This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Phase Re-service in Reinforcement Learning Traffic Signal Control

Zhiyao Zhanga, c, George Guntera, c, Marcos Quinones-Grueiroc, Yuhang Zhanga, c,
William Barbourc, Gautam Biswasb, c, and Daniel Worka, c
The authors are with aDepartment of Civil and Environmental Engineering, bDepartment of Computer Science, cInstitute for Software Integrated Systems, Vanderbilt University, USA. Email: [email protected] work is supported by a grant from the U.S. Department of Transportation Grant Number 693JJ22140000Z44ATNREG3202. This material is based upon work supported by the National Science Foundation under Grant No. CNS-2135579 (Work). The contents of this article reflect the views of the authors, who are responsible for the facts and accuracy of the information presented herein. The U.S. Government assumes no liability for the contents or use thereof.© 2024 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

This article proposes a novel approach to traffic signal control that combines phase re-service with reinforcement learning (RL). The RL agent directly determines the duration of the next phase in a pre-defined sequence. Before the RL agent’s decision is executed, we use the shock wave theory to estimate queue expansion at the designated movement allowed for re-service and decide if phase re-service is necessary. If necessary, a temporary phase re-service is inserted before the next regular phase. We formulate the RL problem as a semi-Markov decision process (SMDP) and solve it with proximal policy optimization (PPO). We conducted a series of experiments that showed significant improvements thanks to the introduction of phase re-service. Vehicle delays are reduced by up to 29.95% of the average and up to 59.21% of the standard deviation. The number of stops is reduced by 26.05% on average with 45.77% less standard deviation.

I Introduction

Dynamically changing traffic patterns is a core challenge in managing traffic at intersections, and when unaccounted for, they can cause an increase in congestion and travel delay. Adaptive traffic signal control (ATSC) offers a promising solution to mitigate congestion and enhance traffic flow. Significant ATSC deployments including SCOOT [1] and RHODES [2] extract traffic patterns over time and optimize the signal timings accordingly.

Recently, reinforcement learning (RL) has emerged as a promising tool for ATSC, based on its learning and real-time computational capabilities in complex environments [3]. Though potentially promising algorithms have been developed (see [3, 4] for comprehensive surveys), phase starvation and safety concerns are known existing challenges to RL-based ATSC without fixed signal sequences [5]. Additionally, some algorithms experience a performance drop in the presence of heavy traffic demands [6].

The left-turn movement is especially sensitive to high demand profiles because it often conflicts with oncoming traffic flow and has limited capacity. Performing phase re-service [7], which is when the controller serves the same phase twice in one cycle, can be an effective approach to manage left-turn queue lengths [8]. Phase re-service has been successfully deployed in many real-world intersections [9, 10]. While its implementation has been limited to traffic-responsive signal timing optimizations [11], we propose extending re-service to real-time ATSC for enhanced operational flexibility.

The main contribution of this article is that we introduce an approach to augment reinforcement learning-based adaptive traffic signal control to enable phase re-service. The RL agent decides the duration of regular phases, and we use shock wave theory [12, 13, 14, 15, 16] to estimate queue growth and trigger phase re-service. Because the agent selects the phase duration, we model the control problem as a semi-Markov decision process (SMDP) [17]. We test the performance of our approach on two intersection geometries, each with five demand profiles. The simulation results demonstrate that phase re-service significantly reduces vehicle delays and the number of stops overall, and for the protected left turn movement.

The remainder of this article is organized as follows: In Section II, we provide the preliminaries for RL based signal control. Section III presents the technical details of phase re-service determination, RL formulation and algorithm, and the pseudocode summarizing the training process. Experimental settings and result analysis are presented in Section IV. We finally conclude our work in Section V.

II Preliminaries

Refer to caption
Figure 1: An example of intersection with a protected movement with phase re-service (red). In the top cycle, each phase is served once. In the bottom cycle, the high demand protected left turn movement is re-served in phase 3.

II-A Traffic signal control setup

Consider traffic control at a single intersection with multiple incoming and outgoing roadways. Incoming vehicles travel on and are queued on incoming roadway lanes. Queues at signalized intersections are served by phases, each of which groups one or more non-conflicting turning movements. Phases are served with green signals in a pre-defined order known as the phase sequence, and a complete iteration of the sequence is a cycle [7].

At some intersections, particular left-turn movements face large demands during peak hours, such as the movement marked with red arrow in Fig. 1. Serving a specific left-turn movement such as a protected left turn twice in a cycle can effectively clear excessive left-turn queues. The second service, known as phase re-service, is typically pre-configured to follow the through movement in the same direction. For example, in Fig. 1, the protected movement served in the first phase may be served again in the third phase. Our work considers adding phase re-service into RL based signal controllers.

II-B Semi-Markov decision processes

RL problems for signal control are often formulated as a Markov decision process (MDP), {S,A,r,P,γ}\{\mathit{S},\mathit{A},r,\mathit{P},\gamma\}. Here, Sm\mathit{S}\subseteq\mathbb{R}^{m} is the state space of the system in question with dimension mm, Ak\mathit{A}\subseteq\mathbb{R}^{k} is the space of available actions with dimension kk, r:Sr:\mathit{S}\mapsto\mathbb{R} is a reward function, P:S×A×S\mathit{P}:\mathit{S}\times\mathit{A}\times\mathit{S}\mapsto\mathbb{R} is a probabilistic transition function, and γ[0,1]\gamma\in[0,1] is a discount factor.

Additionally, let stSs_{t}\in\mathit{S} be the system state at time tt, and atAa_{t}\in\mathit{A} be the action taken at time tt. The actions are chosen at each step from a policy atπθ(at|st)a_{t}\sim\pi_{\theta}(a_{t}|s_{t}) parameterized by θ\theta. RL attempts to maximize the long-term discounted reward i=0γiri()\sum_{i=0}^{\infty}\gamma^{i}r_{i}(\cdot) by choosing an optimal policy, typically through the selection of optimal policy parameters. Signal control problems in which the action at each timestep is the phase to serve are naturally written as MDPs, which can then be solved with RL.

In this work we formulate a traffic signal control problem as a Semi-MDP, which is an extension of MDP which includes a varying transition time between states. The time between transitions is called the sojourn time, denoted jtj_{t}, and we incorporate this into the state transition as P(st+1,jt|st,at)\mathit{P}(s_{t+1},j_{t}|s_{t},a_{t}), meaning that the transition probability takes both the next state and the transition time to it into account. Signal control problems in which the action is the temporal duration to serve the next phase are naturally written as SMDPs.

The maximization of the long-term discounted reward is realized by converging the state value to Bellman optimality [17, 18], which for an SMDP is written as:

V(st)=maxaA[r(st)+st+1,jtγjtP(st+1,jt|st,at)V(st+1)].V^{*}(s_{t})=\max_{a\in A}\left[r(s_{t})+\sum_{s_{t+1},j_{t}}\gamma^{j_{t}}\mathit{P}(s_{t+1},j_{t}|s_{t},a_{t})V^{*}(s_{t+1})\right]. (1)

II-C Queue dynamics at intersections

Refer to caption
Figure 2: Demonstration of the shock wave at a single intersection lane over a signal cycle. Yellow transition is ignored for simplicity.

Let nn be an index on signal cycles. A set of important speed properties associated with shock waves at intersections is as follows [16, 14]:

v1n\displaystyle v_{1}^{n} =qankjkan,\displaystyle=\frac{q_{a}^{n}}{k_{j}-k_{a}^{n}}, (2)
v2\displaystyle v_{2} =qmkjkm,\displaystyle=\frac{q_{m}}{k_{j}-k_{m}}, (3)
v3n\displaystyle v_{3}^{n} =|qmqankmkan|,\displaystyle=\left|\frac{q_{m}-q_{a}^{n}}{k_{m}-k_{a}^{n}}\right|, (4)
v4\displaystyle v_{4} =qmkmkj=v2,\displaystyle=\frac{-q_{m}}{k_{m}-k_{j}}=v_{2}, (5)

where kmk_{m}, kjk_{j}, and qmq_{m} are critical density, jam density, and saturation flow. Additionally, kank_{a}^{n} and qanq_{a}^{n} are the density and flow for arriving vehicles at each cycle nn. Modern sensors can directly measure the number of vehicles over a time window as the flow, and the average speed of these counted vehicles. Then the density is obtained from dividing the flow by the average vehicle speed. Let v1nv_{1}^{n} be the queue expansion speed at the red signal, and let v3nv_{3}^{n} be the speed at which the end of queue moves forward to the stop line at the green signal, v2v_{2} be the vehicle discharge speed from the queue at the green signal, and v4v_{4} be the speed at which the residual of queued vehicles forms a new queue when the green signal ends (only for saturated movements). Fig. 2 shows graphically these different quantities.

III Controller design

In order to handle intersections with high left-turn demand via ATSC, we propose an RL controller design approach combined with phase re-service based on queue length estimation from shockwave theory. We first frame the traffic control problem as an SMDP. Then, we describe how to determine the phase re-service from queue length estimation. Finally, we present a training algorithm.

III-A SMDP formulation

The traffic control problem can be formulated as an SMDP as follows:

State ss. The state consists of the count of queuing vehicles in each incoming lane, the count of non-stopping vehicles in each incoming lane, as well as the number of phases until service. This information is commonly available from sensor data [4].

Action aa. The action taken by the RL agent is the duration of the next phase in a predefined cycle. To account for min/max green time constraints, actions are normalized within the range [1,1][-1,1], achieved via a tanh function, and subsequently linearly mapped to actual phase duration as follows:

a~p=σp+(ap+1)(σp+σp)2,\tilde{a}_{p}=\sigma_{p}^{-}+\frac{(a_{p}+1)\cdot(\sigma_{p}^{+}-\sigma_{p}^{-})}{2}, (6)

where a~p\tilde{a}_{p} and apa_{p} are the actual and normalized duration time for phase pp, and σp+,σp\sigma_{p}^{+},\sigma_{p}^{-} are the maximum and minimum green time also for pp. For notational simplicity, we refer to aa as the actual duration time in the rest of this paper.

Reward rr. Let ll be the index of each incoming lane, L\mathit{L} be the total number of incoming lanes, and XltX_{l}^{t} be the queue length at timestep tt for the lthl^{th} lane. We define the immediate reward as follows:

rt=lLXlt+1Θm,r_{t}=-\sum_{l\in\mathit{L}}{\frac{X_{l}^{t+1}}{\Theta_{m}}}, (7)

where Θm\Theta_{m} is the maximum distance from the stop line that allows vehicle detection. Queue length is often used as a surrogate reward for vehicle delay [3].

Sojourn time jj. The sojourn time in this context is the time from applying one phase duration (the action) until the next point of applying another phase duration (the next action). The sojourn time is typically the phase duration plus the yellow signal transition time. However, if reservice is applied, the time when the next action is executed will be delayed by the re-service phase, which extends the sojourn time accordingly.

III-B Queue estimation and phase re-service

Maximum queue estimation. First, we give a maximum queue estimation technique for this problem setup, which is slightly altered form of the queue estimation techniques presented in [19].

We define the following time quantities related to queue length estimation:

  • TgnT^{n}_{g}: The start time when the protected movement in the nn-th cycle is served with a green signal111TT is the actual simulation time in seconds. In contrast, tt is the timestep for the agent’s decision-making..

  • Trn+1T^{n+1}_{r}: The end time of the protected movement in the nn-th cycle, which is also the n+1n+1 cycle’s start time.

  • TrenT_{re}^{n}: The time at which whether to re-service is decided for the (n)(n)-th cycle.

Let X(T)X(T) refer to the queue length at time TT. Additionally we define:

ΔTn=Tgn+1Tren,\Delta T^{n}=T_{g}^{n+1}-T^{n}_{re},

as the time between assessing for re-service and the next regular green signal.

At a given TrenT^{n}_{re}, ΔTn\Delta T^{n} could be used to perform queue length estimation, however it is not known in real-time. Instead, let ΔT~n\Delta\tilde{T}^{n} be a real-time estimate. In this work we estimate ΔT~n\Delta\tilde{T}^{n} at a given TrenT^{n}_{re} by using the running average of the true ΔTn\Delta T^{n} from prior cycles. In implementation we use 2 prior cycles.

Let Lmaxn+1L_{max}^{n+1} be the forecasted maximum queue length in the next cycle, if no re-service is applied. We calculate this as follows:

Lmaxn+1=v1n(v2ΔT~n+X(Tren)v2v1n)+X(Tren).L_{max}^{n+1}=v_{1}^{n}\left(\frac{v_{2}\Delta\tilde{T}^{n}+X(T_{re}^{n})}{v_{2}-v_{1}^{n}}\right)+X(T_{re}^{n})\ . (8)

From Lmaxn+1L_{max}^{n+1}, whether or not to apply phase re-service is decided. The different quantities covered here are shown graphically in Fig. 2.

Re-service duration calculation. First, whether or not to execute the re-service is decided via

Lmaxn+1>ΘX,L^{n+1}_{max}>\Theta_{X},

where ΘX\Theta_{X} is a threshold on queue length.

Next, we calculate the maximum queue length if re-service is applied as follows:

Lre,maxn+1=v2X(Tren)v2v1n+1.L_{re,max}^{n+1}=\frac{v_{2}X(T_{re}^{n})}{v_{2}-v_{1}^{n+1}}.

We then calculate the duration of the re-service as follows:

ΔTren+1=\displaystyle\Delta T_{re}^{n+1}=
{clip[σre,ζLre,maxn+1v3n,σre+],if X(Tren)<ΘXσre+,if X(Tren)ΘX0,otherwise,\displaystyle\begin{cases}\text{clip}\left[\sigma_{re}^{-},\zeta\frac{L_{re,max}^{n+1}}{v_{3}^{n}},\sigma_{re}^{+}\right],&\text{if }X(T_{re}^{n})<\Theta_{X}\\ \sigma_{re}^{+},&\text{if }X(T_{re}^{n})\geq\Theta_{X}\\ 0,&\text{otherwise}\end{cases}, (9)

where ΔTren+1\Delta T_{re}^{n+1} is the assigned re-service duration, ζ(0,1]\zeta\in(0,1] is a coefficient balancing re-service urgency and overall intersection management, and σre,σre+\sigma_{re}^{-},\sigma_{re}^{+} are the min/max re-service durations. A re-service movement has queues analogous to the right-hand side in Fig. 2.

III-C Controller training framework

In Algorithm 1 our proposed training algorithm is presented. The algorithm joins queue estimation based phase re-service with our SMDP formulation and standard policy optimization techniques. In particular, proximal policy optimization with a generalized advantage estimator is used. Minor alterations to the PPO algorithm were made to adapt the algorithm to the SMDP approach.

Algorithm 1 Episodic training process
1:function Re-service(kan+1k_{a}^{n+1},qan+1q_{a}^{n+1})
2:     Forecast Lmaxn+1L_{max}^{n+1} in (8)
3:     Calculate ΔTren+1\Delta T_{re}^{n+1} in (9), return ΔTren+1\Delta T_{re}^{n+1}
4:end function
5:procedure episode (episode length TepT_{ep})
6:     Initialize env: s0,n0,T0,t0,kan0,qan0s_{0},n\leftarrow 0,T\leftarrow 0,t\leftarrow 0,k_{a}^{n}\leftarrow 0,q_{a}^{n}\leftarrow 0
7:     while T<TepT<T_{ep} do
8:         Agent samples atπθ(|st)a_{t}\leftarrow\pi_{\theta}(\cdot|s_{t})
9:         Env executes ata_{t}, returns st+1,rt,jt,n,TT+jts_{t+1},r_{t},j_{t},n,T\leftarrow T+j_{t}
10:         if T=TrenT=T_{re}^{n} then
11:              Update kan+1,qan+1k_{a}^{n+1},q_{a}^{n+1}
12:              Acquire ΔTren+1\Delta T_{re}^{n+1}\leftarrow Re-service (kan+1,qan+1k_{a}^{n+1},q_{a}^{n+1})
13:              if ΔTren+1>0\Delta T_{re}^{n+1}>0 then
14:                  Environment executes are=ΔTren+1a_{re}=\Delta T_{re}^{n+1}, returns st+1,rt,jtjt+ΔTren+1,TT+ΔTren+1s_{t+1},r_{t},j_{t}\leftarrow j_{t}+\Delta T_{re}^{n+1},T\leftarrow T+\Delta T_{re}^{n+1}
15:              end if
16:         end if
17:         Save (st,at,rt,st+1,jt)(s_{t},a_{t},r_{t},s_{t+1},j_{t}) in buffer
18:         if buffer is full then
19:              πθ\pi_{\theta} updated via PPO
20:         end if
21:     end while
22:end procedure

IV Experimental results

We present the results of a series of numerical experiments conducted in the traffic microsimulation software SUMO [20]. Two different signal control environments are considered, namely a signalized intersection at a freeway ramp and a conventional four-leg intersection.

Three different ATSC algorithms are compared in both environments. We implement our approach of RL with re-service, as well as two other approaches. They are RL without re-service, and the SOTL algorithm [21]. The average vehicle delay, the average number of stops, and total throughput are all measured and compared across a set of different demands.

Refer to caption
(a) Experimental intersection 1: A signalized freeway ramp.
Refer to caption
(b) Experimental intersection 2: Four-leg intersection.
Figure 3: Experimental scenarios and their phase sequences are shown. Vehicle movements are shown as arrows, protected ones in green and others in red. Regular and re-service phases are boxes with solid and dotted lines.

IV-A Implementation details

We model the two intersection types in SUMO using the default parameters to control the vehicle dynamics. At each intersection (Fig. 3), we define five time varying demand profiles, which are shown in Appendix A.

The [min, max] green time constraints for each phase are [5, 30], [5, 40], [10, 25], [5, 45] for the freeway ramp intersection, and [5, 25], [5, 70], [5, 25], [5, 25], [5, 70] for the four-leg intersection, all in seconds. Green time constraints for the re-service phases are in Italic. The yellow signal is set at five seconds. The simulation time for all scenarios is one hour. The maximum distance for vehicle detection Θm\Theta_{m} is set at 250 m, while the threshold for re-service ΘX\Theta_{X} is set at 200 m. The shock wave parameters are estimated from the simulation model parameters as: kj=133.3k_{j}=133.3 veh/km, km=50k_{m}=50 veh/km, and qm=1550q_{m}=1550 veh/h.

The actor network has a single 128-neuron hidden layer and a tanh activation function for output. Similarly, the critic network also has a hidden layer of 128 neurons but the activation function is ReLU. Both networks are independent, i.e., do not share common neurons. The sampled action value from the actor’s stochastic policy is also activated by tanh. The re-service coefficient is set as ζ=0.7\zeta=0.7. For hyperparameters in PPO, we follow the guidelines in [22] and set the loss clipping hyperparameter ϵ=0.1\epsilon=0.1, the long-term reward discount factor γ=0.995\gamma=0.995, the multi-step weighting factor in advantage estimation λ=0.99\lambda=0.99, the learning rate at 2.5e-4, the minibatch size at 256, the update interval of 1200 transitions, and 20 epochs per update. All agents have the same hyperparameters.

Refer to caption
Figure 4: Step-average reward curves for 5 runs. Solid lines are averages and intervals are standard deviations.

IV-B Training results

The step-average training results for 500 episodes in two intersections are shown in Fig. 4. Demand 3 of both intersections are training demand profiles. It shows that adding phase re-service as part of the environmental transition does not significantly affect the speed of convergence, and in both intersections, the re-service can further maximize the rewards. The freeway ramps are more benefited from phase re-service probably due to a large percentage of re-serviced vehicles in the total vehicle demands.

TABLE I: Summary of Test Results (mean, std) in freeway ramps intersection. Percentage of re-service cycles with mean values only (std ignored due to insignificance).
Metric Algorithm Demand 1 Demand 2 Demand 3 Demand 4 Demand 5
Vehicle delay (s) with re-service 34.967, 27.853 42.813, 36.168 48.393, 42.29 48.703, 36.119 43.019, 35.327
without re-service 34.62, 30.933 44.579, 47.773 57.322, 63.986 64.876, 69.702 61.497, 65.379
SOTL 33.167, 35.353 57.203, 75.361 88.622, 113.092 81.864, 102.859 70.121, 90.853
Number of stops with re-service 0.681, 0.508 0.807, 0.715 0.876, 0.772 0.897, 0.711 0.844, 0.765
without re-service 0.715, 0.59 0.887, 0.898 1.12, 1.223 1.213, 1.311 1.16, 1.225
SOTL 0.822, 0.887 1.46, 2.059 2.394, 3.025 2.181, 2.804 1.896, 2.608
% of re-service cycles with re-service 6.4 15.0 23.5 45.3 26.6
Throughput (veh/h) with re-service 1945, 6.59 2352, 6.9 2694, 8.09 2689, 10.06 2593, 6.84
without re-service 1941, 5.3 2358, 10.37 2690, 7.75 2611, 13.59 2594, 5.94
SOTL 1944, 4.08 2314, 13.18 2660, 13.36 2516, 13.41 2510, 13.91
TABLE II: Summary of Test Results (mean, std) in four-leg intersection. Percentage of re-service cycles with mean values only (std ignored due to insignificance).
Metric Algorithm Demand 1 Demand 2 Demand 3 Demand 4 Demand 5
Vehicle delay (s) with re-service 67.573, 51.672 66.355, 51.913 65.159, 53.597 66.588, 52.591 72.759, 55.823
without re-service 76.605, 77.757 72.344, 67.589 72.197, 69.815 89.74, 128.937 80.012, 93.294
SOTL 75.689, 80.501 80.274, 90.484 73.913, 86.785 92.121, 149.148 83.101, 118.338
Number of stops with re-service 0.841, 0.504 0.853, 0.53 0.83, 0.509 0.856, 0.527 0.905, 0.542
without re-service 0.926, 0.671 0.885, 0.597 0.878, 0.589 1.046, 1.068 0.994, 0.831
SOTL 0.971, 0.758 0.97, 0.78 0.959, 0.815 1.095, 1.266 1.084, 1.11
% of re-service cycles with re-service 4.8 11.0 4.0 13.7 14.1
Throughput (veh/h) with re-service 2906, 12.6 2739, 9.3 2782, 13.3 2613, 11.9 3130, 12.5
without re-service 2903, 14.9 2745, 16.4 2771, 11.1 2576, 6.3 3118, 10.6
SOTL 2868, 5.1 2701, 3.0 2755, 6.4 2543, 8.7 3116, 7.0
Refer to caption
Figure 5: Density histogram of vehicle delays of the RL agent with and without phase re-service in freeway ramps Demand 4 scenario.

IV-C Testing performance

For each intersection, the agent is trained five times with demand profile 3. The best-performing agent in the last episode of the training stage is used for testing. We run each testing scenario (consisting of an intersection and a demand profile) 20 times and report the statistics. Metrics including throughput per hour, vehicle delay per trip, and the number of stops experienced per trip are compared and summarized in Table I and Table II. All metrics are directly calculated within SUMO. A baseline non-RL algorithm called SOTL [21] is evaluated for comparison purposes. SOTL determines phase switching based on the count of arriving vehicles showing excellent performance in previous studies [23]. Below we analyze different traffic performance metrics.

Vehicle delay: We report the average delay per vehicle. In 9 out of 10 scenarios tested, vehicle delay is significantly reduced with re-service compared to scenarios without re-service and to the non-RL baseline. Numerically, compared to RL without re-service, the proposed approach achieves an improvement in terms of the mean and standard deviation of vehicle delay ranging from -1% to 29.95% and 9.95% to 59.21%, respectively. SOTL, in contrast, only exhibits a 4.2% mean vehicle delay improvement with 14.29% higher variance in Table I Demand 1. Further, we present the density histogram of vehicle delays for freeway ramps Demand 4 scenario which benefits most from the phase re-service in Fig. 5. RL with re-service method is able to reshape the distribution with a long tail, i.e., significantly delayed vehicles, back to a more centralized one, leading to a lower average delay and smaller standard deviation.

Number of stops: The average number of vehicle stops is reported as a measure of mobility. The phase re-service significantly reduces the average number of stops and the variance of number of stops over all scenarios. The percentage of improvement reaches as high as 26.05% and 45.77% in terms of the mean and standard deviation (Table I, Demand 4 scenario).

Percentage of re-service cycles: This metric calculates the proportion of cycles incorporating phase re-service in each scenario. The results from Table II Demand 3 suggest that the re-service cycle as low as 4.0% can reduce the average vehicle delay and number of stops by 9.75% and 5.47%, and moreover, lower the variance of both metrics by 23.23% and 13.58%. The highest re-service rate reaches 45.3% (Table I, Demand 4 scenario) which contributes to the largest improvements of both vehicle delay and number of stops.

Throughput: The average throughput is also calculated for each scenario. The RL algorithm with and without phase re-service realizes similar performance in maximizing the throughput, which is also comparable to the baseline.

IV-D Performance metrics by directions

We list the trip-level evaluation metrics by vehicle movements of the RL algorithm with and without phase re-service for the scenarios with the lowest and highest re-service penetration. They are summarized in Table III. Through and right-turn movements are grouped together in the four-leg intersection as they utilize the same phases.

As expected, we observe that movements only in regular phases are delayed and stop more since the phase re-service takes additional time. Nevertheless, the additional delay is mild. EE and SE movements in Table III(a) are on average more delayed by 5.6s and 13.5s; non-protected movements in Table III(b) experience more delays from 3.8s to 14.5s. In contrast, the improvement for the protected movement is substantial: 102.5s less delay and 1.454 fewer stops in Table III(a), 79.98s less delay, and 0.45 fewer stops in Table III(b).

TABLE III: Trip-level evaluation metric by vehicle movements. Bold and Italic font: protected movement.
(a) Ramp Demand 4 Scenario
Metric Movement with re-service without re-service
Vehicle delay (s) EE 55.906, 30.871 49.572, 27.026
WW 14.108, 12.629 18.604, 15.085
WS 71.994, 26.755 174.5, 65.297
SE 54.07, 31.034 40.508, 26.181
Number of stops EE 0.907, 0.389 0.862, 0.386
WW 0.344, 0.477 0.508, 0.606
WS 1.567, 0.672 3.021, 1.635
SE 0.855, 0.389 0.743, 0.44
(b) Four-leg Demand 3 Scenario
Metric Movement with re-service without re-service
Vehicle delay (s) NN&NE 47.422, 38.377 50.712, 37.564
NW 134.784, 92.833 214.751, 127.958
WW&WN 62.366, 41.288 55.418, 40.203
WS 79.247, 50.538 83.084, 46.177
EE&ES 56.873, 42.269 51.937, 38.735
EN 90.578, 51.432 95.351, 54.744
SS&SW 46.113, 38.467 50.441, 37.545
SE 75.983, 44.467 90.613, 46.752
Number of stops NN&NE 0.694, 0.473 0.761, 0.472
NW 1.382, 0.789 1.832, 0.977
WW&WN 0.829, 0.428 0.754, 0.459
WS 0.915, 0.358 0.908, 0.309
EE&ES 0.77, 0.431 0.741, 0.442
EN 0.964, 0.365 1.038, 0.393
SS&SW 0.684, 0.471 0.727, 0.452
SE 0.896, 0.311 0.995, 0.363

V Conclusion

In this paper, we propose a method to augment the RL-based ATSC to include temporary phase re-service, aiming to reduce vehicle delays and stops at intersections in high-volume left-turn scenarios. An RL agent determines the duration of the next regular phase, and another rule-based logic incorporating the shock wave theory estimates the queue growth and determines the phase re-service. We formulate the RL problem as SMDP and use PPO to solve it. We test the framework against 2 types of intersections and 10 demand profiles, and demonstrate the general merit of our framework in reducing the vehicle delays and the number of stops overall by up to 29.95% and 26.05% of the average and up to 59.21% and 45.77% of the standard deviation.

References

  • [1] P. Hunt, D. Robertson, R. Bretherton, and M. C. Royle, “The scoot on-line traffic signal optimisation technique,” Traffic Engineering & Control, vol. 23, no. 4, 1982.
  • [2] P. Mirchandani and L. Head, “A real-time traffic signal control system: architecture, algorithms, and analysis,” Transportation Research Part C: Emerging Technologies, vol. 9, no. 6, pp. 415–432, 2001.
  • [3] H. Wei, G. Zheng, V. Gayah, and Z. Li, “Recent advances in reinforcement learning for traffic signal control: A survey of models and evaluation,” ACM SIGKDD Explorations Newsletter, vol. 22, no. 2, pp. 12–18, 2021.
  • [4] M. Noaeen, A. Naik, L. Goodman, J. Crebo, T. Abrar, Z. S. H. Abad, A. L. Bazzan, and B. Far, “Reinforcement learning in urban network traffic signal control: A systematic literature review,” Expert Systems with Applications, vol. 199, p. 116830, 2022.
  • [5] B. Ibrokhimov, Y.-J. Kim, and S. Kang, “Biased pressure: cyclic reinforcement learning model for intelligent traffic signal control,” Sensors, vol. 22, no. 7, p. 2818, 2022.
  • [6] Z. Zhang, M. Quinones-Grueiro, W. Barbour, Y. Zhang, G. Biswas, and D. Work, “Evaluation of traffic signal control at varying demand levels: A comparative study,” in 2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC).   IEEE, 2023, pp. 3215–3221.
  • [7] T. Urbanik, A. Tanaka, B. Lozner, E. Lindstrom, K. Lee, S. Quayle, S. Beaird, S. Tsoi, P. Ryus, D. Gettman et al., Signal timing manual.   Transportation Research Board Washington, DC, 2015, vol. 1.
  • [8] A. Wang and Z. Tian, “Leveraging fully actuated signal coordination and phase reservice to facilitate signal timing practices,” Transportation research record, vol. 2677, no. 1, pp. 240–251, 2023.
  • [9] J. Corey, X. Xin, Y. Lao, and Y. Wang, “Improving intersection performance with left turn phase reservice strategies,” in 2012 15th International IEEE Conference on Intelligent Transportation Systems.   IEEE, 2012, pp. 403–408.
  • [10] S. M. Lavrenz, C. M. Day, A. M. Hainen, W. B. Smith, A. L. Stevens, H. Li, and D. M. Bullock, “Characterizing signalized intersection performance using maximum vehicle delay,” Transportation Research Record, 2015.
  • [11] F. C. Fang and L. Elefteriadou, “Development of an optimization methodology for adaptive traffic signal control at diamond interchanges,” Journal of Transportation Engineering, vol. 132, no. 8, pp. 629–637, 2006.
  • [12] G. Stephanopoulos, P. G. Michalopoulos, and G. Stephanopoulos, “Modelling and analysis of traffic queue dynamics at signalized intersections,” Transportation Research Part A: General, vol. 13, no. 5, pp. 295–307, 1979.
  • [13] P. G. Michalopoulos, G. Stephanopoulos, and G. Stephanopoulos, “An application of shock wave theory to traffic signal control,” Transportation Research Part B: Methodological, vol. 15, no. 1, pp. 35–51, 1981.
  • [14] Z. Wang, Q. Cai, B. Wu, L. Zheng, and Y. Wang, “Shockwave-based queue estimation approach for undersaturated and oversaturated signalized intersections using multi-source detection data,” Journal of Intelligent Transportation Systems, vol. 21, no. 3, pp. 167–178, 2017.
  • [15] Y. Cheng, X. Qin, J. Jin, and B. Ran, “An exploratory shockwave approach to estimating queue length using probe trajectories,” Journal of intelligent transportation systems, vol. 16, no. 1, pp. 12–23, 2012.
  • [16] H. X. Liu, X. Wu, W. Ma, and H. Hu, “Real-time queue length estimation for congested signalized intersections,” Transportation research part C: emerging technologies, vol. 17, no. 4, pp. 412–427, 2009.
  • [17] A. G. Barto and S. Mahadevan, “Recent advances in hierarchical reinforcement learning,” Discrete event dynamic systems, vol. 13, no. 1-2, pp. 41–77, 2003.
  • [18] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [19] J. Yao, F. Li, K. Tang, and S. Jian, “Sampled trajectory data-driven method of cycle-based volume estimation for signalized intersections by hybridizing shockwave theory and probability distribution,” IEEE Transactions on Intelligent Transportation Systems, vol. 21, no. 6, pp. 2615–2627, 2019.
  • [20] D. Krajzewicz, J. Erdmann, M. Behrisch, and L. Bieker, “Recent development and applications of sumo-simulation of urban mobility,” International journal on advances in systems and measurements, vol. 5, no. 3&4, 2012.
  • [21] S.-B. Cools, C. Gershenson, and B. D’Hooghe, “Self-organizing traffic lights: A realistic simulation,” Advances in applied self-organizing systems, pp. 45–55, 2013.
  • [22] M. Andrychowicz, A. Raichuk, P. Stańczyk, M. Orsini, S. Girgin, R. Marinier, L. Hussenot, M. Geist, O. Pietquin, M. Michalski et al., “What matters for on-policy deep actor-critic methods? a large-scale study,” in International conference on learning representations, 2020.
  • [23] G. Zheng, Y. Xiong, X. Zang, J. Feng, H. Wei, H. Zhang, Y. Li, K. Xu, and Z. Li, “Learning phase competition for traffic signal control,” in Proceedings of the 28th ACM international conference on information and knowledge management, 2019, pp. 1963–1972.

-A Demand profiles

We visualize the time varying demand profiles by intersection movement in Fig. 6. Fig. 6(a) provides the ramp (RP) flows for five scenarios. Fig 6(b) and 6(c) show the demand profiles for the four leg intersection (FG).

Refer to caption
(a) RP 1-5 scenarios
Refer to caption
(b) FG 1-2 scenarios
Refer to caption
(c) FG 3-5 scenarios
Figure 6: Traffic demands by moving directions for all scenarios.