Deep Reinforcement Learning for Voltage Control and Renewable Accommodation Using Spatial-Temporal Graph Information
Abstract
Renewable energy resources (RERs) have been increasingly integrated into distribution networks (DNs) for decarbonization. However, the variable nature of RERs introduces uncertainties to DNs, frequently resulting in voltage fluctuations that threaten system security and hamper the further adoption of RERs. To incentivize more RER penetration, we propose a deep reinforcement learning (DRL)-based strategy to dynamically balance the trade-off between voltage fluctuation control and renewable accommodation. To further extract multi-time-scale spatial-temporal (ST) graphical information of a DN, our strategy draws on a multi-grained attention-based spatial-temporal graph convolution network (MG-ASTGCN), consisting of ST attention mechanism and ST convolution to explore the node correlations in the spatial and temporal views. The continuous decision-making process of balancing such a trade-off can be modeled as a Markov decision process optimized by the deep deterministic policy gradient (DDPG) algorithm with the help of the derived ST information. We validate our strategy on the modified IEEE 33, 69, and 118-bus radial distribution systems, with outcomes significantly outperforming the optimization-based benchmarks. Simulations also reveal that our developed MG-ASTGCN can to a great extent accelerate the convergence speed of DDPG and improve its performance in stabilizing node voltage in an RER-rich DN. Moreover, our method improves the DN’s robustness in the presence of generator failures.
Index Terms:
Renewable energy resources (RERs), voltage control, renewable accommodation, deep reinforcement learning (DRL), attention mechanism, graph convolution.I Introduction
There has been an exponential growth of distributed renewable energy resources (RERs), e.g, wind and solar energy, in distribution networks (DNs) for mitigating global climate change and providing affordable electricity to customers [1]. From 2007 to 2021, the global installed capacity of solar photovoltaic (PV) has increased from 8 GW to 940 GW, so as the wind power (from 94 GW to 837 GW) [2]. Despite various benefits brought by RERs, e.g., decarbonization and power supply cost reduction, the increasing adoption of RERs introduces considerable uncertainties to DNs due to their intermittent nature [3]. Uncontrollable factors of RERs, such as solar irradiation and wind speed, can produce stochastic and non-dispatchable RER generation, which makes generation forecasts extremely challenging, leads to frequent voltage fluctuations, threatens system security, and may cause potentially economic losses [4].
Voltage control in an RER-rich DN has been widely discussed in the literature and can be basically sorted into three classes [5]. 1) Distributed-optimization-based methods: the optimal voltage control strategy is often derived from a non-convex optimal power flow problem and then relaxed into a centralized convex problem through semi-definite programming or second-order cone programming, which is solved by distributed algorithms. Typical consensus algorithms [6, 7, 8, 9] and alternating direction method of multipliers [10, 11, 12, 13] are among the most prevalent solutions for distributed optimization problems. However, both methods suffer from heavy computational costs, in particular in large-scale DNs. 2) Decentralized-based methods: based on network partitions, an optimization agent supported by conventional numerical algorithms [14, 15, 16] and heuristic methods (e.g., genetic algorithm [17], particle swarm optimization [18], harris hawk optimization [19], and grey wolf optimization [20]) can be applied to each divided zone to achieve the overall voltage control in a decentralized manner. Nevertheless, numerical algorithms are challenging to converge in face of high uptake of RERs, while heuristic algorithms highly rely on accurate prior knowledge of a specific DN and can easily trap into local optimums. 3) Learning-based methods: Unlike the aforementioned optimization-based methods, deep reinforcement learning (DRL)-based methods have drawn increasing attention to voltage control due to its model-free characteristic [21, 22, 23, 24, 25, 26, 27], enabling the strategy to learn totally from historical experiences without any prior knowledge of both the uncertainty of renewables and the DN. Moreover, the DRL, benefiting from its interactive learning manner, is also well-suitable to capture the uncertain dynamics in an RER-rich DN and thus better control voltage fluctuations. However, learning a stable and well-performed DRL-based control strategy in complex physical systems is notoriously difficult due to its slow learning convergence, such as the voltage control problem in DNs. Furthermore, ensuring the increasing accommodation of renewables in DNs while mitigating voltage fluctuations has been inadequately discussed in the literature.
To bridge the research gap, we are motivated to propose a DRL-based strategy to balance the trade-off between voltage fluctuation control and renewable accommodation, leveraging spatial-temporal (ST) graphical information of the DN. Specifically, providing that correlations among node pairs in DN are mutually influenced in the spatial view while each node features are inherently dependent in the temporal view, we develop a novel multi-grained attention-based spatial-temporal convolution network (MG-ASTGCN) to explore ST correlations through ST attention mechanism and extract ST features through ST convolution in multiple time scales. The derived ST information indicates time-varying patterns of the DN’s power flow, which can be potentially utilized by the DRL to accelerate its learning process, simultaneously improve its performance, and effectively interpret the graphical correlations among nodes in the DN. The main contributions of our work are summarized as follows.
-
•
ST Information Extraction: We develop a novel MG-ASTGCN to fully extract ST information from an RER-rich DN graph. Specifically, attention mechanism and graph convolution inside the MG-ASTGCN are facilitated for exploring ST correlations and features, respectively. Moreover, since the power flow exhibits periodic patterns in multiple time scales, we construct multi-grained power flow time series to better capture temporal ST information.
-
•
DRL for Balancing Voltage Control and Renewable Accommodation: We propose a DRL-based strategy leveraging the derived ST graphical information to dynamically balance the trade-off between voltage fluctuation control and renewable accommodation. The consecutive control process is modeled as a Markov decision process (MDP) which is optimized via the cutting-edge off-policy DRL algorithm, namely the deep deterministic policy gradient (DDPG).
-
•
Numerical Simulations and Implications: We validate our DRL-based method on modified IEEE 33, 69, and 118-bus radial distribution systems. Simulations demonstrate the effectiveness of our approach with outcomes significantly outperforming benchmark algorithms, i.e., harris hawk optimization (HHO), grey wolf optimization (GWO), interior-point (IP)-based method, and linear/quadratic programming (LQP)-based method. Moreover, our DRL-based strategy improves network stability in the presence of generator failures, especially for large-scale DNs.
The key insights drawn from simulation results are summarized as follows.
-
•
DDPG converges faster with the assistance of MG-ASTGCN: The developed MG-ASTGCN substantially accelerates the convergence speed of the DDPG, compared to other graphical correlation extraction methods, which demonstrates the effectiveness of our MG-ASTGCN in capturing the underlying ST graphical information of the DN.
-
•
Effectiveness of the ST attention mechanism inside the MG-ASTGCN: The spatial attention mechanism in the MG-ASTGCN captures mutual correlations among node pairs in the DN, while temporal attention exploits temporal correlations of each node features among multiple time steps. Simulations suggest that node pairs with more generator integration tend to have stronger spatial correlations, while each node features are highly self-correlated in historical recent time steps from the temporal view.
-
•
Overemphasizing voltage fluctuation control or renewable accommodation undermines DDPG’s performance: Our DRL-based strategy encodes all objectives and operational constraints of the derived optimization problem into reward functions as feedback from the DN. Modeling results present that, if the reward function for voltage fluctuation control or renewable accommodation is overemphasized, the DDPG’s performance correspondingly degenerates, resulting in sub-optimal operations of the DN, highlighting that striking an effective balance between these objectives is essential for the optimal control of the DN.
The remainder of this paper is organized as follows. Section II formulates an optimization problem balancing the trade-off between voltage fluctuation control and renewable accommodation, followed by Section III, where our DRL-based strategy, consisting of the MG-ASTGCN and DDPG, is proposed. Experimental results are presented in Section IV. Section V concludes.

II System Model
We consider a radial DN highly integrated with RERs. We here utilize wind and solar PV generators as RERs since they are the most representative RERs with the largest installed capacities worldwide [2]. The high uptake of RERs introduces considerable uncertainties in the DN and continuously leads to voltage fluctuations, hindering RER’s further adoption and often causing curtailments. We formulate an optimization problem for mitigating voltage fluctuations while accommodating renewable integration, along with minimizing generation costs, from Section II-A to II-B. An overview of the presented work is illustrated in Fig. 1.
II-A Voltage Fluctuation Control
The increasing presence of RERs frequently triggers voltage fluctuations, even voltage violations, at load nodes in the DN, which may severely degrade the performance of electronic equipment and pose potential security risks to electricity consumers. To mitigate voltage fluctuations, we consider voltage fluctuation control with an L2-norm-based metric to describe voltage stability, which can be defined as
(1) |
where represents the current time step, is the number of load nodes, and is the phasor form of load voltage expressed as , where is the voltage magnitude and is the phase angle.
II-B Renewable Accommodation
Accommodating the increasing penetration of RERs is of great importance for an orderly energy transition in the power grid and functions as the main pillar for net-zero emissions. The metric reflecting the effectiveness of renewable accommodation, denoted by , can be formulated as
(2) |
where and are the number of wind and solar PV generators, and are the actual outputs of wind and solar PV generation while and are correspondingly maximum power outputs at the current time step, which are usually obtained via onsite monitoring devices.
Moreover, the inherent variability of RERs can lead to mismatch costs between scheduled power generation and actual power output [28], which can be further divided into reserve cost (under power overestimation circumstance) and penalty cost (under power underestimation circumstance). The reserve costs for wind and solar PV power can be formulated as
(3) | ||||
(4) | ||||
where and are indices of wind and solar PV generators, and are constant coefficients, and are the available power outputs of wind and solar PV generation, and are the minimum power outputs of corresponding generators, and are the probability density functions of wind and solar PV generation, where we assume the wind speed and solar irradiation conform with the Weibull and lognormal distribution, respectively [28].
Similarly, the penalty costs of wind and solar PV generators can be formulated as
(5) | ||||
(6) |
where and are constant coefficients.
Additionally, fuel energy resources still play an irreplaceable role in maintaining DN’s stability, especially in face of RER generation shortage. The generation costs of thermoelectric generators can be formulated as
(7) |
where is the index of the thermoelectric generator, , , and are constant coefficients, is the actual power output of the thermoelectric generator.
Combining the generation costs of thermoelectric and RER generators, the overall generation cost in the DN at one time step can be formulated as
(8) |
where is the number of thermoelectric generators.
II-C Optimization Formulation
We formulate the objectives of our control strategy as the weighted summation of metrics regarding voltage fluctuation mitigation, renewable accommodation maximization, and generation cost minimization, which can be expressed as
(9) | ||||
where is the time horizon, , , and are weights for their corresponding objectives, i.e., , , and . The objective defined in Eq. (9) is subject to physical constraints that ensure safe operations of the DN. In particular, the equality constraints, namely power balance equations of active and reactive power in the DN, are formulated as
(10) |
(11) |
with the active/reactive power losses on the branch defined as
(12) |
(13) |
where and are the conductance and susceptance of the branch, and are the active/reactive power from the power grid.
Moreover, inequality constraints that limit both load and generator voltages can be defined as
(14) | |||||
(15) | |||||
(16) | |||||
(17) |
where , , , and , , , are the minimum and maximum voltages of load nodes and generators, respectively.
The active and reactive power output limits of generators can be defined as
(18) | |||||
(19) | |||||
(20) | |||||
(21) | |||||
(22) | |||||
(23) | |||||
(24) |
where , , and , , are the minimum and maximum active power outputs while , , and , , are the minimum and maximum reactive power outputs. Constraint (19) describes the active power ramp rate constraint of the thermoelectric generators, where and are the ramp-down and ramp-up limits of the -th thermoelectric generator, which are set as of its rated power by default [29, 30]. Moreover, for the shut-down constraint of the thermoelectric generator, before the shut-down, the thermoelectric generator must adjust its current power output to its lower limit . Once shut down, the thermoelectric generator is not allowed to restart within time steps by default. For the start-up constraint, the unavailable thermoelectric generator must adjust its power output to its lower limit before connecting to the DN again. Furthermore, once functioning in the DN, the generator is not allowed to be shut down within time steps by default.
Also, the power flow constraint on the branch can be expressed as
(25) |
where is the index of the branch, is the complex form of apparent power on the branch, and represents the maximum power flow on the branch.
III Methodology
To solve the optimization problem derived from Eq. (9) to (25), we first develop the MG-ASTGCN to extract ST graphical information of the DN in Section III-A, with the aim of delivering prior graphical information of the DN for its sequential DDPG. The consecutive control problem is modeled as an MDP in Section III-B, where we introduce the DDPG to learn the optimal strategy for controlling voltage fluctuations and accommodating renewable generation.
III-A ST Information Extraction via MG-ASTGCN
III-A1 MG-ASTGCN Preliminaries
A DN can be modeled as an undirected graph , as illustrated in Fig. 2, where , , and represent the node set, edge set, and the adjacency matrix, respectively. Each load node generates a feature vector for gathering local information, e.g., load voltage (), load power (), and connected branch power (), at each time step, which can be formulated as
(26) |
where is the dimension of node features. The feature matrix of graph can be aggregated as
(27) |
where is the highest node feature dimension defined as .

Considering periodic patterns in the power flow, especially the daily and weekly ones [2], we develop a multi-grained vector constructor to better capture temporal correlations of the DN, where the recent, daily, and weekly graph segments can be formulated as
(28) | ||||
(29) | ||||
(30) |
where , , and indicate the length of recent, daily, and weekly segments, respectively, and represents the number of resolving the optimization problem per day.
The framework of the proposed MG-ASTGCN is illustrated in Fig. 3, consisting of graph conversion operation, multi-grained vector constructor, and the most essential ASTGCN which takes multi-grained segments as inputs and employs stacked ST components to extract ST information. The structure of one ST component is depicted in Fig. 4, including ST attention mechanism and ST convolution, which are presented in detail in Section III-A2 and III-A3, respectively.
III-A2 Spatial-Temporal Attention Mechanism
The key idea of the ST attention mechanism is to pay more attention to valuable graphical information in both spatial and temporal perspectives, assisting its sequential ST convolution for extracting more useful features for the DRL.
Spatial Attention: Mutual influences of each neighboring node pair in the DN dynamically vary due to the changes in power flow. To explore such mutual influences, an attention mechanism in spatial dimension is thus developed to capture the dynamic correlations [31], which can be formulated as (we omit the notations of multi-grained segments in the rest of this section for brevity)
(31) | |||
(32) |
where represents element-wise multiplication, is the sigmoid activation function, , , , , and are all learnable parameters, and represents the spatial attention matrix, whose element , namely the attention weight, semantically describes the correlation strength between the -th and -th nodes. The derived spatial attention matrix can be adopted for spatial graph convolution to adjust spatial correlation strengths of node pairs, as shown in Fig. 4.


Temporal Attention: Similar to the spatial attention, the temporal attention mechanism [32] aims to track temporal correlations of the changing node features, which can be formulated as
(33) | |||
(34) |
where , , , , and are learnable parameters, is the transposed form of the input segment , and is referred to as the temporal attention matrix whose attention weight represents the temporal dependencies between two graph feature vectors and . The temporal attention matrix is used to add temporal correlation information to the original input segment as shown in Fig. 4, which can be expressed as .
III-A3 Spatial-Temporal Convolution
The ST convolution consists of spatial graph convolution and temporal convolution, aiming to compress the three-dimensional input segments and extract ST features that can be integrated with the DRL algorithm.
Spatial Graph Convolution: Graph convolution is defined as a convolution operation implemented by using linear operators diagonalizing in the Fourier domain to replace the classical convolution operator [33], which can be expressed as
(35) | ||||
where represents the graph convolution operator, is a convolution filter, is the Laplacian matrix of the derived undirected graph , is the result of eigenvalue decomposition, and the rectified linear unit (ReLU) is adopted as the activation function. Providing that conducting eigenvalue decomposition on the Laplacian matrix is computationally expensive, Chebyshev polynomials are often used for approximating the solution of eigenvalue decomposition in practice [34]. With the addition of spatial correlation information from the spatial attention matrix as shown in Fig. 4, the spatial graph convolution can be rewritten as
(36) |
where is the normalized Laplacian matrix, is the coefficient of Chebyshev polynomials, is the highest order of Chebyshev polynomials, and is the -th order Chebyshev polynomial.
Temporal Convolution and Feature Compression: The temporal convolution operation takes the result of the spatial graph convolution as input and performs convolution along the temporal dimension, i.e., , , and , which can be formulated as
(37) |
where is the temporal convolution filter and is the input for the following ST component.
To pass the extracted ST information to the DRL algorithm, the multi-time-scale outputs are fused and then fed into a fully-connected neural network layer (FCNNL) [35] for feature compression as shown in Fig. 3, which can be formulated as
(38) |
where concat represents the concatenating operation for the multi-time-scale outputs.
In summary, ST graphical information can be fully exploited by our MG-ASTGCN. The spatial attention mechanism explores spatial correlations of neighboring node pairs, while the temporal attention focuses on mining self-correlations of node features in the temporal view. Furthermore, based on ST correlations provided by the preceding ST attention, ST convolution extracts ST features of the underlying DN. The detailed process of the recent graph segment passing through one ST component is illustrated in Fig. 5.

III-B DRL-based Control Strategy
III-B1 MDP Modeling
Balancing the trade-off between voltage control and renewable accommodation in the DN can be considered as a consecutive decision-making process, which can be further modeled as an MDP [36] consisting of four critical parts: state space , action space , probability space , and reward space .
State Space : The state of the -th load node, denoted by , is its feature vector defined in Eq. (26). Moreover, the extracted ST features is aggregated with the DN’s state. Thus, the state of the DN can be expressed as
(39) |
Action Space : For each power generator, only its active power and voltage magnitude can be manipulated. Thus, the actions of the -th power generator can be expressed as . Actions of all generators in the DN can be defined as
(40) |
where are actions of thermoelectric, wind, and solar PV generators. Note that, our proposed strategy is also applicable to generators operating in the P/Q control mode by changing the generator’s action space into active power and reactive power .
We would like to clarify that our proposed DRL-based strategy can be easily applicable to generators operating in the P/Q control mode by slightly modifying the generator’s action space. In our original manuscript, the action space of each generator is defined as , where and are its active power and voltage magnitude, respectively. The generator working in the P/Q control mode can be defined as , where is the reactive power of the generator.
Probability Space : is the probability set of transitioning to the next state from the current state after taking a deterministic action .
Reward Space : A reward is obtained after taking action at state , which indicates the effectiveness of the selected action. The goal of DRL is to learn an optimal action strategy to maximize the expected cumulative rewards. Hence, it is essential to encode the objective of the optimization problem into a reward function to facilitate the DRL training. The reward function of the MDP can be formulated as
(41) | ||||
where the above three reward terms correspond to the objective functions of the optimization problem, i.e., voltage control , renewable accommodation , and generation cost minimization .
We then introduce how we handle the optimization constraints defined from Eq. (10) to (25) and encode them as rewards into the DRL training process. For the power balance constraints in Eq. (10) and (11), if these two constraints cannot be satisfied during the DRL training process, a constant penalty term ( in our simulation by default) will be added to the reward for constraint violation. Moreover, the current training episode will be terminated. Subsequently, the DN environment will be reset and the input MDP state for the next episode will be randomly initialized. For the load voltage constraint, generator reactive power constraints, and branch flow constraint presented in Eq. (14), (22)-(24), and (25), respectively, if these constraints are violated, the same constant penalty term ( by default) is added to the reward for violation feedback. For the generator’s voltage and power constraints defined in Eq. (15)-(17) and (18)-(21), respectively, their corresponding values will not violate these inequality constraints. This is because the actions of each generator are its power and voltage defined as , indicating that the lower and upper limits of these constraints are encoded as the bounds of the MDP’s action space. As a result, the action value of each generator will always meet these constraints.
III-B2 Solving MDP by DDPG
The objective of DRL is to maximize the expected cumulative rewards, denoted by , which can be formulated as
(42) | ||||
where represents the trajectory of the MDP transitions, recording all 4-tuple transitions denoted by from the beginning of to its end, represents the cumulative rewards of the trajectory, represents the parameters of our action strategy , is the occurrence probability of the trajectory .
We then introduce DDPG [37] to maximize . DDPG is the most representative actor-critic DRL algorithm to optimize the derived MDP. The major difference between DDPG and other DRL algorithms is that the action policy deterministically outputs the values of actions instead of the probability distribution of actions, which dramatically decreases the computation cost and makes it much easier to implement. Specifically, policy gradient method is applied in DDPG to update our action policy, which can be formulated as
(43) |
with the gradient of the DRL objective defined as
(44) |
where is the learning rate, is the number of trajectories, is the length of each trajectory, and represents the advantage function assessing the effectiveness of the state-action pair compared to a certain baseline, which can be formulated as
(45) | ||||
where is a discounting factor, represents the baseline of reward considering all possible actions, and is the expected cumulative rewards considering all possible states at the current time step. Due to the uncertainty of the MDP transitions, both and are random variables. To accurately estimate the advantage function, DDPG introduces a critic network formulated as
(46) |
where represents parameters of the critic networks. With the critic network, the advantage function in Eq. (45) can be rewritten as
(47) |
Providing that calculating the output value of the critic network depends on environments rather than our action policy , it is applicable to learn the critic network in an off-policy manner taking advantage of transitions generated from a different action policy denoted by . The critic network can be updated by minimizing the root mean square error formulated as
(48) |
with the estimated critic network via the target action and critic networks defined as
(49) |
Similarly, the critic network is updated via gradient descent formulated as
(50) |
where is the learning rate for updating the critic network.
The proposed two target networks are updated in an exponential moving average manner using parameters of their corresponding action and critic networks, which can be formulated as
(51) | ||||
(52) |
where is the smoothing parameter.
In summary, DDPG follows the policy-gradient way to optimize the derived MDP, where a critic network is introduced to assess the learned action policy. The workflow illustration and algorithmic procedure of the DDPG are presented in Fig. 6 and Algorithm 1, respectively.

RDSs Charateristics | -Bus | -Bus | -Bus |
Total Buses () | |||
Thermoelectric Generators () | |||
Wind Turbines () | |||
Solar PV Generators () | |||
Baseline Voltage (kV) | |||
Baseline Apparent Power (MVA) | |||
Total Load Active Power (MW) | |||
Total Load Reactive Power (MVAR) |


IV Experiments and Results
IV-A Experimental Settings
IV-A1 Evaluation Scenario
The proposed DRL-based strategy is tested on the modified IEEE 33-bus, 69-bus, and 118-bus radial distribution systems (RDSs) [38] with their detailed statistics provided in Table I. The time interval between two consecutively operational time steps is one hour, aligning with previous studies in DN’s voltage control [22, 39, 25, 40]. The simulation lasts for a one-year duration to ensure sufficient DRL training. One Nvidia TITAN RTX graphics processing unit is used for DRL training. The batch size of the DDPG algorithm is set as . The random noise employed in the DDPG is a Gaussian noise, denoted by with its mean and standard deviation setting as and , respectively. The number of hidden layers for the actor and critic networks is , each of which consists of neurons. We adopt the Adam optimizer to update our actor, i.e., and critic, i.e., , networks. The number of ST components inside our MG-ASTGCN is . The initialized parameters of our DRL-based control strategy are provided in Table II.
Note that we use the same reserve cost coefficients and penalty cost coefficients for wind and solar generators provided by [41], i.e., and . For the cost coefficients of the thermoelectric generator, we use the same coefficients provided by [28], i.e., , , and . Additionally, we provide the power output profile of one wind power generator and one solar generator throughout one day in a part of the IEEE -bus test system in Fig. 7.
IV-A2 Algorithm Performance Metric
In our experiments, the reward function defined in Eq. (41) is adopted to measure the performance of both DDPG and benchmark algorithms. Specifically, we introduce three metrics, namely SCORE, voltage fluctuation rate denoted by , and RER accommodation rate denoted by , to examine the effectiveness of the learned control strategy, voltage control, and renewable accommodation, respectively, which are defined as
SCORE | (53) | |||
(54) |
(55) |
where is the number of episodes for evaluation and is the length of each episode. Both and are initialized as .
HHO | GWO | IP | LQP | Ours | ||
33-Bus | Time Cost per Step (secs) | |||||
SCORE | ||||||
Voltage Fluctuation Rate | ||||||
RER Accommodation Rate | ||||||
69-Bus | Time Cost per Step (secs) | |||||
SCORE | ||||||
Voltage Fluctuation Rate | ||||||
RER Accommodation Rate | ||||||
118-Bus | Time Cost per Step (secs) | |||||
SCORE | ||||||
Voltage Fluctuation Rate | ||||||
RER Accommodation Rate |



IV-B Experimental Results
IV-B1 Optimization-based Benchmark Comparisons
Two representative optimization-based algorithms—HHO [19] and GWO [20], are adopted to compare with our DRL-based strategy. Benefiting from their meta-heuristic characteristic, both the HHO and GWO algorithms can approach the optimal control strategy without modifying or relaxing the optimization formulation. Moreover, the meta-heuristic-based methods also present good convergence and robustness abilities in the previous studies [41, 42, 43]. Moreover, we also introduced two relaxation-based optimization methods, namely interior-point (IP)-based method [44] and linear/quadratic programming (LQP)-based method [45] as benchmarks. The evaluation results derived by these methods on IEEE 33, 69, and 118-bus RDSs are illustrated in Fig. 8 with associated statistics presented in Table III. The results reveal that our proposed DRL-based strategy outperforms all benchmarks by significant margins in terms of faster testing time, higher SCOREs, lower voltage fluctuation rates, and more renewable accommodation. The reason for such performance gaps is twofold:
-
•
The optimization-based benchmarks do not need training, resulting in longer computation time for evaluation at each time step. Meanwhile, since they tend to get stuck in local optimums, their output actions are less likely to be the optimal ones, resulting in smaller rewards.
-
•
Tremendous data collected by the DRL-based strategy contributes to its more effective and efficient searching in the action space. Therefore, the DRL reacts much faster at the beginning of evaluation and gradually obtains a higher reward, especially in large-scale DNs.
Voltage Fluctuation | Renewable Accommodation | ||||
33 Bus | 1 | 1 | 0.01 | ||
69 Bus | 1 | 1 | 0.01 | ||
118 Bus | 1 | 1 | 0.01 | ||
33-Bus | 69-Bus | 118-Bus | |
A2C | secs | secs | secs |
ACER | secs | secs | secs |
PPO | secs | secs | secs |
SAC | secs | secs | secs |
DDPG | secs | secs | secs |
Additionally, we have conducted parameter searching for the weight coefficients of each objective function, i.e., , , and , to determine their optimal values, We trained and evaluated our DRL-based strategy with different weight combinations. The associated voltage fluctuation rates and renewable accommodation rates are presented in Table IV. The reason we always set the same weight for voltage control and renewable accommodation is that we assume that the two objectives are of the same significance in our simulation. The parameter searching results demonstrate the effectiveness of our current weight setting, i.e., , , and , since it achieves the best performance in terms of lower voltage fluctuation rate and higher renewable accommodation rate.









IV-B2 DRL-based Baseline Comparisons
We also conducted comparisons of other state-of-the-art DRL algorithms, including the PPO [46] and SAC [47]. Moreover, we have also added two more baselines – actor-critic with experience replay (ACER) [48] and advantage actor-critic (A2C) [49]. The training reward curves derived from these DRL algorithms are depicted in Fig. 9. The results show that the SAC algorithm performs slightly better than our DDPG algorithm, while the performances of the DDPG and the PPO algorithms are quite close. The reason why we still choose the DDPG algorithm is twofold: a) DDPG presents better in lower training time cost compared to the PPO and SAC algorithms, as presented in Table V. Specifically, the DDPG algorithm has the lowest average training time cost per episode (, , seconds for three respective test systems), outperforming the PPO and SAC algorithms by significant margins; b) the DDPG is relatively easy to implement with fewer hyper-parameters to tune.
IV-B3 Effectiveness of ST Attention and ST Graphical Information of the DN
To evaluate the effectiveness of the ST attention and its impact on the sequential DDPG, the ST attention mechanism is substituted with several other correlation extraction techniques, including cosine similarity (CS) and Jaccard similarity (JS). Their corresponding DRL strategies are termed “CS-DRL” and “JS-DRL”, respectively. We term our method with ST attention as “STA-DRL”. Moreover, we also design a baseline without the ST attention, i.e., only keeping the ST convolution module in our proposed MG-ASTGCN, which is termed “Conv-DRL”. Furthermore, to fully verify the extracted graphical information, we additionally introduce a multilayer-perceptron-based (MLP-based) network structure for the DDPG algorithm, which is unable to extract the ST graphical information of the DN. Each layer of the MLP network is a fully-connected neural network layer. We term such a method as “MLP-DRL”. The associated training and evaluation results of the aforementioned methods are depicted in Fig. 10 and 11, respectively. Note that each data point in Fig. 10 is calculated by taking an average of episodes’ rewards. For each episode, the maximum time step is , and the episode reward is the summation of each step’s reward.
We can summarize several noteworthy observations regarding the effectiveness of ST attention and the extracted graphical information based on the simulation results as follows.
-
•
The outstanding training performance of our STA-DRL method indicates that the adoption of ST attention can dramatically increase DDPG’s convergence speed, leading to a more effective search in the underlying action space based on the extracted ST information. As a result, our STA-DRL method significantly outperforms all baselines during evaluation as shown in Fig. 11.
-
•
The alternative correlation extraction methods, i.e., CS-DRL and JS-DRL, are less effective than the ST attention, since they only focus on degree correlations among different nodes, ignoring node inner features and their temporal dependencies. Interestingly, the performance of the MLP-DRL method surpasses the CS-DRL, JS-DRL, and Conv-DRL, suggesting that adopting effective and suitable graphical information extraction, e.g., our proposed MG-ASTGCN, also seems to be crucial in deriving well-performed strategies for voltage control and renewable accommodation.
Additionally, we observe two interesting phenomena from spatial and temporal attention matrices, which are illustrated in Fig. 12 and Fig. 13, respectively.
-
•
In Fig. 12, the spatial attention mechanism tends to focus on node pairs (larger attention weights) with more generator integration. For instance, although Bus and are not adjacent, they can be connected by Bus (connected with three generators) and (connected with two generators). Therefore, it is reasonable that the correlation strength between Bus and is more significant than other nonadjacent node pairs.
-
•
In the temporal attention for the recent graph segment, the correlation strengths between current and historical node features drop sharply when it comes to the -th historical time step, which may indicate that the latest feature vectors of one node contain more significant temporal correlation information.


IV-B4 Stability Test of the DRL-based Strategy
We here define the average response time, calculated via counting how many time steps the DN takes to recover voltage back to its normal level, to assess the stability of the DN in the event of generator faults. Fig. 14a illustrates the response time of HHO, GWO, and the proposed DRL-based strategy with different numbers of faulted generators, where the DRL’s response time grows linearly compared to the exponential growth of two heuristic algorithms. Besides, Fig. 14b shows a more detailed case study with one faulted generator in the IEEE 69-bus RDS, indicating that the stability of DNs can be significantly improved through our DRL-based strategy.


IV-B5 Trade-Off Between Voltage Fluctuation Control and Renewable Accommodation
The weights , , and in our designed reward function represent their relative importance when learning the optimal control strategy. We trained and evaluated our DRL-based strategy with different voltage control weight to investigate our method’s capability in voltage fluctuation control. The results are presented in Table VI. Specifically, Bus ’s voltage fluctuation profiles in IEEE 69-bus RDS are illustrated in Fig. 15. Surprisingly, better voltage control (with the increase of ) leads to performance degradation of our DRL-based strategy, with lower SCOREs as shown in Table VI. Such results may suggest that overemphasizing voltage fluctuation control may undermine the strategy’s performance.
Furthermore, we found that, while the performance of voltage control improves with the increase of the weight coefficient , the renewable accommodation rate conversely decreases, as shown in Table VII. Given that the DDPG’s performance is related to both voltage control and renewable accommodation, the decreasing renewable accommodation rate may lead to the degradation of the DDPG’s performance. Similarly, overemphasizing renewable accommodation also undermines DDPG’s performance due to the increasing voltage fluctuation rates, as shown in Table VIII. Based on the evaluation results presented in Table VII and VIII, we may conclude that controlling voltage fluctuation and accommodating renewable is a trade-off. Hence, striking an effective balance between voltage control and renewable accommodation seems to be the best solution. According to the parameter searching results presented in Table IV, our current weight setting, i.e., , , and , seems to achieve the optimal balance between voltage control and renewable accommodation. Additionally, in real practice, if more strict voltage control is required, we can always increase the weight coefficient to more effectively mitigate voltage fluctuations, though the renewable accommodation rate will inevitably decrease. Similarly, more renewable generation can be integrated into the DN with the increase of the weight coefficient .

33-Bus | 69-Bus | 118-Bus | ||||
Value | SCORE | SCORE | SCORE | |||
33-Bus | 69-Bus | 118-Bus | ||||
Value | ||||||
33-Bus | 69-Bus | 118-Bus | ||||
Value | ||||||
V Conclusion and Future Works
In this paper, we proposed a DRL-based strategy to balance the trade-off between voltage fluctuations and renewable accommodation in the DN, with the aim of promoting the further adoption of RERs in the DNs. We first derive the optimization formulation considering voltage control and efficient renewable accommodation, along with generation cost minimization. A novel MG-ASTGCN is then proposed to fully explore the ST correlations among node pairs through ST attention mechanism and extract ST information of the DN through ST convolution. The extracted multi-time-scale ST information is finally delivered to the DDPG to learn the optimal control strategy. We can draw several conclusions based on experimental results: 1) extracting ST correlations in the DN plays an essential role in improving DDPG’s performance and convergence speed, while the strategy’s performance without ST attention mechanism or ST graphical information significantly degenerates; 2) compared with optimization-based benchmarks, the proposed DRL-based strategy achieves better performance along with less computation time. Besides, the adoption of the DRL-based strategy improves the network’s stability in terms of shorter response time in the event of generator failures; 3) node pairs in the DN with more generator integration seem to have stronger spatial correlations. Moreover, in the temporal view, the most recent node feature vectors tend to contain more valuable temporal correlation information; 4) experimental results indicate that overemphasizing voltage fluctuation or renewable accommodation control may result in sub-optimal operations.
In our future work, designing a suitable incentive mechanism to accommodate more RERs in smart grids will be studied.
References
- [1] M. Meinshausen, J. Lewis, C. McGlade, J. Gütschow, Z. Nicholls, R. Burdon, L. Cozzi, and B. Hackmann, “Realization of Paris Agreement pledges may limit warming just below 2 °C,” Nature, vol. 604, no. 7905, pp. 304–309, apr 2022.
- [2] L. Ranalder, H. Busch, T. Hansen, M. Brommer, T. Couture, D. Gibb, F. Guerra, J. Nana, Y. Reddy, J. Sawin, K. Seyboth, and F. Sverrisson, Renewables in Cities 2021 Global Status Report. REN21 Secretariat, Mar. 2021.
- [3] IPCC, Climate Change 2022: Mitigation of Climate Change. Contribution of Working Group III to the Sixth Assessment Report of the Intergovernmental Panel on Climate Change, P. Shukla, J. Skea, R. Slade, A. A. Khourdajie, R. van Diemen, D. McCollum, M. Pathak, S. Some, P. Vyas, R. Fradera, M. Belkacemi, A. Hasija, G. Lisboa, S. Luz, and J. Malley, Eds. Cambridge, UK and New York, NY, USA: Cambridge University Press, 2022.
- [4] S. R. Sinsel, R. L. Riemke, and V. H. Hoffmann, “Challenges and solution technologies for the integration of variable renewable energy sources—a review,” Renewable Energy, vol. 145, pp. 2271–2285, 2020.
- [5] K. E. Antoniadou-Plytaria, I. N. Kouveliotis-Lysikatos, P. S. Georgilakis, and N. D. Hatziargyriou, “Distributed and decentralized voltage control of smart distribution networks: Models, methods, and future research,” IEEE Transactions on Smart Grid, vol. 8, no. 6, pp. 2999–3008, 2017.
- [6] S. Bolognani, G. Cavraro, R. Carli, and S. Zampieri, “Distributed reactive power feedback control for voltage regulation and loss minimization,” IEEE Transactions on Automatic Control, vol. 60, 03 2014.
- [7] B. Zhang, A. Y. Lam, A. D. Domínguez-García, and D. Tse, “An optimal and distributed method for voltage regulation in power distribution systems,” IEEE Transactions on Power Systems, vol. 30, no. 4, pp. 1714–1726, 2015.
- [8] A. Maknouninejad and Z. Qu, “Realizing unified microgrid voltage profile and loss minimization: A cooperative distributed optimization and control approach,” IEEE Transactions on Smart Grid, vol. 5, no. 4, pp. 1621–1630, 2014.
- [9] K. Utkarsh, A. Trivedi, D. Srinivasan, and T. Reindl, “A consensus-based distributed computational intelligence technique for real-time optimal control in smart distribution grids,” IEEE Transactions on Emerging Topics in Computational Intelligence, vol. 1, pp. 1–1, 12 2016.
- [10] E. Dall’Anese, H. Zhu, and G. B. Giannakis, “Distributed optimal power flow for smart microgrids,” IEEE Transactions on Smart Grid, vol. 4, no. 3, pp. 1464–1475, 2013.
- [11] W. Zheng, W. Wu, B. Zhang, H. Sun, and Y. Liu, “A fully distributed reactive power optimization and control method for active distribution networks,” IEEE Transactions on Smart Grid, vol. 7, no. 2, pp. 1021–1033, 2016.
- [12] B. A. Robbins, H. Zhu, and A. D. Domínguez-García, “Optimal tap setting of voltage regulation transformers in unbalanced distribution systems,” IEEE Transactions on Power Systems, vol. 31, no. 1, pp. 256–267, 2016.
- [13] B. A. Robbins and A. D. Domínguez-García, “Optimal reactive power dispatch for voltage regulation in unbalanced distribution systems,” IEEE Transactions on Power Systems, vol. 31, no. 4, pp. 2903–2913, 2016.
- [14] L. Yu, D. Czarkowski, and F. de Leon, “Optimal distributed voltage regulation for secondary networks with dgs,” IEEE Transactions on Smart Grid, vol. 3, no. 2, pp. 959–967, 2012.
- [15] A. R. Di Fazio, G. Fusco, and M. Russo, “Decentralized control of distributed generation for voltage profile optimization in smart feeders,” IEEE Transactions on Smart Grid, vol. 4, no. 3, pp. 1586–1596, 2013.
- [16] E. Dall’Anese, S. V. Dhople, and G. B. Giannakis, “Photovoltaic inverter controllers seeking ac optimal power flow solutions,” IEEE Transactions on Power Systems, vol. 31, no. 4, pp. 2809–2823, 2016.
- [17] A. Abessi, V. Vahidinasab, and M. S. Ghazizadeh, “Centralized support distributed voltage control by using end-users as reactive power support,” IEEE Transactions on Smart Grid, vol. 7, no. 1, pp. 178–188, 2016.
- [18] M. Nayeripour, H. Sobhani, E. Waffenschmidt, and S. Hasanvand, “Coordinated online voltage management of distributed generation using network partitioning,” Electric Power Systems Research, vol. 141, pp. 202–209, 12 2016.
- [19] K. Mahmoud, M. M. Hussein, M. Abdel-Nasser, and M. Lehtonen, “Optimal voltage control in distribution systems with intermittent pv using multiobjective grey-wolf-lévy optimizer,” IEEE Systems Journal, vol. 14, no. 1, pp. 760–770, 2020.
- [20] A. Routray, R. K. Singh, and R. Mahanty, “Harmonic reduction in hybrid cascaded multilevel inverter using modified grey wolf optimization,” IEEE Transactions on Industry Applications, vol. 56, no. 2, pp. 1827–1838, 2020.
- [21] W. Wang, N. Yu, Y. Gao, and J. Shi, “Safe off-policy deep reinforcement learning algorithm for volt-var control in power distribution systems,” IEEE Transactions on Smart Grid, vol. 11, no. 4, pp. 3008–3018, 2020.
- [22] D. Cao, W. Hu, J. Zhao, Q. Huang, Z. Chen, and F. Blaabjerg, “A multi-agent deep reinforcement learning based voltage regulation using coordinated pv inverters,” IEEE Transactions on Power Systems, vol. 35, no. 5, pp. 4120–4123, 2020.
- [23] Q. Yang, G. Wang, A. Sadeghi, G. B. Giannakis, and J. Sun, “Two-timescale voltage control in distribution grids using deep reinforcement learning,” IEEE Transactions on Smart Grid, vol. 11, no. 3, pp. 2313–2323, 2020.
- [24] Y. Zhang, X. Wang, J. Wang, and Y. Zhang, “Deep reinforcement learning based volt-var optimization in smart distribution systems,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 361–371, 2021.
- [25] D. Cao, J. Zhao, W. Hu, F. Ding, Q. Huang, Z. Chen, and F. Blaabjerg, “Data-driven multi-agent deep reinforcement learning for distribution system decentralized voltage control with high penetration of pvs,” IEEE Transactions on Smart Grid, vol. 12, no. 5, pp. 4137–4150, 2021.
- [26] X. Sun and J. Qiu, “Two-stage volt/var control in active distribution networks with multi-agent deep reinforcement learning method,” IEEE Transactions on Smart Grid, vol. 12, no. 4, pp. 2903–2912, 2021.
- [27] H. Liu and W. Wu, “Two-stage deep reinforcement learning for inverter-based volt-var control in active distribution networks,” IEEE Transactions on Smart Grid, vol. 12, no. 3, pp. 2037–2047, 2021.
- [28] A. Panda and M. Tripathy, “Security constrained optimal power flow solution of wind-thermal generation system using modified bacteria foraging algorithm,” Energy, vol. 93, pp. 816–827, 12 2015.
- [29] R. Ma, X. Li, Y. Luo, X. Wu, and F. Jiang, “Multi-objective dynamic optimal power flow of wind integrated power systems considering demand response,” CSEE Journal of Power and Energy Systems, vol. 5, no. 4, pp. 466–473, 2019.
- [30] J. Woo, L. wu, J.-B. Park, and J. Roh, “Real-time optimal power flow using twin delayed deep deterministic policy gradient algorithm,” IEEE Access, vol. 8, pp. 213 611–213 618, 01 2020.
- [31] X. Shi, H. Qi, Y. Shen, G. Wu, and B. Yin, “A spatial–temporal attention approach for traffic prediction,” IEEE Transactions on Intelligent Transportation Systems, vol. 22, no. 8, pp. 4909–4918, 2021.
- [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems, vol. 30, 2017.
- [33] S. Zhang, H. Tong, J. Xu, and R. Maciejewski, “Graph convolutional networks: a comprehensive review,” Computational Social Networks, vol. 6, no. 1, p. 11, Nov 2019.
- [34] M. Simonovsky and N. Komodakis, “Dynamic edge-conditioned filters in convolutional neural networks on graphs,” 2017.
- [35] I. J. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. Cambridge, MA, USA: MIT Press, 2016.
- [36] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. Cambridge, MA, USA: A Bradford Book, 2018.
- [37] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra, “Continuous control with deep reinforcement learning,” 2019.
- [38] K. P. Schneider, B. A. Mather, B. C. Pal, C.-W. Ten, G. J. Shirek, H. Zhu, J. C. Fuller, J. L. R. Pereira, L. F. Ochoa, L. R. de Araujo, R. C. Dugan, S. Matthias, S. Paudyal, T. E. McDermott, and W. Kersting, “Analytic considerations and design basis for the ieee distribution test feeders,” IEEE Transactions on Power Systems, vol. 33, no. 3, pp. 3181–3188, 2018.
- [39] D. Cao, J. Zhao, W. Hu, F. Ding, Q. Huang, Z. Chen, and F. Blaabjerg, “Model-free voltage regulation of unbalanced distribution network based on surrogate model and deep reinforcement learning,” 2020.
- [40] D. Cao, J. Zhao, W. Hu, F. Ding, N. Yu, Q. Huang, and Z. Chen, “Model-free voltage control of active distribution system with pvs using surrogate model-based deep reinforcement learning,” Applied Energy, vol. 306, p. 117982, 2022. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S030626192101285X
- [41] I. U. Khan, N. Javaid, K. A. Gamage, C. J. Taylor, S. Baig, and X. Ma, “Heuristic algorithm based optimal power flow model incorporating stochastic renewable energy sources,” IEEE Access, vol. 8, pp. 148 622–148 643, 2020.
- [42] T. Dokeroglu, A. Deniz, and H. E. Kiziloz, “A robust multiobjective harris’ hawks optimization algorithm for the binary classification problem,” Knowledge-Based Systems, vol. 227, p. 107219, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0950705121004810
- [43] P. Hao and B. Sobhani, “Application of the improved chaotic grey wolf optimization algorithm as a novel and efficient method for parameter estimation of solid oxide fuel cells model,” International Journal of Hydrogen Energy, vol. 46, no. 73, pp. 36 454–36 465, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0360319921034194
- [44] F. Capitanescu, M. Glavic, D. Ernst, and L. Wehenkel, “Interior-point based algorithms for the solution of optimal power flow problems,” Electric Power Systems Research, vol. 77, no. 5, pp. 508–517, 2007. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0378779606001209
- [45] P. Fortenbacher and T. Demiray, “Linear/quadratic programming-based optimal power flow using linear power flow and absolute loss approximations,” International Journal of Electrical Power & Energy Systems, vol. 107, pp. 680–689, 2019. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0142061518325377
- [46] J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” 2017.
- [47] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in Proceedings of the 35th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 80. PMLR, 10–15 Jul 2018, pp. 1861–1870.
- [48] Z. Wang, V. Bapst, N. Heess, V. Mnih, R. Munos, K. Kavukcuoglu, and N. de Freitas, “Sample efficient actor-critic with experience replay,” 2017.
- [49] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, and K. Kavukcuoglu, “Asynchronous methods for deep reinforcement learning,” in Proceedings of The 33rd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 48. PMLR, 20–22 Jun 2016, pp. 1928–1937.