On the Use of Reinforcement Learning for Attacking and Defending Load Frequency Control

Amr S. Mohamed
Department of Electrical Engineering
University of Toronto
Toronto, ON M5S 3G4, Canada
[email protected]
&Deepa Kundur
Department of Electrical Engineering
University of Toronto
Toronto, ON M5S 3G4, Canada
[email protected]

Abstract

The electric grid is an attractive target for cyberattackers given its critical nature in society. With the increasing sophistication of cyberattacks, effective grid defense will benefit from proactively identifying vulnerabilities and attack strategies. We develop a deep reinforcement learning-based method that recognizes vulnerabilities in load frequency control, an essential process that maintains grid security and reliability. We demonstrate how our method can synthesize a variety of attacks involving false data injection and load switching, while specifying the attack and threat models – providing insight into potential attack strategies and impact. We discuss how our approach can be employed for testing electric grid vulnerabilities. Moreover our method can be employed to generate data to inform the design of defense strategies and develop attack detection methods. For this, we design and compare a (deep learning-based) supervised attack detector with an unsupervised anomaly detector to highlight the benefits of developing defense strategies based on identified attack strategies.

Keywords Reinforcement learning, cyber-physical security, power system, autoencoder, anomaly detection

1 Introduction

The electrical power grid is evolving to provide enhanced availability, efficiency, and reliability of electricity through an increased reliance on information and communication technologies [1]. This modernization has and will continue to introduce new and critical cybersecurity vulnerabilities [2]. If exploited by cyberattackers, the resulting damage can have devastating consequences to the welfare of society, including economic loss, injury, and loss of life [3].

Recent cyberattacks, such as the 2015 Ukrainian grid attack, left 225,000 people without electricity for hours by sabotaging operator workstations, wiping system files, flooding phone lines, and disabling backup power supplies to bring down the grid and impede its restoration [4]. Such attacks have shown to exhibit prior system knowledge on the part of the attacker, stealth and a high degree of sophistication. We assert that without a reasonable understanding of attacker resources and strategies, electric grid defense is limited to taking a reactive stance leaving the defender at a fundamental disadvantage and to be more easily bypassed [5, 6, 7, 8, 9]. Hence, in this paper we consider a more proactive perspective of defense.

To detect common forms of data corruption attacks, power systems have traditionally relied on bad data detection (BDD) methods, which were originally developed to detect highly corrupt measurements (often stemming from telemetry error). BDD methods use historical data sets, statistical approaches, and approximate system models to flag abnormal measurements and are thus limited to detecting more naively constructed cyberattacks. Specifically, these methods, used as a reactive form of defense, fail to detect attacks that either exploit model inaccuracy or are intentionally crafted such that their distribution is similar to that of the historical system data [10]. To address these limitations, recent research on attack detection has leveraged data-driven methods such as machine learning [11, 12, 13] and more recently deep learning [14, 15]. Studies have shown how these methods are more effective than traditional BDD especially in detecting false data injection (FDI) in the context of state estimation or load frequency control in power systems.

Nevertheless, these data-driven methods are typically evaluated using attacks that are randomly generated or crafted using a simple library of templates, which we argue fail to assess their performance against more realistic attacks that are complex in nature and targeted based on knowledge of system vulnerabilities and dynamics. Thus, we assert that the effectiveness of learning-based methods against such attacks remains largely untested. Further, we believe that a proactive stance is needed to identify and address vulnerabilities.

One strategy, which is the focus of this paper, is to synthesize novel attacks (via intelligent attacker modelling) to inform defense development pre-emptively. The goal of attack synthesis, here, would be to provide insight into attacker strategies to forecast security requirements, appropriately reinforce grid defenses, and improve situational awareness. Computationally-based attack synthesis approaches include generative adversarial networks (GAN) (e.g., [16]), optimization (including reachability frameworks, e.g., [17]), and reinforcement learning (RL). GANs learn known attack patterns to synthesize additional attack realizations; however, prior knowledge of attacks is required, which limits their ability to synthesize new unknown attacks. Optimization-based methods are model-based, requiring accurate system models to synthesize effective attacks, and strong assumptions on the system and/or attack models. In contrast, RL agents can learn new attacks with zero to little prior knowledge of the system and attacks. Further, RL is data-driven, relying on partial system observations, which can involve complex dynamics and inter-dependencies.

Given these advantages, RL is increasingly being applied to electric grid attack and defense. Various studies have constructed (Q-learning) RL agents to synthesize [18, 19, 20, 21, 22] and develop defense strategies [21, 22] against line-switching attacks that exploit how sudden changes in grid topology can lead to cascading failures and blackout. Additionally, RL has been applied to the synthesis of FDI attacks in power systems [23], where the RL agent mimics a virus in a compromised power substation attempting to induce voltage sags in the system. A multi-agent RL approach has also been proposed in [24] to synthesize FDI attacks that can bypass attack detection methods in DC microgrids.

In this work, we extend RL research for attack synthesis against the electric grid to load frequency control (LFC). Frequency deviation negatively impacts grid operation, security, and reliability [25], and can potentially result in equipment damage, load performance degradation, transmission line overload, generation loss, and grid instability [26]. Due to its critical role in maintaining nominal grid frequency, LFC is a valuable target of cyberattacks. As such, the security of LFC has been the subject of a growing body of research papers [25].

In this research, we build on this literature to demonstrate the utility of RL in replicating known attacks and synthesizing new realizations against LFC. We show how our RL paradigm holistically explores the attack space to expose possible attacker strategies to help specify attack requirements and verify attack/threat model assumptions to improve electric grid defense. Hence, in this paper,

1.

we employ RL, for the first time, in the synthesis of attacks against LFC, training RL agents to execute FDI and load switching attack strategies. To the best of our knowledge, our study is the first to apply RL to dynamic power system cyber-physical security. Unlike previous studies that focused on the RL agent’s effect on power flow and state estimation computations, we investigate its influence on the power system’s dynamics and validate results empirically. Additionally, we provide novel RL reward function formulations that serve as templates for facilitating the training of RL agents against LFC. The reward functions can guide future research by serving as a blueprint for rewarding and training RL agents to relieve or induce stress on the power system by deviating from its nominal states.
2.

we present and discuss various novel benefits to utilizing RL for cyber-physical security, including replicating known attacks, exploring the attack space, revealing potential attack strategies, specifying attack/threat model assumptions, and developing proactive defense strategies.
3.

we harness the RL generated data to train a supervised learning-based attack detector with a long short-term memory (LSTM) neural network, and compare it with the state-of-the-art unsupervised anomaly detection based on autoencoders to demonstrate the benefits of RL-based attack synthesis for defense. In this way, we demonstrate the utility of RL attack synthesis in improving detection-based mitigation.

This paper is structured as follows. In Section 2 we review LFC and survey cyberattacks targeting LFC systems. We provide relevant background on RL. Section 3 discusses the threat model that we consider. RL agent and attack detection methodologies are developed in Section 4 followed by presentation of RL training and attack detector performance results with discuss on benefits and challenges in Section 5. Section 6 provides concluding remarks.

2 Background

Our research focuses on the use of RL to generate attacks against LFC and subsequently leveraging the RL-generated attacks to devise defense strategies. Hence, in this section, we introduce LFC and its vulnerabilities to cyberattacks and summarize RL.

2.1 Frequency Control and Protection

LFC maintains power balance and grid frequency consisting of primary, secondary, and tertiary control levels [27]. The primary level employs droop-governor control to regulate frequency while the secondary level uses automatic generation control (AGC) to regulate the net interchange of power. Tertiary control provides additional frequency support mechanisms by restoring power reserve. Failure to regulate frequency can cause frequency protection devices (including ANSI 81U/O/R) to isolate power system equipment to protect them from damage sustained due to operation at abnormal frequency resulting in unwanted system reduction.

The implementation of LFC over a wide-area network with open communication protocols and minimal human supervision increases its cyberattack surface. Interference of LFC operation is possible by exploiting a variety of vulnerabilities in insecure legacy electric grid networks, open communication protocols and operating systems, as well as introducing malware through infected emails or USBs, performing supply-chain attacks, or capitalizing on disgruntled insiders [28, 29, 30]. These cyberattacks often aim to compromise critical measurement signals to ultimately destabilize the grid.

This work focuses on cyberattacks aimed at the unwanted triggering of frequency protection relays in power grids, which can initiate sudden power imbalance leading to grid instability, cascading failure, and blackout. Previous studies highlight the significance of investigating these attacks, showing that attackers can modify loads [26, 31, 32, 33] or tie-line power [26], inject disturbances to automatic generation control [17], or corrupt microgrid synchronization [34] to trigger protection devices and/or destabilize the grid. While these authors applied strong assumptions and/or knowledge of the system dynamics in the development of cyberattacks, in this paper we demonstrate that RL can synthesize attacks with zero to little prior knowledge.

2.2 Reinforcement Learning

In RL, an agent is trained through a process of trial-and-error to achieve optimal decisions or strategies in an environment of which the agent has zero to little prior knowledge [35]. Training an RL agent to attack the grid can yield novel unforeseen insight into system vulnerabilities and attack strategies. Establishing an RL problem within the context of synthesising attacks requires defining the environment (representing the cyber-physical system), and specifying what actions (representing attacks) the agent can execute in the environment and what environmental states it can observe and make decisions based on. A reward function is formulated to steer the agent into taking actions that achieve the attack goals.

The agent contains two components: a policy and a learning algorithm. The goal of the policy is to map environment observations to actions that maximize rewards. The policy can involve an actor, critic, or actor-critic function approximators. An actor $\pi\mathrel{\mathop{\ordinarycolon}}S\rightarrow A$ maps environment observations $S$ to actions $A$ . A critic $Q\mathrel{\mathop{\ordinarycolon}}(S,A)\rightarrow R$ maps action-observation pairs to (predicted) discounted cumulative long-term rewards $R$ . The learning algorithm continuously updates the policy to find the optimal policy. Learning happens in episodes: simulations that expire after the RL agent achieves a certain goal or a maximum simulation length.

In this work, we employ deep deterministic policy gradient (DDPG) RL as it is the simplest learning algorithm compatible with continuous actions and observations – to highlight the feasibility of implementing our work offensively as a cyberattacker or defensively for electric grid cyber-physical security.

3 Threat Model

We train RL agents to execute attacks against LFC that directly lead to protection tripping and loss of generation. Offensively, the agent can be deployed by a cyberattacker during a cyber breach in a man-in-the-middle or FDI attack, or be packaged within (malicious) software that is uploaded to a control device. We assume the following salient components of the threat model:

•

The electric grid exhibits oscillatory eigenmodes that can be leveraged. Generators’ mechanical construction gives rise to their own eigenmodes. The existence of inter-area oscillatory eigenmodes is evident in most multi-machine power systems [32].
•
The attacker can perform one of the following actions:
1. 1.
  
  corrupt frequency (sensor) measurements to the control center [36],
2. 2.
  
  corrupt generation control signals [34, 17],
3. 3.
  
  corrupt tie-line power measurements [26, 36], or
4. 4.
  
  compromise loads [26, 31, 32, 33].
For detailed information on threat models and attack execution, please refer to the cited papers.
•

The attacker can observe the grid frequency. Since frequency is a global state of the electric grid, the attacker can easily measure the grid frequency [26], and compute its derivative (rate of change) and/or time-integral.
•

The attacker is not assumed to have any knowledge of the system; hence, we initialize the RL agent with zero knowledge of its environment.

4 Method

To the best of our knowledge, our work is the first to apply the methods presented in this section, including the dynamical power system model and machine learning algorithms, within an RL framework to attack and defend LFC. We first model LFC in the RL environment and develop the RL agents. Next, we construct the supervised attack detector and unsupervised anomaly detector. Numerical details of the algorithms are addressed in Section 5.

4.1 System Architecture

We apply the swing equation [27] to model LFC. The following state-space system expresses the linear load-frequency dynamics:

\bm{\dot{x}}=\bm{A}\bm{x}+\bm{B}\bm{u}+\bm{W}\bm{p}

(1)

where the state is

\bm{x}=\begin{bmatrix}\Delta e&\Delta P_{g}&\Delta P_{m}&\Delta\omega&\Delta\hat{\omega}&\hat{\dot{\omega}}\end{bmatrix}^{T},

(2)

$e$ is governor-droop control signal, $P_{g}$ , the governor output, $P_{m}$ , the mechanical power, $\omega$ , the system frequency, $\hat{\omega}$ , the frequency measurement, and $\hat{\dot{\omega}}$ , the rate of change of frequency measurement.

The input vectors $\bm{u}$ and $\bm{p}$ represent the inputs to the systems during normal operation and attacks, respectively. The input vector:

\bm{u}=\begin{bmatrix}\Delta P_{L}&\Delta P_{tie}\end{bmatrix}^{T}

(3)

includes change in the demand, $P_{L}$ , and tie-line power, $P_{tie}$ , if any. The attack vector:

\bm{p}=\begin{bmatrix}p_{1}&p_{2}&p_{3}&p_{4}\end{bmatrix}^{T}

(4)

includes actions the attacker can execute as enumerated in our threat model, including corrupting frequency measurements to the control center ( $p_{1}$ ), corrupting generation control signals ( $p_{2}$ ), corrupting tie-line power measurements ( $p_{3}$ ), and compromising load switching ( $p_{4}$ ). The state matrices in (1) are elaborated in the Appendix.

Frequency relays protect generators from damage sustained during operation in unsafe frequency by ensuring the frequency and its rate of change are within a safe set

\mathcal{S}=\{(\hat{\omega},\hat{\dot{\omega}})\mathrel{\mathop{\ordinarycolon}}\text{UF}\leq\hat{\omega}\leq\text{OF},\mathinner{\!\left\lvert\hat{\dot{\omega}}\right\rvert}\leq\text{ROCOF}\}

(5)

Frequency relay functions include under-(UF), over-(OF) and rate-of-change of-(ROCOF) frequency. Recommended settings per IEEE 1547 [37] are listed in Table LABEL:table:relaysettings.

Table 1: IEEE 1547 recommended relay settings for Category III

Protection Function	Threshold	Clearing time
OF2	$62.0$ Hz	$160$ ms
UF2	$56.5$ Hz	$160$ ms
ROCOF	3 Hz/s

4.2 RL for Attacking Load Frequency Control

Initialize the mini-batch size

M

; actor and critic learning rates

\alpha_{\theta},\alpha_{\phi}

; discount factor

\gamma

; target smooth factor

\tau

; episode length; training step length;

Define the action space

\mathcal{A}

, and noise distribution;

Initialize the critic

Q(S,A;\phi)

and target critic

Q_{t}(S,A;\phi_{t})

neural networks with random parameters

\phi=\phi_{t}

;

Initialize the actor

\pi(S;\theta)

and target critic

\pi_{t}(S;\theta_{t})

neural networks with random parameters

\theta=\theta_{t}

;

for each training episode do

for each training step do

For the current observation

S=(\hat{\omega},\hat{\dot{\omega}})

, select action

A=\pi(S;\theta)+N

with noise

N

;

Execute action

A

as an attack on the power system through one of the inputs in

\bm{p}

(refer to (1)). Observe the reward

R

and the next observation

S^{\prime}

;

Store the experience

(S,A,R,S^{\prime})

in the experience buffer;

Sample a random mini-batch of

M

experiences

(S_{i},A_{i},R_{i},S^{\prime}_{i})

from the experience buffer;

for each sampled experience do

Calculate the value function target

y_{i}

;

if $S^{\prime}_{i}$ is a terminal state then

y_{i}=R_{i}

(6)

else

y_{i}=R_{i}+\gamma Q_{t}(S^{\prime}_{i},\pi_{t}(S^{\prime}_{i};\theta_{t});\phi_{t})

(7)

end if

end for

Compute the loss over mini-batch;

L=\frac{1}{M}\sum_{i=1}^{M}(y_{i}-Q(S_{i},A_{i};\phi))^{2}

(8)

Update critic parameters by minimizing over

L

;

\phi\leftarrow\phi-\alpha_{\phi}\frac{\partial L}{\partial\phi}

(9)

Update actor parameters by descending policy gradient;

\frac{\partial J}{\partial\theta}\leftarrow\frac{1}{M}\sum_{i=1}^{M}\frac{\partial}{\partial A}Q(S_{i},A;\phi)\frac{\partial}{\partial\theta}\pi(S_{i};\theta)

(10)

\theta\leftarrow\theta-\alpha_{\theta}\frac{\partial J}{\partial\theta}

(11)

End episode if

S\notin\mathcal{S}

, label

S

as a terminal state. Store episode data;

Update the target actor and critic parameters periodically;

\phi_{t}=\tau\phi+(1-\tau)\phi_{t}

(12)

\theta_{t}=\tau\theta+(1-\tau)\theta_{t}

(13)

end for

Algorithm 1 DDPG algorithm for attacking LFC

We assume that the attacker can observe the system frequency. We model actions $\{p_{1},p_{2},p_{3}\}$ , entailing corruption of communication, as continuous-value actions. In practice, the attacker’s ability to inject an attack signal is limited due to physical constraints, constraints imposed by the communication protocol, or the need to avoid detection by bad data detectors. Hence, we assume that the attack vector $\bm{p}$ is bounded. For load switching, if the attacker compromises an aggregate load $P_{sw}$ and can switch on and off portions of the total load, then $p_{4}$ can also be modelled as continuous-value action, with $p_{4}\in[0,P_{sw}]$ . If the attacker can only switch on and off the entire load, then $p_{4}$ is a discrete-value action $p_{4}\in P_{sw}\times\{0,1\}$ . We discretize the DDPG RL actions in this case as follows:

	$\displaystyle p_{4}$	$\displaystyle=0,$	$\displaystyle\text{\quad if }p_{4}<P_{sw}/2,$		(14)
		$\displaystyle=P_{sw},$	$\displaystyle\text{\quad if }p_{4}\geq P_{sw}/2$

In Algorithm 1, we adapt DDPG to attack LFC. Here, the RL agent observes the system state $S=(\hat{\omega},\hat{\dot{\omega}})$ and influences the power system (1) by injecting an attack action $A$ through the input vector $\bm{p}$ . The learning objective is for the RL agent to learn a policy $\pi\mathrel{\mathop{\ordinarycolon}}S\rightarrow A$ to force the states in $S$ outside the safe set $\mathcal{S}$ (5), effectively triggering a frequency protection device. The attacker can then execute this policy to attack and destabilize a real system.

We collect all episodes in a dataset for training the supervised-learning attack detector. The DDPG neural network architecture is detailed in Table LABEL:table:ddpgNNarchitectures in the Appendix.

4.3 Unsupervised Autoencoder-based Anomaly Detection

Unsupervised machine learning methods detect potential cyberattacks by learning patterns and regularities in normal operation data and flagging anomalies. Due to the lack of labelled cyberattack datasets, unsupervised learning approaches, particularly autoencoder-based detectors, have seen increased application for attack detection [38].

Autoencoders consist of a deep neural network – partitioned into an encoder and decoder – that is trained to reconstruct its input data. The encoder maps the input normal operation data to a compressed hidden representation based on regularities in the data, and its decoder attempts to map this representation back to the input data. When the trained autoencoder is applied to new data, a large variation between the data and the autoencoder’s reconstruction indicates an anomaly in the data, which can then be classified as an attack.

To simulate normal operation data, we model the change in power demand as a random process following the work in [39, 40]. We model the change in demand as a 1 second-per-step random walk sampling from a Gaussian distribution $\mathcal{N}(0,\sigma_{1}^{2})$ superimposed on a 5 minute-per-step random walk sampling from a Gaussian distribution $\mathcal{N}(0,\sigma_{2}^{2})$ . Fig. 1 illustrates the change in load power as simulated by this random process and the resulting frequency and rate of change of frequency deviation.

Refer to caption — Figure 1: Change in load power as a stochastic process with $\sigma_{1}=(0.05/3)$ and $\sigma_{2}=(0.2/3)$ to simulate normal grid operation data.

We develop an autoencoder to detect anomalies in the time-series consisting of the governor-droop control signal $e$ and the frequency measurement $\hat{\omega}$ . To prepare the training dataset, we run a long simulation of normal system operation and then we randomly crop $N$ portions. The portions are sampled at 50 milliseconds per sample and have variable (time) length. Each portion is a vector $X_{i}\in\mathbb{R}^{2\times n_{i}}$ vector representing a time-series of ( $e,\hat{\omega}$ ), where $n_{i}$ is the number of samples in the portion.

The overall autoencoder can be expressed as a mapping $f_{ae}\mathrel{\mathop{\ordinarycolon}}X_{i}\rightarrow\hat{X}_{i}$ between the input data $X_{i}$ and its reconstruction $\hat{X}_{i}$ , where $f_{ae}(X;\phi)$ is a neural network with parameters $\phi$ . We develop an LSTM neural network-based autoencoder, described in Table LABEL:table:lstmNNarchitectures of the Appendix, given the suitability of LSTM networks for time-series. Training the autoencoder seeks to minimize the mean square error between the data and their reconstructions:

L_{MSE}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{2}(\hat{X}_{i}-X_{i})^{2}

(15)

Next, a threshold can be selected based on the maximum reconstruction error (seen in the validation set) to differentiate between normal and anomalous data. When the autoencoder is input new data, if the reconstruction error is below this threshold, the data is labelled normal; otherwise, anomalous.

4.4 Supervised attack detection using RL generated data

Supervised methods are trained to distinguish between normal operation and attacks by training on labelled system data. To collect labelled data, we augment the normal operation dataset (used to train the autoencoder) with the RL-generated dataset and label the data into 4 categories: (1) normal operation, and attacks that (2) do not trigger protection, (3) trigger under or over-frequency protection, and (4) trigger rate of change of frequency protection. Each record in the dataset includes a $X_{i}\in\mathbb{R}^{2\times n_{i}}$ vector representing a time-series control signal and frequency measurement and a label $\ell_{i}\in\mathcal{C}=\{1,2,3,4\}$ .

We train the supervised attack detector to classify the data to their correct labels. The attack detector consists of a neural network $f_{ad}\mathrel{\mathop{\ordinarycolon}}X_{i}\rightarrow\{P_{i}^{(c)}\}^{c\in\mathcal{C}}$ mapping the input data $X_{i}$ to the probability of $X_{i}$ belonging to each category. $P_{i}^{(c)}$ is the probability of $X_{i}$ belonging to category $c$ . The category with the highest probability is chosen as the label for instance. Training the detector seeks to minimize the cross-entropy loss:

L_{CE}=-\sum_{i=1}^{N}\sum_{c\in\mathcal{C}}1\{\ell_{i}=c\}\log\left(P_{i}^{(c)}\right)

(16)

Again, we develop an LSTM neural network-based attack detector (see in Table LABEL:table:lstmNNarchitectures in the Appendix) given its suitability for time-series and for comparison between the supervised and unsupervised attack detection methods.

To deploy the anomaly and attack detectors, the generation’s governor intelligent electronic devices (IED) can be upgraded to collect measurements of the governor control signal and local frequency. The IED will store the last $n_{i-1}$ measurement samples, add the latest sample, and perform the neural network computations to classify the system data. The classification can be communicated to the grid operator SCADA system to alert on attacks, or actions can be programmed into the IED to autonomously mitigate detected attacks.

5 Results

This section demonstrates RL’s use for FDI and load switching attacks synthesis, defense using the attack and anomaly detectors, benefits of the RL method, optimality evaluation of RL attacks, and limitations and challenges of the method. Our results were validated using three different microgrid models (MG1 to MG3), with the same network as depicted in Figure 2, but different load frequency control parameters, as listed in Table LABEL:table:systemsParameters in the Appendix. MG1 is the base microgrid model and replicates a 2.5 MVA rated microgrid in rural Ontario, Canada. Microgrids, with their low inertia, are vulnerable to cyberattacks, making them ideal for studying power system vulnerabilities. FDI attacks compromise the control of the synchronous generator (SG in Fig. 2) and the load switching attacks compromise the switching of one or more of the microgrid’s loads. RL training employs a simplified LFC model of the microgrid, and is verified using time-domain simulations on a detailed model built using MATLAB Simscape. MATLAB and Simulink were employed to generate the results.

5.1 False data injection

Recall that we employ a DDPG RL agent to inject false data into any of the inputs $\{p_{1},p_{2},p_{3}\}$ of a single-area system while observing the frequency and its rate of change. The following results are specific to corrupting the frequency measurement to the control center ( $p_{1}$ ); but are generally applicable to the other inputs. Following experimentation, we list the following decisions we made in the design of the RL agent:

•

We bound the RL action space, representing the injected frequency bias, to $[-0.1,0.1]$ pu. The large action space makes it easier and faster for the RL agent to learn successful attack strategies and generate a larger variety of attacks. Smaller bounds lead to longer convergence times during training. After training, the RL action space can be scaled down to smaller, more practical bounds to destabilize vulnerable power systems. For example, the RL agent’s actions in the detailed model time-domain simulations (Fig. 4) are restricted to the range $[-3.5,2]/60$ pu to match the frequency range ( $56.5$ to $62$ Hz) in which the generator regulates the system frequency (refer to Table LABEL:table:relaysettings).
•

The large action space allows the agent to quickly discover simple bias attack to trigger UF or OF protection. We employ reward function (17) to encourage the agent to discover more complex attacks. The reward function is illustrated in Fig. 3. The safety set $\mathcal{S}$ , which the agent attempts to force the system to exit, is inside the area depicted by the red square. Function (17) rewards the agent for increasing the rate of change of frequency towards and beyond the ROCOF relay setting while maintaining the frequency deviation small. Additionally, the agent attains a high reward of $+20$ when the ROCOF relay trips and a high penalty of $-20$ when the either of the UF or OF relays trips. Without the penalization, the agent continues to prefer simple actions that trigger UF or OF protection. Our experiments show that generally following the above guidelines facilitates RL training.
•

We limit each episode to 15 seconds to encourage the agent to destabilize the system quickly, and end the episode when the agent succeeds in triggering protection.

	$\displaystyle R_{i}$	$\displaystyle=\left(\frac{\hat{\dot{\omega}}}{0.05}\right)^{2}\cdot\max\left(0,1-\left(\frac{\hat{\omega}}{0.0\dot{3}}\right)^{2}\right)$		(17)
		$\displaystyle+20\{\hat{\dot{\omega}}\notin\mathcal{S}\}-20\{\hat{\omega}\notin\mathcal{S}\}$

The agent generates an oscillatory frequency bias to excite the mechanical eigenmode of the microgrid, leading to generation tripping in vulnerable microgrids. Fig. 4 shows (detailed-model) time-domain simulations of two vulnerable microgrid testbeds, each with a different eigenmode frequency. The top plot in each column depicts the injected attack signal, with the frequency and rate of change of frequency deviation plot below it. The horizontal red dashed lines indicate the ROCOF protection relay bounds, which trigger the corresponding relay function and cause generation tripping when exceeded. For demonstration purposes, the relay triggering is suppressed to continue to demonstrate the RL agent’s attack strategy. The agent successfully triggers the rate of change of frequency relay function in the systems.

Fig. 4 demonstrates RL agents’ adaptability. Being data-driven, they can easily adjust their attacks to destabilize different systems. The agent utilizes easily available system frequency measurements to tailor the frequency of its injected attack signal to the mechanical eigenmode of the system.

5.2 Load switching

The RL agent learns to execute load switching attacks by manipulating the system load through $p_{4}$ , while monitoring the frequency and its rate of change. We use reward function (18) to incentivize the agent to increase the frequency or rate of change of frequency deviations, with high rewards of $+20$ earned when any of the UF, OF, or ROCOF relays trip. The reward function is illustrated in Fig. 5. The change in the reward function (from (17)) is attributed to the difficulty of tripping UF/OF protection with switching attacks.

R_{i}=\left(\frac{\hat{\dot{\omega}}}{0.05}\right)^{2}+\left(\frac{\hat{\omega}}{0.08\dot{3}}\right)^{2}+20\{(\hat{\omega},\hat{\dot{\omega}})\notin\mathcal{S}\}

(18)

Fig. 6 shows (detailed-model) time-domain simulations of a fixed load switching attack, wherein 464 kW (0.18 pu) of MG1’s load is switched on (1) and off (0), thereby exciting the mechanical eigenmode and eventually tripping the generation. We present examples of aggregate load switching attacks later in Fig. 8.

5.3 Supervised attack detection

In this section, the RL agent is trained to generate a large attack dataset to train the supervised-learning attack detector. To generate the data, the agent is trained with more emphasis on increasing exploration and delaying learning convergence. Figs. 7 and 8 showcase a variety of frequency measurement corruption and load switching attacks, respectively, collected during RL training that destabilized the LFC model. Although these attacks are generally not optimal (e.g., in terms of time-to-failure), they are still able to trigger protection relays and therefore warrant attention.

We gather 4000 records in the dataset, equally distributed among the 4 categories outlined in Section 4.4, and allocate 15% and 30% of the data for validation and testing, respectively. The detector’s performance on the test data is presented in the confusion matrix in Fig. 9, with an accuracy of 98.4%.

5.4 Unsupervised anomaly detection

Fig. 10 demonstrates the reconstruction of the trained autoencoder. The left plot shows a portion of control signal and frequency measurement collected during normal operation. The right plot shows the autoencoder’s reconstruction of the data. By learning the patterns of data during normal operation, the autoencoder is able to reconstruct the data with low error. Fig. 11 is a frequency histogram of the autoencoder’s reconstruction mean absolute error on the validation data. For comparison purposes, the test and validation sets are the same as that used to compute the supervised detectors’ accuracy. Based on the histogram, we select an error value of $0.2$ to classify between normal and anomalous behavior, which yields $100\%$ accuracy classifying between normal and attack behavior.

When comparing the unsupervised anomaly detector’s accuracy to that of the supervised attack detector in classifying normal and anomalous (comprising successful and unsuccessful attacks) operation, the detectors are comparable – at 100% and 99.8%, respectively. If, however, we compare their accuracy in classifying behavior preceding relay triggering (comprising successful attacks) and behavior that is not (comprising normal operation and unsuccessful attacks), the anomaly detector’s accuracy is 74.5% compared to 98.6%. This echoes a major drawback of unsupervised methods, which is their high false alarm rate for safe anomalous events that are difficult to exhaustively include in their training data, even when these events do not have any impact on the system. We discuss consequent implication and opportunities further in the discussion.

5.5 Discussion

5.5.1 RL suitability

We demonstrated that RL can be used to compromise LFC. Offensively, an attacker can execute the RL actions in a man-in-the-middle attack or package the RL as a malicious software. The RL agent offers a simple, fast, flexible, and adaptive approach to cyber offense, modifying its actions to attack different systems without the need for prior reconnaissance to collect system information or exact models of the targeted systems. This alerts to the urgency of using RL defensively, as suggested by the paper, to proactively identity and collect attack strategies before system vulnerabilities are exploited.

5.5.2 Attack model assumptions

The attacker’s knowledge, resources and limitations are specified as part of developing the RL problem. For example, in the presented attacks, we specify that the attacker has no prior knowledge, their disclosure resources enable them to observe the grid frequency, and their disruption resources enables them to compromise specific communication channels (outlined in Section 3). Their limitations are formulated in the constraints on the action space. If the learning converges, the specified attack and threat models are sufficient to execute the attack.

This can enable verification of assumptions on the attack/threat model. For example, while presenting switching attacks, the authors in [32] assume that the attacker needs prior knowledge of the inter-area eigenmodes to execute the attack. In a preceding reconnaissance attack, the authors suggest that the attacker can impose faults at the compromised load (e.g., by employing a built-in de-energization circuit breaker). The attacker records transient line measurements following the fault. Next, the attacker performs spectral independent component analysis (ICA) analysis on the recorded data to reveal the inter-area modes. Evident from results in Fig. 4 and 6, we show that a pre-trained RL agent does not necessarily need this knowledge; but only needs to observe the frequency during attack execution – alarming us that the attack can be executed with relative ease.

5.5.3 System vulnerability testing

We demonstrated that our method can be used to synthesize multiple attacks against a system during the RL training process. Practically, this yields a system vulnerability testing application for our research. For example, the RL training can be performed on a offline system model. Simulations can reveal vulnerabilities that need to be patched before deployment. Additionally, preceding every system change or upgrade, RL training can reveal vulnerabilities before cyberattacker capitalize on them.

Our method can also be used to validate defense strategies. After a vulnerability is identified in training, defense methods including upgrading control algorithms (physical), upgrading code security (computational), or adding channel redundancy (communication-based) can be designed and incorporated into the offline system model. If the defense method passes the previously successful logged attacks and further RL training without any system failures, then the defense can be deployed to enhance system security.

We also demonstrated that a single RL agent can execute successful attacks against different systems. This provides an opportunity to collect an ‘arsenal’ of RL agents and provide them for system owner-operators to automatically test vulnerabilities. After modelling their system, the owner-operator can retrieve the RL agents from repositories and have each RL agent check for a specific system vulnerability. In this way, our method can transform cyber-physical vulnerability scanning to an approach that is similar to current cyber malware scanning.

5.5.4 Attack optimality

If we quantify attack optimality by (shortest) time-to-failure or (maximum) deviation of frequency (or rate of change of frequency), then the RL method only provides sub-optimal attack policy. The top plot of Fig. 12 shows an optimal FDI attack with a frequency matching the oscillatory eigenmode of the system. The bottom plot shows the RL generated attack signal following learning convergence. Despite being very similar, the optimal attack still produces a larger rate of change of frequency deviation and triggers protection faster. Both attacks still successfully trigger protection.

We provide the following explanation to why the RL method might not yield an optimal policy. Once the RL agent successfully triggers a protection relay function, the episode stops. The largest reward is achieved by triggering the rate of change of frequency function. Policy improvement occurs over episodes as the agent seeks the small rewards gained by inducing higher frequency and rate of change frequency deviation (refer to (17)). Hence, it is likely that the training will stop at a sub-optimal stage. More episodes will be needed to reach optimality.

5.5.5 Integrated attack detection

False triggers due to rare normal events is a significant shortcoming of unsupervised anomaly detection methods. In our study, the unsuccessful attacks category represents events that are not followed by any loss of generation due to false relay triggering. The supervised attack detector successfully classified 94% of these instances. The anomaly detector’s training, however, does not enable it to recognize these events. Consequently, the supervised attack detector is better equipped to alert only to malicious cyberattacks, while suppressing unnecessary responses to safe anomalous events. In practice, an integrated approach can be used to detect attacks – capitalizing on the anomaly detector’s strength in detecting normal behavior, and the attack detector’s strength in detecting attacks that trigger protection. Fig. 13 demonstrates the integrated approach. Events that are classified as normal by the anomaly detector are accepted, while those that are not are evaluated by the attack detector to determine if they will result in relay triggering and require corrective action. The attack detection accuracy of this approach is 98.6%.

5.5.6 Limitations and challenges

RL training is time-consuming and computational intensive, which only increases for more complex environments. In this paper, we used linearized models to speed training. Note that the use of linearized models is widely acceptable given that grid cyber-physical security studies utilize simplified models for ease of mathematical formulation. The RL agents still successfully destabilized the detailed microgrid models.

On another note: as training progresses and learning converges, the agent is taking more exploitative actions. The variation between the generated attacks becomes smaller as training progresses. The distinct attacks that are generated early in training when the agent is explorative provide important insight and data samples for the attack detectors training. As such, it is beneficial to prolong agent exploration, which unfortunately increases time to learning convergence and the number of episodes.

The RL training does not produce a comprehensive dataset of all attacks that cause system failure. However, with the dataset generated during RL training, security engineers can develop valuable defense strategies.

6 Conclusion

Proactively identifying grid vulnerabilities and attack strategies is critical to anticipate attacks and patch grid weaknesses before they are exploited. We develop deep (DDPG) RL agents to execute false data injection and load switching attacks against LFC. The RL-generated attacks directly induce protection relay tripping and generation loss, which can subsequently lead to grid instability and blackout. The process of training the RL agent provides valuable insight into attacker resources and strategies, including specifying attack and threat models and generating attack datasets. The attack datasets can be used defensively to inform, evaluate and develop defense strategies. We develop an LSTM-based supervised-learning model to classify and detect attacks and compare it with state-of-the-art autoencoder-based anomaly detection. The supervised attack detector achieves comparable accuracy (99.8%) when classifying normal and anomalous operation. While anomaly detection is not equipped to identify anomalous events that do not induce relay triggering (such as normal rare events), the supervised attack detector classifies these events with high accuracy (98.6%). We propose an integrated attack detector that capitalizes on the strengths of anomaly detection and supervised attack detection to improve attack detection accuracy while reducing false detection.

The state matrices in (1) are as follows:

	$\displaystyle\bm{\dot{x}}=$	$\displaystyle\begin{bmatrix}0&0&0&-(kB)&0&0\\ 1/\tau_{G}&-1/\tau_{G}&0&-d/(\tau_{G})&0&0\\ 0&1/\tau_{T}&-1/\tau_{T}&0&0&0\\ 0&0&1/M&-D/M&0&0\\ 0&0&0&1/\tau_{\omega}&-1/\tau_{\omega}&0\\ 0&0&1/(M\tau_{\nu})&-D/(M\tau_{\nu})&0&-1/\tau_{\nu}\end{bmatrix}\bm{x}$
	$\displaystyle+$	$\displaystyle\begin{bmatrix}0&0\\ 0&-k\\ 0&0\\ -1/M&0\\ 0&0\\ -1/(M\tau_{\nu})&0\end{bmatrix}\bm{u}+\begin{bmatrix}-(kB)&k&-k&0\\ 0&0&0&0\\ 0&0&0&0\\ 0&0&0&-1/M\\ 0&0&0&0\\ 0&0&0&-1/(M\tau_{\nu})\end{bmatrix}\bm{p}$		(19)

Table 2: System Data

Parameter	Symbol	MG1	MG2	MG3
AGC gain	$k$	3	10	12
Droop gain	$d$	40	50	50
Governor time-constant	$\tau_{G}$	0.08	0.08	0.1
Turbine time-constant	$\tau_{T}$	0.45	0.45	0.45
Generator inertia	$M$	6	6	8
Damping constant	$D$	0.03	0.03	0.03

Frequency sensors time-constants	$\tau_{\omega}=\tau_{\nu}=0.1$
Control center frequency measurement gain	$B=1$

Table 3: DDPG neural network architectures

Actor network
Layer	# of units	Hyperparameters
Input	2 $(\Delta\hat{\omega},\hat{\dot{\omega}})$	$M=128$
Normalization	2	$\alpha_{\theta}=10^{-4},\alpha_{\phi}=10^{-3}$
Fully-connected	100	$\gamma=0.99$
ReLU		$\tau=10^{-3}$
Fully-connected	50	$N\sim\mathcal{N}(0,0.3)$
ReLU
Tanh (or Sigmoid)
Scaling	1
Output	1 ( $A$ )
Critic network
Layer	# of units	Layer	# of units
Input	2 $(\Delta\hat{\omega},\hat{\dot{\omega}})$	Input	1 ( $A$ )
Normalization	2	Normalization	1
Fully-connected	100	Fully-connected	50
ReLU
Fully-connected	50
Addition	50	$\swarrow$
ReLU
Fully-connected	1
Output	1 $Q(\Delta\hat{\omega},\hat{\dot{\omega}},A)$

Table 4: Detectors neural network architecture

Supervised		Unsupervised
Layer	# of units	Layer	# of units
Sequence input	( $\Delta e,\Delta\hat{\omega}$ )	Sequence input	( $\Delta e,\Delta\hat{\omega}$ )
LSTM	75	BiLSTM (w/ normalization)	36
Dropout ( $10\%$ )		ReLU
LSTM	50	BiLSTM (w/ normalization)	8
Dropout ( $20\%$ )		ReLU
LSTM	35	BiLSTM (w/ normalization)	36
Dropout ( $10\%$ )	ReLU	Sequence output
Fully-connected	4
Softmax	4 ( $c\in\mathcal{C}$ )

References

[1] N. E. T. Laboratory, “The NETL modern grid initiative powering our 21st-century economy: Barriers to achieving the modern grid,” 2007.
[2] A. Lee and T. Brewer, “Guidelines for smart grid cyber security: Vol. 1, smart grid cyber security strategy, architecture, and high-level requirements,” NISTIR, vol. 7628, p. 14, 2010.
[3] T. Baumeister, “Literature review on smart grid cyber security,” Collaborative Software Development Laboratory at the University of Hawaii, vol. 650, 2010.
[4] J. E. Sullivan and D. Kamensky, “How cyber-attacks in Ukraine show the vulnerability of the US power grid,” The Electricity Journal, vol. 30, no. 3, pp. 30–35, 2017.
[5] Z.-H. Yu and W.-L. Chin, “Blind false data injection attack using PCA approximation method in smart grid,” IEEE Transactions on Smart Grid, vol. 6, no. 3, pp. 1219–1226, 2015.
[6] Y. Zhang, J. Wang, and B. Chen, “Detecting false data injection attacks in smart grids: A semi-supervised deep learning approach,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 623–634, 2020.
[7] S. Lakshminarayana, A. Kammoun, M. Debbah, and H. V. Poor, “Data-driven false data injection attacks against power grids: A random matrix approach,” IEEE Transactions on Smart Grid, vol. 12, no. 1, pp. 635–646, 2020.
[8] Q. Zhang, F. Li, H. Cui, R. Bo, and L. Ren, “Market-level defense against fdia and a new lmp-disguising attack strategy in real-time market operations,” IEEE Transactions on Power Systems, vol. 36, no. 2, pp. 1419–1431, 2020.
[9] D. Mukherjee, “Data-driven false data injection attack: A low-rank approach,” IEEE Transactions on Smart Grid, vol. 13, no. 3, pp. 2479–2482, 2022.
[10] A. Sayghe, Y. Hu, I. Zografopoulos, X. Liu, R. G. Dutta, Y. Jin, and C. Konstantinou, “Survey of machine learning methods for detecting false data injection attacks in power systems,” IET Smart Grid, vol. 3, no. 5, pp. 581–595, 2020.
[11] M. Ozay, I. Esnaola, F. T. Y. Vural, S. R. Kulkarni, and H. V. Poor, “Machine learning methods for attack detection in the smart grid,” IEEE transactions on neural networks and learning systems, vol. 27, no. 8, pp. 1773–1786, 2015.
[12] Y. He, G. J. Mendis, and J. Wei, “Real-time detection of false data injection attacks in smart grid: A deep learning-based intelligent mechanism,” IEEE Transactions on Smart Grid, vol. 8, no. 5, pp. 2505–2516, 2017.
[13] J. James, Y. Hou, and V. O. Li, “Online false data injection attack detection with wavelet transform and deep neural networks,” IEEE Transactions on Industrial Informatics, vol. 14, no. 7, pp. 3271–3280, 2018.
[14] C. Chen, K. Zhang, K. Yuan, L. Zhu, and M. Qian, “Novel detection scheme design considering cyber attacks on load frequency control,” IEEE Transactions on Industrial Informatics, vol. 14, no. 5, pp. 1932–1941, 2017.
[15] A. Abbaspour, A. Sargolzaei, P. Forouzannezhad, K. K. Yen, and A. I. Sarwat, “Resilient control design for load frequency control system under false data injection attacks,” IEEE Transactions on Industrial Electronics, vol. 67, no. 9, pp. 7951–7962, 2019.
[16] Z. Liu, Q. Wang, Y. Ye, and Y. Tang, “A gan based data injection attack method on data-driven strategies in power systems,” IEEE Transactions on Smart Grid, 2022.
[17] P. M. Esfahani, M. Vrakopoulou, K. Margellos, J. Lygeros, and G. Andersson, “Cyber attack in a two-area power system: Impact identification using reachability,” in Proceedings of the 2010 American control conference. IEEE, 2010, pp. 962–967.
[18] J. Yan, H. He, X. Zhong, and Y. Tang, “Q-learning-based vulnerability analysis of smart grid against sequential topology attacks,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 1, pp. 200–210, 2016.
[19] Z. Ni, S. Paul, X. Zhong, and Q. Wei, “A reinforcement learning approach for sequential decision-making process of attacks in smart grid,” in 2017 IEEE Symposium Series on Computational Intelligence (SSCI). IEEE, 2017, pp. 1–8.
[20] Z. Wang, H. He, Z. Wan, and Y. Sun, “Coordinated topology attacks in smart grid using deep reinforcement learning,” IEEE Transactions on Industrial Informatics, vol. 17, no. 2, pp. 1407–1415, 2020.
[21] S. Paul and Z. Ni, “A study of linear programming and reinforcement learning for one-shot game in smart grid security,” in 2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 2018, pp. 1–8.
[22] Z. Ni and S. Paul, “A multistage game in smart grid security: A reinforcement learning solution,” IEEE transactions on neural networks and learning systems, vol. 30, no. 9, pp. 2684–2695, 2019.
[23] Y. Chen, S. Huang, F. Liu, Z. Wang, and X. Sun, “Evaluation of reinforcement learning-based false data injection attack to automatic voltage control,” IEEE Transactions on Smart Grid, vol. 10, no. 2, pp. 2158–2169, 2018.
[24] A. J. Abianeh, Y. Wan, F. Ferdowsi, N. Mijatovic, and T. Dragicevic, “Vulnerability identification and remediation of fdi attacks in islanded dc microgrids using multi-agent reinforcement learning,” IEEE Transactions on Power Electronics, 2021.
[25] A. M. Mohan, N. Meskin, and H. Mehrjerdi, “A comprehensive review of the cyber-attacks and cyber-security on load frequency control of power systems,” Energies, vol. 13, no. 15, p. 3860, 2020.
[26] Y. Wu, Z. Wei, J. Weng, X. Li, and R. H. Deng, “Resonance attacks on load frequency control of smart grids,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4490–4502, 2017.
[27] P. Kundur, “Power system stability,” Power system stability and control, vol. 10, 2007.
[28] T. Flick and J. Morehouse, Securing the smart grid: next generation power grid security. Elsevier, 2010.
[29] C. Glenn, D. Sterbentz, and A. Wright, “Cyber threat and vulnerability analysis of the us electric sector,” Idaho National Lab.(INL), Idaho Falls, ID (United States), Tech. Rep., 2016.
[30] C. K. Veitch, J. M. Henry, B. T. Richardson, and D. H. Hart, “Microgrid cyber security reference architecture.” Sandia National Lab.(SNL-NM), Albuquerque, NM (United States), Tech. Rep., 2013.
[31] E. Hammad, A. M. Khalil, A. Farraj, D. Kundur, and R. Iravani, “Tuning out of phase: Resonance attacks,” in 2015 IEEE International Conference on Smart Grid Communications (SmartGridComm). IEEE, 2015, pp. 491–496.
[32] ——, “A class of switching exploits based on inter-area oscillations,” IEEE Transactions on Smart Grid, vol. 9, no. 5, pp. 4659–4668, 2017.
[33] M. E. Kabir, M. Ghafouri, B. Moussa, and C. Assi, “A two-stage protection method for detection and mitigation of coordinated evse switching attacks,” IEEE Transactions on Smart Grid, vol. 12, no. 5, pp. 4377–4388, 2021.
[34] A. S. Mohamed, M. F. M. Arani, A. A. Jahromi, and D. Kundur, “False data injection attacks against synchronization systems in microgrids,” IEEE Transactions on Smart Grid, vol. 12, no. 5, pp. 4471–4483, 2021.
[35] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[36] M. Khalaf, A. Youssef, and E. El-Saadany, “Joint detection and mitigation of false data injection attacks in agc systems,” IEEE Transactions on Smart Grid, vol. 10, no. 5, pp. 4985–4995, 2018.
[37] “IEEE standard for interconnection and interoperability of distributed energy resources with associated electric power systems interfaces,” pp. 1–227, April 2018.
[38] J. Zhang, L. Pan, Q.-L. Han, C. Chen, S. Wen, and Y. Xiang, “Deep learning based attack detection for cyber-physical system cybersecurity: A survey,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 3, pp. 377–391, 2021.
[39] F. Milano and R. Zárate-Miñano, “A systematic method to model power systems as stochastic differential algebraic equations,” IEEE Transactions on Power Systems, vol. 28, no. 4, pp. 4537–4544, 2013.
[40] P. Ferraro, E. Crisostomi, M. Raugi, and F. Milano, “Analysis of the impact of microgrid penetration on power system dynamics,” IEEE Transactions on Power Systems, vol. 32, no. 5, pp. 4101–4109, 2016.