This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Reinforcement Learning for Feedback-Enabled Cyber Resilience

Yunhan Huang, Linan Huang, Quanyan Zhu Department of Electrical and Computer Engineering, New York University, 370 Jay Street, Brooklyn, New York, United States, 11201
Abstract

Digitization and remote connectivity have enlarged the attack surface and made cyber systems more vulnerable. As attackers become increasingly sophisticated and resourceful, mere reliance on traditional cyber protection, such as intrusion detection, firewalls, and encryption, is insufficient to secure the cyber systems. Cyber resilience provides a new security paradigm that complements inadequate protection with resilience mechanisms. A Cyber-Resilient Mechanism (CRM) adapts to the known or zero-day threats and uncertainties in real-time and strategically responds to them to maintain critical functions of the cyber systems in the event of successful attacks. Feedback architectures play a pivotal role in enabling the online sensing, reasoning, and actuation process of the CRM. Reinforcement Learning (RL) is an essential tool that epitomizes the feedback architectures for cyber resilience. It allows the CRM to provide sequential responses to attacks with limited or without prior knowledge of the environment and the attacker. In this work, we review the literature on RL for cyber resilience and discuss cyber resilience against three major types of vulnerabilities, i.e., posture-related, information-related, and human-related vulnerabilities. We introduce three application domains of CRMs: moving target defense, defensive cyber deception, and assistive human security technologies. The RL algorithms also have vulnerabilities themselves. We explain the three vulnerabilities of RL and present attack models where the attacker targets the information exchanged between the environment and the agent: the rewards, the state observations, and the action commands. We show that the attacker can trick the RL agent into learning a nefarious policy with minimum attacking effort. Lastly, we discuss the future challenges of RL for cyber security and resilience and emerging applications of RL-based CRMs.

keywords:
Reinforcement Learning , Security , Resilience , Feedback Control Systems , Advanced Persistent Threats , Optimal Control Theory , Cyber Vulnerabilities , Moving Target Defense , Cyber Deception , Honeypots , Human Inattention

1 Introduction

Recent attacks such as SolarWinds [1] and the ransomware attacks on the U.S. gas pipelines [2] have shown intensifying concern of cyber threats on industrial and government cyber systems. These unprecedented attacks had created significant disruptions in business and government operations. We have witnessed not only a surge in the number of attacks but also their increasing sophistication. Now many attacks can circumvent traditional methods such as intrusion detection systems, firewalls, and encryptions. Advanced Persistent Threats (APTs) [3] are one of such threats that are known for their stealthiness, intelligence, and persistence.

Protection against such attacks becomes increasingly challenging. An attacker has to know only one vulnerability to launch a successful attack. In contrast, a defender needs to prepare for all its vulnerabilities and the associated attacks to secure the system successfully. Among those exploitable vulnerabilities, many of them are unknown until an attack exploit them (known as the zero-day vulnerabilities) or known but not patched timely (known as the 11-day or NN-day vulnerabilities due to the delay of 11 or more days between the disclosure time and the attack time). An exploit directed at a zero-day vulnerability is called a zero-day attack, which is exceptionally difficult, if not possible, for the defender to design a protection mechanism to prevent the system from. This information disadvantage for the defender has made perfect protection not only difficult to achieve but also cost-prohibitive.

Furthermore, many recent attacks have targeted government agencies and critical infrastructures. The attackers are often supported by a nation-state or state-sponsored group. They are equipped with sophisticated tools, can conduct meticulous research about the system, and afford a prolonged period of persistent disguise. In contrast, security is often a secondary or add-on concern when designing or operating the system. A defender may not have invested sufficient resources in the protection mechanism. The striking disparity in the resources puts the defender at another disadvantage.

Both information and resource disadvantages have made a mere reliance on cyber protection an insufficient way to safeguard our networks. There is a need to shift the focus from cyber protection to a new security paradigm. Cyber resilience offers such a perspective [4, 5]. It accepts the reality that the attacker can be successful in their attacks and complements the imperfect protection with resilience mechanisms. A cyber-resilient system adapts to the known and unknown threats and adversities in real-time and strategically responds to them to maintain the critical functions and performance in the event of successful attacks.

1.1 P2R2 Cyber-Resilient Mechanism

A cyber-resilient mechanism (CRM) can be decomposed into four stages: Preparation, Protection, Response, and Recovery. We also call it a P2R2 CRM. The first stage of the CRM is preparation. At this stage, we assess the security risk of the cyber system and design appropriate security policies, such as the deployment of honeypots or deception mechanisms [6, 7], training of the employees [8], proper configurations of detection systems [9], and the preparation of the backups and contingencies for recovery plans [10]. Preparation is often done offline and ahead of the real-time operations. Good preparation can help facilitate effective prevention and fast response to unanticipated scenarios in later stages.

The second stage is prevention. At this stage, we implement the designed security policies and protect the cyber systems. Some attacks can be easily detected and thwarted in real-time because of the meticulous preparation and design of the security policies. For example, consolidating moving targeting defense [11] into the communication protocols would make it harder for the attacker to map out the traffic patterns and consequently thwart the denial of service attack. However, despite this effort, there is still a probability for an attacker to become successful, especially for a highly resourceful and stealthy one.

Refer to caption

Figure 1: The four stages of a cyber-resilient mechanism: Preparation, Prevention, Response, and Recovery. An attacker launches a sequence of attacks at t1,t2,t3,t4t_{1},t_{2},t_{3},t_{4}. The cyber security performance decreases when an attack successfully penetrates the system at t2t_{2}. The system partially restores its performance at t4t_{4}.

The third stage of response is critical to defending against attacks when we fail to thwart them at the prevention stage. At this stage, we acquire the information of the footprint of the attacker and reconfigure the cyber system to minimize the further risk of the attack on the cyber system [12, 13]. The response to attacks is delay-sensitive. The information acquisition and the reaction to the observables need to be fast and done before the attacker moves onto the next stage of its attack. A fast response mechanism would rely on the acquired information and the designs at the preparation stage. There would be a cost in usability and performance when we respond to the attacks. The optimal response mechanism is a result of studying such tradeoffs.

The fourth stage of the CRM is recovery. The goal of the recovery stage is to reduce the spill-over impact of an attack and restore the cyber system performance as much as possible. The response to attacks in real-time is often at the sacrifice of the performance of the cyber system. There is a need to maintain system’s operation and gradually restore its functionality to normal while reacting to the attacks.

Fig. 1 illustrates the four stages of a typical CRM over a timeline. Before the operation, the cyber system designs a protection mechanism by considering the anticipated and known attacks and the risks of its own system. The cyber system starts to operate at t0t_{0}. At time t1t_{1}, an attacker launches attack a1a_{1}. As the cyber system prepares for it, this attack is successfully thwarted, and the system is protected until the attacker launches a new attack a2a_{2} at time t2t_{2}. The cyber system did not prepare for this attack. The attack accomplishes its goal (e.g., lateral movement or penetration of the system) and, as a result, the cyber risk increases or the cybersecurity performance drops. The cyber system needs to respond to the successful attack and the ensuing attack a3a_{3} at time t3t_{3}. After fast learning from attacker’s footprint, the cyber system makes strategic decisions to reconfigure and adapt to the adversarial environment and gradually improves the security posture of the system. The adapted system becomes more robust to attacks. The ensuing attack a4a_{4} at time t4t_{4} can be thwarted by the adapted protection mechanism. The cyber system restores to its best-effort post-attack performance at t4t_{4}.

From Fig. 1, one can measure the cyber resilience across the four stages by the time it takes from the first attack to the recovery, i.e., T=t4t2T=t_{4}-t_{2}. This time interval indicates how fast the response is [14, 15]. The worst-case performance degradation between the interval t2t_{2} and t4t_{4} is MM. The gap between the restored performance and the initial or planned performance, i.e., DD, shows the effectiveness of the resilience. The natural goal of the CRM is to minimize T,M,T,M, and DD. It is also evident that these metrics not only depend on the cyber system itself but also the type of attacks that the adversary launches. An optimal design would require understanding what threats are preventable and what can be mitigated by resilience.

1.2 Feedback Architecture and Reinforcement Learning

One critical component of the P2R2 CRM for cyber resilience is the response mechanism, which creates a dynamic process that involves information acquisition, online decision-making, and security reconfiguration. These components are strongly interdependent. One way to characterize this interdependence is through a feedback loop, as illustrated in Fig. 2.

Refer to caption

Figure 2: Feedback structure that enables cyber resilience. At each iteration, the cyber-resilient mechanism senses the parameters (e.g., cost and state) of the cyber system and makes decisions to act on the cyber system to reduce the cyber risks.

The feedback structure provides a system and control approach to cyber resilience [16]. The cyber system can be viewed as a plant. The CRM collects information from the cyber system through sensing and observation. The acquired information is used to reason about the current situation and decide how the system should react. The CRM acts on the recommended decisions by reconfiguring the security parameters and updating the cyber systems. The reasoning for the optimal cyber response can be viewed as a controller. The security reconfiguration can be interpreted as the actuation of the cyber system. The control perspective provides a rich set of tools and concepts from control theory to design an optimal, robust, and adaptive CRM to improve cyber resilience.

One pivotal control framework to achieve an optimal, robust, and adaptive CRM is reinforcement learning (RL). Similar to classic control approaches, RL leverages the idea of feedback architecture to guarantee certain aspects of performance such as optimality and stability. However, RL approaches are distinguished from classic control approaches from several facets.

First, RL allows the CRM policies to adapt to the online observations without knowing the underlying dynamics. The CRM can only observe the state ss and the cost cc from the cyber system at each step and then choose action aa to optimally improve cyber resilience. RL is particularly useful for cyber systems as their precise models are often unavailable, and the influence of the attacker makes it even harder to map out an accurate cyber system model. There are also many unknowns in the attack model and cyber systems that require online learning and adaptation. RL provides an appropriate framework to bridge the gap between control theory and cyber resilience.

Second, RL allows the CRM to have a more generic space that captures the activities in the cyber system. Classic control oftentimes deals with physical systems such as mechanical systems [17], electrical circuits [18], chemical systems [19] and biological systems [20]. The state space of such systems is mostly continuous and finite dimension represented by n\mathbb{R}^{n} with nn being a finite integer. Cyber systems are computer & communication systems integrated with physical systems whose state space is discrete or hybrid (discrete and continuous) in most cases. RL is versatile at dealing with different types of state space. This versatility enables CRM to capture the complex status of cyber systems. Besides, RL equipped with function approximation techniques copes well with large-scale cyber systems that contain a large number of elements or have high-dimension state spaces [21, 22].

Third, RL requires different stages than the classic control approaches before being implemented in real systems. Specifically, the implementation of RL techniques undergoes three phases: the training phase, the testing phase, and the execution phase. In the training phase, the agent interacts with the environment or the simulator to obtain data and improves his/her policy using the obtained data. In the testing phase, we evaluate the performance of the improved policy. If the policy reaches optimality or a satisfactory level of sub-optimality, the learned policy (or the learned CRM in our case) will be put into use, and the implementation enters the execution phase. Classic control approaches go through four phases: the modeling phase, the design phase, the testing phase, and the execution phase. In the modeling phase, modelers with domain knowledge build a mathematical model for the underlying system. In classic control, the training phase is replaced by a design phase where experts with domain expertise design a policy/controller based on the model constructed for the cyber system. The designed policy is then tested and evaluated before being executed in real systems. The training phases consume an enormous amount of data to improve the learned CRM constantly. Much effort need to be spent obtaining the data through either simulation or interacting with the environment. Data-inefficiency needs to solved by combining modern RL techniques with domain expertise from cyber security experts.

1.3 System Vulnerabilities and RL-enabled Defense

The uncertainties in the cyber system are caused by not only the complexity in its configuration but also the incomplete information of the attack models. An attack model specifies the vulnerabilities an attacker would exploit on the attack surface of the cyber system. The design of the CRM pivots on the type of vulnerabilities that the defender aims provide resilience for if he cannot succeed in defending against attackers who exploit them. We categorize the vulnerabilities into three major types. The first one is the class of posture-related vulnerabilities that arise from the resource disadvantage of a defender. The defender is not able to prepare for all vulnerabilities on the attack surface and has to rely on CRM to overcome this disadvantage. The second one is the class of information-related vulnerabilities that result from the information asymmetry between an attacker and a defender. The attacker has more information about the defender than the defender has about the attacker. The CRM is useful to tilt the information asymmetry and allow the defender to create uncertainties for the attacker while gaining reconnaissance of the attacker. The third type of vulnerability is human-induced. Humans are the weakest link in cyber protection. Social engineering, human errors, and insider threats are common attack vectors that an attacker can use to gain access to cyber systems before further penetrating the systems and reaching the targeted assets. The CRM can monitor human performance and guide humans to reduce the risks of their behaviors.

We use three applications to elaborate the design of CRMs for the three major types of vulnerabilities. The first application designs moving target defense (MTD) for posture-related vulnerabilities that result from the defender’s natural disadvantages. The MTD creates reconnaissance difficulties for attackers by introducing uncertainties into the cyber systems. The authors in [11] explore the capability of RL in changing the system configurations to create the most uncertainty for the attackers while minimizing the change cost, which leads to a trade-off between security and usability. Both attackers and defenders apply RL to learn the risk and update their policies.

The second application addresses the information-related vulnerabilities that arise from the defender’s information disadvantage. Deception technologies, including honeypots and honeyfiles, have been used to reduce information asymmetry between attackers and defenders. The authors in [23] use RL to design the deception technologies to engage the attacker and obtain threat information. The goal is to extract the most threat intelligence from attacks while minimizing the risks of detection and evasion. The study leads to the security insight that compared to an immediate ejection after detection, it can be more beneficial to engage attackers in the honeypots to obtain threat intelligence.

The third application designs CRM to mitigate human-related vulnerabilities. Adversaries have exploited human inattention to launch social engineering and phishing attacks toward employees and users. Compared to the reactive attentional attacks that exploit the existing human attention patterns, proactive attentional attacks can strategically change the attention pattern of a human operator or a network administrator. The authors in [24, 25] have identified a new type of attacks called Informational Denial-of-Service (IDoS) attacks that generate a large volume of feint attacks to overload human operators and hide actual attacks among feints. They create an RL-based CRM to learn the optimal alert and attention management strategy.

1.4 Reinforcement Learning in Adversarial Environment and Countermeasures

As we use RL to improve the resilience of the cyber system, an intelligent attacker can also seek to compromise or mislead the RL process so that the cyber system ends with a worse or vulnerable security configuration. In this case, RL enlarges the attack surface and creates opportunities for the attacker. Hence it is essential to safeguard the RL while using it to secure the cyber system. Section 5 presents several attack models that target at RL algorithms and discusses their impact and ways to secure them. RL agents learn a ‘good’ policy through interacting with the environment and exchanging key information such as the rewards, the state observations, and the control commands. The three exchanged signals, which can be exploited by the attacker, become the vulnerabilities of most RL-enable systems. To understand RL in an adversarial environment, the first is to understand the adversarial behaviors of the attackers. To do so, we need to craft the attack models, which specify the attacker’s objective, the attacks available to the attacker, and the knowledge the attacker has. The second is to understand how different types of attacks affect the learning results of RL algorithms, which requires the assessment of the performance degradation of the RL-enabled systems. With the understanding of attack models and their impact on the learning results, we can develop countermeasures to mitigate the effect of the attacks and to protect RL-enabled systems from attacks.

We review the current studies on security of RL and focus on attacks on the three signals, i.e., attacks on the rewards, attacks on the sensors, and attacks on the actuators. We present a sequence of attack models that can achieve a significant impact on RL-enabled systems with only a tiny effort on attacking. Given that this branch of research is still in its infancy, we identify several research gaps that researchers from the learning community and the control community can help narrow. One missing component is that most works focus on deceptive attacks that falsify the values of the feedback information. In contrast, few works study the DoS/jamming attacks on RL algorithms. Another missing component is the study of attacks on actuators and the defensive mechanisms against these attacks.

1.5 Notations and Organization of the Paper

Throughout the paper, we use calligraphic letter 𝒳\mathcal{X} to define a set and boldface letter 𝐗\mathbf{X} to denote a vector. Notation 𝔼\mathbb{E} represent expectation and +\mathbb{R}_{+} represents the set of positive real numbers. The notations 0\mathbb{Z}_{\geq 0} and >0\mathbb{Z}_{>0} represent the set of non-negative and positive integers, respectively. The indicator function 𝟏{x=y}\mathbf{1}_{\{x=y\}} equals one if x=yx=y, and zero if xyx\neq y. If two sets BAB\subset A, A\BA\backslash B denotes the set {x|xA,xB}\left\{x\ \middle|x\in A,x\notin B\right\}. If 𝒜\mathcal{A} is a finite set, then we let Δ𝒜\Delta\mathcal{A} represent the set of probability distributions over 𝒜\mathcal{A}, i.e., Δ𝒜:={p:A+|a𝒜p(a)=1}\Delta\mathcal{A}:=\{p:A\leftarrow\mathbb{R}_{+}|\sum_{a\in\mathcal{A}}p(a)=1\}. The main notation of this paper is introduced in Section 2.1. For sections that are dense with parameters, we include separate notation tables to help readers navigate.

The remainder of the paper is organized as follows. Section 2 first introduces a general schema of RL methods to help readers navigate in the rich universe of RL algorithms. After that, Section 2 gives an overview of literature that studies how RL is leveraged to fight against different types of attacks on cyber systems. While Section 2 focuses on the different attacks, Section 3 centers on three major classes of vulnerabilities: the posture-related vulnerabilities, the information-related vulnerabilities, and the human-related vulnerabilities. In Section 3, we discuss how each vulnerability can be addressed by RL. Following the discussion of 3, Section 4 delves into three application domains, each representing one class of vulnerabilities, and presents design methodologies for cyber resilience. Compared to previous sections that are about RL for cyber security, Section 5 is about security of RL, which introduces potential security threats faced by RL itself. In Section 5, we discuss the vulnerabilities of RL algorithms and provides an overarching review on this emerging research topic. We conclude the paper in Section 6 and discuss several directions for future research.

2 Reinforcement Learning and Its Applications in Cyber Resilience and Defense

The massive scale of Internet-connected or networked-connected systems expands the attack surface. The attacker can leverage one vulnerability in massive systems to penetrate the system and launch successful attacks. Most Internet-connected systems have dynamic nature due to their movement or changes in the environment. Hence, attacking windows can appear from time to time depending on the underlying dynamics, encouraging the attackers to look for opportunities persistently. The massiveness and the dynamics of the system require protecting and securing mechanisms to be responsive, adaptive, and scalable.

With the recent advances in Artificial Intelligence (AI), both the defense and the attacking sides have started to use AI techniques, especially machine learning (ML) methods, to obtain an edge. Sophisticated attackers leverage ML techniques to locate vulnerabilities, stay stealthy, and maximize the effect of attacks. Defenders employ ML techniques to adaptively prevent and minimize the impacts or damages. Both supervised and unsupervised learning methods have been used widely by the defenders for spam filtering [26], intrusion detection [27, 28], zero-day vulnerability forecasting [29], and malware detection [30]. However, these traditional ML methods cannot provide dynamic and sequential responses against cyber attacks from the dark side with unknown patterns and constantly evolving behaviors.

As a branch of ML, Reinforcement Learning (RL) is a learning paradigm that learns from its own experience over time through exploring and exploiting the unknown and changing environment. RL provides suitable tools for defenders to take sequential actions optimally without or with limited prior knowledge of the environment and the attacker. RL approaches allow the defender to capture various types of dynamics (e.g., stochastic or deterministic), a wide range of protective and defensive actions ( high-dimensional continuous state space or discrete state space), and different kinds of system states. RL demonstrates an excellent fit for cyberspace where cyber attacks become increasingly advanced, swift, and omnipresent [31, 32, 33]. In recent years, the development of deep learning has fueled the successful synthesis of Deep Reinforcement Learning (DRL). The idea of DRL is to utilize a neural network to approximate complicated functions with high-dimensional inputs. The integration of deep learning enhances the conventional RL methods to capture the massive scale of many Internet-connected systems such as mobile networks and IoT systems [34, 35].

2.1 RL Framework and Algorithms

The RL agent learns a near-optimal, if not optimal, to protect, prevent, or recover from attacks by constantly exploring and exploiting the environment in a feedback manner, as is shown in Fig. 2. In Fig. 2, ss represents the security state of the cyber system, and the cost cc is the monetary cost or performance degradation of the system induced by attacks and operations. The RL agent, or the defender, aims to find an action aa that optimally modifies the system configuration or other parameters, leading to changes in the security state and the cost. The feedback loop forms an iterative process of agent-environment interactions.

RL is underpinned by a mathematical framework called Markov Decision Process (MDP). An MDP is denoted by a 55-tuple 𝒮,𝒜,c,𝒫,β\langle\mathcal{S},\mathcal{A},c,\mathcal{P},\beta\rangle, where 𝒮\mathcal{S} is the state space containing all possible states of the cyber system. The action space 𝒜\mathcal{A} denotes the actions available for the defender to protect the cyber system, to recover from the damage caused by attacks, or to mitigate the effect of attacks. The cost cc depends on the current and/or the next security states, and the current action, which is usually denoted by a function that maps 𝒮×𝒜×𝒮+\mathcal{S}\times\mathcal{A}\times\mathcal{S}\rightarrow\mathbb{R}_{+}. The transition kernel 𝒫\mathcal{P} defines the rule of how the system state evolves based on the actions taken by the defender. The discount factor β\beta is a weighting factor that assigns more weight to current costs than future costs. The goal of an MDP problem is to find a policy πt:𝒮𝒜\pi_{t}:\mathcal{S}\rightarrow\mathcal{A} to minimize a certain form of the accumulative costs t=0βtct\sum_{t=0}^{\infty}\beta^{t}c_{t} over time.

MDPs are generic modeling tools that can model the dynamic and feedback nature of various types of cyber systems. Conventional MDP approaches require the knowledge of the transition probability of the cyber system and the explicit definition of the reward function to find an appropriate strategy. However, in reality, obtaining such information is prohibitive for two reasons. The first is due to the complexity of the cyber system and its dynamics. The second is because the attackers are usually on the dark side whose behavior is unknown to the defender. RL solves the MDP problem without the knowledge of the transition kernel 𝒫\mathcal{P} and the cost function cc. Instead of utilizing a prior known transition kernel and cost function, RL agent learns the optimal policy from sequences of states {st}t\{s_{t}\}_{t\in\mathbb{Z}}, costs {ct}t\{c_{t}\}_{t\in\mathbb{Z}}, and actions {at}t\{a_{t}\}_{t\in\mathbb{Z}}, which are obtained by interacting with the environment over time.

Having been actively studied for decades, RL has a rich universe of algorithms that help the agent find a satisfactory policy. Covering every single algorithm is beyond the scope of this paper. Here, we highlight a general schema of RL models shown in Fig. 3 and present a concise taxonomy of RL algorithms.

Refer to caption

Figure 3: A general schema of RL algorithms.

2.1.1 Model-Based and Model-Free

One advantage of RL over classic MDP and control is its ability to cope with the unknown environment (e.g., cost function cc, dynamics 𝒫\mathcal{P}). Its ability to tackle unknown environments makes RL an appealing tool for cyber resilience. There are two types of methods that tackle the curse of modeling: the model-based RL and the model-free RL. Here, a ‘model’ means an ensemble of acquired environmental knowledge. If the model (the cost function cc and the transition kernel 𝒫\mathcal{P}) is known, methods in classic MDP (e.g., value iteration, policy iteration, linear programming, etc.) can be applied directly [36].

In model-based RL, the agent estimates the cost function as well as the transition kernel using data sequences {st,ct,at,st+1}t\{s_{t},c_{t},a_{t},s_{t+1}\}_{t\in\mathbb{Z}}. Once elements of the model are constructed/estimated from samples, the agent can apply classic methods in MDP directly to obtain a near-optimal policy. There are occasions where the model is explicitly given and can be accessed directly by the agent. For example, the MCTS (AlphaGo/AlphaZero) algorithm relies on a fully known environment because the rules of the GO game are exactly known [37]. Nevertheless, in most cases, the model is unknown due to the complexity or opaqueness of the environment. The agent estimates the elements in the environment first, then applies classic MDP methods. Typical examples of this kind includes the MBMF algorithm [38], the World Models algorithm [39], the I2A algorithm [40], and etc.

The model-free RL does not predict the environment parameters but directly seeks the optimal policy using the samples. The idea is to look constantly for the policy that produces higher rewards. An example is the QQ-learning algorithm where the agent chooses, with a high probability, the action corresponding to the highest QQ value. The QQ-algorithm converges to an optimal QQ-value function under proper conditions, which produces an optimal policy. The difference between model-based and model-free RL is whether the agent needs to construct/estimate the model of the environment (e.g., the transition kernel and the cost function). In Section 5.1, we will discuss the vulnerabilities of model-based RL and model-free RL under adversarial attacks on the cost samples.

2.1.2 Value-Based and Policy-Based

The model-free RL has two main categories for optimizing the policy: value-based methods and policy-based methods. The value-based methods usually employ a value-action function Q:𝒮×𝒜+Q:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}+, which is also referred to as the QQ function. The most population algorithm of the value-based method is the QQ-learning algorithm, whose goal is to find the optimal QQ-values that satisfy the Bellman equation [36]

Q(s,a)=c(i,a)+βsp(s,s,a)minaQ(s,a),for i𝒮,a𝒜,Q(s,a)=c(i,a)+\beta\sum_{s^{\prime}}p(s,s^{\prime},a)\min_{a^{\prime}}Q(s^{\prime},a^{\prime}),\ \ \ \textrm{for }i\in\mathcal{S},a\in\mathcal{A},

where p(s,s,a)p(s,s^{\prime},a) is the probability that the state of the cyber system at the next step is ss^{\prime} given the current state ss and the current action aa. Without the knowledge of the transition probability p(,,)p(\cdot,\cdot,\cdot) and the cost function c(,)c(\cdot,\cdot), the RL agent can update its QQ-values by interacting with the environment:

Qt+1(st,at)=Qt(st,at)+αt[βminaQn(st+1,a)+ctQt(st,at)],Q_{t+1}(s_{t},a_{t})=Q_{t}(s_{t},a_{t})+\alpha_{t}\cdot\left[\beta\min_{a^{\prime}}Q_{n}(s_{t+1},a^{\prime})+c_{t}-Q_{t}(s_{t},a_{t})\right], (1)

where the sequences of states {st}t\{s_{t}\}_{t\in\mathbb{Z}}, costs {ct}t\{c_{t}\}_{t\in\mathbb{Z}} are from the environment and the sequence of actions {at}t\{a_{t}\}_{t\in\mathbb{Z}} are chosen by the agent.

The policy-based RL focuses on the policy directly. The policy-based method updates the policy iteratively to minimize the accumulative costs over a period of time without explicitly constructing a value function. The adoption of policy-based RL allows the agent to parameterize the policy, which often results in better convergence and is suitable for continuous or high dimensional action space. Frequently used policy-based algorithms include the Policy Gradient (PG) algorithm proposed by Sutton et al. [41], the Trust Region Policy Optimization (TRPO) algorithm [42], etc.

Combining the value-based and the policy-based methods gives rise to the actor-critic class of algorithms. The actor-critic method leverages the value-based methods to build a value function to improve sample efficiency and relies on the policy-based methods to update the policy parameterization. Common actor-critic RL algorithms include the Actor-Critic (AC) algorithm [43] and its variants: the Asynchronous Advantage Actor-Critic (A3C) algorithm [44], the Deterministic Policy Gradient (DPG) algorithm [45], etc.

2.1.3 Function Approximation

With the curse of modeling being tamed, the next is to discipline the curse of dimensionality. The curse of dimensionality appears when the state space or the action space is continuous or prohibitively large, which is the situation we often encounter in cyber systems. Function approximation is a technique that addresses the curse of dimensionality. A function approximator is selected to approximate the value function or the policy function so that the learning algorithms only have to deal with a few parameters. Common function approximators are linear function approximator [36, 43, 46], multiscale approximator [47], neural networks [33, 44, 48, 43], etc. Recent development in Deep Neural Networks (DNNs) in the last decade enables the application of Deep Reinforcement Learning (DRL). In DRL, the function approximator is a neural network with multiple layers between the input and output layers. A wide variety of DNNs, such as convolutional neural networks, recurrent neural networks [48] and transformers [49], has been developed to handle different application scenarios. One can refer to Section 7 of [48] and Section 9 of [43] for a detailed discussion on the selection of function approximators.

2.1.4 Other RL Settings

In the training phase, the RL agent chooses a policy that interacts with the environment to generate more samples and constantly improve the policy using the generated samples. The RL agent accumulates knowledge about the cyber system. The agent has to make a trade-off between learning more about the systems or pursuing what seems to be the most promising defense strategy with the samples gathered so far. This is called the exploration-exploitation trade-off in choosing how to sample from the environment (or the cyber systems in our case). Commonly used sampling strategies include the greed strategy (i.e., choose the action currently unknown as the best), the ϵ\epsilon-greedy exploration (i.e., applies the greedy action with probability 1ϵ1-\epsilon and otherwise selects an action uniformly at random), or the Thompson sampling [50].

In many RL applications, the agent is equipped with a replay memory that stores the past experience/samples of the agent. The storage of past experience allows the agent to reuse the samples later and process samples in batches to achieve a reasonably stable data distribution [48]. With replay memory, the agent can choose either on-policy or off-policy methods. On-policy methods evaluate and improve the policy using the samples generated by the same policy. In off-policy methods, learning is done using samples in replay memory that are not necessarily generated under the current policy but different policies. The off-policy methods allow the defender to leverage data from historical security events to learn a defense policy. A typical example of the on-policy method is SARSA [43] where the agent executes an action from a ϵ\epsilon-greedy policy and uses the sample generated by the action to update the current policy:

Qt+1(st,at)=Qt(st,at)+αt[βQn(st+1,at+1)+ctQt(st,at)].Q_{t+1}(s_{t},a_{t})=Q_{t}(s_{t},a_{t})+\alpha_{t}\cdot\left[\beta Q_{n}(s_{t+1},a_{t+1})+c_{t}-Q_{t}(s_{t},a_{t})\right]. (2)

QQ-learning algorithm in (1)) is a frequently used off-policy method. Even though it adopts the ϵ\epsilon-greedy policy to generate actions, the update of QQ-values follows the greedy policy as is indicated by the min operator in (1). With basic knowledge of RL methods, in the following section, we discuss the literature about how RL can be used to tackle different attacks in cyber systems.

2.2 Review of RL Against Cyber Attacks

The rapid development of Information and Communication Technologies (ICTs) has fueled significant growth in Internet of Things (IoT) usage and Cyber-Physical Systems (CPS) deployment. CPS involves embedded computers and networks used to monitor and control the physical processes, with feedback loops where physical processes affect computations and vice versa. The Internet of Things (IoT) refers to a network comprised of physical objects capable of gathering and sharing electronic information, with physical objects receiving orders from the network and the network receiving information from the physical objects. Both IoT and CPS involve computation units, networking units, and physical entities that form a feedback cycle with human or machine decisions in the loop. Due to the wide adoption of IoT and CPS in many domains such as manufacturing, power, agriculture, battlefield, homes, etc., security threats faced by IoT and CPS systems need to be well addressed.

First, in a CPS, traditional physical controllers have been replaced by micro-computers and processors with embedded operating systems [51]. Micro-computers provide advantages to the CPS, such as flexible configuration with an accessible human-machine interface and digital communication abilities that enable remote control and access. However, the embedded software opens up doors for software attacks such as code injection attacks, cross-site scripting attacks when using a web server for system configuration, and malware attacks. Second, CPS is networked, meaning that the essential components in the physical systems are often connected to corporate networks and the Internet. Security issues may be raised because complete isolation is difficult to achieve. Connectivity can occur in unexpected ways. For example, Stuxnet, a malicious computer worm accused of stymieing Iran’s nuclear-fuel enrichment program, was introduced to the target environment via an infected USB flash drive [52]. The security challenges of CPS will become more severe as the scale and scope of the Internet and other communication techniques grow.

Similarly, IoT is also facing security challenges amid the increasing adoption of IoT technology. IoT employs advanced computing and communication technology such as Zigbee, Bluetooth, radio-frequency identifications (RFIDs), and cloud computing. The employment of these technologies provide opportunities for the attacker to launch cyber attacks, such as DoS attack, eavesdropping, data injection attacks, etc. Beyond that, unlike CPS, IoT usually encompasses small physical units with lower power that are mobile and geographically distributed. Hence, IoT is more susceptible to physical attacks. Due to the dynamic and feedback nature of CPS and IoT, RL has been applied to address the security and privacy challenges faced by these systems. In this section, we provide a brief review of the applications of RL for the security of CPS and IoT. The review follows a taxonomy based on the attack types and different RL algorithms.

2.2.1 RL Aganist DoS attacks

DoS is the most common attack among all the attacks targeting networked systems such as IoT and CPS [53, 54]. A denial-of-service (DoS) attack is a tactic used by attackers for overloading a machine or network to make it unavailable. DoS attacks on networked systems bring serious threats to people and induce direct or indirect financial losses. Since many CPS relies heavily on real-time communication, the sudden unavailability of information may render serious issues. For example, suppose a critical chemical reaction process is unstable in open-loop control. In this case, a continuous DoS attack on the actuator may introduce irreversible damage to the system and the beings around it. Liu et al. [55] have studied the resilient control problem for CPS under DoS attacks. The DoS attacks target the measurement and control channels to jeopardize the closed-loop system’s functionality. The authors designed a resilient control strategy that can achieve a certain degree of stability and performance for the CPS even when under DoS attacks. To compute such a resilient control strategy, the authors have employed an on-policy RL method called SARSA to calculate the strategy. Dai et al. [56] have studied the remote state estimation of CPS with multiple sensors. The sensors transmit the message through several communication channels subject to possible DoS attacks. The authors leverage QQ-learning to compute the communication strategy, i.e., to decide which communication channel to use.

Due to its distributed nature with limited energy, IoT is often targeted by Distributed DoS (DDoS) attacks. Distributed DoS often results in malfunction in an enormous number of IoT devices within a short time window. Xu et al. [22] have proposed a DDoS detection scheme based on Hidden Markov Models and cooperative RL. The authors leverage DQN to compute optimized strategies of information exchange among the distributed multiple detectors to ensure a decent detection accuracy without much load on information communications among the detectors. Later in 2013, Malialis et al. [21] have proposed a protective mechanism to protect IoTs from DDoS attacks by router throttling. A multi-agent RL framework is applied where the agents representing the routers learn to throttle traffic toward a victim server. However, this method will suffer scalability issues as the scale of IoT increases. To tackle this problem, the same authors proposed a coordinated design on top of the previous RL-based router throttling method [57]. More recently, Liu et al. have proposed a DQN based framework to mitigate the effect of the DDoS attack, which can increase or slow down the attack traffic flow [58]. Most papers that leverage RL to mitigate the effect of DoS attacks are for computational purposes when the behavior model of the attacker is unknown to the agent.

2.2.2 RL Aganist Spoofing Attacks

A spoofing attack happens when a malicious party impersonates another device/user on a network to launch other attacks, steal data, spread malware, or bypass access control [59]. There are many things the attackers can forge to make their attacks pan out: An IP address, a phone number, GPS location, etc. Out of all the nefarious scenarios, the following few become increasingly impactful for cyber security nowadays. ARP (Address Resolution Protocol) spoofing is a spoofing attack in which a malicious attacker sends falsified ARP messages over a local area network. This manipulation can make all traffic redirected to the attacker’s device before it can reach its legitimate destination. Another common type of spoofing attack is the IP spoofing attack, in which the attacker impersonates the IP address and obfuscates the actual online identity of the packet sender. The penetration using spoofing attacks gives wings to the attacker to launch further attacks such as man-in-the-middle or DoS attacks.

Xiao et al. have formed a zero-sum authentication game to capture the interplay between the legal receiver and spoofers [60, 61]. The authors leverage QQ-learning algorithms to compute the spoofing detection strategy. The receiver uses this strategy to choose the test threshold in the hypothesis test of the spoofing detection to maximize its expected utility based on Bayesian risk to detect the spoofer. Beyond the passive defensive method such as detection, Elnaggar et al. [62] have proposed an inverse RL approach that leverages historical measurement and control data to forecast the objective of spoofers. The authors studied an autonomous vehicle equipped with multiple sensors and tasked to perform go-to-goal navigation. The spoofer aims to hijack the vehicle to an adversary-desired location while staying stealthy. The agent can prevent the vehicle from reaching the adversary-desired location by leveraging the IRL method.

2.2.3 RL Aganist False Data Injection Attacks

False data injection (FDI) attacks affect the data integrity of packets by modifying their payloads, which are in general more difficult to detect but have less been investigated [63]. Li et al. [64] proposed a DQN-based RL method, on behalf of the attacker, to jeopardize the data analysis results by optimally injecting fake data conflicting with ground truth in crowdsensing systems. Similarly, In [65], Chen et al. come up with a novel strategy of false data injection (FDI) attacks on behalf of the attacker, aiming to distort the regular operation of a power system by automatic voltage controls. The problem of finding an optimal attack strategy is formed as a partially observable MDP, and a Q learning algorithm is proposed to enable online learning and attacking.

Instead of designing optimal FDI attacking strategies for the attacker, Kurt et al. [66] designed a defense strategy against a series of cyber attacks in smart grids. The authors proposed a model-free RL algorithm to detect cyber attacks, including FDI attacks, on the fly without knowing the underlying attack model. Numerical results show that the proposed detection system can detect FDI attacks with a precision of 0.99770.9977 and a recall of 11. This section has introduced different types of attacks, how they can cause damage to the systems, and how RL can mitigate the damage. The following section focuses on the three major classes of vulnerabilities in cyber systems and discusses how RL can help address each class of vulnerabilities.

3 Resilient Cyber Defense Against Vulnerabilities

As discussed in Section 1, the preparation and prevention mechanisms for perfect security are too costly to implement. Response and recovery mechanisms are necessary to mitigate successful attacks and harden the system security. RL is an efficient tool to achieve this goal through online learning, correction, and adaptation. In this section, we provide a literature overview on cyber resilience and focus on methodologies that rely on feedback mechanisms, which include RL and control-theoretic approaches. We refer readers to several recent reviews on game theory for cybersecurity [67] and cyber deception [68], and Deep Reinforcement Learning (DRL) for cybersecurity [33]. The design of CRMs depends on the type of vulnerabilities that the cyber system aims to mitigate. We group the common vulnerabilities into three categories, namely, (1) the posture-related vulnerabilities, (2) the information-related vulnerabilities, and (3) the human-related vulnerabilities.

The posture-related vulnerabilities refer to a defender’s naturally disadvantageous security posture in comparison to the attacker’s; e.g., the defender has to be constantly vigilant and legitimately defend the entire attacker surface, while the attacker only needs to succeed once at a single location. Due to the disadvantage in security posture, the defender with limited resources cannot afford to prepare for all possible attacks. The information-related vulnerabilities refer to the defender’s information disadvantage, especially when facing deceptive and stealthy attacks. The defender cannot make a meticulous plan to protect his assets if he does not have the capability of mapping out the attack paths or predicting the targets. The human-related vulnerabilities are the results of human misbehavior and cognition limitation. The vulnerabilities of all human groups in the cyber system can expose the system to cyber threats and undermine cyber resilience. Human users and insiders can unintentionally fall victim to phishing attacks or intentionally break security rules for their convenience. Human operators and network administrators in charge of real-time monitoring and inspections of alerts and system status can suffer from alert fatigue. Human vulnerabilities are often exploited by attackers as the first step who aim to penetrate a computing system. Preparation and prevention mechanisms, e.g., security training and regulations, are not sufficient to protect them from attacks.

We briefly review the existing technologies to mitigate the three types of vulnerabilities in Section 3.1, 3.2, and 3.3, respectively. In particular, we emphasize the RL-based CRMs and introduce three cyber resilience designs, i.e., MTD in Section 4.1, honeypots in Section 4.2, and human-assistive technologies in Section 4.3, as the representative examples of the RL solutions to these three types of vulnerabilities, respectively.

3.1 Posture-Related Vulnerability

The technologies to mitigate posture-related vulnerabilities have been extensively studied in the literature. Moving Target Defense (MTD) is one of the modern technologies to neutralize attacker’s position advantage by creating reconnaissance difficulties and uncertainties for attackers. There is a surge of recent literature on using RL to choose an adaptive configuration strategy to maximize the impact of MTD with particular focuses on the dynamic environment [69], reduced resource consumption [70], usability [11], partially observable environment [71, 72], and multiagent scenarios that contains both the characteristics of the system and the adversary’s observed activities [73, 72].

Other techniques have been proposed to protect disadvantageous cyber systems from falling victim to resourceful and determined attacks such as Distributed Denial-of-Service (DDoS) or powerful jamming attacks. For example, [58] mitigates DDoS Flooding in Software-Defined Networks; [74] introduces a multi-objective reward function to guide an RL agent to learn the most suitable action in mitigating application-layer DDoS attacks; [75] proposes a feature adaption RL approach based on the space-time flow regularities in IoV for DDoS mitigation. Among the works of RL for DDoS attack, there has been recent attention on large-scale solution [21, 22] via cooperative RL, low-rate attacks in the edge environment [76], and sparse constraint in cloud computing [77]. Besides DDoS attacks, RL has a wide application in mobile edge caching [78], mobile offloading for cloud-based malware detection [79], and host-based and network-based intrusion detection [80, 81].

3.2 Information-Related Vulnerability

Deception is a ubiquitous phenomenon and has been used as a proactive defense method. Honeypots and the related cyber deception technologies, e.g., honeyfiles, honeynet, honeybot, etc., have been widely used to mitigate stealthy and deceptive attacks. In particular, RL has been widely used to develop self-adaptive honeypots in normal conditions [82, 83, 84] and resource-constrained environment [85]. As attackers become more sophisticated, they manage to fingerprint and evade honeypots [86]. This reduces the size of a captured dataset and the chance to extract threat intelligence from it. Thus, there is an urgent need for stealthy honeypots [87] that can counter fingerprinting. In [88], the authors use RL to conceal the honeypot functionality. Besides honeypot fingerprinting, RL is also used to extract the most threat intelligence [89, 90, 23].

Besides honeypots, there are emerging technologies to detect and respond against deceptive attacks. In [91, 92, 93], the authors model the interactions between stealthy attackers and the defenders as a dynamic Bayesian game of discrete or continuous type where multi-stage response strategies are designed based on belief updates. The convergence of continuous-type Bayesian game under a sequence of finer discretization schemes is discussed in [94]. In [95], the authors design the information revealed to the users and attackers to elicit behaviors in favor of the defender. In [96] and [97], RL is used to optimally deploy the deception resources and emulate adversaries, respectively. Many works have attempted to address various spoofing attacks using RL. In [61], the authors propose RL-based spoofing detection schemes where the interactions between a legitimate receiver and spoofers are formulated as a zero-sum authentication game. In [98], the authors model the behavior of exploring face-spoofing-related information from image sub-patches by leveraging deep RL. Developing counter-deceptive and defensive deceptive technologies is in its infancy. RL is a promising tool to make the design quantitative, automatic, and adaptive.

3.3 Human-Related Vulnerability

We classify human vulnerabilities into acquired and innate vulnerabilities, depending on whether they can be mitigated through short-term security training and awareness programs. Social engineering [99] is a common attack vector that targets acquired human vulnerabilities such as fear to express anger, lack of assertiveness to say no, and the desire to please others. Threat actors use psychological manipulation techniques to mislead people to break normal security procedures or divulge confidential information. The authors in [100] have used RL to detect cyberbullying automatically, and [101, 102] have proposed a feedback learning framework, e.g., RL, to fight against social engineering attacks. As a representative form of social engineering, phishing attacks use email or malicious websites to serve malware or steal credentials by masquerading as a legitimate entity. Non-technical anti-phishing solutions include security training and education programs, while technical solutions include blacklisting, whitelisting, and feature-based detection. To handling zero-day phishing attacks, (deep) RL has been used both to detect phishing emails [103], phishing websites [104], spear phishing [105], and social bots [106] in Online Social Networks (OSN). Attackers can also exploit human innate vulnerabilities induced by bounded attention and rationality. The authors in [24, 25] have identified a new type of attacks called Informational Denial-of-Service (IDoS) attacks that can exacerbate the human operators’ cognition overload by generating a large number of feints and hiding real attacks among them.

Most of the existing works take humans as an independent component in the cyber system and aim to compensate indirectly for the human vulnerability through additional mechanisms. An alternative way is to directly affect the human component and consider an integrated human-cyber system. Due to the unpredictability and modeling challenges of human behaviors, RL and feedback control serves as the tool to determine how to affect human incentives and perceptions effectively and efficiently. In [107], the penalty and reward are changed adaptively through a feedback system to improve compliance of human employees and mitigate insider threats. In [108], RL is used to develop the optimal visual aids to enhance users’ attention and help them identify phishing attacks. In [25], the authors use RL to determine resilient and adaptive strategies for alert and attention management. Developing security-assistive technologies that directly affect humans is in its infancy, and it is a promising direction to explore further.

4 Cyber-Resilient Mechanism Designs and Applications

The system vulnerabilities and attacker’s information and resource advantages have made the response and recovery mechanisms a necessity. To achieve cyber resilience, the defender needs to make the response and recovery mechanisms configurable, adaptive, and autonomous. RL and feedback control provide a rich set of tools to achieve these three features.

In this section, we introduce the following three cyber resilience designs, i.e., MTD strategies for security-usability trade-off [11] in Section 4.1, honeypot strategies for attack engagement [23] in Section 4.2, and human-assistive alert management strategies [24, 25] in Section 4.3, as the representative examples of RL solutions to three vulnerabilities in Sections 3.1, 3.2, and 3.3, respectively. For each application, we first introduce the necessary background, motivation, and challenges of the defense technologies. Then, we introduce the feedback and learning models to capture the online behaviors of the players, e.g., the defender, the attackers, and the users. Finally, we illustrate how RL can enable cyber resilience.

The CRMs in all three works guarantee asymptotic convergence, but the learning speed varies based on the complexity of the objective strategy and the learning environment. Fast convergence can reduce the response time in Fig. 1 and lead to a more resilient solution to enhance cyber security performance. Many works in the RL literature, including approximations [109], transfer learning [110], and meta learning [111], have improved the learning speed and reduced sample complexity. However, few works have focused on applying the above success to design CRM. We envision seeing more works that apply those RL algorithms to the cyber domain and thus enable efficient learning and timely responses. Besides developing general efficient RL algorithms, we can also tailor and improve the existing RL solutions to the characteristics of cyber systems. For example, it is sufficient to consider epsilon-optimality when the cyber system is not stationary; e.g., its state changes dynamically and stochastically based on the behaviors of users, defenders, and attacks.

4.1 Adaptive MTD Strategies for Security-Usability trade-off

Many legacy Information Technology (IT) systems operate in a relatively static configuration. As many configuration parameters, e.g., IP addresses, remain the same over long periods, advanced attackers can reconnoiter and learn the configuration to launch targeted attacks. MTD is a countermeasure that dynamically configures the system settings and shifts the attack surface to increase uncertainties and learning costs for the attackers.

Many networked IT systems nowadays consist of hierarchical layers and adopt the Defense-in-Depth (DiD) approach where a series of defensive mechanisms reside on each of these layers. Let 𝒩:={1,2,,N}\mathcal{N}:=\{1,2,\cdots,N\} denote the set of NN layers in a system and 𝒱l:={vl,1,vl,2,,vl,nl}\mathcal{V}_{l}:=\{v_{l,1},v_{l,2},\cdots,v_{l,n_{l}}\} be the set of nln_{l} system vulnerabilities that an attacker can exploit to compromise the system at layer l𝒩l\in\mathcal{N}. Assuming that perfect security is unattainable and attacks can penetrate the layered defense of different depths, MTD can be applied for all layers to interact and slow down the attacker’s rate of penetration. At each layer l𝒩l\in\mathcal{N}, the defender can choose to change the system configuration from a finite set of mlm_{l} feasible configurations 𝒞l:={cl,1,cl,2,,cl,ml}\mathcal{C}_{l}:=\{c_{l,1},c_{l,2},\cdots,c_{l,m_{l}}\}. Different configurations result in different subsets of vulnerabilities among 𝒱l\mathcal{V}_{l}, which are characterized by the configuration-vulnerability map πl:𝒞l2𝒱l\pi_{l}:\mathcal{C}_{l}\rightarrow 2^{\mathcal{V}_{l}}. Under the hh-th configuration cl,h𝒞lc_{l,h}\in\mathcal{C}_{l}, the outcome of the configuration-vulnerability mapping, i.e., πl(cl,h)\pi_{l}(c_{l,h}), represents the attack surface at layer l𝒩l\in\mathcal{N}.

After the attacker has reached the layer l𝒩l\in\mathcal{N}, the attacker can launch an attack al,ka_{l,k} from a finite set 𝒜l:={al,1,al,2,,al,nl}\mathcal{A}_{l}:=\{a_{l,1},a_{l,2},\cdots,a_{l,n_{l}}\}. Let γl:𝒱l𝒜l\gamma_{l}:\mathcal{V}_{l}\rightarrow\mathcal{A}_{l} be the vulnerability-attack map that associates vulnerability vl,j𝒱lv_{l,j}\in\mathcal{V}_{l} with attack al,k𝒜la_{l,k}\in\mathcal{A}_{l}. Without loss of generality, there is a one-to-one correspondence between 𝒜l\mathcal{A}_{l} and 𝒱l\mathcal{V}_{l}, then the corresponding inverse map is denoted as γl1:𝒜l𝒱l\gamma_{l}^{-1}:\mathcal{A}_{l}\rightarrow\mathcal{V}_{l}. Attack action al,k𝒜la_{l,k}\in\mathcal{A}_{l} incurs a bounded cost Dhk+D_{hk}\in\mathbb{R}_{+} when the current attack surface πl(cl,h)\pi_{l}(c_{l,h}) under configuration cl,h𝒞lc_{l,h}\in\mathcal{C}_{l} contains the vulnerability vl,j=γl1(al,k)v_{l,j}=\gamma_{l}^{-1}(a_{l,k}). Otherwise, the attacker fails to exploit the existing vulnerabilities under the configuration and incurs zero damage. Thus, the damage caused by the attacker at layer l𝒩l\in\mathcal{N}, denoted by rl:𝒜l×𝒞l+r_{l}:\mathcal{A}_{l}\times\mathcal{C}_{l}\rightarrow\mathbb{R}_{+}, takes the following form:

rl(al,k,cl,h)={Dhk,γl1(al,k)πl(cl,h)0,otherwise.r_{l}(a_{l,k},c_{l,h})=\begin{cases}D_{hk},&\gamma_{l}^{-1}(a_{l,k})\in\pi_{l}(c_{l,h})\\ 0,&\text{otherwise}\end{cases}. (3)
Table 1: A table of parameters for Section 4.1
Indices & Sets:
N,𝒩N,\mathcal{N} Number and set of layers
nl,𝒱ln_{l},\mathcal{V}_{l} Number and set of vulnerabilities at layer ll: {vl,1,vl,2,,vl,nl}\{v_{l,1},v_{l,2},\cdots,v_{l,n_{l}}\}
𝒞l\mathcal{C}_{l} Set of system configurations at layer ll: {cl,1,cl,2,,cl,ml}\{c_{l,1},c_{l,2},\cdots,c_{l,m_{l}}\}
𝒜l\mathcal{A}_{l} Set of attacks available at layer ll: {al,1,al,2,,al,nl}\{a_{l,1},a_{l,2},\cdots,a_{l,n_{l}}\}
Decision & Strategy:
al,ka_{l,k} Attack targeting at vulnerability kk at layer ll
Δ𝒞l,Δ𝒜l\Delta\mathcal{C}_{l},\Delta\mathcal{A}_{l} Distributions of set 𝒞l\mathcal{C}_{l} and 𝒜l\mathcal{A}_{l}
𝐟l={fl,1,fl,2,,fl,ml}Δ𝒞l\mathbf{f}_{l}=\{f_{l,1},f_{l,2},\cdots,f_{l,m_{l}}\}\in\Delta\mathcal{C}_{l} Defender’s randomized strategy over 𝒞l\mathcal{C}_{l}
𝐠l={gl,1,gl,2,,gl,nl}Δ𝒜l\mathbf{g}_{l}=\{g_{l,1},g_{l,2},\cdots,g_{l,n_{l}}\}\in\Delta\mathcal{A}_{l} Attacker’s randomized attacking strategy over 𝒜l\mathcal{A}_{l}
𝕔l,t(𝕒l,t)\mathbbm{c}_{l,t}(\mathbbm{a}_{l,t}) Action chosen by defender (attacker) at time tt
Other Variables:
πl:𝒞l2𝒱l\pi_{l}:\mathcal{C}_{l}\rightarrow 2^{\mathcal{V}_{l}} Configuration-vulnerability mapping
γl:𝒱l𝒜l\gamma_{l}:\mathcal{V}_{l}\rightarrow\mathcal{A}_{l} Vulnerability-attack map
γl1:𝒜l𝒱l\gamma_{l}^{-1}:\mathcal{A}_{l}\rightarrow\mathcal{V}_{l} Inverse vulnerability-attack map
DhkD_{hk} Cost induced by the attack indexed by kk under hh-th configuration
rl:𝒜l×𝒞l+r_{l}:\mathcal{A}_{l}\times\mathcal{C}_{l}\rightarrow\mathbb{R}_{+} Damage caused by the attacker at layer ll
r^l,tS(r^l,tA)\hat{r}_{l,t}^{S}(\hat{r}_{l,t}^{A}) Defender’s (attacker’s) estimate of the average risk
μtS(μtA)\mu_{t}^{S}(\mu_{t}^{A}) Payoff learning rates for the defender (attacker)
𝕣l(𝐟l,𝐠l)\mathbbm{r}_{l}(\mathbf{f}_{l},\mathbf{g}_{l}) Defender’s expected cost under strategies 𝐟l\mathbf{f}_{l} and 𝐠l\mathbf{g}_{l}
Rl,tSR_{l,t}^{S} Reconfigure cost of the defender
ϵl,tS\epsilon_{l,t}^{S} Parameter that balances the security and the usability
Wl,tS(Wl,tA)W_{l,t}^{S}(W_{l,t}^{A}) Optimal value of the defender’s (attacker’s) optimization problem at time tt
λl,tS(λl,tA)\lambda_{l,t}^{S}(\lambda_{l,t}^{A}) Learning rate of the defender (attacker)

The attacker’s goal is to penetrate and compromise the system, while the defender aims to minimize the damage or risk. Since vulnerabilities are inevitable in modern IT systems, the defender adopts MTD to randomize between configurations and make it difficult for the attacker to learn and exploit the vulnerabilities at each layer. Fig. 4 provides a paradigmatic example to illustrate the benefit of MTD. At the first layer highlighted by the blue box, there are three vulnerabilities. Configuration c1,1c_{1,1} in Fig. 4(a) has an attack surface π1(c1,1)={v1,1,v1,2}\pi_{1}(c_{1,1})=\{v_{1,1},v_{1,2}\} while configuration c1,2c_{1,2} in Fig. 4(b) has an attack surface π1(c1,2)={v1,2,v1,3}\pi_{1}(c_{1,2})=\{v_{1,2},v_{1,3}\}. For each attack surface, the existing and non-existing vulnerabilities are denoted by the solid and dashed arcs, respectively. Then, if the attacker takes action a1,1𝒜1a_{1,1}\in\mathcal{A}_{1} that exploits vulnerability v1,1v_{1,1} but the defender changes the configuration from c1,1c_{1,1} to c1,2c_{1,2}, then the attack is thwarted at the first layer.

Refer to caption
(a) Attack surface π1(c1,1)={v1,1,v1,2}\pi_{1}(c_{1,1})=\{v_{1,1},v_{1,2}\} under configuration c1,1c_{1,1}.
Refer to caption
(b) Attack surface π1(c1,2)={v1,2,v1,3}\pi_{1}(c_{1,2})=\{v_{1,2},v_{1,3}\} under configuration c1,2c_{1,2}.
Figure 4: Given a static configuration c1,1𝒞1c_{1,1}\in\mathcal{C}_{1}, an attacker can succeed in reaching the resources at deeper layers by forming an attack path from v1,1v_{1,1} to v2,2v_{2,2}. A change of configuration to c1,2𝒞1c_{1,2}\in\mathcal{C}_{1} can prevent the attacker from exploiting the vulnerabilities at the first layer.

The authors in [11] model the conflict between a multi-stage attacker and a defender adopting MTD as a zero-sum game. At each layer l𝒩l\in\mathcal{N}, the defender’s randomized strategy over the configuration set 𝒞l\mathcal{C}_{l} is denoted as 𝐟l{fl,1,fl,2,,fl,ml}Δ𝒞l\mathbf{f}_{l}\coloneqq\{f_{l,1},f_{l,2},\cdots,f_{l,m_{l}}\}\in\Delta\mathcal{C}_{l} where Δ𝒞l\Delta\mathcal{C}_{l} is the distribution of set 𝒞l\mathcal{C}_{l}. The attacker can also choose a randomized strategy 𝐠l:={gl,1,gl,2,,gl,nl}Δ𝒜l\mathbf{g}_{l}:=\{g_{l,1},g_{l,2},\cdots,g_{l,n_{l}}\}\in\Delta\mathcal{A}_{l} over the set of feasible attacks at layer l𝒩l\in\mathcal{N} to increase his chance of a successful vulnerability compromise. In particular, fl,h[0,1]f_{l,h}\in[0,1] and gl,k[0,1]g_{l,k}\in[0,1] represent the probabilities of the defender taking configuration cl,hc_{l,h} and the attacker taking action al,ka_{l,k}, respectively, at layer l𝒩l\in\mathcal{N}. Under the mixed-strategy pair (𝐟lΔ𝒞l,𝐠lΔ𝒜l)(\mathbf{f}_{l}\in\Delta\mathcal{C}_{l},\mathbf{g}_{l}\in\Delta\mathcal{A}_{l}), the defender’s expected cost 𝕣l\mathbbm{r}_{l} is given by

𝕣l(𝐟l,𝐠l):=𝔼𝐟l,𝐠lrl=h=1mlk=1nlfl,hgl,krl(al,k,cl,h).\mathbbm{r}_{l}(\mathbf{f}_{l},\mathbf{g}_{l}):=\mathbb{E}_{\mathbf{f}_{l},\mathbf{g}_{l}}r_{l}=\sum_{h=1}^{m_{l}}\sum_{k=1}^{n_{l}}f_{l,h}g_{l,k}r_{l}(a_{l,k},c_{l,h}). (4)

The defender’s optimal randomized strategy 𝐟lΔ𝒞l\mathbf{f}_{l}^{*}\in\Delta\mathcal{C}_{l} against the worst-case attacks can be obtained from the mixed strategy saddle-point equilibrium (SPE) (𝐟lΔ𝒞l,𝐠lΔ𝒜l)(\mathbf{f}_{l}^{*}\in\Delta\mathcal{C}_{l},\mathbf{g}_{l}^{*}\in\Delta\mathcal{A}_{l}) of the zero-sum game where the game value 𝕣(𝐟l,𝐠l)\mathbbm{r}(\mathbf{f}_{l}^{*},\mathbf{g}_{l}^{*}) is unique, i.e.,

𝕣l(𝐟l,𝐠l)𝕣l(𝐟l,𝐠l)𝕣l(𝐟l,𝐠l),𝐟lΔ𝒞l,𝐠lΔ𝒜l,\mathbbm{r}_{l}(\mathbf{f}_{l}^{*},\mathbf{g}_{l})\leq\mathbbm{r}_{l}(\mathbf{f}_{l}^{*},\mathbf{g}_{l}^{*})\leq\mathbbm{r}_{l}(\mathbf{f}_{l},\mathbf{g}_{l}^{*}),\forall\mathbf{f}_{l}\in\Delta\mathcal{C}_{l},\mathbf{g}_{l}\in\Delta\mathcal{A}_{l}, (5)

4.1.1 Optimal Configuration Policy via Reinforcement Learning

In practice, the costs Dhk,hml,knlD_{hk},\forall h\in m_{l},k\in n_{l}, are unknown and subject to noise. Thus, the attacker and the defender need to learn the cost rlr_{l} independently during their interaction over time, which is referred to as the procedure of ‘sense’ in Fig. 5. Then, each player updates his policy based on the estimated cost and then takes an action based on the updated policy, which are referred to as the procedures of ‘learn’ and ‘act’, respectively, in Fig. 5.

The subscript tt denotes the strategy or cost at time tt. Due to the non-cooperative environment, the attacker does not know the configuration and the defender does not know the attack action throughout their interaction. Thus, at time tt, they independently choose actions 𝕔l,t𝒞l\mathbbm{c}_{l,t}\in\mathcal{C}_{l} and 𝕒l,t𝒜l\mathbbm{a}_{l,t}\in\mathcal{A}_{l} according to strategies 𝐟l,t\mathbf{f}_{l,t} and 𝐠l,t\mathbf{g}_{l,t}, respectively. Then, they commonly observe the cost rl,tr_{l,t} as an outcome of their action pair (𝕔l,t,𝕒l,t)(\mathbbm{c}_{l,t},\mathbbm{a}_{l,t}). Based on the observed cost at time tt, the defender and the attacker can estimate the average risk r^l,tS:𝒞l+\hat{r}_{l,t}^{S}:\mathcal{C}_{l}\rightarrow\mathbb{R}_{+} and r^l,tA:𝒜l+\hat{r}_{l,t}^{A}:\mathcal{A}_{l}\rightarrow\mathbb{R}_{+}, respectively, as follows:

r^l,t+1S(cl,h)=r^l,tS(cl,h)+μtS𝟏{𝕔l,t=cl,h}(rl,tr^l,tS(cl,h)),h{1,,ml},r^l,t+1A(al,k)=r^l,tA(al,k)+μtA𝟏{𝕒l,t=al,k}(rl,tr^l,tA(al,k)),k{1,,nl},\begin{split}&\hat{r}_{l,t+1}^{S}(c_{l,h})=\hat{r}_{l,t}^{S}(c_{l,h})+\mu_{t}^{S}\mathbf{1}_{\{\mathbbm{c}_{l,t}=c_{l,h}\}}(r_{l,t}-\hat{r}_{l,t}^{S}(c_{l,h})),\forall h\in\{1,\cdots,m_{l}\},\\ &\hat{r}_{l,t+1}^{A}(a_{l,k})=\hat{r}_{l,t}^{A}(a_{l,k})+\mu_{t}^{A}\mathbf{1}_{\{\mathbbm{a}_{l,t}=a_{l,k}\}}(r_{l,t}-\hat{r}_{l,t}^{A}(a_{l,k})),\forall k\in\{1,\cdots,n_{l}\},\end{split} (6)

where μtS\mu_{t}^{S} and μtA\mu_{t}^{A} are the payoff learning rates for the defender and the attacker, respectively. The indicators in (6) mean that the defender and the attacker only update the estimate average risk of the observed action 𝕔l,t\mathbbm{c}_{l,t} and 𝕒l,t\mathbbm{a}_{l,t} at the current time tt, respectively.

The defender uses the estimated risk r^l,t+1S(cl,h)\hat{r}_{l,t+1}^{S}(c_{l,h}) to update his configuration strategy from 𝐟l,t\mathbf{f}_{l,t} to 𝐟l,t+1\mathbf{f}_{l,t+1}. The strategy change involves a reconfigure cost to maneuvering the defense resources and altering the attack surface from πl(𝕔l,t)\pi_{l}(\mathbbm{c}_{l,t}) to πl(𝕔l,t+1)\pi_{l}(\mathbbm{c}_{l,t+1}), where 𝕔l,t\mathbbm{c}_{l,t} and 𝕔l,t+1\mathbbm{c}_{l,t+1} are selected according to 𝐟l,t\mathbf{f}_{l,t} and 𝐟l,t+1\mathbf{f}_{l,t+1}, respectively. Define the reconfigure cost as the relative entropy between two consecutive strategies:

Rl,tS:=h=1mlfl,h,t+1ln(fl,h,t+1fl,h,t).R_{l,t}^{S}:=\sum_{h=1}^{m_{l}}f_{l,h,t+1}\ln\left(\frac{f_{l,h,t+1}}{f_{l,h,t}}\right). (7)

Then, the reconfigure cost Rl,tSR_{l,t}^{S} is added to the original expected cost h=1mlfl,h,t+1r^l,tS(cl,h)\sum_{h=1}^{m_{l}}f_{l,h,t+1}\hat{r}^{S}_{l,t}(c_{l,h}) with a parameter ϵl,tS>0\epsilon_{l,t}^{S}>0 to quantify the trade-off between the security and the usability. Thus, the defender aims to solve the following optimization problem (SP) at time tt:

(SP):sup𝐟l,t+1Δ𝒞lh=1mlfl,h,t+1r^l,tS(cl,h)ϵl,tSRl,tS,(\texttt{SP}):\sup_{\mathbf{f}_{l,t+1}\in\Delta\mathcal{C}_{l}}-\sum_{h=1}^{m_{l}}f_{l,h,t+1}\hat{r}^{S}_{l,t}(c_{l,h})-\epsilon_{l,t}^{S}R_{l,t}^{S}, (8)

which has the following closed-form solution in (9) and the optimal value Wl,tSW_{l,t}^{S} in (10).

fl,h,t+1=fl,h,ter^l,t(cl,h)ϵl,tSh=1mlfl,h,ter^l,t(cl,h)ϵl,tS,h{1,,ml}.f_{l,h,t+1}=\frac{f_{l,h,t}e^{-\frac{\hat{r}_{l,t}(c_{l,h})}{\epsilon_{l,t}^{S}}}}{\sum_{h^{\prime}=1}^{m_{l}}f_{l,h^{\prime},t}e^{-\frac{\hat{r}_{l,t}(c_{l,h^{\prime}})}{\epsilon_{l,t}^{S}}}},\forall h\in\{1,\cdots,m_{l}\}. (9)
Wl,tS=ϵl,tSln(h=1mlfl,h,ter^l,t(cl,h)ϵl,tS).W_{l,t}^{S}=\epsilon_{l,t}^{S}\ln\left(\sum_{h=1}^{m_{l}}f_{l,h,t}e^{-\frac{\hat{r}_{l,t}(c_{l,h})}{\epsilon_{l,t}^{S}}}\right). (10)

When the value of ϵl,tS\epsilon_{l,t}^{S} is high, the configuration policy changes less and is more usable. However, it is also easier for the attacker to learn the policy and thus reduce the security. If ϵl,tS\epsilon_{l,t}^{S}\rightarrow\infty, then the configuration policy remains the same fl,h,t+1=fl,h,tf_{l,h,t+1}=f_{l,h,t} and the optimal value in (10) equals h=1mlfl,h,tr^l,tS(cl,h)-\sum_{h=1}^{m_{l}}f_{l,h,t}\hat{r}^{S}_{l,t}(c_{l,h}). If ϵl,tS0\epsilon_{l,t}^{S}\rightarrow 0, then the optimal value in (10) equals mincl,h𝒞lr^l,t(cl,h)\min_{c_{l,h}\in\mathcal{C}_{l}}\hat{r}_{l,t}(c_{l,h}).

Analogously, it takes an attacker time and energy to change the attack strategy to explore new vulnerabilities and exploit them, which leads to the following optimization problem (AP)) for the attacker.

(AP):sup𝐠l,t+1Δ𝒜lk=1nlgl,k,t+1r^l,tA(al,k)ϵl,tAk=1nlgl,k,t+1ln(gl,k,t+1gl,k,t),(\texttt{AP}):\sup_{\mathbf{g}_{l,t+1}\in\Delta\mathcal{A}_{l}}-\sum_{k=1}^{n_{l}}g_{l,k,t+1}\hat{r}^{A}_{l,t}(a_{l,k})-\epsilon_{l,t}^{A}\sum_{k=1}^{n_{l}}g_{l,k,t+1}\ln\left(\frac{g_{l,k,t+1}}{g_{l,k,t}}\right), (11)

which has the following closed-form solution in (12) and the optimal value Wl,tSW_{l,t}^{S} in (13).

gl,k,t+1=gl,k,ter^l,t(al,k)ϵl,tAk=1nlgl,k,ter^l,t(al,k)ϵl,tA.g_{l,k,t+1}=\frac{g_{l,k,t}e^{-\frac{\hat{r}_{l,t}(a_{l,k})}{\epsilon_{l,t}^{A}}}}{\sum_{k^{\prime}=1}^{n_{l}}g_{l,k^{\prime},t}e^{-\frac{\hat{r}_{l,t}(a_{l,k^{\prime}})}{\epsilon_{l,t}^{A}}}}. (12)
Wl,tA=ϵl,tAln(k=1nlgl,k,ter^l,t(al,k)ϵl,tA).W_{l,t}^{A}=\epsilon_{l,t}^{A}\ln\left(\sum_{k=1}^{n_{l}}g_{l,k,t}e^{-\frac{\hat{r}_{l,t}(a_{l,k})}{\epsilon_{l,t}^{A}}}\right). (13)

The parameter ϵl,tA>0\epsilon_{l,t}^{A}>0 achieves the trade-off between the attacker’s benefit of attacks and the cost to learn the most effective attack. With all these notations introduced, we use Fig. 5 to illustrate the attacker’s (resp. the defender’s) risk learning, policy update, and action implementation in the light orange (resp. light blue) background. In either the attacker’s adversarial learning or the defender’s defensive learning, the learning rule does not depend on the other player’s action, yet the observed payoff depends on both players’ actions.

Refer to caption
Figure 5: Adaptive learning of the multistage MTD game at layer ll where the adversarial and defensive learning is presented in red and green, respectively. In both players’ learning feedback, the procedures of ‘sense’, ‘learn’, and ‘act’ correspond to the boxes of ‘information acquisition’, ‘decision making’, and ‘security configuration’ in Fig. 2, respectively.

The dynamics for the mixed strategy update (represented by the procedure of ‘learn’ in Fig. 2) in (9) and (12) can be generalized into the following two learning dynamics (14) and (15), respectively, with learning rates λl,tS,λl,tA[0,1]\lambda_{l,t}^{S},\lambda_{l,t}^{A}\in[0,1].

fl,h,t+1=(1λl,tS)fl,h,t+λl,tSfl,h,ter^l,t(cl,h)ϵl,tSh=1mlfl,h,ter^l,t(cl,h)ϵl,tS,h{1,,ml}.f_{l,h,t+1}=(1-\lambda_{l,t}^{S})f_{l,h,t}+\lambda_{l,t}^{S}\frac{f_{l,h,t}e^{-\frac{\hat{r}_{l,t}(c_{l,h})}{\epsilon_{l,t}^{S}}}}{\sum_{h^{\prime}=1}^{m_{l}}f_{l,h^{\prime},t}e^{-\frac{\hat{r}_{l,t}(c_{l,h^{\prime}})}{\epsilon_{l,t}^{S}}}},\forall h\in\{1,\cdots,m_{l}\}. (14)
gl,k,t+1=(1λl,tA)gl,k,t+λl,tAgl,k,ter^l,t(al,k)ϵl,tAk=1nlgl,k,ter^l,t(al,k)ϵl,tA,k{1,,nl}.g_{l,k,t+1}=(1-\lambda_{l,t}^{A})g_{l,k,t}+\lambda_{l,t}^{A}\frac{g_{l,k,t}e^{-\frac{\hat{r}_{l,t}(a_{l,k})}{\epsilon_{l,t}^{A}}}}{\sum_{k^{\prime}=1}^{n_{l}}g_{l,k^{\prime},t}e^{-\frac{\hat{r}_{l,t}(a_{l,k^{\prime}})}{\epsilon_{l,t}^{A}}}},\forall k\in\{1,\cdots,n_{l}\}. (15)

If the learning rates λl,tS=1\lambda_{l,t}^{S}=1 and λl,tA=1\lambda_{l,t}^{A}=1, the learning dynamics (14) and (15) are the same as (9) and (12), respectively. According to the stochastic approximation theory, the learning dynamics (14) and (15) converge to the SPE of the game in (5) under mild conditions [11].

4.2 Adaptive Honeypot Configuration for Attacker Engagement

Traditional cybersecurity techniques such as the firewall and intrusion detection systems rely on low-level Indicators of Compromise (IoCs), e.g., hash values, IP addresses, and domain names, to detect attacks. However, advanced attacks can revise these low-level IoCs and evade detection. Thus, there is an urgent need to disclose high-level IoCs, also referred to as threat intelligence, such as attack tools and Tactics, Techniques, and Procedures (TTP) of the attacker. As a promising active cyber defense mechanism, honeypots can gather essential threat intelligence by luring attackers to conduct adversarial behaviors in a controlled and monitored environment.

Table 2: Summary of notations for Section 4.2.
Variables Meaning
t[0,)t\in[0,\infty),k0k\in\mathbb{Z}_{\geq 0} Index for time and stage
sk𝒮s^{k}\in\mathcal{S} State at stage kk
ak𝒜(sk)a^{k}\in\mathcal{A}(s^{k}) Engagement action
trtr,zz Transition kernel and sojourn distribution
r1,r2,rr_{1},r_{2},r Transition reward, sojourn reward, and investigation reward
rγr^{\gamma} Equivalent investigation reward under discounted factor γ[0,)\gamma\in[0,\infty)
πΠ:=𝒮𝒜\pi\in\Pi:=\mathcal{S}\rightarrow\mathcal{A} Engagement strategy
u(s0,π),v(s0)u(s^{0},\pi),v(s^{0}) Long-term expected utility and value function starting from s0𝒮s^{0}\in\mathcal{S}

A honeynet is a network of honeypots, which emulates the real production system but has no production activities nor authorized services. Thus, an interaction with a honeynet, e.g., unauthorized inbound connections to any honeypot, directly reveals malicious activities. From an attacker’s viewpoint, the production network and the honeypot network share the same structure as shown in Fig. 6.

Refer to caption
Figure 6: The honeynet in red emulates and shares the same structure as the targeted production system in green.

An attacker from the production system or external network can be attracted to the honeynet. Then, the attacker can move inside the honeynet through either physical (if the connecting nodes are real facilities such as computers) or logical (if the nodes represent integrated systems) links. The attacker’s transition is restricted by the network topology. Once the attack launches an attack at a honeypot node, the defender gains an investigation reward by analyzing the attack behaviors and extracting high-level IoCs. Since the defender can obtain more threat intelligence when the attack sustains for a longer time, the investigation reward increases with the engaging time. The defender aims to dynamically configure the honeynet through proper engagement actions to obtain more threat intelligence while minimize the following three types of risks.

  • T1:

    Attackers identify the honeynet and thus either terminate on their own or behave misleadingly in honeypots.

  • T2:

    Attackers circumvent the honeywall and use the honeypots as a pivot to penetrate other production systems.

  • T3:

    The cost to engage attackers in the honeypots outweighs the resulted investigation reward.

To strike a balance between maximizing the investigation reward resulted and minimizing the risks, the authors in [23] model the attacker’s stochastic transition in the honeynet as an infinite-horizon Semi-Markov Decision Process (SMDP) to quantify the long-term reward and risks. The SMDP consists of the tuple {t[0,),𝒮,𝒜(sj),tr(sl|sj,aj),z(|sj,aj,sl),rγ(sj,aj,sl),γ[0,)}\{t\in[0,\infty),\mathcal{S},\mathcal{A}({s_{j}}),tr(s_{l}|s_{j},a_{j}),\allowbreak z(\cdot|s_{j},a_{j},s_{l}),\allowbreak r^{\gamma}(s_{j},a_{j},s_{l}),\gamma\in[0,\infty)\}. We illustrate each element of the tuple through a 1313-state example in Fig. 7.

Refer to caption
Figure 7: Honeypots emulate different components of the production system. Actions aE,aP,aL,aHa_{E},a_{P},a_{L},a_{H} are denoted in red, blue, purple, and green, respectively. The size of node nin_{i} represents the state value v(si),i{1,2,,11}v(s_{i}),i\in\{1,2,\cdots,11\}.

Each node in Fig. 7 represents a state si𝒮,i{1,2,,13}s_{i}\in\mathcal{S},i\in\{1,2,\cdots,13\}. At time t[0,)t\in[0,\infty), the attacker is either at one of the honeypot nodes denoted by state si𝒮,i{1,2,,11}s_{i}\in\mathcal{S},i\in\{1,2,\cdots,11\}, at the normal zone s12s_{12}, or at a virtual absorbing state s13s_{13} once attackers are ejected or terminate on their own. At each state si𝒮s_{i}\in\mathcal{S}, the defender can choose an action ai𝒜(si)a_{i}\in\mathcal{A}(s_{i}). The action set 𝒜(si)\mathcal{A}(s_{i}) is finite and depends on the state si𝒮s_{i}\in\mathcal{S}. For example, at honeypot nodes, the defender can conduct action aEa_{E} to eject the attacker, action aPa_{P} to purely record the attacker’s activities, low-interactive action aLa_{L}, or high-interactive action aHa_{H}, i.e., 𝒜(si):={aE,aP,aL,aH},i{1,,11}\mathcal{A}(s_{i}):=\{a_{E},a_{P},a_{L},a_{H}\},i\in\{1,\cdots,11\}. At normal zone, the defender can choose either action aEa_{E} to eject the attacker immediately, or action aAa_{A} to attract the attacker to the honeynet by generating more deceptive inbound and outbound traffics in the honeynet, i.e., 𝒜(s12):={aE,aA}\mathcal{A}(s_{12}):=\{a_{E},a_{A}\}. No actions needed in the virtual absorbing state, i.e., 𝒜(s13):=\mathcal{A}(s_{13}):=\emptyset.

Based on the attacker’s current state sj𝒮s_{j}\in\mathcal{S} and the defender’s action aj𝒜(sj)a_{j}\in\mathcal{A}(s_{j}), the attacker transits to state sl𝒮s_{l}\in\mathcal{S} with probability tr(sl|sj,aj)tr(s_{l}|s_{j},a_{j}) and the sojourn time at state sjs_{j} is a continuous random variable with probability density z(|sj,aj,sl)z(\cdot|s_{j},a_{j},s_{l}). Once the attacker arrives at a new honeypot nin_{i}, the defender dynamically applies an interaction action at honeypot nin_{i} from 𝒜(si)\mathcal{A}(s_{i}) and keeps interacting with the attacker until the attacker’s next transition. Since the defender makes decision at the time of transition, the above continuous-time model over t[0,)t\in[0,\infty) can be transformed into a discrete decision model at decision epoch k{0,1,,}k\in\{0,1,\cdots,\infty\}. The time of the attacker’s kthk^{th} transition is denoted by a random variable TkT^{k}, the landing state is denoted as sk𝒮s^{k}\in\mathcal{S}, and the adopted action after arriving at sks^{k} is denoted as ak𝒜(sk)a^{k}\in\mathcal{A}(s^{k}).

At decision epoch k{0,1,,}k\in\{0,1,\cdots,\infty\}, the defender’s investigation reward rr consists an immediate cost r1r_{1} of applying engagement action ak𝒜(sk)a^{k}\in\mathcal{A}(s^{k}) at state sk𝒮s^{k}\in\mathcal{S} and a reward rate r2r_{2}, i.e.,

r(sk,ak,sk+1,Tk,Tk+1,τ)=r1(sk,ak,sk+1)𝟏{τ=0}+r2(sk,ak,Tk,Tk+1,τ),τ[Tk,Tk+1].r(s^{k},a^{k},s^{k+1},T^{k},T^{k+1},\tau)=r_{1}(s^{k},a^{k},s^{k+1})\mathbf{1}_{\{\tau=0\}}+r_{2}(s^{k},a^{k},T^{k},T^{k+1},\tau),\tau\in[T^{k},T^{k+1}].

The reward rate r2r_{2} at time τ[Tk,Tk+1]\tau\in[T^{k},T^{k+1}] represents the benefit of threat information acquisition minus the cost rate of persistently generating deceptive traffic. Considering a discounted factor of γ[0,)\gamma\in[0,\infty) to penalize the decreasing value of the investigation as time elapses, the defender aims to maximize the long-term expected utility starting from state s0s^{0}, i.e., u(s0,π)=𝔼[k=0TkTk+1eγ(τ+Tk)(r(Sk,Ak,Sk+1,Tk,Tk+1,τ))𝑑τ],u(s^{0},\pi)=\mathbb{E}[\sum_{k=0}^{\infty}\int_{T^{k}}^{T^{k+1}}e^{-\gamma(\tau+T^{k})}(r(S^{k},A^{k},S^{k+1},T^{k},T^{k+1},\tau))d\tau], where the engagement strategy πΠ\pi\in\Pi maps state sk𝒮s^{k}\in\mathcal{S} to action ak𝒜(sk)a^{k}\in\mathcal{A}(s^{k}). Based on dynamic programming, the value function v(s0)=supπΠu(s0,π)v(s^{0})=\sup_{\pi\in\Pi}u(s^{0},\pi) can be represented as

v(s0)=supa0𝒜(s0)𝔼[T0T1eγ(τ+T0)r(s0,a0,S1,T0,T1,τ)𝑑τ+eγT1v(S1)].\displaystyle v(s^{0})=\sup_{a^{0}\in\mathcal{A}(s^{0})}\mathbb{E}\left[\int_{T^{0}}^{T^{1}}e^{-\gamma(\tau+T^{0})}r(s^{0},a^{0},S^{1},T^{0},T^{1},\tau)d\tau+e^{-\gamma T^{1}}v(S^{1})\right]. (16)

Assuming a constant reward rate r2(sk,ak,Tk,Tk+1,τ)=r¯2(sk,ak)r_{2}(s^{k},a^{k},T^{k},T^{k+1},\tau)=\bar{r}_{2}(s^{k},a^{k}), (16) can be transformed into the equivalent MDP form as follows, i.e.,

v(s0)=supa0𝒜(s0)s1𝒮tr(s1|s0,a0)(rγ(s0,a0,s1)+zγ(s0,a0,s1)v(s1)),s0𝒮,\displaystyle v(s^{0})=\sup_{a^{0}\in\mathcal{A}(s^{0})}\sum_{s^{1}\in\mathcal{S}}tr(s^{1}|s^{0},a^{0})(r^{\gamma}(s^{0},a^{0},s^{1})+z^{\gamma}(s^{0},a^{0},s^{1})v(s^{1})),\forall s^{0}\in\mathcal{S}, (17)

where zγ(s0,a0,s1):=0eγτz(τ|s0,a0,s1)𝑑τ[0,1]{z^{\gamma}}(s^{0},a^{0},s^{1}):=\int_{0}^{\infty}e^{-\gamma\tau}z(\tau|s^{0},a^{0},s^{1})d\tau\in[0,1] is the Laplace transform of the sojourn probability density z(τ|s0,a0,s1)z(\tau|s^{0},a^{0},s^{1}) and the equivalent reward rγ(s0,a0,s1):=r1(s0,a0,s1)+r¯2(s0,a0)γ(1zγ(s0,a0,s1))[mc,mc]r^{\gamma}(s^{0},a^{0},s^{1})\allowbreak:=r_{1}(s^{0},a^{0},s^{1})+\frac{\bar{r}_{2}(s^{0},a^{0})}{\gamma}(1-z^{\gamma}(s^{0},a^{0},s^{1}))\in[-m_{c},m_{c}] is assumed to be bounded by a constant mcm_{c}.

4.2.1 Optimal Engagement Strategy via Reinforcement Learning

In practice, the defender cannot know the exact SMDP model, i.e., the investigation reward, the attacker’s transition probability (and even the network topology), and the sojourn distribution. Thus, the defender learns the optimal engagement policy iteratively during the honeynet interaction with the attacker through the QQ-learning algorithm for SMDP [112], i.e.,

Qk+1(sk,ak):=(1αk(sk,ak))Qk(sk,ak)+αk(sk,ak)[r¯1(sk,ak,s¯k+1)+r¯2(sk,ak)(1eγτ¯k)γeγτ¯kmaxa𝒜(s¯k+1)Qk(s¯k+1,a)],\begin{split}Q^{k+1}(s^{k},a^{k}):=&(1-\alpha^{k}(s^{k},a^{k}))Q^{k}(s^{k},a^{k})+\alpha^{k}(s^{k},a^{k})[\bar{r}_{1}(s^{k},a^{k},\bar{s}^{k+1})\\ &+\bar{r}_{2}(s^{k},a^{k})\frac{(1-e^{-\gamma\bar{\tau}^{k}})}{\gamma}-e^{-\gamma\bar{\tau}^{k}}\max_{a^{\prime}\in\mathcal{A}(\bar{s}^{k+1})}Q^{k}(\bar{s}^{k+1},a^{\prime})],\end{split} (18)

where sks^{k} is the current state sample, aka^{k} is the current selected action, αk(sk,ak)(0,1)\alpha^{k}(s^{k},a^{k})\in(0,1) is the learning rate, s¯k+1\bar{s}^{k+1} is the observed state at next stage, r¯1,r¯2\bar{r}_{1},\bar{r}_{2} is the observed investigation rewards, and τ¯k\bar{\tau}^{k} is the observed sojourn time at state sks^{k}. When the learning rate satisfies k=0αk(sk,ak)=,k=0(αk(sk,ak))2<,sk𝒮,ak𝒜(sk)\sum_{k=0}^{\infty}\alpha^{k}(s^{k},a^{k})=\infty,\sum_{k=0}^{\infty}(\alpha^{k}(s^{k},a^{k}))^{2}<\infty,\forall s^{k}\in\mathcal{S},\forall a^{k}\in\mathcal{A}(s^{k}), and all state-action pairs are explored infinitely, maxa𝒜(sk)Qk(sk,a),k\max_{a^{\prime}\in\mathcal{A}(s^{k})}\allowbreak Q^{k}(s^{k},a^{\prime}),k\rightarrow\infty, in (18) converges to value v(sk)v(s^{k}) with probability 11. At each decision epoch k{0,1,}k\in\{0,1,\cdots\}, the action aka^{k} is chosen according to the ϵ\epsilon-greedy policy, i.e., the defender chooses the optimal action argmaxa𝒜(sk)Qk(sk,a)arg\max_{a^{\prime}\in\mathcal{A}(s^{k})}Q^{k}(s^{k},a^{\prime}) with a probability 1ϵ1-\epsilon, and a random action with a probability ϵ\epsilon. Fig. 8 illustrates an exemplary learning trajectory of the state transition and sojourn time under the ϵ\epsilon-greedy exploration policy, where the chosen actions aE,aP,aL,aHa_{E},a_{P},a_{L},a_{H} are denoted in red, blue, purple, and green, respectively.

Refer to caption
Figure 8: One instance of QQ-learning of SMDP where the xx-axis shows the sojourn time and the yy-axis represents the state transition. The chosen actions aE,aP,aL,aHa_{E},a_{P},a_{L},a_{H} are denoted in red, blue, purple, and green, respectively.

We plot the entire feedback learning structure in Fig. 9. The environmental uncertainties represented by the red background result from the attacker’s behaviors in the honeypots. It is assumed that the attacker does not identify the existence of the honeypot. Thus, the observed samples at the procedure of ‘sense’ indicate the attackers’ characteristics and contain the correct threat intelligence. Then, the defender updates the QQ-value at each decision epoch k{0,1,,}k\in\{0,1,\cdots,\infty\} and then chooses the engagement action aka^{k} as shown in the procedures of ‘learn’ and ‘act’, respectively.

Refer to caption
Figure 9: The feedback structure of reinforcement learning methods on SMDP. The attacker’s characteristics determine the environmental uncertainties and the samples observed in the honeynet.

4.3 Adaptive Alert and Attention Management Strategy against IDoS Attacks

As an analogy to the DoS attacks that generate a large number of superfluous requests to exhaust the limited computing resources in communication systems, IDoS attacks generate a large number of feints to exhaust the operators’ limited cognition resources (e.g., attention, reasoning, and decision-making) in human-in-the-loop systems. IDoS attack is a broad class of attacks and poses the following significant security challenges. First, the strategical feint generation exerts additional cognition load to the human operators and exacerbates alert fatigue in the age of infobesity. Second, IDoS directly target human operators in the Security Operating Center (SOC), which is regarded as the cyber immune system. Third, it requires operators of a high expertise level to identify feints and respond to alert timely in complicated systems. Therefore, there is a need to understand this kind of attentional attack, quantify its consequences and risks, and develop new mitigation methods.

Refer to caption
Figure 10: Interaction among IDoS attacks, human operators, and assistive technologies in orange, green, and blue, respectively.
Table 3: Summary of notations for Section 4.3.
Variables Meaning
k,h0k,h\in\mathbb{Z}_{\geq 0} Index for attack stages and inspection stages
tk[0,)t^{k}\in[0,\infty) Arrival time of the kk-th attack
τk=tk+1tk[0,)\tau^{k}=t^{k+1}-t^{k}\in[0,\infty) Time duration between kk-th and (k+1)(k+1)-th attack
τINh,m:=k=hmhm+m1τk\tau_{IN}^{h,m}:=\sum_{{k}^{\prime}=hm}^{hm+m-1}\tau^{{k}^{\prime}} Inspection time at inspection stage h0h\in\mathbb{Z}_{\geq 0}
wk𝒲:={wFE,wRE,wUN}w^{k}\in\mathcal{W}:=\{w_{FE},w_{RE},w_{UN}\} Security decision at attack stages k0k\in\mathbb{Z}_{\geq 0}
am𝒜a_{m}\in\mathcal{A} Attention management strategy of period m>0m\in\mathbb{Z}_{>0}
θkΘ:={θFE,θRE}\theta^{k}\in\Theta:=\{\theta_{FE},\theta_{RE}\} Attack’s type at attack stages k0k\in\mathbb{Z}_{\geq 0}
θ¯h:=[θhm,,θhm+m1]\bar{\theta}^{h}:=[\theta^{hm},\cdots,\theta^{hm+m-1}] Consolidated type at inspection stage h0h\in\mathbb{Z}_{\geq 0}
sk𝒮s^{k}\in\mathcal{S} Alert’s category label at attack stages k0k\in\mathbb{Z}_{\geq 0}
xh:=[shm,,shm+m1]x^{h}:=[s^{hm},\cdots,s^{hm+m-1}] Consolidated state at inspection stage h0h\in\mathbb{Z}_{\geq 0}
c(wk,sk;θk),c¯(xh,am;θ¯h)c(w^{k},s^{k};\theta^{k}),\bar{c}(x^{h},a_{m};\bar{\theta}^{h}) Cost and Consolidated cost
u^(xh,am),u¯(shm,am)\hat{u}(x^{h},a_{m}),\bar{u}(s^{hm},a_{m}) Expected cumulative cost and expected aggregated cumulative cost
vh+1(x^h,am),v¯h+1(s^hm,am)v^{h+1}(\hat{x}^{h},a_{m}),\bar{v}^{h+1}(\hat{s}^{hm},a_{m}) Learning estimates of ECC and EACC

The authors in [24] establish a holistic model in Fig. 10 to formalize the definition of IDoS attacks, evaluate their severity levels, and assess the induced cyber risks. In particular, they model the IDoS attack as stochastic arrivals of feints (denoted by θFE\theta_{FE} with probability bFEb_{FE}) and real attacks (denoted by θRE\theta_{RE} with probability bREb_{RE}) characterized by the following Markov renewal process. As highlighted by the orange background in Fig. 11, the kk-th attack, or equivalently the one at attack stage k0k\in\mathbb{Z}_{\geq 0}, happen at time tk,k0t^{k},k\in\mathbb{Z}_{\geq 0}. Let τk:=tk+1tk[0,)\tau^{k}:=t^{k+1}-t^{k}\in[0,\infty) be the inter-arrival time between the (k+1)(k+1)-th attack and the kk-th attack for all k0k\in\mathbb{Z}_{\geq 0}.

Refer to caption
Figure 11: The sequential arrival of alerts at attack stage k0k\in\mathbb{Z}_{\geq 0} and the periodic manual inspections at inspection stage h0h\in\mathbb{Z}_{\geq 0} under AM strategy am𝒜a_{m}\in\mathcal{A} where m=2m=2.

The alerts triggered by these attacks in real-time cannot directly reveal the attack’s type denoted by θkΘ:={θFE,θRE}\theta^{k}\in\Theta:=\{\theta_{FE},\theta_{RE}\} at all attack stages k0k\in\mathbb{Z}_{\geq 0}. However, the alerts can provide human operators with a category label sk𝒮s^{k}\in\mathcal{S} based on observable features or traces of the associated attacks at attack stage k0k\in\mathbb{Z}_{\geq 0}. Manual inspection of these alerts leads to three security decisions: the attack is feint (denoted by wFEw_{FE}), the attack is real (denoted by wREw_{RE}), or the attack’s type is unknown (denoted by wUNw_{UN}). Let wk𝒲:={wFE,wRE,wUN}w^{k}\in\mathcal{W}:=\{w_{FE},w_{RE},w_{UN}\} denote the human operator’s security decision of the kk-th alert. Since each human operator has limited attention, frequent alert pop-ups can distract humans from the current alert inspection and result in alert fatigue. To compensate for the human’s attention limitation, one strategy is to intentionally make some alerts less noticeable, e.g., without sounds or in a light color. Then, the human can pay sustained attention to the alert currently under inspection. Those inconspicuous alerts receive the security decision wUNw_{UN} and are assigned to other available inspectors with an additional cost of human resources and inspection delays.

The authors focus on the class of periodic Attention Management (AM) strategies, denoted by 𝒜:={am}m>0\mathcal{A}:=\{a_{m}\}_{m\in\mathbb{Z}_{>0}} and assume that the human operator can only notice and inspect an alert when it is highlighted. Then, AM strategy am𝒜a_{m}\in\mathcal{A} means that the human operator inspects the alerts at attack stages k=hm,h0k=hm,h\in\mathbb{Z}_{\geq 0}. The attack stages during the hh-th inspection are referred to as the inspection stage h0h\in\mathbb{Z}_{\geq 0}. Then, under AM strategy am𝒜a_{m}\in\mathcal{A}, each inspection stage contains mm attack stages as shown in the blue background of Fig. 11 and the hh-th inspection has a duration of τINh,m:=k=hmhm+m1τk\tau_{IN}^{h,m}:=\sum_{{k}^{\prime}=hm}^{hm+m-1}\tau^{{k}^{\prime}} for all h0h\in\mathbb{Z}_{\geq 0}. The consolidated state xh:=[shm,,shm+m1]𝒳:=𝒮mx^{h}:=[s^{hm},\cdots,s^{hm+m-1}]\in\mathcal{X}:=\mathcal{S}^{m} and the consolidated type θ¯h:=[θhm,,θhm+m1]Θ¯:=Θm\bar{\theta}^{h}:=[\theta^{hm},\cdots,\theta^{hm+m-1}]\in\bar{\Theta}:=\Theta^{m} consist of the category labels and types of mm successive alerts, respectively, at inspection stage h0h\in\mathbb{Z}_{\geq 0}.

Denote c(wk,sk;θk)c(w^{k},s^{k};\theta^{k}) as the operator’s cost at attack stage k0k\in\mathbb{Z}_{\geq 0} when the alert’s category label is sk𝒮s^{k}\in\mathcal{S}, the attack’s type is θkΘ\theta^{k}\in\Theta, and the security decision is wk𝒲w^{k}\in\mathcal{W}. At attack stages where alerts are inconspicuous, i.e., for all khm,h0k\neq hm,h\in\mathbb{Z}_{\geq 0}, the security decision is wUNw_{UN} without manual inspection, which incurs an uncertainty cost c(wUN,sk;θk)>0c(w_{UN},s^{k};\theta^{k})>0. At attack stages of highlighted alerts, i.e., for all k=hm,h0k=hm,h\in\mathbb{Z}_{\geq 0}, the human operator obtains a reward (resp. cost) for correct (resp. incorrect) security decisions. If the human operator remains uncertain about the attack’s type after the inspection time τINh,m\tau_{IN}^{h,m}, i.e., whm=wUNw^{hm}=w_{UN}, there is the uncertainty cost c(wUN,shm;θhm)c(w_{UN},s^{hm};\theta^{hm}). Define the human operator’s consolidated cost at inspection stage h0h\in\mathbb{Z}_{\geq 0} as

c¯(xh,am;θ¯h):=(m1)c(wUN,shm;θhm)+whm𝒲Pr(whm|xh,am;θ¯h)c(whm,shm;θhm),\bar{c}(x^{h},a_{m};\bar{\theta}^{h}):=(m-1)c(w_{UN},s^{hm};\theta^{hm})+\sum_{w^{hm}\in\mathcal{W}}\Pr(w^{hm}|x^{h},a_{m};\bar{\theta}^{h})c(w^{hm},s^{hm};\theta^{hm}), (19)

where the decision probability Pr(wk|sk,am;θk)\Pr(w^{k}|s^{k},a_{m};\theta^{k}) represents the probability of human making decision wk𝒲w^{k}\in\mathcal{W} when the attack’s type is θkΘ\theta^{k}\in\Theta, the category label is sk𝒮s^{k}\in\mathcal{S}, and the AM strategy is am𝒜a_{m}\in\mathcal{A}. With discounted factor γ(0,1)\gamma\in(0,1), define u^(xh,am):=𝔼[h=h0(γ)hc¯(xh,am;θ¯h)]\hat{u}(x^{h},a_{m}):=\mathbb{E}[\sum_{h=h_{0}}^{\infty}(\gamma)^{h}\cdot\bar{c}(x^{h},a_{m};\bar{\theta}^{h})] as the Expected Cumulative Cost (ECC) under xhx^{h} and ama_{m}. To reduce the dimension of the samples and enable evaluations with no delay, the authors further define the Expected Aggregated Cumulative Cost (EACC) as

u¯(shm,am):=shm+1,,shm+m1𝒮[Pr(shm+1,,shm+m1|shm)u^([shm,,shm+m1],am)],\begin{split}\bar{u}(s^{hm},a_{m}):=\sum_{s^{hm+1},\cdots,s^{hm+m-1}\in\mathcal{S}}\bigg{[}\Pr(s^{hm+1},\cdots,s^{hm+m-1}|s^{hm})\cdot\hat{u}([s^{hm},\cdots,s^{hm+m-1}],a_{m})\bigg{]},\end{split} (20)

and prove the computational equivalency between ECC and EACC under mild conditions. Both ECC u^(x0,am)\hat{u}(x^{0},a_{m}) and EACC u¯0(s0,am)\bar{u}^{0}(s^{0},a_{m}) evaluate the long-term performance of the AM strategy am𝒜a_{m}\in\mathcal{A} on average. However, EACC depends on shms^{hm} but not on shm+1,,shm+m1s^{hm+1},\cdots,s^{hm+m-1}. Thus, ECC and EACC are referred to as the consolidated and the aggregated IDoS risks, respectively.

4.3.1 Optimal Attention Management Strategy via Reinforcement Learning

Since the parameters of the Markov renewal process are usually unknown, Temporal-Difference (TD) learning are used to evaluate the performance of the AM strategy am𝒜a_{m}\in\mathcal{A} based on the inspection results in real-time. Define vh(xh,am)v^{h}(x^{h},a_{m}) as the estimated value of u^(xh,am)\hat{u}(x^{h},a_{m}) at the inspection stage h0h\in\mathbb{Z}_{\geq 0}, then

vh+1(x^h,am)=(1αh(x^h))vh(x^h,am)+αh(x^h)(c^h+γvh(x^h+1,am)),v^{h+1}(\hat{x}^{h},a_{m})=(1-\alpha^{h}(\hat{x}^{h}))v^{h}(\hat{x}^{h},a_{m})+\alpha^{h}(\hat{x}^{h})(\hat{c}^{h}+\gamma v^{h}(\hat{x}^{h+1},a_{m})), (21)

where x^h\hat{x}^{h} (resp. x^h+1\hat{x}^{h+1}) is the observed state value at the current inspection stage hh (resp. the next inspection stage h+1h+1), αh(x^h)(0,1)\alpha^{h}(\hat{x}^{h})\in(0,1) is the learning rate, and c^h\hat{c}^{h} is the observed cost at stage h0h\in\mathbb{Z}_{\geq 0}. Alternatively, define v¯h(xh,am)\bar{v}^{h}(x^{h},a_{m}) be the estimated value of u¯(shm,am)\bar{u}(s^{hm},a_{m}) at the inspection stage h0h\in\mathbb{Z}_{\geq 0}, then

v¯h+1(s^hm,am)=(1α¯h(s^hm))vh(s^hm,am)+α¯h(s^hm)(c^h+γv¯h(s^(h+1)m,am)),\bar{v}^{h+1}(\hat{s}^{hm},a_{m})=(1-\bar{\alpha}^{h}(\hat{s}^{hm}))v^{h}(\hat{s}^{hm},a_{m})+\bar{\alpha}^{h}(\hat{s}^{hm})(\hat{c}^{h}+\gamma\bar{v}^{h}(\hat{s}^{(h+1)m},a_{m})), (22)

where s^hm\hat{s}^{hm} (resp. s^(h+1)m\hat{s}^{(h+1)m}) is the observed state value at the current inspection stage hh (resp. the next inspection stage h+1h+1), α¯h(s^hm)(0,1)\bar{\alpha}^{h}(\hat{s}^{hm})\in(0,1) is the learning rate, and c^h\hat{c}^{h} is the observed cost at stage h0h\in\mathbb{Z}_{\geq 0}.

Finally, the authors provide a numerical case study where category labels sALs_{AL}, sNLs_{NL}, and sPLs_{PL} in the set 𝒮={sAL,sNL,sPL}\mathcal{S}=\{s_{AL},s_{NL},s_{PL}\} represent the application layer, network layer, and physical layer, respectively. Fig. 12(a) and Fig. 12(b) illustrate the impact of high and low uncertainty costs c(wUN,shm;θhm)c(w_{UN},s^{hm};\theta^{hm}), respectively. If the uncertainty cost is much higher than the expected reward of correct decision-making, then the detailed inspection and correct security decisions are not of priority. As a result, the ‘spray and pray’ strategy should be adopted; i.e., let the operator inspect as many alerts as possible and use the high quantity to compensate for the low quality of these inspections. Under this scenario, u¯(shm,am)\bar{u}(s^{hm},a_{m}) increases with m>0m\in\mathbb{Z}_{>0} for all shm𝒮s^{hm}\in\mathcal{S} as shown in Fig. 12(a). If the uncertainty cost is of the same order as the inspection reward on average, then increasing mm in a certain range (e.g., m{1,2,3,4,}m\in\{1,2,3,4,\} in Fig. 12(b)) can increase the probability of correct decision-making and reduce the aggregated IDoS risk. The loss of alert omissions outweighs the gain of detailed inspection when mm is beyond that range.

Refer to caption
(a) High cost c(wUN,shm;θhm)=20,shm,θhmc(w_{UN},s^{hm};\theta^{hm})=20,\forall s^{hm},\theta^{hm}.
Refer to caption
(b) Low cost c(wUN,shm;θhm)=0.2,shm,θhmc(w_{UN},s^{hm};\theta^{hm})=0.2,\forall s^{hm},\theta^{hm}.
Figure 12: Aggregated IDoS risks under sALs_{AL}, sNLs_{NL}, and sPLs_{PL} in black, red, and blue. A small mm represents a coarse inspection with a large number of alerts while a large mm represents a fine inspection of a small number of alerts.

As an analogy to Fig. 5 and 9, we visualize the RL feedback in Fig. 13. Different from the previous two, the cyber system is replaced by the human operator’s attention system for alert inspections and responses.

Refer to caption
Figure 13: The adaptive learning loop of the optimal alert and attention management strategy to combat IDoS attacks.

5 Reinforcement Learning in Adversarial Environment

As more and more networked systems are equipped with RL techniques to improve their performance, security, and resilience, the application of RL also creates opportunities for malicious third parties to exploit. Hence, it is of equal importance to study the security problems of RL itself than RL-enabled networked systems. The successful implement of RL relies on accurate and consistent feedback from the environment, which can be easily guaranteed in a simulated environment. However, in practice, especially in the presence of adversarial interventions, accurate and consistent feedback from the environment is unlikely to be guaranteed. For example, adversaries can manipulate reward signals by performing data injection attacks and prevent an agent from receiving reward signals by launching DoS attacks on the communication channel. Without consistent and/or accurate feedback from the environment, the RL algorithm can either fail to learn a good strategy or be tricked into a ’trap’ policy that the attacker aims for. The failure of RL algorithms under adversarial interventions can lead to a catastrophe when RL is applied in critical domains. For example, self-driving platooning vehicles can collide with each other when the measurement data is manipulated [113]; drones equipped with RL techniques can be weaponized by terrorists to create chaotic and vicious situations where they are commanded to crash into a crowd or a building [114].

Hence, it is imperative to study RL under an adversarial environment. In the last five years, there is a surge in terms of the number of papers that studies the security issues of RL [114, 115, 46, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126]. The vulnerabilities of RL comes from the information exchange between the agent and the environment. As demonstrated in Fig. 2, the agent receives reward signals and state observations from the environment and sends action protocols to the environment. Suppose the agent is conducting real-time learning by communicating with the environment. In that case, the RL agent is subject to DoS attacks, jamming attacks, spoofing attacks, and data injection attacks that all networked systems might suffer. If the agent learns from batches of stored data from previous experience, the RL agent might suffer from data poisoning attacks, test-item attacks, etc. To understand RL under an adversarial environment, first, we need to understand the adversarial behaviors of the attackers. We need to develope an attack mode that characterizes the attacker’s objective, the available attacks, the limitation of attacks, and the information available to the attacker. The second is to understand how the specific attacks on the RL algorithms impact the learning results. To do so, we need to develop metrics that measure the success of the attacks, and the consequences rendered by the attacks. With the understanding of the attack model and its impact on the learning results, the last is to design defense mechanisms to protect RL algorithms from being degenerated. The defensive acts may include the detection and removal of corrupted feedback, resilient RL in the absence of feedback signals, and deploying cryptography techniques to ensure confidentiality, etc. In this section, we discuss the security problems faced by RL and review relevant works based on the three possible signals that the attacker can target: the reward, the sensing, and the actuating signals.

5.1 Attacks on the Reward

Reward/cost are the most important signals that the RL agent absorbs in order to learn a proper strategy [43]. The most direct way, perhaps also the easiest way, to trick the RL agent into learning a nefarious strategy is to falsify the reward/cost signals or poison reward/cost data. Huang et al. show that the attacker can alter the learned strategy of many states by poisoning the reward in one state [114]. In the past two years, many research works have been focused on the reward-poisoning attacks on RL [114, 115, 46, 116, 121, 123, 126]. In [115], Wang et al. have investigated the effect of perturbed rewards on a list of RL algorithms. The rewards received by the RL agent are perturbed with a certain probability, and the rewards take values only on a finite set. Here, the rewards are unintentionally perturbed, meaning this work focuses on a robust perspective other than a security perspective.

Huang and Zhu [114] have studied the security perspective of cost manipulation in RL. In this work, the attack model is described by the information available to the attacker, the attacer’s actions, and his/her objectives.

  • 1.

    Information the attacker knows: The authors consider an omniscient attacker who knows almost all information regarding the learning process.

  • 2.

    Actions available to the attacker: The attacker can manipulate the cost signals in certain states. Formally, let c~𝒮×𝒜+\tilde{c}\in\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}_{+} be the cost function manipulated by the attacker. We say the attacker can only alter the cost signals at states 𝒮𝒮\mathcal{S}^{\dagger}\subset\mathcal{S}, then we have c~(s,a)=c(s,a)\tilde{c}(s,a)=c(s,a) for all s𝒮\𝒮,a𝒜s\in\mathcal{S}\backslash\mathcal{S}^{\dagger},a\in\mathcal{A}, meaning the attacker cannot manipulate the cost signals at states other than the ones in 𝒮\mathcal{S}^{\dagger}. Here, cc is the original true cost function. The authors also consider other types of constraints to describe the limitation of the attacker’s power, e.g., the manipulated cost is bounded, i.e., cc~B\|c-\tilde{c}\|\leq B, where \|\cdot\| is an appropriate norm.

    Table 4: Summary of notations for Section 5.1.
    Variables Meaning
    𝒮,𝒜\mathcal{S},\mathcal{A} State space, action Space
    c~\tilde{c} Manipulated cost function
    𝒮\mathcal{S}^{\dagger} Set of states where the attacker can manipulate the corresponding cost signals
    BB The bound of the manipulation
    π\pi^{\dagger} A nefarious strategy learned from manipulated signals
    π\pi^{*} A optimal policy learned from accurate signals
    β\beta Discount factor
    Q~\tilde{Q}^{*} Optimal QQ function learned from manipulated signals
    rt0r_{t}^{0} True reward signal at time tt
    D(st,at,rt0,st)D\coloneqq(s_{t},a_{t},r_{t}^{0},s_{t}^{\prime}) Training set
    𝐫(r0,,rT)\mathbf{r}\coloneqq(r_{0},\cdots,r_{T}) Manipulated rewards
    R^\hat{R} Constructed reward function using the training set
    P^\hat{P} Constructed transition probabilities using the training set
  • 3.

    The objective of the attacker: The objective of the attacker is to trick the RL agent into learning a nefarious strategy π\pi^{\dagger} instead of the original optimal strategy π\pi^{*}.

Among many RL algorithms, Huang and Zhu have studied QQ-learning algorithm and its limiting behavior under the manipulated cost signals. The authors develop some fundamental limits regarding the attack model. We highlight a few results in this section.

Theorem 1 ([114]).

Let QQ^{*} denote the QQ-factor learned from QQ learning algorithm (1) with the true cost signals cc and Q~\tilde{Q}^{*} be the QQ-factor learned from the manipulated cost signals c~\tilde{c}. Then, we have

Q~Q11βc~c,\|\tilde{Q}^{*}-Q^{*}\|\leq\frac{1}{1-\beta}\|\tilde{c}-c\|, (23)

where β\beta is the discount factor of the underlying MDP problem.

Theorem 1 states that manipulation on cost cc using a small perturbation does not cause significant changes in the limit point of the Q learning algorithm. This result indicates that the attacker cannot cause a significant change in the learned QQ-factor by just a tiny perturbation in the cost signals. This is a feature known as stability observed in problems that possess contraction mapping properties. One conclusion we can make from (23) is that if the attacker wants to successfully make the RL agent learn the nefarious strategy π\pi^{\dagger}, the attacker has to manipulate the cost signals with magnitude more significant than a certain value:

c~c(1β)infQ𝒱πQQ,\|\tilde{c}-c\|\geq(1-\beta)\inf_{Q\in\mathcal{V}_{\pi^{\dagger}}}\|Q-Q^{*}\|,

where 𝒱π={Q|𝒮|×|𝒜|:Q(s,π(s))<Q(s,a),s𝒮,aπ(s)}.\mathcal{V}_{\pi}=\{Q\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}:Q(s,\pi(s))<Q(s,a),\forall s\in\mathcal{S},\forall a\neq\pi(s)\}.

Huang and Zhu have also shown that to make the agent learn the nefarious strategy π\pi^{\dagger}, the manipulated cost signal has to satisfies

c~(s,a)>(𝟙sβPsa)T(1βPπ)1c~π,\tilde{c}(s,a)>(\mathbbm{1}_{s}-\beta P_{sa})^{T}(1-\beta P_{\pi^{\dagger}})^{-1}\tilde{c}_{\pi^{\dagger}}, (24)

where 𝟙s|𝒮|\mathbbm{1}_{s}\in\mathbb{R}^{|\mathcal{S}|} is a |𝒮||\mathcal{S}|-dimensional vector whose elements are all zero except that the ss-th element is 11, Psa=(p(s,1,a),p(s,2,a),,p(s,|𝒮|,a))T|𝒮|P_{sa}=(p(s,1,a),p(s,2,a),\cdots,p(s,|\mathcal{S}|,a))^{T}\in\mathbb{R}^{|\mathcal{S}|}, and Pπ|𝒮|×|𝒜|P_{\pi}\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|} is a matrix whose ijij-component is [Pπ]i,j=p(i,j,π(i))[P_{\pi}]_{i,j}=p(i,j,\pi(i)). Here, p(s,s,a)p(s,s^{\prime},a) is the probability that given current state ss and current action aa, the next state is ss^{\prime}, which characterizes the transition kernel of the underlying MDP problem. If the attacker manipulates the cost signals such that (24) is satisfied, the agent will be misled into the nefarious strategy π\pi^{\dagger}. The conditions in (24) can then be utilized by the attacker to design an optimal manipulation of the cost signals that optimizes the cost of manipulating the cost signals incurred to the attack. Huang and Zhu have also shown theoretically and numerically that the attacker can achieve his/her objective by manipulating the cost signals in a small subset of states.

In [123], Ma et al. have studied a similar problem for model-based RL. The attacker poisons the reward signals stored in a dataset that is used to train an RL agent. The attack model is given as

  • 1.

    Information the attacker knows: The attacker has access to the training set D=(st,at,rt0,st)t=0:T1D=(s_{t},a_{t},r^{0}_{t},s^{\prime}_{t})_{t=0:T-1}, in which rt0r^{0}_{t} is the true reward signal the agent has at time tt. The attacker also knows the model-based RL learner’s algorithms.

  • 2.

    Actions available to the attacker: The attacker is able to arbitrarily manipulate the rewards 𝐫0=(r00,,rT10)\mathbf{r}^{0}=(r^{0}_{0},\cdots,r^{0}_{T-1}) in DD into 𝐫=(r0,,rT)\mathbf{{r}}=({r}_{0},\cdots,{r}_{T}).

  • 3.

    The objective of the attacker: The objective of the attacker is to mislead the RL agent to learn a nefarious strategy π\pi^{\dagger} while minimizing his/her attack cost 𝐫0𝐫\|\mathbf{r}^{0}-\mathbf{{r}}\|.

With the attack model being defined as above, Ma et al. have converted the attacker’s problem into an equivalent convex optimization problem:

min𝐫T,R^,Q|𝒮|×|𝒜|\displaystyle\min_{\mathbf{{r}}\in\mathbb{R}^{T},\hat{R},Q\in\mathbb{R}^{|\mathcal{S}|\times|\mathcal{A}|}} 𝐫0𝐫,\displaystyle\|\mathbf{{r}}^{0}-\mathbf{r}\|, (25)
s.t.\displaystyle s.t. R^(s,a)=1|Ts,a|tTs,art,P^(s,s,a)=1|Ts,a|tTs,a𝟏{st=s}\displaystyle\hat{R}(s,a)=\frac{1}{|T_{s,a}|}{\sum_{t\in T_{s,a}}r_{t}},\ {\hat{P}(s,s^{\prime},a)=\frac{1}{|T_{s,a}|}\sum_{t\in T_{s,a}}\mathbf{1}_{\{s_{t}^{\prime}=s^{\prime}\}}}
Q(s,a)=R^(s,a)+γsP^(s,s,a)Q(s,π(s)),s,a,\displaystyle Q(s,a)=\hat{R}(s,a)+\gamma\sum_{s^{\prime}}\hat{P}(s,s^{\prime},a)Q\left(s^{\prime},\pi^{\dagger}(s^{\prime})\right),\forall s,\forall a,
Q(s,π(s))Q(s,a)+ϵ,s𝒮,aπ(s),\displaystyle Q(s,\pi^{\dagger}(s))\geq Q(s,a)+\epsilon,\forall s\in\mathcal{S},\forall a\notin\pi^{\dagger}(s),

where Ts,a={t|st=s,at=a}T_{s,a}=\{t|s_{t}=s,a_{t}=a\} is the time indices of all training data points for which action aa is taken at state ss. By solving (25), the attack can mislead the RL agent into learning the strategy π\pi^{\dagger} while minimizing his/her own cost of manipulating the reward data. Ma et al. have showed the existence of the optimal solution to (25) and provided a bound on the optimal cost that the attacker needs to pay to achieve the strategy π\pi^{\dagger}.

Ma et al. have applied the attacking strategy given by the solution of (25) to an RL-based linear-quadratic regulator (LQG) problem. In Fig. 14 (b), the upper figure shows the true reward data and the poisoned reward data, and the bottom figure shows the difference between the two rewards. At each time step, the true reward is only poisoned slightly. However, the sight poisoning can drive the vehicle into a designated position, as we can see from Fig. 14 (a).

Refer to caption
Figure 14: Poisoning a vehicle running LQR in a 44D state space.

As the dataset becomes large DD, i.e., TT increases, solving (25) becomes prohibitive. Zhang et al. have proposed an adaptive reward-poisoning attacking scheme that focuses on online RL attacks. The reward poisoning is done on the fly. Hence, their attacking strategy does not require solving a large-scale optimization problem [116]. The authors demonstrated how easily an attacker could trick the agent into a designated target in a grid world problem.

As the topic is an emerging area, most works have focused on discussing the vulnerabilities of RL instead of designing defensive mechanisms. Very few works have been studied the defense against reward poisoning attacks [121]. In [121], Banihashem et al. have designed a defense approach to make sure the RL agent is robust against reward poisoning attacks.

5.2 Attacks on the Sensing

Attacks that target the sensing components, including sensors and the communication channels in CPS control systems, have been well-studied [63, 127, 128, 129, 54]. RL-based systems also share the same vulnerabilities with CPS systems — the observation may be delayed, inaccurate, and/or missing. One difference between sensing attacks on RL and CPS is that sensing attacks can happen during the training phase, testing phase, and/or execution phase. Sensing attacks during the training phase can render the RL agent learn a ‘bad’ policy that leads to dangerous behavior in the testing phase. Attacks in the testing phase can provide false performance evaluation to the RL agent. Attacks during the execution phase are more similar to attacks in CPS control systems, in which controllers/agents receive real-time measurements and send real-time commands. Hence, during the execution phase, any delay, inaccuracy, or absence of the observation can cause malfunctioning of the system.

Offline RL can tolerate a certain degree of delay of the feedback signals, including reward signals, state signals, since offline RL starts after gathering a collection of data points. However, online RL can be sensitive to delays. In 2003, [130] considered delayed state observation, in which the state observation is delayed, and the RL agent needs to select an action without any knowledge of its current state. For a fixed delay dd, the authors formed a non-delayed MDP whose size grows exponentially in dd. In [131], Lancewichi et al. have studied the delayed feedback in online reinforcement learning algorithms. In their work, an attacker can strategically choose the delay dtd^{t}, meaning that the cost signal and the state observations are only available at the end of episode t+dtt+d^{t}. The authors proposed a policy optimization-based algorithm that can achieve near-optimal high-probability regret of order O~(T+D)\tilde{O}(\sqrt{T}+\sqrt{D}), where TT is the number of episodes and D=tdtD=\sum_{t}d^{t} the total delay. More recently, Ramstedt et al. [132] proposed the delay-correcting actor-critic algorithm that can achieve better performance in environments with delays.

RL can be an easy target of DoS/jamming attacks and spoofing attacks and can suffer a significant loss if the attacks are successfully launched. Take the autonomous vehicle as an example. If the state is the location of the RL agent measured by GPS coordinates, spoofing of GPS signals by the attacker can lead to incorrect navigation policy and hence, a collision. Although RL suffers a high risk and the potential loss of being attacked, very few works have focused on RL with missing or inaccurate observations from a security perspective. Hence, the attack models and the defense mechanisms about DoS attacks and spoofing attacks on RL sensing remain open for future studies.

RL agents observe the environment via sensors. Under DoS attacks, the state observations cannot reach the RL agent, and the agent becomes unmindful of the states. For remote sensing, jamming attacks can be launched on the communication channels. For in-house sensing, an example is sequential blinding of the cameras in an RL-based autonomous vehicle via lasers, leading to a collision. On behalf of the attack, researchers can address the following interesting questions: when and where to launch DoS/jamming attacks to create the most damage to the RL agent, whether the attacker can trick the RL agent into a nefarious policy only through jamming attacks, etc.; on behalf of the agent, researchers can design a resilient RL algorithm that is less sensitive to missing state observations [127], employ prediction methods when state observations are missing [133], or study the effect of DoS attacks on existing RL algorithms. Attacks can also manipulate the sensing signals by spoofing attacks. The problem for the attacker can be to design a manipulation strategy. The manipulation strategy can be characterized by a map KK that maps a state ss into a probability simplex over all possible states Δ𝒮={po(s),s𝒮}\Delta\mathcal{S}=\{p^{o}(s),s\in\mathcal{S}\}. For example, Δ𝒮=K(st)\Delta\mathcal{S}=K(s_{t}) indicates that given the true state sts_{t} at time tt, the RL agent will instead observe the state st𝒮s_{t}^{\prime}\in\mathcal{S} with probability po(st)p^{o}(s^{\prime}_{t}). The map KK is similar to the concept of observation kernel in partially observable Markov decision process [134] except that now the map KK is crafted by the attacker to mislead the RL agent and achieve his/her malicious purposes.

One concept that is close yet different to sensing attacks on RL is environment poisoning attacks, which is the focus of many works regarding the security of RL [124, 119, 117]. In environment poisoning attacks, the attacker can perturb the environment, i.e., the transition probability in the MDP. As is demonstrated by Behzadan and Munir in [113], through sequential reconfiguration of obstacles on the road, an attacker can manipulate the trajectory of an RL-based autonomous vehicle in the testing phase. In 2020, Rakhsha et al. [117] study the problem of how to teach a policy to the RL agent through transition dynamics poisoning. We describe the attack model proposed by the authors using the following description:

  • 1.

    Information the attacker knows: The attacker is omniscient who knows almost all information including the original MDP.

  • 2.

    Actions available to the attacker: The attacker can turn the original transition dynamics denoted by PP into the manipulated transition dynamics P~\tilde{P}.

  • 3.

    The objective of the attacker: The objective of the attacker is to mislead the RL agent into a nefarious policy π\pi^{\dagger} while minimizing his/her cost of poisoning transition dynamics, i.e.,

    minP~PP~ρ(s,a(s|p(s,s,a)p~(s,s,a)|)ρ)1/ρ.\min_{\tilde{P}}\|P-\tilde{P}\|_{\rho}\coloneqq\left(\sum_{s,a}\left(\sum_{s^{\prime}}|p(s,s^{\prime},a)-\tilde{p}(s,s^{\prime},a)|\right)^{\rho}\right)^{1/\rho}.

The attacker can find the optimal way of manipulating the transition dynamics by solving the following optimization problem:

minP~,μπ,μπ{s;a}\displaystyle\min_{\tilde{P},\mu^{\pi^{\dagger}},\mu^{\pi^{\dagger}\{s;a\}}} PP~ρ\displaystyle\|P-\tilde{P}\|_{\rho} (26)
s.t.\displaystyle s.t. μπ\mu^{\pi^{\dagger}} and PP satisfy (27)
s,aπ(s):μπ{s;a} and P sastisfy (27),\displaystyle\forall s,a\neq\pi^{\dagger}(s):\mu^{\pi^{\dagger}\{s;a\}}\textrm{ and $P$ sastisfy (\ref{eq:kolgorov_forw})},
sμπ(s)r(s,π(s))sμπ{s,a}(s)r(s,πs,a(s))+ϵ,\displaystyle\sum_{s^{\prime}}\mu^{\pi^{\dagger}}(s^{\prime})\cdot r(s^{\prime},\pi^{\dagger}(s^{\prime}))\geq\sum_{s^{\prime}}\mu^{\pi^{\dagger}\{s,a\}}(s^{\prime})\cdot r(s^{\prime},\pi^{s,a}(s^{\prime}))+\epsilon,
s,a,s:P~(s,s,a)δP(s,a,s).\displaystyle\forall s,a,s^{\prime}:\tilde{P}(s,s^{\prime},a)\geq\delta\cdot P(s,a,s^{\prime}).

Here, π{s,a}\pi\{s,a\} is a neighbor policy of the policy π\pi defined as

π{s,a}(s)={π(s)ss,as=s.\pi\{s,a\}(s^{\prime})=\begin{cases}\pi(s^{\prime})\ \ \ &s^{\prime}\neq s,\\ a\ \ \ &s^{\prime}=s.\end{cases}

The neighbor policy π{s,a}\pi\{s,a\} is almost the same with policy π\pi except that at state ss, the neighbor policy points to action aa. Here, μπ\mu^{\pi}is the stationary distribution given the transition probability PP and policy π\pi, which satisfies

μπ(s)=sp(s,s,π(s))μπ(s).\mu^{\pi}(s)=\sum_{s^{\prime}}p(s^{\prime},s,\pi(s^{\prime}))\cdot\mu^{\pi}(s^{\prime}). (27)

In the optimization problem (26), the first three constraints are the conditions needed to be satisfied to make sure the agent learns the policy π\pi^{\dagger}. The last constraint specifies how much the attacker can decrease the original values of transition probabilities, where δ(0,1]\delta\in(0,1]. The authors later demonstrated that using the attacking rule obtained from (26), the attacker can achieve his objective by only altering a small number of transition probabilities [117]. However, this approach faces scalability issues and the assumption that the attacker knows the MDP is a stringent one, which makes the attack a white-box attack. In 2021, Xu et al. [124] studies environment-dynamics poisoning attacks at training time which expands the attacks to black-box attacks. The authors propose an attack model that prompts the momentary policy of the agent to change in the desired manner with minimal dynamics manipulation. More recently, Sun et al. [119] proposed an attack model that does not require any knowledge of the MDP. The attack model follows the so-called vulnerability-aware adversarial critic poisoning strategy. The authors demonstrated that the attacking strategy successfully prevents agents from learning a good policy or mislead the agents to a target policy with low attacking effort.

5.3 Attacks on the Actuator

Many works have been done to study attacks on the actuators or the communication channel between the actuators and the controllers for CPS control systems [135, 136, 137, 138, 139, 140]. However, there are very few studies investigating attacks targeting the RL agent’s actuators. RL agents influence their environments by performing actions ata_{t} via actuators. To ensure that the action is timely and accurately applied to the environment, the RL agent must guarantee first that the action command is timely and accurately transmitted to the actuator in the field. Second, the actuator executes the received commands in a timely and accurate manner. However, the attacker may exploit these vulnerabilities. Suppose the attacker can launch jamming attacks on the communication channel or manipulate the actuator. In that case, the actual action performed will be different from the one chosen by the RL agent, and hence the observed experience is corrupted. The learning results will also be corrupted if the RL agent learns from the corrupted experience without noticing the attacks. Future works need to focus on investigating how attacks on the actuator affect the performance of the RL agent in training, testing, and implementing phases.

5.4 Discussion and Future Outlooks

While there is vast literature on IoT and CPS security issues, little is known about the security issues of RL. However, security issues of RL exist and are getting increasingly significant as the application domain of RL widens. This section reviews the literature that studies RL under an adversarial environment, where three vulnerabilities are targeted by the attacker: the reward signals, the sensing, and the actuating. These are the three signals that need to be exchanged timely and accurately between the agent and its environment, which becomes the vulnerabilities of RL-enable systems that the attacker can exploit. Unlike real-time systems such as CPS control systems, attacks on different phases of RL may serve different purposes. Attacks during the training phase can render the RL agent fail to learn a ‘good’ policy. Attacks in the testing phase can provide false performance evaluation to the RL agent. Attacks during the execution phase are more similar to attacks in CPS control systems, in which controllers/agents receive real-time measurements and send real-time commands. Hence, during execution phase, any delay, inaccuracy, or absence of the observation can cause malfunctioning of the system.

Within relatively small literature on the security of RL, most works have focused on the attacks on reward signals, and few papers have investigated the poisoning attacks on the transition dynamics. They mainly focus on deceptive attacks that falsify the values of either the reward signals or the transition probabilities. To the best of our knowledge, very few works are dedicated to DoS/jamming attacks on either the reward, the sensing, or the actuating. Hence, we believe there is an emerging research opportunity since DoS/jamming attacks are considered as the most common attacks that could happen to a networked system [54]. Another gap that needs to be narrowed is the attacks on sensing and actuating for RL-enabled systems. We notice that few works have only addressed the problem of missing or noised sensing/actuating signals from a non-security perspective that achieves robustness for the RL-enabled system. Only with a good understanding of how different attacks affect the RL, one can further safeguard our RL-enabled systems.

6 Conclusions and Discussions

Cyber-resilient mechanisms (CRM) complement the imperfect cyber protections by strategically responding to unknown threats and maintaining the critical functions and performances in the event of successful attacks. A CRM can be viewed as a feedback system that consists of three critical components: sensing, reasoning, and actuation. Sensing aims to acquire information about the system as well as the footprint of the attacker. Reasoning builds on the acquired information to infer the attack behaviors and the design of the optimal resilience strategies. Actuation reconfigures the system according to the optimal strategy by adapting the system parameters and attributes to unknown threats. The sensing-reasoning-actuating feedback loop establishes an adaptive and dynamic system architecture for cyber resilience.

Reinforcement learning (RL) is a suitable data-driven framework for cyber resilience which implements the feedback-system architecture. The update of the system reconfiguration strategies dispenses with the accurate system models and relies on only the observations of the system state and its associated cost at each iteration. We have classified the design of RL-based CRM (RL-CRM) based on the type of vulnerabilities the system aims to mitigate. We have demonstrated that the RL-CRM has enabled an adaptive and autonomous way to configure moving target defense, engage attackers for reconnaissance, and guide human attention to mitigate visual vulnerabilities. We have learned from the literature that posture-related defense technologies are mature, while the mitigation solutions for information-related and human-induced vulnerabilities are still underdeveloped. New advances are needed to detect and deter deceptive attacks. Besides compensating for information disadvantage, the defender can proactively create uncertainties and increase the attack cost by designing defensive deception. To defend against human-induced vulnerabilities, we have observed the need for new designs in human-machine interactions. Besides indirectly designing corrective compensation for human-induced vulnerabilities, we also need to focus on (non)monetary incentives supported by human-study evidence and human-assistive technologies to elicit desirable human behaviors.

We have discussed the RL solutions to mitigate vulnerabilities of the cyber systems and the vulnerabilities of the RL itself. Vulnerabilities of RL exist and are growing increasingly significant as the application domain of RL broadens. The RL agent exchanges three kinds of time-sensitive signals with the environment, including the rewards, the state observations, and the action commands. The three signals become the vulnerabilities of RL-enable systems that the attacker can influence. We have reviewed the current state-of-the-art involving the attacks on the three kinds of signals. These works show that under proper conditions, a small malicious perturbation of the feedback signals can mislead the RL agent to learn any policy desired by the attacker. These results serve as a warning about the seriousness of the security issues of RL-enabled systems.

The study of vulnerabilities of RL from a security point of view is an emerging topic. We have seen a surge of publication on the topic in the last two years. Most works have focused on the attacks on reward signals, and a few papers have investigated the poisoning attacks on the transition dynamics. However, almost no works have studied the attacks on sensing and actuating for RL-enable systems, which is a considerable research gap that needs to be filled. Furthermore, current literature primarily focuses on deceptive attacks that falsify the values of either the reward signals or the transition probabilities. Since DoS/jamming attacks are considered the most common attacks that could happen to a networked system, more works should be dedicated to DoS/jamming attacks on either the rewards, the state observations, or the actions commands.

6.1 Feedback Architectures

The RL-CRMs presented in this work follow the standard feedback architecture. This architecture can be enriched and extended to several more sophisticated ones. One is the nested feedback loops, where one RL feedback loop is coupled in another RL feedback loop. This architecture is useful to separate and then fuse the learning of distinct system components of the cyber system. For example, one RL feedback loop is used to acquire the attack footprint and learn its intent and capabilities. In contrast, the other feedback loop is used to acquire information regarding its system state. The two feedback loops can be fused for making online defense decisions in response to an unknown threat.

Another architecture is a mixture of feedback and open-loop structures. Leveraging the ideas from moving-horizon control and estimation, the CRM can make a moving-horizon plan by looking NN stages into the future and optimizing the cyber resilience for the current stage and NN stages-to-go. This approach would require an open-loop prediction of the system under the attack and feedback-driven sensing of the environment and reasoning of the optimal moving-horizon resilience strategies. This architecture enables a look-ahead strategy that can prepare for a sequence of forthcoming events and improve resilience.

A centralized CRM would not be sufficient for large-scale cyber systems to provide a timely response since the amount of data would increase exponentially with the size, and so does its processing time. A distributed CRM can locally monitor a subsystem and coordinate with other CRMs to achieve a global CRM performance. In doing so, we can achieve local cyber resilience for the subsystem as well as the global resilience of the entire system. When one subsystem is attacked and fails, the remaining subsystems can adapt to it, maintain their local functions, and recover from the failure of one subsystem. This decentralized mechanism aligns with the recent advances in multi-agent control systems [141], distributed control theory [142], and game theory [143, 144]. They can provide a theoretical underpinning for the design of distributed cyber-resilient systems.

6.2 Reinforcement Learning

From the three applications of RL-CRM designs, we have identified several research challenges to be addressed in the future works in this direction. First, it is important to deal with system and performance constraints in the learning process. Cyber systems have many system constraints that need to be taken into account explicitly. For example, certain addresses or functions that are not allowed or undesirable when configuring the moving target defense. The performance of the cyber systems can impact the performance of the physical systems that they serve. Hence, the requirement on the physical system performance naturally imposes a constraint on the performance of the cyber systems. Hence, the CRM would need to adapt and respond while satisfying these performance constraints.

A second challenge is to improve the learning speed. The goal of CRM is to restore the cyber system after an attack. Fast learning would enable a speedier and more resilient response to the attack. To achieve it, we would need to resort to control-theoretic ideas, such as optimal control [145] and adaptive control theory [146], and leverage recent advances in reinforcement learning to speed up the convergence rate or improve the finite-time learning performances. A third challenge is to deal with the nonstationarity of the cyber systems. The classical RL algorithms assume that the environment is stationary and ergodic. In many cybersecurity applications, the systems are nonstationary. The system parameters and attributes change over time. For example, the attack surface may grow when the system is connected with other nodes or used by new users. There is a need to develop nonstationary RL schemes for cyber systems that guarantee performance in a finite horizon.

6.3 Emerging Applications

Beyond the applications presented in this work, there are many emerging ones where RL-CRM can play a pivotal role in improving their security stature. One crucial area is the cyber resilience for wireless communication systems under jamming and spoofing attacks. Wireless devices can prepare for a jammer with limited power by introducing redundant links and channels [147, 148]. An RL-CRM can learn from the communication and the environment to reconfigure adaptively their routes, channels, or access points when they are subject to jamming attacks [149, 150]. The devices can also leverage moving target defense to create an RL-CRM [151] to attract jammers to attack a fake route in order to protect the legitimate communication.

The concept of cyber resilience is also applicable to safeguard Cyber-Physical Systems (CPS) and the Internet-of-Things (IoT) networks. The impact of the successful attacks on CPS and IoT systems is not limited to cyber performance. They can create catastrophic consequences such as the meltdown of nuclear energy systems [152, 153], the blackout of the electric power grid [154, 155], the disruption in gas pipelines [156], and casualties in road accidents [157]. The CPS resilience deal with resilience at both cyber and physical layers of the system. The physical-layer resilience takes into the physical-layer performances and requires a learning-based control system to respond and restore the degrading performance resulting from successful attacks. Cyber resilience and physical resilience are interdependent. An effective CRM would restore the cyber performance and thus reduce the impact of the threat on the physical layer of the system, therefore facilitating physical resilience. There is a need to create a cross-layer resilience design framework [158] to provide a holistic view toward CPS resilience and reduce cross-layer cascading failures [159].

Cyber insurance is an alternative paradigm to improve cyber resilience from a non-technical and economic perspective. Cyber insurance aims to transfer the unpreventable and unmitigable risks to a third party to reduce the economic impact of cyber attacks on an organization. As the attacks are becoming financially motivated, e.g., the rise of ransomware, cyber insurance is an important mechanism that needs to be integrated into the RL-CRM to mitigate further the economic loss induced by the loss of cyber performance. Interested readers can refer to [160, 161, 162] for the discussions on cyber insurance as a tool for cyber risk management.

References

  • [1] Oxford Analytica, Solarwinds hack will alter us cyber strategy, Emerald Expert Briefings.
  • [2] D. M. Nicol, The ransomware threat to energy-delivery systems, IEEE Security & Privacy 19 (3) (2021) 24–32.
  • [3] E. Cole, Advanced persistent threat: understanding the danger and how to protect your organization, Newnes, 2012.
  • [4] A. Kott, I. Linkov, Cyber resilience of systems and networks, Springer, 2019.
  • [5] I. Linkov, A. Kott, Fundamental concepts of cyber resilience: Introduction and overview, in: Cyber resilience of systems and networks, Springer, 2019, pp. 1–25.
  • [6] E. Al-Shaer, J. Wei, W. Kevin, C. Wang, Autonomous cyber deception, Springer.
  • [7] J. Pawlick, Q. Zhu, Game Theory for Cyber Deception: From Theory to Applications, Springer Nature, 2021.
  • [8] F. L. Greitzer, A. P. Moore, D. M. Cappelli, D. H. Andrews, L. A. Carroll, T. D. Hull, Combating the insider cyber threat, IEEE Security & Privacy 6 (1) (2008) 61–64.
  • [9] Q. Zhu, T. Başar, Dynamic policy-based ids configuration, in: Proceedings of the 48h IEEE Conference on Decision and Control (CDC) held jointly with 2009 28th Chinese Control Conference, IEEE, 2009, pp. 8600–8605.
  • [10] L. Huang, J. Chen, Q. Zhu, Distributed and optimal resilient planning of large-scale interdependent critical infrastructures, in: 2018 Winter Simulation Conference (WSC), IEEE, 2018, pp. 1096–1107.
  • [11] Q. Zhu, T. Başar, Game-theoretic approach to feedback-driven multi-stage moving target defense, in: International conference on decision and game theory for security, Springer, 2013, pp. 246–263.
  • [12] C. Reiger, I. Ray, Q. Zhu, M. A. Haney, Industrial control systems security and resiliency, Practice and Theory. Springer, Cham.
  • [13] M. A. Haque, G. K. De Teyou, S. Shetty, B. Krishnappa, Cyber resilience framework for industrial control systems: concepts, metrics, and insights, in: 2018 IEEE international conference on intelligence and security informatics (ISI), IEEE, 2018, pp. 25–30.
  • [14] Q. Zhu, D. Wei, K. Ji, Hierarchical architectures of resilient control systems: Concepts, metrics and design principles, in: Cyber security for industrial control systems: From the viewpoint of close-loop, CRC Press, 2015, pp. To–appear.
  • [15] Q. Zhu, Control challenges for resilient control systems, arXiv preprint arXiv:2001.00712.
  • [16] S. Jajodia, G. Cybenko, V. Subrahmanian, V. Swarup, C. Wang, M. Wellman, Adaptive Autonomous Secure Cyber Systems, Springer, 2020.
  • [17] M. S. De Queiroz, D. M. Dawson, S. P. Nagarkatti, F. Zhang, Lyapunov-based control of mechanical systems, Springer Science & Business Media, 2000.
  • [18] T. Tsubone, T. Saijo, Stabilizing and destabilizing control for a piecewise-linear circuit, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications 45 (2) (1998) 172–177.
  • [19] S. Skogestad, Control structure design for complete chemical plants, Computers & Chemical Engineering 28 (1-2) (2004) 219–234.
  • [20] Y. Huang, Q. Zhu, A differential game approach to decentralized virus-resistant weight adaptation policy over complex networks, IEEE Transactions on Control of Network Systems 7 (2) (2019) 944–955.
  • [21] K. Malialis, D. Kudenko, Large-scale ddos response using cooperative reinforcement learning, in: 11th European Workshop on Multi-Agent Systems (EUMAS), 2013.
  • [22] X. Xu, Y. Sun, Z. Huang, Defending ddos attacks using hidden markov models and cooperative reinforcement learning, in: Pacific-Asia Workshop on Intelligence and Security Informatics, Springer, 2007, pp. 196–207.
  • [23] L. Huang, Q. Zhu, Adaptive honeypot engagement through reinforcement learning of semi-markov decision processes, in: International Conference on Decision and Game Theory for Security, Springer, 2019, pp. 196–216.
  • [24] L. Huang, Q. Zhu, Combating informational denial-of-service (idos) attacks: Modeling and mitigation of attentional human vulnerability, in: International Conference on Decision and Game Theory for Security, Springer, 2021, pp. 314–333.
  • [25] L. Huang, Q. Zhu, Radams: Resilient and adaptive alert and attention management strategy against informational denial-of-service (idos) attacks, arXiv preprint arXiv:2111.03463.
  • [26] M. Crawford, T. M. Khoshgoftaar, J. D. Prusa, A. N. Richter, H. Al Najada, Survey of review spam detection using machine learning techniques, Journal of Big Data 2 (1) (2015) 1–24.
  • [27] A. L. Buczak, E. Guven, A survey of data mining and machine learning methods for cyber security intrusion detection, IEEE Communications surveys & tutorials 18 (2) (2015) 1153–1176.
  • [28] R. Sommer, V. Paxson, Outside the closed world: On using machine learning for network intrusion detection, in: 2010 IEEE symposium on security and privacy, IEEE, 2010, pp. 305–316.
  • [29] D. Last, Forecasting zero-day vulnerabilities, in: Proceedings of the 11th Annual Cyber and Information Security Research Conference, 2016, pp. 1–4.
  • [30] L. Xiao, X. Wan, X. Lu, Y. Zhang, D. Wu, Iot security techniques based on machine learning: How do iot devices use ai to enhance security?, IEEE Signal Processing Magazine 35 (5) (2018) 41–49.
  • [31] Z. Ni, S. Paul, A multistage game in smart grid security: A reinforcement learning solution, IEEE transactions on neural networks and learning systems 30 (9) (2019) 2684–2695.
  • [32] A. Uprety, D. B. Rawat, Reinforcement learning for iot security: A comprehensive survey, IEEE Internet of Things Journal.
  • [33] T. T. Nguyen, V. J. Reddi, Deep reinforcement learning for cyber security, arXiv preprint arXiv:1906.05799.
  • [34] H. Zhu, Y. Cao, W. Wang, T. Jiang, S. Jin, Deep reinforcement learning for mobile edge caching: Review, new features, and open issues, IEEE Network 32 (6) (2018) 50–57.
  • [35] A. Ferdowsi, U. Challita, W. Saad, N. B. Mandayam, Robust deep reinforcement learning for security and safety in autonomous vehicle systems, in: 2018 21st International Conference on Intelligent Transportation Systems (ITSC), IEEE, 2018, pp. 307–312.
  • [36] D. P. Bertsekas, J. N. Tsitsiklis, Neuro-dynamic programming, Athena Scientific, 1996.
  • [37] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, et al., Mastering the game of go with deep neural networks and tree search, nature 529 (7587) (2016) 484–489.
  • [38] A. Nagabandi, G. Kahn, R. S. Fearing, S. Levine, Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning, in: 2018 IEEE International Conference on Robotics and Automation (ICRA), IEEE, 2018, pp. 7559–7566.
  • [39] D. Ha, J. Schmidhuber, Recurrent world models facilitate policy evolution, in: Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 2455–2467.
  • [40] S. Racanière, T. Weber, D. P. Reichert, L. Buesing, A. Guez, D. Rezende, A. P. Badia, O. Vinyals, N. Heess, Y. Li, et al., Imagination-augmented agents for deep reinforcement learning, in: Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 5694–5705.
  • [41] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour, Policy gradient methods for reinforcement learning with function approximation, in: Advances in neural information processing systems, 2000, pp. 1057–1063.
  • [42] J. Schulman, S. Levine, P. Abbeel, M. Jordan, P. Moritz, Trust region policy optimization, in: International conference on machine learning, PMLR, 2015, pp. 1889–1897.
  • [43] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
  • [44] V. Mnih, A. P. Badia, M. Mirza, A. Graves, T. Lillicrap, T. Harley, D. Silver, K. Kavukcuoglu, Asynchronous methods for deep reinforcement learning, in: International conference on machine learning, PMLR, 2016, pp. 1928–1937.
  • [45] D. Silver, G. Lever, N. Heess, T. Degris, D. Wierstra, M. Riedmiller, Deterministic policy gradient algorithms, in: International conference on machine learning, PMLR, 2014, pp. 387–395.
  • [46] Y. Huang, Q. Zhu, Manipulating reinforcement learning: Stealthy attacks on cost signals, Game Theory and Machine Learning for Cyber Security (2021) 367–388.
  • [47] T. Li, Q. Zhu, On convergence rate of adaptive multiscale value function approximation for reinforcement learning, in: 2019 IEEE 29th International Workshop on Machine Learning for Signal Processing (MLSP), IEEE, 2019, pp. 1–6.
  • [48] V. François-Lavet, P. Henderson, R. Islam, M. G. Bellemare, J. Pineau, et al., An introduction to deep reinforcement learning, Foundations and Trends® in Machine Learning 11 (3-4) (2018) 219–354.
  • [49] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, I. Polosukhin, Attention is all you need, in: Advances in neural information processing systems, 2017, pp. 5998–6008.
  • [50] D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, Z. Wen, A tutorial on thompson sampling, Foundations and Trends® in Machine Learning 11 (1) (2018) 1–96.
  • [51] A. A. Cárdenas, S. Amin, S. Sastry, Research challenges for the security of control systems., HotSec 5 (2008) 15.
  • [52] R. Langner, Stuxnet: Dissecting a cyberwarfare weapon, IEEE Security & Privacy 9 (3) (2011) 49–51.
  • [53] S. Alanazi, J. Al-Muhtadi, A. Derhab, K. Saleem, A. N. AlRomi, H. S. Alholaibah, J. J. Rodrigues, On resilience of wireless mesh routing protocol against dos attacks in iot-based ambient assisted living applications, in: 2015 17th International Conference on E-health Networking, Application & Services (HealthCom), IEEE, 2015, pp. 205–210.
  • [54] A. A. Cardenas, S. Amin, S. Sastry, Secure control: Towards survivable cyber-physical systems, in: 2008 The 28th International Conference on Distributed Computing Systems Workshops, IEEE, 2008, pp. 495–500.
  • [55] S. Liu, S. Li, B. Xu, Event-triggered resilient control for cyber-physical system under denial-of-service attacks, International Journal of Control 93 (8) (2020) 1907–1919.
  • [56] P. Dai, W. Yu, H. Wang, G. Wen, Y. Lv, Distributed reinforcement learning for cyber-physical system with multiple remote state estimation under dos attacker, IEEE Transactions on Network Science and Engineering.
  • [57] K. Malialis, D. Kudenko, Distributed response to network intrusions using multiagent reinforcement learning, Engineering Applications of Artificial Intelligence 41 (2015) 270–284.
  • [58] Y. Liu, M. Dong, K. Ota, J. Li, J. Wu, Deep reinforcement learning based smart mitigation of ddos flooding in software-defined networks, in: 2018 IEEE 23rd International Workshop on Computer Aided Modeling and Design of Communication Links and Networks (CAMAD), IEEE, 2018, pp. 1–6.
  • [59] P. Pradhan, K. Nagananda, P. Venkitasubramaniam, S. Kishore, R. S. Blum, Gps spoofing attack characterization and detection in smart grids, in: 2016 IEEE Conference on Communications and Network Security (CNS), IEEE, 2016, pp. 391–395.
  • [60] L. Xiao, Y. Li, G. Liu, Q. Li, W. Zhuang, Spoofing detection with reinforcement learning in wireless networks, in: 2015 IEEE Global Communications Conference (GLOBECOM), IEEE, 2015, pp. 1–5.
  • [61] L. Xiao, Y. Li, G. Han, G. Liu, W. Zhuang, Phy-layer spoofing detection with reinforcement learning in wireless networks, IEEE Transactions on Vehicular Technology 65 (12) (2016) 10037–10047.
  • [62] M. Elnaggar, N. Bezzo, An irl approach for cyber-physical attack intention prediction and recovery, in: 2018 Annual American Control Conference (ACC), IEEE, 2018, pp. 222–227.
  • [63] Y. Mo, B. Sinopoli, False data injection attacks in control systems, in: Preprints of the 1st workshop on Secure Control Systems, 2010, pp. 1–6.
  • [64] M. Li, Y. Sun, H. Lu, S. Maharjan, Z. Tian, Deep reinforcement learning for partially observable data poisoning attack in crowdsensing systems, IEEE Internet of Things Journal 7 (7) (2019) 6266–6278.
  • [65] Y. Chen, S. Huang, F. Liu, Z. Wang, X. Sun, Evaluation of reinforcement learning-based false data injection attack to automatic voltage control, IEEE Transactions on Smart Grid 10 (2) (2018) 2158–2169.
  • [66] M. N. Kurt, O. Ogundijo, C. Li, X. Wang, Online cyber-attack detection in smart grid: A reinforcement learning approach, IEEE Transactions on Smart Grid 10 (5) (2018) 5174–5185.
  • [67] M. H. Manshaei, Q. Zhu, T. Alpcan, T. Bacşar, J.-P. Hubaux, Game theory meets network security and privacy, ACM Computing Surveys (CSUR) 45 (3) (2013) 1–39.
  • [68] J. Pawlick, E. Colbert, Q. Zhu, A game-theoretic taxonomy and survey of defensive deception for cybersecurity and privacy, ACM Computing Surveys (CSUR) 52 (4) (2019) 1–28.
  • [69] C. Gao, Y. Wang, Reinforcement learning based self-adaptive moving target defense against ddos attacks, in: Journal of Physics: Conference Series, Vol. 1812, IOP Publishing, 2021, p. 012039.
  • [70] X. Chai, Y. Wang, C. Yan, Y. Zhao, W. Chen, X. Wang, Dq-motag: Deep reinforcement learning-based moving target defense against ddos attacks, in: 2020 IEEE Fifth International Conference on Data Science in Cyberspace (DSC), IEEE, 2020, pp. 375–379.
  • [71] S. Yoon, J.-H. Cho, D. S. Kim, T. J. Moore, F. Free-Nelson, H. Lim, Desolater: Deep reinforcement learning-based resource allocation and moving target defense deployment framework, IEEE Access 9 (2021) 70700–70714.
  • [72] S. Sengupta, S. Kambhampati, Multi-agent reinforcement learning in bayesian stackelberg markov games for adaptive moving target defense, arXiv preprint arXiv:2007.10457.
  • [73] T. Eghtesad, Y. Vorobeychik, A. Laszka, Adversarial deep reinforcement learning based adaptive moving target defense, in: International Conference on Decision and Game Theory for Security, Springer, 2020, pp. 58–79.
  • [74] Y. Feng, J. Li, T. Nguyen, Application-layer ddos defense with reinforcement learning, in: 2020 IEEE/ACM 28th International Symposium on Quality of Service (IWQoS), IEEE, 2020, pp. 1–10.
  • [75] Z. Li, Y. Kong, C. Wang, C. Jiang, Ddos mitigation based on space-time flow regularities in iov: A feature adaption reinforcement learning approach, IEEE Transactions on Intelligent Transportation Systems.
  • [76] Z. Liu, X. Yin, Y. Hu, Cpss lr-ddos detection and defense in edge computing utilizing dcnn q-learning, IEEE Access 8 (2020) 42120–42130.
  • [77] Z. Zhu, F. Zeng, G. Qi, Y. Li, H. Jie, N. Mazur, Power system structure optimization based on reinforcement learning and sparse constraints under dos attacks in cloud environments, Simulation Modelling Practice and Theory 110 (2021) 102272.
  • [78] L. Xiao, X. Wan, C. Dai, X. Du, X. Chen, M. Guizani, Security in mobile edge caching with reinforcement learning, IEEE Wireless Communications 25 (3) (2018) 116–122.
  • [79] X. Wan, G. Sheng, Y. Li, L. Xiao, X. Du, Reinforcement learning based mobile offloading for cloud-based malware detection, in: GLOBECOM 2017-2017 IEEE Global Communications Conference, IEEE, 2017, pp. 1–6.
  • [80] A. Servin, D. Kudenko, Multi-agent reinforcement learning for intrusion detection, in: Adaptive Agents and Multi-Agent Systems III. Adaptation and Multi-Agent Learning, Springer, 2005, pp. 211–223.
  • [81] M. Lopez-Martin, B. Carro, A. Sanchez-Esguevillas, Application of deep reinforcement learning to intrusion detection for supervised problems, Expert Systems with Applications 141 (2020) 112963. doi:https://doi.org/10.1016/j.eswa.2019.112963.
    URL https://www.sciencedirect.com/science/article/pii/S0957417419306815
  • [82] A. Pauna, A.-C. Iacob, I. Bica, Qrassh-a self-adaptive ssh honeypot driven by q-learning, in: 2018 international conference on communications (COMM), IEEE, 2018, pp. 441–446.
  • [83] A. Pauna, I. Bica, F. Pop, A. Castiglione, On the rewards of self-adaptive iot honeypots, Annals of Telecommunications 74 (7) (2019) 501–515.
  • [84] G. Wagener, R. State, T. Engel, A. Dulaunoy, Adaptive and self-configurable honeypots, in: 12th IFIP/IEEE International Symposium on Integrated Network Management (IM 2011) and Workshops, IEEE, 2011, pp. 345–352.
  • [85] S. Venkatesan, M. Albanese, A. Shah, R. Ganesan, S. Jajodia, Detecting stealthy botnets in a resource-constrained environment using reinforcement learning, in: Proceedings of the 2017 Workshop on Moving Target Defense, 2017, pp. 75–85.
  • [86] N. Krawetz, Anti-honeypot technology, IEEE Security & Privacy 2 (1) (2004) 76–79.
  • [87] L. Huang, Q. Zhu, Farsighted risk mitigation of lateral movement using dynamic cognitive honeypots, in: International Conference on Decision and Game Theory for Security, Springer, 2020, pp. 125–146.
  • [88] S. Dowling, M. Schukat, E. Barrett, Using reinforcement learning to conceal honeypot functionality, in: Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Springer, 2018, pp. 341–355.
  • [89] S. Dowling, M. Schukat, E. Barrett, Improving adaptive honeypot functionality with efficient reinforcement learning parameters for automated malware, Journal of Cyber Security Technology 2 (2) (2018) 75–91.
  • [90] S. Suratkar, K. Shah, A. Sood, A. Loya, D. Bisure, U. Patil, F. Kazi, An adaptive honeypot using q-learning with severity analyzer, Journal of Ambient Intelligence and Humanized Computing (2021) 1–12.
  • [91] L. Huang, Q. Zhu, A dynamic games approach to proactive defense strategies against advanced persistent threats in cyber-physical systems, Computers & Security 89 (2020) 101660. doi:https://doi.org/10.1016/j.cose.2019.101660.
    URL https://www.sciencedirect.com/science/article/pii/S0167404819302020
  • [92] L. Huang, Q. Zhu, Adaptive strategic cyber defense for advanced persistent threats in critical infrastructure networks, ACM SIGMETRICS Performance Evaluation Review 46 (2) (2019) 52–56.
  • [93] L. Huang, Q. Zhu, A dynamic game framework for rational and persistent robot deception with an application to deceptive pursuit-evasion, IEEE Transactions on Automation Science and Engineering.
  • [94] L. Huang, Q. Zhu, Convergence of bayesian nash equilibrium in infinite bayesian games under discretization, arXiv preprint arXiv:2102.12059.
  • [95] L. A. Huang, Q. Y. Zhu, Duplicity games for deception design with an application to insider threat mitigation, Ieee Transactions on Information Forensics and Security 16 (2021) 4843–4856, wn2yn Times Cited:0 Cited References Count:44. doi:10.1109/Tifs.2021.3118886.
    URL <GotoISI>://WOS:000711639400001https://ieeexplore.ieee.org/document/9564055/
  • [96] S. Wang, Q. Pei, J. Wang, G. Tang, Y. Zhang, X. Liu, An intelligent deployment policy for deception resources based on reinforcement learning, IEEE Access 8 (2020) 35792–35804.
  • [97] A. Bhattacharya, T. Ramachandran, S. Banik, C. P. Dowling, S. D. Bopardikar, Automated adversary emulation for cyber-physical systems via reinforcement learning, in: 2020 IEEE International Conference on Intelligence and Security Informatics (ISI), IEEE, 2020, pp. 1–6.
  • [98] R. Cai, H. Li, S. Wang, C. Chen, A. C. Kot, Drl-fas: A novel framework based on deep reinforcement learning for face anti-spoofing, IEEE Transactions on Information Forensics and Security 16 (2020) 937–951.
  • [99] F. Salahdine, N. Kaabouch, Social engineering attacks: a survey, Future Internet 11 (4) (2019) 89.
  • [100] A. T. Aind, A. Ramnaney, D. Sethia, Q-bully: A reinforcement learning based cyberbullying detection framework, in: 2020 International Conference for Emerging Technology (INCET), IEEE, 2020, pp. 1–6.
  • [101] G. H. Yang, Y. Yu, Use of interpersonal deception theory in counter social engineering., in: CIKM Workshops, 2018.
  • [102] J. J. Gonzalez, J. M. Sarriegi, A. Gurrutxaga, A framework for conceptualizing social engineering attacks, in: International Workshop on Critical Information Infrastructures Security, Springer, 2006, pp. 79–90.
  • [103] S. Smadi, N. Aslam, L. Zhang, Detection of online phishing email using dynamic evolving neural network based on reinforcement learning, Decision Support Systems 107 (2018) 88–102.
  • [104] M. Chatterjee, A.-S. Namin, Detecting phishing websites through deep reinforcement learning, in: 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Vol. 2, IEEE, 2019, pp. 227–232.
  • [105] K. Evans, A. Abuadbba, M. Ahmed, T. Wu, M. Johnstone, S. Nepal, Raider: Reinforcement-aided spear phishing detector, arXiv preprint arXiv:2105.07582.
  • [106] G. Lingam, R. R. Rout, D. V. Somayajulu, Deep q-learning and particle swarm optimization for bot detection in online social networks, in: 2019 10th International Conference on Computing, Communication and Networking Technologies (ICCCNT), IEEE, 2019, pp. 1–6.
  • [107] W. A. Casey, Q. Zhu, J. A. Morales, B. Mishra, Compliance control: Managed vulnerability surface in social-technological systems via signaling games, in: Proceedings of the 7th ACM CCS International Workshop on Managing Insider Security Threats, 2015, pp. 53–62.
  • [108] L. Huang, Q. Zhu, Inadvert: An interactive and adaptive counterdeception platform for attention enhancement and phishing prevention, arXiv preprint arXiv:2106.06907.
  • [109] A. Barreto, S. Hou, D. Borsa, D. Silver, D. Precup, Fast reinforcement learning with generalized policy updates, Proceedings of the National Academy of Sciences 117 (48) (2020) 30079–30087.
  • [110] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, Q. He, A comprehensive survey on transfer learning, Proceedings of the IEEE 109 (1) (2020) 43–76.
  • [111] C. Lemke, M. Budka, B. Gabrys, Metalearning: a survey of trends and technologies, Artificial intelligence review 44 (1) (2015) 117–130.
  • [112] S. J. Bradtke, M. O. Duff, Reinforcement learning methods for continuous-time markov decision problems, in: Advances in neural information processing systems, 1995, pp. 393–400.
  • [113] V. Behzadan, A. Munir, Adversarial reinforcement learning framework for benchmarking collision avoidance mechanisms in autonomous vehicles, IEEE Intelligent Transportation Systems Magazine.
  • [114] Y. Huang, Q. Zhu, Deceptive reinforcement learning under adversarial manipulations on cost signals, in: International Conference on Decision and Game Theory for Security, Springer, 2019, pp. 217–237.
  • [115] J. Wang, Y. Liu, B. Li, Reinforcement learning with perturbed rewards, in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, pp. 6202–6209.
  • [116] X. Zhang, Y. Ma, A. Singla, X. Zhu, Adaptive reward-poisoning attacks against reinforcement learning, in: International Conference on Machine Learning, PMLR, 2020, pp. 11225–11234.
  • [117] A. Rakhsha, G. Radanovic, R. Devidze, X. Zhu, A. Singla, Policy teaching via environment poisoning: Training-time adversarial attacks against reinforcement learning, in: International Conference on Machine Learning, PMLR, 2020, pp. 7974–7984.
  • [118] Y.-Y. Chen, C.-T. Chen, C.-Y. Sang, Y.-C. Yang, S.-H. Huang, Adversarial attacks against reinforcement learning-based portfolio management strategy, IEEE Access 9 (2021) 50667–50685.
  • [119] Y. Sun, D. Huo, F. Huang, Vulnerability-aware poisoning mechanism for online rl with unknown dynamics, in: International Conference on Learning Representations, 2021.
  • [120] Y. Sun, R. Zheng, Y. Liang, F. Huang, Who is the strongest enemy? towards optimal and efficient evasion attacks in deep rl, arXiv preprint arXiv:2106.05087.
  • [121] K. Banihashem, A. Singla, G. Radanovic, Defense against reward poisoning attacks in reinforcement learning, arXiv preprint arXiv:2102.05776.
  • [122] Z. Liu, Y. Yang, T. Miller, P. Masters, Deceptive reinforcement learning for privacy-preserving planning, in: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 2021, pp. 818–826.
  • [123] Y. Ma, X. Zhang, W. Sun, X. Zhu, Policy poisoning in batch reinforcement learning and control, Advances in Neural Information Processing Systems.
  • [124] H. Xu, R. Wang, L. Raizman, Z. Rabinovich, Transferable environment poisoning: Training-time attack on reinforcement learning, in: Proceedings of the 20th International Conference on Autonomous Agents and MultiAgent Systems, 2021, pp. 1398–1406.
  • [125] M. Figura, K. C. Kosaraju, V. Gupta, Adversarial attacks in consensus-based multi-agent reinforcement learning, arXiv preprint arXiv:2103.06967.
  • [126] R. Majadas, J. García, F. Fernández, Disturbing reinforcement learning agents with corrupted rewards, arXiv preprint arXiv:2102.06587.
  • [127] S. Feng, P. Tesi, Resilient control under denial-of-service: Robust design, Automatica 79 (2017) 42–51.
  • [128] Y. Yuan, Q. Zhu, F. Sun, Q. Wang, T. Başar, Resilient control of cyber-physical systems against denial-of-service attacks, in: 2013 6th International Symposium on Resilient Control Systems (ISRCS), IEEE, 2013, pp. 54–59.
  • [129] Y. Huang, Z. Xiong, Q. Zhu, Cross-layer coordinated attacks on cyber-physical systems: A lqg game framework with controlled observations, in: 2021 European Control Conference (ECC), IEEE, 2021.
  • [130] K. V. Katsikopoulos, S. E. Engelbrecht, Markov decision processes with delays and asynchronous cost collection, IEEE transactions on automatic control 48 (4) (2003) 568–574.
  • [131] T. Lancewicki, A. Rosenberg, Y. Mansour, Learning adversarial markov decision processes with delayed feedback, arXiv preprint arXiv:2012.14843.
  • [132] Y. Bouteiller, S. Ramstedt, G. Beltrame, C. Pal, J. Binas, Reinforcement learning with random delays, in: International Conference on Learning Representations, 2021.
    URL https://openreview.net/forum?id=QFYnKlBJYR
  • [133] Y. Huang, V. Kavitha, Q. Zhu, Continuous-time markov decision processes with controlled observations, in: 2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton), IEEE, 2019, pp. 32–39.
  • [134] V. Krishnamurthy, Partially observed Markov decision processes, Cambridge university press, 2016.
  • [135] A. Teixeira, D. Pérez, H. Sandberg, K. H. Johansson, Attack models and scenarios for networked control systems, in: Proceedings of the 1st international conference on High Confidence Networked Systems, 2012, pp. 55–64.
  • [136] X. Huang, J. Dong, Reliable control policy of cyber-physical systems against a class of frequency-constrained sensor and actuator attacks, IEEE transactions on cybernetics 48 (12) (2018) 3432–3439.
  • [137] H. Fawzi, P. Tabuada, S. Diggavi, Secure estimation and control for cyber-physical systems under adversarial attacks, IEEE Transactions on Automatic control 59 (6) (2014) 1454–1467.
  • [138] L. An, G.-H. Yang, Lq secure control for cyber-physical systems against sparse sensor and actuator attacks, IEEE transactions on control of network systems 6 (2) (2018) 833–841.
  • [139] J. Wu, T. Chen, Design of networked control systems with packet dropouts, IEEE Transactions on Automatic control 52 (7) (2007) 1314–1319.
  • [140] A. Gupta, C. Langbort, T. Başar, Optimal control in the presence of an intelligent jammer with limited actions, in: 49th IEEE Conference on Decision and Control (CDC), IEEE, 2010, pp. 1096–1101.
  • [141] F. Chen, W. Ren, et al., On the control of multi-agent systems: A survey, Foundations and Trends® in Systems and Control 6 (4) (2019) 339–499.
  • [142] L. Bakule, Decentralized control: An overview, Annual reviews in control 32 (1) (2008) 87–98.
  • [143] Y. Huang, J. Chen, L. Huang, Q. Zhu, Dynamic games for secure and resilient control system design, National Science Review 7 (7) (2020) 1125–1141.
  • [144] Q. Zhu, T. Basar, Game-theoretic methods for robustness, security, and resilience of cyberphysical control systems: games-in-games principle for optimal cross-layer resilient control systems, IEEE Control Systems Magazine 35 (1) (2015) 46–65.
  • [145] D. E. Kirk, Optimal control theory: an introduction, Courier Corporation, 2004.
  • [146] K. J. Åström, Theory and applications of adaptive control—a survey, Automatica 19 (5) (1983) 471–486.
  • [147] Q. Zhu, W. Saad, Z. Han, H. V. Poor, T. Başar, Eavesdropping and jamming in next-generation wireless networks: A game-theoretic approach, in: 2011-MILCOM 2011 Military Communications Conference, IEEE, 2011, pp. 119–124.
  • [148] Z. Xu, Q. Zhu, A game-theoretic approach to secure control of communication-based train control systems under jamming attacks, in: Proceedings of the 1st International Workshop on Safe Control of Connected and Autonomous Vehicles, 2017, pp. 27–34.
  • [149] Q. Zhu, Z. Yuan, J. B. Song, Z. Han, T. Basar, Interference aware routing game for cognitive radio multi-hop networks, IEEE Journal on Selected Areas in Communications 30 (10) (2012) 2006–2015.
  • [150] Q. Zhu, J. B. Song, T. Basar, Dynamic secure routing game in distributed cognitive radio networks, in: 2011 IEEE Global Telecommunications Conference-GLOBECOM 2011, IEEE, 2011, pp. 1–6.
  • [151] A. Clark, Q. Zhu, R. Poovendran, T. Başar, Deceptive routing in relay networks, in: International Conference on Decision and Game Theory for Security, Springer, 2012, pp. 171–185.
  • [152] Y. Zhao, L. Huang, C. Smidts, Q. Zhu, A game theoretic approach for responding to cyber-attacks on nuclear power plants, Nuclear Science and Engineering.
  • [153] Y. Zhao, L. Huang, C. Smidts, Q. Zhu, Finite-horizon semi-markov game for time-sensitive attack response and probabilistic risk assessment in nuclear power plants, Reliability Engineering & System Safety 201 (2020) 106878.
  • [154] Q. Zhu, J. Zhang, P. W. Sauer, A. Dominguez-Garcia, T. Başar, A game-theoretic framework for control of distributed renewable-based energy resources in smart grids, in: 2012 American Control Conference (ACC), IEEE, 2012, pp. 3623–3628.
  • [155] S. Maharjan, Q. Zhu, Y. Zhang, S. Gjessing, T. Basar, Dependable demand response management in the smart grid: A stackelberg game approach, IEEE Transactions on Smart Grid 4 (1) (2013) 120–132.
  • [156] S. Bajpai, J. Gupta, Securing oil and gas infrastructure, Journal of Petroleum Science and Engineering 55 (1-2) (2007) 174–186.
  • [157] B. Sheehan, F. Murphy, M. Mullins, C. Ryan, Connected and autonomous vehicles: A cyber-risk classification framework, Transportation research part A: policy and practice 124 (2019) 523–536.
  • [158] Q. Zhu, Z. Xu, Cross-layer framework for cpss, in: Cross-Layer Design for Secure and Resilient Cyber-Physical Systems, Springer, 2020, pp. 9–15.
  • [159] L. Huang, J. Chen, Q. Zhu, A large-scale markov game approach to dynamic protection of interdependent infrastructure networks, in: International Conference on Decision and Game Theory for Security, Springer, 2017, pp. 357–376.
  • [160] Y. Hayel, Q. Zhu, Attack-aware cyber insurance for risk sharing in computer networks, in: International Conference on Decision and Game Theory for Security, Springer, 2015, pp. 22–34.
  • [161] R. Zhang, Q. Zhu, Flipin: A game-theoretic cyber insurance framework for incentive-compatible cyber risk management of internet of things, IEEE Transactions on Information Forensics and Security 15 (2019) 2026–2041.
  • [162] R. Zhang, Q. Zhu, Y. Hayel, A bi-level game approach to attack-aware cyber insurance of computer networks, IEEE Journal on Selected Areas in Communications 35 (3) (2017) 779–794.