\DeclareMathOperator

*\argmaxarg max \DeclareMathOperator*\argminarg min \DeclareMathOperator*\softmaxsoftmax \DeclareMathOperator*\TrTr \DeclareMathOperator*\RERe \DeclareMathOperator*\IMIm \DeclareMathOperator\trtr \allowdisplaybreaks

Learning-based attacks in Cyber-Physical Systems:
Exploration, Detection, and Control Cost trade-offs

\NameAnshuka Rangi^∗ \Email[email protected]
\addrDepartment of Electrical and Computer Engineering University of California San Diego CA 92093 \NameMohammad Javad Khojasteh \Email[email protected]
\addrLaboratory for Information and Decision Systems indicates equal contribution. Massachusetts Institute of Technology MA 02139 \NameMassimo Franceschetti \Email[email protected]
\addrDepartment of Electrical and Computer Engineering University of California San Diego CA 92093

Abstract

We study the problem of learning-based attacks in linear systems, where the communication channel between the controller and the plant can be hijacked by a malicious attacker. We assume the attacker learns the dynamics of the system from observations, then overrides the controller’s actuation signal, while mimicking legitimate operation by providing fictitious feedback about the sensor readings to the controller. On the other hand, the controller is on a lookout to detect the presence of the attacker and tries to enhance the detection performance by carefully crafting its control signals. We study the trade-offs between the information acquired by the attacker from observations, the detection capabilities of the controller, and the control cost. Specifically, we provide tight upper and lower bounds on the expected $\epsilon$ -deception time, namely the time required by the controller to make a decision regarding the presence of an attacker with confidence at least $(1-\epsilon\log(1/\epsilon))$ . We then show a probabilistic lower bound on the time that must be spent by the attacker learning the system, in order for the controller to have a given expected $\epsilon$ -deception time. We show that this bound is also order optimal, in the sense that if the attacker satisfies it, then there exists a learning algorithm with the given order expected deception time. Finally, we show a lower bound on the expected energy expenditure required to guarantee detection with confidence at least $1-\epsilon\log(1/\epsilon)$ .

keywords:

Data poisoning attack, Man in the middle attack, Cyber physical system

1 Introduction

Attacks directed to Cyber-Physical Systems (CPS) can have catastrophic consequences ranging from hampering the economy through financial scams, to possible losses of human lives through hijacking autonomous vehicles and drones, see [pasqualetti2013attack, shoukry2018smt, hoehn2016detection]. In this framework, two important problems arise: understanding of the regime where the system can be attacked, and designing ways to mitigate these attacks and render the system secure, see [ma2019policy, zhang2020online, jun2018adversarial, zhan2020preventing, chen2019optimal, vemprala2020adversarial, ferdowsi2018deep, mao2020novel, khojasteh2019authentication, khojasteh2018authentication, rangi2021secure]. Techniques developed to secure CPS include watermarking, moving target and baiting, and typically require either a loss of performance, or additional resources available at the controller, see [Kumar:DynamicWatermarki, MoSinopoli:Magzine, kanellopoulos2019moving, flamholz2019baiting].

In this paper, we focus on the former aspect of the problem, namely understanding the regime under which the system can be attacked. We focus on linear plants and on an important and widely used class of attacks based on the “man-in-the-middle" (MITM) technique. In this case, the attacker takes over the physical plant’s control and feedback signals, and acts as a malicious controller for the plant and fictitious plant for the controller. By doing so, it overrides the control signals with malicious inputs aimed at destroying the plant; and it overrides the feedback signals to the controller, trying to mimic the safe and legitimate operation of the system. In learning based MITM attack, we assume that the attacker has full access to both sensor and control signals, but the plant dynamics are unknown to the attacker. Thus, the attacker needs to learn about the plant in order to being able to generate the fictitious signals to the controller that allow the attacker to remain undetected for the time needed to cause harm. On the other hand, the controller has perfect (or nearly perfect) knowledge of the system dynamics and is actively looking out for an anomalous behaviour in the feedback signals from the plant. This assumed information pattern is justified, since the controller is typically tuned in much longer than the attacker, and has knowledge of the system dynamics to a far greater precision than the attacker. Following the detection of the attacker, the controller can shut the plant down, or switch to a “safe” mode where the system is secured using additional resources, and the attacker is prevented from causing additional "harm" to the plant, see [dibaji2019systems, weerakkody2019resilient, teixeira2015secure, hashemi2020gain].

We consider a learning-based MITM attack that evolves in two phases: exploration and exploitation. In the exploration phase, the attacker observes the plant state and control inputs, and learns the plant dynamics. In the exploitation phase, the attacker hijacks the plant, and utilizes the learned estimate to feed the fictitious feedback signals to the controller. During this phase, the attacker may also refine its estimate by continuing to learn. Within this context, our results are as follows: first, we provide a lower bound on the expected $\epsilon$ -deception time, namely the time required by the controller to make a decision regarding the presence of an attacker with confidence at least $1-\epsilon\log(1/\epsilon)$ . This bound is expressed in terms of the parameters of the attacker’s learning algorithm and the controller’s strategy. Second, we show that there exists a learning-based attack and a detection strategy such that a matching upper bound on the expected $\epsilon$ -deception time is obtained. We then show that for a wide range of learning algorithms, if the expected $\epsilon$ -deception time is at least of duration $D$ , then the duration of the exploration phase of the attacker must be at least $\Omega({D/\log(1/\epsilon)})$ , as $\epsilon\to 0$ . We establish that this bound is also order-optimal since there exists a learning algorithm such that if the duration of the exploration phase is $O({D/\log(1/\epsilon)})$ as $\epsilon\to 0$ , then the expected $\epsilon$ -deception time is at least $D$ . Finally, we show that if the controller wants to detect the attacker in at most $D$ duration with confidence at least $1-\epsilon\log(1/\epsilon)$ , then the expected energy expenditure on the control signal must be at least of order $\Omega({D}/\log(1/\epsilon))$ , as $\epsilon\to 0$ .

All proofs are available in appendix.

2 Related Work

There is a wide range of recent research on learning-based control for linear systems [Dean2019, sarkar2019near, berkenkamp2017safe, fisac2018general, khojasteh2019probabilistic, vrabie2009adaptive, jiang2012computational, cheng2020safe, fan2020deep, lederer2019uniform, buisson2020actively]. In these works, learning algorithms are proposed to design controllers in the presence of uncertainty. In contrast, in our setting we assume that the controller has full knowledge of the system dynamics, while the attacker may take advantage of these algorithms. Thus, our focus is not on the optimal control design given the available data, but rather on the trade-offs between the attacker’s learning capability, the controller’s detection strategy, and the control cost.

The MITM attack has been extensively studied in control systems for two special cases, namely, the replay attack and the statistical duplicate attack. The detection of replay attacks has been studied in [mo2014detecting, MoSinopoli:Magzine, miao2013stochastic], and ways to mitigate these attacks have been studied in [zhu2014performance]. Likewise, the ways to detect and mitigate statistical duplicate attacks has been studied in [Kumar:DynamicWatermarki, smith2015covert, porter2020detecting]. These works do not consider learning-enabled attackers, and analyze the performance of the controller for only a specific detection strategy. In contrast, we investigate learning-enabled attacks, and present trade-offs between the attacker’s learning capability through observations, the controller detection strategy, and the control cost. Learning based attacks have been recently considered in [khojasteh2019authentication, khojasteh2018authentication, ziemann2020parameter]. In [khojasteh2019authentication, khojasteh2018authentication], a variance based detection strategy has been investigated to present bounds on the probabilities of detection (or false alarm) of the attacker. In [ziemann2020parameter], an optimization-based controller is proposed that has the additional capability of injecting noise to interfere with the learning process of the attacker. Here, we consider a wider class of learning-based attacks and detection strategies, and provide tight trade-offs for these attacks.

Multiple variants of MITM attacks are studied in Reinforcement Learning (RL). In [rakhsha2020policy], the work studies the MITM attacks under the assumption that the attacker has perfect knowledge of the underlying MDP. The results are further extended to the setting where attacker has no knowledge of the underlying MDP [rakhsha2021reward]. This is analogues to studying learning based attacks in RL where the attacker eavesdrops on the actions performed by the learner and manipulates the feedback from the environment. In [zhang2020adaptive], the work studies the feasibility of MITM attack under the constraint on the amount of contamination introduced by the attacker in the feedback signal. The relationship between the problem of designing optimal MITM attack in RL and the problem of designing optimal control is discussed in [zhu2018optimal]. Finally, the learning based MITM attacks are also an active area of research in the Multi-Armed Bandits (MAB), see [jun2018adversarial, ma2018data, bogunovic2020stochastic, rangi2021secure]. In the same spirit of our work, these works study the feasibility of the attacks, and provide bounds on the amount of contamination needed by the attacker to achieve its objective. However, these works do not consider the possibility of the detection of the attacker. In this work, we focus on understanding the regime where the system can be attacked without the detection of the attacker.

3 Problem Setup

Refer to caption — Figure 1: System model during the two attack phases.

We consider the networked control system depicted in Fig. 1 and Fig. 1, where the plant dynamics are described by a discrete-time and linear time-invariant (LTI) system, namely at time $k\in\mathbb{N}$ , we have

\displaystyle X_{k+1}=AX_{k}+U_{k}+W_{k},

(1)

where $X_{k}$ , $U_{k}$ , $W_{k}$ are vectors of dimension $M\times 1$ representing the plant state, control input, and plant disturbance respectively, and $A$ is a matrix of dimension $M\times M$ , representing the open-loop gain of the plant. At time $k$ , the controller observes the feedback signal $Y_{k}$ and generates a control signal $U_{k}$ as a function of $Y_{1:k}=\{Y_{1},\ldots,Y_{k}\}$ . The initial state $X_{0}$ is known to both the controller and the attacker, and is independent of the disturbance sequence $\{W_{k}\}_{k=1}^{\infty}$ , where $W_{k}$ is i.i.d. Gaussian noise $\mathcal{N}(0,\sigma^{2}I_{M})$ with PDF known to both the parties, and $I_{M}$ is the identity matrix of dimension $M\times M$ . Our results can also be extended to the scenario where the PDF of the noise known to the attacker is different from the actual PDF of the noise (or PDF known to the controller). With a slight loss of generality, we assume that $U_{0}=W_{0}=0$ for analysis.

The controller attempts to detect the presence of the attacker based on the observations $Y_{1:k}$ . When the controller detects an attack, it shuts the system down and prevents the attacker from causing further “damage” to the plant. The controller is aware of the plant dynamics in \eqrefeq:plant, and knows the gain $A$ . This is justified because one can assume that the controller is tuned to the plant for a long duration and thus has knowledge of $A$ to a great precision. On the other hand, the attacker only knows the form of the state evolution equation \eqrefeq:plant, but does not know the gain matrix $A$ .

4 Learning based Attacks

We consider learning based attacks that evolve in two phases.

Phase 1: Exploration. Let $L$ be the duration of the exploration phase. For all $k\leq L$ , as illustrated in Fig. 1, the attacker passively eavesdrops on the control input $U_{k}$ and the plant state $Y_{k}=X_{k}$ with the objective of learning the open loop gain of the plant. We let $\hat{A}_{k}$ be the attacker’s estimate of $A$ at time step $k$ . The duration $L$ can be considered as the cost incurred by the attacker, since its actions are limited to eavesdropping during this phase.

Phase 2: Exploitation. The exploration phase is followed by the exploitation phase. For all $k\geq L+1$ , as illustrated in Fig. 1, the attacker hijacks the system and feeds a malicious control signal $\tilde{U}_{k}$ to the plant in order to destroy the plant. Additionally, the attacker may continue to learn about $A$ , and utilizes its estimate $\hat{A}_{k}$ to design a fictitious feedback signal $Y_{k}=V_{k}$ in Fig. 1 to deceive the controller, namely