This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Attack Impact Evaluation by Exact Convexification
through State Space Augmentation

Hampei Sasahara, Takashi Tanaka, and Henrik Sandberg This work was supported by Swedish Research Council grant 2016-00861.H. Sasahara is with the Department of Systems and Control Engineering, School of Engineering, Tokyo Institute of Technology, Tokyo 152-8552, Japan [email protected]T. Tanaka is with the Department of Aerospace Engineering and Engineering Mechanics, Cockrell School of Engineering, The University of Texas at Austin, TX 78712, USA [email protected]H. Sandberg is with the Division of Decision and Control Systems, School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm SE-100 44, Sweden [email protected]
Abstract

We address the attack impact evaluation problem for control system security. We formulate the problem as a Markov decision process with a temporally joint chance constraint that forces the adversary to avoid being detected throughout the considered time period. Owing to the joint constraint, the optimal control policy depends not only on the current state but also on the entire history, which leads to the explosion of the search space and makes the problem generally intractable. It is shown that whether an alarm has been triggered or not, in addition to the current state is sufficient for specifying the optimal decision at each time step. Augmentation of the information to the state space induces an equivalent convex optimization problem, which is tractable using standard solvers.

I INTRODUCTION

Due to the increased connectivity, security of control systems has become an urgent matter. Indeed, a lot of malicious software that target industrial control systems have been reported [1], and some of them succeeded to cause serious consequences to critical infrastructures [2, 3, 4, 5]. Security risk assessment is a necessary process for effective security countermeasures. Risk assessment is conducted through specifying possible scenarios, quantifying their likelihoods, and evaluating their impacts [6, 7].

Attack impact evaluation for control systems is a challenging task since it depends on malicious input sequences even if the supposed intrusion route is fixed. Typically, the impact is quantified as the solution of an optimal control problem with a constraint that forces the adversary to avoid being detected throughout the considered time period [8]. Based on this framework, various formulations have been proposed [9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. A common problem is that the class of systems that can be handled by the existing works are limited because of difficulty to solve the constrained optimal control problem. In particular, the class of attack detectors is limited. Typical works consider the χ2\chi^{2} detector or provide a bound for all possible detectors by using Kullback-Leibler divergence (KLD) between the observed output and the nominal signal. However, other stateful detectors such as cumulative sum (CUSUM) detectors and exponentially weighted moving average (EWMA) detectors are known to be effective in practice, from the both perspectives of detection performance and computational efficiency [20, 21].

The objective of this study is to provide an attack impact evaluation framework that can handle a general class of systems and detectors for practical risk assessment. In our formulation, the control system with an attack detector is modeled as a Markov decision process (MDP) where an alarm region is embedded in the state space. The stealth condition is given as a temporally joint chance constraint, which restricts the probability of intrusion into the alarm region throughout the whole time period. However, because the chance constraint is joint over time, the optimal policy depends not only on the current state, but also the entire history. In consequence, the dimension of the search space exponentially increases with the time horizon length, and the stochastic optimal control problem becomes generally intractable.

This paper proposes an equivalent problem formulation that reduces the size of the search space and makes the problem tractable. It is shown that there exists a sufficient statistic for optimal decision making at each time step and we can reduce the size of the search space by augmenting the information into the state space. Specifically, an extra binary state representing whether the alarm has been triggered or not in addition to the current state is sufficient for the optimal decision. We refer to adding this information to the state space as state space augmentation. The optimal value can be achieved by Markov policies in the augmented MDP. This finding induces an equivalent optimal control problem where the size of the search space is relatively small. The induced problem is a standard constrained optimal control problem, which can be reduced to an equivalent linear program, and hence can be solved by standard solvers. Moreover, it is observed that the problem formulation leads to an optimal policy where the adversary does not care about future alarms once an alarm is triggered. In this case, the resulting state trajectory tends to stay in the alarm region, which is unreasonable in practice. To model more sophisticated attack strategies, we propose an extended problem formulation by taking into account the number of alarms throughout the entire time period. It is shown that state space augmentation can also be applied to the extended problem.

Related Work

The attack impact evaluation problem has been widely considered in control system security [8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. The formulations themselves are basically similar, and technically the problem is reduced to a constrained optimal control problem. However, different approaches should be taken depending on the class of the system and the detector, the objective function that quantifies the impact caused by an attack signal, and the class of possible attack strategies. An early work [9] considers the χ2\chi^{2} detector, which is the simplest stateless detector. The work [10] provides a rather general formulation where an essential bound is provided using KLD. The CUSUM detector is considered in [11]. In [15], the author considers a detector using 2\ell^{2} norm of the output signal assuming that the system is deterministic. Finally, the other works consider one of the detectors. To the best of our knowledge, this study is the first framework that can handle general detectors for attack impact evaluation.

Our proposed method is based on state space augmentation, which adds information of alarm history for decision making at each step. A similar idea has been proposed in risk-averse MDP[22, 23, 24]. In [22], the authors consider a non-standard MDP where the objective function is given by not expectation but conditional-value-at-risk (CVaR), which is also referred to as average-value-at-risk. They show that value iteration can be applied by considering the augmented state space even for CVaR. The idea is generalized in [23] where not only CVaR minimization but also chance-constrained MDP are considered. The work [24] proposes risk-aware reinforcement learning based on the idea. Moreover, linear temporal logic specification techniques provide a rather general framework [25]. While these works are relatively abstract, this study exhibits a clear interpretation through a concrete application in the context of security, where the augmented state has a concrete interpretation, namely the number of alarm’s up until that time step.

Organization and Notation

The paper is organized as follows. Sec. II provides the system model, clarifies the threat model, and formulates the impact evaluation problem. In Sec. III, the difficulty of the formulated problem is explained, and a solution based on state space augmentation is proposed. Sec. IV considers an extension of the problem where the number of alarms throughout the entire time period is taken into account. It is shown that the approach proposed in Sec. III is still valid in the extended formulation. In Sec. V, the theoretical results are verified through numerical simulation. Finally, Sec. VI concludes and summarizes the paper.

Let \mathbb{N} and \mathbb{R} be the sets of natural numbers and real numbers, respectively. The kk-ary Cartesian power of the set 𝒳\mathcal{X} is denoted by 𝒳k\mathcal{X}^{k}. The tuple (x0,,xk)(x_{0},\ldots,x_{k}) is denoted by x0:kx_{0:k}.

II ATTACK IMPACT EVALUATION PROBLEM

II-A System Model

This study considers a control system with an attack detector. Its stochastic model is described by the finite-horizon MDP [26] with an alarm region given by

:=(𝒳,𝒜,P,𝒯,{rt}t𝒯,𝒳a)\mathcal{M}:=(\mathcal{X},\mathcal{A},P,\mathcal{T},\{r_{t}\}_{t\in\mathcal{T}},\mathcal{X}_{\rm a}) (1)

where 𝒳\mathcal{X} is the state space, 𝒜\mathcal{A} is the action space, PP is the state transition function from 𝒳×𝒜\mathcal{X}\times\mathcal{A} to 𝒳\mathcal{X}, 𝒯:={0,,T}\mathcal{T}:=\{0,\ldots,T\} is the time index set, rt:𝒳×𝒜r_{t}:\mathcal{X}\times\mathcal{A}\to\mathbb{R} for t=0,,T1t=0,\ldots,T-1 and rT:𝒳r_{T}:\mathcal{X}\to\mathbb{R} are the reward functions, and 𝒳a𝒳\mathcal{X}_{\rm a}\subset\mathcal{X} is the alarm region. An alarm implemented in the control system is triggered when the state reaches the alarm region. For simplicity, we assume the state space and the action space to be finite.

Refer to caption
Figure 1: Block diagram of the stochastic system with a detector into which an attacker has intruded. The dynamics of the system with the detector is described as an MDP. The attacker aims at maximizing damage caused by the attack signal while avoiding triggering the alarm.

Example: We give a specific example that can be described in the form (1). Consider the block diagram in Fig. 1, which includes a nonlinear stochastic system equipped with a dynamic attack detector, which is also referred to as a stateful attack detector [27]. Let the dynamics of the control system Σ\Sigma and the detector 𝒟\mathcal{D} be given by

Σ:zt+1=f(zt,at,wt),𝒟:{zt+1d=g(ztd,zt),δt=h(ztd,zt),\Sigma:z_{t+1}=f(z_{t},a_{t},w_{t}),\quad\mathcal{D}:\left\{\begin{array}[]{ll}z^{\rm d}_{t+1}&=g(z^{\rm d}_{t},z_{t}),\\ \delta_{t}&=h(z^{\rm d}_{t},z_{t}),\end{array}\right. (2)

respectively, where zt𝒵z_{t}\in\mathcal{Z} and ztd𝒵tdz^{\rm d}_{t}\in\mathcal{Z}^{\rm d}_{t} are states, at𝒜a_{t}\in\mathcal{A} is an actuation signal caused by the attack, wt𝒲w_{t}\in\mathcal{W} is noise, δt{0,1}\delta_{t}\in\{0,1\} is a binary signal that describes whether an alarm is triggered or not at the time step. Suppose that the noise wtw_{t} is an independent and identically distributed random variable over a probability space (Ω,,)(\Omega,\mathcal{F},\mathbb{P}). Let 𝒳:=𝒵×𝒵d\mathcal{X}:=\mathcal{Z}\times\mathcal{Z}^{\rm d} and 𝒜\mathcal{A} be the state space and the action space in the MDP form (1), respectively. Determine the state transition function by

P((Bz,Bzd)|(z,zd),a):={Pz(Bz|z,a)ifg(zd,z)Bzd,0otherwise,\begin{array}[]{l}P((B_{z},B^{\rm d}_{z})|(z,z^{\rm d}),a)\\ \quad:=\left\{\begin{array}[]{ll}P_{z}(B_{z}|z,a)&{\rm if}\ g(z^{\rm d},z)\in B^{\rm d}_{z},\\ 0&{\rm otherwise},\end{array}\right.\end{array}

where BzB_{z} and BzdB^{\rm d}_{z} are Borel sets in 𝒵\mathcal{Z} and 𝒵d\mathcal{Z}^{\rm d} and

Pz(Bz|z,a):=(wtBw(Bz,z,a)),Bw(Bz,z,a):={w𝒲:f(z,a,w)Bz}.\begin{array}[]{l}P_{z}(B_{z}|z,a):=\mathbb{P}(w_{t}\in B_{w}(B_{z},z,a)),\\ B_{w}(B_{z},z,a):=\{w\in\mathcal{W}:f(z,a,w)\in B_{z}\}.\end{array}

Further, let 𝒳a:=h1({1})\mathcal{X}_{\rm a}:=h^{-1}(\{1\}) be the alarm region111Note that the state transition function PP is guaranteed to be a stochastic kernel if the maps f,g,hf,g,h are Borel measurable [28, Proposition 7.26].. Then, we can obtain an MDP in the form (1) by discretizing the spaces to be finite sets by using standard methods (for example, see [29]). More specific detector examples include the χ2\chi^{2} attack detector and the CUSUM attack detector [21]. The χ2\chi^{2} attack detector with the observed output yt=Czty_{t}=Cz_{t}, the nominal output ytn,y^{\rm n}_{t}, and the threshold τ>0\tau>0 is represented by

𝒟:δt={1ifCztytn2>τ,0otherwise.\mathcal{D}:\delta_{t}=\left\{\begin{array}[]{ll}1&{\rm if}\ \|Cz_{t}-y^{\rm n}_{t}\|^{2}>\tau,\\ 0&{\rm otherwise}.\end{array}\right.

The CUSUM attack detector with the bias b>0b>0 and the threshold τ>0\tau>0 is represented by

𝒟:{zt+1d=max(0,ztd+Cztytnb),δt={1ifztd>τ,0otherwise\mathcal{D}:\left\{\begin{array}[]{ll}z^{\rm d}_{t+1}&=\max(0,z^{\rm d}_{t}+\|Cz_{t}-y^{\rm n}_{t}\|-b),\\ \delta_{t}&=\left\{\begin{array}[]{ll}1&{\rm if}\ z^{\rm d}_{t}>\tau,\\ 0&{\rm otherwise}\end{array}\right.\end{array}\right.

with the observed signal and the nominal output222More precisely, the corresponding MDP should be time-varying owing to the nominal output. However, by adding the time information to the state, our discussion can readily be extended as long as finite-horizon problems are considered..

II-B Threat Model

In this study, we consider the following threat model:

  • The adversary has succeeded to intrude into the system and can execute any action a𝒜a\in\mathcal{A} in a probabilistic manner at every time step.

  • The adversary has perfect model knowledge of \mathcal{M}.

  • The adversary possesses infinite memory and computation resources.

  • The adversary can observe the state at every time step.

  • The attack begins at t=0t=0 and ends at t=T1t=T-1.

The threat model implies that the adversary can implement an arbitrary history-dependent randomized policy πΠh\pi\in\Pi^{\rm h}, where π=(πt)t=0T1\pi=(\pi_{t})_{t=0}^{T-1} is a tuple of policies at each time step, Πh:={(πt)t=0T1:πt:t𝒫(𝒜)}\Pi^{\rm h}:=\{(\pi_{t})_{t=0}^{T-1}:\pi_{t}:\mathcal{H}_{t}\to\mathcal{P}(\mathcal{A})\} is the set of all history-dependent randomized policies, t:=𝒳t×𝒜t1\mathcal{H}_{t}:=\mathcal{X}^{t}\times\mathcal{A}^{t-1} is the set of histories at the time step t𝒯t\in\mathcal{T}, and 𝒫(𝒜)\mathcal{P}(\mathcal{A}) is the set of probabilistic measures with respect to the Borel algebra on 𝒜\mathcal{A}. Let the canonical measurable space of the MDP be denoted by (Ω,)(\Omega,\mathcal{F}), where Ω:=𝒳T+1×𝒜T\Omega:=\mathcal{X}^{T+1}\times\mathcal{A}^{T} and \mathcal{F} is its product σ\sigma-algebra. Define the random variables XtX_{t} and AtA_{t} as the projection of each outcome into the state and action at the time step t𝒯t\in\mathcal{T}, respectively. Throughout this paper, the initial state is fixed for simplicity. We denote the probability measure induced by PP given the policy π\pi by π\mathbb{P}^{\pi} and the expectation operator with respect to π\mathbb{P}^{\pi} by 𝔼π\mathbb{E}^{\pi}.

The objective of the adversary is to maximize a cumulative attack impact while avoiding being detected. Let the impact be quantified as the expectation of the sum of the reward functions rtr_{t} with respect to t𝒯t\in\mathcal{T}. Set the stealth condition to

π(t𝒯Xt𝒳a)Δ\mathbb{P}^{\pi}(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a})\leq\Delta (3)

where

(t𝒯Xt𝒳a):={x0:T𝒳T+1:xt𝒳a}(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a}):=\{x_{0:T}\in\mathcal{X}^{T+1}:\exists x_{t}\in\mathcal{X}_{\rm a}\}

with a constant Δ0\Delta\geq 0. The criterion (3) requires the probability of alarms to be less than or equal to Δ\Delta throughout the attack sequence.

II-C Problem Formulation

Based on the setting above, the attack impact evaluation problem is formulated as a stochastic optimal control problem with a temporally joint chance constraint.

Problem 1

The attack impact evaluation problem is given by

maxπΠh𝔼π[t=0T1rt(Xt,At)+rT(XT)]s.t.π(t𝒯Xt𝒳a)Δ.\begin{array}[]{cl}\displaystyle{\max_{\pi\in\Pi^{\rm h}}}&\mathbb{E}^{\pi}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ {\rm s.t.}&\mathbb{P}^{\pi}(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a})\leq\Delta.\end{array} (4)

Technically, the goal of this paper is to solve Problem 1. In the subsequent section, we explain its difficulty and propose a solution method.

III EXACT CONVEXIFICATION BY STATE SPACE AUGMENTATION

III-A Idea: State Space Augmentation

It is well known that for standard MDPs without constraints the optimal value can be achieved by Markov policies, namely policies that depend only on the current state. This property is justified by the fact: For any history-dependent policy, there exists a Markov policy such that the probabilities marginalized with respect to time are equal to those of the history-dependent policy [30, Theorem 18.1]. In Problem 1, however, there exists a temporally joint chance constraint, which cannot be decomposed with respect to time. Hence Markov policies cannot achieve the optimal value of (4) in general, an example of which is provided in Appendix A. Thus the dimension of its search space exponentially increases with the time horizon length. Indeed, the dimension of Πh\Pi^{\rm h} is t𝒯(|𝒜||𝒳|)t+1\sum_{t\in\mathcal{T}}(|\mathcal{A}||\mathcal{X}|)^{t+1}. This explosion of the search space makes the problem intractable.

The key idea to overcome the difficulty is to add the alarm information by augmenting the state space. We define the augmented state space and the induced augmented MDP.

Definition 1 (Augmented State Space and MDP)

The augmented state space of 𝒳\mathcal{X} for Problem 1 is defined as

𝒳^:=𝒳×𝒴\hat{\mathcal{X}}:=\mathcal{X}\times\mathcal{Y} (5)

with

𝒴:={0,1}.\mathcal{Y}:=\{0,1\}.

The augmented MDP of \mathcal{M} for Problem 1 is defined as

^:=(𝒳^,𝒜,P^,𝒯,{r^t}t𝒯,𝒳^a),\hat{\mathcal{M}}:=(\hat{\mathcal{X}},\mathcal{A},\hat{P},\mathcal{T},\{\hat{r}_{t}\}_{t\in\mathcal{T}},\hat{\mathcal{X}}_{\rm a}), (6)

where the induced state transition function is given by

P^((x,0)|(x,0),a)={P(x|x,a)ifx𝒳a,0otherwise,P^((x,0)|(x,1),a)=0,P^((x,1)|(x,0),a)={0ifx𝒳a,P(x|x,a)otherwise,P^((x,1)|(x,1),a)=P(x|x,a),\begin{array}[]{l}\hat{P}((x^{\prime},0)|(x,0),a)=\left\{\begin{array}[]{ll}P(x^{\prime}|x,a)&{\rm if}\ x^{\prime}\not\in\mathcal{X}_{\rm a},\\ 0&{\rm otherwise},\end{array}\right.\\ \hat{P}((x^{\prime},0)|(x,1),a)=0,\\ \hat{P}((x^{\prime},1)|(x,0),a)=\left\{\begin{array}[]{ll}0&{\rm if}\ x^{\prime}\not\in\mathcal{X}_{\rm a},\\ P(x^{\prime}|x,a)&{\rm otherwise},\end{array}\right.\\ \hat{P}((x^{\prime},1)|(x,1),a)=P(x^{\prime}|x,a),\end{array}

the induced reward function is given by r^t(x,y,a)=rt(x,a)\hat{r}_{t}(x,y,a)=r_{t}(x,a) for any y𝒴y\in\mathcal{Y}, and the induced alarm region is given by 𝒳^a:=𝒳a×𝒴\hat{\mathcal{X}}_{\rm a}:=\mathcal{X}_{\rm a}\times\mathcal{Y}.

The augmented state yt𝒴y_{t}\in\mathcal{Y} represents the alarm information. The binary flag yt=1y_{t}=1 indicates that the alarm has been triggered before the time step t𝒯t\in\mathcal{T}, whereas yt=0y_{t}=0 indicates otherwise. For the augmented MDP ^\hat{\mathcal{M}}, we denote the set of the history-dependent non-deterministic policies by Π^h\hat{\Pi}^{\rm h}, the canonical measurable space by (Ω^,^)(\hat{\Omega},\hat{\mathcal{F}}), the random variables of the state by X^t:=(Xt,Yt)\hat{X}_{t}:=(X_{t},Y_{t}), the probability measure and the expectation induced by the initial condition (x0,0)(x_{0},0) and the policy π^Π^h\hat{\pi}\in\hat{\Pi}^{\rm h} by π^\mathbb{P}^{\hat{\pi}} and 𝔼π^\mathbb{E}^{\hat{\pi}}, respectively.

By using the augmented information, we can rewrite the temporally joint chance constraint in (4) as an isolated chance constraint on the state at the final step. It is intuitively true that (t𝒯Xt𝒳a)(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a}), the event that an alarm is triggered at some time step, is equivalent to YT1({1})Y_{T}^{-1}(\{1\}), the event that the augmented state takes the value 11 at the final time step. This reformulation induces an equivalent problem

maxπ^Π^h𝔼π^[t=0T1rt(Xt,At)+rT(XT)]s.t.π^(YT=1)Δ.\begin{array}[]{cl}\displaystyle{\max_{\hat{\pi}\in\hat{\Pi}^{\rm h}}}&\mathbb{E}^{\hat{\pi}}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ {\rm s.t.}&\mathbb{P}^{\hat{\pi}}(Y_{T}=1)\leq\Delta.\end{array} (7)

The most important feature of this formulation is that the chance constraint depends on the marginal distribution only for the final time step. Hence, the optimal value of (7) can be achieved by Markov policies for the augmented state space 𝒳^\hat{\mathcal{X}}. Thus, the problem (7) can be reduced to

maxπ^Π^m𝔼π^[t=0T1rt(Xt,At)+rT(XT)]s.t.π^(YT=1)Δ,\begin{array}[]{cl}\displaystyle{\max_{\hat{\pi}\in\hat{\Pi}^{\rm m}}}&\mathbb{E}^{\hat{\pi}}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ {\rm s.t.}&\mathbb{P}^{\hat{\pi}}(Y_{T}=1)\leq\Delta,\end{array} (8)

where the search space of (7) is replaced with Π^m\hat{\Pi}^{\rm m}, the set of Markov policies for the augmented MDP ^\hat{\mathcal{M}}.

We formally justify the reformulation. We first show the following lemma.

Lemma 1

For any π^Π^h\hat{\pi}\in\hat{\Pi}^{\rm h}, there exists πΠh\pi\in\Pi^{\rm h} such that

PX0:t,A0:tπ=PX0:t,A0:tπ^,t𝒯P^{\pi}_{X_{0:t},A_{0:t}}=P^{\hat{\pi}}_{X_{0:t},A_{0:t}},\quad\forall t\in\mathcal{T} (9)

where PX0:t,A0:tπP^{\pi}_{X_{0:t},A_{0:t}} is the joint distribution of (X0:t,A0:t)(X_{0:t},A_{0:t}) for \mathcal{M} with respect to π\mathbb{P}^{\pi} and PX0:t,A0:tπ^P^{\hat{\pi}}_{X_{0:t},A_{0:t}} is that for ^\hat{\mathcal{M}} with respect to π^\mathbb{P}^{\hat{\pi}}.

Proof.

Let h^t=(x0:t,y0:t,a0:t1)\hat{h}_{t}=(x_{0:t},y_{0:t},a_{0:t-1}) be a history associated with the augmented MDP ^\hat{\mathcal{M}}. Denote the first time instance at which yt=1y_{t}=1 by t¯\underline{t}, i.e., t¯:=min{t𝒯:yt=1}\underline{t}:=\min\{t\in\mathcal{T}:y_{t}=1\} and t¯:=\underline{t}:=\infty when yt=0y_{t}=0 for any t𝒯t\in\mathcal{T}. We say h^t\hat{h}_{t} to be consistent when h^t\hat{h}_{t} satisfies

{yt=1,tt¯,xt𝒳a,t<t¯,xt¯𝒳a.\left\{\begin{array}[]{ll}y_{t}=1,&\forall t\geq\underline{t},\\ x_{t}\not\in\mathcal{X}_{\rm a},&\forall t<\underline{t},\\ x_{\underline{t}}\in\mathcal{X}_{\rm a}.&\end{array}\right.

Note that the probability of any history that is not consistent is zero from the induced state transition function P^\hat{P} for any policy. A history hth_{t} in the original MDP \mathcal{M} uniquely induces a consistent history in ^\hat{\mathcal{M}}, which we denote by h^t(ht)\hat{h}_{t}(h_{t}). For a fixed policy π^t\hat{\pi}_{t} for ^\hat{\mathcal{M}}, give a policy πt\pi_{t} for \mathcal{M} by

πt(a|ht):=π^t(a|h^t(ht)),t𝒯.\pi_{t}(a|h_{t}):=\hat{\pi}_{t}(a|\hat{h}_{t}(h_{t})),\quad t\in\mathcal{T}. (10)

We confirm that the policy (10) satisfies the condition (9) by induction. Because the initial state is fixed to be (x0,0)(x_{0},0), the condition is satisfied for t=0t=0. Assume that (9) holds for some t𝒯t\in\mathcal{T}. Denoting the probability mass function with a policy π\pi by PπP^{\pi}, we have

Pπ^(x0:t+1,a0:t+1)=Pπ^(xt+1,at+1|x0:t,a0:t)Pπ^(x0:t,a0:t)=Pπ^(at+1|x0:t+1,a0:t)P(xt+1|xt,at)Pπ(x0:t,a0:t)\begin{array}[]{l}P^{\hat{\pi}}(x_{0:t+1},a_{0:t+1})\\ =P^{\hat{\pi}}(x_{t+1},a_{t+1}|x_{0:t},a_{0:t})P^{\hat{\pi}}(x_{0:t},a_{0:t})\\ =P^{\hat{\pi}}(a_{t+1}|x_{0:t+1},a_{0:t})P(x_{t+1}|x_{t},a_{t})P^{\pi}(x_{0:t},a_{0:t})\end{array}

for any x0:t+1x_{0:t+1} and a0:t+1a_{0:t+1}. Thus, it suffices to show that

Pπ^(at+1|x0:t+1,a0:t)=Pπ(at+1|x0:t+1,a0:t).P^{\hat{\pi}}(a_{t+1}|x_{0:t+1},a_{0:t})=P^{\pi}(a_{t+1}|x_{0:t+1},a_{0:t}). (11)

For a policy π^\hat{\pi}, we have

Pπ^(at+1|x0:t+1,a0:t)=y0:t+1𝒴t+2π^(at+1|h^t+1)Pπ^(y0:t+1|x0:t+1,a0:t).\begin{array}[]{l}P^{\hat{\pi}}(a_{t+1}|x_{0:t+1},a_{0:t})\\ =\sum_{y_{0:t+1}\in\mathcal{Y}^{t+2}}\hat{\pi}(a_{t+1}|\hat{h}_{t+1})P^{\hat{\pi}}(y_{0:t+1}|x_{0:t+1},a_{0:t}).\end{array}

Since (x0:t+1,a0:t)(x_{0:t+1},a_{0:t}) uniquely induces y0:t+1y_{0:t+1} such that the corresponding history becomes consistent and satisfies Pπ^(y0:t+1|x0:t+1,a0:t)=1P^{\hat{\pi}}(y_{0:t+1}|x_{0:t+1},a_{0:t})=1, we have

Pπ^(at+1|x0:t+1,a0:t)=π^t+1(at+1|h^t+1(ht+1))=πt+1(at+1|x0:t+1,a0:t)\begin{array}[]{cl}P^{\hat{\pi}}(a_{t+1}|x_{0:t+1},a_{0:t})&=\hat{\pi}_{t+1}(a_{t+1}|\hat{h}_{t+1}(h_{t+1}))\\ &=\pi_{t+1}(a_{t+1}|x_{0:t+1},a_{0:t})\end{array}

and (11) holds. ∎

Lemma 1 implies that the stochastic behaviors of the original MDP and the augmented MDP are identical with appropriate policies.

The following theorem is the main result of this paper.

Theorem 1

Denote the optimal values of the chance-constrained optimal control problems in (4), (7), and (8) by J,J^{\ast}, J^,\hat{J}^{\ast}, and J^m,\hat{J}^{{\rm m}\ast}, respectively. Then we have J=J^=J^m.J^{\ast}=\hat{J}^{\ast}=\hat{J}^{{\rm m}\ast}.

Proof.

We first show J=J^J^{\ast}=\hat{J}^{\ast}. Since the policy set of the augmented MDP includes that of the original MDP, JJ^J^{\ast}\leq\hat{J}^{\ast} clearly holds. Fix a feasible policy π^Π^h\hat{\pi}\in\hat{\Pi}^{\rm h} for (7) and take the corresponding policy πΠh\pi\in\Pi^{\rm h} for the original MDP according to (10). From Lemma 1, the marginal distributions of the state and the action with the policies coincide. By denoting (t𝒯Xt𝒳a)(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a}) by a1\mathcal{E}^{1}_{\rm a}, from (10), we have π(YT=1|a1)=π(a1|YT=1)\mathbb{P}^{\pi}(Y_{T}=1|\mathcal{E}^{1}_{\rm a})=\mathbb{P}^{\pi}(\mathcal{E}^{1}_{\rm a}|Y_{T}=1). Thus π(YT=1)=π(a1)=π^(a1)Δ\mathbb{P}^{\pi}(Y_{T}=1)=\mathbb{P}^{\pi}(\mathcal{E}^{1}_{\rm a})=\mathbb{P}^{\hat{\pi}}(\mathcal{E}^{1}_{\rm a})\leq\Delta and π\pi is feasible in (7). Therefore JJ^J^{\ast}\geq\hat{J}^{\ast}, which leads to J=J^J^{\ast}=\hat{J}^{\ast}. Finally, J^=J^m\hat{J}^{\ast}=\hat{J}^{{\rm m}\ast} is a direct conclusion of [30, Theorem 18.1]. ∎

Theorem 1 justifies the reformulation from (4) to (8). The dimension of Π^m\hat{\Pi}^{\rm m} is

|𝒜||𝒳^||𝒯|=|𝒜||𝒳||𝒴||𝒯|=2|𝒜||𝒳||𝒯|,|\mathcal{A}||\hat{\mathcal{X}}||\mathcal{T}|=|\mathcal{A}||\mathcal{X}||\mathcal{Y}||\mathcal{T}|=2|\mathcal{A}||\mathcal{X}||\mathcal{T}|,

which is much smaller than t𝒯(|𝒜||𝒳|)t+1\sum_{t\in\mathcal{T}}(|\mathcal{A}||\mathcal{X}|)^{t+1}, that of the history-dependent search space. Finally, note that one only has to memorize the binary information yty_{t} for implementation of the optimal policy.

III-B Solution to Problem 1

It is known that a standard class of constrained MDPs, which the problem (8) belongs to, can be solved using linear programming [31]. The procedure is briefly explained in the following. Denote π^(X^t=x^,At=a)\mathbb{P}^{\hat{\pi}}(\hat{X}_{t}=\hat{x},A_{t}=a) by ρt(x^,a)\rho_{t}(\hat{x},a) where π^\hat{\pi} is omitted from the notation. The objective function can be rewritten by

𝔼π^[t=0T1rt(Xt,At)+rT(XT)]=t=0T1x^𝒳^,a𝒜r^t(x^,a)ρt(x^,a)+x^𝒳^r^T(x^)ρT(x^).\begin{array}[]{l}\mathbb{E}^{\hat{\pi}}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ =\sum_{t=0}^{T-1}\sum_{\hat{x}\in\hat{\mathcal{X}},a\in\mathcal{A}}\hat{r}_{t}(\hat{x},a)\rho_{t}(\hat{x},a)+\sum_{\hat{x}\in\hat{\mathcal{X}}}\hat{r}_{T}(\hat{x})\rho_{T}(\hat{x}).\end{array}

The left-hand side of the constraint can be rewritten by

π^(YT=1)=𝔼π^[δy=1(X^T)]=x^𝒳^δy=1(x^)ρT(x^)\mathbb{P}^{\hat{\pi}}(Y_{T}=1)=\mathbb{E}^{\hat{\pi}}\left[\delta_{y=1}(\hat{X}_{T})\right]=\sum_{\hat{x}\in\hat{\mathcal{X}}}\delta_{y=1}(\hat{x})\rho_{T}(\hat{x})

where δy=1(x^)\delta_{y=1}(\hat{x}) for x^=(x,y)\hat{x}=(x,y) takes 11 if y=1y=1 and 0 otherwise. Moreover, the function ρt(x^,a)\rho_{t}(\hat{x},a) satisfy

{a𝒜ρt(x^,a)=x¯𝒳^,a𝒜P^(x^|x¯,a)ρt1(x¯,a), 1tT1,ρ0(x^,a)=δx=x0(x^),ρT(x^)=x¯𝒳^,a𝒜P^(x^|x¯,a)ρT1(x¯,a),0ρt(x^,a)1,x^𝒳^,a𝒜,t=0,,T1,0ρt(x^)1,x^𝒳^\left\{\begin{array}[]{l}\displaystyle{\sum_{a\in\mathcal{A}}\rho_{t}(\hat{x},a)=\sum_{\bar{x}\in\hat{\mathcal{X}},a\in\mathcal{A}}\hat{P}(\hat{x}|\bar{x},a)\rho_{t-1}(\bar{x},a),\ 1\leq t\leq T-1,}\\ \rho_{0}(\hat{x},a)=\delta_{x=x_{0}}(\hat{x}),\\ \rho_{T}(\hat{x})=\sum_{\bar{x}\in\hat{\mathcal{X}},a\in\mathcal{A}}\hat{P}(\hat{x}|\bar{x},a)\rho_{T-1}(\bar{x},a),\\ 0\leq\rho_{t}(\hat{x},a)\leq 1,\quad\forall\hat{x}\in\hat{\mathcal{X}},\forall a\in\mathcal{A},t=0,\ldots,T-1,\\ 0\leq\rho_{t}(\hat{x})\leq 1,\quad\forall\hat{x}\in\hat{\mathcal{X}}\end{array}\right. (12)

where δx=x0(x^)\delta_{x=x_{0}}(\hat{x}) for x^=(x,y)\hat{x}=(x,y) takes 11 if x=x0x=x_{0} and 0 otherwise. Conversely, for any functions that satisfy (12), there exists a policy π^Π^m\hat{\pi}\in\hat{\Pi}^{\rm m} such that π^(X^t=x,At=a)=ρt(x^,a)\mathbb{P}^{\hat{\pi}}(\hat{X}_{t}=x,A_{t}=a)=\rho_{t}(\hat{x},a). Such a policy can specifically be constructed by

π^tm(a|x^)=ρt(x^,a)/a¯𝒜ρt(x^,a¯)\textstyle{\hat{\pi}^{\rm m}_{t}(a|\hat{x})=\rho_{t}(\hat{x},a)/\sum_{\bar{a}\in\mathcal{A}}\rho_{t}(\hat{x},\bar{a})} (13)

when the denominator is nonzero, otherwise π^tm(a|x^)\hat{\pi}^{\rm m}_{t}(a|\hat{x}) can be arbitrary. Therefore ρt(x^,a)\rho_{t}(\hat{x},a) can be used as the decision variables of the problem (8) instead of the policy. It should be emphasized that all the functions are linear with respect to ρt(x^,a)\rho_{t}(\hat{x},a).

The linear programming formulation is given as follows:

maxρt=0T1x^𝒳^,a𝒜r^t(x^,a)ρt(x^,a)+x^𝒳^r^T(x^)ρT(x^)s.t.x^𝒳^δy=1(x^)ρT(x^)Δand(12),\begin{array}[]{cl}\displaystyle{\max_{\rho}}&\displaystyle{\sum_{t=0}^{T-1}\sum_{\hat{x}\in\hat{\mathcal{X}},a\in\mathcal{A}}\hat{r}_{t}(\hat{x},a)\rho_{t}(\hat{x},a)+\sum_{\hat{x}\in\hat{\mathcal{X}}}\hat{r}_{T}(\hat{x})\rho_{T}(\hat{x})}\\ {\rm s.t.}&\displaystyle{\sum_{\hat{x}\in\hat{\mathcal{X}}}\delta_{y=1}(\hat{x})\rho_{T}(\hat{x})\leq\Delta}\ {\rm and}\ \eqref{eq:meas_cond},\end{array} (14)

which can readily be solved by standard solvers.

The solution to Problem 1 is summarized as follows:

  1. 1.

    Given the optimization problem (4), consider the state space augmentation given by Definition 1.

  2. 2.

    Using the augmented MDP, formulate the linear programming in (14).

  3. 3.

    The optimal value of (14) is equal to that of (4). The optimal policy is obtained through (10) and (13).

IV EXTENSION: MULTI-ALARM AVOIDANCE STRATEGY

The optimal policy of Problem 1 means that the adversary cares about being detected when there has been no alarms so far, but does no longer care once an alarm is triggered. In practice, however, a single alarm may not result in counteractions by the defender owing to existence of false alarms. Hence, it is more natural that the adversary avoids serial alarms. In this sense, Problem 1 leads to an unreasonable strategy, which is illustrated by Fig. 2 in the simulation section.

The cause of such unnatural strategies is that the chance constraint is given for the binary value whether an alarm has been triggered or not in the problem formulation. In this section, to obtain a more reasonable strategy and evaluate the attack impact more precisely, we extend the formulation of Problem 1 by taking multi-alarms into account.

First, define the event that alarms are triggered more than or equal to ii times by

ai:={x0:T𝒳T+1:|𝒯a(x0:T)|i}\mathcal{E}^{i}_{\rm a}:=\{x_{0:T}\in\mathcal{X}^{T+1}:|\mathcal{T}_{\rm a}(x_{0:T})|\geq i\}

where

𝒯a(x0:T):={t{0,,T}:xt𝒳a}.\mathcal{T}_{\rm a}(x_{0:T}):=\{t\in\{0,\ldots,T\}:x_{t}\in\mathcal{X}_{\rm a}\}.

Using the notation, the extended version of the attack impact evaluation problem for multi-alarm avoidance strategies is formulated as follows:

Problem 2

The attack impact evaluation problem for multi-alarm avoidance strategies is given by

maxπΠh𝔼π[t=0T1rt(Xt,At)+rT(XT)]s.t.π(a1)Δ1π(aT)ΔT\begin{array}[]{cl}\displaystyle{\max_{\pi\in\Pi^{\rm h}}}&\mathbb{E}^{\pi}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ {\rm s.t.}&\mathbb{P}^{\pi}(\mathcal{E}_{\rm a}^{1})\leq\Delta_{1}\\ &\quad\vdots\\ &\mathbb{P}^{\pi}(\mathcal{E}_{\rm a}^{T})\leq\Delta_{T}\end{array} (15)

where Δi0\Delta_{i}\geq 0 for i=1,,Ti=1,\ldots,T are constants for the stealth constraints.

The same idea of state space augmentation can be applied to Problem 2 as well by adding information of the number of alarms instead of the binary information. The augmented state space and the induced augmented MDP for Problem 2 are defined as follows.

Definition 2

The augmented state space of 𝒳\mathcal{X} for Problem 2 is defined as (5) with

𝒴:={0,,T}.\mathcal{Y}:=\{0,\ldots,T\}.

The augmented MDP for Problem 2 is defined as (6) where the induced state transition function is given by

P^((x,y)|(x,y),a)={P(x|x,a)ifx𝒳a,0otherwise,P^((x,y)|(x,y),a)=0,y{y,y+1},P^((x,y+1)|(x,y),a)={0ifx𝒳a,P(x|x,a)otherwise,\begin{array}[]{l}\hat{P}((x^{\prime},y)|(x,y),a)=\left\{\begin{array}[]{ll}P(x^{\prime}|x,a)&{\rm if}\ x^{\prime}\not\in\mathcal{X}_{\rm a},\\ 0&{\rm otherwise},\end{array}\right.\\ \hat{P}((x^{\prime},y^{\prime})|(x,y),a)=0,\ \forall y^{\prime}\not\in\{y,y+1\},\\ \hat{P}((x^{\prime},y+1)|(x,y),a)=\left\{\begin{array}[]{ll}0&{\rm if}\ x^{\prime}\not\in\mathcal{X}_{\rm a},\\ P(x^{\prime}|x,a)&{\rm otherwise},\end{array}\right.\end{array}

the induced reward function and the induced alarm region are given by those in Definition 1.

The augmented MDP naturally leads to an equivalent problem

maxπ^Π^m𝔼π^[t=0T1rt(Xt,At)+rT(XT)]s.t.π^(YT1)Δ1π^(YTT)ΔT\begin{array}[]{cl}\displaystyle{\max_{\hat{\pi}\in\hat{\Pi}^{\rm m}}}&\mathbb{E}^{\hat{\pi}}\left[\sum_{t=0}^{T-1}r_{t}(X_{t},A_{t})+r_{T}(X_{T})\right]\\ {\rm s.t.}&\mathbb{P}^{\hat{\pi}}(Y_{T}\geq 1)\leq\Delta_{1}\\ &\quad\vdots\\ &\mathbb{P}^{\hat{\pi}}(Y_{T}\geq T)\leq\Delta_{T}\end{array} (16)

where the search space is the set of Markov policies. The following theorem is the correspondence of Theorem 1.

Theorem 2

Denote the optimal values of the problems in (15) and (16) by JJ^{\ast} and J^m\hat{J}^{{\rm m}\ast}, respectively. Then we have J=J^m.J^{\ast}=\hat{J}^{{\rm m}\ast}.

Proof.

The claim can be proven in a manner similar to the proof of Theorem 1. ∎

The dimension of Π^m\hat{\Pi}^{\rm m} is |𝒜||𝒳||𝒯|2|\mathcal{A}||\mathcal{X}||\mathcal{T}|^{2}. The problem (16) is also a standard constrained MDP and can be reformulated as a linear program.

Remark: The constraint in the extended problem (15) can be regarded as a restriction on the probability distribution of the number of alarms. In other words, the formulation utilizes a risk measure on a probability distribution. Several risk measures have been proposed, such as CVaR, which is one of the most commonly used coherent risk measures [32]. Those risk measures compress risk of a random variable that possesses a distribution into a scalar value. Because our formulation uses the full information of the distribution, the constraint can be regarded as a fine-grained version of constraints using standard risk measures.

V NUMERICAL EXAMPLE

We verify the effectiveness of the proposed method through a numerical example. Consider the state space 𝒳={1,,16}\mathcal{X}=\{1,\ldots,16\}. The state xt=1x_{t}=1 represents the reference value, and the adversary’s objective is to push the state from the reference as much as possible. The action space is given by 𝒜:={up,stay,down}\mathcal{A}:=\{{\rm up},{\rm stay},{\rm down}\}. The time horizon is given by 𝒯={0,1,,15}\mathcal{T}=\{0,1,\ldots,15\}. The state transition function is given by

{P(x+1|x,up)=0.8,P(x|x,up)=0.2,{P(x+1|x,stay)=0.2,P(x|x,stay)=0.8,{P(x+1|x,down)=0.2,P(x|x,down)=0.8,\begin{array}[]{l}\left\{\begin{array}[]{ll}P(x+1|x,{\rm up})&=0.8,\\ P(x|x,{\rm up})&=0.2,\end{array}\right.\left\{\begin{array}[]{ll}P(x+1|x,{\rm stay})&=0.2,\\ P(x|x,{\rm stay})&=0.8,\end{array}\right.\\ \left\{\begin{array}[]{ll}P(x+1|x,{\rm down})&=0.2,\\ P(x|x,{\rm down})&=0.8,\\ \end{array}\right.\end{array}

for x=1x=1 and

{P(x+1|x,up)=0.8,P(x|x,up)=0.1,P(x1|x,up)=0.1,{P(x+1|x,stay)=0.1,P(x|x,stay)=0.8,P(x1|x,stay)=0.1,{P(x+1|x,down)=0.1,P(x|x,down)=0.1,P(x1|x,down)=0.8,\begin{array}[]{l}\left\{\begin{array}[]{ll}P(x+1|x,{\rm up})&=0.8,\\ P(x|x,{\rm up})&=0.1,\\ P(x-1|x,{\rm up})&=0.1,\end{array}\right.\left\{\begin{array}[]{ll}P(x+1|x,{\rm stay})&=0.1,\\ P(x|x,{\rm stay})&=0.8,\\ P(x-1|x,{\rm stay})&=0.1,\end{array}\right.\\ \left\{\begin{array}[]{ll}P(x+1|x,{\rm down})&=0.1,\\ P(x|x,{\rm down})&=0.1,\\ P(x-1|x,{\rm down})&=0.8,\end{array}\right.\end{array}

for x=2,,16x=2,\ldots,16. The alarm region is given by 𝒳a={6,,16}\mathcal{X}_{\rm a}=\{6,\ldots,16\}. The reward function is given by rt(x,a)=|x|r_{t}(x,a)=|x| for any a𝒜a\in\mathcal{A} and t𝒯t\in\mathcal{T}.

Consider the formulation of Problem 1. Let the constant on the stealth condition be given by Δ=0.5.\Delta=0.5. Fig. 2 depicts the simulation results with the optimal policy obtained by solving the equivalent formulation (14). Fig. 2a depicts the probability mass function with respect to the total number of alarms during the process. It can be observed that the resulting probability distribution satisfies the stealth constraint in (4) and hence Problem 1 is solved through the proposed method. On the other hand, as pointed out in Sec. IV, the probability is concentrated around the cases more than ten alarms. This result indicates that the formulation in Problem 1 leads to a policy such that the number of alarms becomes large once an alarm is triggered. Figs. 2b and 2c depict the sample paths with 100 trials and the empirical means conditioned by whether an alarm is triggered at least once during the process, or not, respectively. Those subfigures suggest the same tendency that the adversary does no longer care about being detected once an alarm is triggered.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Simulation results with the formulation of Problem 1. (a): Probability mass function with respect to the total number of alarms during the process. (b): Sample paths with 100 trials. The dotted line describes the boundary of the alarm region. (c): Conditional means. The dash line corresponds to the case where the alarm is triggered at least once, while the solid line corresponds to the case where the alarm is not triggered during the process.

Next, consider the formulation of Problem 2. Let the constants on the stealth condition be given by Δi=(0.5)i\Delta_{i}=(0.5)^{i}. Fig. 3 depicts the simulation results. The subfigures correspond to those in Fig. 2. It can be observed from Fig. 3a that the obtained probability mass function looks more natural than that in Fig. 2a, i.e., the probability of the number of alarms decreases as the number increases. It can also be observed from Figs. 3b and 3c that the resulting state trajectories are also natural, i.e., those are close to the boundary of the alarm region even if an alarm is triggered during the process. The evaluated attack impacts through Problems 1 and 2 are compared in Table I.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3: Simulation results with the formulation of Problem 2. Each subfigure corresponds to those in Fig. 2.
TABLE I: Evaluated Attack Impacts through Problems 1 and 2
Problem 1 Problem 2
JJ^{\ast} 84.99 58.16

VI CONCLUSION

This study has addressed the attack impact evaluation problem for control systems. The problem is formulated as an MDP with a temporally joint chance constraint. The difficulty of the optimal control problem lies in the explosion of the search space owing to the dependency of the optimal policy on the entire history. In this paper, we have shown that the information whether alarm has been triggered or not is sufficient and the size can be reduced by considering state space augmentation. Moreover, the problem is extended by taking the number of alarms into account for obtaining reasonable strategies and evaluating the attack impact more precisely.

Further research directions include extension to infinite and continuous state spaces, discretization of which results in a finite but huge state space when the dimension is high. Moreover, although the size of the search space is reduced compared with the original one, the optimal policy still depends on the time instance. To reduce its size further and obtain an optimal stationary policy, extension to the infinite horizon problem is of interest. Finally, in the extended formulation, we have used the full information of the probability distribution. Clarifying relationships to standard risk measures, such as CVaR, is an open problem.

Appendix A Example: History-Dependency of Optimal Policy and Sufficiency of Alarm Information

Consider the MDP illustrated by Fig. 4. In the example, the adversary can take an action only at t=2t=2. When the action aa is taken, the state reaches x3x_{3} with probability one and the adversary obtains reward 11. On the other hand, when the action aa^{\prime} is taken, the state reaches x3x^{\prime}_{3} or x3ax^{\prime}_{3{\rm a}} with equal probabilities. The amount of the reward is 1010 in the case of x3x^{\prime}_{3}, while no reward is given in the case of x3ax^{\prime}_{3{\rm a}}. The alarm region is given as 𝒳a={x1a,x3a}\mathcal{X}_{\rm a}=\{x_{1{\rm a}},x^{\prime}_{3{\rm a}}\}. The action aa^{\prime} can be interpreted as a risky action in the sense that it leads to large reward in expectation but may trigger an alarm.

Refer to caption
Figure 4: MDP for which Markov policies cannot achieve the optimal value of Problem 1. The adversary can take an action only at t=2t=2 and obtains an reward only at the final step depending only on the state.

The history-dependent policies are parameterized by πh(a|x1,x2)=α\pi^{\rm h}(a^{\prime}|x_{1},x_{2})=\alpha and πh(a|x1a,x2)=β\pi^{\rm h}(a^{\prime}|x_{1{\rm a}},x_{2})=\beta with parameters (α,β)[0,1]×[0,1](\alpha,\beta)\in[0,1]\times[0,1]. The joint chance constraint is written by πh(t𝒯Xt𝒳a)Δπh(X1=x1a)+πh(X1=x1)πh(a|x1,x2)P(x3a|x2,a)1/21/4+3/4α/21/2α2/3.\mathbb{P}^{\pi^{\rm h}}(\lor_{t\in\mathcal{T}}X_{t}\in\mathcal{X}_{\rm a})\leq\Delta\Leftrightarrow\mathbb{P}^{\pi^{\rm h}}(X_{1}=x_{1{\rm a}})+\mathbb{P}^{\pi^{\rm h}}(X_{1}=x_{1})\pi^{\rm h}(a^{\prime}|x_{1},x_{2})P(x^{\prime}_{3{\rm a}}|x_{2},a^{\prime})\leq 1/2\Leftrightarrow 1/4+3/4*\alpha/2\leq 1/2\Leftrightarrow\alpha\leq 2/3. Thus the feasible region of (α,β)(\alpha,\beta) is [0,2/3]×[0,1][0,2/3]\times[0,1]. The objective function is written by πh(X3=x3)+πh(X3=x3)10=3/4(1α)+(1β)/4+(3/8α+β/8)10=3α+β+1.\mathbb{P}^{\pi^{\rm h}}(X_{3}=x_{3})+\mathbb{P}^{\pi^{\rm h}}(X_{3}=x^{\prime}_{3})*10=3/4*(1-\alpha)+(1-\beta)/4+(3/8*\alpha+\beta/8)*10=3\alpha+\beta+1. Because this is monotonically increasing with respect to α\alpha and β\beta, the optimal values are (α,β)=(2/3,1),(\alpha^{\ast},\beta^{\ast})=(2/3,1), which leads to

{πh(a|x1,x2)=2/3,πh(a|x1a,x2)=1.\left\{\begin{array}[]{ll}\pi^{\rm h}(a^{\prime}|x_{1},x_{2})&=2/3,\\ \pi^{\rm h}(a^{\prime}|x_{1{\rm a}},x_{2})&=1.\end{array}\right.

On the other hand, the Markov policies are parameterized by πm(a|x2)=γ\pi^{\rm m}(a^{\prime}|x_{2})=\gamma with γ[0,1]\gamma\in[0,1]. The joint chance constraint is γ2/3\gamma\leq 2/3 and the objective function is 4γ+14\gamma+1. Thus the optimal value is γ=2/3,\gamma^{\ast}=2/3, which leads to πm(a|x2)=2/3.\pi^{\rm m}(a^{\prime}|x_{2})=2/3. Denoting the value of the objective function with a policy π\pi by J(π),J(\pi), we have J(πh)=4>11/3=J(πm),J(\pi^{\rm h})=4>11/3=J(\pi^{\rm m}), which implies that Markov policies cannot achieve the optimal value for the example.

The optimal history-dependent policy means that, the adversary reduces the risk when no alarm has been triggered (X1=x1X_{1}=x_{1}) while she takes the risky action when there has been an alarm (X1=x1a)X_{1}=x_{1{\rm a}}). In other words, the decision making at the state x2x_{2} depends on the alarm history. This observation leads to the hypothesis that this binary information in addition to the current state is sufficient for optimal decision making and the idea of state space augmentation.

References

  • [1] K. E. Hemsley and D. R. E. Fisher, “History of industrial control system cyber incidents,” U.S. Department of Energy Office of Scientific and Technical Information, Tech. Rep. INL/CON-18-44411-Rev002, 2018.
  • [2] N. Falliere, L. O. Murchu, and E. Chien, “W32. Stuxnet Dossier,” Symantec, Tech. Rep., 2011.
  • [3] Cybersecurity & Infrastructure Security Agency, “Stuxnet malware mitigation,” Tech. Rep. ICSA-10-238-01B, 2014, [Online]. Available: https://www.us-cert.gov/ics/advisories/ICSA-10-238-01B.
  • [4] ——, “HatMan - safety system targeted malware,” Tech. Rep. MAR-17-352-01, 2017, [Online]. Available: https://www.us-cert.gov/ics/MAR-17-352-01-HatMan-Safety-System-Targeted-Malware-Update-B.
  • [5] ——, “Cyber-attack against Ukrainian critical infrastructure,” Tech. Rep. IR-ALERT-H-16-056-01, 2018, [Online]. Available: https://www.us-cert.gov/ics/alerts/IR-ALERT-H-16-056-01.
  • [6] S. Kaplan and B. J. Garrick, “On the quantitative definition of risk,” Risk Analysis, vol. 1, no. 1, pp. 11–27, 1981.
  • [7] S. Sridhar, A. Hahn, and M. Govindarasu, “Cyber–physical system security for the electric power grid,” Proc. IEEE, vol. 100, no. 1, pp. 210–224, 2012.
  • [8] A. Teixeira, I. Shames, H. Sandberg, and K. H. Johansson, “A secure control framework for resource-limited adversaries,” Automatica, vol. 51, pp. 135–148, 2015.
  • [9] Y. Mo and B. Sinopoli, “On the performance degradation of cyber-physical systems under stealthy integrity attacks,” IEEE Trans. Autom. Control, vol. 61, no. 9, pp. 2618–2624, 2016.
  • [10] C.-Z. Bai, F. Pasqualetti, and V. Gupta, “Data-injection attacks in stochastic control systems: Detectability and performance tradeoffs,” Automatica, vol. 82, pp. 251 – 260, 2017.
  • [11] D. Umsonst, H. Sandberg, and A. A. Cárdenas, “Security analysis of control system anomaly detectors,” in Proc. 2017 American Control Conference (ACC), 2017, pp. 5500–5506.
  • [12] N. H. Hirzallah and P. G. Voulgaris, “On the computation of worst attacks: A LP framework,” in 2018 Annual American Control Conference (ACC), 2018, pp. 4527–4532.
  • [13] C. Murguia and J. Ruths, “On reachable sets of hidden CPS sensor attacks,” in 2018 Annual American Control Conference (ACC), 2018, pp. 178–184.
  • [14] Y. Chen, S. Kar, and J. M. F. Moura, “Optimal attack strategies subject to detection constraints against cyber-physical systems,” IEEE Trans. Contr. Netw. Systems, vol. 5, no. 3, pp. 1157–1168, 2018.
  • [15] A. M. H. Teixeira, “Optimal stealthy attacks on actuators for strictly proper systems,” in 2019 IEEE 58th Conference on Decision and Control (CDC), 2019, pp. 4385–4390.
  • [16] J. Milošević, H. Sandberg, and K. H. Johansson, “Estimating the impact of cyber-attack strategies for stochastic networked control systems,” IEEE Trans. Control Netw. Syst., vol. 7, no. 2, pp. 747–757, 2019.
  • [17] C. Fang, Y. Qi, J. Chen, R. Tan, and W. X. Zheng, “Stealthy actuator signal attacks in stochastic control systems: Performance and limitations,” IEEE Trans. Autom. Control, vol. 65, no. 9, pp. 3927–3934, 2019.
  • [18] X.-L. Wang, G.-H. Yang, and D. Zhang, “Optimal stealth attack strategy design for linear cyber-physical systems,” IEEE Trans. on Cybern., 2020, (early access).
  • [19] T. Sui, Y. Mo, D. Marelli, X. Sun, and M. Fu, “The vulnerability of cyber-physical system under stealthy attacks,” IEEE Trans. Autom. Control, vol. 66, no. 2, pp. 637–650, 2020.
  • [20] A. A. Cárdenas, S. Amin, Z.-S. Lin, Y.-L. Huang, C.-Y. Huang, and S. Sastry, “Attacks against process control systems: Risk assessment, detection, and response,” in Proc. the 6th ACM ASIA Conference on Computer and Communications Security, 2011.
  • [21] C. Murguia and J. Ruths, “On model-based detectors for linear time-invariant stochastic systems under sensor attacks,” IET Control Theory Applications, vol. 13, no. 8, pp. 1051–1061, 2019.
  • [22] N. Bäuerle and J. Ott, “Markov decision processes with average-value-at-risk criteria,” Mathematical Methods of Operations Research, vol. 74, no. 3, pp. 361–379, 2011.
  • [23] W. B. Haskell and R. Jain, “A convex analytic approach to risk-aware Markov decision processes,” SIAM Journal on Control and Optimization, vol. 53, no. 3, pp. 1569–1598, 2015.
  • [24] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone, “Risk-constrained reinforcement learning with percentile risk criteria,” The Journal of Machine Learning Research, vol. 18, no. 1, pp. 6070–6120, 2017.
  • [25] C. Baier and J.-P. Katoen, Principles of Model Checking.   MIT Press, 2008.
  • [26] M. L. Puterman, Markov Decision Process: Discrete Stochastic Dynamic Programming.   John Wiley & Sons, 1994.
  • [27] J. Giraldo et al., “A survey of physics-based attack detection in cyber-physical systems,” ACM Comput. Surv., vol. 51, no. 4, 2018.
  • [28] D. Bertsekas and S. Shreve, Stochastic Optimal Control: The Discrete-Time Case.   Athena Scientific, 1996.
  • [29] R. Munos and A. Moore, “Variable resolution discretization in optimal control,” Machine Learning, vol. 49, no. 2, pp. 291–323, 2002.
  • [30] K. Hinderer, Foundations of Non-stationary Dynamic Programming with Discrete Time Parameter.   Springer, 1970.
  • [31] E. Altman, Constrained Markov Decision Processes.   Chapman and Hall/CRC, 1999.
  • [32] P. Artzner, F. Delbaen, J.-M. Eber, and D. Heath, “Coherent measures of risk,” Mathematical Finance, vol. 9, no. 3, pp. 203–228, 1999.