This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Game Theoretic Analysis of LQG Control under Adversarial Attack

Zuxing Li, György Dán, and Dong Liu This work was partly supported by the Swedish Foundation for Strategic Research (SSF) through the CLAS project and by MSB through the CERCES project.Z. Li, G. Dán, and D. Liu are with the School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden {zuxing;gyuri;doli}@kth.se
Abstract

Motivated by recent works addressing adversarial attacks on deep reinforcement learning, a deception attack on linear quadratic Gaussian control is studied in this paper. In the considered attack model, the adversary can manipulate the observation of the agent subject to a mutual information constraint. The adversarial problem is formulated as a novel dynamic cheap talk game to capture the strategic interaction between the adversary and the agent, the asymmetry of information availability, and the system dynamics. Necessary and sufficient conditions are provided for subgame perfect equilibria to exist in pure strategies and in behavioral strategies; and characteristics of the equilibria and the resulting control rewards are given. The results show that pure strategy equilibria are informative, while only babbling equilibria exist in behavioral strategies. Numerical results are shown to illustrate the impact of strategic adversarial interaction.

I INTRODUCTION

Deep reinforcement learning (DRL) has recently emerged as a promising solution for solving large Markov decision processes (MDPs) and partially observable MDPs (POMDPs), thanks to deep neural networks used as policy approximators [1]. DRL has, however, been shown to be vulnerable to small perturbations of the state observation, called adversarial examples, which were found to mislead the control agent to take suboptimal control actions [2]. While there has been a significant recent interest in the design of adversarial examples against DRL [2, 3, 4, 5, 8], there has been little work on characterizing the ability of agents to adapt to those.

Recent work proposed to use adversarial examples for making DRL agents more robust to perturbations, by letting the adversary and the agent play against each other, and formulating the interaction as a stochastic game (SG) [7]. Nonetheless, in the case of adversarial examples the agent cannot observe the system state directly, neither can the adversary affect the state transition probabilities directly, only through the actions taken by the agent. Hence, the SG model does not capture the information structure of the problem. Effectively, in the presence of adversarial examples the agent has to solve a POMDP, where the observations are subverted by the adversary so as to mislead the agent.

As a model of this interaction, in this paper we propose a game theoretical model to study the strategic interaction between an agent that has to solve a linear quadratic Gaussian (LQG) control problem, and an adversary that can manipulate the agent’s observations by a randomly chosen affine transformation subject to a mutual information constraint, and aims at minimizing the control reward. The resulting problem is formulated as a dynamic cheap talk game, which captures information asymmetry, the beliefs of the adversary and the agent, and the undetectability constraint imposed on the adversarial attacks.

Our paper contributes to the solution of the formulated game theoretical problem in two ways. First, we address necessary and sufficient conditions for the existence of subgame perfect equilibria (SPEs) in pure strategies and in behavioral strategies, and we characterize the equilibrium strategies. Second, we characterize the rewards achievable in equilibria, and relate them to the rewards of a naive agent and an alert agent under attack. The key novelty of our contribution is that we characterize the strategies to be followed by the agent and by the adversary under strategic interaction, which has not been addressed by the existing literature.

The rest of the paper is organized as follows. In Section II we review related work. In Section III we present the system model and problem formulation. In Section IV we provide analytical results. In Section V we provide numerical results. Section VI concludes the paper.

Notation: Unless otherwise specified, we denote a random variable by a capital letter and its realization by the corresponding lower-case letter. We denote by 𝒩(,)\mathcal{N}(\cdot,\cdot) the Gaussian distribution, by 𝕊()\mathbb{S}(\cdot) the support set, by I(;)I(\cdot;\cdot) the mutual information, and by ||||||\cdot|| the cardinality of a set.

II RELATED WORK

Related to our work are previous researches on robust POMDP under uncertainty of the system dynamics [6]. In [6] the control action was optimized under the worst-case assumption of the system dynamics in each stage, i.e., the agent plays as the leader and the dynamic system plays as the follower in a Stackelberg game. SG and partially observable SG (POSG) were used to model the strategic interaction of players in a dynamic system, and have been employed in robust and adversarial problems [7, 8, 9]. But unlike in the case of learning under adversarial attacks, in SG and in POSG the players interact with each other through the impact of their actions on the state transitions, not on the state observations. Our work is related to the cheap talk game [18], where a sender with private information sends a message to a receiver and the receiver takes an action based on the received message and based on its belief on the inaccessible private information. Closest to our model are [10, 11, 12]. In [10] a dynamic cheap talk game was proposed to study a deception attack on a Markovian system, where the actions do not affect the state transitions. In [11] authors developed a dynamic game model of the attacker-defender interaction, and characterized the optimal attack strategy as a function of the defense strategy, allowing for a static optimal defense strategy. In our preliminary work [12] we proposed a dynamic cheap talk framework to model deception attacks on a general MDP, and addressed computational issues.

Adversarial variants of LQG control were considered in a number of recent works. A Stackelberg game was formulated in [13], where the dynamic system is the leader, while the agent is the follower and may be an adversary. The authors formulated a finite horizon hierarchical signaling game between the sender and the receiver in a dynamic environment and showed that linear sender and receiver strategies can yield the equilibrium [14]. In [15], the opposite problem was studied but without considering the complete strategic interaction, where the adversary optimally manipulates the control actions instead of the system states. The optimal attack on both the system state and the control action in LQG control was studied in [16]. In [17], a targeted attack strategy was studied to mislead the LQG system to a particular state while evading detection.

III ADVERSARIAL LQG CONTROL PROBLEM

Refer to caption

Figure 1: Considered adversarial attack on dynamic system control. In the ii-th stage, the adversary observes the true system state sis_{i} and presents the manipulated state s^i\hat{s}_{i} to the agent. The agent does not have access to the true state sis_{i} but takes an action aia_{i} upon observing s^i\hat{s}_{i}.

We consider an NN-stage LQG control problem under adversarial attack, as illustrated in Fig. 1. The system states {si}i=1N\{s_{i}\}_{i=1}^{N}, the manipulated states {s^i}i=1N\{\hat{s}_{i}\}_{i=1}^{N}, the actions {ai}i=1N\{a_{i}\}_{i=1}^{N}, and the instantaneous rewards {ri}i=1N\{r_{i}\}_{i=1}^{N} for 1iN1\leq i\leq N are described by

si+1=αisi+βiai+zi, given αi0, βi0;\displaystyle s_{i+1}=\alpha_{i}s_{i}+\beta_{i}a_{i}+z_{i},\textnormal{ given }\alpha_{i}\not=0,\textnormal{ }\beta_{i}\not=0; (1)
s^i=πisi+ci;\displaystyle\hat{s}_{i}=\pi_{i}s_{i}+c_{i}; (2)
ai=κis^i+ρi;\displaystyle a_{i}=\kappa_{i}\hat{s}_{i}+\rho_{i}; (3)
ri=Ri(si,ai)=θisi2ϕiai2, given θi>0, ϕi>0;\displaystyle r_{i}=R_{i}(s_{i},a_{i})=-\theta_{i}s_{i}^{2}-\phi_{i}a_{i}^{2},\textnormal{ given }\theta_{i}>0,\textnormal{ }\phi_{i}>0; (4)
S1b1𝒩(μ1,σ12), given μ1, σ12>0;\displaystyle S_{1}\sim b_{1}\triangleq\mathcal{N}(\mu_{1},\sigma_{1}^{2}),\textnormal{ given }\mu_{1},\textnormal{ }\sigma_{1}^{2}>0; (5)
Zi𝒩(0,ωi2), given ωi2>0;\displaystyle Z_{i}\sim\mathcal{N}(0,\omega_{i}^{2}),\textnormal{ given }\omega_{i}^{2}>0; (6)
Ci𝒩(0,δi2).\displaystyle C_{i}\sim\mathcal{N}(0,\delta_{i}^{2}). (7)

III-A LQG Recapitulation

If {πi}i=1N\{\pi_{i}\}_{i=1}^{N} and {δi2}i=1N\{\delta_{i}^{2}\}_{i=1}^{N} are known by the agent, the above problem is a standard LQG control. In the ii-th stage, the agent observes s^i\hat{s}_{i} but not sis_{i}, and determines the action aia_{i} with the aim to maximize its expected accumulated reward. Note that it is sufficient to consider an affine function of s^i\hat{s}_{i} for aia_{i} in the standard LQG problem as the optimal action aia_{i}^{\star} is a linear function of the mean of the Gaussian posterior distribution of SiS_{i} for the agent after observing {s^k}k=1i\{\hat{s}_{k}\}_{k=1}^{i} and {ak}k=1i1\{a_{k}\}_{k=1}^{i-1} [20]. To compute the optimal coefficients κi\kappa_{i}^{\star} and ρi\rho_{i}^{\star}, we first define θ~N+1=0\tilde{\theta}_{N+1}=0, and for 1iN1\leq i\leq N define θ~i\tilde{\theta}_{i} as

θ~i=θi+θ~i+1αi2θ~i+12αi2βi2ϕi+θ~i+1βi2.\displaystyle\tilde{\theta}_{i}=\theta_{i}+\tilde{\theta}_{i+1}\alpha_{i}^{2}-\frac{\tilde{\theta}_{i+1}^{2}\alpha_{i}^{2}\beta_{i}^{2}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}}. (8)

Furthermore, we denote by bi𝒩(μi,σi2)b_{i}\triangleq\mathcal{N}(\mu_{i},\sigma_{i}^{2}) the belief of the agent about SiS_{i}, which is the Gaussian posterior distribution of SiS_{i} for the agent after observing {s^k}k=1i1\{\hat{s}_{k}\}_{k=1}^{i-1} and {ak}k=1i1\{a_{k}\}_{k=1}^{i-1}. Then, given the manipulated state s^i\hat{s}_{i}, the optimal action can be expressed as

κi=θ~i+1αiβiπiσi2(ϕi+θ~i+1βi2)(πi2σi2+δi2),\displaystyle\kappa_{i}^{\star}=-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}\pi_{i}\sigma_{i}^{2}}{(\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2})(\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2})}, (9)
ρi=θ~i+1αiβiμiδi2(ϕi+θ~i+1βi2)(πi2σi2+δi2),\displaystyle\rho_{i}^{\star}=-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}\mu_{i}\delta_{i}^{2}}{(\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2})(\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2})}, (10)
ai=κis^i+ρi=θ~i+1αiβiϕi+θ~i+1βi2πiσi2s^i+μiδi2πi2σi2+δi2,\displaystyle a_{i}^{\star}=\kappa_{i}^{\star}\hat{s}_{i}+\rho_{i}^{\star}=-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}}\frac{\pi_{i}\sigma_{i}^{2}\hat{s}_{i}+\mu_{i}\delta_{i}^{2}}{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}}, (11)

where the coefficients κi\kappa_{i}^{\star} and ρi\rho_{i}^{\star} depend on bib_{i}; πiσi2s^i+μiδi2πi2σi2+δi2\frac{\pi_{i}\sigma_{i}^{2}\hat{s}_{i}+\mu_{i}\delta_{i}^{2}}{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}} is the mean of the Gaussian posterior distribution of SiS_{i} for the agent after observing {s^k}k=1i\{\hat{s}_{k}\}_{k=1}^{i} and {ak}k=1i1\{a_{k}\}_{k=1}^{i-1}. Note that when πi1\pi_{i}\equiv 1 and δi20\delta_{i}^{2}\equiv 0 the LQG strategy reduces to the linear quadratic regulator (LQR) strategy

(κi,ρi)=(θ~i+1αiβiϕi+θ~i+1βi2,0).\displaystyle(\kappa_{i}^{\star},\rho_{i}^{\star})=\left(-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}},0\right). (12)

III-B Adversarial Model

The adversary can manipulate the observation of the agent and its objective is to minimize the agent’s expected accumulated reward, similar to [2, 5]. In the ii-th stage, the adversary chooses the manipulation parameters πi\pi_{i}, δi2\delta_{i}^{2}, manipulates the state sis_{i} to s^i\hat{s}_{i}, and then reports the manipulated state s^i\hat{s}_{i} to the agent. We consider that the adversarial manipulation is “small”, since a large manipulation may be easily detected and may also involve a high manipulation cost. Given the agent’s belief bi𝒩(μi,σi2)b_{i}\triangleq\mathcal{N}(\mu_{i},\sigma_{i}^{2}), we impose the following constraints on the manipulation:

<επiε<;\displaystyle-\infty<\varepsilon^{\prime}\leq\pi_{i}\leq\varepsilon<\infty; (13)
I(S^i;Si)=12logπi2σi2+δi2δi212logλ>0, πi,δi2,\displaystyle I(\hat{S}_{i};S_{i})=\frac{1}{2}\log\frac{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}}{\delta_{i}^{2}}\geq\frac{1}{2}\log\lambda>0,\textnormal{ }\forall\pi_{i},\delta_{i}^{2}, (14)

i.e., πi2σi2+δi2δi2λ>1\frac{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}}{\delta_{i}^{2}}\geq\lambda>1. The mutual information constraint (14) implies that the manipulated state conveys at least a certain amount of information about the system state to the agent. A larger value of λ\lambda means a weaker adversary, and vice versa. Note that in order to satisfy the mutual information constraint, the adversary cannot use πi=0\pi_{i}=0. We denote by 𝒜i(bi,ε,ε,λ)\mathcal{A}_{i}(b_{i},\varepsilon^{\prime},\varepsilon,\lambda) the set of feasible adversarial actions (πi,δi2)(\pi_{i},\delta_{i}^{2}) in the ii-th stage subject to (13)-(14). Finally, in the end of this stage, the adversary reveals πi\pi_{i}, δi2\delta_{i}^{2} to the agent111This assumption is strong but may hold in some cases. For instance, the player in the shell game reveals the cup in which the pellet is after each round., so as to keep the adversarial model consistent with the standard LQG control.

Refer to caption


Figure 2: Illustration of information structure for a three-stage ALQG game.

III-C Adversarial LQG Control Game

In every stage of the adversarial problem, there is a cheap talk interaction, where the adversary acts as the sender and the agent as the receiver. Different from the dynamic cheap talk game with an action-independent Markovian system [10], we propose a novel dynamic cheap talk game to model the strategic interaction of the adversary and the agent with asymmetric information in the adversarial LQG control problem. Unlike in recent works on adversarial reinforcement learning [2, 3, 4, 5], in our model of strategic interaction the agent is aware of and can adapt to the adversary.

The game is played between the adversary and the agent over NN stages. In the ii-th stage, the belief of the agent bib_{i} is known to the adversary and determines the action set 𝒜i(bi,ε,ε,λ)\mathcal{A}_{i}(b_{i},\varepsilon^{\prime},\varepsilon,\lambda). The adversary uses a behavioral strategy gi(πi,δi2|bi)g_{i}(\pi_{i},\delta_{i}^{2}|b_{i}) over 𝒜i(bi,ε,ε,λ)\mathcal{A}_{i}(b_{i},\varepsilon^{\prime},\varepsilon,\lambda) for choosing (πi(\pi_{i}, δi2)\delta_{i}^{2}). Then, given the observed system state sis_{i} it generates the manipulated state s^i\hat{s}_{i} with the probability measure 𝒩(s^i|πisi,δi2)\mathcal{N}(\hat{s}_{i}|\pi_{i}s_{i},\delta_{i}^{2}). The agent uses a pure strategy222The following analysis will show that it is sufficient to consider a pure agent strategy with an affine form. fi(bi)f_{i}(b_{i}) for choosing κi\kappa_{i} and ρi\rho_{i} based on the belief bib_{i}, and takes the action ai=κis^i+ρia_{i}=\kappa_{i}\hat{s}_{i}+\rho_{i} once it receives the manipulated state s^i\hat{s}_{i}. Finally, the players compute the belief bi+1𝒩(μi+1,σi+12)b_{i+1}\triangleq\mathcal{N}(\mu_{i+1},\sigma_{i+1}^{2}) based on the current belief bib_{i}, the coefficient πi\pi_{i}, the variance δi2\delta_{i}^{2}, the manipulated state s^i\hat{s}_{i}, and the action aia_{i} as μi+1=Λμ(bi,πi,δi2,s^i,ai)\mu_{i+1}=\Lambda_{\mu}(b_{i},\pi_{i},\delta_{i}^{2},\hat{s}_{i},a_{i}) and σi+12=Λν(bi,πi,δi2)\sigma_{i+1}^{2}=\Lambda_{\nu}(b_{i},\pi_{i},\delta_{i}^{2}).

We can thus express the expected accumulated agent reward using the adversarial strategies gN(g1,,gN)g^{N}\triangleq(g_{1},\dots,g_{N}) and the agent’s strategies fN(f1,,fN)f^{N}\triangleq(f_{1},\dots,f_{N}) over NN stages as

V(b1,gN,fN)=Eb1,gN,fN(j=1NRj(Sj,Aj)).V\left(b_{1},g^{N},f^{N}\right)=E_{b_{1},g^{N},f^{N}}\left(\sum_{j=1}^{N}R_{j}(S_{j},A_{j})\right). (15)

Consequently, the objective of the adversary is to minimize (15), while the agent aims at maximizing it. We refer to this particular dynamic cheap talk game as adversarial LQG (ALQG) game. Fig. 2 illustrates a three-stage ALQG game. Our objective is to characterize SPEs in the ALQG game: the existence conditions and solution structures.

Remark 1

The adversarial LQG problem cannot be modeled as an SG or a POSG, since the adversary directly manipulates the observation of the agent. Furthermore, different from zero-sum SG, an SPE of the ALQG game does not necessarily exist. On the other hand, for N=1N=1 the game is a cheap talk game [18], where the strategies of both players depend on the belief of the agent and on the constraints on the adversarial manipulation. Nonetheless, in the ALQG game the reward function is different from that in [18], which gives rise to different equilibria, as we will show later.

IV EQUILIBRIUM ANALYSIS

In the following, we first formulate the value function and the belief update rule for the adversarial LQG problem; we then characterize SPEs in pure strategies and in behavioral strategies, respectively.

IV-A Value Function and Belief Update

Assume that there is an SPE consisting of strategies (gN,fN)(g^{N*},f^{N*}). Induced by this SPE, we can define the value function of a subgame starting from the ii-th stage as

ViN(bi)=V(bi,giN,fiN)=Ebi,giN,fiN(j=iNRj(Sj,Aj)),V_{i}^{N}(b_{i})=V(b_{i},g_{i}^{N*},f_{i}^{N*})=E_{b_{i},g_{i}^{N*},f_{i}^{N*}}\left(\sum_{j=i}^{N}R_{j}(S_{j},A_{j})\right), (16)

i.e., the value function ViN(bi)V_{i}^{N}(b_{i}) is the expected accumulated agent reward in the subgame starting from the ii-th stage when the belief in the ii-th stage is bib_{i} and the SPE strategies (giN,fiN)(g_{i}^{N*},f_{i}^{N*}) are used. The evaluation of ViN(bi)V_{i}^{N}(b_{i}) needs the beliefs in the subgame. In the following, we specify the belief update rule.

Given the current belief bi𝒩(μi,σi2)b_{i}\triangleq\mathcal{N}(\mu_{i},\sigma_{i}^{2}), the coefficient πi\pi_{i}, the variance δi2\delta_{i}^{2}, the manipulated state s^i\hat{s}_{i}, and the action aia_{i}, it follows from the adversarial LQG model and Bayes rule that bi+1𝒩(μi+1,σi+12)b_{i+1}\triangleq\mathcal{N}(\mu_{i+1},\sigma_{i+1}^{2}) with

μi+1=Λμ(bi,πi,δi2,s^i,ai)=αiπiσi2s^i+μiδi2πi2σi2+δi2+βiai,\displaystyle\mu_{i+1}=\Lambda_{\mu}(b_{i},\pi_{i},\delta_{i}^{2},\hat{s}_{i},a_{i})=\alpha_{i}\frac{\pi_{i}\sigma_{i}^{2}\hat{s}_{i}+\mu_{i}\delta_{i}^{2}}{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}}+\beta_{i}a_{i}, (17)
σi+12=Λν(bi,πi,δi2)=αi2σi2δi2πi2σi2+δi2+ωi2.\displaystyle\sigma_{i+1}^{2}=\Lambda_{\nu}(b_{i},\pi_{i},\delta_{i}^{2})=\frac{\alpha_{i}^{2}\sigma_{i}^{2}\delta_{i}^{2}}{\pi_{i}^{2}\sigma_{i}^{2}+\delta_{i}^{2}}+\omega_{i}^{2}. (18)

An immediate consequence of the belief update rule is the following.

Property 1

It follows from σ12>0\sigma_{1}^{2}>0, the variance update rule (18), and πi0\pi_{i}\not=0 that σi2>0\sigma_{i}^{2}>0 for all 1iN1\leq i\leq N.

Observe that the value functions {ViN}i=1N1\{V_{i}^{N}\}_{i=1}^{N-1} have to satisfy the backward dynamic programming equation

ViN(bi)=\displaystyle V_{i}^{N}(b_{i})= mingiEbi,gi,fi{Ri(Si,Ai)\displaystyle\,\min_{g_{i}}E_{b_{i},g_{i},f_{i}^{*}}\{R_{i}(S_{i},A_{i})
+Vi+1N(𝒩(Λμ(bi,Πi,Δi2,S^i,Ai),Λν(bi,Πi,Δi2)))}\displaystyle\,+V_{i+1}^{N}(\mathcal{N}(\Lambda_{\mu}(b_{i},\Pi_{i},\Delta_{i}^{2},\hat{S}_{i},A_{i}),\Lambda_{\nu}(b_{i},\Pi_{i},\Delta_{i}^{2})))\}
=\displaystyle= maxfiEbi,gi,fi{Ri(Si,Ai)\displaystyle\,\max_{f_{i}}E_{b_{i},g_{i}^{*},f_{i}}\{R_{i}(S_{i},A_{i})
+Vi+1N(𝒩(Λμ(bi,Πi,Δi2,S^i,Ai),Λν(bi,Πi,Δi2)))}.\displaystyle\,+V_{i+1}^{N}(\mathcal{N}(\Lambda_{\mu}(b_{i},\Pi_{i},\Delta_{i}^{2},\hat{S}_{i},A_{i}),\Lambda_{\nu}(b_{i},\Pi_{i},\Delta_{i}^{2})))\}. (19)

Thus the SPE has to satisfy (19), which is the basis for the analysis we present in the following.

IV-B Pure Strategy Equilibria

We start the analysis considering pure strategy equilibria. With slight abuse of notation, we denote by (πi,δi2)=gi(bi)(\pi_{i},\delta_{i}^{2})=g_{i}(b_{i}) a pure strategy of the adversary as a function of the belief bib_{i}.

We first consider the case N=1N=1.

Proposition 1

Let N=1N=1. An SPE consists of (f1,g1)(f_{1}^{*},g_{1}^{*}), where (κ1,ρ1)=f1(b1)=(0,0)(\kappa_{1}^{*},\rho_{1}^{*})=f_{1}^{*}(b_{1})=(0,0) for any belief b1b_{1}; and g1g_{1}^{*} can be any adversarial strategy defined on 𝒜1(b1,ε,ε,λ)\mathcal{A}_{1}(b_{1},\varepsilon^{\prime},\varepsilon,\lambda).

Proof:

Observe that (κ1,ρ1)=f1(b1)=(0,0)(\kappa_{1}^{*},\rho_{1}^{*})=f_{1}^{*}(b_{1})=(0,0) is a dominant strategy for the agent for any belief b1b_{1}. Under this strategy, the adversarial strategy has no impact on the agent’s reward. This proves the result. ∎

The existence of a pure strategy SPE for N=1N=1 is encouraging, even if the equilibrium is degenerate. Unfortunately, for N2N\geq 2 an SPE may not exist as shown in the following theorem.

Theorem 1

Let N2N\geq 2. If εε\varepsilon^{\prime}\not=\varepsilon or if ε=ε=0\varepsilon^{\prime}=\varepsilon=0, then there is no pure strategy SPE for the ALQG game. If ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, then there is a unique pure strategy SPE. The SPE strategies for 1iN1\leq i\leq N are given by

θ~N+1=θ^N+1=0;\displaystyle\tilde{\theta}_{N+1}=\hat{\theta}_{N+1}=0; (20)
θ~i=θi+θ~i+1αi2θ~i+12αi2βi2ϕi+θ~i+1βi2;\displaystyle\tilde{\theta}_{i}=\theta_{i}+\tilde{\theta}_{i+1}\alpha_{i}^{2}-\frac{\tilde{\theta}_{i+1}^{2}\alpha_{i}^{2}\beta_{i}^{2}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}}; (21)
θ^i=θi+θ^i+1αi2(θ~i+12αi2βi2ϕi+θ~i+1βi2+(θ^i+1θ~i+1)αi2)λ1λ;\displaystyle\hat{\theta}_{i}=\theta_{i}+\hat{\theta}_{i+1}\alpha_{i}^{2}-\left(\frac{\tilde{\theta}_{i+1}^{2}\alpha_{i}^{2}\beta_{i}^{2}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}}+(\hat{\theta}_{i+1}-\tilde{\theta}_{i+1})\alpha_{i}^{2}\right)\frac{\lambda-1}{\lambda}; (22)
πi=gi(bi)=ε=ε;\displaystyle\pi_{i}^{*}=g_{i}^{*}(b_{i})=\varepsilon^{\prime}=\varepsilon; (23)
δi2=gi(bi)=ε2σi2λ1;\displaystyle\delta_{i}^{2*}=g_{i}^{*}(b_{i})=\frac{\varepsilon^{2}\sigma_{i}^{2}}{\lambda-1}; (24)
κi=fi(bi)=θ~i+1αiβi(λ1)(ϕi+θ~i+1βi2)λε;\displaystyle\kappa_{i}^{*}=f_{i}^{*}(b_{i})=-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}(\lambda-1)}{(\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2})\lambda\varepsilon}; (25)
ρi=fi(bi)=θ~i+1αiβiμi(ϕi+θ~i+1βi2)λ.\displaystyle\rho_{i}^{*}=f_{i}^{*}(b_{i})=-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}\mu_{i}}{(\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2})\lambda}. (26)
Corollary 1

If ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, the value function induced by the unique pure strategy SPE is

ViN(bi)=θ~iμi2θ^iσi2j=i+1Nθ^jωj12.\displaystyle V_{i}^{N}(b_{i})=-\tilde{\theta}_{i}\mu_{i}^{2}-\hat{\theta}_{i}\sigma_{i}^{2}-\sum_{j=i+1}^{N}\hat{\theta}_{j}\omega_{j-1}^{2}. (27)

The proofs of Theorem 1 and Corollary 1 are provided in the appendix.

Property 2

It follows from (20)-(22) that for 1iN1\leq i\leq N,

θ^iθ~i>0.\displaystyle\hat{\theta}_{i}\geq\tilde{\theta}_{i}>0. (28)
Remark 2

We can make the following observations based on Theorem 1 and Corollary 1.

  • Since the LQG control can be seen as the best response of any given (adversarial manipulated) observation model, it is sufficient to consider a pure agent strategy in the form of an affine function for an SPE of the ALQG game with a pure adversarial strategy.

  • It follows from (24) that a rational adversary will always apply a manipulation with the largest variance.

  • The value function ViNV_{i}^{N} consists of a constant term and two separable terms depending on the mean μi\mu_{i} and the variance σi2\sigma_{i}^{2} of the belief, which allows a closed form solution for arbitrary NN.

Time-invariant system: We now turn to the asymptotic analysis of a time-invariant system, i.e., αi=α0\alpha_{i}=\alpha\not=0, βi=β0\beta_{i}=\beta\not=0, ωi2=ω2>0\omega_{i}^{2}=\omega^{2}>0, θi=θ>0\theta_{i}=\theta>0, and ϕi=ϕ>0\phi_{i}=\phi>0 for i1i\geq 1, and we let NN\to\infty. Let us define the mapping L:0202L:\mathbb{R}^{2}_{\geq 0}\to\mathbb{R}^{2}_{\geq 0} as

L(x,y)=(θ+ϕα2xϕ+β2x,θ+ϕα2xϕ+β2xλ1λ+α2y1λ).\displaystyle L\left(x,y\right)=\left(\theta+\frac{\phi\alpha^{2}x}{\phi+\beta^{2}x},\theta+\frac{\phi\alpha^{2}x}{\phi+\beta^{2}x}\frac{\lambda-1}{\lambda}+\alpha^{2}y\frac{1}{\lambda}\right). (29)

Observe that LL is effectively the coefficient update (20)-(22) for the time-invariant model. In what follows, we first characterize LL and furthermore the pure strategy SPE of the ALQG game in the asymptotic regime.

Proposition 2

Let λ>α2\lambda>\alpha^{2}. Then the mapping LL admits a least fixed point (θ~,θ^)02(\tilde{\theta},\hat{\theta})\in\mathbb{R}_{\geq 0}^{2}, for which

limnLn(0,0)=L(θ~,θ^)=(θ~,θ^),\lim_{n\to\infty}L^{n}(0,0)=L(\tilde{\theta},\hat{\theta})=(\tilde{\theta},\hat{\theta}), (30)

with Ln(0,0)L(L((L(Ln L-mappings(0,0)))))L^{n}(0,0)\triangleq\underbrace{L(L(\cdots(L(L}_{n\textnormal{ }L\textnormal{-mappings}}(0,0)))\cdots)).

Proof:

We start the proof by observing that the mapping LL is order-preserving. That is, for all (x,y)(x,y), (x,y)02(x^{\prime},y^{\prime})\in\mathbb{R}^{2}_{\geq 0} satisfying (x,y)(x,y)(x,y)\preccurlyeq(x^{\prime},y^{\prime}), i.e., xxx\leq x^{\prime} and yyy\leq y^{\prime}, we have L(x,y)L(x,y)L(x,y)\preccurlyeq L(x^{\prime},y^{\prime}). This can be easily shown by analyzing (29).

Furthermore, the fixed point equation L(θ~,θ^)=(θ~,θ^)L(\tilde{\theta},\hat{\theta})=(\tilde{\theta},\hat{\theta}) has a unique solution on 02\mathbb{R}_{\geq 0}^{2} if and only if λ>α2\lambda>\alpha^{2}. Since LL is order-preserving, the convergence result limnLn(0,0)=(θ~,θ^)\lim_{n\to\infty}L^{n}(0,0)=(\tilde{\theta},\hat{\theta}) follows from Kleene’s fixed point theorem [21]. ∎

Analytical expressions for θ~\tilde{\theta} and θ^\hat{\theta} can be obtained by solving the fixed point equation L(θ~,θ^)=(θ~,θ^)L(\tilde{\theta},\hat{\theta})=(\tilde{\theta},\hat{\theta}), and can be used for characterizing the SPE in pure strategies, using Theorem 1 and Proposition 2, as follows.

Theorem 2

Let λ>α2\lambda>\alpha^{2}, ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, and NN\to\infty. Then the ALQG game of the time-invariant model has a stationary SPE in pure strategies as: For i1i\geq 1,

πi=gi(bi)=ε;\displaystyle\pi_{i}^{*}=g_{i}^{*}(b_{i})=\varepsilon; (31)
δi2=gi(bi)=ε2σi2λ1;\displaystyle\delta_{i}^{2*}=g_{i}^{*}(b_{i})=\frac{\varepsilon^{2}\sigma_{i}^{2}}{\lambda-1}; (32)
κi=fi(bi)=θ~αβ(λ1)(ϕ+θ~β2)λε;\displaystyle\kappa_{i}^{*}=f_{i}^{*}(b_{i})=-\frac{\tilde{\theta}\alpha\beta(\lambda-1)}{(\phi+\tilde{\theta}\beta^{2})\lambda\varepsilon}; (33)
ρi=fi(bi)=θ~αβμi(ϕ+θ~β2)λ.\displaystyle\rho_{i}^{*}=f_{i}^{*}(b_{i})=-\frac{\tilde{\theta}\alpha\beta\mu_{i}}{(\phi+\tilde{\theta}\beta^{2})\lambda}. (34)
Proof:

Since ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, a unique SPE in pure strategies exists and the SPE strategies are given in Theorem 1. Since λ>α2\lambda>\alpha^{2} and NN\to\infty, it follows from Proposition 2 that θi~\tilde{\theta_{i}}, θ^i\hat{\theta}_{i} converge to θ~\tilde{\theta}, θ^\hat{\theta}, respectively. This leads to the stationary SPE in Theorem 2. ∎

Interestingly, for this stationary SPE in pure strategies we can obtain the expected average reward per stage in steady state in closed form.

Corollary 2

Let b1𝒩(μ1,σ12)b_{1}\triangleq\mathcal{N}(\mu_{1},\sigma_{1}^{2}) with bounded mean and variance. For the stationary SPE in pure strategies in Theorem 2, the expected average reward per stage in steady state is independent of the initial belief and is given by

limNV1N(b1)N=θ^ω2.\lim_{N\to\infty}\frac{V_{1}^{N}(b_{1})}{N}=-\hat{\theta}\omega^{2}. (35)
Proof:

This result follows from Corollary 1, Proposition 2, and Theorem 2. ∎

IV-C Equilibria in Behavioral Strategies

The previous results show that a pure strategy SPE does not exist if there are multiple choices of the coefficient πi\pi_{i} for the adversary. We thus turn to the analysis of SPE in behavioral strategies.

Theorem 3

Let N2N\geq 2 and ε<0<ε\varepsilon^{\prime}<0<\varepsilon. Then for 1iN1\leq i\leq N, θ~i\tilde{\theta}_{i} and θˇi\check{\theta}_{i} are as given by (21), and

θ~N+1=θˇN+1=0,\displaystyle\tilde{\theta}_{N+1}=\check{\theta}_{N+1}=0, (36)
θˇi=θi+θˇi+1αi2(θˇi+1θ~i+1)αi2λ1λ.\displaystyle\check{\theta}_{i}=\theta_{i}+\check{\theta}_{i+1}\alpha_{i}^{2}-(\check{\theta}_{i+1}-\tilde{\theta}_{i+1})\alpha_{i}^{2}\frac{\lambda-1}{\lambda}. (37)

Furthermore, there is a continuum of SPEs in behavioral strategies. Each SPE in the ii-th stage consists of a behavioral strategy of the adversary and a pure strategy of the agent that satisfies

𝕊(gi|bi){(πi,δi2):δi2=πi2σi2λ1}𝒜i(bi,ε,ε,λ);\displaystyle\mathbb{S}(g_{i}^{*}|b_{i})\triangleq\left\{(\pi_{i},\delta_{i}^{2}):\delta_{i}^{2}=\frac{\pi_{i}^{2}\sigma_{i}^{2}}{\lambda-1}\right\}\subseteq\mathcal{A}_{i}(b_{i},\varepsilon^{\prime},\varepsilon,\lambda); (38)
||𝕊(gi|bi)||2;\displaystyle||\mathbb{S}(g_{i}^{*}|b_{i})||\geq 2; (39)
Egi(Πi)=0;\displaystyle E_{g_{i}^{*}}(\Pi_{i})=0; (40)
(κi,ρi)=fi(bi)=(0,θ~i+1αiβiμiϕi+θ~i+1βi2).\displaystyle(\kappa_{i}^{*},\rho_{i}^{*})=f_{i}^{*}(b_{i})=\left(0,-\frac{\tilde{\theta}_{i+1}\alpha_{i}\beta_{i}\mu_{i}}{\phi_{i}+\tilde{\theta}_{i+1}\beta_{i}^{2}}\right). (41)

Interestingly, this SPE is a babbling equilibrium in which the agent’s action is based on its belief, not on the manipulated state. Nonetheless, the value of the game depends on the adversarial manipulation, as shown in the following corollary.

Corollary 3

Let ε<0<ε\varepsilon^{\prime}<0<\varepsilon. For any SPE in behavioral strategies, we have

ViN(bi)=θ~iμi2θˇiσi2j=i+1Nθˇjωj12.V_{i}^{N}(b_{i})=-\tilde{\theta}_{i}\mu_{i}^{2}-\check{\theta}_{i}\sigma_{i}^{2}-\sum_{j=i+1}^{N}\check{\theta}_{j}\omega_{j-1}^{2}. (42)

The proofs of Theorem 3 and Corollary 42 are given in the appendix.

Property 3

It follows from the update rules (20)-(22) and (36)-(37) that for all 1iN1\leq i\leq N,

θˇiθ^iθ~i>0.\displaystyle\check{\theta}_{i}\geq\hat{\theta}_{i}\geq\tilde{\theta}_{i}>0. (43)
Remark 3

We can make the following observations based on Theorem 3 and Corollary 42.

  • It is sufficient for the agent to use a pure affine strategy against a behavioral strategy of the adversary.

  • Although the adversary cannot use πi=0\pi_{i}=0, the behavioral strategy gig_{i}^{*} needs to achieve zero-mean of the random coefficient Πi\Pi_{i}.

  • A rational adversary will always use a manipulation with the largest variance.

  • From Property 3, the value (42) of an SPE in behavioral strategies is always less than or equal to the value (27) of a pure strategy SPE.

Time-invariant system: We again turn to the time-invariant system for NN\to\infty. Let us define the mapping J:0202J:\mathbb{R}_{\geq 0}^{2}\to\mathbb{R}_{\geq 0}^{2} as

J(x,y)=(θ+ϕα2xϕ+β2x,θ+α2xλ1λ+α2y1λ).\displaystyle J(x,y)=\left(\theta+\frac{\phi\alpha^{2}x}{\phi+\beta^{2}x},\theta+\alpha^{2}x\frac{\lambda-1}{\lambda}+\alpha^{2}y\frac{1}{\lambda}\right). (44)

Observe that JJ is effectively the coefficient update rule (21), (36), (37) for the time-invariant system. In what follows, we characterize JJ and the stationary SPEs in behavioral strategies for the ALQG game.

Proposition 3

Let λ>α2\lambda>\alpha^{2}. Then the mapping JJ admits a least fixed point (θ~,θˇ)02(\tilde{\theta},\check{\theta})\in\mathbb{R}_{\geq 0}^{2}, for which

limnJn(0,0)=J(θ~,θˇ)=(θ~,θˇ).\displaystyle\lim_{n\to\infty}J^{n}(0,0)=J(\tilde{\theta},\check{\theta})=(\tilde{\theta},\check{\theta}). (45)

The proof of Proposition 3 follows using the arguments in the proof of Proposition 2.

Theorem 4

Let λ>α2\lambda>\alpha^{2}, ε<0<ε\varepsilon^{\prime}<0<\varepsilon, and NN\to\infty. Then the ALQG game of the time-invariant model has a stationary SPE in behavioral strategies as: For i1i\geq 1,

gi(πi=ε,δi2=ε2σi2λ1|bi)=εεε;\displaystyle g_{i}^{*}\left(\pi_{i}=\varepsilon^{\prime},\left.\delta_{i}^{2}=\frac{\varepsilon^{\prime 2}\sigma_{i}^{2}}{\lambda-1}\right|b_{i}\right)=\frac{\varepsilon}{\varepsilon-\varepsilon^{\prime}}; (46)
gi(πi=ε,δi2=ε2σi2λ1|bi)=εεε;\displaystyle g_{i}^{*}\left(\pi_{i}=\varepsilon,\left.\delta_{i}^{2}=\frac{\varepsilon^{2}\sigma_{i}^{2}}{\lambda-1}\right|b_{i}\right)=-\frac{\varepsilon^{\prime}}{\varepsilon-\varepsilon^{\prime}}; (47)
(κi,ρi)=fi(bi)=(0,θ~αβμiϕ+θ~β2).\displaystyle(\kappa_{i}^{*},\rho_{i}^{*})=f_{i}^{*}(b_{i})=\left(0,-\frac{\tilde{\theta}\alpha\beta\mu_{i}}{\phi+\tilde{\theta}\beta^{2}}\right). (48)
Corollary 4

Let b1𝒩(μ1,σ12)b_{1}\triangleq\mathcal{N}(\mu_{1},\sigma_{1}^{2}) with bounded mean and variance. For the stationary SPE in behavioral strategies in Theorem 4, the expected average reward per stage in steady state is independent of the initial belief and is given by

limNV1N(b1)N=θˇω2.\lim_{N\to\infty}\frac{V_{1}^{N}(b_{1})}{N}=-\check{\theta}\omega^{2}. (49)

The proofs of Theorem 4 and Corollary 49 are based on Theorem 3, Corollary 42, and Proposition 3, and follow from similar arguments as used in the proofs of Theorem 2 and Corollary 35. Observe that Corollary 35, Corollary 49, and Property 3 jointly imply that the expected average agent reward per stage in steady state is higher when considering pure strategies, as behavioral strategies allow for more uncertainty about the attack and thus make the adversary stronger.

The behavioral strategy of the adversary in Theorem 3 needs to achieve zero-mean of the random coefficient Πi\Pi_{i}, which cannot be satisfied if 0ε<ε0\leq\varepsilon^{\prime}<\varepsilon or ε<ε0\varepsilon^{\prime}<\varepsilon\leq 0. In the following, we study SPE under these conditions.

Theorem 5

Let N2N\geq 2. If 0=ε<ε0=\varepsilon^{\prime}<\varepsilon or if ε<ε=0\varepsilon^{\prime}<\varepsilon=0, there is no SPE for the ALQG game.

Theorem 6

Let N=2N=2. If 0<ε<ε0<\varepsilon^{\prime}<\varepsilon or if ε<ε<0\varepsilon^{\prime}<\varepsilon<0, there is a unique SPE in behavioral strategies for the ALQG game: For any belief b1𝒩(μ1,σ12)b_{1}\triangleq\mathcal{N}(\mu_{1},\sigma_{1}^{2}),

g1(π1=ε,δ12=ε2σ12λ1|b1)=εε+ε;\displaystyle g_{1}^{*}\left(\pi_{1}=\varepsilon^{\prime},\left.\delta_{1}^{2}=\frac{\varepsilon^{\prime 2}\sigma_{1}^{2}}{\lambda-1}\right|b_{1}\right)=\frac{\varepsilon}{\varepsilon^{\prime}+\varepsilon}; (50)
g1(π1=ε,δ12=ε2σ12λ1|b1)=εε+ε;\displaystyle g_{1}^{*}\left(\pi_{1}=\varepsilon,\left.\delta_{1}^{2}=\frac{\varepsilon^{2}\sigma_{1}^{2}}{\lambda-1}\right|b_{1}\right)=\frac{\varepsilon^{\prime}}{\varepsilon^{\prime}+\varepsilon}; (51)
κ1=f1(b1)\displaystyle\kappa_{1}^{*}=f_{1}^{*}(b_{1})\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad (52)
=θ2α1β1Eg1(Π1)σ12(ϕ1+θ2β12)(Eg1(Π12)(μ12+λλ1σ12)Eg12(Π1)μ12);\displaystyle\;\;\;=\frac{-\theta_{2}\alpha_{1}\beta_{1}E_{g_{1}^{*}}(\Pi_{1})\sigma_{1}^{2}}{(\phi_{1}+\theta_{2}\beta_{1}^{2})\left(E_{g_{1}^{*}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)-E_{g_{1}^{*}}^{2}(\Pi_{1})\mu_{1}^{2}\right)}; (53)
ρ1=f1(b1)=Eg1(Π1)μ1κ1θ2α1β1μ1ϕ1+θ2β12;\displaystyle\rho_{1}^{*}=f_{1}^{*}(b_{1})=-E_{g_{1}^{*}}(\Pi_{1})\mu_{1}\kappa_{1}^{*}-\frac{\theta_{2}\alpha_{1}\beta_{1}\mu_{1}}{\phi_{1}+\theta_{2}\beta_{1}^{2}}; (54)

for any belief b2𝒩(μ2,σ22)b_{2}\triangleq\mathcal{N}(\mu_{2},\sigma_{2}^{2}), g2=g1g_{2}^{*}=g_{1}^{*}; (κ2,ρ2)=f2(b2)=(0,0)(\kappa_{2}^{*},\rho_{2}^{*})=f_{2}^{*}(b_{2})=(0,0); and

V12\displaystyle V_{1}^{2} (b1)\displaystyle(b_{1})
=\displaystyle= (θ1+θ2α12θ22α12β12ϕ1+θ2β12)μ12θ2ω12(θ1+θ2α12)σ12\displaystyle\,-\left(\theta_{1}+\theta_{2}\alpha_{1}^{2}-\frac{\theta_{2}^{2}\alpha_{1}^{2}\beta_{1}^{2}}{\phi_{1}+\theta_{2}\beta_{1}^{2}}\right)\mu_{1}^{2}-\theta_{2}\omega_{1}^{2}-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\sigma_{1}^{2}
+θ22α12β12Eg12(Π1)σ14(ϕ1+θ2β12)(Eg1(Π12)(μ12+λλ1σ12)Eg12(Π1)μ12).\displaystyle\,+\frac{\theta_{2}^{2}\alpha_{1}^{2}\beta_{1}^{2}E_{g_{1}^{*}}^{2}(\Pi_{1})\sigma_{1}^{4}}{(\phi_{1}+\theta_{2}\beta_{1}^{2})\left(E_{g_{1}^{*}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)-E_{g_{1}^{*}}^{2}(\Pi_{1})\mu_{1}^{2}\right)}. (55)

The proofs of Theorems 5 & 6 are given in the appendix.

Remark 4

Different from the value functions (27) and (42), the value function (55) cannot be decomposed into separable terms of mean and variance, which makes the extension of Theorem 6 to N3N\geq 3 difficult.

V NUMERICAL RESULTS

We illustrate the impact of strategic interaction on a time-invariant adversarial LQG problem. The parameters used for the evaluation are shown in Table I.

TABLE I: LQG Model Default Parameters
Parameter μ1\mu_{1} σ12\sigma_{1}^{2} α\alpha β\beta ω2\omega^{2} θ\theta ϕ\phi
Value 0 11 0.5-0.5 1.5-1.5 11 22 11

Refer to caption

Figure 3: θ~n\tilde{\theta}_{n}, θ^n\hat{\theta}_{n}, θˇn\check{\theta}_{n} computed as Ln(0,0)L^{n}(0,0) and Jn(0,0)J^{n}(0,0) v.s. the number of iterations nn, for λ=1.5\lambda=1.5 and λ=2\lambda=2 (λ>α2\lambda>\alpha^{2}).

Refer to caption

Figure 4: Expected average reward per stage v.s. mutual information constraint λ\lambda, for stationary SPEs in pure strategies and behavioral strategies.

Refer to caption

Figure 5: Expected average reward per stage for stationary SPE in behavioral strategies, that for a naive agent, and that for an alert agent v.s. mutual information constraint λ\lambda.

We start with illustrating the convergence of the mappings in Propositions 2 and 3. Fig. 5 shows the mappings Ln(0,0)L^{n}(0,0) and Jn(0,0)J^{n}(0,0) as functions of the iteration nn for λ=1.5\lambda=1.5 and λ=2\lambda=2. Both mappings increase monotonically starting from (0,0)(0,0) and converge to their least fixed points (θ~,θ^)(\tilde{\theta},\hat{\theta}) and (θ~,θˇ)(\tilde{\theta},\check{\theta}), respectively, which confirms these propositions.

We continue with the evaluation of stationary SPEs for the time-invariant system. Fig. 5 shows the expected average reward per stage for a stationary SPE in pure strategies and that for a stationary SPE in behavioral strategies. Observe that the adversarial manipulation capability decreases as the mutual information lower bound λ\lambda increases, and hence the expected average rewards increase. The results also confirm Property 3, i.e., the expected average reward per stage for a stationary SPE in behavioral strategies cannot be higher than that for a stationary SPE in pure strategies.

Next, we assess the importance of strategic interaction on the agent’s performance, as compared to an agent that is unaware of the attack [2, 5]. Fig. 5 shows the expected average reward per stage for a stationary SPE in behavioral strategies, as per Corollary 49, that for a naive agent under an optimal adversarial attack, and that for an alert agent under SPE adversarial behavioral strategy gig_{i}^{*} in Corollary 49, for ε=ε>0-\varepsilon^{\prime}=\varepsilon>0. The naive agent is unaware of the adversarial manipulation, i.e., it uses the optimal LQR strategy (12). The corresponding optimal stationary adversarial strategy can be obtained through dynamic programming, and is the pure strategy:

(πi,δi2)=giA(bi)=(ε,ε2σi2λ1).\displaystyle(\pi_{i},\delta_{i}^{2})=g^{A}_{i}(b_{i})=\left(-\varepsilon,\frac{\varepsilon^{2}\sigma_{i}^{2}}{\lambda-1}\right).

The alert agent suspects an adversary but does not act strategically despite the presence of an adversary. The alert agent assumes π^i=π^0\hat{\pi}_{i}=\hat{\pi}\not=0, δ^i2=π^2σi2λ1\hat{\delta}_{i}^{2}=\frac{\hat{\pi}^{2}\sigma_{i}^{2}}{\lambda-1}, and uses the corresponding best response strategy

(κi,ρi)=fiC(bi)=(θ~αβ(λ1)(ϕ+θ~β2)λπ^,θ~αβμi(ϕ+θ~β2)λ).\displaystyle(\kappa_{i},\rho_{i})=f_{i}^{C}(b_{i})=\left(-\frac{\tilde{\theta}\alpha\beta(\lambda-1)}{(\phi+\tilde{\theta}\beta^{2})\lambda\hat{\pi}},-\frac{\tilde{\theta}\alpha\beta\mu_{i}}{(\phi+\tilde{\theta}\beta^{2})\lambda}\right).

The figure shows that the expected average reward per stage of the naive agent is always lower than that for the SPE. Clearly, as the mutual information lower bound λ\lambda increases, the adversarial manipulation capability becomes weaker and the expected average rewards per stage increase. At the same time, we can observe that if the bound ε\varepsilon of the manipulation coefficient is higher then the adversarial manipulation capability becomes stronger and therefore the expected average reward per stage of the naive agent decreases. Note that the limit of the expected average reward per stage of the naive agent does not exist when ε\varepsilon is larger than a threshold due to the resulting instability of the control system. The poor performance of the naive agent is consistent with recent works on adversarial DRL [2, 5], where the naive DRL agents were found to perform poorly against strategic adversaries. The results for the SPE show, however, that an agent that is aware of the adversary can adjust its strategy to be resilient to adversarial attack. The figure also shows that the expected average reward per stage of the alert agent is always lower than that for the SPE since the alert agent does not adjust its best response strategically to the SPE adversarial strategy. Different from the SPE and the naive agent, it is interesting to observe that the alert agent’s performance deteriorates as the adversarial constraint λ\lambda increases. This is due to that the alert agent’s strategy fiCf_{i}^{C} deviates more from the SPE strategy fif_{i}^{*}, as shown in Corollary 49, as λ\lambda increases.

VI CONCLUSION

We proposed a game theoretic model to capture the strategic interaction, information asymmetry and system dynamics for LQG control under adversarial input subject to a mutual information constraint. We characterized the subgame perfect equilibria in pure strategies and in behavioral strategies, including stationary equilibria for time-invariant systems. Our results show that if an equilibrium exists then the agent can use an affine, pure strategy, but randomization enables the adversary to construct more powerful attacks, under a wider range of parameters, and forces the agent into a babbling equilibrium. Our numerical results show the importance of strategic interaction for LQG control, and highlight that an agent that is aware of an adversarial attack can be designed resilient. Our work could be extended in a number of interesting directions, including considering a non-scalar state dynamic system, and relaxing the assumption that the adversarial strategy is revealed to the agent after each stage.

-A Proofs of Theorem 1 and Corollary 1

Proof:

We prove the result using backward dynamic programming. Recall that there is no feasible adversarial strategy for ε=ε=0\varepsilon^{\prime}=\varepsilon=0. Thus, it is sufficient to consider the cases εε\varepsilon^{\prime}\not=\varepsilon or ε=ε0\varepsilon^{\prime}=\varepsilon\not=0. In stage NN the value function for bNb_{N} can be expressed as

VNN(bN)=θ~NμN2θ^NσN2,\displaystyle V_{N}^{N}(b_{N})=-\tilde{\theta}_{N}\mu_{N}^{2}-\hat{\theta}_{N}\sigma_{N}^{2}, (56)

where θ~N=θ^N=θN\tilde{\theta}_{N}=\hat{\theta}_{N}=\theta_{N} from the update rules (20)-(22). The pure strategies, which form an SPE and achieve the value function, consist of any pure strategy gNg_{N}^{*} satisfying the adversarial constraints (13)-(14), and the dominant pure strategy

(κN,ρN)=fN(bN)=(0,0).\displaystyle(\kappa_{N}^{*},\rho_{N}^{*})=f_{N}^{*}(b_{N})=(0,0). (57)

In stage N1N-1, the Q-function of using πN10\pi_{N-1}\not=0, δN12\delta_{N-1}^{2}, κN1\kappa_{N-1}, and ρN1\rho_{N-1} given a belief bN1b_{N-1} is

QN1N\displaystyle Q_{N-1}^{N} (bN1,πN1,δN12,κN1,ρN1)\displaystyle(b_{N-1},\pi_{N-1},\delta_{N-1}^{2},\kappa_{N-1},\rho_{N-1})
=\displaystyle= θN1E(SN12)ϕN1E(AN12)\displaystyle\,-\theta_{N-1}E(S_{N-1}^{2})-\phi_{N-1}E(A_{N-1}^{2})
θ~NE(Λμ(bN1,πN1,δN12,S^N1,AN1))2\displaystyle\,-\tilde{\theta}_{N}E(\Lambda_{\mu}(b_{N-1},\pi_{N-1},\delta_{N-1}^{2},\hat{S}_{N-1},A_{N-1}))^{2}
θ^NΛν(bN1,πN1,δN12)\displaystyle\,-\hat{\theta}_{N}\Lambda_{\nu}(b_{N-1},\pi_{N-1},\delta_{N-1}^{2})
=\displaystyle= (θN1+θ~NαN12)μN12θ^NωN12\displaystyle\,-(\theta_{N-1}+\tilde{\theta}_{N}\alpha_{N-1}^{2})\mu_{N-1}^{2}-\hat{\theta}_{N}\omega_{N-1}^{2}
(θN1+θ^NαN12)σN12\displaystyle\,-(\theta_{N-1}+\hat{\theta}_{N}\alpha_{N-1}^{2})\sigma_{N-1}^{2}
(ϕN1+θ~NβN12)(πN1κN1μN1+ρN1)2\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})(\pi_{N-1}\kappa_{N-1}\mu_{N-1}+\rho_{N-1})^{2}
2θ~NαN1βN1μN1(πN1κN1μN1+ρN1)\displaystyle\,-2\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}(\pi_{N-1}\kappa_{N-1}\mu_{N-1}+\rho_{N-1})
+(θ^Nθ~N)αN12πN12σN12πN12σN12+δN12σN12\displaystyle\,+(\hat{\theta}_{N}-\tilde{\theta}_{N})\alpha_{N-1}^{2}\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2}}\sigma_{N-1}^{2}
(ϕN1+θ~NβN12)κN12(πN12σN12+δN12)\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\kappa_{N-1}^{2}(\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2})
2θ~NαN1βN1πN1κN1σN12,\displaystyle\,-2\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\pi_{N-1}\kappa_{N-1}\sigma_{N-1}^{2}, (58)

where the expectations are induced by the given bN1b_{N-1}, πN1\pi_{N-1}, δN12\delta_{N-1}^{2}, κN1\kappa_{N-1}, and ρN1\rho_{N-1}.

Given bN1b_{N-1}, πN10\pi_{N-1}\not=0, δN12\delta_{N-1}^{2}, and κN1\kappa_{N-1}, the Q-function QN1NQ_{N-1}^{N} is a concave quadratic function of ρN1\rho_{N-1}. As the best response to maximize the agent reward, we can substitute ρN1\rho_{N-1} in terms of bN1b_{N-1}, πN1\pi_{N-1}, and κN1\kappa_{N-1} as

ρN1=πN1κN1μN1θ~NαN1βN1ϕN1+θ~NβN12μN1.\displaystyle\rho_{N-1}=-\pi_{N-1}\kappa_{N-1}\mu_{N-1}-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\mu_{N-1}. (59)

Thus, it is sufficient to consider the Q-function

QN1N\displaystyle Q_{N-1}^{N} (bN1,πN1,δN12,κN1)\displaystyle(b_{N-1},\pi_{N-1},\delta_{N-1}^{2},\kappa_{N-1})
=\displaystyle= (θN1+θ~NαN12θ~N2αN12βN12ϕN1+θ~NβN12)μN12\displaystyle\,-\left(\theta_{N-1}+\tilde{\theta}_{N}\alpha_{N-1}^{2}-\frac{\tilde{\theta}_{N}^{2}\alpha_{N-1}^{2}\beta_{N-1}^{2}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\right)\mu_{N-1}^{2}
(θN1+θ^NαN12)σN12θ^NωN12\displaystyle\,-(\theta_{N-1}+\hat{\theta}_{N}\alpha_{N-1}^{2})\sigma_{N-1}^{2}-\hat{\theta}_{N}\omega_{N-1}^{2}
+(θ^Nθ~N)αN12πN12σN12πN12σN12+δN12σN12\displaystyle\,+(\hat{\theta}_{N}-\tilde{\theta}_{N})\alpha_{N-1}^{2}\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2}}\sigma_{N-1}^{2}
(ϕN1+θ~NβN12)κN12(πN12σN12+δN12)\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\kappa_{N-1}^{2}(\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2})
2θ~NαN1βN1πN1κN1σN12.\displaystyle\,-2\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\pi_{N-1}\kappa_{N-1}\sigma_{N-1}^{2}. (60)

As shown in Property 1, the belief variance is always positive. Therefore, given bN1b_{N-1}, πN10\pi_{N-1}\not=0, and δN12\delta_{N-1}^{2}, the Q-function QN1NQ_{N-1}^{N} is a concave quadratic function of κN1\kappa_{N-1}, and the best response of the agent in terms of κN1\kappa_{N-1} for bN1b_{N-1}, πN1\pi_{N-1}, and δN12\delta_{N-1}^{2} can be expressed as

κN1=θ~NαN1βN1πN1σN12(ϕN1+θ~NβN12)(πN12σN12+δN12)0.\displaystyle\kappa_{N-1}=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\pi_{N-1}\sigma_{N-1}^{2}}{(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})(\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2})}\not=0. (61)

Given bN1b_{N-1}, πN10\pi_{N-1}\not=0, and κN10\kappa_{N-1}\not=0, the Q-function QN1NQ_{N-1}^{N} is a decreasing function of δN12\delta_{N-1}^{2}. The best response of the adversary in terms of δN12\delta_{N-1}^{2} for bN1b_{N-1} and πN1\pi_{N-1} is thus

δN12=πN12σN12λ1.\displaystyle\delta_{N-1}^{2}=\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\lambda-1}. (62)

Consequently, it is sufficient to consider the Q-function

QN1N\displaystyle Q_{N-1}^{N} (bN1,πN1,κN1)\displaystyle(b_{N-1},\pi_{N-1},\kappa_{N-1})
=\displaystyle= (θN1+θ~NαN12θ~N2αN12βN12ϕN1+θ~NβN12)μN12\displaystyle\,-\left(\theta_{N-1}+\tilde{\theta}_{N}\alpha_{N-1}^{2}-\frac{\tilde{\theta}_{N}^{2}\alpha_{N-1}^{2}\beta_{N-1}^{2}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\right)\mu_{N-1}^{2}
(θN1+θ^NαN12)σN12θ^NωN12\displaystyle\,-(\theta_{N-1}+\hat{\theta}_{N}\alpha_{N-1}^{2})\sigma_{N-1}^{2}-\hat{\theta}_{N}\omega_{N-1}^{2}
+(θ^Nθ~N)αN12λ1λσN12\displaystyle\,+(\hat{\theta}_{N}-\tilde{\theta}_{N})\alpha_{N-1}^{2}\frac{\lambda-1}{\lambda}\sigma_{N-1}^{2}
(ϕN1+θ~NβN12)κN12λλ1πN12σN12\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\kappa_{N-1}^{2}\frac{\lambda}{\lambda-1}\pi_{N-1}^{2}\sigma_{N-1}^{2}
2θ~NαN1βN1πN1κN1σN12.\displaystyle\,-2\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\pi_{N-1}\kappa_{N-1}\sigma_{N-1}^{2}. (63)

The pure strategies (gN1,fN1)(g_{N-1}^{*},f_{N-1}^{*}) form an SPE if πN1=gN1(bN1)\pi_{N-1}^{*}=g_{N-1}^{*}(b_{N-1}) and κN1=fN1(bN1)\kappa_{N-1}^{*}=f_{N-1}^{*}(b_{N-1}) satisfy

πN1=argminεπN1ε,πN10QN1N(bN1,πN1,κN1),\displaystyle\pi_{N-1}^{*}=\arg\min_{\varepsilon^{\prime}\leq\pi_{N-1}\leq\varepsilon,\pi_{N-1}\not=0}Q_{N-1}^{N}(b_{N-1},\pi_{N-1},\kappa_{N-1}^{*}), (64)
κN1=argmaxκN1QN1N(bN1,πN1,κN1).\displaystyle\kappa_{N-1}^{*}=\arg\max_{\kappa_{N-1}\in\mathbb{R}}Q_{N-1}^{N}(b_{N-1},\pi_{N-1}^{*},\kappa_{N-1}). (65)

If ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, πN1=ε=ε=gN1(bN1)\pi_{N-1}^{*}=\varepsilon=\varepsilon^{\prime}=g_{N-1}^{*}(b_{N-1}) is a dominant adversarial strategy. Therefore, the SPE must exist. The pure strategies (gN1,fN1)(g_{N-1}^{*},f_{N-1}^{*}) can be obtained by substituting πN1=ε\pi_{N-1}^{*}=\varepsilon into (59), (61), and (62) as

δN12=gN1(bN1)=ε2σN12λ1;\displaystyle\delta_{N-1}^{2*}=g_{N-1}^{*}(b_{N-1})=\frac{\varepsilon^{2}\sigma_{N-1}^{2}}{\lambda-1};
κN1=fN1(bN1)=θ~NαN1βN1(λ1)(ϕN1+θ~NβN12)λε;\displaystyle\kappa_{N-1}^{*}=f_{N-1}^{*}(b_{N-1})=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}(\lambda-1)}{(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\lambda\varepsilon};
ρN1=fN1(bN1)=θ~NαN1βN1μN1(ϕN1+θ~NβN12)λ.\displaystyle\rho_{N-1}^{*}=f_{N-1}^{*}(b_{N-1})=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}}{(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\lambda}.

The value function VN1N(bN1)V_{N-1}^{N}(b_{N-1}) can be obtained by substituting πN1\pi_{N-1}^{*} and κN1\kappa_{N-1}^{*} into the Q-function (63) as

VN1N(bN1)=θ~N1μN12θ^N1σN12θ^NωN12.\displaystyle V_{N-1}^{N}(b_{N-1})=-\tilde{\theta}_{N-1}\mu_{N-1}^{2}-\hat{\theta}_{N-1}\sigma_{N-1}^{2}-\hat{\theta}_{N}\omega_{N-1}^{2}.

Let us now consider the case εε\varepsilon^{\prime}\not=\varepsilon. Assume that there exists an SPE with πN1=gN1(bN1)0\pi_{N-1}^{*}=g_{N-1}^{*}(b_{N-1})\not=0. As the best response, solving (65) leads to κN1=θ~NαN1βN1(λ1)(ϕN1+θ~NβN12)λπN1\kappa_{N-1}^{*}=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}(\lambda-1)}{(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\lambda\pi_{N-1}^{*}}. For all πN1πN1\pi_{N-1}\not=\pi_{N-1}^{*} and πN10\pi_{N-1}\not=0 we have

QN1N(bN1,πN1,κN1)>QN1N(bN1,πN1,κN1).\displaystyle Q_{N-1}^{N}(b_{N-1},\pi_{N-1}^{*},\kappa_{N-1}^{*})>Q_{N-1}^{N}(b_{N-1},\pi_{N-1},\kappa_{N-1}^{*}).

Thus, condition (64) cannot hold and hence the assumption is not true, i.e., there is no pure strategy SPE in this case.

In the case of ε=ε0\varepsilon^{\prime}=\varepsilon\not=0, Theorem 1 and Corollary 1 can be justified in the remaining stages of the backward dynamic programming by using the same analysis. ∎

-B Proofs of Theorem 3 and Corollary 42

Proof:

We prove the result by verifying that the given strategies form an SPE. The dominant pure strategy of the agent and the value function in the final stage are as shown in the proofs of Theorem 1 and Corollary 1. Note that any behavioral adversarial strategy satisfying (13)-(14) can be gNg_{N}^{*} since it has no impact on the agent reward. Therefore, Theorem 3 and Corollary 42 hold in the final stage.

For stage N1N-1, we first show that it is sufficient to consider a pure agent strategy with an affine form. A general behavioral agent strategy fN1f_{N-1} decides an action aN1a_{N-1} based on the belief bN1b_{N-1} and the observation s^N1\hat{s}_{N-1} with the probability measure fN1(aN1|bN1,s^N1)f_{N-1}(a_{N-1}|b_{N-1},\hat{s}_{N-1}). Given a belief bN1b_{N-1}, a behavioral adversarial strategy gN1g_{N-1}, an observation s^N1\hat{s}_{N-1}, and an action aN1a_{N-1} from the support set of a behavioral agent strategy fN1f_{N-1}, the Q-function is

QN1N\displaystyle Q_{N-1}^{N} (bN1,gN1,s^N1,aN1)\displaystyle(b_{N-1},g_{N-1},\hat{s}_{N-1},a_{N-1})
=\displaystyle= θN1EbN1(SN12)ϕN1aN12\displaystyle\,-\theta_{N-1}E_{b_{N-1}}(S_{N-1}^{2})-\phi_{N-1}a_{N-1}^{2}
θ~NEgN1(Λμ(bN1,ΠN1,ΔN12,s^N1,aN1))2\displaystyle\,-\tilde{\theta}_{N}E_{g_{N-1}}(\Lambda_{\mu}(b_{N-1},\Pi_{N-1},\Delta_{N-1}^{2},\hat{s}_{N-1},a_{N-1}))^{2}
θˇNEgN1(Λν(bN1,ΠN1,ΔN12))\displaystyle\,-\check{\theta}_{N}E_{g_{N-1}}(\Lambda_{\nu}(b_{N-1},\Pi_{N-1},\Delta_{N-1}^{2}))
=\displaystyle= θN1(μN12+σN12)(ϕN1+θ~NβN12)aN12\displaystyle\,-\theta_{N-1}(\mu_{N-1}^{2}+\sigma_{N-1}^{2})-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})a_{N-1}^{2}
2θ~NEgN1(ΠN1σN12s^N1+μN1ΔN12ΠN12σN12+ΔN12)\displaystyle\,-2\tilde{\theta}_{N}E_{g_{N-1}}\left(\frac{\Pi_{N-1}\sigma_{N-1}^{2}\hat{s}_{N-1}+\mu_{N-1}\Delta_{N-1}^{2}}{\Pi_{N-1}^{2}\sigma_{N-1}^{2}+\Delta_{N-1}^{2}}\right)
αN1βN1aN1\displaystyle\quad\;\alpha_{N-1}\beta_{N-1}a_{N-1}
θ~NEgN1(ΠN1σN12s^N1+μN1ΔN12ΠN12σN12+ΔN12)2αN12\displaystyle\,-\tilde{\theta}_{N}E_{g_{N-1}}\left(\frac{\Pi_{N-1}\sigma_{N-1}^{2}\hat{s}_{N-1}+\mu_{N-1}\Delta_{N-1}^{2}}{\Pi_{N-1}^{2}\sigma_{N-1}^{2}+\Delta_{N-1}^{2}}\right)^{2}\alpha_{N-1}^{2}
θˇNEgN1(αN12σN12ΔN12ΠN12σN12+ΔN12)θˇNωN12,\displaystyle\,-\check{\theta}_{N}E_{g_{N-1}}\left(\frac{\alpha_{N-1}^{2}\sigma_{N-1}^{2}\Delta_{N-1}^{2}}{\Pi_{N-1}^{2}\sigma_{N-1}^{2}+\Delta_{N-1}^{2}}\right)-\check{\theta}_{N}\omega_{N-1}^{2}, (66)

which is a concave quadratic function of aN1a_{N-1}. As the best response to maximize the agent reward, the support set of the behavioral agent strategy is a singleton, i.e., it is sufficient to use a pure agent strategy, which has an affine form as

aN1=\displaystyle a_{N-1}= θ~NαN1βN1EgN1(ΠN1σN12ΠN12σN12+ΔN12)ϕN1+θ~NβN12s^N1\displaystyle\,-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}E_{g_{N-1}}\left(\frac{\Pi_{N-1}\sigma_{N-1}^{2}}{\Pi_{N-1}^{2}\sigma_{N-1}^{2}+\Delta_{N-1}^{2}}\right)}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\hat{s}_{N-1}
θ~NαN1βN1EgN1(ΔN12ΠN12σN12+ΔN12)ϕN1+θ~NβN12μN1.\displaystyle\,-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}E_{g_{N-1}}\left(\frac{\Delta_{N-1}^{2}}{\Pi_{N-1}^{2}\sigma_{N-1}^{2}+\Delta_{N-1}^{2}}\right)}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\mu_{N-1}. (67)

Assume that an SPE in behavioral strategies consists of κN1=fN1(bN1)=0\kappa_{N-1}^{*}=f_{N-1}^{*}(b_{N-1})=0 and ρN1=fN1(bN1)=θ~NαN1βN1μN1ϕN1+θ~NβN12\rho_{N-1}^{*}=f_{N-1}^{*}(b_{N-1})=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}. Given bN1b_{N-1}, (πN1,δN12)(\pi_{N-1},\delta_{N-1}^{2}) in the support set of a behavioral adversarial strategy gN1g_{N-1}, κN1\kappa_{N-1}^{*}, and ρN1\rho_{N-1}^{*}, we have the following Q-function:

QN1N\displaystyle Q_{N-1}^{N} (bN1,πN1,δN12,κN1,ρN1)\displaystyle(b_{N-1},\pi_{N-1},\delta_{N-1}^{2},\kappa_{N-1}^{*},\rho_{N-1}^{*})
=\displaystyle= (θN1+θ~NαN12θ~N2αN12βN12ϕN1+θ~NβN12)μN12\displaystyle\,-\left(\theta_{N-1}+\tilde{\theta}_{N}\alpha_{N-1}^{2}-\frac{\tilde{\theta}_{N}^{2}\alpha_{N-1}^{2}\beta_{N-1}^{2}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\right)\mu_{N-1}^{2}
(θN1+θˇNαN12)σN12θˇNωN12\displaystyle\,-(\theta_{N-1}+\check{\theta}_{N}\alpha_{N-1}^{2})\sigma_{N-1}^{2}-\check{\theta}_{N}\omega_{N-1}^{2}
+(θˇNθ~N)αN12πN12σN12πN12σN12+δN12σN12.\displaystyle\,+(\check{\theta}_{N}-\tilde{\theta}_{N})\alpha_{N-1}^{2}\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\pi_{N-1}^{2}\sigma_{N-1}^{2}+\delta_{N-1}^{2}}\sigma_{N-1}^{2}. (68)

From Property 3 and the adversarial constraints (13)-(14), we have

(πN10,δN12=πN12σN12λ1)\displaystyle\left(\pi_{N-1}\not=0,\delta_{N-1}^{2}=\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\lambda-1}\right)
=argmin(πN1,δN12)QN1N(bN1,πN1,δN12,κN1,ρN1).\displaystyle\;=\arg\min_{\left(\pi_{N-1},\delta_{N-1}^{2}\right)}Q_{N-1}^{N}(b_{N-1},\pi_{N-1},\delta_{N-1}^{2},\kappa_{N-1}^{*},\rho_{N-1}^{*}). (69)

Therefore, any behavioral adversarial strategy is the best response of fN1f_{N-1}^{*} if its support set consists of two or more elements of (πN10,δN12=πN12σN12λ1)\left(\pi_{N-1}\not=0,\delta_{N-1}^{2}=\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\lambda-1}\right).

Assume that an SPE consists of a behavioral adversarial strategy gN1(|bN1)g_{N-1}^{*}(\cdot|b_{N-1}), which is defined on a support set containing two or more elements of (πN10,δN12=πN12σN12λ1)\left(\pi_{N-1}\not=0,\delta_{N-1}^{2}=\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\lambda-1}\right), and satisfies EgN1(ΠN1)=0E_{g_{N-1}^{*}}(\Pi_{N-1})=0. Given bN1b_{N-1}, gN1g_{N-1}^{*}, κN1\kappa_{N-1}, and ρN1\rho_{N-1}, we have the following Q-function:

QN1N\displaystyle Q_{N-1}^{N} (bN1,gN1,κN1,ρN1)\displaystyle(b_{N-1},g_{N-1}^{*},\kappa_{N-1},\rho_{N-1})
=\displaystyle= (θN1+θ~NαN12)μN12θˇNωN12\displaystyle\,-(\theta_{N-1}+\tilde{\theta}_{N}\alpha_{N-1}^{2})\mu_{N-1}^{2}-\check{\theta}_{N}\omega_{N-1}^{2}
(θN1+θˇNαN12(θˇNθ~N)αN12λ1λ)σN12\displaystyle\,-\left(\theta_{N-1}+\check{\theta}_{N}\alpha_{N-1}^{2}-(\check{\theta}_{N}-\tilde{\theta}_{N})\alpha_{N-1}^{2}\frac{\lambda-1}{\lambda}\right)\sigma_{N-1}^{2}
(ϕN1+θ~NβN12)EgN1(ΠN12)\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})E_{g_{N-1}^{*}}(\Pi_{N-1}^{2})
(μN12+λλ1σN12)κN12\displaystyle\quad\left(\mu_{N-1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{N-1}^{2}\right)\kappa_{N-1}^{2}
(ϕN1+θ~NβN12)ρN12\displaystyle\,-(\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2})\rho_{N-1}^{2}
2θ~NαN1βN1μN1ρN1.\displaystyle\,-2\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}\rho_{N-1}. (70)

The best response of gN1g_{N-1}^{*} is

(κN1=0,ρN1=θ~NαN1βN1μN1ϕN1+θ~NβN12)\displaystyle\left(\kappa_{N-1}=0,\rho_{N-1}=-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\right)
=argmax(κN1,ρN1)QN1N(bN1,gN1,κN1,ρN1).\displaystyle\;=\arg\max_{(\kappa_{N-1},\rho_{N-1})}Q_{N-1}^{N}(b_{N-1},g_{N-1}^{*},\kappa_{N-1},\rho_{N-1}). (71)

It follows from (69) and (71) that a behavioral adversarial strategy gN1(|bN1)g_{N-1}^{*}(\cdot|b_{N-1}), which is defined on a support set containing two or more elements of (πN10,δN12=πN12σN12λ1)\left(\pi_{N-1}\not=0,\delta_{N-1}^{2}=\frac{\pi_{N-1}^{2}\sigma_{N-1}^{2}}{\lambda-1}\right) and satisfies EgN1(ΠN1)=0E_{g_{N-1}^{*}}(\Pi_{N-1})=0, and a pure agent strategy (κN1,ρN1)=fN1(bN1)=(0,θ~NαN1βN1μN1ϕN1+θ~NβN12)(\kappa_{N-1}^{*},\rho_{N-1}^{*})=f_{N-1}^{*}(b_{N-1})=\left(0,-\frac{\tilde{\theta}_{N}\alpha_{N-1}\beta_{N-1}\mu_{N-1}}{\phi_{N-1}+\tilde{\theta}_{N}\beta_{N-1}^{2}}\right) form an SPE in stage N1N-1. Furthermore, the value function in this stage is

VN1N(bN1)=θ~N1μN12θˇN1σN12θˇNωN12.\displaystyle V_{N-1}^{N}(b_{N-1})=-\tilde{\theta}_{N-1}\mu_{N-1}^{2}-\check{\theta}_{N-1}\sigma_{N-1}^{2}-\check{\theta}_{N}\omega_{N-1}^{2}.

Thus, Theorem 3 and Corollary 42 hold in stage N1N-1.

In the remaining stages of the backward dynamic programming we can always justify that the solution of the SPE in behavioral strategies from Theorem 3 and the value function from Corollary 42 hold following the same analysis as used in stage N1N-1. ∎

-C Proofs of Theorems 5 and 6

Proof:

To prove Theorems 5 and 6, it is sufficient to consider a two-stage problem, i.e., N=2N=2. The solution of the final stage is the same as in the proofs of Theorem 3 and Corollary 42, and therefore is omitted here. Theorems 5 and 6 hold in the final stage. Furthermore, as shown in the proofs of Theorem 3 and Corollary 42, it is sufficient to consider a pure agent strategy with an affine form in the first stage.

Given b1b_{1}, (π1,δ12)(\pi_{1},\delta_{1}^{2}) in the support set of a behavioral adversarial strategy g1g_{1}, κ1\kappa_{1}, and ρ1\rho_{1}, the Q-function is

Q12\displaystyle Q_{1}^{2} (b1,π1,δ12,κ1,ρ1)\displaystyle(b_{1},\pi_{1},\delta_{1}^{2},\kappa_{1},\rho_{1})
=\displaystyle= (θ1+θ2α12)μ12θ2ω12(θ1+θ2α12)σ12\displaystyle\,-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\mu_{1}^{2}-\theta_{2}\omega_{1}^{2}-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\sigma_{1}^{2}
(ϕ1+θ2β12)(π1κ1μ1+ρ1)22θ2α1β1μ1(π1κ1μ1+ρ1)\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})(\pi_{1}\kappa_{1}\mu_{1}+\rho_{1})^{2}-2\theta_{2}\alpha_{1}\beta_{1}\mu_{1}(\pi_{1}\kappa_{1}\mu_{1}+\rho_{1})
(ϕ1+θ2β12)κ12(π12σ12+δ12)2θ2α1β1π1κ1σ12.\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})\kappa_{1}^{2}(\pi_{1}^{2}\sigma_{1}^{2}+\delta_{1}^{2})-2\theta_{2}\alpha_{1}\beta_{1}\pi_{1}\kappa_{1}\sigma_{1}^{2}. (72)

This Q-function is non-increasing in δ12\delta_{1}^{2} for any given b1b_{1}, π1\pi_{1}, κ1\kappa_{1}, and ρ1\rho_{1}. As the best response to minimize the agent reward, it is sufficient to consider a behavioral adversarial strategy defined on a non-singleton support set of (π10,δ12=π12σ12λ1)\left(\pi_{1}\not=0,\delta_{1}^{2}=\frac{\pi_{1}^{2}\sigma_{1}^{2}}{\lambda-1}\right).

Given b1b_{1}, g1g_{1} with a non-singleton support set of (π10,δ12=π12σ12λ1)\left(\pi_{1}\not=0,\delta_{1}^{2}=\frac{\pi_{1}^{2}\sigma_{1}^{2}}{\lambda-1}\right), κ1\kappa_{1}, and ρ1\rho_{1}, the Q-function is

Q12\displaystyle Q_{1}^{2} (b1,g1,κ1,ρ1)\displaystyle(b_{1},g_{1},\kappa_{1},\rho_{1})
=\displaystyle= (θ1+θ2α12)μ12θ2ω12(θ1+θ2α12)σ12\displaystyle\,-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\mu_{1}^{2}-\theta_{2}\omega_{1}^{2}-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\sigma_{1}^{2}
(ϕ1+θ2β12)Eg1(Π12)(μ12+λλ1σ12)κ12\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})E_{g_{1}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)\kappa_{1}^{2}
2(ϕ1+θ2β12)Eg1(Π1)μ1κ1ρ1(ϕ1+θ2β12)ρ12\displaystyle\,-2(\phi_{1}+\theta_{2}\beta_{1}^{2})E_{g_{1}}(\Pi_{1})\mu_{1}\kappa_{1}\rho_{1}-(\phi_{1}+\theta_{2}\beta_{1}^{2})\rho_{1}^{2}
2θ2α1β1Eg1(Π1)(μ12+σ12)κ12θ2α1β1μ1ρ1.\displaystyle\,-2\theta_{2}\alpha_{1}\beta_{1}E_{g_{1}}(\Pi_{1})(\mu_{1}^{2}+\sigma_{1}^{2})\kappa_{1}-2\theta_{2}\alpha_{1}\beta_{1}\mu_{1}\rho_{1}. (73)

This is a concave quadratic function of ρ1\rho_{1} when b1b_{1}, g1g_{1}, and κ1\kappa_{1} are fixed. As the best response to maximize the agent reward, we can substitute ρ1\rho_{1} with

ρ1=Eg1(Π1)μ1κ1θ2α1β1μ1ϕ1+θ2β12.\displaystyle\rho_{1}=-E_{g_{1}}(\Pi_{1})\mu_{1}\kappa_{1}-\frac{\theta_{2}\alpha_{1}\beta_{1}\mu_{1}}{\phi_{1}+\theta_{2}\beta_{1}^{2}}. (74)

Then the Q-function (73) reduces to

Q12\displaystyle Q_{1}^{2} (b1,g1,κ1)\displaystyle(b_{1},g_{1},\kappa_{1})
=\displaystyle= (θ1+θ2α12θ22α12β12ϕ1+θ2β12)μ12θ2ω12(θ1+θ2α12)σ12\displaystyle\,-\left(\theta_{1}+\theta_{2}\alpha_{1}^{2}-\frac{\theta_{2}^{2}\alpha_{1}^{2}\beta_{1}^{2}}{\phi_{1}+\theta_{2}\beta_{1}^{2}}\right)\mu_{1}^{2}-\theta_{2}\omega_{1}^{2}-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\sigma_{1}^{2}
(ϕ1+θ2β12)\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})
(Eg1(Π12)(μ12+λλ1σ12)Eg12(Π1)μ12)κ12\displaystyle\quad\left(E_{g_{1}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)-E_{g_{1}}^{2}(\Pi_{1})\mu_{1}^{2}\right)\kappa_{1}^{2}
2θ2α1β1Eg1(Π1)σ12κ1.\displaystyle\,-2\theta_{2}\alpha_{1}\beta_{1}E_{g_{1}}(\Pi_{1})\sigma_{1}^{2}\kappa_{1}. (75)

This is also a concave quadratic function of κ1\kappa_{1} when b1b_{1} and g1g_{1} are fixed. As the best response to maximize the agent reward, we can substitute κ1\kappa_{1} with

κ1=θ2α1β1Eg1(Π1)σ12(ϕ1+θ2β12)(Eg1(Π12)(μ12+λλ1σ12)Eg12(Π1)μ12).\displaystyle\kappa_{1}=\frac{-\theta_{2}\alpha_{1}\beta_{1}E_{g_{1}}(\Pi_{1})\sigma_{1}^{2}}{(\phi_{1}+\theta_{2}\beta_{1}^{2})\left(E_{g_{1}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)-E_{g_{1}}^{2}(\Pi_{1})\mu_{1}^{2}\right)}. (76)

Since we consider 0ε<ε0\leq\varepsilon^{\prime}<\varepsilon or ε<ε0\varepsilon^{\prime}<\varepsilon\leq 0 and the behavioral adversarial strategy has a non-singleton support set, Eg1(Π1)0E_{g_{1}}(\Pi_{1})\not=0 and κ10\kappa_{1}\not=0 in these cases.

We then study the support set of a behavioral adversarial strategy. Given b1b_{1}, (π10,δ12=π12σ12λ1)\left(\pi_{1}\not=0,\delta_{1}^{2}=\frac{\pi_{1}^{2}\sigma_{1}^{2}}{\lambda-1}\right) in the support set of a behavioral adversarial strategy g1g_{1}, κ10\kappa_{1}\not=0, and ρ1\rho_{1}, the Q-function is

Q12\displaystyle Q_{1}^{2} (b1,π1,κ1,ρ1)\displaystyle(b_{1},\pi_{1},\kappa_{1},\rho_{1})
=\displaystyle= (θ1+θ2α12)μ12θ2ω12(θ1+θ2α12)σ12\displaystyle\,-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\mu_{1}^{2}-\theta_{2}\omega_{1}^{2}-(\theta_{1}+\theta_{2}\alpha_{1}^{2})\sigma_{1}^{2}
(ϕ1+θ2β12)(π1κ1μ1+ρ1)22θ2α1β1μ1(π1κ1μ1+ρ1)\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})(\pi_{1}\kappa_{1}\mu_{1}+\rho_{1})^{2}-2\theta_{2}\alpha_{1}\beta_{1}\mu_{1}(\pi_{1}\kappa_{1}\mu_{1}+\rho_{1})
(ϕ1+θ2β12)κ12λλ1π12σ122θ2α1β1π1κ1σ12.\displaystyle\,-(\phi_{1}+\theta_{2}\beta_{1}^{2})\kappa_{1}^{2}\frac{\lambda}{\lambda-1}\pi_{1}^{2}\sigma_{1}^{2}-2\theta_{2}\alpha_{1}\beta_{1}\pi_{1}\kappa_{1}\sigma_{1}^{2}. (77)

This is a concave quadratic function of π1\pi_{1} given b1b_{1}, κ10\kappa_{1}\not=0, and ρ1\rho_{1}. As the best response to minimize the agent reward, it is sufficient to consider a behavioral adversarial strategy g1g_{1} with the following support set: {(π1=ε,δ12=ε2σ12λ1),(π1=ε,δ12=ε2σ12λ1)}\left\{\left(\pi_{1}=\varepsilon^{\prime},\delta_{1}^{2}=\frac{\varepsilon^{\prime 2}\sigma_{1}^{2}}{\lambda-1}\right),\left(\pi_{1}=\varepsilon,\delta_{1}^{2}=\frac{\varepsilon^{2}\sigma_{1}^{2}}{\lambda-1}\right)\right\}.

When 0=ε<ε0=\varepsilon^{\prime}<\varepsilon or ε<ε=0\varepsilon^{\prime}<\varepsilon=0, an SPE in behavioral strategies does not exist since 0=ε0=\varepsilon^{\prime} or ε=0\varepsilon=0 will lead to a singleton support set of the behavioral adversarial strategy; and meanwhile a pure strategy SPE does not exist since εε\varepsilon^{\prime}\not=\varepsilon. This proves Theorem 5.

When 0<ε<ε0<\varepsilon^{\prime}<\varepsilon or ε<ε<0\varepsilon^{\prime}<\varepsilon<0, we assume that an SPE in behavioral strategies consists of

g1(π1=ε,δ12=ε2σ12λ1|b1)=p;\displaystyle g_{1}^{*}\left(\pi_{1}=\varepsilon^{\prime},\left.\delta_{1}^{2}=\frac{\varepsilon^{\prime 2}\sigma_{1}^{2}}{\lambda-1}\right|b_{1}\right)=p^{*};
g1(π1=ε,δ12=ε2σ12λ1|b1)=1p;\displaystyle g_{1}^{*}\left(\pi_{1}=\varepsilon,\left.\delta_{1}^{2}=\frac{\varepsilon^{2}\sigma_{1}^{2}}{\lambda-1}\right|b_{1}\right)=1-p^{*};
κ1=f1(b1)\displaystyle\kappa_{1}^{*}=f_{1}^{*}(b_{1})\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad\qquad
=θ2α1β1Eg1(Π1)σ12(ϕ1+θ2β12)(Eg1(Π12)(μ12+λλ1σ12)Eg12(Π1)μ12);\displaystyle\;\;\;=\frac{-\theta_{2}\alpha_{1}\beta_{1}E_{g_{1}^{*}}(\Pi_{1})\sigma_{1}^{2}}{(\phi_{1}+\theta_{2}\beta_{1}^{2})\left(E_{g_{1}^{*}}(\Pi_{1}^{2})\left(\mu_{1}^{2}+\frac{\lambda}{\lambda-1}\sigma_{1}^{2}\right)-E_{g_{1}^{*}}^{2}(\Pi_{1})\mu_{1}^{2}\right)};
ρ1=f1(b1)=Eg1(Π1)μ1κ1θ2α1β1μ1ϕ1+θ2β12.\displaystyle\rho_{1}^{*}=f_{1}^{*}(b_{1})=-E_{g_{1}^{*}}(\Pi_{1})\mu_{1}\kappa_{1}^{*}-\frac{\theta_{2}\alpha_{1}\beta_{1}\mu_{1}}{\phi_{1}+\theta_{2}\beta_{1}^{2}}.

Since the assumed f1f_{1}^{*} is the best response of the assumed g1g_{1}^{*}, we only need to testify if 0<p<10<p*<1 exists such that g1g_{1}^{*} is the best response of f1f_{1}^{*}, i.e., both π1=ε\pi_{1}=\varepsilon^{\prime} and π1=ε\pi_{1}=\varepsilon are minimizers of the Q-function Q12(b1,π1,κ1,ρ1)Q_{1}^{2}(b_{1},\pi_{1},\kappa_{1}^{*},\rho_{1}^{*}). There is a unique solution

p=εε+ε.\displaystyle p^{*}=\frac{\varepsilon}{\varepsilon^{\prime}+\varepsilon}. (78)

Therefore, there is a unique SPE in behavioral strategies for the two-stage ALQG game. The strategies and the value function of the SPE in Theorem 6 can then be obtained easily. ∎

References

  • [1] V. Mnih et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, pp. 529-533, 2015.
  • [2] S. Huang, N. Papernot, I. Goodfellow, Y. Duan, and P. Abbeel, “Adversarial attacks on neural network policies,” arXiv:1702.02284.
  • [3] Y.-C. Lin, Z.-W. Hong, Y.-H. Liao, M.-L. Shih, M.-Y. Liu, and S. Min, “Tactics of adversarial attack on deep reinforcement learning agents,” in Proc. of IJCAI, 2017.
  • [4] V. Behzadan and A. Munir, “Vulnerability of deep reinforcement learning to policy induction attacks,” in Proc. of MLDM, 2017, pp. 262-275.
  • [5] A. Russo and A. Proutiere, “Optimal attacks on reinforcement learning policies,” arXiv:1907.13548.
  • [6] T. Osogami, “Robust partially observable Markov decision process,” in Proc. of ICML, 2015.
  • [7] L. Pinto, J. Davidson, R. Sukthankar, and A. Gupta, “Robust adversarial reinforcement learning,” in Proc. of ICML, 2017.
  • [8] A. Gleave, M. Dennis, C. Wild, N. Kant, S. Levine, and S. Russell, “Adversarial policies: Attacking deep reinforcement learning,” arXiv:1905.10615.
  • [9] K. Horak, Q. Zhu, and B. Bosansky, “Manipulating adversary’s belief: A dynamic game approach to deception by design for proactive network security,” in Proc. of GameSec, 2017.
  • [10] S. Saritas, S. Yuksel, and S. Gezici, “Nash and Stackelberg equilibria for dynamic cheap talk and signaling games,” in Proc. of ACC, 2017.
  • [11] S. Saritas, E. Shereen, H. Sandberg, and G. Dán, “Adversarial attacks on continuous authentication security: A dynamic game approach,” in Proc. of GameSec, 2019.
  • [12] Z. Li and G. Dán, “Dynamic cheap talk for robust adversarial learning,” in Proc. of GameSec, 2019.
  • [13] M.O. Sayin and T. Basar, “Secure sensor design for cyber-physical systems against advanced persistent threats,” in Proc. of GameSec, 2017.
  • [14] M.O. Sayin, E. Akyol, and T. Basar, “Hierarchical multistage Gaussian signaling games in noncooperative communication and control systems,” Automatica, vol. 107, pp. 9-20, 2019.
  • [15] R. Zhang and P. Venkitasubramaniam, “Stealthy control signal attacks in linear quadratic Gaussian control systems: Detectability reward tradeoff,” IEEE Transactions on Information Forensics and Security, vol. 12, no. 7, pp. 1555-1570, 2017.
  • [16] Q. Zhang, K. Liu, Y. Xia, and A. Ma, “Optimal stealthy deception attack against cyber-physical systems,” IEEE Transactions on Cybernetics, 2020.
  • [17] Y. Chen, S. Kar, and J.M.F. Moura, “Cyber physical attacks constrained by control objectives,” in Proc. of ACC, 2016, pp. 1185-1190.
  • [18] V.P. Crawford and J. Sobel, “Strategic information transmission,” Econometrica, vol. 50, no. 6, pp. 1431-1451, 1982.
  • [19] L. Shapley, “Stochastic games,” Proc. of the National Academy of Sciences, vol. 39, no. 10, pp. 1095-1100, 1953.
  • [20] T. Soderstrom, Discrete-Time Stochastic Systems, Springer, 2002.
  • [21] A. Baranga, “The contraction principle as a particular case of Kleene’s fixed point theorem,” Discrete Mathematics, vol. 98, pp. 75-79, 1991.