This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stochastic Dynamic Information Flow Tracking Game using
Supervised Learning for Detecting Advanced Persistent Threats

Shana Moothedath [email protected] Dinuka Sahabandu [email protected] Joey Allen [email protected] Linda Bushnell [email protected] Wenke Lee [email protected] Radha Poovendran [email protected] Department of Electrical and Computer Engineering, University of Washington, USA College of Computing, Georgia Institute of Technology, USA
Abstract

Advanced persistent threats (APTs) are organized prolonged cyberattacks by sophisticated attackers with the intend of stealing critical information. Although APT activities are stealthy and evade detection by traditional detection tools, they interact with the system components to make progress in the attack. These interactions lead to information flows that are recorded in the form of a system log. Dynamic Information Flow Tracking (DIFT) has been proposed as one of the effective ways to detect APTs using the information flows. A DIFT-based detection mechanism dynamically performs security analysis on the information flows to detect possible attacks. However, wide range security analysis using DIFT results in a significant increase in performance overhead and high rates of false-positives and false-negatives. In this paper, we model the strategic interaction between APT and DIFT as a non-cooperative stochastic game. The game unfolds on a state space constructed from an information flow graph (IFG) that is extracted from the system log. The objective of the APT in the game is to choose transitions in the IFG to find an optimal path in the IFG from an entry point of the attack to an attack target. On the other hand, the objective of DIFT is to dynamically select nodes in the IFG to perform security analysis for detecting APT. Our game model has imperfect information as the players do not have information about the actions of the opponent. We consider two scenarios of the game (i) when the false-positive and false-negative rates are known to both players and (ii) when the false-positive and false-negative rates are unknown to both players. Case (i) translates to a game model with complete information and we propose a value iteration-based algorithm and prove the convergence. Case (ii) translates to an incomplete information game with unknown transition probabilities. In this case, we propose Hierarchical Supervised Learning (HSL) algorithm that integrates a neural network, to predict the value vector of the game, with a policy iteration algorithm to compute an approximate equilibrium. We implemented our algorithms for cases (i) and (ii) on real attack datasets for nation state and ransomware attacks and validated the performance of our approach.

keywords:
Cyber security, Stochastic games, Neural network, Advanced persistent threats, Dynamic information flow tracking

1 Introduction

Advanced persistent threats (APTs) are sophisticated and prolonged cyberattacks that target specific high value organizations in sectors such as national defense, manufacturing, and the financial industry [14, 34]. APTs use advanced attack methods, including advanced exploits of zero-day vulnerabilities, as well as highly-targeted spear phishing and other social engineering techniques to gain access to a network and then remain undetected for a prolonged period of time. The attacking strategies of APTs are stealthy and are methodically designed to bypass conventional security mechanisms to cause more permanent, significant, and irreversible damages to the network.

Interaction of APTs with the network introduce information flows in the form of control flow and command flow commands which get recorded in the system log. Analyzing information flows is thus a promising approach to detect presence of APTs [15]. Dynamic information flow tracking (DIFT) [32] is a flow tracking-based mechanism that is widely used to detect APTs. DIFT tags information flows originating from suspicious input channels in the network and tracks the propagation of the tagged flows. DIFT then initiates security check points, referred to as traps, to analyze the tagged flows and verify the authenticity of the flows. As the ratio of the malicious flows to the benign flows is very small in the network, implementation of DIFT introduces significant performance overhead on the network and generates false-positives and false-negatives [15].

This paper models detection of APTs using a DIFT-based detection mechanism. The effectiveness of detection depends on both the security policy of the DIFT and also on the actions of the APT. We model the strategic interaction between APTs and DIFT as a stochastic game. The game unfolds on a state space that is defined using the information flow graph (IFG) constructed from the system log of the system. The objective of the DIFT is to select locations in the system, i.e., nodes in the IFG, to place the security check points while minimizing the errors due to generation of false-positives and false-negatives. On the other hand, the objective of the APT is to choose an attack path starting from an entry point to a target node in the IFG. We note that, while APTs aim to achieve a reachability objective, the aim of DIFT is an optimal node selection. The APT-DIFT game has imperfect information structure as APTs are unaware of the locations at which DIFT places security check points and DIFT is unaware of the path chosen by the APT for the attack. We consider two scenarios of the game: (i) when the false-positive and false-negative rates are known to both players and (ii) when the false-positive and false-negative rates are unknown to both players. While case (i) is a complete information game, case (ii) is an incomplete information game with unknown transition probabilities.

We make the following contributions in this paper:

  1. \bullet

    We formulate a constant-sum stochastic game with total reward structure that models the interaction between the DIFT and APT. The game captures the information asymmetry between the attacker and the defender, false-positives, and false-negatives generated by DIFT.

  2. \bullet

    When the transition probabilities are known, we present a value iteration algorithm to compute equilibrium strategy and prove convergence and polynomial-time complexity.

  3. \bullet

    When the transition probabilities are unknown, we propose HSL (Hierarchical Supervised Learning), a supervised learning-based algorithm to compute an approximate equilibrium strategy of the incomplete information game. HSL integrates a neural network with policy iteration and utilizes the hierarchical structure of the state space of the APT-DIFT game to guarantee convergence.

  4. \bullet

    We implement our algorithms on attack data, of a nation state attack and a ransomware attack, collected using Refinable Attack INvestigation (RAIN) system, and show the convergence of the algorithm.

This paper is organized is as follows: Section 2 summarizes the related work. Section 3 elaborates on information flow graph, and the attacker and defender models. Section 4 presents the formulation of the stochastic game between APT and DIFT. Section 5 discusses the min-max solution concept used for solving the game. Section 6 presents the value iteration algorithm for the model-based approach and the supervised learning-based approach for the model-free case. Section 7 illustrates the results and discussions on the experiments conducted. Section 8 concludes the paper.

2 Related work

Stochastic games model the strategic interactions among multiple agents or players in dynamical systems and are well studied in the context of security games [18], economic games [3], and resilience of cyber-physical systems [35]. While stochastic games provide a strong framework for modeling security problems, often solving stochastic games are hard. There exists dynamic programming-based approaches, value-iteration and policy iteration, for computing Nash equilibrium (NE) strategies of these games. However, in many stochastic game formulations dynamic programming-based approaches do not converge. References [1, 2] modeled and studied the interaction between defender and attacker in a cyber setting as a two-player partially observable stochastic game and presented an approach for synthesizing sub-optimal strategies.

Multi-agent reinforcement learning (MARL) algorithms have been proposed in the literature to obtain NE strategies of zero-sum and nonzero-sum stochastic games, when the game information such as transition probabilities and payoff functions of the players are unknown. However, algorithms with guaranteed convergence are available only for special cases [11, 12, 23, 24]. A Nash-Q learning algorithm is given in [12] that converges to NE of a general-sum game when the NE is unique. The algorithm in [12] is guaranteed to converge for discounted games with perfect information and unique NE. A Q-learning algorithm is presented in [11] for discounted, constant-sum games with unknown transition probabilities when the players have perfect information. An adaptive policy approach using a regularized Lagrange function is proposed in [23] for zero-sum games with unknown transition probabilities and irreducible state space. A stochastic approximation-based algorithm is presented in [24] for computing NE of discounted general-sum games with irreducible state space. While there exist algorithms that converge to NE of the game for special cases, there are no known algorithms to compute NE of a general stochastic game with incomplete information structure.

Detection of APTs are analyzed using game theory in the literature by modeling the interactions between APT and the detection mechanism as a game [25]. A dynamic game model is given in [13] to detect APTs that can adopt adversarial deceptions. A honeypot game theoretic model is introduced in [33] to address detection of APTs.

There has been some effort to model the detection of APTs by a DIFT-based detection mechanism using game-theoretic framework [22], [29], [21]. Reference [22] studied a DIFT model which selects the trap locations in a dynamic manner and proposed a min-cut based solution approach. Detection of APTs when the attack consists of multiple attackers, possibly with different capabilities, is studied in [29]. We note that, the game models in [22] and [29] did not consider false-negative and false-postive generation by DIFT and are hence non-stochastic. Recently, stochastic models of APT-DIFT games was proposed in [21], [30]. Reference [30] proposed a value iteration-based algorithm to obtain an ϵ\epsilon-NE of the discounted game and [21] proposed a min-cut solution approach to compute an NE of the game. Formulations in [21], [30] assume that the transition probabilities of the game (false-positives and false-nagatives) are known.

DIFT-APT games with unknown transition probabilities are studied in [27], [28]. While reference [27] proposed a two-time scale algorithm to compute an equilibrium of the discounted game, [28] considered an average reward payoff structure. Moreover, the game models in [27], [28] resulted in a unichain structure on the state space of the game, unlike the game model considered in this paper. The unichain structure of the game is critically utilized in [27], [28] for developing the solution approach and deriving the results. Reference [19], the preliminary conference version of this work, studied DIFT-APT game with unknown transition probabilities and average reward payoff structure, when the state space graph is not a unichain structure. The approach in [19] approximated the payoff function to a convex function and utilized an input convex neural network (ICNN) architecture. The ICNN is integrated with an alternating optimization technique and empirical results are presented in [19] to learn approximate equilibrium strategies. In this paper, we do not restrict the payoff function to be convex and hence relax the condition for the neural network to be input convex.

3 Preliminaries

3.1 Information Flow Graph (IFG)

Information Flow Graph (IFG) provides a graphical representation of the whole-system execution during the entire period of monitoring. We use RAIN recording system [15] to construct the IFG. RAIN comprises a kernel module which logs all the system calls that are requested by the user-level processes in the target host. The target host then sends the recorded system recorded logs to the analysis host. The analysis host consists of a provenance graph builder, which then constructs the coarse-grained IFG. The coarse-IFG is then refined using various pruning techniques. A brief discussion on the pruning steps is included in Section 7. For more details, please refer [15].

IFGs are widely used by analysts for effective cyber response [10], [15]. IFGs are directed graphs, where the nodes in the graph form entities such as processes, files, network connections, and memory objects in the system. Edges correspond to the system calls and are oriented in the direction of the information flows and/or causality [10]. Let directed graph G=(VG,EG){\pazocal{G}}=(V_{{\pazocal{G}}},E_{{\pazocal{G}}}) represent IFG of the system, where VG={v1,,vN}V_{{\pazocal{G}}}=\{v_{1},\ldots,v_{N}\} and EGVG×VGE_{{\pazocal{G}}}\subseteq V_{{\pazocal{G}}}\times V_{{\pazocal{G}}}. Given a system log, one can build the corase-grained-IFG which is then pruned and refined incrementally to obtain the corresponding IFG [15].

3.2 Attacker Model

This paper consider advanced cyberattacks called as APTs. APTs are information security threats that target sensitive information in specific organizations. APTs enter into the system by leveraging vulnerabilities in the system and implement multiple sophisticated methods to continuously and stealthily steal information [10]. A typical APT attack consists of multiple stages initiated by a successful penetration and followed by initial compromise, C&C communications, privilege escalation, internal reconnaissance, exfiltration, and cleanup [10]. Detection of APT attacks are very challenging as the attacker activities blend in seamlessly with normal system operation. Moreover, as APT attacks are customized, they can not be detected using signature-based detection methods such as firewalls, intrusion detection systems, and antivirus software.

3.3 Defender Model

Dynamic information flow tracking (DIFT) is a promising technique to detect security attacks on computer systems [7, 9, 32]. DIFT tracks the system calls to detect the malicious information flows from an adversary and to restrict the use of these malicious flows. DIFT architecture is composed of (i) tag sources, (ii) tag propagation rules, and (iii) tag sinks (traps). Tag sources are the suspicious locations in the system, such as keyboards, network interface, and hard disks, that are tagged/tainted as suspicious by DIFT. Tags are single bit or multiple bit markings depending on the level of granularity manageable with the available memory and resources. All the processed values of the tag sources are tagged and DIFT tracks the propagation of the tagged flows. When anomalous behavior is detected in the system, DIFT initiates security analysis and tagged flows are inspected by DIFT at specific locations, referred to as traps. DIFT conducts fine grain analysis at traps to detect the attack and to perform risk assessment. While tagging and trapping using DIFT is a promising detection mechanism against APTs, DIFT introduces performance overhead on the system. Performing security analysis (trapping) of tagged flows uses considerable amount of memory of the system [9]. Thus there is a tradeoff between system log granularity and performance [15].

4 Problem Formulation

This section formulates the interaction between the APT and the DIFT-based defense mechanism as a stochastic game where the decisions taken by the adversary (APT) and the defender (DIFT), referred to as agents/players, influences the system behavior. The objective of the adversary is to choose transitions in the IFG so as to reach the destination node set DVG\pazocal{D}\subset V_{{\pazocal{G}}} from the set of entry points λVG\lambda\subset V_{{\pazocal{G}}}. On the other hand, the objective of the defender is to dynamically choose locations to trap the information flow so as to secure the system from any possible attack. The state of the system denotes the location of the tagged information flow in the system. The defender cannot distinguish a malicious and a benign flow. On the other hand, the adversary does not know if the tagged flow will get trapped by the defender while choosing a transition. Thus the game is an imperfect information game. We note that, the state of the game is known here unlike in a partially observable game setting where the state of the game is unknown. The system’s current state and the joint actions of the agents together determine a probability distribution over the possible next states of the system.

4.1 State Space

The state of the game at a time step tt, denoted as sts_{t}, is defined as the position of the tagged information flow in the system. Let 𝐒={v0,v1,,vN,ϕ,τA,τB}{\bf S}=\{v_{0},v_{1},\ldots,v_{N},\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\} be the finite state space of the game. Here, st=ϕs_{t}=\phi denotes the state when the tagged flow drops out by abandoning the attack, st=τAs_{t}=\tau_{{\scriptscriptstyle{A}}} denotes the state when DIFT detects the adversary after performing the security analysis, and st=τBs_{t}=\tau_{{\scriptscriptstyle{B}}} denotes the state when DIFT performs security analysis and concludes a benign flow as malicious (false positive). Let D\pazocal{D} be the destination (target) node set of the adversary and |D|=q|\pazocal{D}|=q. Without loss of generality, let D={v1,v2,,vq}\pazocal{D}=\{v_{1},v_{2},\ldots,v_{q}\}. We assume that both agents know the IFG and the destination set D\pazocal{D}. The state v0v_{0} corresponds to a virtual state that denote the starting point of the game. In the state space 𝐒{\bf S}, v0v_{0} is connected to all nodes in λ\lambda. Thus the state of the game at t=0t=0 is s0=v0s_{0}=v_{0}.

4.2 Action Spaces

At every time instant in the game, the players choose their actions from their respective action sets. The action set of the adversary is defined as the possible set of transitions the adversarial flow can execute at a state. The defender has limited resources and hence can perform security analysis at one information flow at a time. Thus the defender’s action set is defined such that, at every decision point, the defender chooses one node to perform security analysis, among the possible nodes that the adversary can transition to.

Let the action set of the adversary and the defender at a state sts_{t} be AA(st)\pazocal{A}_{{\scriptscriptstyle{A}}}(s_{t}) and AD(st)\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t}), respectively. At a state stVGs_{t}\in V_{{\pazocal{G}}}, the adversary chooses an action either to drop out, i.e., abort the attack, or to continue the attack by transitioning to a neighboring node. Thus AA(st):={ϕ}{vj:st=viand(vi,vj)EG}\pazocal{A}_{{\scriptscriptstyle{A}}}(s_{t}):=\{\phi\}\cup\{v_{j}:s_{t}=v_{i}\leavevmode\nobreak\ {\rm and\leavevmode\nobreak\ }(v_{i},v_{j})\in E_{{\pazocal{G}}}\}. On the other hand, the defender’s action set at state sts_{t} AD(st):={0}{vj:st=viand(vi,vj)EG}\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t}):=\{0\}\cup\{v_{j}:s_{t}=v_{i}\leavevmode\nobreak\ {\rm and\leavevmode\nobreak\ }(v_{i},v_{j})\in E_{{\pazocal{G}}}\}, where AD(st)=0\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t})=0 represents that the tagged information flow is not trapped and AD(st)=vj\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t})=v_{j} represents that defender decides to perform security analysis at the node vjVGv_{j}\in V_{{\pazocal{G}}} at instant tt.

The game originates at t=0t=0 with state s0=v0s_{0}=v_{0} with AA(v0):=λ\pazocal{A}_{{\scriptscriptstyle{A}}}(v_{0}):=\lambda and AD(v0):=0\pazocal{A}_{{\scriptscriptstyle{D}}}(v_{0}):=0. The definition of action sets at v0v_{0} captures the fact that the adversarial flow originates at one of the entry points. Performing security analysis at entry points can not detect the attack as there are not enough traces to analyze. Moreover, the game terminates at time tt if the state of the game st{ϕ,τA,τB}Ds_{t}\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. The set of states {ϕ,τA,τB}D\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D} are referred to as the absorbing states of the game. For st{ϕ,τA,τB}Ds_{t}\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}, AD(st)=AA(st)=\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t})=\pazocal{A}_{{\scriptscriptstyle{A}}}(s_{t})=\emptyset. At a non-absorbing state, the players choose their actions from their respective action set and the game evolves until the state of the game is an absorbing state.

4.3 Transitions

The transition probabilities of the game are governed by the uncertainty associated with DIFT. DIFT is not capable of performing security analysis accurately due to the generation of false-positives and false-negatives. Consider a tagged flow incoming at node viVGv_{i}\in V_{{\pazocal{G}}} at time tt. Let the action chosen by the defender and the adversary at sts_{t} be dtd_{t} and ata_{t}, respectively. If defender chooses not to trap an information flow, then the flow proceeds to the node in G{\pazocal{G}} chosen by the adversary. That is, if dt=0d_{t}=0, then st+1=ats_{t+1}=a_{t}. If the defender chooses to trap the node at which the adversary also decides to transition to, then the adversary is detected with probability 1FN(dt)1-FN(d_{t}) and the flow transition to the node in G{\pazocal{G}} corresponding to the action of the adversary with the remaining probability. That is, if dt=at=vid_{t}=a_{t}=v_{i}, then st+1=τAs_{t+1}=\tau_{{\scriptscriptstyle{A}}} with probability 1FN(dt)1-FN(d_{t}) and st+1=ats_{t+1}=a_{t} with probability FN(dt)FN(d_{t}). If the defender decides to trap a node which is not the node the adversary decides to transition to, then the defender generates a false-positive with probability FP(dt)FP(d_{t}) and the flow transition to the node in G{\pazocal{G}} corresponding to the action of the adversary with the remaining probability. That is, if dt0d_{t}\neq 0 and dtatd_{t}\neq a_{t}, then st+1=τBs_{t+1}=\tau_{{\scriptscriptstyle{B}}} with probability FP(dt)FP(d_{t}) and st+1=ats_{t+1}=a_{t} with probability 1FP(dt)1-FP(d_{t}).

Let P(st,st,dt,st+1)P(s_{t},s_{t},d_{t},s_{t+1}) denote the probability of transitioning to a state st+1s_{t+1} from a state sts_{t} under actions ata_{t} and dtd_{t}. Then,

P(st,at,dt,st+1)={1,st+1=at, if dt=0FN(dt),st+1=at, if dt=at1FN(dt),st+1=τA, if dt=atFP(dt),st+1=τB, if dtat1FP(dt),st+1=at, if dtatP(s_{t},a_{t},d_{t},s_{t+1})=\begin{cases}\begin{array}[]{lll}1,&s_{t+1}=a_{t},&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }d_{t}=0\\ FN(d_{t}),&s_{t+1}=a_{t},&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }d_{t}=a_{t}\\ 1-FN(d_{t}),&s_{t+1}=\tau_{{\scriptscriptstyle{A}}},&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }d_{t}=a_{t}\\ FP(d_{t}),&s_{t+1}=\tau_{{\scriptscriptstyle{B}}},&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }d_{t}\neq a_{t}\\ 1-FP(d_{t}),&s_{t+1}=a_{t},&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }d_{t}\neq a_{t}\\ \end{array}\end{cases} (1)

The transition probabilities FN(vi),FP(vi)FN(v_{i}),FP(v_{i}), for viVGv_{i}\in V_{{\pazocal{G}}}, denote the rates of generation of false-negatives and false-positives, respectively, at the different nodes in the IFG. We note that, different nodes in IFG have different capabilities to perform security analysis and depending on that the value of FN()FN(\cdot) and FP()FP(\cdot) are different. The numerical values of these transition probabilities also depend on the type of the attack. As APTs are tailored attacks that can manipulate the system operation and evade conventional security mechanisms such as firewalls, anti-virus software, and intrusion-detection systems, FN()FN(\cdot)’s and FP()FP(\cdot)’s are often unknown and hard to estimate accurately.

4.4 Strategies

A strategy for a player is a mapping which yields probability distribution over the player’s actions at every state. Consider mixed (stochastic) and stationary player strategies. When the strategy is stationary, the probability of choosing an action at a state depends only on the current state of the game. Let the stationary strategy space of the attacker be 𝐏A{\bf P}_{{\scriptscriptstyle{A}}} and that of the defender be 𝐏D{\bf P}_{{\scriptscriptstyle{D}}}. Then, 𝐏A:𝐒[0,1]|AA|{\bf P}_{{\scriptscriptstyle{A}}}:{\bf S}\rightarrow[0,1]^{|\pazocal{A}_{{\scriptscriptstyle{A}}}|} and 𝐏D:𝐒[0,1]|AD|{\bf P}_{{\scriptscriptstyle{D}}}:{\bf S}\rightarrow[0,1]^{|\pazocal{A}_{{\scriptscriptstyle{D}}}|}. Let pA𝐏Ap_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}} and pD𝐏Dp_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}. Here, pD=[pD(vq+1),,pD(vN)]p_{{\scriptscriptstyle{D}}}=[p_{{\scriptscriptstyle{D}}}(v_{q+1}),\ldots,p_{{\scriptscriptstyle{D}}}(v_{N})], where pD(vi)p_{{\scriptscriptstyle{D}}}(v_{i}) denotes the probability distribution over all the actions of the defender at state viv_{i}, i.e., out-neighboring nodes of viv_{i} in IFG and not trapping. For pA=[pA(v0),pA(vq+1),,pA(vN)]p_{{\scriptscriptstyle{A}}}=[p_{{\scriptscriptstyle{A}}}(v_{0}),p_{{\scriptscriptstyle{A}}}(v_{q+1}),\ldots,p_{{\scriptscriptstyle{A}}}(v_{N})], pA(vi)p_{{\scriptscriptstyle{A}}}(v_{i}) is a probability distribution over all possible out-neighbors of the node viv_{i} in the IFG and ϕ\phi. We note that, AA(st)=AD(st)=\pazocal{A}_{{\scriptscriptstyle{A}}}(s_{t})=\pazocal{A}_{{\scriptscriptstyle{D}}}(s_{t})=\emptyset, for st{ϕ,τA,τB}Ds_{t}\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. Also, AD(v0)=\pazocal{A}_{{\scriptscriptstyle{D}}}(v_{0})=\emptyset.

4.5 Payoffs

In the APT-DIFT game, the aim of the APT is to choose transitions in the IFG in order to reach destination by evading detection by DIFT. The aim of DIFT is to select the security check points to secure the system from the APT attack. Recall that APTs are stealthy attacks that undergo for a long period of time with the ultimate goal of stealing information over a long time. The destructive consequences of APTs are often unnoticeable until the final stages of the attack [4]. In this paper we consider the payoff functions of the APT-DIFT game such that players achieve reward (β\beta) when their respective aim is achieved. DIFT achieves the aim under two scenarios: (1) when the APT is detected successfully and (2) when APT drops out the attack. We note that, in both cases (1) and (2), DIFT secures the system from the APT attack and hence is rewarded β\beta. On the other hand, APT achieves the aim under two scenarios: (i) when APT reach destination and (ii) when a benign flow is concluded as malicious, i.e., false-positive. We note that, when a benign flow is concluded as malicious, DIFT no longer analyzes the flows (as it believes APT is detected) and hence the actual malicious flow evades detection and can achieve the aim. Thus in both cases (i) and (ii) APT achieves the aim and receives a reward of β\beta.

Let the payoff of player kk at an absorbing state sts_{t} be rk(st)r^{k}(s_{t}), where k{A,D}k\in\{A,D\}. At state τA\tau_{{\scriptscriptstyle{A}}} DIFT receives a payoff of β\beta and APT receives 0 payoff. At state τB\tau_{{\scriptscriptstyle{B}}} the APT receives a payoff of β\beta and DIFT receives 0 payoff. At a state in the set D={v1,v2,,vq}\pazocal{D}=\{v_{1},v_{2},\ldots,v_{q}\}, APT receives a payoff of β\beta and DIFT receives 0 payoff. At state ϕ\phi DIFT receives a payoff of β\beta and APT receives 0 payoff. Then,

rA(st)\displaystyle r^{{\scriptscriptstyle{A}}}(s_{t}) =\displaystyle= {β,stDβ,st=τB0,otherwise\displaystyle\begin{cases}\begin{array}[]{ll}\beta,&s_{t}\in\pazocal{D}\\ \beta,&s_{t}=\tau_{{\scriptscriptstyle{B}}}\\ 0,&\mbox{otherwise}\end{array}\end{cases} (2)
rD(st)\displaystyle r^{{\scriptscriptstyle{D}}}(s_{t}) =\displaystyle= {β,st=τAβ,st=ϕ0,otherwise\displaystyle\begin{cases}\begin{array}[]{ll}\beta,&s_{t}=\tau_{{\scriptscriptstyle{A}}}\\ \beta,&s_{t}=\phi\\ 0,&\mbox{otherwise}\end{array}\end{cases} (3)

At each stage in game, sts_{t} at time tt, both players simultaneously choose their action ata_{t} and dtd_{t} and transition to a next state st+1s_{t+1}. This is continued until they reach an absorbing state and receive the payoff defined using Eqs. (2) and (3).

Let TT denote termination time of the game, i.e., st+1=sts_{t+1}=s_{t} for tTt\geqslant T. At the termination time sT{ϕ,τA,τB}Ds_{T}\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. Let UAU_{{\scriptscriptstyle{A}}} and UDU_{{\scriptscriptstyle{D}}} denote the payoff functions of the adversary and the defender, respectively. As the initial state of the game is v0v_{0}, for a strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) the expected payoffs of the players are

UA(v0,pA,pD)\displaystyle U_{{\scriptscriptstyle{A}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) =\displaystyle= 𝔼v0,pA,pD[t=0TrA(st)]=𝔼v0,pA,pD[rA(sT)],\displaystyle\mathbb{E}_{v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}}\left[\sum\limits_{t=0}^{T}r^{{\scriptscriptstyle{A}}}(s_{t})\right]=\mathbb{E}_{v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}}\Big{[}r^{{\scriptscriptstyle{A}}}(s_{T})\Big{]}, (4)
UD(v0,pA,pD)\displaystyle U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) =\displaystyle= 𝔼v0,pA,pD[t=0TrD(st)]=𝔼v0,pA,pD[rD(sT)],\displaystyle\mathbb{E}_{v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}}\left[\sum\limits_{t=0}^{T}r^{{\scriptscriptstyle{D}}}(s_{t})\right]=\mathbb{E}_{v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}}\Big{[}r^{{\scriptscriptstyle{D}}}(s_{T})\Big{]}, (5)

where 𝔼v0,pA,pD\mathbb{E}_{v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}} denotes the expectation with respect to pAp_{{\scriptscriptstyle{A}}} and pDp_{{\scriptscriptstyle{D}}} when the game originates at state v0v_{0}. The objective of APT and DIFT is to individually maximize their expected total payoff given in Eq. (4) and Eq. (5), respectively. Thus the optimization problem solved by DIFT is

maxpD𝐏DUD(v0,pA,pD).\max_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). (6)

Similarly, the optimization problem of the APT is

maxpA𝐏AUA(v0,pA,pD).\max_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}U_{{\scriptscriptstyle{A}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). (7)

To bring out the structure of the payoffs well, we let Rs0(s)R_{s_{0}}(s) be the cumulative probability with which the state of the game at the termination time is ss, when the game originates at s0s_{0}. With slight abuse of notation, we use Rs0(D)R_{s_{0}}(\pazocal{D}) to denote the cumulative probability with which the state of the game at the termination time lies in set D\pazocal{D}, when the game originates at s0s_{0}. At the time of termination, i.e., t=Tt=T, the state of the game satisfies one of the following: (i) sT=τAs_{T}=\tau_{{\scriptscriptstyle{A}}}, (ii) sT=τBs_{T}=\tau_{{\scriptscriptstyle{B}}}, (iii) sTDs_{T}\in\pazocal{D}, and (iv) sT=ϕs_{T}=\phi. Using these definitions Eqs. (4) and (5) can be rewritten as

UA(v0,pA,pD)\displaystyle U_{{\scriptscriptstyle{A}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) =\displaystyle= (Rs0(D)+Rs0(τB))β,\displaystyle\Big{(}R_{s_{0}}(\pazocal{D})+R_{s_{0}}(\tau_{{\scriptscriptstyle{B}}})\Big{)}\,\beta, (8)
UD(v0,pA,pD)\displaystyle U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) =\displaystyle= (Rs0(τA)+Rs0(ϕ))β.\displaystyle\Big{(}R_{s_{0}}({\tau_{{\scriptscriptstyle{A}}}})+R_{s_{0}}({\phi})\Big{)}\beta. (9)

Using the reformulation of the payoff functions, we present the following property of the APT-DIFT game.

Proposition 4.1.

The APT-DIFT stochastic game is a constant sum game.

Proof.

Recall that APT-DIFT game has absorbing states ϕ,τA,τB\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}} and D\pazocal{D}. At the termination time of the game, i.e., t=Tt=T, the state of the game sT{ϕ,τA,τB}Ds_{T}\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. This implies

Rs0(ϕ)+Rs0(τA)+Rs0(τB)+Rs0(D)=1.R_{s_{0}}(\phi)+R_{s_{0}}(\tau_{{\scriptscriptstyle{A}}})+R_{s_{0}}(\tau_{{\scriptscriptstyle{B}}})+R_{s_{0}}(\pazocal{D})=1.

This gives UA(v0,pA,pD)+UD(v0,pA,pD)=U_{{\scriptscriptstyle{A}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})+U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})= (Rs0(ϕ)+Rs0(τA)+Rs0(τB)+Rs0(D))β=β\Big{(}R_{s_{0}}(\phi)+R_{s_{0}}(\tau_{{\scriptscriptstyle{A}}})+R_{s_{0}}(\tau_{{\scriptscriptstyle{B}}})+R_{s_{0}}(\pazocal{D})\Big{)}\,\beta=\beta. Thus the game between APT and DIFT is a constant sum game with constant equal to β\beta. ∎

5 Solution Concept

The solution concept of the game is defined as follows. Each player is interested in maximizing their individual reward in the minimax sense. In other words, each player is assuming the worst case for an optimal opponent player. Since the APT-DIFT game is a constant-sum game, it is sufficient to view that the opponent player is acting to minimize the reward of the agent. Thus DIFT chooses its actions to find an optimal strategy pDp^{\star}_{{\scriptscriptstyle{D}}} that achieves the upper value defined as

V¯(s0)=maxpD𝐏DminpA𝐏AUD(v0,pA,pD).\overline{V}(s_{0})=\max_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). (10)

On the other hand, APT chooses its actions to find an optimal strategy pAp^{\star}_{{\scriptscriptstyle{A}}} that achieves

maxpA𝐏AminpD𝐏DUA(v0,pA,pD)=maxpA𝐏AminpD𝐏D(βUD(v0,pA,pD)).\max_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}\min_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}U_{{\scriptscriptstyle{A}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})=\max_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}\min_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\Big{(}\beta-U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})\Big{)}. (11)

This is equivalent to saying that APT aims to find an optimal strategy pAp^{\star}_{{\scriptscriptstyle{A}}} that achieves the lower value defined as

V¯(s0)=minpA𝐏AmaxpD𝐏DUD(v0,pA,pD).\underline{V}(s_{0})=\min_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}\max_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). (12)

From Eqs. (10) and (12), the defender tries to maximize and the adversary tries to minimize UD(v0,pA,pD)U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). Hence the value of the game is defined as

V(s0)\displaystyle V^{\star}(s_{0}) :=\displaystyle:= maxpD𝐏DUD(v0,pA,pD)=UD(v0,pA,pD)\displaystyle\max_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}U_{{\scriptscriptstyle{D}}}(v_{0},p^{\star}_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})=U_{{\scriptscriptstyle{D}}}(v_{0},p^{\star}_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}}) (13)
=\displaystyle= minpA𝐏AUD(v0,pA,pD).\displaystyle\min_{p_{{\scriptscriptstyle{A}}}\in{\bf P}_{{\scriptscriptstyle{A}}}}U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}}).

The strategy pair (pA,pD)(p^{\star}_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}}) is referred to as the saddle point or Nash equilibrium (NE) of the game.

Definition 5.1.

Let (pA,pD)(p^{\star}_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}}) be a Nash equilibrium of the APT-DIFT game. A strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) is said to be an ϵ\epsilon-Nash equilibrium, for ϵ>0\epsilon>0, if

UD(v0,pA,pD)UD(v0,pA,pD)ϵ.U_{{\scriptscriptstyle{D}}}(v_{0},p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})\geqslant U_{{\scriptscriptstyle{D}}}(v_{0},p^{\star}_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}})-\epsilon.

In other words, the value corresponding to the strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) denoted as V(s0)V(s_{0}) satisfies

V(s0)V(s0)ϵ.V(s_{0})\geqslant V^{\star}(s_{0})-\epsilon.

Proposition 5.2 proves the existence of NE for the APT-DIFT game.

Proposition 5.2.

There exists a Nash equilibrium for the APT-DIFT stochastic game.

Proof of Proposition 5.2 is presented in the appendix.

In the next section we present our approach to compute an NE of the APT-DIFT game.

6 Solution to the APT-DIFT Game

This section presents our approach to compute an NE of the APT-DIFT game. Firstly, we propose a model-based approach to compute NE using a value iteration algorithm. Later, we present a model-free approach based on a policy iteration algorithm, when the transition probabilities, i.e., the rate of false-positives and false-negatives, are unknown.

6.1 Value Iteration Algorithm

This subsection presents our solution approach for solving the APT-DIFT game when the false-positive and false-negative rates (transition probabilities) are known. By Proposition 5.2, there exists an NE for the APT-DIFT game. Our approach to compute NE of the APT-DIFT game is presented below.

Let s𝐒s\in{\bf S} be an arbitrary state of the APT-DIFT game. The state value function for a constant-sum game, analogous to the Bellman equation, can be written as

V(s)=maxpD(s)𝐏D(s)minaAA(s)dAD(s)Q(s,a,d)pD(s,d),V^{\star}(s)=\max_{p_{{\scriptscriptstyle{D}}}(s)\in{\bf P}_{{\scriptscriptstyle{D}}}(s)}\min_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}Q^{\star}(s,a,d)\,p_{{\scriptscriptstyle{D}}}(s,d), (14)

where the QQ-values are defined as

Q(s,a,d)=s𝐒P(s,a,d,s)V(s).Q^{\star}(s,a,d)=\sum_{s^{\prime}\in{\bf S}}P(s,a,d,s^{\prime})V^{\star}(s^{\prime}). (15)

The min in Eq. (14) can also be defined over mixed (stochastic) policies, but, since it is ‘inside’ the max, the minimum is achieved for a deterministic action choice [17]. The strategy selected by Eq. (14) is referred to as the minimax strategy, and given by

pD(s)=argmaxpD(s)𝐏D(s)minaAA(s)dAD(s)Q(s,a,d)pD(s,d).p^{\star}_{{\scriptscriptstyle{D}}}(s)=\arg\max_{p_{{\scriptscriptstyle{D}}}(s)\in{\bf P}_{{\scriptscriptstyle{D}}}(s)}\min_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}Q^{\star}(s,a,d)\,p_{{\scriptscriptstyle{D}}}(s,d). (16)

The aim here is to compute minimax strategy pDp^{\star}_{{\scriptscriptstyle{D}}}. Our proposed algorithm and convergence proof is given below.

Let V={V(s):s𝐒}V=\{V(s):s\in{\bf S}\} denote the value vector corresponding to a strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). The value for a state s𝐒s\in{\bf S}, V(s)V(s), is the expected payoff under strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) if the game originates at state ss. Then, V(s)V^{\star}(s) is the expected payoff corresponding to an NE strategy pair (pA,pD)(p^{\star}_{{\scriptscriptstyle{A}}},p^{\star}_{{\scriptscriptstyle{D}}}) if the game originates at state ss. Algorithm 6.1 presents the pseudocode to compute the value vector of the APT-DIFT game.

Algorithm 6.1 is a value iteration algorithm defined on a value vector V={V(s):s𝐒}V=\{V(s):s\in{\bf S}\}, where V(s)V(s) is the value of the game starting at the state ss. Thus V(s)=βV(s)=\beta, for s{ϕ,τA}s\in\{\phi,\tau_{{\scriptscriptstyle{A}}}\}, and V(s)=0V(s^{\prime})=0, for s{τB}Ds^{\prime}\in\{\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. The values of the states is computed recursively, for every iteration k=1,2,k=1,2,\ldots, V(k)(s)V^{(k)}(s), as follows:

V(k)(s)={β, if s{ϕ,τA},0, if s{τB}D,val(s,V(k1)), otherwise,V^{(k)}(s)=\begin{cases}\begin{array}[]{ll}\beta,&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }s\in\{\phi,\tau_{{\scriptscriptstyle{A}}}\},\\ 0,&\mbox{\leavevmode\nobreak\ if\leavevmode\nobreak\ }s\in\{\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D},\\ \mbox{val}(s,V^{(k-1)}),&\mbox{\leavevmode\nobreak\ otherwise},\end{array}\end{cases} (17)

where val(s,V(k1))=\mbox{val}(s,V^{(k-1)})=

maxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(k1)(s).\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k-1)}(s^{\prime}).

The value-iteration algorithm computes a sequence V(0),V^{(0)}, V(1),V^{(1)}, V(2),V^{(2)},\ldots, where for k=0,1,2k=0,1,2\ldots, each valuation V(k)V^{(k)} associates with each state s𝐒s\in{\bf S} a lower bound V(k)(s)V^{(k)}(s) on the value of the game. Algorithm 6.1 need not terminate in finitely many iterations [26]. The parameter δ\delta, in step 2, denotes the acceptable error bound which is the maximum absolute difference in the values corresponding to two consecutive iterations, i.e., max{|V(k)(s)=V(k+1)(s)|:s𝐒}\max\{|V^{(k)}(s)=V^{(k+1)}(s)|:s\in{\bf S}\}, and serves as a stopping criteria for Algorithm 6.1. Smaller value of δ\delta in Algorithm 6.1 will return a value vector that is closer to the actual value vector. Below we prove that as kk approaches \infty, the values V(k)(s)V^{(k)}(s) approaches the exact values V(s)V(s) from below, i.e., limkV(k)(s)\lim_{k\rightarrow\infty}V^{(k)}(s) converges to the value of the game at state ss. Theorem 6.3 proves the asymptotic convergence of the values.

Algorithm 6.1 Value iteration algorithm to find NE strategy in APT-DIFT game
Input: APT-DIFT game with state space 𝐒{\bf S}, destination set D\pazocal{D}, action space AD,AA\pazocal{A}_{{\scriptscriptstyle{D}}},\pazocal{A}_{{\scriptscriptstyle{A}}}, payoff parameter β\beta, transition probabilities FN(),FP()FN(\cdot),FP(\cdot)
Output: Value vector V^\hat{V} and defender policy p^D\hat{p}_{{\scriptscriptstyle{D}}}
1:Initialize value vector V(0)(s)0V^{(0)}(s)\leftarrow 0, for all s𝐒s\in{\bf S} and V(1)(s)βV^{(1)}(s^{\prime})\leftarrow\beta for s{τA,ϕ}s^{\prime}\in\{\tau_{{\scriptscriptstyle{A}}},\phi\}, V(1)(s′′)0V^{(1)}(s^{\prime\prime})\leftarrow 0 for all s′′𝐒{τA,ϕ}s^{\prime\prime}\in{\bf S}\setminus\{\tau_{{\scriptscriptstyle{A}}},\phi\}, k0k\leftarrow 0, δ0\delta\geqslant 0
2:while max{|V(k+1)(s)V(k)(s)|:s𝐒}>δ\max\{|V^{(k+1)}(s)-V^{(k)}(s)|:s\in{\bf S}\}>\delta do
3:     kk+1k\leftarrow k+1
4:     for s{ϕ,τA,τB}Ds\notin\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D} do
5:         V(k+1)(s)V^{(k+1)}(s)\leftarrow
6:         maxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(k)(s)\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k)}(s^{\prime})
7:     end for
8:end while
9:return Vector V^\hat{V}, where V^(s)V(k)(s)\hat{V}(s)\leftarrow V^{(k)}(s)
10:Compute DIFT strategy p^D\hat{p}_{{\scriptscriptstyle{D}}}, p^D(s)argmaxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V^(s)\hat{p}_{{\scriptscriptstyle{D}}}(s)\leftarrow\arg\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})\hat{V}(s^{\prime})
Lemma 6.1.

Consider the value iteration algorithm in Algorithm 6.1. Let V(k+1)V^{(k+1)} and V(k)V^{(k)} be the value vectors corresponding to iterations k+1k+1 and kk, respectively. Then, V(k+1)(s)V(k)(s)V^{(k+1)}(s)\geqslant V^{(k)}(s), for all s𝐒s\in{\bf S}.

Proof.

We first prove that result for a state s{ϕ,τA,τB}Ds\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. From Eq. (17), for every iteration k=1,2,k=1,2,\ldots V(k)(s)=βV^{(k)}(s)=\beta, for s{ϕ,τA}s\in\{\phi,\tau_{{\scriptscriptstyle{A}}}\}, and V(k)(s)=0V^{(k)}(s)=0, for s{τB}Ds\in\{\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. From the initialization step of Algorithm 6.1 (Step 1), V(0)(s)=0V^{(0)}(s)=0 for all s𝐒s\in{\bf S}. Thus for any state ss, where s{ϕ,τA,τB}Ds\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}, V(k+1)(s)V(k)(s)V^{(k+1)}(s)\geqslant V^{(k)}(s) for any arbitrary iteration kk of Algorithm 6.1.

For an arbitrary state ss, where s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}, we prove the result using an induction argument. The induction hypothesis is that arbitrary iterations kk and k+1k+1 satisfy V(k+1)(s)V(k)(s)V^{(k+1)}(s)\geqslant V^{(k)}(s), for all s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}.

Base step: Consider k=0k=0 as the base step. Initialize V(0)(s)=0V^{(0)}(s)=0 for all s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D} and set V(1)(s)=0V^{(1)}(s)=0 for all s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. This gives V(1)(s)V(0)(s)V^{(1)}(s)\geqslant V^{(0)}(s), for all s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}.

Induction step: For the induction step, assume that iteration kk satisfies V(k)(s)V(k1)(s)V^{(k)}(s)\geqslant V^{(k-1)}(s) for all s𝐒{ϕ,τA,τB}Ds\in{\bf S}\setminus\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}.

Consider iteration (k+1)(k+1). Then,

V(k+1)(s)\displaystyle V^{(k+1)}(s) =\displaystyle= maxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(k)(s)\displaystyle\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k)}(s^{\prime}) (18)
\displaystyle\geqslant minaAA(s)s𝐒dAD(s)pD(k)(s,d)P(s,a,d,s)V(k)(s)\displaystyle\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p^{(k)}_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k)}(s^{\prime})
\displaystyle\geqslant minaAA(s)s𝐒dAD(s)pD(k)(s,d)P(s,a,d,s)V(k1)(s)\displaystyle\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p^{(k)}_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k-1)}(s^{\prime})
=\displaystyle= maxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(k1)(s)\displaystyle\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{(k-1)}(s^{\prime})
=\displaystyle= V(k)(s)\displaystyle V^{(k)}(s) (20)

Eq. (18) holds as the value obtained using a maximizing policy is at least as the value obtained using a specific policy pDp_{{\scriptscriptstyle{D}}}. Eq. (6.1) follows from the induction argument and Eq. (20) is from the definition of V(k)(s)V^{(k)}(s). This completes the proof. ∎

Proposition 6.2 (Monotone Convergence Theorem, [26]).

If a sequence is monotone increasing and bounded from above, then it is a convergent sequence.

The value of any state s𝐒s\in{\bf S} is bounded above by β\beta. Using Lemma 6.1 and Proposition 6.2, we state the convergence result of Algorithm 6.1.

Theorem 6.3.

Consider the value iteration algorithm in Algorithm 6.1. Let V(k)(s),V(s)V^{(k)}(s),V^{\star}(s) be the value at iteration kk and the optimal value of state s𝐒s\in{\bf S}, respectively. Then, as kk\rightarrow\infty, V(k)(s)V(s)V^{(k)}(s)\rightarrow V^{\star}(s), for all s𝐒s\in{\bf S}. Further, the output of Algorithm 6.1, p^D\hat{p}_{{\scriptscriptstyle{D}}}, for δ0\delta\rightarrow 0, is an optimal defender policy.

The value of any state s𝐒s\in{\bf S} is bounded above by β\beta. From Lemma 6.1 we know that the sequence V(k)(s)V^{(k)}(s) is a monotonically increasing sequence. By the monotone convergence theorem [26], a bounded and monotone sequence converges to the supremum, i.e., limkV(k)(s)V(s)\lim_{k\rightarrow\infty}V^{(k)}(s)\rightarrow V^{\star}(s), for all s𝐒s\in{\bf S}. Thus the value iteration algorithm converges and the proof follows.

Theorem 6.4.

For any ϵ>0\epsilon>0, Algorithm 6.1 returns an ϵ\epsilon-Nash equilibrium of the APT-DIFT game.

Proof of Theorem 6.4 follows from Lemma 6.1, Theorem 6.3, and Definition 5.1.

The value of δ\delta trades off the desired accuracy of the value vector and the computational time. For a given value of δ\delta, the number of iterations of Algorithm 6.1 depends on the size and connectivity structure of the IFG. As the size and connectivity structure of the IFG varies across different datasets, we base the selection of δ\delta solely on the accuracy desired by the security expert. In our experiments, we set δ=0.01\delta=0.01. Theorem 6.5 presents the computational complexity of Algorithm 6.1 for each iteration.

Theorem 6.5.

Consider the APT-DIFT game with NN number of nodes in the IFG and qq number of destination nodes. Let AA\pazocal{A}_{{\scriptscriptstyle{A}}} and AD\pazocal{A}_{{\scriptscriptstyle{D}}} be the action sets of APT and DIFT, respectively. Every iteration of Algorithm 6.1, i.e., steps 2-8, has computational complexity O((Nq+1)2|AA||AD|)O({(N-q+1)}^{2}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|).

Proof.

Every iteration of Algorithm 6.1 involves computation of the value vector. This involves solving, for every state s𝐒{{ϕ,τA,τB}D}s\in{\bf S}\setminus\{\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}\}, a linear program of the form:

Maximize V(s)V(s)
Subject to:  dAD(s)pD(s,d)=1\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)=1 pD(s,d)0p_{{\scriptscriptstyle{D}}}(s,d)\geqslant 0, for all dAD(s)d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s) V(s)dAD(s)Q(s,d,a)pD(s,d)V(s)\leqslant\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}Q(s,d,a)p_{{\scriptscriptstyle{D}}}(s,d), for all aAA(s)a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s).

We note that |𝐒{{ϕ,τA,τB}D}|=Nq+1|{\bf S}\setminus\{\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}\}|=N-q+1. The above linear program has complexity of O((Nq+1)|AA||AD|)O({(N-q+1)}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|) [5]. Thus solving for all (Nq+1)(N-q+1) states has a total complexity of O((Nq+1)2|AA||AD|)O({(N-q+1)}^{2}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|). ∎

Algorithm 6.1 is guaranteed to converge to an NE strategy asymptotically. When the state space of the APT-DIFT game has cycles, Algorithm 6.1 updates the value vector recursively and consequently finite time convergence can not be guaranteed. However, when the IFG is acyclic, the state space of the APT-DIFT game is acyclic (Theorem 6.7) and a finite time convergence can be achieved using value iteration.

Henceforth, the following assumption holds.

Assumption 6.6.

The IFG G{\pazocal{G}} is acyclic.

The IFG obtained from the system log may not be acyclic in general. However, when the IFG is a cyclic graph, one can obtain an acyclic representation of the IFG using the node versioning technique proposed in [10]. Throughout this subsection, we consider directed acyclic IFGs rendered using the approach in [10] and hence Assumption 6.6 is non-restrictive.

Theorem below presents a termination condition of the APT-DIFT game when the IFG is acyclic.

Theorem 6.7.

Consider the APT-DIFT game. Let Assumption 6.6 holds and NN be the number of nodes in the IFG. Then the state space of the APT-DIFT game is acyclic and the game terminates in at most N+4N+4 number of steps.

Proof.

Consider any arbitrary strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). We first prove the acyclic property of the state space of the game under (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). The state space 𝐒{\bf S} is constructed by augmenting the IFG with states v0,ϕv_{0},\phi, τA\tau_{{\scriptscriptstyle{A}}}, and τB\tau_{{\scriptscriptstyle{B}}}. We note that a state s{ϕ,τA,τB}Ds\in\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D} does not lie in a cycle in 𝐒{\bf S} as ss is an absorbing state and hence have no outgoing edge, i.e., AA(s)=AD(s)=\pazocal{A}_{{\scriptscriptstyle{A}}}(s)=\pazocal{A}_{{\scriptscriptstyle{D}}}(s)=\emptyset. The state v0v_{0} does not lie in a cycle in 𝐒{\bf S} as there are no incoming edges to v0v_{0}. This concludes that a state s{v0,ϕ,τA,τB}Ds\in\{v_{0},\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D} is not part of a cycle in 𝐒{\bf S}. Thus a cycle can possibly exist in 𝐒{\bf S} only if there is a cycle which has some states in vq+1,,vNv_{q+1},\ldots,v_{N}, since D={v1,v2,,vq}\pazocal{D}=\{v_{1},v_{2},\ldots,v_{q}\}. Recall that states v1,,vNv_{1},\ldots,v_{N} correspond to nodes of G{\pazocal{G}}. As G{\pazocal{G}} is acyclic, there are no cycles in 𝐒{\bf S}. Since 𝐒{\bf S} is acyclic under any arbitrary strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) and the state space has finite cardinality, the game terminates in finite number of steps. Further, since |𝐒|=N+4|{\bf S}|=N+4, TN+4T\leqslant N+4. ∎

Using Theorem 6.7, we propose a value iteration algorithm with guaranteed finite time convergence. We will use the following definition in our approach.

For a directed acyclic graph, topological ordering of the node set is defined below.

Definition 6.8 ([16]).

A topological ordering of a directed graph is a linear ordering of its vertices such that for every directed edge (u,v)(u,v) from vertex uu to vertex vv, uu comes before vv in the ordering.

For a directed acyclic graph with vertex set BB and edge set EE there exists an algorithm of complexity O(|B|+|E|)O(|B|+|E|) to find the topological order [16]. Using the topological ordering, one can find a hierarchical level partitioning of the nodes of a directed acyclic graph. Let S\pazocal{S} be the topologically ordered set of nodes of 𝐒{\bf S}. Let the number of hierarchical levels associated with S\pazocal{S} be MM, say L1,L2,,LML_{1},L_{2},\ldots,L_{M}. Then L1=v0L_{1}=v_{0}, LM={ϕ,τA,τB}DL_{M}=\{\phi,\tau_{{\scriptscriptstyle{A}}},\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. Moreover, a state ss is said to be in hierarchical level LjL_{j} if there exists an edge into ss from a state ss^{\prime} which is in some level LjL_{j^{\prime}}, where j<jj^{\prime}<j, and there is no edge into ss from a state s′′s^{\prime\prime} which is in level Lj′′L_{j^{\prime\prime}}, where j′′>jj^{\prime\prime}>j.

Algorithm 6.2 presents the pseudocode to solve the APT-DIFT game using the topological ordering and the hierarchical levels, when the IFG is acyclic.

Algorithm 6.2 Value iteration algorithm to find NE strategy for APT-DIFT game under Assumption 6.6
Input: APT-DIFT game with state space 𝐒{\bf S}, destination set D\pazocal{D}, action space AD,AA\pazocal{A}_{{\scriptscriptstyle{D}}},\pazocal{A}_{{\scriptscriptstyle{A}}}, payoff parameter β\beta, transition probabilities FN(),FP()FN(\cdot),FP(\cdot)
Output: Value vector VV^{\star} and defender strategy pDp^{\star}_{{\scriptscriptstyle{D}}}
1:Find the topological ordering of the state space graph, S\pazocal{S}
2:Using S\pazocal{S}, obtain the set of nodes corresponding to hierarchical levels L1,L2,,LML_{1},L_{2},\ldots,L_{M}
3:Initialize value vector V(s)0V(s)\leftarrow 0, for all s𝐒s\in{\bf S} and V(s)βV(s^{\prime})\leftarrow\beta for s{τA,ϕ}s^{\prime}\in\{\tau_{{\scriptscriptstyle{A}}},\phi\}, V(s′′)0V(s^{\prime\prime})\leftarrow 0 for all s′′𝐒{τA,ϕ}s^{\prime\prime}\in{\bf S}\setminus\{\tau_{{\scriptscriptstyle{A}}},\phi\}
4:for k{M1,M2,,1}k\in\{M-1,M-2,\ldots,1\} do
5:     for sLks\in L_{k} do
6:         V(s)maxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(s)V(s)\leftarrow\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V(s^{\prime})
7:     end for
8:     kk+1k\leftarrow k+1
9:end for
10:return Value vector VVV^{\star}\leftarrow V
11:Compute DIFT strategy pDp^{\star}_{{\scriptscriptstyle{D}}}, pD(s)argmaxpD𝐏DminaAA(s)s𝐒dAD(s)pD(s,d)P(s,a,d,s)V(s)p^{\star}_{{\scriptscriptstyle{D}}}(s)\leftarrow\arg\max\limits_{p_{{\scriptscriptstyle{D}}}\in{\bf P}_{{\scriptscriptstyle{D}}}}\min\limits_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s)}\sum\limits_{s^{\prime}\in{\bf S}}\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V^{\star}(s^{\prime})
Corollary 6.9.

Consider the APT-DIFT game and let AA,AD\pazocal{A}_{{\scriptscriptstyle{A}}},\pazocal{A}_{{\scriptscriptstyle{D}}} be the action sets of APT, DIFT, respectively. Let the IFG is acyclic with NN number of nodes and qq number of destination nodes. Then, Algorithm 6.2 returns the value vector VV^{\star}. Moreover, Algorithm 6.2 has computational complexity O((Nq+1)2|AA||AD|)O({(N-q+1)}^{2}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|).

Proof.

Recall that under Assumption 6.6, the state space 𝐒{\bf S} of the APT-DIFT game is acyclic. Thus the topological ordering S\pazocal{S} and the hierarchical levels, L1,,LML_{1},\ldots,L_{M}, of 𝐒{\bf S} can be obtained in polynomial time [16]. We note that, values at the absorbing states, i.e., states at hierarchical level MM, are V(s)=βV(s)=\beta for s{τA,ϕ}s\in\{\tau_{{\scriptscriptstyle{A}}},\phi\} and V(s)=0V(s^{\prime})=0 for all s{τB}Ds^{\prime}\in\{\tau_{{\scriptscriptstyle{B}}}\}\cup\pazocal{D}. Using the hierarchical levels, the value vector can be computed in a dynamic programming manner starting from states in the last hierarchical level. Initially, using the values of the states at level MM, the values of states at level M1M-1 can be obtained. Similarly, using values of states at level M1M-1 and level MM, the values of states at level M2M-2 can be obtained and so on. By recursive computations we obtain the value vector VV^{\star} and the corresponding DIFT strategy pDp^{\star}_{{\scriptscriptstyle{D}}}.

Finding the topological ordering and the corresponding hierarchical levels of the state space graph has O(|𝐒|2)O(|{\bf S}|^{2}) computations. The linear program associated with each state in the value iteration involves O((Nq+1)|AA||AD|)O({(N-q+1)}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|) computations (See proof of Theorem 6.5). Since we need to compute values corresponding to (Nq+1)(N-q+1) states, complexity of Algorithm 6.2 is O((Nq+1)2|AA||AD|)O({(N-q+1)^{2}}|\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}|). ∎

6.2 Hierarchical Supervised Learning (HSL) Algorithm

This subsection presents our approach to solve the APT-DIFT game when the game model is not fully known. It is often unrealistic to know the precise values of the false-positive rates and the false-negative rates of DIFT as these values are estimated empirically. More specifically, the transition probabilities of the APT-DIFT game are unknown. The traditional reinforcement learning algorithms (Q-learning) for constant-sum games with unknown transition probabilities [11] assume perfect information, i.e., the players can observe the past actions and rewards of the opponents [11].

On the other hand, dynamic programming-based approaches, including value iteration and policy iteration algorithms, require the transition probabilities in the computations. When FN()FN(\cdot) and FP()FP(\cdot) values are unknown, the optimization problem associated with the states in step 6 of Algorithm 6.1 and step 6 of Algorithm 6.2 can not be solved. That is, in the LP

Problem 6.10.

Maximize V(s)V(s)
Subject to:  dAD(s)pD(s,d)=1\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}p_{{\scriptscriptstyle{D}}}(s,d)=1 pD(s,d)0p_{{\scriptscriptstyle{D}}}(s,d)\geqslant 0, for all dAD(s)d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s) V(s)dAD(s)Q(s,d,a)pD(s,d)V(s)\leqslant\sum\limits_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s)}Q(s,d,a)p_{{\scriptscriptstyle{D}}}(s,d), for all aAA(s)a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s),

the Q(s,a,d)Q(s,a,d) values are unknown, where

Q(s,a,d)=s𝐒P(s,a,d,s)V(s),Q(s,a,d)=\sum_{s^{\prime}\in{\bf S}}P(s,a,d,s^{\prime})V(s^{\prime}),

and P(s,a,d,s)P(s,a,d,s^{\prime}) is defined in Eq. (1). To the best of our knowledge, there are no known algorithms, with guaranteed equilibrium convergence, to solve imperfect and incomplete stochastic game models as that of the APT-DIFT game presented in this paper. To this end, we propose a supervised learning-based approach to solve the APT-DIFT game. Our approach consists of two key steps.

  1. (1)

    Training a neural network to approximate the value vector of the game for a given strategy pair, and

  2. (2)

    A policy iteration algorithm to compute an ϵ\epsilon-optimal NE strategy.

Our approach to solve the APT-DIFT game, when the rates of false-positives and false-negatives are unknown, utilizes the topological ordering and the hierarchical levels of the state space graph. We propose a hierarchical supervised learning (HSL)-based approach, that predicts the QQ-values of the APT-DIFT game and then solve the LP for all the states (Problem 6.10) in a hierarchical manner, to approximate an NE strategy of the APT-DIFT game. In HSL, we utilize a neural network to obtain a mapping between the strategies of the players and the value vector. This mapping is an approximation, using which HSL presents an approach to compute an approximate equilibrium and evaluates the performance empirically using real attack datasets.

In order to predict the QQ-values of the game, we train a neural network to learn the value vector of the game and use the trained model to predict the QQ-values. The reasons to train an NN to learn the value vector instead of directly learning the QQ-values are: (a) while the value vector is of dimension |𝐒||\bf S| the QQ-values have dimension |𝐒||AA||AD||{\bf S}||\pazocal{A}_{{\scriptscriptstyle{A}}}||\pazocal{A}_{{\scriptscriptstyle{D}}}| and (b) generation of data samples for value vector is easy. The approach for data generation and training is elaborated below.

We note that, it is possible to simulate the APT-DIFT game and observe the final game outcome, i.e., the payoffs of the players. For a given (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}), the value at state ss, V(s)V(s), is the payoff of the defender if the game originate at state ss, i.e., V(s)=UD(s,pA,pD)V(s)=U_{{\scriptscriptstyle{D}}}(s,p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}). Training the neural network for predicting the value vector of the APT-DIFT game consists of two steps.

  1. (i)

    Generate random samples of strategy pairs, (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}})

  2. (ii)

    Simulate the APT-DIFT game for each of the randomly generated sample of strategy pair and obtain the values corresponding to all states.

The neural network takes as input the strategy pairs and outputs the value vector. The training is done using a multi-input, multi-output neural network, represented as F:XY\pazocal{F}:X\rightarrow Y, where X[0,1]|AA|×[0,1]|AD|X\subseteq[0,1]^{|\pazocal{A}_{{\scriptscriptstyle{A}}}|}\times[0,1]^{|\pazocal{A}_{{\scriptscriptstyle{D}}}|} and Y|𝐒|Y\subseteq\mathbb{R}^{|{\bf S}|}. The neural network may not compute the exact value vector, however, it can approximate the value vector to arbitrary accuracy. Given a function f(x)f(x) and ξ>0\xi>0, the guarantee is that by using enough hidden neurons it is always possible to find a neural network whose output g(x)g(x) satisfies |f(x)g(x)|<ξ|f(x)-g(x)|<\xi, for all inputs xx [8]. In other words, the approximation will be good to within the desired accuracy for every possible input. However, the training method does not guarantee that the neural network obtained at the end of the training process is one that satisfies the specified level of accuracy. Using the trained neural network we predict the QQ-values of the game.

Lemma 6.11.

Consider the APT-DIFT game with state space 𝐒{\bf S}. Let a neural network be trained using samples of strategy pairs (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) to predict the value vector VV of the game such that the mean absolute error is within the desired tolerance of ζ0.01\zeta\leqslant 0.01. Then, the trained neural network also yield the QQ-values, Q(s,a,d)Q(s,a,d), for all s𝐒s\in{\bf S}, aAAa\in\pazocal{A}_{{\scriptscriptstyle{A}}}, and dADd\in\pazocal{A}_{{\scriptscriptstyle{D}}}.

Proof.

Consider a neural network that is trained using enough samples of strategy pairs to predict the value vector of the game. Thus given a strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}), the neural network predicts V={V(s):s𝐒}V=\{V(s):s\in{\bf S}\}. Consider an arbitrary state ss. The QQ-value corresponding to state ss and action pair a,da,d for the APT and DIFT, respectively, is given by Q(s,a,d)=s𝐒P(s,a,d,s)V(s)Q(s,a,d)=\sum_{s^{\prime}\in{\bf S}}P(s,a,d,s^{\prime})V(s^{\prime}). For a strategy (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) with pA(s)p_{{\scriptscriptstyle{A}}}(s) such that pA(s,a)=1p_{{\scriptscriptstyle{A}}}(s,a)=1 and pD(s)p_{{\scriptscriptstyle{D}}}(s) such that pD(s,d)=1p_{{\scriptscriptstyle{D}}}(s,d)=1 gives

V(s)\displaystyle V(s) =\displaystyle= s𝐒aAAdADpA(s,a)pD(s,d)P(s,a,d,s)V(s)\displaystyle\sum_{s^{\prime}\in{\bf S}}\sum_{a\in\pazocal{A}_{{\scriptscriptstyle{A}}}}\sum_{d\in\pazocal{A}_{{\scriptscriptstyle{D}}}}p_{{\scriptscriptstyle{A}}}(s,a)p_{{\scriptscriptstyle{D}}}(s,d)P(s,a,d,s^{\prime})V(s^{\prime})
=\displaystyle= s𝐒P(s,a,d,s)V(s)\displaystyle\sum_{s^{\prime}\in{\bf S}}P(s,a,d,s^{\prime})V(s^{\prime})
=\displaystyle= Q(s,a,d).\displaystyle Q(s,a,d).

Hence the QQ-values of the game can be obtained using the neural network that is trained to predict the value vector by inputting a strategy pair (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) with pA(s,a)=1p_{{\scriptscriptstyle{A}}}(s,a)=1 and pD(s,d)=1p_{{\scriptscriptstyle{D}}}(s,d)=1, for any s𝐒s\in{\bf S}, aAAa\in\pazocal{A}_{{\scriptscriptstyle{A}}}, and dADd\in\pazocal{A}_{{\scriptscriptstyle{D}}}. ∎

Using the trained neural network we run a policy iteration algorithm. In the policy iteration, we update the strategies of both players, APT and DIFT, by solving the stage games in a dynamic programming manner. The details of the algorithm are presented below.

Algorithm 6.3 HSL Algorithm for APT-DIFT game
Input: APT-DIFT game with state space 𝐒{\bf S}, destination set D\pazocal{D}, action sets AA,AD\pazocal{A}_{{\scriptscriptstyle{A}}},\pazocal{A}_{{\scriptscriptstyle{D}}}, payoff parameter β\beta
Output: Value vector V^\hat{V} and defender strategy p^D\hat{p}_{{\scriptscriptstyle{D}}}
1:Generate random samples of (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) and value vector
2:Train F\pazocal{F} using the data set from Step 1
3:Find the topological ordering of the state space graph, S\pazocal{S}
4:Obtain the set of nodes corresponding to hierarchical levels L1,L2,,LML_{1},L_{2},\ldots,L_{M}
5:Initialize V(τA)βV(\tau_{{\scriptscriptstyle{A}}})\leftarrow\beta, V(ϕ)βV(\phi)\leftarrow\beta, V(τB)0V(\tau_{{\scriptscriptstyle{B}}})\leftarrow 0, and V(s)0V(s)\leftarrow 0 for sDs\in\pazocal{D}
6:Initialize randomly (p^A(M1),p^D(M1))(\hat{p}^{(M-1)}_{{\scriptscriptstyle{A}}},\hat{p}^{(M-1)}_{{\scriptscriptstyle{D}}})
7:for k{M1,M2,,1}k\in\{M-1,M-2,\ldots,1\} do
8:     for sLks\in L_{k} do
9:         for aAA(s),dAD(s)a\in\pazocal{A}_{{\scriptscriptstyle{A}}}(s),d\in\pazocal{A}_{{\scriptscriptstyle{D}}}(s) do
10:              Update (p^A(k),p^D(k))(\hat{p}^{(k)}_{{\scriptscriptstyle{A}}},\hat{p}^{(k)}_{{\scriptscriptstyle{D}}}) such that p^A(k)(s,a)=1\hat{p}^{(k)}_{{\scriptscriptstyle{A}}}(s,a)=1 and p^D(k)(s,d)=1\hat{p}^{(k)}_{{\scriptscriptstyle{D}}}(s,d)=1
11:              Predict the value vector V^\hat{V} using the neural network F\pazocal{F}, V^F(p^A(k),p^D(k))\hat{V}\leftarrow\pazocal{F}(\hat{p}^{(k)}_{{\scriptscriptstyle{A}}},\hat{p}^{(k)}_{{\scriptscriptstyle{D}}})
12:              Assign Q^(s,a,d)V^(s)\hat{Q}(s,a,d)\leftarrow\hat{V}(s)
13:         end for
14:         Assign Q(s,a,d)Q^(s,a,d)Q(s,a,d)\leftarrow\hat{Q}(s,a,d) and solve Problem 6.10 for ss to obtain V(s)V(s) and pD(s)p_{{\scriptscriptstyle{D}}}(s)
15:         Update p^D(k)\hat{p}^{(k)}_{{\scriptscriptstyle{D}}} such that p^D(k)(s)pD(s)\hat{p}^{(k)}_{{\scriptscriptstyle{D}}}(s)\leftarrow p_{{\scriptscriptstyle{D}}}(s)
16:         Find the action corresponding to the minimum entry of the vector Q^(s,a,d)p^D(k)(s,d)\hat{Q}(s,a,d)\,\hat{p}^{(k)}_{{\scriptscriptstyle{D}}}(s,d), say a^\hat{a}
17:         Update p^A(k)\hat{p}^{(k)}_{{\scriptscriptstyle{A}}} such that p^A(k)(s,a^)1\hat{p}^{(k)}_{{\scriptscriptstyle{A}}}(s,\hat{a})\leftarrow 1
18:     end for
19:end for
20:return p^Dp^D(k)\hat{p}_{{\scriptscriptstyle{D}}}\leftarrow\hat{p}^{(k)}_{{\scriptscriptstyle{D}}}

Algorithm 6.3 presents the pseudocode for our HSL algorithm to solve the APT-DIFT game. Initially we generate random samples of strategy pairs (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) and value vector VV and train a neural network. Using the trained neural network, we propose a strategy iteration algorithm. As Assumption 6.6 holds, the state space graph 𝐒{\bf S} is acyclic. Thus one can compute the topological ordering (Step 3) and the hierarchical levels (Step 4) of 𝐒{\bf S} in polynomial time. The values at the absorbing states are known and are hence set as V(τA)=V(ϕ)=βV(\tau_{{\scriptscriptstyle{A}}})=V(\phi)=\beta, V(τB)=0V(\tau_{{\scriptscriptstyle{B}}})=0, and V(s)=0V(s)=0 for all sDs\in\pazocal{D}. As the values of the states at level MM are known, the algorithm begins with level M1M-1. Initially the strategies of APT and DIFT are randomly set to p^A(M1)\hat{p}^{(M-1)}_{{\scriptscriptstyle{A}}} and p^D(M1)\hat{p}^{(M-1)}_{{\scriptscriptstyle{D}}}, respectively. Then we compute the QQ-values of all the states using the trained neural network, following the hierarchical order. Due to the hierarchical structure of the state space, computation of QQ-value of a state in jthj^{\rm th} hierarchical level depends only on the states that are at levels higher than jj. Consider level M1M-1 and an arbitrary state ss in level M1M-1. The QQ-values of ss are predicted using the trained neural network by selecting suitable deterministic strategy pairs. That is, for predicting Q(s,a,d)Q(s,a,d), choose strategies such that p^A(M1)(s,a)=1\hat{p}^{(M-1)}_{{\scriptscriptstyle{A}}}(s,a)=1 and p^D(M1)(s,d)=1\hat{p}^{(M-1)}_{{\scriptscriptstyle{D}}}(s,d)=1 as input to the neural network. Using the QQ-values of state ss, solve the LP and obtain value vector, DIFT strategy, and the optimal action of the APT. The strategies p^A(M1)\hat{p}^{(M-1)}_{{\scriptscriptstyle{A}}} and p^D(M1)\hat{p}^{(M-1)}_{{\scriptscriptstyle{D}}} are updated using the output of the LP such that the p^A(M1)(s)\hat{p}^{(M-1)}_{{\scriptscriptstyle{A}}}(s) and p^D(M1)(s)\hat{p}^{(M-1)}_{{\scriptscriptstyle{D}}}(s) corresponds to an NE strategy. Once the strategies of both players are updated for all the states in level M1M-1, the process continues for level M2M-2 and so on.

In HSL algorithm, the prediction of QQ-values using the neural network (Step 8-12) corresponding to states in a particular level can be run parallel. To elaborate, let there are xx number of states in level LjL_{j}. Then select an action pair (a,d)(a,d) corresponding to each state and run the prediction step of the algorithm. Thus, every run of the neural network can predict xx number of QQ-values, for xx different states.

Remark: APT attacks typically consist of multiple stages, e.g., initial compromise, internal reconnaissance, foothold establishment, and data exfiltration, with each stage having a specific set of targets. To capture the multi-stage nature of the attack, we construct a multi-stage IFG, Gm{\pazocal{G}}_{m}, from the IFG G{\pazocal{G}}. Consider an attack that consists of mm attack stages with destinations of stage jj denoted as Dj\pazocal{D}_{j}. Then we duplicate mm copies of the IFG G{\pazocal{G}} such that nodes in Dj\pazocal{D}_{j} in the jthj^{\rm th} copy of G{\pazocal{G}} is connected to respective nodes in Dj\pazocal{D}_{j} in the (j+1)th(j+1)^{\rm th} copy, for j{1,,m}j\in\{1,\ldots,m\}. Also, set Dm=D\pazocal{D}_{m}=\pazocal{D}. The construction of Gm{\pazocal{G}}_{m} guarantees that any path that originate in set λ\lambda in the first copy and terminate in D\pazocal{D} in the mthm^{\rm th} copy is a feasible attack path. Figure 1 shows schematic diagram of a multi-stage IFG. For notational brevity the paper presents the single-stage attack case, i.e., m=1m=1. All the algorithm and results presented in this paper also apply to the case of multi-stage attack using Gm{\pazocal{G}}_{m} as the IFG.

Refer to caption
Figure 1: An example of a multi-stage APT attack scenario (Figure 1(a)). An illustrative diagram of the IFG G{\pazocal{G}} (Figure 1(b)) and the corresponding multi-stage IFG Gm{\pazocal{G}}_{m} (Figure 1(b)) that consists of three stages of the attack, i.e., m=3m=3. The propagation of the attack from stage jj of the attack to stage (j+1)(j+1) is captured in Gm{\pazocal{G}}_{m} by connecting the destination nodes Dj{\pazocal{D}}_{j} in stage jj to their respective nodes in stage (j+1)(j+1), for j=1,2j=1,2.

7 Simulation

In this section we test and validate Algorithms 6.16.2 and HSL algorithm (Algorithm 6.3) on two real world attack datasets. First we provide the details on the attack datasets and explain the construction of IFGs corresponding to each attack from their respective system log data. Then we discuss our experiments and present the results.

7.1 Attack Datasets

We use two attack datasets in our experiments. First attack dataset is corresponding to a nation state attack [6] and the second is related to a ransomware attack [20]. Each attack was executed individually in a computer running Linux operating system and system logs were recorded through RAIN system [15]. System logs contain records related to both malicious and benign information flows.

7.1.1 Nation State Attack

Nation state attack (a state-of-the-art APT attack) is a three day adversarial engagement orchestrated by US DARPA red-team during the evaluation of RAIN system. Attack campaign was designed to steal sensitive proprietary and personal information from the victim system. In our experiments we used the system logs recorded during the day 1 of the attack. The attack consists of four key stages: initial compromise, internal reconnaissance, foothold establishment, and data exfiltration. Our experiments considered the first stage of the attack, i.e., initial compromise stage, where APT used spear-phishing to lead the victim user to a website that was hosting ads from a malicious web server. After navigating to the website, APT exploited a vulnerability in the Firefox browser and compromised the Firefox.

7.1.2 Ransomware Attack

Ransomware attack campaign was designed to block access to the files in ./home./home directory and demand a payment from the victim in exchange for regranting the access. The attack consists of three stages: privilege escalation, lateral movement of the attack, and encrypting and deleting ./home./home directory. Upon reaching the final stage of the attack, APT first opened and read all the files in the ./home./home directory of the victim computer. Then APT wrote the content of the files in ./home./home into an encrypted file named ransomware.encryptedransomware.encrypted. Finally, APT deleted the ./home./home directory.

7.2 Construction of IFG

Direct conversion of the system logs into IFG typically results in coarse graphs with large number of nodes and edges as it includes all the information flows recorded during the system execution. Majority of these information flows are related to the system’s background processes (noise) which are unrelated to the actual attack campaign and it is computationally intensive to run the proposed algorithms on a state space induced by such a coarse graph. Hence, we use the following pruning steps to prune the coarse graph without losing any attack related causal information flow dependencies.

  1. 1.

    When multiple edges with same directed orientation exist between two nodes in the coarse IFG, combine them to a single directed edge. For example, consider a scenario where multiple “read” system calls are recorded between a file and a process. This results in multiple edges between the two nodes of the resulting coarse IFG. Our APT-DIFT game formulation only requires to realize the feasibility of transferring information flows between pairs of processes and files. Hence, in scenarios similar to the above example, we collapse all the multiple edges between the two nodes in the coarse IFG into a single edge.

  2. 2.

    Find all the nodes in coarse IFG that have at least one information flow path from an entry point of the attack to a target of the attack. When attack consists of multiple stages find all the nodes in coarse IFG that have at least one information flow path from a destination of stage jj to a destination of a stage j+1j+1, for all j{1,,m1}j\in\{1,\ldots,m-1\}.

  3. 3.

    From coarse graph, extract the subgraph corresponding to the entry points, destinations, and the set of nodes found in Step 22.

  4. 4.

    Group all the nodes that correspond to files of a single directory to a single node related to the parent file directory. For example, assume the resulting coarse IFG has three files, ./home/inventory/prices.xlsx./home/inventory/prices.xlsx, ./home/vendors/contacts/addresses.doc./home/vendors/contacts/addresses.doc, and ./home/vendors/ledger.db./home/vendors/ledger.db, that are uniquely identified by their respective file paths. In this case we will group all three files, prices.xlsxprices.xlsx, addresses.docaddresses.doc, and ledger.dbledger.db under one super-node corresponding to the parent directory ./home./home. Incoming and outgoing edges associated with each of the three files are then connected to the new node ./home./home. If any subset of the new edges connected to ./home./home induce multiple edges with same orientation to another node in the IFG, then follow step 1). It is also possible to group the files into respective sub-directories such as ./home/inventory/./home/inventory/ and ./home/vendors/./home/vendors/ given in the example. Such groupings will facilitate much finer-grained security analysis at the cost of larger IFGs which require more computation resources to run the proposed APT-DIFT game algorithms presented in this paper.

  5. 5.

    If the resulting graph after Steps 11-44 contains any cycles use node versioning techniques [10] to remove cycles while preserving the information flow dependencies in the graph.

Steps 22 and 33 are done using upstream, downstream, and point-to-point stream pruning techniques mentioned in [15]. The resulting information flow graph is called pruned IFG. We tested the proposed algorithms on the state spaces of APT-DIFT games corresponding to these pruned IFGs.

For the nation state attack, initial conversion of the system logs into an IFG resulted in a coarse graph with 299 nodes and 404 edges which is presented in Figure 2(a).. We used steps 11 to 44 explained in Section 7.2 to obtain a pruned IFG with 3030 nodes and 7474 edges. A network socket connected to an untrusted IP address was identified as an entry point of the attack and the target of the attack is Firefox process. Figure 2(b) shows the pruned IFG of nation state attack. In the IFG , there are 21 information flow paths from the entry point to the target Firefox.

Refer to caption
Figure 2: Relevant attack information of nation state attack: (a) coarse IFG and (b) pruned IFG. Nodes of the graph are color coded to illustrate their respective types (network socket, file, and process). A network socket is identified as the entry points of the nation state attack. Target of the attack is Firefox process.
Refer to caption
Figure 3: Relevant attack information of ransomware attack: (a) coarse IFG and (b) pruned IFG. Nodes of the graph are color coded to illustrate their respective types (network socket, file, and process). Two network sockets are identified as the entry points of the ransomware attack. Targets of the attack (/usr/bin/sudo,/bin/bash,/home/usr/bin/sudo,/bin/bash,/home) are labeled in the graph.
Refer to caption
(a)
Refer to caption
(b)
Figure 4: Figure 4(a) plots the value of APT-DIFT game V(k)(s0)V^{(k)}(s_{0}) corresponding to pruned IFG of nation state attack given in Figure 2, computed at iterations k=1,2,,32k=1,2,\ldots,32 in Algorithm 6.1. Figure 4(b) plots the maximum absolute error δ=maxs𝐒|V(k)(s)V(k1)(s)|\delta=\max_{s\in{\bf S}}|V^{(k)}(s)-V^{(k-1)}(s)|, for k=1,2,,32k=1,2,\ldots,32. The payoff parameter is set to β=100\beta=100.
Refer to caption
(a)
Refer to caption
(b)
Figure 5: Figure 5(a) shows the mean absolute error μ\mu between actual value vector VV and estimated value vector V^\hat{V} (using neural network) for different payoff parameter values β=5,10,,100\beta=5,10,\ldots,100. We used 10510^{5} training data samples in the HSL algorithm (Algorithm 6.3). Figure 5(b) shows variation in μ\mu with respect to the number of training data samples used in HSL algorithm. In this experiment β=50\beta=50. Each data point in both cases are calculated by averaging results over 1010 independent trials. Bars in each data point shows the standard errors associated with the μ\mu values obtained in the different trials. Pruned IFG corresponding to the ransomware attack in Figure 3 was used in both cases.

For the ransomware attack, direct conversion of the system logs resulted in a coarse IFG with 173173 nodes and 482482 edges which is presented in Figure 3(a). We pruned the resulting coarse IFG using the steps given in Section 7.2. The pruned IFG of ransomware attack consists of 1818 nodes and 2929 edges. Two network sockets that indicate series of communications with external IP addresses in the recorded system logs were identified as the entry points of the attack. Figure 3(b) illustrates the pruned IFG of ransomware attack with annotated targets /usr/bin/sudo,/bin/bash,/home/usr/bin/sudo,/bin/bash,/home.

7.3 Experiments and Results

Algorithm 6.1 was implemented on APT-DIFT game model associated with the pruned cyclic IFG of nation state attack given in Figure 2. Figure 4(a) shows the convergence of the value V(k)(s0)V^{(k)}(s_{0}) in APT-DIFT game with the iteration number kk. The threshold value of the maximum absolute error, δ\delta, in Algorithm 6.1 was set to 10710^{-7}. We note that δ=maxs𝐒|V(k)(s)V(k1)(s)|\delta=\max_{s\in{\bf S}}|V^{(k)}(s)-V^{(k-1)}(s)| where k=1,2,k=1,2,\ldots. Figure 4(b) shows that δ\delta monotonically decreases with kk. At iteration k=32k=32, δ=7.79×108\delta=7.79\times 10^{-8}.

HSL algorithm (Algorithm 6.3) was tested on APT-DIFT game associated with the pruned acyclic IFG of ransomware attack given in Figure 3 and results are given in Figure 5. We used a sequential neural network with two dense layers to learn the value vector for a given strategy pair of APT and DIFT. Each dense layer consists of 10001000 neurons and ReLU activation function. Training data consist tuples of APT and DIFT strategy pair and corresponding value vector of APT-DIFT game. We generated 1×1051\times 10^{5} training data samples that consist of randomly generated deterministic APT policies. We randomly generated DIFT’s policies such that 40%40\% strategies are stochastic (mixed) and 60%60\% are deterministic. For each randomly generated APT and DIFT strategy pair, the corresponding value vector was computed. A stochastic gradient descent optimizer was used to train the neural network. In each experiment trial the neural network was trained for 100100 episodes and validation error was set to <1%<1\%.

In Figure 5(a) we analyze the sensitivity of estimated value vector to the variations of payoff parameter, β\beta by plotting the mean absolute error μ\mu between actual value vector VV and estimated value vector V^\hat{V}, i.e., μ=s𝐒|V(s)V^(s)|/|𝐒|\mu=\sum_{s\in{\bf S}}|V(s)-\hat{V}(s)|/|{\bf S}|, with respect to β\beta. In these experiments 10510^{5} training data samples were used in HSL algorithm. The results show that V^\hat{V} values are close and consistent with VV values when β\beta parameter takes smaller values and variations between V^\hat{V} and VV is increased when β\beta takes larger values. A reason for this behavior can be the numerical unstability associated with the training and estimating tasks done in the neural network model used in HSL algorithm. In order to study the effect of the length of training data samples on the estimated value vector we plot μ\mu against number of training data samples used in HSL algorithm in Figure 5(b). β=50\beta=50 was used in the experiments. The results show that μ\mu decreases with the number of training data samples used in HSL algorithm as the increased number of training data samples improves the learning in neural network model.

8 Conclusion

This paper studied detection of Advanced Persistent Threats (APTs) using a Dynamic Information Flow Tracking (DIFT)-based detection mechanism. We modeled the strategic interaction between the APT and DIFT as a non-cooperative stochastic game with total reward structure. The APT-DIFT game has imperfect information as both APT and DIFT are unaware of the actions of the opponent. Also, our game model incorporates the false-positives and false-negatives generated by DIFT. We considered two scenarios of the game (i)  when the transition probabilities, i.e., rates of false-positives and false-negatives, are known to both players and (ii)  when the transition probabilities are unknown to both players. For case (i), we proposed a value iteration-based algorithm and proved convergence and complexity. For case (ii), we proposed Hierarchical Supervised Learning (HSL), a supervised learning-based algorithm that utilizes the hierarchical structure of the state space of the APT-DIFT game, that integrates a neural network with a policy iteration algorithm to compute an approximate equilibrium of the game. We implemented our algorithms on real attack datasets for nation state and ransomware attacks collected using the RAIN framework and validated our results and performance.

References

  • [1] M. Ahmadi, M. Cubuktepe, N. Jansen, S. Junges, J.-P. Katoen, and U. Topcu. The partially observable games we play for cyber deception. arXiv:1810.00092, 2018.
  • [2] M. Ahmadi, A. A. Viswanathan, M. D. Ingham, K. Tan, and A. D. Ames. Partially Observable Games for Secure Autonomy. 2020.
  • [3] R. Amir. Stochastic games in economics and related fields: An overview. Stochastic Games and Applications, pages 455–470, 2003.
  • [4] B. Bencsáth, G. Pék, L. Buttyán, and M. Felegyhazi. The cousins of Stuxnet: Duqu, Flame, and Gauss. Future Internet, 4(4):971–1003, 2012.
  • [5] C. Boutilier, T. Dean, and S. Hanks. Decision-theoretic planning: Structural assumptions and computational leverage. Journal of Artificial Intelligence Research, 11:1–94, 1999.
  • [6] S. W. Brenner. Cyberthreats: The emerging fault lines of the nation state. Oxford University Press, 2009.
  • [7] J. Clause, W. Li, and A. Orso. Dytan: a generic dynamic taint analysis framework. International Symposium on Software Testing and Analysis, pages 196–206, 2007.
  • [8] B. C. Csáji et al. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd University, Hungary, 24(48):7, 2001.
  • [9] W. Enck, P. Gilbert, S. Han, V. Tendulkar, B.-G. Chun, L. P. Cox, J. Jung, P. McDaniel, and A. N. Sheth. Taintdroid: an information-flow tracking system for realtime privacy monitoring on smartphones. ACM Transactions on Computer Systems, 32(2):5, 2014.
  • [10] M. N. Hossain, J. Wang, R. Sekar, and S. D. Stoller. Dependence-preserving data compaction for scalable forensic analysis. USENIX Security Symposium, pages 1723–1740, 2018.
  • [11] J. Hu and M. P. Wellman. Multiagent reinforcement learning: Theoretical framework and an algorithm. International Conference on Machine Learning, 98:242–250, 1998.
  • [12] J. Hu and M. P. Wellman. Nash Q-learning for general-sum stochastic games. Journal of machine learning research, 4:1039–1069, 2003.
  • [13] L. Huang and Q. Zhu. Adaptive strategic cyber defense for advanced persistent threats in critical infrastructure networks. ACM SIGMETRICS Performance Evaluation Review, 46(2):52–56, 2019.
  • [14] J. Jang-Jaccard and S. Nepal. A survey of emerging threats in cybersecurity. Journal of Computer and System Sciences, 80(5):973–993, 2014.
  • [15] Y. Ji, S. Lee, E. Downing, W. Wang, M. Fazzini, T. Kim, A. Orso, and W. Lee. RAIN: Refinable attack investigation with on-demand inter-process information flow tracking. ACM SIGSAC Conference on Computer and Communications Security, pages 377–390, 2017.
  • [16] A. B. Kahn. Topological sorting of large networks. Communications of the ACM, 5(11):558–562, 1962.
  • [17] M. L. Littman. Value-function reinforcement learning in Markov games. Cognitive Systems Research, 2(1):55–66, 2001.
  • [18] K.-w. Lye and J. M. Wing. Game strategies in network security. International Journal of Information Security, 4(1-2):71–86, 2005.
  • [19] S. Misra, S. Moothedath, H. Hosseini, J. Allen, L. Bushnell, W. Lee, and R. Poovendran. Learning equilibria in stochastic information flow tracking games with partial knowledge. IEEE Conference on Decision and Control, pages 4053–4060, 2019.
  • [20] S. Mohurle and M. Patil. A brief study of wannacry threat: Ransomware attack 2017. International Journal of Advanced Research in Computer Science, 8(5), 2017.
  • [21] S. Moothedath, D. Sahabandu, J. Allen, A. Clark, L. Bushnell, W. Lee, and R. Poovendran. Dynamic Information Flow Tracking for Detection of Advanced Persistent Threats: A Stochastic Game Approach. arXiv: 2006.12327, 2020.
  • [22] S. Moothedath, D. Sahabandu, J. Allen, A. Clark, L. Bushnell, W. Lee, and R. Poovendran. A game-theoretic approach for dynamic information flow tracking to detect multi-stage advanced persistent threats. IEEE Transactions on Automatic Control, 2020.
  • [23] K. Najim, A. S. Poznyak, and E. Gomez. Adaptive policy for two finite Markov chains zero-sum stochastic game with unknown transition matrices and average payoffs. Automatica, 37(7):1007–1018, 2001.
  • [24] H. L. Prasad, L. A. Prashanth, and S. Bhatnagar. Two-timescale algorithms for learning Nash equilibria in general-sum stochastic games. International Conference on Autonomous Agents and Multiagent Systems, pages 1371–1379, 2015.
  • [25] S. Rass, S. König, and S. Schauer. Defending against advanced persistent threats using game-theory. PLOS ONE, Public Library of Science, 12(1):1–43, 2020.
  • [26] H. Royden and P. Fitzpatrick. Real Analysis. Prentice Hall, 2010.
  • [27] D. Sahabandu, S. Moothedath, J. Allen, L. Bushnell, W. Lee, and R. Poovendran. Stochastic dynamic information flow tracking game with reinforcement learning. International Conference on Decision and Game Theory for Security, pages 417–438, 2019.
  • [28] D. Sahabandu, S. Moothedath, J. Allen, L. Bushnell, W. Lee, and R. Poovendran. A Multi-Agent Reinforcement Learning Approach for Dynamic Information Flow Tracking Games for Advanced Persistent Threats. arXiv: 2007.00076, 2020.
  • [29] D. Sahabandu, S. Moothedath, J. Allen, A. Clark, L. Bushnell, W. Lee, and R. Poovendran. Dynamic information flow tracking games for simultaneous detection of multiple attackers. IEEE Conference on Decision and Control, pages 567–574, 2019.
  • [30] D. Sahabandu, S. Moothedath, J. Allen, A. Clark, L. Bushnell, W. Lee, and R. Poovendran. A game theoretic approach for dynamic information flow tracking with conditional branching. American Control Conference, pages 2289–2296, 2019.
  • [31] L. S. Shapley. Stochastic games. Proceedings of the national academy of sciences, 39(10):1095–1100, 1953.
  • [32] G. E. Suh, J. W. Lee, D. Zhang, and S. Devadas. Secure program execution via dynamic information flow tracking. ACM Sigplan Notices, 39(11):85–96, 2004.
  • [33] W. Tian, X.-P. Ji, W. Liu, J. Zhai, G. Liu, Y. Dai, and S. Huang. Honeypot game-theoretical model for defending against apt attacks with limited resources in cyber-physical systems. ETRI Journal, 41(5):585–598, 2019.
  • [34] B. Watkins. The impact of cyber attacks on the private sector. Briefing Paper, Association for International Affair, 12:1–11, 2014.
  • [35] Q. Zhu and T. Başar. Robust and resilient control design for cyber-physical systems with an application to power systems. IEEE Decision and Control and European Control Conference, pages 4066–4071, 2011.

9 Appendix

Proof of Proposition 5.2: It is known that the class of zero-sum, finite, stochastic games with nonzero stopping (termination) probability has Nash equilibrium in stationary strategies [31]. Note that, the APT-DIFT game is a stochastic game with finite state space and finite action spaces for both players. Moreover, the transition probabilities in the game FP()FP(\cdot) and FN()FN(\cdot) are such that 0<FP()<10<FP(\cdot)<1 and 0<FN()<10<FN(\cdot)<1. Thus the stopping probabilities of the APT-DIFT game are nonzero, for all strategy pairs (pA,pD)(p_{{\scriptscriptstyle{A}}},p_{{\scriptscriptstyle{D}}}) except for a deterministic policy in which the defender does not perform security analysis at any state. However, such a policy is trivially irrelevant to the game as the defender is idle and essentially not participating in the game. As the result for zero-sum games also hold for constant-sum games, from [31] it follows that there exists an NE for the APT-DIFT game. ∎