This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Encrypted Value Iteration and Temporal Difference Learning over Leveled Homomorphic Encryption

Jihoon Suh    Takashi Tanaka This work is supported by NSF Award 1944318. Both authors are with the Department of Aerospace Engineering and Engineering Mechanics, University of Texas at Austin, TX 78712, USA. {jihoonsuh,ttanaka}@utexas.edu
Abstract

We consider an architecture of confidential cloud-based control synthesis based on Homomorphic Encryption (HE). Our study is motivated by the recent surge of data-driven control such as deep reinforcement learning, whose heavy computational requirements often necessitate an outsourcing to the third party server. To achieve more flexibility than Partially Homomorphic Encryption (PHE) and less computational overhead than Fully Homomorphic Encryption (FHE), we consider a Reinforcement Learning (RL) architecture over Leveled Homomorphic Encryption (LHE). We first show that the impact of the encryption noise under the Cheon-Kim-Kim-Song (CKKS) encryption scheme on the convergence of the model-based tabular Value Iteration (VI) can be analytically bounded. We also consider secure implementations of TD(0), SARSA(0) and Z-learning algorithms over the CKKS scheme, where we numerically demonstrate that the effects of the encryption noise on these algorithms are also minimal.

I INTRODUCTION

The growing demand of applications such as smart grids [1] and smart cities [2] assures the important role of data usage and connectivity via advanced data-driven algorithms. However, the utility of advanced data-driven algorithms may be limited in many real-world control systems since components within the network are often resource-constrained. The cloud-based control can be an appealing solution in such scenarios, although a naive outsourcing comes with a steep cost of privacy. Recent literature in control has shown that the use of HE could mitigate the issue of privacy to a certain extent. However, there are many remaining challenges in encrypted control technologies. For instance, in the current encrypted control literature, the potential of FHE is not fully utilized, even though FHE would be necessary to encrypt advanced control algorithms, such as deep reinforcement learning [3]. RL framework has recently accomplished impressive feats with successful applications from AlphaGo Zero to traffic signal control, robot controls and many others, [4, 5, 6]. Its success is also bolstered by the availability of large-scale data and connectivity. Thus, we wish to study a cloud-based control synthesis architecture which can integrate two promising technologies, namely HE and RL. In particular, we examine the effects of using HE on RL both theoretically and numerically.

The first applications of HE to control systems utilized PHE since its relatively low computational overhead was suitable for real-time implementations. On the other hand, FHE or LHE can support more general classes of computations in the ciphertext domain, although its computational overhead makes it less suitable for real-time systems. In this regard, we first make an observation that the encrypted control can be better suited to the control synthesis problems rather than control implementations. For instance, explicit model predictive control (MPC) or RL problems require heavy computations or a large set of data to synthesize the control policy but implementing such policy may be kept local as they are often relatively light computations. In addition, real-time requirement is less stringent in control law synthesis problems than control law implementation counterpart.

I-A Related Work

Various HE schemes were applied to the linear controller in [7, 8, 9]. Importance of quantization and using non-deterministic encryption scheme was identified in [7]. Necessary conditions on encryption parameters for closed-loop stability were shown by [8]. The feasibility of FHE in encrypted control was first shown in [9] by managing multiple controllers. Evaluation of affine control law in explicit MPC was shown by [10]. In [11] encrypted implicit MPC control was shown to be feasible but the cost of privacy as increased complexity was identified. Real-time proximal gradient method was used in [12] to treat encrypted implicit MPC control and it showed undesired computation loads of encrypted control system. Privacy and performance degradation was further highlighted in [13] through experiments on motion control systems. Dynamic controller was encrypted in [14] using FHE exploiting the stability of the system while not relying on bootstrapping. Despite significant progress, encrypted control has been limited to simple computations where the motivation for cloud computing is questionable.

I-B Contribution

To study the feasibility of RL over HE, this paper makes the following contributions. First, as a first step towards more advanced problems of private RL, we formally study the convergence of encrypted tabular VI. We show that the impact of the encryption-induced noise can be made negligible if the QQ-factor is decrypted in each iteration. Second, we present implementation results of temporal difference (TD) learning (namely, TD(0), SARSA(0), and Z-learning [15]) over the CKKS encryption scheme and compare their performances with un-encrypted cases. Although the formal performance analysis for this class of algorithms are difficult with encryption noise, we show numerically that these RL schemes can be implemented accurately over LHE.

I-C Preview

In section II, we summarize the relevant essentials of RL and HE. In section III, we set up the encrypted control synthesis problem and specialize it to the encrypted model-based and model-free RL algorithms. We analyze how the encryption-induced noise influences the convergence of standard VI. In section IV-A, we perform the simulation studies to demonstrate the encrypted model-free RL over HE. Finally, in section V, we summarize our contributions and discuss the future research directions.

II Preliminariess

II-A Reinforcement Learning

RL can be formalized through the finite state Markov decision process (MDP). We define a finite MDP as a tuple (𝒮,𝒜,P,R)(\mathcal{S},\mathcal{A},P,R), where 𝒮=(1,2,,n)\mathcal{S}=(1,2,\dots,n) is a finite state space, 𝒜\mathcal{A} is a finite action space, P=P(st+1|st,at)P=P(s_{t+1}|s_{t},a_{t}) is a Markov state transition probability and R=R(st,at,st+1)R=R(s_{t},a_{t},s_{t+1}) is a reward function for the state-transition. A policy can be formalized via a sequence of stochastic kernels π=(π0,π1,)\pi=(\pi_{0},\pi_{1},...), where πt(at|s0:t+1,a0:t)\pi_{t}(a_{t}|s_{0:t+1},a_{0:t}) is a mapping for each state st𝒮s_{t}\in\mathcal{S} to the probability of selecting the action at𝒜a_{t}\in\mathcal{A} given the history (s0:t1,a0:t1)(s_{0:t-1},a_{0:t-1}).

We consider a discounted problem with a discount factor γ[0,1)\gamma\in[0,1) throughout this paper. Under a stationary policy π\pi, the Bellman’s optimality equation can be defined through a recursive operator TT applied on the initial value vector V0πV^{\pi}_{0} of the form Vπ=(V(1),,V(n))πV^{\pi}=(V(1),\dots,V(n))^{\pi}:

(TVπ)(st)=maxatst+1P[R+γVπ(st+1)].\displaystyle(TV^{\pi})(s_{t})=\max_{a_{t}}\sum_{s_{t+1}}P[R+\gamma V^{\pi}(s_{t+1})]. (1)

Throughout this paper, we adopt the conventional definition for the maximum norm \|\cdot\|_{\infty}. The following results are standard [16].

Lemma 1
  1. (a)

    For any vectors VV and V¯\bar{V}, we have

    TVTV¯γVV¯.\|TV-T\bar{V}\|_{\infty}\leq\gamma\|V-\bar{V}\|_{\infty}.
  2. (b)

    The optimal value vector VV^{*} is the unique solution to the equation V=TVV^{*}=TV^{*},

  3. (c)

    We have

    limkTkV=V\lim_{k\to\infty}T^{k}V=V^{*}

    for any vector VV.

If some policy π\pi^{*} attains VV^{*}, we call π\pi^{*} the optimal policy and the goal is to attain the optimal policy π\pi^{*} for the given MDP environment.

Often the state values are written conveniently as a function of state-action pairs called QQ-values:

Qπ(st,at)=st+1P[R+γVπ(st+1)],\displaystyle Q^{\pi}(s_{t},a_{t})=\sum_{s_{t+1}}P[R+\gamma V^{\pi}(s_{t+1})], (2)

and for all state sts_{t}, Vk+1π(st)=maxat𝒜Qπ(st,at)V^{\pi}_{k+1}(s_{t})=\max_{a_{t}\in\mathcal{A}}Q^{\pi}(s_{t},a_{t}).

RL is a class of algorithms that solves MDPs when the model of the environment (e.g., PP and/or RR) is unknown. RL can be further classified into two groups: model-based RL and model-free RL [17]. A simple model-based RL first estimates the transition probability and reward function empirically (to build an artificial model) then utilizes the VI to solve the MDP. Model-free RL on the other hand directly computes the value function by averaging over episodes (Monte Carlo methods) or by estimating the values iteratively like TD learning.

TD(0) is one of the simplest TD learning algorithm where the tuple (s,R,s)(s,R,s^{\prime}) is sampled from the one-step ahead observation [18]. The on-policy iterative update rule for estimating the value is called TD(0) and is written as follows:

V^t+1π(s)=V^tπ(s)+αt(s)δt.\hat{V}_{t+1}^{\pi}(s)=\hat{V}_{t}^{\pi}(s)+\alpha_{t}(s)\delta_{t}. (3)

δt\delta_{t} denotes the TD error and is computed with the sample as

δt=R(s,a)+γV^tπ(s)V^tπ(s),\delta_{t}=R(s,a)+\gamma\hat{V}_{t}^{\pi}(s^{\prime})-\hat{V}_{t}^{\pi}(s),

starting with an initial guess V^0π\hat{V}^{\pi}_{0}. We use V^\hat{V} to indicate that the values are estimated. With some standard assumptions on the step size αt(s)\alpha_{t}(s) and exploring policies, the convergence of TD(0) is standard in the literature, see [19].

II-B Homomorphic Encryption

HE is a structure-preserving mapping between the plain-text domain 𝒫\mathcal{P} and the cipher-text domain 𝒞\mathcal{C}. Thus, encrypted data in 𝒞\mathcal{C} with a function f(𝒞)f(\mathcal{C}) can be outsourced to the cloud for confidential computations.

PHE supports only the addition or the multiplication. On the other hand, LHE can support both the addition and the multiplication but the noise growth of ciphertext multiplication is significant without the bootstrapping [20]. The bootstrapping operation can promote a LHE scheme to FHE enabling unlimited number of multiplications but it requires a heavy computation resource on its own.

In order to use both addition and multiplication necessary in evaluating RL algorithms, we use an LHE. In particular, we employ CKKS encryption scheme proposed in [21] as it is well suited for engineering applications and also supports parallel computations. Also, there exists a widely available library optimized for implementations. We will only list and describe here several key operations and properties with more detailed information found in Appendix. We are particularly concerned with the fact that the CKKS encryption is noisy (due to error injection for security) but if properly implemented, the noise is manageable.

The security of CKKS scheme assumes the difficulty of the Ring Learning With Errors (RLWE) problem. Security parameters are a power-of-two integer 𝒩\mathcal{N}, a ciphertext modulus 𝒬\mathcal{Q} and the variance σ\sigma used for drawing error from a distribution. The operation KeyGen uses security parameters (𝒩\mathcal{N}, 𝒬\mathcal{Q}, σ\mathcal{\sigma}) to create a λ\lambda-bit public key pk, a secret key sk, and an evaluation key evk.

III Encrypted learning over the cloud

III-A Cloud-based reinforcement learning

Our control loop consists of the plant (environment) and the controller (client) and the cloud server. We are concerned with a possible data breach at the cloud. In order to prevent the privacy compromise of our data such as training data set and other parameters at play, we add the homomorphic encryption module as seen in Fig. 1. In particular, we ignore the malleability risks [22] inherent to the considered HE scheme, which is beyond the scope of this paper. In our context, Syn(Enc())Syn(Enc(\cdot)) can be thought of as a ciphertext domain implementation of RL algorithms. However, Syn(Enc())Syn(Enc(\cdot)) is a general control synthesis that can be evaluated remotely. Ideal uses for Syn(Enc())Syn(Enc(\cdot)) would include advanced data-driven control synthesis procedures such as MPC, or Deep RL. We emphasize that the encryption adds communication delays for control policy synthesis but does not add an extra computation cost to control implementation. For RL, the controller’s task is to implement the most up-to-date state-action map π\pi that is recently synthesized, which can be done locally.

We assume the tabular representation of the state and action pairs. The client explores its plant and samples the information set ht=(st+1,at,rt)h_{t}=(s_{t+1},a_{t},r_{t}). Other necessary parameters such as learning rate αt(s)\alpha_{t}(s) and γ\gamma are grouped as θ\theta denoting the hyperparmeters. This set of data is then encrypted by the CKKS encryption module to generate ciphertexts cv\textbf{c}_{v}, cv\textbf{c}_{v^{\prime}}, cα\textbf{c}_{\alpha}, cγ\textbf{c}_{\gamma}, and cr\textbf{c}_{r}, which are encrypted data for V^tπ(s)\hat{V}_{t}^{\pi}(s), V^tπ(s)\hat{V}_{t}^{\pi}(s^{\prime}), αt(s)\alpha_{t}(s), γ\gamma, and R(s,a)R(s,a), and will be used to implement (3). Note that the subscripts refer to the value when these ciphertexts are decrypted. The cloud is instructed on how to evaluate the requested algorithms in ciphertexts. One may concern about the cloud knowing the form of the algorithm, but this can be resolved by the circuit privacy. That is, we can hide the actual computation steps by adding zeros or multiplying ones in ciphertexts on the original algorithm. We also assume that the cloud is semi-honest so that the cloud faithfully performs the algorithm as instructed. The output of the cloud is a newly synthesized policy πt(s0:t)\pi_{t}(s_{0:t}), which, after a decryption, can be accessed on real-time by the client to produce the control action at(st)a_{t}(s_{t}).

Refer to caption
Figure 1: Encrypted RL Control synthesis with cloud in the loop.

As a first step towards investigating more sophisticated RL algorithms, we specialize the proposed framework to solving RL problems in finite MDP with more elementary methods. In particular, we consider the basic model-based RL and model-free RL using tabular algorithms. For the model-based, we theoretically analyze the convergence of VI under the presence of encryption-induced noise. For the model-free, we implement the encrypted TD algorithms to estimate the values and investigate how the encryption noise affects the output.

III-B Encrypted Model-based RL

The client is assumed to explore the plant, sample the information sets hth_{t} and build the model first. A simple model of the state-transition probability can be in the form

P(st+1|st,at)=N(s,a,s)N(s,a),P(s_{t+1}|s_{t},a_{t})=\frac{N(s,a,s^{\prime})}{N(s,a)}, (4)

where N(s,a,s)N(s,a,s^{\prime}) denotes the number of visits to the triplet (s,a,s)(s,a,s^{\prime}) and N(s,a)N(s,a) to the pair (s,a)(s,a) with s,s𝒮,a𝒜s,s^{\prime}\in\mathcal{S},a\in\mathcal{A}. Similarly, the reward function R(st,at,st+1)R(s_{t},a_{t},s_{t+1}) can be quantified by the average rewards accumulated per the state-action pair.

Then, the client can encrypt the initial values of the state along with the model PP and RR and the discount rate γ\gamma and request the cloud to perform the computation of (2) over the ciphertext domain. Since comparison over HE is non-trivial, the max operation is difficult to be implemented homomorphically. Thus, the cloud computes the QQ values and the controller will receive the decrypted QQ values and completes the VI process for iteration index kk:

V~k+1π(st)=maxat𝒜Q~π(st,at).\tilde{V}^{\pi}_{k+1}(s_{t})=\max_{a_{t}\in\mathcal{A}}\tilde{Q}^{\pi}(s_{t},a_{t}). (5)

However, note that the QQ values computed by the cloud will be noisy due to the encryption, hence we use Q~\tilde{Q} and V~\tilde{V} to denote the noisy value. Let w(st,at)w(s_{t},a_{t}) be the encryption-induced noise produced by computations over HE. Then, the noisy QQ values can be written as

Q~π(st,at)=Qπ(st,at)+w(st,at),\tilde{Q}^{\pi}(s_{t},a_{t})=Q^{\pi}(s_{t},a_{t})+w(s_{t},a_{t}), (6)

where the noise term is bounded such that |w(st,at)|ϵ|w(s_{t},a_{t})|\leq\epsilon st,at\forall s_{t},a_{t} for some ϵ>0\epsilon>0. Appendix AA shows how ϵ\epsilon depends on the encryption parameter and operations used to evaluate the given algorithm.

We now analyze the discrepancy between two sequences of vectors VkπV^{\pi}_{k} and V~kπ\tilde{V}^{\pi}_{k}. We separate the analysis into the synchronous and the asynchronous cases.

III-B1 Synchronous VI

The synchronous VI means that the VI is applied to all state ss simultaneously. First note that VkπV^{\pi}_{k} is computed by the noiseless VI

Vk+1π(st)\displaystyle V^{\pi}_{k+1}(s_{t}) =maxat𝒜st+1P[R+γVtπ(st+1)]\displaystyle=\max_{a_{t}\in\mathcal{A}}\sum_{s_{t+1}}P[R+\gamma V^{\pi}_{t}(s_{t+1})] (7)
=(TVkπ)(st)st𝒮,\displaystyle=(TV^{\pi}_{k})(s_{t})\quad\forall s_{t}\in\mathcal{S},

and V~kπ\tilde{V}^{\pi}_{k} is computed by the noisy VI

V~k+1π(st)\displaystyle\tilde{V}^{\pi}_{k+1}(s_{t}) =maxat𝒜st+1P[R+γVkπ(st+1)+w(st,at)]\displaystyle=\max_{a_{t}\in\mathcal{A}}\sum_{s_{t+1}}P[R+\gamma V^{\pi}_{k}(s_{t+1})+w(s_{t},a_{t})] (8)

We will utilize the following simple lemma, whose proof is straightforward and hence omitted.

Lemma 2

For any arbitrary vectors x=(x(1),,x(n))x=(x(1),\ldots,x(n)) and w=(w(1),,w(n))w=(w(1),\dots,w(n)) such that wϵ\|w\|_{\infty}\leq\epsilon,

maxi(x(i)+w(i))=maxi(x(i))+w~\max_{i}(x(i)+w(i))=\max_{i}(x(i))+\tilde{w} (9)

for some constant w~\tilde{w} satisfying |w~|ϵ|\tilde{w}|\leq\epsilon.

By applying Lemma 22 on the noisy VI (8),
we obtain

V~k+1π(st)\displaystyle\tilde{V}^{\pi}_{k+1}(s_{t}) =maxat𝒜st+1P[R+γV~kπ(st+1)+wk(st,at)]\displaystyle=\max_{a_{t}\in\mathcal{A}}\sum_{s_{t+1}}P[R+\gamma\tilde{V}^{\pi}_{k}(s_{t+1})+w_{k}(s_{t},a_{t})] (10)
=maxat𝒜st+1P[R+γV~kπ(st+1)]+w~k(st)\displaystyle=\max_{a_{t}\in\mathcal{A}}\sum_{s_{t+1}}P[R+\gamma\tilde{V}^{\pi}_{k}(s_{t+1})]+\tilde{w}_{k}(s_{t})
=(TV~kπ)(st)+w~k(st),\displaystyle=(T\tilde{V}^{\pi}_{k})(s_{t})+\tilde{w}_{k}(s_{t}),

where the vector w~k\tilde{w}_{k} satisfies w~kϵ\|\tilde{w}_{k}\|_{\infty}\leq\epsilon. In the vector form, (10)\eqref{eq:noisy VI2} can be written as

V~k+1π\displaystyle\tilde{V}^{\pi}_{k+1} =TV~kπ+w~k,\displaystyle=T\tilde{V}^{\pi}_{k}+\tilde{w}_{k}, (11)
=T~V~kπ.\displaystyle=\tilde{T}\tilde{V}^{\pi}_{k}.

We can now quantify the worst-case performance degradation due to the encryption-induced noise. The result is summarized in the next Theorem.

Theorem 1

(Approximate VI, [16]) Let VV^{*} be the optimal value function characterized by Lemma (1) (aa). Suppose V~kπ\tilde{V}^{\pi}_{k} is the sequence of vectors computed by the noisy VI (8). For an arbitrary initial condition V~0\tilde{V}_{0}, we have

lim supkVV~kπϵ1γ.\limsup\limits_{k\rightarrow\infty}\|V^{*}-\tilde{V}^{\pi}_{k}\|_{\infty}\leq\frac{\epsilon}{1-\gamma}.
Proof:

Let VkπV^{\pi}_{k} be the sequence of vectors computed by the noiseless VI (7) with arbitrary initial conditions V0πV^{\pi}_{0} and V~0π\tilde{V}^{\pi}_{0}. Then,

Vk+1πV~k+1π\displaystyle\|V^{\pi}_{k+1}-\tilde{V}^{\pi}_{k+1}\|_{\infty} =TVkπTV~k+1πw~k\displaystyle=\|TV^{\pi}_{k}-T\tilde{V}^{\pi}_{k+1}-\tilde{w}_{k}\|_{\infty} (12a)
TVkπTV~kπ+w~k\displaystyle\leq\|TV^{\pi}_{k}-T\tilde{V}^{\pi}_{k}\|_{\infty}+\|\tilde{w}_{k}\|_{\infty} (12b)
TVkπTV~kπ+ϵ\displaystyle\leq\|TV^{\pi}_{k}-T\tilde{V}^{\pi}_{k}\|_{\infty}+\epsilon (12c)
γVkπV~kπ+ϵ,\displaystyle\leq\gamma\|V^{\pi}_{k}-\tilde{V}^{\pi}_{k}\|_{\infty}+\epsilon, (12d)

where the first inequality (12b) follows from (11), and (12c) follows from the triangular inequality, and the last inequality is due to Lemma (1). Now define a sequence eke_{k} of positive numbers by

ek+1=γek+ϵe_{k+1}=\gamma e_{k}+\epsilon (13)

with e0=V0πV~0πe_{0}=\|V^{\pi}_{0}-\tilde{V}^{\pi}_{0}\|_{\infty}. By geometric series,

lim supkek=ϵ1γ.\limsup\limits_{k\rightarrow\infty}e_{k}=\frac{\epsilon}{1-\gamma}.

Comparing (12) and (13), we have by induction

VkπV~kπek,k=0,1,2,.\|V^{\pi}_{k}-\tilde{V}^{\pi}_{k}\|_{\infty}\leq e_{k},\forall k=0,1,2,\dots.

Therefore,

lim supkVkπV~kπϵ1γ.\limsup\limits_{k\rightarrow\infty}\|V^{\pi}_{k}-\tilde{V}^{\pi}_{k}\|_{\infty}\leq\frac{\epsilon}{1-\gamma}. (14)

Since (14) holds for any V0πV^{\pi}_{0}, we can pick V0π=VV^{\pi}_{0}=V^{*} for which we have Vkπ=TkV=VV^{\pi}_{k}=T^{k}V^{*}=V^{*} for k=0,1,2,k=0,1,2,\dots by lemma (1), we obtain the desired result. ∎

III-B2 Asynchronous VI

It is often necessary or beneficial to run the VI algorithm asynchronously, or state-by–state, for simulations. We can define the asynchronous noiseless and noisy VI with the new mapping FF and F~\tilde{F}, respectively.

FVkπ(st)={(TVkπ)(s)if s=st,Vkπ(st),otherwise,FV^{\pi}_{k}(s_{t})=\begin{cases}(TV^{\pi}_{k})(s)&\text{if }s=s_{t},\\ V^{\pi}_{k}(s_{t}),&\text{otherwise},\end{cases} (15)
F~V~kπ(st)={(T~V~kπ)(s)if s=st,V~kπ(st),otherwise.\tilde{F}\tilde{V}^{\pi}_{k}(s_{t})=\begin{cases}(\tilde{T}\tilde{V}^{\pi}_{k})(s)&\text{if }s=s_{t},\\ \tilde{V}^{\pi}_{k}(s_{t}),&\text{otherwise}.\end{cases} (16)

In each case, we make the following assumption.

Assumption 1
  1. (a)

    Each state is visited for updates infinitely often.

  2. (b)

    There exists a finite constant MM, which is greater than or equal to the number of updates to sweep through each state at least once.

Assumption 1 is essential for the following theorem as it ensures that the mapping FF and F~\tilde{F} are contraction operators as long as all the states are visited at least once and that the time it takes for visiting all the states at least once is finite. A common approach to ensure this assumption would be to incorporate exploring actions as seen in ϵ\epsilon-greedy policy.

Now, define the iteration sub-sequence {kn}n=0,1,2,\{k_{n}\}_{n=0,1,2,\dots} such that k0=0k_{0}=0 and each state is visited at least once between kn+1k_{n+1} and knk_{n}. For instance, consider a finite MDP with 44 states denoted s1,s2,s3,s4s_{1},s_{2},s_{3},s_{4}. If the state trajectory is (s1,s2,s4,s3,s1,s1,s2,s4,s3,s1,s1,)(s_{1},s_{2},s_{4},s_{3},s_{1},s_{1},s_{2},s_{4},s_{3},s_{1},s_{1},...) for t=(1,2,3,4,5,6,7,8,9,10,11,)t=(1,2,3,4,5,6,7,8,9,10,11,...), then the sub-sequence knk_{n} formed is k0=0,k1=4,k2=9,k_{0}=0,k_{1}=4,k_{2}=9,\dots.

Theorem 2

Let VV^{*} be the optimal value function as defined previously. Suppose V~knπ\tilde{V}^{\pi}_{k_{n}} is the sequence of vectors computed by the asynchronous noisy VI (16) under Assumption 1. For an arbitrary initial condition V~0\tilde{V}_{0}, we have

lim supnVV~knπMϵ1γ.\limsup\limits_{n\rightarrow\infty}\|V^{*}-\tilde{V}^{\pi}_{k_{n}}\|_{\infty}\leq\frac{M\epsilon}{1-\gamma}.
Proof:

By definition of mapping FF and F~\tilde{F}, we can write:

Vkn+1πV~kn+1π\displaystyle\|V^{\pi}_{k_{n+1}}-\tilde{V}^{\pi}_{k_{n+1}}\|_{\infty} =Fkn+1knVknπF~kn+1knV~knπ\displaystyle=\|F^{k_{n+1}-k_{n}}V^{\pi}_{k_{n}}-\tilde{F}^{k_{n+1}-k_{n}}\tilde{V}^{\pi}_{k_{n}}\|_{\infty} (17)
Fkn+1knVknπF~kn+1knV~knπ\displaystyle\leq\|F^{k_{n+1}-k_{n}}V^{\pi}_{k_{n}}-\tilde{F}^{k_{n+1}-k_{n}}\tilde{V}^{\pi}_{k_{n}}\|_{\infty}
+(kn+1kn)ϵ\displaystyle\quad\quad+(k_{n+1}-k_{n})\epsilon

Similar to the proof of Theorem 11, the inequality is due to the bound of the error vector and the triangle inequality. Now, we note that the asynchronous mapping is a contraction mapping with respect to a sequence index knk_{n}. Thus, we have

Vkn+1πV~kn+1πγVknπV~knπ+(kn+1kn)ϵ.\|V^{\pi}_{k_{n+1}}-\tilde{V}^{\pi}_{k_{n+1}}\|_{\infty}\leq\gamma\|V^{\pi}_{k_{n}}-\tilde{V}^{\pi}_{k_{n}}\|_{\infty}+(k_{n+1}-k_{n})\epsilon.

By forming a sequence with nn as it is done in the previous proof, and by the Assumption 11-(b), it yields

lim supnVknπV~knπMϵ1γ.\limsup\limits_{n\rightarrow\infty}\|V^{\pi}_{k_{n}}-\tilde{V}^{\pi}_{k_{n}}\|_{\infty}\leq\frac{M\epsilon}{1-\gamma}. (18)

By expanding the left hand side and applying the reverse traingle inequality

VknπV~knπ\displaystyle\|V^{\pi}_{k_{n}}-\tilde{V}^{\pi}_{k_{n}}\|_{\infty} =VknπV(V~knπV)\displaystyle=\|V^{\pi}_{k_{n}}-V^{*}-(\tilde{V}^{\pi}_{k_{n}}-V^{*})\|_{\infty} (19)
V~knπVVknπV,\displaystyle\geq\|\tilde{V}^{\pi}_{k_{n}}-V^{*}\|_{\infty}-\|V^{\pi}_{k_{n}}-V^{*}\|_{\infty},

for which we know the second term approaches zero in the limit. Since the left hand side is bounded in the limit with constants, we achieve the proposed result. ∎

Theorem 2 assures that the encrypted VI outsourced to the cloud also guarantees a comparable performance asymptotically.

III-C Encrypted Model-free RL

Consider again the TD(0) update rule, (3). The cloud receives the set of ciphertexts {cv\{\textbf{c}_{v}, cv\textbf{c}_{v^{\prime}}, cα\textbf{c}_{\alpha}, cγ\textbf{c}_{\gamma}, cr}\textbf{c}_{r}\}. Then, the update rule in the ciphertext domain becomes:

cv(t+1)=cv(t)+cαcr+cαcγcv(t)cαcv(t).\textbf{c}_{v}(t+1)=\textbf{c}_{v}(t)+\textbf{c}_{\alpha}\cdot\textbf{c}_{r}+\textbf{c}_{\alpha}\cdot\textbf{c}_{\gamma}\cdot\textbf{c}_{v^{\prime}}(t)-\textbf{c}_{\alpha}\cdot\textbf{c}_{v}(t). (20)

Upon decryption of cv(t+1)\textbf{c}_{v}(t+1), we need to remember that the computed value is corrupted with the noise. A similar error analysis for the TD algorithms can be performed (perhaps using stochastic approximation theory). However, the formal analysis in this domain presents some difficulties. Whereas a conventional theory on stochastic approximation with an exogenous noise seen in [16] requires the noise to approach zero in the limit and bounded, the encryption-induced noise satisfies only the bounded condition. We thus only provide some implementation results at this time to gain some insight. We hope to rigorously prove this case as in the future work.

IV Simulation

We present implementation results of various model-free (TD(0), SARSA(0), and Z-learning) RL algorithms over the CKKS implementation scheme. The environment used is the grid world with the state size |𝒮|=36|\mathcal{S}|=36 for all three and the action size |𝒜|=9|\mathcal{A}|=9 for the first two. The available actions are up, up-right, right, down-right, down, down-left, left, up-left, and stay. The reward is fixed as randomly set real numbers to simulate unknown environment. The client starts with a policy π\pi at the box coordinate (1,1), top-left and the grid world has three trap states and one goal state, marked by letters T and G, which terminate the current episode. In each episode, we set the maximum number of steps at which the current episode terminates as well. The learning parameter set θ\theta consists of the discount factor, learning rate, and exploration percentage. As soon as the new data set hth_{t} containing the reward (or cost) and values of states ss and ss^{\prime} become available, the client uploads the encrypted data to the cloud.

Choosing encryption parameters is not straightforward but there exists a standardization effort, [23]. Also, an open-source library SEAL [24] provides a practical tutorial and accessible tools along with encryption parameter generator. We use the default 128-bit security (λ=128\lambda=128) encryption parameters (𝒩,𝒬,σ\mathcal{N},\mathcal{Q},\sigma) generated by SEAL with the user input of 𝒩\mathcal{N}. These are listed in Table I. The size of 𝒩\mathcal{N} need not be this large as there is no parallel operation exploited for the particular example application considered in this paper. However, future applications such as multi agents RL or deep RL can find such capability useful as they contain many batch operations.

The CKKS encryption without employing a bootstrapping allows a predetermined depth for multiplication. Thus, for interested users, it is important to note the largest depth of ciphertext multiplications needed to evaluate the algorithm at hand. For example, TD(0) update rule considered in equation (20) requires cαcγcv(t)\textbf{c}_{\alpha}\cdot\textbf{c}_{\gamma}\cdot\textbf{c}_{v^{\prime}}(t) at the most. This is factored into the design your encryption parameters.

To examine the effect of encryption noise, we created two tables. One keeps track of the values of un-encrypted updates and the other keeps track of the values updated over HE. We recorded the error between two values through each iteration. At final stages of learning, these errors were confirmed to be bounded by some constant of very small magnitude. Although formal convergence analysis of model-free algorithms such as TD(0) are currently not available in this paper, simulation results suggest that they can be performed in the encrypted domain as equally well. Formal analysis of these algorithms based on the analysis already done is left as future research.

TABLE I: Encryption Parameters
Param. TD(0) SARSA(0) Z-learning
𝒩\mathcal{N} 8192 8192 16384
𝒬\mathcal{Q} 219 219 441
σ\sigma 82π\frac{8}{\sqrt{2\pi}} 82π\frac{8}{\sqrt{2\pi}} 82π\frac{8}{\sqrt{2\pi}}

IV-A Prediction: TD(0)

We implement a GLIE (greedy in the limit with infinite exploration) type learning policy seen in [25], where the client starts completely exploratory (ε=1.00\varepsilon=1.00) and slowly becomes greedy (ε=0.00\varepsilon=0.00) with more episodes. The learning rate is set to be 0 for non-visited states and for visited states, we set α(s,t)=500500+n(s,t)\alpha(s,t)=\frac{500}{500+n(s,t)}, with n(s,t)n(s,t) counting the number of visits to the state ss at time tt, to satisfy the standard learning rate assumptions. The discount factor is γ=0.9\gamma=0.9. *RULE* for TD(0) is the right hand side of equation (20).

Algorithm 1 Encrypted TD Learning

Client (Start)

1:Perform an action aa and state transition sss\rightarrow s^{\prime} on a policy πt\pi_{t} to earn a reward rr.
2:Collect the transition data set hth_{t}. Index values from the current table using hth_{t}.
3:Encode values (Q-values if SARSA(0); Z-values if Z-learning) and the set θ\theta into the encoded messages VtV_{t} and Θ\Theta, respectively.
4:Encrypt VtV_{t} and Θ\Theta to get V~\tilde{V} and Θ~\tilde{\Theta} and upload to the cloud.

Cloud

1:Extract ciphertexts from V~\tilde{V} and Θ~\tilde{\Theta}. Example: extract {cz\{\textbf{c}_{z}, cz\textbf{c}_{z^{\prime}}, cα\textbf{c}_{\alpha}, cl\textbf{c}_{l}, cl2}\textbf{c}_{\frac{l}{2}}\} (Z learning).
2:Update: use the *RULE* with ciphertexts extracted.
3:Upload the result of *RULE*, denoted by ciphertect c*, back to the Client.

Client

1:Decrypt the ciphertext c* to get the updated H~\tilde{H}.
2:Decode H~\tilde{H} to get the newly synthesized table of values.
3:Update the policy πt\pi_{t} if necessary.
4:go to Start

IV-B Control: SARSA(0)

The update rule for SARSA(0) is:

Q^t+1(s,a)=Q^t(s,a)+αt(s,a)δtS,\begin{split}\hat{Q}_{t+1}(s,a)=&\hat{Q}_{t}(s,a)+\alpha_{t}(s,a)\delta_{t}^{S},\end{split} (21)

where δtS=(r(s,a)+γQ^t(s,a)Q^t(s,a)).\delta_{t}^{S}=\left(r(s,a)+\gamma\hat{Q}_{t}(s^{\prime},a^{\prime})-\hat{Q}_{t}(s,a)\right).

The policy for SARSA(0) is also a GLIE (greedy in the limit with infinite exploration) type learning policy used in TD(0). The learning rate α(s,t)\alpha(s,t) and the discount factor γ\gamma are unchanged. *RULE* for SARSA(0) is the right hand side of equation (20) after substituting cv\textbf{c}_{v} and cv\textbf{c}_{v^{\prime}} with cQ\textbf{c}_{Q} and cQ\textbf{c}_{Q^{\prime}}.

Refer to caption

Figure 2: State-value map learned over encrypted TD(0); states are numbered from 1-36 (from left-right and top-to-bottom) with G (goal state), and T (trap states) marked.

Refer to caption

Figure 3: Maximum norm of encryption errors in TD(0) over each iterate tt (top). As seen in the un-encrypted value updates (bottom-left) and encryption error (bottom-right) plots of state 1, the value of state 1 gets increased as it propagates from reaching goal states more often once the policy becomes greedy (see Fig. 2) and its error dominates between iterates t=30000t=30000 and 3500035000.

Refer to caption

Figure 4: Value map learned over encrypted SARSA(0)

Refer to caption

Figure 5: Maximum norm of encryption errors in SARSA(0) over each iterate tt (top). The un-encrypted and decrypted values for each state are computed using the equation V=maxaQ^(s,a)V^{*}=\max_{a}\hat{Q}(s,a). Bottom-left shows the sample history of value V(s=15)V^{*}(s=15) and the associated encryption error V^tπ(s=15)V~tπ(s=15)\hat{V}_{t}^{\pi}(s=15)-\tilde{V}_{t}^{\pi}(s=15)

.

Refer to caption

Figure 6: Z value map learned over encrypted Z-learning

Refer to caption

Figure 7: Maximum norm of encryption errors in Z-learning over each iterate tt (top). The state 30 converges to the highest Z value as seen in Fig. 4. Un-encrypted Z value updates show more variance than TD(0) counterparts due to constant exploration. Encryption error in state 30 is the highest error until about t=22000t=22000 which can be seen in maximum norm but it is bounded by a very small number.

IV-C Control: Z-Learning

An off-policy learning control called Z-learning was proposed in [15]. It is formulated on the key observation that the control action can be regarded as an effort to change the passive state transition dynamics. The update rule for Z-learrning is:

Z^t+1(s)=Z^t(s)+αt(s)δtZ,\hat{Z}_{t+1}(s)=\hat{Z}_{t}(s)+\alpha_{t}(s)\delta_{t}^{Z}, (22)

where δtZ=exp(lt)Z^t(s)Z^t(s)\delta_{t}^{Z}=exp(-l_{t})\hat{Z}_{t}(s^{\prime})-\hat{Z}_{t}(s). The estimate ZZ is named the desirability function and it uses the cost ltl_{t} associated with the unknown state-transitions, rather than the reward. Z-learrning by default is exploratory. Thus, we fix the policy to be greedy (ε=1.00\varepsilon=1.00) but it will continuously explore, too. The learning rate α(s,t)\alpha(s,t) and the discount factor are unchanged. *RULE* for Z-learning is evaluating the right hand side of equation (22) after encrypting. Since evaluating exp(lt)exp(-l_{t}) over a ciphertext is not straightforward, we approximate with Taylor series.

IV-D Results

We observed that encryption-induced noise over time approached some small numbers with minimal fluctuations. Moreover, the effects of encryption-induced noise were minimal regarding the convergence of values. This observation complements the analysis on VI in Section III as the TD-algorithms are sampling based value estimation algorithms.

Encryption and decryption are all done at the client’s side and so significant computation time is expected for the client. For this reason, the ideal application will require less frequent needs for uploading and downloading and more advanced synthesis procedure that operate on a large set of data.

V Conclusion and Future Work

We considered an architecture of confidential cloud-based RL over LHE. For the model-based RL, we showed that the impact of the encryption noise on the convergence performance can be analytically bounded. For the model-free RL, we numerically tested implementations of TD(0), SARSA(0), and Z-learning and numerically confirmed that the impacts of the encryption noise on these algorithms are also minimal. Although the applications considered do not necessarily require the cloud, we can develop the framework to adopt to more advanced synthesis algorithms in the future.

There are numerous directions to extend this paper. First, the effort to derive analytical performance guarantees for encrypted RL (including the model-free schemes considered in this paper) is necessary to prove the utility of the encrypted RL concept. Second, an encrypted RL scheme that does not require periodic decryption (similar to the case in [26]) is highly desired as the periodic decryption and communication between the cloud and the controller is costly. Finally, more extensive numerical experiments are needed to fully understand the potential of advanced RL (e.g., deep Q learning) over HE. The interplay between computational overhead, delay, accuracy and security levels must be studied from both theoretical and experimental perspectives.

VI Appendix

VI-A CKKS Encryption Scheme

CKKS encoding procedure maps the vector of complex numbers sized 𝒩2\frac{\mathcal{N}}{2} to the message mm in plain-text space 𝒫\mathcal{P}. The plaintext 𝒫\mathcal{P} is defined as the set 𝒬[𝒳]/(𝒳𝒩+1)\mathbb{Z}_{\mathcal{Q}}[\mathcal{X}]/\left(\mathcal{X}^{\mathcal{N}}+1\right), where 𝒬[𝒳]\mathbb{Z}_{\mathcal{Q}}[\mathcal{X}] denotes the polynomials of x𝒳x\in\mathcal{X} whose coefficients are integers modulo 𝒬\mathcal{Q}, and 𝒳𝒩+1\mathcal{X}^{\mathcal{N}}+1 denotes the degree 𝒩\mathcal{N} cyclotomic polynomials. This allows for CKKS encryption scheme to accept multiple complex-valued inputs of a size 𝒩2\frac{\mathcal{N}}{2} at once, which is convenient for many computing applications.

Then, encryption on the message m𝒫m\in\mathcal{P} yields the ciphertext c𝒞\textbf{c}\in\mathcal{C}

Encpk(m)=c,Enc_{pk}(m)=\textbf{c}, (23)

Each ciphertext is designated with a level LL, which indicates how many multiplications you can perform before decryption fails. This is a key limitation in comparison to FHE.

The encryption also creates a noise polynomial ee. A properly encrypted ciphertext has a bound on the message mp\|m\|_{\infty}\leq p and also on the noise eB\|e\|_{\infty}\leq B when the norm is defined on those polynomials. Thus, a CKKS ciphertext can be defined as a tuple c=(c,L,p,B)\textbf{c}=(c,L,p,B). Then, decryption can be defined as follows.

Dec(c,sk)m+e(modqL),Dec(\textbf{c},sk)\equiv m+e\pmod{q_{L}}, (24)

where qLq_{L} is the coefficient modulus at level LL.

For ciphertexts ci=Encpk(mi)\textbf{c}_{i}=Enc_{pk}(m_{i}) at the same level LL,

Add(c1,c2)\displaystyle Add(\textbf{c}_{1},\textbf{c}_{2}) =cadd\displaystyle=\textbf{c}_{add} (25)
=(cadd,L,p1+p2,B1+B2),\displaystyle=(c_{add},L,p_{1}+p_{2},B_{1}+B_{2}),
Dec(Add(c1,c2),sk)m1+m2+eadd(modqL),Dec(Add(\textbf{c}_{1},\textbf{c}_{2}),sk)\equiv m_{1}+m_{2}+e_{add}\pmod{q_{L}}, (26)

where eaddB1+B2\|e_{add}\|_{\infty}\leq B_{1}+B_{2}.

References

  • [1] E. Hossain, I. Khan, F. Un-Noor, S. S. Sikander, and M. S. H. Sunny, “Application of big data and machine learning in smart grid, and associated security concerns: A review,” IEEE Access, vol. 7, pp. 13 960–13 988, 2019.
  • [2] M. Mohammadi, A. Al-Fuqaha, M. Guizani, and J. Oh, “Semisupervised deep reinforcement learning in support of IoT and smart city services,” IEEE Internet of Things Journal, vol. 5, no. 2, pp. 624–635, April 2018.
  • [3] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al., “Human-level control through deep reinforcement learning,” Nature, vol. 518, no. 7540, p. 529, 2015.
  • [4] D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y. Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis, “Mastering the game of go without human knowledge,” Nature, pp. 354–, Oct. 2017.
  • [5] I. Arel, C. Liu, T. Urbanik, and A. G. Kohls, “Reinforcement learning-based multi-agent system for network traffic signal control,” IET Intelligent Transport Systems, vol. 4, no. 2, pp. 128–135, June 2010.
  • [6] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” J. Mach. Learn. Res., vol. 17, no. 1, p. 1334–1373, Jan. 2016.
  • [7] K. Kogiso and T. Fujita, “Cyber-security enhancement of networked control systems using homomorphic encryption,” 2015 54th IEEE Conference on Decision and Control (CDC), pp. 6836–6843, 2015.
  • [8] F. Farokhi, I. Shames, and N. Batterham, “Secure and private cloud-based control using semi-homomorphic encryption,” IFAC-PapersOnLine, vol. 49, no. 22, pp. 163 – 168, 2016.
  • [9] J. Kim, C. Lee, H. Shim, J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Encrypting controller using fully homomorphic encryption for security of cyber-physical systems,” IFAC-PapersOnLine, vol. 49, no. 22, pp. 175 – 180, 2016.
  • [10] M. Schulze Darup, A. Redder, I. Shames, F. Farokhi, and D. Quevedo, “Towards encrypted MPC for linear constrained systems,” IEEE Control Systems Letters, vol. 2, no. 2, pp. 195–200, 2018.
  • [11] A. B. Alexandru, M. Morari, and G. J. Pappas, “Cloud-based MPC with encrypted data,” in 2018 IEEE Conference on Decision and Control (CDC), 2018, pp. 5014–5019.
  • [12] M. S. Darup, A. Redder, and D. E. Quevedo, “Encrypted cloud-based MPC for linear systems with input constraints,” IFAC-PapersOnLine, vol. 51, no. 20, pp. 535 – 542, 2018.
  • [13] K. Teranishi, M. Kusaka, N. Shimada, J. Ueda, and K. Kogiso, “Secure observer-based motion control based on controller encryption,” in 2019 American Control Conference (ACC), 2019, pp. 2978–2983.
  • [14] J. Kim, H. Shim, and K. Han, “Comprehensive introduction to fully homomorphic encryption for dynamic feedback controller via lwe-based cryptosystem,” CoRR, vol. abs/1904.08025, 2019.
  • [15] E. Todorov, “Efficient computation of optimal actions,” Proceedings of the National Academy of Sciences, vol. 106, no. 28, pp. 11 478–11 483, 2009.
  • [16] D. P. Bertsekas, Neuro-dynamic programmingNeuro-Dynamic Programming.   Boston, MA: Springer US, 2009.
  • [17] D. Silver, “Lecture 8: Integrating learning and planning.” [Online]. Available: https://www.davidsilver.uk/teaching/
  • [18] R. S. Sutton and A. G. Barto, Reinforcement learning: an introduction.   The MIT Press, 2018.
  • [19] J. N. Tsitsiklis, “Asynchronous stochastic approximation and QQ-learning,” in Proceedings of 32nd IEEE Conference on Decision and Control, Dec 1993, pp. 395–400 vol.1.
  • [20] C. Gentry, “A fully homomorphic encryption scheme,” Ph.D. dissertation, Stanford University, 2009, crypto.stanford.edu/craig.
  • [21] J. H. Cheon, A. Kim, M. Kim, and Y. Song, “Homomorphic encryption for arithmetic of approximate numbers,” Cryptology ePrint Archive, Report 2016/421, 2016, https://eprint.iacr.org/2016/421.
  • [22] K. Teranishi and K. Kogiso, “Control-theoretic approach to malleability cancellation by attacked signal normalization,” 09 2019.
  • [23] M. Albrecht, M. Chase, H. Chen, J. Ding, S. Goldwasser, S. Gorbunov, S. Halevi, J. Hoffstein, K. Laine, K. Lauter, S. Lokam, D. Micciancio, D. Moody, T. Morrison, A. Sahai, and V. Vaikuntanathan, “Homomorphic encryption security standard,” HomomorphicEncryption.org, Toronto, Canada, Tech. Rep., November 2018.
  • [24] “Microsoft SEAL (release 3.4),” https://github.com/Microsoft/SEAL, Oct. 2019, microsoft Research, Redmond, WA.
  • [25] S. Singh, T. Jaakkola, M. L. Littman, and C. Szepesvári, “Convergence results for single-step on-policy reinforcement-learning algorithms,” Machine learning, vol. 38, no. 3, pp. 287–308, 2000.
  • [26] J. Kim, H. Shim, and K. Han, “Dynamic controller that operates over homomorphically encrypted data for infinite time horizon,” arXiv preprint arXiv:1912.07362, 2019.