This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Control of Differentially Private Linear Quadratic Systems

Sayak Ray Chowdhury1 Indian Institute of Science
Bangalore, India
Email: [email protected]
   Xingyu Zhou1 Wayne State University
Detroit, USA
Email: [email protected]
   Ness Shroff The Ohio State University
Columbus, USA
Email: [email protected]
Abstract

In this paper we study the problem of regret minimization in reinforcement learning (RL) under differential privacy constraints. This work is motivated by the wide range of RL applications for providing personalized service, where privacy concerns are becoming paramount. In contrast to previous works, we take the first step towards non-tabular RL settings, while providing a rigorous privacy guarantee. In particular, we consider the adaptive control of differentially private linear quadratic (LQ) systems. We develop the first private RL algorithm, Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}  which is able to attain a sub-linear regret while guaranteeing privacy protection. More importantly, the additional cost due to privacy is only on the order of ln(1/δ)1/4ε1/2\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}} given privacy parameters ε,δ>0\varepsilon,\delta>0. Through this process, we also provide a general procedure for adaptive control of LQ systems under changing regularizers, which not only generalizes previous non-private controls, but also serves as the basis for general private controls.

11footnotetext: Equal contribution. This work was funded in part through NSF grants: CNS-1901057 and CNS- 2007231, and an Office of Naval Research under Grant N00014-17-1-241

I Introduction

Reinforcement learning (RL) is a control-theoretic problem, which adaptively learns to make sequential decisions in an unknown environment through trial and error. RL has shown to have significant success for delivering a wide variety of personalized services, including online news and advertisement recommendation [1], medical treatment design [2], natural language processing [3], and social robot [4]. In these applications, an RL agent improves its personalization algorithm by interacting with users to maximize the reward. In particular, in each round, the RL agent offers an action based on the user’s state, and then receives the feedback from the user (i.e., state information, state transition, reward, etc.). This feedback is used by the agent to learn the unknown environment and improve its action selection strategy.

However, in most practical scenarios, the feedback from the users often encodes their sensitive information. For example, in a personalized healthcare setting, the states of a patient include personal information such as age, gender, height, weight, state of the treatment etc. Similarly, the states of a virtual keyboard user (e.g., Google GBoard users) are the words and sentences she typed in, which inevitably contain private information about the user. Another intriguing example is the social robot for second language education of children. The states include facial expressions, and the rewards contain whether they have passed the quiz. Users may not want any of this information to be inferred by others. This directly results in an increasing concern about privacy protection in personalized services. To be more specific, although a user might be willing to share her own information to the agent to obtain a better tailored service, she would not like to allow third parties to infer her private information from the output of the learning algorithm. For example, in the healthcare application, we would like to ensure that an adversary with arbitrary side knowledge cannot infer a particular patient’s state from the treatments prescribed to her.

Differential privacy (DP) [5] has become a standard mechanism for designing interactive learning algorithms under a rigorous privacy guarantee for individual data. Most of the previous works on differentially private learning under partial feedback focus on the simpler bandit setting (i.e., no state transition) [6, 7, 8, 9, 10]. For the general RL problem, there are only a few works that consider differential privacy [11, 12, 13]. More importantly, only the tabula-rasa discrete-state discrete-action environments are considered in these works. However, in real-world applications mentioned above, the number of states and actions are often very large and can even be infinite. Over the years, for various non-tabular environments, efficient and provably optimal algorithms for reward maximization or, equivalently, regret minimization have been developed (see, e.g., [14, 15, 16, 17, 18]). This directly motivates the following question: Is it possible to obtain the optimal reward while providing individual privacy guarantees in the non-tabular RL scenario?

In this paper, we take the first step to answer the aforementioned question by considering a particular non-tabular RL problem – adaptive control of linear quadratic (LQ) systems, in which the state transition is a linear function and the immediate reward (cost) is a quadratic function of the current state and action. In particular, our main contributions can be summarized as follows.

  • First, we provide a general framework for adaptive control of LQ systems under changing regularizers using the optimism in the face of uncertainty (OFU) principle, which covers both the extreme cases – non-private and fully private LQ control.

  • We then develop the first private RL algorithm, namely Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}, for regret minimization in LQ systems by adapting the binary counting mechanism to ensure differencial privacy.

  • In particular, we show that Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} satisfies joint differential privacy (JDP), which, informally, implies that sensitive information about a given user is protected even if an adversary has access to the actions prescribed to all other users.

  • Finally, we prove that Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} achieves a sub-linear regret guarantee, where the regret due to privacy only grows as ln(1/δ)1/4ε1/2\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}} with privacy levels ε,δ>0\varepsilon,\delta\!>\!0 implying that a high amount of privacy (low ε,δ\varepsilon,\delta) comes at a high cost and vice-versa.

II Preliminaries

II-A Stochastic Linear Quadratic Control

We consider the discrete-time episodic linear quadratic (LQ) control problem with HH time steps at every episode. Let xhnx_{h}\!\in\!\mathbb{R}^{n} be the state of the system, uhdu_{h}\!\in\!\mathbb{R}^{d} be the control and chc_{h}\!\in\!\mathbb{R} be the cost at time hh. An LQ problem is characterized by linear dynamics and a quadratic cost function

xh+1=Axh+Buh+wh,ch=xhQxh+uhRuh,x_{h+1}\!=\!Ax_{h}+Bu_{h}+w_{h},\;\;c_{h}\!=\!x_{h}^{\top}Qx_{h}+u_{h}^{\top}Ru_{h}~{}, (1)

where AA, BB are unknown matrices, and QQ, RR are known positive definite (p.d.) matrices. The starting state x1x_{1} is fixed (can possibly be chosen by an adversary) and the system noise whnw_{h}\!\in\!\mathbb{R}^{n} is zero-mean. We summarize the unknown parameters in Θ=[A,B](n+d)×n\Theta\!=\![A,B]^{\top}\!\in\!\mathbb{R}^{(n+d)\times n} .

The goal of the agent is to design a closed-loop control policy π:[H]×nd\pi:[H]\times\mathbb{R}^{n}\to\mathbb{R}^{d} mapping states to controls that minimizes the expected cost

Jhπ(Θ,x):=𝔼π[h=hHch|xh=x],\displaystyle J_{h}^{\pi}(\Theta,x):=\mathbb{E}_{\pi}\Big{[}\sum\nolimits_{h^{\prime}=h}^{H}c_{h^{\prime}}\;\big{|}\;x_{h}=x\Big{]}, (2)

for all h[H]h\in[H] and xnx\in\mathbb{R}^{n}. Here the expectation is over the random trajectory induced by the policy π\pi starting from state xx at time hh. From the standard theory for LQ control (e.g., [19]), the optimal policy π\pi^{*} has the form

πh(x)=Kh(Θ)x,h[H],\displaystyle\pi_{h}^{*}(x)=K_{h}(\Theta)x~{},\quad\forall h\in[H]~{},

where the gain matrices Kh(Θ)K_{h}(\Theta) are given by

Kh(Θ)=(R+BPh(Θ)B)1BPh(Θ)A.\displaystyle\!\!K_{h}(\Theta)\!=\!-(R\!+\!B^{\top}P_{h}(\Theta)B)^{-1}B^{\top}P_{h}(\Theta)A. (3)

Here the symmetric positive semidefinite matrices Ph(Θ)P_{h}(\Theta) are defined recursively by the Riccati iteration

Ph(Θ)=Q+APh+1(Θ)A\displaystyle P_{h}(\Theta)=Q+A^{\top}P_{h+1}(\Theta)A (4)
APh+1(Θ)B(R+BPh+1(Θ)B)1BPh+1(Θ)A.\displaystyle\!-\!A^{\top}P_{h+1}(\Theta)B(R\!+\!B^{\top}P_{h+1}(\Theta)B)^{-1}B^{\top}P_{h+1}(\Theta)A.

with PH+1(Θ):=0P_{H+1}(\Theta):=0. The optimal cost is given by

Jh(Θ,x)=xPh(Θ)x+h=hH𝔼[whPh+1(Θ)wh].\displaystyle\!\!\!\!\!J_{h}^{*}(\Theta,x)\!=\!x^{\top}\!P_{h}(\Theta)x\!+\!\!\!\sum\nolimits_{h^{\prime}\!=h}^{H}\!\!\!\mathbb{E}\left[w_{h^{\prime}}^{\top}P_{h^{\prime}+1}(\Theta)w_{h^{\prime}}\right]\!. (5)

We let the agent play KK episodes and measure the performance by cumulative regret.111In the following, we add subscript kk to denote the variables for the kk-th episode – state xk,hx_{k,h}, control uk,hu_{k,h}, noise wk,hw_{k,h} and cost ck,hc_{k,h}. In particular, if the true system dynamics are Θ=[A,B]\Theta_{*}\!=\![A_{*},B_{*}]^{\top}, the cumulative regret of the first KK episodes is given by

(K):=k=1K(J1πk(Θ,xk,1)J1(Θ,xk,1)),\displaystyle\mathcal{R}(K)\!:=\!\sum\nolimits_{k=1}^{K}\left(J_{1}^{\pi_{k}}(\Theta_{*},x_{k,1})-J_{1}^{*}(\Theta_{*},x_{k,1})\right), (6)

where J1(Θ,xk,1)J_{1}^{*}(\Theta_{*},x_{k,1}) is the (expected) cost under an optimal policy for episode kk (computed via (5)), and J1πk(Θ,xk,1)J_{1}^{\pi_{k}}(\Theta_{*},x_{k,1}) is the (expected) cost under the chosen policy πk\pi_{k} at the start of episode kk (computed via (2)). We seek to attain a sublinear regret (K)=o(K)\mathcal{R}(K)\!=\!o(K), which ensures that the agent finds the optimal policy as KK\!\to\!\infty. We end this section by presenting our assumptions on the LQ system (1), which are common in the LQ control literature [17].

Assumption 1 (Boundedness).

(a) The true system dynamics Θ\Theta_{*} is a member of a set 𝒮:={Θ=[A,B]:ΘF1 and [A,B] is controllable}\mathcal{S}\!:=\!\{\Theta\!=\![A,B]^{\top}\!:\!\left\lVert\Theta\right\rVert_{\operatorname{F}}\!\leqslant\!1\text{ and }[A,B]\text{ is controllable}\}. (b) There exist constants CC, CAC_{A}, CBC_{B} such that ACA<1\left\lVert A_{*}\right\rVert\!\leqslant\!C_{A}\!<\!1, BCB<1\left\lVert B_{*}\right\rVert\!\leqslant\!C_{B}\!<\!1, and QC\left\lVert Q\right\rVert\!\leqslant\!C, RC\left\lVert R\right\rVert\!\leqslant\!C, (c) For all k1k\!\geqslant\!1, xk,11\left\lVert x_{k,1}\right\rVert\!\leqslant\!1. (d) The noise wk,hw_{k,h} at any k1k\!\geqslant\!1 and h[H]h\!\in\![H], is (i) independent of all other randomness, (ii) 𝔼[wk,h]=0\mathbb{E}\left[w_{k,h}\right]\!=\!0, and (iii) wk,h2Cw<1\left\lVert w_{k,h}\right\rVert_{2}\!\leqslant\!C_{w}\!<\!1. (e) There exists a constant γ\gamma such that CA+γCB+Cw1C_{A}+\gamma C_{B}+C_{w}\!\leqslant\!1.

II-B Differential Privacy

We now formally define the notion of differential privacy in the context of episodic LQ control. We write v=(v1,,vK)𝒱Kv\!=\!(v_{1},\ldots,v_{K})\!\in\!\mathcal{V}^{K} to denote a sequence of KK unique users participating in the private RL protocol with an RL agent \mathcal{M}, where 𝒱\mathcal{V} is the set of all users. Each user vkv_{k} is identified by the state responses {xk,h+1}h[H]\{x_{k,h+1}\}_{h\in[H]} she gives to the controls {uk,h}h[H]\{u_{k,h}\}_{h\in[H]} chosen by the agent. We write (v)={uk,h}k[K],h[H](d)KH\mathcal{M}(v)\!=\!\{u_{k,h}\}_{k\in[K],h\in[H]}\in(\mathbb{R}^{d})^{KH} to denote the privatized controls chosen by the agent \mathcal{M} when interacting with the users vv. Informally, we will be interested in randomized algorithms \mathcal{M} so that the knowledge of the output (v)\mathcal{M}(v) and all but the kk-th user vkv_{k} does not reveal ‘much’ about vkv_{k}. We formalize in the following definition, which is adapted from [20].

Definition 1 (Differential Privacy (DP)).

For any ε0\varepsilon\!\geqslant\!0 and δ[0,1]\delta\!\in\![0,1], an algorithm :𝒱K(d)KH\mathcal{M}\!:\!\mathcal{V}^{K}\!\rightarrow\!(\mathbb{R}^{d})^{KH} is (ε,δ)(\varepsilon,\delta)-differentially private if for all v,v𝒱Kv,v^{\prime}\in\mathcal{V}^{K} differing on a single user and all subset of controls 𝒰(d)KH\mathcal{U}\!\!\subset\!\!\left(\mathbb{R}^{d}\right)^{KH},

[(v)𝒰]exp(ε)[(v)𝒰]+δ.\mathbb{P}\left[{\mathcal{M}(v)\in\mathcal{U}}\right]\leqslant\exp(\varepsilon)\mathbb{P}\left[{\mathcal{M}(v^{\prime})\in\mathcal{U}}\right]+\delta~{}.

We now relax this definition motivated by the fact that the controls recommended to a given user vkv_{k} is only observed by her. We consider joint differential privacy [21], which requires that simultaneously for all kk, the joint distribution on controls sent to users other than vkv_{k} will not change substantially upon changing the state responses of the user vkv_{k}. To this end, we let k(v):=(v){uk,h}h[H]\mathcal{M}_{-k}(v)\!:=\!\mathcal{M}(v)\!\setminus\!\{u_{k,h}\}_{h\in[H]} to denote all the controls chosen by the agent \mathcal{M} excluding those recommended to vkv_{k}.

Definition 2 (Joint Differential Privacy (JDP)).

For any ε0\varepsilon\!\geqslant\!0 and δ[0,1]\delta\!\in\![0,1], an algorithm :𝒱K(d)KH\mathcal{M}\!:\!\mathcal{V}^{K}\!\rightarrow\!(\mathbb{R}^{d})^{KH} is (ε,δ)(\varepsilon,\delta)-jointly differentially private if for all k[K]k\!\in\![K], all v,v𝒱v,v^{\prime}\!\in\!\mathcal{V} differing on the kk-th user and all subset of controls 𝒰k(d)(K1)H\mathcal{U}_{-k}\!\subset\!\left(\mathbb{R}^{d}\right)^{(K\!-\!1)H}\!\! given to all but the kk-th user,

[k(v)𝒰k]exp(ε)[k(v)𝒰k]+δ.\mathbb{P}\left[{\mathcal{M}_{-k}(v)\in\mathcal{U}_{-k}}\right]\leqslant\exp(\varepsilon)\mathbb{P}\left[{\mathcal{M}_{-k}(v^{\prime})\in\mathcal{U}_{-k}}\right]+\delta~{}.

This relaxation is necessary in our setting since knowledge of the controls recommended to the user vkv_{k} can reveal a lot of information about her state responses. It weakens the constraint of DP only in that the controls given specifically to vkv_{k} may be sensitive in her state responses. However, it is still a very strong definition since it protects vkv_{k} from any arbitrary collusion of other users against her, so long as she does not herself make the controls reported to her public.

In this work, we look for algorithms that are (ε,δ)(\varepsilon,\delta)-JDP. But, we will build our algorithm upon standard DP mechanisms. Furthermore, to establish privacy, we will use a different relaxation called concentrated differential privacy (CDP) [22]. Roughly, a mechanism is CDP if the privacy loss has Gaussian tails. To this end, we let \mathcal{M} to be a mechanism taking as input a data-stream x𝒳nx\!\in\!\mathcal{X}^{n} and releasing output from some range 𝒴\mathcal{Y}.

Definition 3 (Concentrated Differential Privacy (CDP)).

For any ρ0\rho\geqslant 0, an algorithm :𝒳n𝒴\mathcal{M}\!:\!\mathcal{X}^{n}\!\to\!\mathcal{Y} is ρ\rho-zero-concentrated differentially private if for all x,x𝒳nx,x^{\prime}\in\mathcal{X}^{n} differing on a single entry and all α(1,)\alpha\in(1,\infty),

Dα((x)||(x))ρα,\operatorname{D}_{\alpha}\left(\mathcal{M}(x)||\mathcal{M}(x^{\prime})\right)\leqslant\rho\;\alpha~{},

where Dα((x)||(x))\operatorname{D}_{\alpha}\left(\mathcal{M}(x)||\mathcal{M}(x^{\prime})\right) is the α\alpha-Renyi divergence between the distributions of (x)\mathcal{M}(x) and (x)\mathcal{M}(x^{\prime}).222For two probability distributions PP and QQ on Ω\Omega, the α\alpha-Renyi divergence Dα(P||Q):=1α1ln(ΩP(x)αQ(x)1αdx)\operatorname{D}_{\alpha}(P||Q):=\frac{1}{\alpha-1}\ln\left(\int_{\Omega}P(x)^{\alpha}Q(x)^{1-\alpha}\operatorname{d}\!x\right).

III OFU-Based Control

Our proposed private RL algorithm implements the optimism in the face of uncertainty (OFU) principle in LQ systems. As in [14], the key to implementing the OFU-based control is a high-probability confidence set for the unknown parameter matrix Θ\Theta_{*}.

III-A Adaptive Control with Changing Regularizers

We start with the adaptive LQ control with changing regularizers. This not only allows us to generalize previous results for non-private control, but more importantly serves as a basis for the analysis of private control in the next section. We first define the following compact notations. For a state and control pair at step hh in episode kk, i.e., xk,hx_{k,h} and uk,hu_{k,h}, we write zk,h=[xk,h,uk,h]z_{k,h}\!=\![x_{k,h}^{\top},u_{k,h}^{\top}]^{\top}. For any k1k\!\geqslant\!1, we define the following matrices: Zk:=[zk,h]k[k1],h[H]Z_{k}\!:=\![z_{k^{\prime},h^{\prime}}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]}, Xknext:=[xk,h+1]k[k1],h[H]X_{k}^{\text{next}}\!:=\![x_{k^{\prime},h^{\prime}+1}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]} and Wk:=[wk,h]k[k1],h[H]W_{k}\!:=\![w_{k^{\prime},h^{\prime}}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]}. For two matrices AA and BB, we also define AB2:=trace(ABA)\left\lVert A\right\rVert_{B}^{2}\!:=\!\text{trace}(A^{\top}BA). Now, at every episode kk, we consider the following ridge regression estimate w.r.t. a regularizing p.d. matrix Hk(n+d)×(n+d)H_{k}\in\mathbb{R}^{(n+d)\times(n+d)}:

Θk\displaystyle{\Theta}_{k} :=argminΘ(n+d)×nXknextZkΘF2+ΘHk2\displaystyle:=\operatorname*{arg\,min}_{\Theta\in\mathbb{R}^{(n+d)\times n}}\left\lVert X_{k}^{\text{next}}-Z_{k}\Theta\right\rVert_{\operatorname{F}}^{2}+\left\lVert\Theta\right\rVert_{H_{k}}^{2}
=(ZkZk+Hk)1ZkXknext,\displaystyle=(Z_{k}^{\top}Z_{k}+H_{k})^{-1}Z_{k}^{\top}X_{k}^{\text{next}},

In contrast to the standard online LQ control [14], here the sequence of matrices {ZkZk}k1\{Z_{k}^{\top}Z_{k}\}_{k\geqslant 1} is perturbed by a sequence of regularizers {Hk}k1\{H_{k}\}_{k\geqslant 1}. In particular, when Hk=λIH_{k}\!=\!\lambda I, we get back the standard estimate of [14]. In addition, we also allow ZkXknextZ_{k}^{\top}X_{k}^{\text{next}} to be perturbed by a matrix LkL_{k} at every episode kk. Setting Vk:=ZkZk+HkV_{k}\!:=\!Z_{k}^{\top}Z_{k}\!+\!H_{k} and Uk:=ZkXknext+LkU_{k}\!:=\!Z_{k}^{\top}X_{k}^{\text{next}}\!+\!L_{k}, we now define the estimate under changing regularizers {Hk}k1\{H_{k}\}_{k\geq 1} and {Lk}k1\{L_{k}\}_{k\geq 1} as

Θ^k=Vk1Uk.\displaystyle{\widehat{\Theta}}_{k}={V}_{k}^{-1}{U}_{k}~{}. (7)

We make the following assumptions on the sequence of regularizers {Hk}k1\{H_{k}\}_{k\geq 1} and {Lk}k1\{L_{k}\}_{k\geq 1}.

Assumption 2 (Regularity).

For any α(0,1]\alpha\in(0,1], there exist constants λmax\lambda_{\max}, λmin\lambda_{\min} and ν\nu depending on α\alpha such that, with probability at least 1α1-\alpha, for all k[K]k\in[K],

Hkλmax,Hk11/λminandLkHk1ν.\displaystyle\left\lVert H_{k}\right\rVert\leqslant\lambda_{\max},\;\;\left\lVert H_{k}^{-1}\right\rVert\leqslant 1/\lambda_{\min}\;\;\text{and}\;\;\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}\leqslant\nu.
Lemma 1 (Concentration under changing regularizers).

Under assumptions 1 and 2, the following holds:

α(0,1],[k:ΘΘ^kVkβk(α)]α,\forall\alpha\!\in\!(0,1],\quad\mathbb{P}\left[{\exists k\!\in\!\mathbb{N}\!:\;\left\lVert\Theta_{*}\!-\!\widehat{\Theta}_{k}\right\rVert_{{V}_{k}}\!\geqslant\!\beta_{k}(\alpha)}\right]\!\leqslant\!\alpha~{},

where βk(α):=Cw2ln(2α)+nlndet(I+λmin1ZkZk)+λmax+ν{\beta_{k}(\alpha)}\!:=\!C_{w}\!\sqrt{2\ln\!\big{(}\frac{2}{\alpha}\big{)}\!\!+\!n\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)}\!+\!\sqrt{\lambda_{\max}}\!+\!\nu.

Lemma 1 helps us to introduce the following high probability confidence set

𝒞k(α):={Θ:ΘΘ^kVkβk(α)}.\displaystyle\mathcal{C}_{k}(\alpha):=\left\{\Theta:\left\lVert\Theta-\widehat{\Theta}_{k}\right\rVert_{V_{k}}\leqslant\beta_{k}(\alpha)\right\}. (8)

We then search for an optimistic estimate Θ~k\widetilde{\Theta}_{k} within this confidence region 𝒞k(α)\mathcal{C}_{k}(\alpha), such that

Θ~kargminΘ𝒞k(α)𝒮J1(Θ,xk,1),\displaystyle\widetilde{\Theta}_{k}\in\operatorname*{arg\,min}_{\Theta\in\mathcal{C}_{k}(\alpha)\cap\mathcal{S}}J_{1}^{*}(\Theta,x_{k,1}), (9)

where J1(Θ,xk,1)J_{1}^{*}(\Theta,x_{k,1}) is the optimal cost when system dynamics are Θ\Theta (can be computed from (5)). With the estimate Θ~k\widetilde{\Theta}_{k}, the agent then chooses policy πk\pi_{k} and selects the controls recommended by this policy

uk,h:=πk,h(xk,h)=Kh(Θ~k)xk,h,u_{k,h}:=\pi_{k,h}(x_{k,h})=K_{h}(\widetilde{\Theta}_{k})x_{k,h}~{}, (10)

where Kh(Θ~k)K_{h}(\widetilde{\Theta}_{k}) can be computed from (3). We call this procedure OFU-RL\mathrm{OFU}\text{-}\mathrm{RL} and bound its regret as follows.

Theorem 1 (Regret under changing regularizers).

Under Assumptions 1 and 2, for any α(0,1]\alpha\!\in\!(0,1], with probability at least 1α1-\alpha, the cumulative regret of OFU-RL\mathrm{OFU}\text{-}\mathrm{RL} satisfies

(K)\displaystyle\mathcal{R}(K) =O(HK(H+n(n+d)ψλmin+ln(1/α)))\displaystyle=O\left(H\sqrt{K}\big{(}\sqrt{H}+n(n+d)\psi_{\lambda_{\min}}+\ln(1/\alpha)\big{)}\right)
+\displaystyle+ O(HK(λmax+ν)n(n+d)ψλmin),\displaystyle O\left(H\sqrt{K}\left(\sqrt{\lambda_{\max}}+\nu\right)\sqrt{n(n+d)\psi_{\lambda_{\min}}}\right)~{},

where ψλmin:=ln(1+HK/(n+d)λmin)\psi_{\lambda_{\min}}\!:=\!\ln\left(1\!+\!HK/(n+d)\lambda_{\min}\right).

Proof sketch. Inspired by [17], we first decompose the regret under the following ‘good’ event: K(α):={Θ𝒞k(α)𝒮,k[K]},\mathcal{E}_{K}(\alpha)\!:=\!\{\Theta_{*}\!\in\!\mathcal{C}_{k}(\alpha)\!\cap\!\mathcal{S},\forall k\!\in\![K]\}, which, by Assumption 1 and Lemma 1, holds w.p. at least 1α1\!-\!\alpha. Then, under the ‘good’ event, the cumulative regret (6) can be written as

(K)k=1Kh=1H(Δk,h+Δk,h+Δk,h′′),where\displaystyle\mathcal{R}(K)\leqslant\sum\nolimits_{k=1}^{K}\sum\nolimits_{h=1}^{H}(\Delta_{k,h}+\Delta_{k,h}^{\prime}+\Delta_{k,h}^{\prime\prime}),\;\;\text{where}
Δk,h:=𝔼[Jh+1πk(Θ,xk,h+1)k,h]Jh+1πk(Θ,xk,h+1),\displaystyle\Delta_{k,h}\!:=\!\mathbb{E}\left[J_{h+1}^{\pi_{k}}(\Theta_{*},x_{k,h+1})\!\!\mid\!\mathcal{F}_{k,h}\right]\!-\!J_{h+1}^{\pi_{k}}(\Theta_{*},x_{k,h+1}),
Δk,h:=xk,h+1P~k,h+1𝔼[xk,h+1P~k,h+1k,h]and\displaystyle\Delta_{k,h}^{\prime}\!:=\!\left\lVert x_{k,h+1}\right\rVert_{\widetilde{P}_{k,h+1}}\!\!\!-\!\mathbb{E}\left[\left\lVert x_{k,h+1}\right\rVert_{\widetilde{P}_{k,h+1}}\!\!\mid\!\mathcal{F}_{k,h}\right]\;\text{and}
Δk,h′′:=Θzk,hP~k,h+1Θ~kzk,hP~k,h+1,\displaystyle\Delta_{k,h}^{\prime\prime}\!:=\!\left\lVert\Theta_{*}^{\top}z_{k,h}\right\rVert_{\widetilde{P}_{k,h+1}}\!-\!\left\lVert\widetilde{\Theta}_{k}^{\top}z_{k,h}\right\rVert_{\widetilde{P}_{k,h+1}},

in which P~k,h:=Ph(Θ~k)\widetilde{P}_{k,h}\!:=\!P_{h}(\widetilde{\Theta}_{k}) is given by (4) and k,h\mathcal{F}_{k,h} denotes all the randomness present before time (k,h)(k,h).

Now, we are going to bound each term, respectively. For the first two terms, we can show that both of them are bounded martingale difference sequences. Therefore, by Azuma–Hoeffding inequality, we have k,hΔk,h=O(KH3)\sum_{k,h}\Delta_{k,h}\!=\!O\big{(}\sqrt{KH^{3}}\big{)} and k,hΔk,h=O(KH)\sum_{k,h}\Delta^{\prime}_{k,h}\!=\!O\big{(}\sqrt{KH}\big{)} with high probability. We use Lemma 1 and the OFU principle (9) to bound the third term as k,hΔk,h′′=O(HKβk(α)lndet(I+λmin1ZkZk))\sum_{k,h}\Delta^{\prime\prime}_{k,h}\!=\!O\big{(}H\sqrt{K}\beta_{k}(\alpha)\sqrt{\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)}\big{)}. To put everything together, first note from Assumption 1 that

lndet(I+λmin1ZkZk)(n+d)ln(1+HK(1+γ)2(n+d)λmin).\displaystyle\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)\!\leqslant\!(n\!+\!d)\ln\left(1\!+\!\frac{HK(1+\gamma)^{2}}{(n+d)\lambda_{\min}}\right).

Plugging this into βk(α)\beta_{k}(\alpha) given in Lemma 1 and the third term above, yields the final result. ∎

We end the section with a proof sketch of Lemma 1.

Proof sketch (Lemma 1). Under Assumptions 1 and 2, with some basic algebra, we first have

ΘΘ^kVk=HkΘZkWkLkVk1\displaystyle\left\lVert{\Theta}_{*}-\widehat{\Theta}_{k}\right\rVert_{{V}_{k}}=\left\lVert H_{k}\Theta_{*}-Z_{k}^{\top}W_{k}-L_{k}\right\rVert_{{V}_{k}^{-1}}
\displaystyle\leqslant ZkWk(ZkZk+λminI)1𝒯1+Hk122+LkHk1𝒯2.\displaystyle\underbrace{\left\lVert Z_{k}^{\top}W_{k}\right\rVert_{(Z_{k}^{\top}Z_{k}+\lambda_{\min}I)^{-1}}}_{\mathcal{T}_{1}}+\underbrace{\left\lVert H_{k}^{\frac{1}{2}}\right\rVert_{2}+\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}}_{\mathcal{T}_{2}}.

By Assumption 2, we have w.p. at least 1α1-\alpha, 𝒯2λmax+ν\mathcal{T}_{2}\leqslant\sqrt{\lambda_{\max}}+\nu. To bound 𝒯1\mathcal{T}_{1}, by the boundedness of wk,hw_{k,h} in Assumption 1, we first note that each row of the matrix WkW_{k} is a sub-Gaussian random vector with parameter CwC_{w}. We then generalize the self-normalized concentration inequality of vector-valued martingales [23, Theorem 1] to the setting of matrix-valued martingales. In particular, we show that w.p. at least 1α1-\alpha,

𝒯1Cw2ln(1/α)+nlndet(I+λmin1ZkZk).\displaystyle\mathcal{T}_{1}\leqslant C_{w}\sqrt{2\ln\left(1/\alpha\right)+n\ln\det(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k})}.

Combining the bounds on 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2} using a union bound argument, yields the final result. ∎

III-B Private Control

In this section, we introduce the Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} algorithm (Alg. 1). At every episode kk, we keep track of the history via private version of the matrices ZkZkZ_{k}^{\top}Z_{k} and ZkXknextZ_{k}^{\top}X_{k}^{\text{next}}. To do so, we first initialize two private counter mechanisms 1\mathcal{B}_{1} and 2\mathcal{B}_{2}, which take as parameters the privacy levels ε\varepsilon, δ\delta, number of episodes KK, horizon HH and a problem-specific constant γ\gamma (see Assumption 1). The counter 1\mathcal{B}_{1} (resp. 2\mathcal{B}_{2}) take as input an event stream of matrices {h=1Hzk,hzk,h}k[K]\{\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top}\}_{k\in[K]} (resp. {h=1Hzk,hxk,h+1}k[K]\{\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top}\}_{k\in[K]}), and at the start of each episode kk, release the private version of the matrix ZkZkZ_{k}^{\top}Z_{k} (resp. ZkXknextZ_{k}^{\top}X_{k}^{\text{next}}), which itself is a matrix of the same dimension. Let T1,kT_{1,k} and T2,kT_{2,k} denote the privatized versions for ZkZkZ_{k}^{\top}Z_{k} and ZkXknextZ_{k}^{\top}X_{k}^{\text{next}}, respectively. For some η>0\eta>0 (will be determined later), we define Vk:=T1,k+ηIV_{k}\!:=\!T_{1,k}\!+\!\eta I and Uk:=T2,kU_{k}\!:=\!T_{2,k}. We now instantiate the general OFU-RL\mathrm{OFU}\text{-}\mathrm{RL} procedure under changing regularizers (Section III-A) with these private statistics. First, we compute the point estimate Θ^k\widehat{\Theta}_{k} from (7) and build the confidence set 𝒞k(α)\mathcal{C}_{k}(\alpha) from (8). Then, we choose the most optimisitic policy πk\pi_{k} w.r.t. the entire set 𝒞k(α)\mathcal{C}_{k}(\alpha) from (9) and (10). Finally, we execute the policy for the entire episode and update the counters with observed trajectory.

We now describe the private counters 1\mathcal{B}_{1} adapting the Binary counting mechanism of [24]. First, we write Σ1[i,j]=k=ijh=1Hzk,hzk,h\Sigma_{1}[i,j]=\sum_{k=i}^{j}\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top} to denote a partial sum (P-sum\mathrm{P}\text{-}\mathrm{sum}) involving the state-control pairs in episodes ii through jj. Next, we consider a binary interval tree, where each leaf node represents an episode (i.e., the tree has k1k-1 leaf nodes at the start of episode kk), and each interior node represents the range of episodes covered by its children. At the start of episode kk, we first release a noisy P-sum\mathrm{P}\text{-}\mathrm{sum} Σ^1[i,j]\widehat{\Sigma}_{1}[i,j] corresponding to each node in the tree. Here Σ^1[i,j]\widehat{\Sigma}_{1}[i,j] is obtained by perturbing both (p,q)(p,q)-th and (q,p)(q,p)-th, 1pq(n+d)1\!\leqslant\!p\!\leqslant\!q\!\leqslant\!(n\!+\!d), entries of Σ1[i,j]\Sigma_{1}[i,j] with i.i.d. Gaussian noise ζp,q𝒩(0,σ12)\zeta_{p,q}\!\sim\!\mathcal{N}(0,\sigma_{1}^{2}).333This will ensure symmetry of the P-sums\mathrm{P}\text{-}\mathrm{sums} even after adding noise. Then T1,kT_{1,k} is computed by summing up the noisy P-sums\mathrm{P}\text{-}\mathrm{sums} released by the set of nodes that uniquely cover the range [1,k1][1,k\!-\!1]. Observe that, at the end each episode, the mechanism only needs to store noisy P-sums\mathrm{P}\text{-}\mathrm{sums} required for computing private statistics at future episodes, and can safely discard P-sums\mathrm{P}\text{-}\mathrm{sums} that are no longer needed. For the private counter 2\mathcal{B}_{2}, we maintain P-sums\mathrm{P}\text{-}\mathrm{sums} Σ2[i,j]=k=ijh=1Hzk,hxk,h+1\Sigma_{2}[i,j]\!=\!\sum_{k=i}^{j}\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top} with i.i.d. noise 𝒩(0,σ22)\mathcal{N}(0,\sigma_{2}^{2}) and compute the private statistics T2,kT_{2,k} using a similar procedure. The noise levels σ1\sigma_{1} and σ2\sigma_{2} depends on the problem intrinsics (KK, HH, γ\gamma) and privacy parameters (ε\varepsilon, δ\delta). These, in turn, govern the constants λmax\lambda_{\max}, λmin\lambda_{\min}, ν\nu appearing in the confidence set 𝒞k(α)\mathcal{C}_{k}(\alpha) and the regularizer η\eta. The details will be specified in the next Section as needed.

Input: Number of episodes KK, horizon HH, privacy level ε>0\varepsilon\!>\!0, δ(0,1]\delta\!\in\!(0,1], constants γ\gamma, CwC_{w}, confidence level α(0,1]\alpha\in(0,1]
1 initialize private counters 1\mathcal{B}_{1} and 2\mathcal{B}_{2} with parameters K,H,ε,δ,γK,H,\varepsilon,\delta,\gamma
2 for each episode k=1,2,3,,Kk\!=\!1,2,3,\ldots,K do
3       compute private statistics T1,kT_{1,k} and T2,kT_{2,k}
4       construct confidence set 𝒞k(α)\mathcal{C}_{k}(\alpha)
5       find Θ~kargminΘ𝒞k(α)𝒮J1(Θ,xk,1)\widetilde{\Theta}_{k}\in\operatorname*{arg\,min}_{\Theta\in\mathcal{C}_{k}(\alpha)\cap\mathcal{S}}J_{1}^{*}(\Theta,x_{k,1})
6       for each step h=1,2,,Hh\!=\!1,2,\ldots,H do
7             execute control uk,h=Kh(Θ~k)xk,hu_{k,h}\!=\!K_{h}(\widetilde{\Theta}_{k})x_{k,h}
8             observe cost ck,hc_{k,h} and next state xk,h+1x_{k,h+1}
9            
10      send h=1Hzk,hzk,h\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top} and h=1Hzk,hxk,h+1\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top} to the counters 1\mathcal{B}_{1} and 2\mathcal{B}_{2}, respectively
Algorithm 1 Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}

IV Privacy and Regret Guarantees

In this section, we show that Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} is a JDP algorithm with sublinear regret guarantee.

IV-A Privacy Guarantee

Theorem 2 (Privacy).

Under Assumption 1, for any ε>0\varepsilon\!>\!0 and δ(0,1]\delta\!\in\!(0,1], Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} is (ε,δ)(\varepsilon,\delta)-JDP.

Proof sketch. We first show that both the counters 1\mathcal{B}_{1} and 2\mathcal{B}_{2} are (ε/2,δ/2)(\varepsilon/2,\delta/2)-DP. We begin with the counter 1\mathcal{B}_{1}. To this end, we need to determine a global upper bound Δ1\Delta_{1} over the L2\operatorname{L}_{2}-sensitivity of all the P-sums\mathrm{P}\text{-}\mathrm{sums} Σ1[i,j]\Sigma_{1}[i,j]. Informally, Δ1\Delta_{1} encodes the maximum change in the Frobenious norm of each P-sum\mathrm{P}\text{-}\mathrm{sum} if the trajectory of a single episode is changed. By Assumption 1, we have zk,h1+γ\left\lVert z_{k,h}\right\rVert\!\leqslant\!1\!+\!\gamma, and hence Δ1=H(1+γ)2\Delta_{1}\!=\!H(1\!+\!\gamma)^{2}. Since the noisy P-sums\mathrm{P}\text{-}\mathrm{sums} Σ^1[i,j]\widehat{\Sigma}_{1}[i,j] are obtained via Gaussian mechanism, we have that each Σ^1[i,j]\widehat{\Sigma}_{1}[i,j] is (Δ12/2σ12)(\Delta_{1}^{2}/2\sigma_{1}^{2})-CDP [22, Proposition 1.6]. We now see that every episode appears only in at most m:=log2Km\!:=\!\lceil\log_{2}K\rceil P-sums\mathrm{P}\text{-}\mathrm{sums} Σ1[i,j]\Sigma_{1}[i,j]. Therefore, by the composition property, the whole counter 1\mathcal{B}_{1} is (mΔ12/2σ12)(m\Delta_{1}^{2}/2\sigma_{1}^{2})-CDP, and thus, in turn, (mΔ122σ12+2mΔ122σ12ln(2δ),δ2)\left(\frac{m\Delta_{1}^{2}}{2\sigma_{1}^{2}}\!+\!2\sqrt{\frac{m\Delta_{1}^{2}}{2\sigma_{1}^{2}}\ln\left(\frac{2}{\delta}\right)},\frac{\delta}{2}\right)-DP for any δ>0\delta\!>\!0 [22, Lemma 3.5]. Now, setting σ128mΔ12ln(2/δ)/ε2\sigma_{1}^{2}\!\approx\!8m\Delta_{1}^{2}\ln(2/\delta)/\varepsilon^{2}, we can ensure that 1\mathcal{B}_{1} is (ε/2,δ/2)(\varepsilon/2,\delta/2)-DP. A similar analysis yields that counter 2\mathcal{B}_{2} is (ε/2,δ/2)(\varepsilon/2,\delta/2)-DP if we set σ228mΔ22ln(2/δ)/ε2\sigma_{2}^{2}\!\approx\!8m\Delta_{2}^{2}\ln(2/\delta)/\varepsilon^{2}, where Δ2:=H(1+γ)\Delta_{2}\!:=\!H(1\!+\!\gamma).

To prove Theorem 2, we now use the billboard lemma [25, Lemma 9] which, informally, states that an algorithm is JDP under continual observation if the output sent to each user is a function of the user’s private data and a common quantity computed using standard differential privacy. Note that at each episode kk, Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} computes private statistics T1,kT_{1,k} and T2,kT_{2,k} for all users using the counters 1\mathcal{B}_{1} and 2\mathcal{B}_{2}. These statistics are then used to compute the policy πk\pi_{k}. By composition and post-processing properties of DP, we can argue that the sequence of policies {πk}k[K]\{\pi_{k}\}_{k\in[K]} are computed using an (ε,δ)(\varepsilon,\delta)-DP mechanism. Now, the controls {uk,h}h[H]\{u_{k,h}\}_{h\in[H]} during episode kk are generated using the policy πk\pi_{k} and the user’s private data xk,hx_{k,h} as uk,h=πk,h(xk,h)u_{k,h}\!=\!\pi_{k,h}(x_{k,h}). Then, by the billboard lemma, the composition of the controls {uk,h}k[K],h[H]\{u_{k,h}\}_{k\in[K],h\in[H]} sent to all the users is (ε,δ)(\varepsilon,\delta)-JDP.∎

IV-B Regret Guarantee

Theorem 3 (Private regret).

Under Assumption 1, for any privacy parameters ε>0\varepsilon\!>\!0 and δ(0,1]\delta\!\in\!(0,1], and for any α(0,1]\alpha\!\in\!(0,1], with probability at least 1α1\!-\!\alpha, Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} enjoys the regret bound

(K)=O(H3/2K(n(n+d)lnK+ln(1/α)))\displaystyle\mathcal{R}(K)=O\!\left(H^{3/2}\sqrt{K}\big{(}n(n\!+\!d)\ln K\!+\!\ln(1/\alpha)\big{)}\right)
+O(H3/2KlnK(n(n+d)+lnK/α)ln(1/δ)1/4ε1/2).\displaystyle\!+\!O\!\left(\!H^{3/2}\!\sqrt{K}\ln K\!\left(n(n\!+\!d)\!+\!\!\sqrt{\ln K/\alpha}\right)\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}}\right)\!.

Theorems 2 and 3 together imply that Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL} can achieve a sub-linear regret under (ε,δ)(\varepsilon,\delta)-JDP privacy guarantee. Furthermore, comparing Theorem 3 with Theorem 1, we see that the first term in the regret bound corresponds to the non-private regret, and the second term is the cost of privacy. More importantly, the cost due to privacy grows only as ln(1/δ)1/4ε1/2\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}} with ε,δ\varepsilon,\delta.

Proof sketch (Theorem 3). First note that the private statistics T1,kT_{1,k} can be computed by summing at most m=log2Km\!=\!\lceil\log_{2}K\rceil noisy P-sums\mathrm{P}\text{-}\mathrm{sums} Σ^1[i,j]\widehat{\Sigma}_{1}[i,j]. We then have that the total noise NkN_{k} in each T1,kT_{1,k} is a symmetric matrix with it’s (p,q)(p,q)-th entry, 1pq(n+d)1\!\leqslant\!p\!\leqslant\!q\!\leqslant\!(n\!+\!d), being i.i.d. 𝒩(0,mσ12)\mathcal{N}(0,m\sigma_{1}^{2}). Therefore, by an adaptation of [26, Corollary 4.4.8], we have w.p. at least 1α/2K1\!-\!\alpha/2K,

NkΛ:=σ1m(4n+d+8ln(4K/α)).\displaystyle\left\lVert N_{k}\right\rVert\!\leqslant\!\Lambda\!:=\!\sigma_{1}\sqrt{m}\left(4\sqrt{n\!+\!d}\!+\!\sqrt{8\ln(4K/\alpha)}\right).

Similarly, the total noise LkL_{k} in each T2,kT_{2,k} is an (n+d)×n(n+d)\times n matrix, whose each entry is i.i.d. 𝒩(0,mσ22)\mathcal{N}(0,m\sigma_{2}^{2}). Hence LkF2/mσ22\left\lVert L_{k}\right\rVert_{\operatorname{F}}^{2}/m\sigma_{2}^{2} is a χ2\chi^{2}-statistic with n(n+d)n(n\!+\!d) degrees of freedom, and therefore, by [27, Lemma 1], we have w.p. at least 1α/2K1\!-\!\alpha/2K,

LkFσ2m(2n(n+d)+4ln(2K/α)).\displaystyle\!\left\lVert L_{k}\right\rVert_{\operatorname{F}}\!\leqslant\!\sigma_{2}\sqrt{m}\left(\!\sqrt{2n(n\!+\!d)}\!+\!\sqrt{4\ln(2K/\alpha)}\right).

By construction, we have the regularizer Hk=Nk+ηIH_{k}\!=\!N_{k}\!+\!\eta I. Setting η=2Λ\eta\!=\!2\Lambda, we ensure that HkH_{k} is p.d., and hence LkHk1Λ1/2LkF\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}\!\leqslant\!\Lambda^{-1/2}\left\lVert L_{k}\right\rVert_{\operatorname{F}}. Then, by a union bound argument, Assumption 2 holds for λmin=Λ\lambda_{\min}\!=\!\Lambda, λmax=3Λ\lambda_{\max}\!=\!3\Lambda and ν=σ2m/Λ(2n(n+d)+4ln(2K/α))\nu\!=\!\sigma_{2}\sqrt{m/\Lambda}\left(\sqrt{2n(n\!+\!d)}\!+\!\!\sqrt{4\ln(2K/\alpha)}\right). Appropriating noise levels σ1,σ2\sigma_{1},\sigma_{2} from Section IV-A, the regret bound now follows from Theorem 1.∎

V Conclusion

We develop the first DP algorithm, Private-OFU-RL\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}, for episodic LQ control. Through the notion of JDP, we show that it can protect private user information from being inferred by observing the control policy without losing much on its regret performance. We leave as future work private control of non-linear systems [16].

References

  • [1] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670.
  • [2] Y. Zhao, M. R. Kosorok, and D. Zeng, “Reinforcement learning design for cancer clinical trials,” Statistics in medicine, vol. 28, no. 26, pp. 3294–3315, 2009.
  • [3] A. R. Sharma and P. Kaushik, “Literature survey of statistical, deep and reinforcement learning in natural language processing,” in 2017 International Conference on Computing, Communication and Automation (ICCCA).   IEEE, 2017, pp. 350–354.
  • [4] G. Gordon, S. Spaulding, J. K. Westlund, J. J. Lee, L. Plummer, M. Martinez, M. Das, and C. Breazeal, “Affective personalization of a social robot tutor for children’s second language skills,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
  • [5] C. Dwork, “Differential privacy: A survey of results,” in International conference on theory and applications of models of computation.   Springer, 2008, pp. 1–19.
  • [6] A. Tossou and C. Dimitrakakis, “Algorithms for differentially private multi-armed bandits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
  • [7] ——, “Achieving privacy in the adversarial multi-armed bandit,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
  • [8] D. Basu, C. Dimitrakakis, and A. Tossou, “Differential privacy for multi-armed bandits: What is it and what is its cost?” arXiv preprint arXiv:1905.12298, 2019.
  • [9] N. Mishra and A. Thakurta, “(nearly) optimal differentially private stochastic multi-arm bandits,” in Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015, pp. 592–601.
  • [10] X. Zhou and J. Tan, “Local differential privacy for bayesian optimization,” arXiv preprint arXiv:2010.06709, 2020.
  • [11] B. Balle, M. Gomrokchi, and D. Precup, “Differentially private policy evaluation,” in International Conference on Machine Learning.   PMLR, 2016, pp. 2130–2138.
  • [12] G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu, “Private reinforcement learning with pac and regret guarantees,” in International Conference on Machine Learning.   PMLR, 2020, pp. 9754–9764.
  • [13] E. Garcelon, V. Perchet, C. Pike-Burke, and M. Pirotta, “Local differentially private regret minimization in reinforcement learning,” arXiv preprint arXiv:2010.07778, 2020.
  • [14] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
  • [15] I. Osband and B. V. Roy, “Model-based reinforcement learning and the eluder dimension,” in Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, 2014, pp. 1466–1474.
  • [16] S. R. Chowdhury and A. Gopalan, “Online learning in kernelized markov decision processes,” in The 22nd International Conference on Artificial Intelligence and Statistics.   PMLR, 2019, pp. 3197–3205.
  • [17] T. Wang and L. F. Yang, “Episodic linear quadratic regulators with low-rank transitions,” arXiv preprint arXiv:2011.01568, 2020.
  • [18] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Conference on Learning Theory, 2020, pp. 2137–2143.
  • [19] D. Bertsekas, Dynamic programming and optimal control.   Athena scientific Belmont, MA, 3 edition, 2004.
  • [20] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” 2014.
  • [21] M. Kearns, M. Pai, A. Roth, and J. Ullman, “Mechanism design in large games: Incentives and privacy,” in Proceedings of the 5th conference on Innovations in theoretical computer science, 2014, pp. 403–410.
  • [22] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Theory of Cryptography Conference.   Springer, 2016, pp. 635–658.
  • [23] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
  • [24] T.-H. H. Chan, E. Shi, and D. Song, “Private and continual release of statistics,” ACM Transactions on Information and System Security (TISSEC), vol. 14, no. 3, pp. 1–24, 2011.
  • [25] J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu, “Private matchings and allocations,” SIAM Journal on Computing, vol. 45, no. 6, pp. 1953–1984, 2016.
  • [26] R. Vershynin, High-dimensional probability: An introduction with applications in data science.   Cambridge university press, 2018, vol. 47.
  • [27] B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000.