Adaptive Control of Differentially Private Linear Quadratic Systems

Sayak Ray Chowdhury1 Indian Institute of Science
Bangalore, India
Email: [email protected] Xingyu Zhou1 Wayne State University
Detroit, USA
Email: [email protected] Ness Shroff The Ohio State University
Columbus, USA
Email: [email protected]

Abstract

In this paper we study the problem of regret minimization in reinforcement learning (RL) under differential privacy constraints. This work is motivated by the wide range of RL applications for providing personalized service, where privacy concerns are becoming paramount. In contrast to previous works, we take the first step towards non-tabular RL settings, while providing a rigorous privacy guarantee. In particular, we consider the adaptive control of differentially private linear quadratic (LQ) systems. We develop the first private RL algorithm, $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ which is able to attain a sub-linear regret while guaranteeing privacy protection. More importantly, the additional cost due to privacy is only on the order of $\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}}$ given privacy parameters $\varepsilon,\delta>0$ . Through this process, we also provide a general procedure for adaptive control of LQ systems under changing regularizers, which not only generalizes previous non-private controls, but also serves as the basis for general private controls.

¹¹footnotetext: Equal contribution. This work was funded in part through NSF grants: CNS-1901057 and CNS- 2007231, and an Office of Naval Research under Grant N00014-17-1-241

I Introduction

Reinforcement learning (RL) is a control-theoretic problem, which adaptively learns to make sequential decisions in an unknown environment through trial and error. RL has shown to have significant success for delivering a wide variety of personalized services, including online news and advertisement recommendation [1], medical treatment design [2], natural language processing [3], and social robot [4]. In these applications, an RL agent improves its personalization algorithm by interacting with users to maximize the reward. In particular, in each round, the RL agent offers an action based on the user’s state, and then receives the feedback from the user (i.e., state information, state transition, reward, etc.). This feedback is used by the agent to learn the unknown environment and improve its action selection strategy.

However, in most practical scenarios, the feedback from the users often encodes their sensitive information. For example, in a personalized healthcare setting, the states of a patient include personal information such as age, gender, height, weight, state of the treatment etc. Similarly, the states of a virtual keyboard user (e.g., Google GBoard users) are the words and sentences she typed in, which inevitably contain private information about the user. Another intriguing example is the social robot for second language education of children. The states include facial expressions, and the rewards contain whether they have passed the quiz. Users may not want any of this information to be inferred by others. This directly results in an increasing concern about privacy protection in personalized services. To be more specific, although a user might be willing to share her own information to the agent to obtain a better tailored service, she would not like to allow third parties to infer her private information from the output of the learning algorithm. For example, in the healthcare application, we would like to ensure that an adversary with arbitrary side knowledge cannot infer a particular patient’s state from the treatments prescribed to her.

Differential privacy (DP) [5] has become a standard mechanism for designing interactive learning algorithms under a rigorous privacy guarantee for individual data. Most of the previous works on differentially private learning under partial feedback focus on the simpler bandit setting (i.e., no state transition) [6, 7, 8, 9, 10]. For the general RL problem, there are only a few works that consider differential privacy [11, 12, 13]. More importantly, only the tabula-rasa discrete-state discrete-action environments are considered in these works. However, in real-world applications mentioned above, the number of states and actions are often very large and can even be infinite. Over the years, for various non-tabular environments, efficient and provably optimal algorithms for reward maximization or, equivalently, regret minimization have been developed (see, e.g., [14, 15, 16, 17, 18]). This directly motivates the following question: Is it possible to obtain the optimal reward while providing individual privacy guarantees in the non-tabular RL scenario?

In this paper, we take the first step to answer the aforementioned question by considering a particular non-tabular RL problem – adaptive control of linear quadratic (LQ) systems, in which the state transition is a linear function and the immediate reward (cost) is a quadratic function of the current state and action. In particular, our main contributions can be summarized as follows.

•

First, we provide a general framework for adaptive control of LQ systems under changing regularizers using the optimism in the face of uncertainty (OFU) principle, which covers both the extreme cases – non-private and fully private LQ control.
•

We then develop the first private RL algorithm, namely $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ , for regret minimization in LQ systems by adapting the binary counting mechanism to ensure differencial privacy.
•

In particular, we show that $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ satisfies joint differential privacy (JDP), which, informally, implies that sensitive information about a given user is protected even if an adversary has access to the actions prescribed to all other users.
•

Finally, we prove that $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ achieves a sub-linear regret guarantee, where the regret due to privacy only grows as $\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}}$ with privacy levels $\varepsilon,\delta\!>\!0$ implying that a high amount of privacy (low $\varepsilon,\delta$ ) comes at a high cost and vice-versa.

II Preliminaries

II-A Stochastic Linear Quadratic Control

We consider the discrete-time episodic linear quadratic (LQ) control problem with $H$ time steps at every episode. Let $x_{h}\!\in\!\mathbb{R}^{n}$ be the state of the system, $u_{h}\!\in\!\mathbb{R}^{d}$ be the control and $c_{h}\!\in\!\mathbb{R}$ be the cost at time $h$ . An LQ problem is characterized by linear dynamics and a quadratic cost function

x_{h+1}\!=\!Ax_{h}+Bu_{h}+w_{h},\;\;c_{h}\!=\!x_{h}^{\top}Qx_{h}+u_{h}^{\top}Ru_{h}~{},

(1)

where $A$ , $B$ are unknown matrices, and $Q$ , $R$ are known positive definite (p.d.) matrices. The starting state $x_{1}$ is fixed (can possibly be chosen by an adversary) and the system noise $w_{h}\!\in\!\mathbb{R}^{n}$ is zero-mean. We summarize the unknown parameters in $\Theta\!=\![A,B]^{\top}\!\in\!\mathbb{R}^{(n+d)\times n}$ .

The goal of the agent is to design a closed-loop control policy $\pi:[H]\times\mathbb{R}^{n}\to\mathbb{R}^{d}$ mapping states to controls that minimizes the expected cost

\displaystyle J_{h}^{\pi}(\Theta,x):=\mathbb{E}_{\pi}\Big{[}\sum\nolimits_{h^{\prime}=h}^{H}c_{h^{\prime}}\;\big{|}\;x_{h}=x\Big{]},

(2)

for all $h\in[H]$ and $x\in\mathbb{R}^{n}$ . Here the expectation is over the random trajectory induced by the policy $\pi$ starting from state $x$ at time $h$ . From the standard theory for LQ control (e.g., [19]), the optimal policy $\pi^{*}$ has the form

\displaystyle\pi_{h}^{*}(x)=K_{h}(\Theta)x~{},\quad\forall h\in[H]~{},

where the gain matrices $K_{h}(\Theta)$ are given by

\displaystyle\!\!K_{h}(\Theta)\!=\!-(R\!+\!B^{\top}P_{h}(\Theta)B)^{-1}B^{\top}P_{h}(\Theta)A.

(3)

Here the symmetric positive semidefinite matrices $P_{h}(\Theta)$ are defined recursively by the Riccati iteration

	$\displaystyle P_{h}(\Theta)=Q+A^{\top}P_{h+1}(\Theta)A$		(4)
	$\displaystyle\!-\!A^{\top}P_{h+1}(\Theta)B(R\!+\!B^{\top}P_{h+1}(\Theta)B)^{-1}B^{\top}P_{h+1}(\Theta)A.$

with $P_{H+1}(\Theta):=0$ . The optimal cost is given by

\displaystyle\!\!\!\!\!J_{h}^{*}(\Theta,x)\!=\!x^{\top}\!P_{h}(\Theta)x\!+\!\!\!\sum\nolimits_{h^{\prime}\!=h}^{H}\!\!\!\mathbb{E}\left[w_{h^{\prime}}^{\top}P_{h^{\prime}+1}(\Theta)w_{h^{\prime}}\right]\!.

(5)

We let the agent play $K$ episodes and measure the performance by cumulative regret.¹¹1In the following, we add subscript $k$ to denote the variables for the $k$ -th episode – state $x_{k,h}$ , control $u_{k,h}$ , noise $w_{k,h}$ and cost $c_{k,h}$ . In particular, if the true system dynamics are $\Theta_{*}\!=\![A_{*},B_{*}]^{\top}$ , the cumulative regret of the first $K$ episodes is given by

\displaystyle\mathcal{R}(K)\!:=\!\sum\nolimits_{k=1}^{K}\left(J_{1}^{\pi_{k}}(\Theta_{*},x_{k,1})-J_{1}^{*}(\Theta_{*},x_{k,1})\right),

(6)

where $J_{1}^{*}(\Theta_{*},x_{k,1})$ is the (expected) cost under an optimal policy for episode $k$ (computed via (5)), and $J_{1}^{\pi_{k}}(\Theta_{*},x_{k,1})$ is the (expected) cost under the chosen policy $\pi_{k}$ at the start of episode $k$ (computed via (2)). We seek to attain a sublinear regret $\mathcal{R}(K)\!=\!o(K)$ , which ensures that the agent finds the optimal policy as $K\!\to\!\infty$ . We end this section by presenting our assumptions on the LQ system (1), which are common in the LQ control literature [17].

Assumption 1 (Boundedness).

(a) The true system dynamics $\Theta_{*}$ is a member of a set $\mathcal{S}\!:=\!\{\Theta\!=\![A,B]^{\top}\!:\!\left\lVert\Theta\right\rVert_{\operatorname{F}}\!\leqslant\!1\text{ and }[A,B]\text{ is controllable}\}$ . (b) There exist constants $C$ , $C_{A}$ , $C_{B}$ such that $\left\lVert A_{*}\right\rVert\!\leqslant\!C_{A}\!<\!1$ , $\left\lVert B_{*}\right\rVert\!\leqslant\!C_{B}\!<\!1$ , and $\left\lVert Q\right\rVert\!\leqslant\!C$ , $\left\lVert R\right\rVert\!\leqslant\!C$ , (c) For all $k\!\geqslant\!1$ , $\left\lVert x_{k,1}\right\rVert\!\leqslant\!1$ . (d) The noise $w_{k,h}$ at any $k\!\geqslant\!1$ and $h\!\in\![H]$ , is (i) independent of all other randomness, (ii) $\mathbb{E}\left[w_{k,h}\right]\!=\!0$ , and (iii) $\left\lVert w_{k,h}\right\rVert_{2}\!\leqslant\!C_{w}\!<\!1$ . (e) There exists a constant $\gamma$ such that $C_{A}+\gamma C_{B}+C_{w}\!\leqslant\!1$ .

II-B Differential Privacy

We now formally define the notion of differential privacy in the context of episodic LQ control. We write $v\!=\!(v_{1},\ldots,v_{K})\!\in\!\mathcal{V}^{K}$ to denote a sequence of $K$ unique users participating in the private RL protocol with an RL agent $\mathcal{M}$ , where $\mathcal{V}$ is the set of all users. Each user $v_{k}$ is identified by the state responses $\{x_{k,h+1}\}_{h\in[H]}$ she gives to the controls $\{u_{k,h}\}_{h\in[H]}$ chosen by the agent. We write $\mathcal{M}(v)\!=\!\{u_{k,h}\}_{k\in[K],h\in[H]}\in(\mathbb{R}^{d})^{KH}$ to denote the privatized controls chosen by the agent $\mathcal{M}$ when interacting with the users $v$ . Informally, we will be interested in randomized algorithms $\mathcal{M}$ so that the knowledge of the output $\mathcal{M}(v)$ and all but the $k$ -th user $v_{k}$ does not reveal ‘much’ about $v_{k}$ . We formalize in the following definition, which is adapted from [20].

Definition 1 (Differential Privacy (DP)).

For any $\varepsilon\!\geqslant\!0$ and $\delta\!\in\![0,1]$ , an algorithm $\mathcal{M}\!:\!\mathcal{V}^{K}\!\rightarrow\!(\mathbb{R}^{d})^{KH}$ is $(\varepsilon,\delta)$ -differentially private if for all $v,v^{\prime}\in\mathcal{V}^{K}$ differing on a single user and all subset of controls $\mathcal{U}\!\!\subset\!\!\left(\mathbb{R}^{d}\right)^{KH}$ ,

\mathbb{P}\left[{\mathcal{M}(v)\in\mathcal{U}}\right]\leqslant\exp(\varepsilon)\mathbb{P}\left[{\mathcal{M}(v^{\prime})\in\mathcal{U}}\right]+\delta~{}.

We now relax this definition motivated by the fact that the controls recommended to a given user $v_{k}$ is only observed by her. We consider joint differential privacy [21], which requires that simultaneously for all $k$ , the joint distribution on controls sent to users other than $v_{k}$ will not change substantially upon changing the state responses of the user $v_{k}$ . To this end, we let $\mathcal{M}_{-k}(v)\!:=\!\mathcal{M}(v)\!\setminus\!\{u_{k,h}\}_{h\in[H]}$ to denote all the controls chosen by the agent $\mathcal{M}$ excluding those recommended to $v_{k}$ .

Definition 2 (Joint Differential Privacy (JDP)).

For any $\varepsilon\!\geqslant\!0$ and $\delta\!\in\![0,1]$ , an algorithm $\mathcal{M}\!:\!\mathcal{V}^{K}\!\rightarrow\!(\mathbb{R}^{d})^{KH}$ is $(\varepsilon,\delta)$ -jointly differentially private if for all $k\!\in\![K]$ , all $v,v^{\prime}\!\in\!\mathcal{V}$ differing on the $k$ -th user and all subset of controls $\mathcal{U}_{-k}\!\subset\!\left(\mathbb{R}^{d}\right)^{(K\!-\!1)H}\!\!$ given to all but the $k$ -th user,

\mathbb{P}\left[{\mathcal{M}_{-k}(v)\in\mathcal{U}_{-k}}\right]\leqslant\exp(\varepsilon)\mathbb{P}\left[{\mathcal{M}_{-k}(v^{\prime})\in\mathcal{U}_{-k}}\right]+\delta~{}.

This relaxation is necessary in our setting since knowledge of the controls recommended to the user $v_{k}$ can reveal a lot of information about her state responses. It weakens the constraint of DP only in that the controls given specifically to $v_{k}$ may be sensitive in her state responses. However, it is still a very strong definition since it protects $v_{k}$ from any arbitrary collusion of other users against her, so long as she does not herself make the controls reported to her public.

In this work, we look for algorithms that are $(\varepsilon,\delta)$ -JDP. But, we will build our algorithm upon standard DP mechanisms. Furthermore, to establish privacy, we will use a different relaxation called concentrated differential privacy (CDP) [22]. Roughly, a mechanism is CDP if the privacy loss has Gaussian tails. To this end, we let $\mathcal{M}$ to be a mechanism taking as input a data-stream $x\!\in\!\mathcal{X}^{n}$ and releasing output from some range $\mathcal{Y}$ .

Definition 3 (Concentrated Differential Privacy (CDP)).

For any $\rho\geqslant 0$ , an algorithm $\mathcal{M}\!:\!\mathcal{X}^{n}\!\to\!\mathcal{Y}$ is $\rho$ -zero-concentrated differentially private if for all $x,x^{\prime}\in\mathcal{X}^{n}$ differing on a single entry and all $\alpha\in(1,\infty)$ ,

\operatorname{D}_{\alpha}\left(\mathcal{M}(x)||\mathcal{M}(x^{\prime})\right)\leqslant\rho\;\alpha~{},

where $\operatorname{D}_{\alpha}\left(\mathcal{M}(x)||\mathcal{M}(x^{\prime})\right)$ is the $\alpha$ -Renyi divergence between the distributions of $\mathcal{M}(x)$ and $\mathcal{M}(x^{\prime})$ .²²2For two probability distributions $P$ and $Q$ on $\Omega$ , the $\alpha$ -Renyi divergence $\operatorname{D}_{\alpha}(P||Q):=\frac{1}{\alpha-1}\ln\left(\int_{\Omega}P(x)^{\alpha}Q(x)^{1-\alpha}\operatorname{d}\!x\right)$ .

III OFU-Based Control

Our proposed private RL algorithm implements the optimism in the face of uncertainty (OFU) principle in LQ systems. As in [14], the key to implementing the OFU-based control is a high-probability confidence set for the unknown parameter matrix $\Theta_{*}$ .

III-A Adaptive Control with Changing Regularizers

We start with the adaptive LQ control with changing regularizers. This not only allows us to generalize previous results for non-private control, but more importantly serves as a basis for the analysis of private control in the next section. We first define the following compact notations. For a state and control pair at step $h$ in episode $k$ , i.e., $x_{k,h}$ and $u_{k,h}$ , we write $z_{k,h}\!=\![x_{k,h}^{\top},u_{k,h}^{\top}]^{\top}$ . For any $k\!\geqslant\!1$ , we define the following matrices: $Z_{k}\!:=\![z_{k^{\prime},h^{\prime}}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]}$ , $X_{k}^{\text{next}}\!:=\![x_{k^{\prime},h^{\prime}+1}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]}$ and $W_{k}\!:=\![w_{k^{\prime},h^{\prime}}^{\top}]_{k^{\prime}\in[k-1],h^{\prime}\in[H]}$ . For two matrices $A$ and $B$ , we also define $\left\lVert A\right\rVert_{B}^{2}\!:=\!\text{trace}(A^{\top}BA)$ . Now, at every episode $k$ , we consider the following ridge regression estimate w.r.t. a regularizing p.d. matrix $H_{k}\in\mathbb{R}^{(n+d)\times(n+d)}$ :

	$\displaystyle{\Theta}_{k}$	$\displaystyle:=\operatorname*{arg\,min}_{\Theta\in\mathbb{R}^{(n+d)\times n}}\left\lVert X_{k}^{\text{next}}-Z_{k}\Theta\right\rVert_{\operatorname{F}}^{2}+\left\lVert\Theta\right\rVert_{H_{k}}^{2}$
		$\displaystyle=(Z_{k}^{\top}Z_{k}+H_{k})^{-1}Z_{k}^{\top}X_{k}^{\text{next}},$

In contrast to the standard online LQ control [14], here the sequence of matrices $\{Z_{k}^{\top}Z_{k}\}_{k\geqslant 1}$ is perturbed by a sequence of regularizers $\{H_{k}\}_{k\geqslant 1}$ . In particular, when $H_{k}\!=\!\lambda I$ , we get back the standard estimate of [14]. In addition, we also allow $Z_{k}^{\top}X_{k}^{\text{next}}$ to be perturbed by a matrix $L_{k}$ at every episode $k$ . Setting $V_{k}\!:=\!Z_{k}^{\top}Z_{k}\!+\!H_{k}$ and $U_{k}\!:=\!Z_{k}^{\top}X_{k}^{\text{next}}\!+\!L_{k}$ , we now define the estimate under changing regularizers $\{H_{k}\}_{k\geq 1}$ and $\{L_{k}\}_{k\geq 1}$ as

\displaystyle{\widehat{\Theta}}_{k}={V}_{k}^{-1}{U}_{k}~{}.

(7)

We make the following assumptions on the sequence of regularizers $\{H_{k}\}_{k\geq 1}$ and $\{L_{k}\}_{k\geq 1}$ .

Assumption 2 (Regularity).

For any $\alpha\in(0,1]$ , there exist constants $\lambda_{\max}$ , $\lambda_{\min}$ and $\nu$ depending on $\alpha$ such that, with probability at least $1-\alpha$ , for all $k\in[K]$ ,

\displaystyle\left\lVert H_{k}\right\rVert\leqslant\lambda_{\max},\;\;\left\lVert H_{k}^{-1}\right\rVert\leqslant 1/\lambda_{\min}\;\;\text{and}\;\;\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}\leqslant\nu.

Lemma 1 (Concentration under changing regularizers).

Under assumptions 1 and 2, the following holds:

\forall\alpha\!\in\!(0,1],\quad\mathbb{P}\left[{\exists k\!\in\!\mathbb{N}\!:\;\left\lVert\Theta_{*}\!-\!\widehat{\Theta}_{k}\right\rVert_{{V}_{k}}\!\geqslant\!\beta_{k}(\alpha)}\right]\!\leqslant\!\alpha~{},

where ${\beta_{k}(\alpha)}\!:=\!C_{w}\!\sqrt{2\ln\!\big{(}\frac{2}{\alpha}\big{)}\!\!+\!n\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)}\!+\!\sqrt{\lambda_{\max}}\!+\!\nu$ .

Lemma 1 helps us to introduce the following high probability confidence set

\displaystyle\mathcal{C}_{k}(\alpha):=\left\{\Theta:\left\lVert\Theta-\widehat{\Theta}_{k}\right\rVert_{V_{k}}\leqslant\beta_{k}(\alpha)\right\}.

(8)

We then search for an optimistic estimate $\widetilde{\Theta}_{k}$ within this confidence region $\mathcal{C}_{k}(\alpha)$ , such that

\displaystyle\widetilde{\Theta}_{k}\in\operatorname*{arg\,min}_{\Theta\in\mathcal{C}_{k}(\alpha)\cap\mathcal{S}}J_{1}^{*}(\Theta,x_{k,1}),

(9)

where $J_{1}^{*}(\Theta,x_{k,1})$ is the optimal cost when system dynamics are $\Theta$ (can be computed from (5)). With the estimate $\widetilde{\Theta}_{k}$ , the agent then chooses policy $\pi_{k}$ and selects the controls recommended by this policy

u_{k,h}:=\pi_{k,h}(x_{k,h})=K_{h}(\widetilde{\Theta}_{k})x_{k,h}~{},

(10)

where $K_{h}(\widetilde{\Theta}_{k})$ can be computed from (3). We call this procedure $\mathrm{OFU}\text{-}\mathrm{RL}$ and bound its regret as follows.

Theorem 1 (Regret under changing regularizers).

Under Assumptions 1 and 2, for any $\alpha\!\in\!(0,1]$ , with probability at least $1-\alpha$ , the cumulative regret of $\mathrm{OFU}\text{-}\mathrm{RL}$ satisfies

	$\displaystyle\mathcal{R}(K)$	$\displaystyle=O\left(H\sqrt{K}\big{(}\sqrt{H}+n(n+d)\psi_{\lambda_{\min}}+\ln(1/\alpha)\big{)}\right)$
	$\displaystyle+$	$\displaystyle O\left(H\sqrt{K}\left(\sqrt{\lambda_{\max}}+\nu\right)\sqrt{n(n+d)\psi_{\lambda_{\min}}}\right)~{},$

where $\psi_{\lambda_{\min}}\!:=\!\ln\left(1\!+\!HK/(n+d)\lambda_{\min}\right)$ .

Proof sketch. Inspired by [17], we first decompose the regret under the following ‘good’ event: $\mathcal{E}_{K}(\alpha)\!:=\!\{\Theta_{*}\!\in\!\mathcal{C}_{k}(\alpha)\!\cap\!\mathcal{S},\forall k\!\in\![K]\},$ which, by Assumption 1 and Lemma 1, holds w.p. at least $1\!-\!\alpha$ . Then, under the ‘good’ event, the cumulative regret (6) can be written as

\displaystyle\mathcal{R}(K)\leqslant\sum\nolimits_{k=1}^{K}\sum\nolimits_{h=1}^{H}(\Delta_{k,h}+\Delta_{k,h}^{\prime}+\Delta_{k,h}^{\prime\prime}),\;\;\text{where}

	$\displaystyle\Delta_{k,h}\!:=\!\mathbb{E}\left[J_{h+1}^{\pi_{k}}(\Theta_{},x_{k,h+1})\!\!\mid\!\mathcal{F}_{k,h}\right]\!-\!J_{h+1}^{\pi_{k}}(\Theta_{},x_{k,h+1}),$
	$\displaystyle\Delta_{k,h}^{\prime}\!:=\!\left\lVert x_{k,h+1}\right\rVert_{\widetilde{P}_{k,h+1}}\!\!\!-\!\mathbb{E}\left[\left\lVert x_{k,h+1}\right\rVert_{\widetilde{P}_{k,h+1}}\!\!\mid\!\mathcal{F}_{k,h}\right]\;\text{and}$
	$\displaystyle\Delta_{k,h}^{\prime\prime}\!:=\!\left\lVert\Theta_{*}^{\top}z_{k,h}\right\rVert_{\widetilde{P}_{k,h+1}}\!-\!\left\lVert\widetilde{\Theta}_{k}^{\top}z_{k,h}\right\rVert_{\widetilde{P}_{k,h+1}},$

in which $\widetilde{P}_{k,h}\!:=\!P_{h}(\widetilde{\Theta}_{k})$ is given by (4) and $\mathcal{F}_{k,h}$ denotes all the randomness present before time $(k,h)$ .

Now, we are going to bound each term, respectively. For the first two terms, we can show that both of them are bounded martingale difference sequences. Therefore, by Azuma–Hoeffding inequality, we have $\sum_{k,h}\Delta_{k,h}\!=\!O\big{(}\sqrt{KH^{3}}\big{)}$ and $\sum_{k,h}\Delta^{\prime}_{k,h}\!=\!O\big{(}\sqrt{KH}\big{)}$ with high probability. We use Lemma 1 and the OFU principle (9) to bound the third term as $\sum_{k,h}\Delta^{\prime\prime}_{k,h}\!=\!O\big{(}H\sqrt{K}\beta_{k}(\alpha)\sqrt{\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)}\big{)}$ . To put everything together, first note from Assumption 1 that

\displaystyle\ln\det\left(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k}\right)\!\leqslant\!(n\!+\!d)\ln\left(1\!+\!\frac{HK(1+\gamma)^{2}}{(n+d)\lambda_{\min}}\right).

Plugging this into $\beta_{k}(\alpha)$ given in Lemma 1 and the third term above, yields the final result. ∎

We end the section with a proof sketch of Lemma 1.

Proof sketch (Lemma 1). Under Assumptions 1 and 2, with some basic algebra, we first have

		$\displaystyle\left\lVert{\Theta}_{}-\widehat{\Theta}_{k}\right\rVert_{{V}_{k}}=\left\lVert H_{k}\Theta_{}-Z_{k}^{\top}W_{k}-L_{k}\right\rVert_{{V}_{k}^{-1}}$
	$\displaystyle\leqslant$	$\displaystyle\underbrace{\left\lVert Z_{k}^{\top}W_{k}\right\rVert_{(Z_{k}^{\top}Z_{k}+\lambda_{\min}I)^{-1}}}_{\mathcal{T}_{1}}+\underbrace{\left\lVert H_{k}^{\frac{1}{2}}\right\rVert_{2}+\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}}_{\mathcal{T}_{2}}.$

By Assumption 2, we have w.p. at least $1-\alpha$ , $\mathcal{T}_{2}\leqslant\sqrt{\lambda_{\max}}+\nu$ . To bound $\mathcal{T}_{1}$ , by the boundedness of $w_{k,h}$ in Assumption 1, we first note that each row of the matrix $W_{k}$ is a sub-Gaussian random vector with parameter $C_{w}$ . We then generalize the self-normalized concentration inequality of vector-valued martingales [23, Theorem 1] to the setting of matrix-valued martingales. In particular, we show that w.p. at least $1-\alpha$ ,

\displaystyle\mathcal{T}_{1}\leqslant C_{w}\sqrt{2\ln\left(1/\alpha\right)+n\ln\det(I\!+\!\lambda_{\min}^{-1}Z_{k}^{\top}Z_{k})}.

Combining the bounds on $\mathcal{T}_{1}$ and $\mathcal{T}_{2}$ using a union bound argument, yields the final result. ∎

III-B Private Control

In this section, we introduce the $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ algorithm (Alg. 1). At every episode $k$ , we keep track of the history via private version of the matrices $Z_{k}^{\top}Z_{k}$ and $Z_{k}^{\top}X_{k}^{\text{next}}$ . To do so, we first initialize two private counter mechanisms $\mathcal{B}_{1}$ and $\mathcal{B}_{2}$ , which take as parameters the privacy levels $\varepsilon$ , $\delta$ , number of episodes $K$ , horizon $H$ and a problem-specific constant $\gamma$ (see Assumption 1). The counter $\mathcal{B}_{1}$ (resp. $\mathcal{B}_{2}$ ) take as input an event stream of matrices $\{\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top}\}_{k\in[K]}$ (resp. $\{\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top}\}_{k\in[K]}$ ), and at the start of each episode $k$ , release the private version of the matrix $Z_{k}^{\top}Z_{k}$ (resp. $Z_{k}^{\top}X_{k}^{\text{next}}$ ), which itself is a matrix of the same dimension. Let $T_{1,k}$ and $T_{2,k}$ denote the privatized versions for $Z_{k}^{\top}Z_{k}$ and $Z_{k}^{\top}X_{k}^{\text{next}}$ , respectively. For some $\eta>0$ (will be determined later), we define $V_{k}\!:=\!T_{1,k}\!+\!\eta I$ and $U_{k}\!:=\!T_{2,k}$ . We now instantiate the general $\mathrm{OFU}\text{-}\mathrm{RL}$ procedure under changing regularizers (Section III-A) with these private statistics. First, we compute the point estimate $\widehat{\Theta}_{k}$ from (7) and build the confidence set $\mathcal{C}_{k}(\alpha)$ from (8). Then, we choose the most optimisitic policy $\pi_{k}$ w.r.t. the entire set $\mathcal{C}_{k}(\alpha)$ from (9) and (10). Finally, we execute the policy for the entire episode and update the counters with observed trajectory.

We now describe the private counters $\mathcal{B}_{1}$ adapting the Binary counting mechanism of [24]. First, we write $\Sigma_{1}[i,j]=\sum_{k=i}^{j}\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top}$ to denote a partial sum ( $\mathrm{P}\text{-}\mathrm{sum}$ ) involving the state-control pairs in episodes $i$ through $j$ . Next, we consider a binary interval tree, where each leaf node represents an episode (i.e., the tree has $k-1$ leaf nodes at the start of episode $k$ ), and each interior node represents the range of episodes covered by its children. At the start of episode $k$ , we first release a noisy $\mathrm{P}\text{-}\mathrm{sum}$ $\widehat{\Sigma}_{1}[i,j]$ corresponding to each node in the tree. Here $\widehat{\Sigma}_{1}[i,j]$ is obtained by perturbing both $(p,q)$ -th and $(q,p)$ -th, $1\!\leqslant\!p\!\leqslant\!q\!\leqslant\!(n\!+\!d)$ , entries of $\Sigma_{1}[i,j]$ with i.i.d. Gaussian noise $\zeta_{p,q}\!\sim\!\mathcal{N}(0,\sigma_{1}^{2})$ .³³3This will ensure symmetry of the $\mathrm{P}\text{-}\mathrm{sums}$ even after adding noise. Then $T_{1,k}$ is computed by summing up the noisy $\mathrm{P}\text{-}\mathrm{sums}$ released by the set of nodes that uniquely cover the range $[1,k\!-\!1]$ . Observe that, at the end each episode, the mechanism only needs to store noisy $\mathrm{P}\text{-}\mathrm{sums}$ required for computing private statistics at future episodes, and can safely discard $\mathrm{P}\text{-}\mathrm{sums}$ that are no longer needed. For the private counter $\mathcal{B}_{2}$ , we maintain $\mathrm{P}\text{-}\mathrm{sums}$ $\Sigma_{2}[i,j]\!=\!\sum_{k=i}^{j}\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top}$ with i.i.d. noise $\mathcal{N}(0,\sigma_{2}^{2})$ and compute the private statistics $T_{2,k}$ using a similar procedure. The noise levels $\sigma_{1}$ and $\sigma_{2}$ depends on the problem intrinsics ( $K$ , $H$ , $\gamma$ ) and privacy parameters ( $\varepsilon$ , $\delta$ ). These, in turn, govern the constants $\lambda_{\max}$ , $\lambda_{\min}$ , $\nu$ appearing in the confidence set $\mathcal{C}_{k}(\alpha)$ and the regularizer $\eta$ . The details will be specified in the next Section as needed.

Input: Number of episodes

K

, horizon

H

, privacy level

\varepsilon\!>\!0

\delta\!\in\!(0,1]

, constants

\gamma

C_{w}

, confidence level

\alpha\in(0,1]

1 initialize private counters

\mathcal{B}_{1}

and

\mathcal{B}_{2}

with parameters

K,H,\varepsilon,\delta,\gamma

2 for each episode $k\!=\!1,2,3,\ldots,K$ do

3 compute private statistics

T_{1,k}

and

T_{2,k}

4 construct confidence set

\mathcal{C}_{k}(\alpha)

5 find

\widetilde{\Theta}_{k}\in\operatorname*{arg\,min}_{\Theta\in\mathcal{C}_{k}(\alpha)\cap\mathcal{S}}J_{1}^{*}(\Theta,x_{k,1})

6 for each step $h\!=\!1,2,\ldots,H$ do

7 execute control

u_{k,h}\!=\!K_{h}(\widetilde{\Theta}_{k})x_{k,h}

8 observe cost

c_{k,h}

and next state

x_{k,h+1}

10 send

\sum_{h=1}^{H}z_{k,h}z_{k,h}^{\top}

and

\sum_{h=1}^{H}z_{k,h}x_{k,h+1}^{\top}

to the counters

\mathcal{B}_{1}

and

\mathcal{B}_{2}

, respectively

Algorithm 1

\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}

IV Privacy and Regret Guarantees

In this section, we show that $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ is a JDP algorithm with sublinear regret guarantee.

IV-A Privacy Guarantee

Theorem 2 (Privacy).

Under Assumption 1, for any $\varepsilon\!>\!0$ and $\delta\!\in\!(0,1]$ , $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ is $(\varepsilon,\delta)$ -JDP.

Proof sketch. We first show that both the counters $\mathcal{B}_{1}$ and $\mathcal{B}_{2}$ are $(\varepsilon/2,\delta/2)$ -DP. We begin with the counter $\mathcal{B}_{1}$ . To this end, we need to determine a global upper bound $\Delta_{1}$ over the $\operatorname{L}_{2}$ -sensitivity of all the $\mathrm{P}\text{-}\mathrm{sums}$ $\Sigma_{1}[i,j]$ . Informally, $\Delta_{1}$ encodes the maximum change in the Frobenious norm of each $\mathrm{P}\text{-}\mathrm{sum}$ if the trajectory of a single episode is changed. By Assumption 1, we have $\left\lVert z_{k,h}\right\rVert\!\leqslant\!1\!+\!\gamma$ , and hence $\Delta_{1}\!=\!H(1\!+\!\gamma)^{2}$ . Since the noisy $\mathrm{P}\text{-}\mathrm{sums}$ $\widehat{\Sigma}_{1}[i,j]$ are obtained via Gaussian mechanism, we have that each $\widehat{\Sigma}_{1}[i,j]$ is $(\Delta_{1}^{2}/2\sigma_{1}^{2})$ -CDP [22, Proposition 1.6]. We now see that every episode appears only in at most $m\!:=\!\lceil\log_{2}K\rceil$ $\mathrm{P}\text{-}\mathrm{sums}$ $\Sigma_{1}[i,j]$ . Therefore, by the composition property, the whole counter $\mathcal{B}_{1}$ is $(m\Delta_{1}^{2}/2\sigma_{1}^{2})$ -CDP, and thus, in turn, $\left(\frac{m\Delta_{1}^{2}}{2\sigma_{1}^{2}}\!+\!2\sqrt{\frac{m\Delta_{1}^{2}}{2\sigma_{1}^{2}}\ln\left(\frac{2}{\delta}\right)},\frac{\delta}{2}\right)$ -DP for any $\delta\!>\!0$ [22, Lemma 3.5]. Now, setting $\sigma_{1}^{2}\!\approx\!8m\Delta_{1}^{2}\ln(2/\delta)/\varepsilon^{2}$ , we can ensure that $\mathcal{B}_{1}$ is $(\varepsilon/2,\delta/2)$ -DP. A similar analysis yields that counter $\mathcal{B}_{2}$ is $(\varepsilon/2,\delta/2)$ -DP if we set $\sigma_{2}^{2}\!\approx\!8m\Delta_{2}^{2}\ln(2/\delta)/\varepsilon^{2}$ , where $\Delta_{2}\!:=\!H(1\!+\!\gamma)$ .

To prove Theorem 2, we now use the billboard lemma [25, Lemma 9] which, informally, states that an algorithm is JDP under continual observation if the output sent to each user is a function of the user’s private data and a common quantity computed using standard differential privacy. Note that at each episode $k$ , $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ computes private statistics $T_{1,k}$ and $T_{2,k}$ for all users using the counters $\mathcal{B}_{1}$ and $\mathcal{B}_{2}$ . These statistics are then used to compute the policy $\pi_{k}$ . By composition and post-processing properties of DP, we can argue that the sequence of policies $\{\pi_{k}\}_{k\in[K]}$ are computed using an $(\varepsilon,\delta)$ -DP mechanism. Now, the controls $\{u_{k,h}\}_{h\in[H]}$ during episode $k$ are generated using the policy $\pi_{k}$ and the user’s private data $x_{k,h}$ as $u_{k,h}\!=\!\pi_{k,h}(x_{k,h})$ . Then, by the billboard lemma, the composition of the controls $\{u_{k,h}\}_{k\in[K],h\in[H]}$ sent to all the users is $(\varepsilon,\delta)$ -JDP.∎

IV-B Regret Guarantee

Theorem 3 (Private regret).

Under Assumption 1, for any privacy parameters $\varepsilon\!>\!0$ and $\delta\!\in\!(0,1]$ , and for any $\alpha\!\in\!(0,1]$ , with probability at least $1\!-\!\alpha$ , $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ enjoys the regret bound

	$\displaystyle\mathcal{R}(K)=O\!\left(H^{3/2}\sqrt{K}\big{(}n(n\!+\!d)\ln K\!+\!\ln(1/\alpha)\big{)}\right)$
	$\displaystyle\!+\!O\!\left(\!H^{3/2}\!\sqrt{K}\ln K\!\left(n(n\!+\!d)\!+\!\!\sqrt{\ln K/\alpha}\right)\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}}\right)\!.$

Theorems 2 and 3 together imply that $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ can achieve a sub-linear regret under $(\varepsilon,\delta)$ -JDP privacy guarantee. Furthermore, comparing Theorem 3 with Theorem 1, we see that the first term in the regret bound corresponds to the non-private regret, and the second term is the cost of privacy. More importantly, the cost due to privacy grows only as $\frac{\ln(1/\delta)^{1/4}}{\varepsilon^{1/2}}$ with $\varepsilon,\delta$ .

Proof sketch (Theorem 3). First note that the private statistics $T_{1,k}$ can be computed by summing at most $m\!=\!\lceil\log_{2}K\rceil$ noisy $\mathrm{P}\text{-}\mathrm{sums}$ $\widehat{\Sigma}_{1}[i,j]$ . We then have that the total noise $N_{k}$ in each $T_{1,k}$ is a symmetric matrix with it’s $(p,q)$ -th entry, $1\!\leqslant\!p\!\leqslant\!q\!\leqslant\!(n\!+\!d)$ , being i.i.d. $\mathcal{N}(0,m\sigma_{1}^{2})$ . Therefore, by an adaptation of [26, Corollary 4.4.8], we have w.p. at least $1\!-\!\alpha/2K$ ,

\displaystyle\left\lVert N_{k}\right\rVert\!\leqslant\!\Lambda\!:=\!\sigma_{1}\sqrt{m}\left(4\sqrt{n\!+\!d}\!+\!\sqrt{8\ln(4K/\alpha)}\right).

Similarly, the total noise $L_{k}$ in each $T_{2,k}$ is an $(n+d)\times n$ matrix, whose each entry is i.i.d. $\mathcal{N}(0,m\sigma_{2}^{2})$ . Hence $\left\lVert L_{k}\right\rVert_{\operatorname{F}}^{2}/m\sigma_{2}^{2}$ is a $\chi^{2}$ -statistic with $n(n\!+\!d)$ degrees of freedom, and therefore, by [27, Lemma 1], we have w.p. at least $1\!-\!\alpha/2K$ ,

\displaystyle\!\left\lVert L_{k}\right\rVert_{\operatorname{F}}\!\leqslant\!\sigma_{2}\sqrt{m}\left(\!\sqrt{2n(n\!+\!d)}\!+\!\sqrt{4\ln(2K/\alpha)}\right).

By construction, we have the regularizer $H_{k}\!=\!N_{k}\!+\!\eta I$ . Setting $\eta\!=\!2\Lambda$ , we ensure that $H_{k}$ is p.d., and hence $\left\lVert L_{k}\right\rVert_{H_{k}^{-1}}\!\leqslant\!\Lambda^{-1/2}\left\lVert L_{k}\right\rVert_{\operatorname{F}}$ . Then, by a union bound argument, Assumption 2 holds for $\lambda_{\min}\!=\!\Lambda$ , $\lambda_{\max}\!=\!3\Lambda$ and $\nu\!=\!\sigma_{2}\sqrt{m/\Lambda}\left(\sqrt{2n(n\!+\!d)}\!+\!\!\sqrt{4\ln(2K/\alpha)}\right)$ . Appropriating noise levels $\sigma_{1},\sigma_{2}$ from Section IV-A, the regret bound now follows from Theorem 1.∎

V Conclusion

We develop the first DP algorithm, $\mathrm{Private}\text{-}\mathrm{OFU}\text{-}\mathrm{RL}$ , for episodic LQ control. Through the notion of JDP, we show that it can protect private user information from being inferred by observing the control policy without losing much on its regret performance. We leave as future work private control of non-linear systems [16].

References

[1] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web, 2010, pp. 661–670.
[2] Y. Zhao, M. R. Kosorok, and D. Zeng, “Reinforcement learning design for cancer clinical trials,” Statistics in medicine, vol. 28, no. 26, pp. 3294–3315, 2009.
[3] A. R. Sharma and P. Kaushik, “Literature survey of statistical, deep and reinforcement learning in natural language processing,” in 2017 International Conference on Computing, Communication and Automation (ICCCA). IEEE, 2017, pp. 350–354.
[4] G. Gordon, S. Spaulding, J. K. Westlund, J. J. Lee, L. Plummer, M. Martinez, M. Das, and C. Breazeal, “Affective personalization of a social robot tutor for children’s second language skills,” in Proceedings of the AAAI conference on artificial intelligence, vol. 30, no. 1, 2016.
[5] C. Dwork, “Differential privacy: A survey of results,” in International conference on theory and applications of models of computation. Springer, 2008, pp. 1–19.
[6] A. Tossou and C. Dimitrakakis, “Algorithms for differentially private multi-armed bandits,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30, no. 1, 2016.
[7] ——, “Achieving privacy in the adversarial multi-armed bandit,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017.
[8] D. Basu, C. Dimitrakakis, and A. Tossou, “Differential privacy for multi-armed bandits: What is it and what is its cost?” arXiv preprint arXiv:1905.12298, 2019.
[9] N. Mishra and A. Thakurta, “(nearly) optimal differentially private stochastic multi-arm bandits,” in Proceedings of the Thirty-First Conference on Uncertainty in Artificial Intelligence, 2015, pp. 592–601.
[10] X. Zhou and J. Tan, “Local differential privacy for bayesian optimization,” arXiv preprint arXiv:2010.06709, 2020.
[11] B. Balle, M. Gomrokchi, and D. Precup, “Differentially private policy evaluation,” in International Conference on Machine Learning. PMLR, 2016, pp. 2130–2138.
[12] G. Vietri, B. Balle, A. Krishnamurthy, and S. Wu, “Private reinforcement learning with pac and regret guarantees,” in International Conference on Machine Learning. PMLR, 2020, pp. 9754–9764.
[13] E. Garcelon, V. Perchet, C. Pike-Burke, and M. Pirotta, “Local differentially private regret minimization in reinforcement learning,” arXiv preprint arXiv:2010.07778, 2020.
[14] Y. Abbasi-Yadkori and C. Szepesvári, “Regret bounds for the adaptive control of linear quadratic systems,” in Proceedings of the 24th Annual Conference on Learning Theory, 2011, pp. 1–26.
[15] I. Osband and B. V. Roy, “Model-based reinforcement learning and the eluder dimension,” in Proceedings of the 27th International Conference on Neural Information Processing Systems-Volume 1, 2014, pp. 1466–1474.
[16] S. R. Chowdhury and A. Gopalan, “Online learning in kernelized markov decision processes,” in The 22nd International Conference on Artificial Intelligence and Statistics. PMLR, 2019, pp. 3197–3205.
[17] T. Wang and L. F. Yang, “Episodic linear quadratic regulators with low-rank transitions,” arXiv preprint arXiv:2011.01568, 2020.
[18] C. Jin, Z. Yang, Z. Wang, and M. I. Jordan, “Provably efficient reinforcement learning with linear function approximation,” in Conference on Learning Theory, 2020, pp. 2137–2143.
[19] D. Bertsekas, Dynamic programming and optimal control. Athena scientific Belmont, MA, 3 edition, 2004.
[20] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” 2014.
[21] M. Kearns, M. Pai, A. Roth, and J. Ullman, “Mechanism design in large games: Incentives and privacy,” in Proceedings of the 5th conference on Innovations in theoretical computer science, 2014, pp. 403–410.
[22] M. Bun and T. Steinke, “Concentrated differential privacy: Simplifications, extensions, and lower bounds,” in Theory of Cryptography Conference. Springer, 2016, pp. 635–658.
[23] Y. Abbasi-Yadkori, D. Pál, and C. Szepesvári, “Improved algorithms for linear stochastic bandits,” in Advances in Neural Information Processing Systems, 2011, pp. 2312–2320.
[24] T.-H. H. Chan, E. Shi, and D. Song, “Private and continual release of statistics,” ACM Transactions on Information and System Security (TISSEC), vol. 14, no. 3, pp. 1–24, 2011.
[25] J. Hsu, Z. Huang, A. Roth, T. Roughgarden, and Z. S. Wu, “Private matchings and allocations,” SIAM Journal on Computing, vol. 45, no. 6, pp. 1953–1984, 2016.
[26] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47.
[27] B. Laurent and P. Massart, “Adaptive estimation of a quadratic functional by model selection,” Annals of Statistics, pp. 1302–1338, 2000.