Near-Optimal Differentially Private Reinforcement Learning
Abstract
Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an -JDP algorithm with a regret of which matches the information-theoretic lower bound of non-private learning for all choices of . In the above, , denote the number of states and actions, denotes the planning horizon, and is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as . Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.
1 Introduction
The wide range application of Reinforcement Learning (RL) based algorithms is becoming paramount in many personalized services, including medical care (Raghu et al., 2017), autonomous driving (Sallab et al., 2017) and recommendation systems (Afsar et al., 2021). In these applications, the learning agent continuously improves its performance by learning from users’ private feedback and data. The private data from users, however, usually contain sensitive information. Take recommendation system as an instance, the agent makes recommendation (corresponding to the action in a MDP) according to users’ location, age, gender, etc. (corresponding to the state in a MDP), and improves its performance based on users’ feedback (corresponding to the reward in a MDP). Unfortunately, it is shown that unless privacy protections are launched, learning agents will implicitly memorize information of individual training data points (Carlini et al., 2019), even if they are irrelevant for learning (Brown et al., 2021), which makes RL agents vulnerable to various privacy attacks.
Differential privacy (DP) (Dwork et al., 2006) has become the standard notion of privacy. The output of a differentially private RL algorithm is indistinguishable from its output returned under an alternative universe where any individual user is replaced, thereby preventing the aforementioned privacy risks. However, recent works (Shariff and Sheffet, 2018) show that standard DP is incompatible with sublinear regret bound for contextual bandits. Therefore, a relaxed variant of DP: Joint Differential Privacy (JDP) (Kearns et al., 2014) is considered. JDP ensures that the output of all other users will not leak much information about any specific user and such notion has been studied extensively in bandits problems(Shariff and Sheffet, 2018; Garcelon et al., 2022). In addition, another variant of DP: Local Differential Privacy (LDP) (Duchi et al., 2013) has drawn more and more attention due to its stronger privacy protection. LDP requires that each user’s raw data is privatized before being sent to the agent and LDP has been well studied under bandits (Basu et al., 2019; Zheng et al., 2020).
Compared to the large body of work on private bandits, existing work that studies private RL is sparser. Under the tabular MDP model, Vietri et al. (2020) first defined JDP and proposed PUCB with regret bound and JDP guarantee. Garcelon et al. (2021) introduced LDP under tabular MDP and designed LDP-OBI with regret bound and LDP guarantee. Recently, Chowdhury and Zhou (2021) provided a general framework for this problem and derived the best-known regret bounds under both JDP and LDP. However, the best known regret bound under -JDP , although with the additional regret due to JDP being a lower order term, is still sub-optimal by compared to the minimax optimal regret 111Under the non-stationary MDP as in this paper, the result in Azar et al. (2017) will have additional dependence. (Azar et al., 2017) without constraints on DP. Therefore, if we run Algorithm 2 of Chowdhury and Zhou (2021), we not only pay for a constant additional regret , but also suffer from a multiplicative factor of . Motivated by this, we want to find out whether it is possible to design an algorithm that has optimal regret bound up to lower order terms while satisfying Joint DP.
Algorithms | Regret under -JDP | Regret under -LDP | Type of bonus |
PUCB (Vietri et al., 2020) | NA | Hoeffding | |
LDP-OBI (Garcelon et al., 2021) | NA | Hoeffding | |
Private-UCB-PO (Chowdhury and Zhou, 2021) | Hoeffding | ||
Private-UCB-VI (Chowdhury and Zhou, 2021) | Hoeffding | ||
DP-UCBVI (Our Algorithm 1) | Bernstein | ||
Lower bound without DP (Jin et al., 2018) | NA |
Our contributions. In this paper, we answer the above question affirmatively by constructing a general algorithm for DP RL: Algorithm 1. Our contributions are threefold.
-
•
A new upper confidence bound (UCB) based algorithm (DP-UCBVI, Algorithm 1) that can be combined with any Privatizer (for JDP or LDP). Under the constraint of -JDP, DP-UCBVI achieves regret of , which matches the minimax lower bound up to lower order terms.
-
•
We propose a novel privatization of visitation numbers that satisfies several nice properties (see Assumption 3.1 for details). More importantly, our approach is the first to privatize Bernstein-type bonus, which helps tighten our regret bounds through law of total variance.
-
•
Under the -LDP constraint, DP-UCBVI achieves regret of and improves the best known result (Chowdhury and Zhou, 2021).
1.1 Related work
Detailed comparisons with existing work on differentially private RL under tabular MDP (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021) are given in Table 1, while we leave more discussions about results on regret minimization to Appendix A. Notably, all existing algorithms privatize Hoeffding-type bonus and suffer from sub-optimal regret bound. In comparison, we privatize Bernstein-type bonus and the non-private part of our regret222As shown in Table 1, the regret bounds of all DP-RL algorithms contain two parts: one results from running the non-private RL algorithms, while the other is the additional cost due to DP guarantees. Throughout the paper, we use “non-private part” to denote the regret from running the non-private RL algorithms. matches the minimax lower bound in Jin et al. (2018).
Generally speaking, to achieve DP guarantee under RL, a common approach is to add appropriate noise to existing non-private algorithms, and derive tight regret bounds. We discuss about private algorithms under tabular MDP below and leave more discussions about algorithms under other settings to Appendix A. Under the constraint of JDP, Vietri et al. (2020) designed PUCB by privatizing UBEV (Dann et al., 2017). Private-UCB-VI (Chowdhury and Zhou, 2021) resulted from UCBVI (with bonus 1) (Azar et al., 2017). Under the constraint of LDP, Garcelon et al. (2021) designed LDP-OBI based on UCRL2 (Jaksch et al., 2010). However, all these works privatized Hoeffding-type bonus, which is easier to handle, but will lead to sub-optimal regret bound. In contrast, we directly build upon the non-private algorithm with minimax optimal regret bound: UCBVI with bonus 2 (Azar et al., 2017), where the privatization of Bernstein-type bonus requires more advanced techniques.
A concurrent work (Qiao and Wang, 2022b) focused on the offline RL setting and derived a private version of APVI (Yin and Wang, 2021). Their algorithm achieved tight sub-optimality bound of the output policy through privatization of Bernstein-type pessimism333Pessimism is the counterpart of bonus under offline RL, which aims to discourage the choice of pairs with large uncertainty.. However, their analysis relied on the assumption that the visitation numbers of all (state,action) pairs are larger than some threshold. We overcome the requirement of such assumption via an improved privatization of visitation numbers. More importantly, offline RL can be viewed as one step of online RL, therefore privatization of Bernstein type bonus is more technically demanding. Finally, our approach actually realizes the future direction stated in the conclusion of Qiao and Wang (2022b).
1.2 A remark on technical novelty.
The general idea behind the previous differentially private algorithms under tabular MDP (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021) is to add noise to accumulative visitation numbers, and construct a private bonus based on privatized visitation numbers. Since Hoeffding-type bonus only uses the information of visitation numbers (e.g., in Azar et al. (2017), ), the construction of private bonus is straightforward. We can simply replace original counts with private counts and add an additional term to account for the difference between these two bonuses. Next, combining the construction of private bonuses with the uniform upper bound of , we can upper bound the private bonus by its non-private counterpart plus some additional lower order term. Therefore the proof schedule of the original non-private algorithms also applies to their private counterparts.
Unfortunately, although the idea to privatize UCBVI with bonus 2 (Bernstein-type) (Azar et al., 2017) is straightforward, the generalization of the previous approaches is technically non-trivial. Since the bonus 2 in Azar et al. (2017) includes the term , the first technical challenge is to replace the empirical transition kernel with a private estimate. However, the private transition kernel estimates constructed in previous works are not valid probability distributions. In this paper, for both JDP and LDP, we propose a novel privatization of visitation numbers such that the private transition kernel estimates are valid probability distributions and meanwhile, the upper bound on is the same scale compared to previous approaches. With the private transition kernel estimates , we can replace with where is the value function calculated from value iteration with private estimates. Then the second challenge is to bound the difference between these two variances and retain the optimism. We overcome the second challenge via concentration inequalities. Briefly speaking, we add an additional term (using private statistics) to compensate for the difference of these two bonuses and recovered the proof of optimism. With all these techniques, we derive our regret bound using techniques like error decomposition and error propagation originated from Azar et al. (2017).
2 Notations and Problem Setup
Throughout the paper, for , . For any set , denotes the set of all probability distributions over . Besides, we use standard notations such as and to suppress constants while and absorb logarithmic factors.
Below we present the definition of episodic Markov Decision Processes and introduce differential privacy in reinforcement learning.
2.1 Markov decision processes and regret
We consider finite-horizon episodic Markov Decision Processes (MDP) with non-stationary transitions, denoted by a tuple (Sutton and Barto, 1998), where is state space with , is action space with and is the horizon. The non-stationary transition kernel has the form with representing the probability of transition from state , action to next state at time step . In addition, denotes the corresponding distribution of reward, we overload the notation so that also denotes the expected (immediate) reward function. Besides, is the initial state distribution. A policy can be seen as a series of mapping , where each maps each state to a probability distribution over actions, i.e. , . A random trajectory is generated by the following rule: , .
Given a policy and any , the value function and Q-value function are defined as: The optimal policy maximizes for all simultaneously and we denote the value function and Q-value function with respect to by and . Then Bellman (optimality) equation follows :
We measure the performance of online reinforcement learning algorithms by the regret. The regret of an algorithm is defined as
where is the initial state and is the policy deployed at episode . Let be the number of episodes that the agent plan to play and total number of steps is .
2.2 Differential privacy under episodic RL
Under the episodic RL setting, each trajectory represents one specific user. We first consider the following RL protocol: during the -th step of the -th episode, user sends her state to agent , sends back an action , and finally sends her reward to . Formally, we denote a sequence of users who participate in the above RL protocol by . Following the definition in Vietri et al. (2020), each user can be seen as a tree of depth encoding the state and reward responses they would reply to all possible sequences of actions from the agent. We let denote the whole sequence of actions chosen by agent . An ideal privacy preserving agent would guarantee that and all users but together will not reveal much information about user . We formalize such privacy preservation through adaptation of differential privacy (Dwork et al., 2006).
Definition 2.1 (Differential Privacy (DP)).
For any and , a mechanism is -differentially private if for any possible user sequences and differing on a single user and any subset of ,
If , we say that is -differentially private (-DP).
However, although recommendation to other users will not affect the privacy of user significantly, it is impractical to privately recommend actions to user while protecting the information of her state and reward. Therefore, the notion of DP is relaxed to Joint Differential Privacy (JDP) (Kearns et al., 2014), which requires that for all user , the recommendation to all users but will not reveal much information about . JDP is weaker than DP, while JDP can still provide strong privacy protection since it protects a specific user from any possible collusion of all other users against her. Formally, the definition of JDP is shown below.
Definition 2.2 (Joint Differential Privacy (JDP)).
For any , a mechanism is -joint differentially private if for any , any user sequences , differing on the -th user and any subset of ,
where means the sequence of actions recommended to all users but belongs to set .
JDP ensures that even if an adversary can observe the recommended actions to all users but , it is impossible to identify the trajectory from accurately. JDP is first defined and analyzed under RL by Vietri et al. (2020).
Although JDP provides strong privacy protection, the agent can still observe the raw trajectories from users. Under some circumstances, however, the users are not even willing to share their original data with the agent. This motivates a stronger notion of privacy which is called Local Differential Privacy (LDP) (Duchi et al., 2013). Since under LDP, the agent is not allowed to directly observe the state of users, we consider the following RL protocol for LDP: during the -th episode, the agent sends policy to user , after deploying and getting trajectory , user privatizes her trajectory to and finally sends it to . We denote the privacy mechanism on user’s side by and define local differential privacy formally below.
Definition 2.3 (Local Differential Privacy (LDP)).
For any , a mechanism is -local differentially private if for any possible trajectories and any possible set
,
Local DP ensures that even if an adversary observes the whole reply from user , it is still statistically hard to identify her trajectory. LDP is first defined and analyzed under RL by Garcelon et al. (2021).
3 Algorithm
In this section, we propose DP-UCBVI (Algorithm 1) that takes Privatizer as input, where the Privatizer can be either Central (for JDP) or Local (for LDP). We provide regret analysis for all privatizers satisfying the following Assumption 3.1, which naturally implies regret bounds under both Joint DP and Local DP.
We begin with the following definition of counts. Let denote the visitation number of at step before the -th episode. Similarly, and denote the visitation number of and accumulative reward at before the -th episode. In non-private RL, such counts are sufficient for estimating transition kernel , reward function and deciding the exploration policy, as in Azar et al. (2017). However, these counts are derived from the raw trajectories of the users, which could contain sensitive information. Therefore, under the constraint of privacy, we can only use these counts in a privacy-preserving way, i.e. we use the private counts returned by Privatizer. We make the Assumption 3.1 below, which says that with high probability, the private counts are close to real ones, such assumption will be justified by our Privatizers in Section 5.
Assumption 3.1 (Private counts).
We assume that for any privacy budget and failure probability , the private counts returned by Privatizer satisfies that for some , with probability at least , uniformly over all :
(1) , and .
(2) . . Also, we let .
Under Assumption 3.1, for all , we define the private estimations of transition kernel and reward function.
(1) |
Remark 3.2.
Different from the private empirical transition kernels in Vietri et al. (2020); Garcelon et al. (2021); Chowdhury and Zhou (2021), Assumption 3.1 implies that our estimated transition kernel is a valid probability distribution, this property results from our construction of Privatizer. We truncate the empirical reward function so that it stays in while still preserving privacy.
Algorithmic design. Similar to non-private algorithms (Azar et al., 2017), DP-UCBVI (Algorithm 1) follows the procedure of optimistic value iteration. More specifically, in episode , we do value iteration based on private estimations , and private bonus term to derive private Q-value functions . Next, the greedy policy w.r.t is chosen and we collect one trajectory by running . Finally, the Privatizer translates the non-private counts to private ones for the next episode. We highlight that, different from all previous works regarding private RL, our bonus is variance-dependent. According to Law of total variance, variance-dependent bonus can effectively save a factor of in regret bound. Intuitively, the first term of aims to approximate the variance w.r.t to , the last term accounts for the difference between these two variances and the third term is the additional bonus due to differential privacy.
4 Main results
In this section, we present our main results that formalize the algorithmic ideas discussed in previous sections. We first state a general result based on Assumption 3.1, which can be combined with any Privatizers. The proof of Theorem 4.1 is sketched in Section 6 with details in Appendix C.
Theorem 4.1.
Under Assumption 3.1, the best known regret bound is (Theorem 4.2 of Chowdhury and Zhou (2021)). As a comparison, in our regret bound, the term parameterized by privacy loss remains the same while the leading term is improved by a factor of into . More importantly, when is sufficiently large, our result nearly matches the lower bound in Jin et al. (2018), hence is information-theoretically optimal up to a logarithmic factor.
5 Choice of Privatizers
In this section, we design Privatizers that satisfy Assumption 3.1 and different DP constraints (JDP or LDP). All the proofs in this section are deferred to Appendix D.
5.1 Central Privatizer for Joint DP
The Central Privatizer protects the information of all single users by privatizing all the counter streams , and using the Binary Mechanism (Chan et al., 2011), which focused on privately releasing data stream (Zhao et al., 2022). More specifically, for each , is the partial sums of data stream . Binary Mechanism works as below: for each episode , after observing , the mechanism outputs private version of while ensuring Differential Privacy.444For more details about Binary Mechanism, please refer to Chan et al. (2011) or Kairouz et al. (2021). Given privacy budget , we construct the Central Privatizer as below:
-
(1)
For all , we privatize and (which is summation of bounded streams) by applying Binary Mechanism (Algorithm 2 in Chan et al. (2011)) with . We denote the output of Binary Mechanism by .
-
(2)
The private counts are solved through the procedure in Section 5.1.1 with
. -
(3)
For the counters of accumulative reward, for all , we apply the same Binary Mechanism with to privatize and get .
We sum up the properties of Central Privatizer below.
Lemma 5.1.
For any and , the Central Privatizer satisfies -JDP and Assumption 3.1 with .
Theorem 5.2 (Regret under JDP).
For any and , running DP-UCBVI (Algorithm 1) with Central Privatizer as input, with probability , it holds that:
(3) |
Under the most prevalent regime where the privacy budget is a constant, the additional regret bound due to JDP is a lower order term. The main term of Theorem 5.2 improves the best known result (Corollary 5.2 of Chowdhury and Zhou (2021)) by and matches the minimax lower bound without constrains on DP (Jin et al., 2018).
5.1.1 A post-processing step
During the -th episode, given the noisy counts and (for all ), we construct the following private counts that satisfy Assumption 3.1. The choice of follows: for all
(4) | ||||
Finally, for all , we add the following terms such that with high probability, will never underestimate.
(5) |
Remark 5.3.
The optimization problem (4) can be reformulated as:
(6) |
Note that (6) is a Linear Programming problem with variables and linear constraints. This can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser and Wolsey, 1988). Therefore, since during the whole process, we only solve such Linear Programming problems, our Algorithm 1 is computationally efficient.
The properties of private counts is summarized below.
Lemma 5.4.
Remark 5.5.
Compared to the concurrent work (Qiao and Wang, 2022b), our private counts have additional guarantee of never underestimating the true values, which is a desirable property for analysis in Appendix B. In comparison, the analysis in Qiao and Wang (2022b) heavily relies on the assumption that the visitation number is larger than some threshold such that the scale of noise is ignorable.
5.2 Local Privatizer for Local DP
For each episode , the Local Privatizer privatizes each single trajectory by perturbing the statistics calculated from that trajectory. For visitation of (state,action) pairs, the original visitation number has sensitivity . Therefore, the perturbed version of the counts above satisfies -LDP. In addition, similar perturbations to and will lead to the same result. As a result, we construct Local Privatizer as below:
-
(1)
For all , we perturb and by adding independent Laplace noises:
-
(2)
The noisy counts are calculated by
Then the private counts are solved through the procedure in Section 5.1.1 with
. -
(3)
We perturb the trajectory-wise reward by adding independent Laplace noise: . The accumulative statistic is calculated by .
Properties of our Local Privatizer is summarized below.
Lemma 5.6.
For any and , the Local Privatizer satisfies -LDP and Assumption 3.1 with .
Theorem 5.7 (Regret under LDP).
For any and , running DP-UCBVI (Algorithm 1) with Local Privatizer as input, with probability , it holds that:
(7) |
5.3 More discussions
The step (1) of our Privatizers is similar to previous works (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021). However, different from their approaches (directly use as private counts), we apply the post-processing step in Section 5.1.1, which ensures that is valid probability distribution while is only worse by a constant factor. Therefore, we can apply Bernstein type bonus to achieve the optimal non-private part in our regret bound.
6 Proof Sketch
In this section, we provide a proof overview for Theorem 4.1, which can imply the results under JDP (Theorem 5.2) and LDP (Theorem 5.7). Recall that and are real visitation numbers while ’s are private ones satisfying Assumption 3.1. Other notations like , , , and are defined in Algorithm 1. The statement “with high probability” means that the summation of all failure probabilities is bounded by . We begin with some properties of private statistics below.
Properties of and . Due to concentration inequalities and Assumption 3.1, we provide high probability bounds for , and in Appendix B. In addition, we bound the key term below.
Lemma 6.1 (Informal version of Lemma B.6).
With high probability, for all , it holds that:
(8) |
With these concentrations, we are ready to present our proof sketch. Since we apply Bernstein-type bonus, the proof of optimism is not straightforward. We prove our regret upper bound through induction, which is shown below.
Induction over episodes. Our induction is for all ,
-
(1)
Given that for all , , we prove ()
and for all ,
-
(2)
Given that for all ,
we prove that for all , .
Suppose the above induction holds, we have point (1) holds for all and therefore,
(9) |
Below we discuss about the proof of (1) and (2) separately.
Proof of regret bound: (1). We only need to prove the upper bound of , as the upper bound of follows similarly. Using the standard technique of layer-wise error decomposition (details in Appendix C.3) and ignoring lower order terms: summation of martingale differences, we only need to bound which consists of four terms according to the definition of . First of all, the second and forth terms are dominated by the first and third terms. Next, for the third term, we have
(10) |
Now we analyze the first term (which is also the main term): . It holds that
(11) |
We bound (a) below (details are deferred to Appendix C.4).
(12) |
Therefore, the main term in the regret bound scales as . The details about lower order terms are deferred to Appendix C.3, C.4 and C.5.
Proof of optimism: (2). To prove optimism, we only need
It is clear that can be bounded by the second term and a portion of the third term of . Due to Lemma 6.1, can be bounded by , which can be further bounded by plus a portion of the third term of . Finally, together with the upper bound of (derived from condition of (2) and optimism), the last term of compensates for the difference of (first term of ) and . More details about optimism are deferred to Appendix C.6.
7 Simulations

In this section, we run simulations to show the performance of DP-UCBVI (Algorithm 1). We run simulation on a standard benchmark for tabular MDP: Riverswim (Strehl and Littman, 2008), and Chowdhury and Zhou (2021) run simulations on the same environment. Briefly speaking, the environment consists of six consecutive states and two actions “left” and “right”. Choosing “left”, the agent will tend to move towards the left side, and vice versa. The agent starts from the left side and tries to reach the right side, where she can get higher reward. For more details and illustration about this setting, please refer to Chowdhury and Zhou (2021).
Similar to Chowdhury and Zhou (2021), we set the planning horizon to be and run episodes. For each algorithm, we run times and derive the average performance and confidence region. We compare the performance of DP-UCBVI under constraints of JDP and LDP, and the original UCBVI. The cumulative regret for each algorithm is shown in Figure 1. Comparing the regret, it is shown that the non-private UCBVI has the best performance, while the cost of privacy under constraints of JDP is a small constant, and thus becomes negligible as the number of episodes increases. In addition, the DP-UCBVI with weaker privacy protection (i.e., larger ) has smaller regret. However, under constrains of LDP, the cost of privacy remains high and it takes a much longer period for the algorithm to converge to near-optimal policies. Our simulation results are consistent with our theories which state that the cost of JDP is a constant term while the cost of LDP is multiplicative.
8 Conclusion
In this paper, we studied the well-motivated problem of differentially private reinforcement learning. Under the tabular MDP setting, we propose a general framework: DP-UCBVI (Algorithm 1) that can be combined with any Privatizers for different variants of DP. Under -JDP, we achieved regret bound of , which matches the lower bound up to lower order terms. Meanwhile, under -LDP, we derived regret upper bound of and improves the best known result.
We believe our framework can be further generalized to more general settings, like the linear MDP setting. The best known result under linear MDP (Ngo et al., 2022) built upon LSVI-UCB (Jin et al., 2020), which is arguably a Hoeffding-type algorithm. The main term of regret bound in Ngo et al. (2022), , is known to be suboptimal due to the recent work (Hu et al., 2022), which incorporates Bernstein-type self-normalized concentration. An interesting future direction is to privatize LSVI-UCB+ (Algorithm 1 in Hu et al. (2022)) and derive tighter regret bounds under linear MDP and constraints of JDP. We believe the techniques in this paper (privatization of Bernstein-type bonus under tabular MDP) could serve as basic building blocks.
Acknowledgments
The research is partially supported by NSF Awards #2007117 and #2048091.
References
- Afsar et al. [2021] M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys (CSUR), 2021.
- Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
- Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
- Basu et al. [2019] Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
- Brown et al. [2021] Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
- Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
- Carlini et al. [2019] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
- Chan et al. [2011] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
- Chowdhury and Zhou [2021] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
- Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
- Dann et al. [2019] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
- Duchi et al. [2013] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438. IEEE, 2013.
- Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
- Ficken [2015] Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.
- Garcelon et al. [2021] Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
- Garcelon et al. [2022] Evrard Garcelon, Kamalika Chaudhuri, Vianney Perchet, and Matteo Pirotta. Privacy amplification via shuffling for linear contextual bandits. In International Conference on Algorithmic Learning Theory, pages 381–407. PMLR, 2022.
- Hsu et al. [2014] Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 21–30, 2014.
- Hu et al. [2022] Pihe Hu, Yu Chen, and Longbo Huang. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 8971–9019. PMLR, 2022.
- Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
- Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
- Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
- Kairouz et al. [2021] Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225. PMLR, 2021.
- Kearns and Singh [2002] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
- Kearns et al. [2014] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014.
- Liao et al. [2021] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
- Nemhauser and Wolsey [1988] George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
- Ngo et al. [2022] Dung Daniel T Ngo, Giuseppe Vietri, and Steven Wu. Improved regret for differentially private exploration in linear mdp. In International Conference on Machine Learning, pages 16529–16552. PMLR, 2022.
- Qiao and Wang [2022a] Dan Qiao and Yu-Xiang Wang. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2210.00701, 2022a.
- Qiao and Wang [2022b] Dan Qiao and Yu-Xiang Wang. Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022b.
- Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
- Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017.
- Sallab et al. [2017] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
- Shariff and Sheffet [2018] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
- Strehl and Littman [2008] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
- Vietri et al. [2020] Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754–9764. PMLR, 2020.
- Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
- Xu and Wang [2021] Jianyu Xu and Yu-Xiang Wang. Logarithmic regret in feature-based dynamic pricing. Advances in Neural Information Processing Systems, 34:13898–13910, 2021.
- Xu and Wang [2022] Jianyu Xu and Yu-Xiang Wang. Towards agnostic feature-based dynamic pricing: Linear policies vs linear valuation with unknown noise. In International Conference on Artificial Intelligence and Statistics, pages 9643–9662. PMLR, 2022.
- Xu et al. [2022] Jianyu Xu, Dan Qiao, and Yu-Xiang Wang. Doubly fair dynamic pricing. arXiv preprint arXiv:2209.11837, 2022.
- Yin and Wang [2021] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021.
- Yin et al. [2022] Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
- Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
- Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
- Zhao et al. [2022] Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. Differentially private linear sketches: Efficient implementations and applications. arXiv preprint arXiv:2205.09873, 2022.
- Zheng et al. [2020] Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300–12310, 2020.
- Zhou [2022] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. arXiv preprint arXiv:2201.07052, 2022.
Appendix A Extended related works
Regret minimization under tabular MDP. Under the most fundamental setting of tabular MDP, regret minimization has been widely studied by a long stream of works [Kearns and Singh, 2002, Jaksch et al., 2010, Jin et al., 2018, Xu and Wang, 2021, Qiao et al., 2022, Xu et al., 2022, Qiao and Wang, 2022a, Xu and Wang, 2022]. Among the optimal results, Azar et al. [2017] designed an UCB-based algorithm: UCBVI and derived the minimax optimal regret bound under stationary MDP. Later, Zhang et al. [2020] achieved the optimal regret bound under non-stationary MDP through Q-learning type algorithm: UCB-ADVANTAGE. Meanwhile, in addition to stating optimal regret bound, Dann et al. [2019] also provided policy certificates via their algorithm: ORLC. Different from the minimax optimal algorithms above, Zanette and Brunskill [2019] designed an algorithm: EULER and derived the first problem-dependent regret bound, which can imply the minimax optimal regret.
Other differentially private reinforcement learning algorithms. In this paragraph, we discuss about algorithms under linear MDP or linear mixture MDP. Under linear MDP, the only algorithm with JDP guarantee: Private LSVI-UCB [Ngo et al., 2022] is private version of LSVI-UCB [Jin et al., 2020], while LDP under linear MDP still remains open. Under linear mixture MDP, LinOpt-VI-Reg [Zhou, 2022] generalized UCRL-VTR [Ayoub et al., 2020] to guarantee JDP. In addition, Liao et al. [2021] also privatized UCRL-VTR for LDP guarantee. On the offline side, Qiao and Wang [2022b] provided the first result under linear MDP based on VAPVI [Yin et al., 2022].
Appendix B Properties of private estimations
In this section, we present some useful concentrations about our private estimations that hold with high probability. Throughout the proof, we denote the non-private estimations by:
(13) |
In addition, recall that our private estimations are defined as:
(14) |
Lemma B.1.
With probability , for all , it holds that:
(15) |
Proof of Lemma B.1.
We have for all ,
(16) |
where the third and last inequalities are because of Assumption 3.1. The forth inequality holds with probability due to Hoeffding’s inequality and union bound over . ∎
Lemma B.2.
With probability , for all , it holds that:
(17) |
Proof of Lemma B.2.
Remark B.3.
Similarly, we have for all ,
(19) |
Lemma B.4.
With probability , for all , it holds that:
(20) |
Proof of Lemma B.4.
We have for all ,
(21) |
where the second, forth and last inequalities result from Assumption 3.1. The fifth inequality holds with probability due to Bernstein’s inequality and union bound. ∎
Remark B.5.
Similarly, we have for all ,
(22) |
Lemma B.6.
With probability , for all , it holds that:
(23) |
Proof of Lemma B.6.
We have for all ,
(24) |
where the third, forth and last inequalities come from Assumption 3.1. The fifth inequality holds with probability because of Bernstein’s inequality, Empirical Bernstein’s inequality and union bound. ∎
Remark B.7.
Combining all the concentrations, we have the following lemma.
Lemma B.8.
Appendix C Proof of Theorem 4.1
In this section, we assume the conclusions in Assumption 3.1 and Lemma B.8 hold and prove the regret bound.
C.1 Some preparations
C.1.1 Notations
For all , we define the following variances we will use throughout the proof.
(28) |
(29) |
(30) |
Next, recall the definition of our private bonus .
(31) |
According to Assumption 3.1, the private visitation numbers will never underestimate the real ones, therefore it holds that
(32) |
For the analysis later, we define .
In addition, we define the following three terms for all :
(33) |
C.1.2 Typical episodes
Now we define the typical episodes and the typical episodes with respect to . Briefly speaking, typical episodes ensure that the number of total episodes or visitation number to some state is large enough.
Definition C.1 (Typical episodes).
We define the general typical episodes as . Also, we define typical episodes with respect to as:
where is the real visitation number of before episode .
According to Definition C.1 above, it is clear that
(34) |
In the following proof, when we consider summation over episodes, we can consider only the typical episodes since all episodes that are not typical only contribute to a constant term in final regret bound.
Finally, we define the following summations for all :
(35) |
(36) |
(37) |
(38) |
C.2 Our induction
Since we apply Bernstein-type bonus, different from Chowdhury and Zhou [2021], optimism is not very straightforward. This is because even if is upper bound of and is close to , is not necessarily an upper bound of . However, we can prove by induction that is close enough to , and therefore the last term of will be sufficiently large to make a valid upper bound of . More precisely, our induction is as below:
-
1.
Assume for all , , we prove for all ,
-
2.
We deduce that the last term of compensates for the possible variance difference and for all , .
C.3 Error decomposition
We define and . Now we provide the error decomposition below, based on optimism, for all ,
(39) |
The second inequality is because of the definition of . The third inequality results from Lemma B.1. The forth inequality holds since Lemma B.6 and the definition of , . The last inequality holds due to definition of .
In addition, we have:
(40) |
The first inequality is because of Lemma B.4. The second inequality holds since AM-GM inequality. The last inequality results from the fact that .
Recursively applying (41), we have:
(42) |
Summing over episodes, we have
(43) |
According to Azuma-Hoeffding inequality and union bound, we can bound the partial sum of martingale differences below.
Lemma C.2.
Let be the number of steps until episode . Then with probability , the following inequalities hold for all and :
(44) |
We define and
Therefore, combining (42) and Lemma C.2, we have the following key lemma that upper bounds summation of .
Lemma C.3.
C.4 Upper bounds of variance
From the analysis above, it suffices to derive upper bounds for , , and . As a middle step, we discuss several upper bounds about summation of variances. We begin with the following lemma from Azar et al. [2017]. Recall that and .
Lemma C.4 (Lemma 8 of Azar et al. [2017]).
With probability , for all , it holds that
(47) |
The proof results from a combination of Law of total variance, Freedman’s inequality, union bound and our definition of typical episodes, more details can be found in Azar et al. [2017]. Next, we provide another upper bound for further bounding (and ). Recall that .
Lemma C.5.
Under the high probability event in Lemma C.2, it holds that
(48) |
Similarly, under the same high probability event, for all ,
(49) |
Proof of Lemma C.5.
Lastly, we prove the following lemma for bounding (and ). Recall that .
Lemma C.6.
Under the high probability event in Lemma C.2, with probability , it holds that
(51) |
Similarly, under the same high probability event, for all ,
(52) |
Proof of Lemma C.6.
We only prove the first conclusion, the second one can be proven in identical way. We have
(53) |
The inequality holds because of direct calculation and the fact that . Next we bound these terms separately. First of all,
(54) |
The second inequality comes from Lemma B.2. The last inequality is because of direct calculation. For , according to Lemma C.2, similar to the proof of Lemma C.5, it holds that
(55) |
Lastly, for , we have
(56) |
The second inequality holds with probability due to Hoeffding’s inequality and Remark B.3. The last inequality holds because of direct calculation.
Combining the upper bounds of , the proof is complete. ∎
C.5 Upper bound of regret
With the upper bounds in Section C.4, we are ready to bound , and the regret. In this section, we assume the high probability events in Lemma C.2 and Lemma C.4 hold. We begin with the upper bound of and . Recall that and .
Lemma C.7.
Proof of Lemma C.7.
We only prove the first conclusion, the second one can be proven in identical way. We have
(59) |
For , due to Cauchy-Schwarz inequality, it holds that
(60) |
Due to direct calculation, we have and in addition,
(61) |
where the second inequality holds due to Lemma C.4 and Lemma C.5. Therefore, We have
(62) |
For , according to direct calculation, it holds that
(63) |
Next, we bound and . Recall that and
.
Lemma C.8.
Proof of Lemma C.8.
We only prove the first conclusion, the second one can be proven in identical way. We have
(66) |
We bound these terms separately, we first bound below.
(67) |
The first inequality is because of Cauchy-Schwarz inequality. The second inequality results from direct calculation. The third inequality is due to Lemma C.4 and Lemma C.6. The last inequality holds because for typical episodes, .
According to direct calculation,
(68) |
For , it holds that
(69) |
Finally, we bound the most complex term as below. Because of Cauchy-Schwarz inequality,
(70) |
Note that . For , we have with probability ,
(71) |
where the first inequality holds because . The second inequality holds with probability due to Azuma-Hoeffding inequality. The third inequality results from Lemma B.2. The last inequality comes from direct calculation. Therefore, we have
(72) |
Combining (67), (68), (69) and (72), the proof is complete. ∎
Now we are ready to bound the regret until episode based on optimism. We define the following regret functions.
(73) |
In addition, we define the regret with respect to :
(74) |
Lemma C.9.
With probability , for all , as well as the optimism holds (i.e. for all , ), it holds that
(75) |
In addition, for all , we have
(76) |
Proof of Lemma C.9.
For the proof of this lemma, we assume the high probability events in Assumption 3.1, Lemma B.8, Lemma C.2, Lemma C.4, Lemma C.6 (for all ) and Lemma C.8 (for all ) hold. The failure probability is bounded by
(77) |
We only prove the first conclusion, the second one can be proven in identical way. It holds that
(78) |
where the second inequality is because Lemma C.3 and the fact that . The forth inequality is by combining Lemma C.8 and Lemma C.7. The last inequality results from solving the inequality with respect to . ∎
Corollary C.10.
Under the event in Lemma C.9, we have
(79) |
where the first three inequalities are due to definitions of and . The forth inequality holds because . The last inequality results from our algorithmic design that is non-increasing (line 9 of Algorithm 1).
Therefore, we have .
Now we have proven the first point of our induction. Together with the point 2 (which we will prove in Section C.6), we have with probability , the whole induction process is valid.
For clarity, we restate the induction process under the high probability event in Lemma C.9. For all ,
1. Given that for all , , we prove
and for all ,
2. Given that for all ,
we prove that for all , .
C.6 Proof of optimism
In this part, we prove optimism. Given the condition that for all ,
we prove for all , through backward induction (induction from to ). Since the conclusion holds trivially for , it suffices to prove the following Lemma C.11.
Lemma C.11.
Proof of Lemma C.11.
For all , since , it suffices to prove that . We have
(81) |
where the first inequality is because of Bellman equation and condition 2. The second inequality holds since the definition of and Lemma B.1. The third inequality results from Lemma B.6. The forth inequality comes from Lemma B.9 and Remark B.3. The last inequality is due to the following analysis.
Appendix D Missing proofs in Section 5
In this section, we state the missing proofs in Section 5. Recall that is the original count, is the noisy count after step (1) of both Privatizers and is the final private counts.
Proof of Lemma 5.1.
Due to Theorem 3.5 of Chan et al. [2011] and Lemma 34 of Hsu et al. [2014], the release of satisfies -DP. Similarly, the releases of and both satisfy -DP. Therefore, the release of the following private counters , and satisfy -DP. Due to post-processing (Lemma 2.3 of Bun and Steinke [2016]), the release of all private counts , and also satisfies -DP. Then it holds that the release of all is -DP according to post-processing. Finally, the guarantee of -JDP results from Billboard Lemma (Lemma 9 of Hsu et al. [2014]).
Proof of Lemma 5.4.
For clarity, we denote the solution of (4) by and therefore , .
When the condition (two inequalities) in Lemma 5.4 holds, the original counts is a feasible solution to the optimization problem, which means that
Combining with the condition in Lemma 5.4 with respect to , it holds that
Since and , we have
(85) |
For , according to the constraints in the optimization problem (4), it holds that
Combining with the condition in Lemma 5.4 with respect to , it holds that
Since , we have
(86) |
According to the last line of the optimization problem (4), we have and therefore,
(87) |