This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near-Optimal Differentially Private Reinforcement Learning

Dan Qiao Department of Computer Science, UC Santa Barbara Yu-Xiang Wang Department of Computer Science, UC Santa Barbara
Abstract

Motivated by personalized healthcare and other applications involving sensitive data, we study online exploration in reinforcement learning with differential privacy (DP) constraints. Existing work on this problem established that no-regret learning is possible under joint differential privacy (JDP) and local differential privacy (LDP) but did not provide an algorithm with optimal regret. We close this gap for the JDP case by designing an ϵ\epsilon-JDP algorithm with a regret of O~(SAH2T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{3}/\epsilon) which matches the information-theoretic lower bound of non-private learning for all choices of ϵ>S1.5A0.5H2/T\epsilon>S^{1.5}A^{0.5}H^{2}/\sqrt{T}. In the above, SS, AA denote the number of states and actions, HH denotes the planning horizon, and TT is the number of steps. To the best of our knowledge, this is the first private RL algorithm that achieves privacy for free asymptotically as TT\rightarrow\infty. Our techniques — which could be of independent interest — include privately releasing Bernstein-type exploration bonuses and an improved method for releasing visitation statistics. The same techniques also imply a slightly improved regret bound for the LDP case.

1 Introduction

The wide range application of Reinforcement Learning (RL) based algorithms is becoming paramount in many personalized services, including medical care (Raghu et al., 2017), autonomous driving (Sallab et al., 2017) and recommendation systems (Afsar et al., 2021). In these applications, the learning agent continuously improves its performance by learning from users’ private feedback and data. The private data from users, however, usually contain sensitive information. Take recommendation system as an instance, the agent makes recommendation (corresponding to the action in a MDP) according to users’ location, age, gender, etc. (corresponding to the state in a MDP), and improves its performance based on users’ feedback (corresponding to the reward in a MDP). Unfortunately, it is shown that unless privacy protections are launched, learning agents will implicitly memorize information of individual training data points (Carlini et al., 2019), even if they are irrelevant for learning (Brown et al., 2021), which makes RL agents vulnerable to various privacy attacks.

Differential privacy (DP) (Dwork et al., 2006) has become the standard notion of privacy. The output of a differentially private RL algorithm is indistinguishable from its output returned under an alternative universe where any individual user is replaced, thereby preventing the aforementioned privacy risks. However, recent works (Shariff and Sheffet, 2018) show that standard DP is incompatible with sublinear regret bound for contextual bandits. Therefore, a relaxed variant of DP: Joint Differential Privacy (JDP) (Kearns et al., 2014) is considered. JDP ensures that the output of all other users will not leak much information about any specific user and such notion has been studied extensively in bandits problems(Shariff and Sheffet, 2018; Garcelon et al., 2022). In addition, another variant of DP: Local Differential Privacy (LDP) (Duchi et al., 2013) has drawn more and more attention due to its stronger privacy protection. LDP requires that each user’s raw data is privatized before being sent to the agent and LDP has been well studied under bandits (Basu et al., 2019; Zheng et al., 2020).

Compared to the large body of work on private bandits, existing work that studies private RL is sparser. Under the tabular MDP model, Vietri et al. (2020) first defined JDP and proposed PUCB with regret bound and JDP guarantee. Garcelon et al. (2021) introduced LDP under tabular MDP and designed LDP-OBI with regret bound and LDP guarantee. Recently, Chowdhury and Zhou (2021) provided a general framework for this problem and derived the best-known regret bounds under both JDP and LDP. However, the best known regret bound under ϵ\epsilon-JDP O~(SAH3T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{3}T}+S^{2}AH^{3}/\epsilon), although with the additional regret due to JDP being a lower order term, is still sub-optimal by H\sqrt{H} compared to the minimax optimal regret O~(SAH2T)\widetilde{O}(\sqrt{SAH^{2}T})111Under the non-stationary MDP as in this paper, the result in Azar et al. (2017) will have additional H\sqrt{H} dependence. (Azar et al., 2017) without constraints on DP. Therefore, if we run Algorithm 2 of Chowdhury and Zhou (2021), we not only pay for a constant additional regret O~(S2AH3/ϵ)\widetilde{O}(S^{2}AH^{3}/\epsilon), but also suffer from a multiplicative factor of H\sqrt{H}. Motivated by this, we want to find out whether it is possible to design an algorithm that has optimal regret bound up to lower order terms while satisfying Joint DP.

Algorithms Regret under ϵ\epsilon-JDP Regret under ϵ\epsilon-LDP Type of bonus
PUCB (Vietri et al., 2020) O~(S2AH3T+S2AH3/ϵ)\widetilde{O}(\sqrt{S^{2}AH^{3}T}+S^{2}AH^{3}/\epsilon)^{\star} NA Hoeffding
LDP-OBI (Garcelon et al., 2021) NA O~(S2AH3T+S2AH5T/ϵ)\widetilde{O}(\sqrt{S^{2}AH^{3}T}+S^{2}A\sqrt{H^{5}T}/\epsilon)^{\dagger} Hoeffding
Private-UCB-PO (Chowdhury and Zhou, 2021) O~(S2AH3T+S2AH3/ϵ)\widetilde{O}(\sqrt{S^{2}AH^{3}T}+S^{2}AH^{3}/\epsilon) O~(S2AH3T+S2AH5T/ϵ)\widetilde{O}(\sqrt{S^{2}AH^{3}T}+S^{2}A\sqrt{H^{5}T}/\epsilon) Hoeffding
Private-UCB-VI (Chowdhury and Zhou, 2021) O~(SAH3T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{3}T}+S^{2}AH^{3}/\epsilon) O~(SAH3T+S2AH5T/ϵ)\widetilde{O}(\sqrt{SAH^{3}T}+S^{2}A\sqrt{H^{5}T}/\epsilon) Hoeffding
DP-UCBVI (Our Algorithm 1) O~(SAH2T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{3}/\epsilon) O~(SAH2T+S2AH5T/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}A\sqrt{H^{5}T}/\epsilon) Bernstein
Lower bound without DP (Jin et al., 2018) Ω(SAH2T)\Omega(\sqrt{SAH^{2}T}) Ω(SAH2T)\Omega(\sqrt{SAH^{2}T}) NA
Table 1: Comparison of our results (in blue) to existing work regarding regret under ϵ\epsilon-joint differential privacy, regret under ϵ\epsilon-local differential privacy and type of bonus. Here T=KHT=KH is the number of steps, SS, AA, HH refer to number of states, number of actions and the planning horizon. Bernstein-type bonus uses the knowledge of estimated variance while Hoeffding-type bonus directly bounds the variance by its uniform upper bound. \star: For more discussions about this bound, please refer to Chowdhury and Zhou (2021). \dagger: The original regret bound in Garcelon et al. (2021) is achieved under stationary MDP, and can be translated to the bound stated here by adding H\sqrt{H} to the first term.

Our contributions. In this paper, we answer the above question affirmatively by constructing a general algorithm for DP RL: Algorithm 1. Our contributions are threefold.

  • A new upper confidence bound (UCB) based algorithm (DP-UCBVI, Algorithm 1) that can be combined with any Privatizer (for JDP or LDP). Under the constraint of ϵ\epsilon-JDP, DP-UCBVI achieves regret of O~(SAH2T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{3}/\epsilon), which matches the minimax lower bound up to lower order terms.

  • We propose a novel privatization of visitation numbers that satisfies several nice properties (see Assumption 3.1 for details). More importantly, our approach is the first to privatize Bernstein-type bonus, which helps tighten our regret bounds through law of total variance.

  • Under the ϵ\epsilon-LDP constraint, DP-UCBVI achieves regret of O~(SAH2T+S2AH5T/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}A\sqrt{H^{5}T}/\epsilon) and improves the best known result (Chowdhury and Zhou, 2021).

1.1 Related work

Detailed comparisons with existing work on differentially private RL under tabular MDP (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021) are given in Table 1, while we leave more discussions about results on regret minimization to Appendix A. Notably, all existing algorithms privatize Hoeffding-type bonus and suffer from sub-optimal regret bound. In comparison, we privatize Bernstein-type bonus and the non-private part of our regret222As shown in Table 1, the regret bounds of all DP-RL algorithms contain two parts: one results from running the non-private RL algorithms, while the other is the additional cost due to DP guarantees. Throughout the paper, we use “non-private part” to denote the regret from running the non-private RL algorithms. matches the minimax lower bound in Jin et al. (2018).

Generally speaking, to achieve DP guarantee under RL, a common approach is to add appropriate noise to existing non-private algorithms, and derive tight regret bounds. We discuss about private algorithms under tabular MDP below and leave more discussions about algorithms under other settings to Appendix A. Under the constraint of JDP, Vietri et al. (2020) designed PUCB by privatizing UBEV (Dann et al., 2017). Private-UCB-VI (Chowdhury and Zhou, 2021) resulted from UCBVI (with bonus 1) (Azar et al., 2017). Under the constraint of LDP, Garcelon et al. (2021) designed LDP-OBI based on UCRL2 (Jaksch et al., 2010). However, all these works privatized Hoeffding-type bonus, which is easier to handle, but will lead to sub-optimal regret bound. In contrast, we directly build upon the non-private algorithm with minimax optimal regret bound: UCBVI with bonus 2 (Azar et al., 2017), where the privatization of Bernstein-type bonus requires more advanced techniques.

A concurrent work (Qiao and Wang, 2022b) focused on the offline RL setting and derived a private version of APVI (Yin and Wang, 2021). Their algorithm achieved tight sub-optimality bound of the output policy through privatization of Bernstein-type pessimism333Pessimism is the counterpart of bonus under offline RL, which aims to discourage the choice of (s,a)(s,a) pairs with large uncertainty.. However, their analysis relied on the assumption that the visitation numbers of all (state,action) pairs are larger than some threshold. We overcome the requirement of such assumption via an improved privatization of visitation numbers. More importantly, offline RL can be viewed as one step of online RL, therefore privatization of Bernstein type bonus is more technically demanding. Finally, our approach actually realizes the future direction stated in the conclusion of Qiao and Wang (2022b).

1.2 A remark on technical novelty.

The general idea behind the previous differentially private algorithms under tabular MDP (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021) is to add noise to accumulative visitation numbers, and construct a private bonus based on privatized visitation numbers. Since Hoeffding-type bonus bhk(s,a)b^{k}_{h}(s,a) only uses the information of visitation numbers (e.g., in Azar et al. (2017), bhk(s,a)=O~(H1/Nhk(s,a))b^{k}_{h}(s,a)=\widetilde{O}(H\cdot\sqrt{1/N^{k}_{h}(s,a)})), the construction of private bonus is straightforward. We can simply replace original counts Nhk(s,a)N^{k}_{h}(s,a) with private counts N~hk(s,a)\widetilde{N}^{k}_{h}(s,a) and add an additional term to account for the difference between these two bonuses. Next, combining the construction of private bonuses with the uniform upper bound of |N~hk(s,a)Nhk(s,a)||\widetilde{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|, we can upper bound the private bonus by its non-private counterpart plus some additional lower order term. Therefore the proof schedule of the original non-private algorithms also applies to their private counterparts.

Unfortunately, although the idea to privatize UCBVI with bonus 2 (Bernstein-type) (Azar et al., 2017) is straightforward, the generalization of the previous approaches is technically non-trivial. Since the bonus 2 in Azar et al. (2017) includes the term VarP^hk(|s,a)Vh+1k()\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{k}_{h+1}(\cdot), the first technical challenge is to replace the empirical transition kernel P^hk(s,a)\widehat{P}^{k}_{h}(s,a) with a private estimate. However, the private transition kernel estimates constructed in previous works are not valid probability distributions. In this paper, for both JDP and LDP, we propose a novel privatization of visitation numbers such that the private transition kernel estimates are valid probability distributions and meanwhile, the upper bound on |N~hk(s,a)Nhk(s,a)||\widetilde{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)| is the same scale compared to previous approaches. With the private transition kernel estimates P~hk\widetilde{P}^{k}_{h}, we can replace VarP^hk(|s,a)Vh+1k()\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{k}_{h+1}(\cdot) with VarP~hk(|s,a)V~h+1k()\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot) where V~hk()\widetilde{V}^{k}_{h}(\cdot) is the value function calculated from value iteration with private estimates. Then the second challenge is to bound the difference between these two variances and retain the optimism. We overcome the second challenge via concentration inequalities. Briefly speaking, we add an additional term (using private statistics) to compensate for the difference of these two bonuses and recovered the proof of optimism. With all these techniques, we derive our regret bound using techniques like error decomposition and error propagation originated from Azar et al. (2017).

2 Notations and Problem Setup

Throughout the paper, for N+N\in\mathbb{Z}^{+}, [N]={1,2,,N}[N]=\{1,2,\cdots,N\}. For any set WW, Δ(W)\Delta(W) denotes the set of all probability distributions over WW. Besides, we use standard notations such as OO and Ω\Omega to suppress constants while O~\widetilde{O} and Ω~\widetilde{\Omega} absorb logarithmic factors.

Below we present the definition of episodic Markov Decision Processes and introduce differential privacy in reinforcement learning.

2.1 Markov decision processes and regret

We consider finite-horizon episodic Markov Decision Processes (MDP) with non-stationary transitions, denoted by a tuple =(𝒮,𝒜,H,{Ph}h=1H,{rh}h=1H,d1)\mathcal{M}=(\mathcal{S},\mathcal{A},H,\{P_{h}\}_{h=1}^{H},\{r_{h}\}_{h=1}^{H},d_{1}) (Sutton and Barto, 1998), where 𝒮\mathcal{S} is state space with |𝒮|=S|\mathcal{S}|=S, 𝒜\mathcal{A} is action space with |𝒜|=A|\mathcal{A}|=A and HH is the horizon. The non-stationary transition kernel has the form Ph:𝒮×𝒜×𝒮[0,1]P_{h}:\mathcal{S}\times\mathcal{A}\times\mathcal{S}\mapsto[0,1] with Ph(s|s,a)P_{h}(s^{\prime}|s,a) representing the probability of transition from state ss, action aa to next state ss^{\prime} at time step hh. In addition, rh(s,a)Δ([0,1])r_{h}(s,a)\in\Delta([0,1]) denotes the corresponding distribution of reward, we overload the notation so that rr also denotes the expected (immediate) reward function. Besides, d1d_{1} is the initial state distribution. A policy can be seen as a series of mapping π=(π1,,πH)\pi=(\pi_{1},\cdots,\pi_{H}), where each πh\pi_{h} maps each state s𝒮s\in\mathcal{S} to a probability distribution over actions, i.e. πh:𝒮Δ(𝒜)\pi_{h}:\mathcal{S}\rightarrow\Delta(\mathcal{A}), h[H]\forall\,h\in[H]. A random trajectory (s1,a1,r1,,sH,aH,rH,sH+1)(s_{1},a_{1},r_{1},\cdots,s_{H},a_{H},r_{H},s_{H+1}) is generated by the following rule: s1d1s_{1}\sim d_{1}, ahπh(|sh),rhrh(sh,ah),sh+1Ph(|sh,ah),h[H]a_{h}\sim\pi_{h}(\cdot|s_{h}),r_{h}\sim r_{h}(s_{h},a_{h}),s_{h+1}\sim P_{h}(\cdot|s_{h},a_{h}),\forall\,h\in[H].

Given a policy π\pi and any h[H]h\in[H], the value function Vhπ()V^{\pi}_{h}(\cdot) and Q-value function Qhπ(,)Q^{\pi}_{h}(\cdot,\cdot) are defined as: Vhπ(s)=𝔼π[t=hHrt|sh=s],Qhπ(s,a)=𝔼π[t=hHrt|sh,ah=s,a],s,a𝒮×𝒜.V^{\pi}_{h}(s)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h}=s],Q^{\pi}_{h}(s,a)=\mathbb{E}_{\pi}[\sum_{t=h}^{H}r_{t}|s_{h},a_{h}=s,a],\;\forall\,s,a\in\mathcal{S}\times\mathcal{A}. The optimal policy π\pi^{\star} maximizes Vhπ(s)V_{h}^{\pi}(s) for all s,h𝒮×[H]s,h\in\mathcal{S}\times[H] simultaneously and we denote the value function and Q-value function with respect to π\pi^{\star} by Vh()V^{\star}_{h}(\cdot) and Qh(,)Q^{\star}_{h}(\cdot,\cdot). Then Bellman (optimality) equation follows h[H]\forall\,h\in[H]:

Qhπ(s,a)=rh(s,a)+Ph(|s,a)Vh+1π,Vhπ=𝔼aπh[Qhπ],\displaystyle Q^{\pi}_{h}(s,a)=r_{h}(s,a)+P_{h}(\cdot|s,a)V^{\pi}_{h+1},\;\;V^{\pi}_{h}=\mathbb{E}_{a\sim\pi_{h}}[Q^{\pi}_{h}],
Qh(s,a)=rh(s,a)+Ph(|s,a)Vh+1,Vh=maxaQh(,a).\displaystyle Q^{\star}_{h}(s,a)=r_{h}(s,a)+P_{h}(\cdot|s,a)V^{\star}_{h+1},\;V^{\star}_{h}=\max_{a}Q^{\star}_{h}(\cdot,a).

We measure the performance of online reinforcement learning algorithms by the regret. The regret of an algorithm is defined as

Regret(K):=k=1K[V1(s1k)V1πk(s1k)],\text{Regret}(K):=\sum_{k=1}^{K}[V_{1}^{\star}(s_{1}^{k})-V_{1}^{\pi_{k}}(s_{1}^{k})],

where s1ks_{1}^{k} is the initial state and πk\pi_{k} is the policy deployed at episode kk. Let KK be the number of episodes that the agent plan to play and total number of steps is T:=KHT:=KH.

2.2 Differential privacy under episodic RL

Under the episodic RL setting, each trajectory represents one specific user. We first consider the following RL protocol: during the hh-th step of the kk-th episode, user uku_{k} sends her state shks_{h}^{k} to agent \mathcal{M}, \mathcal{M} sends back an action ahka_{h}^{k}, and finally uku_{k} sends her reward rhkr_{h}^{k} to \mathcal{M}. Formally, we denote a sequence of KK users who participate in the above RL protocol by 𝒰=(u1,,uK)\mathcal{U}=(u_{1},\cdots,u_{K}). Following the definition in Vietri et al. (2020), each user can be seen as a tree of depth HH encoding the state and reward responses they would reply to all AHA^{H} possible sequences of actions from the agent. We let (𝒰)=(a11,,aHK)\mathcal{M}(\mathcal{U})=(a_{1}^{1},\cdots,a_{H}^{K}) denote the whole sequence of actions chosen by agent \mathcal{M}. An ideal privacy preserving agent would guarantee that (𝒰)\mathcal{M}(\mathcal{U}) and all users but uku_{k} together will not reveal much information about user uku_{k}. We formalize such privacy preservation through adaptation of differential privacy (Dwork et al., 2006).

Definition 2.1 (Differential Privacy (DP)).

For any ϵ>0\epsilon>0 and δ[0,1]\delta\in[0,1], a mechanism :𝒰𝒜KH\mathcal{M}:\mathcal{U}\rightarrow\mathcal{A}^{KH} is (ϵ,δ)(\epsilon,\delta)-differentially private if for any possible user sequences 𝒰\mathcal{U} and 𝒰\mathcal{U}^{\prime} differing on a single user and any subset EE of 𝒜KH\mathcal{A}^{KH},

[(𝒰)E]eϵ[(𝒰)E]+δ.\mathbb{P}[\mathcal{M}(\mathcal{U})\in E]\leq e^{\epsilon}\mathbb{P}[\mathcal{M}(\mathcal{U}^{\prime})\in E]+\delta.

If δ=0\delta=0, we say that \mathcal{M} is ϵ\epsilon-differentially private (ϵ\epsilon-DP).

However, although recommendation to other users will not affect the privacy of user uku_{k} significantly, it is impractical to privately recommend actions to user uku_{k} while protecting the information of her state and reward. Therefore, the notion of DP is relaxed to Joint Differential Privacy (JDP) (Kearns et al., 2014), which requires that for all user uku_{k}, the recommendation to all users but uku_{k} will not reveal much information about uku_{k}. JDP is weaker than DP, while JDP can still provide strong privacy protection since it protects a specific user from any possible collusion of all other users against her. Formally, the definition of JDP is shown below.

Definition 2.2 (Joint Differential Privacy (JDP)).

For any ϵ>0\epsilon>0, a mechanism :𝒰𝒜KH\mathcal{M}:\mathcal{U}\rightarrow\mathcal{A}^{KH} is ϵ\epsilon-joint differentially private if for any k[K]k\in[K], any user sequences 𝒰\mathcal{U}, 𝒰\mathcal{U}^{\prime} differing on the kk-th user and any subset EE of 𝒜(K1)H\mathcal{A}^{(K-1)H},

[k(𝒰)E]eϵ[k(𝒰)E],\mathbb{P}[\mathcal{M}_{-k}(\mathcal{U})\in E]\leq e^{\epsilon}\mathbb{P}[\mathcal{M}_{-k}(\mathcal{U}^{\prime})\in E],

where k(𝒰)E\mathcal{M}_{-k}(\mathcal{U})\in E means the sequence of actions recommended to all users but uku_{k} belongs to set EE.

JDP ensures that even if an adversary can observe the recommended actions to all users but uku_{k}, it is impossible to identify the trajectory from uku_{k} accurately. JDP is first defined and analyzed under RL by Vietri et al. (2020).

Although JDP provides strong privacy protection, the agent can still observe the raw trajectories from users. Under some circumstances, however, the users are not even willing to share their original data with the agent. This motivates a stronger notion of privacy which is called Local Differential Privacy (LDP) (Duchi et al., 2013). Since under LDP, the agent is not allowed to directly observe the state of users, we consider the following RL protocol for LDP: during the kk-th episode, the agent \mathcal{M} sends policy πk\pi_{k} to user uku_{k}, after deploying πk\pi_{k} and getting trajectory XkX_{k}, user uku_{k} privatizes her trajectory to XkX_{k}^{\prime} and finally sends it to \mathcal{M}. We denote the privacy mechanism on user’s side by ~\widetilde{\mathcal{M}} and define local differential privacy formally below.

Definition 2.3 (Local Differential Privacy (LDP)).

For any ϵ>0\epsilon>0, a mechanism ~\widetilde{\mathcal{M}} is ϵ\epsilon-local differentially private if for any possible trajectories X,XX,X^{\prime} and any possible set
E{~(X)|Xis any possible trajectory}E\subseteq\{\widetilde{\mathcal{M}}(X)|X\;\text{is any possible trajectory}\},

[~(X)E]eϵ[~(X)E].\mathbb{P}[\widetilde{\mathcal{M}}(X)\in E]\leq e^{\epsilon}\mathbb{P}[\widetilde{\mathcal{M}}(X^{\prime})\in E].

Local DP ensures that even if an adversary observes the whole reply from user uku_{k}, it is still statistically hard to identify her trajectory. LDP is first defined and analyzed under RL by Garcelon et al. (2021).

3 Algorithm

Algorithm 1 DP-UCBVI
1:  Input: Number of episodes KK, privacy budget ϵ\epsilon, failure probability β\beta and a Privatizer (can be either Central or Local).
2:  Initialize: Private counts R~h1(s,a)=N~h1(s,a)=N~h1(s,a,s)=0\widetilde{R}^{1}_{h}(s,a)=\widetilde{N}^{1}_{h}(s,a)=\widetilde{N}^{1}_{h}(s,a,s^{\prime})=0 for all (h,s,a,s)[H]×𝒮×𝒜×𝒮(h,s,a,s^{\prime})\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}. Set up the confidence bound Eϵ,βE_{\epsilon,\beta} w.r.t the Privatizer. ι=log(30HSAT/β)\iota=\log(30HSAT/\beta).
3:  for k=1,2,,Kk=1,2,\cdots,K do
4:     V~H+1k()=0\widetilde{V}^{k}_{H+1}(\cdot)=0.
5:     for h=H,H1,,1h=H,H-1,\cdots,1 do
6:        Compute P~hk(s|s,a)\widetilde{P}^{k}_{h}(s^{\prime}|s,a) and r~hk(s,a)\widetilde{r}^{k}_{h}(s,a) as in (1).
7:        Calculate private bonus bhk(s,a)=2VarsP~hk(|s,a)V~h+1k()ιN~hk(s,a)+2ιN~hk(s,a)+20HSEϵ,βιN~hk(s,a)+4ιsP~hk(s|s,a)min{10002H3SAι2N~h+1k(s)+10002H4S4A2Eϵ,β2ι4N~h+1k(s)2+10002H6S4A2ι4N~h+1k(s)2,H2}N~hk(s,a)b^{k}_{h}(s,a)=2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{k}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\sqrt{\frac{2\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{20HSE_{\epsilon,\beta}\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}+4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{k}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{\widetilde{N}^{k}_{h}(s,a)}}.
8:        for (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} do
9:           Q~hk(s,a)=min{Q~hk1(s,a),H,r~hk(s,a)+sP~hk(s|s,a)V~h+1k(s)+bhk(s,a)}\widetilde{Q}^{k}_{h}(s,a)=\min\{\widetilde{Q}^{k-1}_{h}(s,a),H,\widetilde{r}^{k}_{h}(s,a)+\sum_{s^{\prime}}\widetilde{P}_{h}^{k}(s^{\prime}|s,a)\cdot\widetilde{V}^{k}_{h+1}(s^{\prime})+b^{k}_{h}(s,a)\}.
10:        end for
11:        for s𝒮s\in\mathcal{S} do
12:           V~hk(s)=maxa𝒜Q~hk(s,a)\widetilde{V}^{k}_{h}(s)=\max_{a\in\mathcal{A}}\widetilde{Q}^{k}_{h}(s,a).
13:           πhk(s)=argmaxa𝒜Q~hk(s,a)\pi_{h}^{k}(s)=\arg\max_{a\in\mathcal{A}}\widetilde{Q}^{k}_{h}(s,a) with ties broken arbitrarily.
14:        end for
15:     end for
16:     Deploy policy πk=(π1k,,πHk)\pi_{k}=(\pi^{k}_{1},\cdots,\pi^{k}_{H}) and get trajectory (s1k,a1k,r1k,,sH+1k)(s_{1}^{k},a_{1}^{k},r_{1}^{k},\cdots,s_{H+1}^{k}).
17:     Update the private counts to R~k+1,N~k+1\widetilde{R}^{k+1},\widetilde{N}^{k+1} via Privatizer.
18:  end for

In this section, we propose DP-UCBVI (Algorithm 1) that takes Privatizer as input, where the Privatizer can be either Central (for JDP) or Local (for LDP). We provide regret analysis for all privatizers satisfying the following Assumption 3.1, which naturally implies regret bounds under both Joint DP and Local DP.

We begin with the following definition of counts. Let Nhk(s,a)=i=1k1𝟙(shi,ahi=s,a)N^{k}_{h}(s,a)=\sum_{i=1}^{k-1}\mathds{1}(s_{h}^{i},a_{h}^{i}=s,a) denote the visitation number of (s,a)(s,a) at step hh before the kk-th episode. Similarly, Nhk(s,a,s)=i=1k1𝟙(shi,ahi,sh+1i=s,a,s)N^{k}_{h}(s,a,s^{\prime})=\sum_{i=1}^{k-1}\mathds{1}(s_{h}^{i},a_{h}^{i},s_{h+1}^{i}=s,a,s^{\prime}) and Rhk(s,a)=i=1k1𝟙(shi,ahi=s,a)rhiR^{k}_{h}(s,a)=\sum_{i=1}^{k-1}\mathds{1}(s_{h}^{i},a_{h}^{i}=s,a)\cdot r_{h}^{i} denote the visitation number of (h,s,a,s)(h,s,a,s^{\prime}) and accumulative reward at (h,s,a)(h,s,a) before the kk-th episode. In non-private RL, such counts are sufficient for estimating transition kernel PhP_{h}, reward function rhr_{h} and deciding the exploration policy, as in Azar et al. (2017). However, these counts are derived from the raw trajectories of the users, which could contain sensitive information. Therefore, under the constraint of privacy, we can only use these counts in a privacy-preserving way, i.e. we use the private counts N~hk(s,a),N~hk(s,a,s),R~hk(s,a)\widetilde{N}^{k}_{h}(s,a),\widetilde{N}^{k}_{h}(s,a,s^{\prime}),\widetilde{R}^{k}_{h}(s,a) returned by Privatizer. We make the Assumption 3.1 below, which says that with high probability, the private counts are close to real ones, such assumption will be justified by our Privatizers in Section 5.

Assumption 3.1 (Private counts).

We assume that for any privacy budget ϵ>0\epsilon>0 and failure probability β[0,1]\beta\in[0,1], the private counts returned by Privatizer satisfies that for some Eϵ,β>0E_{\epsilon,\beta}>0, with probability at least 1β/31-\beta/3, uniformly over all (h,s,a,s,k)[H]×𝒮×𝒜×𝒮×[K](h,s,a,s^{\prime},k)\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[K]:
(1) |N~hk(s,a,s)Nhk(s,a,s)|Eϵ,β|\widetilde{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq E_{\epsilon,\beta}, |N~hk(s,a)Nhk(s,a)|Eϵ,β|\widetilde{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq E_{\epsilon,\beta} and |R~hk(s,a)Rhk(s,a)|Eϵ,β|\widetilde{R}^{k}_{h}(s,a)-R^{k}_{h}(s,a)|\leq E_{\epsilon,\beta}.
(2) N~hk(s,a)=s𝒮N~hk(s,a,s)Nhk(s,a)\widetilde{N}^{k}_{h}(s,a)=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{N}^{k}_{h}(s,a,s^{\prime})\geq N^{k}_{h}(s,a). N~hk(s,a,s)>0\widetilde{N}^{k}_{h}(s,a,s^{\prime})>0. Also, we let N~hk(s)=a𝒜N~hk(s,a)\widetilde{N}^{k}_{h}(s)=\sum_{a\in\mathcal{A}}\widetilde{N}^{k}_{h}(s,a).

Under Assumption 3.1, for all (h,s,a,s,k)(h,s,a,s^{\prime},k), we define the private estimations of transition kernel and reward function.

P~hk(s|s,a)=N~hk(s,a,s)N~hk(s,a),r~hk(s,a)=(R~hk(s,a)N~hk(s,a))[0,1].\begin{split}\widetilde{P}^{k}_{h}(s^{\prime}|s,a)=\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)},\\ \widetilde{r}^{k}_{h}(s,a)=\left(\frac{\widetilde{R}^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\right)_{[0,1]}.\end{split} (1)
Remark 3.2.

Different from the private empirical transition kernels in Vietri et al. (2020); Garcelon et al. (2021); Chowdhury and Zhou (2021), Assumption 3.1 implies that our estimated transition kernel P~hk(|s,a)\widetilde{P}^{k}_{h}(\cdot|s,a) is a valid probability distribution, this property results from our construction of Privatizer. We truncate the empirical reward function so that it stays in [0,1][0,1] while still preserving privacy.

Algorithmic design. Similar to non-private algorithms (Azar et al., 2017), DP-UCBVI (Algorithm 1) follows the procedure of optimistic value iteration. More specifically, in episode kk, we do value iteration based on private estimations P~hk\widetilde{P}^{k}_{h}, r~hk\widetilde{r}^{k}_{h} and private bonus term bhkb^{k}_{h} to derive private Q-value functions Q~hk\widetilde{Q}^{k}_{h}. Next, the greedy policy πk\pi_{k} w.r.t Q~hk\widetilde{Q}^{k}_{h} is chosen and we collect one trajectory by running πk\pi_{k}. Finally, the Privatizer translates the non-private counts to private ones for the next episode. We highlight that, different from all previous works regarding private RL, our bonus is variance-dependent. According to Law of total variance, variance-dependent bonus can effectively save a factor of H\sqrt{H} in regret bound. Intuitively, the first term of bhkb^{k}_{h} aims to approximate the variance w.r.t to VhV^{\star}_{h}, the last term accounts for the difference between these two variances and the third term is the additional bonus due to differential privacy.

4 Main results

In this section, we present our main results that formalize the algorithmic ideas discussed in previous sections. We first state a general result based on Assumption 3.1, which can be combined with any Privatizers. The proof of Theorem 4.1 is sketched in Section 6 with details in Appendix C.

Theorem 4.1.

For any privacy budget ϵ>0\epsilon>0, failure probability 0<β<10<\beta<1 and any Privatizer that satisfies Assumption 3.1, with probability at least 1β1-\beta, the regret of DP-UCBVI (Algorithm 1) is

Regret(K)O~(SAH2T+S2AH2Eϵ,β),\text{Regret}(K)\leq\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{2}E_{\epsilon,\beta}), (2)

where KK is the number of episodes and T=HKT=HK.

Under Assumption 3.1, the best known regret bound is O~(SAH3T+S2AH2Eϵ,β)\widetilde{O}(\sqrt{SAH^{3}T}+S^{2}AH^{2}E_{\epsilon,\beta}) (Theorem 4.2 of Chowdhury and Zhou (2021)). As a comparison, in our regret bound, the term parameterized by privacy loss ϵ\epsilon remains the same while the leading term is improved by a factor of H\sqrt{H} into O~(SAH2T)\widetilde{O}(\sqrt{SAH^{2}T}). More importantly, when TT is sufficiently large, our result nearly matches the lower bound in Jin et al. (2018), hence is information-theoretically optimal up to a logarithmic factor.

5 Choice of Privatizers

In this section, we design Privatizers that satisfy Assumption 3.1 and different DP constraints (JDP or LDP). All the proofs in this section are deferred to Appendix D.

5.1 Central Privatizer for Joint DP

The Central Privatizer protects the information of all single users by privatizing all the counter streams Nhk(s,a)N^{k}_{h}(s,a), Nhk(s,a,s)N^{k}_{h}(s,a,s^{\prime}) and Rhk(s,a)R^{k}_{h}(s,a) using the Binary Mechanism (Chan et al., 2011), which focused on privately releasing data stream (Zhao et al., 2022). More specifically, for each (h,s,a)(h,s,a), {Nhk(s,a)=i=1k1𝟙(shi,ahi=s,a)}k[K]\{N^{k}_{h}(s,a)=\sum_{i=1}^{k-1}\mathds{1}(s_{h}^{i},a_{h}^{i}=s,a)\}_{k\in[K]} is the partial sums of data stream {𝟙(shi,ahi=s,a)}i[K]\{\mathds{1}(s_{h}^{i},a_{h}^{i}=s,a)\}_{i\in[K]}. Binary Mechanism works as below: for each episode kk, after observing 𝟙(shk1,ahk1=s,a)\mathds{1}(s_{h}^{k-1},a_{h}^{k-1}=s,a), the mechanism outputs private version of i=1k1𝟙(shi,ahi=s,a)\sum_{i=1}^{k-1}\mathds{1}(s_{h}^{i},a_{h}^{i}=s,a) while ensuring Differential Privacy.444For more details about Binary Mechanism, please refer to Chan et al. (2011) or Kairouz et al. (2021). Given privacy budget ϵ>0\epsilon>0, we construct the Central Privatizer as below:

  1. (1)

    For all (h,s,a,s)(h,s,a,s^{\prime}), we privatize {Nhk(s,a)}k[K]\{N^{k}_{h}(s,a)\}_{k\in[K]} and {Nhk(s,a,s)}k[K]\{N^{k}_{h}(s,a,s^{\prime})\}_{k\in[K]} (which is summation of bounded streams) by applying Binary Mechanism (Algorithm 2 in Chan et al. (2011)) with ϵ=ϵ3HlogK\epsilon^{\prime}=\frac{\epsilon}{3H\log K}. We denote the output of Binary Mechanism by N^hk\widehat{N}^{k}_{h}.

  2. (2)

    The private counts N~hk\widetilde{N}^{k}_{h} are solved through the procedure in Section 5.1.1 with
    Eϵ,β=O(Hϵlog(HSAT/β)2)E_{\epsilon,\beta}=O(\frac{H}{\epsilon}\log(HSAT/\beta)^{2}).

  3. (3)

    For the counters of accumulative reward, for all (h,s,a)(h,s,a), we apply the same Binary Mechanism with ϵ=ϵ3HlogK\epsilon^{\prime}=\frac{\epsilon}{3H\log K} to privatize Rhk(s,a)R^{k}_{h}(s,a) and get R~hk(s,a)\widetilde{R}^{k}_{h}(s,a).

We sum up the properties of Central Privatizer below.

Lemma 5.1.

For any ϵ>0\epsilon>0 and 0<β<10<\beta<1, the Central Privatizer satisfies ϵ\epsilon-JDP and Assumption 3.1 with Eϵ,β=O~(Hϵ)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}).

Therefore, combining Lemma 5.1 and Theorem 4.1, the following regret bound holds.

Theorem 5.2 (Regret under JDP).

For any ϵ>0\epsilon>0 and 0<β<10<\beta<1, running DP-UCBVI (Algorithm 1) with Central Privatizer as input, with probability 1β1-\beta, it holds that:

Regret(K)O~(SAH2T+S2AH3/ϵ).\text{Regret}(K)\leq\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{3}/\epsilon). (3)

Under the most prevalent regime where the privacy budget ϵ\epsilon is a constant, the additional regret bound due to JDP is a lower order term. The main term of Theorem 5.2 improves the best known result O~(SAH3T)\widetilde{O}(\sqrt{SAH^{3}T}) (Corollary 5.2 of Chowdhury and Zhou (2021)) by H\sqrt{H} and matches the minimax lower bound without constrains on DP (Jin et al., 2018).

5.1.1 A post-processing step

During the kk-th episode, given the noisy counts N^hk(s,a)\widehat{N}^{k}_{h}(s,a) and N^hk(s,a,s)\widehat{N}^{k}_{h}(s,a,s^{\prime}) (for all (h,s,a,s)[H]×𝒮×𝒜×𝒮(h,s,a,s^{\prime})\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}), we construct the following private counts that satisfy Assumption 3.1. The choice of N~hk\widetilde{N}^{k}_{h} follows: for all (h,s,a)(h,s,a)

{N~hk(s,a,s)}s𝒮=argmin{xs}s𝒮(maxs𝒮|xsN^hk(s,a,s)|)\displaystyle\{\widetilde{N}^{k}_{h}(s,a,s^{\prime})\}_{s^{\prime}\in\mathcal{S}}=\underset{\{x_{s^{\prime}}\}_{s^{\prime}\in\mathcal{S}}}{\operatorname*{argmin}}\;\left(\max_{s^{\prime}\in\mathcal{S}}\left|x_{s^{\prime}}-\widehat{N}^{k}_{h}(s,a,s^{\prime})\right|\right) (4)
such that|s𝒮xsN^hk(s,a)|Eϵ,β4andxs0,s.\displaystyle\text{such that}\,\,\left|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-\widehat{N}^{k}_{h}(s,a)\right|\leq\frac{E_{\epsilon,\beta}}{4}\,\,\text{and}\,\,x_{s^{\prime}}\geq 0,\forall\,s^{\prime}.
N~hk(s,a)=s𝒮N~hk(s,a,s).\displaystyle\widetilde{N}^{k}_{h}(s,a)=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{N}^{k}_{h}(s,a,s^{\prime}).

Finally, for all (h,s,a)(h,s,a), we add the following terms such that with high probability, N~hk(s,a)\widetilde{N}^{k}_{h}(s,a) will never underestimate.

N~hk(s,a,s)=N~hk(s,a,s)+Eϵ,β2S,N~hk(s,a)=N~hk(s,a)+Eϵ,β2.\begin{split}\widetilde{N}^{k}_{h}(s,a,s^{\prime})=\widetilde{N}^{k}_{h}(s,a,s^{\prime})+\frac{E_{\epsilon,\beta}}{2S},\\ \widetilde{N}^{k}_{h}(s,a)=\widetilde{N}^{k}_{h}(s,a)+\frac{E_{\epsilon,\beta}}{2}.\end{split} (5)
Remark 5.3.

The optimization problem (4) can be reformulated as:

mint,s.t.|xsN^hk(s,a,s)|t,xs0,s𝒮,|s𝒮xsN^hk(s,a)|Eϵ,β4.\begin{split}&\min\,\,t,\,\,\text{s.t.}\,|x_{s^{\prime}}-\widehat{N}^{k}_{h}(s,a,s^{\prime})|\leq t,\;x_{s^{\prime}}\geq 0,\,\,\forall\,s^{\prime}\in\mathcal{S},\\ &\left|\sum_{s^{\prime}\in\mathcal{S}}x_{s^{\prime}}-\widehat{N}^{k}_{h}(s,a)\right|\leq\frac{E_{\epsilon,\beta}}{4}.\end{split} (6)

Note that (6) is a Linear Programming problem with O(S)O(S) variables and O(S)O(S) linear constraints. This can be solved efficiently by the simplex method (Ficken, 2015) or other provably efficient algorithms (Nemhauser and Wolsey, 1988). Therefore, since during the whole process, we only solve HSAKHSAK such Linear Programming problems, our Algorithm 1 is computationally efficient.

The properties of private counts N~hk\widetilde{N}^{k}_{h} is summarized below.

Lemma 5.4.

Suppose N^hk\widehat{N}^{k}_{h} satisfies that with probability 1β31-\frac{\beta}{3}, uniformly over all (h,s,a,s,k)(h,s,a,s^{\prime},k), it holds that

|N^hk(s,a,s)Nhk(s,a,s)|Eϵ,β4,|\widehat{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq\frac{E_{\epsilon,\beta}}{4},
|N^hk(s,a)Nhk(s,a)|Eϵ,β4,|\widehat{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq\frac{E_{\epsilon,\beta}}{4},

then the N~hk\widetilde{N}^{k}_{h} derived from (4) and (5) satisfies Assumption 3.1.

Remark 5.5.

Compared to the concurrent work (Qiao and Wang, 2022b), our private counts N~hk(s,a)\widetilde{N}^{k}_{h}(s,a) have additional guarantee of never underestimating the true values, which is a desirable property for analysis in Appendix B. In comparison, the analysis in Qiao and Wang (2022b) heavily relies on the assumption that the visitation number is larger than some threshold such that the scale of noise is ignorable.

5.2 Local Privatizer for Local DP

For each episode kk, the Local Privatizer privatizes each single trajectory by perturbing the statistics calculated from that trajectory. For visitation of (state,action) pairs, the original visitation number {σhk(s,a)=𝟙(shk,ahk=s,a)}(h,s,a)\{\sigma^{k}_{h}(s,a)=\mathds{1}(s_{h}^{k},a_{h}^{k}=s,a)\}_{(h,s,a)} has 1\ell_{1} sensitivity HH. Therefore, the perturbed version of the counts above σ~hk(s,a)=σhk(s,a)+Lap(3Hϵ)\widetilde{\sigma}^{k}_{h}(s,a)=\sigma^{k}_{h}(s,a)+\text{Lap}(\frac{3H}{\epsilon}) satisfies ϵ3\frac{\epsilon}{3}-LDP. In addition, similar perturbations to {𝟙(shk,ahk,sh+1k=s,a,s)}(h,s,a,s)\{\mathds{1}(s^{k}_{h},a^{k}_{h},s^{k}_{h+1}=s,a,s^{\prime})\}_{(h,s,a,s^{\prime})} and {𝟙(shk,ahk=s,a)rhk}(h,s,a)\{\mathds{1}(s_{h}^{k},a_{h}^{k}=s,a)\cdot r^{k}_{h}\}_{(h,s,a)} will lead to the same result. As a result, we construct Local Privatizer as below:

  1. (1)

    For all (k,h,s,a,s)(k,h,s,a,s^{\prime}), we perturb σhk(s,a)=𝟙(shk,ahk=s,a)\sigma^{k}_{h}(s,a)=\mathds{1}(s_{h}^{k},a_{h}^{k}=s,a) and σhk(s,a,s)=𝟙(shk,ahk,sh+1k=s,a,s)\sigma^{k}_{h}(s,a,s^{\prime})=\mathds{1}(s^{k}_{h},a^{k}_{h},s^{k}_{h+1}=s,a,s^{\prime}) by adding independent Laplace noises:

    σ~hk(s,a)=σhk(s,a)+Lap(3Hϵ),\widetilde{\sigma}^{k}_{h}(s,a)=\sigma^{k}_{h}(s,a)+\text{Lap}(\frac{3H}{\epsilon}),
    σ~hk(s,a,s)=σhk(s,a,s)+Lap(3Hϵ).\widetilde{\sigma}^{k}_{h}(s,a,s^{\prime})=\sigma^{k}_{h}(s,a,s^{\prime})+\text{Lap}(\frac{3H}{\epsilon}).
  2. (2)

    The noisy counts are calculated by

    N^hk(s,a)=i=1k1σ~hi(s,a),\widehat{N}^{k}_{h}(s,a)=\sum_{i=1}^{k-1}\widetilde{\sigma}^{i}_{h}(s,a),
    N^hk(s,a,s)=i=1k1σ~hi(s,a,s).\widehat{N}^{k}_{h}(s,a,s^{\prime})=\sum_{i=1}^{k-1}\widetilde{\sigma}^{i}_{h}(s,a,s^{\prime}).

    Then the private counts N~hk\widetilde{N}^{k}_{h} are solved through the procedure in Section 5.1.1 with
    Eϵ,β=O(HϵKlog(HSAT/β))E_{\epsilon,\beta}=O(\frac{H}{\epsilon}\sqrt{K\log(HSAT/\beta)}).

  3. (3)

    We perturb the trajectory-wise reward by adding independent Laplace noise: r~hk(s,a)=𝟙(shk,ahk=s,a)rhk+Lap(3Hϵ)\widetilde{r}^{k}_{h}(s,a)=\mathds{1}(s_{h}^{k},a_{h}^{k}=s,a)\cdot r^{k}_{h}+\text{Lap}(\frac{3H}{\epsilon}). The accumulative statistic is calculated by R~hk(s,a)=i=1k1r~hi(s,a)\widetilde{R}^{k}_{h}(s,a)=\sum_{i=1}^{k-1}\widetilde{r}^{i}_{h}(s,a).

Properties of our Local Privatizer is summarized below.

Lemma 5.6.

For any ϵ>0\epsilon>0 and 0<β<10<\beta<1, the Local Privatizer satisfies ϵ\epsilon-LDP and Assumption 3.1 with Eϵ,β=O~(HϵK)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}\sqrt{K}).

Therefore, combining Lemma 5.6 and Theorem 4.1, the following regret bound holds.

Theorem 5.7 (Regret under LDP).

For any ϵ>0\epsilon>0 and 0<β<10<\beta<1, running DP-UCBVI (Algorithm 1) with Local Privatizer as input, with probability 1β1-\beta, it holds that:

Regret(K)O~(SAH2T+S2AH5T/ϵ).\text{Regret}(K)\leq\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}A\sqrt{H^{5}T}/\epsilon). (7)

Theorem 5.7 improves the non-private part of regret bound in the best known result (Corollary 5.5 of Chowdhury and Zhou (2021)).

5.3 More discussions

The step (1) of our Privatizers is similar to previous works (Vietri et al., 2020; Garcelon et al., 2021; Chowdhury and Zhou, 2021). However, different from their approaches (directly use N^hk\widehat{N}^{k}_{h} as private counts), we apply the post-processing step in Section 5.1.1, which ensures that P~hk\widetilde{P}^{k}_{h} is valid probability distribution while Eϵ,βE_{\epsilon,\beta} is only worse by a constant factor. Therefore, we can apply Bernstein type bonus to achieve the optimal non-private part in our regret bound.

We remark that the Laplace Mechanism can be replaced with other mechanisms, like Gaussian Mechanism (Dwork et al., 2014) for approximate DP (or zCDP). According to Theorem 4.1, the regret bounds can be easily derived by plugging in the corresponding Eϵ,βE_{\epsilon,\beta}.

6 Proof Sketch

In this section, we provide a proof overview for Theorem 4.1, which can imply the results under JDP (Theorem 5.2) and LDP (Theorem 5.7). Recall that Nhk(s,a)N^{k}_{h}(s,a) and Nhk(s,a,s)N^{k}_{h}(s,a,s^{\prime}) are real visitation numbers while N~hk\widetilde{N}^{k}_{h}’s are private ones satisfying Assumption 3.1. Other notations like P~hk\widetilde{P}^{k}_{h}, r~hk\widetilde{r}^{k}_{h}, Q~hk\widetilde{Q}^{k}_{h}, V~hk\widetilde{V}^{k}_{h} and ι\iota are defined in Algorithm 1. The statement “with high probability” means that the summation of all failure probabilities is bounded by β\beta. We begin with some properties of private statistics below.

Properties of P~\widetilde{P} and r~\widetilde{r}. Due to concentration inequalities and Assumption 3.1, we provide high probability bounds for |r~hk(s,a)rh(s,a)|\left|\widetilde{r}^{k}_{h}(s,a)-r_{h}(s,a)\right|, P~hk(|s,a)Ph(|s,a)1\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-P_{h}(\cdot|s,a)\right\|_{1} and |P~hk(s|s,a)Ph(s|s,a)|\left|\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right| in Appendix B. In addition, we bound the key term |(P~hkPh)Vh+1(s,a)|\left|\left(\widetilde{P}^{k}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)\right| below.

Lemma 6.1 (Informal version of Lemma B.6).

With high probability, for all (h,s,a,k)(h,s,a,k), it holds that:

|(P~hkPh)Vh+1(s,a)|O~(VarP^hk(|s,a)Vh+1()N~hk(s,a))+O~(HSEϵ,βN~hk(s,a)).\begin{split}\left|\left(\widetilde{P}^{k}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)\right|\leq&\widetilde{O}\left(\sqrt{\frac{\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)}{\widetilde{N}^{k}_{h}(s,a)}}\right)\\ +&\widetilde{O}\left(\frac{HSE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}\right).\end{split} (8)

With these concentrations, we are ready to present our proof sketch. Since we apply Bernstein-type bonus, the proof of optimism is not straightforward. We prove our regret upper bound through induction, which is shown below.

Induction over episodes. Our induction is for all k[K]k\in[K],

  1. (1)

    Given that for all (i,h,s,a)[k]×[H]×𝒮×𝒜(i,h,s,a)\in[k]\times[H]\times\mathcal{S}\times\mathcal{A}, Q~hi(s,a)Qh(s,a)\widetilde{Q}^{i}_{h}(s,a)\geq Q^{\star}_{h}(s,a), we prove (Tk=kHT_{k}=kH)

    Regret(k)O~(H2SATk+H2S2AEϵ,β)\text{Regret}(k)\leq\widetilde{O}\left(\sqrt{H^{2}SAT_{k}}+H^{2}S^{2}AE_{\epsilon,\beta}\right)

    and for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

    V~hk(s)Vh(s)O~(SAH3Nhk(s)+S2AH2Eϵ,βNhk(s)).\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq\widetilde{O}\left(\sqrt{\frac{SAH^{3}}{N^{k}_{h}(s)}}+\frac{S^{2}AH^{2}E_{\epsilon,\beta}}{N^{k}_{h}(s)}\right).
  2. (2)

    Given that for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

    V~hk(s)Vh(s)O~(SAH3Nhk(s)+S2AH2Eϵ,βNhk(s)),\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq\widetilde{O}\left(\sqrt{\frac{SAH^{3}}{N^{k}_{h}(s)}}+\frac{S^{2}AH^{2}E_{\epsilon,\beta}}{N^{k}_{h}(s)}\right),

    we prove that for all (h,s,a)(h,s,a), Q~hk+1(s,a)Qh(s,a)\widetilde{Q}^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a).

Suppose the above induction holds, we have point (1) holds for all k[K]k\in[K] and therefore,

Regret(K)O~(H2SAT+H2S2AEϵ,β).\text{Regret}(K)\leq\widetilde{O}\left(\sqrt{H^{2}SAT}+H^{2}S^{2}AE_{\epsilon,\beta}\right). (9)

Below we discuss about the proof of (1) and (2) separately.

Proof of regret bound: (1). We only need to prove the upper bound of Regret(k)\text{Regret}(k), as the upper bound of V~hk(s)Vh(s)\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s) follows similarly. Using the standard technique of layer-wise error decomposition (details in Appendix C.3) and ignoring lower order terms: summation of martingale differences, we only need to bound i=1kh=1Hbhi(shi,ahi)\sum_{i=1}^{k}\sum_{h=1}^{H}b^{i}_{h}(s^{i}_{h},a^{i}_{h}) which consists of four terms according to the definition of bhkb^{k}_{h}. First of all, the second and forth terms are dominated by the first and third terms. Next, for the third term, we have

i=1kh=1H20HSEϵ,βιN~hi(shi,ahi)i=1kh=1H20HSEϵ,βιNhi(shi,ahi)O~(S2AH2Eϵ,β).\begin{split}\sum_{i=1}^{k}\sum_{h=1}^{H}\frac{20HSE_{\epsilon,\beta}\iota}{\widetilde{N}^{i}_{h}(s^{i}_{h},a^{i}_{h})}\leq&\sum_{i=1}^{k}\sum_{h=1}^{H}\frac{20HSE_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}\\ \leq&\widetilde{O}(S^{2}AH^{2}E_{\epsilon,\beta}).\end{split} (10)

Now we analyze the first term (which is also the main term): i=1kh=1HVarsP~hi(|shi,ahi)V~h+1i()N~hi(shi,ahi)\sum_{i=1}^{k}\sum_{h=1}^{H}\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{i}(\cdot|s^{i}_{h},a^{i}_{h})}\widetilde{V}^{i}_{h+1}(\cdot)}{\widetilde{N}^{i}_{h}(s^{i}_{h},a^{i}_{h})}}. It holds that

i=1kh=1HVarsP~hi(|shi,ahi)V~h+1i()N~hi(shi,ahi)i=1kh=1H1Nhi(shi,ahi)O~(HSA)i=1kh=1HVarP~hi(|shi,ahi)V~h+1i()(a).\begin{split}&\sum_{i=1}^{k}\sum_{h=1}^{H}\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{i}(\cdot|s^{i}_{h},a^{i}_{h})}\widetilde{V}^{i}_{h+1}(\cdot)}{\widetilde{N}^{i}_{h}(s^{i}_{h},a^{i}_{h})}}\\ \leq&\sqrt{\underbrace{\sum_{i=1}^{k}\sum_{h=1}^{H}\frac{1}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}}_{\leq\widetilde{O}(HSA)}}\cdot\sqrt{\underbrace{\sum_{i=1}^{k}\sum_{h=1}^{H}\mathrm{Var}_{\widetilde{P}_{h}^{i}(\cdot|s^{i}_{h},a^{i}_{h})}\widetilde{V}^{i}_{h+1}(\cdot)}_{(a)}}.\end{split} (11)

We bound (a) below (details are deferred to Appendix C.4).

(a)i=1kh=1HVarPh(|shi,ahi)Vh+1πi()O~(H2k)w.h.p due to LTV+lower order terms.\begin{split}&(a)\leq\underbrace{\sum_{i=1}^{k}\sum_{h=1}^{H}\mathrm{Var}_{P_{h}(\cdot|s^{i}_{h},a^{i}_{h})}V^{\pi_{i}}_{h+1}(\cdot)}_{\leq\widetilde{O}(H^{2}k)\;\text{w.h.p due to LTV}}+\text{lower order terms}.\end{split} (12)

Therefore, the main term in the regret bound scales as O~(H2SATk+H2S2AEϵ,β)\widetilde{O}(\sqrt{H^{2}SAT_{k}}+H^{2}S^{2}AE_{\epsilon,\beta}). The details about lower order terms are deferred to Appendix C.3, C.4 and C.5.

Proof of optimism: (2). To prove optimism, we only need

bhk(s,a)|r~hk(s,a)rh(s,a)|+|(P~hkPh)Vh+1(s,a)|.b^{k}_{h}(s,a)\geq|\widetilde{r}^{k}_{h}(s,a)-r_{h}(s,a)|+|(\widetilde{P}^{k}_{h}-P_{h})\cdot V^{\star}_{h+1}(s,a)|.

It is clear that |r~hk(s,a)rh(s,a)||\widetilde{r}^{k}_{h}(s,a)-r_{h}(s,a)| can be bounded by the second term and a portion of the third term of bhk(s,a)b^{k}_{h}(s,a). Due to Lemma 6.1, |(P~hkPh)Vh+1(s,a)||(\widetilde{P}^{k}_{h}-P_{h})\cdot V^{\star}_{h+1}(s,a)| can be bounded by O~(VarP^hk(|s,a)Vh+1()/N~hk(s,a))\widetilde{O}\left(\sqrt{\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)/\widetilde{N}^{k}_{h}(s,a)}\right), which can be further bounded by O~(VarP~hk(|s,a)Vh+1()/N~hk(s,a))\widetilde{O}\left(\sqrt{\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)/\widetilde{N}^{k}_{h}(s,a)}\right) plus a portion of the third term of bhk(s,a)b^{k}_{h}(s,a). Finally, together with the upper bound of |V~hk(s)Vh(s)||\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)| (derived from condition of (2) and optimism), the last term of bhk(s,a)b^{k}_{h}(s,a) compensates for the difference of VarP~hk(|s,a)V~h+1k()/N~hk(s,a)\sqrt{\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot)/\widetilde{N}^{k}_{h}(s,a)} (first term of bhk(s,a)b^{k}_{h}(s,a)) and VarP~hk(|s,a)Vh+1()/N~hk(s,a)\sqrt{\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)/\widetilde{N}^{k}_{h}(s,a)}. More details about optimism are deferred to Appendix C.6.

7 Simulations

Refer to caption
Figure 1: Comparison of cumulative regret for UCBVI and DP-UCBVI with different DP guarantees.

In this section, we run simulations to show the performance of DP-UCBVI (Algorithm 1). We run simulation on a standard benchmark for tabular MDP: Riverswim (Strehl and Littman, 2008), and Chowdhury and Zhou (2021) run simulations on the same environment. Briefly speaking, the environment consists of six consecutive states and two actions “left” and “right”. Choosing “left”, the agent will tend to move towards the left side, and vice versa. The agent starts from the left side and tries to reach the right side, where she can get higher reward. For more details and illustration about this setting, please refer to Chowdhury and Zhou (2021).

Similar to Chowdhury and Zhou (2021), we set the planning horizon to be H=20H=20 and run K=50000K=50000 episodes. For each algorithm, we run 55 times and derive the average performance and confidence region. We compare the performance of DP-UCBVI under constraints of JDP and LDP, and the original UCBVI. The cumulative regret for each algorithm is shown in Figure 1. Comparing the regret, it is shown that the non-private UCBVI has the best performance, while the cost of privacy under constraints of JDP is a small constant, and thus becomes negligible as the number of episodes increases. In addition, the DP-UCBVI with weaker privacy protection (i.e., larger ϵ\epsilon) has smaller regret. However, under constrains of LDP, the cost of privacy remains high and it takes a much longer period for the algorithm to converge to near-optimal policies. Our simulation results are consistent with our theories which state that the cost of JDP is a constant term while the cost of LDP is multiplicative.

8 Conclusion

In this paper, we studied the well-motivated problem of differentially private reinforcement learning. Under the tabular MDP setting, we propose a general framework: DP-UCBVI (Algorithm 1) that can be combined with any Privatizers for different variants of DP. Under ϵ\epsilon-JDP, we achieved regret bound of O~(SAH2T+S2AH3/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}AH^{3}/\epsilon), which matches the lower bound up to lower order terms. Meanwhile, under ϵ\epsilon-LDP, we derived regret upper bound of O~(SAH2T+S2AH5T/ϵ)\widetilde{O}(\sqrt{SAH^{2}T}+S^{2}A\sqrt{H^{5}T}/\epsilon) and improves the best known result.

We believe our framework can be further generalized to more general settings, like the linear MDP setting. The best known result under linear MDP (Ngo et al., 2022) built upon LSVI-UCB (Jin et al., 2020), which is arguably a Hoeffding-type algorithm. The main term of regret bound in Ngo et al. (2022), O~(d3H3T)\widetilde{O}(\sqrt{d^{3}H^{3}T}), is known to be suboptimal due to the recent work (Hu et al., 2022), which incorporates Bernstein-type self-normalized concentration. An interesting future direction is to privatize LSVI-UCB+ (Algorithm 1 in Hu et al. (2022)) and derive tighter regret bounds under linear MDP and constraints of JDP. We believe the techniques in this paper (privatization of Bernstein-type bonus under tabular MDP) could serve as basic building blocks.

Acknowledgments

The research is partially supported by NSF Awards #2007117 and #2048091.

References

  • Afsar et al. [2021] M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys (CSUR), 2021.
  • Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Basu et al. [2019] Debabrota Basu, Christos Dimitrakakis, and Aristide Tossou. Differential privacy for multi-armed bandits: What is it and what is its cost? arXiv preprint arXiv:1905.12298, 2019.
  • Brown et al. [2021] Gavin Brown, Mark Bun, Vitaly Feldman, Adam Smith, and Kunal Talwar. When is memorization of irrelevant training data necessary for high-accuracy learning? In ACM SIGACT Symposium on Theory of Computing, pages 123–132, 2021.
  • Bun and Steinke [2016] Mark Bun and Thomas Steinke. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference, pages 635–658. Springer, 2016.
  • Carlini et al. [2019] Nicholas Carlini, Chang Liu, Úlfar Erlingsson, Jernej Kos, and Dawn Song. The secret sharer: Evaluating and testing unintended memorization in neural networks. In USENIX Security Symposium (USENIX Security 19), pages 267–284, 2019.
  • Chan et al. [2011] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
  • Chowdhury and Zhou [2021] Sayak Ray Chowdhury and Xingyu Zhou. Differentially private regret minimization in episodic markov decision processes. arXiv preprint arXiv:2112.10599, 2021.
  • Dann et al. [2017] Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Dann et al. [2019] Christoph Dann, Lihong Li, Wei Wei, and Emma Brunskill. Policy certificates: Towards accountable reinforcement learning. In International Conference on Machine Learning, pages 1507–1516. PMLR, 2019.
  • Duchi et al. [2013] John C Duchi, Michael I Jordan, and Martin J Wainwright. Local privacy and statistical minimax rates. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 429–438. IEEE, 2013.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • Dwork et al. [2014] Cynthia Dwork, Aaron Roth, et al. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3-4):211–407, 2014.
  • Ficken [2015] Frederick Arthur Ficken. The simplex method of linear programming. Courier Dover Publications, 2015.
  • Garcelon et al. [2021] Evrard Garcelon, Vianney Perchet, Ciara Pike-Burke, and Matteo Pirotta. Local differential privacy for regret minimization in reinforcement learning. Advances in Neural Information Processing Systems, 34, 2021.
  • Garcelon et al. [2022] Evrard Garcelon, Kamalika Chaudhuri, Vianney Perchet, and Matteo Pirotta. Privacy amplification via shuffling for linear contextual bandits. In International Conference on Algorithmic Learning Theory, pages 381–407. PMLR, 2022.
  • Hsu et al. [2014] Justin Hsu, Zhiyi Huang, Aaron Roth, Tim Roughgarden, and Zhiwei Steven Wu. Private matchings and allocations. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 21–30, 2014.
  • Hu et al. [2022] Pihe Hu, Yu Chen, and Longbo Huang. Nearly minimax optimal reinforcement learning with linear function approximation. In International Conference on Machine Learning, pages 8971–9019. PMLR, 2022.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(4), 2010.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Kairouz et al. [2021] Peter Kairouz, Brendan McMahan, Shuang Song, Om Thakkar, Abhradeep Thakurta, and Zheng Xu. Practical and private (deep) learning without sampling or shuffling. In International Conference on Machine Learning, pages 5213–5225. PMLR, 2021.
  • Kearns and Singh [2002] Michael Kearns and Satinder Singh. Near-optimal reinforcement learning in polynomial time. Machine learning, 49(2-3):209–232, 2002.
  • Kearns et al. [2014] Michael Kearns, Mallesh Pai, Aaron Roth, and Jonathan Ullman. Mechanism design in large games: Incentives and privacy. In Proceedings of the 5th conference on Innovations in theoretical computer science, pages 403–410, 2014.
  • Liao et al. [2021] Chonghua Liao, Jiafan He, and Quanquan Gu. Locally differentially private reinforcement learning for linear mixture markov decision processes. arXiv preprint arXiv:2110.10133, 2021.
  • Nemhauser and Wolsey [1988] George Nemhauser and Laurence Wolsey. Polynomial-time algorithms for linear programming. Integer and Combinatorial Optimization, pages 146–181, 1988.
  • Ngo et al. [2022] Dung Daniel T Ngo, Giuseppe Vietri, and Steven Wu. Improved regret for differentially private exploration in linear mdp. In International Conference on Machine Learning, pages 16529–16552. PMLR, 2022.
  • Qiao and Wang [2022a] Dan Qiao and Yu-Xiang Wang. Near-optimal deployment efficiency in reward-free reinforcement learning with linear function approximation. arXiv preprint arXiv:2210.00701, 2022a.
  • Qiao and Wang [2022b] Dan Qiao and Yu-Xiang Wang. Offline reinforcement learning with differential privacy. arXiv preprint arXiv:2206.00810, 2022b.
  • Qiao et al. [2022] Dan Qiao, Ming Yin, Ming Min, and Yu-Xiang Wang. Sample-efficient reinforcement learning with loglog(T) switching cost. In International Conference on Machine Learning, pages 18031–18061. PMLR, 2022.
  • Raghu et al. [2017] Aniruddh Raghu, Matthieu Komorowski, Leo Anthony Celi, Peter Szolovits, and Marzyeh Ghassemi. Continuous state-space models for optimal sepsis treatment: a deep reinforcement learning approach. In Machine Learning for Healthcare Conference, pages 147–163, 2017.
  • Sallab et al. [2017] Ahmad EL Sallab, Mohammed Abdou, Etienne Perot, and Senthil Yogamani. Deep reinforcement learning framework for autonomous driving. Electronic Imaging, 2017(19):70–76, 2017.
  • Shariff and Sheffet [2018] Roshan Shariff and Or Sheffet. Differentially private contextual linear bandits. Advances in Neural Information Processing Systems, 31, 2018.
  • Strehl and Littman [2008] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • Sutton and Barto [1998] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998.
  • Vietri et al. [2020] Giuseppe Vietri, Borja Balle, Akshay Krishnamurthy, and Steven Wu. Private reinforcement learning with pac and regret guarantees. In International Conference on Machine Learning, pages 9754–9764. PMLR, 2020.
  • Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Xu and Wang [2021] Jianyu Xu and Yu-Xiang Wang. Logarithmic regret in feature-based dynamic pricing. Advances in Neural Information Processing Systems, 34:13898–13910, 2021.
  • Xu and Wang [2022] Jianyu Xu and Yu-Xiang Wang. Towards agnostic feature-based dynamic pricing: Linear policies vs linear valuation with unknown noise. In International Conference on Artificial Intelligence and Statistics, pages 9643–9662. PMLR, 2022.
  • Xu et al. [2022] Jianyu Xu, Dan Qiao, and Yu-Xiang Wang. Doubly fair dynamic pricing. arXiv preprint arXiv:2209.11837, 2022.
  • Yin and Wang [2021] Ming Yin and Yu-Xiang Wang. Towards instance-optimal offline reinforcement learning with pessimism. Advances in neural information processing systems, 34, 2021.
  • Yin et al. [2022] Ming Yin, Yaqi Duan, Mengdi Wang, and Yu-Xiang Wang. Near-optimal offline reinforcement learning with linear representation: Leveraging variance information with pessimism. arXiv preprint arXiv:2203.05804, 2022.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312. PMLR, 2019.
  • Zhang et al. [2020] Zihan Zhang, Yuan Zhou, and Xiangyang Ji. Almost optimal model-free reinforcement learning via reference-advantage decomposition. Advances in Neural Information Processing Systems, 33:15198–15207, 2020.
  • Zhao et al. [2022] Fuheng Zhao, Dan Qiao, Rachel Redberg, Divyakant Agrawal, Amr El Abbadi, and Yu-Xiang Wang. Differentially private linear sketches: Efficient implementations and applications. arXiv preprint arXiv:2205.09873, 2022.
  • Zheng et al. [2020] Kai Zheng, Tianle Cai, Weiran Huang, Zhenguo Li, and Liwei Wang. Locally differentially private (contextual) bandits learning. Advances in Neural Information Processing Systems, 33:12300–12310, 2020.
  • Zhou [2022] Xingyu Zhou. Differentially private reinforcement learning with linear function approximation. arXiv preprint arXiv:2201.07052, 2022.

Appendix A Extended related works

Regret minimization under tabular MDP. Under the most fundamental setting of tabular MDP, regret minimization has been widely studied by a long stream of works [Kearns and Singh, 2002, Jaksch et al., 2010, Jin et al., 2018, Xu and Wang, 2021, Qiao et al., 2022, Xu et al., 2022, Qiao and Wang, 2022a, Xu and Wang, 2022]. Among the optimal results, Azar et al. [2017] designed an UCB-based algorithm: UCBVI and derived the minimax optimal regret bound O~(HSAT)\widetilde{O}(\sqrt{HSAT}) under stationary MDP. Later, Zhang et al. [2020] achieved the optimal regret bound O~(H2SAT)\widetilde{O}(\sqrt{H^{2}SAT}) under non-stationary MDP through Q-learning type algorithm: UCB-ADVANTAGE. Meanwhile, in addition to stating optimal regret bound, Dann et al. [2019] also provided policy certificates via their algorithm: ORLC. Different from the minimax optimal algorithms above, Zanette and Brunskill [2019] designed an algorithm: EULER and derived the first problem-dependent regret bound, which can imply the minimax optimal regret.

Other differentially private reinforcement learning algorithms. In this paragraph, we discuss about algorithms under linear MDP or linear mixture MDP. Under linear MDP, the only algorithm with JDP guarantee: Private LSVI-UCB [Ngo et al., 2022] is private version of LSVI-UCB [Jin et al., 2020], while LDP under linear MDP still remains open. Under linear mixture MDP, LinOpt-VI-Reg [Zhou, 2022] generalized UCRL-VTR [Ayoub et al., 2020] to guarantee JDP. In addition, Liao et al. [2021] also privatized UCRL-VTR for LDP guarantee. On the offline side, Qiao and Wang [2022b] provided the first result under linear MDP based on VAPVI [Yin et al., 2022].

Appendix B Properties of private estimations

In this section, we present some useful concentrations about our private estimations that hold with high probability. Throughout the proof, we denote the non-private estimations by:

P^hk(s|s,a)=Nhk(s,a,s)Nhk(s,a),r^hk(s,a)=Rhk(s,a)Nhk(s,a).\begin{split}\widehat{P}^{k}_{h}(s^{\prime}|s,a)=\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)},\\ \widehat{r}^{k}_{h}(s,a)=\frac{R^{k}_{h}(s,a)}{N^{k}_{h}(s,a)}.\end{split} (13)

In addition, recall that our private estimations are defined as:

P~hk(s|s,a)=N~hk(s,a,s)N~hk(s,a),r~hk(s,a)=(R~hk(s,a)N~hk(s,a))[0,1].\begin{split}\widetilde{P}^{k}_{h}(s^{\prime}|s,a)=\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)},\\ \widetilde{r}^{k}_{h}(s,a)=\left(\frac{\widetilde{R}^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\right)_{[0,1]}.\end{split} (14)
Lemma B.1.

With probability 1β151-\frac{\beta}{15}, for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K], it holds that:

|r~hk(s,a)rh(s,a)|2ιN~hk(s,a)+2Eϵ,βN~hk(s,a).\left|\widetilde{r}^{k}_{h}(s,a)-r_{h}(s,a)\right|\leq\sqrt{\frac{2\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}. (15)
Proof of Lemma B.1.

We have for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K],

|r~hk(s,a)rh(s,a)||R~hk(s,a)N~hk(s,a)rh(s,a)||R~hk(s,a)N~hk(s,a)Rhk(s,a)N~hk(s,a)|+|Rhk(s,a)N~hk(s,a)rh(s,a)|Eϵ,βN~hk(s,a)+|Nhk(s,a)N~hk(s,a)(Rhk(s,a)Nhk(s,a)rh(s,a))|+|rh(s,a)(Nhk(s,a)N~hk(s,a)1)|Eϵ,βN~hk(s,a)+Nhk(s,a)N~hk(s,a)2ιNhk(s,a)+Eϵ,βN~hk(s,a)2ιN~hk(s,a)+2Eϵ,βN~hk(s,a),\begin{split}&\left|\widetilde{r}^{k}_{h}(s,a)-r_{h}(s,a)\right|\leq\left|\frac{\widetilde{R}^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-r_{h}(s,a)\right|\\ \leq&\left|\frac{\widetilde{R}^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-\frac{R^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\right|+\left|\frac{R^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-r_{h}(s,a)\right|\\ \leq&\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\left|\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\left(\frac{R^{k}_{h}(s,a)}{N^{k}_{h}(s,a)}-r_{h}(s,a)\right)\right|+\left|r_{h}(s,a)\left(\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-1\right)\right|\\ \leq&\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\sqrt{\frac{2\iota}{N^{k}_{h}(s,a)}}+\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}\\ \leq&\sqrt{\frac{2\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)},\end{split} (16)

where the third and last inequalities are because of Assumption 3.1. The forth inequality holds with probability 1β151-\frac{\beta}{15} due to Hoeffding’s inequality and union bound over h,s,a,kh,s,a,k. ∎

Lemma B.2.

With probability 1β151-\frac{\beta}{15}, for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K], it holds that:

P~hk(|s,a)Ph(|s,a)12SιN~hk(s,a)+2SEϵ,βN~hk(s,a).\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-P_{h}(\cdot|s,a)\right\|_{1}\leq 2\sqrt{\frac{S\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}. (17)
Proof of Lemma B.2.

We have for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K],

P~hk(|s,a)Ph(|s,a)1=s|P~hk(s|s,a)Ph(s|s,a)|s|N~hk(s,a,s)Nhk(s,a,s)N~hk(s,a)|+s|Nhk(s,a,s)N~hk(s,a)Ph(s|s,a)|SEϵ,βN~hk(s,a)+s|Nhk(s,a,s)Nhk(s,a)Nhk(s,a)N~hk(s,a)Ph(s|s,a)|SEϵ,βN~hk(s,a)+s|(Nhk(s,a,s)Nhk(s,a)Ph(s|s,a))Nhk(s,a)N~hk(s,a)|+s|Ph(s|s,a)(Nhk(s,a)N~hk(s,a)1)|SEϵ,βN~hk(s,a)+Nhk(s,a)N~hk(s,a)P^hk(|s,a)Ph(|s,a)1+s|Ph(s|s,a)Eϵ,βN~hk(s,a)|Nhk(s,a)N~hk(s,a)2SιNhk(s,a)+2SEϵ,βN~hk(s,a)2SιN~hk(s,a)+2SEϵ,βN~hk(s,a),\begin{split}&\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-P_{h}(\cdot|s,a)\right\|_{1}=\sum_{s^{\prime}}\left|\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\sum_{s^{\prime}}\left|\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}\right|+\sum_{s^{\prime}}\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\frac{SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\sum_{s^{\prime}}\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}\cdot\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\frac{SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\sum_{s^{\prime}}\left|\left(\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right)\cdot\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\right|+\sum_{s^{\prime}}\left|P_{h}(s^{\prime}|s,a)\left(\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-1\right)\right|\\ \leq&\frac{SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\left\|\widehat{P}^{k}_{h}(\cdot|s,a)-P_{h}(\cdot|s,a)\right\|_{1}+\sum_{s^{\prime}}\left|P_{h}(s^{\prime}|s,a)\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}\right|\\ \leq&\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot 2\sqrt{\frac{S\iota}{N^{k}_{h}(s,a)}}+\frac{2SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}\\ \leq&2\sqrt{\frac{S\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)},\end{split} (18)

where the second, forth and last inequalities hold since Assumption 3.1. The fifth inequality holds with probability 1β151-\frac{\beta}{15} according to Theorem 2.1 of Weissman et al. [2003] and union bound. ∎

Remark B.3.

Similarly, we have for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K],

P~hk(|s,a)P^hk(|s,a)1s|N~hk(s,a,s)N~hk(s,a)Nhk(s,a,s)N~hk(s,a)|+s|Nhk(s,a,s)N~hk(s,a)Nhk(s,a,s)Nhk(s,a)|2SEϵ,βN~hk(s,a).\begin{split}\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-\widehat{P}^{k}_{h}(\cdot|s,a)\right\|_{1}\leq&\sum_{s^{\prime}}\left|\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}\right|+\sum_{s^{\prime}}\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}\right|\\ \leq&\frac{2SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}.\end{split} (19)
Lemma B.4.

With probability 1β151-\frac{\beta}{15}, for all h,s,a,s,k[H]×𝒮×𝒜×𝒮×[K]h,s,a,s^{\prime},k\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[K], it holds that:

|P~hk(s|s,a)Ph(s|s,a)|2Ph(s|s,a)ιN~hk(s,a)+2Eϵ,βιN~hk(s,a).\left|\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right|\leq\sqrt{\frac{2P_{h}(s^{\prime}|s,a)\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2E_{\epsilon,\beta}\iota}{\widetilde{N}^{k}_{h}(s,a)}. (20)
Proof of Lemma B.4.

We have for all h,s,a,s,k[H]×𝒮×𝒜×𝒮×[K]h,s,a,s^{\prime},k\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[K],

|P~hk(s|s,a)Ph(s|s,a)||N~hk(s,a,s)Nhk(s,a,s)N~hk(s,a)|+|Nhk(s,a,s)N~hk(s,a)Ph(s|s,a)|Eϵ,βN~hk(s,a)+|Nhk(s,a,s)Nhk(s,a)Nhk(s,a)N~hk(s,a)Ph(s|s,a)|Eϵ,βN~hk(s,a)+|(Nhk(s,a,s)Nhk(s,a)Ph(s|s,a))Nhk(s,a)N~hk(s,a)|+|Ph(s|s,a)(Nhk(s,a)N~hk(s,a)1)|2Eϵ,βN~hk(s,a)+Nhk(s,a)N~hk(s,a)|P^hk(s|s,a)Ph(s|s,a)|2Eϵ,βN~hk(s,a)+Nhk(s,a)N~hk(s,a)(2Ph(s|s,a)ιNhk(s,a)+2ι3Nhk(s,a))2Ph(s|s,a)ιN~hk(s,a)+2Eϵ,βιN~hk(s,a),\begin{split}&\left|\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right|\leq\left|\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}\right|+\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}\cdot\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\left|\left(\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right)\cdot\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\right|+\left|P_{h}(s^{\prime}|s,a)\left(\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-1\right)\right|\\ \leq&\frac{2E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\left|\widehat{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right|\\ \leq&\frac{2E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\left(\sqrt{\frac{2P_{h}(s^{\prime}|s,a)\iota}{N^{k}_{h}(s,a)}}+\frac{2\iota}{3N^{k}_{h}(s,a)}\right)\\ \leq&\sqrt{\frac{2P_{h}(s^{\prime}|s,a)\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{2E_{\epsilon,\beta}\iota}{\widetilde{N}^{k}_{h}(s,a)},\end{split} (21)

where the second, forth and last inequalities result from Assumption 3.1. The fifth inequality holds with probability 1β151-\frac{\beta}{15} due to Bernstein’s inequality and union bound. ∎

Remark B.5.

Similarly, we have for all h,s,a,s,k[H]×𝒮×𝒜×𝒮×[K]h,s,a,s^{\prime},k\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{S}\times[K],

|P~hk(s|s,a)P^hk(s|s,a)||N~hk(s,a,s)N~hk(s,a)Nhk(s,a,s)N~hk(s,a)|+|Nhk(s,a,s)N~hk(s,a)Nhk(s,a,s)Nhk(s,a)|Eϵ,βN~hk(s,a)+Nhk(s,a,s)Eϵ,βN~hk(s,a)Nhk(s,a)2Eϵ,βN~hk(s,a).\begin{split}\left|\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-\widehat{P}^{k}_{h}(s^{\prime}|s,a)\right|\leq&\left|\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}\right|+\left|\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}\right|\\ \leq&\frac{E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a,s^{\prime})E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)\cdot N^{k}_{h}(s,a)}\\ \leq&\frac{2E_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}.\end{split} (22)
Lemma B.6.

With probability 12β151-\frac{2\beta}{15}, for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K], it holds that:

|(P~hkPh)Vh+1(s,a)|min{2VarPh(|s,a)Vh+1()ιN~hk(s,a),2VarP^hk(|s,a)Vh+1()ιN~hk(s,a)}+2HSEϵ,βιN~hk(s,a).\left|\left(\widetilde{P}^{k}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)\right|\leq\min\left\{\sqrt{\frac{2\mathrm{Var}_{P_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}},\sqrt{\frac{2\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}}\right\}+\frac{2HSE_{\epsilon,\beta}\iota}{\widetilde{N}^{k}_{h}(s,a)}. (23)
Proof of Lemma B.6.

We have for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K],

|(P~hkPh)Vh+1(s,a)||s(P~hk(s|s,a)Ph(s|s,a))Vh+1(s)||sN~hk(s,a,s)Nhk(s,a,s)N~hk(s,a)Vh+1(s)|+|s(Nhk(s,a,s)N~hk(s,a)Ph(s|s,a))Vh+1(s)|HSEϵ,βN~hk(s,a)+|Nhk(s,a)N~hk(s,a)s(Nhk(s,a,s)Nhk(s,a)Ph(s|s,a))Vh+1(s)|+|sPh(s|s,a)Vh+1(s)(Nhk(s,a)N~hk(s,a)1)|2HSEϵ,βN~hk(s,a)+|Nhk(s,a)N~hk(s,a)s(Nhk(s,a,s)Nhk(s,a)Ph(s|s,a))Vh+1(s)|2HSEϵ,βN~hk(s,a)+Nhk(s,a)N~hk(s,a)min{2VarPh(|s,a)Vh+1()ιNhk(s,a)+2Hι3Nhk(s,a),2VarP^hk(|s,a)Vh+1()ιNhk(s,a)+7Hι3Nhk(s,a)}min{2VarPh(|s,a)Vh+1()ιN~hk(s,a),2VarP^hk(|s,a)Vh+1()ιN~hk(s,a)}+2HSEϵ,βιN~hk(s,a),\begin{split}&\left|\left(\widetilde{P}^{k}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)\right|\leq\left|\sum_{s^{\prime}}\left(\widetilde{P}^{k}_{h}(s^{\prime}|s,a)-P_{h}(s^{\prime}|s,a)\right)V^{\star}_{h+1}(s^{\prime})\right|\\ \leq&\left|\sum_{s^{\prime}}\frac{\widetilde{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}\cdot V^{\star}_{h+1}(s^{\prime})\right|+\left|\sum_{s^{\prime}}\left(\frac{N^{k}_{h}(s,a,s^{\prime})}{\widetilde{N}^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right)V^{\star}_{h+1}(s^{\prime})\right|\\ \leq&\frac{HSE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\left|\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\sum_{s^{\prime}}\left(\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right)V^{\star}_{h+1}(s^{\prime})\right|+\left|\sum_{s^{\prime}}P_{h}(s^{\prime}|s,a)V^{\star}_{h+1}(s^{\prime})\left(\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}-1\right)\right|\\ \leq&\frac{2HSE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\left|\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\sum_{s^{\prime}}\left(\frac{N^{k}_{h}(s,a,s^{\prime})}{N^{k}_{h}(s,a)}-P_{h}(s^{\prime}|s,a)\right)V^{\star}_{h+1}(s^{\prime})\right|\\ \leq&\frac{2HSE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}+\frac{N^{k}_{h}(s,a)}{\widetilde{N}^{k}_{h}(s,a)}\cdot\min\left\{\sqrt{\frac{2\mathrm{Var}_{P_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{N^{k}_{h}(s,a)}}+\frac{2H\iota}{3N^{k}_{h}(s,a)},\sqrt{\frac{2\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{N^{k}_{h}(s,a)}}+\frac{7H\iota}{3N^{k}_{h}(s,a)}\right\}\\ \leq&\min\left\{\sqrt{\frac{2\mathrm{Var}_{P_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}},\sqrt{\frac{2\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}}\right\}+\frac{2HSE_{\epsilon,\beta}\iota}{\widetilde{N}^{k}_{h}(s,a)},\end{split} (24)

where the third, forth and last inequalities come from Assumption 3.1. The fifth inequality holds with probability 12β151-\frac{2\beta}{15} because of Bernstein’s inequality, Empirical Bernstein’s inequality and union bound. ∎

Remark B.7.

Similarly, we have for all h,s,a,k[H]×𝒮×𝒜×[K]h,s,a,k\in[H]\times\mathcal{S}\times\mathcal{A}\times[K],

|(P~hkP^hk)Vh+1(s,a)|HP~hk(|s,a)P^hk(|s,a)12HSEϵ,βN~hk(s,a),\begin{split}\left|\left(\widetilde{P}^{k}_{h}-\widehat{P}^{k}_{h}\right)\cdot V^{\star}_{h+1}(s,a)\right|\leq H\cdot\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-\widehat{P}^{k}_{h}(\cdot|s,a)\right\|_{1}\leq\frac{2HSE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)},\end{split} (25)

where the last inequality results from Remark B.3.

Combining all the concentrations, we have the following lemma.

Lemma B.8.

Under the high probability event that Assumption 3.1 holds, with probability at least 1β31-\frac{\beta}{3}, the conclusions in Lemma B.1, Lemma B.2, Lemma B.4, Lemma B.6, Remark B.3, Remark B.5 and Remark B.7 hold simultaneously.

In the following proof, we will prove under the high probability event where Assumption 3.1 and Lemma B.8 hold. Lastly, we state the following lemma regarding difference of variance.

Lemma B.9 (Lemma C.5 of Qiao and Wang [2022b]).

For any function VSV\in\mathbb{R}^{S} such that VH\|V\|_{\infty}\leq H, it holds that

|VarP~hk(|s,a)(V)VarP^hk(|s,a)(V)|3HP~hk(|s,a)P^hk(|s,a)1.\left|\sqrt{\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}(V)}\right|\leq\sqrt{3}H\cdot\sqrt{\left\|\widetilde{P}^{k}_{h}(\cdot|s,a)-\widehat{P}^{k}_{h}(\cdot|s,a)\right\|_{1}}. (26)

In addition, according to Remark B.3, the left hand side can be further bounded by

|VarP~hk(|s,a)(V)VarP^hk(|s,a)(V)|3HSEϵ,βN~hk(s,a).\left|\sqrt{\mathrm{Var}_{\widetilde{P}^{k}_{h}(\cdot|s,a)}(V)}-\sqrt{\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}(V)}\right|\leq 3H\sqrt{\frac{SE_{\epsilon,\beta}}{\widetilde{N}^{k}_{h}(s,a)}}. (27)

Appendix C Proof of Theorem 4.1

In this section, we assume the conclusions in Assumption 3.1 and Lemma B.8 hold and prove the regret bound.

C.1 Some preparations

C.1.1 Notations

For all i,j[K]×[H]i,j\in[K]\times[H], we define the following variances we will use throughout the proof.

Vi,jπ=VarPj(|sji,aji)Vj+1πi().V^{\pi}_{i,j}=\mathrm{Var}_{P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\pi_{i}}_{j+1}(\cdot). (28)
Vi,j=VarPj(|sji,aji)Vj+1().V^{\star}_{i,j}=\mathrm{Var}_{P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(\cdot). (29)
V~i,j=VarP~ji(|sji,aji)V~j+1i().\widetilde{V}_{i,j}=\mathrm{Var}_{\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\widetilde{V}^{i}_{j+1}(\cdot). (30)

Next, recall the definition of our private bonus bhkb^{k}_{h}.

bhk(s,a)=2VarsP~hk(|s,a)V~h+1k()ιN~hk(s,a)+2ιN~hk(s,a)+20HSEϵ,βιN~hk(s,a)+4ιsP~hk(s|s,a)min{10002H3SAι2N~h+1k(s)+10002H4S4A2Eϵ,β2ι4N~h+1k(s)2+10002H6S4A2ι4N~h+1k(s)2,H2}N~hk(s,a).\begin{split}b^{k}_{h}(s,a)=&2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{k}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\sqrt{\frac{2\iota}{\widetilde{N}^{k}_{h}(s,a)}}+\frac{20HSE_{\epsilon,\beta}\cdot\iota}{\widetilde{N}^{k}_{h}(s,a)}\\ +&4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{k}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{\widetilde{N}^{k}_{h}(s,a)}}.\end{split} (31)

According to Assumption 3.1, the private visitation numbers will never underestimate the real ones, therefore it holds that

bhk(s,a)2VarsP~hk(|s,a)V~h+1k()ιNhk(s,a)+2ιNhk(s,a)+20HSEϵ,βιNhk(s,a)bh,1k(s,a)+4ιsP~hk(s|s,a)min{10002H3SAι2Nh+1k(s)+10002H4S4A2Eϵ,β2ι4Nh+1k(s)2+10002H6S4A2ι4Nh+1k(s)2,H2}Nhk(s,a)bh,2k(s,a).\begin{split}b^{k}_{h}(s,a)\leq&\underbrace{2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{k}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot)\cdot\iota}{N^{k}_{h}(s,a)}}+\sqrt{\frac{2\iota}{N^{k}_{h}(s,a)}}+\frac{20HSE_{\epsilon,\beta}\cdot\iota}{N^{k}_{h}(s,a)}}_{\mathrm{b^{k}_{h,1}(s,a)}}\\ +&\underbrace{4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{k}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{k}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{k}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{k}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{N^{k}_{h}(s,a)}}}_{b^{k}_{h,2}(s,a)}.\end{split} (32)

For the analysis later, we define b^hk(s,a):=2bh,1k(s,a)+bh,2k(s,a)\widehat{b}^{k}_{h}(s,a):=2b^{k}_{h,1}(s,a)+b^{k}_{h,2}(s,a).

In addition, we define the following three terms for all (i,j)[K]×[H](i,j)\in[K]\times[H]:

c4,i,j=H2SιNji(sji,aji),c1,i,j=2Vi,jιNji(sji,aji),b^i,j=b^ji(sji,aji).c_{4,i,j}=\frac{H^{2}S\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})},\,\,c_{1,i,j}=\sqrt{\frac{2V_{i,j}^{\star}\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}},\,\,\widehat{b}_{i,j}=\widehat{b}^{i}_{j}(s^{i}_{j},a^{i}_{j}). (33)

C.1.2 Typical episodes

Now we define the typical episodes and the typical episodes with respect to (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S}. Briefly speaking, typical episodes ensure that the number of total episodes or visitation number to some state is large enough.

Definition C.1 (Typical episodes).

We define the general typical episodes as [k]typ={i:i[k],i250H2S2Aι2}[k]_{\text{typ}}=\{i:i\in[k],\,i\geq 250H^{2}S^{2}A\iota^{2}\}. Also, we define typical episodes with respect to (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S} as:

[k]typ,h,s={i[k]:Nhi(s)250H2S2Aι2},[k]_{\text{typ},h,s}=\{i\in[k]:N^{i}_{h}(s)\geq 250H^{2}S^{2}A\iota^{2}\},

where Nhi(s)N^{i}_{h}(s) is the real visitation number of (h,s)(h,s) before episode ii.

According to Definition C.1 above, it is clear that

H|[k]/[k]typ|250H3S2Aι2.H\cdot\left|[k]/[k]_{typ}\right|\leq 250H^{3}S^{2}A\iota^{2}. (34)

In the following proof, when we consider summation over episodes, we can consider only the typical episodes since all episodes that are not typical only contribute to a constant term in final regret bound.

Finally, we define the following summations for all k,h,s[K]×[H]×𝒮k,h,s\in[K]\times[H]\times\mathcal{S}:

Ck=i=1k𝟙(i[k]typ)j=1H(c1,i,j+c4,i,j).C_{k}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}(c_{1,i,j}+c_{4,i,j}). (35)
Bk=i=1k𝟙(i[k]typ)j=1Hb^i,j.B_{k}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\widehat{b}_{i,j}. (36)
Ck,h,s=i=1k𝟙(shi=s,i[k]typ,h,s)j=hH(c1,i,j+c4,i,j).C_{k,h,s}=\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}(c_{1,i,j}+c_{4,i,j}). (37)
Bk,h,s=i=1k𝟙(shi=s,i[k]typ,h,s)j=hHb^i,j.B_{k,h,s}=\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\widehat{b}_{i,j}. (38)

C.2 Our induction

Since we apply Bernstein-type bonus, different from Chowdhury and Zhou [2021], optimism is not very straightforward. This is because even if V~hk\widetilde{V}^{k}_{h} is upper bound of VhV^{\star}_{h} and P~hk\widetilde{P}^{k}_{h} is close to P^hk\widehat{P}^{k}_{h}, VarP~hk(|s,a)V~h+1k()\mathrm{Var}_{\widetilde{P}_{h}^{k}(\cdot|s,a)}\widetilde{V}^{k}_{h+1}(\cdot) is not necessarily an upper bound of VarP^hk(|s,a)Vh+1()\mathrm{Var}_{\widehat{P}^{k}_{h}(\cdot|s,a)}V^{\star}_{h+1}(\cdot). However, we can prove by induction that V~h+1k\widetilde{V}^{k}_{h+1} is close enough to Vh+1V^{\star}_{h+1}, and therefore the last term of bhkb^{k}_{h} will be sufficiently large to make V~hk\widetilde{V}^{k}_{h} a valid upper bound of VhV^{\star}_{h}. More precisely, our induction is as below:

  1. 1.

    Assume for all (i,h,s,a)[k]×[H]×𝒮×𝒜(i,h,s,a)\in[k]\times[H]\times\mathcal{S}\times\mathcal{A}, Q~hi(s,a)Qh(s,a)\widetilde{Q}^{i}_{h}(s,a)\geq Q^{\star}_{h}(s,a), we prove for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

    V~hk(s)Vh(s)O~(SAH3/Nhk(s)+S2AH2Eϵ,β/Nhk(s)+S2AH3/Nhk(s)).\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq\widetilde{O}\left(\sqrt{SAH^{3}/N^{k}_{h}(s)}+S^{2}AH^{2}E_{\epsilon,\beta}/N^{k}_{h}(s)+S^{2}AH^{3}/N^{k}_{h}(s)\right).
  2. 2.

    We deduce that the last term of bhkb^{k}_{h} compensates for the possible variance difference and for all (h,s,a)[H]×𝒮×𝒜(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}, Q~hk+1(s,a)Qh(s,a)\widetilde{Q}^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a).

Next, we will first prove the point 1 above under optimism in Section C.3, Section C.4 and Section C.5, and then prove optimism (point 2 above) based on point 1 in Section C.6.

C.3 Error decomposition

We define δi,h:=Vh(shi)Vhπi(shi)\delta_{i,h}:=V^{\star}_{h}(s_{h}^{i})-V_{h}^{\pi_{i}}(s_{h}^{i}) and δ~i,h:=V~hi(shi)Vhπi(shi)\widetilde{\delta}_{i,h}:=\widetilde{V}^{i}_{h}(s^{i}_{h})-V^{\pi_{i}}_{h}(s^{i}_{h}). Now we provide the error decomposition below, based on optimism, for all (i,h)[k]×[H](i,h)\in[k]\times[H],

δi,hδ~i,h=V~hi(shi)Vhπi(shi)=Q~hi(shi,ahi)Qhπi(shi,ahi)r~hi(shi,ahi)+P~hiV~h+1i(shi,ahi)+bhi(shi,ahi)rh(shi,ahi)PhVh+1πi(shi,ahi)bhi(shi,ahi)+2ιNhi(shi,ahi)+2Eϵ,βNhi(shi,ahi)+(P~hiPh)Vh+1(shi,ahi)+(P~hiPh)(V~h+1iVh+1)(shi,ahi)+Ph(V~h+1iVh+1πi)(shi,ahi)bhi(shi,ahi)+bh,1i(shi,ahi)+c1,i,h+(P~hiPh)(V~h+1iVh+1)(shi,ahi)+Ph(V~h+1iVh+1πi)(shi,ahi)2HSEϵ,βιNhi(shi,ahi)b^i,h+c1,i,h+(P~hiPh)(V~h+1iVh+1)(shi,ahi)+Ph(V~h+1iVh+1πi)(shi,ahi)2HSEϵ,βιNhi(shi,ahi).\begin{split}&\delta_{i,h}\leq\widetilde{\delta}_{i,h}=\widetilde{V}^{i}_{h}(s^{i}_{h})-V_{h}^{\pi_{i}}(s^{i}_{h})=\widetilde{Q}^{i}_{h}(s^{i}_{h},a^{i}_{h})-Q^{\pi_{i}}_{h}(s^{i}_{h},a^{i}_{h})\\ \leq&\widetilde{r}^{i}_{h}(s^{i}_{h},a^{i}_{h})+\widetilde{P}^{i}_{h}\cdot\widetilde{V}^{i}_{h+1}(s^{i}_{h},a^{i}_{h})+b^{i}_{h}(s^{i}_{h},a^{i}_{h})-r_{h}(s^{i}_{h},a^{i}_{h})-P_{h}\cdot V^{\pi_{i}}_{h+1}(s^{i}_{h},a^{i}_{h})\\ \leq&b^{i}_{h}(s^{i}_{h},a^{i}_{h})+\sqrt{\frac{2\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}}+\frac{2E_{\epsilon,\beta}}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}+(\widetilde{P}^{i}_{h}-P_{h})\cdot V^{\star}_{h+1}(s^{i}_{h},a^{i}_{h})+(\widetilde{P}^{i}_{h}-P_{h})\cdot(\widetilde{V}^{i}_{h+1}-V^{\star}_{h+1})(s^{i}_{h},a^{i}_{h})\\ +&P_{h}\cdot(\widetilde{V}^{i}_{h+1}-V^{\pi_{i}}_{h+1})(s^{i}_{h},a^{i}_{h})\\ \leq&b^{i}_{h}(s^{i}_{h},a^{i}_{h})+b^{i}_{h,1}(s^{i}_{h},a^{i}_{h})+c_{1,i,h}+(\widetilde{P}^{i}_{h}-P_{h})\cdot(\widetilde{V}^{i}_{h+1}-V^{\star}_{h+1})(s^{i}_{h},a^{i}_{h})+P_{h}\cdot(\widetilde{V}^{i}_{h+1}-V^{\pi_{i}}_{h+1})(s^{i}_{h},a^{i}_{h})-\frac{2HSE_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}\\ \leq&\widehat{b}_{i,h}+c_{1,i,h}+(\widetilde{P}^{i}_{h}-P_{h})\cdot(\widetilde{V}^{i}_{h+1}-V^{\star}_{h+1})(s^{i}_{h},a^{i}_{h})+P_{h}\cdot(\widetilde{V}^{i}_{h+1}-V^{\pi_{i}}_{h+1})(s^{i}_{h},a^{i}_{h})-\frac{2HSE_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}.\end{split} (39)

The second inequality is because of the definition of Q~hi\widetilde{Q}^{i}_{h}. The third inequality results from Lemma B.1. The forth inequality holds since Lemma B.6 and the definition of bh,1ib^{i}_{h,1}, c1,i,hc_{1,i,h}. The last inequality holds due to definition of b^i,h\widehat{b}_{i,h}.

In addition, we have:

(P~hiPh)(V~h+1iVh+1)(shi,ahi)=s(P~hi(s|shi,ahi)Ph(s|shi,ahi))(V~h+1i(s)Vh+1(s))s(2Ph(s|shi,ahi)ιNhi(shi,ahi)+2Eϵ,βιNhi(shi,ahi))(V~h+1i(s)Vh+1(s))s(Ph(s|shi,ahi)H+HιNhi(shi,ahi)+2Eϵ,βιNhi(shi,ahi))(V~h+1i(s)Vh+1(s))1HPh(V~h+1iVh+1πi)(shi,ahi)+H2SιNhi(shi,ahi)+2HSEϵ,βιNhi(shi,ahi).\begin{split}&(\widetilde{P}^{i}_{h}-P_{h})\cdot(\widetilde{V}^{i}_{h+1}-V^{\star}_{h+1})(s^{i}_{h},a^{i}_{h})=\sum_{s^{\prime}}\left(\widetilde{P}^{i}_{h}(s^{\prime}|s^{i}_{h},a^{i}_{h})-P_{h}(s^{\prime}|s^{i}_{h},a^{i}_{h})\right)\cdot\left(\widetilde{V}^{i}_{h+1}(s^{\prime})-V^{\star}_{h+1}(s^{\prime})\right)\\ \leq&\sum_{s^{\prime}}\left(\sqrt{\frac{2P_{h}(s^{\prime}|s^{i}_{h},a^{i}_{h})\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}}+\frac{2E_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}\right)\cdot\left(\widetilde{V}^{i}_{h+1}(s^{\prime})-V^{\star}_{h+1}(s^{\prime})\right)\\ \leq&\sum_{s^{\prime}}\left(\frac{P_{h}(s^{\prime}|s^{i}_{h},a^{i}_{h})}{H}+\frac{H\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}+\frac{2E_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}\right)\cdot\left(\widetilde{V}^{i}_{h+1}(s^{\prime})-V^{\star}_{h+1}(s^{\prime})\right)\\ \leq&\frac{1}{H}P_{h}\cdot(\widetilde{V}^{i}_{h+1}-V^{\pi_{i}}_{h+1})(s^{i}_{h},a^{i}_{h})+\frac{H^{2}S\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}+\frac{2HSE_{\epsilon,\beta}\iota}{N^{i}_{h}(s^{i}_{h},a^{i}_{h})}.\end{split} (40)

The first inequality is because of Lemma B.4. The second inequality holds since AM-GM inequality. The last inequality results from the fact that Vh+1Vh+1πiV^{\star}_{h+1}\geq V^{\pi_{i}}_{h+1}.

Plugging (40) into (39), we have:

δi,hδ~i,hb^i,h+c1,i,h+c4,i,h+(1+1H)Ph(V~h+1iVh+1πi)(shi,ahi)=(1+1H)δ~i,h+1+b^i,h+c1,i,h+c4,i,h+(1+1H)ϵi,h,\begin{split}\delta_{i,h}\leq\widetilde{\delta}_{i,h}\leq&\widehat{b}_{i,h}+c_{1,i,h}+c_{4,i,h}+(1+\frac{1}{H})P_{h}\cdot(\widetilde{V}^{i}_{h+1}-V^{\pi_{i}}_{h+1})(s^{i}_{h},a^{i}_{h})\\ =&(1+\frac{1}{H})\widetilde{\delta}_{i,h+1}+\widehat{b}_{i,h}+c_{1,i,h}+c_{4,i,h}+(1+\frac{1}{H})\epsilon_{i,h},\end{split} (41)

where ϵi,h\epsilon_{i,h} is martingale difference that is bounded in [H,H][-H,H].

Recursively applying (41), we have:

δ~i,h3j=hH[b^i,j+c1,i,j+c4,i,j+ϵi,j].\widetilde{\delta}_{i,h}\leq 3\sum_{j=h}^{H}\left[\widehat{b}_{i,j}+c_{1,i,j}+c_{4,i,j}+\epsilon_{i,j}\right]. (42)

Summing over episodes, we have

i=1kδi,hi=1kδ~i,h3i=1kj=hH[b^i,j+c1,i,j+c4,i,j+ϵi,j].\sum_{i=1}^{k}\delta_{i,h}\leq\sum_{i=1}^{k}\widetilde{\delta}_{i,h}\leq 3\sum_{i=1}^{k}\sum_{j=h}^{H}\left[\widehat{b}_{i,j}+c_{1,i,j}+c_{4,i,j}+\epsilon_{i,j}\right]. (43)

According to Azuma-Hoeffding inequality and union bound, we can bound the partial sum of martingale differences below.

Lemma C.2.

Let Tk:=HkT_{k}:=Hk be the number of steps until episode kk. Then with probability 1β121-\frac{\beta}{12}, the following inequalities hold for all k,h,s[K]×[H]×𝒮k,h,s\in[K]\times[H]\times\mathcal{S} and hhh^{\prime}\geq h:

i=1k𝟙(i[k]typ)j=hHϵi,jHTkι.i=1k𝟙(shi=s,i[k]typ,h,s)j=hHϵi,jHHNhk(s)ι.\begin{split}\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=h}^{H}\epsilon_{i,j}\leq H\sqrt{T_{k}\iota}.\\ \sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h^{\prime}}^{H}\epsilon_{i,j}\leq H\sqrt{HN_{h}^{k}(s)\iota}.\end{split} (44)

We define Uk,h=3i=1k𝟙(i[k]typ)j=hH[b^i,j+c1,i,j+c4,i,j]+3HTkιU_{k,h}=3\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=h}^{H}\left[\widehat{b}_{i,j}+c_{1,i,j}+c_{4,i,j}\right]+3H\sqrt{T_{k}\iota} and

Uk,h,s=3i=1k𝟙(shi=s,i[k]typ,h,s)j=hH[b^i,j+c1,i,j+c4,i,j]+3HHNhk(s)ι.U_{k,h,s}=3\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\left[\widehat{b}_{i,j}+c_{1,i,j}+c_{4,i,j}\right]+3H\sqrt{HN^{k}_{h}(s)\iota}.

Therefore, combining (42) and Lemma C.2, we have the following key lemma that upper bounds summation of δi,j\delta_{i,j}.

Lemma C.3.

Under the high probability event in Lemma C.2, for all h,s[H]×𝒮h,s\in[H]\times\mathcal{S},

i=1k𝟙(i[k]typ)δi,hi=1k𝟙(i[k]typ)δ~i,hUk,hUk,1,i=1k𝟙(i[k]typ)j=hHδ~i,jHUk,1.\begin{split}\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\delta_{i,h}\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\widetilde{\delta}_{i,h}\leq U_{k,h}\leq U_{k,1},\\ \sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=h}^{H}\widetilde{\delta}_{i,j}\leq HU_{k,1}.\end{split} (45)

At the same time, for all h,s[H]×𝒮h,s\in[H]\times\mathcal{S} and jhj\geq h,

i=1k𝟙(shi=s,i[k]typ,h,s)δi,ji=1k𝟙(shi=s,i[k]typ,h,s)δ~i,jUk,h,s,i=1k𝟙(shi=s,i[k]typ,h,s)j=hHδ~i,jHUk,h,s.\begin{split}\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\delta_{i,j}\leq\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\widetilde{\delta}_{i,j}\leq U_{k,h,s},\\ \sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\widetilde{\delta}_{i,j}\leq HU_{k,h,s}.\end{split} (46)

C.4 Upper bounds of variance

From the analysis above, it suffices to derive upper bounds for CkC_{k}, BkB_{k}, Ck,h,sC_{k,h,s} and Bk,h,sB_{k,h,s}. As a middle step, we discuss several upper bounds about summation of variances. We begin with the following lemma from Azar et al. [2017]. Recall that Vi,jπ=VarPj(|sji,aji)Vj+1πi()V^{\pi}_{i,j}=\mathrm{Var}_{P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\pi_{i}}_{j+1}(\cdot) and Tk=HkT_{k}=Hk.

Lemma C.4 (Lemma 8 of Azar et al. [2017]).

With probability 1β121-\frac{\beta}{12}, for all k,h,s[K]×[H]×𝒮k,h,s\in[K]\times[H]\times\mathcal{S}, it holds that

i=1k𝟙(i[k]typ)j=hHVi,jπHTk+2H4Tkι+H3ι2HTk,i=1k𝟙(shi=s,i[k]typ,h,s)j=hHVi,jπH2Nhk(s)+2H5Nhk(s)ι+H3ι2H2Nhk(s).\begin{split}\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=h}^{H}V_{i,j}^{\pi}\leq HT_{k}+2\sqrt{H^{4}T_{k}\iota}+H^{3}\iota\leq 2HT_{k},\\ \sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}V_{i,j}^{\pi}\leq H^{2}N^{k}_{h}(s)+2\sqrt{H^{5}N_{h}^{k}(s)\iota}+H^{3}\iota\leq 2H^{2}N^{k}_{h}(s).\end{split} (47)

The proof results from a combination of Law of total variance, Freedman’s inequality, union bound and our definition of typical episodes, more details can be found in Azar et al. [2017]. Next, we provide another upper bound for further bounding CkC_{k} (and Ck,h,sC_{k,h,s}). Recall that Vi,j=VarPj(|sji,aji)Vj+1()V^{\star}_{i,j}=\mathrm{Var}_{P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(\cdot).

Lemma C.5.

Under the high probability event in Lemma C.2, it holds that

i=1k𝟙(i[k]typ)j=1H(Vi,jVi,jπ)2H2Uk,1+2H2Tkι.\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(V^{\star}_{i,j}-V_{i,j}^{\pi}\right)\leq 2H^{2}U_{k,1}+2H^{2}\sqrt{T_{k}\iota}. (48)

Similarly, under the same high probability event, for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

i=1k𝟙(shi=s,i[k]typ,h,s)j=hH(Vi,jVi,jπ)2H2Uk,h,s+2H2HNhk(s)ι.\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\left(V^{\star}_{i,j}-V^{\pi}_{i,j}\right)\leq 2H^{2}U_{k,h,s}+2H^{2}\sqrt{HN^{k}_{h}(s)\iota}. (49)
Proof of Lemma C.5.

We only prove the first conclusion, the second one can be proven in identical way.

i=1k𝟙(i[k]typ)j=1H(Vi,jVi,jπ)i=1k𝟙(i[k]typ)j=1H𝔼sPj(|sji,aji)[Vj+1(s)2Vj+1πi(s)2]2Hi=1k𝟙(i[k]typ)j=1H𝔼sPj(|sji,aji)[Vj+1(s)Vj+1πi(s)]2H(i=1k𝟙(i[k]typ)j=1Hδ~i,j+1+HTkι)2H2Uk,1+2H2Tkι.\begin{split}&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(V^{\star}_{i,j}-V_{i,j}^{\pi}\right)\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\left[V^{\star}_{j+1}(s^{\prime})^{2}-V^{\pi_{i}}_{j+1}(s^{\prime})^{2}\right]\\ \leq&2H\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\left[V^{\star}_{j+1}(s^{\prime})-V^{\pi_{i}}_{j+1}(s^{\prime})\right]\\ \leq&2H\left(\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\widetilde{\delta}_{i,j+1}+H\sqrt{T_{k}\iota}\right)\\ \leq&2H^{2}U_{k,1}+2H^{2}\sqrt{T_{k}\iota}.\end{split} (50)

The first inequality is because Vj+1Vj+1πiV^{\star}_{j+1}\geq V^{\pi_{i}}_{j+1}. The second inequality results from the fact that Vj+1+Vj+1πi2HV^{\star}_{j+1}+V^{\pi_{i}}_{j+1}\leq 2H. The third inequality holds since Lemma C.2. The last inequality is due to Lemma C.3. ∎

Lastly, we prove the following lemma for bounding BkB_{k} (and Bk,h,sB_{k,h,s}). Recall that V~i,j=VarP~ji(|sji,aji)V~j+1i()\widetilde{V}_{i,j}=\mathrm{Var}_{\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\widetilde{V}^{i}_{j+1}(\cdot).

Lemma C.6.

Under the high probability event in Lemma C.2, with probability 1β12K1-\frac{\beta}{12K}, it holds that

i=1k𝟙(i[k]typ)j=1H(V~i,jVi,jπ)2H2Uk,1+8H2SHATkι+6H3S2AEϵ,βι.\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(\widetilde{V}_{i,j}-V_{i,j}^{\pi}\right)\leq 2H^{2}U_{k,1}+8H^{2}S\sqrt{HAT_{k}\iota}+6H^{3}S^{2}AE_{\epsilon,\beta}\iota. (51)

Similarly, under the same high probability event, for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

i=1k𝟙(shi=s,i[k]typ,h,s)j=hH(V~i,jVi,jπ)2H2Uk,h,s+8H3SANhk(s)ι+6H3S2AEϵ,βι.\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\left(\widetilde{V}_{i,j}-V^{\pi}_{i,j}\right)\leq 2H^{2}U_{k,h,s}+8H^{3}S\sqrt{AN^{k}_{h}(s)\iota}+6H^{3}S^{2}AE_{\epsilon,\beta}\iota. (52)
Proof of Lemma C.6.

We only prove the first conclusion, the second one can be proven in identical way. We have

i=1k𝟙(i[k]typ)j=1H(V~i,jVi,jπ)i=1k𝟙(i[k]typ)j=1H[𝔼sP~ji(|sji,aji)V~j+1i(s)2𝔼sPj(|sji,aji)V~j+1i(s)2](a)+i=1k𝟙(i[k]typ)j=1H𝔼sPj(|sji,aji)[V~j+1i(s)2Vj+1πi(s)2](b)+i=1k𝟙(i[k]typ)j=1H[(𝔼sPj(|sji,aji)Vj+1(s))2(𝔼sP~ji(|sji,aji)Vj+1(s))2](c).\begin{split}&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(\widetilde{V}_{i,j}-V_{i,j}^{\pi}\right)\leq\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left[\mathbb{E}_{s^{\prime}\sim\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\widetilde{V}^{i}_{j+1}(s^{\prime})^{2}-\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\widetilde{V}^{i}_{j+1}(s^{\prime})^{2}\right]}_{\mathrm{(a)}}\\ +&\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}\left[\widetilde{V}^{i}_{j+1}(s^{\prime})^{2}-V^{\pi_{i}}_{j+1}(s^{\prime})^{2}\right]}_{\mathrm{(b)}}\\ +&\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left[\left(\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(s^{\prime})\right)^{2}-\left(\mathbb{E}_{s^{\prime}\sim\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(s^{\prime})\right)^{2}\right]}_{\mathrm{(c)}}.\end{split} (53)

The inequality holds because of direct calculation and the fact that Vj+1πiVj+1V~j+1iV^{\pi_{i}}_{j+1}\leq V^{\star}_{j+1}\leq\widetilde{V}^{i}_{j+1}. Next we bound these terms separately. First of all,

(a)i=1k𝟙(i[k]typ)j=1HH2P~ji(|sji,aji)Pj(|sji,aji)1i=1k𝟙(i[k]typ)j=1HH2(2SιNji(sji,aji)+2SEϵ,βNji(sji,aji))2H2SHATkι+2H3S2AEϵ,βι.\begin{split}&\mathrm{(a)}\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}H^{2}\cdot\left\|\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})-P_{j}(\cdot|s^{i}_{j},a^{i}_{j})\right\|_{1}\\ \leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}H^{2}\left(2\sqrt{\frac{S\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}+\frac{2SE_{\epsilon,\beta}}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}\right)\\ \leq&2H^{2}S\sqrt{HAT_{k}\iota}+2H^{3}S^{2}AE_{\epsilon,\beta}\iota.\end{split} (54)

The second inequality comes from Lemma B.2. The last inequality is because of direct calculation. For (b)\mathrm{(b)}, according to Lemma C.2, similar to the proof of Lemma C.5, it holds that

(b)2H(i=1k𝟙(i[k]typ)j=1Hδ~i,j+1+HTkι)2H2Uk,1+2H2Tkι.\mathrm{(b)}\leq 2H\left(\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\widetilde{\delta}_{i,j+1}+H\sqrt{T_{k}\iota}\right)\leq 2H^{2}U_{k,1}+2H^{2}\sqrt{T_{k}\iota}. (55)

Lastly, for (c)\mathrm{(c)}, we have

(c)i=1k𝟙(i[k]typ)j=1H[(𝔼sPj(|sji,aji)Vj+1(s))2(𝔼sP^ji(|sji,aji)Vj+1(s))2+2H2(P~jiP^ji)(|sji,aji)1]i=1k𝟙(i[k]typ)j=1H(2H2HιNji(sji,aji)+2H22SEϵ,βNji(sji,aji))4H2HSATkι+4H3S2AEϵ,βι.\begin{split}\mathrm{(c)}\leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left[\left(\mathbb{E}_{s^{\prime}\sim P_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(s^{\prime})\right)^{2}-\left(\mathbb{E}_{s^{\prime}\sim\widehat{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})}V^{\star}_{j+1}(s^{\prime})\right)^{2}+2H^{2}\cdot\left\|\left(\widetilde{P}^{i}_{j}-\widehat{P}^{i}_{j}\right)(\cdot|s^{i}_{j},a^{i}_{j})\right\|_{1}\right]\\ \leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(2H\cdot 2H\sqrt{\frac{\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}+2H^{2}\cdot\frac{2SE_{\epsilon,\beta}}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}\right)\\ \leq&4H^{2}\sqrt{HSAT_{k}\iota}+4H^{3}S^{2}AE_{\epsilon,\beta}\iota.\end{split} (56)

The second inequality holds with probability 1β12K1-\frac{\beta}{12K} due to Hoeffding’s inequality and Remark B.3. The last inequality holds because of direct calculation.

Combining the upper bounds of (a),(b),(c)\mathrm{(a)},\mathrm{(b)},\mathrm{(c)}, the proof is complete. ∎

C.5 Upper bound of regret

With the upper bounds in Section C.4, we are ready to bound CkC_{k}, BkB_{k} and the regret. In this section, we assume the high probability events in Lemma C.2 and Lemma C.4 hold. We begin with the upper bound of CkC_{k} and Ck,h,sC_{k,h,s}. Recall that Ck=i=1k𝟙(i[k]typ)j=1H(c1,i,j+c4,i,j)C_{k}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}(c_{1,i,j}+c_{4,i,j}) and Ck,h,s=i=1k𝟙(shi=s,i[k]typ,h,s)j=hH(c1,i,j+c4,i,j)C_{k,h,s}=\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}(c_{1,i,j}+c_{4,i,j}).

Lemma C.7.

Under the high probability events in Lemma C.2 and Lemma C.4, we have

Ck=i=1k𝟙(i[k]typ)j=1H(c1,i,j+c4,i,j)3H2SATkι2+2H3SAUk,1ι2+H3S2Aι2.C_{k}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}(c_{1,i,j}+c_{4,i,j})\leq 3\sqrt{H^{2}SAT_{k}\iota^{2}}+2\sqrt{H^{3}SAU_{k,1}\iota^{2}}+H^{3}S^{2}A\iota^{2}. (57)

Similarly, under the same high probability event, for all h,s[H]×𝒮h,s\in[H]\times\mathcal{S},

Ck,h,s3H3SANhk(s)ι2+2H3SAUk,h,sι2+H3S2Aι2.C_{k,h,s}\leq 3\sqrt{H^{3}SAN_{h}^{k}(s)\iota^{2}}+2\sqrt{H^{3}SAU_{k,h,s}\iota^{2}}+H^{3}S^{2}A\iota^{2}. (58)
Proof of Lemma C.7.

We only prove the first conclusion, the second one can be proven in identical way. We have

Ck=i=1k𝟙(i[k]typ)j=1Hc1,i,j(a)+i=1k𝟙(i[k]typ)j=1Hc4,i,j(b).C_{k}=\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}c_{1,i,j}}_{\mathrm{(a)}}+\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}c_{4,i,j}}_{\mathrm{(b)}}. (59)

For (a)\mathrm{(a)}, due to Cauchy-Schwarz inequality, it holds that

(a)=i=1k𝟙(i[k]typ)j=1H2Vi,jιNji(sji,aji)2ιi=1k𝟙(i[k]typ)j=1H1Nji(sji,aji)(c)i=1k𝟙(i[k]typ)j=1HVi,j(d).\begin{split}&\mathrm{(a)}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\sqrt{\frac{2V^{\star}_{i,j}\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}\\ \leq&\sqrt{2\iota}\cdot\sqrt{\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\frac{1}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}_{\mathrm{(c)}}}\cdot\sqrt{\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}V^{\star}_{i,j}}_{\mathrm{(d)}}}.\end{split} (60)

Due to direct calculation, we have (c)HSAι\mathrm{(c)}\leq HSA\iota and in addition,

(d)i=1k𝟙(i[k]typ)j=1HVi,jπ+i=1k𝟙(i[k]typ)j=1H(Vi,jVi,jπ)2HTk+2H2Uk,1+2H2Tkι,\begin{split}\mathrm{(d)}\leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}V^{\pi}_{i,j}+\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(V^{\star}_{i,j}-V^{\pi}_{i,j}\right)\\ \leq&2HT_{k}+2H^{2}U_{k,1}+2H^{2}\sqrt{T_{k}\iota},\end{split} (61)

where the second inequality holds due to Lemma C.4 and Lemma C.5. Therefore, We have

(a)3H2SATkι2+2H3SAUk,1ι2.\mathrm{(a)}\leq 3\sqrt{H^{2}SAT_{k}\iota^{2}}+2\sqrt{H^{3}SAU_{k,1}\iota^{2}}. (62)

For (b)\mathrm{(b)}, according to direct calculation, it holds that

(b)i=1k𝟙(i[k]typ)j=1HH2SιNji(sji,aji)H3S2Aι2.\mathrm{(b)}\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\frac{H^{2}S\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}\leq H^{3}S^{2}A\iota^{2}. (63)

Combining (62) and (63), the proof is complete. ∎

Next, we bound BkB_{k} and Bk,h,sB_{k,h,s}. Recall that Bk=i=1k𝟙(i[k]typ)j=1Hb^i,jB_{k}=\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\widehat{b}_{i,j} and
Bk,h,s=i=1k𝟙(shi=s,i[k]typ,h,s)j=hHb^i,jB_{k,h,s}=\sum_{i=1}^{k}\mathds{1}(s_{h}^{i}=s,\,i\in[k]_{\text{typ},h,s})\sum_{j=h}^{H}\widehat{b}_{i,j}.

Lemma C.8.

Under the high probability events in Lemma C.2, Lemma C.4 and Lemma C.6, with probability 1β12K1-\frac{\beta}{12K},

Bk16H2SATkι2+6H3SAUk,1ι2+250H2S2AEϵ,βι2+250H3S2Aι2.B_{k}\leq 16\sqrt{H^{2}SAT_{k}\iota^{2}}+6\sqrt{H^{3}SAU_{k,1}\iota^{2}}+250H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+250H^{3}S^{2}A\iota^{2}. (64)

Similarly, under the same high probability event, for all h,s[H]×𝒮h,s\in[H]\times\mathcal{S},

Bk,h,s16H3SANhk(s)ι2+6H3SAUk,h,sι2+250H2S2AEϵ,βι2+250H3S2Aι2.B_{k,h,s}\leq 16\sqrt{H^{3}SAN_{h}^{k}(s)\iota^{2}}+6\sqrt{H^{3}SAU_{k,h,s}\iota^{2}}+250H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+250H^{3}S^{2}A\iota^{2}. (65)
Proof of Lemma C.8.

We only prove the first conclusion, the second one can be proven in identical way. We have

Bki=1k𝟙(i[k]typ)j=1H4V~i,jιNji(sji,aji)(a)+i=1k𝟙(i[k]typ)j=1H22ιNji(sji,aji)(b)+i=1k𝟙(i[k]typ)j=1H40HSEϵ,βιNji(sji,aji)(c)+i=1k𝟙(i[k]typ)j=1H4ιsP~ji(s|sji,aji)min{10002H3SAι2Nj+1i(s)+10002H4S4A2Eϵ,β2ι4Nj+1i(s)2+10002H6S4A2ι4Nj+1i(s)2,H2}Nji(sji,aji)(d).\begin{split}B_{k}\leq&\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}4\sqrt{\frac{\widetilde{V}_{i,j}\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}}_{\mathrm{(a)}}+\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}2\sqrt{\frac{2\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}}_{\mathrm{(b)}}+\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\frac{40HSE_{\epsilon,\beta}\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}_{\mathrm{(c)}}\\ +&\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{i}_{j}(s^{\prime}|s^{i}_{j},a^{i}_{j})\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}},H^{2}\right\}}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}}_{\mathrm{(d)}}.\end{split} (66)

We bound these terms separately, we first bound (a)\mathrm{(a)} below.

(a)4ιi=1k𝟙(i[k]typ)j=1H1Nji(sji,aji)i=1k𝟙(i[k]typ)j=1HV~i,j4ιHSAιi=1k𝟙(i[k]typ)j=1HVi,jπ+i=1k𝟙(i[k]typ)j=1H(V~i,jVi,jπ)4ιHSA2HTk+2H2Uk,1+8H2SHATkι+6H3S2AEϵ,βι8H2SATkι2+42H3SAUk,1ι2+10H4S3A2Eϵ,βι3.\begin{split}&\mathrm{(a)}\leq 4\sqrt{\iota}\cdot\sqrt{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\frac{1}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}\cdot\sqrt{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\widetilde{V}_{i,j}}\\ \leq&4\sqrt{\iota}\cdot\sqrt{HSA\iota}\cdot\sqrt{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}V^{\pi}_{i,j}+\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\left(\widetilde{V}_{i,j}-V^{\pi}_{i,j}\right)}\\ \leq&4\iota\cdot\sqrt{HSA}\sqrt{2HT_{k}+2H^{2}U_{k,1}+8H^{2}S\sqrt{HAT_{k}\iota}+6H^{3}S^{2}AE_{\epsilon,\beta}\iota}\\ \leq&8\sqrt{H^{2}SAT_{k}\iota^{2}}+4\sqrt{2H^{3}SAU_{k,1}\iota^{2}}+10\sqrt{H^{4}S^{3}A^{2}E_{\epsilon,\beta}\iota^{3}}.\end{split} (67)

The first inequality is because of Cauchy-Schwarz inequality. The second inequality results from direct calculation. The third inequality is due to Lemma C.4 and Lemma C.6. The last inequality holds because for typical episodes, 8H2SHATkι2HTk8H^{2}S\sqrt{HAT_{k}\iota}\leq 2HT_{k}.

According to direct calculation,

(b)22ιHSATkH2SATkι2\mathrm{(b)}\leq 2\sqrt{2\iota\cdot HSAT_{k}}\leq\sqrt{H^{2}SAT_{k}\iota^{2}} (68)

For (c)\mathrm{(c)}, it holds that

(c)40HSEϵ,βιHSAι40H2S2AEϵ,βι2.\begin{split}\mathrm{(c)}\leq 40HSE_{\epsilon,\beta}\iota\cdot HSA\iota\leq 40H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}.\end{split} (69)

Finally, we bound the most complex term (d)\mathrm{(d)} as below. Because of Cauchy-Schwarz inequality,

(d)i=1k𝟙(i[k]typ)j=1H4ιsP~ji(s|sji,aji)min{10002H3SAι2Nj+1i(s)+10002H4S4A2Eϵ,β2ι4Nj+1i(s)2+10002H6S4A2ι4Nj+1i(s)2,H2}Nji(sji,aji)4ιi=1k𝟙(i[k]typ)j=1H1Nji(sji,aji)(e)i=1k𝟙(i[k]typ)j=1HsP~ji(s|sji,aji)min{10002H3SAι2Nj+1i(s)+10002H4S4A2Eϵ,β2ι4Nj+1i(s)2+10002H6S4A2ι4Nj+1i(s)2,H2}(f).\begin{split}&\mathrm{(d)}\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{i}_{j}(s^{\prime}|s^{i}_{j},a^{i}_{j})\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}},H^{2}\right\}}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}\\ \leq&4\sqrt{\iota\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\frac{1}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}_{\mathrm{(e)}}}\cdot\\ &\sqrt{\underbrace{\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\sum_{s^{\prime}}\widetilde{P}^{i}_{j}(s^{\prime}|s^{i}_{j},a^{i}_{j})\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}},H^{2}\right\}}_{\mathrm{(f)}}}.\end{split} (70)

Note that (e)HSAι\mathrm{(e)}\leq HSA\iota. For (f)\mathrm{(f)}, we have with probability 1β12K1-\frac{\beta}{12K},

(f)i=1k𝟙(i[k]typ)j=1HH2P~ji(|sji,aji)Pj(|sji,aji)1+i=1k𝟙(i[k]typ)j=1HsPj(s|sji,aji)min{10002H3SAι2Nj+1i(s)+10002H4S4A2Eϵ,β2ι4Nj+1i(s)2+10002H6S4A2ι4Nj+1i(s)2,H2}i=1k𝟙(i[k]typ)j=1HH2P~ji(|sji,aji)Pj(|sji,aji)1+H2Tkι+i=1k𝟙(i[k]typ)j=1Hmin{10002H3SAι2Nj+1i(sj+1i)+10002H4S4A2Eϵ,β2ι4Nj+1i(sj+1i)2+10002H6S4A2ι4Nj+1i(sj+1i)2,H2}i=1k𝟙(i[k]typ)j=1HH2(2SιNji(sji,aji)+2SEϵ,βNji(sji,aji))+i=1k𝟙(i[k]typ)j=1Hmin{10002H3SAι2Nj+1i(sj+1i),H2}+i=1k𝟙(i[k]typ)j=1Hmin{10002H4S4A2Eϵ,β2ι4Nj+1i(sj+1i)2,H2}+i=1k𝟙(i[k]typ)j=1Hmin{10002H6S4A2ι4Nj+1i(sj+1i)2,H2}+H2Tkι3H2SHATkι+2H3S2AEϵ,βι+1000H4S3AEϵ,βι2+2000H5S3Aι2,\begin{split}&\mathrm{(f)}\leq\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}H^{2}\left\|\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})-P_{j}(\cdot|s^{i}_{j},a^{i}_{j})\right\|_{1}\\ +&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\sum_{s^{\prime}}P_{j}(s^{\prime}|s^{i}_{j},a^{i}_{j})\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{\prime})^{2}},H^{2}\right\}\\ \leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}H^{2}\left\|\widetilde{P}^{i}_{j}(\cdot|s^{i}_{j},a^{i}_{j})-P_{j}(\cdot|s^{i}_{j},a^{i}_{j})\right\|_{1}+H^{2}\sqrt{T_{k}\iota}\\ +&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{i}_{j+1})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{i}_{j+1})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{i}_{j+1})^{2}},H^{2}\right\}\\ \leq&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}H^{2}\left(2\sqrt{\frac{S\iota}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}}+\frac{2SE_{\epsilon,\beta}}{N^{i}_{j}(s^{i}_{j},a^{i}_{j})}\right)+\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{i}_{j+1}(s^{i}_{j+1})},H^{2}\right\}\\ +&\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\min\left\{\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{i}_{j+1}(s^{i}_{j+1})^{2}},H^{2}\right\}+\sum_{i=1}^{k}\mathds{1}(i\in[k]_{\text{typ}})\sum_{j=1}^{H}\min\left\{\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{i}_{j+1}(s^{i}_{j+1})^{2}},H^{2}\right\}\\ +&H^{2}\sqrt{T_{k}\iota}\\ \leq&3H^{2}S\sqrt{HAT_{k}\iota}+2H^{3}S^{2}AE_{\epsilon,\beta}\iota+1000H^{4}S^{3}AE_{\epsilon,\beta}\iota^{2}+2000H^{5}S^{3}A\iota^{2},\end{split} (71)

where the first inequality holds because min{,H2}H2\min\{\cdot,H^{2}\}\leq H^{2}. The second inequality holds with probability 1β12K1-\frac{\beta}{12K} due to Azuma-Hoeffding inequality. The third inequality results from Lemma B.2. The last inequality comes from direct calculation. Therefore, we have

(d)7H3S2Aι2HATkι+120H5S4A2Eϵ,βι4+200H3S2Aι2.\mathrm{(d)}\leq 7\sqrt{H^{3}S^{2}A\iota^{2}\sqrt{HAT_{k}\iota}}+120\sqrt{H^{5}S^{4}A^{2}E_{\epsilon,\beta}\iota^{4}}+200H^{3}S^{2}A\iota^{2}. (72)

Combining (67), (68), (69) and (72), the proof is complete. ∎

Now we are ready to bound the regret until episode kk based on optimism. We define the following regret functions.

Regret(k):=i=1kδi,1,Regret~(k):=i=1kδ~i,1.\text{Regret}(k):=\sum_{i=1}^{k}\delta_{i,1},\,\,\widetilde{\text{Regret}}(k):=\sum_{i=1}^{k}\widetilde{\delta}_{i,1}. (73)

In addition, we define the regret with respect to (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S}:

Regret(k,h,s):=i=1k𝟙(shi=s)δi,h,Regret~(k,h,s):=i=1k𝟙(shi=s)δ~i,h.\text{Regret}(k,h,s):=\sum_{i=1}^{k}\mathds{1}(s^{i}_{h}=s)\cdot\delta_{i,h},\,\,\widetilde{\text{Regret}}(k,h,s):=\sum_{i=1}^{k}\mathds{1}(s^{i}_{h}=s)\cdot\widetilde{\delta}_{i,h}. (74)
Lemma C.9.

With probability 1β1-\beta, for all k[K]k\in[K], as well as the optimism holds (i.e. for all (i,h,s,a)[k]×[H]×𝒮×𝒜(i,h,s,a)\in[k]\times[H]\times\mathcal{S}\times\mathcal{A}, Q~hi(s,a)Qh(s,a)\widetilde{Q}^{i}_{h}(s,a)\geq Q^{\star}_{h}(s,a)), it holds that

Regret(k)Regret~(k)1000(H2SATkι2+H2S2AEϵ,βι2+H3S2Aι2).\text{Regret}(k)\leq\widetilde{\text{Regret}}(k)\leq 1000\left(\sqrt{H^{2}SAT_{k}\iota^{2}}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+H^{3}S^{2}A\iota^{2}\right). (75)

In addition, for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S}, we have

Regret(k,h,s)Regret~(k,h,s)1000(H3SANhk(s)ι2+H2S2AEϵ,βι2+H3S2Aι2).\text{Regret}(k,h,s)\leq\widetilde{\text{Regret}}(k,h,s)\leq 1000\left(\sqrt{H^{3}SAN^{k}_{h}(s)\iota^{2}}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+H^{3}S^{2}A\iota^{2}\right). (76)
Proof of Lemma C.9.

For the proof of this lemma, we assume the high probability events in Assumption 3.1, Lemma B.8, Lemma C.2, Lemma C.4, Lemma C.6 (for all k[K]k\in[K]) and Lemma C.8 (for all k[K]k\in[K]) hold. The failure probability is bounded by

β3+β3+β12+β12+Kβ12K+Kβ12Kβ.\frac{\beta}{3}+\frac{\beta}{3}+\frac{\beta}{12}+\frac{\beta}{12}+K\cdot\frac{\beta}{12K}+K\cdot\frac{\beta}{12K}\leq\beta. (77)

We only prove the first conclusion, the second one can be proven in identical way. It holds that

Regret(k)Regret~(k)=i=1kδ~i,1Uk,1+250H3S2Aι23Bk+3Ck+3HTkι+250H3S2Aι260H2SATkι2+24H3SAUk,1ι2+750H2S2AEϵ,βι2+1000H3S2Aι21000(H2SATkι2+H2S2AEϵ,βι2+H3S2Aι2),\begin{split}&\text{Regret}(k)\leq\widetilde{\text{Regret}}(k)=\sum_{i=1}^{k}\widetilde{\delta}_{i,1}\\ \leq&U_{k,1}+250H^{3}S^{2}A\iota^{2}\\ \leq&3B_{k}+3C_{k}+3H\sqrt{T_{k}\iota}+250H^{3}S^{2}A\iota^{2}\\ \leq&60\sqrt{H^{2}SAT_{k}\iota^{2}}+24\sqrt{H^{3}SAU_{k,1}\iota^{2}}+750H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+1000H^{3}S^{2}A\iota^{2}\\ \leq&1000\left(\sqrt{H^{2}SAT_{k}\iota^{2}}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+H^{3}S^{2}A\iota^{2}\right),\end{split} (78)

where the second inequality is because Lemma C.3 and the fact that H|[k]/[k]typ|250H3S2Aι2H\cdot\left|[k]/[k]_{typ}\right|\leq 250H^{3}S^{2}A\iota^{2}. The forth inequality is by combining Lemma C.8 and Lemma C.7. The last inequality results from solving the inequality with respect to Uk,1U_{k,1}. ∎

Corollary C.10.

Under the event in Lemma C.9, we have

1000(H3SANhk(s)ι2+H2S2AEϵ,βι2+H3S2Aι2)Regret~(k,h,s)i=1k𝟙(shi=s)δ~i,hi=1k𝟙(shi=s)(V~hi(s)Vhπi(s))i=1k𝟙(shi=s)(V~hi(s)Vh(s))Nhk(s)(V~hk(s)Vh(s)),\begin{split}&1000\left(\sqrt{H^{3}SAN^{k}_{h}(s)\iota^{2}}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+H^{3}S^{2}A\iota^{2}\right)\geq\widetilde{\text{Regret}}(k,h,s)\\ \geq&\sum_{i=1}^{k}\mathds{1}(s^{i}_{h}=s)\cdot\widetilde{\delta}_{i,h}\\ \geq&\sum_{i=1}^{k}\mathds{1}(s^{i}_{h}=s)\left(\widetilde{V}^{i}_{h}(s)-V^{\pi_{i}}_{h}(s)\right)\\ \geq&\sum_{i=1}^{k}\mathds{1}(s^{i}_{h}=s)\left(\widetilde{V}^{i}_{h}(s)-V^{\star}_{h}(s)\right)\\ \geq&N^{k}_{h}(s)\cdot\left(\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\right),\end{split} (79)

where the first three inequalities are due to definitions of Regret~(k,h,s)\widetilde{\text{Regret}}(k,h,s) and δ~i,h\widetilde{\delta}_{i,h}. The forth inequality holds because VhVhπiV^{\star}_{h}\geq V^{\pi_{i}}_{h}. The last inequality results from our algorithmic design that V~hi(s)\widetilde{V}^{i}_{h}(s) is non-increasing (line 9 of Algorithm 1).

Therefore, we have V~hk(s)Vh(s)1000(H3SAι2/Nhk(s)+H2S2AEϵ,βι2/Nhk(s)+H3S2Aι2/Nhk(s))\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq 1000\left(\sqrt{H^{3}SA\iota^{2}/N^{k}_{h}(s)}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}/N^{k}_{h}(s)+H^{3}S^{2}A\iota^{2}/N^{k}_{h}(s)\right).

Now we have proven the first point of our induction. Together with the point 2 (which we will prove in Section C.6), we have with probability 1β1-\beta, the whole induction process is valid.

For clarity, we restate the induction process under the high probability event in Lemma C.9. For all k[K]k\in[K],
1. Given that for all (i,h,s,a)[k]×[H]×𝒮×𝒜(i,h,s,a)\in[k]\times[H]\times\mathcal{S}\times\mathcal{A}, Q~hi(s,a)Qh(s,a)\widetilde{Q}^{i}_{h}(s,a)\geq Q^{\star}_{h}(s,a), we prove

Regret(k)Regret~(k)1000(H2SATkι2+H2S2AEϵ,βι2+H3S2Aι2)\text{Regret}(k)\leq\widetilde{\text{Regret}}(k)\leq 1000\left(\sqrt{H^{2}SAT_{k}\iota^{2}}+H^{2}S^{2}AE_{\epsilon,\beta}\iota^{2}+H^{3}S^{2}A\iota^{2}\right)

and for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

V~hk(s)Vh(s)1000(SAH3ι2/Nhk(s)+S2AH2Eϵ,βι2/Nhk(s)+S2AH3ι2/Nhk(s)).\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq 1000\left(\sqrt{SAH^{3}\iota^{2}/N^{k}_{h}(s)}+S^{2}AH^{2}E_{\epsilon,\beta}\iota^{2}/N^{k}_{h}(s)+S^{2}AH^{3}\iota^{2}/N^{k}_{h}(s)\right).

2. Given that for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},
V~hk(s)Vh(s)1000(SAH3ι2/Nhk(s)+S2AH2Eϵ,βι2/Nhk(s)+S2AH3ι2/Nhk(s)),\widetilde{V}^{k}_{h}(s)-V^{\star}_{h}(s)\leq 1000\left(\sqrt{SAH^{3}\iota^{2}/N^{k}_{h}(s)}+S^{2}AH^{2}E_{\epsilon,\beta}\iota^{2}/N^{k}_{h}(s)+S^{2}AH^{3}\iota^{2}/N^{k}_{h}(s)\right), we prove that for all (h,s,a)[H]×𝒮×𝒜(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}, Q~hk+1(s,a)Qh(s,a)\widetilde{Q}^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a).

Therefore, with probability 1β1-\beta, we have (T=KHT=KH)

Regret(K)Regret~(K)O~(H2SAT+H2S2AEϵ,β+H3S2A).\text{Regret}(K)\leq\widetilde{\text{Regret}}(K)\leq\widetilde{O}\left(\sqrt{H^{2}SAT}+H^{2}S^{2}AE_{\epsilon,\beta}+H^{3}S^{2}A\right). (80)

This completes the proof of Theorem 4.1.

C.6 Proof of optimism

In this part, we prove optimism. Given the condition that for all (h,s)[H]×𝒮(h,s)\in[H]\times\mathcal{S},

V~hk+1(s)Vh(s)1000(SAH3ι2/Nhk+1(s)+S2AH2Eϵ,βι2/Nhk+1(s)+S2AH3ι2/Nhk+1(s)),\widetilde{V}^{k+1}_{h}(s)-V^{\star}_{h}(s)\leq 1000\left(\sqrt{SAH^{3}\iota^{2}/N^{k+1}_{h}(s)}+S^{2}AH^{2}E_{\epsilon,\beta}\iota^{2}/N^{k+1}_{h}(s)+S^{2}AH^{3}\iota^{2}/N^{k+1}_{h}(s)\right),

we prove for all (h,s,a)[H]×𝒮×𝒜(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}, Q~hk+1(s,a)Qh(s,a)\widetilde{Q}^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a) through backward induction (induction from H+1H+1 to 11). Since the conclusion holds trivially for H+1H+1, it suffices to prove the following Lemma C.11.

Lemma C.11.

Under the high probability event in Assumption 3.1 and Lemma B.8, if it holds that

  1. 1.

    For all (j,s,a)[H]×𝒮×𝒜(j,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}, Q~jk(s,a)Qj(s,a)\widetilde{Q}^{k}_{j}(s,a)\geq Q^{\star}_{j}(s,a).

  2. 2.

    For all s𝒮s\in\mathcal{S},
    0V~h+1k+1(s)Vh+1(s)1000(SAH3ι2/Nh+1k+1(s)+S2AH2Eϵ,βι2/Nh+1k+1(s)+S2AH3ι2/Nh+1k+1(s))0\leq\widetilde{V}^{k+1}_{h+1}(s)-V^{\star}_{h+1}(s)\leq 1000\left(\sqrt{SAH^{3}\iota^{2}/N^{k+1}_{h+1}(s)}+S^{2}AH^{2}E_{\epsilon,\beta}\iota^{2}/N^{k+1}_{h+1}(s)+S^{2}AH^{3}\iota^{2}/N^{k+1}_{h+1}(s)\right).

Then we have for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, Q~hk+1(s,a)Qh(s,a)\widetilde{Q}^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a).

Proof of Lemma C.11.

For all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, since Q~hk(s,a),HQh(s,a)\widetilde{Q}^{k}_{h}(s,a),H\geq Q^{\star}_{h}(s,a), it suffices to prove that r~hk+1(s,a)+P~hk+1V~h+1k+1(s,a)+bhk+1(s,a)Qh(s,a)\widetilde{r}^{k+1}_{h}(s,a)+\widetilde{P}^{k+1}_{h}\cdot\widetilde{V}^{k+1}_{h+1}(s,a)+b^{k+1}_{h}(s,a)\geq Q^{\star}_{h}(s,a). We have

r~hk+1(s,a)+P~hk+1V~h+1k+1(s,a)+bhk+1(s,a)Qh(s,a)(r~hk+1rh)(s,a)+(P~hk+1Ph)Vh+1(s,a)+bhk+1(s,a)(P~hk+1Ph)Vh+1(s,a)+2VarsP~hk+1(|s,a)V~h+1k+1()ιN~hk+1(s,a)+19HSEϵ,βιN~hk+1(s,a)+4ιsP~hk+1(s|s,a)min{10002H3SAι2N~h+1k+1(s)+10002H4S4A2Eϵ,β2ι4N~h+1k+1(s)2+10002H6S4A2ι4N~h+1k+1(s)2,H2}N~hk(s,a)2VarsP^hk+1(|s,a)Vh+1()ιN~hk+1(s,a)+2VarsP~hk+1(|s,a)V~h+1k+1()ιN~hk+1(s,a)+17HSEϵ,βιN~hk+1(s,a)+4ιsP~hk+1(s|s,a)min{10002H3SAι2N~h+1k+1(s)+10002H4S4A2Eϵ,β2ι4N~h+1k+1(s)2+10002H6S4A2ι4N~h+1k+1(s)2,H2}N~hk(s,a)2VarsP^hk+1(|s,a)Vh+1()ιN~hk+1(s,a)+2VarsP^hk+1(|s,a)V~h+1k+1()ιN~hk+1(s,a)+4ιsP^hk+1(s|s,a)min{10002H3SAι2N~h+1k+1(s)+10002H4S4A2Eϵ,β2ι4N~h+1k+1(s)2+10002H6S4A2ι4N~h+1k+1(s)2,H2}N~hk(s,a)0,\begin{split}&\widetilde{r}^{k+1}_{h}(s,a)+\widetilde{P}^{k+1}_{h}\cdot\widetilde{V}^{k+1}_{h+1}(s,a)+b^{k+1}_{h}(s,a)-Q^{\star}_{h}(s,a)\\ \geq&\left(\widetilde{r}^{k+1}_{h}-r_{h}\right)(s,a)+\left(\widetilde{P}^{k+1}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)+b^{k+1}_{h}(s,a)\\ \geq&\left(\widetilde{P}^{k+1}_{h}-P_{h}\right)\cdot V^{\star}_{h+1}(s,a)+2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{k+1}(\cdot|s,a)}\widetilde{V}^{k+1}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}}+\frac{19HSE_{\epsilon,\beta}\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}\\ +&4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{k+1}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{\widetilde{N}^{k}_{h}(s,a)}}\\ \geq&-\sqrt{\frac{2\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}{V}^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}}+2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widetilde{P}_{h}^{k+1}(\cdot|s,a)}\widetilde{V}^{k+1}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}}+\frac{17HSE_{\epsilon,\beta}\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}\\ +&4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widetilde{P}^{k+1}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{\widetilde{N}^{k}_{h}(s,a)}}\\ \geq&-\sqrt{\frac{2\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}{V}^{\star}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}}+2\sqrt{\frac{\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}\widetilde{V}^{k+1}_{h+1}(\cdot)\cdot\iota}{\widetilde{N}^{k+1}_{h}(s,a)}}\\ +&4\sqrt{\iota}\cdot\sqrt{\frac{\sum_{s^{\prime}}\widehat{P}^{k+1}_{h}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}{\widetilde{N}^{k}_{h}(s,a)}}\\ \geq&0,\end{split} (81)

where the first inequality is because of Bellman equation and condition 2. The second inequality holds since the definition of bhk+1b^{k+1}_{h} and Lemma B.1. The third inequality results from Lemma B.6. The forth inequality comes from Lemma B.9 and Remark B.3. The last inequality is due to the following analysis.

Because Var(X)2Var(Y)+2Var(XY)\sqrt{\mathrm{Var}(X)}\leq\sqrt{2\mathrm{Var}(Y)}+\sqrt{2\mathrm{Var}(X-Y)} (Lemma 2 of Azar et al. [2017]), we have

VarsP^hk+1(|s,a)Vh+1()2VarsP^hk+1(|s,a)V~h+1k+1()+2VarsP^hk+1(|s,a)(V~h+1k+1()Vh+1())(a).\begin{split}&\sqrt{\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}{V}^{\star}_{h+1}(\cdot)}\leq\sqrt{2\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}\widetilde{V}^{k+1}_{h+1}(\cdot)}+\underbrace{\sqrt{2\mathrm{Var}_{s^{\prime}\sim\widehat{P}_{h}^{k+1}(\cdot|s,a)}\left(\widetilde{V}^{k+1}_{h+1}(\cdot)-V^{\star}_{h+1}(\cdot)\right)}}_{\mathrm{(a)}}.\end{split} (82)

In addition,

(a)2sP^hk+1(s|s,a)(V~h+1k+1(s)Vh+1(s))26sP^hk+1(s|s,a)min{10002H3SAι2Nh+1k+1(s)+10002H4S4A2Eϵ,β2ι4Nh+1k+1(s)2+10002H6S4A2ι4Nh+1k+1(s)2,H2}22sP^hk+1(s|s,a)min{10002H3SAι2N~h+1k+1(s)+10002H4S4A2Eϵ,β2ι4N~h+1k+1(s)2+10002H6S4A2ι4N~h+1k+1(s)2,H2},\begin{split}&\mathrm{(a)}\leq\sqrt{2\sum_{s^{\prime}}\widehat{P}_{h}^{k+1}(s^{\prime}|s,a)\left(\widetilde{V}^{k+1}_{h+1}(s^{\prime})-V^{\star}_{h+1}(s^{\prime})\right)^{2}}\\ \leq&\sqrt{6\sum_{s^{\prime}}\widehat{P}_{h}^{k+1}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{N^{k+1}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{N^{k+1}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{N^{k+1}_{h+1}(s^{\prime})^{2}},H^{2}\right\}}\\ \leq&2\sqrt{2\sum_{s^{\prime}}\widehat{P}_{h}^{k+1}(s^{\prime}|s,a)\min\left\{\frac{1000^{2}H^{3}SA\iota^{2}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})}+\frac{1000^{2}H^{4}S^{4}A^{2}E_{\epsilon,\beta}^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}}+\frac{1000^{2}H^{6}S^{4}A^{2}\iota^{4}}{\widetilde{N}^{k+1}_{h+1}(s^{\prime})^{2}},H^{2}\right\}},\end{split} (83)

where the first inequality results from the definition of variance. The second inequality holds because of condition 2 and the fact that min{(a+b+c)2,H2}3min{a2+b2+c2,H2}\min\{(a+b+c)^{2},H^{2}\}\leq 3\min\{a^{2}+b^{2}+c^{2},H^{2}\}. The last inequality holds according to Assumption 3.1.

Finally, plugging (82) and (83) into (81), we have the last inequality of (81) holds. Therefore, the proof of Lemma C.11 is complete. ∎

Appendix D Missing proofs in Section 5

In this section, we state the missing proofs in Section 5. Recall that NhkN^{k}_{h} is the original count, N^hk\widehat{N}^{k}_{h} is the noisy count after step (1) of both Privatizers and N~hk\widetilde{N}^{k}_{h} is the final private counts.

Proof of Lemma 5.1.

Due to Theorem 3.5 of Chan et al. [2011] and Lemma 34 of Hsu et al. [2014], the release of {N^hk(s,a)}(h,s,a,k)\{\widehat{N}^{k}_{h}(s,a)\}_{(h,s,a,k)} satisfies ϵ3\frac{\epsilon}{3}-DP. Similarly, the releases of {N^hk(s,a,s)}(k,h,s,a,s)\{\widehat{N}^{k}_{h}(s,a,s^{\prime})\}_{(k,h,s,a,s^{\prime})} and {R~hk(s,a)}(k,h,s,a)\{\widetilde{R}^{k}_{h}(s,a)\}_{(k,h,s,a)} both satisfy ϵ3\frac{\epsilon}{3}-DP. Therefore, the release of the following private counters {N^hk(s,a)}(h,s,a,k)\{\widehat{N}^{k}_{h}(s,a)\}_{(h,s,a,k)}, {N^hk(s,a,s)}(k,h,s,a,s)\{\widehat{N}^{k}_{h}(s,a,s^{\prime})\}_{(k,h,s,a,s^{\prime})} and {R~hk(s,a)}(k,h,s,a)\{\widetilde{R}^{k}_{h}(s,a)\}_{(k,h,s,a)} satisfy ϵ\epsilon-DP. Due to post-processing (Lemma 2.3 of Bun and Steinke [2016]), the release of all private counts {N~hk(s,a)}(h,s,a,k)\{\widetilde{N}^{k}_{h}(s,a)\}_{(h,s,a,k)}, {N~hk(s,a,s)}(k,h,s,a,s)\{\widetilde{N}^{k}_{h}(s,a,s^{\prime})\}_{(k,h,s,a,s^{\prime})} and {R~hk(s,a)}(k,h,s,a)\{\widetilde{R}^{k}_{h}(s,a)\}_{(k,h,s,a)} also satisfies ϵ\epsilon-DP. Then it holds that the release of all πk\pi_{k} is ϵ\epsilon-DP according to post-processing. Finally, the guarantee of ϵ\epsilon-JDP results from Billboard Lemma (Lemma 9 of Hsu et al. [2014]).

For utility analysis, because of Theorem 3.6 of Chan et al. [2011], our choice ϵ=ϵ3HlogK\epsilon^{\prime}=\frac{\epsilon}{3H\log K} in Binary Mechanism and a union bound, with probability 1β31-\frac{\beta}{3}, for all (k,h,s,a,s)(k,h,s,a,s^{\prime}),

|N^hk(s,a,s)Nhk(s,a,s)|O(Hϵlog(HSAT/β)2),|N^hk(s,a)Nhk(s,a)|O(Hϵlog(HSAT/β)2),|R~hk(s,a)Rhk(s,a)|O(Hϵlog(HSAT/β)2).\begin{split}|\widehat{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq O(\frac{H}{\epsilon}\log(HSAT/\beta)^{2}),\;\;|\widehat{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq O(\frac{H}{\epsilon}\log(HSAT/\beta)^{2}),\\ |\widetilde{R}^{k}_{h}(s,a)-R^{k}_{h}(s,a)|\leq O(\frac{H}{\epsilon}\log(HSAT/\beta)^{2}).\end{split} (84)

Together with Lemma 5.4, the Central Privatizer satisfies Assumption 3.1 with Eϵ,β=O~(Hϵ)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}). ∎

Proof of Theorem 5.2.

The proof directly results from plugging Eϵ,β=O~(Hϵ)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}) into Theorem 4.1. ∎

Proof of Lemma 5.4.

For clarity, we denote the solution of (4) by N¯hk\bar{N}^{k}_{h} and therefore N~hk(s,a,s)=N¯hk(s,a,s)+Eϵ,β2S\widetilde{N}^{k}_{h}(s,a,s^{\prime})=\bar{N}^{k}_{h}(s,a,s^{\prime})+\frac{E_{\epsilon,\beta}}{2S}, N~hk(s,a)=N¯hk(s,a)+Eϵ,β2\widetilde{N}^{k}_{h}(s,a)=\bar{N}^{k}_{h}(s,a)+\frac{E_{\epsilon,\beta}}{2}.

When the condition (two inequalities) in Lemma 5.4 holds, the original counts {Nhk(s,a,s)}s𝒮\{N^{k}_{h}(s,a,s^{\prime})\}_{s^{\prime}\in\mathcal{S}} is a feasible solution to the optimization problem, which means that

maxs|N¯hk(s,a,s)N^hk(s,a,s)|maxs|Nhk(s,a,s)N^hk(s,a,s)|Eϵ,β4.\max_{s^{\prime}}|\bar{N}^{k}_{h}(s,a,s^{\prime})-\widehat{N}^{k}_{h}(s,a,s^{\prime})|\leq\max_{s^{\prime}}|N^{k}_{h}(s,a,s^{\prime})-\widehat{N}^{k}_{h}(s,a,s^{\prime})|\leq\frac{E_{\epsilon,\beta}}{4}.

Combining with the condition in Lemma 5.4 with respect to N^hk(s,a,s)\widehat{N}^{k}_{h}(s,a,s^{\prime}), it holds that

|N¯hk(s,a,s)Nhk(s,a,s)||N¯hk(s,a,s)N^hk(s,a,s)|+|N^hk(s,a,s)Nhk(s,a,s)|Eϵ,β2.|\bar{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq|\bar{N}^{k}_{h}(s,a,s^{\prime})-\widehat{N}^{k}_{h}(s,a,s^{\prime})|+|\widehat{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq\frac{E_{\epsilon,\beta}}{2}.

Since N~hk(s,a,s)=N¯hk(s,a,s)+Eϵ,β2S\widetilde{N}^{k}_{h}(s,a,s^{\prime})=\bar{N}^{k}_{h}(s,a,s^{\prime})+\frac{E_{\epsilon,\beta}}{2S} and N¯hk(s,a,s)0\bar{N}^{k}_{h}(s,a,s^{\prime})\geq 0, we have

N~hk(s,a,s)>0,|N~hk(s,a,s)Nhk(s,a,s)|Eϵ,β.\widetilde{N}^{k}_{h}(s,a,s^{\prime})>0,\;\;\;|\widetilde{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq E_{\epsilon,\beta}. (85)

For N¯hk(s,a)\bar{N}^{k}_{h}(s,a), according to the constraints in the optimization problem (4), it holds that

|N¯hk(s,a)N^hk(s,a)|Eϵ,β4.|\bar{N}^{k}_{h}(s,a)-\widehat{N}^{k}_{h}(s,a)|\leq\frac{E_{\epsilon,\beta}}{4}.

Combining with the condition in Lemma 5.4 with respect to N^hk(s,a)\widehat{N}^{k}_{h}(s,a), it holds that

|N¯hk(s,a)Nhk(s,a)||N¯hk(s,a)N^hk(s,a)|+|N^hk(s,a)Nhk(s,a)|Eϵ,β2.|\bar{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq|\bar{N}^{k}_{h}(s,a)-\widehat{N}^{k}_{h}(s,a)|+|\widehat{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq\frac{E_{\epsilon,\beta}}{2}.

Since N~hk(s,a)=N¯hk(s,a)+Eϵ,β2\widetilde{N}^{k}_{h}(s,a)=\bar{N}^{k}_{h}(s,a)+\frac{E_{\epsilon,\beta}}{2}, we have

Nhk(s,a)N~hk(s,a)Nhk(s,a)+Eϵ,β.N^{k}_{h}(s,a)\leq\widetilde{N}^{k}_{h}(s,a)\leq N^{k}_{h}(s,a)+E_{\epsilon,\beta}. (86)

According to the last line of the optimization problem (4), we have N¯hk(s,a)=s𝒮N¯hk(s,a,s)\bar{N}^{k}_{h}(s,a)=\sum_{s^{\prime}\in\mathcal{S}}\bar{N}^{k}_{h}(s,a,s^{\prime}) and therefore,

N~hk(s,a)=s𝒮N~hk(s,a,s).\widetilde{N}^{k}_{h}(s,a)=\sum_{s^{\prime}\in\mathcal{S}}\widetilde{N}^{k}_{h}(s,a,s^{\prime}). (87)

The proof is complete by combining (85), (86) and (87). ∎

Proof of Lemma 5.6.

The privacy guarantee directly results from properties of Laplace Mechanism and composition of DP [Dwork et al., 2014].

For utility analysis, because of Corollary 12.4 of Dwork et al. [2014] and a union bound, with probability 1β31-\frac{\beta}{3}, for all (k,h,s,a,s)(k,h,s,a,s^{\prime}),

|N^hk(s,a,s)Nhk(s,a,s)|O(HϵKlog(HSAT/β)),|N^hk(s,a)Nhk(s,a)|O(HϵKlog(HSAT/β)),|R~hk(s,a)Rhk(s,a)|O(HϵKlog(HSAT/β)).\begin{split}|\widehat{N}^{k}_{h}(s,a,s^{\prime})-N^{k}_{h}(s,a,s^{\prime})|\leq O(\frac{H}{\epsilon}\sqrt{K\log(HSAT/\beta)}),\;\;|\widehat{N}^{k}_{h}(s,a)-N^{k}_{h}(s,a)|\leq O(\frac{H}{\epsilon}\sqrt{K\log(HSAT/\beta)}),\\ |\widetilde{R}^{k}_{h}(s,a)-R^{k}_{h}(s,a)|\leq O(\frac{H}{\epsilon}\sqrt{K\log(HSAT/\beta)}).\end{split} (88)

Together with Lemma 5.4, the Local Privatizer satisfies Assumption 3.1 with Eϵ,β=O~(HϵK)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}\sqrt{K}). ∎

Proof of Theorem 5.7.

The proof directly results from plugging Eϵ,β=O~(HϵK)E_{\epsilon,\beta}=\widetilde{O}(\frac{H}{\epsilon}\sqrt{K}) into Theorem 4.1. ∎