This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\AtAppendix

Online Learning for Unknown Partially Observable MDPs

Mehdi Jafarnia-Jahromi
University of Southern California
[email protected]
&Rahul Jain
University of Southern California
[email protected]
&Ashutosh Nayyar
University of Southern California
[email protected]
Abstract

Solving Partially Observable Markov Decision Processes (POMDPs) is hard. Learning optimal controllers for POMDPs when the model is unknown is harder. Online learning of optimal controllers for unknown POMDPs, which requires efficient learning using regret-minimizing algorithms that effectively tradeoff exploration and exploitation, is even harder, and no solution exists currently. In this paper, we consider infinite-horizon average-cost POMDPs with unknown transition model, though a known observation model. We propose a natural posterior sampling-based reinforcement learning algorithm (PSRL-POMDP) and show that it achieves a regret bound of 𝒪(logT)\mathcal{O}(\log T), where TT is the time horizon, when the parameter set is finite. In the general case (continuous parameter set), we show that the algorithm achieves 𝒪~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret under two technical assumptions. To the best of our knowledge, this is the first online RL algorithm for POMDPs and has sub-linear regret.

1 Introduction

Reinforcement learning (RL) considers the sequential decision-making problem of an agent in an unknown environment with the goal of minimizing the total cost. The agent faces a fundamental exploration-exploitation trade-off: should it exploit the available information to minimize the cost or should it explore the environment to gather more information for future decisions? Maintaining a proper balance between exploration and exploitation is a fundamental challenge in RL and is measured with the notion of cumulative regret: the difference between the cumulative cost of the learning algorithm and that of the best policy.

The problem of balancing exploration and exploitation in RL has been successfully addressed for MDPs and algorithms with near optimal regret bounds known (Bartlett and Tewari, 2009; Jaksch et al., 2010; Ouyang et al., 2017b; Azar et al., 2017; Fruit et al., 2018; Jin et al., 2018; Abbasi-Yadkori et al., 2019b; Zhang and Ji, 2019; Zanette and Brunskill, 2019; Hao et al., 2020; Wei et al., 2020, 2021). MDPs assume that the state is perfectly observable by the agent and the only uncertainty is about the underlying dynamics of the environment. However, in many real-world scenarios such as robotics, healthcare and finance, the state is not fully observed by the agent, and only a partial observation is available. These scenarios are modeled by Partially Observable Markov Decision Processes (POMDPs). In addition to the uncertainty in the environment dynamics, the agent has to deal with the uncertainty about the underlying state. It is well known (Kumar and Varaiya, 2015) that introducing an information or belief state (a posterior distribution over the states given the history of observations and actions) allows the POMDP to be recast as an MDP over the belief state space. The resulting algorithm requires a posterior update of the belief state which needs the transition and observation model to be fully known. This presents a significant difficulty when the model parameters are unknown. Thus, managing the exploration-exploitation trade-off for POMDPs is a significant challenge and to the best of our knowledge, no online RL algorithm with sub-linear regret is known.

In this paper, we consider infinite-horizon average-cost POMDPs with finite states, actions and observations. The underlying state transition dynamics is unknown, though we assume the observation kernel to be known. We propose a Posterior Sampling Reinforcement Learning algorithm (PSRL-POMDP) and prove that it achieves a Bayesian expected regret bound of 𝒪(logT)\mathcal{O}(\log T) in the finite (transition kernel) parameter set case where TT is the time horizon. We then show that in the general (continuous parameter set) case, it achieves 𝒪~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) under some technical assumptions. The PSRL-POMDP algorithm is a natural extension of the TSDE algorithm for MDPs (Ouyang et al., 2017b) with two main differences. First, in addition to the posterior distribution on the environment dynamics, the algorithm maintains a posterior distribution on the underlying state. Second, since the state is not fully observable, the agent cannot keep track of the number of visits to state-action pairs, a quantity that is crucial in the design of algorithms for tabular MDPs. Instead, we introduce a notion of pseudo count and carefully handle its relation with the true counts to obtain sub-linear regret. To the best of our knowledge, PSRL-POMDP is the first online RL algorithm for POMDPs with sub-linear regret.

1.1 Related Literature

We review the related literature in two main domains: efficient exploration for MDPs, and learning in POMDPs.

Efficient exploration in MDPs. To balance the exploration and exploitation, two general techniques are used in the basic tabular MDPs: optimism in the face of uncertainty (OFU), and posterior sampling. Under the OFU technique, the agent constructs a confidence set around the system parameters, selects an optimistic parameter associated with the minimum cost from the confidence set, and takes actions with respect to the optimistic parameter. This principle is widely used in the literature to achieve optimal regret bounds (Bartlett and Tewari, 2009; Jaksch et al., 2010; Azar et al., 2017; Fruit et al., 2018; Jin et al., 2018; Zhang and Ji, 2019; Zanette and Brunskill, 2019; Wei et al., 2020). An alternative technique to encourage exploration is posterior sampling (Thompson, 1933). In this approach, the agent maintains a posterior distribution over the system parameters, samples a parameter from the posterior distribution, and takes action with respect to the sampled parameter (Strens, 2000; Osband et al., 2013; Fonteneau et al., 2013; Gopalan and Mannor, 2015; Ouyang et al., 2017b; Jafarnia-Jahromi et al., 2021). In particular, (Ouyang et al., 2017b) proposes TSDE, a posterior sampling-based algorithm for the infinite-horizon average-cost MDPs.

Extending these results to the continuous state MDPs has been recently addressed with general function approximation (Osband and Van Roy, 2014; Dong et al., 2020; Ayoub et al., 2020; Wang et al., 2020), or in the special cases of linear function approximation (Abbasi-Yadkori et al., 2019a, b; Jin et al., 2020; Hao et al., 2020; Wei et al., 2021; Wang et al., 2021), and Linear Quadratic Regulators (Ouyang et al., 2017a; Dean et al., 2018; Cohen et al., 2019; Mania et al., 2019; Simchowitz and Foster, 2020; Lale et al., 2020a). In general, POMDPs can be formulated as continuous state MDPs by considering the belief as the state. However, computing the belief requires the knowledge of the model parameters and thus unobserved in the RL setting. Hence, learning algorithms for continuous state MDPs cannot be directly applied to POMDPs.

Learning in POMDPs. To the best of our knowledge, the only existing work with sub-linear regret in POMDPs is Azizzadenesheli et al. (2017). However, their definition of regret is not with respect to the optimal policy, but with respect to the best memoryless policy (a policy that maps the current observation to an action). With our natural definition of regret, their algorithm suffers linear regret. Other learning algorithms for POMDPs either consider linear dynamics (Lale et al., 2020b; Tsiamis and Pappas, 2020) or do not consider regret (Shani et al., 2005; Ross et al., 2007; Poupart and Vlassis, 2008; Cai et al., 2009; Liu et al., 2011, 2013; Doshi-Velez et al., 2013; Katt et al., 2018; Azizzadenesheli et al., 2018) and are not directly comparable to our setting.

2 Preliminaries

An infinite-horizon average-cost Partially Observable Markov Decision Process (POMDP) can be specified by (𝒮,𝒜,θ,C,O,η)({\mathcal{S}},{\mathcal{A}},\theta,C,O,\eta) where 𝒮{\mathcal{S}} is the state space, 𝒜{\mathcal{A}} is the action space, C:𝒮×𝒜[0,1]C:{\mathcal{S}}\times{\mathcal{A}}\to[0,1] is the cost function, and OO is the set of observations. Here η:𝒮ΔO\eta:{\mathcal{S}}\to\Delta_{O} is the observation kernel, and θ:𝒮×𝒜Δ𝒮\theta:{\mathcal{S}}\times{\mathcal{A}}\to\Delta_{\mathcal{S}} is the transition kernel such that η(o|s)=(ot=o|st=s)\eta(o|s)=\mathbb{P}(o_{t}=o|s_{t}=s) and θ(s|s,a)=(st+1=s|st=s,at=a)\theta(s^{\prime}|s,a)=\mathbb{P}(s_{t+1}=s^{\prime}|s_{t}=s,a_{t}=a) where otOo_{t}\in O, st𝒮s_{t}\in{\mathcal{S}} and at𝒜a_{t}\in{\mathcal{A}} are the observation, state and action at time t=1,2,3,t=1,2,3,\cdots. Here, for a finite set 𝒳{\mathcal{X}}, Δ𝒳\Delta_{\mathcal{X}} is the set of all probability distributions on 𝒳{\mathcal{X}}. We assume that the state space, the action space and the observations are finite with size |𝒮|,|𝒜|,|O||{\mathcal{S}}|,|{\mathcal{A}}|,|O|, respectively.

Let t{\mathcal{F}}_{t} be the information available at time tt (prior to action ata_{t}), i.e., the sigma algebra generated by the history of actions and observations a1,o1,,at1,ot1,ota_{1},o_{1},\cdots,a_{t-1},o_{t-1},o_{t} and let t+{\mathcal{F}}_{t+} be the information after choosing action ata_{t}. Unlike MDPs, the state is not observable by the agent and the optimal policy cannot be a function of the state. Instead, the agent maintains a belief ht(;θ)Δ𝒮h_{t}(\cdot;\theta)\in\Delta_{\mathcal{S}} given by ht(s;θ):=(st=s|t;θ)h_{t}(s;\theta):=\mathbb{P}(s_{t}=s|{\mathcal{F}}_{t};\theta), as a sufficient statistic for the history of observations and actions. Here we use the notation ht(;θ)h_{t}(\cdot;\theta) to explicitly show the dependency of the belief on θ\theta. After taking action ata_{t} and observing ot+1o_{t+1}, the belief hth_{t} can be updated as

ht+1(s;θ)=sη(ot+1|s)θ(s|s,at)ht(s;θ)ssη(ot+1|s)θ(s|s,at)ht(s;θ).h_{t+1}(s^{\prime};\theta)=\frac{\sum_{s}\eta(o_{t+1}|s^{\prime})\theta(s^{\prime}|s,a_{t})h_{t}(s;\theta)}{\sum_{s^{\prime}}\sum_{s}\eta(o_{t+1}|s^{\prime})\theta(s^{\prime}|s,a_{t})h_{t}(s;\theta)}. (1)

This update rule is compactly denoted by ht+1(;θ)=τ(ht(;θ),at,ot+1;θ)h_{t+1}(\cdot;\theta)=\tau(h_{t}(\cdot;\theta),a_{t},o_{t+1};\theta), with the initial condition

h1(s;θ)=η(o1|s)h(s;θ)sη(o1|s)h(s;θ),h_{1}(s;\theta)=\frac{\eta(o_{1}|s)h(s;\theta)}{\sum_{s}\eta(o_{1}|s)h(s;\theta)},

where h(;θ)h(\cdot;\theta) is the distribution of the initial state s1s_{1} (denoted by s1hs_{1}\sim h). A deterministic stationary policy π:Δ𝒮𝒜\pi:\Delta_{\mathcal{S}}\to{\mathcal{A}} maps a belief to an action. The long-term average cost of a policy π\pi can be defined as

Jπ(h;θ):=lim supT1Tt=1T𝔼[C(st,π(ht(;θ)))].J_{\pi}(h;\theta):=\limsup_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[}C\Big{(}s_{t},\pi\big{(}h_{t}(\cdot;\theta)\big{)}\Big{)}\Big{]}. (2)

Let J(h,θ):=infπJπ(h,θ)J(h,\theta):=\inf_{\pi}J_{\pi}(h,\theta) be the optimal long-term average cost that in general may depend on the initial state distribution hh, though we will assume it is independent of the initial distribution hh (and thus denoted by J(θ)J(\theta)), and the following Bellman equation holds:

Assumption 1 (Bellman optimality equation).

There exist J(θ)J(\theta)\in\mathbb{R} and a bounded function v(;θ):Δ𝒮v(\cdot;\theta):\Delta_{\mathcal{S}}\to\mathbb{R} such that for all bΔ𝒮b\in\Delta_{\mathcal{S}}

J(θ)+v(b;θ)=mina𝒜{c(b,a)+oOP(o|b,a;θ)v(b;θ)},J(\theta)+v(b;\theta)=\min_{a\in{\mathcal{A}}}\{c(b,a)+\sum_{o\in O}P(o|b,a;\theta)v(b^{\prime};\theta)\}, (3)

where vv is called the relative value function, b=τ(b,a,o;θ)b^{\prime}=\tau(b,a,o;\theta) is the updated belief, c(b,a):=sC(s,a)b(s)c(b,a):=\sum_{s}C(s,a)b(s) is the expected cost, and P(o|b,a;θ)P(o|b,a;\theta) is the probability of observing oo in the next step, conditioned on the current belief bb and action aa, i.e.,

P(o|b,a;θ)=s𝒮s𝒮η(o|s)θ(s|s,a)b(s).P(o|b,a;\theta)=\sum_{s^{\prime}\in{\mathcal{S}}}\sum_{s\in{\mathcal{S}}}\eta(o|s^{\prime})\theta(s^{\prime}|s,a)b(s). (4)

Various conditions are known under which Assumption 1 holds, e.g., when the MDP is weakly communicating (Bertsekas, 2017). Note that if Assumption 1 holds, the policy π\pi^{*} that minimizes the right hand side of (3) is the optimal policy. More precisely,

Lemma 1.

Suppose Assumption 1 holds. Then, the policy π(,θ):Δ𝒮𝒜\pi^{*}(\cdot,\theta):\Delta_{\mathcal{S}}\to{\mathcal{A}} given by

π(b;θ):=argmina𝒜{c(b,a)+oOP(o|b,a;θ)v(b;θ)}\pi^{*}(b;\theta):=\operatorname*{argmin}_{a\in{\mathcal{A}}}\{c(b,a)+\sum_{o\in O}P(o|b,a;\theta)v(b^{\prime};\theta)\} (5)

is the optimal policy with Jπ(h;θ)=J(θ)J_{\pi^{*}}(h;\theta)=J(\theta) for all hΔ𝒮h\in\Delta_{\mathcal{S}}.

Note that if vv satisfies the Bellman equation, so does vv plus any constant. Therefore, without loss of generality, and since vv is bounded, we can assume that infbΔ𝒮v(b;θ)=0\inf_{b\in\Delta_{\mathcal{S}}}v(b;\theta)=0 and define the span of a POMDP as sp(θ):=supbΔ𝒮v(b;θ)\operatorname*{sp}(\theta):=\sup_{b\in\Delta_{\mathcal{S}}}v(b;\theta). Let ΘH\Theta_{H} be the class of POMDPs that satisfy Assumption 1 and have sp(θ)H\operatorname*{sp}(\theta)\leq H for all θΘH\theta\in\Theta_{H}. In Section 4, we consider a finite subset ΘΘH\Theta\subseteq\Theta_{H} of POMDPs. In Section 5, the general class Θ=ΘH\Theta=\Theta_{H} is considered.

The learning protocol.

We consider the problem of an agent interacting with an unknown randomly generated POMDP θ\theta_{*}, where θΘ\theta_{*}\in\Theta is randomly generated according to the probability distribution f()f(\cdot).111In Section 4, f()f(\cdot) should be viewed as a probability mass function. After the initial generation of θ\theta_{*}, it remains fixed, but unknown to the agent. The agent interacts with the POMDP θ\theta_{*} in TT steps. Initially, the agent starts from state s1s_{1} that is randomly generated according to the conditional probability mass function h(;θ)h(\cdot;\theta_{*}). At time t=1,2,3,,Tt=1,2,3,\cdots,T, the agent observes otη(|st)o_{t}\sim\eta(\cdot|s_{t}), takes action ata_{t} and suffers cost of C(st,at)C(s_{t},a_{t}). The environment, then determines the next state st+1s_{t+1} which is randomly drawn from the probability distribution θ(|st,at)\theta_{*}(\cdot|s_{t},a_{t}). Note that although the cost function CC is assumed to be known, the agent cannot observe the value of C(st,at)C(s_{t},a_{t}) since the state sts_{t} is unknown to the agent. The goal of the agent is to minimize the expected cumulative regret defined as

RT:=𝔼θ[t=1T[C(st,at)J(θ)]],R_{T}:=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{t=1}^{T}\Big{[}C(s_{t},a_{t})-J(\theta_{*})\Big{]}\Big{]}, (6)

where the expectation is with respect to the prior distribution h(;θ)h(\cdot;\theta_{*}) for s1s_{1}, the randomness in the state transitions, and the randomness in the algorithm. Here, 𝔼θ[]\mathbb{E_{\theta_{*}}}[\cdot] is a shorthand for 𝔼[|θ]\mathbb{E}[\cdot|\theta_{*}]. In Section 4, a regret bound is provided on RTR_{T}, however, Section 5 considers 𝔼[RT]\mathbb{E}[R_{T}] (also called Bayesian regret) as the performance measure for the learning algorithm. We note that the Bayesian regret is widely considered in the MDP literature (Osband et al., 2013; Gopalan and Mannor, 2015; Ouyang et al., 2017b, a).

3 The PSRL-POMDP Algorithm

We propose a general Posterior Sampling Reinforcement Learning for POMDPs (PSRL-POMDP) algorithm (Algorithm 1) for both the finite-parameter and the general case. The algorithm maintains a joint distribution on the unknown parameter θ\theta_{*} as well as the state sts_{t}. PSRL-POMDP takes the prior distributions hh and ff as input. At time tt, the agent computes the posterior distribution ft()f_{t}(\cdot) on the unknown parameter θ\theta_{*} as well as the posterior conditional probability mass function (pmf) ht(;θ)h_{t}(\cdot;\theta) on the state sts_{t} for θΘ\theta\in\Theta. Upon taking action ata_{t} and observing ot+1o_{t+1}, the posterior distribution at time t+1t+1 can be updated by applying the Bayes’ rule as222When the parameter set is finite, θ\int_{\theta} should be replaced with θ\sum_{\theta}.

ft+1(θ)\displaystyle f_{t+1}(\theta) =s,sη(ot+1|s)θ(s|s,at)ht(s;θ)ft(θ)θs,sη(ot+1|s)θ(s|s,at)ht(s;θ)ft(θ)dθ,\displaystyle=\frac{\sum_{s,s^{\prime}}\eta(o_{t+1}|s^{\prime})\theta(s^{\prime}|s,a_{t})h_{t}(s;\theta)f_{t}(\theta)}{\int_{\theta}\sum_{s,s^{\prime}}\eta(o_{t+1}|s^{\prime})\theta(s^{\prime}|s,a_{t})h_{t}(s;\theta)f_{t}(\theta)d\theta},
ht+1(;θ)\displaystyle h_{t+1}(\cdot;\theta) =τ(ht(;θ),at,ot+1;θ),\displaystyle=\tau(h_{t}(\cdot;\theta),a_{t},o_{t+1};\theta), (7)

with the initial condition

f1(θ)\displaystyle f_{1}(\theta) =sη(o1|s)h(s;θ)f(θ)θsη(o1|s)h(s;θ)f(θ)dθ,\displaystyle=\frac{\sum_{s}\eta(o_{1}|s)h(s;\theta)f(\theta)}{\int_{\theta}\sum_{s}\eta(o_{1}|s)h(s;\theta)f(\theta)d\theta},
h1(s;θ)\displaystyle h_{1}(s;\theta) =η(o1|s)h(s;θ)sη(o1|s)h(s;θ).\displaystyle=\frac{\eta(o_{1}|s)h(s;\theta)}{\sum_{s}\eta(o_{1}|s)h(s;\theta)}. (8)

Recall that τ(ht(;θ),at,ot+1;θ)\tau(h_{t}(\cdot;\theta),a_{t},o_{t+1};\theta) is a compact notation for (1). In the special case of perfect observation at time tt, ht(s;θ)=𝟙(st=s)h_{t}(s;\theta)=\mathbbm{1}(s_{t}=s) for all θΘ\theta\in\Theta and s𝒮s\in{\mathcal{S}}. Moreover, the update rule of ft+1f_{t+1} reduces to that of fully observable MDPs (see Eq. (4) of Ouyang et al. (2017b)) in the special case of perfect observation at time tt and t+1t+1.

Let nt(s,a)=τ=1t1𝟙(sτ=s,aτ=a)n_{t}(s,a)=\sum_{\tau=1}^{t-1}\mathbbm{1}(s_{\tau}=s,a_{\tau}=a) be the number of visits to state-action (s,a)(s,a) by time tt. The number of visits ntn_{t} plays an important role in learning for MDPs (Jaksch et al., 2010; Ouyang et al., 2017b) and is one of the two criteria to determine the length of the episodes in the TSDE algorithm for MDPs (Ouyang et al., 2017b). However, in POMDPs, ntn_{t} is not (t1)+{\mathcal{F}}_{{(t-1)}+}-measurable since the states are not observable. Instead, let n~t(s,a):=𝔼[nt(s,a)|(t1)+]\tilde{n}_{t}(s,a):=\mathbb{E}[n_{t}(s,a)|{\mathcal{F}}_{{(t-1)}+}], and define the pseudo-count m~t\tilde{m}_{t} as follows:

Definition 1.

(m~t)t=1T(\tilde{m}_{t})_{t=1}^{T} is a pseudo-count if it is a non-decreasing, integer-valued sequence of random variables such that m~t\tilde{m}_{t} is (t1)+{\mathcal{F}}_{{(t-1)}+}-measurable, m~t(s,a)n~t(s,a)\tilde{m}_{t}(s,a)\geq\lceil\tilde{n}_{t}(s,a)\rceil, and m~t(s,a)t\tilde{m}_{t}(s,a)\leq t for all tT+1t\leq T+1.

An example of such a sequence is simply m~t(s,a)=t\tilde{m}_{t}(s,a)=t for all state-action pair (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}}. This is used in Section 4. Another example is m~t(s,a):=max{m~t1(s,a),n~t(s,a)}\tilde{m}_{t}(s,a):=\max\{\tilde{m}_{t-1}(s,a),\lceil\tilde{n}_{t}(s,a)\rceil\} with m~0(s,a)=0\tilde{m}_{0}(s,a)=0 for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}} which is used in Section 5. Here n~t(s,a)\lceil\tilde{n}_{t}(s,a)\rceil is the smallest integer that is greater than or equal to n~t(s,a)\tilde{n}_{t}(s,a). By definition, m~t\tilde{m}_{t} is integer-valued and non-decreasing which is essential to bound the number of episodes in the algorithm for the general case (see Lemma 5).

0:  prior distributions f(),h(;)f(\cdot),h(\cdot;\cdot)
Initialization: t1,t10t\leftarrow 1,t_{1}\leftarrow 0
Observe o1o_{1} and compute f1,h1f_{1},h_{1} according to (3)
1:  for episodes k=1,2,k=1,2,\cdots do
2:     Tk1ttkT_{k-1}\leftarrow t-t_{k}
3:     tktt_{k}\leftarrow t
4:     Generate θkftk()\theta_{k}\sim f_{t_{k}}(\cdot) and compute πk()=π(;θk)\pi_{k}(\cdot)=\pi^{*}(\cdot;\theta_{k}) from (33)
5:     while tSCHED(tk,Tk1)t\leq\texttt{SCHED}(t_{k},T_{k-1}) and m~t(s,a)2m~tk(s,a)\tilde{m}_{t}(s,a)\leq 2\tilde{m}_{t_{k}}(s,a) for all (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}} do
6:        Choose action at=πk(ht(;θk))a_{t}=\pi_{k}(h_{t}(\cdot;\theta_{k})) and observe ot+1o_{t+1}
7:        Update ft+1,ht+1f_{t+1},h_{t+1} according to (3)
8:        tt+1t\leftarrow t+1
9:     end while
10:  end for
Algorithm 1 PSRL-POMDP

Similar to the TSDE algorithm for fully observable MDPs, PSRL-POMDP algorithm proceeds in episodes. In the beginning of episode kk, POMDP θk\theta_{k} is sampled from the posterior distribution ftkf_{t_{k}} where tkt_{k} denotes the start time of episode kk. The optimal policy π(;θk)\pi^{*}(\cdot;\theta_{k}) is then computed and used during the episode. Note that the input of the policy is ht(;θk)h_{t}(\cdot;\theta_{k}). The intuition behind such a choice (as opposed to the belief bt():=θht(;θ)ft(θ)𝑑θb_{t}(\cdot):=\int_{\theta}h_{t}(\cdot;\theta)f_{t}(\theta)d\theta) is that during episode kk, the agent treats θk\theta_{k} to be the true POMDP and adopts the optimal policy with respect to it. Consequently, the input to the policy should also be the conditional belief with respect to the sampled θk\theta_{k}.

A key factor in designing posterior sampling based algorithms is the design of episodes. Let TkT_{k} denote the length of episode kk. In PSRL-POMDP, a new episode starts if either t>SCHED(tk,Tk1)t>\texttt{SCHED}(t_{k},T_{k-1}) or m~t(s,a)>2m~tk(s,a)\tilde{m}_{t}(s,a)>2\tilde{m}_{t_{k}}(s,a). In the finite parameter case (Section 4), we consider SCHED(tk,Tk1)=2tk\texttt{SCHED}(t_{k},T_{k-1})=2t_{k} and m~t(s,a)=t\tilde{m}_{t}(s,a)=t. With these choices, the two criteria coincide and ensure that the start time and the length of the episodes are deterministic. In Section 5, we use SCHED(tk,Tk1)=tk+Tk1\texttt{SCHED}(t_{k},T_{k-1})=t_{k}+T_{k-1} and m~t(s,a):=max{m~t1(s,a),n~t(s,a)}\tilde{m}_{t}(s,a):=\max\{\tilde{m}_{t-1}(s,a),\lceil\tilde{n}_{t}(s,a)\rceil\}. This guarantees that TkTk1+1T_{k}\leq T_{k-1}+1 and m~t(s,a)2m~tk(s,a)\tilde{m}_{t}(s,a)\leq 2\tilde{m}_{t_{k}}(s,a). These criteria are previously introduced in the TSDE algorithm (Ouyang et al., 2017b) except that TSDE uses the true count ntn_{t} rather than m~t\tilde{m}_{t}.

4 Finite-Parameter Case (|Θ|<|\Theta|<\infty)

In this section, we consider ΘΘH\Theta\subseteq\Theta_{H} such that |Θ|<|\Theta|<\infty. When Θ\Theta is finite, the posterior distribution concentrates on the true parameter exponentially fast if the transition kernels are separated enough (see Lemma 1). This allows us to achieve a regret bound of 𝒪(HlogT)\mathcal{O}(H\log T). Let o1:t,a1:to_{1:t},a_{1:t} be shorthand for the history of observations o1,,oto_{1},\cdots,o_{t} and the history of actions a1,,ata_{1},\cdots,a_{t}, respectively. Let νθo1:t,a1:t(o)\nu_{\theta}^{o_{1:t},a_{1:t}}(o) be the probability of observing oo at time t+1t+1 if the action history is a1:ta_{1:t}, the observation history is o1:to_{1:t}, and the underlying transition kernel is θ\theta, i.e.,

νθo1:t,a1:t(o):=(ot+1=o|o1:t,a1:t,θ=θ).\displaystyle\nu_{\theta}^{o_{1:t},a_{1:t}}(o):=\mathbb{P}(o_{t+1}=o|o_{1:t},a_{1:t},\theta_{*}=\theta).

The distance between νθo1:t,a1:t\nu_{\theta}^{o_{1:t},a_{1:t}} and νγo1:t,a1:t\nu_{\gamma}^{o_{1:t},a_{1:t}} is defined by Kullback Leibler (KL-) divergence as follows. For a fixed state-action pair (s,a)(s,a) and any θ,γΘ\theta,\gamma\in\Theta, denote by 𝒦(νθo1:t,a1:tνγo1:t,a1:t){\mathcal{K}}(\nu_{\theta}^{o_{1:t},a_{1:t}}\|\nu_{\gamma}^{o_{1:t},a_{1:t}}), the Kullback Leibler (KL-) divergence between the probability distributions νθo1:t,a1:t\nu_{\theta}^{o_{1:t},a_{1:t}} and νγo1:t,a1:t\nu_{\gamma}^{o_{1:t},a_{1:t}} is given by

𝒦(νθo1:t,a1:tνγo1:t,a1:t):=oνθo1:t,a1:t(o)logνθo1:t,a1:t(o)νγo1:t,a1:t(o).\displaystyle{\mathcal{K}}(\nu_{\theta}^{o_{1:t},a_{1:t}}\|\nu_{\gamma}^{o_{1:t},a_{1:t}}):=\sum_{o}\nu_{\theta}^{o_{1:t},a_{1:t}}(o)\log\frac{\nu_{\theta}^{o_{1:t},a_{1:t}}(o)}{\nu_{\gamma}^{o_{1:t},a_{1:t}}(o)}.

It can be shown that 𝒦(νθo1:t,a1:tνγo1:t,a1:t)0{\mathcal{K}}(\nu_{\theta}^{o_{1:t},a_{1:t}}\|\nu_{\gamma}^{o_{1:t},a_{1:t}})\geq 0 and that equality holds if and only if νθo1:t,a1:t=νγo1:t,a1:t\nu_{\theta}^{o_{1:t},a_{1:t}}=\nu_{\gamma}^{o_{1:t},a_{1:t}}. Thus, KL-divergence can be thought of as a measure of divergence of νγo1:t,a1:t\nu_{\gamma}^{o_{1:t},a_{1:t}} from νθo1:t,a1:t\nu_{\theta}^{o_{1:t},a_{1:t}}. In this section, we need to assume that the transition kernels in Θ\Theta are distant enough in the following sense.

Assumption 2.

For any time step tt, any history of observations o1:to_{1:t} and actions a1:ta_{1:t}, and any two transition kernels θ,γΘ\theta,\gamma\in\Theta, there exists a positive constant ϵ>0\epsilon>0 such that 𝒦(νθo1:t,a1:tνγo1:t,a1:t)ϵ{\mathcal{K}}(\nu_{\theta}^{o_{1:t},a_{1:t}}\|\nu_{\gamma}^{o_{1:t},a_{1:t}})\geq\epsilon.

This assumption is similar to the one used in Kim (2017).

Theorem 1.

Suppose Assumptions 1 and 2 hold. Then, the regret bound of Algorithm 1 with SCHED(tk,Tk1)=2tk\texttt{SCHED}(t_{k},T_{k-1})=2t_{k} and m~t(s,a)=t\tilde{m}_{t}(s,a)=t for all state-action pairs (s,a)(s,a) is bounded as

RTHlogT+4(H+1)(eβ1)2,\displaystyle R_{T}\leq H\log T+\frac{4(H+1)}{(e^{-\beta}-1)^{2}},

where β>0\beta>0 is a universal constant defined in Lemma 1.

Observe that with SCHED(tk,Tk1)=2tk\texttt{SCHED}(t_{k},T_{k-1})=2t_{k} and m~t(s,a)=t\tilde{m}_{t}(s,a)=t, the two stopping criteria in Algorithm 1 coincide and ensure that Tk=2Tk1T_{k}=2T_{k-1} with T0=1T_{0}=1. In other words, the length of episodes grows exponentially as Tk=2kT_{k}=2^{k}.

4.1 Proof of Theorem 1

In this section, proof of Theorem 1 is provided. A key factor in achieving 𝒪(HlogT)\mathcal{O}(H\log T) regret bound in the case of finite parameters is that the posterior distribution ft()f_{t}(\cdot) concentrates on the true θ\theta_{*} exponentially fast.

Lemma 1.

Suppose Assumption 2 holds. Then, there exist constants α>1\alpha>1 and β>0\beta>0 such that

𝔼[1ft(θ)|θ]αexp(βt).\displaystyle\mathbb{E}[1-f_{t}(\theta_{*})|\theta_{*}]\leq\alpha\exp(-\beta t).

Equipped with this lemma, we are now ready to prove Theorem 1.

Proof.

Note that the regret RTR_{T} can be decomposed as RT=H𝔼θ[KT]+R1+R2+R3R_{T}=H\mathbb{E_{\theta_{*}}}[K_{T}]+R_{1}+R_{2}+R_{3}, where

R1\displaystyle R_{1} :=𝔼θ[k=1KTTk[J(θk)J(θ)]],\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}T_{k}\Big{[}J(\theta_{k})-J(\theta_{*})\Big{]}\right],
R2\displaystyle R_{2} :=H𝔼θ[k=1KTt=tktk+11[s|θ(s|st,at)θk(s|st,at)|+s|ht(s;θ)ht(s;θk)|]],\displaystyle:=H\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left[\sum_{s^{\prime}}\left|\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\right|+\sum_{s}\left|h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\right|\right]\right],
R3\displaystyle R_{3} :=𝔼θ[k=1KTt=tktk+11[c(ht(;θ),at)c(ht(;θk),at)]].\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\Big{]}\right].

Note that the start time and length of episodes in Algorithm 1 are deterministic with the choice of SCHED and m~t\tilde{m}_{t} in the statement of the theorem, i.e., tkt_{k}, TkT_{k} and hence KTK_{T} are deterministic. Note that if θk=θ\theta_{k}=\theta_{*}, then R1=R2=R3=0R_{1}=R_{2}=R_{3}=0. Moreover, we have that J(θk)J(θ)1J(\theta_{k})-J(\theta_{*})\leq 1, s|θ(s|st,at)θk(s|st,at)|2\sum_{s^{\prime}}\left|\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\right|\leq 2, s|ht(s;θ)ht(s;θk)|2\sum_{s}\left|h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\right|\leq 2, and c(ht(;θ),at)c(ht(;θk),at)1c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\leq 1. Therefore,

R1\displaystyle R_{1} :=𝔼θ[k=1KTTk𝟙(θkθ)]=k=1KTTkθ(θkθ),\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}T_{k}\mathbbm{1}(\theta_{k}\neq\theta_{*})\right]=\sum_{k=1}^{K_{T}}T_{k}\mathbb{P_{\theta_{*}}}(\theta_{k}\neq\theta_{*}),
R2\displaystyle R_{2} :=4H𝔼θ[k=1KTt=tktk+11𝟙(θkθ)]=4Hk=1KTTkθ(θkθ),\displaystyle:=4H\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbbm{1}(\theta_{k}\neq\theta_{*})\right]=4H\sum_{k=1}^{K_{T}}T_{k}\mathbb{P_{\theta_{*}}}(\theta_{k}\neq\theta_{*}),
R3\displaystyle R_{3} :=𝔼θ[k=1KTt=tktk+11𝟙(θkθ)]=k=1KTTkθ(θkθ).\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbbm{1}(\theta_{k}\neq\theta_{*})\right]=\sum_{k=1}^{K_{T}}T_{k}\mathbb{P_{\theta_{*}}}(\theta_{k}\neq\theta_{*}).

Note that θ(θkθ)=𝔼θ[1ftk(θ)]αexp(βtk)\mathbb{P_{\theta_{*}}}(\theta_{k}\neq\theta_{*})=\mathbb{E_{\theta_{*}}}[1-f_{t_{k}}(\theta_{*})]\leq\alpha\exp(-\beta t_{k}) by Lemma 1. Combining all these bounds, we can write

RTHKT+(4H+2)αk=1KTTkexp(βtk).\displaystyle R_{T}\leq HK_{T}+(4H+2)\alpha\sum_{k=1}^{K_{T}}T_{k}\exp(-\beta t_{k}).

With the episode schedule provided in the statement of the theorem, it is easy to check that KT=O(logT)K_{T}=O(\log T). Let n=2KTn=2^{K_{T}} and write

k=1KTTkexp(βtk)=k=1KT2keβ(2k1)j=2njeβ(j1)=ddxxn+11x1|x=eβ1.\displaystyle\sum_{k=1}^{K_{T}}T_{k}\exp(-\beta t_{k})=\sum_{k=1}^{K_{T}}2^{k}e^{-\beta(2^{k}-1)}\leq\sum_{j=2}^{n}je^{-\beta(j-1)}=\frac{d}{dx}\frac{x^{n+1}-1}{x-1}\Big{|}_{x=e^{-\beta}}-1.

The last equality is by geometric series. Simplifying the derivative yields

ddxxn+11x1|x=eβ\displaystyle\frac{d}{dx}\frac{x^{n+1}-1}{x-1}\Big{|}_{x=e^{-\beta}} =nxn+1(n+1)xn+1(x1)2|x=eβnxn(n+1)xn+1(x1)2|x=eβ\displaystyle=\frac{nx^{n+1}-(n+1)x^{n}+1}{(x-1)^{2}}\Big{|}_{x=e^{-\beta}}\leq\frac{nx^{n}-(n+1)x^{n}+1}{(x-1)^{2}}\Big{|}_{x=e^{-\beta}}
=xn+1(x1)2|x=eβ2(eβ1)2.\displaystyle=\frac{-x^{n}+1}{(x-1)^{2}}\Big{|}_{x=e^{-\beta}}\leq\frac{2}{(e^{-\beta}-1)^{2}}.

Substituting these values implies RTHlogT+4(H+1)(eβ1)2R_{T}\leq H\log T+\frac{4(H+1)}{(e^{-\beta}-1)^{2}}. ∎

5 General Case (|Θ|=|\Theta|=\infty)

We now consider the general case, where the parameter set is infinite, and in particular, Θ=ΘH\Theta=\Theta_{H}, an uncountable set. We make the following two technical assumptions on the belief and the transition kernel.

Assumption 3.

Denote by k(t)k(t) the episode at time tt. The true conditional belief ht(;θ)h_{t}(\cdot;\theta_{*}) and the approximate conditional belief ht(;θk(t))h_{t}(\cdot;\theta_{k(t)}) satisfy

𝔼[s|ht(s;θ)ht(s;θk(t))|]K1(|𝒮|,|𝒜|,|O|,ι)tk(t),\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k(t)})\big{|}\Big{]}\leq\frac{K_{1}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota)}{\sqrt{t_{k(t)}}}, (9)

with probability at least 1δ1-\delta, for any δ(0,1)\delta\in(0,1). Here K1(|𝒮|,|𝒜|,|O|,ι)K_{1}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is a constant that is polynomial in its input parameters and ι\iota hides the logarithmic dependency on |𝒮|,|𝒜|,|O|,T,δ|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,T,\delta.

Assumption 3 states that the gap between conditional posterior function for the sampled POMDP θk\theta_{k} and the true POMDP θ\theta_{*} decreases with episodes as better approximation of the true POMDP is available. There has been recent work on computation of approximate information states as required in Assumption 3 (Subramanian et al., 2020).

Assumption 4.

There exists an t{\mathcal{F}}_{t}-measurable estimator θ^t:𝒮×𝒜Δ𝒮\hat{\theta}_{t}:{\mathcal{S}}\times{\mathcal{A}}\to\Delta_{\mathcal{S}} such that

s|θ(s|s,a)θ^t(s|s,a)|K2(|𝒮|,|𝒜|,|O|,ι)max{1,m~t(s,a)}\displaystyle\sum_{s^{\prime}}|\theta_{*}(s^{\prime}|s,a)-\hat{\theta}_{t}(s^{\prime}|s,a)|\leq\frac{K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota)}{\sqrt{\max\{1,{\tilde{m}}_{t}(s,a)\}}} (10)

with probability at least 1δ1-\delta, for any δ(0,1)\delta\in(0,1), uniformly for all t=1,2,3,,Tt=1,2,3,\cdots,T, where K2(|𝒮|,|𝒜|,|O|,ι)K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is a constant that is polynomial in its input parameters and ι\iota hides the logarithmic dependency on |𝒮|,|𝒜|,|O|,T,δ|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,T,\delta.

There has been extensive work on estimation of transition dynamics of MDPs, e.g., (Grunewalder et al., 2012). Two examples where Assumptions 3 and 4 hold are:

  • Perfect observation. In the case of perfect observation, where ht(s;θ)=𝟙(st=s)h_{t}(s;\theta)=\mathbbm{1}(s_{t}=s), Assumption 3 is clearly satisfied. Moreover, with perfect observation, one can choose m~t(s,a)=nt(s,a)\tilde{m}_{t}(s,a)=n_{t}(s,a) and select θ^k(s|s,a)=nt(s,a,s)nt(s,a)\hat{\theta}_{k}(s^{\prime}|s,a)=\frac{n_{t}(s,a,s^{\prime})}{n_{t}(s,a)} to satisfy Assumption 4 (Jaksch et al., 2010; Ouyang et al., 2017b). Here nt(s,a,s)n_{t}(s,a,s^{\prime}) denotes the number of visits to s,as,a such that the next state is ss^{\prime} before time tt.

  • Finite-parameter case. In the finite-parameter case with the choice of m~t(s,a)=t\tilde{m}_{t}(s,a)=t for all state-action pairs (s,a)(s,a) and SCHED(tk,Tk1)=tk+Tk1\texttt{SCHED}(t_{k},T_{k-1})=t_{k}+T_{k-1} or SCHED(tk,Tk=1)=2tk\texttt{SCHED}(t_{k},T_{k=1})=2t_{k}, both of the assumptions are satisfied (see Lemma 1 for details). Note that in this case a more refined analysis is performed in Section 4 to achieve 𝒪(HlogT)\mathcal{O}(H\log T) regret bound.

Now, we state the main result of this section.

Theorem 2.

Under Assumptions 1, 3 and 4, running PSRL-POMDP algorithm with SCHED(tk,Tk1)=tk+Tk1\texttt{SCHED}(t_{k},T_{k-1})=t_{k}+T_{k-1} yields 𝔼[RT]𝒪~(HK2(|𝒮||𝒜|T)2/3)\mathbb{E}[R_{T}]\leq\tilde{\mathcal{O}}(HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}), where K2:=K2(|𝒮|,|𝒜|,|O|,ι)K_{2}:=K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) in Assumption 4.

The exact constants are known (see proof and Appendix C.1) though we have hidden the dependence above.

5.1 Proof Sketch of Theorem 2

We provide the proof sketch of Theorem 2 here. A key property of posterior sampling is that conditioned on the information at time tt, the sampled θt\theta_{t} and the true θ\theta_{*} have the same distribution (Osband et al., 2013; Russo and Van Roy, 2014). Since the episode start time tkt_{k} is a stopping time with respect to the filtration (t)t1({\mathcal{F}}_{t})_{t\geq 1}, we use a stopping time version of this property:

Lemma 1 (Lemma 2 in Ouyang et al. (2017b)).

For any measurable function gg and any tk{\mathcal{F}}_{t_{k}}-measurable random variable XX, we have 𝔼[g(θk,X)]=𝔼[g(θ,X)]\mathbb{E}[g(\theta_{k},X)]=\mathbb{E}[g(\theta_{*},X)].

Introducing the pseudo count m~t(s,a)\tilde{m}_{t}(s,a) in the algorithm requires a novel analysis to achieve a low regret bound. The following key lemma states that the pseudo count m~t\tilde{m}_{t} cannot be too smaller than the true count ntn_{t}.

Lemma 2.

Fix a state-action pair (s,a)𝒮×𝒜(s,a)\in{\mathcal{S}}\times{\mathcal{A}}. For any pseudo count m~t\tilde{m}_{t} and any α[0,1]\alpha\in[0,1],

(m~t(s,a)<αnt(s,a))α.\mathbb{P}\big{(}\tilde{m}_{t}(s,a)<\alpha n_{t}(s,a)\big{)}\leq\alpha. (11)
Proof.

We show that (n~t(s,a)<αnt(s,a))α\mathbb{P}\big{(}\tilde{n}_{t}(s,a)<\alpha n_{t}(s,a)\big{)}\leq\alpha. Since by definition m~t(s,a)n~t(s,a)\tilde{m}_{t}(s,a)\geq\tilde{n}_{t}(s,a), the claim of the lemma follows. For any α[0,1]\alpha\in[0,1],

n~t(s,a)𝟙(αnt(s,a)>n~t(s,a))αnt(s,a).\tilde{n}_{t}(s,a)\mathbbm{1}\big{(}\alpha n_{t}(s,a)>\tilde{n}_{t}(s,a)\big{)}\leq\alpha n_{t}(s,a). (12)

By taking conditional expectation with respect to (t1)+{\mathcal{F}}_{{(t-1)}+} from both sides and the fact that 𝔼[nt(s,a)|(t1)+]=n~t(s,a)\mathbb{E}[n_{t}(s,a)|{\mathcal{F}}_{{(t-1)}+}]=\tilde{n}_{t}(s,a), we have

n~t(s,a)𝔼[𝟙(αnt(s,a)>n~t(s,a))|(t1)+]αn~t(s,a).\tilde{n}_{t}(s,a)\mathbb{E}\Big{[}\mathbbm{1}\big{(}\alpha n_{t}(s,a)>\tilde{n}_{t}(s,a)\big{)}\Big{|}{\mathcal{F}}_{{(t-1)}+}\Big{]}\leq\alpha\tilde{n}_{t}(s,a). (13)

We claim that

𝔼[𝟙(αnt(s,a)>n~t(s,a))|(t1)+]α,a.s.\mathbb{E}\Big{[}\mathbbm{1}\big{(}\alpha n_{t}(s,a)>\tilde{n}_{t}(s,a)\big{)}\Big{|}{\mathcal{F}}_{{(t-1)}+}\Big{]}\leq\alpha,~{}~{}\text{a.s.} (14)

If this claim is true, taking another expectation from both sides completes the proof.

To prove the claim, let Ω0,Ω+\Omega_{0},\Omega_{+} be the subsets of the sample space where n~t(s,a)=0\tilde{n}_{t}(s,a)=0 and n~t(s,a)>0\tilde{n}_{t}(s,a)>0, respectively. We consider these two cases separately: (a) on Ω+\Omega_{+} one can divide both sides of (13) by n~t(s,a)\tilde{n}_{t}(s,a) and reach (14); (b) note that by definition n~t(s,a)=0\tilde{n}_{t}(s,a)=0 on Ω0\Omega_{0}. Thus, nt(s,a)𝟙(Ω0)=0n_{t}(s,a)\mathbbm{1}(\Omega_{0})=0 almost surely (this is because 𝔼[nt(s,a)𝟙(Ω0)]=𝔼[𝔼[nt(s,a)𝟙(Ω0)|(t1)+]]=𝔼[n~t(s,a)𝟙(Ω0)]=0\mathbb{E}[n_{t}(s,a)\mathbbm{1}(\Omega_{0})]=\mathbb{E}[\mathbb{E}[n_{t}(s,a)\mathbbm{1}(\Omega_{0})|{\mathcal{F}}_{(t-1)+}]]=\mathbb{E}[\tilde{n}_{t}(s,a)\mathbbm{1}(\Omega_{0})]=0). Therefore,

𝟙(Ω0)𝟙(αnt(s,a)>n~t(s,a))=0,a.s.,\displaystyle\mathbbm{1}(\Omega_{0})\mathbbm{1}\big{(}\alpha n_{t}(s,a)>\tilde{n}_{t}(s,a)\big{)}=0,\quad\text{a.s.,}

which implies

𝟙(Ω0)𝔼[𝟙(αnt(s,a)>n~t(s,a))|(t1)+]=0,a.s.,\mathbbm{1}(\Omega_{0})\mathbb{E}\Big{[}\mathbbm{1}\big{(}\alpha n_{t}(s,a)>\tilde{n}_{t}(s,a)\big{)}\Big{|}{\mathcal{F}}_{{(t-1)}+}\Big{]}=0,\quad\text{a.s.,}

which means on Ω0\Omega_{0}, the left hand side of (14) is indeed zero, almost surely, proving the claim. ∎

The parameter α\alpha will be tuned later to balance two terms and achieve 𝒪~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret bound (see Lemma 4). We are now ready to provide the proof sketch of Theorem 2.

By Lemma 1, RTR_{T} can be decomposed as RT=H𝔼θ[KT]+R1+R2+R3R_{T}=H\mathbb{E_{\theta_{*}}}[K_{T}]+R_{1}+R_{2}+R_{3}, where

R1\displaystyle R_{1} :=𝔼θ[k=1KTTk[J(θk)J(θ)]],\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}T_{k}\Big{[}J(\theta_{k})-J(\theta_{*})\Big{]}\right],
R2\displaystyle R_{2} :=H𝔼θ[k=1KTt=tktk+11[s|θ(s|st,at)θk(s|st,at)|+s|ht(s;θ)ht(s;θk)|]],\displaystyle:=H\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left[\sum_{s^{\prime}}\left|\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\right|+\sum_{s}\left|h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\right|\right]\right],
R3\displaystyle R_{3} :=𝔼θ[k=1KTt=tktk+11[c(ht(;θ),at)c(ht(;θk),at)]].\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\Big{]}\right].

It follows from the first stopping criterion that TkTk1+1T_{k}\leq T_{k-1}+1. Using this along with the property of posterior sampling (Lemma 1) proves that 𝔼[R1]𝔼[KT]\mathbb{E}[R_{1}]\leq\mathbb{E}[K_{T}] (see Lemma 2 for details). 𝔼[R3]\mathbb{E}[R_{3}] is bounded by K1𝔼[k=1KTTktk]+1K_{1}\mathbb{E}\big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\big{]}+1 where K1:=K1(|𝒮|,|𝒜|,|O|,ι)K_{1}:=K_{1}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is the constant in Assumption 3 (see Lemma 3). To bound 𝔼[R2]\mathbb{E}[R_{2}], we use Assumption 3 and follow the proof steps of Lemma 3 to conclude that

𝔼[R2]R¯2+HK1𝔼[k=1KTTktk]+1,\mathbb{E}[R_{2}]\leq\bar{R}_{2}+HK_{1}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\Big{]}+1,

where

R¯2:=H𝔼[k=1KTt=tktk+11s|θ(s|st,at)θk(s|st,at)|].\bar{R}_{2}:=H\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}}\Big{|}\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{|}\Big{]}.

R¯2\bar{R}_{2} is the dominating term in the final 𝒪~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret bound and can be bounded by H+12HK2(|𝒮||𝒜|T)2/3H+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3} where K2:=K2(|𝒮|,|𝒜|,|O|,ι)K_{2}:=K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is the constant in Assumption 4. The detailed proof can be found in Lemma 4. However, we sketch the main steps of the proof here. By Assumption 4, one can show that

R¯2𝒪~(𝔼[t=1THK2max{1,m~t(st,at)}]).\bar{R}_{2}\leq\tilde{\mathcal{O}}\Big{(}\mathbb{E}\Big{[}\sum_{t=1}^{T}\frac{HK_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t}(s_{t},a_{t})\}}}\Big{]}\Big{)}.

Now, let E2E_{2} be the event that m~t(s,a)αnt(s,a)\tilde{m}_{t}(s,a)\geq\alpha n_{t}(s,a) for all s,as,a. Note that by Lemma 2 and union bound, (E2c)|𝒮||𝒜|α\mathbb{P}(E_{2}^{c})\leq|{\mathcal{S}}||{\mathcal{A}}|\alpha. Thus,

R¯2𝒪~(𝔼[t=1THK2max{1,m~t(st,at)}(𝟙(E2)+𝟙(E2c))])\displaystyle\bar{R}_{2}\leq\tilde{\mathcal{O}}\Big{(}\mathbb{E}\Big{[}\sum_{t=1}^{T}\frac{HK_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t}(s_{t},a_{t})\}}}\big{(}\mathbbm{1}(E_{2})+\mathbbm{1}(E_{2}^{c})\big{)}\Big{]}\Big{)}
𝒪~(H𝔼[t=1TK2αmax{1,nt(st,at)}]+HK2|𝒮||𝒜|Tα)\displaystyle\leq\tilde{\mathcal{O}}\Big{(}H\mathbb{E}\Big{[}\sum_{t=1}^{T}\frac{K_{2}}{\sqrt{\alpha\max\{1,{n}_{t}(s_{t},a_{t})\}}}\Big{]}+HK_{2}|{\mathcal{S}}||{\mathcal{A}}|T\alpha\Big{)}

Algebraic manipulation of the inner summation yields R¯2𝒪~(HK2|𝒮||𝒜|Tα+HK2|𝒮||𝒜|Tα)\bar{R}_{2}\leq\tilde{\mathcal{O}}\Big{(}HK_{2}\sqrt{\frac{|{\mathcal{S}}||{\mathcal{A}}|T}{\alpha}}+HK_{2}|{\mathcal{S}}||{\mathcal{A}}|T\alpha\Big{)}.

Optimizing over α\alpha implies R¯2=𝒪~(HK2(|𝒮||𝒜|T)2/3)\bar{R}_{2}=\tilde{\mathcal{O}}(HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}). Substituting upper bounds for 𝔼[R1],𝔼[R2]\mathbb{E}[R_{1}],\mathbb{E}[R_{2}] and 𝔼[R3]\mathbb{E}[R_{3}], we get

𝔼[RT]\displaystyle\mathbb{E}[R_{T}] =H𝔼[KT]+𝔼[R1]+𝔼[R2]+𝔼[R3]\displaystyle=H\mathbb{E}[K_{T}]+\mathbb{E}[R_{1}]+\mathbb{E}[R_{2}]+\mathbb{E}[R_{3}]
(1+H)𝔼[KT]+12HK2(|𝒮||𝒜|T)2/3+(H+1)K1𝔼[k=1KTTktk]+2+H.\displaystyle\leq(1+H)\mathbb{E}[K_{T}]+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}+(H+1)K_{1}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\Big{]}+2+H.

From Lemma 5, we know that 𝔼[KT]=𝒪~(|𝒮||𝒜|T)\mathbb{E}[K_{T}]=\tilde{\mathcal{O}}(\sqrt{|{\mathcal{S}}||{\mathcal{A}}|T}) and k=1KTTktk=𝒪~(|𝒮||𝒜|T)\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}=\tilde{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\sqrt{T}). Therefore, 𝔼[RT]𝒪~(HK2(|𝒮||𝒜|T)2/3).\mathbb{E}[R_{T}]\leq\tilde{\mathcal{O}}(HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}).

6 Conclusions

In this paper, we have presented one of the first online reinforcement learning algorithms for POMDPs. Solving POMDPs is a hard problem. Designing an efficient learning algorithm that achieves sublinear regret is even harder. We show that the proposed PSRL-POMDP algorithm achieves a Bayesian regret bound of 𝒪(logT)\mathcal{O}(\log T) when the parameter is finite. When the parameter set may be uncountable, we showed a 𝒪~(T2/3)\tilde{\mathcal{O}}(T^{2/3}) regret bound under two technical assumptions on the belief state approximation and transition kernel estimation. There has been recent work that does approximate belief state computation, as well as estimates transition dynamics of continuous MDPs, and in future work, we will try to incorporate such estimators. We also assume that the observation kernel is known. Note that without it, it is very challenging to design online learning algorithms for POMDPs. Posterior sampling-based algorithms in general are known to have superior numerical performance as compared to OFU-based algorithms for bandits and MDPs. In future work, we will also do an experimental investigation of the proposed algorithm. An impediment is that available POMDP solvers mostly provide approximate solutions which would lead to linear regret. In the future, we will also try to improve the regret for the general case to 𝒪~(T)\tilde{\mathcal{O}}(\sqrt{T}).

References

  • Abbasi-Yadkori et al. [2019a] Yasin Abbasi-Yadkori, Peter Bartlett, Kush Bhatia, Nevena Lazic, Csaba Szepesvari, and Gellért Weisz. Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning, pages 3692–3702, 2019a.
  • Abbasi-Yadkori et al. [2019b] Yasin Abbasi-Yadkori, Nevena Lazic, Csaba Szepesvari, and Gellert Weisz. Exploration-enhanced politex. arXiv preprint arXiv:1908.10479, 2019b.
  • Ayoub et al. [2020] Alex Ayoub, Zeyu Jia, Csaba Szepesvari, Mengdi Wang, and Lin Yang. Model-based reinforcement learning with value-targeted regression. In International Conference on Machine Learning, pages 463–474. PMLR, 2020.
  • Azar et al. [2017] Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Azizzadenesheli et al. [2017] Kamyar Azizzadenesheli, Alessandro Lazaric, and Animashree Anandkumar. Experimental results: Reinforcement learning of pomdps using spectral methods. arXiv preprint arXiv:1705.02553, 2017.
  • Azizzadenesheli et al. [2018] Kamyar Azizzadenesheli, Yisong Yue, and Animashree Anandkumar. Policy gradient in partially observable environments: Approximation and convergence. arXiv e-prints, pages arXiv–1810, 2018.
  • Bartlett and Tewari [2009] Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
  • Bertsekas [2017] Dimitri P Bertsekas. Dynamic programming and optimal control, vol i and ii, 4th edition. Belmont, MA: Athena Scientific, 2017.
  • Cai et al. [2009] Chenghui Cai, Xuejun Liao, and Lawrence Carin. Learning to explore and exploit in pomdps. Advances in Neural Information Processing Systems, 22:198–206, 2009.
  • Cohen et al. [2019] Alon Cohen, Tomer Koren, and Yishay Mansour. Learning linear-quadratic regulators efficiently with only t\sqrt{t} regret. In International Conference on Machine Learning, pages 1300–1309. PMLR, 2019.
  • Dean et al. [2018] Sarah Dean, Horia Mania, Nikolai Matni, Benjamin Recht, and Stephen Tu. Regret bounds for robust adaptive control of the linear quadratic regulator. arXiv preprint arXiv:1805.09388, 2018.
  • Dong et al. [2020] Kefan Dong, Jian Peng, Yining Wang, and Yuan Zhou. Root-n-regret for learning in markov decision processes with function approximation and low bellman rank. In Conference on Learning Theory, pages 1554–1557. PMLR, 2020.
  • Doshi-Velez et al. [2013] Finale Doshi-Velez, David Pfau, Frank Wood, and Nicholas Roy. Bayesian nonparametric methods for partially-observable reinforcement learning. IEEE transactions on pattern analysis and machine intelligence, 37(2):394–407, 2013.
  • Fonteneau et al. [2013] Raphaël Fonteneau, Nathan Korda, and Rémi Munos. An optimistic posterior sampling strategy for bayesian reinforcement learning. In NIPS 2013 Workshop on Bayesian Optimization (BayesOpt2013), 2013.
  • Fruit et al. [2018] Ronan Fruit, Matteo Pirotta, Alessandro Lazaric, and Ronald Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, pages 1573–1581, 2018.
  • Gopalan and Mannor [2015] Aditya Gopalan and Shie Mannor. Thompson sampling for learning parameterized markov decision processes. In Conference on Learning Theory, pages 861–898. PMLR, 2015.
  • Grunewalder et al. [2012] Steffen Grunewalder, Guy Lever, Luca Baldassarre, Massi Pontil, and Arthur Gretton. Modelling transition dynamics in mdps with rkhs embeddings. arXiv preprint arXiv:1206.4655, 2012.
  • Hao et al. [2020] Botao Hao, Nevena Lazic, Yasin Abbasi-Yadkori, Pooria Joulani, and Csaba Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
  • Jafarnia-Jahromi et al. [2021] Mehdi Jafarnia-Jahromi, Liyu Chen, Rahul Jain, and Haipeng Luo. Online learning for stochastic shortest path model via posterior sampling. arXiv preprint arXiv:2106.05335, 2021.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is Q-learning provably efficient? In Advances in Neural Information Processing Systems, pages 4863–4873, 2018.
  • Jin et al. [2020] Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Katt et al. [2018] Sammie Katt, Frans Oliehoek, and Christopher Amato. Bayesian reinforcement learning in factored pomdps. arXiv preprint arXiv:1811.05612, 2018.
  • Kim [2017] Michael Jong Kim. Thompson sampling for stochastic control: The finite parameter case. IEEE Transactions on Automatic Control, 62(12):6415–6422, 2017.
  • Kumar and Varaiya [2015] Panqanamala Ramana Kumar and Pravin Varaiya. Stochastic systems: Estimation, identification, and adaptive control. SIAM Classic, 2015.
  • Lale et al. [2020a] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Explore more and improve regret in linear quadratic regulators. arXiv preprint arXiv:2007.12291, 2020a.
  • Lale et al. [2020b] Sahin Lale, Kamyar Azizzadenesheli, Babak Hassibi, and Anima Anandkumar. Logarithmic regret bound in partially observable linear dynamical systems. arXiv preprint arXiv:2003.11227, 2020b.
  • Liu et al. [2011] Miao Liu, Xuejun Liao, and Lawrence Carin. The infinite regionalized policy representation. In ICML, 2011.
  • Liu et al. [2013] Miao Liu, Xuejun Liao, and Lawrence Carin. Online expectation maximization for reinforcement learning in pomdps. In IJCAI, pages 1501–1507, 2013.
  • Mania et al. [2019] Horia Mania, Stephen Tu, and Benjamin Recht. Certainty equivalence is efficient for linear quadratic control. arXiv preprint arXiv:1902.07826, 2019.
  • Osband and Van Roy [2014] Ian Osband and Benjamin Van Roy. Model-based reinforcement learning and the eluder dimension. arXiv preprint arXiv:1406.1853, 2014.
  • Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Ouyang et al. [2017a] Yi Ouyang, Mukul Gagrani, and Rahul Jain. Learning-based control of unknown linear systems with thompson sampling. arXiv preprint arXiv:1709.04047, 2017a.
  • Ouyang et al. [2017b] Yi Ouyang, Mukul Gagrani, Ashutosh Nayyar, and Rahul Jain. Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017b.
  • Poupart and Vlassis [2008] Pascal Poupart and Nikos Vlassis. Model-based bayesian reinforcement learning in partially observable domains. In Proc Int. Symp. on Artificial Intelligence and Mathematics,, pages 1–2, 2008.
  • Ross et al. [2007] Stephane Ross, Brahim Chaib-draa, and Joelle Pineau. Bayes-adaptive pomdps. In NIPS, pages 1225–1232, 2007.
  • Russo and Van Roy [2014] Daniel Russo and Benjamin Van Roy. Learning to optimize via posterior sampling. Mathematics of Operations Research, 39(4):1221–1243, 2014.
  • Shani et al. [2005] Guy Shani, Ronen I Brafman, and Solomon E Shimony. Model-based online learning of pomdps. In European Conference on Machine Learning, pages 353–364. Springer, 2005.
  • Simchowitz and Foster [2020] Max Simchowitz and Dylan Foster. Naive exploration is optimal for online lqr. In International Conference on Machine Learning, pages 8937–8948. PMLR, 2020.
  • Strens [2000] Malcolm Strens. A bayesian framework for reinforcement learning. In ICML, volume 2000, pages 943–950, 2000.
  • Subramanian et al. [2020] Jayakumar Subramanian, Amit Sinha, Raihan Seraj, and Aditya Mahajan. Approximate information state for approximate planning and reinforcement learning in partially observed systems. arXiv preprint arXiv:2010.08843, 2020.
  • Thompson [1933] William R Thompson. On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3/4):285–294, 1933.
  • Tsiamis and Pappas [2020] Anastasios Tsiamis and George Pappas. Online learning of the kalman filter with logarithmic regret. arXiv preprint arXiv:2002.05141, 2020.
  • Wang et al. [2020] Ruosong Wang, Russ R Salakhutdinov, and Lin Yang. Reinforcement learning with general value function approximation: Provably efficient approach via bounded eluder dimension. Advances in Neural Information Processing Systems, 33, 2020.
  • Wang et al. [2021] Tianhao Wang, Dongruo Zhou, and Quanquan Gu. Provably efficient reinforcement learning with linear function approximation under adaptivity constraints. arXiv preprint arXiv:2101.02195, 2021.
  • Wei et al. [2020] Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020.
  • Wei et al. [2021] Chen-Yu Wei, Mehdi Jafarnia-Jahromi, Haipeng Luo, and Rahul Jain. Learning infinite-horizon average-reward mdps with linear function approximation. International Conference on Artificial Intelligence and Statistics, 2021.
  • Zanette and Brunskill [2019] Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, 2019.
  • Zhang and Ji [2019] Zihan Zhang and Xiangyang Ji. Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems, 2019.

Appendix A Regret Decomposition

Lemma 1.

RTR_{T} can be decomposed as RT=H𝔼θ[KT]+R1+R2+R3R_{T}=H\mathbb{E_{\theta_{*}}}[K_{T}]+R_{1}+R_{2}+R_{3}, where

R1\displaystyle R_{1} :=𝔼θ[k=1KTTk[J(θk)J(θ)]],\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}T_{k}\Big{[}J(\theta_{k})-J(\theta_{*})\Big{]}\right],
R2\displaystyle R_{2} :=H𝔼θ[k=1KTt=tktk+11[s|θ(s|st,at)θk(s|st,at)|+s|ht(s;θ)ht(s;θk)|]],\displaystyle:=H\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\left[\sum_{s^{\prime}}\left|\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\right|+\sum_{s}\left|h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\right|\right]\right],
R3\displaystyle R_{3} :=𝔼θ[k=1KTt=tktk+11[c(ht(;θ),at)c(ht(;θk),at)]].\displaystyle:=\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\Big{]}\right].
Proof.

First, note that 𝔼θ[C(st,at)|t+]=c(ht(;θ),at)\mathbb{E_{\theta_{*}}}[C(s_{t},a_{t})|{\mathcal{F}}_{t+}]=c(h_{t}(\cdot;\theta_{*}),a_{t}) for any t1t\geq 1. Thus, we can write:

RT\displaystyle R_{T} =𝔼θ[t=1T[C(st,at)J(θ)]]=𝔼θ[t=1T[c(ht(;θ),at)J(θ)]].\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{t=1}^{T}\Big{[}C(s_{t},a_{t})-J(\theta_{*})\Big{]}\Big{]}=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{t=1}^{T}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-J(\theta_{*})\Big{]}\Big{]}.

During episode kk, by the Bellman equation for the sampled POMDP θk\theta_{k} and that at=π(ht(;θk);θk)a_{t}=\pi^{*}(h_{t}(\cdot;\theta_{k});\theta_{k}), we can write:

c(ht(;θk),at)J(θk)=v(ht(;θk);θk)oP(o|ht(;θk),at;θk)v(h;θk),\displaystyle c(h_{t}(\cdot;\theta_{k}),a_{t})-J(\theta_{k})=v(h_{t}(\cdot;\theta_{k});\theta_{k})-\sum_{o}P(o|h_{t}(\cdot;\theta_{k}),a_{t};\theta_{k})v(h^{\prime};\theta_{k}),

where h=τ(ht(;θk),at,o;θk)h^{\prime}=\tau(h_{t}(\cdot;\theta_{k}),a_{t},o;\theta_{k}). Using this equation, we proceed by decomposing the regret as

RT\displaystyle R_{T} =𝔼θ[t=1T[c(ht(;θ),at)J(θ)]]\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{t=1}^{T}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-J(\theta_{*})\Big{]}\Big{]}
=𝔼θ[k=1KTt=tktk+11[c(ht(;θ),at)J(θ)]]\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-J(\theta_{*})\Big{]}\Big{]}
=𝔼θ[k=1KTt=tktk+11[v(ht(;θk);θk)v(ht+1(;θk);θk)]telescopic sum]+𝔼θ[k=1KTTk[J(θk)J(θ)]]=:R1\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\underbrace{\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}v(h_{t}(\cdot;\theta_{k});\theta_{k})-v(h_{t+1}(\cdot;\theta_{k});\theta_{k})\Big{]}}_{\text{telescopic sum}}\Big{]}+\underbrace{\mathbb{E_{\theta_{*}}}\left[\sum_{k=1}^{K_{T}}T_{k}\Big{[}J(\theta_{k})-J(\theta_{*})\Big{]}\right]}_{=:R_{1}}
+𝔼θ[k=1KTt=tktk+11[v(ht+1(;θk);θk)oOP(o|ht(;θk),at;θk)v(h;θk)]]=:R2\displaystyle\qquad+\underbrace{\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}v(h_{t+1}(\cdot;\theta_{k});\theta_{k})-\sum_{o\in O}P(o|h_{t}(\cdot;\theta_{k}),a_{t};\theta_{k})v(h^{\prime};\theta_{k})\Big{]}\Big{]}}_{=:R_{2}^{\prime}}
+𝔼θ[k=1KTt=tktk+11[c(ht(;θ),at)c(ht(;θk),at)]]=:R3\displaystyle\qquad+\underbrace{\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\Big{]}\Big{]}}_{=:R_{3}}

where KTK_{T} is the number of episodes upto time TT, tkt_{k} is the start time of episode kk (we let tk=T+1t_{k}=T+1 for all k>KTk>K_{T}). The telescopic sum is equal to v(htk(;θk);θk)v(htk+1(;θk);θk)Hv(h_{t_{k}}(\cdot;\theta_{k});\theta_{k})-v(h_{t_{k+1}}(\cdot;\theta_{k});\theta_{k})\leq H. Thus, the first term on the right hand side is upper bounded by H𝔼θ[KT]H\mathbb{E_{\theta_{*}}}[K_{T}]. Suffices to show that R2R2R_{2}^{\prime}\leq R_{2}. Throughout the proof, we change the order of expectation and summation at several points. A rigorous proof for why this is allowed in the case that KTK_{T} and tkt_{k} are random variables is presented in the proof of Lemma 3.

We proceed by bounding the term R2R_{2}^{\prime}. Recall that h=τ(ht(;θk),at,o;θk)h^{\prime}=\tau(h_{t}(\cdot;\theta_{k}),a_{t},o;\theta_{k}) and ht+1(;θk)=τ(ht(;θk),at,ot+1;θk)h_{t+1}(\cdot;\theta_{k})=\tau(h_{t}(\cdot;\theta_{k}),a_{t},o_{t+1};\theta_{k}). Conditioned on t,θ,θk{\mathcal{F}}_{t},\theta_{*},\theta_{k}, the only random variable in ht+1(;θk)h_{t+1}(\cdot;\theta_{k}) is ot+1o_{t+1} (at=π(ht(;θk);θk)a_{t}=\pi^{*}(h_{t}(\cdot;\theta_{k});\theta_{k}) is measurable with respect to the sigma algebra generated by t,θk{\mathcal{F}}_{t},\theta_{k}). Therefore,

𝔼θ[v(ht+1(;θk);θk)|t,θk]=oOv(h;θk)θ(ot+1=o|t,θk).\displaystyle\mathbb{E_{\theta_{*}}}\Big{[}v(h_{t+1}(\cdot;\theta_{k});\theta_{k})|{\mathcal{F}}_{t},\theta_{k}\Big{]}=\sum_{o\in O}v(h^{\prime};\theta_{k})\mathbb{P_{\theta_{*}}}(o_{t+1}=o|{\mathcal{F}}_{t},\theta_{k}). (15)

We claim that θ(ot+1=o|t,θk)=P(o|ht(;θ),at;θ)\mathbb{P_{\theta_{*}}}(o_{t+1}=o|{\mathcal{F}}_{t},\theta_{k})=P(o|h_{t}(\cdot;\theta_{*}),a_{t};\theta_{*}): by the total law of probability and that θ(ot+1=o|st+1=s,t,θk)=η(o|s)\mathbb{P_{\theta_{*}}}(o_{t+1}=o|s_{t+1}=s^{\prime},{\mathcal{F}}_{t},\theta_{k})=\eta(o|s^{\prime}), we can write

θ(ot+1=o|t,θk)=sη(o|s)θ(st+1=s|t,θk).\displaystyle\mathbb{P_{\theta_{*}}}(o_{t+1}=o|{\mathcal{F}}_{t},\theta_{k})=\sum_{s^{\prime}}\eta(o|s^{\prime})\mathbb{P_{\theta_{*}}}(s_{t+1}=s^{\prime}|{\mathcal{F}}_{t},\theta_{k}).

Note that

θ(st+1=s|t,θk)\displaystyle\mathbb{P_{\theta_{*}}}(s_{t+1}=s^{\prime}|{\mathcal{F}}_{t},\theta_{k}) =sθ(st+1=s|st=s,t,at,θk)θ(st=s|t,θk)\displaystyle=\sum_{s}\mathbb{P_{\theta_{*}}}(s_{t+1}=s^{\prime}|s_{t}=s,{\mathcal{F}}_{t},a_{t},\theta_{k})\mathbb{P_{\theta_{*}}}(s_{t}=s|{\mathcal{F}}_{t},\theta_{k})
=sθ(s|s,at)θ(st=s|t).\displaystyle=\sum_{s}\theta_{*}(s^{\prime}|s,a_{t})\mathbb{P_{\theta_{*}}}(s_{t}=s|{\mathcal{F}}_{t}).

Thus,

θ(ot+1=o|t,θk)\displaystyle\mathbb{P_{\theta_{*}}}(o_{t+1}=o|{\mathcal{F}}_{t},\theta_{k}) =s,sη(o|s)θ(s|s,at)ht(s;θ)=P(o|ht(;θ),at;θ).\displaystyle=\sum_{s,s^{\prime}}\eta(o|s^{\prime})\theta_{*}(s^{\prime}|s,a_{t})h_{t}(s;\theta_{*})=P(o|h_{t}(\cdot;\theta_{*}),a_{t};\theta_{*}). (16)

Combining (16) with (15) and substituting into R2R_{2}^{\prime}, we get

R2\displaystyle R_{2}^{\prime} =𝔼θ[k=1KTt=tktk+11[oO(P(o|ht(;θ),at;θ)P(o|ht(;θk),at;θk))v(h;θk)]].\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}\sum_{o\in O}\Big{(}P(o|h_{t}(\cdot;\theta_{*}),a_{t};\theta_{*})-P(o|h_{t}(\cdot;\theta_{k}),a_{t};\theta_{k})\Big{)}v(h^{\prime};\theta_{k})\Big{]}\Big{]}.

Recall that for any θΘ\theta\in\Theta, P(o|ht(;θ),at;θ)=sη(o|s)sθ(s|s,at)ht(s;θ)P(o|h_{t}(\cdot;\theta),a_{t};\theta)=\sum_{s^{\prime}}\eta(o|s^{\prime})\sum_{s}\theta(s^{\prime}|s,a_{t})h_{t}(s;\theta). Thus,

R2\displaystyle R_{2}^{\prime} =𝔼θ[k=1KTt=tktk+11o,sv(h;θk)η(o|s)sθ(s|s,at)ht(s;θ)]\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{o,s^{\prime}}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\sum_{s}\theta_{*}(s^{\prime}|s,a_{t})h_{t}(s;\theta_{*})\Big{]}
𝔼θ[k=1KTt=tktk+11o,sv(h;θk)η(o|s)sθk(s|s,at)ht(s;θ)]\displaystyle\qquad-\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{o,s^{\prime}}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\sum_{s}\theta_{k}(s^{\prime}|s,a_{t})h_{t}(s;\theta_{*})\Big{]}
+𝔼θ[k=1KTt=tktk+11o,sv(h;θk)η(o|s)sθk(s|s,at)(ht(s;θ)ht(s;θk))].\displaystyle\qquad+\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{o,s^{\prime}}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\sum_{s}\theta_{k}(s^{\prime}|s,a_{t})\big{(}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{)}\Big{]}. (17)

For the first term, note that conditioned on t,θ{\mathcal{F}}_{t},\theta_{*}, the distribution of sts_{t} is ht(;θ)h_{t}(\cdot;\theta_{*}) by the definition of hth_{t}. Furthermore, ata_{t} is measurable with respect to the sigma algebra generated by t,θk{\mathcal{F}}_{t},\theta_{k} since at=π(ht(;θk);θk)a_{t}=\pi^{*}(h_{t}(\cdot;\theta_{k});\theta_{k}). Thus, we have

𝔼θ[v(h;θk)sθ(s|s,at)ht(s;θ)|t,θk]=v(h;θk)𝔼θ[θ(s|st,at)|t,θk].\displaystyle\mathbb{E_{\theta_{*}}}\Big{[}v(h^{\prime};\theta_{k})\sum_{s}\theta_{*}(s^{\prime}|s,a_{t})h_{t}(s;\theta_{*})\Big{|}{\mathcal{F}}_{t},\theta_{k}\Big{]}=v(h^{\prime};\theta_{k})\mathbb{E_{\theta_{*}}}\Big{[}\theta_{*}(s^{\prime}|s_{t},a_{t})\Big{|}{\mathcal{F}}_{t},\theta_{k}\Big{]}. (18)

Similarly, for the second term on the right hand side of (A), we have

𝔼θ[v(h;θk)sθk(s|s,at)ht(s;θ)|t,θk]=v(h;θk)𝔼θ[θk(s|st,at)|t,θk].\displaystyle\mathbb{E_{\theta_{*}}}\Big{[}v(h^{\prime};\theta_{k})\sum_{s}\theta_{k}(s^{\prime}|s,a_{t})h_{t}(s;\theta_{*})\Big{|}{\mathcal{F}}_{t},\theta_{k}\Big{]}=v(h^{\prime};\theta_{k})\mathbb{E_{\theta_{*}}}\Big{[}\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{|}{\mathcal{F}}_{t},\theta_{k}\Big{]}. (19)

Replacing (18), (19) into (A) and using the tower property of conditional expectation, we get

R2\displaystyle R_{2}^{\prime} =𝔼θ[k=1KTt=tktk+11[sov(h;θk)η(o|s)(θ(s|st,at)θk(s|st,at))]]\displaystyle=\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}\sum_{s^{\prime}}\sum_{o}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\Big{(}\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{)}\Big{]}\Big{]}
+𝔼θ[k=1KTt=tktk+11[sov(h;θk)η(o|s)sθk(s|s,at)(ht(s;θ)ht(s;θk))]].\displaystyle\qquad+\mathbb{E_{\theta_{*}}}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}\sum_{s^{\prime}}\sum_{o}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\sum_{s}\theta_{k}(s^{\prime}|s,a_{t})\big{(}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{)}\Big{]}\Big{]}. (20)

Since supbΔ𝒮v(b,θk)H\sup_{b\in\Delta_{\mathcal{S}}}v(b,\theta_{k})\leq H and oη(o|s)=1\sum_{o}\eta(o|s^{\prime})=1, the inner summation for the first term on the right hand side of (A) can be bounded as

oOv(h;θk)η(o|s)(θ(s|st,at)θk(s|st,at))H|θ(s|st,at)θk(s|st,at)|.\displaystyle\sum_{o\in O}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\Big{(}\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{)}\leq H\Big{|}\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{|}. (21)

Using supbΔ𝒮v(b,θk)H\sup_{b\in\Delta_{\mathcal{S}}}v(b,\theta_{k})\leq H, oη(o|s)=1\sum_{o}\eta(o|s^{\prime})=1 and sθk(s|s,at)=1\sum_{s^{\prime}}\theta_{k}(s^{\prime}|s,a_{t})=1, the second term on the right hand side of (A) can be bounded as

soOv(h;θk)η(o|s)sθk(s|s,at)|ht(s;θ)ht(s;θk)|Hs|ht(s;θ)ht(s;θk)|\displaystyle\sum_{s^{\prime}}\sum_{o\in O}v(h^{\prime};\theta_{k})\eta(o|s^{\prime})\sum_{s}\theta_{k}(s^{\prime}|s,a_{t})\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}\leq H\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}

Substituting (21) and (A) into (A) proves that R2R2R_{2}^{\prime}\leq R_{2}. ∎

Appendix B Proofs of Section 4

B.1 Proof of Lemma 1

Lemma (restatement of Lemma 1). Suppose Assumption 2 holds. Then, there exist constants α>1\alpha>1 and β>0\beta>0 such that

𝔼[1ft(θ)|θ]αexp(βt).\displaystyle\mathbb{E}[1-f_{t}(\theta_{*})|\theta_{*}]\leq\alpha\exp(-\beta t).
Proof.

Let τt\tau_{t} be the trajectory {a1,o1,,at1,ot1,ot}\{a_{1},o_{1},\cdots,a_{t-1},o_{t-1},o_{t}\} and define the likelihood function

(τt|θ)\displaystyle{\mathcal{L}}(\tau_{t}|\theta) :=(τt|θ)=(o1|θ)τ=2t(oτ|o1:τ1,a1:τ1|θ)=(o1|θ)τ=2tνθo1:τ1,a1:τ1(oτ)\displaystyle:=\mathbb{P}(\tau_{t}|\theta)=\mathbb{P}(o_{1}|\theta)\prod_{\tau=2}^{t}\mathbb{P}(o_{\tau}|o_{1:\tau-1},a_{1:\tau-1}|\theta)=\mathbb{P}(o_{1}|\theta)\prod_{\tau=2}^{t}\nu_{\theta}^{o_{1:\tau-1},a_{1:\tau-1}}(o_{\tau}) (23)

Note that (o1|θ)\mathbb{P}(o_{1}|\theta) is independent of θ\theta, thus

(τt|θ)(τt|γ)=τ=2tνθo1:τ1,a1:τ1(oτ)νγo1:τ1,a1:τ1(oτ)\displaystyle\frac{{\mathcal{L}}(\tau_{t}|\theta)}{{\mathcal{L}}(\tau_{t}|\gamma)}=\prod_{\tau=2}^{t}\frac{\nu_{\theta}^{o_{1:\tau-1},a_{1:\tau-1}}(o_{\tau})}{\nu_{\gamma}^{o_{1:\tau-1},a_{1:\tau-1}}(o_{\tau})}

Recall that ft()f_{t}(\cdot) is the posterior associated with the likelihood given by

ft(θ)=(τt|θ)f(θ)γΘ(τt|γ)f(γ).\displaystyle f_{t}(\theta)=\frac{{\mathcal{L}}(\tau_{t}|\theta)f(\theta)}{\sum_{\gamma\in\Theta}{\mathcal{L}}(\tau_{t}|\gamma)f(\gamma)}.

We now proceed to lower bound ft(θ)f_{t}(\theta_{*}). We can write

ft(θ)\displaystyle f_{t}(\theta_{*}) =(τt|θ)f(θ)θ(τt|θ)f(θ)=11+θθf(θ)f(θ)(τt|θ)(τt|θ)\displaystyle=\frac{{\mathcal{L}}(\tau_{t}|\theta_{*})f(\theta_{*})}{\sum_{\theta}{\mathcal{L}}(\tau_{t}|\theta)f(\theta)}=\frac{1}{1+\sum_{\theta\neq\theta_{*}}\frac{f(\theta)}{f(\theta_{*})}\frac{{\mathcal{L}}(\tau_{t}|\theta)}{{\mathcal{L}}(\tau_{t}|\theta_{*})}}
=11+θθf(θ)f(θ)exp(τ=1tlogΛτθ),\displaystyle=\frac{1}{1+\sum_{\theta\neq\theta_{*}}\frac{f(\theta)}{f(\theta_{*})}\exp(-\sum_{\tau=1}^{t}\log\Lambda_{\tau}^{\theta})},

where we define Λ1θ:=1\Lambda_{1}^{\theta}:=1 and for τ2\tau\geq 2,

Λτθ:=νθo1:τ1,a1:τ1(oτ)νθo1:τ1,a1:τ1(oτ)\displaystyle\Lambda_{\tau}^{\theta}:=\frac{\nu_{\theta_{*}}^{o_{1:\tau-1},a_{1:\tau-1}}(o_{\tau})}{\nu_{\theta}^{o_{1:\tau-1},a_{1:\tau-1}}(o_{\tau})}

Note that without loss of generality, we can assume that the denominator in the definition of Λτθ\Lambda_{\tau}^{\theta} is positive (otherwise, (τt|θ)=0{\mathcal{L}}(\tau_{t}|\theta)=0 and can be excluded from the denominator of ft(θ)f_{t}(\theta_{*})) and thus Λτθ\Lambda_{\tau}^{\theta} is well-defined.

Denote by Ztθ:=τ=1tlogΛτθZ_{t}^{\theta}:=\sum_{\tau=1}^{t}\log\Lambda_{\tau}^{\theta} and decompose it as Ztθ=Mtθ+AtθZ_{t}^{\theta}=M_{t}^{\theta}+A_{t}^{\theta} where

Mtθ\displaystyle M_{t}^{\theta} :=τ=1t(logΛτθ𝔼[logΛτθ|τ1,θ]),\displaystyle:=\sum_{\tau=1}^{t}\Big{(}\log\Lambda_{\tau}^{\theta}-\mathbb{E}\Big{[}\log\Lambda_{\tau}^{\theta}\big{|}{\mathcal{F}}_{\tau-1},\theta_{*}\Big{]}\Big{)},
Atθ\displaystyle A_{t}^{\theta} :=τ=1t𝔼[logΛτθ|τ1,θ].\displaystyle:=\sum_{\tau=1}^{t}\mathbb{E}\Big{[}\log\Lambda_{\tau}^{\theta}\big{|}{\mathcal{F}}_{\tau-1},\theta_{*}\Big{]}.

Note that the terms inside the first summation constitute a martingale difference sequence with respect to the filtration (τ)τ1({\mathcal{F}}_{\tau})_{\tau\geq 1} and conditional probability (|θ)\mathbb{P}(\cdot|\theta_{*}). Each term is bounded as |logΛτθ𝔼[logΛτθ|τ1,θ]|d|\log\Lambda_{\tau}^{\theta}-\mathbb{E}[\log\Lambda_{\tau}^{\theta}|{\mathcal{F}}_{\tau-1},\theta_{*}]|\leq d for some d>0d>0. The second term, AtθA_{t}^{\theta} can be lower bounded using Assumption 2 as follows

𝔼[logΛτθ|τ1,θ]\displaystyle\mathbb{E}\Big{[}\log\Lambda_{\tau}^{\theta}\big{|}{\mathcal{F}}_{\tau-1},\theta_{*}\Big{]} =𝔼[𝔼[logΛτθ|τ1,aτ1,θ]|τ1,θ]\displaystyle=\mathbb{E}\bigg{[}\mathbb{E}\Big{[}\log\Lambda_{\tau}^{\theta}\big{|}{\mathcal{F}}_{\tau-1},a_{\tau-1},\theta_{*}\Big{]}\Big{|}{\mathcal{F}}_{\tau-1},\theta_{*}\bigg{]}
=𝔼[𝒦(νθo1:τ1,a1:τ1νθo1:τ1,a1:τ1)|τ1,θ]ϵ\displaystyle=\mathbb{E}\Big{[}{\mathcal{K}}(\nu_{\theta_{*}}^{o_{1:\tau-1},a_{1:\tau-1}}\|\nu_{\theta}^{o_{1:\tau-1},a_{1:\tau-1}})\big{|}{\mathcal{F}}_{\tau-1},\theta_{*}\Big{]}\geq\epsilon

Summing over τ\tau implies that

Atθϵt.\displaystyle A_{t}^{\theta}\geq\epsilon t. (24)

To bound MtθM_{t}^{\theta}, let 0<δ<ϵ0<\delta<\epsilon, and apply Azuma’s inequality to obtain

(|Mtθ|δt|θ)2exp(δ2t2d2).\displaystyle\mathbb{P}\Big{(}|M_{t}^{\theta}|\geq\delta t\big{|}\theta_{*}\Big{)}\leq 2\exp(-\frac{\delta^{2}t}{2d^{2}}).

Union bound over all θθ\theta\neq\theta_{*} implies that the event Btδ:=θθ{|Mtθ|δt}B_{t}^{\delta}:=\cap_{\theta\neq\theta_{*}}\{|M_{t}^{\theta}|\leq\delta t\} happens with probability at least 12(|Θ|1)exp(δ2t2d2)1-2(|\Theta|-1)\exp(-\frac{\delta^{2}t}{2d^{2}}). If BtδB_{t}^{\delta} holds, then Mtθδt-M_{t}^{\theta}\leq\delta t for all θθ\theta\neq\theta_{*}. Combining this with (24) implies that exp(MtθAtθ)exp(δtϵt)\exp(-M_{t}^{\theta}-A_{t}^{\theta})\leq\exp(\delta t-\epsilon t). Therefore,

𝔼[ft(θ)|θ]\displaystyle\mathbb{E}[f_{t}(\theta_{*})|\theta_{*}] =𝔼[11+θθf(θ)f(θ)exp(MtθAtθ)|θ]\displaystyle=\mathbb{E}\Bigg{[}\frac{1}{1+\sum_{\theta\neq\theta_{*}}\frac{f(\theta)}{f(\theta_{*})}\exp(-M_{t}^{\theta}-A_{t}^{\theta})}\bigg{|}\theta_{*}\Bigg{]}
𝔼[𝟙(Bδt)1+θθf(θ)f(θ)exp(δtϵt)|θ]\displaystyle\geq\mathbb{E}\Bigg{[}\frac{\mathbbm{1}(B_{\delta}^{t})}{1+\sum_{\theta\neq\theta_{*}}\frac{f(\theta)}{f(\theta_{*})}\exp(\delta t-\epsilon t)}\bigg{|}\theta_{*}\Bigg{]}
=(Bδt|θ)1+1f(θ)f(θ)exp(δtϵt)\displaystyle=\frac{\mathbb{P}(B_{\delta}^{t}|\theta_{*})}{1+\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)}
12(|Θ|1)exp(δ2t2d2)1+1f(θ)f(θ)exp(δtϵt).\displaystyle\geq\frac{1-2(|\Theta|-1)\exp(-\frac{\delta^{2}t}{2d^{2}})}{1+\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)}.

Now, by choosing δ=ϵ/2\delta=\epsilon/2, and constants α=2max{maxθΘ1f(θ)f(θ),2(|Θ|1)}\alpha=2\max\{\max_{\theta\in\Theta}\frac{1-f(\theta)}{f(\theta)},2(|\Theta|-1)\}, and β=min{ϵ2,ϵ28d2}\beta=\min\{\frac{\epsilon}{2},\frac{\epsilon^{2}}{8d^{2}}\}, we have

𝔼[1ft(θ)|θ]\displaystyle\mathbb{E}[1-f_{t}(\theta_{*})|\theta_{*}] 112(|Θ|1)exp(δ2t2d2)1+1f(θ)f(θ)exp(δtϵt)\displaystyle\leq 1-\frac{1-2(|\Theta|-1)\exp(-\frac{\delta^{2}t}{2d^{2}})}{1+\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)}
=1f(θ)f(θ)exp(δtϵt)+2(|Θ|1)exp(δ2t2d2)1+1f(θ)f(θ)exp(δtϵt)\displaystyle=\frac{\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)+2(|\Theta|-1)\exp(-\frac{\delta^{2}t}{2d^{2}})}{1+\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)}
1f(θ)f(θ)exp(δtϵt)+2(|Θ|1)exp(δ2t2d2)\displaystyle\leq\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(\delta t-\epsilon t)+2(|\Theta|-1)\exp(-\frac{\delta^{2}t}{2d^{2}})
=1f(θ)f(θ)exp(ϵt2)+2(|Θ|1)exp(ϵ2t8d2)\displaystyle=\frac{1-f(\theta_{*})}{f(\theta_{*})}\exp(-\frac{\epsilon t}{2})+2(|\Theta|-1)\exp(-\frac{\epsilon^{2}t}{8d^{2}})
αexp(βt).\displaystyle\leq\alpha\exp(-\beta t).

Appendix C Proofs of Section 5

C.1 Full Upper Bound on the Expected Regret of Theorem 2

The exact expression for the upper bound of the expected regret in Theorem 2 is

𝔼[RT]=H𝔼[KT]+𝔼[R1]+𝔼[R2]+𝔼[R3]\displaystyle\mathbb{E}[R_{T}]=H\mathbb{E}[K_{T}]+\mathbb{E}[R_{1}]+\mathbb{E}[R_{2}]+\mathbb{E}[R_{3}]
(1+H)𝔼[KT]+12HK2(|𝒮||𝒜|T)2/3\displaystyle\leq(1+H)\mathbb{E}[K_{T}]+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}
+(H+1)K1𝔼[k=1KTTktk]+2+H\displaystyle\quad+(H+1)K_{1}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\Big{]}+2+H
(1+H)2T(1+|𝒮||𝒜|log(T+1))+12HK2(|𝒮||𝒜|T)2/3\displaystyle\leq(1+H)\sqrt{2T(1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1))}+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}
+7(H+1)K12T(1+|𝒮||𝒜|log(T+1))log2T+2+H.\displaystyle\quad+7(H+1)K_{1}\sqrt{2T}(1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1))\log\sqrt{2T}+2+H.

C.2 Finite-parameter Case Satisfies Assumptions 3 and 4

In this section, we show that Assumptions 3 and 4 are satisfied for the finite-parameter case i.e., |Θ|<|\Theta|<\infty as long as the PSRL-POMDP generates a deterministic schedule. As an instance, a deterministic schedule can be generated by choosing m~t(s,a)=t\tilde{m}_{t}(s,a)=t for all state-action pairs (s,a)(s,a) and running Algorithm 1 with either SCHED(tk,Tk1)=2tk\texttt{SCHED}(t_{k},T_{k-1})=2t_{k} or SCHED(tk,Tk1)=tk+Tk1\texttt{SCHED}(t_{k},T_{k-1})=t_{k}+T_{k-1}.

Lemma 1.

Assume |Θ|<|\Theta|<\infty. If Algorithm refalg: posterior sampling generates a deterministic schedule, then Assumptions 3 and 4 are satisfied.

Proof.

Observe that the left hand side of (9) is zero if θk(t)=θ\theta_{k(t)}=\theta_{*}, and is upper bounded by 22 if θk(t)θ\theta_{k(t)}\neq\theta_{*}. Thus, we can write

𝔼[s|ht(s;θ)ht(s;θk(t))||θ]2(θk(t)θ|θ)=2𝔼[1ftk(t)(θ)|θ]αexp(βtk(t)),\displaystyle\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k(t)})\big{|}\Big{|}\theta_{*}\Big{]}\leq 2\mathbb{P}(\theta_{k(t)}\neq\theta_{*}|\theta_{*})=2\mathbb{E}\left[1-f_{t_{k(t)}}(\theta_{*})|\theta_{*}\right]\leq\alpha\exp(-\beta t_{k(t)}),

which obviously satisfies Assumption 3 by choosing a large enough constant K1K_{1}. Here, the last equality is by Lemma 1 and that the start time of episode k(t)k(t) is deterministic.

To see why Assumption 4 is satisfied, let θ^t\hat{\theta}_{t} be the Maximum a Posteriori (MAP) estimator, i.e., θ^t=argmaxθΘft(θ)\hat{\theta}_{t}=\operatorname*{argmax}_{\theta\in\Theta}f_{t}(\theta). Then, the left hand side of (10) is equal to zero if θ^t=θ\hat{\theta}_{t}=\theta_{*}. Note that this happens with high probability with the following argument:

(θ^tθ|θ)(ft(θ)0.5|θ)=(1ft(θ)0.5|θ)2𝔼[1ft(θ)|θ]2αexp(βt).\displaystyle\mathbb{P}(\hat{\theta}_{t}\neq\theta_{*}|\theta_{*})\leq\mathbb{P}\left(f_{t}(\theta_{*})\leq 0.5|\theta_{*}\right)=\mathbb{P}(1-f_{t}(\theta_{*})\geq 0.5|\theta_{*})\leq 2\mathbb{E}[1-f_{t}(\theta_{*})|\theta_{*}]\leq 2\alpha\exp(-\beta t).

Here the first inequality is by the fact that if ft(θ)>0.5f_{t}(\theta_{*})>0.5, then the MAP estimator would choose θ^t=θ\hat{\theta}_{t}=\theta_{*}. The second inequality is by applying Markov inequality and the last inequality is by Lemma 1. Note that m~t(s,a)t\tilde{m}_{t}(s,a)\leq t by definition. We claim that Assumption 4 is satisfied by choosing K2=2(1/β)log(δ/2α)K_{2}=2\sqrt{(-1/\beta)\log(\delta/2\alpha)}. To see this, note that 2αexp(βt)δ2\alpha\exp(-\beta t)\leq\delta for t(1/β)log(δ/2α)t\geq(-1/\beta)\log(\delta/2\alpha). In this case, (10) automatically holds since with probability at least 1δ1-\delta the left hand side is zero. For t<(1/β)log(δ/2α)t<(-1/\beta)\log(\delta/2\alpha), note that the left hand side of (10) can be at most 2. Therefore, K2K_{2} can be found by solving 2K2/(1/β)log(δ/2α)2\leq K_{2}/\sqrt{(-1/\beta)\log(\delta/2\alpha)}. ∎

C.3 Auxiliary Lemmas for Section 5

Lemma 2.

[Lemma 3 in Ouyang et al. [2017b]] The term 𝔼[R1]\mathbb{E}[R_{1}] can be bounded as 𝔼[R1]𝔼[KT]\mathbb{E}[R_{1}]\leq\mathbb{E}[K_{T}].

Proof.
𝔼[R1]\displaystyle\mathbb{E}[R_{1}] =𝔼[k=1KTTk[J(θk)J(θ)]]=𝔼[k=1𝟙(tkT)TkJ(θk)]T𝔼[J(θ)].\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}\Big{[}J(\theta_{k})-J(\theta_{*})\Big{]}\Big{]}=\mathbb{E}\Big{[}\sum_{k=1}^{\infty}\mathbbm{1}(t_{k}\leq T)T_{k}J(\theta_{k})\Big{]}-T\mathbb{E}[J(\theta_{*})].

By monotone convergence theorem and the fact that J(θk)0J(\theta_{k})\geq 0 and TkTk1+1T_{k}\leq T_{k-1}+1 (the first criterion in determining the episode length in Algorithm 1), the first term can be bounded as

𝔼[k=1𝟙(tkT)TkJ(θk)]=k=1𝔼[𝟙(tkT)TkJ(θk)]\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{\infty}\mathbbm{1}(t_{k}\leq T)T_{k}J(\theta_{k})\Big{]}=\sum_{k=1}^{\infty}\mathbb{E}\Big{[}\mathbbm{1}(t_{k}\leq T)T_{k}J(\theta_{k})\Big{]}
k=1𝔼[𝟙(tkT)(Tk1+1)J(θk)].\displaystyle\leq\sum_{k=1}^{\infty}\mathbb{E}\Big{[}\mathbbm{1}(t_{k}\leq T)(T_{k-1}+1)J(\theta_{k})\Big{]}.

Note that 𝟙(tkT)(Tk1+1)\mathbbm{1}(t_{k}\leq T)(T_{k-1}+1) is tk{\mathcal{F}}_{t_{k}}-measurable. Thus, by the property of posterior sampling (Lemma 1), 𝔼[𝟙(tkT)(Tk1+1)J(θk)]=𝔼[𝟙(tkT)(Tk1+1)J(θ)]\mathbb{E}[\mathbbm{1}(t_{k}\leq T)(T_{k-1}+1)J(\theta_{k})]=\mathbb{E}[\mathbbm{1}(t_{k}\leq T)(T_{k-1}+1)J(\theta_{*})]. Therefore,

𝔼[R1]\displaystyle\mathbb{E}[R_{1}] 𝔼[k=1𝟙(tkT)(Tk1+1)J(θ)]T𝔼[J(θ)]\displaystyle\leq\mathbb{E}\Big{[}\sum_{k=1}^{\infty}\mathbbm{1}(t_{k}\leq T)(T_{k-1}+1)J(\theta_{*})\Big{]}-T\mathbb{E}[J(\theta_{*})]
=𝔼[J(θ)(KT+k=1KTTk1)]T𝔼[J(θ)]\displaystyle=\mathbb{E}\Big{[}J(\theta_{*})(K_{T}+\sum_{k=1}^{K_{T}}T_{k-1})\Big{]}-T\mathbb{E}[J(\theta_{*})]
=𝔼[J(θ)KT]+𝔼[J(θ)(k=1KTTk1T)]𝔼[KT],\displaystyle=\mathbb{E}[J(\theta_{*})K_{T}]+\mathbb{E}\Big{[}J(\theta_{*})(\sum_{k=1}^{K_{T}}T_{k-1}-T)\Big{]}\leq\mathbb{E}[K_{T}],

where the last inequality is by the fact that k=1KTTk1T0\sum_{k=1}^{K_{T}}T_{k-1}-T\leq 0 and 0J(θ)10\leq J(\theta_{*})\leq 1. ∎

Lemma 3.

The term 𝔼[R3]\mathbb{E}[R_{3}] can be bounded as

𝔼[R3]K1𝔼[k=1KTTktk]+1,\displaystyle\mathbb{E}[R_{3}]\leq K_{1}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\Big{]}+1,

where K1:=K1(|𝒮|,|𝒜|,|O|,ι)K_{1}:=K_{1}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is the constant in Assumption 3.

Proof.

Recall that

𝔼[R3]\displaystyle\mathbb{E}[R_{3}] =𝔼[k=1KTt=tktk+11[c(ht(;θ),at)c(ht(;θk),at)]].\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\Big{[}c(h_{t}(\cdot;\theta_{*}),a_{t})-c(h_{t}(\cdot;\theta_{k}),a_{t})\Big{]}\Big{]}.

Let k(t)k(t) be a random variable denoting the episode number at time tt, i.e., tk(t)t<tk(t)+1t_{k(t)}\leq t<t_{k(t)+1} for all tTt\leq T. By the definition of cc, we can write

𝔼[R3]\displaystyle\mathbb{E}[R_{3}] =𝔼[k=1KTt=tktk+11sC(s,at)[ht(s;θ)ht(s;θk)]]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s}C(s,a_{t})\Big{[}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\Big{]}\Big{]}
=𝔼[t=1TsC(s,at)[ht(s;θ)ht(s;θk(t))]]\displaystyle=\mathbb{E}\Big{[}\sum_{t=1}^{T}\sum_{s}C(s,a_{t})\Big{[}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k(t)})\Big{]}\Big{]}
t=1T𝔼[s|ht(s;θ)ht(s;θk(t))|]\displaystyle\leq\sum_{t=1}^{T}\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k(t)})\big{|}\Big{]}
=𝔼[k=1KTt=tktk+11𝔼[s|ht(s;θ)ht(s;θk)|]],\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}\Big{]}\Big{]},

where the inequality is by 0C(s,at)10\leq C(s,a_{t})\leq 1. Let K1:=K1(|𝒮|,|𝒜|,|O|,ι)K_{1}:=K_{1}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) be the constant in Assumption 3 and define event E1E_{1} as the successful event of Assumption 3 where 𝔼[s|ht(s;θ)ht(s;θk)|]K1tk\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}\Big{]}\leq\frac{K_{1}}{\sqrt{t_{k}}} happens. We can write

𝔼[s|ht(s;θ)ht(s;θk)|]\displaystyle\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}\Big{]}
=𝔼[s|ht(s;θ)ht(s;θk)|](𝟙(E1)+𝟙(E1c))\displaystyle=\mathbb{E}\Big{[}\sum_{s}\big{|}h_{t}(s;\theta_{*})-h_{t}(s;\theta_{k})\big{|}\Big{]}(\mathbbm{1}(E_{1})+\mathbbm{1}(E_{1}^{c}))
K1tk+2𝟙(E1c).\displaystyle\leq\frac{K_{1}}{\sqrt{t_{k}}}+2\mathbbm{1}(E_{1}^{c}).

Recall that by Assumption 3, (E1c)δ\mathbb{P}(E_{1}^{c})\leq\delta. Therefore,

𝔼[R3]K1𝔼[k=1KTTktk]+2Tδ.\displaystyle\mathbb{E}[R_{3}]\leq K_{1}\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\Big{]}+2T\delta.

Choosing δ=min(1/(2T),1/(2HT))\delta=\min(1/(2T),1/(2HT)) completes the proof. ∎

Lemma 4.

The term R¯2\bar{R}_{2} can be bounded as

R¯2H+12HK2(|𝒮||𝒜|T)2/3,\displaystyle\bar{R}_{2}\leq H+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3},

where K2:=K2(|𝒮|,|𝒜|,|O|,ι)K_{2}:=K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) in Assumption 4.

Proof.

Recall that

R¯2=H𝔼[k=1KTt=tktk+11s|θ(s|st,at)θk(s|st,at)|].\displaystyle\bar{R}_{2}=H\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\sum_{s^{\prime}}\Big{|}\theta_{*}(s^{\prime}|s_{t},a_{t})-\theta_{k}(s^{\prime}|s_{t},a_{t})\Big{|}\Big{]}. (25)

We proceed by bounding the inner term of the above equation. For notational simplicity, define z:=(s,a)z:=(s,a) and zt:=(st,at)z_{t}:=(s_{t},a_{t}). Let θ^tk\hat{\theta}_{t_{k}} be the estimator in Assumption 4 and define the confidence set BkB_{k} as

Bk\displaystyle B_{k} :={θΘH:s𝒮|θ(s|z)θ^k(s|z)|K2max{1,m~tk(z)},z𝒮×𝒜},\displaystyle:=\Big{\{}\theta\in\Theta_{H}:\sum_{s^{\prime}\in{\mathcal{S}}}\Big{|}\theta(s^{\prime}|z)-\hat{\theta}_{k}(s^{\prime}|z)\Big{|}\leq\frac{K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z)\}}},\forall z\in{\mathcal{S}}\times{\mathcal{A}}\Big{\}},

where K2:=K2(|𝒮|,|𝒜|,|O|,ι)K_{2}:=K_{2}(|{\mathcal{S}}|,|{\mathcal{A}}|,|O|,\iota) is the constant in Assumption 4. Note that BkB_{k} reduces to the confidence set used in Jaksch et al. [2010], Ouyang et al. [2017b] in the case of perfect observation by choosing m~t(s,a)=nt(s,a)\tilde{m}_{t}(s,a)=n_{t}(s,a). By triangle inequality, the inner term in (25) can be bounded by

s|θ(s|zt)θk(s|zt)|\displaystyle\sum_{s^{\prime}}\Big{|}\theta_{*}(s^{\prime}|z_{t})-\theta_{k}(s^{\prime}|z_{t})\Big{|}
s|θ(s|zt)θ^tk(s|zt)|+s|θk(s|zt)θ^tk(s|zt)|\displaystyle\leq\sum_{s^{\prime}}\Big{|}\theta_{*}(s^{\prime}|z_{t})-\hat{\theta}_{t_{k}}(s^{\prime}|z_{t})\Big{|}+\sum_{s^{\prime}}\Big{|}\theta_{k}(s^{\prime}|z_{t})-\hat{\theta}_{t_{k}}(s^{\prime}|z_{t})\Big{|}
2(𝟙(θBk)+𝟙(θkBk))+2K2max{1,m~tk(zt)}.\displaystyle\leq 2\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}+\frac{2K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z_{t})\}}}.

Substituting this into (25) implies

R¯2\displaystyle\bar{R}_{2} 2H𝔼[k=1KTt=tktk+11(𝟙(θBk)+𝟙(θkBk))]+2H𝔼[k=1KTt=tktk+11K2max{1,m~tk(zt)}].\displaystyle\leq 2H\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]}+2H\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z_{t})\}}}\Big{]}. (26)

We need to bound these two terms separately.

Bounding the first term.

For the first term we can write:

𝔼[k=1KTt=tktk+11(𝟙(θBk)+𝟙(θkBk))]\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]} =𝔼[k=1KTTk(𝟙(θBk)+𝟙(θkBk))]\displaystyle=\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}T_{k}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]}
T𝔼[k=1KT(𝟙(θBk)+𝟙(θkBk))]\displaystyle\leq T\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]}
Tk=1T𝔼[(𝟙(θBk)+𝟙(θkBk))],\displaystyle\leq T\sum_{k=1}^{T}\mathbb{E}\Big{[}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]},

where the last inequality is by the fact that KTTK_{T}\leq T. Now, observe that since BkB_{k} is tk{\mathcal{F}}_{t_{k}}-measurable, Lemma 1 implies that 𝔼[𝟙(θkBk)]=𝔼[𝟙(θBk)]\mathbb{E}[\mathbbm{1}(\theta_{k}\notin B_{k})]=\mathbb{E}[\mathbbm{1}(\theta_{*}\notin B_{k})]. Moreover, by Assumption 4, 𝔼[𝟙(θBk)]=(θBk)δ\mathbb{E}[\mathbbm{1}(\theta_{*}\notin B_{k})]=\mathbb{P}(\theta_{*}\notin B_{k})\leq\delta. By choosing δ=14T2\delta=\frac{1}{4T^{2}}, we get

𝔼[k=1KTt=tktk+11(𝟙(θBk)+𝟙(θkBk))]12.\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\big{(}\mathbbm{1}(\theta_{*}\notin B_{k})+\mathbbm{1}(\theta_{k}\notin B_{k})\big{)}\Big{]}\leq\frac{1}{2}. (27)

Bounding the second term.

To bound the second term of (26), observe that by the second criterion of the algorithm in choosing the episode length, we have 2m~tk(zt)m~t(zt)2\tilde{m}_{t_{k}}(z_{t})\geq\tilde{m}_{t}(z_{t}). Thus,

𝔼[k=1KTt=tktk+11K2max{1,m~tk(zt)}]𝔼[t=1T2K2max{1,m~t(zt)}]\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z_{t})\}}}\Big{]}\leq\mathbb{E}\Big{[}\sum_{t=1}^{T}\frac{\sqrt{2}K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t}(z_{t})\}}}\Big{]}
=t=1Tz𝔼[2K2𝟙(zt=z)max{1,m~t(z)}]\displaystyle=\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,{\tilde{m}}_{t}(z)\}}}\Big{]}
=t=1Tz𝔼[2K2𝟙(zt=z)max{1,m~t(z)}𝟙(m~t(z)αnt(z))]\displaystyle=\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,{\tilde{m}}_{t}(z)\}}}\mathbbm{1}\big{(}{\tilde{m}}_{t}(z)\geq\alpha n_{t}(z)\big{)}\Big{]}
+t=1Tz𝔼[2K2𝟙(zt=z)max{1,m~t(z)}𝟙(m~t(z)<αnt(z))]\displaystyle\qquad\qquad+\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,{\tilde{m}}_{t}(z)\}}}\mathbbm{1}\big{(}{\tilde{m}}_{t}(z)<\alpha n_{t}(z)\big{)}\Big{]}
t=1Tz𝔼[2K2𝟙(zt=z)max{1,αnt(z)}]+t=1Tz𝔼[2K2𝟙(m~t(z)<αnt(z))].\displaystyle\leq\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,\alpha{n}_{t}(z)\}}}\Big{]}+\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\sqrt{2}K_{2}\mathbbm{1}\big{(}{\tilde{m}}_{t}(z)<\alpha n_{t}(z)\big{)}\Big{]}. (28)

Lemma 2 implies that 𝔼[𝟙(m~t(z)<αnt(z))]=(m~t(z)<αnt(z))α\mathbb{E}\Big{[}\mathbbm{1}\big{(}{\tilde{m}}_{t}(z)<\alpha n_{t}(z)\big{)}\Big{]}=\mathbb{P}({\tilde{m}}_{t}(z)<\alpha n_{t}(z))\leq\alpha. Thus, the second term in (C.3) can be bounded by 2K2|𝒮||𝒜|Tα\sqrt{2}K_{2}|{\mathcal{S}}||{\mathcal{A}}|T\alpha. To bound the first term of (C.3), we can write:

t=1Tz𝔼[2K2𝟙(zt=z)max{1,αnt(z)}]\displaystyle\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,\alpha{n}_{t}(z)\}}}\Big{]}
2αK2𝔼[zt=1T𝟙(zt=z)max{1,nt(z)}].\displaystyle\leq\sqrt{\frac{2}{\alpha}}K_{2}\mathbb{E}\Big{[}\sum_{z}\sum_{t=1}^{T}\frac{\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,{n}_{t}(z)\}}}\Big{]}.

Observe that whenever zt=zz_{t}=z, nt(z)n_{t}(z) increases by 11. Since, nt(z)n_{t}(z) is the number of visits to zz by time t1t-1 (including t1t-1 and excluding tt), the denominator will be 11 for the first two times that zt=zz_{t}=z. Therefore, the term inside the expectation can be bounded by

zt=1T𝟙(zt=z)max{1,nt(z)}\displaystyle\sum_{z}\sum_{t=1}^{T}\frac{\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,{n}_{t}(z)\}}} =z𝔼[𝟙(nT+1(z)>0)+j=1nT+1(z)11j]\displaystyle=\sum_{z}\mathbb{E}\Big{[}\mathbbm{1}(n_{T+1}(z)>0)+\sum_{j=1}^{n_{T+1}(z)-1}\frac{1}{\sqrt{j}}\Big{]}
z𝔼[𝟙(nT+1(z)>0)+2nT+1(z)]\displaystyle\leq\sum_{z}\mathbb{E}\Big{[}\mathbbm{1}(n_{T+1}(z)>0)+2\sqrt{n_{T+1}(z)}\Big{]}
3znT+1(z).\displaystyle\leq 3\sum_{z}\sqrt{n_{T+1}(z)}.

Since znT+1(z)=T\sum_{z}n_{T+1}(z)=T, Cauchy Schwartz inequality implies

3znT+1(z)3|𝒮||𝒜|znT+1(z)=3|𝒮||𝒜|T.\displaystyle 3\sum_{z}\sqrt{n_{T+1}(z)}\leq 3\sqrt{|{\mathcal{S}}||{\mathcal{A}}|\sum_{z}n_{T+1}(z)}=3\sqrt{|{\mathcal{S}}||{\mathcal{A}}|T}.

Therefore, the first term of (C.3) can be bounded by

t=1Tz𝔼[2K2𝟙(zt=z)max{1,αnt(z)}]3K22|𝒮||𝒜|Tα.\displaystyle\sum_{t=1}^{T}\sum_{z}\mathbb{E}\Big{[}\frac{\sqrt{2}K_{2}\mathbbm{1}(z_{t}=z)}{\sqrt{\max\{1,\alpha{n}_{t}(z)\}}}\Big{]}\leq 3K_{2}\sqrt{\frac{2|{\mathcal{S}}||{\mathcal{A}}|T}{\alpha}}.

Substituting this bound in (C.3) along with the bound on the second term of (C.3), we obtain

𝔼[k=1KTt=tktk+11K2max{1,m~tk(zt)}]3K22|𝒮||𝒜|Tα+2K2|𝒮||𝒜|Tα.\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1}\frac{K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z_{t})\}}}\Big{]}\leq 3K_{2}\sqrt{\frac{2|{\mathcal{S}}||{\mathcal{A}}|T}{\alpha}}+\sqrt{2}K_{2}|{\mathcal{S}}||{\mathcal{A}}|T\alpha.

α=(3/2)2/3(|𝒮||𝒜|T)1/3\alpha=(3/2)^{2/3}(|{\mathcal{S}}||{\mathcal{A}}|T)^{-1/3} minimizes the upper bound, and thus

𝔼[k=1KTt=tktk+11\displaystyle\mathbb{E}\Big{[}\sum_{k=1}^{K_{T}}\sum_{t=t_{k}}^{t_{k+1}-1} K2max{1,m~tk(zt)}]6K2(|𝒮||𝒜|T)2/3.\displaystyle\frac{K_{2}}{\sqrt{\max\{1,{\tilde{m}}_{t_{k}}(z_{t})\}}}\Big{]}\leq 6K_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}. (29)

By substituting (27) and (29) into (26), we get

R¯2H+12HK2(|𝒮||𝒜|T)2/3.\displaystyle\bar{R}_{2}\leq H+12HK_{2}(|{\mathcal{S}}||{\mathcal{A}}|T)^{2/3}.

Lemma 5.

The following inequalities hold:

  1. 1.

    The number of episodes KTK_{T} can be bounded as KT2T(1+|𝒮||𝒜|log(T+1))=𝒪~(|𝒮||𝒜|T)K_{T}\leq\sqrt{2T(1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1))}=\tilde{\mathcal{O}}(\sqrt{|{\mathcal{S}}||{\mathcal{A}}|T}).

  2. 2.

    The following inequality holds: k=1KTTktk72T(1+|𝒮||𝒜|log(T+1))log2T=𝒪~(|𝒮||𝒜|T)\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\leq 7\sqrt{2T}(1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1))\log\sqrt{2T}=\tilde{\mathcal{O}}(|{\mathcal{S}}||{\mathcal{A}}|\sqrt{T}).

Proof.

We first provide an intuition why these results should be true. Note that the length of the episodes is determined by two criteria. The first criterion triggers when Tk=Tk1+1T_{k}=T_{k-1}+1 and the second criterion triggers when the pseudo counts doubles for a state-action pair compared to the beginning of the episode. Intuitively speaking, the second criterion should only happen logarithmically, while the first criterion occurs more frequently. This means that one could just consider the first criterion for an intuitive argument. Thus, if we ignore the second criterion, we get Tk=𝒪(k)T_{k}=\mathcal{O}(k), KT=𝒪(T)K_{T}=\mathcal{O}(\sqrt{T}), and tk=𝒪(k2)t_{k}=\mathcal{O}(k^{2}) which implies k=1KTTktk=𝒪(KT)=𝒪(T)\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}=\mathcal{O}(K_{T})=\mathcal{O}(\sqrt{T}). The rigorous proof is stated in the following.

1. Define macro episodes with start times tmit_{m_{i}} given by tm1=t1t_{m_{1}}=t_{1} and tmi:=t_{m_{i}}:=

min{tk>tmi1:m~tk(s,a)>2m~tk1(s,a) for some (s,a)}.\displaystyle\min\{t_{k}>t_{m_{i-1}}:\tilde{m}_{t_{k}}(s,a)>2\tilde{m}_{t_{k-1}}(s,a)\text{ for some }(s,a)\}.

Note that a new macro episode starts when the second criterion of episode length in Algorithm 1 triggers. Let MTM_{T} be the random variable denoting the number of macro episodes by time TT and define mMT+1=KT+1m_{M_{T}+1}=K_{T}+1.

Let T~i\tilde{T}_{i} denote the length of macro episode ii. Note that T~i=k=mimi+11Tk\tilde{T}_{i}=\sum_{k=m_{i}}^{m_{i+1}-1}T_{k}. Moreover, from the definition of macro episodes, we know that all the episodes in a macro episode except the last one are triggered by the first criterion, i.e., Tk=Tk1+1T_{k}=T_{k-1}+1 for all mikmi+12m_{i}\leq k\leq m_{i+1}-2. This implies that

T~i=k=mimi+11Tk=Tmi+11+j=1mi+1mi1(Tmi1+j)\displaystyle\tilde{T}_{i}=\sum_{k=m_{i}}^{m_{i+1}-1}T_{k}=T_{m_{i+1}-1}+\sum_{j=1}^{m_{i+1}-m_{i}-1}(T_{m_{i}-1}+j)
1+j=1mi+1mi1(1+j)=(mi+1mi)(mi+1mi+1)2.\displaystyle\geq 1+\sum_{j=1}^{m_{i+1}-m_{i}-1}(1+j)=\frac{(m_{i+1}-m_{i})(m_{i+1}-m_{i}+1)}{2}.

This implies that mi+1mi2T~im_{i+1}-m_{i}\leq\sqrt{2\tilde{T}_{i}}. Now, we can write:

KT=mMT+11=i=1MT(mi+1mi)\displaystyle K_{T}=m_{M_{T}+1}-1=\sum_{i=1}^{M_{T}}(m_{i+1}-m_{i})
i=1MT2T~i2MTiT~i=2MTT,\displaystyle\leq\sum_{i=1}^{M_{T}}\sqrt{2\tilde{T}_{i}}\leq\sqrt{2M_{T}\sum_{i}\tilde{T}_{i}}=\sqrt{2M_{T}T}, (30)

where the last inequality is by Cauchy-Schwartz.

Now, it suffices to show that MT1+|𝒮||𝒜|log(T+1)M_{T}\leq 1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1). Let 𝒯s,a{\mathcal{T}}_{s,a} be the start times at which the second criterion is triggered at state-action pair (s,a)(s,a), i.e.,

𝒯s,a:={tkT:m~tk(s,a)>2m~tk1(s,a)}.\displaystyle{\mathcal{T}}_{s,a}:=\{t_{k}\leq T:\tilde{m}_{t_{k}}(s,a)>2\tilde{m}_{t_{k-1}}(s,a)\}.

We claim that |𝒯s,a|log(m~T+1(s,a))|{\mathcal{T}}_{s,a}|\leq\log(\tilde{m}_{T+1}(s,a)). To prove this claim, assume by contradiction that |𝒯s,a|log(m~T+1(s,a))+1|{\mathcal{T}}_{s,a}|\geq\log(\tilde{m}_{T+1}(s,a))+1, then

m~tKT(s,a)tkT,m~tk1(s,a)1m~tk(s,a)m~tk1(s,a)\displaystyle\tilde{m}_{t_{K_{T}}}(s,a)\geq\prod_{t_{k}\leq T,\tilde{m}_{t_{k-1}}(s,a)\geq 1}\frac{\tilde{m}_{t_{k}}(s,a)}{\tilde{m}_{t_{k-1}}(s,a)}
tk𝒯s,a,m~tk1(s,a)1m~tk(s,a)m~tk1(s,a)\displaystyle\geq\prod_{t_{k}\in{\mathcal{T}}_{s,a},\tilde{m}_{t_{k-1}}(s,a)\geq 1}\frac{\tilde{m}_{t_{k}}(s,a)}{\tilde{m}_{t_{k-1}}(s,a)}
>tk𝒯s,a,m~tk1(s,a)12=2|𝒯s,a|1m~T+1(s,a),\displaystyle>\prod_{t_{k}\in{\mathcal{T}}_{s,a},\tilde{m}_{t_{k-1}}(s,a)\geq 1}2=2^{|{\mathcal{T}}_{s,a}|-1}\geq\tilde{m}_{T+1}(s,a),

which is a contradiction. The second inequality is by the fact that m~t(s,a)\tilde{m}_{t}(s,a) is non-decreasing, and the third inequality is by the definition of 𝒯s,a{\mathcal{T}}_{s,a}. Therefore,

MT1+s,a|𝒯s,a|1+s,alog(m~T+1(s,a))\displaystyle M_{T}\leq 1+\sum_{s,a}|{\mathcal{T}}_{s,a}|\leq 1+\sum_{s,a}\log(\tilde{m}_{T+1}(s,a))
1+|𝒮||𝒜|log(s,am~T+1(s,a)/|𝒮||𝒜|)\displaystyle\leq 1+|{\mathcal{S}}||{\mathcal{A}}|\log(\sum_{s,a}\tilde{m}_{T+1}(s,a)/|{\mathcal{S}}||{\mathcal{A}}|)
=1+|𝒮||𝒜|log(T+1),\displaystyle=1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1), (31)

where the third inequality is due to the concavity of log\log and the last inequality is by the fact that m~T+1(s,a)T+1\tilde{m}_{T+1}(s,a)\leq T+1.

2. First, we claim that Tk2TT_{k}\leq\sqrt{2T} for all kKTk\leq K_{T}. To see this, assume by contradiction that Tk>2TT_{k^{*}}>\sqrt{2T} for some kKTk^{*}\leq K_{T}. By the first stopping criterion, we can conclude that Tk1>2T1T_{k^{*}-1}>\sqrt{2T}-1, Tk2>2T2T_{k^{*}-2}>\sqrt{2T}-2, …, T1>max{2Tk+1,0}T_{1}>\max\{\sqrt{2T}-k^{*}+1,0\} since the episode length can increase at most by one compared to the previous one. Note that k2T1k^{*}\geq\sqrt{2T}-1, because otherwise T1>2T_{1}>2 which is not feasible since T1T0+1=2T_{1}\leq T_{0}+1=2. Thus, k=1kTk>0.52T(2T+1)>T\sum_{k=1}^{k^{*}}T_{k}>0.5\sqrt{2T}(\sqrt{2T}+1)>T which is a contradiction.

We now proceed to lower bound tkt_{k}. By the definition of macro episodes in part (1), during a macro episode length of the episodes except the last one are determined by the first criterion, i.e., for macro episode ii, one can write Tk=Tk1+1T_{k}=T_{k-1}+1 for mikmi+12m_{i}\leq k\leq m_{i+1}-2. Hence, for mikmi+12m_{i}\leq k\leq m_{i+1}-2,

tk+1\displaystyle t_{k+1} =tk+Tk=tk+Tmi1+k(mi1)\displaystyle=t_{k}+T_{k}=t_{k}+T_{m_{i}-1}+k-(m_{i}-1)
tk+kmi+1.\displaystyle\geq t_{k}+k-m_{i}+1.

Recursive substitution of tkt_{k} implies that tktmi+0.5(kmi)(kmi+1)t_{k}\geq t_{m_{i}}+0.5(k-m_{i})(k-m_{i}+1) for mikmi+11m_{i}\leq k\leq m_{i+1}-1. Thus,

k=1KTTktk2Ti=1MTk=mimi+111tk\displaystyle\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}}\leq\sqrt{2T}\sum_{i=1}^{M_{T}}\sum_{k=m_{i}}^{m_{i+1}-1}\frac{1}{\sqrt{t_{k}}}
2Ti=1MTk=mimi+111tmi+0.5(kmi)(kmi+1).\displaystyle\leq\sqrt{2T}\sum_{i=1}^{M_{T}}\sum_{k=m_{i}}^{m_{i+1}-1}\frac{1}{\sqrt{t_{m_{i}}+0.5(k-m_{i})(k-m_{i}+1)}}. (32)

The denominator of the summands at k=mik=m_{i} is equal to tmi\sqrt{t_{m_{i}}}. For other values of kk it can be lower bounded by 0.5(kmi)20.5(k-m_{i})^{2}. Thus,

i=1MTk=mimi+111tmi+0.5(kmi)(kmi+1)\displaystyle\sum_{i=1}^{M_{T}}\sum_{k=m_{i}}^{m_{i+1}-1}\frac{1}{\sqrt{t_{m_{i}}+0.5(k-m_{i})(k-m_{i}+1)}}
i=1MT1tmi+i=1MTk=mi+1mi+112kmi\displaystyle\leq\sum_{i=1}^{M_{T}}\frac{1}{\sqrt{t_{m_{i}}}}+\sum_{i=1}^{M_{T}}\sum_{k=m_{i}+1}^{m_{i+1}-1}\frac{\sqrt{2}}{k-m_{i}}
MT+i=1MTj=1mi+1mi12j\displaystyle\leq M_{T}+\sum_{i=1}^{M_{T}}\sum_{j=1}^{m_{i+1}-m_{i}-1}\frac{\sqrt{2}}{j}
MT+2(MT+i=1MTlog(mi+1mi))\displaystyle\leq M_{T}+\sqrt{2}(M_{T}+\sum_{i=1}^{M_{T}}\log(m_{i+1}-m_{i}))
MT(1+2)+2MTlog(1MTi=1MT(mi+1mi))\displaystyle\leq M_{T}(1+\sqrt{2})+\sqrt{2}M_{T}\log(\frac{1}{M_{T}}\sum_{i=1}^{M_{T}}(m_{i+1}-m_{i}))
MT(1+2)+2MTlog2T\displaystyle\leq M_{T}(1+\sqrt{2})+\sqrt{2}M_{T}\log\sqrt{2T}
7MTlog2T,\displaystyle\leq 7M_{T}\log\sqrt{2T},

where the second inequality is by tmi1t_{m_{i}}\geq 1, the third inequality is by the fact that j=1K1/j1+1Kdx/x=1+logK\sum_{j=1}^{K}1/j\leq 1+\int_{1}^{K}dx/x=1+\log K, the forth inequality is by concavity of log\log and the fifth inequality is by the fact that i=1MT(mi+1mi)=mMT+11=KT\sum_{i=1}^{M_{T}}(m_{i+1}-m_{i})=m_{M_{T}+1}-1=K_{T} and KT/MT2T/MT2TK_{T}/M_{T}\leq\sqrt{2T/M_{T}}\leq\sqrt{2T} (see (C.3)). Substituting this bound into (C.3) and using the upper bound on MTM_{T} (C.3), we can write

k=1KTTktk\displaystyle\sum_{k=1}^{K_{T}}\frac{T_{k}}{\sqrt{t_{k}}} 2T(7MTlog2T)\displaystyle\leq\sqrt{2T}\Big{(}7M_{T}\log\sqrt{2T}\Big{)}
72T(1+|𝒮||𝒜|log(T+1))log2T.\displaystyle\leq 7\sqrt{2T}(1+|{\mathcal{S}}||{\mathcal{A}}|\log(T+1))\log\sqrt{2T}.

Appendix D Other Proofs

D.1 Proof of Lemma 1

Lemma (restatement of Lemma 1). Suppose Assumption 1 holds. Then, the policy π(,θ):Δ𝒮𝒜\pi^{*}(\cdot,\theta):\Delta_{\mathcal{S}}\to{\mathcal{A}} given by

π(b;θ):=argmina𝒜{c(b,a)+oOP(o|b,a;θ)v(b;θ)}\pi^{*}(b;\theta):=\operatorname*{argmin}_{a\in{\mathcal{A}}}\{c(b,a)+\sum_{o\in O}P(o|b,a;\theta)v(b^{\prime};\theta)\} (33)

is the optimal policy with Jπ(h;θ)=J(θ)J_{\pi^{*}}(h;\theta)=J(\theta) for all hΔ𝒮h\in\Delta_{\mathcal{S}}.

Proof.

We prove that for any policy π\pi, Jπ(h,θ)Jπ(h,θ)=J(θ)J_{\pi}(h,\theta)\geq J_{\pi^{*}}(h,\theta)=J(\theta) for all hΔ𝒮h\in\Delta_{\mathcal{S}}. Let π:Δ𝒮𝒜\pi:\Delta_{\mathcal{S}}\to{\mathcal{A}} be an arbitrary policy. We can write

Jπ(h,θ)=lim supT1Tt=1T𝔼[C(st,π(ht))|s1h]\displaystyle J_{\pi}(h,\theta)=\limsup_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[C(s_{t},\pi(h_{t}))|s_{1}\sim h]
=lim supT1Tt=1T𝔼[𝔼[C(st,π(ht))|t,s1h]|s1h]\displaystyle=\limsup_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}\Big{[}\mathbb{E}[C(s_{t},\pi(h_{t}))|{\mathcal{F}}_{t},s_{1}\sim h]\big{|}s_{1}\sim h\Big{]}
=lim supT1Tt=1T𝔼[c(ht,π(ht))|s1h]\displaystyle=\limsup_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[c(h_{t},\pi(h_{t}))|s_{1}\sim h]
lim supT1Tt=1T𝔼[J(θ)+v(ht,θ)v(ht+1,θ)|s1h]\displaystyle\geq\limsup_{T\to\infty}\frac{1}{T}\sum_{t=1}^{T}\mathbb{E}[J(\theta)+v(h_{t},\theta)-v(h_{t+1},\theta)|s_{1}\sim h]
=J(θ),\displaystyle=J(\theta),

with equality attained by π\pi^{*} completing the proof. ∎