This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Near-optimal Regret Bounds for Stochastic Shortest Path

Alon Cohen Google Research, Tel Aviv; [email protected].    Haim Kaplan Tel-Aviv University and Google Research, Tel Aviv; [email protected].    Yishay Mansour Tel-Aviv University and Google Research, Tel Aviv; Supported in part by a grant from the ISF. [email protected].    Aviv Rosenberg Tel-Aviv University; [email protected].
Abstract

Stochastic shortest path (SSP) is a well-known problem in planning and control, in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent is unaware of the environment dynamics (i.e., the transition function) and has to repeatedly play for a given number of episodes, while learning the problem’s optimal solution. Unlike other well-studied models in reinforcement learning (RL), the length of an episode is not predetermined (or bounded) and is influenced by the agent’s actions. Recently, Tarbouriech et al. (2019) studied this problem in the context of regret minimization, and provided an algorithm whose regret bound is inversely proportional to the square root of the minimum instantaneous cost. In this work we remove this dependence on the minimum cost—we give an algorithm that guarantees a regret bound of O~(B|S||A|K)\widetilde{O}(B_{\star}|S|\sqrt{|A|K}), where BB_{\star} is an upper bound on the expected cost of the optimal policy, SS is the set of states, AA is the set of actions and KK is the number of episodes. We additionally show that any learning algorithm must have at least Ω(B|S||A|K)\Omega(B_{\star}\sqrt{|S||A|K}) regret in the worst case.

1 Introduction

Stochastic shortest path (SSP) is one of the most basic models in reinforcement learning (RL). It includes the discounted return model and the finite-horizon model as special cases. In SSP the goal of the agent is to reach a predefined goal state in minimum expected cost. This setting captures a wide variety of realistic scenarios, such as car navigation, game playing and drone flying; i.e., tasks carried out in episodes that eventually terminate.

The focus of this work is on regret minimization in SSP. It builds on extensive literature on theoretical aspects of online RL, and in particular on the copious works about regret minimization in either the average cost model or the finite-horizon model. A major contribution to this literature is the UCRL2 algorithm Jaksch et al. (2010) that gives a general framework to achieve optimism in face of uncertainty for these settings. The main methodology is to define a confidence set that includes the true model parameters with high probability. The algorithm periodically computes an optimistic policy that minimizes the overall expected cost simultaneously over all policies and over all parameters within the confidence set, and proceeds to play this policy.

The only regret minimization algorithm specifically designed for SSP is that of Tarbouriech et al. (2019) that assumes that all costs are bounded away from zero (i.e., there is a cmin>0c_{\text{min}}>0 such that all costs are in the range [cmin,1][c_{\text{min}},1]). They show a regret bound that scales as O(D3/2|S||A|K/cmin)O(D^{3/2}|S|\sqrt{|A|K/c_{\text{min}}}) where DD is the minimum expected time of reaching the goal state from any state, SS is the set of states, AA is the set of actions and KK is the number of episodes. In addition, they show that the algorithm’s regret is O~(K2/3)\widetilde{O}(K^{2/3}) when the costs are arbitrary (namely, may be zero).

Here we improve upon the work of Tarbouriech et al. (2019) in several important aspects. First, we remove the dependency on cmin1c_{\text{min}}^{-1} and allow for zero costs while maintaining regret of O~(K)\widetilde{O}(\sqrt{K}). Second, we give a much simpler algorithm in which the computation of the optimistic policy has a simple solution. Our main regret term is O~(B|S||A|K),\widetilde{O}(B_{\star}|S|\sqrt{|A|K}), where BB_{\star} is an upper bound on the expected cost of the optimal policy (note that BDB_{\star}\leq D). We show that this is almost optimal by giving a lower bound of Ω(B|S||A|K).\Omega(B_{\star}\sqrt{|S||A|K}).

Our technical contribution is as follows. We start by assuming that the costs are lower bounded by cminc_{\text{min}} and give an algorithm that is simple to analyze and achieves a regret bound of O~(B3/2|S||A|K/cmin)\widetilde{O}(B_{\star}^{3/2}|S|\sqrt{|A|K/c_{\text{min}}}). Note that this bound is comparable to the one of Tarbouriech et al. (2019), yet our algorithm and its analysis are significantly simpler and more intuitive. We subsequently improve our algorithm by utilizing better confidence sets based on the Bernstein concentration inequality (Azar et al., 2017). This algorithm is even simpler than our first one mainly since picking the parameters of the optimistic model is particularly easy. The analysis, however, is somewhat more delicate. We achieve our final bound by perturbing the instantaneous costs to be at least ϵ>0\epsilon>0. The additional cost due to this perturbation has a small effect since the dependency of our regret on cmin1c_{\text{min}}^{-1} is additive and does not multiply any term depending on KK.

1.1 Related work.

Early work by Bertsekas and Tsitsiklis (1991) studied the problem of planning in SSPs, that is, computing the optimal strategy efficiently in a known SSP instance. They established that, under certain assumptions, the optimal strategy is a deterministic stationary policy (a mapping from states to actions) and can be computed efficiently using standard planning algorithms, e.g., Value Iteration or Policy Iteration.

The extensive literature about regret minimization in RL focuses on the the average-cost infinite-horizon model Bartlett and Tewari (2009); Jaksch et al. (2010) and on the finite-horizon model Osband et al. (2016); Azar et al. (2017); Dann et al. (2017); Zanette and Brunskill (2019). These recent works give algorithms with near-optimal regret bounds using Bernstein-type concentration bounds.

Another related model is that of loop-free SSP with adversarial costs Neu et al. (2010, 2012); Zimin and Neu (2013); Rosenberg and Mansour (2019a, b). This model eliminates the challenge of avoiding policies that never terminate, but the adversarial costs pose a different, unrelated, challenge.

2 Preliminaries and Main Results

An instance of the SSP problem is a Markov decision process (MDP) M=(S,A,P,c,sinit)M=(S,A,P,c,s_{\text{init}}) where SS is the state space and AA is the action space. The agent begins at the initial state sinits_{\text{init}}, and ends her interaction with MM by arriving at the goal state gg (where gSg\not\in S). Whenever she plays action aa in state ss, she pays a cost c(s,a)[0,1]c(s,a)\in[0,1] and the next state sSs^{\prime}\in S is chosen with probability P(ss,a)P(s^{\prime}\mid s,a). Note that to simplify the presentation we avoid addressing the goal state gg explicitly – we assume that the probability of reaching the goal state by playing action aa at state ss is 1sSP(ss,a)1-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,a).

We now review planning in a known SSP instance. Under certain assumptions that we shall briefly discuss, the optimal behaviour of the agent, i.e., the policy that minimizes the expected total cost of reaching the goal state from any state, is a stationary, deterministic and proper policy. A stationary and deterministic policy π:SA\pi:S\mapsto A is a mapping that selects action π(s)\pi(s) whenever the agent is at state ss. A proper policy is defined as follows.

Definition 1 (Proper and Improper Policies).

A policy π\pi is proper if playing π\pi reaches the goal state with probability 11 when starting from any state. A policy is improper if it is not proper.

Any policy π\pi induces a cost-to-go function Jπ:S[0,]J^{\pi}:S\mapsto[0,\infty] defined as Jπ(s)=limT𝔼π[t=1Tc(st,at)s1=s],J^{\pi}(s)=\lim_{T\rightarrow\infty}\mathbb{E}_{\pi}\bigl{[}\sum_{t=1}^{T}c(s_{t},a_{t})\mid s_{1}=s\bigr{]}, where the expectation is taken w.r.t the random sequence of states generated by playing according to π\pi when the initial state is ss. For a proper policy π\pi, since the number of states |S||S| is finite, it follows that Jπ(s)J^{\pi}(s) is finite for all sSs\in S. However, note that Jπ(s)J^{\pi}(s) may be finite even if π\pi is improper. We additionally denote by Tπ(s)T^{\pi}(s) the expected time it takes for π\pi to reach gg starting at ss; in particular, if π\pi is proper then Tπ(s)T^{\pi}(s) is finite for all ss, and if π\pi is improper there must exist some ss such that Tπ(s)=T^{\pi}(s)=\infty. In this work we assume the following about the SSP model.

Assumption 1.

There exists at least one proper policy.

With 1, we have the following important properties of proper policies. In particular, the first result shows that a policy is proper if and only if its cost-to-go function satisfies the Bellman equations. The second result proves that a policy is optimal if and only if it satisfies the Bellman optimality criterion. Note that they assume that every improper policy has high cost.

Lemma 2.1 (Bertsekas and Tsitsiklis, 1991, Lemma 1).

Suppose that 1 holds and that for every improper policy π\pi^{\prime} there exists at least one state sSs\in S such that Jπ(s)=J^{\pi^{\prime}}(s)=\infty. Let π\pi be any policy, then

  1. (i)

    If there exists some J:SJ:S\mapsto\mathbb{R} such that J(s)c(s,π(s))+sSP(ss,π(s))J(s)J(s)\geq c\bigl{(}s,\pi(s)\bigr{)}+\sum_{s^{\prime}\in S}P\bigl{(}s^{\prime}\mid s,\pi(s)\bigr{)}J(s^{\prime}) for all sSs\in S, then π\pi is proper. Moreover, it holds that Jπ(s)J(s),sS.J^{\pi}(s)\leq J(s),\;\forall s\in S.

  2. (ii)

    If π\pi is proper then JπJ^{\pi} is the unique solution to the equations Jπ(s)=c(s,π(s))+sSP(ss,π(s))Jπ(s)J^{\pi}(s)=c\bigl{(}s,\pi(s)\bigr{)}+\sum_{s^{\prime}\in S}P\bigl{(}s^{\prime}\mid s,\pi(s)\bigr{)}J^{\pi}(s^{\prime}) for all sSs\in S.

Lemma 2.2 (Bertsekas and Tsitsiklis, 1991, Proposition 2).

Under the conditions of Lemma 2.1 the optimal policy π\pi^{\star} is stationary, deterministic, and proper. Moreover, a policy π\pi is optimal if and only if it satisfies the Bellman optimality equations for all sSs\in S:

Jπ(s)\displaystyle J^{\pi}(s) =minaAc(s,a)+sSP(ss,a)Jπ(s),\displaystyle=\min_{a\in A}c\bigl{(}s,a\bigr{)}+\sum_{s^{\prime}\in S}P\bigl{(}s^{\prime}\mid s,a\bigr{)}J^{\pi}(s^{\prime}), (1)
π(s)\displaystyle\pi(s) argminaAc(s,a)+sSP(ss,a)Jπ(s).\displaystyle\in\operatorname*{arg\,min}_{a\in A}c\bigl{(}s,a\bigr{)}+\sum_{s^{\prime}\in S}P\bigl{(}s^{\prime}\mid s,a\bigr{)}J^{\pi}(s^{\prime}).

In this work we are not interested in approximating the optimal policy overall, but rather the best proper policy. In this case the second requirement in the lemmas above, that for every improper policy π\pi there exists some state sSs\in S such that Jπ(s)=J^{\pi}(s)=\infty, can be circumvented in the following way (Bertsekas and Yu, 2013). First, note that this requirement is trivially satisfied when all instantaneous costs are strictly positive. Then, one can perturb the instantaneous costs by adding a small positive cost ϵ[0,1]\epsilon\in[0,1], i.e., the new cost function is cϵ(s,a)=max{c(s,a),ϵ}c_{\epsilon}(s,a)=\max\{c(s,a),\epsilon\}. After this perturbation, all proper policies remain proper, and every improper policy has infinite cost-to-go from some state (as all costs are positive). In the modified MDP, we apply Lemma 2.2 and obtain an optimal policy πϵ\pi^{\star}_{\epsilon} that is stationary, deterministic and proper and has a cost-to-go function Jϵ{J}^{\star}_{\epsilon}. Taking the limit as ϵ0\epsilon\rightarrow 0, we have that πϵπ\pi^{\star}_{\epsilon}\rightarrow\pi^{\star} and JϵJπ{J}^{\star}_{\epsilon}\rightarrow J^{\pi^{\star}}, where π\pi^{\star} is the optimal proper policy in the original model that is also stationary and deterministic, and JπJ^{\pi^{\star}} denotes its cost-to-go function. We use this observation to obtain Corollaries 2.5 and 2.6 below that only require 1 to hold.

Learning formulation.

We assume that the costs are deterministic and known to the learner, and the transition probabilities PP are fixed but unknown to the learner. The learner interacts with the model in episodes: each episode starts at the initial state sinits_{\text{init}}, and ends when the learner reaches the goal state gg (note that she might never reach the goal state). Success is measured by the learner’s regret over KK such episodes, that is the difference between her total cost over the KK episodes and the total expected cost of the optimal proper policy:

RK=k=1Ki=1Ikc(sik,aik)KminπΠproperJπ(sinit),R_{K}=\sum_{k=1}^{K}\sum_{i=1}^{I^{k}}c(s^{k}_{i},a^{k}_{i})-K\cdot\min_{\pi\in\Pi_{\text{proper}}}J^{\pi}(s_{\text{init}}),

where IkI^{k} is the time it takes the learner to complete episode kk (which may be infinite), Πproper\Pi_{\text{proper}} is the set of all stationary, deterministic and proper policies (that is not empty by 1), and (sik,aik)(s_{i}^{k},a^{k}_{i}) is the ii-th state-action pair at episode kk. In the case that IkI^{k} is infinite for some kk, we define RK=R_{K}=\infty.

We denote the optimal proper policy by π\pi^{\star}, i.e., Jπ(s)=argminπΠproperJπ(s)J^{\pi^{\star}}(s)=\operatorname*{arg\,min}_{\pi\in\Pi_{\text{proper}}}J^{\pi}(s) for all sSs\in S. Moreover, let B>0B_{\star}>0 be an upper bound on the values of JπJ^{\pi^{\star}} and let T>0T_{\star}>0 be an upper bound on the times TπT^{\pi^{\star}}, i.e., BmaxsSJπ(s)B_{\star}\geq\max_{s\in S}J^{\pi^{\star}}(s) and TmaxsSTπ(s)T_{\star}\geq\max_{s\in S}T^{\pi^{\star}}(s).

2.1 Summary of our results

In Section 3 we present our Hoeffding-based algorithms (Algorithms 1 and 3) and their analysis. In Section 4 we show our Bernstein-based algorithm (Algorithm 2) for which we prove improved regret bounds. In addition, we give a lower bound on the learner’s regret showing that Algorithm 2 is near-optimal (see Appendix C).

The learner must reach the goal state otherwise she has infinite regret. Therefore, she has to trade-off two objectives, one is to reach the goal state and the other is to minimize the cost. Under the following assumption, the two objectives essentially coincide.

Assumption 2.

All costs are positive, i.e., there exists cmin>0c_{\text{min}}>0 such that c(s,a)cminc(s,a)\geq c_{\text{min}} for every (s,a)S×A(s,a)\in S\times A.

This assumption allows us to upper bound the running time of the algorithm by its total cost up to a factor of cmin1c_{\text{min}}^{-1}. In particular, it guarantees that any policy that does not reach the goal state has infinite cost, so any bounded regret algorithm has to reach the goal state. We eventually relax 2 by a technique similar to that of Bertsekas and Yu (2013). We add a small positive perturbation to the instantaneous costs and run our algorithms on the model with the perturbed costs. This provides a regret bound that scales with the expected running time of the optimal policy.

We now summarize our results. For ease of comparison, we first present our regret bounds for both the Hoeffding and Bernstein-based algorithms for when 2 holds, and subsequently show the regret bounds of both algorithms for the general case. In order to simplify the presentation of our results, we assume that |S|2|S|\geq 2, |A|2|A|\geq 2 and K|S|2|A|K\geq|S|^{2}|A| throughout. In addition, we denote L=log(KB|S||A|/δcmin)L=\log(KB_{\star}|S||A|/\delta c_{\text{min}}). The complete proof of all statements is found in the supplementary material.

Positive costs.

The following results hold when 2 holds (recall that we always assume 1). In particular, when this assumption holds the optimal policy overall is proper (Lemma 2.2) hence the regret bounds below are with respect to the best overall policy.

Theorem 2.3.

Suppose that 2 holds. With probability at least 1δ1-\delta the regret of Algorithm 3 is bounded as follows:

RK\displaystyle R_{K} =O(B3|S|2|A|KcminL+B3|S|2|A|cmin2L2).\displaystyle=O\Biggl{(}\sqrt{\frac{B_{\star}^{3}|S|^{2}|A|K}{c_{\text{min}}}}L+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}L^{2}\Biggr{)}.

The main issue with the regret bound in Theorem 2.3 is that it scales with K/cmin\sqrt{K/c_{\text{min}}} which cannot be avoided regardless of how large KK is with respect to cmin1c_{\text{min}}^{-1}. This problem is alleviated in Algorithm 2 that uses the tighter Bernstein-based confidence bounds.

Theorem 2.4.

Assume that 2 holds. With probability at least 1δ1-\delta the regret of Algorithm 2 is bounded as follows:

RK\displaystyle R_{K} =O(B|S||A|KL+B3|S|4|A|2cminL2).\displaystyle=O\Biggl{(}B_{\star}|S|\sqrt{|A|K}L+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}}L^{2}\Biggr{)}.

Note that when KB|S|2|A|/cminK\gg B_{\star}|S|^{2}|A|/c_{\text{min}}, the regret bound above scales as O~(B|S||A|K)\widetilde{O}(B_{\star}|S|\sqrt{|A|K}) thus obtaining a near-optimal rate.

Arbitrary costs. Recall that in this case we can no longer assume that the optimal policy is proper. Therefore, the regret bounds below are with comparison to the best proper policy. 2 can be easily alleviated by adding a small fixed cost to the cost of all state-action pairs. Following the perturbation of the costs, we obtain regret bounds from Theorems 2.3 and 2.4 with cminϵc_{\text{min}}\leftarrow\epsilon and BB+ϵTB_{\star}\leftarrow B_{\star}+\epsilon T_{\star}, and the learner also suffers an additional cost of ϵTK\epsilon T_{\star}K due to the misspecification of the model caused by the perturbation. By picking ϵ\epsilon to balance these terms we get the following corollaries (letting L~=log(KBT|S||A|/δ)\widetilde{L}=\log(KB_{\star}T_{\star}|S||A|/\delta)).

Corollary 2.5.

Running Algorithm 3 using costs cϵ(s,a)=max{c(s,a),ϵ}c_{\epsilon}(s,a)=\max\{c(s,a),\epsilon\} for ϵ=(|S|2|A|/K)1/3\epsilon=(|S|^{2}|A|/K)^{1/3} gives the following regret bound with probability at least 1δ1-\delta:

RK\displaystyle R_{K} =O(T3|S|2/3|A|1/3K2/3L~+T3|S|2|A|L~2).\displaystyle=O\Biggl{(}T_{\star}^{3}|S|^{2/3}|A|^{1/3}K^{2/3}\widetilde{L}+T_{\star}^{3}|S|^{2}|A|\widetilde{L}^{2}\Biggr{)}.
Corollary 2.6.

Running Algorithm 2 using costs cϵ(s,a)=max{c(s,a),ϵ}c_{\epsilon}(s,a)=\max\{c(s,a),\epsilon\} for ϵ=|S|2|A|/K\epsilon=|S|^{2}|A|/K gives the following regret bound with probability at least 1δ1-\delta:

RK\displaystyle R_{K} =O(B3/2|S||A|KL~+T3/2|S|2|A|L~2).\displaystyle=O\Biggl{(}B_{\star}^{3/2}|S|\sqrt{|A|K}\widetilde{L}+T_{\star}^{3/2}|S|^{2}|A|\widetilde{L}^{2}\Biggr{)}.

Moreover, when the algorithm knows BB_{\star} and K|S|2|A|T2K\gg|S|^{2}|A|T_{\star}^{2}, then choosing ϵ=B|S|2|A|/K\epsilon=B_{\star}|S|^{2}|A|/K gets a near-optimal regret bound of O~(B|S||A|K)\widetilde{O}(B_{\star}|S|\sqrt{|A|K}).

Lower bound. In Appendix C we show that Corollary 2.6 is nearly-tight using the following theorem.

Theorem 2.7.

There exists an SSP problem instance M=(S,A,P,c,sinit)M=(S,A,P,c,s_{\text{init}}) in which Jπ(s)BJ^{\pi^{\star}}(s)\leq B_{\star} for all sSs\in S, |S|2|S|\geq 2, |A|16|A|\geq 16, B2B_{\star}\geq 2, K|S||A|K\geq|S||A|, and c(s,a)=1c(s,a)=1 for all sS,aAs\in S,a\in A, such the expected regret of any learner after KK episodes satisfies

𝔼[RK]11024B|S||A|K.\mathbb{E}[R_{K}]\geq\frac{1}{1024}B_{\star}\sqrt{|S||A|K}.

3 Hoeffding-type Confidence Bounds

We start with a simpler case in which BB_{\star} is known to the learner. In Section 3.2 we alleviate this assumption with a penalty of an additional log-factor in the regret bound. For now, we prove the following bound on the learner’s regret.

Theorem 3.1.

Suppose that 2 holds. With probability at least 1δ1-\delta the regret of Algorithm 1 is bounded as follows:

RK\displaystyle R_{K} =O(B3|S|2|A|KcminL+B3|S|2|A|cmin2L3/2).\displaystyle=O\Biggl{(}\sqrt{\frac{B_{\star}^{3}|S|^{2}|A|K}{c_{\text{min}}}}L+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}L^{3/2}\Biggr{)}.

Our algorithm follows the known concept of optimism in face of uncertainty. That is, it maintains confidence sets that contain the true transition function with high probability and picks an optimistic optimal policy—a policy that minimizes the expected cost over all policies and all transition functions in the current confidence set. The computation of the optimistic optimal policy can be done efficiently as shown by Tarbouriech et al. (2019). Construct an augmented MDP whose states are SS and its action set consists of tuples (a,P~)(a,\widetilde{P}) where aAa\in A and P~\widetilde{P} is any transition function such that

P~(|s,a)P¯(|s,a)15|S|log(|S||A|N+(s,a)/δ)N+(s,a)\bigl{\|}\widetilde{P}(\cdot|s,a)-\bar{P}(\cdot|s,a)\bigr{\|}_{1}\leq 5\sqrt{\frac{|S|\log(|S||A|N_{+}(s,a)/\delta)}{N_{+}(s,a)}} (2)

where P¯\bar{P} is the empirical estimate of PP. It can be shown that the optimistic policy and the optimistic model, i.e., those that minimize the expected total cost over all policies and feasible transition functions, correspond to the optimal policy of the augmented MDP.

Algorithm 1 Hoeffding-type confidence bounds and known BB_{\star}
  input: state space SS, action space AA, bound on cost-to-go of optimal policy BB_{\star}, confidence parameter δ\delta.
  initialization: (s,a,s)S×A×S:N(s,a,s)0,N(s,a)0\forall(s,a,s^{\prime})\in S\times A\times S:N(s,a,s^{\prime})\leftarrow 0,N(s,a)\leftarrow 0, an arbitrary policy π~\tilde{\pi}, t1t\leftarrow 1.
  for k=1,2,k=1,2,\ldots do
     set stsinits_{t}\leftarrow s_{\text{init}}.
     while stgs_{t}\neq g do
        follow optimistic optimal policy: atπ~(st)a_{t}\leftarrow\tilde{\pi}(s_{t}).
        observe next state st+1P(st,at)s_{t+1}\sim P(\cdot\mid s_{t},a_{t}).
        update: N(st,at,st+1)N(st,at,st+1)+1N(s_{t},a_{t},s_{t+1})\leftarrow N(s_{t},a_{t},s_{t+1})+1, N(st,at)N(st,at)+1N(s_{t},a_{t})\leftarrow N(s_{t},a_{t})+1.
        if N(st+1,π~(st+1))5000B2|S|cmin2logB|S||A|δcminN(s_{t+1},\tilde{\pi}(s_{t+1}))\!\leq\!\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} ​or​ st+1=gs_{t+1}\!=\!g then
           # start new interval
           compute empirical transition function P¯\bar{P} as P¯(s|s,a)=N(s,a,s)/N+(s,a)\bar{P}(s^{\prime}|s,a)=N(s,a,s^{\prime})/N_{+}(s,a) where N+(s,a)=max{N(s,a),1}N_{+}(s,a)=\max\{N(s,a),1\}.
           compute optimistic policy π~\tilde{\pi} by minimizing expected cost over transition functions P~\widetilde{P} that satisfy Eq. 2.
        end if
        set tt+1t\leftarrow t+1.
     end while
  end for

To ensure that the algorithm reaches the goal state in every episode, we define a state-action pair (s,a)(s,a) as known if the number of visits to this pair is at least 5000B2|S|cmin2logB|S||A|δcmin\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} and as unknown otherwise. We show with high probability the optimistic policy chosen by the algorithm will be proper once all state-action pairs are known. However, when some pairs are still unknown, our chosen policies may be improper. This implies that the strategy of keeping the policy fixed throughout an episode, as done usually in episodic RL, will fail. Consequently, our algorithm changes policies at the start of every episode and also every time we reach an unknown state-action pair.

Formally, we split the time into intervals. The first interval begins at the first time step, and every interval ends by reaching the goal state or a state ss such that (s,π~(s))(s,\tilde{\pi}(s)) is unknown (where π~\tilde{\pi} is the current policy followed by the learner). Recall that once all state-action pairs are known, the optimistic policy will eventually reach the goal state. Therefore, recomputing the optimistic policy at the end of every interval ensures that the algorithm will eventually reach the goal state with high probability. Note that the total number of intervals is at most the number of visits to an unknown state-action pair plus the number of episodes.

Observation 3.2.

The total number of intervals, MM, is

O(K+B2|S|2|A|cmin2logB|S||A|δcmin).O\biggl{(}K+\frac{B_{\star}^{2}|S|^{2}|A|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}\biggr{)}.

3.1 Analysis

The proof of Theorem 3.1 begins by defining the “good event” in which our confidence sets contain the true transition function and the total cost in every interval is bounded. This in turn implies that all episodes end in finite time. We prove that the good event holds with high probability.

Then, independently, we give a high-probability bound on the regret of the algorithm when the good event holds. To do so, recall that at the beginning of every interval mm, the learner computes an optimistic policy by minimizing over all policies and over all transition functions within the current confidence set. We denote the chosen policy by π~m\tilde{\pi}^{m} and let P~m\widetilde{P}_{m} be the minimizing transition function (i.e., the optimistic model). A key observation is that by the definition of our confidence sets, P~m\widetilde{P}_{m} is such that there is always some positive probability to transition to the goal state directly from any state-action. This implies that all policies are proper in the optimistic model and that the cost-to-go function of π~m\tilde{\pi}^{m} defined with respect to P~m\widetilde{P}_{m}, and denoted by J~m\widetilde{J}^{m}, is finite. By Lemma 2.1, the following Bellman optimality equations hold for all sSs\in S,

J~m(s)\displaystyle\widetilde{J}^{m}(s) =minaAc(s,a)+sSP~m(ss,a)J~m(s).\displaystyle=\min_{a\in A}c(s,a)+\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s,a)\widetilde{J}^{m}(s^{\prime}). (3)
High probability events.

For every interval mm, we let Ωm\Omega^{m} denote the event that the confidence set for interval mm contains the true transition function PP. Formally, let P¯m\bar{P}_{m} denote the empirical estimate of the transition function at the beginning of interval mm, let Nm(s,a)N_{m}(s,a) denote the number of visits to state-action pair (s,a)(s,a) up to interval mm (not including), and let nm(s,a)n_{m}(s,a) be the number of visits to (s,a)(s,a) during interval mm. Then we say that Ωm\Omega^{m} holds if for all (s,a)S×A(s,a)\in S\times A, we have (N+m(s,a)=max{1,Nm(s,a)}N_{+}^{m}(s,a)=\max\{1,N_{m}(s,a)\})

P(|s,a)P¯m(|s,a)15|S|log(|S||A|N+m(s,a)/δ)N+m(s,a).\lVert P(\cdot|s,a)-\bar{P}_{m}(\cdot|s,a)\rVert_{1}\leq 5\sqrt{\frac{|S|\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}}. (4)

In the following lemma we show that, with high probability, the events Ωm\Omega^{m} hold and that the total cost in each interval is bounded. Combining this with 3.2 we get that all episodes terminate within a finite number of steps, with high probability.

Lemma 3.3.

With probability at least 1δ/21-\delta/2, for all intervals mm simultaneously, we have that Ωm\Omega^{m} holds and that h=1Hmc(shm,ahm)24Blog4mδ\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\leq 24B_{\star}\log\frac{4m}{\delta}, where HmH^{m} denotes the length of interval mm, shms_{h}^{m} is the observed state at time hh of interval mm and ahm=π~m(shm)a_{h}^{m}=\tilde{\pi}^{m}(s_{h}^{m}) is the chosen action. This implies that the total number of steps of the algorithm is

T=O(KBcminL+B3|S|2|A|cmin3L2).\displaystyle T=O\biggl{(}\frac{KB_{\star}}{c_{\text{min}}}L+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{3}}L^{2}\biggr{)}.
Proof sketch..

The events Ωm\Omega^{m} hold with high probability due to standard concentration inequalities, and thus it remains to address the high probability bound on the total cost within each interval.

This proof consists of three parts. In the first, we show that when Ωm\Omega^{m} occurs we have that J~m(s)Jπ(s)B\widetilde{J}^{m}(s)\leq J^{\pi^{\star}}(s)\leq B_{\star} for all sSs\in S due to the optimistic nature of the computation of π~m\tilde{\pi}^{m}. In the second part, we postulate that had all state-action pairs been known, then having Ωm\Omega^{m} hold implies that Jm(s)2BJ^{m}(s)\leq 2B_{\star} for all sSs\in S. That is, when all state-action pairs are known, not only π~m\tilde{\pi}^{m} is proper in the true model, but its expected cumulative cost is at most 2B2B_{\star}.

The third part of the proof deals with the general case when not all state-action pairs are known. Fix some interval mm. Since the interval ends when we reach an unknown state-action, it must be that all but the first state-action pair visited during the interval are known. For this unknown first state-action pair, it follows from the Bellman equations (Eq. 3) and from J~m(s)B\widetilde{J}^{m}(s)\leq B_{\star} for all sSs\in S that π~m\tilde{\pi}^{m} never picks an action whose instantaneous cost is larger than BB_{\star}. Therefore, the cost of this first unknown state-action pair is at most BB_{\star}, and we focus on bounding the total cost in the remaining time steps with high probability.

To that end, we define the following modified MDP Mknow=(Sknow,A,Pknow,c,sinit)M^{\text{know}}=(S^{\text{know}},A,P^{\text{know}},c,s_{\text{init}}) in which every state sSs\in S such that (s,π~m(s))(s,\tilde{\pi}^{m}(s)) is unknown is contracted to the goal state. Let PknowP^{\text{know}} be the transition function induced in MknowM^{\text{know}} by PP, and let JknowmJ^{m}_{\text{know}} be the cost-to-go of π~m\tilde{\pi}^{m} in MknowM^{\text{know}} w.r.t PknowP^{\text{know}}. Similarly, define P~mknow\widetilde{P}^{\text{know}}_{m} as the transition function induced in MknowM^{\text{know}} by P~m\widetilde{P}_{m}, and J~knowm\widetilde{J}^{m}_{\text{know}} as the cost-to-go of π~m\tilde{\pi}^{m} in MknowM^{\text{know}} w.r.t P~mknow\widetilde{P}^{\text{know}}_{m}. It is clear that J~knowm(s)J~m(s)\widetilde{J}^{m}_{\text{know}}(s)\leq\widetilde{J}^{m}(s) for every sSs\in S from whence J~knowm(s)B\widetilde{J}^{m}_{\text{know}}(s)\leq B_{\star}. Moreover, since all states sSs\in S for which (s,π~m(s))(s,\tilde{\pi}^{m}(s)) is unknown were contracted to the goal state, in MknowM^{\text{know}} all remaining states-action pairs are known. Therefore, by the second part of the proof, Jknowm(s)2BJ^{m}_{\text{know}}(s)\leq 2B_{\star} for all sSs\in S. Note that reaching the goal state in MknowM^{\text{know}} is equivalent to reaching either the goal state or an unknown state-action pair in the true model hence the latter argument shows that the total expected cost in doing so is at most 2B2B_{\star}. We further obtain the high probability bound by a probabilistic amplification argument using the Markov property of the MDP. ∎

Regret analysis.

In what follows, instead of bounding RKR_{K}, we bound R~K=m=1Mh=1Hmc(shm,ahm)𝕀{Ωm}KJπ(sinit)\widetilde{R}_{K}=\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}}), where 𝕀\mathbb{I} is the indicator function. Note that according to Lemma 3.3, we have that R~K=RK\widetilde{R}_{K}=R_{K} with high probability.

The definition of R~K\widetilde{R}_{K} allows the analysis to disentangle two dependent probabilistic events. The first is the intersection of the events Ωm\Omega^{m} which is dealt with in Lemma 3.3. The second holds when, for a fixed policy, the costs suffered by the learner do not deviate significantly from their expectation. In the following lemma we bound R~K\widetilde{R}_{K}.

Lemma 3.4.

With probability at least 1δ/21-\delta/2, we have

R~KO(B3|S|2|A|cmin2logB|S||A|cminδ(1)+BTlogTδ+B|S|log|S||A|Tδs,am=1Mnm(s,a)N+m(s,a)(2)).\displaystyle\widetilde{R}_{K}\leq O\biggl{(}\underbrace{\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{c_{\text{min}}\delta}}_{(1)}+B_{\star}\sqrt{T\log\frac{T}{\delta}}+B_{\star}\sqrt{|S|\log\frac{|S||A|T}{\delta}}\underbrace{\sum_{s,a}\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}}_{(2)}\biggr{)}.

Here we only explain how to interpret the resulting bound. The term (1) bounds the total cost spent in intervals that ended in unknown state-action pairs (it does not depend on KK). The term (2)(2) is at most O(|S||A|T)O(\sqrt{|S||A|T}) when Lemma 3.3 holds, and then the dominant term in Lemma 3.4 becomes O~(B|S||A|T)\widetilde{O}(B|S|\sqrt{|A|T}). Theorem 3.1 is finally obtained by applying a union bound on Lemmas 3.3 and 3.4 and using Lemma 3.3 to bound TT.

3.2 Unknown Cost Bound

In this section we relax the assumption that BB_{\star} is known to the learner. Instead, we keep an estimate B~\widetilde{B} that is initialized to cminc_{\text{min}} and doubles every time the cost in interval mm (denoted as CmC_{m}) reaches 24B~log4mδ24\widetilde{B}\log\frac{4m}{\delta}. By Lemma 3.3, with high probability, B~2B\widetilde{B}\leq 2B_{\star}. We end an interval as before (once the goal state is reached or an unknown state-action pair is reached), but also when B~\widetilde{B} is doubled. The algorithm for this case is presented in Appendix A (Algorithm 3). Since B~\widetilde{B} changes, every state-action pair can become known once for every different value of B~\widetilde{B}.

Observation 3.5.

When BB_{\star} is unknown to the learner, the number of times a state-action pair can become known is at most log2(B/cmin)\log_{2}(B_{\star}/c_{\text{min}}). The number of intervals MM is

O(K+B2|S|2|A|cmin2log2B|S||A|δcmin).O\biggl{(}K+\frac{B_{\star}^{2}|S|^{2}|A|}{c_{\text{min}}^{2}}\log^{2}\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}\biggr{)}.
Lemma 3.6.

When BB_{\star} is unknown, with probability at least 1δ/21-\delta/2, for all intervals mm simultaneously, we have that Ωm\Omega^{m} holds and that h=1Hmc(shm,ahm)24Blog4mδ\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\leq 24B_{\star}\log\frac{4m}{\delta}. This implies that the total number of steps of the algorithm is

T=O(KBcminL+B3|S|2|A|cmin3L3).T=O\biggl{(}\frac{KB_{\star}}{c_{\text{min}}}L+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{3}}L^{3}\biggr{)}.

The analysis follows that of Algorithm 1. In particular, Lemma 3.4 still holds (with 2B2B_{\star} instead of BB_{\star}), and jointly with Lemma 3.6 imply Theorem 2.3.

4 Bernstein-type Confidence Bounds

Algorithm 2 Bernstein-type confidence bounds
  input: state space SS, action space AA and confidence parameter δ\delta.
  initialization: i1i\leftarrow 1, t1t\leftarrow 1, arbitrary policy π~1\tilde{\pi}_{1} , (s,a,s)S×A×S:N1(s,a,s)0,N1(s,a)0\forall(s,a,s^{\prime})\in S\times A\times S:\;N_{1}(s,a,s^{\prime})\leftarrow 0,N_{1}(s,a)\leftarrow 0, n1(s,a,s)0n_{1}(s,a,s^{\prime})\leftarrow 0, n1(s,a)0n_{1}(s,a)\leftarrow 0.
  for k=1,2,k=1,2,\ldots do
     set stsinits_{t}\leftarrow s_{\text{init}}.
     while stgs_{t}\neq g do
        follow optimistic optimal policy: atπ~i(st)a_{t}\leftarrow\tilde{\pi}_{i}(s_{t}).
        observe next state st+1P(st,at)s_{t+1}\sim P(\cdot\mid s_{t},a_{t}).
        set: ni(st,at)ni(st,at)+1n_{i}(s_{t},a_{t})\leftarrow n_{i}(s_{t},a_{t})+1, ni(st,at,st+1)ni(st,at,st+1)+1n_{i}(s_{t},a_{t},s_{t+1})\leftarrow n_{i}(s_{t},a_{t},s_{t+1})+1.
        if ni(st+1,π~i(st+1))<Ni(st+1,π~i(st+1))n_{i}(s_{t+1},\tilde{\pi}_{i}(s_{t+1}))<N_{i}(s_{t+1},\tilde{\pi}_{i}(s_{t+1})) then
           set tt+1t\leftarrow t+1 and continue.
        end if
        # start new epoch
        set: Ni+1(s,a,s)Ni(s,a,s)+ni(s,a,s)N_{i+1}(s,a,s^{\prime})\leftarrow N_{i}(s,a,s^{\prime})+n_{i}(s,a,s^{\prime}), Ni+1(s,a)Ni(s,a)+ni(s,a)N_{i+1}(s,a)\leftarrow N_{i}(s,a)+n_{i}(s,a), ni+1(s,a)0n_{i+1}(s,a)\leftarrow 0, ni+1(s,a,s)0n_{i+1}(s,a,s^{\prime})\leftarrow 0 for all (s,a,s)S×A×S(s,a,s^{\prime})\in S\times A\times S.
        compute empirical transition function P¯\bar{P} as P¯(ss,a)=N(s,a,s)/N+(s,a)\bar{P}(s^{\prime}\mid s,a)=N(s,a,s^{\prime})/N_{+}(s,a) for every (s,a,s)S×A×S(s,a,s^{\prime})\in S\times A\times S where N+(s,a)=max{N(s,a),1}N_{+}(s,a)=\max\{N(s,a),1\}.
        compute optimistic transition function P~\widetilde{P} using Eq. 5.
        compute optimal policy π~\tilde{\pi} w.r.t P~\widetilde{P}.
        ii+1i\leftarrow i+1, tt+1t\leftarrow t+1.
     end while
  end for

Algorithm 1 has two drawbacks. The first one is the use of Hoeffding-style confidence bounds which we improve with Bernstein-style confidence bounds. The second is the number of times the optimistic optimal policy is computed. In this section we propose to compute it in a way similar to UCRL2, i.e., once the number of visits to some state-action pair is doubled. Note that this change also eliminates the need to know or to estimate BB_{\star}.

The algorithm is presented in Algorithm 2. It consists of epochs. The first epoch starts at the first time step, and each epoch ends once the number of visits to some state-action pair is doubled. An optimistic policy is computed at the end of every epoch using (empirical) Bernstein confidence bounds. In contrast to Algorithm 1, Algorithm 2 defines a confidence range for each state, action, and next state, separately, around its empirical estimate (i.e., we use an LL_{\infty} “ball” rather than an L1L_{1} “ball” around the empirical estimates). This allows us to disentangle the computation of the optimistic policy from the computation of the optimistic model. Indeed, the computation of the optimistic model becomes very easy: one simply has to maximize the probability of transition directly to the goal state at every state-action pair which means minimizing the probability of transition to all other states and setting them at the lowest possible value of their confidence range. This results in the following formula for P~(s|s,a)\widetilde{P}(s^{\prime}|s,a):

max{P¯(s|s,a)28A(s,a)4P¯(s|s,a)A(s,a),0},\displaystyle\max\{\bar{P}(s^{\prime}|s,a)-28A(s,a)-4\sqrt{\bar{P}(s^{\prime}|s,a)A(s,a)},0\}, (5)

where A(s,a)=log(|S||A|N+(s,a)/δ)/N+(s,a)A(s,a)=\log(|S||A|N_{+}(s,a)/\delta)/N_{+}(s,a). The optimistic policy is then the optimal policy in the SSP model defined by the transition function P~\widetilde{P}.

4.1 Analysis

In this section we prove Theorem 2.4. We start by showing that our new confidence sets contain PP with high probability which implies that each episode ends in finite time with high probability. Consequently, we are able to bound the regret through summation of our confidence bounds.

We once again distinguish between known and unknown state-action pairs similarly to Algorithm 1. A state-action pair (s,a)(s,a) becomes known at the end of an epoch if the total number of visits to (s,a)(s,a) has passed αB|S|cminlogB|S||A|δcmin\alpha\cdot\frac{B_{\star}|S|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} at some time step during the epoch (for some constant α>0\alpha>0). Note that at the end of the epoch, the visit count of (s,a)(s,a) may be strictly larger than αB|S|cminlogB|S||A|δcmin\alpha\cdot\frac{B_{\star}|S|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} but at most twice as much by the definition of our algorithm. Furthermore, we split each epoch into intervals similar to what did in Section 3. The first interval starts at the first time step and each interval ends once (1) the total cost in the interval accumulates to at least BB_{\star}; (2) an unknown state-action pair is reached; (3) the current episode ends; or (4) the current epoch ends. We have the following observation.

Observation 4.1.

Let CMC_{M} denote the cost of the learner after MM intervals. Observe that the total cost in each interval is at least BB_{\star} unless the interval ends in the goal state, in an unknown state-action pair or the epoch ends. Thus the total number of intervals satisfies

MCMB+2|S||A|logT+K+O(B|S|2|A|cminlogB|S||A|δcmin),M\leq\frac{C_{M}}{B_{\star}}+2|S||A|\log T+K+O\biggl{(}\frac{B_{\star}|S|^{2}|A|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}\biggr{)},

and the total time satisfies TCM/cmin.T\leq C_{M}/c_{\text{min}}.

Recall that in the analysis of Algorithm 1 we show that once all state-action pairs are known, the optimistic policies generated by the algorithm are proper in the true MDP. The same holds true for Algorithm 2, yet we never prove this directly. Instead, our proof goes as follows.111We neglect low order terms here. We prove that CMC_{M}, the cost accumulated by the learner during the first MM intervals, is at most KJπ(sinit)+BMK\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{M} with high probability as long as no more than KK episodes have been completed during these MM intervals. We notice that once all state-action pairs are known, the total cost in each interval is at least BB_{\star} (ignoring intervals that end with the end of an epoch or an episode), which implies that the total number of intervals MM is bounded by CM/BC_{M}/B_{\star}. This allows us to get a bound on CMC_{M} that is independent of the number of intervals by solving the inequality CMKJπ(sinit)+BMKJπ(sinit)+BCMC_{M}\lesssim K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{M}\lesssim K\cdot J^{\pi^{\star}}(s_{\text{init}})+\sqrt{B_{\star}\cdot C_{M}}. From this, and since the instantaneous costs are strictly positive (by 2), it must be that the learner eventually completes all KK episodes; i.e., there must be a time from which Algorithm 2 generates only proper policies.

Notation.

The epoch that interval mm belongs to is denoted by i(m)i(m), other notations are as in Section 3.1. Note that since the optimistic policy is computed at the end of an epoch and not at the end of an interval, it follows that π~m=π~i(m)\tilde{\pi}^{m}=\tilde{\pi}^{i(m)} and J~m=J~i(m)\widetilde{J}^{m}=\widetilde{J}^{i(m)}. The trajectory visited in interval mm is denoted by Um=(s1m,a1m,,sHmm,aHmm,sHm+1m)U^{m}=(s_{1}^{m},a_{1}^{m},\ldots,s_{H^{m}}^{m},a_{H^{m}}^{m},s_{H^{m}+1}^{m}), where ahma_{h}^{m} is the action taken in shms_{h}^{m}, and HmH^{m} is the length of the interval. In addition, the concatenation of the trajectories of the intervals up to and including interval mm is denoted by U¯m\bar{U}^{m}, that is U¯m=m=1mUm\bar{U}^{m}=\cup_{m^{\prime}=1}^{m}U^{m^{\prime}}.

High probability events.

Throughout the analysis we denote S+=S{g}S^{+}=S\cup\{g\}. For every interval mm we let Ωm\Omega^{m} denote the event that the confidence set for epoch i=i(m)i=i(m) contains the actual transition function PP. Formally, if Ωm\Omega^{m} holds then for all (s,a,s)S×A×S+(s,a,s^{\prime})\in S\times A\times S^{+}, we have (denote N+m(s,a)=max{1,Nm(s,a)}N_{+}^{m}(s,a)=\max\{1,N^{m}(s,a)\}, Ahm=A(shm,ahm)A_{h}^{m}=A(s_{h}^{m},a_{h}^{m}))

|P(s|s,a)P¯m(s|s,a)|28Ahm+4P¯m(s|s,a)Ahm.\displaystyle|P(s^{\prime}|s,a)-\bar{P}_{m}(s^{\prime}|s,a)|\leq 28A_{h}^{m}+4\sqrt{\bar{P}_{m}(s^{\prime}|s,a)A_{h}^{m}}. (6)

In the following lemma we show that the events Ωm\Omega^{m} hold with high probability.

Lemma 4.2.

With probability at least 1δ/21-\delta/2, Ωm\Omega^{m} holds for all intervals mm simultaneously.

Regret analysis.

In the following section, instead of bounding RKR_{K}, we bound R~M=m=1Mh=1Hmc(shm,ahm)𝕀{Ωm}KJπ(sinit)\widetilde{R}_{M}=\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\mathbb{I}\{\Omega^{m}\}-KJ^{\pi^{\star}}(s_{\text{init}}) for any number of intervals MM. This implies Theorem 2.4 by the following argument. Lemma 4.2 implies that R~M=RM\widetilde{R}_{M}=R_{M} with high probability for any number of intervals MM (RMR_{M} is the true regret within the first MM intervals). In particular, when MM is the number of intervals in which the first KK episodes elapse, this implies Theorem 2.4 (we show that the learner indeed completes these KK episodes).

To bound R~M\widetilde{R}_{M}, we use the next lemma to decompose R~M\widetilde{R}_{M} into two terms which we bound independently.

Lemma 4.3.

It holds that R~M=m=1MR~m1+m=1MR~m2KJπ(sinit),\widetilde{R}_{M}=\sum_{m=1}^{M}\widetilde{R}_{m}^{1}+\sum_{m=1}^{M}\widetilde{R}_{m}^{2}-K\cdot J^{\pi^{\star}}(s_{\text{init}}), where

R~m1\displaystyle\widetilde{R}_{m}^{1} =(J~m(s1m)J~m(sHm+1m))𝕀{Ωm},and\displaystyle=\bigl{(}\widetilde{J}^{m}(s_{1}^{m})-\widetilde{J}^{m}(s_{H^{m}+1}^{m})\bigr{)}\mathbb{I}\{\Omega^{m}\},\quad\text{and}
R~m2\displaystyle\widetilde{R}_{m}^{2} =(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}.\displaystyle=\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}.

The lemma breaks down R~M\widetilde{R}_{M} into two terms. The first term accounts for the number of times in which the learner changes her policy in the middle of an episode which is at most the number of epochs. The second term sums the errors between the cost-to-go of the observed next state and its estimated expectation.

Indeed, m=1MR~m1\sum_{m=1}^{M}\widetilde{R}_{m}^{1} is related to the total number of epochs which is at most |S||A|log2T|S||A|\log_{2}T due to the following lemma.

Lemma 4.4.

It holds that m=1MR~m12B|S||A|logT+KJπ(sinit).\sum_{m=1}^{M}\widetilde{R}_{m}^{1}\leq 2B_{\star}|S||A|\log T+KJ^{\pi^{\star}}(s_{\text{init}}).

The next lemma shows that m=1MR~m2\sum_{m=1}^{M}\widetilde{R}_{m}^{2} does not deviate from m=1M𝔼[R~m2U¯m1]\sum_{m=1}^{M}\mathbb{E}[\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}] significantly.

Lemma 4.5.

With probability at least 1δ/41-\delta/4,

m=1MR~m2m=1M𝔼[R~m2U¯m1]+3BMlog8Mδ.\displaystyle\sum_{m=1}^{M}\widetilde{R}_{m}^{2}\leq\sum_{m=1}^{M}\mathbb{E}\bigl{[}\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}\bigr{]}+3B_{\star}\sqrt{M\log\frac{8M}{\delta}}.

The key property of the lemma is that the deviations between m=1MR~m2\sum_{m=1}^{M}\widetilde{R}_{m}^{2} and its corresponding expectation is of order M\sqrt{M} and do not scale with TT.

To prove the lemma, we recall that an interval ends at most at the first time step in which the accumulated cost in the interval surpasses BB_{\star}. We show in our analysis that J~m(s)Jπ(s)B\widetilde{J}^{m}(s)\leq J^{\pi^{\star}}(s)\leq B_{\star} for all sSs\in S due to the optimistic computation of π~m\tilde{\pi}^{m}. Therefore, π~m\tilde{\pi}^{m} never picks an action whose instantaneous cost is more than BB_{\star}. This implies that the total cost within each interval is at most 2B2B_{\star}. Then, we use the Bellman equations to bound R~m2\widetilde{R}_{m}^{2} by order of the total cost in the interval, and the lemma follows by an application of Azuma’s concentration inequality.

Lemma 4.6 below bounds 𝔼[R~m2U¯m1]\mathbb{E}\bigl{[}\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}\bigr{]} for every interval mm by a sum of the confidence bounds used in Algorithm 2.

Lemma 4.6.

For every interval mm,

𝔼[R~m2U¯m1]16𝔼[h=1Hm|S|𝕍hmAhm𝕀{Ωm}|U¯m1]+272𝔼[h=1HmB|S|Ahm𝕀{Ωm}|U¯m1],\displaystyle\mathbb{E}\bigl{[}\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}\bigr{]}\leq 16\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sqrt{|S|\mathbb{V}_{h}^{m}A_{h}^{m}}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}+272\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}B_{\star}|S|A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}, (7)

where 𝕍hm\mathbb{V}_{h}^{m} is the empirical variance defined as 𝕍hm=sS+P(sshm,ahm)(J~m(s)μhm)2,\mathbb{V}_{h}^{m}=\sum_{s^{\prime}\in S^{+}}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigl{(}\widetilde{J}^{m}(s^{\prime})-\mu_{h}^{m}\bigr{)}^{2}, and μhm=sS+P(sshm,ahm)J~m(s)\mu_{h}^{m}=\sum_{s^{\prime}\in S^{+}}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime}).

The next step is the part of our proof in which our analysis departs from that of Algorithm 1. Note that when Ωm\Omega^{m} holds, 𝕍hmB2\mathbb{V}_{h}^{m}\leq B_{\star}^{2}. Using this bound for each time step separately will result in a bound similar to that of Theorem 2.3. However, this bound is loose due to the following intuitive argument. Suppose that we replace J~m\widetilde{J}^{m} with the true cost-to-go function of π~m\tilde{\pi}^{m}, JmJ^{m}, in the definition of 𝕍hm\mathbb{V}_{h}^{m}. Note that from the Bellman equations (Eq. 1) we have Jm(shm)>Jm(sh+1m)J^{m}(s_{h}^{m})>J^{m}(s_{h+1}^{m}) in expectation on consecutive time steps hh and h+1h+1 hence we surmise that in expectation 𝕍hm\mathbb{V}_{h}^{m} would also decrease on consecutive time steps. A similar argument holds when in reality we use J~m\widetilde{J}^{m} because all-but-one of the state-action pairs in the interval are known, and J~m\widetilde{J}^{m} is a “close enough” approximation of JmJ^{m} on known state-action pairs since they have been sampled sufficiently many times. Indeed, in Lemma 4.7 we use the technique of Azar et al. (2017) to show that (up to a constant) B2B_{\star}^{2} bounds the expected sum of the variances over the time steps of an interval.

Lemma 4.7.

𝔼[h=1Hm𝕍hm𝕀{Ωm}U¯m1]44B2.\mathbb{E}\bigl{[}\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\bigr{]}\leq 44B_{\star}^{2}.

Armed with Lemma 4.7, we upper bound m=1M𝔼[R~m2U¯m1]\sum_{m=1}^{M}\mathbb{E}\bigl{[}\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}\bigr{]} by applying some algebraic manipulation on Eq. 7, and summing over all intervals which gives the next lemma.

Lemma 4.8.

With probability at least 1δ/41-\delta/4,

m=1M𝔼[R~m2U¯m1]614BM|S|2|A|log2T|S||A|δ+8160B|S|2|A|log2T|S||A|δ.\displaystyle\sum_{m=1}^{M}\mathbb{E}\bigl{[}\widetilde{R}_{m}^{2}\mid\bar{U}^{m-1}\bigr{]}\leq 614B_{\star}\sqrt{M|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+8160B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}.

Theorem 2.4 is obtained by first applying a union bound on Lemmas 4.2, 4.8 and 4.5, plugging in the bounds of Lemmas 4.4, 4.5 and 4.8 into Lemma 4.3, and bounding TT and MM using 4.1. This results in a quadratic inequality in CM\sqrt{C_{M}} and solving it yields the theorem.

Acknowledgements

HK is supported by the Israeli Science Foundation (ISF) grant 1595/19.

References

  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed bandit problem. SIAM journal on computing, 32(1):48–77, 2002.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 263–272. JMLR. org, 2017.
  • Bartlett and Tewari (2009) Peter L Bartlett and Ambuj Tewari. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, pages 35–42. AUAI Press, 2009.
  • Bertsekas and Tsitsiklis (1991) Dimitri P Bertsekas and John N Tsitsiklis. An analysis of stochastic shortest path problems. Mathematics of Operations Research, 16(3):580–595, 1991.
  • Bertsekas and Yu (2013) Dimitri P Bertsekas and Huizhen Yu. Stochastic shortest path problems under weak conditions. Lab. for Information and Decision Systems Report LIDS-P-2909, MIT, 2013.
  • Cesa-Bianchi and Lugosi (2006) Nicolo Cesa-Bianchi and Gábor Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
  • Dann et al. (2017) Christoph Dann, Tor Lattimore, and Emma Brunskill. Unifying pac and regret: Uniform pac bounds for episodic reinforcement learning. In Advances in Neural Information Processing Systems, pages 5713–5723, 2017.
  • Jaksch et al. (2010) Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Neu et al. (2010) Gergely Neu, András György, and Csaba Szepesvári. The online loop-free stochastic shortest-path problem. In COLT 2010 - The 23rd Conference on Learning Theory, Haifa, Israel, June 27-29, 2010, pages 231–243, 2010.
  • Neu et al. (2012) Gergely Neu, Andras Gyorgy, and Csaba Szepesvári. The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics, pages 805–813, 2012.
  • Osband et al. (2016) Ian Osband, Benjamin Van Roy, and Zheng Wen. Generalization and exploration via randomized value functions. In International Conference on Machine Learning, pages 2377–2386, 2016.
  • Rosenberg and Mansour (2019a) Aviv Rosenberg and Yishay Mansour. Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems, pages 2209–2218, 2019a.
  • Rosenberg and Mansour (2019b) Aviv Rosenberg and Yishay Mansour. Online convex optimization in adversarial markov decision processes. In International Conference on Machine Learning, pages 5478–5486, 2019b.
  • Tarbouriech et al. (2019) Jean Tarbouriech, Evrard Garcelon, Michal Valko, Matteo Pirotta, and Alessandro Lazaric. No-regret exploration in goal-oriented reinforcement learning, 2019.
  • Weissman et al. (2003) Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep, 2003.
  • Zanette and Brunskill (2019) Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. In International Conference on Machine Learning, pages 7304–7312, 2019.
  • Zimin and Neu (2013) Alexander Zimin and Gergely Neu. Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5-8, 2013, Lake Tahoe, Nevada, United States, pages 1583–1591, 2013.

Appendix A Algorithm

Algorithm 3 Hoeffding-type confidence bounds
  input: state space SS, action space AA and confidence parameter δ\delta.
  initialization: arbitrary policy π~\tilde{\pi}, m1,B~cmin,C10,(s,a,s)S×A×S:N(s,a,s)0,N(s,a)0m\leftarrow 1,\widetilde{B}\leftarrow c_{\text{min}},C_{1}\leftarrow 0,\forall(s,a,s^{\prime})\in S\times A\times S:\quad N(s,a,s^{\prime})\leftarrow 0,N(s,a)\leftarrow 0.
  for k=1,2,k=1,2,\ldots do
     set ssinits\leftarrow s_{\text{init}}.
     while sgs\neq g do
        follow optimistic optimal policy: aπ~(s)a\leftarrow\tilde{\pi}(s).
        suffer cost: CmCm+c(s,a)C_{m}\leftarrow C_{m}+c(s,a).
        observe next state sP(s,a)s^{\prime}\sim P(\cdot\mid s,a).
        update visit counters: N(s,a,s)N(s,a,s)+1,N(s,a)N(s,a)+1N(s,a,s^{\prime})\leftarrow N(s,a,s^{\prime})+1,N(s,a)\leftarrow N(s,a)+1.
        if N(s,π~(s))5000B~2|S|cmin2logB~|S||A|δcminN(s^{\prime},\tilde{\pi}(s^{\prime}))\leq\frac{5000\widetilde{B}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{\widetilde{B}|S||A|}{\delta c_{\text{min}}} or s=gs^{\prime}=g or Cm24B~log4mδC_{m}\geq 24\widetilde{B}\log\frac{4m}{\delta} then
           # start new interval
           if Cm24B~log4mδC_{m}\geq 24\widetilde{B}\log\frac{4m}{\delta} then
              update BB_{\star} estimate: B~2B~\widetilde{B}\leftarrow 2\widetilde{B}.
           end if
           advance intervals counter: mm+1m\leftarrow m+1.
           initialize cost suffered in interval: Cm0C_{m}\leftarrow 0.
           compute empirical transition function P¯\bar{P} for every (s,a,s)S×A×S:(s,a,s^{\prime})\in S\times A\times S:
P¯(ss,a)=N(s,a,s)max{N(s,a),1}.\bar{P}(s^{\prime}\mid s,a)=\frac{N(s,a,s^{\prime})}{\max\{N(s,a),1\}}.
           compute policy π~\tilde{\pi} that minimizes the expected cost with respect to a transition function P~\widetilde{P}, such that for every (s,a)S×A(s,a)\in S\times A:
P~(s,a)P¯(s,a)15|S|log(|S||A|N+(s,a)/δ)N+(s,a).\bigl{\lVert}\widetilde{P}(\cdot\mid s,a)-\bar{P}(\cdot\mid s,a)\bigr{\rVert}_{1}\leq 5\sqrt{\frac{|S|\log\bigl{(}|S||A|N_{+}(s,a)/\delta\bigr{)}}{N_{+}(s,a)}}.
        end if
        set sss\leftarrow s^{\prime}.
     end while
  end for

Appendix B Proofs

B.1 Proofs for Section 3.1

B.1.1 Proof of Lemma 3.3

Lemma (restatement of Lemma 3.3).

With probability at least 1δ/21-\delta/2, Ωm\Omega^{m} holds and h=1Hmc(shm,ahm)24Blog4mδ\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\leq 24B_{\star}\log\frac{4m}{\delta} for all intervals mm simultaneously. This implies that the total number of steps of the algorithm is

T=O(KBcminlogKB|S||A|δcmin+B3|S|2|A|cmin3log2KB|S||A|δcmin).T=O\biggl{(}\frac{KB_{\star}}{c_{\text{min}}}\log\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{3}}\log^{2}\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}\biggr{)}.
Lemma B.1.

The event Ωm\Omega^{m} holds for all intervals mm simultaneously with probability at least 1δ/41-\delta/4.

Proof.

Fix a state ss and an action aa. Consider an infinite sequence {Zi}i=1\{Z_{i}\}_{i=1}^{\infty} of draws from the distribution P(s,a)P(\cdot\mid s,a). By Theorem D.2 we get that for a prefix of length tt of this sequence (that is {Zi}i=1t\{Z_{i}\}_{i=1}^{t})

P(s,a)P¯{Zi}i=1t(s,a)12|S|log(δt1)t,\bigl{\lVert}P(\cdot\mid s,a)-\bar{P}_{\{Z_{i}\}_{i=1}^{t}}(\cdot\mid s,a)\bigr{\rVert}_{1}\leq 2\sqrt{\frac{|S|\log(\delta_{t}^{-1})}{t}}\ ,

holds with probability 1δt1-\delta_{t}, where P¯{Zi}i=1t(s,a)\bar{P}_{\{Z_{i}\}_{i=1}^{t}}(\cdot\mid s,a) is the empirical distribution defined by the draws {Zi}i=1t\{Z_{i}\}_{i=1}^{t}. We repeat this argument for every prefix {Zi}i=1t\{Z_{i}\}_{i=1}^{t} of {Zi}i=1\{Z_{i}\}_{i=1}^{\infty} and for every state-action pair, with δt=δ/8|S||A|t2\delta_{t}=\delta/8|S||A|t^{2}. Then from the union bound we get that Ωm\Omega^{m} holds for all intervals mm simultaneously with probability at least 1δ/41-\delta/4. ∎

Lemma B.2.

Let mm be an interval. If Ωm\Omega^{m} holds then J~m(s)Jπ(s)B\widetilde{J}^{m}(s)\leq J^{\pi^{\star}}(s)\leq B_{\star} for every sSs\in S.

Proof.

Tarbouriech et al. (2019) show that all the transition functions in the confidence set of Eq. 4 can be combined into a single augmented MDP. The optimal policy of the augmented MDP can be found efficiently, e.g., with Extended Value Iteration. The optimistic policy is the optimal policy in the augmented MDP. It minimizes J~m(s)\widetilde{J}^{m}(s) over all policies and feasible transition functions, for all states sSs\in S simultaneously (following Bertsekas and Tsitsiklis, 1991). Since Ωm\Omega^{m} holds, it follows that the real transition function is in the confidence set therefore it is also considered in the minimization. Thus J~m(s)Jπ(s)\widetilde{J}^{m}(s)\leq J^{\pi^{\star}}(s) for all sSs\in S. Finally, Jπ(s)BJ^{\pi^{\star}}(s)\leq B_{\star} by the definition of BB_{\star}. ∎

Lemma B.3.

Let mm be an interval and (s,a)(s,a) be a known state-action pair. If Ωm\Omega^{m} holds then

P~m(s,a)P(s,a)1c(s,a)2B.\lVert\widetilde{P}_{m}(\cdot\mid s,a)-P(\cdot\mid s,a)\rVert_{1}\leq\frac{c(s,a)}{2B_{\star}}\ .
Proof.

By the definition of the confidence set

P~m(s,a)P¯m(s,a)15|S|log(|S||A|N+m(s,a)/δ)N+m(s,a)c(s,a)4B,\lVert\widetilde{P}_{m}(\cdot\mid s,a)-\bar{P}_{m}(\cdot\mid s,a)\rVert_{1}\leq 5\sqrt{\frac{|S|\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}}\leq\frac{c(s,a)}{4B_{\star}},

where the last inequality follows because log(x)/x\log(x)/x is decreasing, and N+m(s,a)5000B2|S|cmin2logB|S||A|δcminN_{+}^{m}(s,a)\geq\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} since (s,a)(s,a) is known. Similarly, since Ωm\Omega^{m} holds we also have that

P(s,a)P¯m(s,a)15|S|log(|S||A|N+m(s,a)/δ)N+m(s,a)c(s,a)4B,\lVert P(\cdot\mid s,a)-\bar{P}_{m}(\cdot\mid s,a)\rVert_{1}\leq 5\sqrt{\frac{|S|\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}}\leq\frac{c(s,a)}{4B_{\star}},

and the lemma follows by the triangle inequality. ∎

Lemma B.4.

Let π~\tilde{\pi} be a policy and P~\widetilde{P} be a transition function. Denote the cost-to-go of π~\tilde{\pi} with respect to P~\widetilde{P} by J~\widetilde{J}. Assume that for every sSs\in S, J~(s)B\widetilde{J}(s)\leq B_{\star} and that

P~(s,π~(s))P(s,π~(s))1c(s,π~(s))2B.\bigl{\lVert}\widetilde{P}(\cdot\mid s,\tilde{\pi}(s))-P(\cdot\mid s,\tilde{\pi}(s))\bigr{\rVert}_{1}\leq\frac{c(s,\tilde{\pi}(s))}{2B_{\star}}.

Then, π~\tilde{\pi} is proper (with respect to PP), and it holds that Jπ~(s)2BJ^{\tilde{\pi}}(s)\leq 2B_{\star} for every sSs\in S.

Proof.

Consider the Bellman equations of π~\tilde{\pi} with respect to transition function P~\widetilde{P} at some state sSs\in S (see Lemma 2.1), defined as

J~(s)\displaystyle\widetilde{J}(s) =c(s,π~(s))+sSP~(ss,π~(s))J~(s)\displaystyle=c(s,\tilde{\pi}(s))+\sum_{s^{\prime}\in S}\widetilde{P}(s^{\prime}\mid s,\tilde{\pi}(s))\widetilde{J}(s^{\prime})
=c(s,π~(s))+sSP(ss,π~(s))J~(s)+sSJ~(s)(P~(ss,π~(s))P(ss,π~(s))).\displaystyle=c(s,\tilde{\pi}(s))+\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\tilde{\pi}(s))\widetilde{J}(s^{\prime})+\sum_{s^{\prime}\in S}\widetilde{J}(s^{\prime})\left(\widetilde{P}(s^{\prime}\mid s,\tilde{\pi}(s))-P(s^{\prime}\mid s,\tilde{\pi}(s))\right)\ . (8)

Notice that by our assumptions and using Hölder inequality,

|sSJ~(s)(P~(ss,π~(s))P(ss,π~(s)))|\displaystyle\left|\sum_{s^{\prime}\in S}\widetilde{J}(s^{\prime})\left(\widetilde{P}(s^{\prime}\mid s,\tilde{\pi}(s))-P(s^{\prime}\mid s,\tilde{\pi}(s))\right)\right| P~(s,π~(s))P(s,π~(s))1J~\displaystyle\leq\lVert\widetilde{P}(\cdot\mid s,\tilde{\pi}(s))-P(\cdot\mid s,\tilde{\pi}(s))\rVert_{1}\cdot\lVert\widetilde{J}\rVert_{\infty}
c(s,π~(s))2BB=c(s,π~(s))2.\displaystyle\leq\frac{c(s,\tilde{\pi}(s))}{2B_{\star}}\cdot B_{\star}=\frac{c(s,\tilde{\pi}(s))}{2}\ .

Plugging this into Eq. 8, we obtain

J~(s)c(s,π~(s))+sSP(ss,π~(s))J~(s)c(s,π~(s))2=c(s,π~(s))2+sSP(ss,π~(s))J~(s).\displaystyle\widetilde{J}(s)\geq c(s,\tilde{\pi}(s))+\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\tilde{\pi}(s))\widetilde{J}(s^{\prime})-\frac{c(s,\tilde{\pi}(s))}{2}=\frac{c(s,\tilde{\pi}(s))}{2}+\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\tilde{\pi}(s))\widetilde{J}(s^{\prime}).

Therefore, defining J=2J~{J}^{\prime}=2\widetilde{J}, then J(s)c(s,π~(s))+sSP(ss,π~(s))J(s){J}^{\prime}(s)\geq c(s,\tilde{\pi}(s))+\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\tilde{\pi}(s)){J}^{\prime}(s^{\prime}) for all sSs\in S. The statement now follows by Lemma 2.1. ∎

Lemma B.5.

Let π\pi be a proper policy such that for some v>0v>0, Jπ(s)vJ^{\pi}(s)\leq v for every sSs\in S. Then, the probability that the cost of π\pi to reach the goal state from any state ss is more than mm, is at most 2em/4v2e^{-m/4v} for all m0m\geq 0. Note that a cost of at most mm implies that the number of steps is at most mcmin\tfrac{m}{c_{\text{min}}}.

Proof.

By Markov inequality, the probability that π\pi accumulates cost of more than 2v2v before reaching the goal state is at most 1/21/2. Iterating this argument, we get that the probability that π\pi accumulates cost of more than 2kv2kv before reaching the goal state is at most 2k2^{-k} for every integer k0k\geq 0. In general, for any m0m\geq 0, the probability that π\pi suffers a cost of more than mm is at most 2m/2v22m/2v2em/4v.2^{-\lfloor m/2v\rfloor}\leq 2\cdot 2^{-m/2v}\leq 2e^{-m/4v}.

For the next lemma we will need the following definitions. The trajectory visited in interval mm is denoted by Um=(s1m,a1m,,sHmm,aHmm,sHm+1m)U^{m}=(s_{1}^{m},a_{1}^{m},\ldots,s_{H^{m}}^{m},a_{H^{m}}^{m},s_{H^{m}+1}^{m}) where ahma_{h}^{m} is the action taken in shms_{h}^{m}, and HmH^{m} is the length of the interval. In addition, the concatenation of the trajectories in the intervals up to and including interval mm is denoted by U¯m=m=1mUm\bar{U}^{m}=\cup_{m^{\prime}=1}^{m}U^{m^{\prime}}.

Lemma B.6.

Let mm be an interval. For all r0r\geq 0, we have that

[h=1Hmc(shm,ahm)𝕀{Ωm}>rU¯m1]3er/8B.\mathbb{P}\Biggl{[}\sum_{h=1}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\mathbb{I}\{\Omega^{m}\}>r\mid\bar{U}^{m-1}\Biggr{]}\leq 3e^{-r/8B_{\star}}.
Proof.

Note that Ωm\Omega^{m} is determined given U¯m1\bar{U}^{m-1}, and suppose that Ωm\Omega^{m} holds otherwise h=1Hmc(shm,ahm)𝕀{Ωm}\sum_{h=1}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\mathbb{I}\{\Omega^{m}\} is 0. Also assume that r8Br\geq 8B_{\star} or else the statement holds trivially.

Define the MDP Mknow=(Sknow,A,Pknow,c,sinit)M^{\text{know}}=(S^{\text{know}},A,P^{\text{know}},c,s_{\text{init}}) in which every state sSs\in S such that (s,π~m(s))(s,\tilde{\pi}^{m}(s)) is unknown is contracted into the goal state. Let PknowP^{\text{know}} be the transition function induced in MknowM^{\text{know}} by PP, and let JknowmJ^{m}_{\text{know}} be the cost-to-go of π~m\tilde{\pi}^{m} in MknowM^{\text{know}} with respect to PknowP^{\text{know}}. Similarly, define P~mknow\widetilde{P}^{\text{know}}_{m} as the transition function induced in MknowM^{\text{know}} by P~m\widetilde{P}_{m}, and J~knowm\widetilde{J}^{m}_{\text{know}} as the cost-to-go of π~m\tilde{\pi}^{m} in MknowM^{\text{know}} with respect to P~mknow\widetilde{P}^{\text{know}}_{m}. It is clear that J~knowm(s)J~m(s)\widetilde{J}^{m}_{\text{know}}(s)\leq\widetilde{J}^{m}(s) for every sSs\in S, so by Lemma B.2, J~knowm(s)B\widetilde{J}^{m}_{\text{know}}(s)\leq B_{\star}. Moreover, since all the states sSs\in S for which (s,π~m(s))(s,\tilde{\pi}^{m}(s)) is unknown were contracted to the goal state, we can use Lemma B.3 to obtain for all sSknows\in S^{\text{know}}:

P~mknow(s,π~m(s))Pknow(s,π~m(s))1P~m(s,π~m(s))P(s,π~m(s))1c(s,π~m(s))2B.\bigl{\lVert}\widetilde{P}^{\text{know}}_{m}(\cdot\mid s,\tilde{\pi}^{m}(s))-P^{\text{know}}(\cdot\mid s,\tilde{\pi}^{m}(s))\bigr{\rVert}_{1}\leq\bigl{\lVert}\widetilde{P}_{m}(\cdot\mid s,\tilde{\pi}^{m}(s))-P(\cdot\mid s,\tilde{\pi}^{m}(s))\bigr{\rVert}_{1}\leq\frac{c(s,\tilde{\pi}^{m}(s))}{2B_{\star}}. (9)

We can apply Lemma B.4 in MknowM^{\text{know}} and obtain that Jknowm(s)2BJ^{m}_{\text{know}}(s)\leq 2B_{\star} for every sSknows\in S^{\text{know}}. Notice that reaching the goal state in MknowM^{\text{know}} is equivalent to reaching the goal state or an unknown state-action pair in MM, and also recall that all state-action pairs in the interval are known except for the first one. Thus, from Lemma B.5,

[h=2Hmc(shm,ahm)𝕀{Ωm}>rBU¯m1]2e(rB)/8B3er/8B.\displaystyle\mathbb{P}\Biggl{[}\sum_{h=2}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\mathbb{I}\{\Omega^{m}\}>r-B_{\star}\mid\bar{U}^{m-1}\Biggr{]}\leq 2e^{-(r-B_{\star})/8B_{\star}}\leq 3e^{-r/8B_{\star}}.

Since J~mB\widetilde{J}^{m}\leq B_{\star}, our algorithm will never select an action whose instantaneous cost is larger than BB_{\star}. Since the first state-action in the interval might not be known, its cost is at most BB_{\star}, and therefore

[h=1Hmc(shm,ahm)𝕀{Ωm}>rU¯m1][h=2Hmc(shm,ahm)𝕀{Ωm}>rBU¯m1]3er/8B.\displaystyle\mathbb{P}\Biggl{[}\sum_{h=1}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\mathbb{I}\{\Omega^{m}\}>r\mid\bar{U}^{m-1}\Biggr{]}\leq\mathbb{P}\Biggl{[}\sum_{h=2}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\mathbb{I}\{\Omega^{m}\}>r-B_{\star}\mid\bar{U}^{m-1}\Biggr{]}\leq 3e^{-r/8B_{\star}}.\qquad\qed
Proof of Lemma 3.3.

From Lemma B.6, with probability at least 1δ/16m21-\delta/16m^{2}, h=1Hmc(shm,ahm)24Blog4mδ\sum_{h=1}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\leq 24B_{\star}\log\frac{4m}{\delta}, and by the union bound this holds for all intervals mm simultaneously with probability at least 1δ/41-\delta/4. By Lemma B.1, with probability 1δ/41-\delta/4, Ωm\Omega^{m} holds for all intervals mm. Combining these two facts again by a union bound, we get that both Ωm\Omega^{m} holds and the cost of interval mm is at most 24Blog4mδ24B_{\star}\log\frac{4m}{\delta} simultaneously to all intervals mm with probability at least 1δ/21-\delta/2.

If the cost of all intervals is bounded (and therefore so is the length of the interval), we can use the bound on the number of intervals in 3.2 to conclude that

T\displaystyle T =O(BcminlogMδ(K+B2|S|2|A|cmin2logB|S||A|δcmin))\displaystyle=O\left(\frac{B_{\star}}{c_{\text{min}}}\log\frac{M}{\delta}\cdot\left(K+\frac{B_{\star}^{2}|S|^{2}|A|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}\right)\right)
=O(KBcminlogKB|S||A|δcmin+B3|S|2|A|cmin3log2KB|S||A|δcmin).\displaystyle=O\left(\frac{KB_{\star}}{c_{\text{min}}}\log\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{3}}\log^{2}\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}\right).\qed

B.1.2 Proof of Lemma 3.4

Lemma (restatement of Lemma 3.4).

With probability at least 1δ/21-\delta/2, we have

R~K5000B3|S|2|A|cmin2logB|S||A|cminδ+BTlog4Tδ+10B|S|log|S||A|Tδs,am=1Mnm(s,a)N+m(s,a).\displaystyle\widetilde{R}_{K}\leq\frac{5000B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{c_{\text{min}}\delta}+B_{\star}\sqrt{T\log\frac{4T}{\delta}}+10B_{\star}\sqrt{|S|\log\frac{|S||A|T}{\delta}}\sum_{s,a}\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}.

To analyze R~K\widetilde{R}_{K}, we begin by plugging in the Bellman optimality equation of π~m\tilde{\pi}^{m} with respect to P~m\widetilde{P}_{m} into R~K\widetilde{R}_{K}. This allows us to decompose R~K\widetilde{R}_{K} into three terms as follows.

R~K\displaystyle\widetilde{R}_{K} =m=1Mh=1Hm(J~m(shm)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}KJπ(sinit)\displaystyle=\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\left(\widetilde{J}^{m}(s_{h}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\right)\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}})
=m=1Mh=1Hm(J~m(shm)J~m(sh+1m))𝕀{Ωm}KJπ(sinit)\displaystyle=\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\left(\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})\right)\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}}) (10)
+m=1Mh=1HmsSJ~m(s)(P(sshm,ahm)P~m(sshm,ahm))𝕀{Ωm}\displaystyle\qquad+\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\widetilde{J}^{m}(s^{\prime})\left(P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\right)\mathbb{I}\{\Omega^{m}\} (11)
+m=1M(h=1HmJ~m(sh+1m)sSP(sshm,ahm)J~m(s))𝕀{Ωm}.\displaystyle\qquad+\sum_{m=1}^{M}\left(\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\right)\mathbb{I}\{\Omega^{m}\}. (12)

Eq. 10 is a bound on the cost suffered from switching policies each time we visit an unknown state-action pair and is bounded by the following lemma.

Lemma B.7.

m=1Mh=1Hm(J~m(shm)J~m(sh+1m))𝕀{Ωm}B|S||A|5000B2|S|cmin2logB|S||A|δcmin+KJπ(sinit).\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\left(\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})\right)\mathbb{I}\{\Omega^{m}\}\leq B_{\star}|S||A|\cdot\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}+K\cdot J^{\pi^{\star}}(s_{\text{init}}).

Proof.

Note that per interval h=1Hm(J~m(shm)J~m(sh+1m))\sum_{h=1}^{H^{m}}(\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})) is a telescopic sum which equals J~m(s1m)J~m(sHm+1m)\widetilde{J}^{m}(s_{1}^{m})-\widetilde{J}^{m}(s_{H^{m}+1}^{m}). Furthermore, for every two consecutive intervals m,m+1m,m+1 one of the following occurs:

  1. (i)

    If interval mm ended in the goal state then J~m(sHm+1m)=J~m(g)=0\widetilde{J}^{m}(s_{H^{m}+1}^{m})=\widetilde{J}^{m}(g)=0 and J~m+1(s1m+1)=J~m+1(sinit).\widetilde{J}^{m+1}(s_{1}^{m+1})=\widetilde{J}^{m+1}(s_{\text{init}}). Thus, using Lemma B.2 for the last inequality,

    J~m+1(s1m+1)𝕀{Ωm+1}J~m(sHm+1m)𝕀{Ωm}=J~m+1(sinit)𝕀{Ωm+1}Jπ(sinit).\widetilde{J}^{m+1}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}-\widetilde{J}^{m}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}=\widetilde{J}^{m+1}(s_{\text{init}})\mathbb{I}\{\Omega^{m+1}\}\leq J^{\pi^{\star}}(s_{\text{init}}).

    This happens at most KK times.

  2. (ii)

    If interval mm ended in an unknown state then

    J~m+1(s1m+1)𝕀{Ωm+1}J~m(sHm+1m)𝕀{Ωm}J~m+1(s1m+1)𝕀{Ωm+1}B.\widetilde{J}^{m+1}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}-\widetilde{J}^{m}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}\leq\widetilde{J}^{m+1}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}\leq B_{\star}.

    This happens at most |S||A|5000B2|S|cmin2logB|S||A|δcmin|S||A|\cdot\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} times. ∎

Lemma B.8 bounds Eq. 11 using techniques borrowed from Jaksch et al. (2010).

Lemma B.8.

It holds that

m=1Mh=1HmsSJ~m(s)(P(sshm,ahm)P~m(sshm,ahm))𝕀{Ωm}10B|S|log|S||A|Tδs,am=1Mnm(s,a)N+m(s,a).\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\widetilde{J}^{m}(s^{\prime})\left(P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\right)\mathbb{I}\{\Omega^{m}\}\leq 10B_{\star}\sqrt{|S|\log\frac{|S||A|T}{\delta}}\sum_{s,a}\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}.
Proof.

Using the definition of the confidence sets we obtain

m=1Mh=1HmsSJ~m(s)\displaystyle\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\widetilde{J}^{m}(s^{\prime}) (P(sshm,ahm)P~m(sshm,ahm))𝕀{Ωm}\displaystyle\left(P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\right)\mathbb{I}\{\Omega^{m}\}\leq
BsSaAm=1Mnm(s,a)P(s,a)P~m(s,a)1𝕀{Ωm}\displaystyle\leq B_{\star}\sum_{s\in S}\sum_{a\in A}\sum_{m=1}^{M}n_{m}(s,a)\lVert P(\cdot\mid s,a)-\widetilde{P}_{m}(\cdot\mid s,a)\rVert_{1}\mathbb{I}\{\Omega^{m}\}
10BsSaAm=1Mnm(s,a)|S|log(|S||A|N+m(s,a)/δ)N+m(s,a)\displaystyle\leq 10B_{\star}\sum_{s\in S}\sum_{a\in A}\sum_{m=1}^{M}n_{m}(s,a)\sqrt{\frac{|S|\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}}
10B|S|log|S||A|TδsSaAm=1Mnm(s,a)N+m(s,a).\displaystyle\leq 10B_{\star}\sqrt{|S|\log\frac{|S||A|T}{\delta}}\sum_{s\in S}\sum_{a\in A}\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}.

where the first inequality follows from Hölder inequality and Lemma B.2, and the second because P~m\widetilde{P}_{m} and PP are both in the confidence set of Eq. 4 when Ωm\Omega^{m} holds. The third inequality follows because N+m(s,a)TN_{+}^{m}(s,a)\leq T. ∎

Lemma B.9 bounds the term in Eq. 12 using Azuma’s concentration inequality.

Lemma B.9.

With probability at least 1δ/21-\delta/2,

m=1M(h=1HmJ~m(sh+1m)sSP(sshm,ahm)J~m(s))𝕀{Ωm}BTlog4Tδ.\sum_{m=1}^{M}\left(\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\right)\mathbb{I}\{\Omega^{m}\}\leq B_{\star}\sqrt{T\log\frac{4T}{\delta}}.
Proof.

Consider the infinite sequence of random variables

Xt=(J~m(sh+1m)sSP(sshm,π~m(shm))J~m(s))𝕀{Ωm},X_{t}=\Biggl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},\tilde{\pi}^{m}(s_{h}^{m}))\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\},

where mm is the interval containing time tt, and hh is the index of time step tt within interval mm. Notice that this is a martingale difference sequence, and |Xt|B|X_{t}|\leq B_{\star} by Lemma B.2. Now, we apply anytime Azuma’s inequality (Theorem D.1) to any prefix of the sequence {Xt}t=1\{X_{t}\}_{t=1}^{\infty}. Thus, with probability at least 1δ/21-\delta/2, for every TT:

t=1TXtBTlog4Tδ.\sum_{t=1}^{T}X_{t}\leq B_{\star}\sqrt{T\log\frac{4T}{\delta}}.\qed

B.1.3 Proof of Theorem 3.1

Theorem (restatement of Theorem 3.1).

Suppose that 2 holds. With probability at least 1δ1-\delta the regret of Algorithm 1 is bounded as follows:

RK\displaystyle R_{K} =O(B3|S|2|A|KcminlogKB|S||A|δcmin+B3|S|2|A|cmin2log3/2KB|S||A|δcmin).\displaystyle=O\Biggl{(}\sqrt{\frac{B_{\star}^{3}|S|^{2}|A|K}{c_{\text{min}}}}\log\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}+\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}\log^{3/2}\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}\Biggr{)}.
Lemma B.10.

Assume that the number of steps in every interval is is at most 24Bcminlog4mδ\frac{24B_{\star}}{c_{\text{min}}}\log\frac{4m}{\delta}. Then for every sSs\in S and aAa\in A,

m=1Mnm(s,a)N+m(s,a)3NM+1(s,a).\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}\leq 3\sqrt{N_{M+1}(s,a)}\ .
Proof.

We claim that, by the assumption of the lemma, for every interval mm we have that nm(s,a)N+m(s,a)n_{m}(s,a)\leq N_{+}^{m}(s,a). Indeed, if (s,a)(s,a) is unknown then nm(s,a)=1n_{m}(s,a)=1 and since N+m(s,a)1N_{+}^{m}(s,a)\geq 1 the claim follows. If (s,a)(s,a) is known then N+m(s,a)5000B2|S|cmin2logB|S||A|δcminN_{+}^{m}(s,a)\geq\frac{5000B_{\star}^{2}|S|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}} and by our assumption the length of the interval, and in particular nm(s,a)n_{m}(s,a), is at most 24Bcminlog4mδ\frac{24B_{\star}}{c_{\text{min}}}\log\frac{4m}{\delta}. Our statement then follows by Jaksch et al. (2010, Lemma 19). ∎

Proof of Theorem 3.1.

With probability at least 1δ1-\delta, both Lemmas 3.3 and B.9 hold. Lemma 3.3 states that the length of every interval is at most 24Bcminlog4mδ\frac{24B_{\star}}{c_{\text{min}}}\log\frac{4m}{\delta}, and Lemma B.10 obtains

sSaAm=1Mnm(s,a)N+m(s,a)3(s,a)S×ANM+1(s,a)3|S||A|T,\displaystyle\sum_{s\in S}\sum_{a\in A}\sum_{m=1}^{M}\frac{n_{m}(s,a)}{\sqrt{N_{+}^{m}(s,a)}}\leq 3\sum_{(s,a)\in S\times A}\sqrt{N_{M+1}(s,a)}\leq 3\sqrt{|S||A|T}, (13)

where the last inequality follows from Jensen’s inequality and the fact that (s,a)S×ANM+1(s,a)T\sum_{(s,a)\in S\times A}N_{M+1}(s,a)\leq T. Next, we sum the bounds of Lemmas B.7, B.8 and B.9 and use Eq. 13 to obtain

RK\displaystyle R_{K} 5000B3|S|2|A|cmin2logB|S||A|δcmin+30B|S||A|Tlog|S||A|Tδ+BTlog4Tδ.\displaystyle\leq 5000\frac{B_{\star}^{3}|S|^{2}|A|}{c_{\text{min}}^{2}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}+30B_{\star}|S|\sqrt{|A|T\log\frac{|S||A|T}{\delta}}+B_{\star}\sqrt{T\log\frac{4T}{\delta}}.

To finish the proof use Lemma 3.3 to bound TT. ∎

B.2 Proofs for Section 4.1

B.2.1 Proof of Lemma 4.2

Lemma (restatement of Lemma 4.2).

With probability at least 1δ/21-\delta/2, Ωm\Omega^{m} holds for all intervals mm simultaneously.

Proof.

Fix a triplet (s,a,s)S×A×S+(s,a,s^{\prime})\in S\times A\times S^{+}. Consider an infinite sequence (Zi)i=1(Z_{i})_{i=1}^{\infty} of draws from the distribution P(s,a)P(\cdot\mid s,a) and let Xi=𝕀{Zi=s}X_{i}=\mathbb{I}\{Z_{i}=s^{\prime}\}. We apply Eq. 25 of Theorem D.3 with δt=δ4|S|2|A|t2\delta_{t}=\tfrac{\delta}{4|S|^{2}|A|t^{2}} to a prefix of length tt of the sequence (Xi)i=1(X_{i})_{i=1}^{\infty}. Then divide Eq. 25 by tt and obtain that, after simplifying using the assumptions that |S|2|S|\geq 2 and |A|2|A|\geq 2, Eq. 6 holds with probability 1δt1-\delta_{t}. We repeat this argument for every prefix (Zi)i=1t(Z_{i})_{i=1}^{t} of (Zi)i=1(Z_{i})_{i=1}^{\infty} and for every state-action-state triplet. Then from the union bound we get that Ωm\Omega^{m} holds for all intervals mm simultaneously with probability at least 1δ/21-\delta/2. ∎

B.2.2 Proof of Lemma 4.3

Lemma (restatement of Lemma 4.3).

It holds that

R~M\displaystyle\widetilde{R}_{M} =m=1M(h=1HmJ~m(shm)J~m(sh+1m))𝕀{Ωm}KJπ(sinit)\displaystyle=\sum_{m=1}^{M}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})\Biggr{)}\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}})
+m=1M(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}.\displaystyle\qquad+\sum_{m=1}^{M}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}.
Lemma B.11.

Let mm be an interval. If Ωm\Omega^{m} holds then π~m\tilde{\pi}^{m} satisfies the Bellman equations in the optimistic model:

J~m(s)=c(s,π~m(s))+sSP~m(ss,π~i(s))J~m(s),sS.\widetilde{J}^{m}(s)=c(s,\tilde{\pi}^{m}(s))+\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s,\tilde{\pi}^{i}(s))\widetilde{J}^{m}(s^{\prime}),\quad\forall s\in S.
Proof.

Note that the Bellman equations hold in the optimistic model since as we defined this model, there is a nonzero probability of transition to the goal state by any action from every state. Thus in the optimistic model every policy is a proper policy and in particular Lemma 2.2 holds. ∎

Proof of Lemma 4.3.

By Lemma B.11, we can use the Bellman equations in the optimistic model to have the following interpretation of the costs for every interval mm and time hh:

c(shm,ahm)𝕀{Ωm}\displaystyle c(s_{h}^{m},a_{h}^{m})\mathbb{I}\{\Omega^{m}\} =(J~m(shm)sSP~i(sshm,ahm)J~m(s))𝕀{Ωm}\displaystyle=\Biggl{(}\widetilde{J}^{m}(s_{h}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{i}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}
=(J~m(shm)J~m(sh+1m))𝕀{Ωm}+(J~m(sh+1m)sSP~i(sshm,ahm)J~m(s))𝕀{Ωm}.\displaystyle=\Biggl{(}\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})\Biggr{)}\mathbb{I}\{\Omega^{m}\}+\Biggl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{i}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}. (14)

We now write R~M=m=1Mh=1Hmc(shm,ahm)𝕀{Ωm}KJπ(sinit),\widetilde{R}_{M}=\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}}), and substitute for each cost using Eq. 14 to get the lemma. ∎

B.2.3 Proof of Lemma 4.4

Lemma (restatement of Lemma 4.4).

m=1M(h=1HmJ~m(shm)J~m(sh+1m))𝕀{Ωm}KJπ(sinit)2B|S||A|logT.\sum_{m=1}^{M}\bigl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h}^{m})-\widetilde{J}^{m}(s_{h+1}^{m})\bigr{)}\mathbb{I}\{\Omega^{m}\}-K\cdot J^{\pi^{\star}}(s_{\text{init}})\leq 2B_{\star}|S||A|\log T.

Lemma B.12.

Let mm be an interval. If Ωm\Omega^{m} holds then J~m(s)Jπ(s)B\widetilde{J}^{m}(s)\leq J^{\pi^{\star}}(s)\leq B_{\star} for every sSs\in S.

Proof.

Denote by P~\widetilde{P} the transition function computed by Algorithm 2 at the beginning of epoch i(m)i(m), and by J~\widetilde{J} the cost-to-go with respect to P~\widetilde{P}. We claim that for every proper policy π\pi and state sSs\in S, J~π(s)Jπ(s)\widetilde{J}^{\pi}(s)\leq J^{\pi}(s). Then, the lemma follows easily since J~m(s)J~π(s)Jπ(s)B\widetilde{J}^{m}(s)\leq\widetilde{J}^{\pi^{\star}}(s)\leq J^{\pi^{\star}}(s)\leq B_{\star}.

Indeed, let sSs\in S and consider the Bellman equations of π\pi with respect to PP:

Jπ(s)=c(s,π(s))+sSP(ss,π(s))Jπ(s)c(s,π(s))+sSP~(ss,π(s))Jπ(s),\displaystyle J^{\pi}(s)=c(s,\pi(s))+\sum_{s^{\prime}\in S}P(s^{\prime}\mid s,\pi(s))J^{\pi}(s^{\prime})\geq c(s,\pi(s))+\sum_{s^{\prime}\in S}\widetilde{P}(s^{\prime}\mid s,\pi(s))J^{\pi}(s^{\prime}),

where the inequality follows because P~(ss,a)P(ss,a)\widetilde{P}(s^{\prime}\mid s,a)\leq P(s^{\prime}\mid s,a) for every (s,a,s)S×A×S(s,a,s^{\prime})\in S\times A\times S. This holds since PP is in the confidence set of Eq. 6 (as Ωm\Omega^{m} holds), and by the way P~\widetilde{P} is computed in Algorithm 2. Therefore, by Lemma 2.1 we obtain that Jπ(s)J~π(s)J^{\pi}(s)\geq\widetilde{J}^{\pi}(s) for every sSs\in S as required. ∎

Proof of Lemma 4.4.

For every two consecutive intervals m,m+1m,m+1, denoting i=i(m)i=i(m), we have one of the following:

  1. (i)

    If interval mm ended in the goal state then J~i(m)(sHm+1m)=J~i(m)(g)=0\widetilde{J}^{i(m)}(s_{H^{m}+1}^{m})=\widetilde{J}^{i(m)}(g)=0 and J~i(m+1)(s1m+1)=J~i(m+1)(sinit).\widetilde{J}^{i(m+1)}(s_{1}^{m+1})=\widetilde{J}^{i(m+1)}(s_{\text{init}}). Therefore, by Lemma B.12,

    J~i(m+1)(s1m+1)𝕀{Ωm+1}J~i(m)(sHm+1m)𝕀{Ωm}=J~i(m+1)(sinit)𝕀{Ωm+1}Jπ(sinit).\widetilde{J}^{i(m+1)}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}-\widetilde{J}^{i(m)}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}=\widetilde{J}^{i(m+1)}(s_{\text{init}})\mathbb{I}\{\Omega^{m+1}\}\leq J^{\pi^{\star}}(s_{\text{init}}).

    This happens at most KK times.

  2. (ii)

    If interval mm ended in an unknown state-action pair or since the cost reached BB_{\star}, and we stay in the same epoch, then i(m)=i(m+1)=ii(m)=i(m+1)=i and s1m+1=sHm+1ms_{1}^{m+1}=s_{H^{m}+1}^{m}. Thus

    J~i(m+1)(s1m+1)𝕀{Ωm+1}J~i(m)(sHm+1m)𝕀{Ωm}=J~i(s1m+1)𝕀{Ωm}J~i(sHm+1m)𝕀{Ωm}=0.\widetilde{J}^{i(m+1)}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}-\widetilde{J}^{i(m)}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}=\widetilde{J}^{i}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m}\}-\widetilde{J}^{i}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}=0.
  3. (iii)

    If interval mm ended by doubling the visit count to some state-action pair, then we start a new epoch. Thus by Lemma B.12,

    J~i(m+1)(s1m+1)𝕀{Ωm+1}J~i(m)(sHm+1m)𝕀{Ωm}J~i+1(s1m+1)𝕀{Ωm+1}B,\widetilde{J}^{i(m+1)}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}-\widetilde{J}^{i(m)}(s_{H^{m}+1}^{m})\mathbb{I}\{\Omega^{m}\}\leq\widetilde{J}^{i+1}(s_{1}^{m+1})\mathbb{I}\{\Omega^{m+1}\}\leq B_{\star},

    This happens at most 2|S||A|logT2|S||A|\log T times.

To conclude, we have

m=1M(h=1HmJ~i(m)(shm)J~i(m)(sh+1m))𝕀{Ωm}KJπ(sinit)\displaystyle\sum_{m=1}^{M}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{i(m)}(s_{h}^{m})-\widetilde{J}^{i(m)}(s_{h+1}^{m})\Biggr{)}\mathbb{I}\{\Omega^{m}\}-KJ^{\pi^{\star}}(s_{\text{init}}) KJπ(sinit)+2B|S||A|logTKJπ(sinit)\displaystyle\leq KJ^{\pi^{\star}}(s_{\text{init}})+2B_{\star}|S||A|\log T-KJ^{\pi^{\star}}(s_{\text{init}})
=2B|S||A|logT.\displaystyle=2B_{\star}|S||A|\log{T}.\qed

B.2.4 Proof of Lemma 4.5

Lemma (restatement of Lemma 4.5).

With probability at least 1δ/41-\delta/4, the following holds for all M=1,2,M=1,2,\ldots simultaneously.

m=1M(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}\displaystyle\sum_{m=1}^{M}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}
m=1M𝔼[(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}U¯m1]+3BMlog8Mδ.\displaystyle\qquad\leq\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}+3B_{\star}\sqrt{M\log\frac{8M}{\delta}}.
Proof.

Consider the following martingale difference sequence (Xm)m=1(X^{m})_{m=1}^{\infty} defined by

Xm=h=1Hm(J~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}.X^{m}=\sum_{h=1}^{H^{m}}\bigl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\bigr{)}\mathbb{I}\{\Omega^{m}\}.

The Bellman optimality equations of π~m\tilde{\pi}^{m} with respect to P~m\widetilde{P}_{m} (Lemma B.11) obtains

|Xm|\displaystyle|X^{m}| =|(J~m(sHm+1m)J~m(s1m)B+h=1Hmc(shm,ahm)2B)𝕀{Ωm}|3B,\displaystyle=\biggl{|}\biggl{(}\underbrace{\widetilde{J}^{m}(s_{H^{m}+1}^{m})-\widetilde{J}^{m}(s_{1}^{m})}_{\leq B_{\star}}+\underbrace{\sum_{h=1}^{H^{m}}c(s_{h}^{m},a_{h}^{m})}_{\leq 2B_{\star}}\biggr{)}\mathbb{I}\{\Omega^{m}\}\biggr{|}\leq 3B_{\star},

where the inequality follows from Lemma B.12 and the fact that the total cost within each interval at most 2B2B_{\star} by construction. Therefore, we use anytime Azuma’s inequality (Theorem D.1) to obtain that with probability at least 1δ/41-\delta/4:

m=1MXmm=1M𝔼[XmU¯m1]+3BMlog8Mδ.\displaystyle\sum_{m=1}^{M}X^{m}\leq\sum_{m=1}^{M}\mathbb{E}\bigl{[}X^{m}\mid\bar{U}^{m-1}\bigr{]}+3B_{\star}\sqrt{M\log\frac{8M}{\delta}}.\qquad\qed

B.2.5 Proof of Lemma 4.6

Lemma (restatement of Lemma 4.6).

For every interval mm and time hh, denote Ahm=log(|S||A|N+m(shm,ahm)/δ)N+m(shm,ahm).A_{h}^{m}=\tfrac{\log(|S||A|N_{+}^{m}(s_{h}^{m},a_{h}^{m})/\delta)}{N_{+}^{m}(s_{h}^{m},a_{h}^{m})}. Then,

𝔼[(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}U¯m1]\displaystyle\mathbb{E}\Biggl{[}\Biggl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{)}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
16𝔼[h=1Hm|S|𝕍hmAhm𝕀{Ωm}|U¯m1]+272𝔼[h=1HmB|S|Ahm𝕀{Ωm}|U¯m1],\displaystyle\qquad\leq 16\cdot\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sqrt{|S|\mathbb{V}_{h}^{m}A_{h}^{m}}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}+272\cdot\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}B_{\star}|S|A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]},

where 𝕍hm\mathbb{V}_{h}^{m} is the empirical variance defined as

𝕍hm=sS+P(sshm,ahm)(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))2.\mathbb{V}_{h}^{m}=\sum_{s^{\prime}\in S^{+}}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\Biggl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\Biggr{)}^{2}.

The next lemma gives a different interpretation to the confidence bounds of Eq. 6, and will be useful in the proofs that follow.

Lemma B.13.

Denote Ahm=log(|S||A|N+m(s,a)/δ)/N+m(s,a)A_{h}^{m}=\log(|S||A|N_{+}^{m}(s,a)/\delta)/N_{+}^{m}(s,a). When Ωm\Omega^{m} holds we have for any (s,a,s)S×A×S+(s,a,s^{\prime})\in S\times A\times S^{+}:

|P(ss,a)P~m(ss,a)|8P(ss,a)Ahm+136Ahm.\bigl{|}P(s^{\prime}\mid s,a)-\widetilde{P}_{m}(s^{\prime}\mid s,a)\bigr{|}\leq 8\sqrt{P(s^{\prime}\mid s,a)A_{h}^{m}}+136A_{h}^{m}.
Proof.

Since Ωm\Omega^{m} holds we have for all (s,a,s)S×A×S+(s,a,s^{\prime})\in S\times A\times S^{+} that

P¯m(ss,a)P(ss,a)4P¯m(ss,a)Ahm+28Ahm.\bar{P}_{m}(s^{\prime}\mid s,a)-P(s^{\prime}\mid s,a)\leq 4\sqrt{\bar{P}_{m}(s^{\prime}\mid s,a)A_{h}^{m}}+28A_{h}^{m}.

This is a quadratic inequality in P¯m(ss,a)\sqrt{\bar{P}_{m}(s^{\prime}\mid s,a)}. Using the fact that x2ax+bx^{2}\leq a\cdot x+b implies xa+bx\leq a+\sqrt{b} with a=4Ahma=4\sqrt{A_{h}^{m}} and b=P(ss,a)+28Ahmb=P(s^{\prime}\mid s,a)+28A_{h}^{m}, we have

P¯m(ss,a)4Ahm+P(ss,a)+28AhmP(ss,a)+10Ahm,\sqrt{\bar{P}_{m}(s^{\prime}\mid s,a)}\leq 4\sqrt{A_{h}^{m}}+\sqrt{P(s^{\prime}\mid s,a)+28A_{h}^{m}}\leq\sqrt{P(s^{\prime}\mid s,a)}+10\sqrt{A_{h}^{m}},

where we used the inequality x+yx+y\sqrt{x+y}\leq\sqrt{x}+\sqrt{y} that holds for any x0x\geq 0 and y0y\geq 0. Substituting back into Eq. 6 obtains

|P(ss,a)P¯m(ss,a)|4P(ss,a)Ahm+68Ahm.\bigl{|}P(s^{\prime}\mid s,a)-\bar{P}_{m}(s^{\prime}\mid s,a)\bigr{|}\leq 4\sqrt{P(s^{\prime}\mid s,a)A_{h}^{m}}+68A_{h}^{m}.

From a similar argument

|P~m(ss,a)P¯m(ss,a)|4P(ss,a)Ahm+68Ahm.\bigl{|}\widetilde{P}_{m}(s^{\prime}\mid s,a)-\bar{P}_{m}(s^{\prime}\mid s,a)\bigr{|}\leq 4\sqrt{P(s^{\prime}\mid s,a)A_{h}^{m}}+68A_{h}^{m}.

Using the triangle inequality finishes the proof. ∎

Proof of Lemma 4.6.

Denote Xm=(h=1HmJ~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}X^{m}=\bigl{(}\sum_{h=1}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\bigr{)}\mathbb{I}\{\Omega^{m}\}, and Zhm=(J~m(sh+1m)sSP(sshm,ahm)J~m(s))𝕀{Ωm}Z_{h}^{m}=\bigl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\bigr{)}\mathbb{I}\{\Omega^{m}\}. Think of the interval as an infinite continuous stochastic process, and note that, conditioned on U¯m1\bar{U}^{m-1}, (Zhm)h=1\bigl{(}Z_{h}^{m}\bigr{)}_{h=1}^{\infty} is a martingale difference sequence w.r.t (Uh)h=1(U^{h})_{h=1}^{\infty}, where UhU^{h} is the trajectory of the learner from the beginning of the interval and up to and including time hh. This holds since, by conditioning on U¯m1\bar{U}^{m-1}, Ωm\Omega^{m} is determined and is independent of the randomness generated during the interval. Note that HmH^{m} is a stopping time with respect to (Zhm)h=1(Z_{h}^{m})_{h=1}^{\infty} which is bounded by 2B/cmin2B_{\star}/c_{\text{min}}. Hence by the optional stopping theorem 𝔼[h=1HmZhmU¯m1]=0,\mathbb{E}[\sum_{h=1}^{H^{m}}Z_{h}^{m}\mid\bar{U}^{m-1}]=0, which gets us

𝔼[XmU¯m1]\displaystyle\mathbb{E}[X^{m}\mid\bar{U}^{m-1}] =𝔼[h=1Hm(J~m(sh+1m)sSP~m(sshm,ahm)J~m(s))𝕀{Ωm}U¯m1]\displaystyle=\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\biggl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\biggr{)}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
=𝔼[h=1HmZhmU¯m1]+𝔼[h=1HmsS(P(sshm,ahm)P~,(sshm,ahm))J~m(s)𝕀{Ωm}U¯m1]\displaystyle=\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}Z_{h}^{m}\mid\bar{U}^{m-1}\Biggr{]}+\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{,}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{m}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
=𝔼[h=1HmsS(P(sshm,ahm)P~m(sshm,ahm))J~m(s)𝕀{Ωm}U¯m1].\displaystyle=\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{m}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}.

Furthermore, we have

𝔼[h=1HmsS(P(sshm,ahm)P~m(sshm,ahm))J~m(s)𝕀{Ωm}U¯m1]\displaystyle\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{m}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
=𝔼[h=1HmsS+(P(sshm,ahm)P~m(sshm,ahm))(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))𝕀{Ωm}U¯m1]\displaystyle\quad=\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S^{+}}\biggl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\biggr{)}\biggl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\biggr{)}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
𝔼[8h=1HmsS+AhmP(sshm,ahm)(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))2𝕀{Ωm}U¯m1]\displaystyle\quad\leq\mathbb{E}\Biggl{[}8\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S^{+}}\sqrt{A_{h}^{m}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\Biggl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\Biggr{)}^{2}}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
+𝔼[136h=1HmsS+Ahm|J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′)|𝕀{Ωm}U¯m1]\displaystyle\quad\qquad+\mathbb{E}\Biggl{[}136\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S^{+}}A_{h}^{m}\Biggl{|}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\Biggr{|}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
𝔼[16h=1Hm|S|𝕍hmAhm𝕀{Ωm}+272|S|BAhm𝕀{Ωm}U¯m1],\displaystyle\quad\leq\mathbb{E}\Biggl{[}16\sum_{h=1}^{H^{m}}\sqrt{|S|\mathbb{V}_{h}^{m}A_{h}^{m}}\mathbb{I}\{\Omega^{m}\}+272|S|B_{\star}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]},

where the first equality follows since J~m(g)=0\widetilde{J}^{m}(g)=0, and P(shm,ahm)P(\cdot\mid s_{h}^{m},a_{h}^{m}) and P~i(shm,ahm)\widetilde{P}_{i}(\cdot\mid s_{h}^{m},a_{h}^{m}) are probability distributions over S+S^{+} whence s′′S+P(s′′shm,ahm)J~m(s′′)\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime}) does not depend on ss^{\prime}. The first inequality follows from Lemma B.13, and the second inequality from Jensen’s inequality, Lemma B.12, |S+|2|S||S^{+}|\leq 2|S|, and the definition of 𝕍hm\mathbb{V}_{h}^{m}. ∎

B.2.6 Proof of Lemma 4.7

Lemma (restatement of Lemma 4.7).

For any interval mm, 𝔼[h=1Hm𝕍hm𝕀{Ωm}U¯m1]44B2.\mathbb{E}\bigl{[}\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\bigr{]}\leq 44B_{\star}^{2}.

Lemma B.14.

Let mm be an interval and (s,a)(s,a) be a known state-action pair. If Ωm\Omega^{m} holds then for every sS+s^{\prime}\in S^{+}

|P~m(ss,a)P(ss,a)|18cminP(ss,a)|S|B+cmin4|S|B.\bigl{|}\widetilde{P}_{m}\bigl{(}s^{\prime}\mid s,a\bigr{)}-P\bigl{(}s^{\prime}\mid s,a\bigr{)}\bigr{|}\leq\frac{1}{8}\sqrt{\frac{c_{\text{min}}\cdot P\bigl{(}s^{\prime}\mid s,a\bigr{)}}{|S|B_{\star}}}+\frac{c_{\text{min}}}{4|S|B_{\star}}.
Proof.

By Lemma B.13 we have that

|P~m(ss,a)P(ss,a)|\displaystyle\bigl{|}\widetilde{P}_{m}\bigl{(}s^{\prime}\mid s,a\bigr{)}-P\bigl{(}s^{\prime}\mid s,a\bigr{)}\bigr{|} 8P(ss,a)log(|S||A|N+m(s,a)/δ)N+m(s,a)+136log(|S||A|N+m(s,a)/δ)N+m(s,a)\displaystyle\leq 8\sqrt{\frac{P(s^{\prime}\mid s,a)\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}}+\frac{136\log\bigl{(}|S||A|N_{+}^{m}(s,a)/\delta\bigr{)}}{N_{+}^{m}(s,a)}

which gives the required bound because log(x)/x\log(x)/x is decreasing, and (s,a)(s,a) is a known state-action pair so N+m(s,a)30000B|S|cminlogB|S||A|δcminN_{+}^{m}(s,a)\geq 30000\cdot\frac{B_{\star}|S|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}. ∎

Proof of Lemma 4.7.

Note that the first state-action pair in the subinterval, (s1m,a1m)(s_{1}^{m},a_{1}^{m}), might be unknown and that all state-action pairs that appear afterwards are known. Thus, we bound

𝔼[h=1Hm𝕍hmU¯m1]\displaystyle\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mid\bar{U}^{m-1}\Biggr{]} =𝔼[𝕍1m𝕀{Ωm}U¯m1]+𝔼[h=2Hm𝕍hm𝕀{Ωm}U¯m1].\displaystyle=\mathbb{E}\Biggl{[}\mathbb{V}_{1}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}+\mathbb{E}\Biggl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}.

The first summand is trivially bounded by B2B_{\star}^{2} (Lemma B.12). We now upper bound 𝔼[h=2Hm𝕍hm𝕀{Ωm}U¯m1].\mathbb{E}\bigl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\bigr{]}. Denote Zhm=(J~m(sh+1m)sSP(sshm,ahm)J~m(s))𝕀{Ωm}Z_{h}^{m}=\bigl{(}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\bigr{)}\mathbb{I}\{\Omega^{m}\}, and think of the interval as an infinite continuous stochastic process. Note that, conditioned on U¯m1\bar{U}^{m-1}, (Zhm)h=1\bigl{(}Z_{h}^{m}\bigr{)}_{h=1}^{\infty} is a martingale difference sequence w.r.t (Uh)h=1(U^{h})_{h=1}^{\infty}, where UhU^{h} is the trajectory of the learner from the beginning of the interval and up to time hh and including. This holds since, by conditioning on U¯m1\bar{U}^{m-1}, Ωm\Omega^{m} is determined and is independent of the randomness generated during the interval. Note that HmH^{m} is a stopping time with respect to (Zhm)h=1(Z_{h}^{m})_{h=1}^{\infty} which is bounded by 2B/cmin2B_{\star}/c_{\text{min}}. Therefore, applying Lemma B.15 found below obtains

𝔼[h=2Hm𝕍hm𝕀{Ωm}U¯m1]=𝔼[(h=2HmZhm𝕀{Ωm})2U¯m1].\mathbb{E}\Biggl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}=\mathbb{E}\Biggl{[}\Biggl{(}\sum_{h=2}^{H^{m}}Z_{h}^{m}\mathbb{I}\{\Omega^{m}\}\Biggr{)}^{2}\mid\bar{U}^{m-1}\Biggr{]}. (15)

We now proceed by bounding |h=1HmZhm||\sum_{h=1}^{H^{m}}Z_{h}^{m}| when Ωm\Omega^{m} occurs. Therefore,

|h=2HmZhm|\displaystyle\Biggl{|}\sum_{h=2}^{H^{m}}Z_{h}^{m}\Biggr{|} =|h=2HmJ~m(sh+1m)sSP(sshm,ahm)J~m(s)|\displaystyle=\Biggl{|}\sum_{h=2}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\sum_{s^{\prime}\in S}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{|}
|h=2HmJ~m(sh+1m)J~m(shm)|\displaystyle\leq\Biggl{|}\sum_{h=2}^{H^{m}}\widetilde{J}^{m}(s_{h+1}^{m})-\widetilde{J}^{m}(s_{h}^{m})\Biggr{|} (16)
+|h=2HmJ~m(shm)sSP~m(sshm,ahm)J~m(s)|\displaystyle\qquad+\Biggl{|}\sum_{h=2}^{H^{m}}\widetilde{J}^{m}(s_{h}^{m})-\sum_{s^{\prime}\in S}\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime})\Biggr{|} (17)
+|h=2HmsS+(P(sshm,ahm)P~m(sshm,ahm))(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))|,\displaystyle\qquad+\Biggl{|}\sum_{h=2}^{H^{m}}\sum_{s^{\prime}\in S^{+}}\Bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\Bigr{)}\Bigl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\Bigr{)}\Biggr{|}, (18)

where Eq. 18 is given as P(shm,ahm)P(\cdot\mid s_{h}^{m},a_{h}^{m}) and P~i(shm,ahm)\widetilde{P}_{i}(\cdot\mid s_{h}^{m},a_{h}^{m}) are probability distributions over S+S^{+}, s′′S+P(s′′shm,ahm)J~m(s′′)\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime}) is constant w.r.t ss^{\prime}, and J~m(g)=0\widetilde{J}^{m}(g)=0.

We now bound each of the three terms above individually. Eq. 16 is a telescopic sum that is at most BB_{\star} on Ωm\Omega^{m} (Lemma B.12). For Eq. 17, we use the Bellman equations for π~m\tilde{\pi}^{m} on the optimistic model defined by the transitions P~m\widetilde{P}_{m} (Lemma B.11) thus it is at most h=2Hmc(shm,ahm)2B\sum_{h=2}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}\leq 2B_{\star} (see text following Lemma 4.5). For Eq. 18, recall that all states-action pairs at times h=2,,Hmh=2,\ldots,H^{m} are known by definition of HmH^{m}. Hence by Lemma B.14,

|sS+(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))(P~m(sshm,ahm)P(sshm,ahm))|\displaystyle\Biggl{|}\sum_{s^{\prime}\in S^{+}}\Bigl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P\bigl{(}s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}\widetilde{J}^{m}(s^{\prime\prime})\Bigr{)}\Bigl{(}\widetilde{P}_{m}\bigl{(}s^{\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}-P\bigl{(}s^{\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}\Bigr{)}\Biggr{|}
18sS+cminP(sshm,ahm)(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))2|S|B\displaystyle\qquad\leq\frac{1}{8}\sum_{s^{\prime}\in S^{+}}\sqrt{\frac{c_{\text{min}}\cdot P\bigl{(}s^{\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}\bigl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P\bigl{(}s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}\widetilde{J}^{m}(s^{\prime\prime})\bigr{)}^{2}}{|S|B_{\star}}}
+sS+cmin4|S|B|J~m(s)s′′SP(s′′shm,ahm)J~m(s′′)|B by Lemma B.12\displaystyle\qquad\qquad+\sum_{s^{\prime}\in S^{+}}\frac{c_{\text{min}}}{4|S|B_{\star}}\cdot\underbrace{\Bigl{|}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S}P\bigl{(}s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m}\bigr{)}\widetilde{J}^{m}(s^{\prime\prime})\Bigr{|}}_{\leq B_{\star}\text{ by \lx@cref{creftype~refnum}{lem:bern-opt-val-bound}}}
14cmin𝕍hmB+c(shm,ahm)2,\displaystyle\qquad\leq\frac{1}{4}\sqrt{\frac{c_{\text{min}}\cdot\mathbb{V}_{h}^{m}}{B_{\star}}}+\frac{c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}}{2}, (by Jensen’s inequality, cminc(shm,ahm)c_{\text{min}}\leq c(s_{h}^{m},a_{h}^{m}), |S+|2|S||S^{+}|\leq 2|S|)

and again by Jensen’s inequality and that the total cost throughout the interval is at most 2B2B_{\star}, we have on Ωm\Omega^{m}

h=2Hm14cmin𝕍hmB+c(shm,ahm)2\displaystyle\sum_{h=2}^{H^{m}}\frac{1}{4}\sqrt{\frac{c_{\text{min}}\cdot\mathbb{V}_{h}^{m}}{B_{\star}}}+\frac{c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}}{2} 14Hm2B/cminh=2Hmcmin𝕍hmB+12h=2Hmc(shm,ahm)2B\displaystyle\leq\frac{1}{4}\sqrt{\underbrace{H^{m}}_{\leq 2B_{\star}/c_{\text{min}}}\cdot\sum_{h=2}^{H^{m}}\frac{c_{\text{min}}\cdot\mathbb{V}_{h}^{m}}{B_{\star}}}+\frac{1}{2}\underbrace{\sum_{h=2}^{H^{m}}c\bigl{(}s_{h}^{m},a_{h}^{m}\bigr{)}}_{\leq 2B_{\star}} (Jensen’s inequality)
142h=2Hm𝕍hm+B.\displaystyle\leq\frac{1}{4}\sqrt{2\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}}+B_{\star}.

Plugging these bounds back into Eq. 15 gets us

𝔼[h=2Hm𝕍hm𝕀{Ωm}|U¯m1]\displaystyle\mathbb{E}\Biggl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]} 𝔼[(4B+142h=1Hm𝕍hm𝕀{Ωm})2|U¯m1]\displaystyle\leq\mathbb{E}\Biggl{[}\Biggl{(}4B_{\star}+\frac{1}{4}\sqrt{2\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}}\Biggr{)}^{2}\biggm{|}\bar{U}^{m-1}\Biggr{]}
32B2+14𝔼[h=2Hm𝕍hm𝕀{Ωm}|U¯m1],\displaystyle\leq 32B_{\star}^{2}+\frac{1}{4}\mathbb{E}\Biggl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]},

where the last inequality is by the elementary inequality (a+b)22(a2+b2)(a+b)^{2}\leq 2(a^{2}+b^{2}). Rearranging gets us 𝔼[h=2Hm𝕍hm𝕀{Ωm}U¯m1]43B2\mathbb{E}\bigl{[}\sum_{h=2}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\bigr{]}\leq 43B_{\star}^{2}, and the lemma follows. ∎

Lemma B.15.

Let (Xt)t=1(X_{t})_{t=1}^{\infty} be a martingale difference sequence adapted to the filtration (t)t=0(\mathcal{F}_{t})_{t=0}^{\infty}. Let Yn=(t=1nXt)2t=1n𝔼[Xt2t1]Y_{n}=(\sum_{t=1}^{n}X_{t})^{2}-\sum_{t=1}^{n}\mathbb{E}[X_{t}^{2}\mid\mathcal{F}_{t-1}]. Then (Yn)n=0(Y_{n})_{n=0}^{\infty} is a martingale, and in particular if τ\tau is a stopping time such that τc\tau\leq c almost surely, then 𝔼[Yτ]=0\mathbb{E}[Y_{\tau}]=0.

Proof.

We first show that (Yn)n=1(Y_{n})_{n=1}^{\infty} is a martingale. Indeed,

𝔼[Ynn1]\displaystyle\mathbb{E}[Y_{n}\mid\mathcal{F}_{n-1}] =𝔼[(t=1nXt)2t=1n𝔼[Xt2t1]n1]\displaystyle=\mathbb{E}\Biggl{[}\Biggl{(}\sum_{t=1}^{n}X_{t}\Biggr{)}^{2}-\sum_{t=1}^{n}\mathbb{E}[X_{t}^{2}\mid\mathcal{F}_{t-1}]\mid\mathcal{F}_{n-1}\Biggr{]}
=𝔼[(t=1n1Xt)22(t=1n1Xt)Xn+Xn2t=1n𝔼[Xt2t1]n1]\displaystyle=\mathbb{E}\Biggl{[}\Biggl{(}\sum_{t=1}^{n-1}X_{t}\Biggr{)}^{2}-2\Biggl{(}\sum_{t=1}^{n-1}X_{t}\Biggr{)}X_{n}+X_{n}^{2}-\sum_{t=1}^{n}\mathbb{E}[X_{t}^{2}\mid\mathcal{F}_{t-1}]\mid\mathcal{F}_{n-1}\Biggr{]}
=(t=1n1Xt)22(t=1n1Xt)0+𝔼[Xn2n1]t=1n𝔼[Xt2t1]\displaystyle=\Biggl{(}\sum_{t=1}^{n-1}X_{t}\Biggr{)}^{2}-2\Biggl{(}\sum_{t=1}^{n-1}X_{t}\Biggr{)}\cdot 0+\mathbb{E}[X_{n}^{2}\mid\mathcal{F}_{n-1}]-\sum_{t=1}^{n}\mathbb{E}[X_{t}^{2}\mid\mathcal{F}_{t-1}] (𝔼[Xnn1]=0\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]=0)
=(t=1n1Xt)2t=1n1𝔼[Xt2t1]=Yn1.\displaystyle=\Biggl{(}\sum_{t=1}^{n-1}X_{t}\Biggr{)}^{2}-\sum_{t=1}^{n-1}\mathbb{E}[X_{t}^{2}\mid\mathcal{F}_{t-1}]=Y_{n-1}.

We would now like to show that 𝔼[Yτ]=𝔼[Y1]=0\mathbb{E}[Y_{\tau}]=\mathbb{E}[Y_{1}]=0 using the optional stopping theorem. The latter holds since τc\tau\leq c almost surely and 𝔼[Y1]=𝔼[X12𝔼[X120]]=0.\mathbb{E}[Y_{1}]=\mathbb{E}[X_{1}^{2}-\mathbb{E}[X_{1}^{2}\mid\mathcal{F}_{0}]]=0.

B.2.7 Proof of Lemma 4.8

Lemma (restatement of Lemma 4.8).

With probability at least 1δ/41-\delta/4,

m=1M𝔼[h=1HmsS(P(sshm,ahm)P~m(sshm,ahm))J~m(s)𝕀{Ωm}U¯m1]\displaystyle\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{m}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{m}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
614BM|S|2|A|log2T|S||A|δ+8160B|S|2|A|log2T|S||A|δ.\displaystyle\qquad\leq 614B_{\star}\sqrt{M|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+8160B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}.
Proof.

Recall the following definitions:

Ahm=log(|S||A|N+m(shm,ahm)/δ)N+m(shm,ahm).𝕍hm=sS+P(sshm,ahm)(J~m(s)s′′S+P(s′′shm,ahm)J~m(s′′))2.\displaystyle A_{h}^{m}=\frac{\log(|S||A|N_{+}^{m}(s_{h}^{m},a_{h}^{m})/\delta)}{N_{+}^{m}(s_{h}^{m},a_{h}^{m})}.\qquad\qquad\mathbb{V}_{h}^{m}=\sum_{s^{\prime}\in S^{+}}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\Biggl{(}\widetilde{J}^{m}(s^{\prime})-\sum_{s^{\prime\prime}\in S^{+}}P(s^{\prime\prime}\mid s_{h}^{m},a_{h}^{m})\widetilde{J}^{m}(s^{\prime\prime})\Biggr{)}^{2}.

From Lemma 4.6 we have that

𝔼[h=1HmsS(P(sshm,ahm)\displaystyle\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m}) P~i(sshm,ahm))J~i(s)𝕀{Ωm}U¯m1]\displaystyle-\widetilde{P}_{i}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{i}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
𝔼[16|S|h=1Hm𝕍hmAhm𝕀{Ωm}+272B|S|Ahm𝕀{Ωm}U¯m1].\displaystyle\leq\mathbb{E}\Biggl{[}16\sqrt{|S|}\sum_{h=1}^{H^{m}}\sqrt{\mathbb{V}_{h}^{m}A_{h}^{m}}\mathbb{I}\{\Omega^{m}\}+272B_{\star}|S|A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}.

Moreover, by applying the Cauchy-Schwartz inequality twice, we get that

𝔼[h=1Hm𝕍hmAhm𝕀{Ωm}|U¯m1]\displaystyle\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sqrt{\mathbb{V}_{h}^{m}A_{h}^{m}}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]} 𝔼[h=1Hm𝕍hm𝕀{Ωm}h=1HmAhm𝕀{Ωm}|U¯m1]\displaystyle\leq\mathbb{E}\Biggl{[}\sqrt{\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}}\cdot\sqrt{\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}}\biggm{|}\bar{U}^{m-1}\Biggr{]}
𝔼[h=1HmAhm𝕀{Ωm}|U¯m1]𝔼[h=1Hm𝕍hm𝕀{Ωm}|U¯m1]\displaystyle\leq\sqrt{\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}}\cdot\sqrt{\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\mathbb{V}_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}}
7B𝔼[h=1HmAhm𝕀{Ωm}|U¯m1].\displaystyle\leq 7B_{\star}\sqrt{\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\biggm{|}\bar{U}^{m-1}\Biggr{]}}. (Lemma 4.7)

We sum over all intervals to obtain

m=1M𝔼[h=1HmsS(P(sshm,ahm)P~i(sshm,ahm))J~i(s)𝕀{Ωm}U¯m1]\displaystyle\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}\sum_{s^{\prime}\in S}\bigl{(}P(s^{\prime}\mid s_{h}^{m},a_{h}^{m})-\widetilde{P}_{i}(s^{\prime}\mid s_{h}^{m},a_{h}^{m})\bigr{)}\widetilde{J}^{i}(s^{\prime})\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}\leq
112Bm=1M|S|𝔼[h=1HmAhm𝕀{Ωm}U¯m1]+272B|S|m=1M𝔼[h=1HmAhm𝕀{Ωm}U¯m1]\displaystyle\qquad\leq 112B_{\star}\sum_{m=1}^{M}\sqrt{|S|\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}}+272B_{\star}|S|\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}
112BM|S|m=1M𝔼[h=1HmAhm𝕀{Ωm}U¯m1]+272B|S|m=1M𝔼[h=1HmAhm𝕀{Ωm}U¯m1],\displaystyle\qquad\leq 112B_{\star}\sqrt{M|S|\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}}+272B_{\star}|S|\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]},

where the last inequality follows from Jensen’s inequality. We finish the proof using Lemma B.16 below. ∎

Lemma B.16.

With probability at least 1δ/41-\delta/4, the following holds for M=1,2,M=1,2,\ldots simultaneously.

m=1M𝔼[h=1HmAhm𝕀{Ωm}U¯m1]O(|S||A|log2T|S||A|δ).\sum_{m=1}^{M}\mathbb{E}\Biggl{[}\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\}\mid\bar{U}^{m-1}\Biggr{]}\leq O\biggl{(}|S||A|\log^{2}\frac{T|S||A|}{\delta}\biggr{)}.
Proof.

Define the infinite sequence of random variables: Xm=h=1HmAhm𝕀{Ωm}X^{m}=\sum_{h=1}^{H^{m}}A_{h}^{m}\mathbb{I}\{\Omega^{m}\} for which |Xm|3log(|S||A|/δ)|X^{m}|\leq 3\log(|S||A|/\delta) due to Lemma B.17 below. We apply Eq. 26 of Lemma D.4 to obtain with probability at least 1δ/41-\delta/4, for all M=1,2,M=1,2,\ldots simultaneously

m=1M𝔼[XmU¯m1]2m=1MXm+12log(|S||A|δ)log(8Mδ).\displaystyle\sum_{m=1}^{M}\mathbb{E}\bigl{[}X^{m}\mid\bar{U}^{m-1}\bigr{]}\leq 2\sum_{m=1}^{M}X^{m}+12\log\biggl{(}\frac{|S||A|}{\delta}\biggr{)}\log\biggl{(}\frac{8M}{\delta}\biggr{)}.

Now, we bound the sum over XmX^{m} by rewriting it as a sum over epochs:

m=1MXmm=1Mh=1Hmlog(|S||A|N+i(shm,ahm)/δ)N+i(shm,ahm)log|S||A|TδsSaAi=1Eni(s,a)N+i(s,a),\sum_{m=1}^{M}X^{m}\leq\sum_{m=1}^{M}\sum_{h=1}^{H^{m}}\frac{\log(|S||A|N_{+}^{i}(s_{h}^{m},a_{h}^{m})/\delta)}{N_{+}^{i}(s_{h}^{m},a_{h}^{m})}\leq\log\frac{|S||A|T}{\delta}\sum_{s\in S}\sum_{a\in A}\sum_{i=1}^{E}\frac{n_{i}(s,a)}{N_{+}^{i}(s,a)},

where EE is the last epoch. Finally, from Lemma B.18 below we have that for every (s,a)S×A(s,a)\in S\times A,

i=1Eni(s,a)N+i(s,a)2logNE+1(s,a)2logT.\sum_{i=1}^{E}\frac{n_{i}(s,a)}{N_{+}^{i}(s,a)}\leq 2\log N_{E+1}(s,a)\leq 2\log T.

We now plugin the resulting bound for m=1MXm\sum_{m=1}^{M}X^{m} and simplify the acquired expression by using MTM\leq T. ∎

Lemma B.17.

For any interval mm, |h=1HmAhm|3log(|S||A|/δ).|\sum_{h=1}^{H^{m}}A_{h}^{m}|\leq 3\log(|S||A|/\delta).

Proof.

Note that all state-action pairs (shm,ahm)(s_{h}^{m},a_{h}^{m}) (except the first one (s1m,a1m)(s_{1}^{m},a_{1}^{m})) are known. Hence, for h2h\geq 2, N+i(shm,ahm)30000B|S|cminlogB|S||A|δcminN_{+}^{i}(s_{h}^{m},a_{h}^{m})\geq 30000\cdot\frac{B_{\star}|S|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}. Therefore, since log(x)/x\log(x)/x is decreasing and since |S|2|S|\geq 2 and |A|2|A|\geq 2 by assumption,

h=1Hmlog(|S||A|N+i(shm,ahm)/δ)N+i(shm,ahm)\displaystyle\sum_{h=1}^{H^{m}}\frac{\log(|S||A|N_{+}^{i}(s_{h}^{m},a_{h}^{m})/\delta)}{N_{+}^{i}(s_{h}^{m},a_{h}^{m})} log(|S||A|N+i(s1m,a1m)/δ)N+i(s1m,a1m)+h=2Hmlog(|S||A|N+i(shm,ahm)/δ)N+i(shm,ahm)\displaystyle\leq\frac{\log(|S||A|N_{+}^{i}(s_{1}^{m},a_{1}^{m})/\delta)}{N_{+}^{i}(s_{1}^{m},a_{1}^{m})}+\sum_{h=2}^{H^{m}}\frac{\log(|S||A|N_{+}^{i}(s_{h}^{m},a_{h}^{m})/\delta)}{N_{+}^{i}(s_{h}^{m},a_{h}^{m})}
log(|S||A|/δ)+cminHmB\displaystyle\leq\log(|S||A|/\delta)+\frac{c_{\text{min}}H^{m}}{B_{\star}}
log(|S||A|/δ)+2\displaystyle\leq\log(|S||A|/\delta)+2 (Hm2BcminH^{m}\leq\tfrac{2B_{\star}}{c_{\text{min}}} by definition.)
3log(|S||A|/δ).\displaystyle\leq 3\log(|S||A|/\delta).\qed
Lemma B.18.

For any sequence of integers z1,,znz_{1},\dots,z_{n} with 0zkZk1:=max{1,i=1k1zi}0\leq z_{k}\leq Z_{k-1}:=\max\{1,\sum_{i=1}^{k-1}z_{i}\} and Z0=1Z_{0}=1, it holds that

k=1nzkZk12logZn.\sum_{k=1}^{n}\frac{z_{k}}{Z_{k-1}}\leq 2\log Z_{n}.
Proof.

We use the inequality x2log(1+x)x\leq 2\log(1+x) for every 0x10\leq x\leq 1 to obtain

k=1nzkZk12k=1nlog(1+zkZk1)=2k=1nlogZk1+zkZk1=2k=1nlogZkZk1=2logk=1nZkZk1=2logZn.\sum_{k=1}^{n}\frac{z_{k}}{Z_{k-1}}\leq 2\sum_{k=1}^{n}\log\biggl{(}1+\frac{z_{k}}{Z_{k-1}}\biggr{)}=2\sum_{k=1}^{n}\log\frac{Z_{k-1}+z_{k}}{Z_{k-1}}=2\sum_{k=1}^{n}\log\frac{Z_{k}}{Z_{k-1}}=2\log\prod_{k=1}^{n}\frac{Z_{k}}{Z_{k-1}}=2\log Z_{n}.\qed

B.2.8 Proof of Theorem 2.4

Theorem (restatement of Theorem 2.4).

Assume that 2 holds. With probability at least 1δ1-\delta the regret of Algorithm 2 is bounded as follows:

RK=O(B|S||A|KlogKB|S||A|δcmin+B3|S|4|A|2cminlog2KB|S||A|δcmin).\displaystyle R_{K}=O\biggl{(}B_{\star}|S|\sqrt{|A|K}\log\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}}\log^{2}\frac{KB_{\star}|S||A|}{\delta c_{\text{min}}}\biggr{)}.
Proof.

Let CMC_{M} denote the cost of the learner after MM intervals. First, with probability at least 1δ1-\delta, we have Lemmas 4.2, 4.8 and 4.5 via a union bound. Now, as Ωm\Omega^{m} hold for all intervals, we have R~M=RM\widetilde{R}_{M}=R_{M} for any number of intervals MM. Plugging in the bounds of Lemmas 4.4, 4.5 and 4.8 into Lemma 4.3, we have that for any number of intervals MM:

CM=O(KJπ(sinit)+BM|S|2|A|log2T|S||A|δ+B|S|2|A|log2T|S||A|δ).C_{M}=O\biggl{(}K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{M|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}\biggr{)}.

We now plug in the bounds on MM and TT from 4.1 into the bound above. First, we plug in the bound on MM. As long as the KK episodes have not elapsed we have that MO(CM/B+K+2|S||A|logT+B|S|2|A|cminlogB|S||A|δcmin)M\leq O\bigl{(}C_{M}/B_{\star}+K+2|S||A|\log T+\frac{B_{\star}|S|^{2}|A|}{c_{\text{min}}}\log\frac{B_{\star}|S||A|}{\delta c_{\text{min}}}\bigr{)}. This gets after using the subadditivity of the square root to simplify the resulting expression,

CM\displaystyle C_{M} =O(KJπ(sinit)+BK|S|2|A|log2T|S||A|δ\displaystyle=O\biggl{(}K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}
+BCM|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ).\displaystyle\qquad\qquad+\sqrt{B_{\star}C_{M}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}\biggr{)}.

From which, by solving for CMC_{M} (using that xax+bx\leq a\sqrt{x}+b implies x(a+b)2x\leq(a+\sqrt{b})^{2} for a0a\geq 0 and b0b\geq 0), and simplifying the resulting expression by applying Jπ(sinit)BJ^{\pi^{\star}}(s_{\text{init}})\leq B_{\star} and our assumptions that K|S|2|A|K\geq|S|^{2}|A|, |S|2|S|\geq 2, |A|2|A|\geq 2, we get that

CM\displaystyle C_{M} =O((B|S|2|A|log2T|S||A|δ\displaystyle=O\Biggl{(}\biggl{(}\sqrt{B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}
+KJπ(sinit)+BK|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ)2)\displaystyle\qquad+\sqrt{K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}}\biggr{)}^{2}\Biggr{)}
=O(B|S|2|A|log2T|S||A|δ\displaystyle=O\Biggl{(}B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}
+B|S|2|A|log2T|S||A|δKJπ(sinit)+BK|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ\displaystyle\qquad+\sqrt{B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}\cdot\sqrt{K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}}
+KJπ(sinit)+BK|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ)\displaystyle\qquad+K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}\Biggr{)}
=O(B|S|2|A|log2T|S||A|δ+BK1/4|S|3|A|3/2log3T|S||A|δ+B5/2|S|4|A|2cmin1/2log4TB|S||A|cminδ\displaystyle=O\Biggl{(}B_{\star}|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}+B_{\star}\sqrt{K^{1/4}|S|^{3}|A|^{3/2}\log^{3}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{5/2}|S|^{4}|A|^{2}}{c_{\text{min}}^{1/2}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}
+KJπ(sinit)+BK|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ)\displaystyle\qquad+K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}\Biggr{)}
=O(KJπ(sinit)+BK|S|2|A|log2T|S||A|δ+B3|S|4|A|2cminlog4TB|S||A|cminδ).\displaystyle=O\Biggl{(}K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}\sqrt{K|S|^{2}|A|\log^{2}\frac{T|S||A|}{\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{TB_{\star}|S||A|}{c_{\text{min}}\delta}}\Biggr{)}. (19)

Note that in particular, by simplifying the bound above, we have CM=O(B3|S|4|A|2KT/cminδ).C_{M}=O\Bigl{(}\sqrt{B_{\star}^{3}|S|^{4}|A|^{2}KT/c_{\text{min}}\delta}\Bigr{)}. Next we combine this with the fact, stated in 4.1 that TCM/cminT\leq C_{M}/c_{\text{min}}. Isolating TT gets T=O(B3|S|4|A|2Kcmin3δ),T=O\Bigl{(}\tfrac{B_{\star}^{3}|S|^{4}|A|^{2}K}{c_{\text{min}}^{3}\delta}\Bigr{)}, and plugging this bound back into Eq. 19 and simplifying gets us

CM=O(KJπ(sinit)+B|S||A|Klog2KB|S||A|cminδ+B3|S|4|A|2cminlog4KB|S||A|cminδ).\displaystyle C_{M}=O\biggl{(}K\cdot J^{\pi^{\star}}(s_{\text{init}})+B_{\star}|S|\sqrt{|A|K\log^{2}\frac{KB_{\star}|S||A|}{c_{\text{min}}\delta}}+\sqrt{\frac{B_{\star}^{3}|S|^{4}|A|^{2}}{c_{\text{min}}}\log^{4}\frac{KB_{\star}|S||A|}{c_{\text{min}}\delta}}\biggr{)}.

Finally, we note that the bound above holds for any number of intervals MM as long as KK episodes do not elapse. As the instantaneous costs in the model are positive, this means that the learner must eventually finish the KK episodes from which we derive the bound for RKR_{K} claimed by the theroem. ∎

Appendix C Lower Bound

In this section we prove Theorem 2.7. At first glance, it is tempting to try and use the lower bound of Jaksch et al. (2010, Theorem 5) on the regret suffered against learning average-reward MDPs by reducing any problem instance from an average-reward MDP to an instance of SSP. However, it is unclear to us if such a reduction is possible, and if it is, how to perform it.222Even though a reduction in the reverse direction is fairly straight-forward in the unit-cost case (Tarbouriech et al., 2019). We consequently prove the theorem here directly.

By Yao’s minimax principle, in order to derive a lower bound on the learner’s regret, it suffices to show a distribution over MDP instances that forces any deterministic learner to suffer a regret of Ω(B|S||A|K)\Omega(B_{\star}\sqrt{|S||A|K}) in expectation.

To simplify our arguments, let us first consider the following simpler problem before considering the problem in its full generality. Think of a simple MDP with two states: the initial state and a goal state. The set of actions AA has a special action aa^{\star} chosen uniformly at random a-priori. Upon choosing the special action, the learner transitions to the goal state with probability 1/B\approx 1/B_{\star} and remains at sinits_{\text{init}} with the remaining probability. Concretely P(ga)=1/BP(g\mid a^{\star})=1/B_{\star} and P(sinita)=11/BP(s_{\text{init}}\mid a^{\star})=1-1/B_{\star}, and for any other action aaa\neq a^{\star} we have P(ga)=(1ϵ)/BP(g\mid a)=(1-\epsilon)/B_{\star} and P(sinita)=1(1ϵ)/BP(s_{\text{init}}\mid a)=1-(1-\epsilon)/B_{\star} for some ϵ(0,1/8)\epsilon\in(0,1/8).333For ease of notation and since there is only one state other than gg, we do not write this state as the origin state in the definition of the transition function. The costs of all actions equal 1; i.e., c(sinit,a)=1c(s_{\text{init}},a)=1 for all aAa\in A. Clearly, the optimal policy constantly plays aa^{\star} and therefore Jπ(sinit)=BJ^{\pi^{\star}}(s_{\text{init}})=B_{\star}.

Fix any deterministic learning algorithm, we shall now quantify the regret of the learner during a single episode in terms of the number of times that it chooses aa^{\star}. Let NkN_{k} denote the number of steps that the learner spends in sinits_{\text{init}} during episode kk, and let NkN^{\star}_{k} be the number of times the learner plays aa^{\star} at sinits_{\text{init}} during the episode. Note that NkN_{k} is also the total cost that the learning algorithm suffered during episode kk. We have the following lemma.

Lemma C.1.

𝔼[Nk]Jπ(sinit)=ϵ𝔼[NkNk].\mathbb{E}\bigl{[}N_{k}\bigr{]}-J^{\pi^{\star}}(s_{\text{init}})=\epsilon\cdot\mathbb{E}\bigl{[}N_{k}-N^{\star}_{k}\bigr{]}.

Proof.

Let us denote by s1,s2,s_{1},s_{2},\ldots and a1,a2,a_{1},a_{2},\ldots the sequences of states and actions observed by the learner during the episode. We have,

𝔼[Nk]\displaystyle\mathbb{E}[N_{k}] =t=1[st=sinit]\displaystyle=\sum_{t=1}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}}]
=1+t=2[st=sinit]\displaystyle=1+\sum_{t=2}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}}]
=1+t=2[st=sinitst1=sinit,at1=a][st1=sinit,at1=a]\displaystyle=1+\sum_{t=2}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}}\mid s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]\mathbb{P}[s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]
+t=2[st=sinitst1=sinit,at1a][st1=sinit,at1a]\displaystyle\qquad+\sum_{t=2}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}}\mid s_{t-1}=s_{\text{init}},a_{t-1}\neq a^{\star}]\mathbb{P}[s_{t-1}=s_{\text{init}},a_{t-1}\neq a^{\star}]
=1+t=2(11B)[st1=sinit,at1=a]+t=2(11ϵB)[st1=sinit,at1a]\displaystyle=1+\sum_{t=2}^{\infty}\biggl{(}1-\frac{1}{B_{\star}}\biggr{)}\mathbb{P}[s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]+\sum_{t=2}^{\infty}\biggl{(}1-\frac{1-\epsilon}{B_{\star}}\biggr{)}\mathbb{P}[s_{t-1}=s_{\text{init}},a_{t-1}\neq a^{\star}]
=1+(11B)t=1[st=sinit,at=a]+(11ϵB)t=1[st=sinit,ata]\displaystyle=1+\biggl{(}1-\frac{1}{B_{\star}}\biggr{)}\sum_{t=1}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}},a_{t}=a^{\star}]+\biggl{(}1-\frac{1-\epsilon}{B_{\star}}\biggr{)}\sum_{t=1}^{\infty}\mathbb{P}[s_{t}=s_{\text{init}},a_{t}\neq a^{\star}]
=1+(11B)𝔼[Nk]+(11ϵB)𝔼[NkNk].\displaystyle=1+\biggl{(}1-\frac{1}{B_{\star}}\biggr{)}\mathbb{E}[N_{k}^{\star}]+\biggl{(}1-\frac{1-\epsilon}{B_{\star}}\biggr{)}\mathbb{E}[N_{k}-N^{\star}_{k}].

Rearranging using Jπ(sinit)=BJ^{\pi^{\star}}(s_{\text{init}})=B_{\star} gives the Lemma’s statement. ∎

By Lemma C.1 the overall regret of the learner over KK episodes is: 𝔼[RK]=ϵ𝔼[NN],\mathbb{E}[R_{K}]=\epsilon\cdot\mathbb{E}\bigl{[}N-N^{\star}\bigr{]}, where N=k=1KNkN=\sum_{k=1}^{K}N_{k} and N=k=1KNkN^{\star}=\sum_{k=1}^{K}N_{k}^{\star}. In words, the regret of the learner is ϵ\epsilon times the expected number of visits to sinits_{\text{init}} in which the learner did not play aa^{\star}.

In the remainder of the proof we lower bound NN in expectation and upper bound the expected value of NN^{\star}. To upper bound NN^{\star}, we use standard techniques from lower bounds of multi-armed bandits (Auer et al., 2002) that bound the total variation distance between the distribution of the sequence of states traversed by the learner in the original MDP and that generated in a “uniform MDP” in which all actions are identical. However, we cannot apply this argument directly since it requires NN^{\star} to be bounded almost surely, yet here NN^{\star} depends on the total length of all KK episodes which is unbounded in general. We fix this issue by looking only on the first TT steps (where TT is to be determined) and showing that the regret is large even in these TT steps.

Formally, we view the run of the KK episodes as a continuous process in which when the learner reaches the goal state we transfer it to sinits_{\text{init}} (at no cost) and let it restart from there. Furthermore, we cap the learning process to consist of exactly TT steps as follows. If the KK episodes are completed before TT steps are elapsed, the learner remains in gg (until completing TT steps) without suffering any additional cost, and otherwise we stop the learner after TT steps before it completes its KK episodes. In this capped process, we denote the number of visits in sinits_{\text{init}} by NN_{-} and the number of times the learner played aa^{\star} in sinits_{\text{init}} by NN_{-}^{\star}. We have

𝔼[RK]ϵ(𝔼[N]𝔼[N]).\mathbb{E}[R_{K}]\geq\epsilon\cdot\bigl{(}\mathbb{E}\bigl{[}N_{-}\bigr{]}-\mathbb{E}\bigl{[}N^{\star}_{-}\bigr{]}\bigr{)}. (20)

The number of visits to sinits_{\text{init}} under this capping is lower bounded by the following lemma.

Lemma C.2.

For any deterministic learner, if T2KBT\geq 2KB_{\star} then we have that 𝔼[N]KB/4.\mathbb{E}\bigl{[}N_{-}\bigr{]}\geq KB_{\star}/4.

Proof.

If the capped learner finished its KK episodes then N=NN_{-}=N. Otherwise, it visits the goal state less than KK times and therefore NTKN_{-}\geq T-K. Hence 𝔼[N]𝔼[min{TK,N}]k=1K𝔼[min{T/K1,Nk}].\mathbb{E}\bigl{[}N_{-}\bigr{]}\geq\mathbb{E}\bigl{[}\min\{T-K,N\}\bigr{]}\geq\sum_{k=1}^{K}\mathbb{E}\bigl{[}\min\{T/K-1,N_{k}\}\bigr{]}. Since T2KBT\geq 2KB_{\star}, the lemma will follow if we show that NkBN_{k}\geq B_{\star} with probability at least 1/41/4. We lower bound the probability that NkBN_{k}\geq B_{\star} by the probability of staying at sinits_{\text{init}} for BB_{\star} steps and picking aa^{\star} in the first B1B_{\star}-1 steps. Indeed, using (11/x)x11/e(1-1/x)^{x-1}\geq 1/e for x1x\geq 1, we get that [NkB](11B)B114.\mathbb{P}[N_{k}\geq B_{\star}]\geq\bigl{(}1-\frac{1}{B_{\star}}\bigr{)}^{B_{\star}-1}\geq\frac{1}{4}.

We now introduce an additional distribution of the transitions which call unif\mathbb{P}_{\text{unif}}. unif\mathbb{P}_{\text{unif}} is identical to \mathbb{P} as defined above, except that P(ga)=(1ϵ)/BP(g\mid a)=(1-\epsilon)/B_{\star} for all actions a.a. We denote expectations over unif\mathbb{P}_{\text{unif}} by 𝔼unif\mathbb{E}_{\text{unif}}. The following lemma uses standard lower bound techniques used for multi-armed bandits (see, e.g., Jaksch et al., 2010, Theorem 13) to bound the difference in the expectation of NN^{\star}_{-} when the learner plays in \mathbb{P} compared to when it plays in unif\mathbb{P}_{\text{unif}}.

Lemma C.3.

For any deterministic learner we have that 𝔼[N]𝔼unif[N]+ϵT𝔼unif[N]/B.\mathbb{E}\bigl{[}N^{\star}_{-}\bigr{]}\leq\mathbb{E}_{\text{unif}}\bigl{[}N^{\star}_{-}\bigr{]}+\epsilon T\sqrt{\mathbb{E}_{\text{unif}}[N^{\star}_{-}]/B}.

Proof.

Fix any deterministic learner. Let us denote by s(t)s^{(t)} the sequence of states observed by the learner up to time tt and including. Now, as NTN_{-}^{\star}\leq T and the fact that NN_{-}^{\star} is a function of s(T)s^{(T)}, 𝔼[N]𝔼unif[N]+TTV(unif[s(T)],[s(T)]),\mathbb{E}\bigl{[}N_{-}^{\star}\bigr{]}\leq\mathbb{E}_{\text{unif}}\bigl{[}N^{\star}_{-}\bigr{]}+T\cdot\text{TV}(\mathbb{P}_{\text{unif}}[s^{(T)}],\mathbb{P}[s^{(T)}]), and Pinsker’s inequality yields

TV(unif[s(T)],[s(T)])12KL(unif[s(T)][s(T)]).\text{TV}(\mathbb{P}_{\text{unif}}[s^{(T)}],\mathbb{P}[s^{(T)}])\leq\sqrt{\frac{1}{2}\text{KL}(\mathbb{P}_{\text{unif}}[s^{(T)}]\;\|\;\mathbb{P}[s^{(T)}])}. (21)

Next, the chain rule of the KL divergence obtains

KL(unif[s(T)][s(T)])=t=1Ts(t1)unif[s(t1)]KL(unif[sts(t1)][sts(t1)]).\text{KL}(\mathbb{P}_{\text{unif}}[s^{(T)}]\;\|\;\mathbb{P}[s^{(T)}])=\sum_{t=1}^{T}\sum_{s^{(t-1)}}\mathbb{P}_{\text{unif}}[s^{(t-1)}]\cdot\text{KL}(\mathbb{P}_{\text{unif}}[s_{t}\mid s^{(t-1)}]\;\|\;\mathbb{P}[s_{t}\mid s^{(t-1)}]).

Observe that at any time, since the learning algorithm is deterministic, the learner chooses an action given s(t1)s^{(t-1)} regardless of whether s(t1)s^{(t-1)} was generated under \mathbb{P} or under unif\mathbb{P}_{\text{unif}}. Thus, the KL(unif[sts(t1)][sts(t1)])\text{KL}(\mathbb{P}_{\text{unif}}[s_{t}\mid s^{(t-1)}]\;\|\;\mathbb{P}[s_{t}\mid s^{(t-1)}]) is zero if at1aa_{t-1}\neq a_{\star}, and otherwise

KL(unif[sts(t1)][sts(t1)])\displaystyle\text{KL}(\mathbb{P}_{\text{unif}}[s_{t}\mid s^{(t-1)}]\;\|\;\mathbb{P}[s_{t}\mid s^{(t-1)}]) =sSunif[stst1=sinit,at1=a]logunif[stst1=sinit,at1=a][stst1=sinit,at1=a]\displaystyle=\sum_{s\in S}\mathbb{P}_{\text{unif}}[s_{t}\mid s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]\log\frac{\mathbb{P}_{\text{unif}}[s_{t}\mid s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]}{\mathbb{P}[s_{t}\mid s_{t-1}=s_{\text{init}},a_{t-1}=a^{\star}]}
=1ϵBlog(1ϵ)+(11ϵB)log(1+ϵB1)\displaystyle=\frac{1-\epsilon}{B_{\star}}\cdot\log(1-\epsilon)+\biggl{(}1-\frac{1-\epsilon}{B_{\star}}\biggr{)}\log\biggl{(}1+\frac{\epsilon}{B_{\star}-1}\biggr{)}
ϵ2B1.\displaystyle\leq\frac{\epsilon^{2}}{B_{\star}-1}. (using log(1+x)x\log(1+x)\leq x for all x>0x>0)

Plugging the above back into Eq. 21 and using B2B_{\star}\geq 2 gives the lemma. ∎

In the following result, we combine the lemma above with standard techniques from lower bounds of multi-armed bandits (see Jaksch et al., 2010, Thm. 5 for example).

Theorem C.4.

Suppose that B2B_{\star}\geq 2, ϵ(0,18)\epsilon\in(0,\frac{1}{8}) and |A|16|A|\geq 16. For the problem described above we have that

𝔼[RK]ϵKB(182ϵ2K|A|).\mathbb{E}[R_{K}]\geq\epsilon KB_{\star}\biggl{(}\frac{1}{8}-2\epsilon\sqrt{\frac{2K}{|A|}}\biggr{)}.
Proof of Theorem C.4.

Note that as under unif\mathbb{P}_{\text{unif}} the transition distributions are identical for all actions, we have that

aAa=a𝔼unif[N]=𝔼unif[aAa=aN]=𝔼unif[N]T.\sum_{a\in A\mid a^{\star}=a}\mathbb{E}_{\text{unif}}\bigl{[}N^{\star}_{-}\bigr{]}=\mathbb{E}_{\text{unif}}\Biggl{[}\sum_{a\in A\mid a^{\star}=a}N^{\star}_{-}\Biggr{]}=\mathbb{E}_{\text{unif}}\bigl{[}N_{-}\bigr{]}\leq T. (22)

Suppose that aa^{\star} is sampled uniformly at random before the game starts. Denote the probability and expectation with respect to the distribution induced by a specific choice of a=aa^{\star}=a by a\mathbb{P}_{a} and 𝔼a\mathbb{E}_{a} respectively. Then for T=2KBT=2KB_{\star},

𝔼[RK]\displaystyle\mathbb{E}[R_{K}] =1|A|aA𝔼a[RK]\displaystyle=\frac{1}{|A|}\sum_{a\in A}\mathbb{E}_{a}[R_{K}]
1|A|aA𝔼a[NN]\displaystyle\geq\frac{1}{|A|}\sum_{a\in A}\mathbb{E}_{a}[N_{-}-N^{\star}_{-}] (Eq. 20)
1|A|aAa=a(KB4𝔼unif[N]ϵT𝔼unif[N]B)\displaystyle\geq\frac{1}{|A|}\sum_{a\in A\mid a_{\star}=a}\biggl{(}\frac{KB_{\star}}{4}-\mathbb{E}_{\text{unif}}[N^{\star}_{-}]-\epsilon T\sqrt{\frac{\mathbb{E}_{\text{unif}}[N^{\star}_{-}]}{B_{\star}}}\biggr{)} (Lemmas C.2 and C.3)
KB41|A|aAa=a𝔼unif[N]ϵT1B1|A|aAa=a𝔼unif[N]\displaystyle\geq\frac{KB_{\star}}{4}-\frac{1}{|A|}\sum_{a\in A\mid a_{\star}=a}\mathbb{E}_{\text{unif}}[N^{\star}_{-}]-\epsilon T\sqrt{\frac{1}{B_{\star}}\cdot\frac{1}{|A|}\sum_{a\in A\mid a_{\star}=a}\mathbb{E}_{\text{unif}}[N^{\star}_{-}]} (Jensen’s inequality)
KB4T|A|ϵTTB|A|\displaystyle\geq\frac{KB_{\star}}{4}-\frac{T}{|A|}-\epsilon T\sqrt{\frac{T}{B_{\star}|A|}} (Eq. 22)
=ϵ(KB42KB|A|2ϵKB2KB|A|B)\displaystyle=\epsilon\biggl{(}\frac{KB_{\star}}{4}-\frac{2KB_{\star}}{|A|}-2\epsilon KB_{\star}\sqrt{\frac{2KB_{\star}}{|A|B_{\star}}}\biggr{)}
=ϵKB(142|A|2ϵ2K|A|).\displaystyle=\epsilon KB_{\star}\biggl{(}\frac{1}{4}-\frac{2}{|A|}-2\epsilon\sqrt{\frac{2K}{|A|}}\biggr{)}.

The theorem follows from |A|16|A|\geq 16 and by rearranging. ∎

Proof of Theorem 2.7.

Consider the following MDP. Let SS be the set of states disregarding gg. The initial state is sampled uniformly at random from SS. Each sSs\in S has its own special action asa^{\star}_{s}. The transition distributions are defined P(gas,s)=1/B,P(g\mid a^{\star}_{s},s)=1/B_{\star}, P(sas,s)=11/B,P(s\mid a^{\star}_{s},s)=1-1/B_{\star}, and P(ga,s)=(1ϵ)/B,P(g\mid a,s)=(1-\epsilon)/B_{\star}, P(sa,s)=1(1ϵ)/BP(s\mid a,s)=1-(1-\epsilon)/B_{\star} for any other action aA\{as}a\in A\backslash\{a^{\star}_{s}\}.

Note that for each sSs\in S, the learner is faced with a simple problem as the one described above from which it cannot learn about from other states sss^{\prime}\neq s. Therefore, we can apply Theorem C.4 for each sSs\in S separately and lower bound the learner’s expected regret the sum of the regrets suffered at each sSs\in S, which would depend on the number of times sSs\in S is drawn as the initial state. Since the states are chosen uniformly at random there are many states (constant fraction) that are chosen Θ(K/|S|)\Theta(K/|S|) times. Summing the regret bounds of Theorem C.4 over only these states and choosing ϵ\epsilon appropriately gives the sought-after bound.

Denote by KsK_{s} the number of episodes that start in each state sSs\in S.

𝔼[RK]sS𝔼[ϵKsB(182ϵ2Ks|A|)]=ϵKB82ϵ2B2|A|sS𝔼[Ks3/2].\displaystyle\mathbb{E}[R_{K}]\geq\sum_{s\in S}\mathbb{E}\Biggl{[}\epsilon K_{s}B_{\star}\biggl{(}\frac{1}{8}-2\epsilon\sqrt{\frac{2K_{s}}{|A|}}\biggr{)}\Biggr{]}=\frac{\epsilon KB_{\star}}{8}-2\epsilon^{2}B_{\star}\sqrt{\frac{2}{|A|}}\sum_{s\in S}\mathbb{E}[K_{s}^{3/2}]. (23)

Taking expectation over the initial states and applying Cauchy-Schwartz inequality gives

sS𝔼[Ks3/2]sS𝔼[Ks]𝔼[Ks2]=sS𝔼[Ks]𝔼[Ks]2+𝕍[Ks]=sSK|S|K2|S|2+K(|S|1)|S|2K2K|S|,\sum_{s\in S}\mathbb{E}\bigl{[}K_{s}^{3/2}\bigr{]}\leq\sum_{s\in S}\sqrt{\mathbb{E}[K_{s}]}\sqrt{\mathbb{E}[K_{s}^{2}]}=\sum_{s\in S}\sqrt{\mathbb{E}[K_{s}]}\sqrt{\mathbb{E}[K_{s}]^{2}+\mathbb{V}[K_{s}]}=\sum_{s\in S}\sqrt{\frac{K}{|S|}}\sqrt{\frac{K^{2}}{|S|^{2}}+\frac{K(|S|-1)}{|S|^{2}}}\leq K\sqrt{\frac{2K}{|S|}},

where we have used the expectation and variance formulas of the Binomial distribution. The lower bound is now given by applying the inequality above in Eq. 23 and choosing ϵ=164|A||S|/K\epsilon=\frac{1}{64}\sqrt{|A||S|/K}. ∎

Appendix D Concentration inequalities

Theorem D.1 (Anytime Azuma).

Let (Xn)n=1(X_{n})_{n=1}^{\infty} be a martingale difference sequence with respect to the filtration (n)n=0(\mathcal{F}_{n})_{n=0}^{\infty} such that |Xn|B|X_{n}|\leq B almost surely. Then with probability at least 1δ1-\delta,

|i=1nXi|Bnlog2nδ,n1.\Biggl{|}\sum_{i=1}^{n}X_{i}\Biggr{|}\leq B\sqrt{n\log\frac{2n}{\delta}},\qquad\forall n\geq 1.
Theorem D.2 (Weissman et al., 2003).

Let p()p(\cdot) be a distribution over mm elements, and let p¯t()\bar{p}_{t}(\cdot) be the empirical distribution defined by tt iid samples from p()p(\cdot). Then, with probability at least 1δ1-\delta,

p¯t()p()12mlog1δt.\bigl{\lVert}\bar{p}_{t}(\cdot)-p(\cdot)\bigr{\rVert}_{1}\leq 2\sqrt{\frac{m\log\frac{1}{\delta}}{t}}.
Theorem D.3 (Anytime Bernstein).

Let (Xn)n=1(X_{n})_{n=1}^{\infty} be a sequence of i.i.d. random variables with expectation μ\mu. Suppose that 0XnB0\leq X_{n}\leq B almost surely. Then with probability at least 1δ1-\delta, the following holds for all n1n\geq 1 simultaneously:

|i=1n(Xiμ)|2Bμnlog2nδ+Blog2nδ.\displaystyle\Biggl{|}\sum_{i=1}^{n}(X_{i}-\mu)\Biggr{|}\leq 2\sqrt{B\mu n\log\frac{2n}{\delta}}+B\log\frac{2n}{\delta}. (24)
|i=1n(Xiμ)|2Bi=1nXilog2nδ+7Blog2nδ.\displaystyle\Biggl{|}\sum_{i=1}^{n}(X_{i}-\mu)\Biggr{|}\leq 2\sqrt{B\sum_{i=1}^{n}X_{i}\log\frac{2n}{\delta}}+7B\log\frac{2n}{\delta}. (25)
Proof.

Fix some n1n\geq 1. By Bernstein’s concentration inequality (see for example, Cesa-Bianchi and Lugosi, 2006, Corollary A.3), we have with probability at least 1δ2n21-\tfrac{\delta}{2n^{2}} that Eq. 24 holds. By a union bound, the inequality holds with probability at least 1δ1-\delta for all n1n\geq 1 simultaneously.

To show Eq. 25, note that in particular we have

μni=1nXi2Bμnlog2nδ+Blog2nδ\mu\cdot n-\sum_{i=1}^{n}X_{i}\leq 2\sqrt{B\mu n\log\frac{2n}{\delta}}+B\log\frac{2n}{\delta}

that is a quadratic inequality in μ\mu. This implies that

μ1ni=1nXi+3Blog2nδn.\sqrt{\mu}\leq\sqrt{\frac{1}{n}\sum_{i=1}^{n}X_{i}}+3\sqrt{\frac{B\log\frac{2n}{\delta}}{n}}.

Plugging this inequality back into the RHS of Eq. 24 gets us Eq. 25. ∎

Lemma D.4.

Let (Xn)n=1(X_{n})_{n=1}^{\infty} be a sequence of random variables with expectation adapted to the filtration (n)n=0(\mathcal{F}_{n})_{n=0}^{\infty}. Suppose that 0XnB0\leq X_{n}\leq B almost surely. Then with probability at least 1δ1-\delta, the following holds for all n1n\geq 1 simultaneously:

i=1n𝔼[Xii1]2i=1nXi+4Blog2nδ.\sum_{i=1}^{n}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]\leq 2\sum_{i=1}^{n}X_{i}+4B\log\frac{2n}{\delta}. (26)
Proof.

For all n1n\geq 1, we have

𝔼[eXn/Bn1]\displaystyle\mathbb{E}[e^{-X_{n}/B}\mid\mathcal{F}_{n-1}] 𝔼[1XnB+Xn22B2|n1]\displaystyle\leq\mathbb{E}\biggl{[}1-\frac{X_{n}}{B}+\frac{X_{n}^{2}}{2B^{2}}\Bigm{|}\mathcal{F}_{n-1}\biggr{]} (ex1x+x22e^{-x}\leq 1-x+\tfrac{x^{2}}{2} for all x0x\geq 0)
1𝔼[Xnn1]B+𝔼[Xnn1]2B\displaystyle\leq 1-\frac{\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]}{B}+\frac{\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]}{2B} (XnBX_{n}\leq B)
=1𝔼[Xnn1]2B\displaystyle=1-\frac{\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]}{2B}
e𝔼[Xnn1]/2B.\displaystyle\leq e^{-\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]/2B}. (1xex1-x\leq e^{-x} for all xx)

Hence, fix some n1n\geq 1, then

𝔼[exp(1Bi=1n(12𝔼[Xii1]Xi))]\displaystyle\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{1}{B}\sum_{i=1}^{n}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}\biggr{)}\Biggr{]}
=𝔼[exp(1Bi=1n1(12𝔼[Xii1]Xi))𝔼[exp(1B(12𝔼[Xnn1]Xn)|n1]1]\displaystyle\qquad=\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{1}{B}\sum_{i=1}^{n-1}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}\biggr{)}\cdot\underbrace{\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{1}{B}\biggl{(}\frac{1}{2}\mathbb{E}[X_{n}\mid\mathcal{F}_{n-1}]-X_{n}\biggr{)}\biggm{|}\mathcal{F}_{n-1}\Biggr{]}}_{\leq 1}\Biggr{]}
𝔼[exp(1Bi=1n1(12𝔼[Xii1]Xi))]\displaystyle\qquad\leq\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{1}{B}\sum_{i=1}^{n-1}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}\biggr{)}\Biggr{]}
1.\displaystyle\qquad\leq 1. (by repeating the last argument inductively.)

Therefore,

[i=1n(12𝔼[Xii1]Xi)>2Blog2nδ]\displaystyle\mathbb{P}\Biggl{[}\sum_{i=1}^{n}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}>2B\log\frac{2n}{\delta}\Biggr{]} [exp(1Bi=1n(12𝔼[Xii1]Xi))>2n2δ]\displaystyle\leq\mathbb{P}\Biggl{[}\exp\biggl{(}\frac{1}{B}\sum_{i=1}^{n}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}\biggr{)}>\frac{2n^{2}}{\delta}\Biggr{]}
𝔼[exp(1Bi=1n(12𝔼[Xii1]Xi))]δ2n2\displaystyle\leq\mathbb{E}\Biggl{[}\exp\biggl{(}\frac{1}{B}\sum_{i=1}^{n}\biggl{(}\frac{1}{2}\mathbb{E}[X_{i}\mid\mathcal{F}_{i-1}]-X_{i}\biggr{)}\biggr{)}\Biggr{]}\cdot\frac{\delta}{2n^{2}} (Markov inequality)
δ2n2.\displaystyle\leq\frac{\delta}{2n^{2}}.

Hence the above holds for all n1n\geq 1 via a union bound which provides the lemma. ∎