This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Navigating to the Best Policy
in Markov Decision Processes

Aymen Al Marjani
UMPA, ENS Lyon
Lyon, France
[email protected]
Aurélien Garivier
UMPA, CNRS, INRIA, ENS Lyon
Lyon, France
[email protected]
&Alexandre Proutiere
EECS and Digital Futures
KTH Royal Institute of Technology, Sweden
[email protected]
Abstract

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least 1δ1-\delta. We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting [MP21], where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints, induced by the online setting. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.

1 Introduction

Somewhat surprisingly, learning in a Markov Decision Process is most often considered under the performance criteria of consistency or regret minimization (see e.g. [SB18, Sze10, LS20] and references therein). Regret minimization (see e.g. [AJO09, FCG10]) is particularly relevant when the rewards accumulated during the learning phase are important. This is however not always the case: for example, when learning a game (whether Go, chess, Atari, or whatever), winning or losing during the learning phase does not really matter. One may intuitively think that sometimes getting into difficulty on purpose so as to observe unheard situations can significantly accelerate the learning process. Another example is the training of robot prototypes in the factory: a reasonably good policy is first searched, regardless of the losses incurred, that can serve as an initialization for a second phase regret-minimization mode that starts when the robot is deployed. It is hence also of great practical importance to study the sample complexity of learning, and to work on strategies that might improve, in this perspective, on regret minimizing algorithms.

In this work, we are interested in the best policy identification (BPI) problem for infinite-horizon discounted MDPs. This framework was introduced by [Fie94] under the name of PAC-RL. In BPI the algorithm explores the MDP until it has gathered enough samples to return an ε\varepsilon-optimal policy with probability at least 1δ1-\delta. Crucially, the algorithm halts at a random time step, determined by a stopping rule which guarantees that the probability of returning a wrong answer is less that δ\delta. The optimality of BPI algorithms is measured through their sample complexity, defined as their expected stopping time. Best policy identification in MDPs has been mostly investigated under the lens of minimax-optimality. The minimax framework may be overly pessimistic by accounting for the worst possible MDP, whereas an algorithm with instance-specific guarantees can adapt its exploration procedure to the hardness of the MDP instance that it faces. Recently, a few works made the first attempts towards understanding the instance-specific sample complexity of reinforcement learning. However, they typically either make simplifying assumptions such as access to a generative model [ZKB19, MP21], or restrict their attention to episodic MDPs [WSJ21]. In practice however, the samples gathered rather correspond to a single, possibly infinite, trajectory of the system that we wish to control. This motivates our study of the full online setting where observations are only obtained by navigating in the MDP, that is by sequential choices of actions and following the transitions of the MDP.

Our contributions. [MP21] recently proposed an information-theoretical complexity analysis for MDPs in the case of access to a generative model. Here we extend their results to the online setting. Our main goal is to understand how the online learning scheme affects the sample complexity compared to the easier case where we have a generative model. A natural first step consists in understanding how the first order term Tlog(1/δ)T^{\star}\log(1/\delta) changes. Thus we only focus on the asymptotic regime δ0\delta\to 0. Our key contributions can be summarized as follows:

  • First, we adapt the lower bound of [MP21] to the online setting (Proposition 2). The new bound also writes as the value of a zero-sum two-player game between nature and the algorithm, where the loss function remains the same, but where the set of possible strategies for the algorithm is restricted to a subset Ω()\Omega(\mathcal{M}) of the simplex of dimension SA1SA-1. We refer to the constraints defining Ω()\Omega(\mathcal{M}) as the navigation constraints.

  • We propose MDP-NaS, the first algorithm for the online setting111Before publication of this work, but after a preprint was available online, [WSJ21] proposed another algorithm with instance-dependent guarantees. with instance-dependent bounds on its sample complexity in the asymptotic regime δ0\delta\to 0 (Theorem 6). A major challenge lies in the design of a sampling rule that guarantees that the sampling frequency of state-action pairs (Nsa(t)/t)s,a\big{(}N_{sa}(t)/t\big{)}_{s,a} converges to some target oracle allocation ωΩ()\omega^{\star}\in\Omega(\mathcal{M}). Indeed, since we can no longer choose the next state of the agent, the tracking procedure which was developed by [GK16] for multi-armed bandits and used in [MP21] for MDPs with a generative model can no longer be applied in our setting. We propose a new sampling rule which performs exploration according to a mixture of the uniform policy and a plug-in estimate of the oracle policy (the policy whose stationary state-action distribution is ω\omega^{\star}) and prove that it satisfies the requirement above. The analysis of our sampling rule relies on an ergodic theorem for non-homogeneous Markov chains of independent interest (Proposition 12).

  • We investigate, depending on the communication properties of the ground-truth instance \mathcal{M}, what is the minimal forced-exploration rate in our sampling rule that guarantees the consistency of the plug-in estimator of the oracle policy. Our findings imply that when \mathcal{M} is ergodic, an exploration rate as low as 1/t1/\sqrt{t} is sufficient. However, when \mathcal{M} is only assumed to be communicating, one is obliged to resort to a much more conservative exploration rate of t1m+1t^{-\frac{1}{m+1}}, where mm is a parameter defined in Lemma 4 that may scales as large as S1S-1 in the worst case.

  • Finally, our stopping rule represents the first implementation of the Generalized Likelihood Ratio test for MDPs. Notably, we circumvent the need to solve the max-min program of the lower bound exactly, and show how an upper bound of the best-response problem, such as the one derived in [MP21], can be used to perform a GLR test. This improves upon the sample complexity bound that one obtains using the KL-Ball stopping rule of [MP21] by at least a factor of 22222Note that the stopping rule is independent of the sampling rule and thus can be used in both the generative and the online settings. Furthermore, one may even obtain an improved factor of 44 by using a deviation inequality for the full distribution of (reward, next-state) instead of a union bound of deviation inequalities for each marginal distribution..

1.1 Related work

Minimax Best-Policy Identification. BPI in the online setting has been investigated recently by a number of works with minimax sample complexity bounds. In the case of episodic MDPs [KMDD+21], [MDJ+20] proposed algorithms that identify an ε\varepsilon-optimal policy at the initial state w.h.p. In contrast, in the case of infinite-horizon MDPs one is rather interested in finding a good policy at every state. Recent works provide convergence analysis for Q-learning [LWC+20b] or policy gradient algorithms [AHKS20], [FYAY21], [ZCA21]. Their results typically state that if the algorithm is fed with the appropriate hyperparameters, it can return an ε\varepsilon-optimal policy w.h.p. after collecting a polynomial number of samples. In practice, a pure exploration agent needs a stopping rule to determine when to halt the learning process. In particular, the question of how to tune the number of iterations of these algorithms without prior knowledge of the ground truth instance remains open.333Here we chose not to mention works that investigate the sample complexity in the PAC-MDP framework [Kak03]. Indeed, in this framework the sample complexity is rather defined as the the number of episodes where the algorithm does not play an ε\varepsilon-optimal policy. As explained in [DMKV21], this objective is closer to regret minimization than to pure exploration.

Generative Model. A large body of literature focuses on the so-called generative model, where in each step, the algorithm may query a sample (i.e., observe a reward and a next-state drawn from the rewards and transitions distributions respectively) from any given state-action pair [KS98],[KMN99], [EDMM06], [AMK13],[DTC13], [Wan17], [SWW+18], [ZKB19], [AKY20], [LWC+20a], [LCC+21], [MP21].

Instance-specific bounds. Instance-optimal algorithms for Best Arm Identification in multi armed bandits (MDPs with one state) have been obtained independently by [GK16],[Rus16]. Here, we extend their information-theoretical approach to the problem of Best Policy Identification in MDPs. More recently, [WSJ21] provided an algorithm for BPI in episodic MDPs with instance-specific sample complexity. A more detailed comparison with [WSJ21] can be found in Section 6.

Outline. The rest of the paper is organized as follows: After introducing the setting and giving some notation and definitions in Section 2, we derive in Section 3 a lower bound on the time required by any algorithm navigating the MDP until it is able to identify the best policy with probability at least 1δ1-\delta. The algorithm is presented in Section 4. Section 5 contains our main results along with a sketch of the analysis. Finally, in Section 6 we compare our results with MOCA, the only other algorithm (to the best of our knowledge) with problem-dependent guarantees in the online setting. Most technical results and proofs are given in the appendix.

2 Setting and notation

Discounted MDPs. We consider infinite horizon MDPs with discount and finite state and action spaces 𝒮{\cal S} and 𝒜{\cal A}. Such an MDP {\cal M} is defined through its transition and reward distributions p(|s,a)p_{\cal M}(\cdot|s,a) and q(|s,a)q_{\cal M}(\cdot|s,a) for all (s,a)(s,a). For simplicity, q(|s,a)q_{\cal M}(\cdot|s,a) will denote the density of the reward distribution w.r.t. a positive measure λ\lambda with support included in [0,1][0,1]. Specifically, p(s|s,a)p_{\cal M}(s^{\prime}|s,a) denotes the probability of transitioning to state ss^{\prime} after playing action aa at state ss while R(s,a)R(s,a) is the random instantaneous reward that is collected. Finally, γ[0,1)\gamma\in[0,1) is a discounting factor. We look for an optimal control policy π:𝒮𝒜\pi^{\star}:\mathcal{S}\to\mathcal{A} maximizing the long-term discounted reward: Vπ(s)=𝔼,π[t=0γtR(st,π(st))]V_{\cal M}^{\pi}(s)=\mathbb{E}_{{\cal M},\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},\pi(s_{t}))], where 𝔼,π\mathbb{E}_{{\cal M},\pi} denotes the expectation w.r.t the randomness in the rewards and the trajectory when the policy π\pi is used. Classically, we denote by VV_{\cal M}^{\star} the optimal value function of {\cal M} and by QQ^{\star}_{\mathcal{M}} the optimal QQ-value function of {\cal M}. Π()={π:𝒮𝒜,Vπ=V}\Pi^{\star}(\mathcal{M})=\{\pi:\mathcal{S}\to\mathcal{A},\ V_{\cal M}^{\pi}=V_{\cal M}^{\star}\} denotes the set of optimal policies for \mathcal{M}.

Problem-dependent quantities. The sub-optimality gap of action aa in state ss is defined as Δsa=V(s)Q(s,a)\Delta_{sa}=V_{\mathcal{M}}^{\star}(s)-Q_{\mathcal{M}}^{\star}(s,a). Let Δmin=mins,aπ(s)Δsa\Delta_{\min}=\min_{s,a\neq\pi^{\star}(s)}\Delta_{sa} be the minimum gap and sp(V)=maxs,s|V(s)V(s)|\operatorname{sp}(V_{\mathcal{M}}^{\star})=\max_{s,s^{\prime}}|V_{\mathcal{M}}^{\star}(s)-V_{\mathcal{M}}^{\star}(s^{\prime})| be the span of VV_{\mathcal{M}}^{\star}. Finally, we introduce the variance of the value function in (s,a)(s,a) as Varp(s,a)[V]=Varsp(.|s,a)[V(s)]\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]=\mathrm{Var}_{s^{\prime}\sim p_{\mathcal{M}}(.|s,a)}[V_{\mathcal{M}}^{\star}(s^{\prime})].

Active pure exploration with fixed confidence. When {\cal M} is unknown, we wish to devise a learning algorithm 𝔸\mathbb{A} identifying, from a single trajectory, an optimal policy as quickly as possible with some given level of confidence. Formally, such an algorithm consists of (i) a sampling rule, selecting in each round tt in an adaptive manner the action ata_{t} to be played; ata_{t} depends on past observations, it is t𝔸{\cal F}_{t}^{\mathbb{A}}-measurable where t𝔸=σ(s0,a0,R0,st1,at1,Rt1,st){\cal F}_{t}^{\mathbb{A}}=\sigma(s_{0},a_{0},R_{0}\ldots,s_{t-1},a_{t-1},R_{t-1},s_{t}) is the σ\sigma-algebra generated by observations up to time tt; (ii) a stopping rule τδ\tau_{\delta}, a stopping time w.r.t. the filtration (t𝔸)t0({\cal F}_{t}^{\mathbb{A}})_{t\geq 0}, deciding when to stop collecting observations; (iii) a decision rule returning π^τδ\hat{\pi}^{\star}_{\tau_{\delta}} an estimated optimal policy.

An algorithm 𝔸\mathbb{A} is δ\delta-Probably Correct (δ\delta-PC) over some set 𝕄\mathbb{M} of MDPs, if for any 𝕄{\cal M}\in\mathbb{M}, it returns (in finite time) an optimal policy with probability at least 1δ1-\delta. In this paper, we aim to devise a δ\delta-PC algorithm 𝔸\mathbb{A} with minimal sample complexity 𝔼,𝔸[τδ]\mathbb{E}_{{\cal M},\mathbb{A}}[\tau_{\delta}]. We make the following assumption.

Assumption 1.

We consider the set 𝕄\mathbb{M} of communicating MDPs with a unique optimal policy.

We justify the above assumption as follows. (i) We restrict our attention to the case where {\cal M} is communicating, for otherwise, if it is Multichain there would be a non-zero probability that the algorithm enters a subclass of states from which there is no possible comeback. In this case it becomes impossible to identify the global optimal policy444Unless we modify the objective to finding the optimal policy in this subchain.. (ii) About the uniqueness of the optimal policy, treating the case of MDPs with multiple optimal policies, or that of ε\varepsilon-optimal policy identification, requires the use of more involved Overlapping Hypothesis tests, which is already challenging in multi-armed bandits (MDPs with a single state) [GK19]. We will analyze how to remove this assumption in future work.

Notation. 𝒵𝒮×𝒜\mathcal{Z}\triangleq\mathcal{S}\times\mathcal{A} is the state-action space. Σ{ω+S×A:s,aωsa=1}\Sigma\triangleq\{\omega\in\mathbb{R}_{+}^{S\times A}:\ \sum_{s,a}\omega_{sa}=1\}, the simplex of dimension SA1SA-1. Nsa(t)N_{sa}(t) denotes the number of times the state-action pair (s,a)(s,a) has been visited up to the end of round tt. We also introduce Ns(t)=a𝒜Nsa(t)N_{s}(t)=\sum_{a\in\mathcal{A}}N_{sa}(t). Similarly, for a vector ωΣ\omega\in\Sigma we will denote ωsa𝒜ωsa\omega_{s}\triangleq\sum_{a\in\mathcal{A}}\omega_{sa}. For a matrix An×mA\in\mathbb{R}^{n\times m}, the infinity norm is defined as Amax1inj=1m|ai,j|\left\lVert A\right\rVert_{\infty}\triangleq\max_{1\leq i\leq n}\sum_{j=1}^{m}|a_{i,j}|. The KL divergence between two probability distributions PP and QQ on some discrete space 𝒮\mathcal{S} is defined as: KL(PQ)s𝒮P(s)log(P(s)Q(s))KL(P\|Q)\triangleq\sum_{s\in\mathcal{S}}P(s)\log(\frac{P(s)}{Q(s)}). For Bernoulli distributions of respective means pp and qq, the KL divergence is denoted by kl(p,q)\operatorname{kl}(p,q). For distributions over [0,1][0,1] defined through their densities pp and qq w.r.t. some positive measure λ\lambda, the KL divergence is: KL(pq)p(x)log(p(x)q(x))λ(dx)KL(p\|q)\triangleq\int_{-\infty}^{\infty}p(x)\log\left(\frac{p(x)}{q(x)}\right)\,\lambda(dx). 𝕄\mathbb{M} denotes the set of communicating MDPs with a unique optimal policy. For two MDPs ,𝕄{\cal M},{\cal M}^{\prime}\in\mathbb{M}, we say that \mathcal{M}\ll\mathcal{M}^{\prime} if for all (s,a)(s,a), p(|s,a)p(|s,a)p_{\mathcal{M}}(\cdot|s,a)\ll p_{\mathcal{M}^{\prime}}(\cdot|s,a) and q(|s,a)q(|s,a)q_{\mathcal{M}}(\cdot|s,a)\ll q_{\mathcal{M}^{\prime}}(\cdot|s,a). In that case, for any state (action pair) (s,a)(s,a), we define the KL divergence of the distributions of the one-step observations under {\cal M} and {\cal M}^{\prime} when starting at (s,a)(s,a) as KL|(s,a)KL(p(|s,a)p(|s,a))+KL(q(|s,a)q(|s,a))\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\triangleq KL(p_{\mathcal{M}}(\cdot|s,a)\|p_{\mathcal{M}^{\prime}}(\cdot|s,a))+KL(q_{\mathcal{M}}(\cdot|s,a)\|q_{\mathcal{M}^{\prime}}(\cdot|s,a)).

3 Sample complexity lower bound

In this section, we first derive a lower bound on the expected sample complexity satisfied by any δ\delta-PC algorithm. The lower bound is obtained as the solution of a non-convex optimization problem, as in the case where the learner has access to a generative model [MP21]. The problem has however additional constraints, referred to as the navigation constraints, due the fact that the learner has access to a single system trajectory.
The expected sample complexity of an algorithm 𝔸\mathbb{A} is 𝔼,𝔸[τδ]=s,a𝔼,𝔸[Nsa(τδ)].\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}]=\sum_{s,a}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})]. Lower bounds on the sample complexity are derived by identifying constraints that the various Nsa(τδ)N_{sa}(\tau_{\delta})’s need to satisfy so as to get a δ\delta-PC algorithm. We distinguish here two kinds of constraints:

Information constraints. These are constraints on 𝔼,𝔸[Nsa(τδ)]\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})], so that the algorithm can learn the optimal policy with probability at least 1δ1-\delta. They are the same as those derived [MP21] when the learner has access to a generative model and are recalled in Lemma 9 in Appendix C.

Navigation constraints. Observations come from a single (but controlled) system trajectory which imposes additional constraints on the Nsa(τδ)N_{sa}(\tau_{\delta})’s. These are derived by just writing the Chapman-Kolmogorov equations of the controlled Markov chain (refer to Appendix C for a proof).

Lemma 1.

For any algorithm 𝔸\mathbb{A}, and for any state ss, we have:

|𝔼,𝔸[Ns(τδ)]s,ap(s|s,a)𝔼,𝔸[Nsa(τδ)]|1.\displaystyle\Big{|}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{s}(\tau_{\delta})]-\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{s^{\prime}a^{\prime}}({\tau_{\delta}})]\Big{|}\leq 1. (1)

Putting all constraints together, we obtain the following sample complexity lower bound.

Proposition 2.

Define the set of navigation-constrained allocation vectors:

Ω()={ωΣ:s𝒮,ωs=s,ap(s|s,a)ωsa}.\Omega(\mathcal{M})=\big{\{}\omega\in\Sigma:\ \forall s\in\mathcal{S},\ \omega_{s}=\underset{s^{\prime},a^{\prime}}{\sum}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\omega_{s^{\prime}a^{\prime}}\big{\}}.

Further define Alt()={ MDP:,Π()Π()=}\operatorname{Alt}(\mathcal{M})=\{\mathcal{M}^{\prime}\textrm{ MDP}:\mathcal{M}\ll\mathcal{M}^{\prime},\Pi^{\star}(\mathcal{M})\cap\Pi^{\star}(\mathcal{M}^{\prime})=\emptyset\} the set of alternative instances. Then the expected sample complexity 𝔼,𝔸[τδ]\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}] of any δ\delta-PC algorithm 𝔸\mathbb{A} satisfies:

lim infδ0𝔼,𝔸[τ]log(1/δ)To(),whereTo()1=supωΩ()T(,ω)1\underset{\delta\to 0}{\liminf}\ \frac{\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau]}{\log(1/\delta)}\geq T_{o}(\mathcal{M}),\quad\textrm{where}\quad T_{o}(\mathcal{M})^{-1}=\sup\limits_{\omega\in\Omega(\mathcal{M})}T(\mathcal{M},\omega)^{-1} (2)

and

T(,ω)1=infAlt()s,aωsaKL|(s,a).T(\mathcal{M},\omega)^{-1}=\inf\limits_{\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M})}\sum_{s,a}\omega_{sa}\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a). (3)
Remark 1.

A common interpretation of change-of-measure lower bounds like the one above is the following: the optimization problem in the definition of To()T_{o}(\mathcal{M}) can be seen as the value of a two-player zero-sum game between an algorithm which samples each state-action (s,a)(s,a) proportionally to ωsa\omega_{sa} and an adversary who chooses an alternative instance \mathcal{M}^{\prime} that is difficult to distinguish from \mathcal{M} under the algorithm’s sampling strategy. This suggests that an optimal algorithm should play the optimal allocation that solves the optimization problem (2) and, as a consequence, rules out all alternative instances as fast as possible.

Remark 2.

Note that compared to the lower bound of the generative setting in [MP21], the only change is in the set of sampling strategies that the algorithm can play, which is no longer equal to the entire simplex.

3.1 Proxy for the optimal allocation and the characteristic time

As shown in [MP21], even without accounting for the navigation constraints, computing the characteristic time To()T_{o}(\mathcal{M}) and in particular, the optimal allocation leading to it is not easy. Indeed, the sub-problem corresponding to computing T(,ω)1T(\mathcal{M},\omega)^{-1} is non-convex. This makes it difficult to design an algorithm that targets the optimal weights argmaxωΩ()T(,ω)1\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}T(\mathcal{M},\omega)^{-1}. Instead, we use a tractable upper bound of T(,ω)T(\mathcal{M},\omega) from [MP21]:

Lemma 3.

(Theorem 1, [MP21]) For all vectors ωΣ\omega\in\Sigma, T(,ω)U(,ω)T(\mathcal{M},\omega)\leq U(\mathcal{M},\omega), where555The exact definition of HH^{\star} is given in Appendix C.

U(,ω)=max(s,a):aπ(s)Hsaωsa+HSmin𝑠ωs,π(s),U(\mathcal{M},\omega)=\underset{(s,a):a\neq\pi^{\star}(s)}{\max}\ \frac{H_{sa}}{\omega_{sa}}+\frac{H^{\star}}{S\underset{s}{\min}\ \omega_{s,\pi^{\star}(s)}}, (4)
and {Hsa=2Δsa2+max(16Varp(s,a)[V]Δsa2,6sp(V)4/3Δsa4/3),H=2S[Δmin(1γ)]2+𝒪(SΔmin2(1γ)3).\hbox{and }\begin{cases}H_{sa}=\displaystyle{\frac{2}{\Delta_{sa}^{2}}}+\displaystyle{\max\bigg{(}\frac{16\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]}{\Delta_{sa}^{2}},\frac{6\operatorname{sp}(V_{\mathcal{M}}^{\star})^{4/3}}{\Delta_{sa}^{4/3}}\bigg{)}},\\ H^{\star}=\displaystyle{\frac{2S}{[\Delta_{\min}(1-\gamma)]^{2}}+\mathcal{O}\bigg{(}\frac{S}{\Delta_{\min}^{2}(1-\gamma)^{3}}\bigg{)}}.\end{cases} (5)

Using U(,ω)U(\mathcal{M},\omega), we obtain the following upper bound on the characteristic time (2):

To()Uo()infωΩ()U(,ω).T_{o}(\mathcal{M})\leq U_{o}(\mathcal{M})\triangleq\underset{{\omega\in\Omega(\mathcal{M})}}{\inf}U(\mathcal{M},\omega). (6)

The advantages of the above upper bound Uo()U_{o}(\mathcal{M}) are that: (i) it is a problem-specific quantity as it depends on the gaps and variances of the value function in \mathcal{M}; (ii) the corresponding allocation (that solves (6)) can be easily computed, and hence targeted. Indeed, the optimization problem in (6) has convex objective and constraints. Therefore, we can use the projected subgradient-descent algorithm to compute

𝝎()argminωΩ()U(,ω)=argmaxωΩ()U(,ω)1,{\bm{\omega}}^{\star}(\mathcal{M})\triangleq\underset{{\omega\in\Omega(\mathcal{M})}}{\operatorname*{arg\,min}}\ U(\mathcal{M},\omega)=\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}\ U(\mathcal{M},\omega)^{-1}, (7)

which will be used in our algorithm as a proxy666Note that if we have access to an optimization oracle that, given a MDP \mathcal{M}, returns the optimal allocation solution to (2), then we can replace 𝝎(){\bm{\omega}}^{\star}(\mathcal{M}) by this optimal allocation. Our algorithm will then be asymptotically optimal up to a factor of 2. for argmaxωΩ()T(,ω)1\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}\ T(\mathcal{M},\omega)^{-1}.

4 Algorithm

We propose MDP-NaS (MDP Navigate-and-Stop), a model-based algorithm that is inspired by the lower bound. The lower bound suggests that to identify the best policy in a sample-efficient manner, the algorithm must collect samples from state-action pair (s,a)(s,a) proportionally to 𝝎sa(){\bm{\omega}}_{sa}^{\star}(\mathcal{M}). We propose two sampling rules which ensure that the former statement holds in the long term (see Section 4.1 and Theorem 7 for a rigorous formulation). Our sampling rules are combined with a Generalized Likelihood Ratio (GLR) test (or rather a proxy of the GLR test, see Section 4.2 for details), that stops as soon as we are confident that π^t=π\widehat{\pi}_{t}^{\star}=\pi^{\star} with probability at least 1δ1-\delta. The pseudo-code for MDP-NaS is given in Algorithm 1.

1
Input: Confidence level δ\delta, ERGODIC boolean variable, communication parameter mm or an upper bound.
2 if ERGODIC then
3       Set (εt)t1=(1/t)t1(\varepsilon_{t})_{t\geq 1}=(1/\sqrt{t})_{t\geq 1}.
4      
5else
6       Set (εt)t1=(t1m+1)t1(\varepsilon_{t})_{t\geq 1}=(t^{-\frac{1}{m+1}})_{t\geq 1}.
7      
8 end if
9Set t0t\leftarrow 0 and Nsa(t)0N_{sa}(t)\leftarrow 0, for all (s,a).
10 Initialize empirical estimate ^0\widehat{\mathcal{M}}_{0} by drawing an arbitrary MDP from 𝕄\mathbb{M}.
11
12Compute π^tPOLICY-ITERATION(^t)\widehat{\pi}_{t}^{\star}\leftarrow\textrm{POLICY-ITERATION}(\widehat{\mathcal{M}}_{t}).
13 while Stopping condition (13) is not satisfied do
14       Compute 𝝎(^t){\bm{\omega}}^{\star}(\widehat{\mathcal{M}}_{t}) according to (7) and πo(^t)\pi^{o}(\widehat{\mathcal{M}}_{t}) according to (8).
15       if t=0t=0 then
16             Play a0Unif([|1,A|])a_{0}\sim\mathrm{Unif}([|1,A|]).
17      else
18             Play atπt(.|st)a_{t}\sim\pi_{t}(.|s_{t}), where πt\pi_{t} is determined by either (9) or (10).
19       end if
20      Observe (Rt,st+1)q(.|st,at)p(.|st,at)(R_{t},s_{t+1})\sim q(.|s_{t},a_{t})\otimes p(.|s_{t},a_{t}).
21       tt+1t\leftarrow t+1.
22       Update ^t\widehat{\mathcal{M}}_{t} and (Nsa(t))s,a(N_{sa}(t))_{s,a}.
23       Compute π^tPOLICY-ITERATION(^t)\widehat{\pi}_{t}^{\star}\leftarrow\textrm{POLICY-ITERATION}(\widehat{\mathcal{M}}_{t}).
24 end while
Output: Empirical optimal policy π^τ\widehat{\pi}_{\tau}^{\star}.
Algorithm 1 MDP Navigate and Stop (MDP-NaS)

4.1 Sampling rule

We introduce a few definitions to simplify the presentation. Any stationary policy π\pi induces a Markov chain on 𝒵\mathcal{Z} whose transition kernel is defined by Pπ((s,a),(s,a))P(s|s,a)π(a|s)P_{\pi}((s,a),(s^{\prime},a^{\prime}))\triangleq P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime}). With some abuse of notation, we will use PπP_{\pi} to refer to both the Markov chain and its kernel. We denote by πu\pi_{u} the uniform random policy, i.e., πu(a|s)=1/A\pi_{u}(a|s)=1/A for all pairs (s,a)(s,a). Finally, we define the vector of visit-frequencies 𝑵(t)/t(Nsa(t)/t)(s,a)𝒵{\bm{N}}(t)/t\triangleq\big{(}N_{sa}(t)/t\big{)}_{(s,a)\in\mathcal{Z}}.

In contrast with pure exploration in Multi-Armed Bandits and MDPs with a generative model where any allocation vector in the simplex is achievable, here the agent can only choose a sequence of actions and follow the resulting trajectory. Therefore, one might ask if the oracle allocation can be achieved by following a simple policy. A natural candidate is the oracle policy defined by

(s,a)𝒵,πo(a|s)ωsaωs=ωsab𝒜ωsb.\displaystyle\forall(s,a)\in\mathcal{Z},\quad\pi^{o}(a|s)\triangleq\frac{\omega_{sa}^{\star}}{\omega_{s}^{\star}}=\frac{\omega_{sa}^{\star}}{\sum_{b\in\mathcal{A}}\omega_{sb}^{\star}}. (8)

It is immediate to check that 𝝎{\bm{\omega}}^{\star} is the stationary distribution of PπoP_{\pi^{o}}. Denote by πo()\pi^{o}(\mathcal{M}) the above oracle policy defined through ω()\omega^{\star}(\mathcal{M}). Policy πo()\pi^{o}(\mathcal{M}) is the target that we would like to play, but since the rewards and dynamics of \mathcal{M} are unknown, the actions must be chosen so as to estimate consistently the oracle policy while at the same time ensuring that 𝑵(t)/t{\bm{N}}(t)/t converges to 𝝎(){\bm{\omega}}^{\star}(\mathcal{M}). The following two sampling rules satisfy these requirements.

D-Navigation rule: At time step tt, the learner plays the policy

πt=εtπu+(1εt)πo(^t),\pi_{t}=\varepsilon_{t}\pi_{u}+(1-\varepsilon_{t})\pi^{o}(\widehat{\mathcal{M}}_{t}), (9)

C-Navigation rule: At time step tt, the learner plays

πt=εtπu+(1εt)j=1tπo(^j)/t,\pi_{t}=\varepsilon_{t}\pi_{u}+(1-\varepsilon_{t})\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t, (10)

where εt\varepsilon_{t} is a decreasing exploration rate to be tuned later. In the second case, the agent navigates using a Cesàro-mean of oracle policies instead of the current estimate of the oracle policy, which makes the navigation more stable.

4.1.1 Tuning the forced exploration parameter

The mixture with the uniform policy777Our results still hold if πu\pi_{u} is replaced by any other policy π\pi which has an ergodic kernel. In practice, this can be helpful especially when we have prior knowledge of a fast-mixing policy π\pi. helps the agent to explore all state-action pairs often enough in the initial phase. This is particularly necessary in the case where the empirical weight ωsa(^t)\omega_{sa}^{\star}(\widehat{\mathcal{M}}_{t}) of some state-action pair (s,a)(s,a) is under-estimated, which may cause that this pair is left aside and hence the estimation never corrected. The next result in this section gives a tuning of the rate εt\varepsilon_{t} that ensures a sufficient forced exploration:

Lemma 4.

Let mm be the maximum length of the shortest paths (in terms of number of transitions) between pairs of states in \mathcal{M}: mmax(s,s)𝒮2min{n1:π:𝒮𝒜,Pπn(s,s)>0}m\triangleq\max_{(s,s^{\prime})\in\mathcal{S}^{2}}\ \min\{n\geq 1:\ \exists\pi:\mathcal{S}\rightarrow\mathcal{A},P_{\pi}^{n}(s,s^{\prime})>0\}. Then C-Navigation or D-Navigation with any decreasing sequence (εt)t1(\varepsilon_{t})_{t\geq 1} such that t1,εtt1m+1\forall t\geq 1,\ \varepsilon_{t}\geq t^{-\frac{1}{m+1}} satisfies: ,𝔸((s,a)𝒵,limtNsa(t)=)=1\mathbb{P}_{\mathcal{M},\mathbb{A}}\big{(}\forall(s,a)\in\mathcal{Z},\ \lim_{t\to\infty}N_{sa}(t)=\infty\big{)}=1.

Remark 3.

When the parameter mm is unknown to the learner, one can replace it by its worst case value mmax=S1m_{\max}=S-1. However, when prior knowledge is available, using a faster-decreasing sequence εt=t1m+1\varepsilon_{t}=t^{-\frac{1}{m+1}} instead of t1St^{-\frac{1}{S}} can be useful to accelerate convergence, especially when the states of \mathcal{M} are densely connected (mS1m\ll S-1).

Minimal exploration rate: Communicating MDPs. The forced exploration rate t1St^{-\frac{1}{S}} vanishes quite slowly. One may wonder if this rate is necessary to guarantee sufficient exploration in communicating MDPs: the answer is yes in the worst case, as the following example shows. Consider a variant of the classical RiverSwim MDP with state (resp. action) space 𝒮=[|1,S|]\mathcal{S}=[|1,S|], (resp. 𝒜={LEFT1, LEFT2, RIGHT}\mathcal{A}=\{\small{\textsc{LEFT1, LEFT2, RIGHT}}\}). LEFT1 and LEFT2 are equivalent actions inducing the same rewards and transitions. After playing RIGHT the agent makes a transition of one step to the right, while playing LEFT1 (or LEFT2) moves the agent all the way back to state 11. The rewards are null everywhere but in states {1,S}\{1,S\} where R(1,LEFT1),R(1,LEFT2)R(1,\small{\textsc{LEFT1}}),R(1,\small{\textsc{LEFT2}}) are equal to 0.010.01 almost surely and R(S,RIGHT)R(S,\small{\textsc{RIGHT}}) is Bernoulli with mean 0.02.

When γ\gamma is close to 11, the optimal policy consists in always playing RIGHT so as to reach state SS and stay there indefinitely. Now suppose that the agent starts at s=1s=1 and due to the small probability of observing the large reward in state SS, she underestimates the value of this state in the first rounds and focuses on distinguishing the best action to play in state 11 among {LEFT1, LEFT2}\{\small{\textsc{LEFT1, LEFT2}}\}. Under this scenario she ends up playing a sequence of policies that scales like πt=2εtπu+(12εt)(LEFT1+LEFT2)/2\pi_{t}=2\varepsilon_{t}\pi_{u}+(1-2\varepsilon_{t})(\small{\textsc{LEFT1}}+\small{\textsc{LEFT2}})/2888Modulo a re-scaling of εt\varepsilon_{t} by a constant factor of 1/2.. This induces the non-homogeneous Markov Chain depicted in Figure 1. For the exploration rate εt=tα\displaystyle{\varepsilon_{t}=t^{-\alpha}}, we show that if the agent uses any α>1S1\alpha>\frac{1}{S-1}, with non-zero probability she will visit state SS only a finite number of times. Therefore she will fail to identify the optimal policy for small values of the confidence level δ\delta. The proof is deferred to Appendix D.

12\cdotsSSεt\varepsilon_{t}1εt1-\varepsilon_{t}εt\varepsilon_{t} 1εt\quad\quad\ 1-\varepsilon_{t}εt\varepsilon_{t}1εt\quad\quad\quad 1-\varepsilon_{t}1εt1-\varepsilon_{t}εt\varepsilon_{t}
Figure 1: Non-homogeneous Markov Chain. An exploration rate of at least t1S1t^{-\frac{1}{S-1}} is needed.
Minimal exploration rate: Ergodic MDPs.

For such MDPs, we can select εt=1/tα\varepsilon_{t}=1/t^{\alpha} where α<1\alpha<1 without compromising the conclusion of Lemma 4. The proof is deferred to Appendix D.

4.2 Stopping rule

To implement a GLR test, we define (t)\ell_{\mathcal{M}^{\prime}}(t), the likelihood of the observations under some MDP \mathcal{M}^{\prime}\ll\mathcal{M}: (t)=k=0t1p(sk+1|sk,ak)q(Rk|sk,ak)\ell_{\mathcal{M}^{\prime}}(t)=\prod_{k=0}^{t-1}p_{\mathcal{M}^{\prime}}(s_{k+1}|s_{k},a_{k})q_{\mathcal{M}^{\prime}}(R_{k}|s_{k},a_{k}), where at step kk the algorithm is in state sks_{k}, plays action aka_{k} and observes the reward RkR_{k} and sk+1s_{k+1} (the next state). Performing a GLR test in step tt consists in computing the optimal policy π^t\widehat{\pi}_{t}^{\star} for the estimated MDP ^t\widehat{\mathcal{M}}_{t} and in comparing the likelihood of observations under the most likely model where π^t\widehat{\pi}_{t}^{\star} is optimal to the likelihood under the most likely model where π^t\widehat{\pi}_{t}^{\star} is sub-optimal. Following the standardized form of the GLR for multi-armed bandits in [DKM19] we write:

Gπ^t(t)\displaystyle G_{\widehat{\pi}_{t}^{\star}}(t) =logsup:π^tΠ()(t)sup:π^tΠ()(t)=log^t(t)supAlt(^t)(t)=infAlt(^t)log^t(t)(t)\displaystyle=\log\ \frac{\underset{\mathcal{M}^{\prime}:\widehat{\pi}_{t}^{\star}\in\Pi^{\star}(\mathcal{M}^{\prime})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}{\underset{\mathcal{M}^{\prime}:\widehat{\pi}_{t}^{\star}\notin\Pi^{\star}(\mathcal{M}^{\prime})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}=\log\ \frac{\ell_{\widehat{\mathcal{M}}_{t}}(t)}{\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}=\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\log\frac{\ell_{\widehat{\mathcal{M}}_{t}}(t)}{\ell_{\mathcal{M}^{\prime}}(t)}
=infAlt(^t)(s,a)𝒵Nsa(t)[KL(q^s,a(t),q(s,a))+KL(p^s,a(t),p(s,a))]\displaystyle=\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}^{\prime}}(s,a))+\operatorname{KL}(\widehat{p}_{s,a}(t),\,p_{\mathcal{M}^{\prime}}(s,a))\big{]} (11)
=tT(^t,𝑵(t)/t)1.\displaystyle=t\;T\bigg{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\bigg{)}^{-1}\;. (12)

The hypothesis (π^tπ)(\widehat{\pi}_{t}^{\star}\neq\pi^{\star}) is then rejected as soon as the ratio of likelihoods becomes greater than the threshold β(t,δ)\beta(t,\delta), properly tuned to ensure that the algorithm is δ\delta-PC. Note that the likelihood ratio Gπ^t(t)G_{\widehat{\pi}_{t}^{\star}}(t) itself can be difficult to compute for MDPs, since it is equivalent to solving (3), we lower bound it using (12) and Lemma 5. This leads to the following proposition:

Proposition 5.

Define the random thresholds for the transitions and rewards respectively:

βp(t,δ)\displaystyle\beta_{p}(t,\delta) log(1/δ)+(S1)(s,a)log(e[1+Nsa(t)/(S1)])\displaystyle\triangleq\log(1/\delta)+(S-1)\underset{(s,a)}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}
βr(t,δ)\displaystyle\beta_{r}(t,\delta) SAφ(log(1/δ)/SA)+3s,alog[1+log(Nsa(t))]\displaystyle\triangleq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+3\sum_{s,a}\log\big{[}1+\log(N_{sa}(t))\big{]}

where φ(x)x+log(x)\varphi(x)\underset{\infty}{\sim}x+\log(x) is defined in the appendix. Then the stopping rule:

τδinf{t1:tU(^t,𝑵(t)/t)1βr(t,δ/2)+βp(t,δ/2)}\tau_{\delta}\triangleq\inf\bigg{\{}t\geq 1:\ t\;U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\bigg{\}} (13)

is δ\delta-PC, i.e., (τδ<,π^τδπ)δ.\mathbb{P}(\tau_{\delta}<\infty,\ \widehat{\pi}_{\tau_{\delta}}^{\star}\neq\pi^{\star})\leq\delta.

5 Main results and sample complexity analysis

First we state our main results, which take the form of asymptotic upper bounds on the sample complexity of MDP-NaS, under a slightly more conservative exploration rate than in Lemma 4. Then we present the most important ingredients in their proof. The complete proof is provided in Appendix F.

Theorem 6.

Using the C-Navigation sampling rule with εt=t12(m+1)\varepsilon_{t}=t^{-\frac{1}{2(m+1)}} and the stopping rule (13): (i) MDP-NaS stops almost surely and its stopping time satisfies
(lim supδ0τδlog(1/δ)2Uo())=1\mathbb{P}\big{(}\limsup_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M})\big{)}=1, where Uo()U_{o}(\mathcal{M}) was defined in (6);
(ii) the stopping time of MDP-NaS has a finite expectation for all δ(0,1)\delta\in(0,1), and lim supδ0𝔼[τδ]log(1/δ)2Uo()\limsup_{\delta\to 0}\frac{\mathbb{E}[\tau_{\delta}]}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M}).

5.1 Proof sketch

Concentration of empirical MDPs: The starting point of our proof is a concentration event of the empirical estimates ^t\widehat{\mathcal{M}}_{t} around \mathcal{M}. For ξ>0\xi>0 and T1T\geq 1, we define 𝒞T1(ξ)=t=T1/4T(^tξ,πo(^t)πo()ξ)\mathcal{C}_{T}^{1}(\xi)=\bigcap_{t=T^{1/4}}^{T}\bigg{(}\left\lVert\widehat{\mathcal{M}}_{t}-\mathcal{M}\right\rVert\leq\xi,\left\lVert\pi^{o}(\widehat{\mathcal{M}}_{t})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\xi\bigg{)}, where .\left\lVert.\right\rVert is a semi-norm on MDPs. Then we prove in Lemma 18 that 𝒞T1(ξ)\mathcal{C}_{T}^{1}(\xi) holds with high probability in the sense that for all T1T\geq 1:

(𝒞T1(ξ))1𝒪(1/T2).\mathbb{P}\left(\mathcal{C}_{T}^{1}(\xi)\right)\geq 1-\mathcal{O}\big{(}1/T^{2}\big{)}. (14)

For this purpose we derive, for all pairs (s,a)(s,a), a lower bound on Nsa(t)N_{sa}(t) stating that999Using the method in our proof one can derive a lower bound of the type tζt^{\zeta} for any ζ<1/2\zeta<1/2 and any exploration rate εt=tθ\varepsilon_{t}=t^{-\theta} with θ<1/(m+1)\theta<1/(m+1). We chose θ=12(m+1)\theta=\frac{1}{2(m+1)} and ζ=1/4\zeta=1/4 because they enable us to have an explicit formula for λα\lambda_{\alpha}. See Lemma 11 in the appendix.: ((s,a)𝒵,t1,Nsa(t)(t/λα)1/41)1α\mathbb{P}\big{(}\forall(s,a)\in\mathcal{Z},\ \forall t\geq 1,\ N_{sa}(t)\geq(t/\lambda_{\alpha})^{1/4}-1\big{)}\geq 1-\alpha where λαlog2(1+SA/α)\lambda_{\alpha}\propto\log^{2}(1+SA/\alpha) is a parameter that depends on the mixing properties of \mathcal{M} under the uniform policy. These lower bounds on Nsa(t)N_{sa}(t) w.h.p. contrast with their deterministic equivalent obtained for C-tracking in [GK16].

Concentration of visit-frequencies: Before we proceed, we make the following assumption.

Assumption 2.

PπuP_{\pi_{u}} is aperiodic101010This assumption is mild as it is enough to have only one state s~\tilde{s} and one action a~\tilde{a} such that P(s~|s~,a~)>0P_{\mathcal{M}}(\tilde{s}|\tilde{s},\tilde{a})>0 for it to be satisfied. Furthermore, Assumptions 1 and 10 combined are still less restrictive than the usual ergodicity assumption, which requires that the Markov chains of all policies are ergodic..

Under Assumptions 1 and 10 the kernel PπuP_{\pi_{u}}, and consequently also PπoP_{\pi^{o}}, becomes ergodic. Hence the iterates of PπoP_{\pi^{o}} converge to 𝝎(){\bm{\omega}}^{\star}(\mathcal{M}) at a geometric speed. Also note that the Markov chains induced by playing either of our sampling rules are non-homogeneous, with a sequence of kernels (Pπt)t1(P_{\pi_{t}})_{t\geq 1} that is history-dependent. To tackle this difficulty, we adapt a powerful ergodic theorem (see Proposition 12 in the appendix) from [FMP11] originally derived for adaptive Markov Chain Monte-Carlo (MCMC) algorithms to get the following result.

Theorem 7.

Using the C-Navigation or D-Navigation we have: limt𝐍(t)/t=ω()\underset{t\to\infty}{\lim}{\bm{N}}(t)/t=\omega^{\star}(\mathcal{M}) almost surely.

Lemma 4 and Theorem 7 combined prove Theorem 6 (i) in a straightforward fashion. The proof of Theorem 6 (ii) is more involved. Again, we adapt the proof method from [FMP11] to derive a finite-time version of Proposition 12 which results into the following proposition.

Proposition 8.

Under C-Navigation, for all ξ>0\xi>0, there exists a time TξT_{\xi} such that for all TTξT\geq T_{\xi}, all tT3/4t\geq T^{3/4} and all functions f:𝒵+f:\mathcal{Z}\xrightarrow{}\mathbbm{R}^{+}, we have:

(|k=1tf(sk,ak)t𝔼(s,a)ω[f(s,a)]|Kξfξ|𝒞T1(ξ))2exp(tξ2).\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{k=1}^{t}f(s_{k},a_{k})}{t}-\mathbb{E}_{(s,a)\sim\omega^{\star}}[f(s,a)]\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)}.

where ξKξ\xi\mapsto K_{\xi} is a mapping with values in (1,)(1,\infty) such that lim supξ0Kξ<\limsup_{\xi\to 0}K_{\xi}<\infty.

Now define 𝒞T2(ξ)=t=T3/4T(|𝑵(t)/tω()|Kξξ)\mathcal{C}_{T}^{2}(\xi)=\bigcap_{t=T^{3/4}}^{T}\big{(}\big{|}{\bm{N}}(t)/t-\omega^{\star}(\mathcal{M})\big{|}\leq K_{\xi}\;\xi\big{)}. Then Proposition 8 and Eq. (14) combined imply that for TT large enough, the event 𝒞T1(ξ)𝒞T2(ξ)\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi) holds w.h.p. so that the expected stopping time is finite on the complementary event: 𝔼[τδ𝟙{𝒞T1(ξ)¯𝒞T2(ξ)¯}]<\mathbb{E}[\tau_{\delta}\mathbbm{1}\{\overline{\mathcal{C}_{T}^{1}(\xi)}\cup\overline{\mathcal{C}_{T}^{2}(\xi)}\}]<\infty. Now given the asymptotic shape of the thresholds: βr(t,δ/2)+βp(t,δ/2)δ02log(1/δ)\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\underset{\delta\to 0}{\sim}2\log(1/\delta), we may informally write:

𝔼[τδ𝟙{𝒞T1(ξ)𝒞T2(ξ)}]δ02log(1/δ)sup(,ω)BξUo(,ω),\mathbb{E}\big{[}\tau_{\delta}\mathbbm{1}\{\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)\}\big{]}\underset{\delta\to 0}{\preceq}2\log(1/\delta)\sup_{(\mathcal{M}^{\prime},\omega^{\prime})\in B_{\xi}}U_{o}\big{(}\mathcal{M}^{\prime},\omega^{\prime}\big{)},

where Bξ={(,ω):ξ,ωω()Kξξ}B_{\xi}=\{(\mathcal{M}^{\prime},\omega^{\prime}):\left\lVert\mathcal{M}^{\prime}-\mathcal{M}\right\rVert\leq\xi,\left\lVert\omega^{\prime}-\omega^{\star}(\mathcal{M})\right\rVert_{\infty}\leq K_{\xi}\;\xi\}. Taking the limits when δ\delta and ξ\xi go to zero respectively concludes the proof.

6 Comparison with MOCA

Recall from Lemma 3 and Theorem 6 that our sample complexity bound writes as:

𝒪(infωΩ()max(s,a):aπ(s)1+Varp(s,a)[V]ωsaΔsa2+1min𝑠ωs,π(s)Δmin2(1γ)3)log(1/δ).\mathcal{O}\bigg{(}\underset{\omega\in\Omega(\mathcal{M})}{\inf}\ \underset{(s,a):a\neq\pi^{\star}(s)}{\max}\ \frac{1+\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]}{\omega_{sa}\Delta_{sa}^{2}}+\frac{1}{\underset{s}{\min}\omega_{s,\pi^{\star}(s)}\Delta_{\min}^{2}(1-\gamma)^{3}}\bigg{)}\log(1/\delta).

Hence, in the asymptotic regime δ0\delta\to 0, MDP-NaS finds the optimal way to balance exploration between state-action pairs proportionally to their hardness: (1+Varp(s,a)[V])/Δsa2(1+\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}])/\Delta_{sa}^{2} for sub-optimal pairs (resp. 1/Δmin2(1γ)31/\Delta_{\min}^{2}(1-\gamma)^{3} for optimal pairs).

After a preprint of this work was published, [WSJ21] proposed MOCA, an algorithm for BPI in the episodic setting. MOCA has the advantage of treating the more general case of ε\varepsilon-optimal policy identification with finite-time guarantees on its sample complexity. The two papers have different and complementary objectives but one can compare with their bound for exact policy identification, i.e when ε<ε\varepsilon<\varepsilon^{\star}111111ε\varepsilon^{\star} is defined in their work as the threshold value for ε\varepsilon such that the only ε\varepsilon-optimal policy is the best policy. However, when the objective is to find the optimal policy, it is not clear from their paper how one can determine such a value of ε\varepsilon without prior knowledge of the ground truth instance., in the asymptotic regime δ0\delta\to 0. In this case, by carefully inspecting the proofs of [WSJ21], we see that MOCA’s sample complexity writes as121212The log3(1/δ)\log^{3}(1/\delta) term comes from the sample complexity of their sub-routine FindExplorableSets.:

𝒪(h=1HinfωΩ()maxs,aπ(s)H2ωsa(h)Δsa(h)2)log(1/δ)+polylog(S,A,H,log(ε))log3(1/δ)ε\mathcal{O}\bigg{(}\sum_{h=1}^{H}\ \underset{\omega\in\Omega(\mathcal{M})}{\inf}\ \underset{s,a\neq\pi^{\star}(s)}{\max}\frac{H^{2}}{\omega_{sa}(h)\Delta_{sa}(h)^{2}}\bigg{)}\log(1/\delta)+\frac{\textrm{polylog}(S,A,H,\log(\varepsilon^{*}))\log^{3}(1/\delta)}{\varepsilon^{*}}

where HH is the horizon and Δsa(h)\Delta_{sa}(h) is the sub-optimality gap of (s,a)(s,a) at time step hh. We make the following remarks about the bounds above:

  1. 1.

    MOCA only pays the cost of worst-case visitation probability multiplied by the gap of the corresponding state mins,aπ(s)ωsaΔsa2\underset{s,a\neq\pi^{\star}(s)}{\min}\omega_{sa}\Delta_{sa}^{2}. Instead, MDP-NaS pays a double worst-case cost of the smallest visitation probability multiplied by the minimum gap minsωs,π(s)Δmin2\min_{s}\omega_{s,\pi^{\star}(s)}\Delta_{\min}^{2}. As pointed out by [WSJ21] the former scaling is better, especially when the state where the minimum gap is achieved is different from the one that is hardest to reach. This is however an artefact of the upper bound in Lemma 3 that MDP-NaS uses as a proxy for the characteristic time. Using a more refined bound, or an optimization oracle that solves the best-response problem, one can remove this double worst-case dependency.

  2. 2.

    The sample complexity of MOCA divided by log(1/δ)\log(1/\delta) still explodes when δ\delta goes to zero, contrary to MDP-NaS’s bound.

  3. 3.

    The sample complexity of MDP-NaS is variance-sensitive for sub-optimal state-action pairs, while MOCA’s bound depends on a worst-case factor of H2H^{2}. Indeed as the rewards are in [0,1][0,1], we always have VHV_{\mathcal{M}}^{\star}\leq H and Varp(s,a)[V]H2\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]\leq H^{2} in the episodic setting.

We conclude this section by noting that MDP-NaS has a simple design and can easily be implemented.

7 Conclusion

To the best of our knowledge, this paper is the first to propose an algorithm with instance-dependent sample complexity for Best Policy identification (BPI) in the full online setting. Our results are encouraging as they show: 1) How the navigation constraints of online RL impact the difficulty of learning a good policy, compared to the more relaxed sampling schemes of Multi-Armed Bandits and MDPs with a generative model. 2) That, provided access to an optimization oracle that solves the information-theoretical lower bound (resp. some convex relaxation of the lower bound), asymptotic optimal (resp. near-optimal) sample complexity is still possible through adaptive control of the trajectory. This opens up exciting new research questions. First, it is intriguing to understand how the mixing times -and not just the stationary distributions- of Markov chains induced by policies impact the sample complexity of BPI in the moderate confidence regime. A second direction would be to extend our contributions to the problem of finding an ε\varepsilon-optimal policy, which is of more practical interest than identifying the best policy.

References

  • [AHKS20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13399–13412. Curran Associates, Inc., 2020.
  • [AJO09] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
  • [AKY20] Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020.
  • [AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
  • [BK97] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
  • [CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
  • [DKM19] Rémy Degenne, Wouter M. Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
  • [DMKV21] O. D. Domingues, Pierre M’enard, E. Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In ALT, 2021.
  • [DTC13] Thomas Dietterich, Majid Taleghan, and Mark Crowley. Pac optimal planning for invasive species management : Improved exploration for reinforcement learning from simulator-defined mdps. 01 2013.
  • [EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 06 2006.
  • [FCG10] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In Allerton Conference on Communication, Control, and Computing, pages 115–122, Sep. 2010.
  • [Fie94] C. Fiechter. Efficient reinforcement learning. In COLT ’94, 1994.
  • [FMP11] G. Fort, É. Moulines, and P. Priouret. Convergence of adaptive and interacting markov chain monte carlo algorithms. Annals of Statistics, 39:3262–3289, 2011.
  • [FYAY21] Fei Feng, W. Yin, Alekh Agarwal, and Lin F. Yang. Provably correct optimization and exploration with non-linear policies. ArXiv, abs/2103.11559, 2021.
  • [GK16] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 998–1027, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
  • [GK19] A. Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. arXiv preprint, Statistics Theory, arXiv:1905.03495, 2019.
  • [HH80] P. Hall and C.C. Heyde. 2 - inequalities and laws of large numbers. In P. Hall and C.C. Heyde, editors, Martingale Limit Theory and its Application, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pages 13–50. Academic Press, 1980.
  • [JKM+20] Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1253–1263. Curran Associates, Inc., 2020.
  • [Kak03] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London, England, 2003.
  • [KCG16] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
  • [KK18] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint, arXiv:1811.11419, 2018.
  • [KMDD+21] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 865–891. PMLR, 16–19 Mar 2021.
  • [KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
  • [KS98] Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In Proceedings of the 11th International Conference on Neural Information Processing Systems, NIPS’98, page 996–1002, Cambridge, MA, USA, 1998. MIT Press.
  • [LCC+21] Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis, 2021.
  • [LPW06] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006.
  • [LR85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–2, 1985.
  • [LS20] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
  • [LWC+20a] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. arXiv preprint, arXiv:2005.12900, 2020.
  • [LWC+20b] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction, 2020.
  • [MDJ+20] Pierre M’enard, O. D. Domingues, Anders Jonsson, E. Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. ArXiv, abs/2007.13442, 2020.
  • [MP21] Aymen Al Marjani and Alexandre Proutiere. Adaptive sampling for best policy identification in markov decision processes. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7459–7468. PMLR, 18–24 Jul 2021.
  • [Rus16] D. Russo. Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference On Learning Theory, 2016.
  • [SB18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
  • [Sch68] P. Schweitzer. Perturbation theory and finite markov chains. Journal of Applied Probability, 5:401–413, 1968.
  • [SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
  • [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5186–5196. Curran Associates, Inc., 2018.
  • [Sze10] Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
  • [Wan17] Mengdi Wang. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time. arXiv e-prints, page arXiv:1704.01869, April 2017.
  • [WSJ21] Andrew Wagenmaker, Max Simchowitz, and Kevin Jamieson. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. arXiv e-prints, page arXiv:2108.02717, August 2021.
  • [ZCA21] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. ArXiv, abs/2103.12923, 2021.
  • [ZKB19] Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5625–5634. Curran Associates, Inc., 2019.

Appendix A Symbols

Table 1: Additional notations used in the appendix
Symbol Definition
mm Maximum length of shortest paths: max(s,s)𝒮2min{n1:π:𝒮𝒜,Pπn(s,s)>0}\underset{(s,s^{\prime})\in\mathcal{S}^{2}}{\max}\ \min\{n\geq 1:\ \exists\pi:\mathcal{S}\rightarrow\mathcal{A},P_{\pi}^{n}(s,s^{\prime})>0\}
PπuP_{\pi_{u}} Transition kernel of the uniform policy
ωu\omega_{u} Stationary distribution of PπuP_{\pi_{u}}
rr min{1:(z,z)𝒵2,Pπu(z,z)>0}\min\{\ell\geq 1:\ \forall(z,z^{\prime})\in\mathcal{Z}^{2},\ P_{\pi_{u}}^{\ell}(z,z^{\prime})>0\}
σu\sigma_{u} minz,z𝒵Pπur(z,z)ωu(z)\min\limits_{z,z^{\prime}\in\mathcal{Z}}\ \frac{P_{\pi_{u}}^{r}(z,z^{\prime})}{\omega_{u}(z^{\prime})}
η1\eta_{1} min{Pπu(z,z)|(z,z)𝒵2,Pπu(z,z)>0}\min\big{\{}P_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},P_{\pi_{u}}(z,z^{\prime})>0\big{\}}
η2\eta_{2} min{Pπun(z,z)|(z,z)𝒵2,n[|1,m+1|],Pπun(z,z)>0}\min\big{\{}P^{n}_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},n\in[|1,m+1|],P^{n}_{\pi_{u}}(z,z^{\prime})>0\big{\}}
η\eta Communication parameter η1η2\eta_{1}\eta_{2}
ω\omega^{\star} oracle weights: ωargmaxωΩ()U(,ω)\omega^{\star}\triangleq\operatorname*{arg\,max}\limits_{\omega\in\Omega(\mathcal{M})}\ U(\mathcal{M},\omega).
πo\pi^{o} oracle policy: π(a|s)ωsaa𝒜ωsa\displaystyle{\pi(a|s)\triangleq\frac{\omega_{sa}^{\star}}{\sum\limits_{a^{\prime}\in\mathcal{A}}\omega_{sa^{\prime}}^{\star}}}.
πto\pi_{t}^{o} πo(^t)\pi^{o}(\widehat{\mathcal{M}}_{t})
πto¯\overline{\pi_{t}^{o}} j=1tπo(^j)/t\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t
ZπoZ_{\pi^{o}} (IPπo+𝟙ω𝖳)1(I-P_{\pi^{o}}+\mathbbm{1}{\omega^{\star}}^{\mathsf{\scriptscriptstyle T}})^{-1}
κ\kappa_{\mathcal{M}} Condition number Zπo\left\lVert Z_{\pi^{o}}\right\rVert_{\infty}
PtP_{t} Kernel of the policy πt\pi_{t}
ωt\omega_{t} Stationary distribution of PtP_{t}
Ct,ρtC_{t},\rho_{t} Constants such that n1,Ptn(z0,.)ωt𝖳Ctρtn\forall n\geq 1,\ \left\lVert P_{t}^{n}(z_{0},.)-\omega_{t}^{\mathsf{\scriptscriptstyle T}}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}
LtL_{t} Ct(1ρt)1C_{t}(1-\rho_{t})^{-1}

Appendix B Experiments

In this section, we test our algorithm on two small examples. The first instance is an ergodic MDP, with 55 states, 55 actions per state and a discount factor γ=0.7\gamma=0.7. The rewards of each state-action pair come from independent Bernoulli distributions with means sampled from the uniform distribution 𝒰([0,1])\mathcal{U}([0,1]). The transitions kernels were generated following a Dirichlet distribution 𝒟(1,,1)\mathcal{D}(1,\ldots,1). The second instance is the classical RiverSwim from [SL08], which is communicating but not ergodic. The instance we used has 55 states and 22 actions: {LEFT, RIGHT}\{\textrm{LEFT, RIGHT}\}, with deterministic transitions and a discount factor γ=0.95\gamma=0.95. Rewards are null everywhere but in states {1,5}\{1,5\} where they are Bernoulli with respective means r(1,LEFT)=0.05r(1,\textrm{LEFT})=0.05 and r(1,RIGHT)=1r(1,\textrm{RIGHT})=1. We fix a confidence level δ=0.1\delta=0.1, and for each of these MDPs, we run 3030 Monte-Carlo simulations of MDP-NaS with either C-Navigation or D-Navigation. Towards computational efficiency, we note that the empirical oracle policy does not change significantly after collecting one sample, therefore we only update it every 10410^{4} time steps131313The period was chosen so as to save computation time, and knowing that for MDPs algorithms usually require 106\geq 10^{6} samples to return a reasonably good policy..
First, we seek to check whether the frequencies of state-action pair visits converge to their oracle weights, as stated in Theorem 7. Figure 2 shows the relative distance, in log scale, between the vector of empirical frequencies 𝑵(t)/t{\bm{N}}(t)/t and the oracle allocation 𝝎{\bm{\omega}}^{\star}. The shaded area represents the 10%10\% and 90%90\% quantiles. We see that the relative distance steadily decreases with time, indicating that the visit-frequencies of both D-Navigation and C-Navigation converge to the oracle allocation. We also note that the D-Navigation rule exhibits a faster convergence than the C-Navigation rule.

Refer to caption
Refer to caption
Figure 2: Relative distance in log scale: log10(maxs,a|Nsa(t)/t𝝎sa|𝝎sa)\log_{10}\big{(}\max_{s,a}\frac{|N_{sa}(t)/t-{\bm{\omega}}^{\star}_{sa}|}{{\bm{\omega}}^{\star}_{sa}}\big{)}. Left: Ergodic MDP with S=A=5S=A=5. Right: River Swim with S=5,A=2S=5,A=2.

Next we compare our algorithm with Variance-reduced-Q-learning (VRQL) [LWC+20b], a variant of the classical Q-learning with faster convergence rates. VRQL finds an ε\varepsilon-estimate Q^\widehat{Q} of the Q function QQ^{\star}, by following a fixed sampling rule, referred to as the behavior policy πb\pi_{b}, and updating its estimate of the Q function via Temporal Difference learning. VRQL does not have a stopping rule, but is guaranteed to yield an estimate such that Q^Qε\left\lVert\widehat{Q}-Q\right\rVert_{\infty}\leq\varepsilon with probability 1δ1-\delta after using M(N+tepoch)M(N+t_{\textrm{epoch}}) samples where

M\displaystyle M =c3log(1/ε2(1γ)2),\displaystyle=c_{3}\log(1/\varepsilon^{2}(1-\gamma)^{2}),
tepoch\displaystyle t_{\textrm{epoch}} =c2μmin(1(1γ)3+tmix1γ)log(1/(1γ)2ε)log(SA/δ),\displaystyle=\frac{c_{2}}{\mu_{\min}}\bigg{(}\frac{1}{(1-\gamma)^{3}}+\frac{t_{mix}}{1-\gamma}\bigg{)}\log\big{(}1/(1-\gamma)^{2}\varepsilon\big{)}\log\big{(}SA/\delta\big{)},
N\displaystyle N =c1μmin(1(1γ)3min(1,ε2)+tmix)log(SAtepoch/δ),\displaystyle=\frac{c_{1}}{\mu_{\min}}\bigg{(}\frac{1}{(1-\gamma)^{3}\min(1,\varepsilon^{2})}+t_{mix}\bigg{)}\log\big{(}SAt_{\textrm{epoch}}/\delta\big{)},

where μmin\mu_{\min} (resp. tmixt_{mix}) is the minimum state-action occupancy (resp. the mixing time) of πb\pi_{b} and c1,c2,c3c_{1},c_{2},c_{3} are some large enough universal constants. We use VRQL with c1=c2=c3=10c_{1}=c_{2}=c_{3}=10, εΔmin\varepsilon\leq\Delta_{\min} (since the goal is to identify the best policy) and use the uniform policy as a sampling rule141414In the absence of prior knowledge about the MDP, the uniform policy is a reasonable choice to maximize μmin\mu_{\min}.. We plug this value into the equations above and the compute the sample complexity of VRQL. Table 2 shows a comparison of the sample complexities of MDP-NaS and VRQL. MDP-NaS has much better performance than VRQL.

MDP-NaS VRQL
Small Ergodic MDP 8×1058\times 10^{5} 2.5×1082.5\times 10^{8}
RIVER-SWIM 2.6×1062.6\times 10^{6} 3.3×1093.3\times 10^{9}

Table 2: Average sample complexity of MDP-NaS (D-Navigation) vs deterministic sample complexity of VRQL. δ=0.1\delta=0.1.

Appendix C Sample complexity lower bound

Let Alt()\mathrm{Alt}(\mathcal{M}) be the set of MDPs such that \mathcal{M}\ll\mathcal{M}^{\prime} and Π()Π()=\Pi^{\star}(\mathcal{M})\cap\Pi^{\star}(\mathcal{M}^{\prime})=\emptyset. The information constraints are obtained by change-of-measure arguments as in the bandit literature [LR85, KCG16]:

Lemma 9.

([MP21]) For any δ\delta-PC algorithm 𝔸\mathbb{A}, and for any Alt()\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M}), we have:

s,a𝔼,𝔸[Nsa(τδ)]KL|(s,a)kl(δ,1δ).\sum_{s,a}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})]\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\geq\operatorname{kl}(\delta,1-\delta). (15)

C.1 Navigation constraints: proof of Lemma 1

For all states ss,

Nτ(s)\displaystyle N_{\tau}(s) =𝟙{S1=s}+s,au=1Nτ1(s,a)𝟙{Wu=s},\displaystyle=\mathbbm{1}_{\{S_{1}=s\}}+\sum_{s^{\prime},a^{\prime}}\sum_{u=1}^{N_{\tau-1}(s^{\prime},a^{\prime})}\mathbbm{1}_{\{W_{u}=s\}},

where WuW_{u} denotes the state observed after the uu-th times (s,a)(s^{\prime},a^{\prime}) has been visited. Fix s,as^{\prime},a^{\prime}. Introduce Gts,a=u=1Nt1(s,a)𝟙{Wu=s}G_{t}^{s^{\prime},a^{\prime}}=\sum_{u=1}^{N_{t-1}(s^{\prime},a^{\prime})}\mathbbm{1}_{\{W_{u}=s\}}. Observe that {Wu=s}{\{W_{u}=s\}} and {Nt1(s,a)>u1}\{N_{t-1}(s^{\prime},a^{\prime})>u-1\} are independent. Furthermore, 𝔼,𝔸[𝟙{Ws=s}]=p(s|s,a)\mathbb{E}_{\mathcal{M},\mathbb{A}}[\mathbbm{1}_{\{W_{s}=s\}}]=p_{\mathcal{M}}(s|s^{\prime},a^{\prime}). Hence:

𝔼,𝔸[Gτs,a]=p(s|s,a)𝔼,𝔸[Nτ1(s,a)].\mathbb{E}_{\mathcal{M},\mathbb{A}}[G_{\tau}^{s^{\prime},a^{\prime}}]=p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})].

Finally,

𝔼,𝔸[Nτ(s)]=[𝟙{S1=s}]+s,ap(s|s,a)𝔼,𝔸[Nτ1(s,a)].\displaystyle\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s)]=\mathbb{P}_{\mathcal{M}}[\mathbbm{1}_{\{S_{1}=s\}}]+\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})]. (16)

From the above equality, the lemma is proved by just observing that [𝟙{S1=s}]1\mathbb{P}_{\mathcal{M}}[\mathbbm{1}_{\{S_{1}=s\}}]\leq 1, 𝔼,𝔸[Nτ1(s,a)]𝔼,𝔸[Nτ(s,a)]\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})]\leq\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s^{\prime},a^{\prime})] for any (s,a)(s^{\prime},a^{\prime}), and 𝔼,𝔸[Nτ(s)]𝔼,𝔸[Nτ1(s)]+1\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s)]\leq\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s)]+1 for any ss.

C.2 Lower bound: proof of Proposition 2

By combining Lemma 9 and Lemma 1 we get the following proposition.

Proposition 10.

The expected sample complexity 𝔼,𝔸[τδ]\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}] of any δ\delta-PC algorithm 𝔸\mathbb{A} is larger than the value of the following optimization problem:

infn0s,ansa\displaystyle\inf_{n\geq 0}\ \sum_{s,a}n_{sa} (17)
s.t.s,|ansas,ap(s|s,a)nsa|1,\displaystyle\textrm{s.t.}\ \ \forall s,\ \Big{|}\sum_{a}n_{sa}-\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}}\Big{|}\leq 1,
Alt(),s,ansaKL|(s,a)kl(δ,1δ).\displaystyle\ \ \ \ \ \ \forall\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M}),\ \sum_{s,a}n_{sa}\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\geq\operatorname{kl}(\delta,1-\delta).

In (C.2), nsan_{sa} is interpreted as the expected number of times (s,a)(s,a) is visited before the stopping time. Note that the above proposition provides a lower bound for any value of δ\delta. We can further simplify this bound when restricting our attention to asymptotic regimes where δ\delta goes to 0. In that case, the navigation constraints (1) can be replaced by ansa=s,ap(s|s,a)nsa\sum_{a}n_{sa}=\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}} (by just renormalizing nn with n/log(1/δ)n/\log(1/\delta), and letting δ0\delta\rightarrow 0). For small δ\delta regimes, we can hence rewrite the lower bound as follows:

lim infδ0𝔼,𝔸[τ]log(1/δ)\displaystyle\underset{\delta\to 0}{\liminf}\ \frac{\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau]}{\log(1/\delta)}\geq\ infn0s,ansa\displaystyle\inf_{n\geq 0}\ \sum_{s,a}n_{sa}
s.t.s,ansa=s,ap(s|s,a)nsa,\displaystyle\textrm{s.t.}\ \ \forall s,\ \sum_{a}n_{sa}=\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}},
Alt(),s,ansaKL|(s,a)1.\displaystyle\ \ \ \ \ \ \forall\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M}),\ \sum_{s,a}n_{sa}\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\geq 1.

One can easily conclude by showing that the value of the optimization program above is equal to the one in Eq 2.

C.3 Full definition of the terms in the upper bound

Let Varmax[V]=maxs𝒮Varsp(.|s,π(s))[V(s)]\mathrm{Var}_{\max}^{\star}[V_{\mathcal{M}}^{\star}]=\max\limits_{s\in\mathcal{S}}\ \mathrm{Var}_{s^{\prime}\sim p_{\mathcal{M}}(.|s,\pi^{\star}(s))}[V_{\mathcal{M}}^{\star}(s^{\prime})] denote the maximum variance of the value function on the trajectory of the optimal policy. In [MP21] are defined the following functionals of \mathcal{M}:

T3()\displaystyle T_{3}(\mathcal{M}) 2Δmin2(1γ)2,\displaystyle\triangleq\frac{2}{\Delta_{\min}^{2}(1-\gamma)^{2}},
T4()\displaystyle T_{4}(\mathcal{M}) min(27Δmin2(1γ)3,max(16Varmax[V]Δmin2(1γ)2,6sp(V)4/3Δmin4/3(1γ)4/3)).\displaystyle\triangleq\min\Bigg{(}\frac{27}{\Delta_{\min}^{2}(1-\gamma)^{3}},\ \max\bigg{(}\frac{16\mathrm{Var}_{\max}^{\star}[V_{\mathcal{M}}^{\star}]}{\Delta_{\min}^{2}(1-\gamma)^{2}},\frac{6\operatorname{sp}(V_{\mathcal{M}}^{\star})^{4/3}}{\Delta_{\min}^{4/3}(1-\gamma)^{4/3}}\bigg{)}\Bigg{)}.

Then HH^{\star} is simply defined as H=S(T3()+T4())H^{\star}=S(T_{3}(\mathcal{M})+T_{4}(\mathcal{M})). Note that T4()=𝒪(1Δmin2(1γ)3)T_{4}(\mathcal{M})=\mathcal{O}\big{(}\frac{1}{\Delta_{\min}^{2}(1-\gamma)^{3}}\big{)}.

Appendix D Sampling rule

Recall that 𝒵𝒮×𝒜\mathcal{Z}\triangleq\mathcal{S}\times\mathcal{A} denotes the set of state-action pairs. Any policy π\pi induces a Markov Chain on 𝒵\mathcal{Z} whose kernel is defined by:

Pπ((s,a),(s,a))=P(s|s,a)π(a|s).P_{\pi}((s,a),(s^{\prime},a^{\prime}))=P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime}).
Fact 1:

Note that if it takes mm steps to move between any pair of states (s,s)(s,s^{\prime}) with non-zero probability, then we can move between any pair of state-actions ((s,a),(s,a))((s,a),(s^{\prime},a^{\prime})) in at most m+1m+1 steps by playing the policy:

π~(x)={aif x=s,πs(x)otherwise\displaystyle\tilde{\pi}(x)=\begin{cases}a^{\prime}\quad\textrm{if $x=s^{\prime}$,}\\ \pi_{\hookrightarrow s^{\prime}}(x)\quad\textrm{otherwise}\end{cases}

where πs\pi_{\hookrightarrow s^{\prime}} is the policy corresponding to shortest path to ss^{\prime}. Finally, for the sake of simplicity, we will note PtPπtP_{t}\triangleq P_{\pi_{t}}.

D.1 Almost sure forced exploration: proof of Lemma 4

Consider the event (z𝒵,M>0,t1,Nz(t)<M)\mathcal{E}\triangleq\big{(}\exists z\in\mathcal{Z},\ \exists M>0,\ \forall t\geq 1,\ N_{z}(t)<M\big{)}. Observe that =z𝒵z\mathcal{E}=\bigcup\limits_{z\in\mathcal{Z}}\mathcal{E}_{z}, where for z𝒵z\in\mathcal{Z}, z(M>0,t1,Nz(t)<M)\mathcal{E}_{z}\triangleq\big{(}\exists M>0,\ \forall t\geq 1,\ N_{z}(t)<M\big{)}. We will prove that (z)=0\mathbb{P}(\mathcal{E}_{z^{\prime}})=0 for all zz^{\prime}, which implies the desired result. From Fact 1 above, we have:

(z,z)𝒵2,r[|1,m+1|],πΠ,Pπr(z,z)>0,\displaystyle\forall(z,z^{\prime})\in\mathcal{Z}^{2},\ \exists r\in[|1,m+1|],\ \exists\pi\in\Pi,\ P_{\pi}^{r}(z,z^{\prime})>0, (18)

where PπrP_{\pi}^{r} is the rr-th power of the transition matrix induced by policy π\pi. Therefore:

η=minz,zmax1rm+1πΠPπr(z,z)>0\eta=\underset{z,z^{\prime}}{\min}\max_{\begin{subarray}{c}1\leq r\leq m+1\\ \pi\in\Pi\end{subarray}}P_{\pi}^{r}(z,z^{\prime})>0

is well defined. Fix z𝒵z\in\mathcal{Z} and let π,rz\pi,r_{z} be a policy and an integer satisfying the property (18) above for the pair (z,z)(z,z^{\prime}). Observe that:

PtεtPπuεtAPπ\displaystyle P_{t}\geq\varepsilon_{t}P_{\pi_{u}}\geq\frac{\varepsilon_{t}}{A}P_{\pi}

where the matrix inequality is entry-wise. Now define the stopping times (τk(z))k1(\tau_{k}(z))_{k\geq 1} where the agent reaches state-action zz for the kk-th time151515We restrict our attention to departure state-action pairs zz that are visited infinitely often. Such pairs always exist, therefore τk(z)\tau_{k}(z) is well defined.. Then:

(z|(πt)t1,(τk(z))k1)\displaystyle\mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}|\ (\pi_{t})_{t\geq 1},\ (\tau_{k}(z))_{k\geq 1}\big{)} (N1,kN,Xτk(z)+rzz|(πt)t1,(τk(z))k1)\displaystyle\leq\mathbb{P}\big{(}\exists N\geq 1,\forall k\geq N,X_{\tau_{k}(z)+r_{z}}\neq z^{\prime}\ |\ (\pi_{t})_{t\geq 1},\ (\tau_{k}(z))_{k\geq 1}\big{)}
N=1k=N(Xτk(z)+rzz|τk(z),(πt)t[|τk(z)+1,τk(z)+rz|])\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\mathbb{P}\big{(}X_{\tau_{k}(z)+r_{z}}\neq z^{\prime}\ |\ \tau_{k}(z),\ (\pi_{t})_{t\in[|\tau_{k}(z)+1,\tau_{k}(z)+r_{z}|]}\big{)}
=N=1k=N[1(t=τk(z)+1τk(z)+rzPπt)(z,z)]\displaystyle=\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\bigg{(}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}P_{\pi_{t}}\bigg{)}(z,z^{\prime})\bigg{]}
N=1k=N[1(t=τk(z)+1τk(z)+rzεtAPπ)(z,z)]\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\bigg{(}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}\frac{\varepsilon_{t}}{A}P_{\pi}\bigg{)}(z,z^{\prime})\bigg{]}
N=1k=N[1ηArzt=τk(z)+1τk(z)+rzεt]\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{r_{z}}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}\varepsilon_{t}\bigg{]}
N=1k=N[1ηAm+1t=τk(z)+1τk(z)+m+1εt].\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.

The second inequality comes from a union bound and the strong Markov property161616This property is sometimes referred to as: Markov Chains start afresh after stopping times.. The last inequality comes from the fact that rzm+1r_{z}\leq m+1 and εt1\varepsilon_{t}\leq 1. Now observe that the inequality above holds for all realizations of the sequences (πt)t1(\pi_{t})_{t\geq 1}. Therefore, integrating that inequality over all possible sequences of policies yields:

z𝒵,(z|(τk(z))k1)N=1k=N[1ηAm+1t=τk(z)+1τk(z)+m+1εt].\displaystyle\forall z\in\mathcal{Z},\ \mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}|\ (\tau_{k}(z))_{k\geq 1}\big{)}\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.

We can already see that if state-action zz is visited "frequently enough" (τk(z)c.k\tau_{k}(z)\sim c.k for some constant cc) then the right-hand side above will be zero. Since we know that a least one state-action zz is visited frequently enough, we consider the product over all state-action pairs zz of the probabilities above:

z𝒵(z|(τk(z))k1)\displaystyle\prod_{z\in\mathcal{Z}}\mathbb{P}\bigg{(}\mathcal{E}_{z^{\prime}}|\ (\tau_{k}(z))_{k\geq 1}\bigg{)} (N1,,NSA)()SAz𝒵k=Nz[1ηAm+1t=τk(z)+1τk(z)+m+1εt]\displaystyle\leq\sum_{(N_{1},\ldots,N_{SA})\in(\mathbb{N}^{\star})^{SA}}\ \prod_{z\in\mathcal{Z}}\prod_{k=N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]} (19)
(N1,,NSA)a(N1,,NSA).\displaystyle\triangleq\sum_{(N_{1},\ldots,N_{SA})}a_{(N_{1},\ldots,N_{SA})}\;.

We will now show that a(N1,,NSA)=0a_{(N_{1},\ldots,N_{SA})}=0 for all tuples (N1,,NSA)(N_{1},\ldots,N_{SA}):

a(N1,,NSA)\displaystyle a_{(N_{1},\ldots,N_{SA})} z𝒵k=maxzNz[1ηAm+1t=τk(z)+1τk(z)+m+1εt]\displaystyle\leq\prod_{z\in\mathcal{Z}}\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}
=k=maxzNzz𝒵[1ηAm+1t=τk(z)+1τk(z)+m+1εt].\displaystyle=\prod_{k=\max\limits_{z}N_{z}}^{\infty}\prod_{z\in\mathcal{Z}}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.

Now observe that for all k1k\geq 1 there exists zk𝒵z_{k}\in\mathcal{Z} such that τk(zk)SAk\tau_{k}(z_{k})\leq SAk, i.e., at least one state-action has been visited kk times before time step SAkSAk. For that particular choice of zkz_{k} and since (εt)t1(\varepsilon_{t})_{t\geq 1} is decreasing, we get:

a(N1,,NSA)\displaystyle a_{(N_{1},\ldots,N_{SA})} k=maxzNz[1ηAm+1t=τk(zk)+1τk(zk)+m+1εt]\displaystyle\leq\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z_{k})+1}^{\tau_{k}(z_{k})+m+1}\varepsilon_{t}\bigg{]}
k=maxzNz[1ηAm+1t=SA.k+1SA.k+m+1εt].\displaystyle\leq\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=SA.k+1}^{SA.k+m+1}\varepsilon_{t}\bigg{]}.

For the choice of εt=t1m+1\varepsilon_{t}=t^{-\frac{1}{m+1}} the right-hand side above is zero. To sum up, for all realizations of (τk(z))z𝒵,k1(\tau_{k}(z))_{z\in\mathcal{Z},k\geq 1}:

z𝒵(z|(τk(z))k1)=0.\displaystyle\prod_{z\in\mathcal{Z}}\mathbb{P}\bigg{(}\mathcal{E}_{z^{\prime}}|\ (\tau_{k}(z))_{k\geq 1}\bigg{)}=0\;.

Therefore, for all zz^{\prime}: (z)=0\mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}\big{)}=0 and consequently: ()=0\mathbb{P}(\mathcal{E})=0.

D.2 Minimal exploration rate for communicating MDPs

Indeed if the agent visits state SS at time kk, then the last S1S-1 transitions before kk must have been to the right, ie (sk=S)j=kS+1k1εj(εkS+1)S1\mathbb{P}(s_{k}=S)\leq\prod_{j=k-S+1}^{k-1}\varepsilon_{j}\leq(\varepsilon_{k-S+1})^{S-1}. Therefore 𝔼[NS(t)]k=St(kS+1)α(S1)\mathbb{E}[N_{S}(t)]\leq\sum_{k=S}^{t}(k-S+1)^{-\alpha(S-1)}. In particular this implies that for α>1S1\alpha>\frac{1}{S-1}, lim supt𝔼[NS(t)]=M<\limsup_{t\to\infty}\mathbb{E}[N_{S}(t)]=M<\infty. Therefore using the reverse Fatou lemma and Markov’s inequality:

(t1,NS(t)\displaystyle\mathbb{P}(\forall t\geq 1,N_{S}(t)\leq 2M)=𝔼[lim suptk=1t𝟙{NS(k)2M}]\displaystyle 2M)=\mathbb{E}\big{[}\limsup_{t\to\infty}\prod_{k=1}^{t}\mathbbm{1}\{N_{S}(k)\leq 2M\}\big{]}
lim supt𝔼[k=1t𝟙{NS(k)2M}]=lim supt𝔼[𝟙{NS(t)2M}]12.\displaystyle\geq\limsup_{t\to\infty}\mathbb{E}\big{[}\prod_{k=1}^{t}\mathbbm{1}\{N_{S}(k)\leq 2M\}\big{]}=\limsup_{t\to\infty}\mathbb{E}\big{[}\mathbbm{1}\{N_{S}(t)\leq 2M\}\big{]}\geq\frac{1}{2}.

D.3 Minimal exploration rate for ergodic MDPs

This is a consequence of Proposition 2 [BK97], stating that there exist c1,c2,C>0c_{1},c_{2},C>0 such that for all ss and tt large enough, ,𝔸[Ns(t)>c1t]1Cec2t\mathbb{P}_{\mathcal{M},\mathbb{A}}[N_{s}(t)>c_{1}t]\geq 1-Ce^{-c_{2}t}. A union bound yields: ,𝔸[s,Ns(t)>c1t]1CSec2t\mathbb{P}_{\mathcal{M},\mathbb{A}}[\forall s,N_{s}(t)>c_{1}t]\geq 1-CSe^{-c_{2}t}. To extend this result to the numbers of visits at the various (state, action) pairs, we can derive a lower bound on Nsa(t)N_{sa}(t) given that Ns(t)>c1tN_{s}(t)>c_{1}t by observing that a worst scenario (by monotonicity of εs\varepsilon_{s}) occurs when ss is visited only in the c1tc_{1}t rounds before tt. We get 𝔼[Nsa(t)|Ns(t)>c1t]c3t(1α)\mathbb{E}[N_{sa}(t)|N_{s}(t)>c_{1}t]\geq c_{3}t^{(1-\alpha)}. Remarking that Nsa(t+1)Ns(t)εtN_{sa}(t+1)-N_{s}(t)\varepsilon_{t} is a sub-martingale with bounded increments, standard concentration arguments then imply that ,𝔸[s,a,Nsa(t)>c32t(1α)]ϕ(t)\mathbb{P}_{\mathcal{M},\mathbb{A}}[\forall s,a,N_{sa}(t)>{\frac{c_{3}}{2}}t^{(1-\alpha)}]\geq\phi(t), where ϕ(t)1\phi(t)\to 1. Next, define the random variable Zt=s,a𝟙{Nsa(t)>c32t(1α)}Z_{t}=\prod_{s,a}\mathds{1}\{N_{sa}(t)>{\frac{c_{3}}{2}}t^{(1-\alpha)}\}. Applying the reverse Fatou lemma, we get 1=limsupt𝔼[Zt]𝔼[limsuptZt]1=\lim\sup_{t}\mathbb{E}[Z_{t}]\leq\mathbb{E}[\lim\sup_{t}Z_{t}]. From there, we directly deduce (by monotonicity of tNsa(t)t\mapsto N_{sa}(t)) that a.s. limtNsa(t)=\lim_{t\to\infty}N_{sa}(t)=\infty.

D.4 High probability forced exploration

Lemma 11.

Denote by τk(z)\tau_{k}(z) the k-th time the agent visits the state-action pair zz. Under C-Navigation with exploration rate εt=t12(m+1)\varepsilon_{t}=t^{-\frac{1}{2(m+1)}} we have: for all α(0,1)\alpha\in(0,1), there exists a parameter η>0\eta>0 that only depends on \mathcal{M} such that:

(z𝒵,k1,τk(z)λαk4)1α,\displaystyle\mathbb{P}\bigg{(}\forall z\in\mathcal{Z},\ \forall k\geq 1,\ \tau_{k}(z)\leq\lambda_{\alpha}k^{4}\bigg{)}\geq 1-\alpha,

where λα(m+1)2η2log2(1+SAα)\lambda_{\alpha}\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+\frac{SA}{\alpha}).

Corollary 1.

Denote by Nz(t)N_{z}(t) the number of times the agent visits state-action zz up to and including time step tt. Then under the same notations of the lemma above we have: for all α(0,1)\alpha\in(0,1):

(z𝒵,t1,Nz(t)(tλα)1/41)1α.\displaystyle\mathbb{P}\bigg{(}\forall z\in\mathcal{Z},\ \forall t\geq 1,\ N_{z}(t)\geq\bigg{(}\frac{t}{\lambda_{\alpha}}\bigg{)}^{1/4}-1\bigg{)}\geq 1-\alpha.
Proof.

Let ff be some increasing function such that f()f(\mathbb{N})\subset\mathbb{N} and f(0)=0f(0)=0 and define the event =(z𝒵,k1,τk(z)f(k))\mathcal{E}=\bigg{(}\forall z\in\mathcal{Z},\ \forall k\geq 1,\ \tau_{k}(z)\leq f(k)\bigg{)}. We will prove the following more general result:

(c)SAk=1j=0f(k)f(k1)1m+11[1ηl=1m+1εf(k1)+(m+1).j+l],\displaystyle\mathbb{P}(\mathcal{E}^{c})\leq SA\sum\limits_{k=1}^{\infty}\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod\limits_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1).j+l}\bigg{]}\;, (20)

where η\eta is a constant depending on the communication properties of \mathcal{M}. Then we will tune f(k)f(k) and εt\varepsilon_{t} so that the right-hand side is less than α\alpha. First, observe that:

c=z𝒵k=1(τk(z)>f(k)andjk1,τj(z)f(j)).\mathcal{E}^{c}=\bigcup\limits_{z\in\mathcal{Z}}\ \bigcup\limits_{k=1}^{\infty}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \forall j\leq k-1,\ \tau_{j}(z)\leq f(j)\bigg{)}\;.

Using the decomposition above, we upper bound the probability of c\mathcal{E}^{c} by the sum of probabilities for k1k\geq 1 that the kk-th excursion from and back to zz takes too long:

(c)\displaystyle\mathbb{P}(\mathcal{E}^{c}) z𝒵[(τ1(z)>f(1))+k=2(τk(z)>f(k)andjk1,τj(z)f(j))]\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \forall j\leq k-1,\ \tau_{j}(z)\leq f(j)\bigg{)}\bigg{]}
z𝒵[(τ1(z)>f(1))+k=2(τk(z)>f(k)andτk1(z)f(k1))]\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \tau_{k-1}(z)\leq f(k-1)\bigg{)}\bigg{]}
z𝒵[(τ1(z)>f(1))+k=2(τk(z)τk1(z)>f(k)f(k1),τk1(z)f(k1))]\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1),\ \tau_{k-1}(z)\leq f(k-1)\bigg{)}\bigg{]}
z𝒵[(τ1(z)>f(1))+k=2n=1f(k1)(τk(z)τk1(z)>f(k)f(k1)|τk1(z)=n)(τ0(z)=n)]\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\sum\limits_{n=1}^{f(k-1)}\mathbb{P}\big{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\big{|}\tau_{k-1}(z)=n\big{)}\mathbb{P}(\tau_{0}(z)=n)\bigg{]}
=z𝒵[a1(z)+k=2n=1f(k1)ak,n(z)(τk1(s)=n)],\displaystyle=\sum\limits_{z\in\mathcal{Z}}\big{[}a_{1}(z)+\sum\limits_{k=2}^{\infty}\sum\limits_{n=1}^{f(k-1)}a_{k,n}(z)\mathbb{P}(\tau_{k-1}(s)=n)\big{]}\;, (21)

where

a1(z)\displaystyle a_{1}(z) (τ1(z)>f(1)),\displaystyle\triangleq\mathbb{P}(\tau_{1}(z)>f(1))\;,
k2n[|1,f(k1)|],ak,n(z)\displaystyle\forall k\geq 2\ \forall n\in[|1,f(k-1)|],\ a_{k,n}(z) (τk(z)τk1(z)>f(k)f(k1)|τk1(z)=n).\displaystyle\triangleq\mathbb{P}\big{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\big{|}\tau_{k-1}(z)=n\big{)}\;.

We will now prove an upper bound on ak,n(z)a_{k,n}(z) for a fixed z𝒵z\in\mathcal{Z} and k1k\geq 1.

1) Upper bounding the probability that an excursion takes too long:

Let us rewrite PtP_{t} as

Pt=(Qt(z)[Pt(z,z)]zz[Pt(z,z)]zzTPt(z,z)),P_{t}=\begin{pmatrix}\begin{matrix}\\ \quad Q_{t}(z)\quad\\ \\ \end{matrix}&\vline&[P_{t}(z^{\prime},z)]_{z^{\prime}\neq z}\\ \hline\cr[P_{t}(z,z^{\prime})]_{z^{\prime}\neq z}^{T}&\vline&\begin{matrix}P_{t}(z,z)\end{matrix}\end{pmatrix}\;,

so that state-action zz corresponds to the last row and last column. Further let pt(z,¬z)[Pt(z,z")]z"zp_{t}(z^{\prime},\neg z)\triangleq[P_{t}(z^{\prime},z")]_{z"\neq z} denote the vector of probabilities of transitions at time tt from zz^{\prime} to states z"z" different from zz. Using a simple recurrence on NN, one can prove that for all k,N,n1k,N,n\geq 1 we have:

(τk(z)τk1(z)>N|τk1(z)=n)=pn(z,¬z)𝖳(j=n+1n+N1Qj(z))𝟙.\displaystyle\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>N\bigg{|}\tau_{k-1}(z)=n\bigg{)}=p_{n}(z,\neg z)^{\mathsf{\scriptscriptstyle T}}\bigg{(}\prod\limits_{j=n+1}^{n+N-1}Q_{j}(z)\bigg{)}\mathbbm{1}\;. (22)

Using Lemma 20, there exists η>0\eta>0 (that only depends on \mathcal{M}) such that for all n1n\geq 1 and all sequences (πt)t1(\pi_{t})_{t\geq 1} such that πtεtπu\pi_{t}\geq\varepsilon_{t}\pi_{u} we have:

l=n+1n+m+1Ql(z)1ηl=n+1n+m+1εl.\left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}(z)\right\rVert_{\infty}\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}\;. (23)

Therefore using (22) for N=f(k)f(k1)N=f(k)-f(k-1) and breaking the matrix product into smaller product terms of (m+1)(m+1) matrices, we get for k2k\geq 2:

ak,n(z)\displaystyle a_{k,n}(z) =(τk(z)τk1(z)>f(k)f(k1)|τk1(z)=n)\displaystyle=\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\bigg{|}\tau_{k-1}(z)=n\bigg{)}
=𝔼(πt)t1[(τk(s)τk1(s)>f(k)f(k1)|τk1(z)=n,(πt)t1)]\displaystyle=\mathbb{E}_{(\pi_{t})_{t\geq 1}}\bigg{[}\mathbb{P}\bigg{(}\tau_{k}(s)-\tau_{k-1}(s)>f(k)-f(k-1)\bigg{|}\tau_{k-1}(z)=n,(\pi_{t})_{t\geq 1}\bigg{)}\bigg{]}
=𝔼(πt)t1[pn(z,¬z)𝖳(j=n+1n+f(k)f(k1)1Qj(z))𝟙]\displaystyle=\mathbb{E}_{(\pi_{t})_{t\geq 1}}\bigg{[}p_{n}(z,\neg z)^{\mathsf{\scriptscriptstyle T}}\bigg{(}\prod\limits_{j=n+1}^{n+f(k)-f(k-1)-1}Q_{j}(z)\bigg{)}\mathbbm{1}\bigg{]}
l=n+1n+f(k)f(k1)1Ql(z)\displaystyle\leq\left\lVert\prod\limits_{l=n+1}^{n+f(k)-f(k-1)-1}Q_{l}(z)\right\rVert_{\infty}
l=(m+1)f(k)f(k1)1m+1+1f(k)f(k1)1Qn+l(z)×j=0f(k)f(k1)1m+11l=1m+1Qn+(m+1)j+l(z)\displaystyle\leq\left\lVert\prod\limits_{l=(m+1)\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor+1}^{f(k)-f(k-1)-1}Q_{n+l}(z)\right\rVert_{\infty}\times\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\left\lVert\prod_{l=1}^{m+1}Q_{n+(m+1)j+l}(z)\right\rVert_{\infty}
j=0f(k)f(k1)1m+11[1ηl=1m+1εn+(m+1)j+l]\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{n+(m+1)j+l}\bigg{]}
j=0f(k)f(k1)1m+11[1ηl=1m+1εf(k1)+(m+1)j+l]bk,\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1)j+l}\bigg{]}\triangleq b_{k}\;, (24)

where in the fourth line we used that pn(z,¬z)11\left\lVert p_{n}(z,\neg z)\right\rVert_{1}\leq 1. The sixth line uses the fact that the matrices QQ are substochastic. The last line is due to the fact that nf(k1)n\leq f(k-1) and tεtt\mapsto\varepsilon_{t} is decreasing. Similarly, one can prove that:

a1(z)\displaystyle a_{1}(z) j=0f(1)1m+11[1ηl=1m+1ε(m+1)j+l]\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{(m+1)j+l}\bigg{]}
=j=0f(1)f(0)1m+11[1ηl=1m+1εf(0)+(m+1)j+l]b1,\displaystyle=\prod\limits_{j=0}^{\lfloor\frac{f(1)-f(0)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(0)+(m+1)j+l}\bigg{]}\triangleq b_{1}\;, (25)

where we used the fact that f(0)=0f(0)=0. Now we only have to tune f(k)f(k) and εt\varepsilon_{t} so that k=1bk<αSA\sum\limits_{k=1}^{\infty}b_{k}<\frac{\alpha}{SA} and conclude using (21), (24) and (25).

2) Tuning ff and the exploration rate:

Since the sequence (εt)t1(\varepsilon_{t})_{t\geq 1} is decreasing we have:

bk\displaystyle b_{k} =j=0f(k)f(k1)1m+11[1ηl=1m+1εf(k1)+(m+1)j+l]\displaystyle=\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1)j+l}\bigg{]}
j=0f(k)f(k1)1m+11[1η(εf(k1)+(m+1)j+S)m+1]\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\big{(}\varepsilon_{f(k-1)+(m+1)j+S}\big{)}^{m+1}\bigg{]}
[1η(εf(k))m+1]f(k)f(k1)1m+1.\displaystyle\leq\bigg{[}1-\eta\big{(}\varepsilon_{f(k)}\big{)}^{m+1}\bigg{]}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor}.

For f(k)=λ.k4f(k)=\lambda.k^{4} where λ\lambda\in\mathbb{N}^{\star} and εt=t12(m+1)\varepsilon_{t}=t^{-\frac{1}{2(m+1)}} we have: f(k)f(k1)1m+1λk3(m+1)\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor\geq\frac{\lambda k^{3}}{(m+1)} and (εf(k))m+1=1λk2\big{(}\varepsilon_{f(k)}\big{)}^{m+1}=\frac{1}{\sqrt{\lambda}k^{2}}, implying:

bk\displaystyle b_{k} [1ηλk2]λk3(m+1)\displaystyle\leq\bigg{[}1-\frac{\eta}{\sqrt{\lambda}k^{2}}\bigg{]}^{\frac{\lambda k^{3}}{(m+1)}}
exp(λk3η(m+1)λk2)=exp(λ1/2kηm+1).\displaystyle\leq\exp\bigg{(}\frac{-\lambda k^{3}\eta}{(m+1)\sqrt{\lambda}k^{2}}\bigg{)}=\exp\bigg{(}-\frac{\lambda^{1/2}k\eta}{m+1}\bigg{)}\;.

Summing the last inequality, along with (21), (24) and (25) we get:

(c)\displaystyle\mathbb{P}(\mathcal{E}^{c}) SAk=1bk\displaystyle\leq SA\sum\limits_{k=1}^{\infty}b_{k}
SAk=1exp(λ1/2kηm+1)\displaystyle\leq SA\sum\limits_{k=1}^{\infty}\exp\bigg{(}-\frac{\lambda^{1/2}k\eta}{m+1}\bigg{)}
=SAexp(λ1/2ηm+1)1exp(λ1/2ηm+1)g(λ).\displaystyle=\frac{SA\exp\big{(}-\frac{\lambda^{1/2}\eta}{m+1}\big{)}}{1-\exp\big{(}-\frac{\lambda^{1/2}\eta}{m+1}\big{)}}\triangleq g(\lambda)\;.

For λα(m+1)2η2log2(1+SAα)\lambda_{\alpha}\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+\frac{SA}{\alpha}), we have g(λα)=αg(\lambda_{\alpha})=\alpha, which gives the desired result.

Remark 4.

It is natural that λ\lambda depends on η\eta, which expresses how well connected is the MDP under the uniform policy, see proof of Lemma 20.

D.5 An Ergodic Theorem for non-homogeneous Markov Chains

We start with some definitions and a technical result. Let {Pπ,πΠ}\{P_{\pi},\pi\in\Pi\} be a collection of Markov transition kernels on the state-action space 𝒵\mathcal{Z}, indexed by policies πΠ\pi\in\Pi. For any Markov transition kernel PP, bounded function ff and probability distribution μ\mu, we define:

Pf(z)z𝒵P(z,z)f(z)andμP(z)z𝒵μ(z)P(z,z).Pf(z)\triangleq\sum\limits_{z^{\prime}\in\mathcal{Z}}P(z,z^{\prime})f(z^{\prime})\quad\textrm{and}\quad\mu P(z)\triangleq\sum\limits_{z^{\prime}\in\mathcal{Z}}\mu(z^{\prime})P(z^{\prime},z).

For a measure μ\mu and a function ff, μ(f)=𝔼Xμ[f(X)]\mu(f)=\mathbb{E}_{X\sim\mu}[f(X)] denotes the mean of ff w.r.t. μ\mu. Finally, for two policies π\pi and π\pi^{\prime} we define D(π,π)PπPπ=maxz𝒵Pπ(z,.)Pπ(z,.)1D(\pi,\pi^{\prime})\triangleq\left\lVert P_{\pi}-P_{\pi^{\prime}}\right\rVert_{\infty}=\max\limits_{z\in\mathcal{Z}}\left\lVert P_{\pi}(z,.)-P_{\pi^{\prime}}(z,.)\right\rVert_{1}.
We consider a 𝒵×Π\mathcal{Z}\times\Pi-valued process {(zt,πt),t1}\{(z_{t},\pi_{t}),t\geq 1\} such that (zt,πt)(z_{t},\pi_{t}) is t\mathcal{F}_{t}-adapted and for any bounded measurable function ff:

𝔼[f(zt+1)|t]=Pπtf(zt).\mathbb{E}[f(z_{t+1})|\mathcal{F}_{t}]=P_{\pi_{t}}f(z_{t}).

The next result is adapted from [FMP11]. There the authors prove an ergodic theorem for adaptive Markov Chain Monte-Carlo (MCMC) algorithms with a general state space and a parameter-dependent function. For the sake of self-containedness, we include here the proof of their result in the simple case of finite state space chains with a function that does not depend on the policy πt\pi_{t}.

Proposition 12.

(Corollary 2.9, [FMP11]) Assume that:

(B1)t1,Ptis ergodic. We denote by ωt its stationary distribution.\displaystyle\textrm{\hypertarget{assumption:B1}{(B1)}}\ \forall t\geq 1,\ P_{t}\ \textrm{is ergodic. We denote by $\omega_{t}$ its stationary distribution.}
(B2)There exists an ergodic kernel PP such that PtPt0\left\lVert P_{t}-P\right\rVert_{\infty}\underset{t\to\infty}{\longrightarrow}0 almost surely.
(B3) There exists two constants Ct and ρt such that for all n1PtnWtCtρtn,\displaystyle\textrm{\hypertarget{assumption:B3}{(B3)} There exists two constants $C_{t}$ and $\rho_{t}$ such that for all $n\geq 1$, }\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}\ ,
where Wt is a rank-one matrix whose rows are equal to ωt𝖳.\displaystyle\quad\quad\textrm{where $W_{t}$ is a rank-one matrix whose rows are equal to $\omega_{t}^{\mathsf{\scriptscriptstyle T}}$}.
(B4)Denote by LtCt(1ρt)1. Then lim suptLt< almost surely. (UNIFORM ERGODICITY)\displaystyle\textrm{\hypertarget{assumption:B4}{(B4)}}\ \textrm{Denote by }L_{t}\triangleq C_{t}(1-\rho_{t})^{-1}.\textrm{ Then }\limsup\limits_{t\to\infty}L_{t}<\infty\textrm{ almost surely}.\textsc{ (UNIFORM ERGODICITY)}
(B5)D(πt+1,πt)t0 almost surely. (STABILITY)\displaystyle\textrm{\hypertarget{assumption:B5}{(B5)}}\ D(\pi_{t+1},\pi_{t})\underset{t\to\infty}{\longrightarrow}0\textrm{ almost surely}.\textsc{ (STABILITY)}

Finally, denote by ω\omega^{\star} the stationary distribution of PP. Then for any bounded non-negative function f:𝒵+f:\ \mathcal{Z}\to\mathbb{R}^{+} we have:

k=1tf(zk)ttω(f)\displaystyle\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}\underset{t\to\infty}{\longrightarrow}\omega^{\star}(f)

almost surely.

Proof.

Consider the difference

D\displaystyle D =k=1tf(zk)tω(f)\displaystyle=\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega(f)
=f(z1)ω(f)tD1,t+k=2t[f(zk)ωk1(f)]tD2,t+k=2t[ωk1(f)ω(f)]tD3,t.\displaystyle=\underbrace{\frac{f(z_{1})-\omega^{\star}(f)}{t}}\limits_{D_{1,t}}+\underbrace{\frac{\sum\limits_{k=2}^{t}\big{[}f(z_{k})-\omega_{k-1}(f)\big{]}}{t}}\limits_{D_{2,t}}+\underbrace{\frac{\sum\limits_{k=2}^{t}\big{[}\omega_{k-1}(f)-\omega^{\star}(f)\big{]}}{t}}\limits_{D_{3,t}}. (26)

We clearly have:

|D1,t|ftt0.\displaystyle|D_{1,t}|\leq\frac{\left\lVert f\right\rVert_{\infty}}{t}\underset{t\to\infty}{\longrightarrow}0. (28)

Next, by Lemma 22 there exists a constant κP\kappa_{P} (that only depends on PP) such that: ωkω1κPPkP\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}\leq\kappa_{P}\left\lVert P_{k}-P\right\rVert_{\infty}. Therefore:

|D3,t|\displaystyle|D_{3,t}| κPfk=1t1PkPtt0,\displaystyle\leq\kappa_{P}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=1}^{t-1}\left\lVert P_{k}-P\right\rVert_{\infty}}{t}\underset{t\to\infty}{\longrightarrow}0, (29)

where the convergence to zero is due to assumption (B2). Now to bound D2,tD_{2,t} we use the function f^k\widehat{f}_{k} solution to the Poisson equation (f^kPkf^k)(.)=f(.)ωk(f)\big{(}\widehat{f}_{k}-P_{k}\widehat{f}_{k}\big{)}(.)=f(.)-\omega_{k}(f). By Lemma 23, f^k(.)=n0Pkn[fωk(f)](.)\widehat{f}_{k}(.)=\sum\limits_{n\geq 0}P_{k}^{n}[f-\omega_{k}(f)](.) exists and is solution to the Poisson equation. Therefore we can rewrite D2,tD_{2,t} as follows:

D2,t\displaystyle D_{2,t} =k=2t[f^k1(zk)Pk1f^k1(zk)]t\displaystyle=\frac{\sum\limits_{k=2}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}
=Mt+Ct+Rt,\displaystyle=M_{t}+C_{t}+R_{t}, (30)

where

Mtk=2t[f^k1(zk)Pk1f^k1(zk1)]t,\displaystyle M_{t}\triangleq\frac{\sum\limits_{k=2}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})\big{]}}{t},
Ctk=2t[Pkf^k(zk)Pk1f^k1(zk)]t,\displaystyle C_{t}\triangleq\frac{\sum\limits_{k=2}^{t}\big{[}P_{k}\widehat{f}_{k}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t},
RtP1f^1(z1)Ptf^t(zt)t.\displaystyle R_{t}\triangleq\frac{P_{1}\widehat{f}_{1}(z_{1})-P_{t}\widehat{f}_{t}(z_{t})}{t}.
Bounding MtM_{t}:

Note that SttMtS_{t}\triangleq tM_{t} is a martingale since 𝔼[f^k1(zk)|k1]=Pk1f^k1(zk1)\mathbb{E}[\widehat{f}_{k-1}(z_{k})|\mathcal{F}_{k-1}]=P_{k-1}\widehat{f}_{k-1}(z_{k-1}). Furthermore, by Lemma 23:

|StSt1|\displaystyle|S_{t}-S_{t-1}| =|f^t1(zt)Pk1f^t1(zt1)|\displaystyle=|\widehat{f}_{t-1}(z_{t})-P_{k-1}\widehat{f}_{t-1}(z_{t-1})|
2f^t1\displaystyle\leq 2\left\lVert\widehat{f}_{t-1}\right\rVert_{\infty}
2Lt1f.\displaystyle\leq 2L_{t-1}\left\lVert f\right\rVert_{\infty}.

In particular, this implies that:

k=2𝔼[|SkSk1|2|k1]k2\displaystyle\sum\limits_{k=2}^{\infty}\frac{\mathbb{E}[|S_{k}-S_{k-1}|^{2}\ |\mathcal{F}_{k-1}]}{k^{2}} k=24f2Lk12k2<\displaystyle\leq\sum\limits_{k=2}^{\infty}\frac{4\left\lVert f\right\rVert_{\infty}^{2}L_{k-1}^{2}}{k^{2}}<\infty

where the convergence of the series is due to (B4). (Theorem 2.18 in [HH80]) then implies that Mtt0M_{t}\underset{t\to\infty}{\longrightarrow}0 almost surely.

Bounding CtC_{t}:

Using Lemma 23, we have:

|Ct|\displaystyle|C_{t}| fk=2tLk[ωkωk11+Lk1D(πk,πk1)]t\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=2}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}
fk=2tLk[ωkω1+ωωk11+Lk1D(πk,πk1)]t\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=2}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}+\left\lVert\omega^{\star}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}
f[κPL2P1P+LtPtP+k=2t1(Lk+Lk+1)PkPt\displaystyle\leq\left\lVert f\right\rVert_{\infty}\bigg{[}\kappa_{P}\frac{L_{2}\left\lVert P_{1}-P\right\rVert_{\infty}+L_{t}\left\lVert P_{t}-P\right\rVert_{\infty}+\sum\limits_{k=2}^{t-1}(L_{k}+L_{k+1})\left\lVert P_{k}-P\right\rVert_{\infty}}{t}
+k=2tLkLk1D(πk,πk1)t]t0,\displaystyle\quad\quad\quad\quad+\frac{\sum\limits_{k=2}^{t}L_{k}L_{k-1}D(\pi_{k},\pi_{k-1})}{t}\bigg{]}\underset{t\to\infty}{\longrightarrow}0, (31)

where the third line comes from applying Lemma 22 and the convergence to zero is due to assumptions (B2)-(B4)-(B5).

Bounding RtR_{t}:

Finally, by Lemma 23 we have:

|Rt|\displaystyle|R_{t}| f^1+f^tt\displaystyle\leq\frac{\left\lVert\widehat{f}_{1}\right\rVert_{\infty}+\left\lVert\widehat{f}_{t}\right\rVert_{\infty}}{t}
f(L1+Lt)tt0,\displaystyle\leq\frac{\left\lVert f\right\rVert_{\infty}(L_{1}+L_{t})}{t}\underset{t\to\infty}{\longrightarrow}0, (32)

where the convergence to zero is due to Assumption (B4). Summing up the inequalities (28-32) gives the result. ∎

D.6 Application to C-Navigation: proof of Theorem 7

We will now prove that C-Navigation verifies the assumptions (B1-5).

The same can be proved for D-Navigation by replacing πto¯\overline{\pi_{t}^{o}} with πto\pi_{t}^{o}. Theorem 7 follows immediately by applying Proposition 12 for the functions f(z~)=𝟙{z~=z}f(\tilde{z})=\mathbbm{1}\{\tilde{z}=z\}, where zz is any fixed state-action pair.

(B1):

This is a direct consequence of the fact that PπuP_{\pi_{u}} is ergodic (due to Assumptions 1 and 10) which implies by construction that PtP_{t} is also ergodic.

(B2):

By Lemma 4 we have: Nsa(t)a.sN_{sa}(t)\overset{a.s}{\longrightarrow}\infty for all (s,a)(s,a). Hence ^a.s\widehat{\mathcal{M}}\overset{a.s}{\longrightarrow}\mathcal{M} and by continuity: πo(^)a.sπo()\pi^{o}(\widehat{\mathcal{M}})\overset{a.s}{\longrightarrow}\pi^{o}(\mathcal{M}), which implies that:

Pta.sPπo.P_{t}\overset{a.s}{\longrightarrow}P_{\pi^{o}}. (33)
(B3):

By Lemma 13, PtP_{t} satisfies (B3) for Ct=2θ(εt,πto¯,ωt)1C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1} and ρt=θ(εt,πto¯,ωt)1/r\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}.

(B4):

We have:

σ(εt,πto¯,ωt)\displaystyle\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t}) =(εtr+[(1εt)Amins,aπto¯(a|s)]r)σu(minzωu(z)ωt(z))\displaystyle=\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega_{t}(z)}\bigg{)}
a.s(Amins,aπo(a|s))σuminzωu(z)ω(z)σo.\displaystyle\overset{a.s}{\longrightarrow}\bigg{(}A\min\limits_{s,a}\pi^{o}(a|s)\bigg{)}\sigma_{u}\min\limits_{z}\frac{\omega_{u}(z)}{\omega^{\star}(z)}\triangleq\sigma_{o}. (34)

Note that σo>0\sigma_{o}>0 since ωu>0\omega_{u}>0 (ergodicity of PπuP_{\pi_{u}}), ω<1\omega^{\star}<1 and πo>0\pi^{o}>0 entry-wise. Similarly, it is trivial that σo<1\sigma_{o}<1 since Amins,aπo(a|s)<1,minzωu(z)ω(z)<1A\min\limits_{s,a}\pi^{o}(a|s)<1,\min\limits_{z}\frac{\omega_{u}(z)}{\omega^{\star}(z)}<1 and σu<1\sigma_{u}<1. Therefore θ(εt,πto¯,ωt)=1σ(εt,πto¯,ωt)a.s1σoθo(0,1)\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})=1-\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\overset{a.s}{\longrightarrow}1-\sigma_{o}\triangleq\theta_{o}\in(0,1) and:

lim suptLt\displaystyle\limsup\limits_{t\to\infty}\ L_{t} =lim suptCt(1ρt)1\displaystyle=\limsup\limits_{t\to\infty}\ C_{t}(1-\rho_{t})^{-1}
=lim supt2θ(εt,πto¯,ωt)[1θ(εt,πto¯,ωt)1/r]\displaystyle=\limsup\limits_{t\to\infty}\frac{2}{\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\big{[}1-\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}\big{]}}
=2θo[1θo1r]<.\displaystyle=\frac{2}{\theta_{o}\big{[}1-\theta_{o}^{\frac{1}{r}}\big{]}}<\infty. (35)
(B5):

We have: (Pt+1Pt)(s,s)=a[πt+1(a|s)πt(a|s)]p(s|s,a)\big{(}P_{t+1}-P_{t}\big{)}(s,s^{\prime})=\sum\limits_{a}[\pi_{t+1}(a|s)-\pi_{t}(a|s)]p_{\mathcal{M}}(s^{\prime}|s,a). Hence: Pt+1Ptπt+1πt\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty}\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}, where πt+1\pi_{t+1} and πt\pi_{t} are viewed as vectors. On the other hand:

πt+1πt\displaystyle\pi_{t+1}-\pi_{t} =(εtεt+1)(πto¯πu)+(1εt)(πt+1o¯πto¯)\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})(\overline{\pi_{t+1}^{o}}-\overline{\pi_{t}^{o}})
=(εtεt+1)(πto¯πu)+(1εt)(t×πto¯+πo(^t+1)t+1πto¯)\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})(\frac{t\times\overline{\pi_{t}^{o}}+\pi^{o}(\widehat{\mathcal{M}}_{t+1})}{t+1}-\overline{\pi_{t}^{o}})
=(εtεt+1)(πto¯πu)+(1εt)πo(^t+1)πto¯t+1\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})\frac{\pi^{o}(\widehat{\mathcal{M}}_{t+1})-\overline{\pi_{t}^{o}}}{t+1}

Therefore

Pt+1Pt\displaystyle\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty} πt+1πt\displaystyle\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}
(εtεt+1)+1t+1t0.\displaystyle\leq(\varepsilon_{t}-\varepsilon_{t+1})+\frac{1}{t+1}\underset{t\to\infty}{\longrightarrow}0.

For D-Navigation, we get in a similar fashion:

Pt+1Pt\displaystyle\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty} πt+1πt\displaystyle\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}
(εtεt+1)+(1εt)πt+1oπtot0.\displaystyle\leq(\varepsilon_{t}-\varepsilon_{t+1})+(1-\varepsilon_{t})\left\lVert\pi_{t+1}^{o}-\pi_{t}^{o}\right\rVert_{\infty}\underset{t\to\infty}{\longrightarrow}0.

D.7 Geometric ergodicity of the sampling rules

Since PπuP_{\pi_{u}} is ergodic, there exists r>0r>0 such that Pπur(z,z)>0P_{\pi_{u}}^{r}(z,z^{\prime})>0 for all z,zz,z^{\prime} (Proposition 1.7, [LPW06]). Thus we define

r\displaystyle r =min{1:(z,z)𝒵2,Pπu(z,z)>0},\displaystyle=\min\{\ell\geq 1:\ \forall(z,z^{\prime})\in\mathcal{Z}^{2},\ P_{\pi_{u}}^{\ell}(z,z^{\prime})>0\}, (36)
σu\displaystyle\sigma_{u} minz,zPπur(z,z)ωu(z),\displaystyle\triangleq\min\limits_{z,z^{\prime}}\ \frac{P_{\pi_{u}}^{r}(z,z^{\prime})}{\omega_{u}(z^{\prime})}, (37)

where ωu\omega_{u} is the stationary distribution of PπuP_{\pi_{u}}.

Lemma 13.

Let πtoπo(^t)\pi_{t}^{o}\triangleq\pi^{o}(\widehat{\mathcal{M}}_{t}) (resp. πto¯j=1tπo(^j)/t\overline{\pi_{t}^{o}}\triangleq\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t) denote the oracle policy of ^t\widehat{\mathcal{M}}_{t} (resp. the Cesaro-mean of oracle policies up to time tt). Further define

σ(ε,π,ω)\displaystyle\sigma(\varepsilon,\pi,\omega) (εr+[(1ε)Amins,aπ(a|s)]r)σu(minzωu(z)ω(z)),\displaystyle\triangleq\bigg{(}\varepsilon^{r}+\big{[}(1-\varepsilon)A\min\limits_{s,a}\pi(a|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega(z)}\bigg{)},
θ(ε,π,ω)\displaystyle\theta(\varepsilon,\pi,\omega) 1σ(ε,π,ω),\displaystyle\triangleq 1-\sigma(\varepsilon,\pi,\omega),
(ε,π,ω)\displaystyle\mathcal{L}(\varepsilon,\pi,\omega) 2θ(ε,π,ω)[1θ(ε,π,ω)1/r].\displaystyle\triangleq\frac{2}{\theta(\varepsilon,\pi,\omega)\big{[}1-\theta(\varepsilon,\pi,\omega)^{1/r}\big{]}}.

Then for D-Navigation (resp. C-Navigation) we have:

n1,PtnWtCtρtn\displaystyle\forall n\geq 1,\ \left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}

where Ct=2θ(εt,πto,ωt)1C_{t}=2\theta(\varepsilon_{t},\pi_{t}^{o},\omega_{t})^{-1} and ρt=θ(εt,πto,ωt)1/r\rho_{t}=\theta(\varepsilon_{t},\pi_{t}^{o},\omega_{t})^{1/r} (resp. Ct=2θ(εt,πto¯,ωt)1C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1} and ρt=θ(εt,πto¯,ωt)1/r\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}). In particular LtCt(1ρt)1=(εt,πto,ωt)L_{t}\triangleq C_{t}(1-\rho_{t})^{-1}=\mathcal{L}(\varepsilon_{t},\pi_{t}^{o},\omega_{t}) (resp. Lt=Ct(1ρt)1=(εt,πto¯,ωt)L_{t}=C_{t}(1-\rho_{t})^{-1}=\mathcal{L}(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})).

Proof.

We only prove the lemma for C-Navigation. The statement for D-Navigation can be proved in the same way. Recall that: Pt=εtPπu+(1εt)Pπto¯P_{t}=\varepsilon_{t}P_{\pi_{u}}+(1-\varepsilon_{t})P_{\overline{\pi_{t}^{o}}}. Therefore:

(z,z),Ptr(z,z)\displaystyle\forall(z,z^{\prime}),\quad P_{t}^{r}(z,z^{\prime}) [εtrPπur+(1εt)rPπto¯r](z,z)\displaystyle\geq[\varepsilon_{t}^{r}P_{\pi_{u}}^{r}+(1-\varepsilon_{t})^{r}P_{\overline{\pi_{t}^{o}}}^{r}](z,z^{\prime})
(εtr+[(1εt)Amins,aπto¯(a|s)]r)Pπur(z,z)\displaystyle\geq\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)\big{]}^{r}\bigg{)}P_{\pi_{u}}^{r}(z,z^{\prime})
(εtr+[(1εt)Amins,aπto¯(a|s)]r)σuωu(z)\displaystyle\geq\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)\big{]}^{r}\bigg{)}\sigma_{u}\omega_{u}(z^{\prime})
(εtr+[(1εt)Amins,aπto¯(a|s)]r)σu(minzωu(z)ωt(z))σtωt(z)\displaystyle\geq\underbrace{\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega_{t}(z)}\bigg{)}}\limits_{\sigma_{t}}\omega_{t}(z^{\prime})
=σ(εt,πto¯,ωt)ωt(z).\displaystyle=\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\omega_{t}(z^{\prime}).

where the second and third inequalities comes from the fact that Pπto¯Amins,aπto¯(a|s)PπuP_{\overline{\pi_{t}^{o}}}\geq A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)P_{\pi_{u}} entry-wise and from (37) respectively. Using Lemma 21 we conclude that or all n1n\geq 1:

PtnWt2θ(εt,πto¯,ωt)nr1\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq 2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{\frac{n}{r}-1}

where θ(εt,πto¯,ωt)=1σ(εt,πto¯,ωt)\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})=1-\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t}). Therefore PtP_{t} satisfies PtnWtCtρtn\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n} for Ct=2θ(εt,πto¯,ωt)1C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1} and ρt=θ(εt,πto¯,ωt)1/r\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}. ∎

Appendix E Stopping rule

E.1 Deviation inequality for KL divergences of rewards

We suppose that the reward distributions q(s,a)q_{\mathcal{M}}(s,a) come from a one-dimensional exponential family and can therefore be parametrized by their respective means r(s,a)r_{\mathcal{M}}(s,a). Furthermore, for any tt such that Nsa(t)>0N_{sa}(t)>0, we let q^s,a(t)\widehat{q}_{s,a}(t) denote the distribution belonging to the same exponential family, whose mean is the empirical average r^t(s,a)=k=1tRk𝟙{(st,at)=(s,a)}Nsa(t)\widehat{r}_{t}(s,a)=\frac{\sum\limits_{k=1}^{t}R_{k}\mathbbm{1}\{(s_{t},a_{t})=(s,a)\}}{N_{sa}(t)}. For x1x\geq 1, define the function h(x)=xlog(x)h(x)=x-\log(x) and its inverse h1(x)h^{-1}(x). Further define the function h~:+\tilde{h}:\mathbbm{R}^{+}\longrightarrow\mathbbm{R} by:

h~(x)={h1(x)exp(1/h1(x))if xh1(1/ln(3/2)),32[xlog(log(3/2)]otherwise.\tilde{h}(x)=\begin{cases}h^{-1}(x)\exp(1/h^{-1}(x))\quad\textrm{if }x\geq h^{-1}(1/\ln(3/2)),\\ \\ \frac{3}{2}\big{[}x-\log(\log(3/2)\big{]}\quad\textrm{otherwise.}\end{cases}

Finally let

φ(x)=2h~(h1(1+x)+log(2Γ(2))2),\varphi(x)=2\tilde{h}\bigg{(}\frac{h^{-1}(1+x)+\log(2\Gamma(2))}{2}\bigg{)},

where Γ(2)=n=11/n2\Gamma(2)=\sum_{n=1}^{\infty}1/n^{2}. Now we recall a deviation inequality from [KK18], which we use for the empirical KL divergence of rewards.

Lemma 14.

(Theorem 14, [KK18]) Define the threshold βr(t,δ)SAφ(log(1/δ)/SA)+3s,alog[1+log(Nsa(t))]\beta_{r}(t,\delta)\triangleq SA\varphi\big{(}\log(1/\delta)/SA\big{)}+3\sum\limits_{s,a}\log\big{[}1+\log(N_{sa}(t))\big{]}. Then for all δ(0,1)\delta\in(0,1):

(t1,(s,a)𝒵Nsa(t)KL(q^s,a(t),q(s,a))>βp(t,δ))δ.\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))>\beta_{p}(t,\delta)\bigg{)}\leq\delta.
Remark 5.

One can easily see that φ(x)xx\varphi(x)\underset{x\to\infty}{\sim}x.

E.2 Deviation inequality for KL divergences of transitions

Our second deviation inequality is adapted from Proposition 1 in [JKM+20]. There the authors derive a deviation inequality for a single KL divergence of a multinomial distribution. In order to get a deviation inequality of a sum of KL divergences, we modified their proof by considering the product over state-action pairs of the martingales they used. For the sake of self-containedness, we include the proof below. p^sa(t)\widehat{p}_{sa}(t) is defined as the categorical distribution with a vector of probabilities qq satisfying:

s𝒮,qs={k=0t1𝟙{(sk,ak,sk+1)=(s,a,s)}Nsa(t)if Nsa(t)0,1/Aotherwise.\displaystyle\forall s^{\prime}\in\mathcal{S},\ q_{s^{\prime}}=\begin{cases}\displaystyle{\frac{\sum_{k=0}^{t-1}\mathbbm{1}\{(s_{k},a_{k},s_{k+1})=(s,a,s^{\prime})\}}{N_{sa}(t)}}\quad\textrm{if }N_{sa}(t)\neq 0,\\ \\ 1/A\quad\textrm{otherwise.}\end{cases}
Lemma 15.

(Proposition 1, [JKM+20]) Define the threshold βp(t,δ)log(1/δ)+(S1)s,alog(e[1+Nsa(t)/(S1)])\beta_{p}(t,\delta)\triangleq\log(1/\delta)+(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}. Then for all δ(0,1)\delta\in(0,1) we have:

(t1,(s,a)𝒵Nsa(t)KL(p^sa(t),p(s,a))>βp(t,δ))δ,\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))>\beta_{p}(t,\delta)\bigg{)}\leq\delta,

with the convention that Nsa(t)KL(p^sa(t),p(s,a))=0N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))=0 whenever Nsa(t)=0N_{sa}(t)=0.

Proof.

We begin with a few notations. For any vector λS1×{0}\lambda\in\mathbbm{R}^{S-1}\times\{0\} and any element of the simplex pΣS1p\in\Sigma_{S-1}, we denote <λ,p>i=1S1λipi<\lambda,p>\triangleq\sum_{i=1}^{S-1}\lambda_{i}p_{i}. We define the log-partition function of a discrete distribution pp supported over {1,,S}\{1,\ldots,S\} by:

λS1×{0},ϕp(λ)log(pS+i=1S1pieiλ).\displaystyle\forall\lambda\in\mathbbm{R}^{S-1}\times\{0\},\ \phi_{p}(\lambda)\triangleq\log(p_{S}+\sum_{i=1}^{S-1}p_{i}e^{\lambda}_{i}).

We use the shorthand and let ϕsa(λ)ϕpsa(λ)\phi_{sa}(\lambda)\triangleq\phi_{p_{sa}}(\lambda). For NN\in\mathbbm{N}^{*} and x{0,,N}kx\in\{0,\ldots,N\}^{k} such that i=1kxi=N\sum_{i=1}^{k}x_{i}=N the binomial coefficient is defined as: (Nx)N!i=1kxi!\binom{N}{x}\triangleq\frac{N!}{\prod_{i=1}^{k}x_{i}!}. Finally H(p)=i=1Spilog(1/pi)H(p)=\sum_{i=1}^{S}p_{i}\log(1/p_{i}) is the Shannon entropy of distribution pp.

Building a convenient mixture martingale for every state-action pair:

Following [JKM+20], we define for every integer tt:

Mtλ(s,a)=exp(Nsa(t)(<λ,p^sa(t)>ϕsa(λ))).M_{t}^{\lambda}(s,a)=\exp\bigg{(}N_{sa}(t)\big{(}<\lambda,\widehat{p}_{sa}(t)>-\phi_{sa}(\lambda)\big{)}\bigg{)}\;. (38)

The sequence (Mtλ(s,a))t(M_{t}^{\lambda}(s,a))_{t} is an (t)t(\mathcal{F}_{t})_{t}-martingale since:

𝔼[Mtλ(s,a)|t1,(st,at)=(s,a)]\displaystyle\mathbb{E}[M_{t}^{\lambda}(s,a)|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]
=𝔼[exp(Nsa(t)[<λ,p^sa(t)>ϕsa(λ)])|t1,(st,at)=(s,a)]\displaystyle=\mathbb{E}\bigg{[}\exp\bigg{(}N_{sa}(t)\big{[}<\lambda,\widehat{p}_{sa}(t)>-\phi_{sa}(\lambda)\big{]}\bigg{)}\bigg{|}\ \mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)\bigg{]}
=𝔼Xpsa[exp((Nsa(t1)+1)(<λ,Nsa(t1)p^sa(t1)+XNsa(t1)+1>ϕsa(λ)))|t1]\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}\exp\bigg{(}\big{(}N_{sa}(t-1)+1\big{)}\big{(}<\lambda,\frac{N_{sa}(t-1)\widehat{p}_{sa}(t-1)+X}{N_{sa}(t-1)+1}>-\phi_{sa}(\lambda)\big{)}\bigg{)}\bigg{|}\ \mathcal{F}_{t-1}\bigg{]}
=𝔼Xpsa[Mt1λ(s,a)exp(<λ,X>ϕsa(λ))|t1]=Mt1λ(s,a).\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}M_{t-1}^{\lambda}(s,a)\exp\bigg{(}<\lambda,X>-\phi_{sa}(\lambda)\bigg{)}\bigg{|}\ \mathcal{F}_{t-1}\bigg{]}=M_{t-1}^{\lambda}(s,a).

The same holds trivially when (st,at)(s,a)(s_{t},a_{t})\neq(s,a). Now, we define the mixture martingale defined by the family of priors λq=ϕsa1(q)\lambda_{q}=\nabla\phi_{sa}^{-1}(q) where q𝒟ir(1,,1)q\sim\mathcal{D}\mathrm{ir}(1,\ldots,1) follows a Dirichlet distribution with parameters (1,,1)(1,\ldots,1):

Mt(s,a)=Mtλq(s,a)Γ(S)i=1SΓ(1)i=1Sqidq\displaystyle M_{t}(s,a)=\int M_{t}^{\lambda_{q}}(s,a)\frac{\Gamma(S)}{\prod_{i=1}^{S}\Gamma(1)}\prod_{i=1}^{S}q_{i}dq
=eNsa(t)(KL(p^sa(t),psa)KL(p^sa(t),q))(S1)!i=1Sqidq\displaystyle=\int e^{N_{sa}(t)\big{(}KL(\widehat{p}_{sa}(t),p_{sa})-KL(\widehat{p}_{sa}(t),q)\big{)}}(S-1)!\prod_{i=1}^{S}q_{i}dq
=exp(Nsa(t)(KL(p^sa(t),psa)+H(p^sa(t))))(S1)!i=1Sqi1+Nsa(t)p^sa,i(t)dq\displaystyle=\exp\bigg{(}N_{sa}(t)\big{(}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{)}\bigg{)}(S-1)!\int\prod_{i=1}^{S}q_{i}^{1+N_{sa}(t)\widehat{p}_{sa,i}(t)}dq
=exp(Nsa(t)[KL(p^sa(t),psa)+H(p^sa(t))])(S1)!i=1SΓ(1+Nsa(t)p^sa,i(t))Γ(Nsa(t)+S)\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{(S-1)!\prod_{i=1}^{S}\Gamma\bigg{(}1+N_{sa}(t)\widehat{p}_{sa,i}(t)\bigg{)}}{\Gamma(N_{sa}(t)+S)}
=exp(Nsa(t)[KL(p^sa(t),psa)+H(p^sa(t))])(S1)!i=1S(Nsa(t)p^sa,i(t))!(Nsa(t)+S1)!\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{(S-1)!\prod_{i=1}^{S}\big{(}N_{sa}(t)\widehat{p}_{sa,i}(t)\big{)}!}{(N_{sa}(t)+S-1)!}
=exp(Nsa(t)[KL(p^sa(t),psa)+H(p^sa(t))])i=1S(Nsa(t)p^sa,i(t))!Nsa(t)!(S1)!Nsa(t)!(Nsa(t)+S1)!\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{\prod_{i=1}^{S}\big{(}N_{sa}(t)\widehat{p}_{sa,i}(t)\big{)}!}{N_{sa}(t)!}\frac{(S-1)!N_{sa}(t)!}{(N_{sa}(t)+S-1)!}
=exp(Nsa(t)[KL(p^sa(t),psa)+H(p^sa(t))])1(Nsa(t)+S1S1)1(Nsa(t)Nsa(t)p^sa(t)),\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{1}{\binom{N_{sa}(t)+S-1}{S-1}}\frac{1}{\binom{N_{sa}(t)}{N_{sa}(t)\widehat{p}_{sa}(t)}},

where in the second inequality we used Lemma 16 and p^sa,i(t)\widehat{p}_{sa,i}(t) denotes the i-th component of p^sa(t)\widehat{p}_{sa}(t). Now using Lemma 17, we upper bound the binomial coefficients which leads to:

Mt(s,a)\displaystyle M_{t}(s,a) exp(Nsa(t)[KL(p^sa(t),psa)+H(p^sa(t))]Nsa(t)H(p^sa(t))\displaystyle\geq\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}-N_{sa}(t)H(\widehat{p}_{sa}(t))
(Nsa(t)+S1)H(S1/(Nsa(t)+S1)))\displaystyle\quad-(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}
=exp(Nsa(t)KL(p^sa(t),psa)(Nsa(t)+S1)H(S1/(Nsa(t)+S1))).\displaystyle=\exp\bigg{(}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}.
The product martingale:

Taking the product over all state-action pairs we get:

Mt\displaystyle M_{t} (s,a)𝒵Mt(s,a)\displaystyle\triangleq\prod_{(s,a)\in\mathcal{Z}}M_{t}(s,a)
exp(s,aNsa(t)KL(p^sa(t),psa)s,a(Nsa(t)+S1)H(S1/(Nsa(t)+S1))).\displaystyle\geq\exp\bigg{(}\sum_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-\sum_{s,a}(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}. (39)

Next, using that log(1+x)x\log(1+x)\leq x we get:

(Nsa(t)+S1)H(S1/(Nsa(t)+S1))\displaystyle(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1)) =(S1)log(1+Nsa(t)/(S1))\displaystyle=(S-1)\log(1+N_{sa}(t)/(S-1))
+Nsa(t)log(1+(S1)/Nsa(t))\displaystyle\quad+N_{sa}(t)\log(1+(S-1)/N_{sa}(t))
(S1)log(1+Nsa(t)/(S1))+(S1)\displaystyle\leq(S-1)\log(1+N_{sa}(t)/(S-1))+(S-1)
=(S1)log(e[1+Nsa(t)/(S1)]).\displaystyle=(S-1)\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}.

Hence (39) becomes:

Mtexp(s,aNsa(t)KL(p^sa(t),psa)(S1)s,alog(e[1+Nsa(t)/(S1)])).\displaystyle M_{t}\geq\exp\bigg{(}\sum\limits_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}\bigg{)}. (40)

Now, we show that MtM_{t} is a martingale. For any fixed pair (s,a)(s,a) we have:

𝔼[Mt|t1,(st,at)=(s,a)]\displaystyle\mathbb{E}[M_{t}|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)] =𝔼[Mt(s,a)(s,a)(s,a)Mt(s,a)|t1,(st,at)=(s,a)]\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t}(s^{\prime},a^{\prime})|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]
=𝔼[Mt(s,a)(s,a)(s,a)Mt1(s,a)|t1]\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})|\mathcal{F}_{t-1}]
=𝔼[Mt(s,a)|t1]×(s,a)(s,a)Mt1(s,a)\displaystyle=\mathbb{E}[M_{t}(s,a)|\mathcal{F}_{t-1}]\times\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})
=Mt1,\displaystyle=M_{t-1},

where the third equality is because Mt(s,a)M_{t}(s,a) and (Mt1(s,a))(s,a)(s,a)\big{(}M_{t-1}(s^{\prime},a^{\prime})\big{)}_{(s^{\prime},a^{\prime})\neq(s,a)} are independent conditionally on t1\mathcal{F}_{t-1}. Finally, using the tower rule we get:

𝔼[Mt|t1]=𝔼[𝔼[Mt|t1,(st,at)]|t1]=𝔼[Mt1|t1]=Mt1.\displaystyle\mathbb{E}[M_{t}|\mathcal{F}_{t-1}]=\mathbb{E}\bigg{[}\mathbb{E}[M_{t}|\mathcal{F}_{t-1},\ (s_{t},a_{t})]\bigg{|}\mathcal{F}_{t-1}\bigg{]}=\mathbb{E}[M_{t-1}|\mathcal{F}_{t-1}]=M_{t-1}.

Hence MtM_{t} is a martingale. Thanks to Doob’s maximal inequality we have:

(t0,Mt>1/δ)δ𝔼[M0]=δ.\displaystyle\mathbb{P}\bigg{(}\exists t\geq 0,\ M_{t}>1/\delta\bigg{)}\leq\delta\mathbb{E}[M_{0}]=\delta.

In view of (40), we conclude that for βp(t,δ)=log(1/δ)+(S1)s,alog(e[1+Nsa(t)/(S1)])\beta_{p}(t,\delta)=\log(1/\delta)+(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)} we have:

(t1,s,aNsa(t)KL(p^sa(t),psa)>βp(t,δ))δ.\displaystyle\mathbb{P}\bigg{(}\exists t\geq 1,\ \sum_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})>\beta_{p}(t,\delta)\bigg{)}\leq\delta.

Lemma 16.

(Lemma 3 in [JKM+20]) For q,pq,p in Σm\Sigma_{m} the simplex of dimension (m1)(m-1) and λm1\lambda\in\mathbbm{R}^{m-1}, we have:

<λ,q>ϕp(λ)=KL(q,p)KL(q,pλ)<\lambda,q>-\phi_{p}(\lambda)=KL(q,p)-KL(q,p^{\lambda})

where pλ=ϕp(λ)p^{\lambda}=\nabla\phi_{p}(\lambda).

Lemma 17.

(Theorem 11.1.3, [CT06]) Let NN\in\mathbbm{N}^{*}, and x{0,,N}kx\in\{0,\ldots,N\}^{k} such that i=1kxi=N\sum_{i=1}^{k}x_{i}=N then:

(Nx)=N!i=1kxi!eNH(x/N)\binom{N}{x}=\frac{N!}{\prod_{i=1}^{k}x_{i}!}\leq e^{NH(x/N)}

where H(x/N)H(x/N) is the Shannon entropy of the discrete distribution over {1,,k}\{1,\ldots,k\} with vector of probabilities (xiN)1ik(\frac{x_{i}}{N})_{1\leq i\leq k}.

E.3 Correctness of the stopping rule

Proof.

Using Lemma 5 and equations (11-12) in the second and third lines respectively, we get:

(π^τ\displaystyle\mathbb{P}(\widehat{\pi}_{\tau}^{*}\neq π,τδ<)=(t1,tU(^t,𝑵(t)/t)1βr(t,δ/2)+βp(t,δ/2),π^tπ)\displaystyle\pi^{*},\tau_{\delta}<\infty)=\mathbb{P}\bigg{(}\exists t\geq 1,\ t\;U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\widehat{\pi}_{t}^{*}\neq\pi^{*}\bigg{)}
(t1,tT(^t,𝑵(t)/t)1βr(t,δ/2)+βp(t,δ/2),Alt(^t))\displaystyle\leq\mathbb{P}\bigg{(}\exists t\geq 1,\ t\;T\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\mathcal{M}\in\textrm{Alt}(\widehat{\mathcal{M}}_{t})\bigg{)}
=(t1,infAlt(^t)(s,a)𝒵Nsa(t)[KL(q^s,a(t),q(s,a))+KL(p^sa(t),p(s,a))]\displaystyle=\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}^{\prime}}(s,a))+\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}^{\prime}}(s,a))\big{]}
βr(t,δ/2)+βp(t,δ/2),Alt(^t))\displaystyle\quad\quad\quad\quad\quad\quad\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\mathcal{M}\in\textrm{Alt}(\widehat{\mathcal{M}}_{t})\bigg{)}
(t1,(s,a)𝒵Nsa(t)[KL(q^s,a(t),q(s,a))+KL(p^sa(t),p(s,a))]\displaystyle\leq\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))+\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))\big{]}
βr(t,δ/2)+βp(t,δ/2))\displaystyle\quad\quad\quad\quad\quad\quad\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\bigg{)}
=(t1,(s,a)𝒵Nsa(t)KL(q^s,a(t),q(s,a))βr(t,δ/2))\displaystyle=\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))\geq\beta_{r}(t,\delta/2)\bigg{)}
+(t1,(s,a)𝒵Nsa(t)KL(p^sa(t),p(s,a))βp(t,δ/2))\displaystyle\quad\quad\quad+\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))\geq\beta_{p}(t,\delta/2)\bigg{)}
δ/2+δ/2=δ\displaystyle\leq\delta/2+\delta/2=\delta

where the last inequality is due to Lemmas 14 and 15. ∎

Appendix F Sample complexity upper bound

F.1 Almost sure upper bound: proof of Theorem 6 (i)

Proof.

Consider the event =((s,a)𝒵,limtNsa(t)t=ωs,a,^t)\mathcal{E}=\bigg{(}\forall(s,a)\in\mathcal{Z},\ \lim_{t\to\infty}\frac{N_{sa}(t)}{t}=\omega^{\star}_{s,a},\ \widehat{\mathcal{M}}_{t}\to\mathcal{M}\bigg{)}. By Lemma 4 and Theorem 7, we have ()=1\mathbb{P}(\mathcal{E})=1. We will prove that under \mathcal{E}, lim supδ0τδlog(1/δ)2Uo()\limsup\limits_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M}).
Fix η>0\eta>0. There exits tηt_{\eta} such that for all ttηt\geq t_{\eta}:

U(^t,𝑵(t)/t)1(1η)U(,ω)1\displaystyle U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq(1-\eta)U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1} (41)
βp(t,δ/2)log(1/δ)+ηU(,ω)1t\displaystyle\beta_{p}(t,\delta/2)\leq\log(1/\delta)+\eta U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}t (42)
βr(t,δ/2)SAφ(log(1/δ)/SA)+ηU(,ω)1t\displaystyle\beta_{r}(t,\delta/2)\leq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+\eta U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}t (43)

where the last two inequalities come from the fact that both the thresholds satisfy β(t,δ/2)=𝒪(log(t))=o(t)\beta(t,\delta/2)=\mathcal{O}\big{(}\log(t)\big{)}=o(t). Combining the inequalities above with the definition of τδ\tau_{\delta}, we get:

τδ\displaystyle\tau_{\delta} inf{ttη,(13η)tU(,ω)1log(1/δ)+SAφ(log(1/δ)/SA)}\displaystyle\leq\inf\bigg{\{}t\geq t_{\eta},(1-3\eta)tU\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}\geq\log(1/\delta)+SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}\bigg{\}}
=max(tη,[log(1/δ)+SAφ(log(1/δ)/SA)]U(,ω)13η).\displaystyle=\max\bigg{(}t_{\eta},\ \frac{\bigg{[}log(1/\delta)+SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}\bigg{]}U\big{(}\mathcal{M},\omega^{\star}\big{)}}{1-3\eta}\bigg{)}.

Since φ(x)x\varphi(x)\underset{\infty}{\sim}x, then the last inequality implies that lim supδ0τδlog(1/δ)2U(,ω)13η\limsup\limits_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq\frac{2U(\mathcal{M},\omega^{\star})}{1-3\eta}. Taking the limit when η\eta goes to zero finishes the proof. ∎

F.2 Upper bound in expectation: proof of Theorem 6 (ii)

Proof.

We start by defining the semi-distance between MDPs:

=maxs,amax(|r(s,a)r(s,a)|,p(.|s,a)p(.|s,a)1).\displaystyle\left\lVert\mathcal{M}-\mathcal{M}^{\prime}\right\rVert=\max\limits_{s,a}\max\big{(}|r_{\mathcal{M}}(s,a)-r_{\mathcal{M}^{\prime}}(s,a)|,\left\lVert p_{\mathcal{M}}(.|s,a)-p_{\mathcal{M}^{\prime}}(.|s,a)\right\rVert_{1}\big{)}.

Now for ξ>0\xi>0, by continuity of πo()\mathcal{M}\to\pi^{o}(\mathcal{M})171717As a consequence of Berge’s Theorem, which gives the continuity of 𝝎()\mathcal{M}\mapsto{\bm{\omega}}^{\star}(\mathcal{M}). there exists ρ(ξ)ξ\rho(\xi)\leq\xi such that:

(,ρ(ξ)),πo()πo()ξ\displaystyle\forall\mathcal{M}^{\prime}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)},\ \left\lVert\pi^{o}(\mathcal{M}^{\prime})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\xi

where (,ρ)={:ρ}\mathcal{B}(\mathcal{M},\rho)=\{\mathcal{M}^{\prime}:\ \left\lVert\mathcal{M}^{\prime}-\mathcal{M}\right\rVert\leq\rho\}. For T1T\geq 1, consider the concentration events181818 For simplicity and w.l.o.g, we consider that T1/4T^{1/4} and T3/4T^{3/4} are integers.:

𝒞T1(ξ)\displaystyle\mathcal{C}_{T}^{1}(\xi) t=T1/4T(^t(,ρ(ξ))).\displaystyle\triangleq\bigcap\limits_{t=T^{1/4}}^{T}\bigg{(}\widehat{\mathcal{M}}_{t}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\bigg{)}.
𝒞T2(ξ)\displaystyle\mathcal{C}_{T}^{2}(\xi) t=T3/4T(|𝑵(t)/tω|Kξξ)\displaystyle\triangleq\bigcap\limits_{t=T^{3/4}}^{T}\bigg{(}\big{|}{\bm{N}}(t)/t-\omega^{\star}\big{|}\leq K_{\xi}\xi\bigg{)}

where KξK_{\xi} is a mapping defined in Proposition 19. We will upper bound the stopping time of MDP-NaS under 𝒞T1(ξ)𝒞T2(ξ)\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi). Define:

U(,ω,ξ)\displaystyle U(\mathcal{M},\omega^{\star},\xi)\triangleq supU(,ω).\displaystyle\sup\ U\big{(}\mathcal{M}^{\prime},\omega^{\prime}\big{)}.
ωωKξξ.\displaystyle\scalebox{0.75}{$\left\lVert\omega^{\prime}-\omega^{\star}\right\rVert_{\infty}\leq K_{\xi}\xi$}.

By Proposition 19, there exists T1(ξ)T_{1}(\xi) such that for all TT1(ξ)T\geq T_{1}(\xi), conditionally on 𝒞T1(ξ)\mathcal{C}_{T}^{1}(\xi), the event 𝒞T2(ξ)\mathcal{C}_{T}^{2}(\xi) occurs with high probability. For TT1(ξ)T\geq T_{1}(\xi) we have:

t[|T3/4,T|],U(^t,𝑵(t)/t)U(,ω,ξ).\displaystyle\forall t\in[|T^{3/4},T|],\ U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}\leq U(\mathcal{M},\omega^{\star},\xi). (44)

Furthermore, using the bound Nsa(t)tN_{sa}(t)\leq t in the definitions of the thresholds βp\beta_{p} and βr\beta_{r} and the fact that log(t)=to(t)\log(t)\underset{t\to\infty}{=}o(t), we prove the existence of T2(ξ)T_{2}(\xi) such that for all tT2(ξ)t\geq T_{2}(\xi):

βp(t,δ/2)\displaystyle\beta_{p}(t,\delta/2) log(1/δ)+ξU(,ω,ξ)1t\displaystyle\leq\log(1/\delta)+\xi U(\mathcal{M},\omega^{\star},\xi)^{-1}t (45)
βr(t,δ/2)\displaystyle\beta_{r}(t,\delta/2) SAφ(log(1/δ)/SA)+ξU(,ω,ξ)1t.\displaystyle\leq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+\xi U(\mathcal{M},\omega^{\star},\xi)^{-1}t. (46)

Finally define:

T3(ξ,δ)U(,ω,ξ)[log(1/δ)+SAφ(log(1/δ)SA)](12ξ).\displaystyle T_{3}(\xi,\delta)\triangleq\frac{U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}\bigg{[}\log(1/\delta)+SA\varphi\big{(}\frac{\log(1/\delta)}{SA}\big{)}\bigg{]}}{(1-2\xi)}.

Using (44-46), we have for all Tmax(T1(ξ),T2(ξ),T3(ξ,δ))T\geq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)} under 𝒞T1(ξ)𝒞T2(ξ)\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi) the following holds:

T×U(^T,𝑵(T)/T)1βp(T,δ/2)+βr(T,δ/2).\displaystyle T\times U\big{(}\widehat{\mathcal{M}}_{T},{\bm{N}}(T)/T\big{)}^{-1}\geq\beta_{p}(T,\delta/2)+\beta_{r}(T,\delta/2).

In other words:

Tmax(T1(ξ),T2(ξ),T3(ξ,δ)),𝒞T1(ξ)𝒞T2(ξ)(τδT).\displaystyle\forall T\geq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)},\quad\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)\subset\big{(}\tau_{\delta}\leq T\big{)}. (47)

Therefore191919 ¯\overline{\mathcal{E}} denotes the complementary of event \mathcal{E}.:

𝔼[τδ]\displaystyle\mathbb{E}[\tau_{\delta}] =T=1(τδ>T)\displaystyle=\sum\limits_{T=1}^{\infty}\mathbb{P}(\tau_{\delta}>T)
max(T1(ξ),T2(ξ),T3(ξ,δ))+T=max(T1,T2,T3)(τδ>T)\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=\max(T_{1},T_{2},T_{3})}^{\infty}\mathbb{P}(\tau_{\delta}>T)
max(T1(ξ),T2(ξ),T3(ξ,δ))+T=max(T1,T2,T3)(𝒞T1(ξ)¯𝒞T2(ξ)¯)\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=\max(T_{1},T_{2},T_{3})}^{\infty}\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{1}(\xi)}\cup\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{)}
max(T1(ξ),T2(ξ),T3(ξ,δ))+T=1[(𝒞T1(ξ)¯)+(𝒞T2(ξ)¯|𝒞T1(ξ))]\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=1}^{\infty}\bigg{[}\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{1}(\xi)}\bigg{)}+\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\bigg{]}
max(T1(ξ),T2(ξ),T3(ξ,δ))+T=11T2+BTexp(CT1/16log(1+SAT2))\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=1}^{\infty}\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}
+2SAexp(T3/4ξ2)1exp(ξ2),\displaystyle\qquad\qquad\qquad\qquad+\frac{2SA\exp(-T^{3/4}\xi^{2})}{1-\exp(-\xi^{2})},

where we used Lemma 18 and Proposition 19 in the last inequality. This implies that 𝔼[τδ]\mathbb{E}[\tau_{\delta}] is finite and:

lim supδ0𝔼[τδ]log(1/δ)\displaystyle\limsup\limits_{\delta\to 0}\frac{\mathbb{E}[\tau_{\delta}]}{\log(1/\delta)} lim supδ0T3(ξ,δ)log(1/δ)\displaystyle\leq\limsup\limits_{\delta\to 0}\frac{T_{3}(\xi,\delta)}{\log(1/\delta)}
=2U(,ω,ξ)(12ξ)\displaystyle=\frac{2U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}}{(1-2\xi)}

where we used φ(x)x+log(x)\varphi(x)\underset{\infty}{\sim}x+\log(x). Finally, we take the limit ξ0\xi\to 0. Since ρ(ξ)ξ\rho(\xi)\leq\xi and lim supξ0Kξ<\limsup\limits_{\xi\to 0}K_{\xi}<\infty then lim supξ0U(,ω,ξ)=U(,ω)=Uo()\limsup\limits_{\xi\to 0}U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}=U\big{(}\mathcal{M},\omega^{\star}\big{)}=U_{o}(\mathcal{M}) which finishes the proof. ∎

F.3 Concentration of the empirical MDPs

Lemma 18.

Define the event 𝒞T1(ξ)t=T1/4T(^t(,ρ(ξ)))\mathcal{C}_{T}^{1}(\xi)\triangleq\bigcap\limits_{t=T^{1/4}}^{T}\bigg{(}\widehat{\mathcal{M}}_{t}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\bigg{)}. Then there exists two positive constants BB and CC that only depend on ξ\xi and \mathcal{M} such that:

T1,(𝒞T1(ξ)¯)1T2+BTexp(CT1/16log(1+SAT2)).\displaystyle\forall T\geq 1,\ \mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right)\leq\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}.
Proof.

For simplicity we will denote ρ(ξ)\rho(\xi) by ρ\rho. Consider the forced exploration event:

T=((s,a)𝒵,t1,Nsa(t)[tλ(T)]1/41)\displaystyle\mathcal{E}_{T}=\bigg{(}\forall(s,a)\in\mathcal{Z},\ \forall t\geq 1,\ N_{sa}(t)\geq\bigg{[}\frac{t}{\lambda(T)}\bigg{]}^{1/4}-1\bigg{)}

where λ(T)(m+1)2η2log2(1+SAT2)\lambda(T)\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+SAT^{2}) and η\eta is a parameter that only depends on \mathcal{M}. Applying Corollary 1 for α=1T2\alpha=\frac{1}{T^{2}}, we get (T)11/T2\mathbb{P}(\mathcal{E}_{T})\geq 1-1/T^{2}. Therefore we have:

(𝒞T1(ξ)¯)\displaystyle\mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right) (T¯)+(𝒞T1(ξ)¯T)\displaystyle\leq\mathbb{P}(\overline{\mathcal{E}_{T}})+\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T})
1T2+(𝒞T1(ξ)¯T).\displaystyle\leq\frac{1}{T^{2}}+\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T}). (48)

On the other hand:

(𝒞T1(ξ)¯T)\displaystyle\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T}) t=T1/4T(^t(,ρ(ξ))T)\displaystyle\leq\sum_{t=T^{1/4}}^{T}\mathbb{P}\left(\widehat{\mathcal{M}}_{t}\notin\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\cap\mathcal{E}_{T}\right)
t=T1/4Ts,a[((r^t(s,a)r(s,a)>ρ)T)\displaystyle\leq\sum_{t=T^{1/4}}^{T}\underset{s,a}{\sum}\ \Bigg{[}\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)>\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}
+((r^t(s,a)r(s,a)<ρ)T)\displaystyle\qquad\qquad+\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)<-\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}
+s((p^t(s|s,a)p(s|s,a)>ρ/S)T)\displaystyle\qquad\qquad+\underset{s^{\prime}}{\sum}\ \mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}|s,a)-p(s^{\prime}|s,a)>\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}
+((p^t(s|s,a)p(s|s,a)<ρ/S)T)].\displaystyle\qquad\qquad+\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}|s,a)-p(s^{\prime}|s,a)<-\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}\Bigg{]}. (49)

Using a union bound and Chernoff-Hoeffding theorem respectively we get for tT1/4t\geq T^{1/4}:

((p^t(s|s,a)\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}|s,a)- p(s|s,a)>ρ/S)T)\displaystyle p(s^{\prime}|s,a)>\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}
(p^t(s|s,a)p(s|s,a)>ρ/S,Nsa(t)[tλ(T)]1/41)\displaystyle\leq\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}|s,a)-p(s^{\prime}|s,a)>\rho/S,\ N_{sa}(t)\geq\bigg{[}\frac{t}{\lambda(T)}\bigg{]}^{1/4}-1\bigg{)}
=t=[tλ(T)]1/41t(p^t(s|s,a)p(s|s,a)>ρ/S,Nsa(t)=t)\displaystyle=\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}|s,a)-p(s^{\prime}|s,a)>\rho/S,\ N_{sa}(t)=t^{\prime}\bigg{)}
t=[tλ(T)]1/41texp(tkl(p(s|s,a)+ρ/S,p(s|s,a)))\displaystyle\leq\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\exp\bigg{(}-t^{\prime}\cdot\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}
exp(([tλ(T)]1/41)kl(p(s|s,a)+ρ/S,p(s|s,a)))1exp(kl(p(s|s,a)+ρ/S,p(s|s,a)))\displaystyle\leq\frac{\exp\bigg{(}-\big{(}\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1\big{)}\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}
exp([T1/16λ(T)1/41]kl(p(s|s,a)+ρ/S,p(s|s,a)))1exp(kl(p(s|s,a)+ρ/S,p(s|s,a))).\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}. (50)

In a similar fashion we prove that:

((p^t(s|s,a)\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}|s,a) p(s|s,a)<ρ/S)T)\displaystyle-p(s^{\prime}|s,a)<-\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}
exp([T1/16λ(T)1/41]kl(p(s|s,a)ρ/S,p(s|s,a)))1exp(kl(p(s|s,a)ρ/S,p(s|s,a))),\displaystyle\qquad\qquad\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}|s,a)-\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}|s,a)-\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}, (51)
((r^t(s,a)r(s,a)>ρ)T)\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)>\rho\big{)}\cap\mathcal{E}_{T}\bigg{)} exp([T1/16λ(T)1/41]kl(r(s,a)+ρ,r(s,a)))1exp(kl(r(s,a)+ρ,r(s,a))),\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}, (52)
((r^t(s,a)r(s,a)<ρ)T)\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)<-\rho\big{)}\cap\mathcal{E}_{T}\bigg{)} exp([T1/16λ(T)1/41]kl(r(s,a)ρ,r(s,a)))1exp(kl(r(s,a)ρ,r(s,a))).\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}. (53)

Thus, for the following choice of constants

C=ηm+1mins,a,s(\displaystyle C=\sqrt{\frac{\eta}{m+1}}\min\limits_{s,a,s^{\prime}}\Bigg{(} kl(r(s,a)ρ,r(s,a)),kl(r(s,a)+ρ,r(s,a)),\displaystyle\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)},\ \operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)},
kl(p(s|s,a)ρ/S,p(s|s,a)),kl(p(s|s,a)+ρ/S,p(s|s,a))),\displaystyle\operatorname{kl}\big{(}p(s^{\prime}|s,a)-\rho/S,\ p(s^{\prime}|s,a)\big{)},\ \operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\Bigg{)},
B=\displaystyle B= s,a(exp(kl(r(s,a)+ρ,r(s,a)))1exp(kl(r(s,a)+ρ,r(s,a)))+exp(kl(r(s,a)ρ,r(s,a)))1exp(kl(r(s,a)ρ,r(s,a)))\displaystyle\sum\limits_{s,a}\ \Bigg{(}\frac{\exp\bigg{(}\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}+\frac{\exp\bigg{(}\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}
+s[exp(kl(p(s|s,a)+ρ/S,p(s|s,a)))1exp(kl(p(s|s,a)+ρ/S,p(s|s,a)))\displaystyle+\sum\limits_{s^{\prime}}\Bigg{[}\frac{\exp\bigg{(}\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}|s,a)+\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}
+exp(kl(p(s|s,a)ρ/S,p(s|s,a)))1exp(kl(p(s|s,a)ρ/S,p(s|s,a)))]),\displaystyle+\frac{\exp\bigg{(}\operatorname{kl}\big{(}p(s^{\prime}|s,a)-\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}|s,a)-\rho/S,\ p(s^{\prime}|s,a)\big{)}\bigg{)}}\Bigg{]}\Bigg{)},

and using (49-53) we get:

(𝒞T1(ξ)¯T)t=T1/4TBexp(CT1/16log(1+SAT2))BTexp(CT1/16log(1+SAT2)).\displaystyle\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T})\leq\sum\limits_{t=T^{1/4}}^{T}B\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\leq BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\;.

Combined with (48), the previous inequality implies that:

(𝒞T1(ξ)¯)1T2+BTexp(CT1/16log(1+SAT2)).\displaystyle\mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right)\leq\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\;.

F.4 Concentration of state-action visitation frequency

The following proposition is a somewhat stronger version of Proposition 12. The fact that ^t\widehat{\mathcal{M}}_{t} is within a distance of at most ξ\xi from \mathcal{M} enables us to have a tighter control of the ergodicity constants LtL_{t}. This in turn allows us to derive a finite sample bound on the deviations of state-action frequency of visits. Before stating the result we recall some simple facts:

Fact 1:

For any two policies π,π\pi,\pi^{\prime} we have: PπPπππ\left\lVert P_{\pi^{\prime}}-P_{\pi}\right\rVert_{\infty}\leq\left\lVert\pi^{\prime}-\pi\right\rVert_{\infty}, where the norm is on policies viewed as vectors of SA\mathbb{R}^{SA}.

Fact 2:

Under 𝒞T1(ξ)\mathcal{C}_{T}^{1}(\xi), for kTk\geq\sqrt{T} we have:

πko¯πo()\displaystyle\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty} j=1T1/4εjπuπo()k+j=T1/4+1kπo(^j)πo()k\displaystyle\leq\frac{\sum\limits_{j=1}^{T^{1/4}}\varepsilon_{j}\left\lVert\pi_{u}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}}{k}+\frac{\sum\limits_{j=T^{1/4}+1}^{k}\left\lVert\pi^{o}(\widehat{\mathcal{M}}_{j})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}}{k}
T1/4k+ξ.\displaystyle\leq\frac{T^{1/4}}{k}+\xi.
Fact 3:

Under 𝒞T1(ξ)\mathcal{C}_{T}^{1}(\xi), for kTk\geq\sqrt{T} we have:

ωkω1\displaystyle\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1} κPkPπo\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert P_{k}-P_{\pi^{o}}\right\rVert_{\infty}
κπkπo()\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert\pi_{k}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}
κ[εk+πko¯πo()]\displaystyle\leq\kappa_{\mathcal{M}}\big{[}\varepsilon_{k}+\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\big{]}
κ[T14(m+1)+T1/4k+ξ].\displaystyle\leq\kappa_{\mathcal{M}}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}.

where we used Lemma 22, Fact 1, the definitions of πk\pi_{k} and εk\varepsilon_{k} and Fact 2 respectively.

Fact 4:

Under 𝒞T1(ξ)\mathcal{C}_{T}^{1}(\xi), for kTk\geq\sqrt{T} we have:

D(πk,πk1)\displaystyle D(\pi_{k},\pi_{k-1}) =PkPk1\displaystyle=\left\lVert P_{k}-P_{k-1}\right\rVert_{\infty}
PkPπo+PπoPk1\displaystyle\leq\left\lVert P_{k}-P_{\pi^{o}}\right\rVert_{\infty}+\left\lVert P_{\pi^{o}}-P_{k-1}\right\rVert_{\infty}
2[T14(m+1)+T1/4k1+ξ]\displaystyle\leq 2\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{]}
Proposition 19.

Under C-Navigation, for all ξ>0\xi>0, there exists a time TξT_{\xi} such that for all TTξT\geq T_{\xi}, all tT3/4t\geq T^{3/4} and all functions f:𝒵+f:\mathcal{Z}\xrightarrow{}\mathbbm{R}^{+}, we have:

(|k=1tf(sk,ak)t𝔼(s,a)ω[f(s,a)]|Kξfξ|𝒞T1(ξ))2exp(tξ2),\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{k=1}^{t}f(s_{k},a_{k})}{t}-\mathbb{E}_{(s,a)\sim\omega^{\star}}[f(s,a)]\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)},

where ξKξ\xi\mapsto K_{\xi} is a mapping with values in (1,)(1,\infty) such that lim supξ0Kξ<\limsup_{\xi\to 0}K_{\xi}<\infty. In particular, this implies that:

((s,a)𝒵,|Nsa(t)/tωsa|Kξξ|𝒞T1(ξ))2SAexp(tξ2).\displaystyle\mathbb{P}\bigg{(}\exists(s,a)\in\mathcal{Z},\bigg{|}N_{sa}(t)/t-\omega_{sa}^{\star}\bigg{|}\geq K_{\xi}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2SA\exp\big{(}-t\xi^{2}\big{)}.
Corollary 2.

We have (𝒞T2(ξ)¯|𝒞T1(ξ))2SAexp(T3/4ξ2)1exp(ξ2)\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq\displaystyle{\frac{2SA\exp(-T^{3/4}\xi^{2})}{1-\exp(-\xi^{2})}}.

Proof.

Consider the difference

D\displaystyle D =k=1tf(zk)tω(f)\displaystyle=\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega^{\star}(f)
=k=1T[f(zk)ω(f)]t+k=T+1t[f(zk)ω(f)]t\displaystyle=\frac{\sum\limits_{k=1}^{\sqrt{T}}[f(z_{k})-\omega^{\star}(f)]}{t}+\frac{\sum\limits_{k=\sqrt{T}+1}^{t}[f(z_{k})-\omega^{\star}(f)]}{t}
=k=1T[f(zk)ω(f)]tD1,t+k=T+1t[f(zk)ωk1(f)]tD2,t+k=T+1t[ωk1(f)ω(f)]tD3,t.\displaystyle=\underbrace{\frac{\sum\limits_{k=1}^{\sqrt{T}}[f(z_{k})-\omega^{\star}(f)]}{t}}\limits_{D_{1,t}}+\underbrace{\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}f(z_{k})-\omega_{k-1}(f)\big{]}}{t}}\limits_{D_{2,t}}+\underbrace{\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\omega_{k-1}(f)-\omega^{\star}(f)\big{]}}{t}}\limits_{D_{3,t}}.

We clearly have:

T1ξ4,tT3/4,|D1,t|fTtfT1/4fξ.\displaystyle\forall T\geq\frac{1}{\xi^{4}},\ \forall t\geq T^{3/4},\ |D_{1,t}|\leq\frac{\left\lVert f\right\rVert_{\infty}\sqrt{T}}{t}\leq\frac{\left\lVert f\right\rVert_{\infty}}{T^{1/4}}\leq\left\lVert f\right\rVert_{\infty}\xi. (55)

Using Fact 3 and integral-series comparison, we upper bound the third term as follows:

T(2ξ)4(m+1),tT3/4,|D3,t|\displaystyle\forall T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)},\ \forall t\geq T^{3/4},\ |D_{3,t}| κfk=T+1t1ωkω1t\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}}{t}
κfk=T+1t1[T14(m+1)+T1/4k+ξ]t\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}}{t}
κf(T14(m+1)+k=T+1t1T1/4kt+ξ)\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\frac{T^{1/4}}{k}}{t}+\xi\bigg{)}
κf(T14(m+1)+ξ+T1/4log(t)t)\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+\frac{T^{1/4}\log(t)}{t}\bigg{)}
κf(T14(m+1)+ξ+T1/4t)\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+\frac{T^{1/4}}{\sqrt{t}}\bigg{)}
κf(T14(m+1)+ξ+T18)\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+T^{\frac{-1}{8}}\bigg{)}
κf(2T14(m+1)+ξ)\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}2T^{\frac{-1}{4(m+1)}}+\xi\bigg{)}
2κfξ.\displaystyle\leq 2\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\xi. (56)

Now to bound D2,tD_{2,t} we use the function f^k\widehat{f}_{k} solution to the Poisson equation (f^kPkf^k)(.)=f(.)ωk(f)\big{(}\widehat{f}_{k}-P_{k}\widehat{f}_{k}\big{)}(.)=f(.)-\omega_{k}(f). By Lemma 23, f^k(.)=n0Pkn[fωk(f)](.)\widehat{f}_{k}(.)=\sum\limits_{n\geq 0}P_{k}^{n}[f-\omega_{k}(f)](.) exists and is solution to the Poisson equation. Therefore we can rewrite D2,tD_{2,t} as follows:

D2,t\displaystyle D_{2,t} =k=T+1t[f^k1(zk)Pk1f^k1(zk)]t\displaystyle=\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}
=Mt+Ct+Rt.\displaystyle=M_{t}+C_{t}+R_{t}. (57)

where

Mtk=T+1t[f^k1(zk)Pk1f^k1(zk1)]t.\displaystyle M_{t}\triangleq\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})\big{]}}{t}.
Ctk=T+1t[Pkf^k(zk)Pk1f^k1(zk)]t.\displaystyle C_{t}\triangleq\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}P_{k}\widehat{f}_{k}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}.
RtPTf^T(zT)Ptf^t(zt)t.\displaystyle R_{t}\triangleq\frac{P_{\sqrt{T}}\widehat{f}_{\sqrt{T}}(z_{\sqrt{T}})-P_{t}\widehat{f}_{t}(z_{t})}{t}.
Bounding MtM_{t}:

Note that SttMtS_{t}\triangleq tM_{t} is a martingale since 𝔼[f^k1(zk)|k1]=Pk1f^k1(zk1)\mathbb{E}[\widehat{f}_{k-1}(z_{k})|\mathcal{F}_{k-1}]=P_{k-1}\widehat{f}_{k-1}(z_{k-1}). Furthermore, by Lemma 23:

|SkSk1|\displaystyle|S_{k}-S_{k-1}| =|f^k1(zk)Pk1f^k1(zk1)|\displaystyle=|\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})|
2f^k1\displaystyle\leq 2\left\lVert\widehat{f}_{k-1}\right\rVert_{\infty}
2fLk1.\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{k-1}. (58)

Recall from Lemma 13 that Lk=(εk,πko¯,ωk)L_{k}=\mathcal{L}(\varepsilon_{k},\overline{\pi_{k}^{o}},\omega_{k}) where:

(ε,π,ω)\displaystyle\mathcal{L}(\varepsilon,\pi,\omega) 2θ(ε,π,ω)[1θ(ε,π,ω)1/r]\displaystyle\triangleq\frac{2}{\theta(\varepsilon,\pi,\omega)\big{[}1-\theta(\varepsilon,\pi,\omega)^{1/r}\big{]}}
θ(ε,π,ω)\displaystyle\theta(\varepsilon,\pi,\omega) 1σ(ε,π,ω).\displaystyle\triangleq 1-\sigma(\varepsilon,\pi,\omega).
σ(ε,π,ω)\displaystyle\sigma(\varepsilon,\pi,\omega) [εr+((1ε)Amins,aπ(a|s))r]σu(minzωu(z)ω(z)).\displaystyle\triangleq\bigg{[}\varepsilon^{r}+\bigg{(}(1-\varepsilon)A\min\limits_{s,a}\pi(a|s)\bigg{)}^{r}\bigg{]}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega(z)}\bigg{)}.

Now for T(2ξ)4(m+1)T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)} and kTk\geq\sqrt{T} we have:

|εk|=k12(m+1)T14(m+1)ξ/2.\displaystyle|\varepsilon_{k}|=k^{\frac{-1}{2(m+1)}}\leq T^{\frac{-1}{4(m+1)}}\leq\xi/2.
πko¯πo()T1/4k+ξ1T1/4+ξ2ξ.\displaystyle\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\frac{T^{1/4}}{k}+\xi\leq\frac{1}{T^{1/4}}+\xi\leq 2\xi.
ωkω1κ[T14(m+1)+T1/4k+ξ]κ[2T14(m+1)+ξ]2κξ.\displaystyle\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}\leq\kappa_{\mathcal{M}}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}\leq\kappa_{\mathcal{M}}\big{[}2T^{\frac{-1}{4(m+1)}}+\xi\big{]}\leq 2\kappa_{\mathcal{M}}\xi.

Therefore:

LkLξ\displaystyle L_{k}\leq L_{\xi}\triangleq sup(ε,π,ω).\displaystyle\sup\ \mathcal{L}(\varepsilon,\pi,\omega).
(59)

Using (58), (59) and Azuma-Hoeffding inequality we get for all tT3/4t\geq T^{3/4}:

(|Mt|2fLξξ)\displaystyle\mathbb{P}\big{(}|M_{t}|\geq 2\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)} =(|St|2tfLξξ)\displaystyle=\mathbb{P}\big{(}|S_{t}|\geq 2t\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)}
=(|StST|2tfLξξ)\displaystyle=\mathbb{P}\big{(}|S_{t}-S_{\sqrt{T}}|\geq 2t\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)}
2exp(t2ξ2(tT))\displaystyle\leq 2\exp\bigg{(}\frac{-t^{2}\xi^{2}}{(t-\sqrt{T})}\bigg{)}
2exp(tξ2).\displaystyle\leq 2\exp\big{(}-t\xi^{2}\big{)}. (60)
Bounding CtC_{t}:

Using Lemma 23 we have for all T(2ξ)4(m+1)T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)} and all tT3/4t\geq T^{3/4}:

|Ct|\displaystyle|C_{t}| fk=T+1tLk[ωkωk11+Lk1D(πk,πk1)]t\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}
fk=T+1tLξ[ωkω1+ωωk11+2Lξ(T14(m+1)+T1/4k1+ξ)]t\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{\xi}\bigg{[}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}+\left\lVert\omega^{\star}-\omega_{k-1}\right\rVert_{1}+2L_{\xi}\big{(}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{)}\bigg{]}}{t}
fk=T+1tLξ[κ(2T14(m+1)+2ξ+T1/4k+T1/4k1)+2Lξ(T14(m+1)+T1/4k1+ξ)]t\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{\xi}\bigg{[}\kappa_{\mathcal{M}}\big{(}2T^{\frac{-1}{4(m+1)}}+2\xi+\frac{T^{1/4}}{k}+\frac{T^{1/4}}{k-1}\big{)}+2L_{\xi}\big{(}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{)}\bigg{]}}{t}
2f(κLξ+Lξ2)[T14(m+1)+ξ+T1/4log(t)t]\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{1/4}\frac{\log(t)}{t}\bigg{]}
2f(κLξ+Lξ2)[T14(m+1)+ξ+T1/41t]\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{1/4}\frac{1}{\sqrt{t}}\bigg{]}
2f(κLξ+Lξ2)[T14(m+1)+ξ+T1/8]\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{-1/8}\bigg{]}
4f(κLξ+Lξ2)ξ,\displaystyle\leq 4\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\xi, (61)

where the second line comes from (59) and Fact 4 and the third line is due to Fact 3.

Bounding RtR_{t}:

Finally, by Lemma 23 we have:

T(2ξ)4(m+1),tT3/4,|Rt|\displaystyle\forall T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)},\ \forall t\geq T^{3/4},\ |R_{t}| f^T+f^tt\displaystyle\leq\frac{\left\lVert\widehat{f}_{\sqrt{T}}\right\rVert_{\infty}+\left\lVert\widehat{f}_{t}\right\rVert_{\infty}}{t}
f(LT+Lt)t\displaystyle\leq\frac{\left\lVert f\right\rVert_{\infty}(L_{\sqrt{T}}+L_{t})}{t}
2fLξT3/4\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{\xi}T^{-3/4}
2fLξξ.\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{\xi}\xi. (62)

Summing up the inequalities (55-62) yields for all T(2ξ)4(m+1)T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)} and all tT3/4t\geq T^{3/4}:

(|k=1tf(zk)tω(f)|Kξfξ|𝒞T1(ξ))2exp(tξ2).\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega^{\star}(f)\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)}. (63)

where Kξ1+2κ+4Lξ(1+κ+Lξ)K_{\xi}\triangleq 1+2\kappa_{\mathcal{M}}+4L_{\xi}(1+\kappa_{\mathcal{M}}+L_{\xi}). Note that lim supξ0Lξ=(0,πo(),ω)<\limsup\limits_{\xi\to 0}L_{\xi}=\mathcal{L}(0,\pi^{o}(\mathcal{M}),\omega^{\star})<\infty 202020Refer to (34) and (35) for a formal justification. implying that lim supξ0Kξ<\limsup\limits_{\xi\to 0}K_{\xi}<\infty. We get the final result by applying (63) to indicator functions 𝟙s,a(z)\mathbbm{1}_{s,a}(z) and using a union bound. ∎

Appendix G Technical Lemmas

G.1 Upper bound on the norm of products of substochastic matrices

Before we proceed with the lemma, we lay out some definitions. η1min{Pπu(z,z)|(z,z)𝒵2,Pπu(z,z)>0}\eta_{1}\triangleq\min\big{\{}P_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},P_{\pi_{u}}(z,z^{\prime})>0\big{\}} denotes the minimum positive probability of transition in \mathcal{M}. Similarly define η2min{Pπun(z,z)|(z,z)𝒵2,n[|1,m+1|],Pπun(z,z)>0}\eta_{2}\triangleq\min\big{\{}P^{n}_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},n\in[|1,m+1|],P^{n}_{\pi_{u}}(z,z^{\prime})>0\big{\}} the minimal probability of reaching some state-action pair zz^{\prime} from any other state-action zz after nm+1n\leq m+1212121Refer to the preamble of Appendix D for more detail. transitions in the Markov chain induced by the uniform random policy. Finally, ηη1η2\eta\triangleq\eta_{1}\eta_{2}.

Lemma 20.

Fix some state-action zz and let PtP_{t} be the transition matrix under some policy πt\pi_{t} satisfying πt(a|s)εtπu(a|s)\pi_{t}(a|s)\geq\varepsilon_{t}\pi_{u}(a|s) for all (s,a)𝒵(s,a)\in\mathcal{Z}. Define the substochastic matrix QtQ_{t} obtained by removing from PtP_{t} the row and the column corresponding to zz:

Pt=(Qt[Pt(z,z)]zz[Pt(z,z)]zzTPt(z,z)).P_{t}=\begin{pmatrix}\begin{matrix}\\ \quad Q_{t}\quad\\ \\ \end{matrix}&\vline&[P_{t}(z^{\prime},z)]_{z^{\prime}\neq z}\\ \hline\cr[P_{t}(z,z^{\prime})]_{z^{\prime}\neq z}^{T}&\vline&\begin{matrix}P_{t}(z,z)\end{matrix}\end{pmatrix}\;.

Then we have:

n1,l=n+1n+m+1Ql1ηl=n+1n+m+1εl.\forall n\geq 1,\ \left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}\right\rVert_{\infty}\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}.
Proof.

Define rk(n1,n2)=j=1SA1(l=n1+1n2Ql)kjr_{k}(n_{1},n_{2})=\sum\limits_{j=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{2}}Q_{l}\bigg{)}_{kj} the sum of the k-th row in the product of matrices QlQ_{l} for l[[n1+1,n2|]l\in[[n_{1}+1,n_{2}|]. We will prove that or all i[|1,SA1|]i\in[|1,SA-1|]: ri(n,n+m+1)1ηl=n+1n+m+1εlr_{i}(n,n+m+1)\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}. The result follows immediately by noting that l=n+1n+m+1Ql=maxi[|1,SA1|]ri(n,n+m+1)\left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}\right\rVert_{\infty}=\max\limits_{i\in[|1,SA-1|]}r_{i}(n,n+m+1).
Consider zz^{\prime} such that Pπu(z,z)η1P_{\pi_{u}}(z^{\prime},z)\geq\eta_{1} (such zz^{\prime} always exists since \mathcal{M} is communicating) and let kk^{\star} be the index of the row corresponding to zz^{\prime} in QtQ_{t}. Then for all n11n_{1}\geq 1:

rk(n1,l=n1+1)\displaystyle r_{k^{\star}}(n_{1},l=n_{1}+1) =j=1SA1(Qn1+1)kj\displaystyle=\sum\limits_{j=1}^{SA-1}(Q_{n_{1}+1})_{k^{\star}j}
=1Pn1+1(z,z)\displaystyle=1-P_{n_{1}+1}(z^{\prime},z)
1η1εn1+1.\displaystyle\leq 1-\eta_{1}\varepsilon_{n_{1}+1}. (64)

Now for n1,n21n_{1},n_{2}\geq 1 we have:

rk(n1,n1+n2)\displaystyle r_{k^{\star}}(n_{1},n_{1}+n_{2}) =j1=1SA1(l=n1+1n1+n2Ql)kj1\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}}Q_{l}\bigg{)}_{k^{\star}j_{1}}
=j1=1SA1j2=1SA1(l=n1+1n1+n21Ql)kj2(Qn1+n2)j2j1\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}(Q_{n_{1}+n_{2}})_{j_{2}j_{1}}
=j2=1SA1(l=n1+1n1+n21Ql)kj2[j1=1SA1(Qn1+n2)j2j1]\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}\bigg{[}\sum\limits_{j_{1}=1}^{SA-1}(Q_{n_{1}+n_{2}})_{j_{2}j_{1}}\bigg{]}
=j2=1SA1(l=n1+1n1+n21Ql)kj2rj2(n1+n21,n1+n2)\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}r_{j_{2}}(n_{1}+n_{2}-1,n_{1}+n_{2})
rk(n1,n1+n21)\displaystyle\leq r_{k^{\star}}(n_{1},n_{1}+n_{2}-1)
\displaystyle\quad\vdots
rk(n1,n1+1)\displaystyle\leq r_{k^{\star}}(n_{1},n_{1}+1)
1η1εn1+1,\displaystyle\leq 1-\eta_{1}\varepsilon_{n_{1}+1}, (65)

where in the fifth line we use the fact that for all j2,a,bj_{2},a,b: rj2(a,b)1r_{j_{2}}(a,b)\leq 1 since the matrices QlQ_{l} are substochastic. The last line comes from (64). Now for all other indexes i[|1,SA1|]i\in[|1,SA-1|] we have:

n1[|1,m|],ri(n,n+m+1)\displaystyle\forall n_{1}\in[|1,m|],\ r_{i}(n,n+m+1) =j1=1SA1(l=n+1n+n1Ql×l=n+n1+1n+m+1Ql)ij1\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\ \times\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{ij_{1}}
=j1=1SA1j2=1SA1(l=n+1n+n1Ql)ij2(l=n+n1+1n+m+1Ql)j2j1\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}\bigg{(}\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{j_{2}j_{1}}
=j2=1SA1(l=n+1n+n1Ql)ij2j1=1SA1(l=n+n1+1n+m+1Ql)j2j1\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}\ \sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{j_{2}j_{1}}
=j2=1SA1(l=n+1n+n1Ql)ij2rj2(n+n1,n+m+1)\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}r_{j_{2}}(n+n_{1},n+m+1)
(1η1εn+n1+1)(l=n+1n+n1Ql)ik+j2k(l=n+1n+n1Ql)ij2\displaystyle\leq(1-\eta_{1}\varepsilon_{n+n_{1}+1})\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}+\sum\limits_{j_{2}\neq k^{\star}}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}
(1η1εn+n1+1)(l=n+1n+n1Ql)ik+1(l=n+1n+n1Ql)ik\displaystyle\leq(1-\eta_{1}\varepsilon_{n+n_{1}+1})\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}+1-\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}
=1η1εn+n1+1(l=n+1n+n1Ql)ik,\displaystyle=1-\eta_{1}\varepsilon_{n+n_{1}+1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}, (66)

where we used (65) and the fact that the matrix l=n+1n+n1Ql\prod\limits_{l=n+1}^{n+n_{1}}Q_{l} is substochastic. Now since \mathcal{M} is communicating then we can reach state-action zz^{\prime} from any other state-action zi[|1,SA1|]z_{i}\in[|1,SA-1|], after some nim+1n_{i}\leq m+1 steps in the Markov chain corresponding to the random uniform policy. In other words, if ii is the index corresponding to ziz_{i} then there exists nim+1n_{i}\leq m+1, such that (Pπuni)ikη2>0(P_{\pi_{u}}^{n_{i}})_{ik^{\star}}\geq\eta_{2}>0. Therefore:

(l=n+1n+niQl)ik\displaystyle\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}Q_{l}\bigg{)}_{ik^{\star}} (l=n+1n+niεlPπu)ik\displaystyle\geq\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}P_{\pi_{u}}\bigg{)}_{ik^{\star}}
=(l=n+1n+niεl)(Pπuni)ik\displaystyle=\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}\bigg{)}(P_{\pi_{u}}^{n_{i}})_{ik^{\star}}
η2l=n+1n+niεl.\displaystyle\geq\eta_{2}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}. (67)

Thus, combining (66) for n1=nin_{1}=n_{i} and (67) we get:

i[|1,SA1|],ri(n,n+m+1)\displaystyle\forall i\in[|1,SA-1|],\ r_{i}(n,n+m+1) 1η1η2l=n+1n+niεl\displaystyle\leq 1-\eta_{1}\eta_{2}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}
1η1η2l=n+1n+m+1εl\displaystyle\leq 1-\eta_{1}\eta_{2}\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}
=1ηl=n+1n+m+1εl.\displaystyle=1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}.

G.2 Geometric ergodicity: a general result

The following lemma is adapted from the proof of the Convergence theorem (Theorem 4.9, [LPW06]).

Lemma 21.

Let PP be a stochastic matrix with stationary distribution vector ω\omega. Suppose that there exist σ>0\sigma>0 and an integer rr such that Pr(s,s)σω(s)P^{r}(s,s^{\prime})\geq\sigma\omega(s^{\prime}) for all (s,s)(s,s^{\prime}). Let WW be a rank-one matrix whose rows are equal to ω𝖳\omega^{\mathsf{\scriptscriptstyle T}}. Then:

n1,PnW2θnr1\forall n\geq 1,\ \left\lVert P^{n}-W\right\rVert_{\infty}\leq 2\theta^{\frac{n}{r}-1}

where θ=1σ\theta=1-\sigma.

Proof.

We write: Pr=(1θ)W+θQP^{r}=(1-\theta)W+\theta Q where QQ is a stochastic matrix. Note that WPk=WWP^{k}=W for all k0k\geq 0 since ω𝖳=ω𝖳P\omega^{\mathsf{\scriptscriptstyle T}}=\omega^{\mathsf{\scriptscriptstyle T}}P. Furthermore MW=WMW=W for all stochastic matrices since all rows of WW are equal. Using these properties, we will show by induction that Prk=(1θk)W+θkQkP^{r}k=(1-\theta^{k})W+\theta^{k}Q^{k}:
For k=1k=1 the result is trivial. Now suppose that Prk=(1θk)W+θkQkP^{r}k=(1-\theta^{k})W+\theta^{k}Q^{k}. Then:

Pr(k+1)\displaystyle P^{r}(k+1) =PrkPr\displaystyle=P^{r}kP^{r}
=[(1θk)W+θkQk]Pr\displaystyle=[(1-\theta^{k})W+\theta^{k}Q^{k}]P^{r}
=(1θk)WPr+(1θ)θkQkW+θk+1Qk+1\displaystyle=(1-\theta^{k})WP^{r}+(1-\theta)\theta^{k}Q^{k}W+\theta^{k+1}Q^{k+1}
=(1θk)W+1θ)θkW+θk+1Qk+1\displaystyle=(1-\theta^{k})W+1-\theta)\theta^{k}W+\theta^{k+1}Q^{k+1}
=(1θk+1)W+θk+1Qk+1.\displaystyle=(1-\theta^{k+1})W+\theta^{k+1}Q^{k+1}.

Therefore the result holds for all k1k\geq 1. Therefore Prk+jW=θk(QkPjW)P^{rk+j}-W=\theta^{k}(Q^{k}P^{j}-W) which implies:

n=rk+j1,PnW\displaystyle\forall n=rk+j\geq 1,\ \left\lVert P^{n}-W\right\rVert_{\infty} θkQkPjW\displaystyle\leq\theta^{k}\left\lVert Q^{k}P^{j}-W\right\rVert_{\infty}
2θk=2θnr2θnr1.\displaystyle\leq 2\theta^{k}=2\theta^{\lfloor\frac{n}{r}\rfloor}\leq 2\theta^{\frac{n}{r}-1}.

G.3 Condition number of Markov Chains

Lemma 22.

(Theorem 2 in [Sch68]) Let P1P_{1} (resp. P2P_{2}) be the transition kernel of a Markov Chain with stationary distribution ω1\omega_{1} (resp. ω2\omega_{2}). Define Z1(IP1+𝟙ω1𝖳)1Z_{1}\triangleq(I-P_{1}+\mathbbm{1}\omega_{1}^{\mathsf{\scriptscriptstyle T}})^{-1}. Then:

ω2𝖳ω1𝖳=ω2𝖳[P2P1]Z1andω2ω11κ1P2P1.\displaystyle\omega_{2}^{\mathsf{\scriptscriptstyle T}}-\omega_{1}^{\mathsf{\scriptscriptstyle T}}=\omega_{2}^{\mathsf{\scriptscriptstyle T}}[P_{2}-P_{1}]Z_{1}\quad\textrm{and}\quad\left\lVert\omega_{2}-\omega_{1}\right\rVert_{1}\leq\kappa_{1}\left\lVert P_{2}-P_{1}\right\rVert_{\infty}.

where κ1Z1\kappa_{1}\triangleq\left\lVert Z_{1}\right\rVert_{\infty}. Crucially, in our setting this implies that there exists a constant κ\kappa_{\mathcal{M}} that only depends on \mathcal{M} such that for all π\pi:

ωπω1κPπPπo.\displaystyle\left\lVert\omega_{\pi}-\omega^{\star}\right\rVert_{1}\leq\kappa_{\mathcal{M}}\left\lVert P_{\pi}-P_{\pi^{o}}\right\rVert_{\infty}.

where κZπo=(IPπo+𝟙ω𝖳)1\kappa_{\mathcal{M}}\triangleq\left\lVert Z_{\pi^{o}}\right\rVert_{\infty}=\left\lVert(I-P_{\pi^{o}}+\mathbbm{1}{\omega^{\star}}^{\mathsf{\scriptscriptstyle T}})^{-1}\right\rVert_{\infty}.

G.4 Properties of Poisson equation’s solutions

Lemma 23.

Let PπP_{\pi} be a Markov transition kernel satisfying the assumptions (B1) and (B3) and denote by ωπ\omega_{\pi} its stationary distribution. Then for any a bounded function f:𝒵+f:\mathcal{Z}\to\mathbb{R}^{+}, the function defined by f^π(.)n0Pπn[fωπ(f)](.)\widehat{f}_{\pi}(.)\triangleq\sum\limits_{n\geq 0}P_{\pi}^{n}[f-\omega_{\pi}(f)](.) is well defined and is solution to the Poisson equation (f^πPπf^π)(.)=f(.)ωπ(f)\big{(}\widehat{f}_{\pi}-P_{\pi}\widehat{f}_{\pi}\big{)}(.)=f(.)-\omega_{\pi}(f). Furthermore:

f^πLπf,\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi}\left\lVert f\right\rVert_{\infty},

and for any pair of kernels Pπ,PπP_{\pi},P_{\pi^{\prime}}:

Pπf^πPπf^πLπf[ωπωπ1+LπD(π,π)].\left\lVert P_{\pi^{\prime}}\widehat{f}_{\pi^{\prime}}-P_{\pi}\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi^{\prime}}\left\lVert f\right\rVert_{\infty}\big{[}\left\lVert\omega_{\pi^{\prime}}-\omega_{\pi}\right\rVert_{1}+L_{\pi}D(\pi^{\prime},\pi)\big{]}.
Proof.

We will prove that f^π\widehat{f}_{\pi} is well defined. Checking that it satisfies the Poisson equation is straightforward. Observe that:

f^π(.)\displaystyle\widehat{f}_{\pi}(.) n0Pπn[fωπ(f)](.)\displaystyle\triangleq\sum\limits_{n\geq 0}P_{\pi}^{n}[f-\omega_{\pi}(f)](.)
=n0[PπnWπ][fωπ(f)](.),\displaystyle=\sum\limits_{n\geq 0}[P_{\pi}^{n}-W_{\pi}][f-\omega_{\pi}(f)](.),

where the second equality is because (Wπf)(z)=ωπ(f)(W_{\pi}f)(z)=\omega_{\pi}(f) for all z𝒵z\in\mathcal{Z}. From the last expression, we see that the sum defining f^π\widehat{f}_{\pi} converges and we have the first bound f^πLπf\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi}\left\lVert f\right\rVert_{\infty}. Now for the second bound, we write:

(Pπf^πPπf^π)(.)\displaystyle(P_{\pi^{\prime}}\widehat{f}_{\pi^{\prime}}-P_{\pi}\widehat{f}_{\pi})(.) =n1[Pπn[ωπ(f)ωπ(f)]+[PπnPπn][fωπ(f)]](.)\displaystyle=\sum\limits_{n\geq 1}\bigg{[}P_{\pi^{\prime}}^{n}\big{[}\omega_{\pi}(f)-\omega_{\pi^{\prime}}(f)\big{]}+\big{[}P_{\pi^{\prime}}^{n}-P_{\pi}^{n}\big{]}\big{[}f-\omega_{\pi}(f)\big{]}\bigg{]}(.)
=n1Pπn[ωπ(f)ωπ(f)](.)A(.)+n1[PπnPπn][fωπ(f)](.)B(.).\displaystyle=\underbrace{\sum\limits_{n\geq 1}P_{\pi^{\prime}}^{n}\big{[}\omega_{\pi}(f)-\omega_{\pi^{\prime}}(f)\big{]}(.)}\limits_{A(.)}+\underbrace{\sum\limits_{n\geq 1}\big{[}P_{\pi^{\prime}}^{n}-P_{\pi}^{n}\big{]}\big{[}f-\omega_{\pi}(f)\big{]}(.)}\limits_{B(.)}. (68)

Using the same trick as before we obtain:

A\displaystyle\left\lVert A\right\rVert_{\infty} fCπ(1ρπ)1ωπωπ1\displaystyle\leq\left\lVert f\right\rVert_{\infty}C_{\pi^{\prime}}(1-\rho_{\pi^{\prime}})^{-1}\left\lVert\omega_{\pi}-\omega_{\pi^{\prime}}\right\rVert_{1}
=fLπωπωπ1.\displaystyle=\left\lVert f\right\rVert_{\infty}L_{\pi^{\prime}}\left\lVert\omega_{\pi}-\omega_{\pi^{\prime}}\right\rVert_{1}. (69)

On the other hand, a simple calculation shows that (BPπB)(.)=(PπPπ)f^π(.)(B-P_{\pi^{\prime}}B)(.)=(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}(.), ie BB is solution to the modified Poisson equation where the right hand side is (PπPπ)f^π(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}. Therefore:

B(.)=n0Pπn[(PπPπ)f^π](.).\displaystyle B(.)=\sum\limits_{n\geq 0}P_{\pi^{\prime}}^{n}\big{[}(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}\big{]}(.).

and

B\displaystyle\left\lVert B\right\rVert_{\infty} Lπ(PπPπ)f^π\displaystyle\leq L_{\pi^{\prime}}\left\lVert(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}\right\rVert_{\infty}
LπD(π,π)f^π\displaystyle\leq L_{\pi^{\prime}}D(\pi^{\prime},\pi)\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}
LπLπD(π,π)f.\displaystyle\leq L_{\pi}L_{\pi^{\prime}}D(\pi^{\prime},\pi)\left\lVert f\right\rVert_{\infty}. (70)

Summing up equation (68) and inequalities (69-70) ends the proof. ∎