Navigating to the Best Policy
in Markov Decision Processes

Aymen Al Marjani
UMPA, ENS Lyon
Lyon, France
[email protected]
Aurélien Garivier
UMPA, CNRS, INRIA, ENS Lyon
Lyon, France
[email protected]
&Alexandre Proutiere
EECS and Digital Futures
KTH Royal Institute of Technology, Sweden
[email protected]

Abstract

We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least $1-\delta$ . We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting [MP21], where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints, induced by the online setting. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.

1 Introduction

Somewhat surprisingly, learning in a Markov Decision Process is most often considered under the performance criteria of consistency or regret minimization (see e.g. [SB18, Sze10, LS20] and references therein). Regret minimization (see e.g. [AJO09, FCG10]) is particularly relevant when the rewards accumulated during the learning phase are important. This is however not always the case: for example, when learning a game (whether Go, chess, Atari, or whatever), winning or losing during the learning phase does not really matter. One may intuitively think that sometimes getting into difficulty on purpose so as to observe unheard situations can significantly accelerate the learning process. Another example is the training of robot prototypes in the factory: a reasonably good policy is first searched, regardless of the losses incurred, that can serve as an initialization for a second phase regret-minimization mode that starts when the robot is deployed. It is hence also of great practical importance to study the sample complexity of learning, and to work on strategies that might improve, in this perspective, on regret minimizing algorithms.

In this work, we are interested in the best policy identification (BPI) problem for infinite-horizon discounted MDPs. This framework was introduced by [Fie94] under the name of PAC-RL. In BPI the algorithm explores the MDP until it has gathered enough samples to return an $\varepsilon$ -optimal policy with probability at least $1-\delta$ . Crucially, the algorithm halts at a random time step, determined by a stopping rule which guarantees that the probability of returning a wrong answer is less that $\delta$ . The optimality of BPI algorithms is measured through their sample complexity, defined as their expected stopping time. Best policy identification in MDPs has been mostly investigated under the lens of minimax-optimality. The minimax framework may be overly pessimistic by accounting for the worst possible MDP, whereas an algorithm with instance-specific guarantees can adapt its exploration procedure to the hardness of the MDP instance that it faces. Recently, a few works made the first attempts towards understanding the instance-specific sample complexity of reinforcement learning. However, they typically either make simplifying assumptions such as access to a generative model [ZKB19, MP21], or restrict their attention to episodic MDPs [WSJ21]. In practice however, the samples gathered rather correspond to a single, possibly infinite, trajectory of the system that we wish to control. This motivates our study of the full online setting where observations are only obtained by navigating in the MDP, that is by sequential choices of actions and following the transitions of the MDP.

Our contributions. [MP21] recently proposed an information-theoretical complexity analysis for MDPs in the case of access to a generative model. Here we extend their results to the online setting. Our main goal is to understand how the online learning scheme affects the sample complexity compared to the easier case where we have a generative model. A natural first step consists in understanding how the first order term $T^{\star}\log(1/\delta)$ changes. Thus we only focus on the asymptotic regime $\delta\to 0$ . Our key contributions can be summarized as follows:

•

First, we adapt the lower bound of [MP21] to the online setting (Proposition 2). The new bound also writes as the value of a zero-sum two-player game between nature and the algorithm, where the loss function remains the same, but where the set of possible strategies for the algorithm is restricted to a subset $\Omega(\mathcal{M})$ of the simplex of dimension $SA-1$ . We refer to the constraints defining $\Omega(\mathcal{M})$ as the navigation constraints.
•

We propose MDP-NaS, the first algorithm for the online setting¹¹1Before publication of this work, but after a preprint was available online, [WSJ21] proposed another algorithm with instance-dependent guarantees. with instance-dependent bounds on its sample complexity in the asymptotic regime $\delta\to 0$ (Theorem 6). A major challenge lies in the design of a sampling rule that guarantees that the sampling frequency of state-action pairs $\big{(}N_{sa}(t)/t\big{)}_{s,a}$ converges to some target oracle allocation $\omega^{\star}\in\Omega(\mathcal{M})$ . Indeed, since we can no longer choose the next state of the agent, the tracking procedure which was developed by [GK16] for multi-armed bandits and used in [MP21] for MDPs with a generative model can no longer be applied in our setting. We propose a new sampling rule which performs exploration according to a mixture of the uniform policy and a plug-in estimate of the oracle policy (the policy whose stationary state-action distribution is $\omega^{\star}$ ) and prove that it satisfies the requirement above. The analysis of our sampling rule relies on an ergodic theorem for non-homogeneous Markov chains of independent interest (Proposition 12).
•

We investigate, depending on the communication properties of the ground-truth instance $\mathcal{M}$ , what is the minimal forced-exploration rate in our sampling rule that guarantees the consistency of the plug-in estimator of the oracle policy. Our findings imply that when $\mathcal{M}$ is ergodic, an exploration rate as low as $1/\sqrt{t}$ is sufficient. However, when $\mathcal{M}$ is only assumed to be communicating, one is obliged to resort to a much more conservative exploration rate of $t^{-\frac{1}{m+1}}$ , where $m$ is a parameter defined in Lemma 4 that may scales as large as $S-1$ in the worst case.
•

Finally, our stopping rule represents the first implementation of the Generalized Likelihood Ratio test for MDPs. Notably, we circumvent the need to solve the max-min program of the lower bound exactly, and show how an upper bound of the best-response problem, such as the one derived in [MP21], can be used to perform a GLR test. This improves upon the sample complexity bound that one obtains using the KL-Ball stopping rule of [MP21] by at least a factor of $2$ ²²2Note that the stopping rule is independent of the sampling rule and thus can be used in both the generative and the online settings. Furthermore, one may even obtain an improved factor of $4$ by using a deviation inequality for the full distribution of (reward, next-state) instead of a union bound of deviation inequalities for each marginal distribution..

1.1 Related work

Minimax Best-Policy Identification. BPI in the online setting has been investigated recently by a number of works with minimax sample complexity bounds. In the case of episodic MDPs [KMDD⁺21], [MDJ⁺20] proposed algorithms that identify an $\varepsilon$ -optimal policy at the initial state w.h.p. In contrast, in the case of infinite-horizon MDPs one is rather interested in finding a good policy at every state. Recent works provide convergence analysis for Q-learning [LWC⁺20b] or policy gradient algorithms [AHKS20], [FYAY21], [ZCA21]. Their results typically state that if the algorithm is fed with the appropriate hyperparameters, it can return an $\varepsilon$ -optimal policy w.h.p. after collecting a polynomial number of samples. In practice, a pure exploration agent needs a stopping rule to determine when to halt the learning process. In particular, the question of how to tune the number of iterations of these algorithms without prior knowledge of the ground truth instance remains open.³³3Here we chose not to mention works that investigate the sample complexity in the PAC-MDP framework [Kak03]. Indeed, in this framework the sample complexity is rather defined as the the number of episodes where the algorithm does not play an $\varepsilon$ -optimal policy. As explained in [DMKV21], this objective is closer to regret minimization than to pure exploration.

Generative Model. A large body of literature focuses on the so-called generative model, where in each step, the algorithm may query a sample (i.e., observe a reward and a next-state drawn from the rewards and transitions distributions respectively) from any given state-action pair [KS98],[KMN99], [EDMM06], [AMK13],[DTC13], [Wan17], [SWW⁺18], [ZKB19], [AKY20], [LWC⁺20a], [LCC⁺21], [MP21].

Instance-specific bounds. Instance-optimal algorithms for Best Arm Identification in multi armed bandits (MDPs with one state) have been obtained independently by [GK16],[Rus16]. Here, we extend their information-theoretical approach to the problem of Best Policy Identification in MDPs. More recently, [WSJ21] provided an algorithm for BPI in episodic MDPs with instance-specific sample complexity. A more detailed comparison with [WSJ21] can be found in Section 6.

Outline. The rest of the paper is organized as follows: After introducing the setting and giving some notation and definitions in Section 2, we derive in Section 3 a lower bound on the time required by any algorithm navigating the MDP until it is able to identify the best policy with probability at least $1-\delta$ . The algorithm is presented in Section 4. Section 5 contains our main results along with a sketch of the analysis. Finally, in Section 6 we compare our results with MOCA, the only other algorithm (to the best of our knowledge) with problem-dependent guarantees in the online setting. Most technical results and proofs are given in the appendix.

2 Setting and notation

Discounted MDPs. We consider infinite horizon MDPs with discount and finite state and action spaces ${\cal S}$ and ${\cal A}$ . Such an MDP ${\cal M}$ is defined through its transition and reward distributions $p_{\cal M}(\cdot|s,a)$ and $q_{\cal M}(\cdot|s,a)$ for all $(s,a)$ . For simplicity, $q_{\cal M}(\cdot|s,a)$ will denote the density of the reward distribution w.r.t. a positive measure $\lambda$ with support included in $[0,1]$ . Specifically, $p_{\cal M}(s^{\prime}|s,a)$ denotes the probability of transitioning to state $s^{\prime}$ after playing action $a$ at state $s$ while $R(s,a)$ is the random instantaneous reward that is collected. Finally, $\gamma\in[0,1)$ is a discounting factor. We look for an optimal control policy $\pi^{\star}:\mathcal{S}\to\mathcal{A}$ maximizing the long-term discounted reward: $V_{\cal M}^{\pi}(s)=\mathbb{E}_{{\cal M},\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},\pi(s_{t}))]$ , where $\mathbb{E}_{{\cal M},\pi}$ denotes the expectation w.r.t the randomness in the rewards and the trajectory when the policy $\pi$ is used. Classically, we denote by $V_{\cal M}^{\star}$ the optimal value function of ${\cal M}$ and by $Q^{\star}_{\mathcal{M}}$ the optimal $Q$ -value function of ${\cal M}$ . $\Pi^{\star}(\mathcal{M})=\{\pi:\mathcal{S}\to\mathcal{A},\ V_{\cal M}^{\pi}=V_{\cal M}^{\star}\}$ denotes the set of optimal policies for $\mathcal{M}$ .

Problem-dependent quantities. The sub-optimality gap of action $a$ in state $s$ is defined as $\Delta_{sa}=V_{\mathcal{M}}^{\star}(s)-Q_{\mathcal{M}}^{\star}(s,a)$ . Let $\Delta_{\min}=\min_{s,a\neq\pi^{\star}(s)}\Delta_{sa}$ be the minimum gap and $\operatorname{sp}(V_{\mathcal{M}}^{\star})=\max_{s,s^{\prime}}|V_{\mathcal{M}}^{\star}(s)-V_{\mathcal{M}}^{\star}(s^{\prime})|$ be the span of $V_{\mathcal{M}}^{\star}$ . Finally, we introduce the variance of the value function in $(s,a)$ as $\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]=\mathrm{Var}_{s^{\prime}\sim p_{\mathcal{M}}(.|s,a)}[V_{\mathcal{M}}^{\star}(s^{\prime})]$ .

Active pure exploration with fixed confidence. When ${\cal M}$ is unknown, we wish to devise a learning algorithm $\mathbb{A}$ identifying, from a single trajectory, an optimal policy as quickly as possible with some given level of confidence. Formally, such an algorithm consists of (i) a sampling rule, selecting in each round $t$ in an adaptive manner the action $a_{t}$ to be played; $a_{t}$ depends on past observations, it is ${\cal F}_{t}^{\mathbb{A}}$ -measurable where ${\cal F}_{t}^{\mathbb{A}}=\sigma(s_{0},a_{0},R_{0}\ldots,s_{t-1},a_{t-1},R_{t-1},s_{t})$ is the $\sigma$ -algebra generated by observations up to time $t$ ; (ii) a stopping rule $\tau_{\delta}$ , a stopping time w.r.t. the filtration $({\cal F}_{t}^{\mathbb{A}})_{t\geq 0}$ , deciding when to stop collecting observations; (iii) a decision rule returning $\hat{\pi}^{\star}_{\tau_{\delta}}$ an estimated optimal policy.

An algorithm $\mathbb{A}$ is $\delta$ -Probably Correct ( $\delta$ -PC) over some set $\mathbb{M}$ of MDPs, if for any ${\cal M}\in\mathbb{M}$ , it returns (in finite time) an optimal policy with probability at least $1-\delta$ . In this paper, we aim to devise a $\delta$ -PC algorithm $\mathbb{A}$ with minimal sample complexity $\mathbb{E}_{{\cal M},\mathbb{A}}[\tau_{\delta}]$ . We make the following assumption.

Assumption 1.

We consider the set $\mathbb{M}$ of communicating MDPs with a unique optimal policy.

We justify the above assumption as follows. (i) We restrict our attention to the case where ${\cal M}$ is communicating, for otherwise, if it is Multichain there would be a non-zero probability that the algorithm enters a subclass of states from which there is no possible comeback. In this case it becomes impossible to identify the global optimal policy⁴⁴4Unless we modify the objective to finding the optimal policy in this subchain.. (ii) About the uniqueness of the optimal policy, treating the case of MDPs with multiple optimal policies, or that of $\varepsilon$ -optimal policy identification, requires the use of more involved Overlapping Hypothesis tests, which is already challenging in multi-armed bandits (MDPs with a single state) [GK19]. We will analyze how to remove this assumption in future work.

Notation. $\mathcal{Z}\triangleq\mathcal{S}\times\mathcal{A}$ is the state-action space. $\Sigma\triangleq\{\omega\in\mathbb{R}_{+}^{S\times A}:\ \sum_{s,a}\omega_{sa}=1\}$ , the simplex of dimension $SA-1$ . $N_{sa}(t)$ denotes the number of times the state-action pair $(s,a)$ has been visited up to the end of round $t$ . We also introduce $N_{s}(t)=\sum_{a\in\mathcal{A}}N_{sa}(t)$ . Similarly, for a vector $\omega\in\Sigma$ we will denote $\omega_{s}\triangleq\sum_{a\in\mathcal{A}}\omega_{sa}$ . For a matrix $A\in\mathbb{R}^{n\times m}$ , the infinity norm is defined as $\left\lVert A\right\rVert_{\infty}\triangleq\max_{1\leq i\leq n}\sum_{j=1}^{m}|a_{i,j}|$ . The KL divergence between two probability distributions $P$ and $Q$ on some discrete space $\mathcal{S}$ is defined as: $KL(P\|Q)\triangleq\sum_{s\in\mathcal{S}}P(s)\log(\frac{P(s)}{Q(s)})$ . For Bernoulli distributions of respective means $p$ and $q$ , the KL divergence is denoted by $\operatorname{kl}(p,q)$ . For distributions over $[0,1]$ defined through their densities $p$ and $q$ w.r.t. some positive measure $\lambda$ , the KL divergence is: $KL(p\|q)\triangleq\int_{-\infty}^{\infty}p(x)\log\left(\frac{p(x)}{q(x)}\right)\,\lambda(dx)$ . $\mathbb{M}$ denotes the set of communicating MDPs with a unique optimal policy. For two MDPs ${\cal M},{\cal M}^{\prime}\in\mathbb{M}$ , we say that $\mathcal{M}\ll\mathcal{M}^{\prime}$ if for all $(s,a)$ , $p_{\mathcal{M}}(\cdot|s,a)\ll p_{\mathcal{M}^{\prime}}(\cdot|s,a)$ and $q_{\mathcal{M}}(\cdot|s,a)\ll q_{\mathcal{M}^{\prime}}(\cdot|s,a)$ . In that case, for any state (action pair) $(s,a)$ , we define the KL divergence of the distributions of the one-step observations under ${\cal M}$ and ${\cal M}^{\prime}$ when starting at $(s,a)$ as $\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\triangleq KL(p_{\mathcal{M}}(\cdot|s,a)\|p_{\mathcal{M}^{\prime}}(\cdot|s,a))+KL(q_{\mathcal{M}}(\cdot|s,a)\|q_{\mathcal{M}^{\prime}}(\cdot|s,a))$ .

3 Sample complexity lower bound

In this section, we first derive a lower bound on the expected sample complexity satisfied by any $\delta$ -PC algorithm. The lower bound is obtained as the solution of a non-convex optimization problem, as in the case where the learner has access to a generative model [MP21]. The problem has however additional constraints, referred to as the navigation constraints, due the fact that the learner has access to a single system trajectory.
The expected sample complexity of an algorithm $\mathbb{A}$ is $\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}]=\sum_{s,a}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})].$ Lower bounds on the sample complexity are derived by identifying constraints that the various $N_{sa}(\tau_{\delta})$ ’s need to satisfy so as to get a $\delta$ -PC algorithm. We distinguish here two kinds of constraints:

Information constraints. These are constraints on $\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})]$ , so that the algorithm can learn the optimal policy with probability at least $1-\delta$ . They are the same as those derived [MP21] when the learner has access to a generative model and are recalled in Lemma 9 in Appendix C.

Navigation constraints. Observations come from a single (but controlled) system trajectory which imposes additional constraints on the $N_{sa}(\tau_{\delta})$ ’s. These are derived by just writing the Chapman-Kolmogorov equations of the controlled Markov chain (refer to Appendix C for a proof).

Lemma 1.

For any algorithm $\mathbb{A}$ , and for any state $s$ , we have:

\displaystyle\Big{|}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{s}(\tau_{\delta})]-\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{s^{\prime}a^{\prime}}({\tau_{\delta}})]\Big{|}\leq 1.

(1)

Putting all constraints together, we obtain the following sample complexity lower bound.

Proposition 2.

Define the set of navigation-constrained allocation vectors:

\Omega(\mathcal{M})=\big{\{}\omega\in\Sigma:\ \forall s\in\mathcal{S},\ \omega_{s}=\underset{s^{\prime},a^{\prime}}{\sum}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\omega_{s^{\prime}a^{\prime}}\big{\}}.

Further define $\operatorname{Alt}(\mathcal{M})=\{\mathcal{M}^{\prime}\textrm{ MDP}:\mathcal{M}\ll\mathcal{M}^{\prime},\Pi^{\star}(\mathcal{M})\cap\Pi^{\star}(\mathcal{M}^{\prime})=\emptyset\}$ the set of alternative instances. Then the expected sample complexity $\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}]$ of any $\delta$ -PC algorithm $\mathbb{A}$ satisfies:

\underset{\delta\to 0}{\liminf}\ \frac{\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau]}{\log(1/\delta)}\geq T_{o}(\mathcal{M}),\quad\textrm{where}\quad T_{o}(\mathcal{M})^{-1}=\sup\limits_{\omega\in\Omega(\mathcal{M})}T(\mathcal{M},\omega)^{-1}

(2)

and

T(\mathcal{M},\omega)^{-1}=\inf\limits_{\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M})}\sum_{s,a}\omega_{sa}\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a).

(3)

Remark 1.

A common interpretation of change-of-measure lower bounds like the one above is the following: the optimization problem in the definition of $T_{o}(\mathcal{M})$ can be seen as the value of a two-player zero-sum game between an algorithm which samples each state-action $(s,a)$ proportionally to $\omega_{sa}$ and an adversary who chooses an alternative instance $\mathcal{M}^{\prime}$ that is difficult to distinguish from $\mathcal{M}$ under the algorithm’s sampling strategy. This suggests that an optimal algorithm should play the optimal allocation that solves the optimization problem (2) and, as a consequence, rules out all alternative instances as fast as possible.

Remark 2.

Note that compared to the lower bound of the generative setting in [MP21], the only change is in the set of sampling strategies that the algorithm can play, which is no longer equal to the entire simplex.

3.1 Proxy for the optimal allocation and the characteristic time

As shown in [MP21], even without accounting for the navigation constraints, computing the characteristic time $T_{o}(\mathcal{M})$ and in particular, the optimal allocation leading to it is not easy. Indeed, the sub-problem corresponding to computing $T(\mathcal{M},\omega)^{-1}$ is non-convex. This makes it difficult to design an algorithm that targets the optimal weights $\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}T(\mathcal{M},\omega)^{-1}$ . Instead, we use a tractable upper bound of $T(\mathcal{M},\omega)$ from [MP21]:

Lemma 3.

(Theorem 1, [MP21]) For all vectors $\omega\in\Sigma$ , $T(\mathcal{M},\omega)\leq U(\mathcal{M},\omega)$ , where⁵⁵5The exact definition of $H^{\star}$ is given in Appendix C.

U(\mathcal{M},\omega)=\underset{(s,a):a\neq\pi^{\star}(s)}{\max}\ \frac{H_{sa}}{\omega_{sa}}+\frac{H^{\star}}{S\underset{s}{\min}\ \omega_{s,\pi^{\star}(s)}},

(4)

\hbox{and }\begin{cases}H_{sa}=\displaystyle{\frac{2}{\Delta_{sa}^{2}}}+\displaystyle{\max\bigg{(}\frac{16\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]}{\Delta_{sa}^{2}},\frac{6\operatorname{sp}(V_{\mathcal{M}}^{\star})^{4/3}}{\Delta_{sa}^{4/3}}\bigg{)}},\\ H^{\star}=\displaystyle{\frac{2S}{[\Delta_{\min}(1-\gamma)]^{2}}+\mathcal{O}\bigg{(}\frac{S}{\Delta_{\min}^{2}(1-\gamma)^{3}}\bigg{)}}.\end{cases}

(5)

Using $U(\mathcal{M},\omega)$ , we obtain the following upper bound on the characteristic time (2):

T_{o}(\mathcal{M})\leq U_{o}(\mathcal{M})\triangleq\underset{{\omega\in\Omega(\mathcal{M})}}{\inf}U(\mathcal{M},\omega).

(6)

The advantages of the above upper bound $U_{o}(\mathcal{M})$ are that: (i) it is a problem-specific quantity as it depends on the gaps and variances of the value function in $\mathcal{M}$ ; (ii) the corresponding allocation (that solves (6)) can be easily computed, and hence targeted. Indeed, the optimization problem in (6) has convex objective and constraints. Therefore, we can use the projected subgradient-descent algorithm to compute

{\bm{\omega}}^{\star}(\mathcal{M})\triangleq\underset{{\omega\in\Omega(\mathcal{M})}}{\operatorname*{arg\,min}}\ U(\mathcal{M},\omega)=\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}\ U(\mathcal{M},\omega)^{-1},

(7)

which will be used in our algorithm as a proxy⁶⁶6Note that if we have access to an optimization oracle that, given a MDP $\mathcal{M}$ , returns the optimal allocation solution to (2), then we can replace ${\bm{\omega}}^{\star}(\mathcal{M})$ by this optimal allocation. Our algorithm will then be asymptotically optimal up to a factor of 2. for $\operatorname*{arg\,max}_{\omega\in\Omega(\mathcal{M})}\ T(\mathcal{M},\omega)^{-1}$ .

4 Algorithm

We propose MDP-NaS (MDP Navigate-and-Stop), a model-based algorithm that is inspired by the lower bound. The lower bound suggests that to identify the best policy in a sample-efficient manner, the algorithm must collect samples from state-action pair $(s,a)$ proportionally to ${\bm{\omega}}_{sa}^{\star}(\mathcal{M})$ . We propose two sampling rules which ensure that the former statement holds in the long term (see Section 4.1 and Theorem 7 for a rigorous formulation). Our sampling rules are combined with a Generalized Likelihood Ratio (GLR) test (or rather a proxy of the GLR test, see Section 4.2 for details), that stops as soon as we are confident that $\widehat{\pi}_{t}^{\star}=\pi^{\star}$ with probability at least $1-\delta$ . The pseudo-code for MDP-NaS is given in Algorithm 1.

Input: Confidence level

\delta

, ERGODIC boolean variable, communication parameter

m

or an upper bound.

2 if ERGODIC then

3 Set

(\varepsilon_{t})_{t\geq 1}=(1/\sqrt{t})_{t\geq 1}

5else

6 Set

(\varepsilon_{t})_{t\geq 1}=(t^{-\frac{1}{m+1}})_{t\geq 1}

8 end if

9Set

t\leftarrow 0

and

N_{sa}(t)\leftarrow 0

, for all (s,a).

10 Initialize empirical estimate

\widehat{\mathcal{M}}_{0}

by drawing an arbitrary MDP from

\mathbb{M}

12Compute

\widehat{\pi}_{t}^{\star}\leftarrow\textrm{POLICY-ITERATION}(\widehat{\mathcal{M}}_{t})

13 while Stopping condition (13) is not satisfied do

14 Compute

{\bm{\omega}}^{\star}(\widehat{\mathcal{M}}_{t})

according to (7) and

\pi^{o}(\widehat{\mathcal{M}}_{t})

according to (8).

15 if $t=0$ then

16 Play

a_{0}\sim\mathrm{Unif}([|1,A|])

17 else

18 Play

a_{t}\sim\pi_{t}(.|s_{t})

, where

\pi_{t}

is determined by either (9) or (10).

19 end if

20 Observe

(R_{t},s_{t+1})\sim q(.|s_{t},a_{t})\otimes p(.|s_{t},a_{t})

t\leftarrow t+1

22 Update

\widehat{\mathcal{M}}_{t}

and

(N_{sa}(t))_{s,a}

23 Compute

\widehat{\pi}_{t}^{\star}\leftarrow\textrm{POLICY-ITERATION}(\widehat{\mathcal{M}}_{t})

24 end while

Output: Empirical optimal policy

\widehat{\pi}_{\tau}^{\star}

Algorithm 1 MDP Navigate and Stop (MDP-NaS)

4.1 Sampling rule

We introduce a few definitions to simplify the presentation. Any stationary policy $\pi$ induces a Markov chain on $\mathcal{Z}$ whose transition kernel is defined by $P_{\pi}((s,a),(s^{\prime},a^{\prime}))\triangleq P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})$ . With some abuse of notation, we will use $P_{\pi}$ to refer to both the Markov chain and its kernel. We denote by $\pi_{u}$ the uniform random policy, i.e., $\pi_{u}(a|s)=1/A$ for all pairs $(s,a)$ . Finally, we define the vector of visit-frequencies ${\bm{N}}(t)/t\triangleq\big{(}N_{sa}(t)/t\big{)}_{(s,a)\in\mathcal{Z}}$ .

In contrast with pure exploration in Multi-Armed Bandits and MDPs with a generative model where any allocation vector in the simplex is achievable, here the agent can only choose a sequence of actions and follow the resulting trajectory. Therefore, one might ask if the oracle allocation can be achieved by following a simple policy. A natural candidate is the oracle policy defined by

\displaystyle\forall(s,a)\in\mathcal{Z},\quad\pi^{o}(a|s)\triangleq\frac{\omega_{sa}^{\star}}{\omega_{s}^{\star}}=\frac{\omega_{sa}^{\star}}{\sum_{b\in\mathcal{A}}\omega_{sb}^{\star}}.

(8)

It is immediate to check that ${\bm{\omega}}^{\star}$ is the stationary distribution of $P_{\pi^{o}}$ . Denote by $\pi^{o}(\mathcal{M})$ the above oracle policy defined through $\omega^{\star}(\mathcal{M})$ . Policy $\pi^{o}(\mathcal{M})$ is the target that we would like to play, but since the rewards and dynamics of $\mathcal{M}$ are unknown, the actions must be chosen so as to estimate consistently the oracle policy while at the same time ensuring that ${\bm{N}}(t)/t$ converges to ${\bm{\omega}}^{\star}(\mathcal{M})$ . The following two sampling rules satisfy these requirements.

D-Navigation rule: At time step $t$ , the learner plays the policy

\pi_{t}=\varepsilon_{t}\pi_{u}+(1-\varepsilon_{t})\pi^{o}(\widehat{\mathcal{M}}_{t}),

(9)

C-Navigation rule: At time step $t$ , the learner plays

\pi_{t}=\varepsilon_{t}\pi_{u}+(1-\varepsilon_{t})\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t,

(10)

where $\varepsilon_{t}$ is a decreasing exploration rate to be tuned later. In the second case, the agent navigates using a Cesàro-mean of oracle policies instead of the current estimate of the oracle policy, which makes the navigation more stable.

4.1.1 Tuning the forced exploration parameter

The mixture with the uniform policy⁷⁷7Our results still hold if $\pi_{u}$ is replaced by any other policy $\pi$ which has an ergodic kernel. In practice, this can be helpful especially when we have prior knowledge of a fast-mixing policy $\pi$ . helps the agent to explore all state-action pairs often enough in the initial phase. This is particularly necessary in the case where the empirical weight $\omega_{sa}^{\star}(\widehat{\mathcal{M}}_{t})$ of some state-action pair $(s,a)$ is under-estimated, which may cause that this pair is left aside and hence the estimation never corrected. The next result in this section gives a tuning of the rate $\varepsilon_{t}$ that ensures a sufficient forced exploration:

Lemma 4.

Let $m$ be the maximum length of the shortest paths (in terms of number of transitions) between pairs of states in $\mathcal{M}$ : $m\triangleq\max_{(s,s^{\prime})\in\mathcal{S}^{2}}\ \min\{n\geq 1:\ \exists\pi:\mathcal{S}\rightarrow\mathcal{A},P_{\pi}^{n}(s,s^{\prime})>0\}$ . Then C-Navigation or D-Navigation with any decreasing sequence $(\varepsilon_{t})_{t\geq 1}$ such that $\forall t\geq 1,\ \varepsilon_{t}\geq t^{-\frac{1}{m+1}}$ satisfies: $\mathbb{P}_{\mathcal{M},\mathbb{A}}\big{(}\forall(s,a)\in\mathcal{Z},\ \lim_{t\to\infty}N_{sa}(t)=\infty\big{)}=1$ .

Remark 3.

When the parameter $m$ is unknown to the learner, one can replace it by its worst case value $m_{\max}=S-1$ . However, when prior knowledge is available, using a faster-decreasing sequence $\varepsilon_{t}=t^{-\frac{1}{m+1}}$ instead of $t^{-\frac{1}{S}}$ can be useful to accelerate convergence, especially when the states of $\mathcal{M}$ are densely connected ( $m\ll S-1$ ).

Minimal exploration rate: Communicating MDPs. The forced exploration rate $t^{-\frac{1}{S}}$ vanishes quite slowly. One may wonder if this rate is necessary to guarantee sufficient exploration in communicating MDPs: the answer is yes in the worst case, as the following example shows. Consider a variant of the classical RiverSwim MDP with state (resp. action) space $\mathcal{S}=[|1,S|]$ , (resp. $\mathcal{A}=\{\small{\textsc{LEFT1, LEFT2, RIGHT}}\}$ ). LEFT1 and LEFT2 are equivalent actions inducing the same rewards and transitions. After playing RIGHT the agent makes a transition of one step to the right, while playing LEFT1 (or LEFT2) moves the agent all the way back to state $1$ . The rewards are null everywhere but in states $\{1,S\}$ where $R(1,\small{\textsc{LEFT1}}),R(1,\small{\textsc{LEFT2}})$ are equal to $0.01$ almost surely and $R(S,\small{\textsc{RIGHT}})$ is Bernoulli with mean 0.02.

When $\gamma$ is close to $1$ , the optimal policy consists in always playing RIGHT so as to reach state $S$ and stay there indefinitely. Now suppose that the agent starts at $s=1$ and due to the small probability of observing the large reward in state $S$ , she underestimates the value of this state in the first rounds and focuses on distinguishing the best action to play in state $1$ among $\{\small{\textsc{LEFT1, LEFT2}}\}$ . Under this scenario she ends up playing a sequence of policies that scales like $\pi_{t}=2\varepsilon_{t}\pi_{u}+(1-2\varepsilon_{t})(\small{\textsc{LEFT1}}+\small{\textsc{LEFT2}})/2$ ⁸⁸8Modulo a re-scaling of $\varepsilon_{t}$ by a constant factor of 1/2.. This induces the non-homogeneous Markov Chain depicted in Figure 1. For the exploration rate $\displaystyle{\varepsilon_{t}=t^{-\alpha}}$ , we show that if the agent uses any $\alpha>\frac{1}{S-1}$ , with non-zero probability she will visit state $S$ only a finite number of times. Therefore she will fail to identify the optimal policy for small values of the confidence level $\delta$ . The proof is deferred to Appendix D.

Figure 1: Non-homogeneous Markov Chain. An exploration rate of at least

t^{-\frac{1}{S-1}}

is needed.

Minimal exploration rate: Ergodic MDPs.

For such MDPs, we can select $\varepsilon_{t}=1/t^{\alpha}$ where $\alpha<1$ without compromising the conclusion of Lemma 4. The proof is deferred to Appendix D.

4.2 Stopping rule

To implement a GLR test, we define $\ell_{\mathcal{M}^{\prime}}(t)$ , the likelihood of the observations under some MDP $\mathcal{M}^{\prime}\ll\mathcal{M}$ : $\ell_{\mathcal{M}^{\prime}}(t)=\prod_{k=0}^{t-1}p_{\mathcal{M}^{\prime}}(s_{k+1}|s_{k},a_{k})q_{\mathcal{M}^{\prime}}(R_{k}|s_{k},a_{k})$ , where at step $k$ the algorithm is in state $s_{k}$ , plays action $a_{k}$ and observes the reward $R_{k}$ and $s_{k+1}$ (the next state). Performing a GLR test in step $t$ consists in computing the optimal policy $\widehat{\pi}_{t}^{\star}$ for the estimated MDP $\widehat{\mathcal{M}}_{t}$ and in comparing the likelihood of observations under the most likely model where $\widehat{\pi}_{t}^{\star}$ is optimal to the likelihood under the most likely model where $\widehat{\pi}_{t}^{\star}$ is sub-optimal. Following the standardized form of the GLR for multi-armed bandits in [DKM19] we write:

$\displaystyle G_{\widehat{\pi}_{t}^{\star}}(t)$	$\displaystyle=\log\ \frac{\underset{\mathcal{M}^{\prime}:\widehat{\pi}_{t}^{\star}\in\Pi^{\star}(\mathcal{M}^{\prime})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}{\underset{\mathcal{M}^{\prime}:\widehat{\pi}_{t}^{\star}\notin\Pi^{\star}(\mathcal{M}^{\prime})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}=\log\ \frac{\ell_{\widehat{\mathcal{M}}_{t}}(t)}{\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\sup}\ell_{\mathcal{M}^{\prime}}(t)}=\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\log\frac{\ell_{\widehat{\mathcal{M}}_{t}}(t)}{\ell_{\mathcal{M}^{\prime}}(t)}$
	$\displaystyle=\underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}^{\prime}}(s,a))+\operatorname{KL}(\widehat{p}_{s,a}(t),\,p_{\mathcal{M}^{\prime}}(s,a))\big{]}$	(11)
	$\displaystyle=t\;T\bigg{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\bigg{)}^{-1}\;.$	(12)

The hypothesis $(\widehat{\pi}_{t}^{\star}\neq\pi^{\star})$ is then rejected as soon as the ratio of likelihoods becomes greater than the threshold $\beta(t,\delta)$ , properly tuned to ensure that the algorithm is $\delta$ -PC. Note that the likelihood ratio $G_{\widehat{\pi}_{t}^{\star}}(t)$ itself can be difficult to compute for MDPs, since it is equivalent to solving (3), we lower bound it using (12) and Lemma 5. This leads to the following proposition:

Proposition 5.

Define the random thresholds for the transitions and rewards respectively:

	$\displaystyle\beta_{p}(t,\delta)$	$\displaystyle\triangleq\log(1/\delta)+(S-1)\underset{(s,a)}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}$
	$\displaystyle\beta_{r}(t,\delta)$	$\displaystyle\triangleq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+3\sum_{s,a}\log\big{[}1+\log(N_{sa}(t))\big{]}$

where $\varphi(x)\underset{\infty}{\sim}x+\log(x)$ is defined in the appendix. Then the stopping rule:

\tau_{\delta}\triangleq\inf\bigg{\{}t\geq 1:\ t\;U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\bigg{\}}

(13)

is $\delta$ -PC, i.e., $\mathbb{P}(\tau_{\delta}<\infty,\ \widehat{\pi}_{\tau_{\delta}}^{\star}\neq\pi^{\star})\leq\delta.$

5 Main results and sample complexity analysis

First we state our main results, which take the form of asymptotic upper bounds on the sample complexity of MDP-NaS, under a slightly more conservative exploration rate than in Lemma 4. Then we present the most important ingredients in their proof. The complete proof is provided in Appendix F.

Theorem 6.

Using the C-Navigation sampling rule with $\varepsilon_{t}=t^{-\frac{1}{2(m+1)}}$ and the stopping rule (13): (i) MDP-NaS stops almost surely and its stopping time satisfies
$\mathbb{P}\big{(}\limsup_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M})\big{)}=1$ , where $U_{o}(\mathcal{M})$ was defined in (6);
(ii) the stopping time of MDP-NaS has a finite expectation for all $\delta\in(0,1)$ , and $\limsup_{\delta\to 0}\frac{\mathbb{E}[\tau_{\delta}]}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M})$ .

5.1 Proof sketch

Concentration of empirical MDPs: The starting point of our proof is a concentration event of the empirical estimates $\widehat{\mathcal{M}}_{t}$ around $\mathcal{M}$ . For $\xi>0$ and $T\geq 1$ , we define $\mathcal{C}_{T}^{1}(\xi)=\bigcap_{t=T^{1/4}}^{T}\bigg{(}\left\lVert\widehat{\mathcal{M}}_{t}-\mathcal{M}\right\rVert\leq\xi,\left\lVert\pi^{o}(\widehat{\mathcal{M}}_{t})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\xi\bigg{)}$ , where $\left\lVert.\right\rVert$ is a semi-norm on MDPs. Then we prove in Lemma 18 that $\mathcal{C}_{T}^{1}(\xi)$ holds with high probability in the sense that for all $T\geq 1$ :

\mathbb{P}\left(\mathcal{C}_{T}^{1}(\xi)\right)\geq 1-\mathcal{O}\big{(}1/T^{2}\big{)}.

(14)

For this purpose we derive, for all pairs $(s,a)$ , a lower bound on $N_{sa}(t)$ stating that⁹⁹9Using the method in our proof one can derive a lower bound of the type $t^{\zeta}$ for any $\zeta<1/2$ and any exploration rate $\varepsilon_{t}=t^{-\theta}$ with $\theta<1/(m+1)$ . We chose $\theta=\frac{1}{2(m+1)}$ and $\zeta=1/4$ because they enable us to have an explicit formula for $\lambda_{\alpha}$ . See Lemma 11 in the appendix.: $\mathbb{P}\big{(}\forall(s,a)\in\mathcal{Z},\ \forall t\geq 1,\ N_{sa}(t)\geq(t/\lambda_{\alpha})^{1/4}-1\big{)}\geq 1-\alpha$ where $\lambda_{\alpha}\propto\log^{2}(1+SA/\alpha)$ is a parameter that depends on the mixing properties of $\mathcal{M}$ under the uniform policy. These lower bounds on $N_{sa}(t)$ w.h.p. contrast with their deterministic equivalent obtained for C-tracking in [GK16].

Concentration of visit-frequencies: Before we proceed, we make the following assumption.

Assumption 2.

$P_{\pi_{u}}$ is aperiodic¹⁰¹⁰10This assumption is mild as it is enough to have only one state $\tilde{s}$ and one action $\tilde{a}$ such that $P_{\mathcal{M}}(\tilde{s}|\tilde{s},\tilde{a})>0$ for it to be satisfied. Furthermore, Assumptions 1 and 10 combined are still less restrictive than the usual ergodicity assumption, which requires that the Markov chains of all policies are ergodic..

Under Assumptions 1 and 10 the kernel $P_{\pi_{u}}$ , and consequently also $P_{\pi^{o}}$ , becomes ergodic. Hence the iterates of $P_{\pi^{o}}$ converge to ${\bm{\omega}}^{\star}(\mathcal{M})$ at a geometric speed. Also note that the Markov chains induced by playing either of our sampling rules are non-homogeneous, with a sequence of kernels $(P_{\pi_{t}})_{t\geq 1}$ that is history-dependent. To tackle this difficulty, we adapt a powerful ergodic theorem (see Proposition 12 in the appendix) from [FMP11] originally derived for adaptive Markov Chain Monte-Carlo (MCMC) algorithms to get the following result.

Theorem 7.

Using the C-Navigation or D-Navigation we have: $\underset{t\to\infty}{\lim}{\bm{N}}(t)/t=\omega^{\star}(\mathcal{M})$ almost surely.

Lemma 4 and Theorem 7 combined prove Theorem 6 (i) in a straightforward fashion. The proof of Theorem 6 (ii) is more involved. Again, we adapt the proof method from [FMP11] to derive a finite-time version of Proposition 12 which results into the following proposition.

Proposition 8.

Under C-Navigation, for all $\xi>0$ , there exists a time $T_{\xi}$ such that for all $T\geq T_{\xi}$ , all $t\geq T^{3/4}$ and all functions $f:\mathcal{Z}\xrightarrow{}\mathbbm{R}^{+}$ , we have:

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{k=1}^{t}f(s_{k},a_{k})}{t}-\mathbb{E}_{(s,a)\sim\omega^{\star}}[f(s,a)]\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)}.

where $\xi\mapsto K_{\xi}$ is a mapping with values in $(1,\infty)$ such that $\limsup_{\xi\to 0}K_{\xi}<\infty$ .

Now define $\mathcal{C}_{T}^{2}(\xi)=\bigcap_{t=T^{3/4}}^{T}\big{(}\big{|}{\bm{N}}(t)/t-\omega^{\star}(\mathcal{M})\big{|}\leq K_{\xi}\;\xi\big{)}$ . Then Proposition 8 and Eq. (14) combined imply that for $T$ large enough, the event $\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)$ holds w.h.p. so that the expected stopping time is finite on the complementary event: $\mathbb{E}[\tau_{\delta}\mathbbm{1}\{\overline{\mathcal{C}_{T}^{1}(\xi)}\cup\overline{\mathcal{C}_{T}^{2}(\xi)}\}]<\infty$ . Now given the asymptotic shape of the thresholds: $\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\underset{\delta\to 0}{\sim}2\log(1/\delta)$ , we may informally write:

\mathbb{E}\big{[}\tau_{\delta}\mathbbm{1}\{\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)\}\big{]}\underset{\delta\to 0}{\preceq}2\log(1/\delta)\sup_{(\mathcal{M}^{\prime},\omega^{\prime})\in B_{\xi}}U_{o}\big{(}\mathcal{M}^{\prime},\omega^{\prime}\big{)},

where $B_{\xi}=\{(\mathcal{M}^{\prime},\omega^{\prime}):\left\lVert\mathcal{M}^{\prime}-\mathcal{M}\right\rVert\leq\xi,\left\lVert\omega^{\prime}-\omega^{\star}(\mathcal{M})\right\rVert_{\infty}\leq K_{\xi}\;\xi\}$ . Taking the limits when $\delta$ and $\xi$ go to zero respectively concludes the proof.

6 Comparison with MOCA

Recall from Lemma 3 and Theorem 6 that our sample complexity bound writes as:

\mathcal{O}\bigg{(}\underset{\omega\in\Omega(\mathcal{M})}{\inf}\ \underset{(s,a):a\neq\pi^{\star}(s)}{\max}\ \frac{1+\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]}{\omega_{sa}\Delta_{sa}^{2}}+\frac{1}{\underset{s}{\min}\omega_{s,\pi^{\star}(s)}\Delta_{\min}^{2}(1-\gamma)^{3}}\bigg{)}\log(1/\delta).

Hence, in the asymptotic regime $\delta\to 0$ , MDP-NaS finds the optimal way to balance exploration between state-action pairs proportionally to their hardness: $(1+\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}])/\Delta_{sa}^{2}$ for sub-optimal pairs (resp. $1/\Delta_{\min}^{2}(1-\gamma)^{3}$ for optimal pairs).

After a preprint of this work was published, [WSJ21] proposed MOCA, an algorithm for BPI in the episodic setting. MOCA has the advantage of treating the more general case of $\varepsilon$ -optimal policy identification with finite-time guarantees on its sample complexity. The two papers have different and complementary objectives but one can compare with their bound for exact policy identification, i.e when $\varepsilon<\varepsilon^{\star}$ ¹¹¹¹11 $\varepsilon^{\star}$ is defined in their work as the threshold value for $\varepsilon$ such that the only $\varepsilon$ -optimal policy is the best policy. However, when the objective is to find the optimal policy, it is not clear from their paper how one can determine such a value of $\varepsilon$ without prior knowledge of the ground truth instance., in the asymptotic regime $\delta\to 0$ . In this case, by carefully inspecting the proofs of [WSJ21], we see that MOCA’s sample complexity writes as¹²¹²12The $\log^{3}(1/\delta)$ term comes from the sample complexity of their sub-routine FindExplorableSets.:

\mathcal{O}\bigg{(}\sum_{h=1}^{H}\ \underset{\omega\in\Omega(\mathcal{M})}{\inf}\ \underset{s,a\neq\pi^{\star}(s)}{\max}\frac{H^{2}}{\omega_{sa}(h)\Delta_{sa}(h)^{2}}\bigg{)}\log(1/\delta)+\frac{\textrm{polylog}(S,A,H,\log(\varepsilon^{*}))\log^{3}(1/\delta)}{\varepsilon^{*}}

where $H$ is the horizon and $\Delta_{sa}(h)$ is the sub-optimality gap of $(s,a)$ at time step $h$ . We make the following remarks about the bounds above:

1.

MOCA only pays the cost of worst-case visitation probability multiplied by the gap of the corresponding state $\underset{s,a\neq\pi^{\star}(s)}{\min}\omega_{sa}\Delta_{sa}^{2}$ . Instead, MDP-NaS pays a double worst-case cost of the smallest visitation probability multiplied by the minimum gap $\min_{s}\omega_{s,\pi^{\star}(s)}\Delta_{\min}^{2}$ . As pointed out by [WSJ21] the former scaling is better, especially when the state where the minimum gap is achieved is different from the one that is hardest to reach. This is however an artefact of the upper bound in Lemma 3 that MDP-NaS uses as a proxy for the characteristic time. Using a more refined bound, or an optimization oracle that solves the best-response problem, one can remove this double worst-case dependency.
2.

The sample complexity of MOCA divided by $\log(1/\delta)$ still explodes when $\delta$ goes to zero, contrary to MDP-NaS’s bound.
3.

The sample complexity of MDP-NaS is variance-sensitive for sub-optimal state-action pairs, while MOCA’s bound depends on a worst-case factor of $H^{2}$ . Indeed as the rewards are in $[0,1]$ , we always have $V_{\mathcal{M}}^{\star}\leq H$ and $\mathrm{Var}_{p(s,a)}[V_{\mathcal{M}}^{\star}]\leq H^{2}$ in the episodic setting.

We conclude this section by noting that MDP-NaS has a simple design and can easily be implemented.

7 Conclusion

To the best of our knowledge, this paper is the first to propose an algorithm with instance-dependent sample complexity for Best Policy identification (BPI) in the full online setting. Our results are encouraging as they show: 1) How the navigation constraints of online RL impact the difficulty of learning a good policy, compared to the more relaxed sampling schemes of Multi-Armed Bandits and MDPs with a generative model. 2) That, provided access to an optimization oracle that solves the information-theoretical lower bound (resp. some convex relaxation of the lower bound), asymptotic optimal (resp. near-optimal) sample complexity is still possible through adaptive control of the trajectory. This opens up exciting new research questions. First, it is intriguing to understand how the mixing times -and not just the stationary distributions- of Markov chains induced by policies impact the sample complexity of BPI in the moderate confidence regime. A second direction would be to extend our contributions to the problem of finding an $\varepsilon$ -optimal policy, which is of more practical interest than identifying the best policy.

References

[AHKS20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13399–13412. Curran Associates, Inc., 2020.
[AJO09] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
[AKY20] Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020.
[AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
[BK97] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
[CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
[DKM19] Rémy Degenne, Wouter M. Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
[DMKV21] O. D. Domingues, Pierre M’enard, E. Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In ALT, 2021.
[DTC13] Thomas Dietterich, Majid Taleghan, and Mark Crowley. Pac optimal planning for invasive species management : Improved exploration for reinforcement learning from simulator-defined mdps. 01 2013.
[EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 06 2006.
[FCG10] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In Allerton Conference on Communication, Control, and Computing, pages 115–122, Sep. 2010.
[Fie94] C. Fiechter. Efficient reinforcement learning. In COLT ’94, 1994.
[FMP11] G. Fort, É. Moulines, and P. Priouret. Convergence of adaptive and interacting markov chain monte carlo algorithms. Annals of Statistics, 39:3262–3289, 2011.
[FYAY21] Fei Feng, W. Yin, Alekh Agarwal, and Lin F. Yang. Provably correct optimization and exploration with non-linear policies. ArXiv, abs/2103.11559, 2021.
[GK16] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 998–1027, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
[GK19] A. Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. arXiv preprint, Statistics Theory, arXiv:1905.03495, 2019.
[HH80] P. Hall and C.C. Heyde. 2 - inequalities and laws of large numbers. In P. Hall and C.C. Heyde, editors, Martingale Limit Theory and its Application, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pages 13–50. Academic Press, 1980.
[JKM⁺20] Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1253–1263. Curran Associates, Inc., 2020.
[Kak03] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London, England, 2003.
[KCG16] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
[KK18] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint, arXiv:1811.11419, 2018.
[KMDD⁺21] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 865–891. PMLR, 16–19 Mar 2021.
[KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
[KS98] Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In Proceedings of the 11th International Conference on Neural Information Processing Systems, NIPS’98, page 996–1002, Cambridge, MA, USA, 1998. MIT Press.
[LCC⁺21] Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis, 2021.
[LPW06] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006.
[LR85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–2, 1985.
[LS20] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
[LWC⁺20a] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. arXiv preprint, arXiv:2005.12900, 2020.
[LWC⁺20b] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction, 2020.
[MDJ⁺20] Pierre M’enard, O. D. Domingues, Anders Jonsson, E. Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. ArXiv, abs/2007.13442, 2020.
[MP21] Aymen Al Marjani and Alexandre Proutiere. Adaptive sampling for best policy identification in markov decision processes. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7459–7468. PMLR, 18–24 Jul 2021.
[Rus16] D. Russo. Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference On Learning Theory, 2016.
[SB18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
[Sch68] P. Schweitzer. Perturbation theory and finite markov chains. Journal of Applied Probability, 5:401–413, 1968.
[SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
[SWW⁺18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5186–5196. Curran Associates, Inc., 2018.
[Sze10] Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
[Wan17] Mengdi Wang. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time. arXiv e-prints, page arXiv:1704.01869, April 2017.
[WSJ21] Andrew Wagenmaker, Max Simchowitz, and Kevin Jamieson. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. arXiv e-prints, page arXiv:2108.02717, August 2021.
[ZCA21] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. ArXiv, abs/2103.12923, 2021.
[ZKB19] Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5625–5634. Curran Associates, Inc., 2019.

Appendix A Symbols

Table 1: Additional notations used in the appendix

Symbol	Definition
$m$	Maximum length of shortest paths: $\underset{(s,s^{\prime})\in\mathcal{S}^{2}}{\max}\ \min\{n\geq 1:\ \exists\pi:\mathcal{S}\rightarrow\mathcal{A},P_{\pi}^{n}(s,s^{\prime})>0\}$
$P_{\pi_{u}}$	Transition kernel of the uniform policy
$\omega_{u}$	Stationary distribution of $P_{\pi_{u}}$
$r$	$\min\{\ell\geq 1:\ \forall(z,z^{\prime})\in\mathcal{Z}^{2},\ P_{\pi_{u}}^{\ell}(z,z^{\prime})>0\}$
$\sigma_{u}$	$\min\limits_{z,z^{\prime}\in\mathcal{Z}}\ \frac{P_{\pi_{u}}^{r}(z,z^{\prime})}{\omega_{u}(z^{\prime})}$
$\eta_{1}$	$\min\big{\{}P_{\pi_{u}}(z,z^{\prime})\ \big{\|}(z,z^{\prime})\in\mathcal{Z}^{2},P_{\pi_{u}}(z,z^{\prime})>0\big{\}}$
$\eta_{2}$	$\min\big{\{}P^{n}_{\pi_{u}}(z,z^{\prime})\ \big{\|}(z,z^{\prime})\in\mathcal{Z}^{2},n\in[\|1,m+1\|],P^{n}_{\pi_{u}}(z,z^{\prime})>0\big{\}}$
$\eta$	Communication parameter $\eta_{1}\eta_{2}$
$\omega^{\star}$	oracle weights: $\omega^{\star}\triangleq\operatorname*{arg\,max}\limits_{\omega\in\Omega(\mathcal{M})}\ U(\mathcal{M},\omega)$ .
$\pi^{o}$	oracle policy: $\displaystyle{\pi(a\|s)\triangleq\frac{\omega_{sa}^{\star}}{\sum\limits_{a^{\prime}\in\mathcal{A}}\omega_{sa^{\prime}}^{\star}}}$ .
$\pi_{t}^{o}$	$\pi^{o}(\widehat{\mathcal{M}}_{t})$
$\overline{\pi_{t}^{o}}$	$\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t$
$Z_{\pi^{o}}$	$(I-P_{\pi^{o}}+\mathbbm{1}{\omega^{\star}}^{\mathsf{\scriptscriptstyle T}})^{-1}$
$\kappa_{\mathcal{M}}$	Condition number $\left\lVert Z_{\pi^{o}}\right\rVert_{\infty}$
$P_{t}$	Kernel of the policy $\pi_{t}$
$\omega_{t}$	Stationary distribution of $P_{t}$
$C_{t},\rho_{t}$	Constants such that $\forall n\geq 1,\ \left\lVert P_{t}^{n}(z_{0},.)-\omega_{t}^{\mathsf{\scriptscriptstyle T}}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}$
$L_{t}$	$C_{t}(1-\rho_{t})^{-1}$

Appendix B Experiments

In this section, we test our algorithm on two small examples. The first instance is an ergodic MDP, with $5$ states, $5$ actions per state and a discount factor $\gamma=0.7$ . The rewards of each state-action pair come from independent Bernoulli distributions with means sampled from the uniform distribution $\mathcal{U}([0,1])$ . The transitions kernels were generated following a Dirichlet distribution $\mathcal{D}(1,\ldots,1)$ . The second instance is the classical RiverSwim from [SL08], which is communicating but not ergodic. The instance we used has $5$ states and $2$ actions: $\{\textrm{LEFT, RIGHT}\}$ , with deterministic transitions and a discount factor $\gamma=0.95$ . Rewards are null everywhere but in states $\{1,5\}$ where they are Bernoulli with respective means $r(1,\textrm{LEFT})=0.05$ and $r(1,\textrm{RIGHT})=1$ . We fix a confidence level $\delta=0.1$ , and for each of these MDPs, we run $30$ Monte-Carlo simulations of MDP-NaS with either C-Navigation or D-Navigation. Towards computational efficiency, we note that the empirical oracle policy does not change significantly after collecting one sample, therefore we only update it every $10^{4}$ time steps¹³¹³13The period was chosen so as to save computation time, and knowing that for MDPs algorithms usually require $\geq 10^{6}$ samples to return a reasonably good policy..
First, we seek to check whether the frequencies of state-action pair visits converge to their oracle weights, as stated in Theorem 7. Figure 2 shows the relative distance, in log scale, between the vector of empirical frequencies ${\bm{N}}(t)/t$ and the oracle allocation ${\bm{\omega}}^{\star}$ . The shaded area represents the $10\%$ and $90\%$ quantiles. We see that the relative distance steadily decreases with time, indicating that the visit-frequencies of both D-Navigation and C-Navigation converge to the oracle allocation. We also note that the D-Navigation rule exhibits a faster convergence than the C-Navigation rule.

Refer to caption — Figure 2: Relative distance in log scale: $\log_{10}\big{(}\max_{s,a}\frac{|N_{sa}(t)/t-{\bm{\omega}}^{\star}_{sa}|}{{\bm{\omega}}^{\star}_{sa}}\big{)}$ . Left: Ergodic MDP with $S=A=5$ . Right: River Swim with $S=5,A=2$ .

Next we compare our algorithm with Variance-reduced-Q-learning (VRQL) [LWC⁺20b], a variant of the classical Q-learning with faster convergence rates. VRQL finds an $\varepsilon$ -estimate $\widehat{Q}$ of the Q function $Q^{\star}$ , by following a fixed sampling rule, referred to as the behavior policy $\pi_{b}$ , and updating its estimate of the Q function via Temporal Difference learning. VRQL does not have a stopping rule, but is guaranteed to yield an estimate such that $\left\lVert\widehat{Q}-Q\right\rVert_{\infty}\leq\varepsilon$ with probability $1-\delta$ after using $M(N+t_{\textrm{epoch}})$ samples where

	$\displaystyle M$	$\displaystyle=c_{3}\log(1/\varepsilon^{2}(1-\gamma)^{2}),$
	$\displaystyle t_{\textrm{epoch}}$	$\displaystyle=\frac{c_{2}}{\mu_{\min}}\bigg{(}\frac{1}{(1-\gamma)^{3}}+\frac{t_{mix}}{1-\gamma}\bigg{)}\log\big{(}1/(1-\gamma)^{2}\varepsilon\big{)}\log\big{(}SA/\delta\big{)},$
	$\displaystyle N$	$\displaystyle=\frac{c_{1}}{\mu_{\min}}\bigg{(}\frac{1}{(1-\gamma)^{3}\min(1,\varepsilon^{2})}+t_{mix}\bigg{)}\log\big{(}SAt_{\textrm{epoch}}/\delta\big{)},$

where $\mu_{\min}$ (resp. $t_{mix}$ ) is the minimum state-action occupancy (resp. the mixing time) of $\pi_{b}$ and $c_{1},c_{2},c_{3}$ are some large enough universal constants. We use VRQL with $c_{1}=c_{2}=c_{3}=10$ , $\varepsilon\leq\Delta_{\min}$ (since the goal is to identify the best policy) and use the uniform policy as a sampling rule¹⁴¹⁴14In the absence of prior knowledge about the MDP, the uniform policy is a reasonable choice to maximize $\mu_{\min}$ .. We plug this value into the equations above and the compute the sample complexity of VRQL. Table 2 shows a comparison of the sample complexities of MDP-NaS and VRQL. MDP-NaS has much better performance than VRQL.

	MDP-NaS	VRQL
Small Ergodic MDP	$8\times 10^{5}$	$2.5\times 10^{8}$
RIVER-SWIM	$2.6\times 10^{6}$	$3.3\times 10^{9}$

Table 2: Average sample complexity of MDP-NaS (D-Navigation) vs deterministic sample complexity of VRQL.

\delta=0.1

Appendix C Sample complexity lower bound

Let $\mathrm{Alt}(\mathcal{M})$ be the set of MDPs such that $\mathcal{M}\ll\mathcal{M}^{\prime}$ and $\Pi^{\star}(\mathcal{M})\cap\Pi^{\star}(\mathcal{M}^{\prime})=\emptyset$ . The information constraints are obtained by change-of-measure arguments as in the bandit literature [LR85, KCG16]:

Lemma 9.

([MP21]) For any $\delta$ -PC algorithm $\mathbb{A}$ , and for any $\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M})$ , we have:

\sum_{s,a}\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{sa}(\tau_{\delta})]\textrm{KL}_{\mathcal{M}|\mathcal{M}^{\prime}}(s,a)\geq\operatorname{kl}(\delta,1-\delta).

(15)

C.1 Navigation constraints: proof of Lemma 1

For all states $s$ ,

\displaystyle N_{\tau}(s)

\displaystyle=\mathbbm{1}_{\{S_{1}=s\}}+\sum_{s^{\prime},a^{\prime}}\sum_{u=1}^{N_{\tau-1}(s^{\prime},a^{\prime})}\mathbbm{1}_{\{W_{u}=s\}},

where $W_{u}$ denotes the state observed after the $u$ -th times $(s^{\prime},a^{\prime})$ has been visited. Fix $s^{\prime},a^{\prime}$ . Introduce $G_{t}^{s^{\prime},a^{\prime}}=\sum_{u=1}^{N_{t-1}(s^{\prime},a^{\prime})}\mathbbm{1}_{\{W_{u}=s\}}$ . Observe that ${\{W_{u}=s\}}$ and $\{N_{t-1}(s^{\prime},a^{\prime})>u-1\}$ are independent. Furthermore, $\mathbb{E}_{\mathcal{M},\mathbb{A}}[\mathbbm{1}_{\{W_{s}=s\}}]=p_{\mathcal{M}}(s|s^{\prime},a^{\prime})$ . Hence:

\mathbb{E}_{\mathcal{M},\mathbb{A}}[G_{\tau}^{s^{\prime},a^{\prime}}]=p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})].

Finally,

\displaystyle\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s)]=\mathbb{P}_{\mathcal{M}}[\mathbbm{1}_{\{S_{1}=s\}}]+\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})].

(16)

From the above equality, the lemma is proved by just observing that $\mathbb{P}_{\mathcal{M}}[\mathbbm{1}_{\{S_{1}=s\}}]\leq 1$ , $\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s^{\prime},a^{\prime})]\leq\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s^{\prime},a^{\prime})]$ for any $(s^{\prime},a^{\prime})$ , and $\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau}(s)]\leq\mathbb{E}_{\mathcal{M},\mathbb{A}}[N_{\tau-1}(s)]+1$ for any $s$ .

C.2 Lower bound: proof of Proposition 2

By combining Lemma 9 and Lemma 1 we get the following proposition.

Proposition 10.

The expected sample complexity $\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau_{\delta}]$ of any $\delta$ -PC algorithm $\mathbb{A}$ is larger than the value of the following optimization problem:

	$\displaystyle\inf_{n\geq 0}\ \sum_{s,a}n_{sa}$		(17)
	$\displaystyle\textrm{s.t.}\ \ \forall s,\ \Big{\|}\sum_{a}n_{sa}-\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s\|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}}\Big{\|}\leq 1,$
	$\displaystyle\ \ \ \ \ \ \forall\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M}),\ \sum_{s,a}n_{sa}\textrm{KL}_{\mathcal{M}\|\mathcal{M}^{\prime}}(s,a)\geq\operatorname{kl}(\delta,1-\delta).$

In (C.2), $n_{sa}$ is interpreted as the expected number of times $(s,a)$ is visited before the stopping time. Note that the above proposition provides a lower bound for any value of $\delta$ . We can further simplify this bound when restricting our attention to asymptotic regimes where $\delta$ goes to 0. In that case, the navigation constraints (1) can be replaced by $\sum_{a}n_{sa}=\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}}$ (by just renormalizing $n$ with $n/\log(1/\delta)$ , and letting $\delta\rightarrow 0$ ). For small $\delta$ regimes, we can hence rewrite the lower bound as follows:

	$\displaystyle\underset{\delta\to 0}{\liminf}\ \frac{\mathbb{E}_{\mathcal{M},\mathbb{A}}[\tau]}{\log(1/\delta)}\geq\$	$\displaystyle\inf_{n\geq 0}\ \sum_{s,a}n_{sa}$
		$\displaystyle\textrm{s.t.}\ \ \forall s,\ \sum_{a}n_{sa}=\sum_{s^{\prime},a^{\prime}}p_{\mathcal{M}}(s\|s^{\prime},a^{\prime})n_{s^{\prime}a^{\prime}},$
		$\displaystyle\ \ \ \ \ \ \forall\mathcal{M}^{\prime}\in\mathrm{Alt}(\mathcal{M}),\ \sum_{s,a}n_{sa}\textrm{KL}_{\mathcal{M}\|\mathcal{M}^{\prime}}(s,a)\geq 1.$

One can easily conclude by showing that the value of the optimization program above is equal to the one in Eq 2.

C.3 Full definition of the terms in the upper bound

Let $\mathrm{Var}_{\max}^{\star}[V_{\mathcal{M}}^{\star}]=\max\limits_{s\in\mathcal{S}}\ \mathrm{Var}_{s^{\prime}\sim p_{\mathcal{M}}(.|s,\pi^{\star}(s))}[V_{\mathcal{M}}^{\star}(s^{\prime})]$ denote the maximum variance of the value function on the trajectory of the optimal policy. In [MP21] are defined the following functionals of $\mathcal{M}$ :

	$\displaystyle T_{3}(\mathcal{M})$	$\displaystyle\triangleq\frac{2}{\Delta_{\min}^{2}(1-\gamma)^{2}},$
	$\displaystyle T_{4}(\mathcal{M})$	$\displaystyle\triangleq\min\Bigg{(}\frac{27}{\Delta_{\min}^{2}(1-\gamma)^{3}},\ \max\bigg{(}\frac{16\mathrm{Var}_{\max}^{\star}[V_{\mathcal{M}}^{\star}]}{\Delta_{\min}^{2}(1-\gamma)^{2}},\frac{6\operatorname{sp}(V_{\mathcal{M}}^{\star})^{4/3}}{\Delta_{\min}^{4/3}(1-\gamma)^{4/3}}\bigg{)}\Bigg{)}.$

Then $H^{\star}$ is simply defined as $H^{\star}=S(T_{3}(\mathcal{M})+T_{4}(\mathcal{M}))$ . Note that $T_{4}(\mathcal{M})=\mathcal{O}\big{(}\frac{1}{\Delta_{\min}^{2}(1-\gamma)^{3}}\big{)}$ .

Appendix D Sampling rule

Recall that $\mathcal{Z}\triangleq\mathcal{S}\times\mathcal{A}$ denotes the set of state-action pairs. Any policy $\pi$ induces a Markov Chain on $\mathcal{Z}$ whose kernel is defined by:

P_{\pi}((s,a),(s^{\prime},a^{\prime}))=P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime}).

Fact 1:

Note that if it takes $m$ steps to move between any pair of states $(s,s^{\prime})$ with non-zero probability, then we can move between any pair of state-actions $((s,a),(s^{\prime},a^{\prime}))$ in at most $m+1$ steps by playing the policy:

\displaystyle\tilde{\pi}(x)=\begin{cases}a^{\prime}\quad\textrm{if $x=s^{\prime}$,}\\ \pi_{\hookrightarrow s^{\prime}}(x)\quad\textrm{otherwise}\end{cases}

where $\pi_{\hookrightarrow s^{\prime}}$ is the policy corresponding to shortest path to $s^{\prime}$ . Finally, for the sake of simplicity, we will note $P_{t}\triangleq P_{\pi_{t}}$ .

D.1 Almost sure forced exploration: proof of Lemma 4

Consider the event $\mathcal{E}\triangleq\big{(}\exists z\in\mathcal{Z},\ \exists M>0,\ \forall t\geq 1,\ N_{z}(t)<M\big{)}$ . Observe that $\mathcal{E}=\bigcup\limits_{z\in\mathcal{Z}}\mathcal{E}_{z}$ , where for $z\in\mathcal{Z}$ , $\mathcal{E}_{z}\triangleq\big{(}\exists M>0,\ \forall t\geq 1,\ N_{z}(t)<M\big{)}$ . We will prove that $\mathbb{P}(\mathcal{E}_{z^{\prime}})=0$ for all $z^{\prime}$ , which implies the desired result. From Fact 1 above, we have:

\displaystyle\forall(z,z^{\prime})\in\mathcal{Z}^{2},\ \exists r\in[|1,m+1|],\ \exists\pi\in\Pi,\ P_{\pi}^{r}(z,z^{\prime})>0,

(18)

where $P_{\pi}^{r}$ is the $r$ -th power of the transition matrix induced by policy $\pi$ . Therefore:

\eta=\underset{z,z^{\prime}}{\min}\max_{\begin{subarray}{c}1\leq r\leq m+1\\ \pi\in\Pi\end{subarray}}P_{\pi}^{r}(z,z^{\prime})>0

is well defined. Fix $z\in\mathcal{Z}$ and let $\pi,r_{z}$ be a policy and an integer satisfying the property (18) above for the pair $(z,z^{\prime})$ . Observe that:

\displaystyle P_{t}\geq\varepsilon_{t}P_{\pi_{u}}\geq\frac{\varepsilon_{t}}{A}P_{\pi}

where the matrix inequality is entry-wise. Now define the stopping times $(\tau_{k}(z))_{k\geq 1}$ where the agent reaches state-action $z$ for the $k$ -th time¹⁵¹⁵15We restrict our attention to departure state-action pairs $z$ that are visited infinitely often. Such pairs always exist, therefore $\tau_{k}(z)$ is well defined.. Then:

	$\displaystyle\mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}\|\ (\pi_{t})_{t\geq 1},\ (\tau_{k}(z))_{k\geq 1}\big{)}$	$\displaystyle\leq\mathbb{P}\big{(}\exists N\geq 1,\forall k\geq N,X_{\tau_{k}(z)+r_{z}}\neq z^{\prime}\ \|\ (\pi_{t})_{t\geq 1},\ (\tau_{k}(z))_{k\geq 1}\big{)}$
		$\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\mathbb{P}\big{(}X_{\tau_{k}(z)+r_{z}}\neq z^{\prime}\ \|\ \tau_{k}(z),\ (\pi_{t})_{t\in[\|\tau_{k}(z)+1,\tau_{k}(z)+r_{z}\|]}\big{)}$
		$\displaystyle=\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\bigg{(}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}P_{\pi_{t}}\bigg{)}(z,z^{\prime})\bigg{]}$
		$\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\bigg{(}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}\frac{\varepsilon_{t}}{A}P_{\pi}\bigg{)}(z,z^{\prime})\bigg{]}$
		$\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{r_{z}}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+r_{z}}\varepsilon_{t}\bigg{]}$
		$\displaystyle\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.$

The second inequality comes from a union bound and the strong Markov property¹⁶¹⁶16This property is sometimes referred to as: Markov Chains start afresh after stopping times.. The last inequality comes from the fact that $r_{z}\leq m+1$ and $\varepsilon_{t}\leq 1$ . Now observe that the inequality above holds for all realizations of the sequences $(\pi_{t})_{t\geq 1}$ . Therefore, integrating that inequality over all possible sequences of policies yields:

\displaystyle\forall z\in\mathcal{Z},\ \mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}|\ (\tau_{k}(z))_{k\geq 1}\big{)}\leq\sum_{N=1}^{\infty}\prod_{k=N}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.

We can already see that if state-action $z$ is visited "frequently enough" ( $\tau_{k}(z)\sim c.k$ for some constant $c$ ) then the right-hand side above will be zero. Since we know that a least one state-action $z$ is visited frequently enough, we consider the product over all state-action pairs $z$ of the probabilities above:

	$\displaystyle\prod_{z\in\mathcal{Z}}\mathbb{P}\bigg{(}\mathcal{E}_{z^{\prime}}\|\ (\tau_{k}(z))_{k\geq 1}\bigg{)}$	$\displaystyle\leq\sum_{(N_{1},\ldots,N_{SA})\in(\mathbb{N}^{\star})^{SA}}\ \prod_{z\in\mathcal{Z}}\prod_{k=N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}$		(19)
		$\displaystyle\triangleq\sum_{(N_{1},\ldots,N_{SA})}a_{(N_{1},\ldots,N_{SA})}\;.$

We will now show that $a_{(N_{1},\ldots,N_{SA})}=0$ for all tuples $(N_{1},\ldots,N_{SA})$ :

	$\displaystyle a_{(N_{1},\ldots,N_{SA})}$	$\displaystyle\leq\prod_{z\in\mathcal{Z}}\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}$
		$\displaystyle=\prod_{k=\max\limits_{z}N_{z}}^{\infty}\prod_{z\in\mathcal{Z}}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z)+1}^{\tau_{k}(z)+m+1}\varepsilon_{t}\bigg{]}\;.$

Now observe that for all $k\geq 1$ there exists $z_{k}\in\mathcal{Z}$ such that $\tau_{k}(z_{k})\leq SAk$ , i.e., at least one state-action has been visited $k$ times before time step $SAk$ . For that particular choice of $z_{k}$ and since $(\varepsilon_{t})_{t\geq 1}$ is decreasing, we get:

	$\displaystyle a_{(N_{1},\ldots,N_{SA})}$	$\displaystyle\leq\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=\tau_{k}(z_{k})+1}^{\tau_{k}(z_{k})+m+1}\varepsilon_{t}\bigg{]}$
		$\displaystyle\leq\prod_{k=\max\limits_{z}N_{z}}^{\infty}\bigg{[}1-\frac{\eta}{A^{m+1}}\prod_{t=SA.k+1}^{SA.k+m+1}\varepsilon_{t}\bigg{]}.$

For the choice of $\varepsilon_{t}=t^{-\frac{1}{m+1}}$ the right-hand side above is zero. To sum up, for all realizations of $(\tau_{k}(z))_{z\in\mathcal{Z},k\geq 1}$ :

\displaystyle\prod_{z\in\mathcal{Z}}\mathbb{P}\bigg{(}\mathcal{E}_{z^{\prime}}|\ (\tau_{k}(z))_{k\geq 1}\bigg{)}=0\;.

Therefore, for all $z^{\prime}$ : $\mathbb{P}\big{(}\mathcal{E}_{z^{\prime}}\big{)}=0$ and consequently: $\mathbb{P}(\mathcal{E})=0$ .

D.2 Minimal exploration rate for communicating MDPs

Indeed if the agent visits state $S$ at time $k$ , then the last $S-1$ transitions before $k$ must have been to the right, ie $\mathbb{P}(s_{k}=S)\leq\prod_{j=k-S+1}^{k-1}\varepsilon_{j}\leq(\varepsilon_{k-S+1})^{S-1}$ . Therefore $\mathbb{E}[N_{S}(t)]\leq\sum_{k=S}^{t}(k-S+1)^{-\alpha(S-1)}$ . In particular this implies that for $\alpha>\frac{1}{S-1}$ , $\limsup_{t\to\infty}\mathbb{E}[N_{S}(t)]=M<\infty$ . Therefore using the reverse Fatou lemma and Markov’s inequality:

	$\displaystyle\mathbb{P}(\forall t\geq 1,N_{S}(t)\leq$	$\displaystyle 2M)=\mathbb{E}\big{[}\limsup_{t\to\infty}\prod_{k=1}^{t}\mathbbm{1}\{N_{S}(k)\leq 2M\}\big{]}$
		$\displaystyle\geq\limsup_{t\to\infty}\mathbb{E}\big{[}\prod_{k=1}^{t}\mathbbm{1}\{N_{S}(k)\leq 2M\}\big{]}=\limsup_{t\to\infty}\mathbb{E}\big{[}\mathbbm{1}\{N_{S}(t)\leq 2M\}\big{]}\geq\frac{1}{2}.$

D.3 Minimal exploration rate for ergodic MDPs

This is a consequence of Proposition 2 [BK97], stating that there exist $c_{1},c_{2},C>0$ such that for all $s$ and $t$ large enough, $\mathbb{P}_{\mathcal{M},\mathbb{A}}[N_{s}(t)>c_{1}t]\geq 1-Ce^{-c_{2}t}$ . A union bound yields: $\mathbb{P}_{\mathcal{M},\mathbb{A}}[\forall s,N_{s}(t)>c_{1}t]\geq 1-CSe^{-c_{2}t}$ . To extend this result to the numbers of visits at the various (state, action) pairs, we can derive a lower bound on $N_{sa}(t)$ given that $N_{s}(t)>c_{1}t$ by observing that a worst scenario (by monotonicity of $\varepsilon_{s}$ ) occurs when $s$ is visited only in the $c_{1}t$ rounds before $t$ . We get $\mathbb{E}[N_{sa}(t)|N_{s}(t)>c_{1}t]\geq c_{3}t^{(1-\alpha)}$ . Remarking that $N_{sa}(t+1)-N_{s}(t)\varepsilon_{t}$ is a sub-martingale with bounded increments, standard concentration arguments then imply that $\mathbb{P}_{\mathcal{M},\mathbb{A}}[\forall s,a,N_{sa}(t)>{\frac{c_{3}}{2}}t^{(1-\alpha)}]\geq\phi(t)$ , where $\phi(t)\to 1$ . Next, define the random variable $Z_{t}=\prod_{s,a}\mathds{1}\{N_{sa}(t)>{\frac{c_{3}}{2}}t^{(1-\alpha)}\}$ . Applying the reverse Fatou lemma, we get $1=\lim\sup_{t}\mathbb{E}[Z_{t}]\leq\mathbb{E}[\lim\sup_{t}Z_{t}]$ . From there, we directly deduce (by monotonicity of $t\mapsto N_{sa}(t)$ ) that a.s. $\lim_{t\to\infty}N_{sa}(t)=\infty$ .

D.4 High probability forced exploration

Lemma 11.

Denote by $\tau_{k}(z)$ the k-th time the agent visits the state-action pair $z$ . Under C-Navigation with exploration rate $\varepsilon_{t}=t^{-\frac{1}{2(m+1)}}$ we have: for all $\alpha\in(0,1)$ , there exists a parameter $\eta>0$ that only depends on $\mathcal{M}$ such that:

\displaystyle\mathbb{P}\bigg{(}\forall z\in\mathcal{Z},\ \forall k\geq 1,\ \tau_{k}(z)\leq\lambda_{\alpha}k^{4}\bigg{)}\geq 1-\alpha,

where $\lambda_{\alpha}\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+\frac{SA}{\alpha})$ .

Corollary 1.

Denote by $N_{z}(t)$ the number of times the agent visits state-action $z$ up to and including time step $t$ . Then under the same notations of the lemma above we have: for all $\alpha\in(0,1)$ :

\displaystyle\mathbb{P}\bigg{(}\forall z\in\mathcal{Z},\ \forall t\geq 1,\ N_{z}(t)\geq\bigg{(}\frac{t}{\lambda_{\alpha}}\bigg{)}^{1/4}-1\bigg{)}\geq 1-\alpha.

Proof.

Let $f$ be some increasing function such that $f(\mathbb{N})\subset\mathbb{N}$ and $f(0)=0$ and define the event $\mathcal{E}=\bigg{(}\forall z\in\mathcal{Z},\ \forall k\geq 1,\ \tau_{k}(z)\leq f(k)\bigg{)}$ . We will prove the following more general result:

\displaystyle\mathbb{P}(\mathcal{E}^{c})\leq SA\sum\limits_{k=1}^{\infty}\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod\limits_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1).j+l}\bigg{]}\;,

(20)

where $\eta$ is a constant depending on the communication properties of $\mathcal{M}$ . Then we will tune $f(k)$ and $\varepsilon_{t}$ so that the right-hand side is less than $\alpha$ . First, observe that:

\mathcal{E}^{c}=\bigcup\limits_{z\in\mathcal{Z}}\ \bigcup\limits_{k=1}^{\infty}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \forall j\leq k-1,\ \tau_{j}(z)\leq f(j)\bigg{)}\;.

Using the decomposition above, we upper bound the probability of $\mathcal{E}^{c}$ by the sum of probabilities for $k\geq 1$ that the $k$ -th excursion from and back to $z$ takes too long:

$\displaystyle\mathbb{P}(\mathcal{E}^{c})$	$\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \forall j\leq k-1,\ \tau_{j}(z)\leq f(j)\bigg{)}\bigg{]}$
	$\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)>f(k)\ \mathrm{and}\ \tau_{k-1}(z)\leq f(k-1)\bigg{)}\bigg{]}$
	$\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1),\ \tau_{k-1}(z)\leq f(k-1)\bigg{)}\bigg{]}$
	$\displaystyle\leq\sum\limits_{z\in\mathcal{Z}}\bigg{[}\mathbb{P}(\tau_{1}(z)>f(1))+\sum\limits_{k=2}^{\infty}\sum\limits_{n=1}^{f(k-1)}\mathbb{P}\big{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\big{\|}\tau_{k-1}(z)=n\big{)}\mathbb{P}(\tau_{0}(z)=n)\bigg{]}$
	$\displaystyle=\sum\limits_{z\in\mathcal{Z}}\big{[}a_{1}(z)+\sum\limits_{k=2}^{\infty}\sum\limits_{n=1}^{f(k-1)}a_{k,n}(z)\mathbb{P}(\tau_{k-1}(s)=n)\big{]}\;,$	(21)

where

	$\displaystyle a_{1}(z)$	$\displaystyle\triangleq\mathbb{P}(\tau_{1}(z)>f(1))\;,$
	$\displaystyle\forall k\geq 2\ \forall n\in[\|1,f(k-1)\|],\ a_{k,n}(z)$	$\displaystyle\triangleq\mathbb{P}\big{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\big{\|}\tau_{k-1}(z)=n\big{)}\;.$

We will now prove an upper bound on $a_{k,n}(z)$ for a fixed $z\in\mathcal{Z}$ and $k\geq 1$ .

1) Upper bounding the probability that an excursion takes too long:

Let us rewrite $P_{t}$ as

P_{t}=\begin{pmatrix}\begin{matrix}\\ \quad Q_{t}(z)\quad\\ \\ \end{matrix}&\vline&[P_{t}(z^{\prime},z)]_{z^{\prime}\neq z}\\ \hline\cr[P_{t}(z,z^{\prime})]_{z^{\prime}\neq z}^{T}&\vline&\begin{matrix}P_{t}(z,z)\end{matrix}\end{pmatrix}\;,

so that state-action $z$ corresponds to the last row and last column. Further let $p_{t}(z^{\prime},\neg z)\triangleq[P_{t}(z^{\prime},z")]_{z"\neq z}$ denote the vector of probabilities of transitions at time $t$ from $z^{\prime}$ to states $z"$ different from $z$ . Using a simple recurrence on $N$ , one can prove that for all $k,N,n\geq 1$ we have:

\displaystyle\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>N\bigg{|}\tau_{k-1}(z)=n\bigg{)}=p_{n}(z,\neg z)^{\mathsf{\scriptscriptstyle T}}\bigg{(}\prod\limits_{j=n+1}^{n+N-1}Q_{j}(z)\bigg{)}\mathbbm{1}\;.

(22)

Using Lemma 20, there exists $\eta>0$ (that only depends on $\mathcal{M}$ ) such that for all $n\geq 1$ and all sequences $(\pi_{t})_{t\geq 1}$ such that $\pi_{t}\geq\varepsilon_{t}\pi_{u}$ we have:

\left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}(z)\right\rVert_{\infty}\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}\;.

(23)

Therefore using (22) for $N=f(k)-f(k-1)$ and breaking the matrix product into smaller product terms of $(m+1)$ matrices, we get for $k\geq 2$ :

$\displaystyle a_{k,n}(z)$	$\displaystyle=\mathbb{P}\bigg{(}\tau_{k}(z)-\tau_{k-1}(z)>f(k)-f(k-1)\bigg{\|}\tau_{k-1}(z)=n\bigg{)}$
	$\displaystyle=\mathbb{E}_{(\pi_{t})_{t\geq 1}}\bigg{[}\mathbb{P}\bigg{(}\tau_{k}(s)-\tau_{k-1}(s)>f(k)-f(k-1)\bigg{\|}\tau_{k-1}(z)=n,(\pi_{t})_{t\geq 1}\bigg{)}\bigg{]}$
	$\displaystyle=\mathbb{E}_{(\pi_{t})_{t\geq 1}}\bigg{[}p_{n}(z,\neg z)^{\mathsf{\scriptscriptstyle T}}\bigg{(}\prod\limits_{j=n+1}^{n+f(k)-f(k-1)-1}Q_{j}(z)\bigg{)}\mathbbm{1}\bigg{]}$
	$\displaystyle\leq\left\lVert\prod\limits_{l=n+1}^{n+f(k)-f(k-1)-1}Q_{l}(z)\right\rVert_{\infty}$
	$\displaystyle\leq\left\lVert\prod\limits_{l=(m+1)\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor+1}^{f(k)-f(k-1)-1}Q_{n+l}(z)\right\rVert_{\infty}\times\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\left\lVert\prod_{l=1}^{m+1}Q_{n+(m+1)j+l}(z)\right\rVert_{\infty}$
	$\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{n+(m+1)j+l}\bigg{]}$
	$\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1)j+l}\bigg{]}\triangleq b_{k}\;,$	(24)

where in the fourth line we used that $\left\lVert p_{n}(z,\neg z)\right\rVert_{1}\leq 1$ . The sixth line uses the fact that the matrices $Q$ are substochastic. The last line is due to the fact that $n\leq f(k-1)$ and $t\mapsto\varepsilon_{t}$ is decreasing. Similarly, one can prove that:

	$\displaystyle a_{1}(z)$	$\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{(m+1)j+l}\bigg{]}$
		$\displaystyle=\prod\limits_{j=0}^{\lfloor\frac{f(1)-f(0)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(0)+(m+1)j+l}\bigg{]}\triangleq b_{1}\;,$		(25)

where we used the fact that $f(0)=0$ . Now we only have to tune $f(k)$ and $\varepsilon_{t}$ so that $\sum\limits_{k=1}^{\infty}b_{k}<\frac{\alpha}{SA}$ and conclude using (21), (24) and (25).

2) Tuning $f$ and the exploration rate:

Since the sequence $(\varepsilon_{t})_{t\geq 1}$ is decreasing we have:

	$\displaystyle b_{k}$	$\displaystyle=\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\prod_{l=1}^{m+1}\varepsilon_{f(k-1)+(m+1)j+l}\bigg{]}$
		$\displaystyle\leq\prod\limits_{j=0}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor-1}\bigg{[}1-\eta\big{(}\varepsilon_{f(k-1)+(m+1)j+S}\big{)}^{m+1}\bigg{]}$
		$\displaystyle\leq\bigg{[}1-\eta\big{(}\varepsilon_{f(k)}\big{)}^{m+1}\bigg{]}^{\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor}.$

For $f(k)=\lambda.k^{4}$ where $\lambda\in\mathbb{N}^{\star}$ and $\varepsilon_{t}=t^{-\frac{1}{2(m+1)}}$ we have: $\lfloor\frac{f(k)-f(k-1)-1}{m+1}\rfloor\geq\frac{\lambda k^{3}}{(m+1)}$ and $\big{(}\varepsilon_{f(k)}\big{)}^{m+1}=\frac{1}{\sqrt{\lambda}k^{2}}$ , implying:

	$\displaystyle b_{k}$	$\displaystyle\leq\bigg{[}1-\frac{\eta}{\sqrt{\lambda}k^{2}}\bigg{]}^{\frac{\lambda k^{3}}{(m+1)}}$
		$\displaystyle\leq\exp\bigg{(}\frac{-\lambda k^{3}\eta}{(m+1)\sqrt{\lambda}k^{2}}\bigg{)}=\exp\bigg{(}-\frac{\lambda^{1/2}k\eta}{m+1}\bigg{)}\;.$

Summing the last inequality, along with (21), (24) and (25) we get:

	$\displaystyle\mathbb{P}(\mathcal{E}^{c})$	$\displaystyle\leq SA\sum\limits_{k=1}^{\infty}b_{k}$
		$\displaystyle\leq SA\sum\limits_{k=1}^{\infty}\exp\bigg{(}-\frac{\lambda^{1/2}k\eta}{m+1}\bigg{)}$
		$\displaystyle=\frac{SA\exp\big{(}-\frac{\lambda^{1/2}\eta}{m+1}\big{)}}{1-\exp\big{(}-\frac{\lambda^{1/2}\eta}{m+1}\big{)}}\triangleq g(\lambda)\;.$

For $\lambda_{\alpha}\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+\frac{SA}{\alpha})$ , we have $g(\lambda_{\alpha})=\alpha$ , which gives the desired result.

Remark 4.

It is natural that $\lambda$ depends on $\eta$ , which expresses how well connected is the MDP under the uniform policy, see proof of Lemma 20.

∎

D.5 An Ergodic Theorem for non-homogeneous Markov Chains

We start with some definitions and a technical result. Let $\{P_{\pi},\pi\in\Pi\}$ be a collection of Markov transition kernels on the state-action space $\mathcal{Z}$ , indexed by policies $\pi\in\Pi$ . For any Markov transition kernel $P$ , bounded function $f$ and probability distribution $\mu$ , we define:

Pf(z)\triangleq\sum\limits_{z^{\prime}\in\mathcal{Z}}P(z,z^{\prime})f(z^{\prime})\quad\textrm{and}\quad\mu P(z)\triangleq\sum\limits_{z^{\prime}\in\mathcal{Z}}\mu(z^{\prime})P(z^{\prime},z).

For a measure $\mu$ and a function $f$ , $\mu(f)=\mathbb{E}_{X\sim\mu}[f(X)]$ denotes the mean of $f$ w.r.t. $\mu$ . Finally, for two policies $\pi$ and $\pi^{\prime}$ we define $D(\pi,\pi^{\prime})\triangleq\left\lVert P_{\pi}-P_{\pi^{\prime}}\right\rVert_{\infty}=\max\limits_{z\in\mathcal{Z}}\left\lVert P_{\pi}(z,.)-P_{\pi^{\prime}}(z,.)\right\rVert_{1}$ .
We consider a $\mathcal{Z}\times\Pi$ -valued process $\{(z_{t},\pi_{t}),t\geq 1\}$ such that $(z_{t},\pi_{t})$ is $\mathcal{F}_{t}$ -adapted and for any bounded measurable function $f$ :

\mathbb{E}[f(z_{t+1})|\mathcal{F}_{t}]=P_{\pi_{t}}f(z_{t}).

The next result is adapted from [FMP11]. There the authors prove an ergodic theorem for adaptive Markov Chain Monte-Carlo (MCMC) algorithms with a general state space and a parameter-dependent function. For the sake of self-containedness, we include here the proof of their result in the simple case of finite state space chains with a function that does not depend on the policy $\pi_{t}$ .

Proposition 12.

(Corollary 2.9, [FMP11]) Assume that:

	$\displaystyle\textrm{\hypertarget{assumption:B1}{(B1)}}\ \forall t\geq 1,\ P_{t}\ \textrm{is ergodic. We denote by $\omega_{t}$ its stationary distribution.}$
	(B2)There exists an ergodic kernel $P$ such that $\left\lVert P_{t}-P\right\rVert_{\infty}\underset{t\to\infty}{\longrightarrow}0$ almost surely.
	$\displaystyle\textrm{\hypertarget{assumption:B3}{(B3)} There exists two constants $C_{t}$ and $\rho_{t}$ such that for all $n\geq 1$, }\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}\ ,$
	$\displaystyle\quad\quad\textrm{where $W_{t}$ is a rank-one matrix whose rows are equal to $\omega_{t}^{\mathsf{\scriptscriptstyle T}}$}.$
	$\displaystyle\textrm{\hypertarget{assumption:B4}{(B4)}}\ \textrm{Denote by }L_{t}\triangleq C_{t}(1-\rho_{t})^{-1}.\textrm{ Then }\limsup\limits_{t\to\infty}L_{t}<\infty\textrm{ almost surely}.\textsc{ (UNIFORM ERGODICITY)}$
	$\displaystyle\textrm{\hypertarget{assumption:B5}{(B5)}}\ D(\pi_{t+1},\pi_{t})\underset{t\to\infty}{\longrightarrow}0\textrm{ almost surely}.\textsc{ (STABILITY)}$

Finally, denote by $\omega^{\star}$ the stationary distribution of $P$ . Then for any bounded non-negative function $f:\ \mathcal{Z}\to\mathbb{R}^{+}$ we have:

\displaystyle\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}\underset{t\to\infty}{\longrightarrow}\omega^{\star}(f)

almost surely.

Proof.

Consider the difference

	$\displaystyle D$	$\displaystyle=\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega(f)$
		$\displaystyle=\underbrace{\frac{f(z_{1})-\omega^{\star}(f)}{t}}\limits_{D_{1,t}}+\underbrace{\frac{\sum\limits_{k=2}^{t}\big{[}f(z_{k})-\omega_{k-1}(f)\big{]}}{t}}\limits_{D_{2,t}}+\underbrace{\frac{\sum\limits_{k=2}^{t}\big{[}\omega_{k-1}(f)-\omega^{\star}(f)\big{]}}{t}}\limits_{D_{3,t}}.$		(26)

We clearly have:

\displaystyle|D_{1,t}|\leq\frac{\left\lVert f\right\rVert_{\infty}}{t}\underset{t\to\infty}{\longrightarrow}0.

(28)

Next, by Lemma 22 there exists a constant $\kappa_{P}$ (that only depends on $P$ ) such that: $\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}\leq\kappa_{P}\left\lVert P_{k}-P\right\rVert_{\infty}$ . Therefore:

\displaystyle|D_{3,t}|

\displaystyle\leq\kappa_{P}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=1}^{t-1}\left\lVert P_{k}-P\right\rVert_{\infty}}{t}\underset{t\to\infty}{\longrightarrow}0,

(29)

where the convergence to zero is due to assumption (B2). Now to bound $D_{2,t}$ we use the function $\widehat{f}_{k}$ solution to the Poisson equation $\big{(}\widehat{f}_{k}-P_{k}\widehat{f}_{k}\big{)}(.)=f(.)-\omega_{k}(f)$ . By Lemma 23, $\widehat{f}_{k}(.)=\sum\limits_{n\geq 0}P_{k}^{n}[f-\omega_{k}(f)](.)$ exists and is solution to the Poisson equation. Therefore we can rewrite $D_{2,t}$ as follows:

	$\displaystyle D_{2,t}$	$\displaystyle=\frac{\sum\limits_{k=2}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}$
		$\displaystyle=M_{t}+C_{t}+R_{t},$		(30)

where

	$\displaystyle M_{t}\triangleq\frac{\sum\limits_{k=2}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})\big{]}}{t},$
	$\displaystyle C_{t}\triangleq\frac{\sum\limits_{k=2}^{t}\big{[}P_{k}\widehat{f}_{k}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t},$
	$\displaystyle R_{t}\triangleq\frac{P_{1}\widehat{f}_{1}(z_{1})-P_{t}\widehat{f}_{t}(z_{t})}{t}.$

Bounding $M_{t}$ :

Note that $S_{t}\triangleq tM_{t}$ is a martingale since $\mathbb{E}[\widehat{f}_{k-1}(z_{k})|\mathcal{F}_{k-1}]=P_{k-1}\widehat{f}_{k-1}(z_{k-1})$ . Furthermore, by Lemma 23:

	$\displaystyle\|S_{t}-S_{t-1}\|$	$\displaystyle=\|\widehat{f}_{t-1}(z_{t})-P_{k-1}\widehat{f}_{t-1}(z_{t-1})\|$
		$\displaystyle\leq 2\left\lVert\widehat{f}_{t-1}\right\rVert_{\infty}$
		$\displaystyle\leq 2L_{t-1}\left\lVert f\right\rVert_{\infty}.$

In particular, this implies that:

\displaystyle\sum\limits_{k=2}^{\infty}\frac{\mathbb{E}[|S_{k}-S_{k-1}|^{2}\ |\mathcal{F}_{k-1}]}{k^{2}}

\displaystyle\leq\sum\limits_{k=2}^{\infty}\frac{4\left\lVert f\right\rVert_{\infty}^{2}L_{k-1}^{2}}{k^{2}}<\infty

where the convergence of the series is due to (B4). (Theorem 2.18 in [HH80]) then implies that $M_{t}\underset{t\to\infty}{\longrightarrow}0$ almost surely.

Bounding $C_{t}$ :

Using Lemma 23, we have:

$\displaystyle\|C_{t}\|$	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=2}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}$
	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=2}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}+\left\lVert\omega^{\star}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}$
	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\bigg{[}\kappa_{P}\frac{L_{2}\left\lVert P_{1}-P\right\rVert_{\infty}+L_{t}\left\lVert P_{t}-P\right\rVert_{\infty}+\sum\limits_{k=2}^{t-1}(L_{k}+L_{k+1})\left\lVert P_{k}-P\right\rVert_{\infty}}{t}$
	$\displaystyle\quad\quad\quad\quad+\frac{\sum\limits_{k=2}^{t}L_{k}L_{k-1}D(\pi_{k},\pi_{k-1})}{t}\bigg{]}\underset{t\to\infty}{\longrightarrow}0,$	(31)

where the third line comes from applying Lemma 22 and the convergence to zero is due to assumptions (B2)-(B4)-(B5).

Bounding $R_{t}$ :

Finally, by Lemma 23 we have:

	$\displaystyle\|R_{t}\|$	$\displaystyle\leq\frac{\left\lVert\widehat{f}_{1}\right\rVert_{\infty}+\left\lVert\widehat{f}_{t}\right\rVert_{\infty}}{t}$
		$\displaystyle\leq\frac{\left\lVert f\right\rVert_{\infty}(L_{1}+L_{t})}{t}\underset{t\to\infty}{\longrightarrow}0,$		(32)

where the convergence to zero is due to Assumption (B4). Summing up the inequalities (28-32) gives the result. ∎

D.6 Application to C-Navigation: proof of Theorem 7

We will now prove that C-Navigation verifies the assumptions (B1-5).

The same can be proved for D-Navigation by replacing $\overline{\pi_{t}^{o}}$ with $\pi_{t}^{o}$ . Theorem 7 follows immediately by applying Proposition 12 for the functions $f(\tilde{z})=\mathbbm{1}\{\tilde{z}=z\}$ , where $z$ is any fixed state-action pair.

(B1):

This is a direct consequence of the fact that $P_{\pi_{u}}$ is ergodic (due to Assumptions 1 and 10) which implies by construction that $P_{t}$ is also ergodic.

(B2):

By Lemma 4 we have: $N_{sa}(t)\overset{a.s}{\longrightarrow}\infty$ for all $(s,a)$ . Hence $\widehat{\mathcal{M}}\overset{a.s}{\longrightarrow}\mathcal{M}$ and by continuity: $\pi^{o}(\widehat{\mathcal{M}})\overset{a.s}{\longrightarrow}\pi^{o}(\mathcal{M})$ , which implies that:

P_{t}\overset{a.s}{\longrightarrow}P_{\pi^{o}}.

(33)

(B3):

By Lemma 13, $P_{t}$ satisfies (B3) for $C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1}$ and $\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}$ .

(B4):

We have:

	$\displaystyle\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})$	$\displaystyle=\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a\|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega_{t}(z)}\bigg{)}$
		$\displaystyle\overset{a.s}{\longrightarrow}\bigg{(}A\min\limits_{s,a}\pi^{o}(a\|s)\bigg{)}\sigma_{u}\min\limits_{z}\frac{\omega_{u}(z)}{\omega^{\star}(z)}\triangleq\sigma_{o}.$		(34)

Note that $\sigma_{o}>0$ since $\omega_{u}>0$ (ergodicity of $P_{\pi_{u}}$ ), $\omega^{\star}<1$ and $\pi^{o}>0$ entry-wise. Similarly, it is trivial that $\sigma_{o}<1$ since $A\min\limits_{s,a}\pi^{o}(a|s)<1,\min\limits_{z}\frac{\omega_{u}(z)}{\omega^{\star}(z)}<1$ and $\sigma_{u}<1$ . Therefore $\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})=1-\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\overset{a.s}{\longrightarrow}1-\sigma_{o}\triangleq\theta_{o}\in(0,1)$ and:

$\displaystyle\limsup\limits_{t\to\infty}\ L_{t}$	$\displaystyle=\limsup\limits_{t\to\infty}\ C_{t}(1-\rho_{t})^{-1}$
	$\displaystyle=\limsup\limits_{t\to\infty}\frac{2}{\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\big{[}1-\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}\big{]}}$
	$\displaystyle=\frac{2}{\theta_{o}\big{[}1-\theta_{o}^{\frac{1}{r}}\big{]}}<\infty.$	(35)

(B5):

We have: $\big{(}P_{t+1}-P_{t}\big{)}(s,s^{\prime})=\sum\limits_{a}[\pi_{t+1}(a|s)-\pi_{t}(a|s)]p_{\mathcal{M}}(s^{\prime}|s,a)$ . Hence: $\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty}\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}$ , where $\pi_{t+1}$ and $\pi_{t}$ are viewed as vectors. On the other hand:

	$\displaystyle\pi_{t+1}-\pi_{t}$	$\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})(\overline{\pi_{t+1}^{o}}-\overline{\pi_{t}^{o}})$
		$\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})(\frac{t\times\overline{\pi_{t}^{o}}+\pi^{o}(\widehat{\mathcal{M}}_{t+1})}{t+1}-\overline{\pi_{t}^{o}})$
		$\displaystyle=(\varepsilon_{t}-\varepsilon_{t+1})(\overline{\pi_{t}^{o}}-\pi_{u})+(1-\varepsilon_{t})\frac{\pi^{o}(\widehat{\mathcal{M}}_{t+1})-\overline{\pi_{t}^{o}}}{t+1}$

Therefore

	$\displaystyle\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty}$	$\displaystyle\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}$
		$\displaystyle\leq(\varepsilon_{t}-\varepsilon_{t+1})+\frac{1}{t+1}\underset{t\to\infty}{\longrightarrow}0.$

For D-Navigation, we get in a similar fashion:

	$\displaystyle\left\lVert P_{t+1}-P_{t}\right\rVert_{\infty}$	$\displaystyle\leq\left\lVert\pi_{t+1}-\pi_{t}\right\rVert_{\infty}$
		$\displaystyle\leq(\varepsilon_{t}-\varepsilon_{t+1})+(1-\varepsilon_{t})\left\lVert\pi_{t+1}^{o}-\pi_{t}^{o}\right\rVert_{\infty}\underset{t\to\infty}{\longrightarrow}0.$

D.7 Geometric ergodicity of the sampling rules

Since $P_{\pi_{u}}$ is ergodic, there exists $r>0$ such that $P_{\pi_{u}}^{r}(z,z^{\prime})>0$ for all $z,z^{\prime}$ (Proposition 1.7, [LPW06]). Thus we define

	$\displaystyle r$	$\displaystyle=\min\{\ell\geq 1:\ \forall(z,z^{\prime})\in\mathcal{Z}^{2},\ P_{\pi_{u}}^{\ell}(z,z^{\prime})>0\},$		(36)
	$\displaystyle\sigma_{u}$	$\displaystyle\triangleq\min\limits_{z,z^{\prime}}\ \frac{P_{\pi_{u}}^{r}(z,z^{\prime})}{\omega_{u}(z^{\prime})},$		(37)

where $\omega_{u}$ is the stationary distribution of $P_{\pi_{u}}$ .

Lemma 13.

Let $\pi_{t}^{o}\triangleq\pi^{o}(\widehat{\mathcal{M}}_{t})$ (resp. $\overline{\pi_{t}^{o}}\triangleq\sum_{j=1}^{t}\pi^{o}(\widehat{\mathcal{M}}_{j})/t$ ) denote the oracle policy of $\widehat{\mathcal{M}}_{t}$ (resp. the Cesaro-mean of oracle policies up to time $t$ ). Further define

	$\displaystyle\sigma(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq\bigg{(}\varepsilon^{r}+\big{[}(1-\varepsilon)A\min\limits_{s,a}\pi(a\|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega(z)}\bigg{)},$
	$\displaystyle\theta(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq 1-\sigma(\varepsilon,\pi,\omega),$
	$\displaystyle\mathcal{L}(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq\frac{2}{\theta(\varepsilon,\pi,\omega)\big{[}1-\theta(\varepsilon,\pi,\omega)^{1/r}\big{]}}.$

Then for D-Navigation (resp. C-Navigation) we have:

\displaystyle\forall n\geq 1,\ \left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}

where $C_{t}=2\theta(\varepsilon_{t},\pi_{t}^{o},\omega_{t})^{-1}$ and $\rho_{t}=\theta(\varepsilon_{t},\pi_{t}^{o},\omega_{t})^{1/r}$ (resp. $C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1}$ and $\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}$ ). In particular $L_{t}\triangleq C_{t}(1-\rho_{t})^{-1}=\mathcal{L}(\varepsilon_{t},\pi_{t}^{o},\omega_{t})$ (resp. $L_{t}=C_{t}(1-\rho_{t})^{-1}=\mathcal{L}(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})$ ).

Proof.

We only prove the lemma for C-Navigation. The statement for D-Navigation can be proved in the same way. Recall that: $P_{t}=\varepsilon_{t}P_{\pi_{u}}+(1-\varepsilon_{t})P_{\overline{\pi_{t}^{o}}}$ . Therefore:

	$\displaystyle\forall(z,z^{\prime}),\quad P_{t}^{r}(z,z^{\prime})$	$\displaystyle\geq[\varepsilon_{t}^{r}P_{\pi_{u}}^{r}+(1-\varepsilon_{t})^{r}P_{\overline{\pi_{t}^{o}}}^{r}](z,z^{\prime})$
		$\displaystyle\geq\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a\|s)\big{]}^{r}\bigg{)}P_{\pi_{u}}^{r}(z,z^{\prime})$
		$\displaystyle\geq\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a\|s)\big{]}^{r}\bigg{)}\sigma_{u}\omega_{u}(z^{\prime})$
		$\displaystyle\geq\underbrace{\bigg{(}\varepsilon_{t}^{r}+\big{[}(1-\varepsilon_{t})A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a\|s)\big{]}^{r}\bigg{)}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega_{t}(z)}\bigg{)}}\limits_{\sigma_{t}}\omega_{t}(z^{\prime})$
		$\displaystyle=\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})\omega_{t}(z^{\prime}).$

where the second and third inequalities comes from the fact that $P_{\overline{\pi_{t}^{o}}}\geq A\min\limits_{s,a}\overline{\pi_{t}^{o}}(a|s)P_{\pi_{u}}$ entry-wise and from (37) respectively. Using Lemma 21 we conclude that or all $n\geq 1$ :

\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq 2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{\frac{n}{r}-1}

where $\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})=1-\sigma(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})$ . Therefore $P_{t}$ satisfies $\left\lVert P_{t}^{n}-W_{t}\right\rVert_{\infty}\leq C_{t}\rho_{t}^{n}$ for $C_{t}=2\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{-1}$ and $\rho_{t}=\theta(\varepsilon_{t},\overline{\pi_{t}^{o}},\omega_{t})^{1/r}$ . ∎

Appendix E Stopping rule

E.1 Deviation inequality for KL divergences of rewards

We suppose that the reward distributions $q_{\mathcal{M}}(s,a)$ come from a one-dimensional exponential family and can therefore be parametrized by their respective means $r_{\mathcal{M}}(s,a)$ . Furthermore, for any $t$ such that $N_{sa}(t)>0$ , we let $\widehat{q}_{s,a}(t)$ denote the distribution belonging to the same exponential family, whose mean is the empirical average $\widehat{r}_{t}(s,a)=\frac{\sum\limits_{k=1}^{t}R_{k}\mathbbm{1}\{(s_{t},a_{t})=(s,a)\}}{N_{sa}(t)}$ . For $x\geq 1$ , define the function $h(x)=x-\log(x)$ and its inverse $h^{-1}(x)$ . Further define the function $\tilde{h}:\mathbbm{R}^{+}\longrightarrow\mathbbm{R}$ by:

\tilde{h}(x)=\begin{cases}h^{-1}(x)\exp(1/h^{-1}(x))\quad\textrm{if }x\geq h^{-1}(1/\ln(3/2)),\\ \\ \frac{3}{2}\big{[}x-\log(\log(3/2)\big{]}\quad\textrm{otherwise.}\end{cases}

Finally let

\varphi(x)=2\tilde{h}\bigg{(}\frac{h^{-1}(1+x)+\log(2\Gamma(2))}{2}\bigg{)},

where $\Gamma(2)=\sum_{n=1}^{\infty}1/n^{2}$ . Now we recall a deviation inequality from [KK18], which we use for the empirical KL divergence of rewards.

Lemma 14.

(Theorem 14, [KK18]) Define the threshold $\beta_{r}(t,\delta)\triangleq SA\varphi\big{(}\log(1/\delta)/SA\big{)}+3\sum\limits_{s,a}\log\big{[}1+\log(N_{sa}(t))\big{]}$ . Then for all $\delta\in(0,1)$ :

\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))>\beta_{p}(t,\delta)\bigg{)}\leq\delta.

Remark 5.

One can easily see that $\varphi(x)\underset{x\to\infty}{\sim}x$ .

E.2 Deviation inequality for KL divergences of transitions

Our second deviation inequality is adapted from Proposition 1 in [JKM⁺20]. There the authors derive a deviation inequality for a single KL divergence of a multinomial distribution. In order to get a deviation inequality of a sum of KL divergences, we modified their proof by considering the product over state-action pairs of the martingales they used. For the sake of self-containedness, we include the proof below. $\widehat{p}_{sa}(t)$ is defined as the categorical distribution with a vector of probabilities $q$ satisfying:

\displaystyle\forall s^{\prime}\in\mathcal{S},\ q_{s^{\prime}}=\begin{cases}\displaystyle{\frac{\sum_{k=0}^{t-1}\mathbbm{1}\{(s_{k},a_{k},s_{k+1})=(s,a,s^{\prime})\}}{N_{sa}(t)}}\quad\textrm{if }N_{sa}(t)\neq 0,\\ \\ 1/A\quad\textrm{otherwise.}\end{cases}

Lemma 15.

(Proposition 1, [JKM⁺20]) Define the threshold $\beta_{p}(t,\delta)\triangleq\log(1/\delta)+(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}$ . Then for all $\delta\in(0,1)$ we have:

\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))>\beta_{p}(t,\delta)\bigg{)}\leq\delta,

with the convention that $N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))=0$ whenever $N_{sa}(t)=0$ .

Proof.

We begin with a few notations. For any vector $\lambda\in\mathbbm{R}^{S-1}\times\{0\}$ and any element of the simplex $p\in\Sigma_{S-1}$ , we denote $<\lambda,p>\triangleq\sum_{i=1}^{S-1}\lambda_{i}p_{i}$ . We define the log-partition function of a discrete distribution $p$ supported over $\{1,\ldots,S\}$ by:

\displaystyle\forall\lambda\in\mathbbm{R}^{S-1}\times\{0\},\ \phi_{p}(\lambda)\triangleq\log(p_{S}+\sum_{i=1}^{S-1}p_{i}e^{\lambda}_{i}).

We use the shorthand and let $\phi_{sa}(\lambda)\triangleq\phi_{p_{sa}}(\lambda)$ . For $N\in\mathbbm{N}^{*}$ and $x\in\{0,\ldots,N\}^{k}$ such that $\sum_{i=1}^{k}x_{i}=N$ the binomial coefficient is defined as: $\binom{N}{x}\triangleq\frac{N!}{\prod_{i=1}^{k}x_{i}!}$ . Finally $H(p)=\sum_{i=1}^{S}p_{i}\log(1/p_{i})$ is the Shannon entropy of distribution $p$ .

Building a convenient mixture martingale for every state-action pair:

Following [JKM⁺20], we define for every integer $t$ :

M_{t}^{\lambda}(s,a)=\exp\bigg{(}N_{sa}(t)\big{(}<\lambda,\widehat{p}_{sa}(t)>-\phi_{sa}(\lambda)\big{)}\bigg{)}\;.

(38)

The sequence $(M_{t}^{\lambda}(s,a))_{t}$ is an $(\mathcal{F}_{t})_{t}$ -martingale since:

	$\displaystyle\mathbb{E}[M_{t}^{\lambda}(s,a)\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$
	$\displaystyle=\mathbb{E}\bigg{[}\exp\bigg{(}N_{sa}(t)\big{[}<\lambda,\widehat{p}_{sa}(t)>-\phi_{sa}(\lambda)\big{]}\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)\bigg{]}$
	$\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}\exp\bigg{(}\big{(}N_{sa}(t-1)+1\big{)}\big{(}<\lambda,\frac{N_{sa}(t-1)\widehat{p}_{sa}(t-1)+X}{N_{sa}(t-1)+1}>-\phi_{sa}(\lambda)\big{)}\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}M_{t-1}^{\lambda}(s,a)\exp\bigg{(}<\lambda,X>-\phi_{sa}(\lambda)\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1}\bigg{]}=M_{t-1}^{\lambda}(s,a).$

The same holds trivially when $(s_{t},a_{t})\neq(s,a)$ . Now, we define the mixture martingale defined by the family of priors $\lambda_{q}=\nabla\phi_{sa}^{-1}(q)$ where $q\sim\mathcal{D}\mathrm{ir}(1,\ldots,1)$ follows a Dirichlet distribution with parameters $(1,\ldots,1)$ :

	$\displaystyle M_{t}(s,a)=\int M_{t}^{\lambda_{q}}(s,a)\frac{\Gamma(S)}{\prod_{i=1}^{S}\Gamma(1)}\prod_{i=1}^{S}q_{i}dq$
	$\displaystyle=\int e^{N_{sa}(t)\big{(}KL(\widehat{p}_{sa}(t),p_{sa})-KL(\widehat{p}_{sa}(t),q)\big{)}}(S-1)!\prod_{i=1}^{S}q_{i}dq$
	$\displaystyle=\exp\bigg{(}N_{sa}(t)\big{(}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{)}\bigg{)}(S-1)!\int\prod_{i=1}^{S}q_{i}^{1+N_{sa}(t)\widehat{p}_{sa,i}(t)}dq$
	$\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{(S-1)!\prod_{i=1}^{S}\Gamma\bigg{(}1+N_{sa}(t)\widehat{p}_{sa,i}(t)\bigg{)}}{\Gamma(N_{sa}(t)+S)}$
	$\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{(S-1)!\prod_{i=1}^{S}\big{(}N_{sa}(t)\widehat{p}_{sa,i}(t)\big{)}!}{(N_{sa}(t)+S-1)!}$
	$\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{\prod_{i=1}^{S}\big{(}N_{sa}(t)\widehat{p}_{sa,i}(t)\big{)}!}{N_{sa}(t)!}\frac{(S-1)!N_{sa}(t)!}{(N_{sa}(t)+S-1)!}$
	$\displaystyle=\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}\bigg{)}\frac{1}{\binom{N_{sa}(t)+S-1}{S-1}}\frac{1}{\binom{N_{sa}(t)}{N_{sa}(t)\widehat{p}_{sa}(t)}},$

where in the second inequality we used Lemma 16 and $\widehat{p}_{sa,i}(t)$ denotes the i-th component of $\widehat{p}_{sa}(t)$ . Now using Lemma 17, we upper bound the binomial coefficients which leads to:

	$\displaystyle M_{t}(s,a)$	$\displaystyle\geq\exp\bigg{(}N_{sa}(t)\big{[}KL(\widehat{p}_{sa}(t),p_{sa})+H(\widehat{p}_{sa}(t))\big{]}-N_{sa}(t)H(\widehat{p}_{sa}(t))$
		$\displaystyle\quad-(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}$
		$\displaystyle=\exp\bigg{(}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}.$

The product martingale:

Taking the product over all state-action pairs we get:

	$\displaystyle M_{t}$	$\displaystyle\triangleq\prod_{(s,a)\in\mathcal{Z}}M_{t}(s,a)$
		$\displaystyle\geq\exp\bigg{(}\sum_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-\sum_{s,a}(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))\bigg{)}.$		(39)

Next, using that $\log(1+x)\leq x$ we get:

	$\displaystyle(N_{sa}(t)+S-1)H(S-1/(N_{sa}(t)+S-1))$	$\displaystyle=(S-1)\log(1+N_{sa}(t)/(S-1))$
		$\displaystyle\quad+N_{sa}(t)\log(1+(S-1)/N_{sa}(t))$
		$\displaystyle\leq(S-1)\log(1+N_{sa}(t)/(S-1))+(S-1)$
		$\displaystyle=(S-1)\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}.$

Hence (39) becomes:

\displaystyle M_{t}\geq\exp\bigg{(}\sum\limits_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})-(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}\bigg{)}.

(40)

Now, we show that $M_{t}$ is a martingale. For any fixed pair $(s,a)$ we have:

	$\displaystyle\mathbb{E}[M_{t}\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$	$\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t}(s^{\prime},a^{\prime})\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$
		$\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})\|\mathcal{F}_{t-1}]$
		$\displaystyle=\mathbb{E}[M_{t}(s,a)\|\mathcal{F}_{t-1}]\times\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})$
		$\displaystyle=M_{t-1},$

where the third equality is because $M_{t}(s,a)$ and $\big{(}M_{t-1}(s^{\prime},a^{\prime})\big{)}_{(s^{\prime},a^{\prime})\neq(s,a)}$ are independent conditionally on $\mathcal{F}_{t-1}$ . Finally, using the tower rule we get:

\displaystyle\mathbb{E}[M_{t}|\mathcal{F}_{t-1}]=\mathbb{E}\bigg{[}\mathbb{E}[M_{t}|\mathcal{F}_{t-1},\ (s_{t},a_{t})]\bigg{|}\mathcal{F}_{t-1}\bigg{]}=\mathbb{E}[M_{t-1}|\mathcal{F}_{t-1}]=M_{t-1}.

Hence $M_{t}$ is a martingale. Thanks to Doob’s maximal inequality we have:

\displaystyle\mathbb{P}\bigg{(}\exists t\geq 0,\ M_{t}>1/\delta\bigg{)}\leq\delta\mathbb{E}[M_{0}]=\delta.

In view of (40), we conclude that for $\beta_{p}(t,\delta)=\log(1/\delta)+(S-1)\underset{s,a}{\sum}\log\bigg{(}e\big{[}1+N_{sa}(t)/(S-1)\big{]}\bigg{)}$ we have:

\displaystyle\mathbb{P}\bigg{(}\exists t\geq 1,\ \sum_{s,a}N_{sa}(t)KL(\widehat{p}_{sa}(t),p_{sa})>\beta_{p}(t,\delta)\bigg{)}\leq\delta.

∎

Lemma 16.

(Lemma 3 in [JKM⁺20]) For $q,p$ in $\Sigma_{m}$ the simplex of dimension $(m-1)$ and $\lambda\in\mathbbm{R}^{m-1}$ , we have:

<\lambda,q>-\phi_{p}(\lambda)=KL(q,p)-KL(q,p^{\lambda})

where $p^{\lambda}=\nabla\phi_{p}(\lambda)$ .

Lemma 17.

(Theorem 11.1.3, [CT06]) Let $N\in\mathbbm{N}^{*}$ , and $x\in\{0,\ldots,N\}^{k}$ such that $\sum_{i=1}^{k}x_{i}=N$ then:

\binom{N}{x}=\frac{N!}{\prod_{i=1}^{k}x_{i}!}\leq e^{NH(x/N)}

where $H(x/N)$ is the Shannon entropy of the discrete distribution over $\{1,\ldots,k\}$ with vector of probabilities $(\frac{x_{i}}{N})_{1\leq i\leq k}$ .

E.3 Correctness of the stopping rule

Proof.

Using Lemma 5 and equations (11-12) in the second and third lines respectively, we get:

	$\displaystyle\mathbb{P}(\widehat{\pi}_{\tau}^{*}\neq$	$\displaystyle\pi^{},\tau_{\delta}<\infty)=\mathbb{P}\bigg{(}\exists t\geq 1,\ t\;U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\widehat{\pi}_{t}^{}\neq\pi^{*}\bigg{)}$
		$\displaystyle\leq\mathbb{P}\bigg{(}\exists t\geq 1,\ t\;T\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\mathcal{M}\in\textrm{Alt}(\widehat{\mathcal{M}}_{t})\bigg{)}$
		$\displaystyle=\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{\mathcal{M}^{\prime}\in\operatorname{Alt}(\widehat{\mathcal{M}}_{t})}{\inf}\underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}^{\prime}}(s,a))+\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}^{\prime}}(s,a))\big{]}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2),\mathcal{M}\in\textrm{Alt}(\widehat{\mathcal{M}}_{t})\bigg{)}$
		$\displaystyle\leq\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\big{[}\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))+\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))\big{]}$
		$\displaystyle\quad\quad\quad\quad\quad\quad\geq\beta_{r}(t,\delta/2)+\beta_{p}(t,\delta/2)\bigg{)}$
		$\displaystyle=\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{q}_{s,a}(t),\,q_{\mathcal{M}}(s,a))\geq\beta_{r}(t,\delta/2)\bigg{)}$
		$\displaystyle\quad\quad\quad+\mathbb{P}\bigg{(}\exists t\geq 1,\ \underset{(s,a)\in\mathcal{Z}}{\sum}\ N_{sa}(t)\operatorname{KL}(\widehat{p}_{sa}(t),\,p_{\mathcal{M}}(s,a))\geq\beta_{p}(t,\delta/2)\bigg{)}$
		$\displaystyle\leq\delta/2+\delta/2=\delta$

where the last inequality is due to Lemmas 14 and 15. ∎

Appendix F Sample complexity upper bound

F.1 Almost sure upper bound: proof of Theorem 6 (i)

Proof.

Consider the event $\mathcal{E}=\bigg{(}\forall(s,a)\in\mathcal{Z},\ \lim_{t\to\infty}\frac{N_{sa}(t)}{t}=\omega^{\star}_{s,a},\ \widehat{\mathcal{M}}_{t}\to\mathcal{M}\bigg{)}$ . By Lemma 4 and Theorem 7, we have $\mathbb{P}(\mathcal{E})=1$ . We will prove that under $\mathcal{E}$ , $\limsup\limits_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq 2U_{o}(\mathcal{M})$ .
Fix $\eta>0$ . There exits $t_{\eta}$ such that for all $t\geq t_{\eta}$ :

	$\displaystyle U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}^{-1}\geq(1-\eta)U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}$		(41)
	$\displaystyle\beta_{p}(t,\delta/2)\leq\log(1/\delta)+\eta U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}t$		(42)
	$\displaystyle\beta_{r}(t,\delta/2)\leq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+\eta U\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}t$		(43)

where the last two inequalities come from the fact that both the thresholds satisfy $\beta(t,\delta/2)=\mathcal{O}\big{(}\log(t)\big{)}=o(t)$ . Combining the inequalities above with the definition of $\tau_{\delta}$ , we get:

	$\displaystyle\tau_{\delta}$	$\displaystyle\leq\inf\bigg{\{}t\geq t_{\eta},(1-3\eta)tU\big{(}\mathcal{M},\omega^{\star}\big{)}^{-1}\geq\log(1/\delta)+SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}\bigg{\}}$
		$\displaystyle=\max\bigg{(}t_{\eta},\ \frac{\bigg{[}log(1/\delta)+SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}\bigg{]}U\big{(}\mathcal{M},\omega^{\star}\big{)}}{1-3\eta}\bigg{)}.$

Since $\varphi(x)\underset{\infty}{\sim}x$ , then the last inequality implies that $\limsup\limits_{\delta\to 0}\frac{\tau_{\delta}}{\log(1/\delta)}\leq\frac{2U(\mathcal{M},\omega^{\star})}{1-3\eta}$ . Taking the limit when $\eta$ goes to zero finishes the proof. ∎

F.2 Upper bound in expectation: proof of Theorem 6 (ii)

Proof.

We start by defining the semi-distance between MDPs:

\displaystyle\left\lVert\mathcal{M}-\mathcal{M}^{\prime}\right\rVert=\max\limits_{s,a}\max\big{(}|r_{\mathcal{M}}(s,a)-r_{\mathcal{M}^{\prime}}(s,a)|,\left\lVert p_{\mathcal{M}}(.|s,a)-p_{\mathcal{M}^{\prime}}(.|s,a)\right\rVert_{1}\big{)}.

Now for $\xi>0$ , by continuity of $\mathcal{M}\to\pi^{o}(\mathcal{M})$ ¹⁷¹⁷17As a consequence of Berge’s Theorem, which gives the continuity of $\mathcal{M}\mapsto{\bm{\omega}}^{\star}(\mathcal{M})$ . there exists $\rho(\xi)\leq\xi$ such that:

\displaystyle\forall\mathcal{M}^{\prime}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)},\ \left\lVert\pi^{o}(\mathcal{M}^{\prime})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\xi

where $\mathcal{B}(\mathcal{M},\rho)=\{\mathcal{M}^{\prime}:\ \left\lVert\mathcal{M}^{\prime}-\mathcal{M}\right\rVert\leq\rho\}$ . For $T\geq 1$ , consider the concentration events¹⁸¹⁸18 For simplicity and w.l.o.g, we consider that $T^{1/4}$ and $T^{3/4}$ are integers.:

	$\displaystyle\mathcal{C}_{T}^{1}(\xi)$	$\displaystyle\triangleq\bigcap\limits_{t=T^{1/4}}^{T}\bigg{(}\widehat{\mathcal{M}}_{t}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\bigg{)}.$
	$\displaystyle\mathcal{C}_{T}^{2}(\xi)$	$\displaystyle\triangleq\bigcap\limits_{t=T^{3/4}}^{T}\bigg{(}\big{\|}{\bm{N}}(t)/t-\omega^{\star}\big{\|}\leq K_{\xi}\xi\bigg{)}$

where $K_{\xi}$ is a mapping defined in Proposition 19. We will upper bound the stopping time of MDP-NaS under $\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)$ . Define:

	$\displaystyle U(\mathcal{M},\omega^{\star},\xi)\triangleq$	$\displaystyle\sup\ U\big{(}\mathcal{M}^{\prime},\omega^{\prime}\big{)}.$

		$\displaystyle\scalebox{0.75}{$\left\lVert\omega^{\prime}-\omega^{\star}\right\rVert_{\infty}\leq K_{\xi}\xi$}.$

By Proposition 19, there exists $T_{1}(\xi)$ such that for all $T\geq T_{1}(\xi)$ , conditionally on $\mathcal{C}_{T}^{1}(\xi)$ , the event $\mathcal{C}_{T}^{2}(\xi)$ occurs with high probability. For $T\geq T_{1}(\xi)$ we have:

\displaystyle\forall t\in[|T^{3/4},T|],\ U\big{(}\widehat{\mathcal{M}}_{t},{\bm{N}}(t)/t\big{)}\leq U(\mathcal{M},\omega^{\star},\xi).

(44)

Furthermore, using the bound $N_{sa}(t)\leq t$ in the definitions of the thresholds $\beta_{p}$ and $\beta_{r}$ and the fact that $\log(t)\underset{t\to\infty}{=}o(t)$ , we prove the existence of $T_{2}(\xi)$ such that for all $t\geq T_{2}(\xi)$ :

	$\displaystyle\beta_{p}(t,\delta/2)$	$\displaystyle\leq\log(1/\delta)+\xi U(\mathcal{M},\omega^{\star},\xi)^{-1}t$		(45)
	$\displaystyle\beta_{r}(t,\delta/2)$	$\displaystyle\leq SA\;\varphi\big{(}\log(1/\delta)/SA\big{)}+\xi U(\mathcal{M},\omega^{\star},\xi)^{-1}t.$		(46)

Finally define:

\displaystyle T_{3}(\xi,\delta)\triangleq\frac{U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}\bigg{[}\log(1/\delta)+SA\varphi\big{(}\frac{\log(1/\delta)}{SA}\big{)}\bigg{]}}{(1-2\xi)}.

Using (44-46), we have for all $T\geq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}$ under $\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)$ the following holds:

\displaystyle T\times U\big{(}\widehat{\mathcal{M}}_{T},{\bm{N}}(T)/T\big{)}^{-1}\geq\beta_{p}(T,\delta/2)+\beta_{r}(T,\delta/2).

In other words:

\displaystyle\forall T\geq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)},\quad\mathcal{C}_{T}^{1}(\xi)\cap\mathcal{C}_{T}^{2}(\xi)\subset\big{(}\tau_{\delta}\leq T\big{)}.

(47)

Therefore¹⁹¹⁹19 $\overline{\mathcal{E}}$ denotes the complementary of event $\mathcal{E}$ .:

	$\displaystyle\mathbb{E}[\tau_{\delta}]$	$\displaystyle=\sum\limits_{T=1}^{\infty}\mathbb{P}(\tau_{\delta}>T)$
		$\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=\max(T_{1},T_{2},T_{3})}^{\infty}\mathbb{P}(\tau_{\delta}>T)$
		$\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=\max(T_{1},T_{2},T_{3})}^{\infty}\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{1}(\xi)}\cup\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{)}$
		$\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=1}^{\infty}\bigg{[}\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{1}(\xi)}\bigg{)}+\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{\|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\bigg{]}$
		$\displaystyle\leq\max\big{(}T_{1}(\xi),T_{2}(\xi),T_{3}(\xi,\delta)\big{)}+\sum\limits_{T=1}^{\infty}\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}$
		$\displaystyle\qquad\qquad\qquad\qquad+\frac{2SA\exp(-T^{3/4}\xi^{2})}{1-\exp(-\xi^{2})},$

where we used Lemma 18 and Proposition 19 in the last inequality. This implies that $\mathbb{E}[\tau_{\delta}]$ is finite and:

	$\displaystyle\limsup\limits_{\delta\to 0}\frac{\mathbb{E}[\tau_{\delta}]}{\log(1/\delta)}$	$\displaystyle\leq\limsup\limits_{\delta\to 0}\frac{T_{3}(\xi,\delta)}{\log(1/\delta)}$
		$\displaystyle=\frac{2U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}}{(1-2\xi)}$

where we used $\varphi(x)\underset{\infty}{\sim}x+\log(x)$ . Finally, we take the limit $\xi\to 0$ . Since $\rho(\xi)\leq\xi$ and $\limsup\limits_{\xi\to 0}K_{\xi}<\infty$ then $\limsup\limits_{\xi\to 0}U\big{(}\mathcal{M},\omega^{\star},\xi\big{)}=U\big{(}\mathcal{M},\omega^{\star}\big{)}=U_{o}(\mathcal{M})$ which finishes the proof. ∎

F.3 Concentration of the empirical MDPs

Lemma 18.

Define the event $\mathcal{C}_{T}^{1}(\xi)\triangleq\bigcap\limits_{t=T^{1/4}}^{T}\bigg{(}\widehat{\mathcal{M}}_{t}\in\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\bigg{)}$ . Then there exists two positive constants $B$ and $C$ that only depend on $\xi$ and $\mathcal{M}$ such that:

\displaystyle\forall T\geq 1,\ \mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right)\leq\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}.

Proof.

For simplicity we will denote $\rho(\xi)$ by $\rho$ . Consider the forced exploration event:

\displaystyle\mathcal{E}_{T}=\bigg{(}\forall(s,a)\in\mathcal{Z},\ \forall t\geq 1,\ N_{sa}(t)\geq\bigg{[}\frac{t}{\lambda(T)}\bigg{]}^{1/4}-1\bigg{)}

where $\lambda(T)\triangleq\frac{(m+1)^{2}}{\eta^{2}}\log^{2}(1+SAT^{2})$ and $\eta$ is a parameter that only depends on $\mathcal{M}$ . Applying Corollary 1 for $\alpha=\frac{1}{T^{2}}$ , we get $\mathbb{P}(\mathcal{E}_{T})\geq 1-1/T^{2}$ . Therefore we have:

	$\displaystyle\mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right)$	$\displaystyle\leq\mathbb{P}(\overline{\mathcal{E}_{T}})+\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T})$
		$\displaystyle\leq\frac{1}{T^{2}}+\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T}).$		(48)

On the other hand:

$\displaystyle\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T})$	$\displaystyle\leq\sum_{t=T^{1/4}}^{T}\mathbb{P}\left(\widehat{\mathcal{M}}_{t}\notin\mathcal{B}\big{(}\mathcal{M},\rho(\xi)\big{)}\cap\mathcal{E}_{T}\right)$
	$\displaystyle\leq\sum_{t=T^{1/4}}^{T}\underset{s,a}{\sum}\ \Bigg{[}\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)>\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}$
	$\displaystyle\qquad\qquad+\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)<-\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}$
	$\displaystyle\qquad\qquad+\underset{s^{\prime}}{\sum}\ \mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)>\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}$
	$\displaystyle\qquad\qquad+\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)<-\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}\Bigg{]}.$	(49)

Using a union bound and Chernoff-Hoeffding theorem respectively we get for $t\geq T^{1/4}$ :

$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)-$	$\displaystyle p(s^{\prime}\|s,a)>\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}$
	$\displaystyle\leq\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)>\rho/S,\ N_{sa}(t)\geq\bigg{[}\frac{t}{\lambda(T)}\bigg{]}^{1/4}-1\bigg{)}$
	$\displaystyle=\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)>\rho/S,\ N_{sa}(t)=t^{\prime}\bigg{)}$
	$\displaystyle\leq\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\exp\bigg{(}-t^{\prime}\cdot\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}$
	$\displaystyle\leq\frac{\exp\bigg{(}-\big{(}\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1\big{)}\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}$
	$\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}.$	(50)

In a similar fashion we prove that:

	$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)$	$\displaystyle-p(s^{\prime}\|s,a)<-\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}$
		$\displaystyle\qquad\qquad\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}},$		(51)

	$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)>\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}$	$\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}},$		(52)
	$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{r}_{t}(s,a)-r(s,a)<-\rho\big{)}\cap\mathcal{E}_{T}\bigg{)}$	$\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}.$		(53)

Thus, for the following choice of constants

	$\displaystyle C=\sqrt{\frac{\eta}{m+1}}\min\limits_{s,a,s^{\prime}}\Bigg{(}$	$\displaystyle\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)},\ \operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)},$
		$\displaystyle\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)},\ \operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\Bigg{)},$

	$\displaystyle B=$	$\displaystyle\sum\limits_{s,a}\ \Bigg{(}\frac{\exp\bigg{(}\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)+\rho,\ r(s,a)\big{)}\bigg{)}}+\frac{\exp\bigg{(}\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}r(s,a)-\rho,\ r(s,a)\big{)}\bigg{)}}$
		$\displaystyle+\sum\limits_{s^{\prime}}\Bigg{[}\frac{\exp\bigg{(}\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}$
		$\displaystyle+\frac{\exp\bigg{(}\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}\Bigg{]}\Bigg{)},$

and using (49-53) we get:

\displaystyle\mathbb{P}(\overline{\mathcal{C}_{T}^{1}(\xi)}\cap\mathcal{E}_{T})\leq\sum\limits_{t=T^{1/4}}^{T}B\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\leq BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\;.

Combined with (48), the previous inequality implies that:

\displaystyle\mathbb{P}\left(\overline{\mathcal{C}_{T}^{1}(\xi)}\right)\leq\frac{1}{T^{2}}+BT\exp\bigg{(}-\frac{CT^{1/16}}{\sqrt{\log(1+SAT^{2})}}\bigg{)}\;.

∎

F.4 Concentration of state-action visitation frequency

The following proposition is a somewhat stronger version of Proposition 12. The fact that $\widehat{\mathcal{M}}_{t}$ is within a distance of at most $\xi$ from $\mathcal{M}$ enables us to have a tighter control of the ergodicity constants $L_{t}$ . This in turn allows us to derive a finite sample bound on the deviations of state-action frequency of visits. Before stating the result we recall some simple facts:

Fact 1:

For any two policies $\pi,\pi^{\prime}$ we have: $\left\lVert P_{\pi^{\prime}}-P_{\pi}\right\rVert_{\infty}\leq\left\lVert\pi^{\prime}-\pi\right\rVert_{\infty}$ , where the norm is on policies viewed as vectors of $\mathbb{R}^{SA}$ .

Fact 2:

Under $\mathcal{C}_{T}^{1}(\xi)$ , for $k\geq\sqrt{T}$ we have:

	$\displaystyle\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}$	$\displaystyle\leq\frac{\sum\limits_{j=1}^{T^{1/4}}\varepsilon_{j}\left\lVert\pi_{u}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}}{k}+\frac{\sum\limits_{j=T^{1/4}+1}^{k}\left\lVert\pi^{o}(\widehat{\mathcal{M}}_{j})-\pi^{o}(\mathcal{M})\right\rVert_{\infty}}{k}$
		$\displaystyle\leq\frac{T^{1/4}}{k}+\xi.$

Fact 3:

Under $\mathcal{C}_{T}^{1}(\xi)$ , for $k\geq\sqrt{T}$ we have:

	$\displaystyle\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}$	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert P_{k}-P_{\pi^{o}}\right\rVert_{\infty}$
		$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert\pi_{k}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}$
		$\displaystyle\leq\kappa_{\mathcal{M}}\big{[}\varepsilon_{k}+\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\big{]}$
		$\displaystyle\leq\kappa_{\mathcal{M}}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}.$

where we used Lemma 22, Fact 1, the definitions of $\pi_{k}$ and $\varepsilon_{k}$ and Fact 2 respectively.

Fact 4:

Under $\mathcal{C}_{T}^{1}(\xi)$ , for $k\geq\sqrt{T}$ we have:

	$\displaystyle D(\pi_{k},\pi_{k-1})$	$\displaystyle=\left\lVert P_{k}-P_{k-1}\right\rVert_{\infty}$
		$\displaystyle\leq\left\lVert P_{k}-P_{\pi^{o}}\right\rVert_{\infty}+\left\lVert P_{\pi^{o}}-P_{k-1}\right\rVert_{\infty}$
		$\displaystyle\leq 2\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{]}$

Proposition 19.

Under C-Navigation, for all $\xi>0$ , there exists a time $T_{\xi}$ such that for all $T\geq T_{\xi}$ , all $t\geq T^{3/4}$ and all functions $f:\mathcal{Z}\xrightarrow{}\mathbbm{R}^{+}$ , we have:

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum_{k=1}^{t}f(s_{k},a_{k})}{t}-\mathbb{E}_{(s,a)\sim\omega^{\star}}[f(s,a)]\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)},

where $\xi\mapsto K_{\xi}$ is a mapping with values in $(1,\infty)$ such that $\limsup_{\xi\to 0}K_{\xi}<\infty$ . In particular, this implies that:

\displaystyle\mathbb{P}\bigg{(}\exists(s,a)\in\mathcal{Z},\bigg{|}N_{sa}(t)/t-\omega_{sa}^{\star}\bigg{|}\geq K_{\xi}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2SA\exp\big{(}-t\xi^{2}\big{)}.

Corollary 2.

We have $\mathbb{P}\bigg{(}\overline{\mathcal{C}_{T}^{2}(\xi)}\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq\displaystyle{\frac{2SA\exp(-T^{3/4}\xi^{2})}{1-\exp(-\xi^{2})}}$ .

Proof.

Consider the difference

	$\displaystyle D$	$\displaystyle=\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega^{\star}(f)$
		$\displaystyle=\frac{\sum\limits_{k=1}^{\sqrt{T}}[f(z_{k})-\omega^{\star}(f)]}{t}+\frac{\sum\limits_{k=\sqrt{T}+1}^{t}[f(z_{k})-\omega^{\star}(f)]}{t}$
		$\displaystyle=\underbrace{\frac{\sum\limits_{k=1}^{\sqrt{T}}[f(z_{k})-\omega^{\star}(f)]}{t}}\limits_{D_{1,t}}+\underbrace{\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}f(z_{k})-\omega_{k-1}(f)\big{]}}{t}}\limits_{D_{2,t}}+\underbrace{\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\omega_{k-1}(f)-\omega^{\star}(f)\big{]}}{t}}\limits_{D_{3,t}}.$

We clearly have:

\displaystyle\forall T\geq\frac{1}{\xi^{4}},\ \forall t\geq T^{3/4},\ |D_{1,t}|\leq\frac{\left\lVert f\right\rVert_{\infty}\sqrt{T}}{t}\leq\frac{\left\lVert f\right\rVert_{\infty}}{T^{1/4}}\leq\left\lVert f\right\rVert_{\infty}\xi.

(55)

Using Fact 3 and integral-series comparison, we upper bound the third term as follows:

$\displaystyle\forall T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)},\ \forall t\geq T^{3/4},\ \|D_{3,t}\|$	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}}{t}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}}{t}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\frac{\sum\limits_{k=\sqrt{T}+1}^{t-1}\frac{T^{1/4}}{k}}{t}+\xi\bigg{)}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+\frac{T^{1/4}\log(t)}{t}\bigg{)}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+\frac{T^{1/4}}{\sqrt{t}}\bigg{)}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}T^{\frac{-1}{4(m+1)}}+\xi+T^{\frac{-1}{8}}\bigg{)}$
	$\displaystyle\leq\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\bigg{(}2T^{\frac{-1}{4(m+1)}}+\xi\bigg{)}$
	$\displaystyle\leq 2\kappa_{\mathcal{M}}\left\lVert f\right\rVert_{\infty}\xi.$	(56)

Now to bound $D_{2,t}$ we use the function $\widehat{f}_{k}$ solution to the Poisson equation $\big{(}\widehat{f}_{k}-P_{k}\widehat{f}_{k}\big{)}(.)=f(.)-\omega_{k}(f)$ . By Lemma 23, $\widehat{f}_{k}(.)=\sum\limits_{n\geq 0}P_{k}^{n}[f-\omega_{k}(f)](.)$ exists and is solution to the Poisson equation. Therefore we can rewrite $D_{2,t}$ as follows:

	$\displaystyle D_{2,t}$	$\displaystyle=\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}$
		$\displaystyle=M_{t}+C_{t}+R_{t}.$		(57)

where

	$\displaystyle M_{t}\triangleq\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})\big{]}}{t}.$
	$\displaystyle C_{t}\triangleq\frac{\sum\limits_{k=\sqrt{T}+1}^{t}\big{[}P_{k}\widehat{f}_{k}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k})\big{]}}{t}.$
	$\displaystyle R_{t}\triangleq\frac{P_{\sqrt{T}}\widehat{f}_{\sqrt{T}}(z_{\sqrt{T}})-P_{t}\widehat{f}_{t}(z_{t})}{t}.$

Bounding $M_{t}$ :

Note that $S_{t}\triangleq tM_{t}$ is a martingale since $\mathbb{E}[\widehat{f}_{k-1}(z_{k})|\mathcal{F}_{k-1}]=P_{k-1}\widehat{f}_{k-1}(z_{k-1})$ . Furthermore, by Lemma 23:

$\displaystyle\|S_{k}-S_{k-1}\|$	$\displaystyle=\|\widehat{f}_{k-1}(z_{k})-P_{k-1}\widehat{f}_{k-1}(z_{k-1})\|$
	$\displaystyle\leq 2\left\lVert\widehat{f}_{k-1}\right\rVert_{\infty}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{k-1}.$	(58)

Recall from Lemma 13 that $L_{k}=\mathcal{L}(\varepsilon_{k},\overline{\pi_{k}^{o}},\omega_{k})$ where:

	$\displaystyle\mathcal{L}(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq\frac{2}{\theta(\varepsilon,\pi,\omega)\big{[}1-\theta(\varepsilon,\pi,\omega)^{1/r}\big{]}}$
	$\displaystyle\theta(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq 1-\sigma(\varepsilon,\pi,\omega).$
	$\displaystyle\sigma(\varepsilon,\pi,\omega)$	$\displaystyle\triangleq\bigg{[}\varepsilon^{r}+\bigg{(}(1-\varepsilon)A\min\limits_{s,a}\pi(a\|s)\bigg{)}^{r}\bigg{]}\sigma_{u}\bigg{(}\min\limits_{z}\frac{\omega_{u}(z)}{\omega(z)}\bigg{)}.$

Now for $T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)}$ and $k\geq\sqrt{T}$ we have:

	$\displaystyle\|\varepsilon_{k}\|=k^{\frac{-1}{2(m+1)}}\leq T^{\frac{-1}{4(m+1)}}\leq\xi/2.$
	$\displaystyle\left\lVert\overline{\pi_{k}^{o}}-\pi^{o}(\mathcal{M})\right\rVert_{\infty}\leq\frac{T^{1/4}}{k}+\xi\leq\frac{1}{T^{1/4}}+\xi\leq 2\xi.$
	$\displaystyle\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}\leq\kappa_{\mathcal{M}}\big{[}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k}+\xi\big{]}\leq\kappa_{\mathcal{M}}\big{[}2T^{\frac{-1}{4(m+1)}}+\xi\big{]}\leq 2\kappa_{\mathcal{M}}\xi.$

Therefore:

$\displaystyle L_{k}\leq L_{\xi}\triangleq$	$\displaystyle\sup\ \mathcal{L}(\varepsilon,\pi,\omega).$


		(59)

Using (58), (59) and Azuma-Hoeffding inequality we get for all $t\geq T^{3/4}$ :

$\displaystyle\mathbb{P}\big{(}\|M_{t}\|\geq 2\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)}$	$\displaystyle=\mathbb{P}\big{(}\|S_{t}\|\geq 2t\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)}$
	$\displaystyle=\mathbb{P}\big{(}\|S_{t}-S_{\sqrt{T}}\|\geq 2t\left\lVert f\right\rVert_{\infty}L_{\xi}\xi\big{)}$
	$\displaystyle\leq 2\exp\bigg{(}\frac{-t^{2}\xi^{2}}{(t-\sqrt{T})}\bigg{)}$
	$\displaystyle\leq 2\exp\big{(}-t\xi^{2}\big{)}.$	(60)

Bounding $C_{t}$ :

Using Lemma 23 we have for all $T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)}$ and all $t\geq T^{3/4}$ :

$\displaystyle\|C_{t}\|$	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{k}\bigg{[}\left\lVert\omega_{k}-\omega_{k-1}\right\rVert_{1}+L_{k-1}D(\pi_{k},\pi_{k-1})\bigg{]}}{t}$
	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{\xi}\bigg{[}\left\lVert\omega_{k}-\omega^{\star}\right\rVert_{1}+\left\lVert\omega^{\star}-\omega_{k-1}\right\rVert_{1}+2L_{\xi}\big{(}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{)}\bigg{]}}{t}$
	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}\frac{\sum\limits_{k=\sqrt{T}+1}^{t}L_{\xi}\bigg{[}\kappa_{\mathcal{M}}\big{(}2T^{\frac{-1}{4(m+1)}}+2\xi+\frac{T^{1/4}}{k}+\frac{T^{1/4}}{k-1}\big{)}+2L_{\xi}\big{(}T^{\frac{-1}{4(m+1)}}+\frac{T^{1/4}}{k-1}+\xi\big{)}\bigg{]}}{t}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{1/4}\frac{\log(t)}{t}\bigg{]}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{1/4}\frac{1}{\sqrt{t}}\bigg{]}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\bigg{[}T^{\frac{-1}{4(m+1)}}+\xi+T^{-1/8}\bigg{]}$
	$\displaystyle\leq 4\left\lVert f\right\rVert_{\infty}(\kappa_{\mathcal{M}}L_{\xi}+L_{\xi}^{2})\xi,$	(61)

where the second line comes from (59) and Fact 4 and the third line is due to Fact 3.

Bounding $R_{t}$ :

Finally, by Lemma 23 we have:

$\displaystyle\forall T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)},\ \forall t\geq T^{3/4},\ \|R_{t}\|$	$\displaystyle\leq\frac{\left\lVert\widehat{f}_{\sqrt{T}}\right\rVert_{\infty}+\left\lVert\widehat{f}_{t}\right\rVert_{\infty}}{t}$
	$\displaystyle\leq\frac{\left\lVert f\right\rVert_{\infty}(L_{\sqrt{T}}+L_{t})}{t}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{\xi}T^{-3/4}$
	$\displaystyle\leq 2\left\lVert f\right\rVert_{\infty}L_{\xi}\xi.$	(62)

Summing up the inequalities (55-62) yields for all $T\geq\big{(}\frac{2}{\xi}\big{)}^{4(m+1)}$ and all $t\geq T^{3/4}$ :

\displaystyle\mathbb{P}\bigg{(}\bigg{|}\frac{\sum\limits_{k=1}^{t}f(z_{k})}{t}-\omega^{\star}(f)\bigg{|}\geq K_{\xi}\left\lVert f\right\rVert_{\infty}\xi\bigg{|}\mathcal{C}_{T}^{1}(\xi)\bigg{)}\leq 2\exp\big{(}-t\xi^{2}\big{)}.

(63)

where $K_{\xi}\triangleq 1+2\kappa_{\mathcal{M}}+4L_{\xi}(1+\kappa_{\mathcal{M}}+L_{\xi})$ . Note that $\limsup\limits_{\xi\to 0}L_{\xi}=\mathcal{L}(0,\pi^{o}(\mathcal{M}),\omega^{\star})<\infty$ ²⁰²⁰20Refer to (34) and (35) for a formal justification. implying that $\limsup\limits_{\xi\to 0}K_{\xi}<\infty$ . We get the final result by applying (63) to indicator functions $\mathbbm{1}_{s,a}(z)$ and using a union bound. ∎

Appendix G Technical Lemmas

G.1 Upper bound on the norm of products of substochastic matrices

Before we proceed with the lemma, we lay out some definitions. $\eta_{1}\triangleq\min\big{\{}P_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},P_{\pi_{u}}(z,z^{\prime})>0\big{\}}$ denotes the minimum positive probability of transition in $\mathcal{M}$ . Similarly define $\eta_{2}\triangleq\min\big{\{}P^{n}_{\pi_{u}}(z,z^{\prime})\ \big{|}(z,z^{\prime})\in\mathcal{Z}^{2},n\in[|1,m+1|],P^{n}_{\pi_{u}}(z,z^{\prime})>0\big{\}}$ the minimal probability of reaching some state-action pair $z^{\prime}$ from any other state-action $z$ after $n\leq m+1$ ²¹²¹21Refer to the preamble of Appendix D for more detail. transitions in the Markov chain induced by the uniform random policy. Finally, $\eta\triangleq\eta_{1}\eta_{2}$ .

Lemma 20.

Fix some state-action $z$ and let $P_{t}$ be the transition matrix under some policy $\pi_{t}$ satisfying $\pi_{t}(a|s)\geq\varepsilon_{t}\pi_{u}(a|s)$ for all $(s,a)\in\mathcal{Z}$ . Define the substochastic matrix $Q_{t}$ obtained by removing from $P_{t}$ the row and the column corresponding to $z$ :

P_{t}=\begin{pmatrix}\begin{matrix}\\ \quad Q_{t}\quad\\ \\ \end{matrix}&\vline&[P_{t}(z^{\prime},z)]_{z^{\prime}\neq z}\\ \hline\cr[P_{t}(z,z^{\prime})]_{z^{\prime}\neq z}^{T}&\vline&\begin{matrix}P_{t}(z,z)\end{matrix}\end{pmatrix}\;.

Then we have:

\forall n\geq 1,\ \left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}\right\rVert_{\infty}\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}.

Proof.

Define $r_{k}(n_{1},n_{2})=\sum\limits_{j=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{2}}Q_{l}\bigg{)}_{kj}$ the sum of the k-th row in the product of matrices $Q_{l}$ for $l\in[[n_{1}+1,n_{2}|]$ . We will prove that or all $i\in[|1,SA-1|]$ : $r_{i}(n,n+m+1)\leq 1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}$ . The result follows immediately by noting that $\left\lVert\prod\limits_{l=n+1}^{n+m+1}Q_{l}\right\rVert_{\infty}=\max\limits_{i\in[|1,SA-1|]}r_{i}(n,n+m+1)$ .
Consider $z^{\prime}$ such that $P_{\pi_{u}}(z^{\prime},z)\geq\eta_{1}$ (such $z^{\prime}$ always exists since $\mathcal{M}$ is communicating) and let $k^{\star}$ be the index of the row corresponding to $z^{\prime}$ in $Q_{t}$ . Then for all $n_{1}\geq 1$ :

$\displaystyle r_{k^{\star}}(n_{1},l=n_{1}+1)$	$\displaystyle=\sum\limits_{j=1}^{SA-1}(Q_{n_{1}+1})_{k^{\star}j}$
	$\displaystyle=1-P_{n_{1}+1}(z^{\prime},z)$
	$\displaystyle\leq 1-\eta_{1}\varepsilon_{n_{1}+1}.$	(64)

Now for $n_{1},n_{2}\geq 1$ we have:

$\displaystyle r_{k^{\star}}(n_{1},n_{1}+n_{2})$	$\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}}Q_{l}\bigg{)}_{k^{\star}j_{1}}$
	$\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}(Q_{n_{1}+n_{2}})_{j_{2}j_{1}}$
	$\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}\bigg{[}\sum\limits_{j_{1}=1}^{SA-1}(Q_{n_{1}+n_{2}})_{j_{2}j_{1}}\bigg{]}$
	$\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n_{1}+1}^{n_{1}+n_{2}-1}Q_{l}\bigg{)}_{k^{\star}j_{2}}r_{j_{2}}(n_{1}+n_{2}-1,n_{1}+n_{2})$
	$\displaystyle\leq r_{k^{\star}}(n_{1},n_{1}+n_{2}-1)$
	$\displaystyle\quad\vdots$
	$\displaystyle\leq r_{k^{\star}}(n_{1},n_{1}+1)$
	$\displaystyle\leq 1-\eta_{1}\varepsilon_{n_{1}+1},$	(65)

where in the fifth line we use the fact that for all $j_{2},a,b$ : $r_{j_{2}}(a,b)\leq 1$ since the matrices $Q_{l}$ are substochastic. The last line comes from (64). Now for all other indexes $i\in[|1,SA-1|]$ we have:

$\displaystyle\forall n_{1}\in[\|1,m\|],\ r_{i}(n,n+m+1)$	$\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\ \times\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{ij_{1}}$
	$\displaystyle=\sum\limits_{j_{1}=1}^{SA-1}\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}\bigg{(}\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{j_{2}j_{1}}$
	$\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}\ \sum\limits_{j_{1}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+n_{1}+1}^{n+m+1}Q_{l}\bigg{)}_{j_{2}j_{1}}$
	$\displaystyle=\sum\limits_{j_{2}=1}^{SA-1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}r_{j_{2}}(n+n_{1},n+m+1)$
	$\displaystyle\leq(1-\eta_{1}\varepsilon_{n+n_{1}+1})\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}+\sum\limits_{j_{2}\neq k^{\star}}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ij_{2}}$
	$\displaystyle\leq(1-\eta_{1}\varepsilon_{n+n_{1}+1})\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}+1-\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}}$
	$\displaystyle=1-\eta_{1}\varepsilon_{n+n_{1}+1}\bigg{(}\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}\bigg{)}_{ik^{\star}},$	(66)

where we used (65) and the fact that the matrix $\prod\limits_{l=n+1}^{n+n_{1}}Q_{l}$ is substochastic. Now since $\mathcal{M}$ is communicating then we can reach state-action $z^{\prime}$ from any other state-action $z_{i}\in[|1,SA-1|]$ , after some $n_{i}\leq m+1$ steps in the Markov chain corresponding to the random uniform policy. In other words, if $i$ is the index corresponding to $z_{i}$ then there exists $n_{i}\leq m+1$ , such that $(P_{\pi_{u}}^{n_{i}})_{ik^{\star}}\geq\eta_{2}>0$ . Therefore:

$\displaystyle\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}Q_{l}\bigg{)}_{ik^{\star}}$	$\displaystyle\geq\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}P_{\pi_{u}}\bigg{)}_{ik^{\star}}$
	$\displaystyle=\bigg{(}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}\bigg{)}(P_{\pi_{u}}^{n_{i}})_{ik^{\star}}$
	$\displaystyle\geq\eta_{2}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}.$	(67)

Thus, combining (66) for $n_{1}=n_{i}$ and (67) we get:

	$\displaystyle\forall i\in[\|1,SA-1\|],\ r_{i}(n,n+m+1)$	$\displaystyle\leq 1-\eta_{1}\eta_{2}\prod\limits_{l=n+1}^{n+n_{i}}\varepsilon_{l}$
		$\displaystyle\leq 1-\eta_{1}\eta_{2}\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}$
		$\displaystyle=1-\eta\prod\limits_{l=n+1}^{n+m+1}\varepsilon_{l}.$

∎

G.2 Geometric ergodicity: a general result

The following lemma is adapted from the proof of the Convergence theorem (Theorem 4.9, [LPW06]).

Lemma 21.

Let $P$ be a stochastic matrix with stationary distribution vector $\omega$ . Suppose that there exist $\sigma>0$ and an integer $r$ such that $P^{r}(s,s^{\prime})\geq\sigma\omega(s^{\prime})$ for all $(s,s^{\prime})$ . Let $W$ be a rank-one matrix whose rows are equal to $\omega^{\mathsf{\scriptscriptstyle T}}$ . Then:

\forall n\geq 1,\ \left\lVert P^{n}-W\right\rVert_{\infty}\leq 2\theta^{\frac{n}{r}-1}

where $\theta=1-\sigma$ .

Proof.

We write: $P^{r}=(1-\theta)W+\theta Q$ where $Q$ is a stochastic matrix. Note that $WP^{k}=W$ for all $k\geq 0$ since $\omega^{\mathsf{\scriptscriptstyle T}}=\omega^{\mathsf{\scriptscriptstyle T}}P$ . Furthermore $MW=W$ for all stochastic matrices since all rows of $W$ are equal. Using these properties, we will show by induction that $P^{r}k=(1-\theta^{k})W+\theta^{k}Q^{k}$ :
For $k=1$ the result is trivial. Now suppose that $P^{r}k=(1-\theta^{k})W+\theta^{k}Q^{k}$ . Then:

	$\displaystyle P^{r}(k+1)$	$\displaystyle=P^{r}kP^{r}$
		$\displaystyle=[(1-\theta^{k})W+\theta^{k}Q^{k}]P^{r}$
		$\displaystyle=(1-\theta^{k})WP^{r}+(1-\theta)\theta^{k}Q^{k}W+\theta^{k+1}Q^{k+1}$
		$\displaystyle=(1-\theta^{k})W+1-\theta)\theta^{k}W+\theta^{k+1}Q^{k+1}$
		$\displaystyle=(1-\theta^{k+1})W+\theta^{k+1}Q^{k+1}.$

Therefore the result holds for all $k\geq 1$ . Therefore $P^{rk+j}-W=\theta^{k}(Q^{k}P^{j}-W)$ which implies:

	$\displaystyle\forall n=rk+j\geq 1,\ \left\lVert P^{n}-W\right\rVert_{\infty}$	$\displaystyle\leq\theta^{k}\left\lVert Q^{k}P^{j}-W\right\rVert_{\infty}$
		$\displaystyle\leq 2\theta^{k}=2\theta^{\lfloor\frac{n}{r}\rfloor}\leq 2\theta^{\frac{n}{r}-1}.$

∎

G.3 Condition number of Markov Chains

Lemma 22.

(Theorem 2 in [Sch68]) Let $P_{1}$ (resp. $P_{2}$ ) be the transition kernel of a Markov Chain with stationary distribution $\omega_{1}$ (resp. $\omega_{2}$ ). Define $Z_{1}\triangleq(I-P_{1}+\mathbbm{1}\omega_{1}^{\mathsf{\scriptscriptstyle T}})^{-1}$ . Then:

\displaystyle\omega_{2}^{\mathsf{\scriptscriptstyle T}}-\omega_{1}^{\mathsf{\scriptscriptstyle T}}=\omega_{2}^{\mathsf{\scriptscriptstyle T}}[P_{2}-P_{1}]Z_{1}\quad\textrm{and}\quad\left\lVert\omega_{2}-\omega_{1}\right\rVert_{1}\leq\kappa_{1}\left\lVert P_{2}-P_{1}\right\rVert_{\infty}.

where $\kappa_{1}\triangleq\left\lVert Z_{1}\right\rVert_{\infty}$ . Crucially, in our setting this implies that there exists a constant $\kappa_{\mathcal{M}}$ that only depends on $\mathcal{M}$ such that for all $\pi$ :

\displaystyle\left\lVert\omega_{\pi}-\omega^{\star}\right\rVert_{1}\leq\kappa_{\mathcal{M}}\left\lVert P_{\pi}-P_{\pi^{o}}\right\rVert_{\infty}.

where $\kappa_{\mathcal{M}}\triangleq\left\lVert Z_{\pi^{o}}\right\rVert_{\infty}=\left\lVert(I-P_{\pi^{o}}+\mathbbm{1}{\omega^{\star}}^{\mathsf{\scriptscriptstyle T}})^{-1}\right\rVert_{\infty}$ .

G.4 Properties of Poisson equation’s solutions

Lemma 23.

Let $P_{\pi}$ be a Markov transition kernel satisfying the assumptions (B1) and (B3) and denote by $\omega_{\pi}$ its stationary distribution. Then for any a bounded function $f:\mathcal{Z}\to\mathbb{R}^{+}$ , the function defined by $\widehat{f}_{\pi}(.)\triangleq\sum\limits_{n\geq 0}P_{\pi}^{n}[f-\omega_{\pi}(f)](.)$ is well defined and is solution to the Poisson equation $\big{(}\widehat{f}_{\pi}-P_{\pi}\widehat{f}_{\pi}\big{)}(.)=f(.)-\omega_{\pi}(f)$ . Furthermore:

\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi}\left\lVert f\right\rVert_{\infty},

and for any pair of kernels $P_{\pi},P_{\pi^{\prime}}$ :

\left\lVert P_{\pi^{\prime}}\widehat{f}_{\pi^{\prime}}-P_{\pi}\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi^{\prime}}\left\lVert f\right\rVert_{\infty}\big{[}\left\lVert\omega_{\pi^{\prime}}-\omega_{\pi}\right\rVert_{1}+L_{\pi}D(\pi^{\prime},\pi)\big{]}.

Proof.

We will prove that $\widehat{f}_{\pi}$ is well defined. Checking that it satisfies the Poisson equation is straightforward. Observe that:

	$\displaystyle\widehat{f}_{\pi}(.)$	$\displaystyle\triangleq\sum\limits_{n\geq 0}P_{\pi}^{n}[f-\omega_{\pi}(f)](.)$
		$\displaystyle=\sum\limits_{n\geq 0}[P_{\pi}^{n}-W_{\pi}][f-\omega_{\pi}(f)](.),$

where the second equality is because $(W_{\pi}f)(z)=\omega_{\pi}(f)$ for all $z\in\mathcal{Z}$ . From the last expression, we see that the sum defining $\widehat{f}_{\pi}$ converges and we have the first bound $\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}\leq L_{\pi}\left\lVert f\right\rVert_{\infty}$ . Now for the second bound, we write:

	$\displaystyle(P_{\pi^{\prime}}\widehat{f}_{\pi^{\prime}}-P_{\pi}\widehat{f}_{\pi})(.)$	$\displaystyle=\sum\limits_{n\geq 1}\bigg{[}P_{\pi^{\prime}}^{n}\big{[}\omega_{\pi}(f)-\omega_{\pi^{\prime}}(f)\big{]}+\big{[}P_{\pi^{\prime}}^{n}-P_{\pi}^{n}\big{]}\big{[}f-\omega_{\pi}(f)\big{]}\bigg{]}(.)$
		$\displaystyle=\underbrace{\sum\limits_{n\geq 1}P_{\pi^{\prime}}^{n}\big{[}\omega_{\pi}(f)-\omega_{\pi^{\prime}}(f)\big{]}(.)}\limits_{A(.)}+\underbrace{\sum\limits_{n\geq 1}\big{[}P_{\pi^{\prime}}^{n}-P_{\pi}^{n}\big{]}\big{[}f-\omega_{\pi}(f)\big{]}(.)}\limits_{B(.)}.$		(68)

Using the same trick as before we obtain:

	$\displaystyle\left\lVert A\right\rVert_{\infty}$	$\displaystyle\leq\left\lVert f\right\rVert_{\infty}C_{\pi^{\prime}}(1-\rho_{\pi^{\prime}})^{-1}\left\lVert\omega_{\pi}-\omega_{\pi^{\prime}}\right\rVert_{1}$
		$\displaystyle=\left\lVert f\right\rVert_{\infty}L_{\pi^{\prime}}\left\lVert\omega_{\pi}-\omega_{\pi^{\prime}}\right\rVert_{1}.$		(69)

On the other hand, a simple calculation shows that $(B-P_{\pi^{\prime}}B)(.)=(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}(.)$ , ie $B$ is solution to the modified Poisson equation where the right hand side is $(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}$ . Therefore:

\displaystyle B(.)=\sum\limits_{n\geq 0}P_{\pi^{\prime}}^{n}\big{[}(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}\big{]}(.).

and

$\displaystyle\left\lVert B\right\rVert_{\infty}$	$\displaystyle\leq L_{\pi^{\prime}}\left\lVert(P_{\pi^{\prime}}-P_{\pi})\widehat{f}_{\pi}\right\rVert_{\infty}$
	$\displaystyle\leq L_{\pi^{\prime}}D(\pi^{\prime},\pi)\left\lVert\widehat{f}_{\pi}\right\rVert_{\infty}$
	$\displaystyle\leq L_{\pi}L_{\pi^{\prime}}D(\pi^{\prime},\pi)\left\lVert f\right\rVert_{\infty}.$	(70)

Summing up equation (68) and inequalities (69-70) ends the proof. ∎

	$\displaystyle\mathbb{E}[M_{t}^{\lambda}(s,a)\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$
	$\displaystyle=\mathbb{E}\bigg{[}\exp\bigg{(}N_{sa}(t)\big{[}<\lambda,\widehat{p}_{sa}(t)>-\phi_{sa}(\lambda)\big{]}\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)\bigg{]}$
	$\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}\exp\bigg{(}\big{(}N_{sa}(t-1)+1\big{)}\big{(}<\lambda,\frac{N_{sa}(t-1)\widehat{p}_{sa}(t-1)+X}{N_{sa}(t-1)+1}>-\phi_{sa}(\lambda)\big{)}\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1}\bigg{]}$
	$\displaystyle=\mathbb{E}_{X\sim p_{sa}}\bigg{[}M_{t-1}^{\lambda}(s,a)\exp\bigg{(}<\lambda,X>-\phi_{sa}(\lambda)\bigg{)}\bigg{\|}\ \mathcal{F}_{t-1}\bigg{]}=M_{t-1}^{\lambda}(s,a).$

	$\displaystyle\mathbb{E}[M_{t}\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$	$\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t}(s^{\prime},a^{\prime})\|\mathcal{F}_{t-1},\ (s_{t},a_{t})=(s,a)]$
		$\displaystyle=\mathbb{E}[M_{t}(s,a)\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})\|\mathcal{F}_{t-1}]$
		$\displaystyle=\mathbb{E}[M_{t}(s,a)\|\mathcal{F}_{t-1}]\times\prod_{(s^{\prime},a^{\prime})\neq(s,a)}M_{t-1}(s^{\prime},a^{\prime})$
		$\displaystyle=M_{t-1},$

$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)-$	$\displaystyle p(s^{\prime}\|s,a)>\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}$
	$\displaystyle\leq\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)>\rho/S,\ N_{sa}(t)\geq\bigg{[}\frac{t}{\lambda(T)}\bigg{]}^{1/4}-1\bigg{)}$
	$\displaystyle=\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\mathbb{P}\bigg{(}\widehat{p}_{t}(s^{\prime}\|s,a)-p(s^{\prime}\|s,a)>\rho/S,\ N_{sa}(t)=t^{\prime}\bigg{)}$
	$\displaystyle\leq\sum_{t^{\prime}=\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1}^{t}\exp\bigg{(}-t^{\prime}\cdot\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}$
	$\displaystyle\leq\frac{\exp\bigg{(}-\big{(}\big{[}\frac{t}{\lambda(T)}\big{]}^{1/4}-1\big{)}\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}$
	$\displaystyle\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)+\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}.$	(50)

	$\displaystyle\mathbb{P}\bigg{(}\big{(}\widehat{p}_{t}(s^{\prime}\|s,a)$	$\displaystyle-p(s^{\prime}\|s,a)<-\rho/S\big{)}\cap\mathcal{E}_{T}\bigg{)}$
		$\displaystyle\qquad\qquad\leq\frac{\exp\bigg{(}-[\frac{T^{1/16}}{\lambda(T)^{1/4}}-1]\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}}{1-\exp\bigg{(}-\operatorname{kl}\big{(}p(s^{\prime}\|s,a)-\rho/S,\ p(s^{\prime}\|s,a)\big{)}\bigg{)}},$		(51)

Navigating to the Best Policy in Markov Decision Processes

Abstract

1 Introduction

1.1 Related work

2 Setting and notation

Assumption 1.

3 Sample complexity lower bound

Lemma 1.

Proposition 2.

Remark 1.

Remark 2.

3.1 Proxy for the optimal allocation and the characteristic time

Lemma 3.

4 Algorithm

4.1 Sampling rule

4.1.1 Tuning the forced exploration parameter

Lemma 4.

Remark 3.

Minimal exploration rate: Ergodic MDPs.

4.2 Stopping rule

Proposition 5.

5 Main results and sample complexity analysis

Theorem 6.

5.1 Proof sketch

Assumption 2.

Theorem 7.

Proposition 8.

6 Comparison with MOCA

7 Conclusion

References

Appendix A Symbols

Appendix B Experiments

Appendix C Sample complexity lower bound

Lemma 9.

C.1 Navigation constraints: proof of Lemma 1

C.2 Lower bound: proof of Proposition 2

Proposition 10.

C.3 Full definition of the terms in the upper bound

Appendix D Sampling rule

Fact 1:

D.1 Almost sure forced exploration: proof of Lemma 4

D.2 Minimal exploration rate for communicating MDPs

D.3 Minimal exploration rate for ergodic MDPs

D.4 High probability forced exploration

Lemma 11.

Corollary 1.

Proof.

1) Upper bounding the probability that an excursion takes too long:

2) Tuning ff and the exploration rate:

Remark 4.

D.5 An Ergodic Theorem for non-homogeneous Markov Chains

Proposition 12.

Proof.

Bounding MtM_{t}:

Bounding CtC_{t}:

Bounding RtR_{t}:

D.6 Application to C-Navigation: proof of Theorem 7

We will now prove that C-Navigation verifies the assumptions (B1-5).

(B1):

(B2):

(B3):

(B4):

(B5):

D.7 Geometric ergodicity of the sampling rules

Lemma 13.

Proof.

Appendix E Stopping rule

E.1 Deviation inequality for KL divergences of rewards

Lemma 14.

Remark 5.

E.2 Deviation inequality for KL divergences of transitions

Lemma 15.

Proof.

Building a convenient mixture martingale for every state-action pair:

The product martingale:

Lemma 16.

Lemma 17.

E.3 Correctness of the stopping rule

Proof.

Appendix F Sample complexity upper bound

Navigating to the Best Policy
in Markov Decision Processes

2) Tuning $f$ and the exploration rate:

Bounding $M_{t}$ :

Bounding $C_{t}$ :

Bounding $R_{t}$ :

Bounding $M_{t}$ :

Bounding $C_{t}$ :

Bounding $R_{t}$ :