Navigating to the Best Policy
in Markov Decision Processes
Abstract
We investigate the classical active pure exploration problem in Markov Decision Processes, where the agent sequentially selects actions and, from the resulting system trajectory, aims at identifying the best policy as fast as possible. We propose a problem-dependent lower bound on the average number of steps required before a correct answer can be given with probability at least . We further provide the first algorithm with an instance-specific sample complexity in this setting. This algorithm addresses the general case of communicating MDPs; we also propose a variant with a reduced exploration rate (and hence faster convergence) under an additional ergodicity assumption. This work extends previous results relative to the generative setting [MP21], where the agent could at each step query the random outcome of any (state, action) pair. In contrast, we show here how to deal with the navigation constraints, induced by the online setting. Our analysis relies on an ergodic theorem for non-homogeneous Markov chains which we consider of wide interest in the analysis of Markov Decision Processes.
1 Introduction
Somewhat surprisingly, learning in a Markov Decision Process is most often considered under the performance criteria of consistency or regret minimization (see e.g. [SB18, Sze10, LS20] and references therein). Regret minimization (see e.g. [AJO09, FCG10]) is particularly relevant when the rewards accumulated during the learning phase are important. This is however not always the case: for example, when learning a game (whether Go, chess, Atari, or whatever), winning or losing during the learning phase does not really matter. One may intuitively think that sometimes getting into difficulty on purpose so as to observe unheard situations can significantly accelerate the learning process. Another example is the training of robot prototypes in the factory: a reasonably good policy is first searched, regardless of the losses incurred, that can serve as an initialization for a second phase regret-minimization mode that starts when the robot is deployed. It is hence also of great practical importance to study the sample complexity of learning, and to work on strategies that might improve, in this perspective, on regret minimizing algorithms.
In this work, we are interested in the best policy identification (BPI) problem for infinite-horizon discounted MDPs. This framework was introduced by [Fie94] under the name of PAC-RL. In BPI the algorithm explores the MDP until it has gathered enough samples to return an -optimal policy with probability at least . Crucially, the algorithm halts at a random time step, determined by a stopping rule which guarantees that the probability of returning a wrong answer is less that . The optimality of BPI algorithms is measured through their sample complexity, defined as their expected stopping time. Best policy identification in MDPs has been mostly investigated under the lens of minimax-optimality. The minimax framework may be overly pessimistic by accounting for the worst possible MDP, whereas an algorithm with instance-specific guarantees can adapt its exploration procedure to the hardness of the MDP instance that it faces. Recently, a few works made the first attempts towards understanding the instance-specific sample complexity of reinforcement learning. However, they typically either make simplifying assumptions such as access to a generative model [ZKB19, MP21], or restrict their attention to episodic MDPs [WSJ21]. In practice however, the samples gathered rather correspond to a single, possibly infinite, trajectory of the system that we wish to control. This motivates our study of the full online setting where observations are only obtained by navigating in the MDP, that is by sequential choices of actions and following the transitions of the MDP.
Our contributions. [MP21] recently proposed an information-theoretical complexity analysis for MDPs in the case of access to a generative model. Here we extend their results to the online setting. Our main goal is to understand how the online learning scheme affects the sample complexity compared to the easier case where we have a generative model. A natural first step consists in understanding how the first order term changes. Thus we only focus on the asymptotic regime . Our key contributions can be summarized as follows:
-
•
First, we adapt the lower bound of [MP21] to the online setting (Proposition 2). The new bound also writes as the value of a zero-sum two-player game between nature and the algorithm, where the loss function remains the same, but where the set of possible strategies for the algorithm is restricted to a subset of the simplex of dimension . We refer to the constraints defining as the navigation constraints.
-
•
We propose MDP-NaS, the first algorithm for the online setting111Before publication of this work, but after a preprint was available online, [WSJ21] proposed another algorithm with instance-dependent guarantees. with instance-dependent bounds on its sample complexity in the asymptotic regime (Theorem 6). A major challenge lies in the design of a sampling rule that guarantees that the sampling frequency of state-action pairs converges to some target oracle allocation . Indeed, since we can no longer choose the next state of the agent, the tracking procedure which was developed by [GK16] for multi-armed bandits and used in [MP21] for MDPs with a generative model can no longer be applied in our setting. We propose a new sampling rule which performs exploration according to a mixture of the uniform policy and a plug-in estimate of the oracle policy (the policy whose stationary state-action distribution is ) and prove that it satisfies the requirement above. The analysis of our sampling rule relies on an ergodic theorem for non-homogeneous Markov chains of independent interest (Proposition 12).
-
•
We investigate, depending on the communication properties of the ground-truth instance , what is the minimal forced-exploration rate in our sampling rule that guarantees the consistency of the plug-in estimator of the oracle policy. Our findings imply that when is ergodic, an exploration rate as low as is sufficient. However, when is only assumed to be communicating, one is obliged to resort to a much more conservative exploration rate of , where is a parameter defined in Lemma 4 that may scales as large as in the worst case.
-
•
Finally, our stopping rule represents the first implementation of the Generalized Likelihood Ratio test for MDPs. Notably, we circumvent the need to solve the max-min program of the lower bound exactly, and show how an upper bound of the best-response problem, such as the one derived in [MP21], can be used to perform a GLR test. This improves upon the sample complexity bound that one obtains using the KL-Ball stopping rule of [MP21] by at least a factor of 222Note that the stopping rule is independent of the sampling rule and thus can be used in both the generative and the online settings. Furthermore, one may even obtain an improved factor of by using a deviation inequality for the full distribution of (reward, next-state) instead of a union bound of deviation inequalities for each marginal distribution..
1.1 Related work
Minimax Best-Policy Identification. BPI in the online setting has been investigated recently by a number of works with minimax sample complexity bounds. In the case of episodic MDPs [KMDD+21], [MDJ+20] proposed algorithms that identify an -optimal policy at the initial state w.h.p. In contrast, in the case of infinite-horizon MDPs one is rather interested in finding a good policy at every state. Recent works provide convergence analysis for Q-learning [LWC+20b] or policy gradient algorithms [AHKS20], [FYAY21], [ZCA21]. Their results typically state that if the algorithm is fed with the appropriate hyperparameters, it can return an -optimal policy w.h.p. after collecting a polynomial number of samples. In practice, a pure exploration agent needs a stopping rule to determine when to halt the learning process. In particular, the question of how to tune the number of iterations of these algorithms without prior knowledge of the ground truth instance remains open.333Here we chose not to mention works that investigate the sample complexity in the PAC-MDP framework [Kak03]. Indeed, in this framework the sample complexity is rather defined as the the number of episodes where the algorithm does not play an -optimal policy. As explained in [DMKV21], this objective is closer to regret minimization than to pure exploration.
Generative Model. A large body of literature focuses on the so-called generative model, where in each step, the algorithm may query a sample (i.e., observe a reward and a next-state drawn from the rewards and transitions distributions respectively) from any given state-action pair [KS98],[KMN99], [EDMM06], [AMK13],[DTC13], [Wan17], [SWW+18], [ZKB19], [AKY20], [LWC+20a], [LCC+21], [MP21].
Instance-specific bounds. Instance-optimal algorithms for Best Arm Identification in multi armed bandits (MDPs with one state) have been obtained independently by [GK16],[Rus16]. Here, we extend their information-theoretical approach to the problem of Best Policy Identification in MDPs. More recently, [WSJ21] provided an algorithm for BPI in episodic MDPs with instance-specific sample complexity. A more detailed comparison with [WSJ21] can be found in Section 6.
Outline. The rest of the paper is organized as follows: After introducing the setting and giving some notation and definitions in Section 2, we derive in Section 3 a lower bound on the time required by any algorithm navigating the MDP until it is able to identify the best policy with probability at least . The algorithm is presented in Section 4. Section 5 contains our main results along with a sketch of the analysis. Finally, in Section 6 we compare our results with MOCA, the only other algorithm (to the best of our knowledge) with problem-dependent guarantees in the online setting. Most technical results and proofs are given in the appendix.
2 Setting and notation
Discounted MDPs. We consider infinite horizon MDPs with discount and finite state and action spaces and . Such an MDP is defined through its transition and reward distributions and for all . For simplicity, will denote the density of the reward distribution w.r.t. a positive measure with support included in . Specifically, denotes the probability of transitioning to state after playing action at state while is the random instantaneous reward that is collected. Finally, is a discounting factor. We look for an optimal control policy maximizing the long-term discounted reward: , where denotes the expectation w.r.t the randomness in the rewards and the trajectory when the policy is used. Classically, we denote by the optimal value function of and by the optimal -value function of . denotes the set of optimal policies for .
Problem-dependent quantities. The sub-optimality gap of action in state is defined as . Let be the minimum gap and be the span of . Finally, we introduce the variance of the value function in as .
Active pure exploration with fixed confidence. When is unknown, we wish to devise a learning algorithm identifying, from a single trajectory, an optimal policy as quickly as possible with some given level of confidence. Formally, such an algorithm consists of (i) a sampling rule, selecting in each round in an adaptive manner the action to be played; depends on past observations, it is -measurable where is the -algebra generated by observations up to time ; (ii) a stopping rule , a stopping time w.r.t. the filtration , deciding when to stop collecting observations; (iii) a decision rule returning an estimated optimal policy.
An algorithm is -Probably Correct (-PC) over some set of MDPs, if for any , it returns (in finite time) an optimal policy with probability at least . In this paper, we aim to devise a -PC algorithm with minimal sample complexity . We make the following assumption.
Assumption 1.
We consider the set of communicating MDPs with a unique optimal policy.
We justify the above assumption as follows. (i) We restrict our attention to the case where is communicating, for otherwise, if it is Multichain there would be a non-zero probability that the algorithm enters a subclass of states from which there is no possible comeback. In this case it becomes impossible to identify the global optimal policy444Unless we modify the objective to finding the optimal policy in this subchain.. (ii) About the uniqueness of the optimal policy, treating the case of MDPs with multiple optimal policies, or that of -optimal policy identification, requires the use of more involved Overlapping Hypothesis tests, which is already challenging in multi-armed bandits (MDPs with a single state) [GK19]. We will analyze how to remove this assumption in future work.
Notation. is the state-action space. , the simplex of dimension . denotes the number of times the state-action pair has been visited up to the end of round . We also introduce . Similarly, for a vector we will denote . For a matrix , the infinity norm is defined as . The KL divergence between two probability distributions and on some discrete space is defined as: . For Bernoulli distributions of respective means and , the KL divergence is denoted by . For distributions over defined through their densities and w.r.t. some positive measure , the KL divergence is: . denotes the set of communicating MDPs with a unique optimal policy. For two MDPs , we say that if for all , and . In that case, for any state (action pair) , we define the KL divergence of the distributions of the one-step observations under and when starting at as .
3 Sample complexity lower bound
In this section, we first derive a lower bound on the expected sample complexity satisfied by any -PC algorithm. The lower bound is obtained as the solution of a non-convex optimization problem, as in the case where the learner has access to a generative model [MP21]. The problem has however additional constraints, referred to as the navigation constraints, due the fact that the learner has access to a single system trajectory.
The expected sample complexity of an algorithm is
Lower bounds on the sample complexity are derived by identifying constraints that the various ’s need to satisfy so as to get a -PC algorithm. We distinguish here two kinds of constraints:
Information constraints. These are constraints on , so that the algorithm can learn the optimal policy with probability at least . They are the same as those derived [MP21] when the learner has access to a generative model and are recalled in Lemma 9 in Appendix C.
Navigation constraints. Observations come from a single (but controlled) system trajectory which imposes additional constraints on the ’s. These are derived by just writing the Chapman-Kolmogorov equations of the controlled Markov chain (refer to Appendix C for a proof).
Lemma 1.
For any algorithm , and for any state , we have:
(1) |
Putting all constraints together, we obtain the following sample complexity lower bound.
Proposition 2.
Define the set of navigation-constrained allocation vectors:
Further define the set of alternative instances. Then the expected sample complexity of any -PC algorithm satisfies:
(2) |
and
(3) |
Remark 1.
A common interpretation of change-of-measure lower bounds like the one above is the following: the optimization problem in the definition of can be seen as the value of a two-player zero-sum game between an algorithm which samples each state-action proportionally to and an adversary who chooses an alternative instance that is difficult to distinguish from under the algorithm’s sampling strategy. This suggests that an optimal algorithm should play the optimal allocation that solves the optimization problem (2) and, as a consequence, rules out all alternative instances as fast as possible.
Remark 2.
Note that compared to the lower bound of the generative setting in [MP21], the only change is in the set of sampling strategies that the algorithm can play, which is no longer equal to the entire simplex.
3.1 Proxy for the optimal allocation and the characteristic time
As shown in [MP21], even without accounting for the navigation constraints, computing the characteristic time and in particular, the optimal allocation leading to it is not easy. Indeed, the sub-problem corresponding to computing is non-convex. This makes it difficult to design an algorithm that targets the optimal weights . Instead, we use a tractable upper bound of from [MP21]:
Lemma 3.
Using , we obtain the following upper bound on the characteristic time (2):
(6) |
The advantages of the above upper bound are that: (i) it is a problem-specific quantity as it depends on the gaps and variances of the value function in ; (ii) the corresponding allocation (that solves (6)) can be easily computed, and hence targeted. Indeed, the optimization problem in (6) has convex objective and constraints. Therefore, we can use the projected subgradient-descent algorithm to compute
(7) |
which will be used in our algorithm as a proxy666Note that if we have access to an optimization oracle that, given a MDP , returns the optimal allocation solution to (2), then we can replace by this optimal allocation. Our algorithm will then be asymptotically optimal up to a factor of 2. for .
4 Algorithm
We propose MDP-NaS (MDP Navigate-and-Stop), a model-based algorithm that is inspired by the lower bound. The lower bound suggests that to identify the best policy in a sample-efficient manner, the algorithm must collect samples from state-action pair proportionally to . We propose two sampling rules which ensure that the former statement holds in the long term (see Section 4.1 and Theorem 7 for a rigorous formulation). Our sampling rules are combined with a Generalized Likelihood Ratio (GLR) test (or rather a proxy of the GLR test, see Section 4.2 for details), that stops as soon as we are confident that with probability at least . The pseudo-code for MDP-NaS is given in Algorithm 1.
4.1 Sampling rule
We introduce a few definitions to simplify the presentation. Any stationary policy induces a Markov chain on whose transition kernel is defined by . With some abuse of notation, we will use to refer to both the Markov chain and its kernel. We denote by the uniform random policy, i.e., for all pairs . Finally, we define the vector of visit-frequencies .
In contrast with pure exploration in Multi-Armed Bandits and MDPs with a generative model where any allocation vector in the simplex is achievable, here the agent can only choose a sequence of actions and follow the resulting trajectory. Therefore, one might ask if the oracle allocation can be achieved by following a simple policy. A natural candidate is the oracle policy defined by
(8) |
It is immediate to check that is the stationary distribution of . Denote by the above oracle policy defined through . Policy is the target that we would like to play, but since the rewards and dynamics of are unknown, the actions must be chosen so as to estimate consistently the oracle policy while at the same time ensuring that converges to . The following two sampling rules satisfy these requirements.
D-Navigation rule: At time step , the learner plays the policy
(9) |
C-Navigation rule: At time step , the learner plays
(10) |
where is a decreasing exploration rate to be tuned later. In the second case, the agent navigates using a Cesàro-mean of oracle policies instead of the current estimate of the oracle policy, which makes the navigation more stable.
4.1.1 Tuning the forced exploration parameter
The mixture with the uniform policy777Our results still hold if is replaced by any other policy which has an ergodic kernel. In practice, this can be helpful especially when we have prior knowledge of a fast-mixing policy . helps the agent to explore all state-action pairs often enough in the initial phase. This is particularly necessary in the case where the empirical weight of some state-action pair is under-estimated, which may cause that this pair is left aside and hence the estimation never corrected. The next result in this section gives a tuning of the rate that ensures a sufficient forced exploration:
Lemma 4.
Let be the maximum length of the shortest paths (in terms of number of transitions) between pairs of states in : . Then C-Navigation or D-Navigation with any decreasing sequence such that satisfies: .
Remark 3.
When the parameter is unknown to the learner, one can replace it by its worst case value . However, when prior knowledge is available, using a faster-decreasing sequence instead of can be useful to accelerate convergence, especially when the states of are densely connected ().
Minimal exploration rate: Communicating MDPs. The forced exploration rate vanishes quite slowly. One may wonder if this rate is necessary to guarantee sufficient exploration in communicating MDPs: the answer is yes in the worst case, as the following example shows. Consider a variant of the classical RiverSwim MDP with state (resp. action) space , (resp. ). LEFT1 and LEFT2 are equivalent actions inducing the same rewards and transitions. After playing RIGHT the agent makes a transition of one step to the right, while playing LEFT1 (or LEFT2) moves the agent all the way back to state . The rewards are null everywhere but in states where are equal to almost surely and is Bernoulli with mean 0.02.
When is close to , the optimal policy consists in always playing RIGHT so as to reach state and stay there indefinitely. Now suppose that the agent starts at and due to the small probability of observing the large reward in state , she underestimates the value of this state in the first rounds and focuses on distinguishing the best action to play in state among . Under this scenario she ends up playing a sequence of policies that scales like 888Modulo a re-scaling of by a constant factor of 1/2.. This induces the non-homogeneous Markov Chain depicted in Figure 1. For the exploration rate , we show that if the agent uses any , with non-zero probability she will visit state only a finite number of times. Therefore she will fail to identify the optimal policy for small values of the confidence level . The proof is deferred to Appendix D.
Minimal exploration rate: Ergodic MDPs.
4.2 Stopping rule
To implement a GLR test, we define , the likelihood of the observations under some MDP : , where at step the algorithm is in state , plays action and observes the reward and (the next state). Performing a GLR test in step consists in computing the optimal policy for the estimated MDP and in comparing the likelihood of observations under the most likely model where is optimal to the likelihood under the most likely model where is sub-optimal. Following the standardized form of the GLR for multi-armed bandits in [DKM19] we write:
(11) | ||||
(12) |
The hypothesis is then rejected as soon as the ratio of likelihoods becomes greater than the threshold , properly tuned to ensure that the algorithm is -PC. Note that the likelihood ratio itself can be difficult to compute for MDPs, since it is equivalent to solving (3), we lower bound it using (12) and Lemma 5. This leads to the following proposition:
Proposition 5.
Define the random thresholds for the transitions and rewards respectively:
where is defined in the appendix. Then the stopping rule:
(13) |
is -PC, i.e.,
5 Main results and sample complexity analysis
First we state our main results, which take the form of asymptotic upper bounds on the sample complexity of MDP-NaS, under a slightly more conservative exploration rate than in Lemma 4. Then we present the most important ingredients in their proof. The complete proof is provided in Appendix F.
Theorem 6.
5.1 Proof sketch
Concentration of empirical MDPs: The starting point of our proof is a concentration event of the empirical estimates around . For and , we define , where is a semi-norm on MDPs. Then we prove in Lemma 18 that holds with high probability in the sense that for all :
(14) |
For this purpose we derive, for all pairs , a lower bound on stating that999Using the method in our proof one can derive a lower bound of the type for any and any exploration rate with . We chose and because they enable us to have an explicit formula for . See Lemma 11 in the appendix.: where is a parameter that depends on the mixing properties of under the uniform policy. These lower bounds on w.h.p. contrast with their deterministic equivalent obtained for C-tracking in [GK16].
Concentration of visit-frequencies: Before we proceed, we make the following assumption.
Assumption 2.
is aperiodic101010This assumption is mild as it is enough to have only one state and one action such that for it to be satisfied. Furthermore, Assumptions 1 and 10 combined are still less restrictive than the usual ergodicity assumption, which requires that the Markov chains of all policies are ergodic..
Under Assumptions 1 and 10 the kernel , and consequently also , becomes ergodic. Hence the iterates of converge to at a geometric speed. Also note that the Markov chains induced by playing either of our sampling rules are non-homogeneous, with a sequence of kernels that is history-dependent. To tackle this difficulty, we adapt a powerful ergodic theorem (see Proposition 12 in the appendix) from [FMP11] originally derived for adaptive Markov Chain Monte-Carlo (MCMC) algorithms to get the following result.
Theorem 7.
Using the C-Navigation or D-Navigation we have: almost surely.
Lemma 4 and Theorem 7 combined prove Theorem 6 (i) in a straightforward fashion. The proof of Theorem 6 (ii) is more involved. Again, we adapt the proof method from [FMP11] to derive a finite-time version of Proposition 12 which results into the following proposition.
Proposition 8.
Under C-Navigation, for all , there exists a time such that for all , all and all functions , we have:
where is a mapping with values in such that .
Now define . Then Proposition 8 and Eq. (14) combined imply that for large enough, the event holds w.h.p. so that the expected stopping time is finite on the complementary event: . Now given the asymptotic shape of the thresholds: , we may informally write:
where . Taking the limits when and go to zero respectively concludes the proof.
6 Comparison with MOCA
Recall from Lemma 3 and Theorem 6 that our sample complexity bound writes as:
Hence, in the asymptotic regime , MDP-NaS finds the optimal way to balance exploration between state-action pairs proportionally to their hardness: for sub-optimal pairs (resp. for optimal pairs).
After a preprint of this work was published, [WSJ21] proposed MOCA, an algorithm for BPI in the episodic setting. MOCA has the advantage of treating the more general case of -optimal policy identification with finite-time guarantees on its sample complexity. The two papers have different and complementary objectives but one can compare with their bound for exact policy identification, i.e when 111111 is defined in their work as the threshold value for such that the only -optimal policy is the best policy. However, when the objective is to find the optimal policy, it is not clear from their paper how one can determine such a value of without prior knowledge of the ground truth instance., in the asymptotic regime . In this case, by carefully inspecting the proofs of [WSJ21], we see that MOCA’s sample complexity writes as121212The term comes from the sample complexity of their sub-routine FindExplorableSets.:
where is the horizon and is the sub-optimality gap of at time step . We make the following remarks about the bounds above:
-
1.
MOCA only pays the cost of worst-case visitation probability multiplied by the gap of the corresponding state . Instead, MDP-NaS pays a double worst-case cost of the smallest visitation probability multiplied by the minimum gap . As pointed out by [WSJ21] the former scaling is better, especially when the state where the minimum gap is achieved is different from the one that is hardest to reach. This is however an artefact of the upper bound in Lemma 3 that MDP-NaS uses as a proxy for the characteristic time. Using a more refined bound, or an optimization oracle that solves the best-response problem, one can remove this double worst-case dependency.
-
2.
The sample complexity of MOCA divided by still explodes when goes to zero, contrary to MDP-NaS’s bound.
-
3.
The sample complexity of MDP-NaS is variance-sensitive for sub-optimal state-action pairs, while MOCA’s bound depends on a worst-case factor of . Indeed as the rewards are in , we always have and in the episodic setting.
We conclude this section by noting that MDP-NaS has a simple design and can easily be implemented.
7 Conclusion
To the best of our knowledge, this paper is the first to propose an algorithm with instance-dependent sample complexity for Best Policy identification (BPI) in the full online setting. Our results are encouraging as they show: 1) How the navigation constraints of online RL impact the difficulty of learning a good policy, compared to the more relaxed sampling schemes of Multi-Armed Bandits and MDPs with a generative model. 2) That, provided access to an optimization oracle that solves the information-theoretical lower bound (resp. some convex relaxation of the lower bound), asymptotic optimal (resp. near-optimal) sample complexity is still possible through adaptive control of the trajectory. This opens up exciting new research questions. First, it is intriguing to understand how the mixing times -and not just the stationary distributions- of Markov chains induced by policies impact the sample complexity of BPI in the moderate confidence regime. A second direction would be to extend our contributions to the problem of finding an -optimal policy, which is of more practical interest than identifying the best policy.
References
- [AHKS20] Alekh Agarwal, Mikael Henaff, Sham Kakade, and Wen Sun. Pc-pg: Policy cover directed exploration for provable policy gradient learning. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 13399–13412. Curran Associates, Inc., 2020.
- [AJO09] Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in Neural Information Processing Systems, volume 21. Curran Associates, Inc., 2009.
- [AKY20] Alekh Agarwal, Sham Kakade, and Lin F. Yang. Model-based reinforcement learning with a generative model is minimax optimal. volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR, 09–12 Jul 2020.
- [AMK13] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Machine learning, 91(3):325–349, 2013.
- [BK97] Apostolos N. Burnetas and Michael N. Katehakis. Optimal adaptive policies for markov decision processes. Mathematics of Operations Research, 22(1):222–255, 1997.
- [CT06] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing). Wiley-Interscience, USA, 2006.
- [DKM19] Rémy Degenne, Wouter M. Koolen, and Pierre Ménard. Non-asymptotic pure exploration by solving games. In NeurIPS, 2019.
- [DMKV21] O. D. Domingues, Pierre M’enard, E. Kaufmann, and Michal Valko. Episodic reinforcement learning in finite mdps: Minimax lower bounds revisited. In ALT, 2021.
- [DTC13] Thomas Dietterich, Majid Taleghan, and Mark Crowley. Pac optimal planning for invasive species management : Improved exploration for reinforcement learning from simulator-defined mdps. 01 2013.
- [EDMM06] Eyal Even-Dar, Shie Mannor, and Yishay Mansour. Action elimination and stopping conditions for the multi-armed bandit and reinforcement learning problems. Journal of Machine Learning Research, 7:1079–1105, 06 2006.
- [FCG10] Sarah Filippi, Olivier Cappé, and Aurélien Garivier. Optimism in reinforcement learning and kullback-leibler divergence. In Allerton Conference on Communication, Control, and Computing, pages 115–122, Sep. 2010.
- [Fie94] C. Fiechter. Efficient reinforcement learning. In COLT ’94, 1994.
- [FMP11] G. Fort, É. Moulines, and P. Priouret. Convergence of adaptive and interacting markov chain monte carlo algorithms. Annals of Statistics, 39:3262–3289, 2011.
- [FYAY21] Fei Feng, W. Yin, Alekh Agarwal, and Lin F. Yang. Provably correct optimization and exploration with non-linear policies. ArXiv, abs/2103.11559, 2021.
- [GK16] Aurélien Garivier and Emilie Kaufmann. Optimal best arm identification with fixed confidence. In Vitaly Feldman, Alexander Rakhlin, and Ohad Shamir, editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 998–1027, Columbia University, New York, New York, USA, 23–26 Jun 2016. PMLR.
- [GK19] A. Garivier and Emilie Kaufmann. Non-asymptotic sequential tests for overlapping hypotheses and application to near optimal arm identification in bandit models. arXiv preprint, Statistics Theory, arXiv:1905.03495, 2019.
- [HH80] P. Hall and C.C. Heyde. 2 - inequalities and laws of large numbers. In P. Hall and C.C. Heyde, editors, Martingale Limit Theory and its Application, Probability and Mathematical Statistics: A Series of Monographs and Textbooks, pages 13–50. Academic Press, 1980.
- [JKM+20] Anders Jonsson, Emilie Kaufmann, Pierre Menard, Omar Darwiche Domingues, Edouard Leurent, and Michal Valko. Planning in markov decision processes with gap-dependent sample complexity. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 1253–1263. Curran Associates, Inc., 2020.
- [Kak03] Sham Machandranath Kakade. On the sample complexity of reinforcement learning. PhD thesis, University of London, England, 2003.
- [KCG16] Emilie Kaufmann, Olivier Cappé, and Aurélien Garivier. On the complexity of best-arm identification in multi-armed bandit models. The Journal of Machine Learning Research, 17(1):1–42, 2016.
- [KK18] Emilie Kaufmann and Wouter M. Koolen. Mixture martingales revisited with applications to sequential tests and confidence intervals. arXiv preprint, arXiv:1811.11419, 2018.
- [KMDD+21] Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Vitaly Feldman, Katrina Ligett, and Sivan Sabato, editors, Proceedings of the 32nd International Conference on Algorithmic Learning Theory, volume 132 of Proceedings of Machine Learning Research, pages 865–891. PMLR, 16–19 Mar 2021.
- [KMN99] Michael Kearns, Yishay Mansour, and Andrew Y. Ng. A sparse sampling algorithm for near-optimal planning in large markov decision processes. In Proceedings of the 16th International Joint Conference on Artificial Intelligence - Volume 2, IJCAI’99, page 1324–1331, San Francisco, CA, USA, 1999. Morgan Kaufmann Publishers Inc.
- [KS98] Michael Kearns and Satinder Singh. Finite-sample convergence rates for q-learning and indirect algorithms. In Proceedings of the 11th International Conference on Neural Information Processing Systems, NIPS’98, page 996–1002, Cambridge, MA, USA, 1998. MIT Press.
- [LCC+21] Gen Li, Changxiao Cai, Yuxin Chen, Yuantao Gu, Yuting Wei, and Yuejie Chi. Is q-learning minimax optimal? a tight sample complexity analysis, 2021.
- [LPW06] David A. Levin, Yuval Peres, and Elizabeth L. Wilmer. Markov chains and mixing times. American Mathematical Society, 2006.
- [LR85] T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6(1):4–2, 1985.
- [LS20] Tor Lattimore and Csaba Szepesvári. Bandit Algorithms. Cambridge University Press, 2020.
- [LWC+20a] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Breaking the sample size barrier in model-based reinforcement learning with a generative model. arXiv preprint, arXiv:2005.12900, 2020.
- [LWC+20b] Gen Li, Yuting Wei, Yuejie Chi, Yuantao Gu, and Yuxin Chen. Sample complexity of asynchronous q-learning: Sharper analysis and variance reduction, 2020.
- [MDJ+20] Pierre M’enard, O. D. Domingues, Anders Jonsson, E. Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. ArXiv, abs/2007.13442, 2020.
- [MP21] Aymen Al Marjani and Alexandre Proutiere. Adaptive sampling for best policy identification in markov decision processes. In Marina Meila and Tong Zhang, editors, Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 7459–7468. PMLR, 18–24 Jul 2021.
- [Rus16] D. Russo. Simple bayesian algorithms for best arm identification. In Proceedings of the 29th Conference On Learning Theory, 2016.
- [SB18] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition, 2018.
- [Sch68] P. Schweitzer. Perturbation theory and finite markov chains. Journal of Applied Probability, 5:401–413, 1968.
- [SL08] Alexander L Strehl and Michael L Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309–1331, 2008.
- [SWW+18] Aaron Sidford, Mengdi Wang, Xian Wu, Lin Yang, and Yinyu Ye. Near-optimal time and sample complexities for solving markov decision processes with a generative model. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 5186–5196. Curran Associates, Inc., 2018.
- [Sze10] Csaba Szepesvari. Algorithms for Reinforcement Learning. Morgan and Claypool Publishers, 2010.
- [Wan17] Mengdi Wang. Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time. arXiv e-prints, page arXiv:1704.01869, April 2017.
- [WSJ21] Andrew Wagenmaker, Max Simchowitz, and Kevin Jamieson. Beyond No Regret: Instance-Dependent PAC Reinforcement Learning. arXiv e-prints, page arXiv:2108.02717, August 2021.
- [ZCA21] Andrea Zanette, Ching-An Cheng, and Alekh Agarwal. Cautiously optimistic policy optimization and exploration with linear function approximation. ArXiv, abs/2103.12923, 2021.
- [ZKB19] Andrea Zanette, Mykel J Kochenderfer, and Emma Brunskill. Almost horizon-free structure-aware best policy identification with a generative model. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d’Alché Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Systems 32, pages 5625–5634. Curran Associates, Inc., 2019.
Appendix A Symbols
Symbol | Definition |
---|---|
Maximum length of shortest paths: | |
Transition kernel of the uniform policy | |
Stationary distribution of | |
Communication parameter | |
oracle weights: . | |
oracle policy: . | |
Condition number | |
Kernel of the policy | |
Stationary distribution of | |
Constants such that | |
Appendix B Experiments
In this section, we test our algorithm on two small examples. The first instance is an ergodic MDP, with states, actions per state and a discount factor . The rewards of each state-action pair come from independent Bernoulli distributions with means sampled from the uniform distribution . The transitions kernels were generated following a Dirichlet distribution . The second instance is the classical RiverSwim from [SL08], which is communicating but not ergodic. The instance we used has states and actions: , with deterministic transitions and a discount factor . Rewards are null everywhere but in states where they are Bernoulli with respective means and . We fix a confidence level , and for each of these MDPs, we run Monte-Carlo simulations of MDP-NaS with either C-Navigation or D-Navigation. Towards computational efficiency, we note that the empirical oracle policy does not change significantly after collecting one sample, therefore we only update it every time steps131313The period was chosen so as to save computation time, and knowing that for MDPs algorithms usually require samples to return a reasonably good policy..
First, we seek to check whether the frequencies of state-action pair visits converge to their oracle weights, as stated in Theorem 7. Figure 2 shows the relative distance, in log scale, between the vector of empirical frequencies and the oracle allocation . The shaded area represents the and quantiles. We see that the relative distance steadily decreases with time, indicating that the visit-frequencies of both D-Navigation and C-Navigation converge to the oracle allocation. We also note that the D-Navigation rule exhibits a faster convergence than the C-Navigation rule.


Next we compare our algorithm with Variance-reduced-Q-learning (VRQL) [LWC+20b], a variant of the classical Q-learning with faster convergence rates. VRQL finds an -estimate of the Q function , by following a fixed sampling rule, referred to as the behavior policy , and updating its estimate of the Q function via Temporal Difference learning. VRQL does not have a stopping rule, but is guaranteed to yield an estimate such that with probability after using samples where
where (resp. ) is the minimum state-action occupancy (resp. the mixing time) of and are some large enough universal constants. We use VRQL with , (since the goal is to identify the best policy) and use the uniform policy as a sampling rule141414In the absence of prior knowledge about the MDP, the uniform policy is a reasonable choice to maximize .. We plug this value into the equations above and the compute the sample complexity of VRQL. Table 2 shows a comparison of the sample complexities of MDP-NaS and VRQL. MDP-NaS has much better performance than VRQL.
MDP-NaS | VRQL | |
---|---|---|
Small Ergodic MDP | ||
RIVER-SWIM |
Appendix C Sample complexity lower bound
Let be the set of MDPs such that and . The information constraints are obtained by change-of-measure arguments as in the bandit literature [LR85, KCG16]:
Lemma 9.
([MP21]) For any -PC algorithm , and for any , we have:
(15) |
C.1 Navigation constraints: proof of Lemma 1
For all states ,
where denotes the state observed after the -th times has been visited. Fix . Introduce . Observe that and are independent. Furthermore, . Hence:
Finally,
(16) |
From the above equality, the lemma is proved by just observing that , for any , and for any .
C.2 Lower bound: proof of Proposition 2
Proposition 10.
The expected sample complexity of any -PC algorithm is larger than the value of the following optimization problem:
(17) | |||
In (C.2), is interpreted as the expected number of times is visited before the stopping time. Note that the above proposition provides a lower bound for any value of . We can further simplify this bound when restricting our attention to asymptotic regimes where goes to 0. In that case, the navigation constraints (1) can be replaced by (by just renormalizing with , and letting ). For small regimes, we can hence rewrite the lower bound as follows:
One can easily conclude by showing that the value of the optimization program above is equal to the one in Eq 2.
C.3 Full definition of the terms in the upper bound
Let denote the maximum variance of the value function on the trajectory of the optimal policy. In [MP21] are defined the following functionals of :
Then is simply defined as . Note that .
Appendix D Sampling rule
Recall that denotes the set of state-action pairs. Any policy induces a Markov Chain on whose kernel is defined by:
Fact 1:
Note that if it takes steps to move between any pair of states with non-zero probability, then we can move between any pair of state-actions in at most steps by playing the policy:
where is the policy corresponding to shortest path to . Finally, for the sake of simplicity, we will note .
D.1 Almost sure forced exploration: proof of Lemma 4
Consider the event . Observe that , where for , . We will prove that for all , which implies the desired result. From Fact 1 above, we have:
(18) |
where is the -th power of the transition matrix induced by policy . Therefore:
is well defined. Fix and let be a policy and an integer satisfying the property (18) above for the pair . Observe that:
where the matrix inequality is entry-wise. Now define the stopping times where the agent reaches state-action for the -th time151515We restrict our attention to departure state-action pairs that are visited infinitely often. Such pairs always exist, therefore is well defined.. Then:
The second inequality comes from a union bound and the strong Markov property161616This property is sometimes referred to as: Markov Chains start afresh after stopping times.. The last inequality comes from the fact that and . Now observe that the inequality above holds for all realizations of the sequences . Therefore, integrating that inequality over all possible sequences of policies yields:
We can already see that if state-action is visited "frequently enough" ( for some constant ) then the right-hand side above will be zero. Since we know that a least one state-action is visited frequently enough, we consider the product over all state-action pairs of the probabilities above:
(19) | ||||
We will now show that for all tuples :
Now observe that for all there exists such that , i.e., at least one state-action has been visited times before time step . For that particular choice of and since is decreasing, we get:
For the choice of the right-hand side above is zero. To sum up, for all realizations of :
Therefore, for all : and consequently: .
D.2 Minimal exploration rate for communicating MDPs
Indeed if the agent visits state at time , then the last transitions before must have been to the right, ie . Therefore . In particular this implies that for , . Therefore using the reverse Fatou lemma and Markov’s inequality:
D.3 Minimal exploration rate for ergodic MDPs
This is a consequence of Proposition 2 [BK97], stating that there exist such that for all and large enough, . A union bound yields: . To extend this result to the numbers of visits at the various (state, action) pairs, we can derive a lower bound on given that by observing that a worst scenario (by monotonicity of ) occurs when is visited only in the rounds before . We get . Remarking that is a sub-martingale with bounded increments, standard concentration arguments then imply that , where . Next, define the random variable . Applying the reverse Fatou lemma, we get . From there, we directly deduce (by monotonicity of ) that a.s. .
D.4 High probability forced exploration
Lemma 11.
Denote by the k-th time the agent visits the state-action pair . Under C-Navigation with exploration rate we have: for all , there exists a parameter that only depends on such that:
where .
Corollary 1.
Denote by the number of times the agent visits state-action up to and including time step . Then under the same notations of the lemma above we have: for all :
Proof.
Let be some increasing function such that and and define the event . We will prove the following more general result:
(20) |
where is a constant depending on the communication properties of . Then we will tune and so that the right-hand side is less than . First, observe that:
Using the decomposition above, we upper bound the probability of by the sum of probabilities for that the -th excursion from and back to takes too long:
(21) |
where
We will now prove an upper bound on for a fixed and .
1) Upper bounding the probability that an excursion takes too long:
Let us rewrite as
so that state-action corresponds to the last row and last column. Further let denote the vector of probabilities of transitions at time from to states different from . Using a simple recurrence on , one can prove that for all we have:
(22) |
Using Lemma 20, there exists (that only depends on ) such that for all and all sequences such that we have:
(23) |
Therefore using (22) for and breaking the matrix product into smaller product terms of matrices, we get for :
(24) |
where in the fourth line we used that . The sixth line uses the fact that the matrices are substochastic. The last line is due to the fact that and is decreasing. Similarly, one can prove that:
(25) |
where we used the fact that . Now we only have to tune and so that and conclude using (21), (24) and (25).
2) Tuning and the exploration rate:
Since the sequence is decreasing we have:
For where and we have: and , implying:
Summing the last inequality, along with (21), (24) and (25) we get:
For , we have , which gives the desired result.
Remark 4.
It is natural that depends on , which expresses how well connected is the MDP under the uniform policy, see proof of Lemma 20.
∎
D.5 An Ergodic Theorem for non-homogeneous Markov Chains
We start with some definitions and a technical result. Let be a collection of Markov transition kernels on the state-action space , indexed by policies . For any Markov transition kernel , bounded function and probability distribution , we define:
For a measure and a function , denotes the mean of w.r.t. . Finally, for two policies and we define .
We consider a -valued process such that is -adapted and for any bounded measurable function :
The next result is adapted from [FMP11]. There the authors prove an ergodic theorem for adaptive Markov Chain Monte-Carlo (MCMC) algorithms with a general state space and a parameter-dependent function. For the sake of self-containedness, we include here the proof of their result in the simple case of finite state space chains with a function that does not depend on the policy .
Proposition 12.
Proof.
Consider the difference
(26) |
We clearly have:
(28) |
Next, by Lemma 22 there exists a constant (that only depends on ) such that: . Therefore:
(29) |
where the convergence to zero is due to assumption (B2). Now to bound we use the function solution to the Poisson equation . By Lemma 23, exists and is solution to the Poisson equation. Therefore we can rewrite as follows:
(30) |
where
Bounding :
Bounding :
Bounding :
D.6 Application to C-Navigation: proof of Theorem 7
We will now prove that C-Navigation verifies the assumptions (B1-5).
(B1):
(B2):
By Lemma 4 we have: for all . Hence and by continuity: , which implies that:
(33) |
(B3):
By Lemma 13, satisfies (B3) for and .
(B4):
We have:
(34) |
Note that since (ergodicity of ), and entry-wise. Similarly, it is trivial that since and . Therefore and:
(35) |
(B5):
We have: . Hence: , where and are viewed as vectors. On the other hand:
Therefore
For D-Navigation, we get in a similar fashion:
D.7 Geometric ergodicity of the sampling rules
Since is ergodic, there exists such that for all (Proposition 1.7, [LPW06]). Thus we define
(36) | ||||
(37) |
where is the stationary distribution of .
Lemma 13.
Let (resp. ) denote the oracle policy of (resp. the Cesaro-mean of oracle policies up to time ). Further define
Then for D-Navigation (resp. C-Navigation) we have:
where and (resp. and ). In particular (resp. ).
Proof.
We only prove the lemma for C-Navigation. The statement for D-Navigation can be proved in the same way. Recall that: . Therefore:
Appendix E Stopping rule
E.1 Deviation inequality for KL divergences of rewards
We suppose that the reward distributions come from a one-dimensional exponential family and can therefore be parametrized by their respective means . Furthermore, for any such that , we let denote the distribution belonging to the same exponential family, whose mean is the empirical average . For , define the function and its inverse . Further define the function by:
Finally let
where . Now we recall a deviation inequality from [KK18], which we use for the empirical KL divergence of rewards.
Lemma 14.
(Theorem 14, [KK18]) Define the threshold . Then for all :
Remark 5.
One can easily see that .
E.2 Deviation inequality for KL divergences of transitions
Our second deviation inequality is adapted from Proposition 1 in [JKM+20]. There the authors derive a deviation inequality for a single KL divergence of a multinomial distribution. In order to get a deviation inequality of a sum of KL divergences, we modified their proof by considering the product over state-action pairs of the martingales they used. For the sake of self-containedness, we include the proof below. is defined as the categorical distribution with a vector of probabilities satisfying:
Lemma 15.
(Proposition 1, [JKM+20]) Define the threshold . Then for all we have:
with the convention that whenever .
Proof.
We begin with a few notations. For any vector and any element of the simplex , we denote . We define the log-partition function of a discrete distribution supported over by:
We use the shorthand and let . For and such that the binomial coefficient is defined as: . Finally is the Shannon entropy of distribution .
Building a convenient mixture martingale for every state-action pair:
Following [JKM+20], we define for every integer :
(38) |
The sequence is an -martingale since:
The same holds trivially when . Now, we define the mixture martingale defined by the family of priors where follows a Dirichlet distribution with parameters :
where in the second inequality we used Lemma 16 and denotes the i-th component of . Now using Lemma 17, we upper bound the binomial coefficients which leads to:
The product martingale:
Taking the product over all state-action pairs we get:
(39) |
Next, using that we get:
Hence (39) becomes:
(40) |
Now, we show that is a martingale. For any fixed pair we have:
where the third equality is because and are independent conditionally on . Finally, using the tower rule we get:
Hence is a martingale. Thanks to Doob’s maximal inequality we have:
In view of (40), we conclude that for we have:
∎
Lemma 16.
Lemma 17.
(Theorem 11.1.3, [CT06]) Let , and such that then:
where is the Shannon entropy of the discrete distribution over with vector of probabilities .
E.3 Correctness of the stopping rule
Appendix F Sample complexity upper bound
F.1 Almost sure upper bound: proof of Theorem 6 (i)
Proof.
Consider the event . By Lemma 4 and Theorem 7, we have . We will prove that under , .
Fix . There exits such that for all :
(41) | |||
(42) | |||
(43) |
where the last two inequalities come from the fact that both the thresholds satisfy . Combining the inequalities above with the definition of , we get:
Since , then the last inequality implies that . Taking the limit when goes to zero finishes the proof. ∎
F.2 Upper bound in expectation: proof of Theorem 6 (ii)
Proof.
We start by defining the semi-distance between MDPs:
Now for , by continuity of 171717As a consequence of Berge’s Theorem, which gives the continuity of . there exists such that:
where . For , consider the concentration events181818 For simplicity and w.l.o.g, we consider that and are integers.:
where is a mapping defined in Proposition 19. We will upper bound the stopping time of MDP-NaS under . Define:
By Proposition 19, there exists such that for all , conditionally on , the event occurs with high probability. For we have:
(44) |
Furthermore, using the bound in the definitions of the thresholds and and the fact that , we prove the existence of such that for all :
(45) | ||||
(46) |
Finally define:
Using (44-46), we have for all under the following holds:
In other words:
(47) |
Therefore191919 denotes the complementary of event .:
where we used Lemma 18 and Proposition 19 in the last inequality. This implies that is finite and:
where we used . Finally, we take the limit . Since and then which finishes the proof. ∎
F.3 Concentration of the empirical MDPs
Lemma 18.
Define the event . Then there exists two positive constants and that only depend on and such that:
Proof.
For simplicity we will denote by . Consider the forced exploration event:
where and is a parameter that only depends on . Applying Corollary 1 for , we get . Therefore we have:
(48) |
On the other hand:
(49) |
Using a union bound and Chernoff-Hoeffding theorem respectively we get for :
(50) |
In a similar fashion we prove that:
(51) |
(52) | ||||
(53) |
Thus, for the following choice of constants
Combined with (48), the previous inequality implies that:
∎
F.4 Concentration of state-action visitation frequency
The following proposition is a somewhat stronger version of Proposition 12. The fact that is within a distance of at most from enables us to have a tighter control of the ergodicity constants . This in turn allows us to derive a finite sample bound on the deviations of state-action frequency of visits. Before stating the result we recall some simple facts:
Fact 1:
For any two policies we have: , where the norm is on policies viewed as vectors of .
Fact 2:
Under , for we have:
Fact 3:
Under , for we have:
where we used Lemma 22, Fact 1, the definitions of and and Fact 2 respectively.
Fact 4:
Under , for we have:
Proposition 19.
Under C-Navigation, for all , there exists a time such that for all , all and all functions , we have:
where is a mapping with values in such that . In particular, this implies that:
Corollary 2.
We have .
Proof.
Consider the difference
We clearly have:
(55) |
Using Fact 3 and integral-series comparison, we upper bound the third term as follows:
(56) |
Now to bound we use the function solution to the Poisson equation . By Lemma 23, exists and is solution to the Poisson equation. Therefore we can rewrite as follows:
(57) |
where
Bounding :
Bounding :
Bounding :
Appendix G Technical Lemmas
G.1 Upper bound on the norm of products of substochastic matrices
Before we proceed with the lemma, we lay out some definitions. denotes the minimum positive probability of transition in . Similarly define the minimal probability of reaching some state-action pair from any other state-action after 212121Refer to the preamble of Appendix D for more detail. transitions in the Markov chain induced by the uniform random policy. Finally, .
Lemma 20.
Fix some state-action and let be the transition matrix under some policy satisfying for all . Define the substochastic matrix obtained by removing from the row and the column corresponding to :
Then we have:
Proof.
Define the sum of the k-th row in the product of matrices for . We will prove that or all : . The result follows immediately by noting that .
Consider such that (such always exists since is communicating) and let be the index of the row corresponding to in . Then for all :
(64) |
Now for we have:
(65) |
where in the fifth line we use the fact that for all : since the matrices are substochastic. The last line comes from (64). Now for all other indexes we have:
(66) |
where we used (65) and the fact that the matrix is substochastic. Now since is communicating then we can reach state-action from any other state-action , after some steps in the Markov chain corresponding to the random uniform policy. In other words, if is the index corresponding to then there exists , such that . Therefore:
(67) |
Thus, combining (66) for and (67) we get:
∎
G.2 Geometric ergodicity: a general result
The following lemma is adapted from the proof of the Convergence theorem (Theorem 4.9, [LPW06]).
Lemma 21.
Let be a stochastic matrix with stationary distribution vector . Suppose that there exist and an integer such that for all . Let be a rank-one matrix whose rows are equal to . Then:
where .
Proof.
We write: where is a stochastic matrix. Note that for all since . Furthermore for all stochastic matrices since all rows of are equal. Using these properties, we will show by induction that :
For the result is trivial. Now suppose that . Then:
Therefore the result holds for all . Therefore which implies:
∎
G.3 Condition number of Markov Chains
Lemma 22.
(Theorem 2 in [Sch68]) Let (resp. ) be the transition kernel of a Markov Chain with stationary distribution (resp. ). Define . Then:
where . Crucially, in our setting this implies that there exists a constant that only depends on such that for all :
where .
G.4 Properties of Poisson equation’s solutions
Lemma 23.
Let be a Markov transition kernel satisfying the assumptions (B1) and (B3) and denote by its stationary distribution. Then for any a bounded function , the function defined by is well defined and is solution to the Poisson equation . Furthermore:
and for any pair of kernels :
Proof.
We will prove that is well defined. Checking that it satisfies the Poisson equation is straightforward. Observe that:
where the second equality is because for all . From the last expression, we see that the sum defining converges and we have the first bound . Now for the second bound, we write:
(68) |
Using the same trick as before we obtain:
(69) |
On the other hand, a simple calculation shows that , ie is solution to the modified Poisson equation where the right hand side is . Therefore:
and
(70) |
Summing up equation (68) and inequalities (69-70) ends the proof. ∎