Stanford University
Optimal Sample Complexity of Reinforcement Learning for
Mixing Discounted Markov Decision Processes
Abstract
We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on and of the form , where denotes the discount factor and is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is , where is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.
1 Introduction
Reinforcement learning (RL) (Sutton and Barto,, 2018) has witnessed substantial empirical successes in a wide range of applications, including robotics (Kober et al.,, 2013), computer vision (Sadeghi and Levine,, 2016), and finance (Deng et al.,, 2017). This has sparked substantial research of RL theory and applications in operations research and management sciences. This paper provides a theoretical contribution to this area by considering a tabular RL environment in which a controller wishes to maximize an infinite horizon discounted reward governed by a Markov decision process (MDP). In many applications, due to engineering and managerial considerations, the underlying MDP is naturally constrained to an environment in which good policies must be stable, c.f. Bramson, (2008). In these settings, the controlled Markov chain induced by a reasonable policy will converge in distribution to a unique steady state behavior, regardless of the initial condition. This phenomenon is known as mixing. In cases where the mixing rate of a system is rapid, a finite-time observation of the state process can provide a more accurate statistical depiction of its long-term behavior. Consequently, it is reasonable to expect that for such systems, the development of a lower-complexity algorithm for policy learning is feasible, even in the presence of small discount rates, provided that an optimal policy displays fast mixing. This intuition serves as the motivation and driving force for this paper.
In addition, an optimal algorithm and analysis for discounted mixing MDPs can serve as valuable algorithmic and theoretical tools for achieving optimal sample complexity results for long-run average reward MDPs through the reduction method proposed in Jin and Sidford, (2021). While the worst case sample complexities of discounted tabular RL algorithms have been studied extensively in recent years (Azar et al.,, 2013; Agarwal et al.,, 2020; Li et al.,, 2022), we are unaware of a complexity theory and algorithm that is optimized for discounted mixing MDPs. Therefore, it is of significant practical and theoretical value to develop a satisfactory complexity theory for discounted mixing MDPs.
This paper contributes to the existing literature in the following ways. First, we adapt the regeneration idea and the split chain technique (Athreya and Ney,, 1978; Nummelin,, 1978) for analyzing stable Markov chain to the RL context. This framework naturally leads to measuring the mixing rate of the system by the minorization time (Definition 3), which is the expected amount of time between successive Markov chain regenerations. The drift and minorization formulation have been widely used in modeling and analyzing stable queuing systems (Meyn and Down,, 1994; Bramson,, 2008). In Theorem 1, we prove that in the setting of uniformly ergodic Markov chains with finite state space, the minorization time is equivalent to the total variation mixing time that is used in the sample complexity of mixing MDP literature (Wang,, 2017; Jin and Sidford,, 2021). Moreover, this formulation directly generalizes to infinite state spaces, foreshadowing a generalization our results to general state space MDPs.
Secondly, we establish algorithms and worst case sample complexity upper bounds under three different mixing assumptions using Wainwright’s variance-reduced Q-learning (Wainwright, 2019b, ) algorithmic framework. We begin with a general setting where there exists an optimal policy such that an upper bound on the minorization time is available. Next, we require that there is a uniform upper bound on the minorization times of the class of optimal policies. Finally, we consider the case where the minorization time of all policy has a uniform upper bound. Our worst case sample complexity guarantees in these three contexts are summarized in Table 1. Note that in all three cases the dependence on the effective horizon is .
Thirdly, we introduce the non-asymptotic minimax risk of the class of MDPs with a uniform minorization time upper bound (Definition 4). We compare this definition of minimax risk to the instance dependent version in Khamaru et al., 2021b and find ours easier to interpret. Moreover, we prove the same lower bounds in Table 1 for these two definitions of the non-asymptotic minimax risks by constructing a family of hard MDP instances (Definition 5) indexed by the discount and the minorization time. This allows us to conclude that our sample complexity upper bounds are optimal up to log factors.
Mixing Assumption | Sample Complexity | Relevant Theorems |
---|---|---|
One optimal policy (Assumption 1) | Theorem 2 | |
All optimal policies (Assumption 2, 3) | Theorem 3 | |
All policies (Assumption 4) | Theorem 4 | |
Lower Bound: | Theorems 5, 6 |
In this paper, we assume the availability of a simulator, also known as a generative model, that can generate independent state transitions for any input state-action pair. Given the significant memory and computational demands associated with vast state and action spaces, we adopt a model-free approach in which the algorithm only maintains a state-action value function, or -function. Our learning objective is to achieve an estimator of the optimal function within absolute error in the sup norm. Our algorithm and analysis achieve the optimal non-asymptotic minimax complexity in this regard. Notably, the variance bounds established in this work can be readily applied to the analysis of model-based algorithms, such as Azar et al., (2013), and the same optimal sample complexity upper bounds as in this work should follow.
1.1 Literature Review
-
Markov Stability and Regeneration: There is a rich literature on the stability of Markovian stochastic systems using the regeneration idea (Athreya and Ney,, 1978; Nummelin,, 1978; Meyn and Down,, 1994; Bramson,, 2008; Meyn et al.,, 2009). It has also been a important theoretical tool for design and analysis of simulation (Crane and Lemoine,, 1977; Henderson and Glynn,, 2001), statistics (Gilks et al.,, 1998), and machine learning (Glynn and Ormoneit,, 2002; Lemańczyk,, 2021) procedures. We provide a brief review of this literature since we study mixing behavior of MDPs using these techniques.
-
Sample Complexity of Discounted Tabular RL: The worst case sample complexity theory of tabular RL has been extensively studied in recent years. There are two types of modeling environments that inspires the model-base and model-free approaches to RL algorithmic designs. In a model-base approach, the controller tries to accumulate a data set to construct a empirical model of the underlying MDP and solve it by variants of the dynamic programming principle. Azar et al., (2013); Sidford et al., (2018); Agarwal et al., (2020); Li et al., (2022) propose algorithms and prove optimal upper bounds (the matching lower bound is proved in Azar et al., (2013)) of the sample complexity to achieve error in using the model-based approach. On the other hand, the model-free approach only maintains lower dimensional statistics of the transition data, and iterative update them. The celebrated Q-learning (Watkins and Dayan,, 1992) and its generalizations are examples model-free algorithms. Li et al., (2021) shows that the Q-learning have a minimax sample complexity . Nevertheless, Wainwright, 2019b proposes a variance-reduced variants of the -learning that achieves the aforementioned model-based sample complexity lower bound . Recent advances in sample complexity theory of Q-learning and its variants are propelled by the breakthroughs in finite time analysis of stochastic approximation (SA) algorithms. Wainwright, 2019a proves a sample path bound for the SA recursion where the random operator (as oppose to the expectation) is monotone and a contraction. These strong assumptions yield path-wise error bounds that enable variance reduction techniques that help to achieve optimal complexity in Wainwright, 2019b . In comparison, Chen et al., (2020) establishes finite sample guarantees of SA only under a second moment bound on the martingale difference noise sequence.
The worst case analysis provides guarantees on the convergence rate across all instances of -discounted MDPs. Notably, instances that reach the complexity lower bound must involve a transition kernel and reward function that depends upon . In contrast, Khamaru et al., 2021b delve into instance-dependent settings, where the transition kernel and reward function remain fixed, yielding matching sample complexity upper and lower bounds. Furthermore, Wang et al., (2023) explore an intermediate scenario where MDPs are assumed to possess an upper bound on their mixing time. This intermediate setting holds particular significance to this paper’s main objective, as elaborated upon in the introduction.
-
Sample Complexity of Average Reward RL: Mixing MDPs arise naturally in RL settings where the controller’s objective is to maximize the long run average reward per unit of time achieved by the control policy. Wang, (2017); Jin and Sidford, (2020) propose algorithms to solve the policy learning problem directly from the average reward MDP model, achieving a complexity dependence . On the other hand, Jin and Sidford, (2021); Wang et al., (2022) use a discounted MDP with large enough to approximate the average reward MDP and prove worst case complexity bounds . Notably, the contemporary work Wang et al., (2022) only makes assumption on the optimal -function, which is equivalent to assuming an upper bound on of one of the optimal policy in the worst case. Whereas the other aforementioned literature assume a uniform bound on over all policy. Our results in the discounted case echo this behavior: only assuming mixing of one of the optimal policy will lead to a reduction in the complexity.
2 Markov Decision Processes: Definitions
Let be an MDP, where and are finite state and action spaces, the reward function, the transition kernel. is the discount factor. We assume that are known, and we can draw samples from the transition measures independently through a sampler (generative model).
Let and let be the product -field be the underlying measureable space. Define the stochastic process by the point evaluation for all for any . At time and state , if action is chosen, the decision maker will receive a reward , and then the law of the subsequent state is determined by the transition kernel . It is well known that to achieve optimal decision making in the context of infinite horizon discounted MDPs (to be introduced), it suffices to consider policies that are stationary, Markov, and deterministic. Therefore, we will restrict ourselves to consider only this class of policies.
Let denote the class of stationary Markov deterministic policies; i.e. any can be seen as a function . For and initial distribution , there is a probability measure on the product space s.t. the chain has finite dimensional distributions
Note that this also implies that is a Markov chain under . Let denotes the expectation under under . For with full support , the value function is defined via
while the optimal value function is given by
(1) |
. It can be seen as a vector . Let denote the set of all optimal stationary Markov deterministic policies; i.e. that achieves the maximum in (1).
Let denote the sum . It is well known that the optimal value function is the unique solution of the following Bellman equation:
(2) |
To learn an optimal policy , it is useful to introduce the -function: For any policy ,
which can be identified with a vector . Define the notation . When achieves the maximum in (1), we denote the corresponding optimal -function by and hence . Note that is the unique solution to the Bellman equation for the -function:
(3) | ||||
where the mapping
(4) |
is known as the Bellman operator for the -function. Here we list two elementary properties of the Bellman operator and the -function. The proofs can be found in Puterman, (1994). First, is a -contraction in the -norm; i.e. for all . Then, a consequence of being a -contraction is
A greedy policy from a -function is defined as
where one action is picked if there is a tie. It is easy to see that . Therefore, policy learning can be achieved if we can learn a good estimate of . For , we use to denote the linear operator
Under suitable mixing conditions as presented in Puterman, (1994), it is useful to look at the span semi-norm of the value and -functions. For vector , let be the vector with all entries equal to one and define
Note that is not a norm because , but it is an induced norm on the quotient space . We also note that for all . However, and are not equivalent on .
Naturally, the variance of the cumulative reward induced by the optimal policy plays a key role in controlling the sample complexity of solving the MDP. Given a stationary deterministic Markov policy , let us define and denote the variance of the cumulative reward by
(5) |
Another standard deviation parameter to our interest is
(6) |
We call this object the “1-sample standard deviation” of the Bellman operator. It is the sampling standard deviation of the so-called 1-sample Bellman operator, which we will introduce in equation (11) in the algorithm analysis section. To analyze and , it’s useful to define
(7) |
Note that for any optimal policy , . So, .
3 Uniformly Ergodic Markov Chains and Mixing Metrics
Let be a MDP. Given a stationary Markov deterministic policy , the state process will have transition kernel . In various applications, the state process induced by a reasonable policy will converge to a unique steady state regardless of the initial condition. In this section, we will formally characterize such behavior by introducing the notion of uniform ergodicity. A uniformly ergodic Markov chain is one for which the marginal distribution of converges to its stationary distribution in total variation distance with uniform rate across all initial conditions. Recall the total variation distance between two probability measures on is
We use the norm notation here because it is equivalent to the norm if we view and as vectors in ; i.e. .
Uniform ergodicity is equivalent to the the transition kernel satisfying the Doeblin condition (Meyn et al.,, 2009). This equivalence permits the application of the split chain representation (Athreya and Ney,, 1978; Nummelin,, 1978) of the Markov chain, which exhibits favorable independence properties, as illustrated in Section 3.2. Thus, the Doeblin condition will serve as our principal assumption and theoretical tool for the analysis of the sample complexity of mixing MDPs. To characterize the rate of mixing of a uniformly ergodic Markov chain, two metrics, namely and , emerge naturally from the total variation and Doeblin condition characterizations. We will introduce these metrics in Section 3.3. While is more commonly utilized in the literature on MDP complexity (Wang,, 2017; Jin and Sidford,, 2020, 2021), we have found to be a more suitable metric for our objectives. To facilitate the comparison of our complexity theories with those in the literature, we establish the equivalence of and up to a constant factor in Section 3.3.
3.1 Uniform Ergodicity
Definition 1 (Uniform Ergodicity).
A Markov chain with transition kernel is called uniformly ergodic if there exists probability measure for which
Note that must be the unique stationary distribution of : By uniform ergodicity and for all as ; i.e. . Let be any stationary distribution of . Then, and for all ; i.e. .
Definition 2 (Doeblin Minorization Condition).
A transition kernel satisfies the -Doeblin condition for some and if there exists a probability measure and stochastic kernel s.t.
(8) |
for all . The measure is called the minorization measure and the stochastic kernel is call the residual kernel.
Note that if is -Doeblin, then it must be -Doeblin for any :
The following result explains the equivalence between the Doeblin condition, uniform ergodicity, and uniform geometric ergodicity.
Theorem UE (Theorem 16.0.2 in Meyn et al., (2009)).
The following statements are equivalent:
-
1.
There exists s.t.
-
2.
is uniformly ergodic.
-
3.
There exists and s.t. for all
-
4.
is -Doeblin for some and .
Moreover, we have
(9) |
Theorem UE implies the existence of a Doeblin minorization structure of any uniformly ergodic Markov kernel. As we will discuss in the next section, the Doeblin condition will allow us to use the split chain technique, which underlies our analysis of the sample complexity of mixing MDPs.
3.2 Doeblin Condition, Split Chain, and Regeneration
In this section, we introduce the split chain of a uniformly ergodic Chain . Athreya and Ney, (1978) and Nummelin, (1978) independently realized that a Markov chain with transition kernel admitting a decomposition of the form (8) can be simulated as follow:
-
1.
Start from , and generate , the initial distribution.
-
2.
At time , flip a coin with success probability .
-
3.
If , generate ; if not, generate .
-
4.
Generate from the conditional measure .
-
5.
Set and go back to step 2.
The process is known as the split chain. It can be shown that the original process has the same law as . Define the wide sense regeneration times , and (wide sense) regeneration cycles where for . This construction has the following implications:
-
•
and where for all .
-
•
The regeneration cycles are 1-dependent in the sense that and are independent for all .
-
•
The regeneration cycles are identically distributed.
See Asmussen et al., (2003) for a detailed exposition. Because of these properties, the process is easier to work with. So, moving forward, whenever we analyze a -Doeblin kernel , the process will refer to the split chain. Also, since are not stopping times w.r.t. the natural filtration generated by , we define . Clearly, are -stopping times and is Markov w.r.t. as well.
3.3 Two Mixing Metrics and Their Equivalence
The favorable characteristics of the split chain will facilitate the analysis of the behavior of the MDP under mixing policies. Consequently, we aim to study the minimax sample complexity of policy learning for an MDP by postulating the existence of a -Doeblin condition. This naturally leads to the quantification of the mixing intensity of the MDP by means of the minorization time; see below. However, in the literature of sample complexity theory for mixing MDPs (Wang,, 2017; Jin and Sidford,, 2020, 2021), the intensity of mixing is usually quantified in terms of the mixing time. In this section, we formally introduce these two measures of mixing. Furthermore, we establish that, for a uniformly ergodic Markov chain, and are equivalent up to a multiplicative constant. This equivalence enables a direct comparison of our complexity outcomes, which use , with the existing theories.
Let be a uniformly ergodic stochastic kernel with stationary distribution . By Theorem UE, there exists s.t. .
Definition 3 (Mixing and Minorization Times).
Define the mixing time
and the minorization time
We will drop the dependence when it is clear.
Next, we aim to demonstrate the equivalence between these two mixing times up to a constant factor.
Theorem 1.
The two notion of mixing times and are equivalent up to constants: If is uniformly ergodic, then .
We note that proof idea of the direction follows from Aldous et al., (1997), while follows from Theorem UE. We defer the proof to the appendix.
Therefore, for a uniformly ergodic chain, quantitative knowledge of the mixing time and the “best” minorization are equivalent up-to constant factors. As a consequence of this equivalence between and , the complexity outcomes highlighted in Table 1 can be substituted directly with .
Other than its theoretical convenience, could be a more explicit metric than in control of stochastic systems. For instance, consider a queueing system for which the sources of randomness are fully exogenous. A long inter-arrival time will imply that the system become empty and the state process regenerates. Therefore, we can obtain a natural upper bound on from the frequency of observing long inter-arrival times.
4 Uniformly Ergodic MDPs: Algorithms and Sample Complexity Upper Bounds
We have introduced uniform ergodicity and the minorization time as an equivalent measure of mixing time and the split chain technique. Using as the mixing criterion, in this section, we explore the worst case sample complexities of learning the optimal -function under different stability assumptions of the controlled Markov chain.
MDPs can exhibit different types of mixing behavior. However, since our primary focus is on developing a complexity theory for learning the optimal -function, we seek algorithms that obtain optimal policies by producing progressively more precise estimates of as the sample size increases. Consequently, the asymptotic variability of the estimator should be influenced by the mixing characteristics of the optimal policies. Therefore, it is well-motivated to make assumptions about the mixing properties of the class of optimal policies. We first consider the most general one:
Assumption 1.
There exists an optimal stationary Markov deterministic policy s.t. the transition kernel is uniformly ergodic.
In this setting, define
To obtain an optimal sample complexity result, we make further uniform assumptions on the minorization times of kernels induced by . In this paper, we consider the following two settings:
Assumption 2.
is a singleton. Moreover, satisfies Assumption 1.
If is not a singleton, we instead impose a “continuity in ” assumption on which we will refer to as Lipschitzness:
Assumption 3.
For any , the transition kernel is uniformly ergodic, and the transition kernel satisfies a local Lipschitz condition: There is s.t. for all and associated greedy policy ,
(10) |
In the setting of Assumption 2 and 3, we define
We note that the settings in Assumptions 2 and 3 are considered in the Q-learning literature (Khamaru et al., 2021b, ; Li et al.,, 2023). Moreover, Assumption 3 is implied by 2 with where is the optimality gap of the optimal policy class defined as ; see Lemma B.1 of Li et al., (2023).
Even though the asymptotic performance of a convergent algorithm should only depend on the mixing characteristics of the optimal policies, for a prescribed accuracy level , assumptions on mixing characteristics of non-optimal policies could potentially increase the performance as well. Moreover, in applications of MDPs, we typically have little knowledge of the class of optimal policies a priori. However, as mentioned earlier, there may be settings in which all policies will induce mixing Markov chains with a uniform bound on the minorization times. This leads to the following assumption:
Assumption 4.
For all , the transition kernel is uniformly ergodic.
In this setting, we define
Since is finite, the above is always achieved and .
Remark.
The minimax sample complexity of tabular MPDs is well understood when . We will assume the interesting case when . Moreover, for the convenience of our analysis, we also assume that, w.l.o.g, .
In the following developments, we will demonstrate the impact of mixing Assumptions 1, 2, 3, and 4 on algorithmic performance, leading to a improvement in sample complexity by a factor of when compared to the minimax lower bound without considering mixing time assumptions (). In particular, the sample complexity upper bounds in Theorems 2, 3, and 4 are now enhanced to .
4.1 Q-learning and Wainwright’s Variance Reduction
We first introduce the algorithmic frameworks that underlie our complexity results: the Q-learning algorithm and Wainwright’s variance reduction technique.
4.1.1 The Q-learning Algorithm
Define as a sequence of i.i.d. 1-sample empirical Bellman operators: i.e. for each , sample independently and assign
(11) |
The synchronous Q-learning with number of iterations can be expressed in terms of these empirical Bellman operators, as displayed in Algorithm 1. It is easily seen that . This echoes the previous definition (6).
It turns out that the algorithms we develop in the future sections will require a reasonably good initialization. Typically, such an initialization can be achieved by first running the Q-learning Algorithm 1. Therefore, it is useful to have an error bound for this Q-learning algorithm. From Wainwright, 2019a ; Wainwright, 2019b , we have the following proposition.
Proposition 4.1 (Section 4.1.2 in Wainwright, 2019b ).
For any , let be the output of . Then, there exists absolute constant so that with probability at least , we have
To use this proposition, we need the following lemma, which is a consequence of Proposition 6.1.
Lemma 1.
Under Assumption 1, the following bounds hold:
Remark.
Combining Proposition 4.1 and Lemma 1, we have a sample complexity bound for this Q-learning algorithm under Assumption 1: . Li et al., (2021) have shown that the worst case sample complexity of the Q-learning Algorithm 1 should be uniformly in all MDP instance with discount factor . So, the sample complexity bound implied by Proposition 4.1 is is not optimal when .
4.1.2 Wainwright’s Variance Reduction
Li et al., (2021) established that the Q-learning Algorithm 1 is not minimax optimal when used to learn the optimal action-value function . In the pursuit of an optimal model-free algorithm, Wainwright, 2019b proposed a variance-reduced variant of Q-learning, as outlined in Algorithm 2.
(12) |
This variant has been shown to achieve the minimax optimal sample complexity for any Markov decision process (MDP) instance , subject to optimal selection of the parameters and initialization . In the subsequent sections, we explore modifications to Algorithm 2 that enable it to attain the sample complexity upper bounds presented in Table 1.
Notably, the variance-reduced Q-learning variant generally satisfies the path-wise error bound of the form with high probability, due to the empirical Bellman operators being -contractions as well. This desirable characteristic makes it a versatile tool for algorithmic designs.
The variance-reduced Q-learning algorithm employs an inner loop indexed by and an outer loop indexed by , where each outer loop is referred to as an epoch. Within each epoch, the inner loop executes a “recentered” version of synchronous Q-learning. By selecting a series of empirical Bellman operators with exponentially increasing sample sizes as a function of , the estimators generated by the epochs achieve an exponentially decreasing error, namely , with high probability.
4.2 The General Setting: Assumption 1
In this section, we analyze the convergence and worst case sample complexity bound of Algorithm 2 under the general Assumption 1, which posits that there exists one optimal policy that induces a -Doeblin kernel.
We aim to demonstrate that by combining Algorithms 1 and 2 under the aforementioned notion of mixing, one can produce an estimator of within an absolute error with high probability using at most number of samples. This removes a power on from the minimax sample complexity lower and upper bounds established in Azar et al., (2013). This is quite a surprising improvement as we are only assuming that one of the optimal policies induces mixing.
The central result in this section is the following sample complexity upper bound in Theorem 2. It is is an immediate consequence of Proposition 4.3.
Theorem 2.
Remark.
For , this sample complexity bound reduce to .
To arrive at this sample complexity upper bound, we analyze the statistical properties of Algorithm 2 under Assumption 1. At a high-level, an upper bound of the asymptotic variance of estimating using the variance-reduced Q-learning can be established using the oscillation of . According to Proposition 6.1, the oscillation of can again be bounded by the minorization time. Since all optimal policies induce the same , the minorization time of any optimal policy can be used to control the convergence rate.
We will motivate and state two intermediate propositions that underlie the proof of Theorem 2. The proofs and relevant developments of the key results in this section are deferred to Section B.2. An important result is the following property of Algorithm 2.
Proposition 4.2.
Given satisfies w.p. at least and , then choosing
(13) | ||||
and the total number of outter iterations
(14) |
we have that the output satisfies w.p. at least .
It should be noted that the algorithm specified in Proposition 4.2 may not be directly implementable due to the lack of a priori knowledge of in the definition of . Fortunately, this can be addressed by first running Algorithm 1 to obtain an initialization s.t. with high probability. By Lemma 1, this initialization can cancel out the term in the numerator of in (13).
To simplify notation, we define
(15) |
Concretely, we choose
(16) | ||||
with the same as in (13). Moreover, the initialization is obtained from running Algorithm 1 with
(17) |
for some sufficiently large absolute constant . By Proposition 4.1, this should guarantee w.p. at least . Therefore, by Proposition 4.2, we have the following finite time algorithmic error guarantee:
Proposition 4.3.
Remark.
Notice that this proposition allows a range of . Doing this allows the user to run the algorithm without the knowledge of . However, to achieve the optimized complexity in Theorem 2, we need to choose .
4.3 The Lipschitz Setting: Assumptions 2 and 3
In the previous section, we prove that in the general setting, the variance-reduced -learning algorithm has worst case sample complexity upper bound scale as uniformly for . This is not optimal when . In particular, there are known algorithms and analysis (Azar et al.,, 2013; Wainwright, 2019b, ) with worst case sample complexity upper bound . In this section, we develop a better sample complexity bound under the Assumption 2 or 3. The proof of the results in this section provided in Section B.3.
Define the variance vector
(18) |
Here, the notation maps a square matrix to a column vector containing the diagonal elements of . In Khamaru et al., 2021b , the authors establish that serves as a lower bound on the non-asymptotic minimax risk of estimating for a specific MDP instance . They also prove an upper bound that matches the lower bound under certain conditions, such as the uniqueness of the optimal policy or the satisfaction of the Lipschitz condition (10) of the transition kernel. It turns out that, as a direct consequence of Corollary 6.2.1, satisfies the upper bound , see Lemma 2. This will imply a worst-case sample complexity upper bound of , provided that the sample size is sufficiently large.
Recall that is the set of optimal stationary deterministc Markov policies. Define the optimality gap
The main result in this section is the following theorem.
Theorem 3.
Suppose Assumption 2 or 3 hold. Let parameters as in (23) and as in Theorem K2. Then, for all sufficiently small , the combined algorithm achieves an estimator of with absolute error in the sup norm w.p. at least using
number of samples. Specifically, it is sufficient under Assumption 2 for
(19) |
or under Assumption 3 for
(20) |
for any .
Remark.
If we further require that , the worst case sample complexity upper bound becomes .
The proof of Theorem 3 is based on the following key result in Khamaru et al., 2021b :
Theorem K2 (Theorem 2 of Khamaru et al., 2021b ).
Suppose either that the optimal policy is unique or that the Lipschitz condition (10) holds. We run Algorithm 2 with initialization satisfying . Also, for sufficiently large satisfying that there exists a
-
•
In the case of unique ,
-
•
In the case of Lipschitz condition (10) hold,
where is some large absolute constant. Choose
(21) | ||||
We have that w.p. at least
(22) |
As explained in the discussion after Assumption 3, noted by Li et al., (2023), if the optimal policy is unique with optimality gap , then Assumption 3 holds with . This is consistent with the requirement of in Theorem K2.
Theorem K2 suggests that the finite sample convergence rate can be controlled by the local minimax rate parameter . Observe that if the entries of can be uniformly upper bounded by for all , then the error bound (22) should imply the desired sample complexity upper bound. This is indeed the case if all optimal policies induce Doeblin chains with a unifrom minorization time upper bound.
Lemma 2.
Suppose that for all , the kernel uniformly ergodic with
Then
To use Theorem K2, we still need a initialization . As in the previous section, this can be achieved w.p. at least by running Algorithm 1 with
(23) |
for sufficiently large constant . Then we start the Algorithm 2 with this initialization and parameters specified in Theorem K2. This will give the desired sample complexity upper bound in Theorem 3.
Remark.
This method of generating , however, requires the knowledge of in order to specify . To resolve this, instead, we can use
to initialize . This will result in a sample complexity upper bound of when on top of satisfying the conditions in Theorem 3. This is less efficient in terms of the range of . So, taking a complexity theoretical perspective, we omit a formal presentation and proofs of this initialization method.
4.4 Uniform Mixing: Assupmtion 4
Assuming a uniform minorization time upper bound as in Assumption 4, in this section, we will construct an algorithm and outline the proof of the following sample complexity upper bound.
Theorem 4.
Remark.
To develop this sample complexity upper bound, we highlight some preliminary observations of Algorithm 2 made by Wainwright, 2019b . Consider the inner loop update of the variance-reduced Q-learning (12). Wainwright, 2019b defines
Then, (12) becomes
It is easy to check that is a -contraction, and . So, the inner loop of the variance-reduced Q-learning can be thought of as a stochastic approximation to the unique fixed point of .
This observation lead us to considering a perturbed reward function
(24) |
Then the fixed point of uniquely satisfies the Bellman equation
(25) |
With these reformulation of the variance-reduced Q-learning recursion, we can bound the error after epoch by the triangle inequality
Then, by tightly controling the errors and , we can arrive at the following Proposition.
Proposition 4.4.
Assume Assumption 4 and an initialization that satisfies w.p. at least , and . Choose ,
(26) | ||||
Then, we have that w.p. at least .
To use Proposition 4.4, we need to produce an estimator s.t. w.p. at least . Here, we require to satisfy . Therefore, to achieve such a initialization efficiently, we can run the combined procedure in Proposition 4.3 with specified error and tolerance probability . By Theorem 2, we need
number of samples.
Remark.
Similar to the remark in section 4.3, this method of generating requires the knowledge of . Instead, one could run the Q-learning with
to initialize , where we recall the definition of in (15). This will result in a sample complexity upper bound of when . This is less efficient and the proof is omitted.
We will use the following choice of parameters:
(27) | ||||
These choice of parameters implies the following algorithmic error bound.
Proposition 4.5.
5 Minimax Lower Bounds
In this section, we show that the sample complexity dependence on , , and in Theorem 3 and 4 cannot be improved in general. We consider two notions of non-asymptotic minimax risk of learning the optimal -function in the sup norm, one localized at a hard instance, the other is uniform over the class of all MDPs with discount factor and uniform minorization time upper bound (Assumption 4). We prove that both minimax risks have lower bounds supporting the claim that our sample complexities in Theorem 3 and 4 are tight.
We start off with a rigorous definition of the generator model and the class of estimators. This is necessary because the lower bounding technique needs a careful specification of the probability measure. The generator model is a sequence of product probability spaces
where the is the transition kernel of instance . By construction, the identity random element on , denoted by is independent across . We will denote the product measure on as and the expectation as . An estimator is a measureable function from to . For fixed , this is equivalent to the procedure that generate i.i.d. for each and then form from these samples.
For fixed state and action spaces and discount factor , we first consider the local non-asymptotic minimax risk at instance of learning the function accurately in the sup norm using samples:
(28) |
where the first supremum is taken over all MDP instance with discount factor . This quantity measures the complexity of learning optimal -function on either or the hardest local alternative. It is studied in the convex optimization context by Cai and Low, (2015); Chatterjee et al., (2016), and adapted to the the RL context by Khamaru et al., 2021a ; Khamaru et al., 2021b .
The previous complexity metric embodies the idea of constructing a hard instance that is difficult to distinguish from its local alternative, bearing the spirit of the lower bounds in seminal works on MDP complexity theory (c.f. Azar et al., (2013)). However, one can argue that it is more natural to define the minimax risk of the class of MDP instances .
Definition 4.
We define the minimax risk of learning the optimal -function of the MDPs as
With these definitions, we are ready to state the main results of this section.
Theorem 5.
There exists absolute constant s.t. for any and , one can find a MDP instance satisfying Assumption 4 with uniform minorization time upper bound s.t.
provided that .
Here, as we are lower bounding the complexity metric, we abuse the notation in the Assumption 4 so that where is the kernel of .
As we will explain in a moment, the hard instances in Theorem 5 also facilitate lower bounding in Definition 4.
Theorem 6.
There exists absolute constant s.t. for any and ,
provided that .
Next, we outline the proofs of these theorems and defer the details to Section C.1 and C.2 respectively.
To begin with, we construct a family of MDP instances satisfying Assumption 4 such that the local non-asymptotic risk parameter (see (18)) at each instance is . We claim that this is the case for the following family of hard MDP instances:
Definition 5 (Hard MDP Family).
Consider the following family of hard MDP instances with , . The transition kernel for each action is
The reward function and .
Clearly, and define a optimal policy. Observe that
So, is -Doeblin and hence -Doeblin. For simplicity, we will use . It is also easy to see that all policies will induce a -Doeblin chain; i.e. Assumption 4 holds.
This and the following Theorem 1 in Khamaru et al., 2021b would imply a complexity lower bound of learning within an absolute error in the sup norm if the risk measure is :
Theorem K1 (Theorem 1 in Khamaru et al., 2021b ).
There exists constant s.t. for any MDP instance ,
provided that
(29) |
To prove Theorem 6, we first observe that
(30) |
Therefore, we should have if the hard alternative instance in (28) satisfies . Observe that, for large , if and are very different, it will be very easy to tell them apart and thence the estimation can be done with small expected error, because both instances are known to the controller ( is inside ). Thus, the instances that potentially achieve the outer supremum in (28) should be close to . Hence, a hard alternative instance for (28) should have a minorization time similar to as well. Following this intuition, in the proof of Theorem 6, we construct such a hard local alternative using the techniques in Khamaru et al., 2021b , proof of Theorem K1. See appendix Section C.2.
6 Cumulative Reward of a -Doeblin Chain
In order to prove the matching sample complexity upper and lower bounds stated in previous sections, we need to control the moments of the discounted cumulative rewards
In this section, we state some key properties of this cumulative reward and their implications when the kernel is uniformly ergodic, where is some stationary Markov deterministic policy. By Theorem UE, satisfies a -Doeblin condition and the minorization time . The proofs for this section will be deferred to the appendix Section D.
First, we show that for a Doeblin MDP, the oscillation of the -function is upper bounded by the minorization time.
Proposition 6.1 (Oscillation of the -function).
Suppose is -Doeblin. Then there exists constant s.t. . In particular .
Note that Proposition 6.1 can be directly applied to bound as in Lemma 1. Next, we prove a bound on the variance of the cumulative reward achieved by a -Doeblin chain.
Proposition 6.2 (Variance of the Cumulative Reward).
Suppose is -Doeblin. Then , where is defined in 5.
Proposition 6.2 and the equality
(31) |
due to Azar et al., (2013) allow us to obtain the following important upper bound:
Corollary 6.2.1.
Recall the definition of in (7). Suppose is uniformly ergodic. Then
Remark.
If we no longer assume is bounded by , then . Also, this bound can be easily applied to the model-based RL settings, and the same complexity upper bound should follow.
Acknowledgements
The material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 1915967 and 2118199.
References
- Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR.
- Aldous et al., (1997) Aldous, D., Lovász, L., and Winkler, P. (1997). Mixing times for uniformly ergodic markov chains. Stochastic Processes and their Applications, 71(2):165–185.
- Asmussen et al., (2003) Asmussen, S., Asmussen, S., and Asmussen, S. (2003). Applied probability and queues, volume 2. Springer.
- Athreya and Ney, (1978) Athreya, K. B. and Ney, P. (1978). A new approach to the limit theory of recurrent markov chains. Transactions of the American Mathematical Society, 245:493–501.
- Azar et al., (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349.
- Bramson, (2008) Bramson, M. (2008). Stability of Queueing Networks: École d’Été de Probabilités de Saint-Flour XXXVI - 2006. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Cai and Low, (2015) Cai, T. T. and Low, M. G. (2015). A framework for estimation of convex functions. Statistica Sinica, 25(2):423–456.
- Chatterjee et al., (2016) Chatterjee, S., Duchi, J. C., Lafferty, J., and Zhu, Y. (2016). Local minimax complexity of stochastic convex optimization. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
- Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 8223–8234. Curran Associates, Inc.
- Crane and Lemoine, (1977) Crane, M. A. and Lemoine, A. J., editors (1977). An Introduction to the Regenerative Method for Simulation Analysis. Springer Berlin Heidelberg, Berlin, Heidelberg.
- Deng et al., (2017) Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. (2017). Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653–664.
- Gilks et al., (1998) Gilks, W. R., Roberts, G. O., and Sahu, S. K. (1998). Adaptive markov chain monte carlo through regeneration. Journal of the American Statistical Association, 93(443):1045–1054.
- Glynn and Ormoneit, (2002) Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & Probability Letters, 56(2):143–146.
- Henderson and Glynn, (2001) Henderson, S. G. and Glynn, P. W. (2001). Regenerative steady-state simulation of discrete-event systems. ACM Trans. Model. Comput. Simul., 11(4):313–345.
- Jin and Sidford, (2020) Jin, Y. and Sidford, A. (2020). Efficiently solving MDPs with stochastic mirror descent. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4890–4900. PMLR.
- Jin and Sidford, (2021) Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward mdps.
- (17) Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., and Jordan, M. I. (2021a). Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4):1013–1040.
- (18) Khamaru, K., Xia, E., Wainwright, M. J., and Jordan, M. I. (2021b). Instance-optimality in optimal value estimation: Adaptivity via variance-reduced q-learning.
- Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
- Lemańczyk, (2021) Lemańczyk, M. (2021). General bernstein-like inequality for additive functionals of markov chains. Journal of Theoretical Probability, 34(3):1426–1454.
- Li et al., (2021) Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.
- Li et al., (2022) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022). Settling the sample complexity of model-based offline reinforcement learning.
- Li et al., (2023) Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of polyak-ruppert averaged q-learning. In Ruiz, F., Dy, J., and van de Meent, J.-W., editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2207–2261. PMLR.
- Meyn et al., (2009) Meyn, S., Tweedie, R. L., and Glynn, P. W. (2009). Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition.
- Meyn and Down, (1994) Meyn, S. P. and Down, D. (1994). Stability of Generalized Jackson Networks. The Annals of Applied Probability, 4(1):124 – 148.
- Nummelin, (1978) Nummelin, E. (1978). A splitting technique for harris recurrent markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 43:309–318.
- Puterman, (1994) Puterman, M. (1994). Average Reward and Related Criteria, chapter 8, pages 331–440. John Wiley & Sons, Ltd.
- Sadeghi and Levine, (2016) Sadeghi, F. and Levine, S. (2016). Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
- Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Sutton and Barto, (2018) Sutton, R. and Barto, A. (2018). Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press.
- (31) Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharp -bounds for -learning.
- (32) Wainwright, M. J. (2019b). Variance-reduced -learning is minimax optimal.
- Wang et al., (2022) Wang, J., Wang, M., and Yang, L. F. (2022). Near sample-optimal reduction-based policy learning for average reward mdp.
- Wang, (2017) Wang, M. (2017). Primal-dual learning: Sample complexity and sublinear run time for ergodic markov decision problems. ArXiv, abs/1710.06100.
- Wang et al., (2023) Wang, S., Blanchet, J., and Glynn, P. (2023). Optimal sample complexity of reinforcement learning for uniformly ergodic discounted markov decision processes.
- Watkins and Dayan, (1992) Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
Appendix A Equivalence of and
In this section, we present results and proofs that justify the claim of Theorem 1.
Recall Definition 3. For convenience, we also define the separation between consecutive possible regeneration under measure and probability as
We will drop the dependence.
Now, we proceed to show that the two mixing times and are equivalent up to constants.
Proposition A.1.
If is uniformly ergodic with for some and , then
Proof.
For the other inequality, we follow the proof idea in Aldous et al., (1997). The following lemma holds:
Lemma 3 (Lemma 4 in Aldous et al., (1997)).
Let kernel have invariant measure . If
then there exists s.t. and for any probability measure and
Let , , then by the above lemma,
Therefore, if we define , because ,
This implies that . ∎
Next, we discuss the quantitative relationship between the following assumption:
Proposition A.2.
The following properties hold:
-
•
If has mixing time , then for any , there exists s.t. and is -Doeblin.
-
•
If is -Doeblin, then .
Proof.
A.1 Proof of Theorem 1
Proof.
So, knowing the mixing time and the “best” minorization are equivalent for a uniformly ergodic chain. Given the equivalence of and , the complexity results in Table 1 can be directly replaced by .
Appendix B Analysis of Algorithms
In this section, we carry out the proofs that justify our sample complexity upper bounds of the algorithms motivated in Section 4.
B.1 Proof of Lemma 1
Proof.
Recall that . Then by definition, for any , there exists s.t. is -Doeblin and . By Proposition 6.1, and
Since is arbitrary, this implies the statement of the lemma. ∎
B.2 Proofs for Section 4.2: The General Setting
Let denote the -field generated by the samples used until the end of ’th epoch of the variance-reduced Q-learning. We define .
Before Proving proposition 4.2, we introduce the following lemma that underlies its proof and the parameter choice.
Lemma 4.
For s.t. satisfies , then choosing
and
for some known absolute constant , we have that under the probability measure , w.p. at least .
Proof.
From Section 4.1.6 of Wainwright, 2019b , we have that for s.t. satisfies ,
w.p. at least under , where is some absolute constant. It is easy to see that indeed under the parameter choice with , the r.h.s. will be less than . ∎
B.2.1 Proof of Proposition 4.2
Proof.
We recall the conditions for in (13) and (14). Let . By the chain rule for conditional probability, for
where the last equality follows from the assumption that w.p. at least .
Let
We analyze the probability
This is because, for every , the choice of and in (13) satisfies Lemma 4 with replaced by and replaced by . Hence
w.p.1. Therefore,
(33) | ||||
To finish the proof, we consider the function
Evidently, for and , is with derivatives
So, is non-decreasing. Hence
(34) |
Note that by the choice of in (14), the assumption that implies that . Finally, combining this with estimate (34) and the high probability error bound (33) yields that . ∎
Remark.
The proof implies that the error bound holds in a stronger pathwise sense; i.e. we can conclude from the same assumptions as in Proposition 4.2 that . Moreover, the geometric progression can be replaced by other pathwise tolerance levels, which will lead to different error bounds and complexity guarantees.
B.2.2 Proof of Proposition 4.3
Proof.
Recall from Lemma 1 that under Assumption 1. By the choice of in (17) and the error high probability bound for Algorithm 1 in Proposition 4.1, we can conclude that w.p. at least
the error .
Next, we would like to apply Proposition 4.2. The previous analysis implies that we can let . Note that the assumption implies . Also, under this , the requirement in Proposition 4.2 for satisfies
Therefore, we have that for all , the parameters and in (16) will satisfy Proposition 4.2. Hence, we conclude that w.p. at least . ∎
B.2.3 Proof of Theorem 2
Proof.
Let be the total number of 1-sample Bellman operators used by the over all procedure in Proposition 4.3 with . The total number of samples used by the algorithm is
∎
B.3 Proofs for Section 4.3: The Lipschitz Setting
B.3.1 Proof of Lemma 2
B.3.2 Proof of Theorem 3
Proof.
From Lemma 2, we have that if
then the r.h.s. of (22) is less than . Here are constants in Theorem K2. So, for , we have that by Theorem K2, , provided that satisfies the requirement of Theorem K2. This happens when satisfies the requirement in Theorem 3.
Indeed, under Assumption 2, if (19) hold, then
satisfying Theorem 3. Moreover, under Assumption 3 and (20) hold,
satisfying Theorem 3 as well.
Now, consider the error probability of the combined procedure. We have
By replacing with in Theorem K2, we conclude that it suffices to choose
The total number of samples used by the combined algorithm, therefore, is
This implies the claim of Theorem 3. ∎
B.4 Proofs for Section 4.4: Uniform Mixing
Recall the definition of and in equation (24) and (25) respectively. We denote an optimal policy for by . Before we prove Proposition 4.4, we introduce the following lemmas.
Lemma 5 (Lemma 5 in Wainwright, 2019b ).
For , w.p. at least under
for some absolute constants .
Lemma 6 (Lemma 3 in Wainwright, 2019b ).
Lemma 7.
On , for sufficiently large absolute constants ,
w.p. at least under probability measure , provided that .
B.4.1 Proof of Lemma 7
Proof.
By Lemma 5, we have that on under w.p. at least
(35) |
Also, By Lemma 6 in Wainwright, 2019b ,
(36) |
Here, denotes the entrywise absolue value, and the inequalities holds entrywise. Let be the set s.t. (35) holds. Then on , the vector satisfies
Moreover, let
Let . By construction and the optimality of , is the variance of the 1-sample Bellman operator associated with the reward (c.f. definition (6) and (7)). Therefore, we have that by Corollary 6.2.1,
(37) | ||||
where we note that by the definition of in Assumption 4. For , by the assumption on and
So, on , the error
where follows from (35), replaces with the upper bound (37). Therefore, combining these with (36), we have that w.p. at least
for some large absolute constant . This implies the claimed result. ∎
B.4.2 Proof of Proposition 4.4
Proof.
By the choice of with sufficiently large
Moreover, satisfies the assumption of Lemma 7 for every . Therefore, Lemma 7 implies that on w.p. at least under
Observe that the choice of in (27) satisfies (26). Hence, by Lemma 6, 7, and the union bound, on w.p. at least under ,
where the last inequality follows from the choice of . Therefore, repeating the proof of Proposition 4.2 with allow us to conclude that w.p. at least . Note that the assumption implies .
As in the previous remark following Proposition 4.2, the event holds w.p. at least . ∎
B.4.3 Proof of Proposition 4.5
B.4.4 Proof of Theorem 4
Appendix C Proofs of the Minimax Lower Bounds
C.1 Proof of Theorem 5
Proof.
By Theorem K1, it suffices to show that for any and MDP instance satisfies
(38) |
We compute the minimax risk parameter for
We show that for , , and . Define the function
It suffices to prove that is uniformly bounded away from 0 in the region of interests.
We compute the derivative w.r.t. ,
Note that . Moreover, let
For and , we can bound
So, for and , which implies that is non-decreasing. Recall that ; i.e. . So, for , , and
So, the constant . ∎
C.2 Proof of Theorem 6
Proof.
By (30), we have that for any
The rest of the proof directly follows from that of Theorem 1 in Khamaru et al., 2021b (Theorem K1). Here we sketch the key steps and the construction of the instance .
First, the proof in Khamaru et al., 2021b use the standard reduction to testing and the Le Cam’s lower bound to conclude that for all
where is the Hellinger distance. Therefore, if and s.t. .
Next, we use the constructed in Lemma 2 of Khamaru et al., 2021b . Note that , any stationary deterministic Markov policy is optimal, induce the same and . Let , , and s.t. . Define having the same reward as and
Lemma 2 and 3 of Khamaru et al., 2021b and the proof of Theorem 5 imply that is a valid transition kernel s.t. and
when satisfies (29), which is implied by the assumption. Therefore, it is left to show that . To do this, it is sufficient to show that is -Doeblin for all .
Recall the definition of , for every policy ,
Moreover,
By (38), and the choice of ,
where . This completes the proof. ∎
Appendix D Proofs for Section 6
D.1 Proof of Proposition 6.1
Proof.
The is the state-action value function under the policy :
where we use the Markov property on the bounded functional . Now
The last conditional expectation is a bounded function . Recall that is the first regeneration time of the split chain. Then
where we can use the strong Markov property by the boundedness of ; and the split chain’s distribution at regeneration time is . Note that the second term is constant. So, if we let
then can be written as
Since ,
Notice that
(39) |
Therefore, we conclude that . Also recall the definition of
where we used that and . ∎
D.2 Proof of Proposition 6.2
Proof.
Define . By uniform regeneration, where Geo i.i.d.. So,
Next, we expand the variance:
We split the second term using the regeneration cycles. Let , and
Note that
(40) | ||||
Recall that the sequence of random elements are 1-dependent. In particular . Then
The first term can be bounded by the second moment
where the last line follows from (39). Similarly, the third term can be bounded by the variance.
For the middle term,
We will defer the proof of the following lemma:
Lemma 8.
D.2.1 Proof of Lemma 8
Proof.
We simplify the notation by consider a new probability space supporting a cycle of the split chain starting from initial and cycle length . We use the property of the split chain: Let be a independent process on this probability space with skeleton having transition kernel and interpolated by . Then, the random element has the same distribution as .
Write
First, we handle the last -segment.
for some non-negative and deterministic function . Here we used the property of the split chain: given the coin toss is successful and the current state , the path is generated condition on the independently sampled and . Similarly,
Therefore,
Define and the bounded functional
Notice that, is non-decreasing and is decreasing. Let Geo be independent, then consider
where the last equality follows from, and . ∎