\MHInternalSyntaxOn\MHInternalSyntaxOff^affil0^affil0affiliationtext: Department of Management Science and Engineering
Stanford University

Optimal Sample Complexity of Reinforcement Learning for
Mixing Discounted Markov Decision Processes

Shengbo Wang Jose Blanchet Peter Glynn

Abstract

We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on $\gamma$ and $\epsilon$ of the form $\tilde{\Theta}((1-\gamma)^{-3}\epsilon^{-2})$ , where $\gamma$ denotes the discount factor and $\epsilon$ is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is $\tilde{\Theta}({t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2})$ , where ${t_{\mathrm{mix}}}$ is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.

1 Introduction

Reinforcement learning (RL) (Sutton and Barto,, 2018) has witnessed substantial empirical successes in a wide range of applications, including robotics (Kober et al.,, 2013), computer vision (Sadeghi and Levine,, 2016), and finance (Deng et al.,, 2017). This has sparked substantial research of RL theory and applications in operations research and management sciences. This paper provides a theoretical contribution to this area by considering a tabular RL environment in which a controller wishes to maximize an infinite horizon discounted reward governed by a Markov decision process (MDP). In many applications, due to engineering and managerial considerations, the underlying MDP is naturally constrained to an environment in which good policies must be stable, c.f. Bramson, (2008). In these settings, the controlled Markov chain induced by a reasonable policy will converge in distribution to a unique steady state behavior, regardless of the initial condition. This phenomenon is known as mixing. In cases where the mixing rate of a system is rapid, a finite-time observation of the state process can provide a more accurate statistical depiction of its long-term behavior. Consequently, it is reasonable to expect that for such systems, the development of a lower-complexity algorithm for policy learning is feasible, even in the presence of small discount rates, provided that an optimal policy displays fast mixing. This intuition serves as the motivation and driving force for this paper.

In addition, an optimal algorithm and analysis for discounted mixing MDPs can serve as valuable algorithmic and theoretical tools for achieving optimal sample complexity results for long-run average reward MDPs through the reduction method proposed in Jin and Sidford, (2021). While the worst case sample complexities of discounted tabular RL algorithms have been studied extensively in recent years (Azar et al.,, 2013; Agarwal et al.,, 2020; Li et al.,, 2022), we are unaware of a complexity theory and algorithm that is optimized for discounted mixing MDPs. Therefore, it is of significant practical and theoretical value to develop a satisfactory complexity theory for discounted mixing MDPs.

This paper contributes to the existing literature in the following ways. First, we adapt the regeneration idea and the split chain technique (Athreya and Ney,, 1978; Nummelin,, 1978) for analyzing stable Markov chain to the RL context. This framework naturally leads to measuring the mixing rate of the system by the minorization time (Definition 3), which is the expected amount of time between successive Markov chain regenerations. The drift and minorization formulation have been widely used in modeling and analyzing stable queuing systems (Meyn and Down,, 1994; Bramson,, 2008). In Theorem 1, we prove that in the setting of uniformly ergodic Markov chains with finite state space, the minorization time is equivalent to the total variation mixing time that is used in the sample complexity of mixing MDP literature (Wang,, 2017; Jin and Sidford,, 2021). Moreover, this formulation directly generalizes to infinite state spaces, foreshadowing a generalization our results to general state space MDPs.

Secondly, we establish algorithms and worst case sample complexity upper bounds under three different mixing assumptions using Wainwright’s variance-reduced Q-learning (Wainwright, 2019b, ) algorithmic framework. We begin with a general setting where there exists an optimal policy such that an upper bound on the minorization time is available. Next, we require that there is a uniform upper bound on the minorization times of the class of optimal policies. Finally, we consider the case where the minorization time of all policy has a uniform upper bound. Our worst case sample complexity guarantees in these three contexts are summarized in Table 1. Note that in all three cases the dependence on the effective horizon is $(1-\gamma)^{-2}$ .

Thirdly, we introduce the non-asymptotic minimax risk of the class of MDPs with a uniform minorization time upper bound (Definition 4). We compare this definition of minimax risk to the instance dependent version in Khamaru et al., 2021b and find ours easier to interpret. Moreover, we prove the same lower bounds in Table 1 for these two definitions of the non-asymptotic minimax risks by constructing a family of hard MDP instances (Definition 5) indexed by the discount and the minorization time. This allows us to conclude that our sample complexity upper bounds are optimal up to log factors.

Mixing Assumption	Sample Complexity	Relevant Theorems
One optimal policy (Assumption 1)	$\tilde{O}(\|S\|\|A\|{t_{\mathrm{mix}}}^{2}(1-\gamma)^{-2}\epsilon^{-2})$	Theorem 2
All optimal policies (Assumption 2, 3)	$\tilde{O}(\|S\|\|A\|{t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2})$	Theorem 3
All policies (Assumption 4)	$\tilde{O}(\|S\|\|A\|{t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2})$	Theorem 4
Lower Bound: $\|S\|\|A\|=O(1)$	$\Omega({t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2})$	Theorems 5, 6

Table 1: Summary of sample complexity upper bounds, where

{t_{\mathrm{mix}}}\leq(1-\gamma)^{-1}

is the total variation mixing time for uniformly egodic chains, c.f. Theorem 1.

In this paper, we assume the availability of a simulator, also known as a generative model, that can generate independent state transitions for any input state-action pair. Given the significant memory and computational demands associated with vast state and action spaces, we adopt a model-free approach in which the algorithm only maintains a state-action value function, or $q$ -function. Our learning objective is to achieve an estimator of the optimal $q^{*}$ function within $\epsilon$ absolute error in the sup norm. Our algorithm and analysis achieve the optimal non-asymptotic minimax complexity in this regard. Notably, the variance bounds established in this work can be readily applied to the analysis of model-based algorithms, such as Azar et al., (2013), and the same optimal sample complexity upper bounds as in this work should follow.

1.1 Literature Review

Markov Stability and Regeneration: There is a rich literature on the stability of Markovian stochastic systems using the regeneration idea (Athreya and Ney,, 1978; Nummelin,, 1978; Meyn and Down,, 1994; Bramson,, 2008; Meyn et al.,, 2009). It has also been a important theoretical tool for design and analysis of simulation (Crane and Lemoine,, 1977; Henderson and Glynn,, 2001), statistics (Gilks et al.,, 1998), and machine learning (Glynn and Ormoneit,, 2002; Lemańczyk,, 2021) procedures. We provide a brief review of this literature since we study mixing behavior of MDPs using these techniques.

Sample Complexity of Discounted Tabular RL: The worst case sample complexity theory of tabular RL has been extensively studied in recent years. There are two types of modeling environments that inspires the model-base and model-free approaches to RL algorithmic designs. In a model-base approach, the controller tries to accumulate a data set to construct a empirical model of the underlying MDP and solve it by variants of the dynamic programming principle. Azar et al., (2013); Sidford et al., (2018); Agarwal et al., (2020); Li et al., (2022) propose algorithms and prove optimal upper bounds (the matching lower bound is proved in Azar et al., (2013)) $\tilde{\Theta}(|S||A|(1-\gamma)^{-3}\epsilon^{-2})$ of the sample complexity to achieve $\epsilon$ error in using the model-based approach. On the other hand, the model-free approach only maintains lower dimensional statistics of the transition data, and iterative update them. The celebrated Q-learning (Watkins and Dayan,, 1992) and its generalizations are examples model-free algorithms. Li et al., (2021) shows that the Q-learning have a minimax sample complexity $\tilde{\Theta}(|S||A|(1-\gamma)^{-4}\epsilon^{-2})$ . Nevertheless, Wainwright, 2019b proposes a variance-reduced variants of the $Q$ -learning that achieves the aforementioned model-based sample complexity lower bound $\tilde{\Theta}(|S||A|(1-\gamma)^{-3}\epsilon^{-2})$ . Recent advances in sample complexity theory of Q-learning and its variants are propelled by the breakthroughs in finite time analysis of stochastic approximation (SA) algorithms. Wainwright, 2019a proves a sample path bound for the SA recursion where the random operator (as oppose to the expectation) is monotone and a contraction. These strong assumptions yield path-wise error bounds that enable variance reduction techniques that help to achieve optimal complexity in Wainwright, 2019b . In comparison, Chen et al., (2020) establishes finite sample guarantees of SA only under a second moment bound on the martingale difference noise sequence.

The worst case analysis provides guarantees on the convergence rate across all instances of $\gamma$ -discounted MDPs. Notably, instances that reach the complexity lower bound must involve a transition kernel and reward function that depends upon $\gamma$ . In contrast, Khamaru et al., 2021b delve into instance-dependent settings, where the transition kernel and reward function remain fixed, yielding matching sample complexity upper and lower bounds. Furthermore, Wang et al., (2023) explore an intermediate scenario where MDPs are assumed to possess an upper bound on their mixing time. This intermediate setting holds particular significance to this paper’s main objective, as elaborated upon in the introduction.

Sample Complexity of Average Reward RL: Mixing MDPs arise naturally in RL settings where the controller’s objective is to maximize the long run average reward per unit of time achieved by the control policy. Wang, (2017); Jin and Sidford, (2020) propose algorithms to solve the policy learning problem directly from the average reward MDP model, achieving a complexity dependence $\tilde{O}(|S||A|{t_{\mathrm{mix}}}^{2}\epsilon^{-2})$ . On the other hand, Jin and Sidford, (2021); Wang et al., (2022) use a discounted MDP with large enough $\gamma$ to approximate the average reward MDP and prove worst case complexity bounds $\tilde{O}(|S||A|{t_{\mathrm{mix}}}\epsilon^{-3})$ . Notably, the contemporary work Wang et al., (2022) only makes assumption on the optimal $q$ -function, which is equivalent to assuming an upper bound on ${t_{\mathrm{mix}}}$ of one of the optimal policy in the worst case. Whereas the other aforementioned literature assume a uniform bound on ${t_{\mathrm{mix}}}$ over all policy. Our results in the discounted case echo this behavior: only assuming mixing of one of the optimal policy will lead to a reduction in the complexity.

2 Markov Decision Processes: Definitions

Let $\mathcal{M}=(S,A,r,P,\gamma)$ be an MDP, where $S$ and $A$ are finite state and action spaces, $r:|S|\times|A|\rightarrow[0,1]$ the reward function, $P:|S|\times|A|\rightarrow\mathcal{P}(S,2^{S})$ the transition kernel. $\gamma\in(0,1)$ is the discount factor. We assume that $S,A,r,\gamma$ are known, and we can draw samples from the transition measures $\left\{{P_{s,a},s\in S,a\in A}\right\}$ independently through a sampler (generative model).

Let $\Omega=(|S|\times|A|)^{\mathbb{Z}_{\geq 0}}$ and let $\mathcal{F}$ be the product $\sigma$ -field be the underlying measureable space. Define the stochastic process $\left\{{(S_{t},A_{t}),{t\geq 0}}\right\}$ by the point evaluation $S_{t}(\omega)=s_{t},A_{t}(\omega)=a_{t}$ for all $t\geq 0$ for any $\omega=(s_{0},a_{0},s_{1},a_{1},\dots)\in\Omega$ . At time $t\geq 0$ and state $s_{t}\in S$ , if action $a_{t}\in A$ is chosen, the decision maker will receive a reward $r(s,a)$ , and then the law of the subsequent state is determined by the transition kernel $\mathcal{L}(S_{t+1}|S_{0}=s_{0},\dots,S_{t}=s_{t},A_{t}=a_{t})=P_{s_{t},a_{t}}(\cdot)$ . It is well known that to achieve optimal decision making in the context of infinite horizon discounted MDPs (to be introduced), it suffices to consider policies that are stationary, Markov, and deterministic. Therefore, we will restrict ourselves to consider only this class of policies.

Let $\Pi$ denote the class of stationary Markov deterministic policies; i.e. any $\pi\in\Pi$ can be seen as a function $\pi:S\rightarrow A$ . For $\pi\in\Pi$ and initial distribution $\mu\in\mathcal{P}(S,2^{S})$ , there is a probability measure $Q_{\mu}^{\pi}$ on the product space s.t. the chain $\left\{{(S_{t},A_{t}),t\geq 0}\right\}$ has finite dimensional distributions

Q^{\pi}_{\mu}(S_{0}=s_{0},A_{0}=a_{0},\dots,A_{t}=a_{t})=\mu(s_{0})P_{s_{0},\pi(s_{0})}(s_{1})P_{s_{1},\pi(s_{1})}(s_{2})\dots P_{s_{t-1},\pi(s_{t-1})}(s_{t})\mathds{1}\left\{{\pi(s_{i})=a_{i},i\leq t}\right\}.

Note that this also implies that $\left\{{S_{t},t\geq 0}\right\}$ is a Markov chain under $Q_{\mu}^{\pi}$ . Let $E_{\mu}^{\pi}$ denotes the expectation under under $Q_{\mu}^{\pi}$ . For $\mu$ with full support $S$ , the value function $v^{\pi}(s)$ is defined via

v^{\pi}(s):=E_{\mu}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(S_{t},A_{t})\middle|S_{0}=s\right],

while the optimal value function is given by

v^{*}(s):=\max_{\pi\in\Pi}v^{\pi}(s),

(1)

$\forall s\in S$ . It can be seen as a vector $v^{*}\in\mathbb{R}^{|S|}$ . Let $\Pi^{*}$ denote the set of all optimal stationary Markov deterministic policies; i.e. $\pi\in\Pi$ that achieves the maximum in (1).

Let $P_{s,a}[v]$ denote the sum $\sum_{s^{\prime}\in S}P_{s,a}(s^{\prime})v(s^{\prime})$ . It is well known that the optimal value function is the unique solution of the following Bellman equation:

v^{*}(s)=\max_{a\in A}\left(r(s,a)+\gamma P_{s,a}[v^{*}]\right).

(2)

To learn an optimal policy $\pi^{*}\in\Pi^{*}$ , it is useful to introduce the $q$ -function: For any policy $\pi\in\Pi$ ,

q^{\pi}(s,a):=r(s,a)+\gamma P_{s,a}[v^{\pi}],

which can be identified with a vector $q^{\pi}\in\mathbb{R}^{|S|\times|A|}$ . Define the notation $v(q)(s)=\max_{a\in A}q(s,a)$ . When $\pi=\pi^{*}$ achieves the maximum in (1), we denote the corresponding optimal $q$ -function $q^{\pi^{*}}$ by $q^{*}$ and hence $v(q^{*})=v^{*}$ . Note that $q^{*}$ is the unique solution to the Bellman equation for the $q$ -function:

$\displaystyle q^{*}(s,a)$	$\displaystyle=r(s,a)+\gamma P_{s,a}\left[\max_{b\in A}q^{*}(\cdot,b)\right]$	(3)
	$\displaystyle=r(s,a)+\gamma P_{s,a}\left[v(q^{*})\right]$
	$\displaystyle=\mathcal{T}(q^{*})(s,a)$

where the mapping

\mathcal{T}:q(s,a)\rightarrow r(s,a)+\gamma P_{s,a}\left[v(q)\right],\forall(s,a)\in S\times A

(4)

is known as the Bellman operator for the $q$ -function. Here we list two elementary properties of the Bellman operator and the $q$ -function. The proofs can be found in Puterman, (1994). First, $\mathcal{T}$ is a $\gamma$ -contraction in the $\|\cdot\|_{\infty}$ -norm; i.e. $\|\mathcal{T}(q)-\mathcal{T}(q^{\prime})\|_{\infty}\leq\gamma\|q-q^{\prime}\|$ for all $q,q^{\prime}\in\mathbb{R}^{|S|\times|A|}$ . Then, a consequence of $\mathcal{T}$ being a $\gamma$ -contraction is

\|q^{*}\|_{\infty}\leq 1/(1-\gamma).

A greedy policy from a $q$ -function is defined as

\pi(q)(s)=\underset{a\in A}{\operatorname{arg}\,\operatorname{max}}\;q(s,a),

where one action is picked if there is a tie. It is easy to see that $\Pi^{*}\ni\pi^{*}=\arg\max_{a\in A}q^{*}(\cdot,a)=\pi(q^{*})$ . Therefore, policy learning can be achieved if we can learn a good estimate of $q^{*}$ . For $\pi\in\Pi$ , we use $P^{\pi}:\mathbb{R}^{|S||A|}\rightarrow\mathbb{R}^{|S||A|}$ to denote the linear operator

P^{\pi}:q(s,a)\rightarrow\sum_{s^{\prime}\in S}P_{s,a}(s^{\prime})q(s^{\prime},\pi(s)),\forall(s,a)\in S\times A.

Under suitable mixing conditions as presented in Puterman, (1994), it is useful to look at the span semi-norm of the value and $q$ -functions. For vector $v\in V=\mathbb{R}^{d}$ , let $e$ be the vector with all entries equal to one and define

	$\displaystyle\left\|v\right\|_{\mathrm{span}}$	$\displaystyle:=\inf_{c\in\mathbb{R}}\\|v-ce\\|_{\infty}$
		$\displaystyle=\max_{1\leq i\leq d}v_{i}-\min_{1\leq i\leq d}v_{i}.$

Note that $\left|\cdot\right|_{\mathrm{span}}$ is not a norm because $\left|e\right|_{\mathrm{span}}=0$ , but it is an induced norm on the quotient space $V\backslash\left\{{ce:c\in\mathbb{R}}\right\}$ . We also note that $\left|v\right|_{\mathrm{span}}\leq 2\|v\|_{\infty}$ for all $v\in V$ . However, $\left|\cdot\right|_{\mathrm{span}}$ and $\left\|\cdot\right\|_{\infty}$ are not equivalent on $\mathbb{R}^{d}$ .

Naturally, the variance of the cumulative reward induced by the optimal policy plays a key role in controlling the sample complexity of solving the MDP. Given a stationary deterministic Markov policy $\pi$ , let us define and denote the variance of the cumulative reward by

\Psi^{\pi}(s,a):=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})\right).

(5)

Another standard deviation parameter to our interest is

\sigma(q^{\pi})(s,a):=\gamma\sqrt{(P_{s,a}[v(q^{\pi})^{2}]-P_{s,a}[v(q^{\pi})]^{2})}.

(6)

We call this object the “1-sample standard deviation” of the Bellman operator. It is the sampling standard deviation of the so-called 1-sample Bellman operator, which we will introduce in equation (11) in the algorithm analysis section. To analyze $\Psi^{\pi}$ and $\sigma(q^{\pi})$ , it’s useful to define

\sigma^{\pi}(s,a):=\gamma\sqrt{(P_{s,a}^{\pi}[(q^{\pi})^{2}]-P_{s,a}^{\pi}[q^{\pi}]^{2})}.

(7)

Note that for any optimal policy $\pi^{*}$ , $v(q^{*})=v^{*}$ . So, $\sigma(q^{*})=\sigma^{\pi^{*}}$ .

3 Uniformly Ergodic Markov Chains and Mixing Metrics

Let $\mathcal{M}$ be a MDP. Given a stationary Markov deterministic policy $\pi\in\Pi$ , the state process $\left\{{S_{t},t\geq 0}\right\}$ will have transition kernel $P_{\pi}(s,s^{\prime})=P_{s,\pi(s)}(s^{\prime})$ . In various applications, the state process induced by a reasonable policy will converge to a unique steady state regardless of the initial condition. In this section, we will formally characterize such behavior by introducing the notion of uniform ergodicity. A uniformly ergodic Markov chain is one for which the marginal distribution of $S_{t}$ converges to its stationary distribution in total variation distance with uniform rate across all initial conditions. Recall the total variation distance between two probability measures $\mu,\nu$ on $2^{S}$ is

\left\|\mu-\nu\right\|_{\mathrm{TV}}:=\sup_{A\in 2^{S}}|\mu(A)-\nu(A)|.

We use the norm notation here because it is equivalent to the $l_{1}$ norm if we view $\mu$ and $\nu$ as vectors in $\mathbb{R}^{|S|}$ ; i.e. $\left\|\mu-\nu\right\|_{\mathrm{TV}}=2^{-1}\|\mu-\nu\|_{1}$ .

Uniform ergodicity is equivalent to the the transition kernel satisfying the Doeblin condition (Meyn et al.,, 2009). This equivalence permits the application of the split chain representation (Athreya and Ney,, 1978; Nummelin,, 1978) of the Markov chain, which exhibits favorable independence properties, as illustrated in Section 3.2. Thus, the Doeblin condition will serve as our principal assumption and theoretical tool for the analysis of the sample complexity of mixing MDPs. To characterize the rate of mixing of a uniformly ergodic Markov chain, two metrics, namely ${t_{\mathrm{mix}}}$ and $t_{\mathrm{minorize}}$ , emerge naturally from the total variation and Doeblin condition characterizations. We will introduce these metrics in Section 3.3. While ${t_{\mathrm{mix}}}$ is more commonly utilized in the literature on MDP complexity (Wang,, 2017; Jin and Sidford,, 2020, 2021), we have found $t_{\mathrm{minorize}}$ to be a more suitable metric for our objectives. To facilitate the comparison of our complexity theories with those in the literature, we establish the equivalence of ${t_{\mathrm{mix}}}$ and $t_{\mathrm{minorize}}$ up to a constant factor in Section 3.3.

3.1 Uniform Ergodicity

Definition 1 (Uniform Ergodicity).

A Markov chain $\left\{{S_{t},t\geq 0}\right\}$ with transition kernel $P$ is called uniformly ergodic if there exists probability measure $\eta$ for which

\sup_{s\in S}\left\|P^{n}(s,\cdot)-\eta\right\|_{\mathrm{TV}}\rightarrow 0

Note that $\eta$ must be the unique stationary distribution of $P$ : By uniform ergodicity $(\eta P^{n+1})(s)\rightarrow\eta(s)$ and $(\eta P^{n}P)(s)\rightarrow(\eta P)(s)$ for all $s\in S$ as $n\rightarrow\infty$ ; i.e. $\eta=\eta P$ . Let $\nu$ be any stationary distribution of $P$ . Then, $\nu P^{n}=\nu$ and $(\nu P^{n})(s)-\eta(s)\rightarrow 0$ for all $s\in S$ ; i.e. $\nu=\eta$ .

Definition 2 (Doeblin Minorization Condition).

A transition kernel $P$ satisfies the $(m,p)$ -Doeblin condition for some $m\in\mathbb{Z}_{>0}$ and $p\in(0,1]$ if there exists a probability measure $\psi$ and stochastic kernel $R$ s.t.

P^{m}(s,s^{\prime})=p\psi(s^{\prime})+(1-p)R(s,s^{\prime}).

(8)

for all $s,s^{\prime}\in S$ . The measure $\psi$ is called the minorization measure and the stochastic kernel $R$ is call the residual kernel.

Note that if $P$ is $(m,p)$ -Doeblin, then it must be $(m,q)$ -Doeblin for any $0<q\leq p$ :

P^{m}=p\psi+(1-p)R=q\psi+(1-p)R+(p-q)1\otimes\psi

The following result explains the equivalence between the Doeblin condition, uniform ergodicity, and uniform geometric ergodicity.

Theorem UE (Theorem 16.0.2 in Meyn et al., (2009)).

The following statements are equivalent:

1.

There exists $n<\infty$ s.t.

$\sup_{s\in S}\left\|P^{n}(s,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq\frac{1}{4}$
2.

$\left\{{S_{t},t\geq 0}\right\}$ is uniformly ergodic.
3.

There exists $r>1$ and $c<\infty$ s.t. for all $n$

$\sup_{s\in S}\left\|P^{n}(x,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq cr^{-n}.$
4.

$P$ is $(m,p)$ -Doeblin for some $m\in\mathbb{Z}_{>0}$ and $p\in(0,1]$ .

Moreover, we have

\sup_{s\in S}\left\|P^{n}(x,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq 2(1-p)^{-\left\lfloor{n/m}\right\rfloor}.

(9)

Theorem UE implies the existence of a Doeblin minorization structure of any uniformly ergodic Markov kernel. As we will discuss in the next section, the Doeblin condition will allow us to use the split chain technique, which underlies our analysis of the sample complexity of mixing MDPs.

3.2 Doeblin Condition, Split Chain, and Regeneration

In this section, we introduce the split chain of a uniformly ergodic Chain $\left\{{S_{t},t\geq 0}\right\}$ . Athreya and Ney, (1978) and Nummelin, (1978) independently realized that a Markov chain with transition kernel admitting a decomposition of the form (8) can be simulated as follow:

1.

Start from $t=0$ , and generate $X_{0}\sim\mu$ , the initial distribution.
2.

At time $t$ , flip a coin $B_{t+m}$ with success probability $P(B_{t+m}=1)=p$ .
3.

If $B_{t+m}=1$ , generate $S^{*}_{t+m}\sim\psi$ ; if not, generate $S^{*}_{t+m}\sim R(S^{*}_{t},\cdot)$ .
4.

Generate $S^{*}_{t+1},\dots,S^{*}_{t+m-1}$ from the conditional measure $P(\cdot|S^{*}_{t},S^{*}_{t+m})$ .
5.

Set $t\leftarrow t+m$ and go back to step 2.

The process $\left\{{S^{*}_{t},t\geq 0}\right\}$ is known as the split chain. It can be shown that the original process $\left\{{S_{t},t\geq 0}\right\}$ has the same law as $\left\{{S_{t}^{*},t\geq 0}\right\}$ . Define the wide sense regeneration times $\tau_{0}=0$ , $\tau_{j+1}=\inf\left\{{t>\tau_{j}:B_{t}=1}\right\}$ and (wide sense) regeneration cycles $W_{j+1}=(S_{\tau_{j}}^{*},\dots,S_{\tau_{j+1}-1}^{*},T_{j+1})$ where $T_{j+1}=\tau_{j+1}-\tau_{j}$ for $j\geq 1$ . This construction has the following implications:

•

$S_{\tau_{j}}^{*}\sim\psi$ and $T_{j}\overset{d}{=}mG$ where $G\sim\mathrm{Geom}(p)$ for all $j\geq 1$ .
•

The regeneration cycles are 1-dependent in the sense that $\left\{{W_{j}:1\leq j\leq k}\right\}$ and $\left\{{W_{j}:k+2\leq j}\right\}$ are independent for all $k\geq 1$ .
•

The regeneration cycles $\left\{{W_{j},j\geq 2}\right\}$ are identically distributed.

See Asmussen et al., (2003) for a detailed exposition. Because of these properties, the process $S^{*}$ is easier to work with. So, moving forward, whenever we analyze a $(m,p)$ -Doeblin kernel $P$ , the process $\left\{{S_{t},t\geq 0}\right\}$ will refer to the split chain. Also, since $\left\{{\tau_{j}}\right\}$ are not stopping times w.r.t. the natural filtration generated by $\left\{{S_{t}}\right\}$ , we define $\mathcal{F}_{t}=\sigma(\left\{{S_{j}:j\leq t}\right\},\left\{{B_{mj}:mj\leq t}\right\})$ . Clearly, $\left\{{\tau_{j}}\right\}$ are $\mathcal{F}_{t}$ -stopping times and $\left\{{S_{t}}\right\}$ is Markov w.r.t. $\mathcal{F}_{t}$ as well.

3.3 Two Mixing Metrics and Their Equivalence

The favorable characteristics of the split chain will facilitate the analysis of the behavior of the MDP under mixing policies. Consequently, we aim to study the minimax sample complexity of policy learning for an MDP by postulating the existence of a $(m,p)$ -Doeblin condition. This naturally leads to the quantification of the mixing intensity of the MDP by means of the minorization time; see below. However, in the literature of sample complexity theory for mixing MDPs (Wang,, 2017; Jin and Sidford,, 2020, 2021), the intensity of mixing is usually quantified in terms of the mixing time. In this section, we formally introduce these two measures of mixing. Furthermore, we establish that, for a uniformly ergodic Markov chain, ${t_{\mathrm{mix}}}$ and $t_{\mathrm{minorize}}$ are equivalent up to a multiplicative constant. This equivalence enables a direct comparison of our complexity outcomes, which use $t_{\mathrm{minorize}}$ , with the existing theories.

Let $P$ be a uniformly ergodic stochastic kernel with stationary distribution $\eta$ . By Theorem UE, there exists $(m,p,\psi,R)$ s.t. $P^{m}=p\psi+(1-p)R$ .

Definition 3 (Mixing and Minorization Times).

Define the mixing time

{t_{\mathrm{mix}}}(P):=\inf\left\{{t\geq 1:\sup_{s\in S}\left\|P^{t}(s,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq\frac{1}{4}}\right\},

and the minorization time

t_{\mathrm{minorize}}(P)=\inf\left\{{m/p:\inf_{s\in S}P^{m}(s,\cdot)\geq p\phi(\cdot)\text{ for some }\phi\in\mathcal{P}(2^{S})}\right\}.

We will drop the $P$ dependence when it is clear.

Next, we aim to demonstrate the equivalence between these two mixing times up to a constant factor.

Theorem 1.

The two notion of mixing times $t_{\mathrm{minorize}}$ and ${t_{\mathrm{mix}}}$ are equivalent up to constants: If $P$ is uniformly ergodic, then $t_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq 22\log(16)t_{\mathrm{minorize}}$ .

We note that proof idea of the direction $t_{\mathrm{minorize}}\leq c{t_{\mathrm{mix}}}$ follows from Aldous et al., (1997), while ${t_{\mathrm{mix}}}\leq Ct_{\mathrm{minorize}}$ follows from Theorem UE. We defer the proof to the appendix.

Therefore, for a uniformly ergodic chain, quantitative knowledge of the mixing time and the “best” minorization are equivalent up-to constant factors. As a consequence of this equivalence between ${t_{\mathrm{mix}}}$ and $t_{\mathrm{minorize}}$ , the complexity outcomes highlighted in Table 1 can be substituted directly with ${t_{\mathrm{mix}}}$ .

Other than its theoretical convenience, $t_{\mathrm{minorize}}$ could be a more explicit metric than ${t_{\mathrm{mix}}}$ in control of stochastic systems. For instance, consider a queueing system for which the sources of randomness are fully exogenous. A long inter-arrival time will imply that the system become empty and the state process regenerates. Therefore, we can obtain a natural upper bound on $t_{\mathrm{minorize}}$ from the frequency of observing long inter-arrival times.

4 Uniformly Ergodic MDPs: Algorithms and Sample Complexity Upper Bounds

We have introduced uniform ergodicity and the minorization time $t_{\mathrm{minorize}}$ as an equivalent measure of mixing time and the split chain technique. Using $t_{\mathrm{minorize}}$ as the mixing criterion, in this section, we explore the worst case sample complexities of learning the optimal $q$ -function under different stability assumptions of the controlled Markov chain.

MDPs can exhibit different types of mixing behavior. However, since our primary focus is on developing a complexity theory for learning the optimal $q$ -function, we seek algorithms that obtain optimal policies by producing progressively more precise estimates of $q^{*}$ as the sample size increases. Consequently, the asymptotic variability of the estimator should be influenced by the mixing characteristics of the optimal policies. Therefore, it is well-motivated to make assumptions about the mixing properties of the class of optimal policies. We first consider the most general one:

Assumption 1.

There exists an optimal stationary Markov deterministic policy $\pi^{*}\in\Pi^{*}$ s.t. the transition kernel $P_{\pi^{*}}$ is uniformly ergodic.

In this setting, define

t_{\mathrm{minorize}}^{*}=t_{\mathrm{minorize}}(P_{\pi^{*}}).

To obtain an optimal sample complexity result, we make further uniform assumptions on the minorization times of kernels induced by $\pi^{*}\in\Pi^{*}$ . In this paper, we consider the following two settings:

Assumption 2.

$\Pi^{*}=\left\{{\pi^{*}}\right\}$ is a singleton. Moreover, $\pi^{*}$ satisfies Assumption 1.

If $\Pi^{*}$ is not a singleton, we instead impose a “continuity in $\pi$ ” assumption on $P$ which we will refer to as Lipschitzness:

Assumption 3.

For any $\pi^{*}\in\Pi^{*}$ , the transition kernel $P_{\pi^{*}}$ is uniformly ergodic, and the transition kernel satisfies a local Lipschitz condition: There is $L>0$ s.t. for all $q\in\mathbb{R}^{|S|\times|A|}$ and associated greedy policy $\pi(q)$ ,

\left\|\left(P^{\pi(q)}-P^{\pi^{*}}\right)(q-q^{*})\right\|_{\infty}\leq L\left\|q-q^{*}\right\|_{\infty}^{2}.

(10)

In the setting of Assumption 2 and 3, we define

t_{\mathrm{minorize}}^{*}=\max_{\pi^{*}\in\Pi^{*}}t_{\mathrm{minorize}}(P_{\pi^{*}}).

We note that the settings in Assumptions 2 and 3 are considered in the Q-learning literature (Khamaru et al., 2021b, ; Li et al.,, 2023). Moreover, Assumption 3 is implied by 2 with $L=4/\zeta$ where $\zeta$ is the optimality gap of the optimal policy class defined as $\zeta:=\min_{s\in S}\min_{\pi\in\Pi\backslash\Pi^{*}}|v^{*}(s)-q^{*}(s,\pi(s))|$ ; see Lemma B.1 of Li et al., (2023).

Even though the asymptotic performance of a convergent algorithm should only depend on the mixing characteristics of the optimal policies, for a prescribed accuracy level $\epsilon$ , assumptions on mixing characteristics of non-optimal policies could potentially increase the performance as well. Moreover, in applications of MDPs, we typically have little knowledge of the class of optimal policies a priori. However, as mentioned earlier, there may be settings in which all policies will induce mixing Markov chains with a uniform bound on the minorization times. This leads to the following assumption:

Assumption 4.

For all $\pi\in\Pi$ , the transition kernel $P_{\pi}$ is uniformly ergodic.

In this setting, we define

t_{\mathrm{minorize}}=\max_{\pi\in\Pi}t_{\mathrm{minorize}}(P_{\pi}).

Since $\Pi$ is finite, the above $\max$ is always achieved and $t_{\mathrm{minorize}}<\infty$ .

Remark.

The minimax sample complexity of tabular MPDs is well understood when $t_{\mathrm{minorize}},t_{\mathrm{minorize}}^{*}=\Omega((1-\gamma)^{-1})$ . We will assume the interesting case when $t_{\mathrm{minorize}},t_{\mathrm{minorize}}^{*}\leq(1-\gamma)^{-1}$ . Moreover, for the convenience of our analysis, we also assume that, w.l.o.g, $p^{\pi}\leq 1/2$ .

In the following developments, we will demonstrate the impact of mixing Assumptions 1, 2, 3, and 4 on algorithmic performance, leading to a improvement in sample complexity by a factor of $(1-\gamma)^{-1}$ when compared to the minimax lower bound without considering mixing time assumptions ( $\tilde{O}((1-\gamma)^{-3})$ ). In particular, the sample complexity upper bounds in Theorems 2, 3, and 4 are now enhanced to $\tilde{O}((1-\gamma)^{-2})$ .

4.1 Q-learning and Wainwright’s Variance Reduction

We first introduce the algorithmic frameworks that underlie our complexity results: the Q-learning algorithm and Wainwright’s variance reduction technique.

4.1.1 The Q-learning Algorithm

Define $\left\{{\widehat{\mathcal{T}}_{k},k\geq 0}\right\}$ as a sequence of i.i.d. 1-sample empirical Bellman operators: i.e. for each $s,a\in S\times A$ , sample $S^{\prime}\sim P(\cdot|s,a)$ independently and assign

\widehat{\mathcal{T}}_{k+1}(q)(s,a):=r(s,a)+\gamma\max_{b\in A}q(S^{\prime},b).

(11)

The synchronous Q-learning $\text{QL}(k_{0})$ with $k_{0}$ number of iterations can be expressed in terms of these empirical Bellman operators, as displayed in Algorithm 1. It is easily seen that $\sigma^{2}(q)(s,a)=\mathrm{Var}(\widehat{\mathcal{T}}(q)(s,a))$ . This echoes the previous definition (6).

Algorithm 1 Q-learning:

\text{QL}(k_{0})

Initialization:

q_{1}\equiv 0

;

k=1

for

1\leq k\leq k_{0}

Sample

\widehat{\mathcal{T}}_{k+1}

the 1-sample empirical Bellman operator.

Compute the Q-learning update

q_{k+1}=(1-\lambda_{k})q_{k}+\lambda_{k}\widehat{\mathcal{T}}_{k+1}(q_{k})

with stepsize

\lambda_{k}=1/(1+(1-\gamma)k)

end for

return

q_{k_{0}+1}

It turns out that the algorithms we develop in the future sections will require a reasonably good initialization. Typically, such an initialization can be achieved by first running the Q-learning Algorithm 1. Therefore, it is useful to have an error bound for this Q-learning algorithm. From Wainwright, 2019a ; Wainwright, 2019b , we have the following proposition.

Proposition 4.1 (Section 4.1.2 in Wainwright, 2019b ).

For any $k_{0}\geq 1$ , let $q_{k_{0}+1}$ be the output of $\text{QL}(k_{0})$ . Then, there exists absolute constant $c$ so that with probability at least $1-\delta$ , we have

	$\displaystyle\quad\\|q_{k_{0}+1}-q^{*}\\|_{\infty}$
	$\displaystyle\leq\frac{\\|q^{}\\|_{\infty}}{(1-\gamma)k_{0}}+c\left(\frac{\\|\sigma(q^{})\\|_{\infty}}{(1-\gamma)^{3/2}\sqrt{k_{0}}}\log\left(\frac{2\|S\|\|A\|k}{\delta}\right)^{1/2}+\frac{\left\|q^{*}\right\|_{\mathrm{span}}}{(1-\gamma)^{2}k_{0}}\log\left(\frac{2e\|S\|\|A\|k_{0}(1+(1-\gamma)k_{0})}{\delta}\right)\right).$

To use this proposition, we need the following lemma, which is a consequence of Proposition 6.1.

Lemma 1.

Under Assumption 1, the following bounds hold:

\left|q^{*}\right|_{\mathrm{span}}\leq 3t_{\mathrm{minorize}}^{*}\quad\text{and}\quad\sigma^{2}(q^{*})(s,a)\leq 4\gamma^{2}(t_{\mathrm{minorize}}^{*})^{2}.

Remark.

Combining Proposition 4.1 and Lemma 1, we have a sample complexity bound for this Q-learning algorithm under Assumption 1: $\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}\epsilon^{-2}(1-\gamma)^{-3})$ . Li et al., (2021) have shown that the worst case sample complexity of the Q-learning Algorithm 1 should be $\tilde{O}(|S||A|/(\epsilon^{2}(1-\gamma)^{4}))$ uniformly in all MDP instance $\mathcal{M}$ with discount factor $\gamma$ . So, the sample complexity bound implied by Proposition 4.1 is is not optimal when $t_{\mathrm{minorize}}^{*}=\Theta((1-\gamma)^{-1})$ .

4.1.2 Wainwright’s Variance Reduction

Li et al., (2021) established that the Q-learning Algorithm 1 is not minimax optimal when used to learn the optimal action-value function $q^{*}$ . In the pursuit of an optimal model-free algorithm, Wainwright, 2019b proposed a variance-reduced variant of Q-learning, as outlined in Algorithm 2.

Algorithm 2 Variance-reduced Q-learning:

\text{VRQL}(q,k^{*},l^{*},\left\{{n_{l}:1\leq l\leq l^{*}}\right\})

Initialization:

\hat{q}_{0}=q

;

l=1

;

k=1

for

1\leq l\leq l^{*}

Sample i.i.d.

\left\{{\tilde{\mathcal{T}}_{l,j}:j=1,\dots,n_{l}}\right\}

1-sample empirical Bellman operators defined in (11).

Compute the recentering

\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}):=\frac{1}{n_{l}}\sum_{j=1}^{n_{l}}\widetilde{\mathcal{T}}_{l,j}(\hat{q}_{l-1}).

Set

q_{l,1}=\hat{q}_{l-1}

for

1\leq k\leq k^{*}

Sample

\widehat{\mathcal{T}}_{l,k+1}

the 1-sample empirical Bellman operator.

Compute the recentered Q-learning update

q_{l,k+1}=(1-\lambda_{k})q_{l,k}+\lambda_{k}\left(\widehat{\mathcal{T}}_{l,k+1}(q_{l,k})-\widehat{\mathcal{T}}_{l,k+1}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1})\right)

(12)

with stepsize

\lambda_{k}=1/(1+(1-\gamma)k)

end for

Set

\hat{q}_{l}=q_{l,k^{*}+1}

end for

return

\hat{q}_{l^{*}}

This variant has been shown to achieve the minimax optimal sample complexity $\tilde{O}(|S||A|(1-\gamma)^{-3})$ for any Markov decision process (MDP) instance $\mathcal{M}$ , subject to optimal selection of the parameters $q,l^{*},\left\{{n_{l},1\leq l\leq l^{*}}\right\},k^{*}$ and initialization $\hat{q}_{0}$ . In the subsequent sections, we explore modifications to Algorithm 2 that enable it to attain the sample complexity upper bounds presented in Table 1.

Notably, the variance-reduced Q-learning variant generally satisfies the path-wise error bound of the form $\left\{{\|\hat{q}_{l}-q^{*}\|_{\infty}\leq b_{l},\forall l\leq l^{*}}\right\}$ with high probability, due to the empirical Bellman operators being $\gamma$ -contractions as well. This desirable characteristic makes it a versatile tool for algorithmic designs.

The variance-reduced Q-learning algorithm employs an inner loop indexed by $k$ and an outer loop indexed by $l$ , where each outer loop is referred to as an epoch. Within each epoch, the inner loop executes a “recentered” version of synchronous Q-learning. By selecting a series of empirical Bellman operators $\widetilde{\mathcal{T}}_{l}$ with exponentially increasing sample sizes as a function of $l$ , the estimators $\left\{{\hat{q}_{l}}\right\}$ generated by the epochs achieve an exponentially decreasing error, namely $\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b$ , with high probability.

4.2 The General Setting: Assumption 1

In this section, we analyze the convergence and worst case sample complexity bound of Algorithm 2 under the general Assumption 1, which posits that there exists one optimal policy that induces a $(m,p)$ -Doeblin kernel.

We aim to demonstrate that by combining Algorithms 1 and 2 under the aforementioned notion of mixing, one can produce an estimator of $q^{*}$ within an $\epsilon$ absolute error with high probability using at most $\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)^{-2}\epsilon^{-2})$ number of samples. This removes a power on $(1-\gamma)^{-1}$ from the minimax sample complexity lower and upper bounds established in Azar et al., (2013). This is quite a surprising improvement as we are only assuming that one of the optimal policies induces mixing.

The central result in this section is the following sample complexity upper bound in Theorem 2. It is is an immediate consequence of Proposition 4.3.

Theorem 2.

For any MDP instance satisfying Assumption 1, the sample complexity running the procedure specified by Proposition 4.3 with $l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil$ to achieve an estimator of $q^{*}$ with absolute error $\epsilon\leq t_{\mathrm{minorize}}^{*}$ w.p. at least $1-\delta$ is

\tilde{O}\left(\frac{|S||A|(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{1,\frac{\epsilon^{2}}{(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)}}\right\}\right).

Remark.

For $\epsilon\leq t_{\mathrm{minorize}}\sqrt{1-\gamma}$ , this sample complexity bound reduce to $\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)^{-2})$ .

To arrive at this sample complexity upper bound, we analyze the statistical properties of Algorithm 2 under Assumption 1. At a high-level, an upper bound of the asymptotic variance of estimating $q^{*}$ using the variance-reduced Q-learning can be established using the oscillation of $q^{*}$ . According to Proposition 6.1, the oscillation of $q^{*}$ can again be bounded by the minorization time. Since all optimal policies induce the same $q^{*}$ , the minorization time of any optimal policy can be used to control the convergence rate.

We will motivate and state two intermediate propositions that underlie the proof of Theorem 2. The proofs and relevant developments of the key results in this section are deferred to Section B.2. An important result is the following property of Algorithm 2.

Proposition 4.2.

Given $\hat{q}_{0}$ satisfies $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b$ w.p. at least $1-\delta/(l^{*}+1)$ and $b/\epsilon\geq 1$ , then choosing

	$\displaystyle k^{*}$	$\displaystyle\geq c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8(l^{*}+1)\|S\|\|A\|}{(1-\gamma)\delta}\right)$		(13)
	$\displaystyle n_{l}$	$\displaystyle\geq c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\\|\sigma(q^{})\\|_{\infty}+(1-\gamma)\\|q^{}\\|_{\infty}}{2^{-l+1}b}+1\right)^{2}\log\left(\frac{8(l^{*}+1)\|S\|\|A\|}{\delta}\right),$		(13)

and the total number of outter iterations

l^{*}\geq\left\lceil{\log_{2}\left(\frac{b}{\epsilon}\right)}\right\rceil,

(14)

we have that the output $\hat{q}_{l^{*}}=\text{VRQL}(\hat{q}_{0},k^{*},l^{*},\left\{{n_{l}:1\leq l\leq l^{*}}\right\})$ satisfies $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ .

It should be noted that the algorithm specified in Proposition 4.2 may not be directly implementable due to the lack of a priori knowledge of $\|\sigma(q^{*})\|_{\infty}$ in the definition of $\left\{{n_{l}}\right\}$ . Fortunately, this can be addressed by first running Algorithm 1 to obtain an initialization $\hat{q}_{0}$ s.t. $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b=O(t_{\mathrm{minorize}}^{*})$ with high probability. By Lemma 1, this initialization can cancel out the $\|\sigma(q^{*})\|_{\infty}$ term in the numerator of $\left\{{n_{l}}\right\}$ in (13).

To simplify notation, we define

d:=|S||A|\left(\left\lceil{\log\left(\frac{1}{(1-\gamma)\epsilon}\right)}\right\rceil+1\right).

(15)

Concretely, we choose

	$\displaystyle k^{*}$	$\displaystyle=c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8d}{(1-\gamma)\delta}\right)$		(16)
	$\displaystyle n_{l}$	$\displaystyle=3c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8d}{\delta}\right)$		(16)

with the same $c_{1},c_{2}$ as in (13). Moreover, the initialization $\hat{q}_{0}$ is obtained from running Algorithm 1 with

\displaystyle k_{0}=c_{0}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2d}{(1-\gamma)\delta}\right)

(17)

for some sufficiently large absolute constant $c_{0}$ . By Proposition 4.1, this should guarantee $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq t_{\mathrm{minorize}}^{*}$ w.p. at least $1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil+1)$ . Therefore, by Proposition 4.2, we have the following finite time algorithmic error guarantee:

Proposition 4.3.

Assume Assumption 1. We run the combined Algorithm 1 and 2 with parameters specified in (16) and (17); i.e. $\text{VRQL}(\text{QL}(k_{0}),k^{*},\left\{{n_{l}}\right\},l^{*})$ . Then, for $\epsilon\leq t_{\mathrm{minorize}}^{*}$ and $\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil\leq l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil$ , the estimator $\hat{q}_{l^{*}}$ produced by this combined procedure satisfies $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ .

Remark.

Notice that this proposition allows a range of $l^{*}$ . Doing this allows the user to run the algorithm without the knowledge of $t_{\mathrm{minorize}}$ . However, to achieve the optimized complexity in Theorem 2, we need to choose $l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil$ .

Theorem 2 will then follow from Proposition 4.3 by calculating the total number of sample used.

4.3 The Lipschitz Setting: Assumptions 2 and 3

In the previous section, we prove that in the general setting, the variance-reduced $q$ -learning algorithm has worst case sample complexity upper bound scale as $\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}\epsilon^{-2}(1-\gamma)^{-2})$ uniformly for $t_{\mathrm{minorize}}^{*}\leq 1/(1-\gamma)$ . This is not optimal when $1/\sqrt{1-\gamma}\leq t_{\mathrm{minorize}}^{*}\leq 1/(1-\gamma)$ . In particular, there are known algorithms and analysis (Azar et al.,, 2013; Wainwright, 2019b, ) with worst case sample complexity upper bound $\tilde{O}(|S||A|\epsilon^{-2}(1-\gamma)^{-3})$ . In this section, we develop a better sample complexity bound under the Assumption 2 or 3. The proof of the results in this section provided in Section B.3.

Define the variance vector

\nu^{\pi}(\mathcal{M})^{2}:=\mathrm{diag}(\mathrm{Cov}((I-\gamma P^{\pi})^{-1}\widehat{\mathcal{T}}(q^{*}))).

(18)

Here, the notation $\mathrm{diag}(\cdot)$ maps a square matrix $M$ to a column vector containing the diagonal elements of $M$ . In Khamaru et al., 2021b , the authors establish that $\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}$ serves as a lower bound on the non-asymptotic minimax risk of estimating $q^{*}$ for a specific MDP instance $\mathcal{M}$ . They also prove an upper bound that matches the lower bound under certain conditions, such as the uniqueness of the optimal policy or the satisfaction of the Lipschitz condition (10) of the transition kernel. It turns out that, as a direct consequence of Corollary 6.2.1, $\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}$ satisfies the upper bound $O(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2})$ , see Lemma 2. This will imply a worst-case sample complexity upper bound of $\tilde{O}(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2}\epsilon^{-2})$ , provided that the sample size is sufficiently large.

Recall that $\Pi^{*}$ is the set of optimal stationary deterministc Markov policies. Define the optimality gap

\displaystyle\Delta=\min_{\pi\in\Pi\backslash\Pi^{*}}\|q^{*}-(r+\gamma P^{\pi}q^{*})\|_{\infty}.

The main result in this section is the following theorem.

Theorem 3.

Suppose Assumption 2 or 3 hold. Let parameters $k_{0}$ as in (23) and $l^{*},\left\{{n_{l}}\right\},k^{*}$ as in Theorem K2. Then, for all sufficiently small $\epsilon$ , the combined algorithm $\text{VRQL}(\text{QL}(k_{0}),k^{*},\left\{{n_{l}}\right\},l^{*})$ achieves an estimator of $q^{*}$ with absolute error $\epsilon$ in the sup norm w.p. at least $1-\delta$ using

\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{1,\epsilon^{2}t_{\mathrm{minorize}}^{*},\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}}}\right\}\right)

number of samples. Specifically, it is sufficient under Assumption 2 for

\epsilon\leq\Delta\sqrt{t_{\mathrm{minorize}}^{*}(1-\gamma)^{1+\beta}};

(19)

or under Assumption 3 for

\epsilon\leq\max\left\{{\Delta,(1-\gamma)/L}\right\}\sqrt{t_{\mathrm{minorize}}^{*}(1-\gamma)^{1+\beta}}

(20)

for any $\beta>0$ .

Remark.

If we further require that $\epsilon\leq\min\left\{{((1-\gamma)t_{\mathrm{minorize}}^{*})^{1/2},(t_{\mathrm{minorize}}^{*})^{-1/2}}\right\}$ , the worst case sample complexity upper bound becomes $\tilde{O}(|S||A|t_{\mathrm{minorize}}^{*}\epsilon^{-2}(1-\gamma)^{-2})$ .

The proof of Theorem 3 is based on the following key result in Khamaru et al., 2021b :

Theorem K2 (Theorem 2 of Khamaru et al., 2021b ).

Suppose either that the optimal policy $\pi^{*}$ is unique or that the Lipschitz condition (10) holds. We run Algorithm 2 with initialization $\hat{q}$ satisfying $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}$ . Also, for sufficiently large $n^{*}$ satisfying that there exists a $\beta>0$

•

In the case of unique $\pi^{*}$ ,

\frac{n^{*}}{(\log n^{*})^{2}}\geq c\frac{\log(|S||A|/\delta)}{(1-\gamma)^{3}}\max\left\{{1,\frac{1}{\Delta^{2}(1-\gamma)^{\beta}}}\right\}.

•

In the case of Lipschitz condition (10) hold,

\frac{n^{*}}{(\log n^{*})^{2}}\geq c\frac{\log(|S||A|/\delta)}{(1-\gamma)^{3+\beta}}\min\left\{{\frac{1}{\Delta^{2}},\frac{L^{2}}{(1-\gamma)^{2}}}\right\}.

where $c$ is some large absolute constant. Choose

$\displaystyle l^{*}$	$\displaystyle=\left\lceil{\log_{4}\left(\frac{n^{}(1-\gamma)^{2}}{8\log\left((16\|S\|\|A\|/\delta)\log n^{}\right)}\right)}\right\rceil$	(21)
$\displaystyle n_{l}$	$\displaystyle=\frac{4^{l}}{(1-\gamma)^{2}}\log_{4}\left(\frac{16l^{*}\|S\|\|A\|}{\delta}\right)$
$\displaystyle k^{*}$	$\displaystyle=\frac{n^{}}{2l^{}}.$

We have that w.p. at least $1-\delta$

\|\hat{q}_{l^{*}}-q^{*}\|\leq c^{\prime}\left(\sqrt{\frac{\log_{4}(8|S||A|l^{*}/\delta)}{n^{*}}}\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}+\frac{\left|q^{*}\right|_{\mathrm{span}}\log_{4}(8|S||A|l^{*}/\delta)}{(1-\gamma)n^{*}}\right).

(22)

As explained in the discussion after Assumption 3, noted by Li et al., (2023), if the optimal policy is unique with optimality gap $\Delta$ , then Assumption 3 holds with $L=4/\Delta$ . This is consistent with the requirement of $n^{*}$ in Theorem K2.

Theorem K2 suggests that the finite sample convergence rate can be controlled by the local minimax rate parameter $\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}$ . Observe that if the entries of $\nu^{\pi}(\mathcal{M})$ can be uniformly upper bounded by $O(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2})$ for all $\pi\in\Pi^{*}$ , then the error bound (22) should imply the desired sample complexity upper bound. This is indeed the case if all optimal policies induce Doeblin chains with a unifrom minorization time upper bound.

Lemma 2.

Suppose that for all $\pi\in\Pi^{*}$ , the kernel $P_{\pi}$ uniformly ergodic with

t_{\mathrm{minorize}}^{*}:=\max_{\pi\in\Pi^{*}}t_{\mathrm{minorize}}(P_{\pi}).

Then

\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty}\leq\frac{6400t_{\mathrm{minorize}}^{*}}{(1-\gamma)^{2}}.

To use Theorem K2, we still need a initialization $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}$ . As in the previous section, this can be achieved w.p. at least $1-\delta/2$ by running Algorithm 1 with

k_{0}=c_{0}\frac{(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}}\log\left(\frac{4|S||A|}{(1-\gamma)\delta}\right)

(23)

for sufficiently large constant $c_{0}$ . Then we start the Algorithm 2 with this initialization and parameters specified in Theorem K2. This will give the desired sample complexity upper bound in Theorem 3.

Remark.

This method of generating $\hat{q}_{0}$ , however, requires the knowledge of $t_{\mathrm{minorize}}$ in order to specify $k_{0}$ . To resolve this, instead, we can use

k_{0}=c_{0}\frac{1}{(1-\gamma)^{4}}\log\left(\frac{4|S||A|}{(1-\gamma)\delta}\right)

to initialize $\hat{q}_{0}$ . This will result in a sample complexity upper bound of $\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right)$ when $\epsilon\leq 1-\gamma$ on top of satisfying the conditions in Theorem 3. This is less efficient in terms of the range of $\epsilon$ . So, taking a complexity theoretical perspective, we omit a formal presentation and proofs of this initialization method.

Also, note that if compared to the procedure in Proposition 4.5, one might attempt to use the procedure in Proposition 4.3 as a more efficient initialization method. However, this is not possible due to the requirement that $\epsilon\leq\sqrt{t_{\mathrm{minorize}}}$ in Proposition 4.3.

4.4 Uniform Mixing: Assupmtion 4

Assuming a uniform minorization time upper bound as in Assumption 4, in this section, we will construct an algorithm and outline the proof of the following sample complexity upper bound.

Theorem 4.

Assume Assumption 4. Run the procedure in Proposition 4.5 with $l^{*}=\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil$ . The worst case sample complexity to achieve an estimator of $q^{*}$ with absolute error $\epsilon\leq\sqrt{t_{\mathrm{minorize}}}$ in the infinity norm w.p. at least $1-\delta$ is

\tilde{O}\left(\frac{|S||A|t_{\mathrm{minorize}}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{\frac{\epsilon^{2}}{t_{\mathrm{minorize}}(1-\gamma)},1}\right\}\right).

Remark.

Note that when $\epsilon\leq\sqrt{t_{\mathrm{minorize}}(1-\gamma)}$ , Theorem 4 implies a sample complexity upper bound of $\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right)$ . Compare to Theorem 3, this gives a larger range of $\epsilon$ in which the optimal sample complexity bound holds.

To develop this sample complexity upper bound, we highlight some preliminary observations of Algorithm 2 made by Wainwright, 2019b . Consider the inner loop update of the variance-reduced Q-learning (12). Wainwright, 2019b defines

\widehat{\mathcal{J}}_{l,k+1}(q)=\widehat{\mathcal{T}}_{l,k+1}(q)-\widehat{\mathcal{T}}_{l,k+1}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1});\qquad\mathcal{J}_{l}(q)=\mathcal{T}(q)-\mathcal{T}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}).

Then, (12) becomes

q_{l,k+1}=(1-\lambda_{k})q_{l,k}+\lambda_{k}\widehat{\mathcal{J}}_{l,k+1}(q_{l,k}).

It is easy to check that $\mathcal{J}_{l}$ is a $\gamma$ -contraction, and $E[\widehat{\mathcal{J}}_{l,k+1}(q)|\hat{q}_{l-1}]=\mathcal{J}_{l}(q)$ . So, the inner loop of the variance-reduced Q-learning can be thought of as a stochastic approximation to the unique fixed point $\bar{q}_{l}$ of $\mathcal{J}_{l}$ .

This observation lead us to considering a perturbed reward function

\bar{r}_{l}:=r-\mathcal{T}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}).

(24)

Then the fixed point $\bar{q}_{l}$ of $\mathcal{J}_{l}$ uniquely satisfies the Bellman equation

\bar{q}_{l}=\bar{r}_{l}(s,a)+P_{s,a}v(\bar{q}_{l}).

(25)

With these reformulation of the variance-reduced Q-learning recursion, we can bound the error after $l$ epoch by the triangle inequality

\|\hat{q}_{l}-q^{*}\|_{\infty}\leq\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}+\|q^{*}-\bar{q}_{l}\|_{\infty}.

Then, by tightly controling the errors $\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}$ and $\|q^{*}-\bar{q}_{l}\|_{\infty}$ , we can arrive at the following Proposition.

Proposition 4.4.

Assume Assumption 4 and an initialization $\hat{q}_{0}$ that satisfies $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\sqrt{t_{\mathrm{minorize}}}$ w.p. at least $1-\delta/(l^{*}+1)$ , and $\epsilon\leq\sqrt{t_{\mathrm{minorize}}}$ . Choose $l^{*}\geq\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil$ ,

	$\displaystyle k^{*}$	$\displaystyle\geq c\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2\|S\|\|A\|(l^{*}+1)}{\delta(1-\gamma)}\right),$		(26)
	$\displaystyle n_{l}$	$\displaystyle\geq c^{\prime}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8\|S\|\|A\|(l^{*}+1)}{\delta}\right).$		(26)

Then, we have that $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ .

To use Proposition 4.4, we need to produce an estimator $\hat{q}_{0}$ s.t. $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\sqrt{t_{\mathrm{minorize}}}$ w.p. at least $1-\delta/(l^{*}+1)$ . Here, we require $l^{*}$ to satisfy $l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil$ . Therefore, to achieve such a initialization efficiently, we can run the combined procedure in Proposition 4.3 with specified error $\epsilon=\epsilon_{0}=\sqrt{t_{\mathrm{minorize}}}$ and tolerance probability $1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil+1)$ . By Theorem 2, we need

\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{(t_{\mathrm{minorize}})^{2}}{\epsilon_{0}^{2}}}\right\}\right)=\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{3}}\right)

number of samples.

Remark.

Similar to the remark in section 4.3, this method of generating $\hat{q}_{0}$ requires the knowledge of $t_{\mathrm{minorize}}$ . Instead, one could run the Q-learning $\text{QL}(k_{0})$ with

k_{0}=c_{0}\frac{1}{(1-\gamma)^{4}}\log\left(\frac{2dt_{\mathrm{minorize}}}{(1-\gamma)\delta}\right)

to initialize $\hat{q}_{0}$ , where we recall the definition of $d$ in (15). This will result in a sample complexity upper bound of $\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right)$ when $\epsilon\leq(1-\gamma)\sqrt{t_{\mathrm{minorize}}}$ . This is less efficient and the proof is omitted.

We will use the following choice of parameters:

	$\displaystyle k^{*}$	$\displaystyle=c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2d}{(1-\gamma)\delta}\right),$		(27)
	$\displaystyle n_{l}$	$\displaystyle=c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8d}{\delta}\right).$		(27)

These choice of parameters implies the following algorithmic error bound.

Proposition 4.5.

Assume Assumption 4. First, run the procedure in Proposition 4.3 with specified error $\epsilon_{0}=\sqrt{t_{\mathrm{minorize}}}$ and tolerance probability $1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil+1)$ to produce $\hat{q}_{0}$ . Then, run $\text{VRQL}(\hat{q}_{0},k^{*},l^{*},\left\{{n_{l}}\right\})$ with parameters $k^{*}$ , $\left\{{n_{l}}\right\}$ as specified in (27) and $\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil\leq l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil$ . For $\epsilon\leq\sqrt{t_{\mathrm{minorize}}}$ , the output of this combined procedure $\hat{q}_{l^{*}}$ satisfies $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ .

Theorem 4 will then follow from Proposition 4.5 and a simple calculation the number of samples used.

5 Minimax Lower Bounds

In this section, we show that the sample complexity dependence on $\epsilon$ , $(1-\gamma)^{-1}$ , and $t_{\mathrm{minorize}}$ in Theorem 3 and 4 cannot be improved in general. We consider two notions of non-asymptotic minimax risk of learning the optimal $q$ -function in the sup norm, one localized at a hard instance, the other is uniform over the class of all MDPs with discount factor $\gamma$ and uniform minorization time upper bound $t_{\mathrm{minorize}}$ (Assumption 4). We prove that both minimax risks have lower bounds supporting the claim that our sample complexities in Theorem 3 and 4 are tight.

We start off with a rigorous definition of the generator model and the class of estimators. This is necessary because the lower bounding technique needs a careful specification of the probability measure. The generator model is a sequence of product probability spaces

G_{n}^{\mathcal{M}}=\bigotimes_{i=1}^{n}\bigotimes_{s\in S,a\in A}(S,2^{S},P^{\mathcal{M}}_{s,a})

where the $P_{s,a}$ is the transition kernel of instance $\mathcal{M}$ . By construction, the identity random element on $G_{1}^{\mathcal{M}}$ , denoted by $\left\{{S^{\prime}(s,a)(\cdot),s\in S,a\in A}\right\}$ is independent across $S\times A$ . We will denote the product measure on $G_{n}^{\mathcal{M}}$ as $P^{\mathcal{M}}_{n}$ and the expectation as $E^{\mathcal{M}}_{n}$ . An estimator $\tilde{q}_{n}$ is a measureable function from $G_{n}^{\mathcal{M}}$ to $\mathbb{R}^{S\times A}$ . For fixed $n$ , this is equivalent to the procedure that generate $n$ i.i.d. $S_{i}^{\prime}(s,a)\sim P_{s,a}$ for each $s,a$ and then form $\tilde{q}_{n}$ from these samples.

For fixed state and action spaces and discount factor $\gamma$ , we first consider the local non-asymptotic minimax risk at instance $\mathcal{M}_{\gamma}$ of learning the $q^{*}$ function accurately in the sup norm using $n$ samples:

\mathfrak{M}_{n}(\mathcal{M}_{\gamma})=\sup_{\mathcal{M}_{\gamma}^{\prime}}\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{\gamma},\mathcal{M}_{\gamma}^{\prime}}\right\}}\sqrt{n}E^{\mathcal{M}}_{n}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}.

(28)

where the first supremum is taken over all MDP instance with discount factor $\gamma$ . This quantity measures the complexity of learning optimal $q$ -function on either $\mathcal{M}_{\gamma}$ or the hardest local alternative. It is studied in the convex optimization context by Cai and Low, (2015); Chatterjee et al., (2016), and adapted to the the RL context by Khamaru et al., 2021a ; Khamaru et al., 2021b .

The previous complexity metric embodies the idea of constructing a hard instance that is difficult to distinguish from its local alternative, bearing the spirit of the lower bounds in seminal works on MDP complexity theory (c.f. Azar et al., (2013)). However, one can argue that it is more natural to define the minimax risk of the class of MDP instances $\mathcal{C}(\gamma,t_{\mathrm{minorize}}):=\left\{{\mathcal{M}_{\gamma}:\text{ satisfying Assumption \ref{assump:unif_m-doeblin}}}\right\}$ .

Definition 4.

We define the minimax risk of learning the optimal $q$ -function of the MDPs $\mathcal{C}(\gamma,t_{\mathrm{minorize}})$ as

\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})=\inf_{\tilde{q}_{n}}\sup_{\mathcal{M}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})}\sqrt{n}E^{\mathcal{M}}_{n}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}.

With these definitions, we are ready to state the main results of this section.

Theorem 5.

There exists absolute constant $c>0$ s.t. for any $\gamma\in[0.9,1)$ and $10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma)$ , one can find a MDP instance $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ satisfying Assumption 4 with uniform minorization time upper bound $t_{\mathrm{minorize}}$ s.t.

\mathfrak{M}_{n}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\geq c\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}

provided that $n\geq 2\max\left\{{(1-\gamma)^{-2},3^{5}t_{\mathrm{minorize}}}\right\}$ .

Here, as we are lower bounding the complexity metric, we abuse the notation $t_{\mathrm{minorize}}$ in the Assumption 4 so that $t_{\mathrm{minorize}}\geq\max_{\pi\in\Pi}t_{\mathrm{minorize}}(P_{\pi})$ where $P$ is the kernel of $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ .

As we will explain in a moment, the hard instances $\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}}\right\}$ in Theorem 5 also facilitate lower bounding $\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})$ in Definition 4.

Theorem 6.

There exists absolute constant $c>0$ s.t. for any $\gamma\in[0.9,1)$ and $10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma)$ ,

\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq c\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}

provided that $n\geq 2\max\left\{{(1-\gamma)^{-2},3^{5}(t_{\mathrm{minorize}})^{3}}\right\}$ .

Next, we outline the proofs of these theorems and defer the details to Section C.1 and C.2 respectively.

To begin with, we construct a family of MDP instances satisfying Assumption 4 such that the local non-asymptotic risk parameter $\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty}$ (see (18)) at each instance is $\Omega(t_{\mathrm{minorize}}(1-\gamma)^{-2})$ . We claim that this is the case for the following family of hard MDP instances:

Definition 5 (Hard MDP Family).

Consider the following family of hard MDP instances $\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}:\gamma,t_{\mathrm{minorize}}}\right\}$ with $S=\left\{{1,2}\right\}$ , $A=\left\{{a_{1},a_{2}}\right\}$ . The transition kernel for each action is

P_{a_{1}}=P_{a_{2}}=\begin{bmatrix}1-p&p\\ p&1-p\end{bmatrix}.

The reward function $r(1,\cdot)=1$ and $r(2,\cdot)=0$ .

Clearly, $\pi^{*}(s_{1})=a_{1}$ and $\pi^{*}(s_{2})=a_{1}$ define a optimal policy. Observe that

P_{\pi^{*}}=\begin{bmatrix}1-p&p\\ p&1-p\end{bmatrix}=2p\begin{bmatrix}1/2&1/2\end{bmatrix}+(1-2p)\begin{bmatrix}1&0\\ 0&1\end{bmatrix}

So, $P_{\pi^{*}}$ is $(1,2p)$ -Doeblin and hence $(1,p)$ -Doeblin. For simplicity, we will use $t_{\mathrm{minorize}}=1/p\geq t_{\mathrm{minorize}}(P_{\pi^{*}})$ . It is also easy to see that all policies will induce a $(1,p)$ -Doeblin chain; i.e. Assumption 4 holds.

This and the following Theorem 1 in Khamaru et al., 2021b would imply a $\Omega\left(t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right)$ complexity lower bound of learning $q^{*}$ within an $\epsilon$ absolute error in the sup norm if the risk measure is $\mathfrak{M}_{n}(\mathcal{M})$ :

Theorem K1 (Theorem 1 in Khamaru et al., 2021b ).

There exists constant $c>0$ s.t. for any MDP instance $\mathcal{M}$ ,

\mathfrak{M}_{n}(\mathcal{M})\geq c\gamma\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}

provided that

n\geq\max\left\{{\frac{2\gamma^{2}}{(1-\gamma)^{2}},\frac{2\left|q^{*}\right|_{\mathrm{span}}^{2}}{(1-\gamma)^{2}\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty}}}\right\}.

(29)

To prove Theorem 6, we first observe that

\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\sup_{\mathcal{M}_{1},\mathcal{M}_{2}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})}\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{1},\mathcal{M}_{2}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}

(30)

Therefore, we should have $\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\mathfrak{M}_{n}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})$ if the hard alternative instance in (28) satisfies $\mathcal{M}^{\prime}_{\gamma}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})$ . Observe that, for large $n$ , if $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ and $\mathcal{M}^{\prime}_{\gamma}$ are very different, it will be very easy to tell them apart and thence the estimation can be done with small expected error, because both instances are known to the controller ( $\inf_{\tilde{q}_{n}}$ is inside $\sup_{\mathcal{M}^{\prime}_{\gamma}}$ ). Thus, the instances that potentially achieve the outer supremum in (28) should be close to $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ . Hence, a hard alternative instance for (28) should have a minorization time similar to $t_{\mathrm{minorize}}$ as well. Following this intuition, in the proof of Theorem 6, we construct such a hard local alternative using the techniques in Khamaru et al., 2021b , proof of Theorem K1. See appendix Section C.2.

6 Cumulative Reward of a $(m,p)$ -Doeblin Chain

In order to prove the matching sample complexity upper and lower bounds stated in previous sections, we need to control the moments of the discounted cumulative rewards

\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k}).

In this section, we state some key properties of this cumulative reward and their implications when the kernel $P_{\pi}$ is uniformly ergodic, where $\pi\in\Pi$ is some stationary Markov deterministic policy. By Theorem UE, $P_{\pi}$ satisfies a $(m,p)$ -Doeblin condition and the minorization time $t_{\mathrm{minorize}}(P_{\pi})\leq m/p$ . The proofs for this section will be deferred to the appendix Section D.

First, we show that for a Doeblin MDP, the oscillation of the $q$ -function is upper bounded by the minorization time.

Proposition 6.1 (Oscillation of the $q$ -function).

Suppose $P_{\pi}$ is $(m,p)$ -Doeblin. Then there exists constant $\bar{q}^{\pi}$ s.t. $\|q^{\pi}-\bar{q}^{\pi}\|_{\infty}\leq m/p+1$ . In particular $\left|q^{\pi}\right|_{\mathrm{span}}\leq 3m/p$ .

Note that Proposition 6.1 can be directly applied to bound $\sigma(q^{*})$ as in Lemma 1. Next, we prove a bound on the variance of the cumulative reward achieved by a $(m,p,\psi,R)$ -Doeblin chain.

Proposition 6.2 (Variance of the Cumulative Reward).

Suppose $P_{\pi}$ is $(m,p)$ -Doeblin. Then $\|\Psi^{\pi}\|_{\infty}\leq 20(m/p)(1-\gamma)^{-1}$ , where $\Psi^{\pi}$ is defined in 5.

Proposition 6.2 and the equality

\Psi^{\pi}=(\sigma^{\pi})^{2}+\gamma^{2}P^{\pi}\Psi^{\pi}

(31)

due to Azar et al., (2013) allow us to obtain the following important upper bound:

Corollary 6.2.1.

Recall the definition of $\sigma^{\pi}$ in (7). Suppose $P_{\pi}$ is uniformly ergodic. Then

\displaystyle\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq\frac{80\sqrt{t_{\mathrm{minorize}}(P_{\pi})}}{1-\gamma}.

Remark.

If we no longer assume $r$ is bounded by $1$ , then $\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq 80\|r\|_{\infty}\sqrt{t_{\mathrm{minorize}}(P_{\pi})}/(1-\gamma)$ . Also, this bound can be easily applied to the model-based RL settings, and the same complexity upper bound $\tilde{O}(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2})$ should follow.

Acknowledgements

The material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 1915967 and 2118199.

References

Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR.
Aldous et al., (1997) Aldous, D., Lovász, L., and Winkler, P. (1997). Mixing times for uniformly ergodic markov chains. Stochastic Processes and their Applications, 71(2):165–185.
Asmussen et al., (2003) Asmussen, S., Asmussen, S., and Asmussen, S. (2003). Applied probability and queues, volume 2. Springer.
Athreya and Ney, (1978) Athreya, K. B. and Ney, P. (1978). A new approach to the limit theory of recurrent markov chains. Transactions of the American Mathematical Society, 245:493–501.
Azar et al., (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349.
Bramson, (2008) Bramson, M. (2008). Stability of Queueing Networks: École d’Été de Probabilités de Saint-Flour XXXVI - 2006. Springer Berlin Heidelberg, Berlin, Heidelberg.
Cai and Low, (2015) Cai, T. T. and Low, M. G. (2015). A framework for estimation of convex functions. Statistica Sinica, 25(2):423–456.
Chatterjee et al., (2016) Chatterjee, S., Duchi, J. C., Lafferty, J., and Zhu, Y. (2016). Local minimax complexity of stochastic convex optimization. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 8223–8234. Curran Associates, Inc.
Crane and Lemoine, (1977) Crane, M. A. and Lemoine, A. J., editors (1977). An Introduction to the Regenerative Method for Simulation Analysis. Springer Berlin Heidelberg, Berlin, Heidelberg.
Deng et al., (2017) Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. (2017). Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653–664.
Gilks et al., (1998) Gilks, W. R., Roberts, G. O., and Sahu, S. K. (1998). Adaptive markov chain monte carlo through regeneration. Journal of the American Statistical Association, 93(443):1045–1054.
Glynn and Ormoneit, (2002) Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & Probability Letters, 56(2):143–146.
Henderson and Glynn, (2001) Henderson, S. G. and Glynn, P. W. (2001). Regenerative steady-state simulation of discrete-event systems. ACM Trans. Model. Comput. Simul., 11(4):313–345.
Jin and Sidford, (2020) Jin, Y. and Sidford, A. (2020). Efficiently solving MDPs with stochastic mirror descent. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4890–4900. PMLR.
Jin and Sidford, (2021) Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward mdps.
(17) Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., and Jordan, M. I. (2021a). Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4):1013–1040.
(18) Khamaru, K., Xia, E., Wainwright, M. J., and Jordan, M. I. (2021b). Instance-optimality in optimal value estimation: Adaptivity via variance-reduced q-learning.
Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
Lemańczyk, (2021) Lemańczyk, M. (2021). General bernstein-like inequality for additive functionals of markov chains. Journal of Theoretical Probability, 34(3):1426–1454.
Li et al., (2021) Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.
Li et al., (2022) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022). Settling the sample complexity of model-based offline reinforcement learning.
Li et al., (2023) Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of polyak-ruppert averaged q-learning. In Ruiz, F., Dy, J., and van de Meent, J.-W., editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2207–2261. PMLR.
Meyn et al., (2009) Meyn, S., Tweedie, R. L., and Glynn, P. W. (2009). Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition.
Meyn and Down, (1994) Meyn, S. P. and Down, D. (1994). Stability of Generalized Jackson Networks. The Annals of Applied Probability, 4(1):124 – 148.
Nummelin, (1978) Nummelin, E. (1978). A splitting technique for harris recurrent markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 43:309–318.
Puterman, (1994) Puterman, M. (1994). Average Reward and Related Criteria, chapter 8, pages 331–440. John Wiley & Sons, Ltd.
Sadeghi and Levine, (2016) Sadeghi, F. and Levine, S. (2016). Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
Sutton and Barto, (2018) Sutton, R. and Barto, A. (2018). Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press.
(31) Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharp $\ell_{\infty}$ -bounds for $q$ -learning.
(32) Wainwright, M. J. (2019b). Variance-reduced $q$ -learning is minimax optimal.
Wang et al., (2022) Wang, J., Wang, M., and Yang, L. F. (2022). Near sample-optimal reduction-based policy learning for average reward mdp.
Wang, (2017) Wang, M. (2017). Primal-dual $\pi$ learning: Sample complexity and sublinear run time for ergodic markov decision problems. ArXiv, abs/1710.06100.
Wang et al., (2023) Wang, S., Blanchet, J., and Glynn, P. (2023). Optimal sample complexity of reinforcement learning for uniformly ergodic discounted markov decision processes.
Watkins and Dayan, (1992) Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.

\appendixpage

Appendix A Equivalence of $t_{\mathrm{minorize}}$ and ${t_{\mathrm{mix}}}$

In this section, we present results and proofs that justify the claim of Theorem 1.

Recall Definition 3. For convenience, we also define the separation between consecutive possible regeneration under measure $\phi$ and probability $q$ as

{t_{\mathrm{separation}}}(P,q,\phi)=\inf\left\{{m:\inf_{s\in S}P^{m}(s,\cdot)\geq q\phi(\cdot)}\right\}.

We will drop the $P$ dependence.

Now, we proceed to show that the two mixing times $t_{\mathrm{minorize}}$ and ${t_{\mathrm{mix}}}$ are equivalent up to constants.

Proposition A.1.

If $P$ is uniformly ergodic with $P^{m}=p\psi+(1-p)R$ for some $m\in\mathbb{Z}_{>0}$ and $p\in(0,1]$ , then

t_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq\frac{22\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi).

Proof.

By Theorem UE identity (9),

\sup_{s\in S}\|P^{n}(s,\cdot)-\eta(\cdot)\|_{\text{TV}}\leq 2(1-p)^{\left\lfloor{n/{t_{\mathrm{separation}}}(p,\psi)}\right\rfloor}

(32)

So, choose

n\geq{t_{\mathrm{separation}}}(p,\psi)\frac{\log(16)}{\log(1/(1-p))}

suffices. So, ${t_{\mathrm{mix}}}\leq\frac{\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi)$ .

For the other inequality, we follow the proof idea in Aldous et al., (1997). The following lemma holds:

Lemma 3 (Lemma 4 in Aldous et al., (1997)).

Let kernel $Q$ have invariant measure $\eta$ . If

\sup_{x\in S}\left\|Q(x,\cdot)-\eta\right\|_{\mathrm{TV}}\leq\frac{1}{4},

then there exists $A\subset S$ s.t. $\eta(A)\geq 1/2$ and for any probability measure $\mu$ and $y\in A$

\mu Q(y)\geq\frac{1}{4}(1-\left\|\mu-\eta\right\|_{\mathrm{TV}})\eta(y).

Let $Q=P^{t_{\mathrm{mix}}}$ , $\mu=P^{t_{\mathrm{mix}}}(x,\cdot)$ , then by the above lemma, $\forall y\in A$

P^{2{t_{\mathrm{mix}}}}(x,y)\geq\frac{3}{16}\eta(y).

Therefore, if we define $\psi(\cdot):=\eta(\cdot\cap A)/\eta(A)=\eta(\cdot|A)$ , because $\eta(A)\geq 1/2$ ,

P^{2{t_{\mathrm{mix}}}}(x,\cdot)\geq\frac{3}{32}\psi(\cdot).

This implies that $t_{\mathrm{minorize}}\leq\frac{64}{3}{t_{\mathrm{mix}}}\leq 22{t_{\mathrm{mix}}}$ . ∎

Next, we discuss the quantitative relationship between the following assumption:

Proposition A.2.

The following properties hold:

•

If $P$ has mixing time ${t_{\mathrm{mix}}}<\infty$ , then for any $\epsilon>0$ , there exists $(m,p)$ s.t. $m/p\leq 22{t_{\mathrm{mix}}}+\epsilon$ and $P$ is $(m,p)$ -Doeblin.
•

If $P$ is $(m,p)$ -Doeblin, then ${t_{\mathrm{mix}}}\leq\log(16)m/p$ .

Proof.

First, note that by Theorem UE, both assumptions will imply that $P$ is uniformly ergodic. So, first entry simply follows from Proposition A.1. For the second entry, note that $\log(1-p)\leq-p$ for $p\in[0,1)$ . So, $1/\log(1/(1-p))\leq 1/p$ . We conclude that ${t_{\mathrm{mix}}}\leq\frac{\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi)\leq\log(16)m/p$ . ∎

A.1 Proof of Theorem 1

Proof.

From Proposition A.1 and A.2, there exists $(m,p,\psi,R)$ s.t. $t_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq 22\log(16)m/p$ , where $m/p$ can be arbitrary close to $t_{\mathrm{minorize}}$ given the definition of $t_{\mathrm{minorize}}$ . ∎

So, knowing the mixing time and the “best” minorization are equivalent for a uniformly ergodic chain. Given the equivalence of ${t_{\mathrm{mix}}}$ and $t_{\mathrm{minorize}}$ , the complexity results in Table 1 can be directly replaced by ${t_{\mathrm{mix}}}$ .

Appendix B Analysis of Algorithms

In this section, we carry out the proofs that justify our sample complexity upper bounds of the algorithms motivated in Section 4.

B.1 Proof of Lemma 1

Proof.

Recall that $t_{\mathrm{minorize}}^{*}=t_{\mathrm{minorize}}(P_{\pi^{*}})$ . Then by definition, for any $\epsilon>0$ , there exists $(m,p)$ s.t. $P_{\pi^{*}}$ is $(m,p)$ -Doeblin and $t_{\mathrm{minorize}}^{*}+\epsilon\geq m/p$ . By Proposition 6.1, $\left|q^{*}\right|_{\mathrm{span}}\leq 3m/p\leq 3(t_{\mathrm{minorize}}^{*}+\epsilon)$ and

	$\displaystyle\sigma(q^{*})(s,a)^{2}$	$\displaystyle=\mathrm{Var}(r(s,a)+\gamma\max_{b\in A}q^{*}(S^{\prime},b))$
		$\displaystyle=\gamma^{2}\mathrm{Var}(\max_{b\in A}[q^{}(S^{\prime},b)-\bar{q}^{}])$
		$\displaystyle\leq\gamma^{2}E[(\max_{b\in A}(q^{}(S^{\prime},b)-\bar{q}^{}))^{2}]$
		$\displaystyle\leq 4\gamma^{2}(m/p)^{2}.$
		$\displaystyle\leq 4\gamma^{2}(t_{\mathrm{minorize}}^{*}+\epsilon)^{2}.$

Since $\epsilon>0$ is arbitrary, this implies the statement of the lemma. ∎

B.2 Proofs for Section 4.2: The General Setting

Let $\mathcal{G}_{l}$ denote the $\sigma$ -field generated by the samples used until the end of $l$ ’th epoch of the variance-reduced Q-learning. We define $E_{l}[\cdot]=E[\cdot|\mathcal{G}_{l}]$ .

Before Proving proposition 4.2, we introduce the following lemma that underlies its proof and the parameter choice.

Lemma 4.

For $\omega$ s.t. $\hat{q}_{l-1}$ satisfies $\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq b$ , then choosing

k^{*}\geq c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8|S||A|}{(1-\gamma)\delta}\right),

and

n_{l}\geq c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\|\sigma(q^{*})\|_{\infty}+(1-\gamma)\|q^{*}\|_{\infty}}{b}+1\right)^{2}\log\left(\frac{8|S||A|}{\delta}\right),

for some known absolute constant $c_{1},c_{2}$ , we have that under the probability measure $P_{l-1}(\cdot)(\omega)$ , $\|\hat{q}_{l}-q^{*}\|_{\infty}\leq b/2$ w.p. at least $1-\delta$ .

Proof.

From Section 4.1.6 of Wainwright, 2019b , we have that for $\omega$ s.t. $\hat{q}_{l-1}$ satisfies $\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq b$ ,

\|q_{l-1,k+1}-q^{*}\|_{\infty}\leq c\left(\frac{b}{(1-\gamma)k}+\frac{b\sqrt{\log(8|S||A|k/\delta)}}{(1-\gamma)^{3/2}\sqrt{k}}+\frac{b+\|\sigma(q^{*})\|+(1-\gamma)\|q^{*}\|_{\infty}}{(1-\gamma)\sqrt{n_{l}}}\sqrt{\log(8|S||A|/\delta)}\right).

w.p. at least $1-\delta$ under $P_{l-1}(\cdot)$ , where $c>0$ is some absolute constant. It is easy to see that indeed under the parameter choice with $c_{1}=c_{2}=16c$ , the r.h.s. will be less than $b/2$ . ∎

B.2.1 Proof of Proposition 4.2

Proof.

We recall the conditions for $k^{*},\left\{{n_{l}}\right\},l^{*}$ in (13) and (14). Let $l^{\prime}=l^{*}+1$ . By the chain rule for conditional probability, for $l^{*}\geq 1$

	$\displaystyle P(\\|\hat{q}_{l^{}}-q^{}\\|_{\infty}\leq 2^{-l^{*}}b)$	$\displaystyle\geq P\left(\bigcap_{l=0}^{l^{}}\left\{{\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b}\right\}\right)$
		$\displaystyle=\prod_{l=0}^{l^{\prime}-1}P\left(\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b\Bigg{\|}\bigcap_{n=0}^{l-1}\left\{{\\|\hat{q}_{n}-q^{}\\|_{\infty}\leq 2^{-n}b}\right\}\right)$
		$\displaystyle\geq\left(1-\frac{\delta}{l^{\prime}}\right)\prod_{l=1}^{l^{\prime}-1}P\left(\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b\Bigg{\|}\bigcap_{n=0}^{l-1}\left\{{\\|\hat{q}_{n}-q^{}\\|_{\infty}\leq 2^{-n}b}\right\}\right)$

where the last equality follows from the assumption that $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b$ w.p. at least $1-\delta/l^{\prime}$ .

Let

A_{l-1}=\bigcap_{n=1}^{l-1}\left\{{\|\hat{q}_{n}-q^{*}\|_{\infty}\leq 2^{-n}b}\right\}.

We analyze the probability

	$\displaystyle P\left(\\|\hat{q}_{l}-q^{*}\\|_{\infty}\leq 2^{-l}b\middle\|A_{l-1}\right)$	$\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\\|\hat{q}_{l}-q^{*}\\|_{\infty}\leq 2^{-l}b}\right\}\mathds{1}_{A_{l-1}}\right]$
		$\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\\|\hat{q}_{l-1}-q^{}\\|_{\infty}\leq 2^{-l+1}b}\right\}E\left[\mathds{1}\left\{{\\|\hat{q}_{l}-q_{}\\|_{\infty}\leq 2^{-l}b}\right\}\middle\|\mathcal{G}_{l-1}\right]\mathds{1}_{A_{l-1}}\right]$
		$\displaystyle\geq 1-\frac{\delta}{l^{\prime}}.$

This is because, for every $1\leq l\leq l^{\prime}-1$ , the choice of $k^{*}$ and $\left\{{n_{l}}\right\}$ in (13) satisfies Lemma 4 with $b$ replaced by $2^{-l+1}b$ and $\delta$ replaced by $\delta/l^{\prime}$ . Hence

\mathds{1}\left\{{\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq 2^{-l+1}b}\right\}E\left[\mathds{1}\left\{{\|\hat{q}_{l}-q_{*}\|_{\infty}\leq 2^{-l}b}\right\}\middle|\mathcal{G}_{l-1}\right]\geq 1-\frac{\delta}{l^{\prime}}.

w.p.1. Therefore,

	$\displaystyle P\left(\\|\hat{q}_{l^{}}-q^{}\\|_{\infty}\leq 2^{-l^{*}}b\right)$	$\displaystyle\geq P\left(\\|\hat{q}_{0}-q^{*}\\|_{\infty}\leq b\right)\left(1-\frac{\delta}{l^{\prime}}\right)^{l^{\prime}-1}$		(33)
		$\displaystyle\geq\left(1-\frac{\delta}{l^{\prime}}\right)^{l^{\prime}}.$		(33)

To finish the proof, we consider the function

e(\eta):=\left(1-\frac{\eta}{l}\right)^{l}.

Evidently, for $l\geq 1$ and $\eta\in[0,1)$ , $e(\eta)$ is $C^{2}$ with derivatives

e^{\prime}(\eta)=-\left(1-\frac{\eta}{l}\right)^{l-1},\qquad e^{\prime\prime}(\eta)=\frac{l-1}{l}\left(1-\frac{\eta}{l}\right)^{l-2}\geq 0.

So, $e^{\prime}(\eta)$ is non-decreasing. Hence

\displaystyle e(\delta)

\displaystyle=e(0)+\int_{0}^{\delta}e^{\prime}(\eta)dt\geq 1+e^{\prime}(0)\delta\geq 1-\delta,

(34)

Note that by the choice of $l^{*}$ in (14), the assumption that $b/\epsilon\geq 1$ implies that $l^{\prime}=l^{*}+1\geq 1$ . Finally, combining this with estimate (34) and the high probability error bound (33) yields that $P\left(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon\right)\geq 1-\delta$ . ∎

Remark.

The proof implies that the error bound holds in a stronger pathwise sense; i.e. we can conclude from the same assumptions as in Proposition 4.2 that $P\left(\left\{{\forall l,\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b}\right\}\right)\geq 1-\delta$ . Moreover, the geometric progression $2^{-l}b$ can be replaced by other pathwise tolerance levels, which will lead to different error bounds and complexity guarantees.

B.2.2 Proof of Proposition 4.3

Proof.

Recall from Lemma 1 that under Assumption 1 $,\|\sigma(q^{*})\|_{\infty}\leq 2t_{\mathrm{minorize}}^{*}$ . By the choice of $k_{0}$ in (17) and the error high probability bound for Algorithm 1 in Proposition 4.1, we can conclude that w.p. at least

1-\frac{\delta}{\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil+1}\geq 1-\frac{\delta}{l^{*}+1},

the error $\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\|\sigma(q^{*})\|_{\infty}/4+1/2\leq t_{\mathrm{minorize}}^{*}$ .

Next, we would like to apply Proposition 4.2. The previous analysis implies that we can let $b=t_{\mathrm{minorize}}^{*}$ . Note that the assumption $\epsilon\leq t_{\mathrm{minorize}}^{*}$ implies $b/\epsilon\geq 1$ . Also, under this $b$ , the requirement in Proposition 4.2 for $n_{l}$ satisfies

\displaystyle c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\|\sigma(q^{*})\|_{\infty}+(1-\gamma)\|q^{*}\|_{\infty}}{2^{-l+1}b}+1\right)^{2}\log\left(\frac{8(l^{*}+1)|S||A|}{\delta}\right)\leq 3c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8(l^{*}+1)|S||A|}{\delta}\right)

Therefore, we have that for all $l^{*}\in[\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon)}\right\rceil,\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil]\cap\mathbb{Z}$ , the parameters $k^{*}$ and $\left\{{n_{l}}\right\}$ in (16) will satisfy Proposition 4.2. Hence, we conclude that $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ . ∎

B.2.3 Proof of Theorem 2

Proof.

Let $n^{*}$ be the total number of 1-sample Bellman operators used by the over all procedure in Proposition 4.3 with $l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil$ . The total number of samples used by the algorithm is

	$\displaystyle\|S\|\|A\|n^{}=\|S\|\|A\|\left(k_{0}+\sum_{l=1}^{l^{}}n_{l}+l^{}k^{}\right)$
	$\displaystyle\leq\|S\|\|A\|\left(c_{0}\frac{1}{(1-\gamma)^{3}}+3c_{2}\frac{4^{l^{}+1}/3}{(1-\gamma)^{2}}+c_{1}\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)$
	$\displaystyle\leq c^{\prime}\|S\|\|A\|\left(\frac{(t_{\mathrm{minorize}}^{})^{2}}{\epsilon^{2}(1-\gamma)^{2}}+\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{2}}\left(\frac{(t_{\mathrm{minorize}}^{*})^{2}}{\epsilon^{2}}+\frac{1}{1-\gamma}\right)\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|(t_{\mathrm{minorize}}^{})^{2}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{1,\frac{\epsilon^{2}}{(t_{\mathrm{minorize}}^{})^{2}(1-\gamma)}}\right\}\right).$

∎

B.3 Proofs for Section 4.3: The Lipschitz Setting

B.3.1 Proof of Lemma 2

Proof.

For any $\pi\in\Pi^{*}$ , consider $\sigma(q^{*})(s,a)^{2}=\mathrm{Var}_{s,a}(\widehat{\mathcal{T}}(q^{*}))=\sigma^{\pi}(s,a)^{2}$ . Recall the definition of $\nu^{\pi}(\mathcal{M})^{2}$ in equation (18). Let $D_{\sigma}$ be the matrix with $\sigma$ on the diagonal, then

	$\displaystyle\\|\nu^{\pi}(\mathcal{M})^{2}\\|_{\infty}$	$\displaystyle=\\|\mathrm{diag}((I-\gamma P^{\pi})^{-1}D_{\sigma(q^{*})^{2}}(I-\gamma P^{\pi})^{-T})\\|_{\infty}$
		$\displaystyle=\max_{s,a}\sum_{s^{\prime},a^{\prime}}\left((I-\gamma P^{\pi})^{-1}(s,a,s^{\prime},a^{\prime})\sigma(q^{*})(s^{\prime},a^{\prime})\right)^{2}$
		$\displaystyle\leq\max_{s,a}\left(\sum_{s^{\prime},a^{\prime}}(I-\gamma P^{\pi})^{-1}(s,a,s^{\prime},a^{\prime})\sigma(q^{*})(s^{\prime},a^{\prime})\right)^{2}$
		$\displaystyle=\\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\\|_{\infty}^{2}$
		$\displaystyle\leq\frac{6400t_{\mathrm{minorize}}(P_{\pi})}{(1-\gamma)^{2}}$

where we also used the Cauchy-Schwarz inequality and Corollary 6.2.1. Taking maximum and recalling the definition of $t_{\mathrm{minorize}}^{*}$ completes the proof. ∎

B.3.2 Proof of Theorem 3

Proof.

From Lemma 2, we have that if

n^{*}=\left(c+25600(c^{\prime})^{2}\right)\max\left\{{1,\frac{(1-\gamma)t_{\mathrm{minorize}}^{*}}{\epsilon^{2}}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right),

then the r.h.s. of (22) is less than $\epsilon$ . Here $c,c^{\prime}$ are constants in Theorem K2. So, for $\omega:\|\hat{q}_{0}(\omega)-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}$ , we have that by Theorem K2, $P(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon|\hat{q}_{0})\geq 1-\delta$ , provided that $n^{*}$ satisfies the requirement of Theorem K2. This happens when $\epsilon$ satisfies the requirement in Theorem 3.

Indeed, under Assumption 2, if (19) hold, then

n^{*}\geq c\max\left\{{1,\frac{1}{\Delta^{2}(1-\gamma)^{\beta}}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right)

satisfying Theorem 3. Moreover, under Assumption 3 and (20) hold,

n^{*}\geq c\max\left\{{1,\frac{1}{(1-\gamma)^{\beta}}\min\left\{{\frac{1}{\Delta^{2}},\frac{L^{2}}{(1-\gamma)^{2}}}\right\}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right)

satisfying Theorem 3 as well.

Now, consider the error probability of the combined procedure. We have

	$\displaystyle P(\\|\hat{q}_{l^{}}-q^{}\\|_{\infty}\leq\epsilon)$	$\displaystyle\geq E\mathds{1}\left\{{\\|\hat{q}_{0}-q^{}\\|_{\infty}\leq 1/\sqrt{1-\gamma},\\|\hat{q}_{l^{}}-q^{*}\\|_{\infty}\leq\epsilon}\right\}$
		$\displaystyle\geq E\mathds{1}\left\{{\\|\hat{q}_{0}-q^{}\\|_{\infty}\leq 1/\sqrt{1-\gamma}}\right\}P\left(\\|\hat{q}_{l^{}}-q^{*}\\|_{\infty}\leq\epsilon\|\hat{q}_{0}\right)$
		$\displaystyle\geq(1-\delta^{2})$
		$\displaystyle\geq 1-2\delta.$

By replacing $\delta$ with $\delta/2$ in Theorem K2, we conclude that it suffices to choose

n^{*}=\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}},1}\right\}\right).

The total number of samples used by the combined algorithm, therefore, is

	$\displaystyle\|S\|\|A\|(k_{0}+n^{*})$	$\displaystyle=\|S\|\|A\|\left(\tilde{O}\left(\frac{(t_{\mathrm{minorize}}^{})^{2}}{(1-\gamma)^{2}}\right)+\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}},1}\right\}\right)\right)$
		$\displaystyle=\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{1,\epsilon^{2}t_{\mathrm{minorize}}^{},\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}}}\right\}\right).$

This implies the claim of Theorem 3. ∎

B.4 Proofs for Section 4.4: Uniform Mixing

Recall the definition of $\bar{r}_{l}$ and $\bar{q}_{l}$ in equation (24) and (25) respectively. We denote an optimal policy for $\bar{q}_{l}$ by $\bar{\pi}_{l}$ . Before we prove Proposition 4.4, we introduce the following lemmas.

Lemma 5 (Lemma 5 in Wainwright, 2019b ).

For $\omega:\|\hat{q}_{l-1}-q^{*}\|\leq b$ , w.p. at least $1-\delta$ under $P_{l-1}(\cdot)(\omega)$

|\bar{r}_{l}-r|\leq c_{a}(b+\sigma(q^{*}))\sqrt{\frac{\log(4|S||A|/\delta)}{n_{l}}}+c_{b}\|q^{*}\|_{\infty}\frac{\log(4|S||A|/\delta)}{n_{l}}

for some absolute constants $c_{a},c_{b}$ .

Lemma 6 (Lemma 3 in Wainwright, 2019b ).

Let $k^{*}$ satisfies (26) for some sufficently large absolute constant $c$ . Then under $P_{l-1}(\cdot)$ ,

\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}\leq\frac{1}{4}\|\hat{q}_{l-1}-\bar{q}_{l}\|

w.p. at least $1-\delta/(2(l^{*}+1))$ .

Lemma 7.

On $\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\}$ , for sufficiently large absolute constants $c,c^{\prime}$ ,

\|q^{*}-\bar{q}_{l}\|_{\infty}\leq c\left(\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}+\frac{\log(8|S||A|(l^{*}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right).

w.p. at least $1-\delta/(2(l^{*}+1))$ under probability measure $P_{l-1}$ , provided that $n_{l}\geq(c^{\prime})^{2}(1-\gamma)^{-2}\log(8(l^{*}+1)|S||A|/\delta)$ .

B.4.1 Proof of Lemma 7

Proof.

By Lemma 5, we have that on $\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\}$ under $P_{l-1}$ w.p. at least $1-\delta/(2l^{*})$

|\bar{r}_{l}-r|\leq c_{a}(\sqrt{t_{\mathrm{minorize}}}+\sigma(q^{*}))\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\|q^{*}\|_{\infty}\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}.

(35)

Also, By Lemma 6 in Wainwright, 2019b ,

|q^{*}-\bar{q}_{l}|\leq\max\left\{{(I-\gamma P^{\pi^{*}})^{-1}|\bar{r}_{l}-r|,(I-\gamma P^{\bar{\pi}_{l}})^{-1}|\bar{r}_{l}-r|}\right\}.

(36)

Here, $\left|\cdot\right|$ denotes the entrywise absolue value, and the inequalities holds entrywise. Let $B_{l}$ be the set s.t. (35) holds. Then on $B_{l}$ , the vector $(I-\gamma P^{\pi^{*}})|\bar{r}_{l}-r|$ satisfies

	$\displaystyle\quad(I-\gamma P^{\pi^{*}})^{-1}\|\bar{r}_{l}-r\|$
	$\displaystyle\leq c_{a}\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\\|(I-\gamma P^{\pi^{}})^{-1}\sigma(q^{})\\|_{\infty}\right)\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\frac{\\|q^{}\\|_{\infty}}{1-\gamma}\frac{\log(8l^{*}\|S\|\|A\|/\delta)}{n_{l}}$
	$\displaystyle\leq c_{a}\frac{81\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{}\|S\|\|A\|/\delta)}{(1-\gamma)^{2}n_{l}}$

Moreover, let $\bar{v}_{l}(\cdot)=q_{l}(\cdot,\bar{\pi}_{l}(\cdot))$

	$\displaystyle\sigma(q^{*})(s,a)^{2}$	$\displaystyle=\gamma^{2}P_{s,a}[(v(q^{})-P_{s,a}[v(q^{})])^{2}]$
		$\displaystyle=\gamma^{2}P_{s,a}[(v^{}-P_{s,a}[v^{}])^{2}]$
		$\displaystyle=\gamma^{2}P_{s,a}[(v^{}-\bar{v}_{l}+\bar{v}_{l}-P_{s,a}[\bar{v}_{l}]+P_{s,a}[\bar{v}_{l}]-P_{s,a}[v^{}])^{2}]$
		$\displaystyle=\gamma^{2}\mathrm{Var}_{s,a}(v^{*}(S^{\prime})-\bar{v}_{l}(S^{\prime}))+\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))$
		$\displaystyle\leq\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))+\gamma^{2}\\|\bar{q}_{l}-q^{*}\\|_{\infty}$

Let $\sigma_{l}(s,a)^{2}=\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))$ . By construction and the optimality of $\bar{\pi}_{l}$ , $\sigma_{l}^{2}$ is the variance of the 1-sample Bellman operator associated with the reward $\bar{r}_{l}$ (c.f. definition (6) and (7)). Therefore, we have that by Corollary 6.2.1,

	$\displaystyle\\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{*})\\|_{\infty}$	$\displaystyle\leq\\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma_{l}\\|_{\infty}+\frac{\gamma^{2}\\|\bar{q}_{l}-q^{*}\\|_{\infty}}{1-\gamma}$		(37)
		$\displaystyle\leq\\|\bar{r}_{l}\\|_{\infty}\frac{80\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\\|\bar{q}_{l}-q^{*}\\|_{\infty}}{1-\gamma}$		(37)

where we note that $t_{\mathrm{minorize}}(P_{\bar{\pi}_{l}})\leq t_{\mathrm{minorize}}$ by the definition of $t_{\mathrm{minorize}}$ in Assumption 4. For $\omega\in B_{l}$ , by the assumption on $n_{l}$ and $c^{\prime}\geq 10(c_{a}\vee c_{b})$

	$\displaystyle\\|\bar{r}_{l}\\|_{\infty}$	$\displaystyle\leq 1+\\|\bar{r}_{l}-r\\|_{\infty}$
		$\displaystyle\leq 1+c_{a}(\sqrt{t_{\mathrm{minorize}}}+\left\\|\sigma(q^{})\right\\|_{\infty})\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\\|q^{}\\|_{\infty}\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}$
		$\displaystyle\leq 1+5c_{a}t_{\mathrm{minorize}}\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{}\|S\|\|A\|/\delta)}{(1-\gamma)n_{l}}$
		$\displaystyle\leq 3.$

So, on $B_{l}$ , the error

	$\displaystyle\quad(I-\gamma P^{\bar{\pi}_{l}})^{-1}\|\bar{r}_{l}-r\|$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}c_{a}\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{})\\|_{\infty}\right)\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\frac{\\|q^{}\\|_{\infty}}{1-\gamma}\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}$
	$\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}c_{a}\left(\frac{241\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\\|\bar{q}_{l}-q^{}\\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8l^{}\|S\|\|A\|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{*}\|S\|\|A\|/\delta)}{(1-\gamma)^{2}n_{l}}.$

where $(i)$ follows from (35), $(ii)$ replaces $\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{*})\|_{\infty}$ with the upper bound (37). Therefore, combining these with (36), we have that w.p. at least $1-\delta/(2l^{*})$

|q^{*}-\bar{q}_{l}|\leq c\left(\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+\frac{\log(8l^{*}|S||A|/\delta)}{(1-\gamma)^{2}n_{l}}\right)

for some large absolute constant $c$ . This implies the claimed result. ∎

Equipped with Lemma 6 and 7, we are ready to prove Proposition 4.4.

B.4.2 Proof of Proposition 4.4

Proof.

By the choice of $n_{l}$ with sufficiently large $c^{\prime}$

c\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}\leq\frac{1}{2}\|\bar{q}_{l}-q^{*}\|_{\infty}.

Moreover, $n_{l}$ satisfies the assumption of Lemma 7 for every $l$ . Therefore, Lemma 7 implies that on $\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq 2^{-l+1}\sqrt{t_{\mathrm{minorize}}}}\right\}\subset\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\}$ w.p. at least $1-\delta/(2(l^{*}+1))$ under $P_{l-1}(\cdot)$

\|q^{*}-\bar{q}_{l}\|_{\infty}\leq 2c\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}+\frac{\log(8|S||A|(l^{*}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right).

Observe that the choice of $k^{*}$ in (27) satisfies (26). Hence, by Lemma 6, 7, and the union bound, on $\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq 2^{-l+1}\sqrt{t_{\mathrm{minorize}}}}\right\}$ w.p. at least $1-\delta/(l^{*}+1)$ under $P_{l-1}(\cdot)$ ,

	$\displaystyle\\|\hat{q}_{l}-q^{*}\\|_{\infty}$	$\displaystyle\leq\\|\hat{q}_{l}-\bar{q}_{l}\\|_{\infty}+\\|q^{*}-\bar{q}_{l}\\|_{\infty}$
		$\displaystyle\leq\frac{1}{4}\\|\hat{q}_{l-1}-\bar{q}_{l}\\|_{\infty}+\\|q^{*}-\bar{q}_{l}\\|_{\infty}$
		$\displaystyle\leq\frac{1}{4}\\|\hat{q}_{l-1}-q^{}\\|_{\infty}+\frac{5}{4}\\|q^{}-\bar{q}_{l}\\|_{\infty}$
		$\displaystyle\leq\frac{2^{-l}\sqrt{t_{\mathrm{minorize}}}}{2}+\frac{5}{2}c\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8\|S\|\|A\|(l^{}+1)/\delta)}{n_{l}}}+\frac{\log(8\|S\|\|A\|(l^{}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right)$
		$\displaystyle\leq 2^{-l}\sqrt{t_{\mathrm{minorize}}}$

where the last inequality follows from the choice of $n_{l}$ . Therefore, repeating the proof of Proposition 4.2 with $b=\sqrt{t_{\mathrm{minorize}}}$ allow us to conclude that $\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon$ w.p. at least $1-\delta$ . Note that the assumption $\epsilon\leq\sqrt{t_{\mathrm{minorize}}}$ implies $b/\epsilon\geq 1$ .

As in the previous remark following Proposition 4.2, the event $\left\{{\forall 1\leq l\leq l^{*},\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}\sqrt{t_{\mathrm{minorize}}}}\right\}$ holds w.p. at least $1-\delta$ . ∎

B.4.3 Proof of Proposition 4.5

Proof.

The proof is the same as that of Proposition 4.3 given 4.2. ∎

B.4.4 Proof of Theorem 4

Proof.

Notice that Assumption 1 is implied by Assumption 4 with $t_{\mathrm{minorize}}^{*}\leq t_{\mathrm{minorize}}$ . As pointed out before, the sample size to compute $\hat{q}_{0}$ , by Theorem 2, is $\tilde{O}(|S||A|(1-\gamma)^{-3})$ . The overall sample complexity of this procedure is

	$\displaystyle\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{3}}\right)+c^{\prime}\|S\|\|A\|\left(\frac{t_{\mathrm{minorize}}}{\epsilon^{2}(1-\gamma)^{2}}+\frac{\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8dt_{\mathrm{minorize}}}{(1-\gamma)\delta}\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{3}}\right)+\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{t_{\mathrm{minorize}}}{\epsilon^{2}}}\right\}\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{t_{\mathrm{minorize}}}{\epsilon^{2}}}\right\}\right).$

∎

Appendix C Proofs of the Minimax Lower Bounds

C.1 Proof of Theorem 5

Proof.

By Theorem K1, it suffices to show that for any $\gamma\in[0.9,1)$ and $10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma)$ MDP instance $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ satisfies

\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})^{2}\|_{\infty}\geq\frac{t_{\mathrm{minorize}}}{3^{5}(1-\gamma)^{2}}.

(38)

We compute the minimax risk parameter for $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$

	$\displaystyle\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(1,a_{1})^{2}$	$\displaystyle=\gamma^{2}\mathrm{Var}\left(\left\{{(I-\gamma P^{\pi^{}})^{-1}\widehat{\mathcal{T}}(q^{})}\right\}(1,a_{1})\right)$
		$\displaystyle=\gamma^{2}\frac{(1-p)p(1-2\gamma(1-p)+\gamma^{2}(1-2p+2p^{2}))}{(1-\gamma)^{2}(1-\gamma+2\gamma p)^{4}}.$

We show that $\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(1,a_{1})^{2}=\Omega(p^{-1}(1-\gamma)^{-2})$ for $t_{\mathrm{minorize}}=1/p\leq 1/(1-\gamma)$ , $p\in(0,0.1]$ , and $\gamma\in[0.9,1)$ . Define the function

f(p,\gamma):=\frac{(1-\gamma)^{2}p}{\gamma^{2}}\nu^{\pi^{*}}(q^{*})(1,a_{1})^{2}.

It suffices to prove that $f(p,\gamma)$ is uniformly bounded away from 0 in the region of interests.

We compute the derivative w.r.t. $\gamma$ ,

\partial_{\gamma}f(p,\gamma)=\frac{2(1-p)p^{2}\left((\gamma-1)^{2}-4\gamma^{2}p^{3}-2\gamma(2-3\gamma)p^{2}-\left(4\gamma^{2}-7\gamma+3\right)p\right)}{(\gamma(2p-1)+1)^{5}}.

Note that $(\gamma(2p-1)+1)^{5}>0$ . Moreover, let

h=(\gamma-1)^{2}-4\gamma^{2}p^{3}-2\gamma(2-3\gamma)p^{2}-\left(4\gamma^{2}-7\gamma+3\right)p

For $p\in(0,0.1]$ and $\gamma\in[0.9,1)$ , we can bound

	$\displaystyle h$	$\displaystyle=(\gamma-1)^{2}-4\gamma^{2}p^{3}+2\gamma(3\gamma-2)p^{2}+(1-\gamma)(4\gamma-3)p$
		$\displaystyle\geq(\gamma-1)^{2}-4\gamma^{2}p^{3}+\gamma^{2}p^{2}+\gamma(5\gamma-4)p^{2}$
		$\displaystyle>(\gamma-1)^{2}$
		$\displaystyle\geq 0.$

So, $\partial_{\gamma}f(p,\gamma)>0$ for $p\in(0,0.1]$ and $\gamma\in[0.9,1)$ , which implies that $f(p,\cdot)$ is non-decreasing. Recall that $1/p\leq 1/(1-\gamma)$ ; i.e. $\gamma\geq 1-p$ . So, for $p\leq 0.1$ , $\gamma\geq 0.9$ , and $\gamma\geq 1-p$

f(p,\gamma)\geq f(p,1-p)=\frac{(1-p)\left(2p^{2}-6p+5\right)}{(3-2p)^{4}}\geq\frac{2}{3^{4}}.

So, the constant $2\cdot 3^{-4}\gamma^{2}\geq 3^{-5}$ . ∎

C.2 Proof of Theorem 6

Proof.

By (30), we have that for any $\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})$

\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}},\bar{\mathcal{M}}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}.

The rest of the proof directly follows from that of Theorem 1 in Khamaru et al., 2021b (Theorem K1). Here we sketch the key steps and the construction of the instance $\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})$ .

First, the proof in Khamaru et al., 2021b use the standard reduction to testing and the Le Cam’s lower bound to conclude that for all $\mathcal{M}_{1},\mathcal{M}_{2}$

\inf_{\hat{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{1},\mathcal{M}_{2}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\hat{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}\geq\frac{\sqrt{n}}{4}\|q^{*}(\mathcal{M}_{1})-q^{*}(\mathcal{M}_{2})\|_{\infty}\left(1-n\sqrt{2}{d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{1}},P_{1}^{\mathcal{M}_{2}})^{2}\right).

where ${d_{\mathrm{Hel}}}$ is the Hellinger distance. Therefore, if $\mathcal{M}_{1}=\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ and $\mathcal{M}_{2}=\bar{\mathcal{M}}$ s.t. ${d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{1}},P_{1}^{\mathcal{M}_{2}})\leq(2\sqrt{n})^{-1}$ .

\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\frac{\sqrt{n}}{8}\|q^{*}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})-q^{*}(\bar{\mathcal{M}})\|_{\infty}

Next, we use the $\bar{\mathcal{M}}$ constructed in Lemma 2 of Khamaru et al., 2021b . Note that $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ , any stationary deterministic Markov policy is optimal, induce the same $P_{\pi}$ and $\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}$ . Let $U:=(I-\gamma P^{\pi})^{-1}$ , $q^{*}=Ur$ , and $(\bar{s},\bar{a})$ s.t. $\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(\bar{s},\bar{a})=\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}$ . Define $\bar{\mathcal{M}}=(S,A,r,\bar{P},\gamma)$ having the same reward $r$ as $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ and

\bar{P}_{s,a}(s^{\prime})=P_{s,a}(s^{\prime})+\frac{P_{s,a}(s^{\prime})U_{\bar{s},\bar{a}}(s,a)[v(q^{*})(s^{\prime})-(P^{\pi}q^{*})(s,a)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}.

Lemma 2 and 3 of Khamaru et al., 2021b and the proof of Theorem 5 imply that $\bar{P}$ is a valid transition kernel s.t. ${d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}},P_{1}^{\bar{\mathcal{M}}})\leq(2\sqrt{n})^{-1}$ and

\|q^{*}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})-q^{*}(\bar{\mathcal{M}})\|_{\infty}\geq c\gamma\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}\geq\frac{c\sqrt{t_{\mathrm{minorize}}}}{3^{5}(1-\gamma)}

when $n$ satisfies (29), which is implied by the assumption. Therefore, it is left to show that $\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})$ . To do this, it is sufficient to show that $\bar{P}_{\pi}$ is $(1,p)$ -Doeblin for all $\pi\in\Pi$ .

Recall the definition of $\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}$ , for every policy $\pi\in\Pi$ ,

\displaystyle\bar{P}_{\pi}(s,s^{\prime})

\displaystyle=2p\frac{e^{T}}{2}(s^{\prime})+(1-2p)I(s,s^{\prime})+\frac{P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}.

Moreover,

\sup_{s,s^{\prime}}|P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]|\leq\frac{\left|q^{*}\right|_{\mathrm{span}}}{1-\gamma}\leq\frac{t_{\mathrm{minorize}}}{1-\gamma}.

By (38), and the choice of $n$ ,

\sup_{s,s^{\prime}}\left|\frac{P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}\right|\leq\frac{3^{5/2}\sqrt{t_{\mathrm{minorize}}}}{\sqrt{2n}}\leq\frac{p}{2},

where $p=1/t_{\mathrm{minorize}}$ . This completes the proof. ∎

Appendix D Proofs for Section 6

D.1 Proof of Proposition 6.1

Proof.

The $q^{\pi}$ is the state-action value function under the policy $\pi$ :

	$\displaystyle q^{\pi}(s,a)$	$\displaystyle=E_{s,a}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k})$
		$\displaystyle=r(s,a)+E_{s,a}^{\pi}\sum_{k=1}^{\infty}\gamma^{k}r(S_{k},A_{k})$
		$\displaystyle=r(s,a)+\gamma E_{s,a}E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})$

where we use the Markov property on the bounded functional $\sum_{k}\gamma^{k}r$ . Now

	$\displaystyle E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k})$	$\displaystyle=E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\sum_{k=0}^{m-1}\gamma^{mj+k}r_{\pi}(S_{mj+k})$
		$\displaystyle=E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\sum_{k=0}^{m-1}\gamma^{mj+k}r_{\pi}(S_{mj+k})$
		$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}E_{S_{1}}^{\pi}E[r_{\pi}(S_{mj+k})\|\mathcal{F}_{mj}]$
		$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}E_{S_{mj}}r_{\pi}(S_{k})$

The last conditional expectation is a bounded function $w_{k}(s):=E_{s}r_{\pi}(S_{k})\in[0,1]$ . Recall that $\tau_{1}$ is the first regeneration time of the split chain. Then

	$\displaystyle E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})$	$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})$
		$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\tau_{1}-1}\gamma^{mj}w_{k}(S_{mj})+\sum_{j=\tau_{1}}^{\infty}\gamma^{mj}w_{k}(S_{mj})\right)$
		$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\tau_{1}-1}\gamma^{mj}w_{k}(S_{mj})+E\left[\sum_{j=\tau_{1}}^{\infty}\gamma^{mj}w_{k}(S_{mj})\middle\|\mathcal{F}_{\tau}\right]\right)$
		$\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\mathds{1}\left\{{\tau_{1}>j}\right\}+E_{\psi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\right)$

where we can use the strong Markov property by the boundedness of $w_{k}$ ; and the split chain’s distribution at regeneration time is $\psi$ . Note that the second term is constant. So, if we let

c:=\sum_{k=0}^{m-1}\gamma^{k}E_{\psi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj}),

then $q^{\pi}$ can be written as

\displaystyle q^{\pi}(s,a)

\displaystyle=r(s,a)+\gamma c+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\mathds{1}\left\{{\tau>j}\right\}.

Since $w_{k}(\cdot)\in[0,1]$ ,

	$\displaystyle 0\leq q^{\pi}(s,a)-\gamma c$	$\displaystyle\leq 1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}\mathds{1}\left\{{\tau>j}\right\}$
		$\displaystyle=1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}P_{S_{1}}^{\pi}(\tau>j)$
		$\displaystyle=1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}(1-p)^{j}$
		$\displaystyle=1+\frac{\gamma(1-\gamma^{m})}{(1-\gamma)(1-(1-p)\gamma^{m})}.$

Notice that

\frac{\gamma(1-\gamma^{m})}{(1-\gamma)}=\gamma\sum_{k=0}^{m-1}\gamma^{k}\leq m\gamma;\qquad\frac{1}{(1-(1-p)\gamma^{m})}\leq\frac{1}{p}.

(39)

Therefore, we conclude that $\|q^{\pi}-\gamma c\|_{\infty}\leq m/p+1$ . Also recall the definition of $\left|\cdot\right|_{\mathrm{span}}$

	$\displaystyle\left\|q^{\pi}\right\|_{\mathrm{span}}$	$\displaystyle=2\\|q^{\pi}-\gamma c\\|_{\infty}$
		$\displaystyle\leq 2m/p+2$
		$\displaystyle\leq 3m/p$

where we used that $p\leq 1/2$ and $m\geq 1$ . ∎

D.2 Proof of Proposition 6.2

Proof.

Define $T_{j+1}=\tau_{j+1}-\tau_{j}$ . By uniform regeneration, $\left\{{T_{j+1}}\right\}\overset{d}{=}\left\{{mG_{i}}\right\}$ where $G_{i}\sim$ Geo $(p)$ i.i.d.. So,

E\gamma^{cT_{i}}=E\exp(cm\log(\gamma)G)=\frac{p\gamma^{cm}}{1-(1-p)\gamma^{cm}}=:\chi(c)

Next, we expand the variance:

	$\displaystyle\Psi^{\pi}(s,a)$	$\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})\right)$
		$\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=0}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right)$
		$\displaystyle\leq 2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\tau_{1}-1}\gamma^{k}r_{\pi}(S_{k})\right)+2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right).$

We split the second term using the regeneration cycles. Let $W_{j+1}=(S_{\tau_{j}},\dots,S_{\tau_{j+1}-1},T_{j+1})$ , $\mathcal{C}_{k}=\sigma(W_{j},j\leq k)$ and

g(W_{j+1})=\sum_{k=0}^{T_{j+1}-1}\gamma^{k}r_{\pi}(S_{\tau_{j}+k}).

Note that

$\displaystyle Eg(W_{j+1})$	$\displaystyle\leq E\sum_{k=0}^{T_{j+1}-1}\gamma^{k}$	(40)
	$\displaystyle=\frac{1-E\gamma^{T_{j+1}}}{1-\gamma}$
	$\displaystyle=\frac{1-\chi(1)}{1-\gamma}$
	$\displaystyle=\frac{1-\gamma^{m}}{(1-\gamma)(1-(1-p)\gamma^{m})}.$

Recall that the sequence of random elements $\left\{{W_{j+1}}\right\}$ are 1-dependent. In particular $W_{j+2}\perp W_{j}$ . Then

	$\displaystyle\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right)$	$\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\gamma^{\tau_{j}}g(W_{j+1})\right)$
		$\displaystyle=\sum_{j=1}^{\infty}\mathrm{Var}_{s,a}^{\pi}\left(\gamma^{\tau_{j}}g(W_{j+1})\right)+2\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))$
		$\displaystyle\qquad+2\sum_{i=1}^{\infty}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{i+1}}g(W_{i+2}))$
		$\displaystyle=:V_{1}+2V_{2}+2V_{3}$

The first term can be bounded by the second moment

	$\displaystyle V_{1}$	$\displaystyle=\sum_{j=1}^{\infty}\mathrm{Var}_{s,a}^{\pi}\left(\gamma^{\tau_{j}}g(W_{j+1})\right)$
		$\displaystyle\leq\sum_{j=1}^{\infty}E\gamma^{2\tau_{j}}g(W_{j+1})^{2}$
		$\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{j=1}^{\infty}E\gamma^{2\tau_{j}}(1-\gamma^{T_{j+1}})^{2}$
		$\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{j=1}^{\infty}\chi(2)^{j}(1-2\chi(1)+\chi(2))$
		$\displaystyle\leq\frac{p\gamma^{2m}(1-\gamma^{m})(1+(1-p)\gamma^{m})}{(1-\gamma)^{2}(1+\gamma^{m})(1-(1-p)\gamma^{m})(1-(1-p)\gamma^{2m})}$
		$\displaystyle\leq\frac{2m}{(1-\gamma)p}$

where the last line follows from (39). Similarly, the third term can be bounded by the variance.

	$\displaystyle V_{3}$	$\displaystyle=\sum_{i=1}^{\infty}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{i+1}}g(W_{i+2}))$
		$\displaystyle\leq\sum_{i=1}^{\infty}\sqrt{\mathrm{Var}(\gamma^{\tau_{i}}g(W_{i+1}))\mathrm{Var}(\gamma^{\tau_{i+1}}g(W_{i+2}))}$
		$\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{i=1}^{\infty}E\gamma^{2\tau_{i}}(1-\gamma^{T_{i+1}})^{2}$
		$\displaystyle\leq\frac{2m}{(1-\gamma)p}.$

For the middle term,

\displaystyle\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]-E[\gamma^{\tau_{j}}g(W_{j+1})]E[\gamma^{\tau_{i}}g(W_{i+1})].

We will defer the proof of the following lemma:

Lemma 8.

\mathrm{Cov}(\gamma^{T_{j+1}},g(W_{j+1}))\leq 0.

By Lemma 8, $E\gamma^{T_{j+1}}g(W_{j+1})\leq E\gamma^{T_{j+1}}Eg(W_{j+1})$ . Also, we use the regeneration property: for $j\leq i-2$ $W_{i+1}\perp\mathcal{C}_{j+1}$ . We have

	$\displaystyle E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]$	$\displaystyle=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})\|\mathcal{C}_{j+1}]]$
		$\displaystyle=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})]E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]$
		$\displaystyle\leq E[\gamma^{2\tau_{j}}]E[\gamma^{T_{j+1}}]E[g(W_{j+1})]E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]$
		$\displaystyle=\chi(2)^{j}\chi(1)^{i-j}E[g(W_{j+1})]E[g(W_{i+1})].$

Then, the covariance term

	$\displaystyle V_{2}$	$\displaystyle=\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))$
		$\displaystyle\leq\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}(\chi(2)^{j}\chi(1)^{i-j}-\chi(1)^{j}\chi(1)^{i})E[g(W_{j+1})]E[g(W_{i+1})]$
		$\displaystyle\leq\frac{(1-p)p^{3}\gamma^{4m}}{(1-\gamma^{m})(1+\gamma^{m})(1-(1-p)\gamma^{m})(1-(1-2p)\gamma^{m})}\left(\frac{1-\gamma^{m}}{(1-\gamma)(1-(1-p)\gamma^{m})}\right)^{2}$
		$\displaystyle\leq\frac{2(1-\gamma^{m})p^{3}}{(1-\gamma)^{2}p(2p)p^{2}}$
		$\displaystyle\leq\frac{m}{(1-\gamma)p}$

where we used (40). We conclude that

	$\displaystyle\Psi^{\pi}(s,a)$	$\displaystyle\leq 2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\tau_{1}-1}\gamma^{k}r_{\pi}(S_{k})\right)+2(V_{1}+2V_{2}+2V_{3})$
		$\displaystyle\leq 2\frac{1+\chi(2)-2\chi(1)}{(1-\gamma)^{2}}+\frac{16m}{(1-\gamma)p}$
		$\displaystyle\leq 2\frac{(1-\gamma^{m})^{2}(1+(1-p)\gamma^{m})}{(1-\gamma)^{2}(1-(1-p)\gamma^{m})(1-(1-p)\gamma^{2m})}+\frac{16m}{(1-\gamma)p}$
		$\displaystyle\leq\frac{4m^{2}}{p^{2}}+\frac{16m}{(1-\gamma)p}$
		$\displaystyle\leq\frac{20m}{(1-\gamma)p}$

∎

D.2.1 Proof of Lemma 8

Proof.

We simplify the notation by consider a new probability space supporting a cycle $\left\{{S_{k},k=0,\dots,T}\right\}$ of the split chain starting from initial $S_{0}\sim\psi$ and cycle length $T$ . We use the property of the split chain: Let $\left\{{Y_{k}}\right\}$ be a independent process on this probability space with $m$ skeleton having transition kernel $R$ and interpolated by $P_{\pi}$ . Then, the random element $(S_{0},\dots,S_{T-m},T)$ has the same distribution as $(Y_{0},\dots,Y_{T-m},T)$ .

Write

	$\displaystyle\mathrm{Cov}(\gamma^{T},g(W))$	$\displaystyle=\mathrm{Cov}(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(S_{k})+\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k}))$
		$\displaystyle=\mathrm{Cov}(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(Y_{k}))+\mathrm{Cov}(\gamma^{T},\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k}))$

First, we handle the last $m$ -segment.

	$\displaystyle E\gamma^{T}\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k})$	$\displaystyle=EE\left[\gamma^{T}\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k})\middle\|T,S_{T-m}\right]$
		$\displaystyle=E\gamma^{2T-m}E\left[\sum_{k=0}^{m-1}\gamma^{k}r_{\pi}(S_{T-m+k})\middle\|T,S_{T-m}\right]$
		$\displaystyle=E\gamma^{2T-m}\tilde{f}(S_{T-m})$
		$\displaystyle=E\gamma^{2T-m}\tilde{f}(Y_{T-m})$

for some non-negative and deterministic function $\tilde{f}$ . Here we used the property of the split chain: given the coin toss is successful and the current state $S_{T-m}$ , the path $\left\{{S_{T-m+1},\dots,S_{T-1}}\right\}$ is generated condition on the independently sampled $S_{T}\sim\psi$ and $S_{T-m}$ . Similarly,

E\gamma^{T}E\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k})=E\gamma^{T}E\gamma^{T-m}\tilde{f}(Y_{T-m}).

Therefore,

\mathrm{Cov}(\gamma^{T},g(W))=\mathrm{Cov}\left(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m})\right)

Define $h(T):=\gamma^{T}-E\gamma^{T}$ and the bounded functional $f:m\mathbb{N}\times\mathbb{R}^{\mathbb{N}}\rightarrow\mathbb{R}^{+}$

f(T,Y):=\sum_{k=0}^{T-m}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m})-E\left(\sum_{k=0}^{T-m}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m}))\right).

Notice that, $f(\cdot,Y)$ is non-decreasing and $h$ is decreasing. Let $T^{\prime}\sim m$ Geo $(p)$ be independent, then consider

	$\displaystyle 0$	$\displaystyle\geq E(f(T,Y)-f(T^{\prime},Y))(h(T)-h(T^{\prime}))$
		$\displaystyle=Ef(T,Y)h(T)+Ef(T^{\prime},Y)h(T^{\prime})-Ef(T,Y)Eh(T^{\prime})-Ef(T^{\prime},Y)Eh(T)$
		$\displaystyle=2\mathrm{Cov}(\gamma^{T},g(W))$

where the last equality follows from, $(Y,T)\overset{d}{=}(Y,T^{\prime})$ and $Eh(T)=0$ . ∎

D.3 Proof of Corollary 6.2.1

Proof.

By Theorem UE and the definition of $t_{\mathrm{minorize}}(P_{\pi})$ , for any $\epsilon>0$ , there exists $(m,p)$ s.t. $P_{\pi}$ is $(m,p)$ -Doeblin and $t_{\mathrm{minorize}}(P_{\pi})+\epsilon\geq m/p$ . From Azar et al., (2013), we have that (31) and inequality

\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq\frac{\sqrt{t}}{1-\gamma^{t}}\sqrt{\|(I-\gamma^{2}P^{\pi})^{-1}(\sigma^{\pi})^{2}\|_{\infty}}.

holds. Now, we choose $t=-\log 2/\log\gamma\leq 2/(1-\gamma)$ to conclude that

	$\displaystyle\\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\\|_{\infty}$	$\displaystyle\leq 2\sqrt{t}\sqrt{\\|\Psi^{\pi}\\|_{\infty}}$
		$\displaystyle\leq\frac{80\sqrt{m/p}}{(1-\gamma)}$
		$\displaystyle\leq\frac{80\sqrt{t_{\mathrm{minorize}}(P_{\pi})+\epsilon}}{(1-\gamma)}$

where we used Proposition 6.2 to bound $\|\Psi^{\pi}\|_{\infty}$ . Since $\epsilon>0$ is arbitrary, this implies the corollary. ∎

	$\displaystyle P(\\|\hat{q}_{l^{}}-q^{}\\|_{\infty}\leq 2^{-l^{*}}b)$	$\displaystyle\geq P\left(\bigcap_{l=0}^{l^{}}\left\{{\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b}\right\}\right)$
		$\displaystyle=\prod_{l=0}^{l^{\prime}-1}P\left(\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b\Bigg{\|}\bigcap_{n=0}^{l-1}\left\{{\\|\hat{q}_{n}-q^{}\\|_{\infty}\leq 2^{-n}b}\right\}\right)$
		$\displaystyle\geq\left(1-\frac{\delta}{l^{\prime}}\right)\prod_{l=1}^{l^{\prime}-1}P\left(\\|\hat{q}_{l}-q^{}\\|_{\infty}\leq 2^{-l}b\Bigg{\|}\bigcap_{n=0}^{l-1}\left\{{\\|\hat{q}_{n}-q^{}\\|_{\infty}\leq 2^{-n}b}\right\}\right)$

	$\displaystyle P\left(\\|\hat{q}_{l}-q^{*}\\|_{\infty}\leq 2^{-l}b\middle\|A_{l-1}\right)$	$\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\\|\hat{q}_{l}-q^{*}\\|_{\infty}\leq 2^{-l}b}\right\}\mathds{1}_{A_{l-1}}\right]$
		$\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\\|\hat{q}_{l-1}-q^{}\\|_{\infty}\leq 2^{-l+1}b}\right\}E\left[\mathds{1}\left\{{\\|\hat{q}_{l}-q_{}\\|_{\infty}\leq 2^{-l}b}\right\}\middle\|\mathcal{G}_{l-1}\right]\mathds{1}_{A_{l-1}}\right]$
		$\displaystyle\geq 1-\frac{\delta}{l^{\prime}}.$

	$\displaystyle\|S\|\|A\|n^{}=\|S\|\|A\|\left(k_{0}+\sum_{l=1}^{l^{}}n_{l}+l^{}k^{}\right)$
	$\displaystyle\leq\|S\|\|A\|\left(c_{0}\frac{1}{(1-\gamma)^{3}}+3c_{2}\frac{4^{l^{}+1}/3}{(1-\gamma)^{2}}+c_{1}\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)$
	$\displaystyle\leq c^{\prime}\|S\|\|A\|\left(\frac{(t_{\mathrm{minorize}}^{})^{2}}{\epsilon^{2}(1-\gamma)^{2}}+\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|}{(1-\gamma)^{2}}\left(\frac{(t_{\mathrm{minorize}}^{*})^{2}}{\epsilon^{2}}+\frac{1}{1-\gamma}\right)\right)$
	$\displaystyle=\tilde{O}\left(\frac{\|S\|\|A\|(t_{\mathrm{minorize}}^{})^{2}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{1,\frac{\epsilon^{2}}{(t_{\mathrm{minorize}}^{})^{2}(1-\gamma)}}\right\}\right).$

	$\displaystyle P(\\|\hat{q}_{l^{}}-q^{}\\|_{\infty}\leq\epsilon)$	$\displaystyle\geq E\mathds{1}\left\{{\\|\hat{q}_{0}-q^{}\\|_{\infty}\leq 1/\sqrt{1-\gamma},\\|\hat{q}_{l^{}}-q^{*}\\|_{\infty}\leq\epsilon}\right\}$
		$\displaystyle\geq E\mathds{1}\left\{{\\|\hat{q}_{0}-q^{}\\|_{\infty}\leq 1/\sqrt{1-\gamma}}\right\}P\left(\\|\hat{q}_{l^{}}-q^{*}\\|_{\infty}\leq\epsilon\|\hat{q}_{0}\right)$
		$\displaystyle\geq(1-\delta^{2})$
		$\displaystyle\geq 1-2\delta.$

	$\displaystyle\sigma(q^{*})(s,a)^{2}$	$\displaystyle=\gamma^{2}P_{s,a}[(v(q^{})-P_{s,a}[v(q^{})])^{2}]$
		$\displaystyle=\gamma^{2}P_{s,a}[(v^{}-P_{s,a}[v^{}])^{2}]$
		$\displaystyle=\gamma^{2}P_{s,a}[(v^{}-\bar{v}_{l}+\bar{v}_{l}-P_{s,a}[\bar{v}_{l}]+P_{s,a}[\bar{v}_{l}]-P_{s,a}[v^{}])^{2}]$
		$\displaystyle=\gamma^{2}\mathrm{Var}_{s,a}(v^{*}(S^{\prime})-\bar{v}_{l}(S^{\prime}))+\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))$
		$\displaystyle\leq\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))+\gamma^{2}\\|\bar{q}_{l}-q^{*}\\|_{\infty}$

Optimal Sample Complexity of Reinforcement Learning for Mixing Discounted Markov Decision Processes

Abstract

1 Introduction

1.1 Literature Review

2 Markov Decision Processes: Definitions

3 Uniformly Ergodic Markov Chains and Mixing Metrics

3.1 Uniform Ergodicity

Definition 1 (Uniform Ergodicity).

Definition 2 (Doeblin Minorization Condition).

Theorem UE (Theorem 16.0.2 in Meyn et al., (2009)).

3.2 Doeblin Condition, Split Chain, and Regeneration

3.3 Two Mixing Metrics and Their Equivalence

Definition 3 (Mixing and Minorization Times).

Theorem 1.

4 Uniformly Ergodic MDPs: Algorithms and Sample Complexity Upper Bounds

Assumption 1.

Assumption 2.

Assumption 3.

Assumption 4.

Remark.

4.1 Q-learning and Wainwright’s Variance Reduction

4.1.1 The Q-learning Algorithm

Proposition 4.1 (Section 4.1.2 in Wainwright, 2019b ).

Lemma 1.

Remark.

4.1.2 Wainwright’s Variance Reduction

4.2 The General Setting: Assumption 1

Theorem 2.

Remark.

Proposition 4.2.

Proposition 4.3.

Remark.

4.3 The Lipschitz Setting: Assumptions 2 and 3

Theorem 3.

Remark.

Theorem K2 (Theorem 2 of Khamaru et al., 2021b ).

Lemma 2.

Remark.

4.4 Uniform Mixing: Assupmtion 4

Theorem 4.

Remark.

Proposition 4.4.

Remark.

Proposition 4.5.

5 Minimax Lower Bounds

Definition 4.

Theorem 5.

Theorem 6.

Definition 5 (Hard MDP Family).

Theorem K1 (Theorem 1 in Khamaru et al., 2021b ).

6 Cumulative Reward of a (m,p)(m,p)-Doeblin Chain

Proposition 6.1 (Oscillation of the qq-function).

Proposition 6.2 (Variance of the Cumulative Reward).

Corollary 6.2.1.

Remark.

Acknowledgements

References

Appendix A Equivalence of tminorizet_{\mathrm{minorize}} and tmix{t_{\mathrm{mix}}}

Proposition A.1.

Proof.

Lemma 3 (Lemma 4 in Aldous et al., (1997)).

Proposition A.2.

Proof.

A.1 Proof of Theorem 1

Proof.

Appendix B Analysis of Algorithms

B.1 Proof of Lemma 1

Proof.

B.2 Proofs for Section 4.2: The General Setting

Lemma 4.

Proof.

B.2.1 Proof of Proposition 4.2

Proof.

Remark.

B.2.2 Proof of Proposition 4.3

Proof.

B.2.3 Proof of Theorem 2

Proof.

B.3 Proofs for Section 4.3: The Lipschitz Setting

B.3.1 Proof of Lemma 2

Optimal Sample Complexity of Reinforcement Learning for
Mixing Discounted Markov Decision Processes

6 Cumulative Reward of a $(m,p)$ -Doeblin Chain

Proposition 6.1 (Oscillation of the $q$ -function).

Appendix A Equivalence of $t_{\mathrm{minorize}}$ and ${t_{\mathrm{mix}}}$