This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\MHInternalSyntaxOn\MHInternalSyntaxOffaffil0affil0affiliationtext: Department of Management Science and Engineering
Stanford University

Optimal Sample Complexity of Reinforcement Learning for
Mixing Discounted Markov Decision Processes

Shengbo Wang Jose Blanchet Peter Glynn
Abstract

We consider the optimal sample complexity theory of tabular reinforcement learning (RL) for maximizing the infinite horizon discounted reward in a Markov decision process (MDP). Optimal worst-case complexity results have been developed for tabular RL problems in this setting, leading to a sample complexity dependence on γ\gamma and ϵ\epsilon of the form Θ~((1γ)3ϵ2)\tilde{\Theta}((1-\gamma)^{-3}\epsilon^{-2}), where γ\gamma denotes the discount factor and ϵ\epsilon is the solution error tolerance. However, in many applications of interest, the optimal policy (or all policies) induces mixing. We establish that in such settings, the optimal sample complexity dependence is Θ~(tmix(1γ)2ϵ2)\tilde{\Theta}({t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2}), where tmix{t_{\mathrm{mix}}} is the total variation mixing time. Our analysis is grounded in regeneration-type ideas, which we believe are of independent interest, as they can be used to study RL problems for general state space MDPs.

1 Introduction

Reinforcement learning (RL) (Sutton and Barto,, 2018) has witnessed substantial empirical successes in a wide range of applications, including robotics (Kober et al.,, 2013), computer vision (Sadeghi and Levine,, 2016), and finance (Deng et al.,, 2017). This has sparked substantial research of RL theory and applications in operations research and management sciences. This paper provides a theoretical contribution to this area by considering a tabular RL environment in which a controller wishes to maximize an infinite horizon discounted reward governed by a Markov decision process (MDP). In many applications, due to engineering and managerial considerations, the underlying MDP is naturally constrained to an environment in which good policies must be stable, c.f. Bramson, (2008). In these settings, the controlled Markov chain induced by a reasonable policy will converge in distribution to a unique steady state behavior, regardless of the initial condition. This phenomenon is known as mixing. In cases where the mixing rate of a system is rapid, a finite-time observation of the state process can provide a more accurate statistical depiction of its long-term behavior. Consequently, it is reasonable to expect that for such systems, the development of a lower-complexity algorithm for policy learning is feasible, even in the presence of small discount rates, provided that an optimal policy displays fast mixing. This intuition serves as the motivation and driving force for this paper.

In addition, an optimal algorithm and analysis for discounted mixing MDPs can serve as valuable algorithmic and theoretical tools for achieving optimal sample complexity results for long-run average reward MDPs through the reduction method proposed in Jin and Sidford, (2021). While the worst case sample complexities of discounted tabular RL algorithms have been studied extensively in recent years (Azar et al.,, 2013; Agarwal et al.,, 2020; Li et al.,, 2022), we are unaware of a complexity theory and algorithm that is optimized for discounted mixing MDPs. Therefore, it is of significant practical and theoretical value to develop a satisfactory complexity theory for discounted mixing MDPs.

This paper contributes to the existing literature in the following ways. First, we adapt the regeneration idea and the split chain technique (Athreya and Ney,, 1978; Nummelin,, 1978) for analyzing stable Markov chain to the RL context. This framework naturally leads to measuring the mixing rate of the system by the minorization time (Definition 3), which is the expected amount of time between successive Markov chain regenerations. The drift and minorization formulation have been widely used in modeling and analyzing stable queuing systems (Meyn and Down,, 1994; Bramson,, 2008). In Theorem 1, we prove that in the setting of uniformly ergodic Markov chains with finite state space, the minorization time is equivalent to the total variation mixing time that is used in the sample complexity of mixing MDP literature (Wang,, 2017; Jin and Sidford,, 2021). Moreover, this formulation directly generalizes to infinite state spaces, foreshadowing a generalization our results to general state space MDPs.

Secondly, we establish algorithms and worst case sample complexity upper bounds under three different mixing assumptions using Wainwright’s variance-reduced Q-learning (Wainwright, 2019b, ) algorithmic framework. We begin with a general setting where there exists an optimal policy such that an upper bound on the minorization time is available. Next, we require that there is a uniform upper bound on the minorization times of the class of optimal policies. Finally, we consider the case where the minorization time of all policy has a uniform upper bound. Our worst case sample complexity guarantees in these three contexts are summarized in Table 1. Note that in all three cases the dependence on the effective horizon is (1γ)2(1-\gamma)^{-2}.

Thirdly, we introduce the non-asymptotic minimax risk of the class of MDPs with a uniform minorization time upper bound (Definition 4). We compare this definition of minimax risk to the instance dependent version in Khamaru et al., 2021b and find ours easier to interpret. Moreover, we prove the same lower bounds in Table 1 for these two definitions of the non-asymptotic minimax risks by constructing a family of hard MDP instances (Definition 5) indexed by the discount and the minorization time. This allows us to conclude that our sample complexity upper bounds are optimal up to log factors.

Mixing Assumption Sample Complexity Relevant Theorems
One optimal policy (Assumption 1) O~(|S||A|tmix2(1γ)2ϵ2)\tilde{O}(|S||A|{t_{\mathrm{mix}}}^{2}(1-\gamma)^{-2}\epsilon^{-2}) Theorem 2
All optimal policies (Assumption 2, 3) O~(|S||A|tmix(1γ)2ϵ2)\tilde{O}(|S||A|{t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2}) Theorem 3
All policies (Assumption 4) O~(|S||A|tmix(1γ)2ϵ2)\tilde{O}(|S||A|{t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2}) Theorem 4
Lower Bound: |S||A|=O(1)|S||A|=O(1) Ω(tmix(1γ)2ϵ2)\Omega({t_{\mathrm{mix}}}(1-\gamma)^{-2}\epsilon^{-2}) Theorems 5, 6
Table 1: Summary of sample complexity upper bounds, where tmix(1γ)1{t_{\mathrm{mix}}}\leq(1-\gamma)^{-1} is the total variation mixing time for uniformly egodic chains, c.f. Theorem 1.

In this paper, we assume the availability of a simulator, also known as a generative model, that can generate independent state transitions for any input state-action pair. Given the significant memory and computational demands associated with vast state and action spaces, we adopt a model-free approach in which the algorithm only maintains a state-action value function, or qq-function. Our learning objective is to achieve an estimator of the optimal qq^{*} function within ϵ\epsilon absolute error in the sup norm. Our algorithm and analysis achieve the optimal non-asymptotic minimax complexity in this regard. Notably, the variance bounds established in this work can be readily applied to the analysis of model-based algorithms, such as Azar et al., (2013), and the same optimal sample complexity upper bounds as in this work should follow.

1.1 Literature Review

  • Markov Stability and Regeneration: There is a rich literature on the stability of Markovian stochastic systems using the regeneration idea (Athreya and Ney,, 1978; Nummelin,, 1978; Meyn and Down,, 1994; Bramson,, 2008; Meyn et al.,, 2009). It has also been a important theoretical tool for design and analysis of simulation (Crane and Lemoine,, 1977; Henderson and Glynn,, 2001), statistics (Gilks et al.,, 1998), and machine learning (Glynn and Ormoneit,, 2002; Lemańczyk,, 2021) procedures. We provide a brief review of this literature since we study mixing behavior of MDPs using these techniques.

  • Sample Complexity of Discounted Tabular RL: The worst case sample complexity theory of tabular RL has been extensively studied in recent years. There are two types of modeling environments that inspires the model-base and model-free approaches to RL algorithmic designs. In a model-base approach, the controller tries to accumulate a data set to construct a empirical model of the underlying MDP and solve it by variants of the dynamic programming principle. Azar et al., (2013); Sidford et al., (2018); Agarwal et al., (2020); Li et al., (2022) propose algorithms and prove optimal upper bounds (the matching lower bound is proved in Azar et al., (2013)) Θ~(|S||A|(1γ)3ϵ2)\tilde{\Theta}(|S||A|(1-\gamma)^{-3}\epsilon^{-2}) of the sample complexity to achieve ϵ\epsilon error in using the model-based approach. On the other hand, the model-free approach only maintains lower dimensional statistics of the transition data, and iterative update them. The celebrated Q-learning (Watkins and Dayan,, 1992) and its generalizations are examples model-free algorithms. Li et al., (2021) shows that the Q-learning have a minimax sample complexity Θ~(|S||A|(1γ)4ϵ2)\tilde{\Theta}(|S||A|(1-\gamma)^{-4}\epsilon^{-2}). Nevertheless, Wainwright, 2019b proposes a variance-reduced variants of the QQ-learning that achieves the aforementioned model-based sample complexity lower bound Θ~(|S||A|(1γ)3ϵ2)\tilde{\Theta}(|S||A|(1-\gamma)^{-3}\epsilon^{-2}). Recent advances in sample complexity theory of Q-learning and its variants are propelled by the breakthroughs in finite time analysis of stochastic approximation (SA) algorithms. Wainwright, 2019a proves a sample path bound for the SA recursion where the random operator (as oppose to the expectation) is monotone and a contraction. These strong assumptions yield path-wise error bounds that enable variance reduction techniques that help to achieve optimal complexity in Wainwright, 2019b . In comparison, Chen et al., (2020) establishes finite sample guarantees of SA only under a second moment bound on the martingale difference noise sequence.

    The worst case analysis provides guarantees on the convergence rate across all instances of γ\gamma-discounted MDPs. Notably, instances that reach the complexity lower bound must involve a transition kernel and reward function that depends upon γ\gamma. In contrast, Khamaru et al., 2021b delve into instance-dependent settings, where the transition kernel and reward function remain fixed, yielding matching sample complexity upper and lower bounds. Furthermore, Wang et al., (2023) explore an intermediate scenario where MDPs are assumed to possess an upper bound on their mixing time. This intermediate setting holds particular significance to this paper’s main objective, as elaborated upon in the introduction.

  • Sample Complexity of Average Reward RL: Mixing MDPs arise naturally in RL settings where the controller’s objective is to maximize the long run average reward per unit of time achieved by the control policy. Wang, (2017); Jin and Sidford, (2020) propose algorithms to solve the policy learning problem directly from the average reward MDP model, achieving a complexity dependence O~(|S||A|tmix2ϵ2)\tilde{O}(|S||A|{t_{\mathrm{mix}}}^{2}\epsilon^{-2}). On the other hand, Jin and Sidford, (2021); Wang et al., (2022) use a discounted MDP with large enough γ\gamma to approximate the average reward MDP and prove worst case complexity bounds O~(|S||A|tmixϵ3)\tilde{O}(|S||A|{t_{\mathrm{mix}}}\epsilon^{-3}). Notably, the contemporary work Wang et al., (2022) only makes assumption on the optimal qq-function, which is equivalent to assuming an upper bound on tmix{t_{\mathrm{mix}}} of one of the optimal policy in the worst case. Whereas the other aforementioned literature assume a uniform bound on tmix{t_{\mathrm{mix}}} over all policy. Our results in the discounted case echo this behavior: only assuming mixing of one of the optimal policy will lead to a reduction in the complexity.

2 Markov Decision Processes: Definitions

Let =(S,A,r,P,γ)\mathcal{M}=(S,A,r,P,\gamma) be an MDP, where SS and AA are finite state and action spaces, r:|S|×|A|[0,1]r:|S|\times|A|\rightarrow[0,1] the reward function, P:|S|×|A|𝒫(S,2S)P:|S|\times|A|\rightarrow\mathcal{P}(S,2^{S}) the transition kernel. γ(0,1)\gamma\in(0,1) is the discount factor. We assume that S,A,r,γS,A,r,\gamma are known, and we can draw samples from the transition measures {Ps,a,sS,aA}\left\{{P_{s,a},s\in S,a\in A}\right\} independently through a sampler (generative model).

Let Ω=(|S|×|A|)0\Omega=(|S|\times|A|)^{\mathbb{Z}_{\geq 0}} and let \mathcal{F} be the product σ\sigma-field be the underlying measureable space. Define the stochastic process {(St,At),t0}\left\{{(S_{t},A_{t}),{t\geq 0}}\right\} by the point evaluation St(ω)=st,At(ω)=atS_{t}(\omega)=s_{t},A_{t}(\omega)=a_{t} for all t0t\geq 0 for any ω=(s0,a0,s1,a1,)Ω\omega=(s_{0},a_{0},s_{1},a_{1},\dots)\in\Omega. At time t0t\geq 0 and state stSs_{t}\in S, if action atAa_{t}\in A is chosen, the decision maker will receive a reward r(s,a)r(s,a), and then the law of the subsequent state is determined by the transition kernel (St+1|S0=s0,,St=st,At=at)=Pst,at()\mathcal{L}(S_{t+1}|S_{0}=s_{0},\dots,S_{t}=s_{t},A_{t}=a_{t})=P_{s_{t},a_{t}}(\cdot). It is well known that to achieve optimal decision making in the context of infinite horizon discounted MDPs (to be introduced), it suffices to consider policies that are stationary, Markov, and deterministic. Therefore, we will restrict ourselves to consider only this class of policies.

Let Π\Pi denote the class of stationary Markov deterministic policies; i.e. any πΠ\pi\in\Pi can be seen as a function π:SA\pi:S\rightarrow A. For πΠ\pi\in\Pi and initial distribution μ𝒫(S,2S)\mu\in\mathcal{P}(S,2^{S}), there is a probability measure QμπQ_{\mu}^{\pi} on the product space s.t. the chain {(St,At),t0}\left\{{(S_{t},A_{t}),t\geq 0}\right\} has finite dimensional distributions

Qμπ(S0=s0,A0=a0,,At=at)=μ(s0)Ps0,π(s0)(s1)Ps1,π(s1)(s2)Pst1,π(st1)(st)𝟙{π(si)=ai,it}.Q^{\pi}_{\mu}(S_{0}=s_{0},A_{0}=a_{0},\dots,A_{t}=a_{t})=\mu(s_{0})P_{s_{0},\pi(s_{0})}(s_{1})P_{s_{1},\pi(s_{1})}(s_{2})\dots P_{s_{t-1},\pi(s_{t-1})}(s_{t})\mathds{1}\left\{{\pi(s_{i})=a_{i},i\leq t}\right\}.

Note that this also implies that {St,t0}\left\{{S_{t},t\geq 0}\right\} is a Markov chain under QμπQ_{\mu}^{\pi}. Let EμπE_{\mu}^{\pi} denotes the expectation under under QμπQ_{\mu}^{\pi}. For μ\mu with full support SS, the value function vπ(s)v^{\pi}(s) is defined via

vπ(s):=Eμπ[t=0γtr(St,At)|S0=s],v^{\pi}(s):=E_{\mu}^{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r(S_{t},A_{t})\middle|S_{0}=s\right],

while the optimal value function is given by

v(s):=maxπΠvπ(s),v^{*}(s):=\max_{\pi\in\Pi}v^{\pi}(s), (1)

sS\forall s\in S. It can be seen as a vector v|S|v^{*}\in\mathbb{R}^{|S|}. Let Π\Pi^{*} denote the set of all optimal stationary Markov deterministic policies; i.e. πΠ\pi\in\Pi that achieves the maximum in (1).

Let Ps,a[v]P_{s,a}[v] denote the sum sSPs,a(s)v(s)\sum_{s^{\prime}\in S}P_{s,a}(s^{\prime})v(s^{\prime}). It is well known that the optimal value function is the unique solution of the following Bellman equation:

v(s)=maxaA(r(s,a)+γPs,a[v]).v^{*}(s)=\max_{a\in A}\left(r(s,a)+\gamma P_{s,a}[v^{*}]\right). (2)

To learn an optimal policy πΠ\pi^{*}\in\Pi^{*}, it is useful to introduce the qq-function: For any policy πΠ\pi\in\Pi,

qπ(s,a):=r(s,a)+γPs,a[vπ],q^{\pi}(s,a):=r(s,a)+\gamma P_{s,a}[v^{\pi}],

which can be identified with a vector qπ|S|×|A|q^{\pi}\in\mathbb{R}^{|S|\times|A|}. Define the notation v(q)(s)=maxaAq(s,a)v(q)(s)=\max_{a\in A}q(s,a). When π=π\pi=\pi^{*} achieves the maximum in (1), we denote the corresponding optimal qq-function qπq^{\pi^{*}} by qq^{*} and hence v(q)=vv(q^{*})=v^{*}. Note that qq^{*} is the unique solution to the Bellman equation for the qq-function:

q(s,a)\displaystyle q^{*}(s,a) =r(s,a)+γPs,a[maxbAq(,b)]\displaystyle=r(s,a)+\gamma P_{s,a}\left[\max_{b\in A}q^{*}(\cdot,b)\right] (3)
=r(s,a)+γPs,a[v(q)]\displaystyle=r(s,a)+\gamma P_{s,a}\left[v(q^{*})\right]
=𝒯(q)(s,a)\displaystyle=\mathcal{T}(q^{*})(s,a)

where the mapping

𝒯:q(s,a)r(s,a)+γPs,a[v(q)],(s,a)S×A\mathcal{T}:q(s,a)\rightarrow r(s,a)+\gamma P_{s,a}\left[v(q)\right],\forall(s,a)\in S\times A (4)

is known as the Bellman operator for the qq-function. Here we list two elementary properties of the Bellman operator and the qq-function. The proofs can be found in Puterman, (1994). First, 𝒯\mathcal{T} is a γ\gamma-contraction in the \|\cdot\|_{\infty}-norm; i.e. 𝒯(q)𝒯(q)γqq\|\mathcal{T}(q)-\mathcal{T}(q^{\prime})\|_{\infty}\leq\gamma\|q-q^{\prime}\| for all q,q|S|×|A|q,q^{\prime}\in\mathbb{R}^{|S|\times|A|}. Then, a consequence of 𝒯\mathcal{T} being a γ\gamma-contraction is

q1/(1γ).\|q^{*}\|_{\infty}\leq 1/(1-\gamma).

A greedy policy from a qq-function is defined as

π(q)(s)=argmaxaAq(s,a),\pi(q)(s)=\underset{a\in A}{\operatorname{arg}\,\operatorname{max}}\;q(s,a),

where one action is picked if there is a tie. It is easy to see that Ππ=argmaxaAq(,a)=π(q)\Pi^{*}\ni\pi^{*}=\arg\max_{a\in A}q^{*}(\cdot,a)=\pi(q^{*}). Therefore, policy learning can be achieved if we can learn a good estimate of qq^{*}. For πΠ\pi\in\Pi, we use Pπ:|S||A||S||A|P^{\pi}:\mathbb{R}^{|S||A|}\rightarrow\mathbb{R}^{|S||A|} to denote the linear operator

Pπ:q(s,a)sSPs,a(s)q(s,π(s)),(s,a)S×A.P^{\pi}:q(s,a)\rightarrow\sum_{s^{\prime}\in S}P_{s,a}(s^{\prime})q(s^{\prime},\pi(s)),\forall(s,a)\in S\times A.

Under suitable mixing conditions as presented in Puterman, (1994), it is useful to look at the span semi-norm of the value and qq-functions. For vector vV=dv\in V=\mathbb{R}^{d}, let ee be the vector with all entries equal to one and define

|v|span\displaystyle\left|v\right|_{\mathrm{span}} :=infcvce\displaystyle:=\inf_{c\in\mathbb{R}}\|v-ce\|_{\infty}
=max1idvimin1idvi.\displaystyle=\max_{1\leq i\leq d}v_{i}-\min_{1\leq i\leq d}v_{i}.

Note that ||span\left|\cdot\right|_{\mathrm{span}} is not a norm because |e|span=0\left|e\right|_{\mathrm{span}}=0, but it is an induced norm on the quotient space V\{ce:c}V\backslash\left\{{ce:c\in\mathbb{R}}\right\}. We also note that |v|span2v\left|v\right|_{\mathrm{span}}\leq 2\|v\|_{\infty} for all vVv\in V. However, ||span\left|\cdot\right|_{\mathrm{span}} and \left\|\cdot\right\|_{\infty} are not equivalent on d\mathbb{R}^{d}.

Naturally, the variance of the cumulative reward induced by the optimal policy plays a key role in controlling the sample complexity of solving the MDP. Given a stationary deterministic Markov policy π\pi, let us define and denote the variance of the cumulative reward by

Ψπ(s,a):=Vars,aπ(k=0γkrπ(Sk)).\Psi^{\pi}(s,a):=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})\right). (5)

Another standard deviation parameter to our interest is

σ(qπ)(s,a):=γ(Ps,a[v(qπ)2]Ps,a[v(qπ)]2).\sigma(q^{\pi})(s,a):=\gamma\sqrt{(P_{s,a}[v(q^{\pi})^{2}]-P_{s,a}[v(q^{\pi})]^{2})}. (6)

We call this object the “1-sample standard deviation” of the Bellman operator. It is the sampling standard deviation of the so-called 1-sample Bellman operator, which we will introduce in equation (11) in the algorithm analysis section. To analyze Ψπ\Psi^{\pi} and σ(qπ)\sigma(q^{\pi}), it’s useful to define

σπ(s,a):=γ(Ps,aπ[(qπ)2]Ps,aπ[qπ]2).\sigma^{\pi}(s,a):=\gamma\sqrt{(P_{s,a}^{\pi}[(q^{\pi})^{2}]-P_{s,a}^{\pi}[q^{\pi}]^{2})}. (7)

Note that for any optimal policy π\pi^{*}, v(q)=vv(q^{*})=v^{*}. So, σ(q)=σπ\sigma(q^{*})=\sigma^{\pi^{*}}.

3 Uniformly Ergodic Markov Chains and Mixing Metrics

Let \mathcal{M} be a MDP. Given a stationary Markov deterministic policy πΠ\pi\in\Pi, the state process {St,t0}\left\{{S_{t},t\geq 0}\right\} will have transition kernel Pπ(s,s)=Ps,π(s)(s)P_{\pi}(s,s^{\prime})=P_{s,\pi(s)}(s^{\prime}). In various applications, the state process induced by a reasonable policy will converge to a unique steady state regardless of the initial condition. In this section, we will formally characterize such behavior by introducing the notion of uniform ergodicity. A uniformly ergodic Markov chain is one for which the marginal distribution of StS_{t} converges to its stationary distribution in total variation distance with uniform rate across all initial conditions. Recall the total variation distance between two probability measures μ,ν\mu,\nu on 2S2^{S} is

μνTV:=supA2S|μ(A)ν(A)|.\left\|\mu-\nu\right\|_{\mathrm{TV}}:=\sup_{A\in 2^{S}}|\mu(A)-\nu(A)|.

We use the norm notation here because it is equivalent to the l1l_{1} norm if we view μ\mu and ν\nu as vectors in |S|\mathbb{R}^{|S|}; i.e. μνTV=21μν1\left\|\mu-\nu\right\|_{\mathrm{TV}}=2^{-1}\|\mu-\nu\|_{1}.

Uniform ergodicity is equivalent to the the transition kernel satisfying the Doeblin condition (Meyn et al.,, 2009). This equivalence permits the application of the split chain representation (Athreya and Ney,, 1978; Nummelin,, 1978) of the Markov chain, which exhibits favorable independence properties, as illustrated in Section 3.2. Thus, the Doeblin condition will serve as our principal assumption and theoretical tool for the analysis of the sample complexity of mixing MDPs. To characterize the rate of mixing of a uniformly ergodic Markov chain, two metrics, namely tmix{t_{\mathrm{mix}}} and tminorizet_{\mathrm{minorize}}, emerge naturally from the total variation and Doeblin condition characterizations. We will introduce these metrics in Section 3.3. While tmix{t_{\mathrm{mix}}} is more commonly utilized in the literature on MDP complexity (Wang,, 2017; Jin and Sidford,, 2020, 2021), we have found tminorizet_{\mathrm{minorize}} to be a more suitable metric for our objectives. To facilitate the comparison of our complexity theories with those in the literature, we establish the equivalence of tmix{t_{\mathrm{mix}}} and tminorizet_{\mathrm{minorize}} up to a constant factor in Section 3.3.

3.1 Uniform Ergodicity

Definition 1 (Uniform Ergodicity).

A Markov chain {St,t0}\left\{{S_{t},t\geq 0}\right\} with transition kernel PP is called uniformly ergodic if there exists probability measure η\eta for which

supsSPn(s,)ηTV0\sup_{s\in S}\left\|P^{n}(s,\cdot)-\eta\right\|_{\mathrm{TV}}\rightarrow 0

Note that η\eta must be the unique stationary distribution of PP: By uniform ergodicity (ηPn+1)(s)η(s)(\eta P^{n+1})(s)\rightarrow\eta(s) and (ηPnP)(s)(ηP)(s)(\eta P^{n}P)(s)\rightarrow(\eta P)(s) for all sSs\in S as nn\rightarrow\infty; i.e. η=ηP\eta=\eta P. Let ν\nu be any stationary distribution of PP. Then, νPn=ν\nu P^{n}=\nu and (νPn)(s)η(s)0(\nu P^{n})(s)-\eta(s)\rightarrow 0 for all sSs\in S; i.e. ν=η\nu=\eta.

Definition 2 (Doeblin Minorization Condition).

A transition kernel PP satisfies the (m,p)(m,p)-Doeblin condition for some m>0m\in\mathbb{Z}_{>0} and p(0,1]p\in(0,1] if there exists a probability measure ψ\psi and stochastic kernel RR s.t.

Pm(s,s)=pψ(s)+(1p)R(s,s).P^{m}(s,s^{\prime})=p\psi(s^{\prime})+(1-p)R(s,s^{\prime}). (8)

for all s,sSs,s^{\prime}\in S. The measure ψ\psi is called the minorization measure and the stochastic kernel RR is call the residual kernel.

Note that if PP is (m,p)(m,p)-Doeblin, then it must be (m,q)(m,q)-Doeblin for any 0<qp0<q\leq p:

Pm=pψ+(1p)R=qψ+(1p)R+(pq)1ψP^{m}=p\psi+(1-p)R=q\psi+(1-p)R+(p-q)1\otimes\psi

The following result explains the equivalence between the Doeblin condition, uniform ergodicity, and uniform geometric ergodicity.

Theorem UE (Theorem 16.0.2 in Meyn et al., (2009)).

The following statements are equivalent:

  1. 1.

    There exists n<n<\infty s.t.

    supsSPn(s,)η()TV14\sup_{s\in S}\left\|P^{n}(s,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq\frac{1}{4}
  2. 2.

    {St,t0}\left\{{S_{t},t\geq 0}\right\} is uniformly ergodic.

  3. 3.

    There exists r>1r>1 and c<c<\infty s.t. for all nn

    supsSPn(x,)η()TVcrn.\sup_{s\in S}\left\|P^{n}(x,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq cr^{-n}.
  4. 4.

    PP is (m,p)(m,p)-Doeblin for some m>0m\in\mathbb{Z}_{>0} and p(0,1]p\in(0,1].

Moreover, we have

supsSPn(x,)η()TV2(1p)n/m.\sup_{s\in S}\left\|P^{n}(x,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq 2(1-p)^{-\left\lfloor{n/m}\right\rfloor}. (9)

Theorem UE implies the existence of a Doeblin minorization structure of any uniformly ergodic Markov kernel. As we will discuss in the next section, the Doeblin condition will allow us to use the split chain technique, which underlies our analysis of the sample complexity of mixing MDPs.

3.2 Doeblin Condition, Split Chain, and Regeneration

In this section, we introduce the split chain of a uniformly ergodic Chain {St,t0}\left\{{S_{t},t\geq 0}\right\}. Athreya and Ney, (1978) and Nummelin, (1978) independently realized that a Markov chain with transition kernel admitting a decomposition of the form (8) can be simulated as follow:

  1. 1.

    Start from t=0t=0, and generate X0μX_{0}\sim\mu, the initial distribution.

  2. 2.

    At time tt, flip a coin Bt+mB_{t+m} with success probability P(Bt+m=1)=pP(B_{t+m}=1)=p.

  3. 3.

    If Bt+m=1B_{t+m}=1, generate St+mψS^{*}_{t+m}\sim\psi; if not, generate St+mR(St,)S^{*}_{t+m}\sim R(S^{*}_{t},\cdot).

  4. 4.

    Generate St+1,,St+m1S^{*}_{t+1},\dots,S^{*}_{t+m-1} from the conditional measure P(|St,St+m)P(\cdot|S^{*}_{t},S^{*}_{t+m}).

  5. 5.

    Set tt+mt\leftarrow t+m and go back to step 2.

The process {St,t0}\left\{{S^{*}_{t},t\geq 0}\right\} is known as the split chain. It can be shown that the original process {St,t0}\left\{{S_{t},t\geq 0}\right\} has the same law as {St,t0}\left\{{S_{t}^{*},t\geq 0}\right\}. Define the wide sense regeneration times τ0=0\tau_{0}=0, τj+1=inf{t>τj:Bt=1}\tau_{j+1}=\inf\left\{{t>\tau_{j}:B_{t}=1}\right\} and (wide sense) regeneration cycles Wj+1=(Sτj,,Sτj+11,Tj+1)W_{j+1}=(S_{\tau_{j}}^{*},\dots,S_{\tau_{j+1}-1}^{*},T_{j+1}) where Tj+1=τj+1τjT_{j+1}=\tau_{j+1}-\tau_{j} for j1j\geq 1. This construction has the following implications:

  • SτjψS_{\tau_{j}}^{*}\sim\psi and Tj=𝑑mGT_{j}\overset{d}{=}mG where GGeom(p)G\sim\mathrm{Geom}(p) for all j1j\geq 1.

  • The regeneration cycles are 1-dependent in the sense that {Wj:1jk}\left\{{W_{j}:1\leq j\leq k}\right\} and {Wj:k+2j}\left\{{W_{j}:k+2\leq j}\right\} are independent for all k1k\geq 1.

  • The regeneration cycles {Wj,j2}\left\{{W_{j},j\geq 2}\right\} are identically distributed.

See Asmussen et al., (2003) for a detailed exposition. Because of these properties, the process SS^{*} is easier to work with. So, moving forward, whenever we analyze a (m,p)(m,p)-Doeblin kernel PP, the process {St,t0}\left\{{S_{t},t\geq 0}\right\} will refer to the split chain. Also, since {τj}\left\{{\tau_{j}}\right\} are not stopping times w.r.t. the natural filtration generated by {St}\left\{{S_{t}}\right\}, we define t=σ({Sj:jt},{Bmj:mjt})\mathcal{F}_{t}=\sigma(\left\{{S_{j}:j\leq t}\right\},\left\{{B_{mj}:mj\leq t}\right\}). Clearly, {τj}\left\{{\tau_{j}}\right\} are t\mathcal{F}_{t}-stopping times and {St}\left\{{S_{t}}\right\} is Markov w.r.t. t\mathcal{F}_{t} as well.

3.3 Two Mixing Metrics and Their Equivalence

The favorable characteristics of the split chain will facilitate the analysis of the behavior of the MDP under mixing policies. Consequently, we aim to study the minimax sample complexity of policy learning for an MDP by postulating the existence of a (m,p)(m,p)-Doeblin condition. This naturally leads to the quantification of the mixing intensity of the MDP by means of the minorization time; see below. However, in the literature of sample complexity theory for mixing MDPs (Wang,, 2017; Jin and Sidford,, 2020, 2021), the intensity of mixing is usually quantified in terms of the mixing time. In this section, we formally introduce these two measures of mixing. Furthermore, we establish that, for a uniformly ergodic Markov chain, tmix{t_{\mathrm{mix}}} and tminorizet_{\mathrm{minorize}} are equivalent up to a multiplicative constant. This equivalence enables a direct comparison of our complexity outcomes, which use tminorizet_{\mathrm{minorize}}, with the existing theories.

Let PP be a uniformly ergodic stochastic kernel with stationary distribution η\eta. By Theorem UE, there exists (m,p,ψ,R)(m,p,\psi,R) s.t. Pm=pψ+(1p)RP^{m}=p\psi+(1-p)R.

Definition 3 (Mixing and Minorization Times).

Define the mixing time

tmix(P):=inf{t1:supsSPt(s,)η()TV14},{t_{\mathrm{mix}}}(P):=\inf\left\{{t\geq 1:\sup_{s\in S}\left\|P^{t}(s,\cdot)-\eta(\cdot)\right\|_{\mathrm{TV}}\leq\frac{1}{4}}\right\},

and the minorization time

tminorize(P)=inf{m/p:infsSPm(s,)pϕ() for some ϕ𝒫(2S)}.t_{\mathrm{minorize}}(P)=\inf\left\{{m/p:\inf_{s\in S}P^{m}(s,\cdot)\geq p\phi(\cdot)\text{ for some }\phi\in\mathcal{P}(2^{S})}\right\}.

We will drop the PP dependence when it is clear.

Next, we aim to demonstrate the equivalence between these two mixing times up to a constant factor.

Theorem 1.

The two notion of mixing times tminorizet_{\mathrm{minorize}} and tmix{t_{\mathrm{mix}}} are equivalent up to constants: If PP is uniformly ergodic, then tminorize22tmix22log(16)tminorizet_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq 22\log(16)t_{\mathrm{minorize}}.

We note that proof idea of the direction tminorizectmixt_{\mathrm{minorize}}\leq c{t_{\mathrm{mix}}} follows from Aldous et al., (1997), while tmixCtminorize{t_{\mathrm{mix}}}\leq Ct_{\mathrm{minorize}} follows from Theorem UE. We defer the proof to the appendix.

Therefore, for a uniformly ergodic chain, quantitative knowledge of the mixing time and the “best” minorization are equivalent up-to constant factors. As a consequence of this equivalence between tmix{t_{\mathrm{mix}}} and tminorizet_{\mathrm{minorize}}, the complexity outcomes highlighted in Table 1 can be substituted directly with tmix{t_{\mathrm{mix}}}.

Other than its theoretical convenience, tminorizet_{\mathrm{minorize}} could be a more explicit metric than tmix{t_{\mathrm{mix}}} in control of stochastic systems. For instance, consider a queueing system for which the sources of randomness are fully exogenous. A long inter-arrival time will imply that the system become empty and the state process regenerates. Therefore, we can obtain a natural upper bound on tminorizet_{\mathrm{minorize}} from the frequency of observing long inter-arrival times.

4 Uniformly Ergodic MDPs: Algorithms and Sample Complexity Upper Bounds

We have introduced uniform ergodicity and the minorization time tminorizet_{\mathrm{minorize}} as an equivalent measure of mixing time and the split chain technique. Using tminorizet_{\mathrm{minorize}} as the mixing criterion, in this section, we explore the worst case sample complexities of learning the optimal qq-function under different stability assumptions of the controlled Markov chain.

MDPs can exhibit different types of mixing behavior. However, since our primary focus is on developing a complexity theory for learning the optimal qq-function, we seek algorithms that obtain optimal policies by producing progressively more precise estimates of qq^{*} as the sample size increases. Consequently, the asymptotic variability of the estimator should be influenced by the mixing characteristics of the optimal policies. Therefore, it is well-motivated to make assumptions about the mixing properties of the class of optimal policies. We first consider the most general one:

Assumption 1.

There exists an optimal stationary Markov deterministic policy πΠ\pi^{*}\in\Pi^{*} s.t. the transition kernel PπP_{\pi^{*}} is uniformly ergodic.

In this setting, define

tminorize=tminorize(Pπ).t_{\mathrm{minorize}}^{*}=t_{\mathrm{minorize}}(P_{\pi^{*}}).

To obtain an optimal sample complexity result, we make further uniform assumptions on the minorization times of kernels induced by πΠ\pi^{*}\in\Pi^{*}. In this paper, we consider the following two settings:

Assumption 2.

Π={π}\Pi^{*}=\left\{{\pi^{*}}\right\} is a singleton. Moreover, π\pi^{*} satisfies Assumption 1.

If Π\Pi^{*} is not a singleton, we instead impose a “continuity in π\pi” assumption on PP which we will refer to as Lipschitzness:

Assumption 3.

For any πΠ\pi^{*}\in\Pi^{*}, the transition kernel PπP_{\pi^{*}} is uniformly ergodic, and the transition kernel satisfies a local Lipschitz condition: There is L>0L>0 s.t. for all q|S|×|A|q\in\mathbb{R}^{|S|\times|A|} and associated greedy policy π(q)\pi(q),

(Pπ(q)Pπ)(qq)Lqq2.\left\|\left(P^{\pi(q)}-P^{\pi^{*}}\right)(q-q^{*})\right\|_{\infty}\leq L\left\|q-q^{*}\right\|_{\infty}^{2}. (10)

In the setting of Assumption 2 and 3, we define

tminorize=maxπΠtminorize(Pπ).t_{\mathrm{minorize}}^{*}=\max_{\pi^{*}\in\Pi^{*}}t_{\mathrm{minorize}}(P_{\pi^{*}}).

We note that the settings in Assumptions 2 and 3 are considered in the Q-learning literature (Khamaru et al., 2021b, ; Li et al.,, 2023). Moreover, Assumption 3 is implied by 2 with L=4/ζL=4/\zeta where ζ\zeta is the optimality gap of the optimal policy class defined as ζ:=minsSminπΠ\Π|v(s)q(s,π(s))|\zeta:=\min_{s\in S}\min_{\pi\in\Pi\backslash\Pi^{*}}|v^{*}(s)-q^{*}(s,\pi(s))|; see Lemma B.1 of Li et al., (2023).

Even though the asymptotic performance of a convergent algorithm should only depend on the mixing characteristics of the optimal policies, for a prescribed accuracy level ϵ\epsilon, assumptions on mixing characteristics of non-optimal policies could potentially increase the performance as well. Moreover, in applications of MDPs, we typically have little knowledge of the class of optimal policies a priori. However, as mentioned earlier, there may be settings in which all policies will induce mixing Markov chains with a uniform bound on the minorization times. This leads to the following assumption:

Assumption 4.

For all πΠ\pi\in\Pi, the transition kernel PπP_{\pi} is uniformly ergodic.

In this setting, we define

tminorize=maxπΠtminorize(Pπ).t_{\mathrm{minorize}}=\max_{\pi\in\Pi}t_{\mathrm{minorize}}(P_{\pi}).

Since Π\Pi is finite, the above max\max is always achieved and tminorize<t_{\mathrm{minorize}}<\infty.

Remark.

The minimax sample complexity of tabular MPDs is well understood when tminorize,tminorize=Ω((1γ)1)t_{\mathrm{minorize}},t_{\mathrm{minorize}}^{*}=\Omega((1-\gamma)^{-1}). We will assume the interesting case when tminorize,tminorize(1γ)1t_{\mathrm{minorize}},t_{\mathrm{minorize}}^{*}\leq(1-\gamma)^{-1}. Moreover, for the convenience of our analysis, we also assume that, w.l.o.g, pπ1/2p^{\pi}\leq 1/2.

In the following developments, we will demonstrate the impact of mixing Assumptions 1, 2, 3, and 4 on algorithmic performance, leading to a improvement in sample complexity by a factor of (1γ)1(1-\gamma)^{-1} when compared to the minimax lower bound without considering mixing time assumptions (O~((1γ)3)\tilde{O}((1-\gamma)^{-3})). In particular, the sample complexity upper bounds in Theorems 2, 3, and 4 are now enhanced to O~((1γ)2)\tilde{O}((1-\gamma)^{-2}).

4.1 Q-learning and Wainwright’s Variance Reduction

We first introduce the algorithmic frameworks that underlie our complexity results: the Q-learning algorithm and Wainwright’s variance reduction technique.

4.1.1 The Q-learning Algorithm

Define {𝒯^k,k0}\left\{{\widehat{\mathcal{T}}_{k},k\geq 0}\right\} as a sequence of i.i.d. 1-sample empirical Bellman operators: i.e. for each s,aS×As,a\in S\times A, sample SP(|s,a)S^{\prime}\sim P(\cdot|s,a) independently and assign

𝒯^k+1(q)(s,a):=r(s,a)+γmaxbAq(S,b).\widehat{\mathcal{T}}_{k+1}(q)(s,a):=r(s,a)+\gamma\max_{b\in A}q(S^{\prime},b). (11)

The synchronous Q-learning QL(k0)\text{QL}(k_{0}) with k0k_{0} number of iterations can be expressed in terms of these empirical Bellman operators, as displayed in Algorithm 1. It is easily seen that σ2(q)(s,a)=Var(𝒯^(q)(s,a))\sigma^{2}(q)(s,a)=\mathrm{Var}(\widehat{\mathcal{T}}(q)(s,a)). This echoes the previous definition (6).

Algorithm 1 Q-learning: QL(k0)\text{QL}(k_{0})
  Initialization: q10q_{1}\equiv 0; k=1k=1.
  for 1kk01\leq k\leq k_{0} do
     Sample 𝒯^k+1\widehat{\mathcal{T}}_{k+1} the 1-sample empirical Bellman operator.
     Compute the Q-learning update
qk+1=(1λk)qk+λk𝒯^k+1(qk)q_{k+1}=(1-\lambda_{k})q_{k}+\lambda_{k}\widehat{\mathcal{T}}_{k+1}(q_{k})
with stepsize λk=1/(1+(1γ)k)\lambda_{k}=1/(1+(1-\gamma)k).
  end for
  return  qk0+1q_{k_{0}+1}

It turns out that the algorithms we develop in the future sections will require a reasonably good initialization. Typically, such an initialization can be achieved by first running the Q-learning Algorithm 1. Therefore, it is useful to have an error bound for this Q-learning algorithm. From Wainwright, 2019a ; Wainwright, 2019b , we have the following proposition.

Proposition 4.1 (Section 4.1.2 in Wainwright, 2019b ).

For any k01k_{0}\geq 1, let qk0+1q_{k_{0}+1} be the output of QL(k0)\text{QL}(k_{0}). Then, there exists absolute constant cc so that with probability at least 1δ1-\delta, we have

qk0+1q\displaystyle\quad\|q_{k_{0}+1}-q^{*}\|_{\infty}
q(1γ)k0+c(σ(q)(1γ)3/2k0log(2|S||A|kδ)1/2+|q|span(1γ)2k0log(2e|S||A|k0(1+(1γ)k0)δ)).\displaystyle\leq\frac{\|q^{*}\|_{\infty}}{(1-\gamma)k_{0}}+c\left(\frac{\|\sigma(q^{*})\|_{\infty}}{(1-\gamma)^{3/2}\sqrt{k_{0}}}\log\left(\frac{2|S||A|k}{\delta}\right)^{1/2}+\frac{\left|q^{*}\right|_{\mathrm{span}}}{(1-\gamma)^{2}k_{0}}\log\left(\frac{2e|S||A|k_{0}(1+(1-\gamma)k_{0})}{\delta}\right)\right).

To use this proposition, we need the following lemma, which is a consequence of Proposition 6.1.

Lemma 1.

Under Assumption 1, the following bounds hold:

|q|span3tminorizeandσ2(q)(s,a)4γ2(tminorize)2.\left|q^{*}\right|_{\mathrm{span}}\leq 3t_{\mathrm{minorize}}^{*}\quad\text{and}\quad\sigma^{2}(q^{*})(s,a)\leq 4\gamma^{2}(t_{\mathrm{minorize}}^{*})^{2}.
Remark.

Combining Proposition 4.1 and Lemma 1, we have a sample complexity bound for this Q-learning algorithm under Assumption 1: O~(|S||A|(tminorize)2ϵ2(1γ)3)\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}\epsilon^{-2}(1-\gamma)^{-3}). Li et al., (2021) have shown that the worst case sample complexity of the Q-learning Algorithm 1 should be O~(|S||A|/(ϵ2(1γ)4))\tilde{O}(|S||A|/(\epsilon^{2}(1-\gamma)^{4})) uniformly in all MDP instance \mathcal{M} with discount factor γ\gamma. So, the sample complexity bound implied by Proposition 4.1 is is not optimal when tminorize=Θ((1γ)1)t_{\mathrm{minorize}}^{*}=\Theta((1-\gamma)^{-1}).

4.1.2 Wainwright’s Variance Reduction

Li et al., (2021) established that the Q-learning Algorithm 1 is not minimax optimal when used to learn the optimal action-value function qq^{*}. In the pursuit of an optimal model-free algorithm, Wainwright, 2019b proposed a variance-reduced variant of Q-learning, as outlined in Algorithm 2.

Algorithm 2 Variance-reduced Q-learning: VRQL(q,k,l,{nl:1ll})\text{VRQL}(q,k^{*},l^{*},\left\{{n_{l}:1\leq l\leq l^{*}}\right\})
  Initialization: q^0=q\hat{q}_{0}=q; l=1l=1; k=1k=1.
  for 1ll1\leq l\leq l^{*} do
     Sample i.i.d. {𝒯~l,j:j=1,,nl}\left\{{\tilde{\mathcal{T}}_{l,j}:j=1,\dots,n_{l}}\right\} 1-sample empirical Bellman operators defined in (11).
     Compute the recentering
𝒯~l(q^l1):=1nlj=1nl𝒯~l,j(q^l1).\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}):=\frac{1}{n_{l}}\sum_{j=1}^{n_{l}}\widetilde{\mathcal{T}}_{l,j}(\hat{q}_{l-1}).
     Set ql,1=q^l1q_{l,1}=\hat{q}_{l-1}.
     for 1kk1\leq k\leq k^{*} do
        Sample 𝒯^l,k+1\widehat{\mathcal{T}}_{l,k+1} the 1-sample empirical Bellman operator.
        Compute the recentered Q-learning update
ql,k+1=(1λk)ql,k+λk(𝒯^l,k+1(ql,k)𝒯^l,k+1(q^l1)+𝒯~l(q^l1))q_{l,k+1}=(1-\lambda_{k})q_{l,k}+\lambda_{k}\left(\widehat{\mathcal{T}}_{l,k+1}(q_{l,k})-\widehat{\mathcal{T}}_{l,k+1}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1})\right) (12)
with stepsize λk=1/(1+(1γ)k)\lambda_{k}=1/(1+(1-\gamma)k).
     end for
     Set q^l=ql,k+1\hat{q}_{l}=q_{l,k^{*}+1}.
  end for
  return  q^l\hat{q}_{l^{*}}

This variant has been shown to achieve the minimax optimal sample complexity O~(|S||A|(1γ)3)\tilde{O}(|S||A|(1-\gamma)^{-3}) for any Markov decision process (MDP) instance \mathcal{M}, subject to optimal selection of the parameters q,l,{nl,1ll},kq,l^{*},\left\{{n_{l},1\leq l\leq l^{*}}\right\},k^{*} and initialization q^0\hat{q}_{0}. In the subsequent sections, we explore modifications to Algorithm 2 that enable it to attain the sample complexity upper bounds presented in Table 1.

Notably, the variance-reduced Q-learning variant generally satisfies the path-wise error bound of the form {q^lqbl,ll}\left\{{\|\hat{q}_{l}-q^{*}\|_{\infty}\leq b_{l},\forall l\leq l^{*}}\right\} with high probability, due to the empirical Bellman operators being γ\gamma-contractions as well. This desirable characteristic makes it a versatile tool for algorithmic designs.

The variance-reduced Q-learning algorithm employs an inner loop indexed by kk and an outer loop indexed by ll, where each outer loop is referred to as an epoch. Within each epoch, the inner loop executes a “recentered” version of synchronous Q-learning. By selecting a series of empirical Bellman operators 𝒯~l\widetilde{\mathcal{T}}_{l} with exponentially increasing sample sizes as a function of ll, the estimators {q^l}\left\{{\hat{q}_{l}}\right\} generated by the epochs achieve an exponentially decreasing error, namely q^lq2lb\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b, with high probability.

4.2 The General Setting: Assumption 1

In this section, we analyze the convergence and worst case sample complexity bound of Algorithm 2 under the general Assumption 1, which posits that there exists one optimal policy that induces a (m,p)(m,p)-Doeblin kernel.

We aim to demonstrate that by combining Algorithms 1 and 2 under the aforementioned notion of mixing, one can produce an estimator of qq^{*} within an ϵ\epsilon absolute error with high probability using at most O~(|S||A|(tminorize)2(1γ)2ϵ2)\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)^{-2}\epsilon^{-2}) number of samples. This removes a power on (1γ)1(1-\gamma)^{-1} from the minimax sample complexity lower and upper bounds established in Azar et al., (2013). This is quite a surprising improvement as we are only assuming that one of the optimal policies induces mixing.

The central result in this section is the following sample complexity upper bound in Theorem 2. It is is an immediate consequence of Proposition 4.3.

Theorem 2.

For any MDP instance satisfying Assumption 1, the sample complexity running the procedure specified by Proposition 4.3 with l=log2(tminorize/ϵ))l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil to achieve an estimator of qq^{*} with absolute error ϵtminorize\epsilon\leq t_{\mathrm{minorize}}^{*} w.p. at least 1δ1-\delta is

O~(|S||A|(tminorize)2(1γ)2ϵ2max{1,ϵ2(tminorize)2(1γ)}).\tilde{O}\left(\frac{|S||A|(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{1,\frac{\epsilon^{2}}{(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)}}\right\}\right).
Remark.

For ϵtminorize1γ\epsilon\leq t_{\mathrm{minorize}}\sqrt{1-\gamma}, this sample complexity bound reduce to O~(|S||A|(tminorize)2(1γ)2)\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)^{-2}).

To arrive at this sample complexity upper bound, we analyze the statistical properties of Algorithm 2 under Assumption 1. At a high-level, an upper bound of the asymptotic variance of estimating qq^{*} using the variance-reduced Q-learning can be established using the oscillation of qq^{*}. According to Proposition 6.1, the oscillation of qq^{*} can again be bounded by the minorization time. Since all optimal policies induce the same qq^{*}, the minorization time of any optimal policy can be used to control the convergence rate.

We will motivate and state two intermediate propositions that underlie the proof of Theorem 2. The proofs and relevant developments of the key results in this section are deferred to Section B.2. An important result is the following property of Algorithm 2.

Proposition 4.2.

Given q^0\hat{q}_{0} satisfies q^0qb\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b w.p. at least 1δ/(l+1)1-\delta/(l^{*}+1) and b/ϵ1b/\epsilon\geq 1, then choosing

k\displaystyle k^{*} c11(1γ)3log(8(l+1)|S||A|(1γ)δ)\displaystyle\geq c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8(l^{*}+1)|S||A|}{(1-\gamma)\delta}\right) (13)
nl\displaystyle n_{l} c21(1γ)2(σ(q)+(1γ)q2l+1b+1)2log(8(l+1)|S||A|δ),\displaystyle\geq c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\|\sigma(q^{*})\|_{\infty}+(1-\gamma)\|q^{*}\|_{\infty}}{2^{-l+1}b}+1\right)^{2}\log\left(\frac{8(l^{*}+1)|S||A|}{\delta}\right),

and the total number of outter iterations

llog2(bϵ),l^{*}\geq\left\lceil{\log_{2}\left(\frac{b}{\epsilon}\right)}\right\rceil, (14)

we have that the output q^l=VRQL(q^0,k,l,{nl:1ll})\hat{q}_{l^{*}}=\text{VRQL}(\hat{q}_{0},k^{*},l^{*},\left\{{n_{l}:1\leq l\leq l^{*}}\right\}) satisfies q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta.

It should be noted that the algorithm specified in Proposition 4.2 may not be directly implementable due to the lack of a priori knowledge of σ(q)\|\sigma(q^{*})\|_{\infty} in the definition of {nl}\left\{{n_{l}}\right\}. Fortunately, this can be addressed by first running Algorithm 1 to obtain an initialization q^0\hat{q}_{0} s.t. q^0qb=O(tminorize)\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b=O(t_{\mathrm{minorize}}^{*}) with high probability. By Lemma 1, this initialization can cancel out the σ(q)\|\sigma(q^{*})\|_{\infty} term in the numerator of {nl}\left\{{n_{l}}\right\} in (13).

To simplify notation, we define

d:=|S||A|(log(1(1γ)ϵ)+1).d:=|S||A|\left(\left\lceil{\log\left(\frac{1}{(1-\gamma)\epsilon}\right)}\right\rceil+1\right). (15)

Concretely, we choose

k\displaystyle k^{*} =c11(1γ)3log(8d(1γ)δ)\displaystyle=c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8d}{(1-\gamma)\delta}\right) (16)
nl\displaystyle n_{l} =3c24l(1γ)2log(8dδ)\displaystyle=3c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8d}{\delta}\right)

with the same c1,c2c_{1},c_{2} as in (13). Moreover, the initialization q^0\hat{q}_{0} is obtained from running Algorithm 1 with

k0=c01(1γ)3log(2d(1γ)δ)\displaystyle k_{0}=c_{0}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2d}{(1-\gamma)\delta}\right) (17)

for some sufficiently large absolute constant c0c_{0}. By Proposition 4.1, this should guarantee q^0qtminorize\|\hat{q}_{0}-q^{*}\|_{\infty}\leq t_{\mathrm{minorize}}^{*} w.p. at least 1δ/(log2((1γ)1ϵ1)+1)1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil+1). Therefore, by Proposition 4.2, we have the following finite time algorithmic error guarantee:

Proposition 4.3.

Assume Assumption 1. We run the combined Algorithm 1 and 2 with parameters specified in (16) and (17); i.e. VRQL(QL(k0),k,{nl},l)\text{VRQL}(\text{QL}(k_{0}),k^{*},\left\{{n_{l}}\right\},l^{*}). Then, for ϵtminorize\epsilon\leq t_{\mathrm{minorize}}^{*} and log2(tminorize/ϵ))llog2((1γ)1ϵ1)\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil\leq l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil, the estimator q^l\hat{q}_{l^{*}} produced by this combined procedure satisfies q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta.

Remark.

Notice that this proposition allows a range of ll^{*}. Doing this allows the user to run the algorithm without the knowledge of tminorizet_{\mathrm{minorize}}. However, to achieve the optimized complexity in Theorem 2, we need to choose l=log2(tminorize/ϵ))l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil.

Theorem 2 will then follow from Proposition 4.3 by calculating the total number of sample used.

4.3 The Lipschitz Setting: Assumptions 2 and 3

In the previous section, we prove that in the general setting, the variance-reduced qq-learning algorithm has worst case sample complexity upper bound scale as O~(|S||A|(tminorize)2ϵ2(1γ)2)\tilde{O}(|S||A|(t_{\mathrm{minorize}}^{*})^{2}\epsilon^{-2}(1-\gamma)^{-2}) uniformly for tminorize1/(1γ)t_{\mathrm{minorize}}^{*}\leq 1/(1-\gamma). This is not optimal when 1/1γtminorize1/(1γ)1/\sqrt{1-\gamma}\leq t_{\mathrm{minorize}}^{*}\leq 1/(1-\gamma). In particular, there are known algorithms and analysis (Azar et al.,, 2013; Wainwright, 2019b, ) with worst case sample complexity upper bound O~(|S||A|ϵ2(1γ)3)\tilde{O}(|S||A|\epsilon^{-2}(1-\gamma)^{-3}). In this section, we develop a better sample complexity bound under the Assumption 2 or 3. The proof of the results in this section provided in Section B.3.

Define the variance vector

νπ()2:=diag(Cov((IγPπ)1𝒯^(q))).\nu^{\pi}(\mathcal{M})^{2}:=\mathrm{diag}(\mathrm{Cov}((I-\gamma P^{\pi})^{-1}\widehat{\mathcal{T}}(q^{*}))). (18)

Here, the notation diag()\mathrm{diag}(\cdot) maps a square matrix MM to a column vector containing the diagonal elements of MM. In Khamaru et al., 2021b , the authors establish that maxπΠνπ()\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty} serves as a lower bound on the non-asymptotic minimax risk of estimating qq^{*} for a specific MDP instance \mathcal{M}. They also prove an upper bound that matches the lower bound under certain conditions, such as the uniqueness of the optimal policy or the satisfaction of the Lipschitz condition (10) of the transition kernel. It turns out that, as a direct consequence of Corollary 6.2.1, maxπΠνπ()\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty} satisfies the upper bound O(tminorize(1γ)2)O(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2}), see Lemma 2. This will imply a worst-case sample complexity upper bound of O~(tminorize(1γ)2ϵ2)\tilde{O}(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2}\epsilon^{-2}), provided that the sample size is sufficiently large.

Recall that Π\Pi^{*} is the set of optimal stationary deterministc Markov policies. Define the optimality gap

Δ=minπΠ\Πq(r+γPπq).\displaystyle\Delta=\min_{\pi\in\Pi\backslash\Pi^{*}}\|q^{*}-(r+\gamma P^{\pi}q^{*})\|_{\infty}.

The main result in this section is the following theorem.

Theorem 3.

Suppose Assumption 2 or 3 hold. Let parameters k0k_{0} as in (23) and l,{nl},kl^{*},\left\{{n_{l}}\right\},k^{*} as in Theorem K2. Then, for all sufficiently small ϵ\epsilon, the combined algorithm VRQL(QL(k0),k,{nl},l)\text{VRQL}(\text{QL}(k_{0}),k^{*},\left\{{n_{l}}\right\},l^{*}) achieves an estimator of qq^{*} with absolute error ϵ\epsilon in the sup norm w.p. at least 1δ1-\delta using

O~(tminorizeϵ2(1γ)2max{1,ϵ2tminorize,ϵ2(1γ)tminorize})\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{1,\epsilon^{2}t_{\mathrm{minorize}}^{*},\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}}}\right\}\right)

number of samples. Specifically, it is sufficient under Assumption 2 for

ϵΔtminorize(1γ)1+β;\epsilon\leq\Delta\sqrt{t_{\mathrm{minorize}}^{*}(1-\gamma)^{1+\beta}}; (19)

or under Assumption 3 for

ϵmax{Δ,(1γ)/L}tminorize(1γ)1+β\epsilon\leq\max\left\{{\Delta,(1-\gamma)/L}\right\}\sqrt{t_{\mathrm{minorize}}^{*}(1-\gamma)^{1+\beta}} (20)

for any β>0\beta>0.

Remark.

If we further require that ϵmin{((1γ)tminorize)1/2,(tminorize)1/2}\epsilon\leq\min\left\{{((1-\gamma)t_{\mathrm{minorize}}^{*})^{1/2},(t_{\mathrm{minorize}}^{*})^{-1/2}}\right\}, the worst case sample complexity upper bound becomes O~(|S||A|tminorizeϵ2(1γ)2)\tilde{O}(|S||A|t_{\mathrm{minorize}}^{*}\epsilon^{-2}(1-\gamma)^{-2}).

The proof of Theorem 3 is based on the following key result in Khamaru et al., 2021b :

Theorem K2 (Theorem 2 of Khamaru et al., 2021b ).

Suppose either that the optimal policy π\pi^{*} is unique or that the Lipschitz condition (10) holds. We run Algorithm 2 with initialization q^\hat{q} satisfying q^0q1/1γ\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}. Also, for sufficiently large nn^{*} satisfying that there exists a β>0\beta>0

  • In the case of unique π\pi^{*},

    n(logn)2clog(|S||A|/δ)(1γ)3max{1,1Δ2(1γ)β}.\frac{n^{*}}{(\log n^{*})^{2}}\geq c\frac{\log(|S||A|/\delta)}{(1-\gamma)^{3}}\max\left\{{1,\frac{1}{\Delta^{2}(1-\gamma)^{\beta}}}\right\}.
  • In the case of Lipschitz condition (10) hold,

    n(logn)2clog(|S||A|/δ)(1γ)3+βmin{1Δ2,L2(1γ)2}.\frac{n^{*}}{(\log n^{*})^{2}}\geq c\frac{\log(|S||A|/\delta)}{(1-\gamma)^{3+\beta}}\min\left\{{\frac{1}{\Delta^{2}},\frac{L^{2}}{(1-\gamma)^{2}}}\right\}.

where cc is some large absolute constant. Choose

l\displaystyle l^{*} =log4(n(1γ)28log((16|S||A|/δ)logn))\displaystyle=\left\lceil{\log_{4}\left(\frac{n^{*}(1-\gamma)^{2}}{8\log\left((16|S||A|/\delta)\log n^{*}\right)}\right)}\right\rceil (21)
nl\displaystyle n_{l} =4l(1γ)2log4(16l|S||A|δ)\displaystyle=\frac{4^{l}}{(1-\gamma)^{2}}\log_{4}\left(\frac{16l^{*}|S||A|}{\delta}\right)
k\displaystyle k^{*} =n2l.\displaystyle=\frac{n^{*}}{2l^{*}}.

We have that w.p. at least 1δ1-\delta

q^lqc(log4(8|S||A|l/δ)nmaxπΠνπ()+|q|spanlog4(8|S||A|l/δ)(1γ)n).\|\hat{q}_{l^{*}}-q^{*}\|\leq c^{\prime}\left(\sqrt{\frac{\log_{4}(8|S||A|l^{*}/\delta)}{n^{*}}}\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}+\frac{\left|q^{*}\right|_{\mathrm{span}}\log_{4}(8|S||A|l^{*}/\delta)}{(1-\gamma)n^{*}}\right). (22)

As explained in the discussion after Assumption 3, noted by Li et al., (2023), if the optimal policy is unique with optimality gap Δ\Delta, then Assumption 3 holds with L=4/ΔL=4/\Delta. This is consistent with the requirement of nn^{*} in Theorem K2.

Theorem K2 suggests that the finite sample convergence rate can be controlled by the local minimax rate parameter maxπΠνπ()\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}. Observe that if the entries of νπ()\nu^{\pi}(\mathcal{M}) can be uniformly upper bounded by O(tminorize(1γ)2)O(t_{\mathrm{minorize}}^{*}(1-\gamma)^{-2}) for all πΠ\pi\in\Pi^{*}, then the error bound (22) should imply the desired sample complexity upper bound. This is indeed the case if all optimal policies induce Doeblin chains with a unifrom minorization time upper bound.

Lemma 2.

Suppose that for all πΠ\pi\in\Pi^{*}, the kernel PπP_{\pi} uniformly ergodic with

tminorize:=maxπΠtminorize(Pπ).t_{\mathrm{minorize}}^{*}:=\max_{\pi\in\Pi^{*}}t_{\mathrm{minorize}}(P_{\pi}).

Then

maxπΠνπ()26400tminorize(1γ)2.\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty}\leq\frac{6400t_{\mathrm{minorize}}^{*}}{(1-\gamma)^{2}}.

To use Theorem K2, we still need a initialization q^0q1/1γ\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}. As in the previous section, this can be achieved w.p. at least 1δ/21-\delta/2 by running Algorithm 1 with

k0=c0(tminorize)2(1γ)2log(4|S||A|(1γ)δ)k_{0}=c_{0}\frac{(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}}\log\left(\frac{4|S||A|}{(1-\gamma)\delta}\right) (23)

for sufficiently large constant c0c_{0}. Then we start the Algorithm 2 with this initialization and parameters specified in Theorem K2. This will give the desired sample complexity upper bound in Theorem 3.

Remark.

This method of generating q^0\hat{q}_{0}, however, requires the knowledge of tminorizet_{\mathrm{minorize}} in order to specify k0k_{0}. To resolve this, instead, we can use

k0=c01(1γ)4log(4|S||A|(1γ)δ)k_{0}=c_{0}\frac{1}{(1-\gamma)^{4}}\log\left(\frac{4|S||A|}{(1-\gamma)\delta}\right)

to initialize q^0\hat{q}_{0}. This will result in a sample complexity upper bound of O~(|S||A|tminorize(1γ)2ϵ2)\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right) when ϵ1γ\epsilon\leq 1-\gamma on top of satisfying the conditions in Theorem 3. This is less efficient in terms of the range of ϵ\epsilon. So, taking a complexity theoretical perspective, we omit a formal presentation and proofs of this initialization method.

Also, note that if compared to the procedure in Proposition 4.5, one might attempt to use the procedure in Proposition 4.3 as a more efficient initialization method. However, this is not possible due to the requirement that ϵtminorize\epsilon\leq\sqrt{t_{\mathrm{minorize}}} in Proposition 4.3.

4.4 Uniform Mixing: Assupmtion 4

Assuming a uniform minorization time upper bound as in Assumption 4, in this section, we will construct an algorithm and outline the proof of the following sample complexity upper bound.

Theorem 4.

Assume Assumption 4. Run the procedure in Proposition 4.5 with l=log2(tminorize/ϵ)l^{*}=\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil. The worst case sample complexity to achieve an estimator of qq^{*} with absolute error ϵtminorize\epsilon\leq\sqrt{t_{\mathrm{minorize}}} in the infinity norm w.p. at least 1δ1-\delta is

O~(|S||A|tminorize(1γ)2ϵ2max{ϵ2tminorize(1γ),1}).\tilde{O}\left(\frac{|S||A|t_{\mathrm{minorize}}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{\frac{\epsilon^{2}}{t_{\mathrm{minorize}}(1-\gamma)},1}\right\}\right).
Remark.

Note that when ϵtminorize(1γ)\epsilon\leq\sqrt{t_{\mathrm{minorize}}(1-\gamma)}, Theorem 4 implies a sample complexity upper bound of O~(|S||A|tminorize(1γ)2ϵ2)\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right). Compare to Theorem 3, this gives a larger range of ϵ\epsilon in which the optimal sample complexity bound holds.

To develop this sample complexity upper bound, we highlight some preliminary observations of Algorithm 2 made by Wainwright, 2019b . Consider the inner loop update of the variance-reduced Q-learning (12). Wainwright, 2019b defines

𝒥^l,k+1(q)=𝒯^l,k+1(q)𝒯^l,k+1(q^l1)+𝒯~l(q^l1);𝒥l(q)=𝒯(q)𝒯(q^l1)+𝒯~l(q^l1).\widehat{\mathcal{J}}_{l,k+1}(q)=\widehat{\mathcal{T}}_{l,k+1}(q)-\widehat{\mathcal{T}}_{l,k+1}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1});\qquad\mathcal{J}_{l}(q)=\mathcal{T}(q)-\mathcal{T}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}).

Then, (12) becomes

ql,k+1=(1λk)ql,k+λk𝒥^l,k+1(ql,k).q_{l,k+1}=(1-\lambda_{k})q_{l,k}+\lambda_{k}\widehat{\mathcal{J}}_{l,k+1}(q_{l,k}).

It is easy to check that 𝒥l\mathcal{J}_{l} is a γ\gamma-contraction, and E[𝒥^l,k+1(q)|q^l1]=𝒥l(q)E[\widehat{\mathcal{J}}_{l,k+1}(q)|\hat{q}_{l-1}]=\mathcal{J}_{l}(q). So, the inner loop of the variance-reduced Q-learning can be thought of as a stochastic approximation to the unique fixed point q¯l\bar{q}_{l} of 𝒥l\mathcal{J}_{l}.

This observation lead us to considering a perturbed reward function

r¯l:=r𝒯(q^l1)+𝒯~l(q^l1).\bar{r}_{l}:=r-\mathcal{T}(\hat{q}_{l-1})+\widetilde{\mathcal{T}}_{l}(\hat{q}_{l-1}). (24)

Then the fixed point q¯l\bar{q}_{l} of 𝒥l\mathcal{J}_{l} uniquely satisfies the Bellman equation

q¯l=r¯l(s,a)+Ps,av(q¯l).\bar{q}_{l}=\bar{r}_{l}(s,a)+P_{s,a}v(\bar{q}_{l}). (25)

With these reformulation of the variance-reduced Q-learning recursion, we can bound the error after ll epoch by the triangle inequality

q^lqq^lq¯l+qq¯l.\|\hat{q}_{l}-q^{*}\|_{\infty}\leq\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}+\|q^{*}-\bar{q}_{l}\|_{\infty}.

Then, by tightly controling the errors q^lq¯l\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty} and qq¯l\|q^{*}-\bar{q}_{l}\|_{\infty}, we can arrive at the following Proposition.

Proposition 4.4.

Assume Assumption 4 and an initialization q^0\hat{q}_{0} that satisfies q^0qtminorize\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\sqrt{t_{\mathrm{minorize}}} w.p. at least 1δ/(l+1)1-\delta/(l^{*}+1), and ϵtminorize\epsilon\leq\sqrt{t_{\mathrm{minorize}}}. Choose llog2(tminorize/ϵ)l^{*}\geq\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil,

k\displaystyle k^{*} c1(1γ)3log(2|S||A|(l+1)δ(1γ)),\displaystyle\geq c\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2|S||A|(l^{*}+1)}{\delta(1-\gamma)}\right), (26)
nl\displaystyle n_{l} c4l(1γ)2log(8|S||A|(l+1)δ).\displaystyle\geq c^{\prime}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8|S||A|(l^{*}+1)}{\delta}\right).

Then, we have that q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta.

To use Proposition 4.4, we need to produce an estimator q^0\hat{q}_{0} s.t. q^0qtminorize\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\sqrt{t_{\mathrm{minorize}}} w.p. at least 1δ/(l+1)1-\delta/(l^{*}+1). Here, we require ll^{*} to satisfy llog2((1γ)1/2ϵ1)l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil. Therefore, to achieve such a initialization efficiently, we can run the combined procedure in Proposition 4.3 with specified error ϵ=ϵ0=tminorize\epsilon=\epsilon_{0}=\sqrt{t_{\mathrm{minorize}}} and tolerance probability 1δ/(log2((1γ)1/2ϵ1)+1)1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil+1). By Theorem 2, we need

O~(|S||A|(1γ)2max{11γ,(tminorize)2ϵ02})=O~(|S||A|(1γ)3)\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{(t_{\mathrm{minorize}})^{2}}{\epsilon_{0}^{2}}}\right\}\right)=\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{3}}\right)

number of samples.

Remark.

Similar to the remark in section 4.3, this method of generating q^0\hat{q}_{0} requires the knowledge of tminorizet_{\mathrm{minorize}}. Instead, one could run the Q-learning QL(k0)\text{QL}(k_{0}) with

k0=c01(1γ)4log(2dtminorize(1γ)δ)k_{0}=c_{0}\frac{1}{(1-\gamma)^{4}}\log\left(\frac{2dt_{\mathrm{minorize}}}{(1-\gamma)\delta}\right)

to initialize q^0\hat{q}_{0}, where we recall the definition of dd in (15). This will result in a sample complexity upper bound of O~(|S||A|tminorize(1γ)2ϵ2)\tilde{O}\left(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right) when ϵ(1γ)tminorize\epsilon\leq(1-\gamma)\sqrt{t_{\mathrm{minorize}}}. This is less efficient and the proof is omitted.

We will use the following choice of parameters:

k\displaystyle k^{*} =c11(1γ)3log(2d(1γ)δ),\displaystyle=c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{2d}{(1-\gamma)\delta}\right), (27)
nl\displaystyle n_{l} =c24l(1γ)2log(8dδ).\displaystyle=c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8d}{\delta}\right).

These choice of parameters implies the following algorithmic error bound.

Proposition 4.5.

Assume Assumption 4. First, run the procedure in Proposition 4.3 with specified error ϵ0=tminorize\epsilon_{0}=\sqrt{t_{\mathrm{minorize}}} and tolerance probability 1δ/(log2((1γ)1/2ϵ1)+1)1-\delta/(\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil+1) to produce q^0\hat{q}_{0}. Then, run VRQL(q^0,k,l,{nl})\text{VRQL}(\hat{q}_{0},k^{*},l^{*},\left\{{n_{l}}\right\}) with parameters kk^{*}, {nl}\left\{{n_{l}}\right\} as specified in (27) and log2(tminorize/ϵ)llog2((1γ)1/2ϵ1)\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil\leq l^{*}\leq\left\lceil{\log_{2}((1-\gamma)^{-1/2}\epsilon^{-1})}\right\rceil. For ϵtminorize\epsilon\leq\sqrt{t_{\mathrm{minorize}}}, the output of this combined procedure q^l\hat{q}_{l^{*}} satisfies q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta.

Theorem 4 will then follow from Proposition 4.5 and a simple calculation the number of samples used.

5 Minimax Lower Bounds

In this section, we show that the sample complexity dependence on ϵ\epsilon, (1γ)1(1-\gamma)^{-1}, and tminorizet_{\mathrm{minorize}} in Theorem 3 and 4 cannot be improved in general. We consider two notions of non-asymptotic minimax risk of learning the optimal qq-function in the sup norm, one localized at a hard instance, the other is uniform over the class of all MDPs with discount factor γ\gamma and uniform minorization time upper bound tminorizet_{\mathrm{minorize}} (Assumption 4). We prove that both minimax risks have lower bounds supporting the claim that our sample complexities in Theorem 3 and 4 are tight.

We start off with a rigorous definition of the generator model and the class of estimators. This is necessary because the lower bounding technique needs a careful specification of the probability measure. The generator model is a sequence of product probability spaces

Gn=i=1nsS,aA(S,2S,Ps,a)G_{n}^{\mathcal{M}}=\bigotimes_{i=1}^{n}\bigotimes_{s\in S,a\in A}(S,2^{S},P^{\mathcal{M}}_{s,a})

where the Ps,aP_{s,a} is the transition kernel of instance \mathcal{M}. By construction, the identity random element on G1G_{1}^{\mathcal{M}}, denoted by {S(s,a)(),sS,aA}\left\{{S^{\prime}(s,a)(\cdot),s\in S,a\in A}\right\} is independent across S×AS\times A. We will denote the product measure on GnG_{n}^{\mathcal{M}} as PnP^{\mathcal{M}}_{n} and the expectation as EnE^{\mathcal{M}}_{n}. An estimator q~n\tilde{q}_{n} is a measureable function from GnG_{n}^{\mathcal{M}} to S×A\mathbb{R}^{S\times A}. For fixed nn, this is equivalent to the procedure that generate nn i.i.d. Si(s,a)Ps,aS_{i}^{\prime}(s,a)\sim P_{s,a} for each s,as,a and then form q~n\tilde{q}_{n} from these samples.

For fixed state and action spaces and discount factor γ\gamma, we first consider the local non-asymptotic minimax risk at instance γ\mathcal{M}_{\gamma} of learning the qq^{*} function accurately in the sup norm using nn samples:

𝔐n(γ)=supγinfq~nmax{γ,γ}nEnq~nq().\mathfrak{M}_{n}(\mathcal{M}_{\gamma})=\sup_{\mathcal{M}_{\gamma}^{\prime}}\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{\gamma},\mathcal{M}_{\gamma}^{\prime}}\right\}}\sqrt{n}E^{\mathcal{M}}_{n}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}. (28)

where the first supremum is taken over all MDP instance with discount factor γ\gamma. This quantity measures the complexity of learning optimal qq-function on either γ\mathcal{M}_{\gamma} or the hardest local alternative. It is studied in the convex optimization context by Cai and Low, (2015); Chatterjee et al., (2016), and adapted to the the RL context by Khamaru et al., 2021a ; Khamaru et al., 2021b .

The previous complexity metric embodies the idea of constructing a hard instance that is difficult to distinguish from its local alternative, bearing the spirit of the lower bounds in seminal works on MDP complexity theory (c.f. Azar et al., (2013)). However, one can argue that it is more natural to define the minimax risk of the class of MDP instances 𝒞(γ,tminorize):={γ: satisfying Assumption 4}\mathcal{C}(\gamma,t_{\mathrm{minorize}}):=\left\{{\mathcal{M}_{\gamma}:\text{ satisfying Assumption \ref{assump:unif_m-doeblin}}}\right\}.

Definition 4.

We define the minimax risk of learning the optimal qq-function of the MDPs 𝒞(γ,tminorize)\mathcal{C}(\gamma,t_{\mathrm{minorize}}) as

𝔐n(γ,tminorize)=infq~nsup𝒞(γ,tminorize)nEnq~nq().\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})=\inf_{\tilde{q}_{n}}\sup_{\mathcal{M}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})}\sqrt{n}E^{\mathcal{M}}_{n}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}.

With these definitions, we are ready to state the main results of this section.

Theorem 5.

There exists absolute constant c>0c>0 s.t. for any γ[0.9,1)\gamma\in[0.9,1) and 10tminorize1/(1γ)10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma), one can find a MDP instance γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}} satisfying Assumption 4 with uniform minorization time upper bound tminorizet_{\mathrm{minorize}} s.t.

𝔐n(γ,tminorize)ctminorize1γ\mathfrak{M}_{n}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\geq c\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}

provided that n2max{(1γ)2,35tminorize}n\geq 2\max\left\{{(1-\gamma)^{-2},3^{5}t_{\mathrm{minorize}}}\right\}.

Here, as we are lower bounding the complexity metric, we abuse the notation tminorizet_{\mathrm{minorize}} in the Assumption 4 so that tminorizemaxπΠtminorize(Pπ)t_{\mathrm{minorize}}\geq\max_{\pi\in\Pi}t_{\mathrm{minorize}}(P_{\pi}) where PP is the kernel of γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}.

As we will explain in a moment, the hard instances {γ,tminorize}\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}}\right\} in Theorem 5 also facilitate lower bounding 𝔐n(γ,tminorize)\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}}) in Definition 4.

Theorem 6.

There exists absolute constant c>0c>0 s.t. for any γ[0.9,1)\gamma\in[0.9,1) and 10tminorize1/(1γ)10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma),

𝔐n(γ,tminorize)ctminorize1γ\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq c\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}

provided that n2max{(1γ)2,35(tminorize)3}n\geq 2\max\left\{{(1-\gamma)^{-2},3^{5}(t_{\mathrm{minorize}})^{3}}\right\}.

Next, we outline the proofs of these theorems and defer the details to Section C.1 and C.2 respectively.

To begin with, we construct a family of MDP instances satisfying Assumption 4 such that the local non-asymptotic risk parameter maxπΠνπ()2\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty} (see (18)) at each instance is Ω(tminorize(1γ)2)\Omega(t_{\mathrm{minorize}}(1-\gamma)^{-2}). We claim that this is the case for the following family of hard MDP instances:

Definition 5 (Hard MDP Family).

Consider the following family of hard MDP instances {γ,tminorize:γ,tminorize}\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}:\gamma,t_{\mathrm{minorize}}}\right\} with S={1,2}S=\left\{{1,2}\right\}, A={a1,a2}A=\left\{{a_{1},a_{2}}\right\}. The transition kernel for each action is

Pa1=Pa2=[1ppp1p].P_{a_{1}}=P_{a_{2}}=\begin{bmatrix}1-p&p\\ p&1-p\end{bmatrix}.

The reward function r(1,)=1r(1,\cdot)=1 and r(2,)=0r(2,\cdot)=0.

Clearly, π(s1)=a1\pi^{*}(s_{1})=a_{1} and π(s2)=a1\pi^{*}(s_{2})=a_{1} define a optimal policy. Observe that

Pπ=[1ppp1p]=2p[1/21/2]+(12p)[1001]P_{\pi^{*}}=\begin{bmatrix}1-p&p\\ p&1-p\end{bmatrix}=2p\begin{bmatrix}1/2&1/2\end{bmatrix}+(1-2p)\begin{bmatrix}1&0\\ 0&1\end{bmatrix}

So, PπP_{\pi^{*}} is (1,2p)(1,2p)-Doeblin and hence (1,p)(1,p)-Doeblin. For simplicity, we will use tminorize=1/ptminorize(Pπ)t_{\mathrm{minorize}}=1/p\geq t_{\mathrm{minorize}}(P_{\pi^{*}}). It is also easy to see that all policies will induce a (1,p)(1,p)-Doeblin chain; i.e. Assumption 4 holds.

This and the following Theorem 1 in Khamaru et al., 2021b would imply a Ω(tminorize(1γ)2ϵ2)\Omega\left(t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}\right) complexity lower bound of learning qq^{*} within an ϵ\epsilon absolute error in the sup norm if the risk measure is 𝔐n()\mathfrak{M}_{n}(\mathcal{M}):

Theorem K1 (Theorem 1 in Khamaru et al., 2021b ).

There exists constant c>0c>0 s.t. for any MDP instance \mathcal{M},

𝔐n()cγmaxπΠνπ()\mathfrak{M}_{n}(\mathcal{M})\geq c\gamma\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})\|_{\infty}

provided that

nmax{2γ2(1γ)2,2|q|span2(1γ)2maxπΠνπ()2}.n\geq\max\left\{{\frac{2\gamma^{2}}{(1-\gamma)^{2}},\frac{2\left|q^{*}\right|_{\mathrm{span}}^{2}}{(1-\gamma)^{2}\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty}}}\right\}. (29)

To prove Theorem 6, we first observe that

𝔐n(γ,tminorize)sup1,2𝒞(γ,tminorize)infq~nmax{1,2}nEnq~nq()\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\sup_{\mathcal{M}_{1},\mathcal{M}_{2}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})}\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{1},\mathcal{M}_{2}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty} (30)

Therefore, we should have 𝔐n(γ,tminorize)𝔐n(γ,tminorize)\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\mathfrak{M}_{n}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}) if the hard alternative instance in (28) satisfies γ𝒞(γ,tminorize)\mathcal{M}^{\prime}_{\gamma}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}}). Observe that, for large nn, if γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}} and γ\mathcal{M}^{\prime}_{\gamma} are very different, it will be very easy to tell them apart and thence the estimation can be done with small expected error, because both instances are known to the controller (infq~n\inf_{\tilde{q}_{n}} is inside supγ\sup_{\mathcal{M}^{\prime}_{\gamma}}). Thus, the instances that potentially achieve the outer supremum in (28) should be close to γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}. Hence, a hard alternative instance for (28) should have a minorization time similar to tminorizet_{\mathrm{minorize}} as well. Following this intuition, in the proof of Theorem 6, we construct such a hard local alternative using the techniques in Khamaru et al., 2021b , proof of Theorem K1. See appendix Section C.2.

6 Cumulative Reward of a (m,p)(m,p)-Doeblin Chain

In order to prove the matching sample complexity upper and lower bounds stated in previous sections, we need to control the moments of the discounted cumulative rewards

k=0γkr(Sk,Ak).\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k}).

In this section, we state some key properties of this cumulative reward and their implications when the kernel PπP_{\pi} is uniformly ergodic, where πΠ\pi\in\Pi is some stationary Markov deterministic policy. By Theorem UE, PπP_{\pi} satisfies a (m,p)(m,p)-Doeblin condition and the minorization time tminorize(Pπ)m/pt_{\mathrm{minorize}}(P_{\pi})\leq m/p. The proofs for this section will be deferred to the appendix Section D.

First, we show that for a Doeblin MDP, the oscillation of the qq-function is upper bounded by the minorization time.

Proposition 6.1 (Oscillation of the qq-function).

Suppose PπP_{\pi} is (m,p)(m,p)-Doeblin. Then there exists constant q¯π\bar{q}^{\pi} s.t. qπq¯πm/p+1\|q^{\pi}-\bar{q}^{\pi}\|_{\infty}\leq m/p+1. In particular |qπ|span3m/p\left|q^{\pi}\right|_{\mathrm{span}}\leq 3m/p.

Note that Proposition 6.1 can be directly applied to bound σ(q)\sigma(q^{*}) as in Lemma 1. Next, we prove a bound on the variance of the cumulative reward achieved by a (m,p,ψ,R)(m,p,\psi,R)-Doeblin chain.

Proposition 6.2 (Variance of the Cumulative Reward).

Suppose PπP_{\pi} is (m,p)(m,p)-Doeblin. Then Ψπ20(m/p)(1γ)1\|\Psi^{\pi}\|_{\infty}\leq 20(m/p)(1-\gamma)^{-1}, where Ψπ\Psi^{\pi} is defined in 5.

Proposition 6.2 and the equality

Ψπ=(σπ)2+γ2PπΨπ\Psi^{\pi}=(\sigma^{\pi})^{2}+\gamma^{2}P^{\pi}\Psi^{\pi} (31)

due to Azar et al., (2013) allow us to obtain the following important upper bound:

Corollary 6.2.1.

Recall the definition of σπ\sigma^{\pi} in (7). Suppose PπP_{\pi} is uniformly ergodic. Then

(IγPπ)1σπ80tminorize(Pπ)1γ.\displaystyle\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq\frac{80\sqrt{t_{\mathrm{minorize}}(P_{\pi})}}{1-\gamma}.
Remark.

If we no longer assume rr is bounded by 11, then (IγPπ)1σπ80rtminorize(Pπ)/(1γ)\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq 80\|r\|_{\infty}\sqrt{t_{\mathrm{minorize}}(P_{\pi})}/(1-\gamma). Also, this bound can be easily applied to the model-based RL settings, and the same complexity upper bound O~(|S||A|tminorize(1γ)2ϵ2)\tilde{O}(|S||A|t_{\mathrm{minorize}}(1-\gamma)^{-2}\epsilon^{-2}) should follow.

Acknowledgements

The material in this paper is based upon work supported by the Air Force Office of Scientific Research under award number FA9550-20-1-0397. Additional support is gratefully acknowledged from NSF 1915967 and 2118199.

References

  • Agarwal et al., (2020) Agarwal, A., Kakade, S., and Yang, L. F. (2020). Model-based reinforcement learning with a generative model is minimax optimal. In Abernethy, J. and Agarwal, S., editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 67–83. PMLR.
  • Aldous et al., (1997) Aldous, D., Lovász, L., and Winkler, P. (1997). Mixing times for uniformly ergodic markov chains. Stochastic Processes and their Applications, 71(2):165–185.
  • Asmussen et al., (2003) Asmussen, S., Asmussen, S., and Asmussen, S. (2003). Applied probability and queues, volume 2. Springer.
  • Athreya and Ney, (1978) Athreya, K. B. and Ney, P. (1978). A new approach to the limit theory of recurrent markov chains. Transactions of the American Mathematical Society, 245:493–501.
  • Azar et al., (2013) Azar, M. G., Munos, R., and Kappen, H. J. (2013). Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349.
  • Bramson, (2008) Bramson, M. (2008). Stability of Queueing Networks: École d’Été de Probabilités de Saint-Flour XXXVI - 2006. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Cai and Low, (2015) Cai, T. T. and Low, M. G. (2015). A framework for estimation of convex functions. Statistica Sinica, 25(2):423–456.
  • Chatterjee et al., (2016) Chatterjee, S., Duchi, J. C., Lafferty, J., and Zhu, Y. (2016). Local minimax complexity of stochastic convex optimization. In Lee, D., Sugiyama, M., Luxburg, U., Guyon, I., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 29. Curran Associates, Inc.
  • Chen et al., (2020) Chen, Z., Maguluri, S. T., Shakkottai, S., and Shanmugam, K. (2020). Finite-sample analysis of contractive stochastic approximation using smooth convex envelopes. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H., editors, Advances in Neural Information Processing Systems, volume 33, pages 8223–8234. Curran Associates, Inc.
  • Crane and Lemoine, (1977) Crane, M. A. and Lemoine, A. J., editors (1977). An Introduction to the Regenerative Method for Simulation Analysis. Springer Berlin Heidelberg, Berlin, Heidelberg.
  • Deng et al., (2017) Deng, Y., Bao, F., Kong, Y., Ren, Z., and Dai, Q. (2017). Deep direct reinforcement learning for financial signal representation and trading. IEEE Transactions on Neural Networks and Learning Systems, 28(3):653–664.
  • Gilks et al., (1998) Gilks, W. R., Roberts, G. O., and Sahu, S. K. (1998). Adaptive markov chain monte carlo through regeneration. Journal of the American Statistical Association, 93(443):1045–1054.
  • Glynn and Ormoneit, (2002) Glynn, P. W. and Ormoneit, D. (2002). Hoeffding’s inequality for uniformly ergodic markov chains. Statistics & Probability Letters, 56(2):143–146.
  • Henderson and Glynn, (2001) Henderson, S. G. and Glynn, P. W. (2001). Regenerative steady-state simulation of discrete-event systems. ACM Trans. Model. Comput. Simul., 11(4):313–345.
  • Jin and Sidford, (2020) Jin, Y. and Sidford, A. (2020). Efficiently solving MDPs with stochastic mirror descent. In III, H. D. and Singh, A., editors, Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 4890–4900. PMLR.
  • Jin and Sidford, (2021) Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward mdps.
  • (17) Khamaru, K., Pananjady, A., Ruan, F., Wainwright, M. J., and Jordan, M. I. (2021a). Is temporal difference learning optimal? an instance-dependent analysis. SIAM Journal on Mathematics of Data Science, 3(4):1013–1040.
  • (18) Khamaru, K., Xia, E., Wainwright, M. J., and Jordan, M. I. (2021b). Instance-optimality in optimal value estimation: Adaptivity via variance-reduced q-learning.
  • Kober et al., (2013) Kober, J., Bagnell, J. A., and Peters, J. (2013). Reinforcement learning in robotics: A survey. The International Journal of Robotics Research, 32(11):1238–1274.
  • Lemańczyk, (2021) Lemańczyk, M. (2021). General bernstein-like inequality for additive functionals of markov chains. Journal of Theoretical Probability, 34(3):1426–1454.
  • Li et al., (2021) Li, G., Cai, C., Chen, Y., Gu, Y., Wei, Y., and Chi, Y. (2021). Is q-learning minimax optimal? a tight sample complexity analysis. arXiv preprint arXiv:2102.06548.
  • Li et al., (2022) Li, G., Shi, L., Chen, Y., Chi, Y., and Wei, Y. (2022). Settling the sample complexity of model-based offline reinforcement learning.
  • Li et al., (2023) Li, X., Yang, W., Liang, J., Zhang, Z., and Jordan, M. I. (2023). A statistical analysis of polyak-ruppert averaged q-learning. In Ruiz, F., Dy, J., and van de Meent, J.-W., editors, Proceedings of The 26th International Conference on Artificial Intelligence and Statistics, volume 206 of Proceedings of Machine Learning Research, pages 2207–2261. PMLR.
  • Meyn et al., (2009) Meyn, S., Tweedie, R. L., and Glynn, P. W. (2009). Markov Chains and Stochastic Stability. Cambridge Mathematical Library. Cambridge University Press, 2 edition.
  • Meyn and Down, (1994) Meyn, S. P. and Down, D. (1994). Stability of Generalized Jackson Networks. The Annals of Applied Probability, 4(1):124 – 148.
  • Nummelin, (1978) Nummelin, E. (1978). A splitting technique for harris recurrent markov chains. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete, 43:309–318.
  • Puterman, (1994) Puterman, M. (1994). Average Reward and Related Criteria, chapter 8, pages 331–440. John Wiley & Sons, Ltd.
  • Sadeghi and Levine, (2016) Sadeghi, F. and Levine, S. (2016). Cad2rl: Real single-image flight without a single real image. arXiv preprint arXiv:1611.04201.
  • Sidford et al., (2018) Sidford, A., Wang, M., Wu, X., Yang, L., and Ye, Y. (2018). Near-optimal time and sample complexities for solving markov decision processes with a generative model. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R., editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  • Sutton and Barto, (2018) Sutton, R. and Barto, A. (2018). Reinforcement Learning, second edition: An Introduction. Adaptive Computation and Machine Learning series. MIT Press.
  • (31) Wainwright, M. J. (2019a). Stochastic approximation with cone-contractive operators: Sharp \ell_{\infty}-bounds for qq-learning.
  • (32) Wainwright, M. J. (2019b). Variance-reduced qq-learning is minimax optimal.
  • Wang et al., (2022) Wang, J., Wang, M., and Yang, L. F. (2022). Near sample-optimal reduction-based policy learning for average reward mdp.
  • Wang, (2017) Wang, M. (2017). Primal-dual π\pi learning: Sample complexity and sublinear run time for ergodic markov decision problems. ArXiv, abs/1710.06100.
  • Wang et al., (2023) Wang, S., Blanchet, J., and Glynn, P. (2023). Optimal sample complexity of reinforcement learning for uniformly ergodic discounted markov decision processes.
  • Watkins and Dayan, (1992) Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3):279–292.
\appendixpage

Appendix A Equivalence of tminorizet_{\mathrm{minorize}} and tmix{t_{\mathrm{mix}}}

In this section, we present results and proofs that justify the claim of Theorem 1.

Recall Definition 3. For convenience, we also define the separation between consecutive possible regeneration under measure ϕ\phi and probability qq as

tseparation(P,q,ϕ)=inf{m:infsSPm(s,)qϕ()}.{t_{\mathrm{separation}}}(P,q,\phi)=\inf\left\{{m:\inf_{s\in S}P^{m}(s,\cdot)\geq q\phi(\cdot)}\right\}.

We will drop the PP dependence.

Now, we proceed to show that the two mixing times tminorizet_{\mathrm{minorize}} and tmix{t_{\mathrm{mix}}} are equivalent up to constants.

Proposition A.1.

If PP is uniformly ergodic with Pm=pψ+(1p)RP^{m}=p\psi+(1-p)R for some m>0m\in\mathbb{Z}_{>0} and p(0,1]p\in(0,1], then

tminorize22tmix22log(16)log(1/(1p))tseparation(p,ψ).t_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq\frac{22\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi).
Proof.

By Theorem UE identity (9),

supsSPn(s,)η()TV2(1p)n/tseparation(p,ψ)\sup_{s\in S}\|P^{n}(s,\cdot)-\eta(\cdot)\|_{\text{TV}}\leq 2(1-p)^{\left\lfloor{n/{t_{\mathrm{separation}}}(p,\psi)}\right\rfloor} (32)

So, choose

ntseparation(p,ψ)log(16)log(1/(1p))n\geq{t_{\mathrm{separation}}}(p,\psi)\frac{\log(16)}{\log(1/(1-p))}

suffices. So, tmixlog(16)log(1/(1p))tseparation(p,ψ){t_{\mathrm{mix}}}\leq\frac{\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi).

For the other inequality, we follow the proof idea in Aldous et al., (1997). The following lemma holds:

Lemma 3 (Lemma 4 in Aldous et al., (1997)).

Let kernel QQ have invariant measure η\eta. If

supxSQ(x,)ηTV14,\sup_{x\in S}\left\|Q(x,\cdot)-\eta\right\|_{\mathrm{TV}}\leq\frac{1}{4},

then there exists ASA\subset S s.t. η(A)1/2\eta(A)\geq 1/2 and for any probability measure μ\mu and yAy\in A

μQ(y)14(1μηTV)η(y).\mu Q(y)\geq\frac{1}{4}(1-\left\|\mu-\eta\right\|_{\mathrm{TV}})\eta(y).

Let Q=PtmixQ=P^{t_{\mathrm{mix}}}, μ=Ptmix(x,)\mu=P^{t_{\mathrm{mix}}}(x,\cdot), then by the above lemma, yA\forall y\in A

P2tmix(x,y)316η(y).P^{2{t_{\mathrm{mix}}}}(x,y)\geq\frac{3}{16}\eta(y).

Therefore, if we define ψ():=η(A)/η(A)=η(|A)\psi(\cdot):=\eta(\cdot\cap A)/\eta(A)=\eta(\cdot|A), because η(A)1/2\eta(A)\geq 1/2,

P2tmix(x,)332ψ().P^{2{t_{\mathrm{mix}}}}(x,\cdot)\geq\frac{3}{32}\psi(\cdot).

This implies that tminorize643tmix22tmixt_{\mathrm{minorize}}\leq\frac{64}{3}{t_{\mathrm{mix}}}\leq 22{t_{\mathrm{mix}}}. ∎

Next, we discuss the quantitative relationship between the following assumption:

Proposition A.2.

The following properties hold:

  • If PP has mixing time tmix<{t_{\mathrm{mix}}}<\infty, then for any ϵ>0\epsilon>0, there exists (m,p)(m,p) s.t. m/p22tmix+ϵm/p\leq 22{t_{\mathrm{mix}}}+\epsilon and PP is (m,p)(m,p)-Doeblin.

  • If PP is (m,p)(m,p)-Doeblin, then tmixlog(16)m/p{t_{\mathrm{mix}}}\leq\log(16)m/p.

Proof.

First, note that by Theorem UE, both assumptions will imply that PP is uniformly ergodic. So, first entry simply follows from Proposition A.1. For the second entry, note that log(1p)p\log(1-p)\leq-p for p[0,1)p\in[0,1). So, 1/log(1/(1p))1/p1/\log(1/(1-p))\leq 1/p. We conclude that tmixlog(16)log(1/(1p))tseparation(p,ψ)log(16)m/p{t_{\mathrm{mix}}}\leq\frac{\log(16)}{\log(1/(1-p))}{t_{\mathrm{separation}}}(p,\psi)\leq\log(16)m/p. ∎

A.1 Proof of Theorem 1

Proof.

From Proposition A.1 and A.2, there exists (m,p,ψ,R)(m,p,\psi,R) s.t. tminorize22tmix22log(16)m/pt_{\mathrm{minorize}}\leq 22{t_{\mathrm{mix}}}\leq 22\log(16)m/p, where m/pm/p can be arbitrary close to tminorizet_{\mathrm{minorize}} given the definition of tminorizet_{\mathrm{minorize}}. ∎

So, knowing the mixing time and the “best” minorization are equivalent for a uniformly ergodic chain. Given the equivalence of tmix{t_{\mathrm{mix}}} and tminorizet_{\mathrm{minorize}}, the complexity results in Table 1 can be directly replaced by tmix{t_{\mathrm{mix}}}.

Appendix B Analysis of Algorithms

In this section, we carry out the proofs that justify our sample complexity upper bounds of the algorithms motivated in Section 4.

B.1 Proof of Lemma 1

Proof.

Recall that tminorize=tminorize(Pπ)t_{\mathrm{minorize}}^{*}=t_{\mathrm{minorize}}(P_{\pi^{*}}). Then by definition, for any ϵ>0\epsilon>0, there exists (m,p)(m,p) s.t. PπP_{\pi^{*}} is (m,p)(m,p)-Doeblin and tminorize+ϵm/pt_{\mathrm{minorize}}^{*}+\epsilon\geq m/p. By Proposition 6.1, |q|span3m/p3(tminorize+ϵ)\left|q^{*}\right|_{\mathrm{span}}\leq 3m/p\leq 3(t_{\mathrm{minorize}}^{*}+\epsilon) and

σ(q)(s,a)2\displaystyle\sigma(q^{*})(s,a)^{2} =Var(r(s,a)+γmaxbAq(S,b))\displaystyle=\mathrm{Var}(r(s,a)+\gamma\max_{b\in A}q^{*}(S^{\prime},b))
=γ2Var(maxbA[q(S,b)q¯])\displaystyle=\gamma^{2}\mathrm{Var}(\max_{b\in A}[q^{*}(S^{\prime},b)-\bar{q}^{*}])
γ2E[(maxbA(q(S,b)q¯))2]\displaystyle\leq\gamma^{2}E[(\max_{b\in A}(q^{*}(S^{\prime},b)-\bar{q}^{*}))^{2}]
4γ2(m/p)2.\displaystyle\leq 4\gamma^{2}(m/p)^{2}.
4γ2(tminorize+ϵ)2.\displaystyle\leq 4\gamma^{2}(t_{\mathrm{minorize}}^{*}+\epsilon)^{2}.

Since ϵ>0\epsilon>0 is arbitrary, this implies the statement of the lemma. ∎

B.2 Proofs for Section 4.2: The General Setting

Let 𝒢l\mathcal{G}_{l} denote the σ\sigma-field generated by the samples used until the end of ll’th epoch of the variance-reduced Q-learning. We define El[]=E[|𝒢l]E_{l}[\cdot]=E[\cdot|\mathcal{G}_{l}].

Before Proving proposition 4.2, we introduce the following lemma that underlies its proof and the parameter choice.

Lemma 4.

For ω\omega s.t. q^l1\hat{q}_{l-1} satisfies q^l1qb\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq b, then choosing

kc11(1γ)3log(8|S||A|(1γ)δ),k^{*}\geq c_{1}\frac{1}{(1-\gamma)^{3}}\log\left(\frac{8|S||A|}{(1-\gamma)\delta}\right),

and

nlc21(1γ)2(σ(q)+(1γ)qb+1)2log(8|S||A|δ),n_{l}\geq c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\|\sigma(q^{*})\|_{\infty}+(1-\gamma)\|q^{*}\|_{\infty}}{b}+1\right)^{2}\log\left(\frac{8|S||A|}{\delta}\right),

for some known absolute constant c1,c2c_{1},c_{2}, we have that under the probability measure Pl1()(ω)P_{l-1}(\cdot)(\omega), q^lqb/2\|\hat{q}_{l}-q^{*}\|_{\infty}\leq b/2 w.p. at least 1δ1-\delta.

Proof.

From Section 4.1.6 of Wainwright, 2019b , we have that for ω\omega s.t. q^l1\hat{q}_{l-1} satisfies q^l1qb\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq b,

ql1,k+1qc(b(1γ)k+blog(8|S||A|k/δ)(1γ)3/2k+b+σ(q)+(1γ)q(1γ)nllog(8|S||A|/δ)).\|q_{l-1,k+1}-q^{*}\|_{\infty}\leq c\left(\frac{b}{(1-\gamma)k}+\frac{b\sqrt{\log(8|S||A|k/\delta)}}{(1-\gamma)^{3/2}\sqrt{k}}+\frac{b+\|\sigma(q^{*})\|+(1-\gamma)\|q^{*}\|_{\infty}}{(1-\gamma)\sqrt{n_{l}}}\sqrt{\log(8|S||A|/\delta)}\right).

w.p. at least 1δ1-\delta under Pl1()P_{l-1}(\cdot), where c>0c>0 is some absolute constant. It is easy to see that indeed under the parameter choice with c1=c2=16cc_{1}=c_{2}=16c, the r.h.s. will be less than b/2b/2. ∎

B.2.1 Proof of Proposition 4.2

Proof.

We recall the conditions for k,{nl},lk^{*},\left\{{n_{l}}\right\},l^{*} in (13) and (14). Let l=l+1l^{\prime}=l^{*}+1. By the chain rule for conditional probability, for l1l^{*}\geq 1

P(q^lq2lb)\displaystyle P(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq 2^{-l^{*}}b) P(l=0l{q^lq2lb})\displaystyle\geq P\left(\bigcap_{l=0}^{l^{*}}\left\{{\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b}\right\}\right)
=l=0l1P(q^lq2lb|n=0l1{q^nq2nb})\displaystyle=\prod_{l=0}^{l^{\prime}-1}P\left(\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b\Bigg{|}\bigcap_{n=0}^{l-1}\left\{{\|\hat{q}_{n}-q^{*}\|_{\infty}\leq 2^{-n}b}\right\}\right)
(1δl)l=1l1P(q^lq2lb|n=0l1{q^nq2nb})\displaystyle\geq\left(1-\frac{\delta}{l^{\prime}}\right)\prod_{l=1}^{l^{\prime}-1}P\left(\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b\Bigg{|}\bigcap_{n=0}^{l-1}\left\{{\|\hat{q}_{n}-q^{*}\|_{\infty}\leq 2^{-n}b}\right\}\right)

where the last equality follows from the assumption that q^0qb\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b w.p. at least 1δ/l1-\delta/l^{\prime}.

Let

Al1=n=1l1{q^nq2nb}.A_{l-1}=\bigcap_{n=1}^{l-1}\left\{{\|\hat{q}_{n}-q^{*}\|_{\infty}\leq 2^{-n}b}\right\}.

We analyze the probability

P(q^lq2lb|Al1)\displaystyle P\left(\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b\middle|A_{l-1}\right) =1P(Al1)E[𝟙{q^lq2lb}𝟙Al1]\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b}\right\}\mathds{1}_{A_{l-1}}\right]
=1P(Al1)E[𝟙{q^l1q2l+1b}E[𝟙{q^lq2lb}|𝒢l1]𝟙Al1]\displaystyle=\frac{1}{P(A_{l-1})}E\left[\mathds{1}\left\{{\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq 2^{-l+1}b}\right\}E\left[\mathds{1}\left\{{\|\hat{q}_{l}-q_{*}\|_{\infty}\leq 2^{-l}b}\right\}\middle|\mathcal{G}_{l-1}\right]\mathds{1}_{A_{l-1}}\right]
1δl.\displaystyle\geq 1-\frac{\delta}{l^{\prime}}.

This is because, for every 1ll11\leq l\leq l^{\prime}-1, the choice of kk^{*} and {nl}\left\{{n_{l}}\right\} in (13) satisfies Lemma 4 with bb replaced by 2l+1b2^{-l+1}b and δ\delta replaced by δ/l\delta/l^{\prime}. Hence

𝟙{q^l1q2l+1b}E[𝟙{q^lq2lb}|𝒢l1]1δl.\mathds{1}\left\{{\|\hat{q}_{l-1}-q^{*}\|_{\infty}\leq 2^{-l+1}b}\right\}E\left[\mathds{1}\left\{{\|\hat{q}_{l}-q_{*}\|_{\infty}\leq 2^{-l}b}\right\}\middle|\mathcal{G}_{l-1}\right]\geq 1-\frac{\delta}{l^{\prime}}.

w.p.1. Therefore,

P(q^lq2lb)\displaystyle P\left(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq 2^{-l^{*}}b\right) P(q^0qb)(1δl)l1\displaystyle\geq P\left(\|\hat{q}_{0}-q^{*}\|_{\infty}\leq b\right)\left(1-\frac{\delta}{l^{\prime}}\right)^{l^{\prime}-1} (33)
(1δl)l.\displaystyle\geq\left(1-\frac{\delta}{l^{\prime}}\right)^{l^{\prime}}.

To finish the proof, we consider the function

e(η):=(1ηl)l.e(\eta):=\left(1-\frac{\eta}{l}\right)^{l}.

Evidently, for l1l\geq 1 and η[0,1)\eta\in[0,1), e(η)e(\eta) is C2C^{2} with derivatives

e(η)=(1ηl)l1,e′′(η)=l1l(1ηl)l20.e^{\prime}(\eta)=-\left(1-\frac{\eta}{l}\right)^{l-1},\qquad e^{\prime\prime}(\eta)=\frac{l-1}{l}\left(1-\frac{\eta}{l}\right)^{l-2}\geq 0.

So, e(η)e^{\prime}(\eta) is non-decreasing. Hence

e(δ)\displaystyle e(\delta) =e(0)+0δe(η)𝑑t1+e(0)δ1δ,\displaystyle=e(0)+\int_{0}^{\delta}e^{\prime}(\eta)dt\geq 1+e^{\prime}(0)\delta\geq 1-\delta, (34)

Note that by the choice of ll^{*} in (14), the assumption that b/ϵ1b/\epsilon\geq 1 implies that l=l+11l^{\prime}=l^{*}+1\geq 1. Finally, combining this with estimate (34) and the high probability error bound (33) yields that P(q^lqϵ)1δP\left(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon\right)\geq 1-\delta. ∎

Remark.

The proof implies that the error bound holds in a stronger pathwise sense; i.e. we can conclude from the same assumptions as in Proposition 4.2 that P({l,q^lq2lb})1δP\left(\left\{{\forall l,\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}b}\right\}\right)\geq 1-\delta. Moreover, the geometric progression 2lb2^{-l}b can be replaced by other pathwise tolerance levels, which will lead to different error bounds and complexity guarantees.

B.2.2 Proof of Proposition 4.3

Proof.

Recall from Lemma 1 that under Assumption 1,σ(q)2tminorize,\|\sigma(q^{*})\|_{\infty}\leq 2t_{\mathrm{minorize}}^{*}. By the choice of k0k_{0} in (17) and the error high probability bound for Algorithm 1 in Proposition 4.1, we can conclude that w.p. at least

1δlog2((1γ)1ϵ1)+11δl+1,1-\frac{\delta}{\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil+1}\geq 1-\frac{\delta}{l^{*}+1},

the error q^0qσ(q)/4+1/2tminorize\|\hat{q}_{0}-q^{*}\|_{\infty}\leq\|\sigma(q^{*})\|_{\infty}/4+1/2\leq t_{\mathrm{minorize}}^{*} .

Next, we would like to apply Proposition 4.2. The previous analysis implies that we can let b=tminorizeb=t_{\mathrm{minorize}}^{*}. Note that the assumption ϵtminorize\epsilon\leq t_{\mathrm{minorize}}^{*} implies b/ϵ1b/\epsilon\geq 1. Also, under this bb, the requirement in Proposition 4.2 for nln_{l} satisfies

c21(1γ)2(σ(q)+(1γ)q2l+1b+1)2log(8(l+1)|S||A|δ)3c24l(1γ)2log(8(l+1)|S||A|δ)\displaystyle c_{2}\frac{1}{(1-\gamma)^{2}}\left(\frac{\|\sigma(q^{*})\|_{\infty}+(1-\gamma)\|q^{*}\|_{\infty}}{2^{-l+1}b}+1\right)^{2}\log\left(\frac{8(l^{*}+1)|S||A|}{\delta}\right)\leq 3c_{2}\frac{4^{l}}{(1-\gamma)^{2}}\log\left(\frac{8(l^{*}+1)|S||A|}{\delta}\right)

Therefore, we have that for all l[log2(tminorize/ϵ),log2((1γ)1ϵ1)]l^{*}\in[\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon)}\right\rceil,\left\lceil{\log_{2}((1-\gamma)^{-1}\epsilon^{-1})}\right\rceil]\cap\mathbb{Z}, the parameters kk^{*} and {nl}\left\{{n_{l}}\right\} in (16) will satisfy Proposition 4.2. Hence, we conclude that q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta. ∎

B.2.3 Proof of Theorem 2

Proof.

Let nn^{*} be the total number of 1-sample Bellman operators used by the over all procedure in Proposition 4.3 with l=log2(tminorize/ϵ))l^{*}=\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon))}\right\rceil. The total number of samples used by the algorithm is

|S||A|n=|S||A|(k0+l=1lnl+lk)\displaystyle|S||A|n^{*}=|S||A|\left(k_{0}+\sum_{l=1}^{l^{*}}n_{l}+l^{*}k^{*}\right)
|S||A|(c01(1γ)3+3c24l+1/3(1γ)2+c1log2(tminorize/ϵ)(1γ)3)log(8d(1γ)δ)\displaystyle\leq|S||A|\left(c_{0}\frac{1}{(1-\gamma)^{3}}+3c_{2}\frac{4^{l^{*}+1}/3}{(1-\gamma)^{2}}+c_{1}\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)
c|S||A|((tminorize)2ϵ2(1γ)2+log2(tminorize/ϵ)(1γ)3)log(8d(1γ)δ)\displaystyle\leq c^{\prime}|S||A|\left(\frac{(t_{\mathrm{minorize}}^{*})^{2}}{\epsilon^{2}(1-\gamma)^{2}}+\frac{\left\lceil{\log_{2}(t_{\mathrm{minorize}}^{*}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8d}{(1-\gamma)\delta}\right)
=O~(|S||A|(1γ)2((tminorize)2ϵ2+11γ))\displaystyle=\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{2}}\left(\frac{(t_{\mathrm{minorize}}^{*})^{2}}{\epsilon^{2}}+\frac{1}{1-\gamma}\right)\right)
=O~(|S||A|(tminorize)2(1γ)2ϵ2max{1,ϵ2(tminorize)2(1γ)}).\displaystyle=\tilde{O}\left(\frac{|S||A|(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}\epsilon^{2}}\max\left\{{1,\frac{\epsilon^{2}}{(t_{\mathrm{minorize}}^{*})^{2}(1-\gamma)}}\right\}\right).

B.3 Proofs for Section 4.3: The Lipschitz Setting

B.3.1 Proof of Lemma 2

Proof.

For any πΠ\pi\in\Pi^{*}, consider σ(q)(s,a)2=Vars,a(𝒯^(q))=σπ(s,a)2\sigma(q^{*})(s,a)^{2}=\mathrm{Var}_{s,a}(\widehat{\mathcal{T}}(q^{*}))=\sigma^{\pi}(s,a)^{2}. Recall the definition of νπ()2\nu^{\pi}(\mathcal{M})^{2} in equation (18). Let DσD_{\sigma} be the matrix with σ\sigma on the diagonal, then

νπ()2\displaystyle\|\nu^{\pi}(\mathcal{M})^{2}\|_{\infty} =diag((IγPπ)1Dσ(q)2(IγPπ)T)\displaystyle=\|\mathrm{diag}((I-\gamma P^{\pi})^{-1}D_{\sigma(q^{*})^{2}}(I-\gamma P^{\pi})^{-T})\|_{\infty}
=maxs,as,a((IγPπ)1(s,a,s,a)σ(q)(s,a))2\displaystyle=\max_{s,a}\sum_{s^{\prime},a^{\prime}}\left((I-\gamma P^{\pi})^{-1}(s,a,s^{\prime},a^{\prime})\sigma(q^{*})(s^{\prime},a^{\prime})\right)^{2}
maxs,a(s,a(IγPπ)1(s,a,s,a)σ(q)(s,a))2\displaystyle\leq\max_{s,a}\left(\sum_{s^{\prime},a^{\prime}}(I-\gamma P^{\pi})^{-1}(s,a,s^{\prime},a^{\prime})\sigma(q^{*})(s^{\prime},a^{\prime})\right)^{2}
=(IγPπ)1σπ2\displaystyle=\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}^{2}
6400tminorize(Pπ)(1γ)2\displaystyle\leq\frac{6400t_{\mathrm{minorize}}(P_{\pi})}{(1-\gamma)^{2}}

where we also used the Cauchy-Schwarz inequality and Corollary 6.2.1. Taking maximum and recalling the definition of tminorizet_{\mathrm{minorize}}^{*} completes the proof. ∎

B.3.2 Proof of Theorem 3

Proof.

From Lemma 2, we have that if

n=(c+25600(c)2)max{1,(1γ)tminorizeϵ2}1(1γ)3log4(8|S||A|lδ),n^{*}=\left(c+25600(c^{\prime})^{2}\right)\max\left\{{1,\frac{(1-\gamma)t_{\mathrm{minorize}}^{*}}{\epsilon^{2}}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right),

then the r.h.s. of (22) is less than ϵ\epsilon. Here c,cc,c^{\prime} are constants in Theorem K2. So, for ω:q^0(ω)q1/1γ\omega:\|\hat{q}_{0}(\omega)-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}, we have that by Theorem K2, P(q^lqϵ|q^0)1δP(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon|\hat{q}_{0})\geq 1-\delta, provided that nn^{*} satisfies the requirement of Theorem K2. This happens when ϵ\epsilon satisfies the requirement in Theorem 3.

Indeed, under Assumption 2, if (19) hold, then

ncmax{1,1Δ2(1γ)β}1(1γ)3log4(8|S||A|lδ)n^{*}\geq c\max\left\{{1,\frac{1}{\Delta^{2}(1-\gamma)^{\beta}}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right)

satisfying Theorem 3. Moreover, under Assumption 3 and (20) hold,

ncmax{1,1(1γ)βmin{1Δ2,L2(1γ)2}}1(1γ)3log4(8|S||A|lδ)n^{*}\geq c\max\left\{{1,\frac{1}{(1-\gamma)^{\beta}}\min\left\{{\frac{1}{\Delta^{2}},\frac{L^{2}}{(1-\gamma)^{2}}}\right\}}\right\}\frac{1}{(1-\gamma)^{3}}\log_{4}\left(\frac{8|S||A|l^{*}}{\delta}\right)

satisfying Theorem 3 as well.

Now, consider the error probability of the combined procedure. We have

P(q^lqϵ)\displaystyle P(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon) E𝟙{q^0q1/1γ,q^lqϵ}\displaystyle\geq E\mathds{1}\left\{{\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma},\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon}\right\}
E𝟙{q^0q1/1γ}P(q^lqϵ|q^0)\displaystyle\geq E\mathds{1}\left\{{\|\hat{q}_{0}-q^{*}\|_{\infty}\leq 1/\sqrt{1-\gamma}}\right\}P\left(\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon|\hat{q}_{0}\right)
(1δ2)\displaystyle\geq(1-\delta^{2})
12δ.\displaystyle\geq 1-2\delta.

By replacing δ\delta with δ/2\delta/2 in Theorem K2, we conclude that it suffices to choose

n=O~(tminorizeϵ2(1γ)2max{ϵ2(1γ)tminorize,1}).n^{*}=\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}},1}\right\}\right).

The total number of samples used by the combined algorithm, therefore, is

|S||A|(k0+n)\displaystyle|S||A|(k_{0}+n^{*}) =|S||A|(O~((tminorize)2(1γ)2)+O~(tminorizeϵ2(1γ)2max{ϵ2(1γ)tminorize,1}))\displaystyle=|S||A|\left(\tilde{O}\left(\frac{(t_{\mathrm{minorize}}^{*})^{2}}{(1-\gamma)^{2}}\right)+\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}},1}\right\}\right)\right)
=O~(tminorizeϵ2(1γ)2max{1,ϵ2tminorize,ϵ2(1γ)tminorize}).\displaystyle=\tilde{O}\left(\frac{t_{\mathrm{minorize}}^{*}}{\epsilon^{2}(1-\gamma)^{2}}\max\left\{{1,\epsilon^{2}t_{\mathrm{minorize}}^{*},\frac{\epsilon^{2}}{(1-\gamma)t_{\mathrm{minorize}}^{*}}}\right\}\right).

This implies the claim of Theorem 3. ∎

B.4 Proofs for Section 4.4: Uniform Mixing

Recall the definition of r¯l\bar{r}_{l} and q¯l\bar{q}_{l} in equation (24) and (25) respectively. We denote an optimal policy for q¯l\bar{q}_{l} by π¯l\bar{\pi}_{l}. Before we prove Proposition 4.4, we introduce the following lemmas.

Lemma 5 (Lemma 5 in Wainwright, 2019b ).

For ω:q^l1qb\omega:\|\hat{q}_{l-1}-q^{*}\|\leq b, w.p. at least 1δ1-\delta under Pl1()(ω)P_{l-1}(\cdot)(\omega)

|r¯lr|ca(b+σ(q))log(4|S||A|/δ)nl+cbqlog(4|S||A|/δ)nl|\bar{r}_{l}-r|\leq c_{a}(b+\sigma(q^{*}))\sqrt{\frac{\log(4|S||A|/\delta)}{n_{l}}}+c_{b}\|q^{*}\|_{\infty}\frac{\log(4|S||A|/\delta)}{n_{l}}

for some absolute constants ca,cbc_{a},c_{b}.

Lemma 6 (Lemma 3 in Wainwright, 2019b ).

Let kk^{*} satisfies (26) for some sufficently large absolute constant cc. Then under Pl1()P_{l-1}(\cdot),

q^lq¯l14q^l1q¯l\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}\leq\frac{1}{4}\|\hat{q}_{l-1}-\bar{q}_{l}\|

w.p. at least 1δ/(2(l+1))1-\delta/(2(l^{*}+1)).

Lemma 7.

On {ω:q^l1qtminorize}\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\}, for sufficiently large absolute constants c,cc,c^{\prime},

qq¯lc((tminorize1γ+γ2q¯lq1γ)log(8|S||A|(l+1)/δ)nl+log(8|S||A|(l+1)/δ)(1γ)2nl).\|q^{*}-\bar{q}_{l}\|_{\infty}\leq c\left(\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}+\frac{\log(8|S||A|(l^{*}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right).

w.p. at least 1δ/(2(l+1))1-\delta/(2(l^{*}+1)) under probability measure Pl1P_{l-1}, provided that nl(c)2(1γ)2log(8(l+1)|S||A|/δ)n_{l}\geq(c^{\prime})^{2}(1-\gamma)^{-2}\log(8(l^{*}+1)|S||A|/\delta).

B.4.1 Proof of Lemma 7

Proof.

By Lemma 5, we have that on {ω:q^l1qtminorize}\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\} under Pl1P_{l-1} w.p. at least 1δ/(2l)1-\delta/(2l^{*})

|r¯lr|ca(tminorize+σ(q))log(8l|S||A|/δ)nl+cbqlog(8l|S||A|/δ)nl.|\bar{r}_{l}-r|\leq c_{a}(\sqrt{t_{\mathrm{minorize}}}+\sigma(q^{*}))\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\|q^{*}\|_{\infty}\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}. (35)

Also, By Lemma 6 in Wainwright, 2019b ,

|qq¯l|max{(IγPπ)1|r¯lr|,(IγPπ¯l)1|r¯lr|}.|q^{*}-\bar{q}_{l}|\leq\max\left\{{(I-\gamma P^{\pi^{*}})^{-1}|\bar{r}_{l}-r|,(I-\gamma P^{\bar{\pi}_{l}})^{-1}|\bar{r}_{l}-r|}\right\}. (36)

Here, ||\left|\cdot\right| denotes the entrywise absolue value, and the inequalities holds entrywise. Let BlB_{l} be the set s.t. (35) holds. Then on BlB_{l}, the vector (IγPπ)|r¯lr|(I-\gamma P^{\pi^{*}})|\bar{r}_{l}-r| satisfies

(IγPπ)1|r¯lr|\displaystyle\quad(I-\gamma P^{\pi^{*}})^{-1}|\bar{r}_{l}-r|
ca(tminorize1γ+(IγPπ)1σ(q))log(8l|S||A|/δ)nl+cbq1γlog(8l|S||A|/δ)nl\displaystyle\leq c_{a}\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\|(I-\gamma P^{\pi^{*}})^{-1}\sigma(q^{*})\|_{\infty}\right)\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\frac{\|q^{*}\|_{\infty}}{1-\gamma}\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}
ca81tminorize1γlog(8l|S||A|/δ)nl+cblog(8l|S||A|/δ)(1γ)2nl\displaystyle\leq c_{a}\frac{81\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{*}|S||A|/\delta)}{(1-\gamma)^{2}n_{l}}

Moreover, let v¯l()=ql(,π¯l())\bar{v}_{l}(\cdot)=q_{l}(\cdot,\bar{\pi}_{l}(\cdot))

σ(q)(s,a)2\displaystyle\sigma(q^{*})(s,a)^{2} =γ2Ps,a[(v(q)Ps,a[v(q)])2]\displaystyle=\gamma^{2}P_{s,a}[(v(q^{*})-P_{s,a}[v(q^{*})])^{2}]
=γ2Ps,a[(vPs,a[v])2]\displaystyle=\gamma^{2}P_{s,a}[(v^{*}-P_{s,a}[v^{*}])^{2}]
=γ2Ps,a[(vv¯l+v¯lPs,a[v¯l]+Ps,a[v¯l]Ps,a[v])2]\displaystyle=\gamma^{2}P_{s,a}[(v^{*}-\bar{v}_{l}+\bar{v}_{l}-P_{s,a}[\bar{v}_{l}]+P_{s,a}[\bar{v}_{l}]-P_{s,a}[v^{*}])^{2}]
=γ2Vars,a(v(S)v¯l(S))+γ2Vars,a(v¯l(S))\displaystyle=\gamma^{2}\mathrm{Var}_{s,a}(v^{*}(S^{\prime})-\bar{v}_{l}(S^{\prime}))+\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))
γ2Vars,a(v¯l(S))+γ2q¯lq\displaystyle\leq\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime}))+\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}

Let σl(s,a)2=γ2Vars,a(v¯l(S))\sigma_{l}(s,a)^{2}=\gamma^{2}\mathrm{Var}_{s,a}(\bar{v}_{l}(S^{\prime})). By construction and the optimality of π¯l\bar{\pi}_{l}, σl2\sigma_{l}^{2} is the variance of the 1-sample Bellman operator associated with the reward r¯l\bar{r}_{l} (c.f. definition (6) and (7)). Therefore, we have that by Corollary 6.2.1,

(IγPπ¯l)1σ(q)\displaystyle\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{*})\|_{\infty} (IγPπ¯l)1σl+γ2q¯lq1γ\displaystyle\leq\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma_{l}\|_{\infty}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma} (37)
r¯l80tminorize1γ+γ2q¯lq1γ\displaystyle\leq\|\bar{r}_{l}\|_{\infty}\frac{80\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}

where we note that tminorize(Pπ¯l)tminorizet_{\mathrm{minorize}}(P_{\bar{\pi}_{l}})\leq t_{\mathrm{minorize}} by the definition of tminorizet_{\mathrm{minorize}} in Assumption 4. For ωBl\omega\in B_{l}, by the assumption on nln_{l} and c10(cacb)c^{\prime}\geq 10(c_{a}\vee c_{b})

r¯l\displaystyle\|\bar{r}_{l}\|_{\infty} 1+r¯lr\displaystyle\leq 1+\|\bar{r}_{l}-r\|_{\infty}
1+ca(tminorize+σ(q))log(8l|S||A|/δ)nl+cbqlog(8l|S||A|/δ)nl\displaystyle\leq 1+c_{a}(\sqrt{t_{\mathrm{minorize}}}+\left\|\sigma(q^{*})\right\|_{\infty})\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\|q^{*}\|_{\infty}\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}
1+5catminorizelog(8l|S||A|/δ)nl+cblog(8l|S||A|/δ)(1γ)nl\displaystyle\leq 1+5c_{a}t_{\mathrm{minorize}}\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{*}|S||A|/\delta)}{(1-\gamma)n_{l}}
3.\displaystyle\leq 3.

So, on BlB_{l}, the error

(IγPπ¯l)1|r¯lr|\displaystyle\quad(I-\gamma P^{\bar{\pi}_{l}})^{-1}|\bar{r}_{l}-r|
(i)ca(tminorize1γ+(IγPπ¯l)1σ(q))log(8l|S||A|/δ)nl+cbq1γlog(8l|S||A|/δ)nl\displaystyle\stackrel{{\scriptstyle(i)}}{{\leq}}c_{a}\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{*})\|_{\infty}\right)\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\frac{\|q^{*}\|_{\infty}}{1-\gamma}\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}
(ii)ca(241tminorize1γ+γ2q¯lq1γ)log(8l|S||A|/δ)nl+cblog(8l|S||A|/δ)(1γ)2nl.\displaystyle\stackrel{{\scriptstyle(ii)}}{{\leq}}c_{a}\left(\frac{241\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+c_{b}\frac{\log(8l^{*}|S||A|/\delta)}{(1-\gamma)^{2}n_{l}}.

where (i)(i) follows from (35), (ii)(ii) replaces (IγPπ¯l)1σ(q)\|(I-\gamma P^{\bar{\pi}_{l}})^{-1}\sigma(q^{*})\|_{\infty} with the upper bound (37). Therefore, combining these with (36), we have that w.p. at least 1δ/(2l)1-\delta/(2l^{*})

|qq¯l|c((tminorize1γ+γ2q¯lq1γ)log(8l|S||A|/δ)nl+log(8l|S||A|/δ)(1γ)2nl)|q^{*}-\bar{q}_{l}|\leq c\left(\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}+\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\right)\sqrt{\frac{\log(8l^{*}|S||A|/\delta)}{n_{l}}}+\frac{\log(8l^{*}|S||A|/\delta)}{(1-\gamma)^{2}n_{l}}\right)

for some large absolute constant cc. This implies the claimed result. ∎

Equipped with Lemma 6 and 7, we are ready to prove Proposition 4.4.

B.4.2 Proof of Proposition 4.4

Proof.

By the choice of nln_{l} with sufficiently large cc^{\prime}

cγ2q¯lq1γlog(8|S||A|(l+1)/δ)nl12q¯lq.c\frac{\gamma^{2}\|\bar{q}_{l}-q^{*}\|_{\infty}}{1-\gamma}\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}\leq\frac{1}{2}\|\bar{q}_{l}-q^{*}\|_{\infty}.

Moreover, nln_{l} satisfies the assumption of Lemma 7 for every ll. Therefore, Lemma 7 implies that on {ω:q^l1q2l+1tminorize}{ω:q^l1qtminorize}\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq 2^{-l+1}\sqrt{t_{\mathrm{minorize}}}}\right\}\subset\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq\sqrt{t_{\mathrm{minorize}}}}\right\} w.p. at least 1δ/(2(l+1))1-\delta/(2(l^{*}+1)) under Pl1()P_{l-1}(\cdot)

qq¯l2c(tminorize1γlog(8|S||A|(l+1)/δ)nl+log(8|S||A|(l+1)/δ)(1γ)2nl).\|q^{*}-\bar{q}_{l}\|_{\infty}\leq 2c\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}+\frac{\log(8|S||A|(l^{*}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right).

Observe that the choice of kk^{*} in (27) satisfies (26). Hence, by Lemma 6, 7, and the union bound, on {ω:q^l1q2l+1tminorize}\left\{{\omega:\|\hat{q}_{l-1}-q^{*}\|\leq 2^{-l+1}\sqrt{t_{\mathrm{minorize}}}}\right\} w.p. at least 1δ/(l+1)1-\delta/(l^{*}+1) under Pl1()P_{l-1}(\cdot),

q^lq\displaystyle\|\hat{q}_{l}-q^{*}\|_{\infty} q^lq¯l+qq¯l\displaystyle\leq\|\hat{q}_{l}-\bar{q}_{l}\|_{\infty}+\|q^{*}-\bar{q}_{l}\|_{\infty}
14q^l1q¯l+qq¯l\displaystyle\leq\frac{1}{4}\|\hat{q}_{l-1}-\bar{q}_{l}\|_{\infty}+\|q^{*}-\bar{q}_{l}\|_{\infty}
14q^l1q+54qq¯l\displaystyle\leq\frac{1}{4}\|\hat{q}_{l-1}-q^{*}\|_{\infty}+\frac{5}{4}\|q^{*}-\bar{q}_{l}\|_{\infty}
2ltminorize2+52c(tminorize1γlog(8|S||A|(l+1)/δ)nl+log(8|S||A|(l+1)/δ)(1γ)2nl)\displaystyle\leq\frac{2^{-l}\sqrt{t_{\mathrm{minorize}}}}{2}+\frac{5}{2}c\left(\frac{\sqrt{t_{\mathrm{minorize}}}}{1-\gamma}\sqrt{\frac{\log(8|S||A|(l^{*}+1)/\delta)}{n_{l}}}+\frac{\log(8|S||A|(l^{*}+1)/\delta)}{(1-\gamma)^{2}n_{l}}\right)
2ltminorize\displaystyle\leq 2^{-l}\sqrt{t_{\mathrm{minorize}}}

where the last inequality follows from the choice of nln_{l}. Therefore, repeating the proof of Proposition 4.2 with b=tminorizeb=\sqrt{t_{\mathrm{minorize}}} allow us to conclude that q^lqϵ\|\hat{q}_{l^{*}}-q^{*}\|_{\infty}\leq\epsilon w.p. at least 1δ1-\delta. Note that the assumption ϵtminorize\epsilon\leq\sqrt{t_{\mathrm{minorize}}} implies b/ϵ1b/\epsilon\geq 1.

As in the previous remark following Proposition 4.2, the event {1ll,q^lq2ltminorize}\left\{{\forall 1\leq l\leq l^{*},\|\hat{q}_{l}-q^{*}\|_{\infty}\leq 2^{-l}\sqrt{t_{\mathrm{minorize}}}}\right\} holds w.p. at least 1δ1-\delta. ∎

B.4.3 Proof of Proposition 4.5

Proof.

The proof is the same as that of Proposition 4.3 given 4.2. ∎

B.4.4 Proof of Theorem 4

Proof.

Notice that Assumption 1 is implied by Assumption 4 with tminorizetminorizet_{\mathrm{minorize}}^{*}\leq t_{\mathrm{minorize}}. As pointed out before, the sample size to compute q^0\hat{q}_{0}, by Theorem 2, is O~(|S||A|(1γ)3)\tilde{O}(|S||A|(1-\gamma)^{-3}). The overall sample complexity of this procedure is

O~(|S||A|(1γ)3)+c|S||A|(tminorizeϵ2(1γ)2+log2(tminorize/ϵ)(1γ)3)log(8dtminorize(1γ)δ)\displaystyle\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{3}}\right)+c^{\prime}|S||A|\left(\frac{t_{\mathrm{minorize}}}{\epsilon^{2}(1-\gamma)^{2}}+\frac{\left\lceil{\log_{2}(\sqrt{t_{\mathrm{minorize}}}/\epsilon)}\right\rceil}{(1-\gamma)^{3}}\right)\log\left(\frac{8dt_{\mathrm{minorize}}}{(1-\gamma)\delta}\right)
=O~(|S||A|(1γ)3)+O~(|S||A|(1γ)2max{11γ,tminorizeϵ2})\displaystyle=\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{3}}\right)+\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{t_{\mathrm{minorize}}}{\epsilon^{2}}}\right\}\right)
=O~(|S||A|(1γ)2max{11γ,tminorizeϵ2}).\displaystyle=\tilde{O}\left(\frac{|S||A|}{(1-\gamma)^{2}}\max\left\{{\frac{1}{1-\gamma},\frac{t_{\mathrm{minorize}}}{\epsilon^{2}}}\right\}\right).

Appendix C Proofs of the Minimax Lower Bounds

C.1 Proof of Theorem 5

Proof.

By Theorem K1, it suffices to show that for any γ[0.9,1)\gamma\in[0.9,1) and 10tminorize1/(1γ)10\leq t_{\mathrm{minorize}}\leq 1/(1-\gamma) MDP instance γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}} satisfies

maxπΠνπ(γ,tminorize)2tminorize35(1γ)2.\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})^{2}\|_{\infty}\geq\frac{t_{\mathrm{minorize}}}{3^{5}(1-\gamma)^{2}}. (38)

We compute the minimax risk parameter for γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}

νπ(γ,tminorize)(1,a1)2\displaystyle\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(1,a_{1})^{2} =γ2Var({(IγPπ)1𝒯^(q)}(1,a1))\displaystyle=\gamma^{2}\mathrm{Var}\left(\left\{{(I-\gamma P^{\pi^{*}})^{-1}\widehat{\mathcal{T}}(q^{*})}\right\}(1,a_{1})\right)
=γ2(1p)p(12γ(1p)+γ2(12p+2p2))(1γ)2(1γ+2γp)4.\displaystyle=\gamma^{2}\frac{(1-p)p(1-2\gamma(1-p)+\gamma^{2}(1-2p+2p^{2}))}{(1-\gamma)^{2}(1-\gamma+2\gamma p)^{4}}.

We show that νπ(γ,tminorize)(1,a1)2=Ω(p1(1γ)2)\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(1,a_{1})^{2}=\Omega(p^{-1}(1-\gamma)^{-2}) for tminorize=1/p1/(1γ)t_{\mathrm{minorize}}=1/p\leq 1/(1-\gamma), p(0,0.1]p\in(0,0.1], and γ[0.9,1)\gamma\in[0.9,1). Define the function

f(p,γ):=(1γ)2pγ2νπ(q)(1,a1)2.f(p,\gamma):=\frac{(1-\gamma)^{2}p}{\gamma^{2}}\nu^{\pi^{*}}(q^{*})(1,a_{1})^{2}.

It suffices to prove that f(p,γ)f(p,\gamma) is uniformly bounded away from 0 in the region of interests.

We compute the derivative w.r.t. γ\gamma,

γf(p,γ)=2(1p)p2((γ1)24γ2p32γ(23γ)p2(4γ27γ+3)p)(γ(2p1)+1)5.\partial_{\gamma}f(p,\gamma)=\frac{2(1-p)p^{2}\left((\gamma-1)^{2}-4\gamma^{2}p^{3}-2\gamma(2-3\gamma)p^{2}-\left(4\gamma^{2}-7\gamma+3\right)p\right)}{(\gamma(2p-1)+1)^{5}}.

Note that (γ(2p1)+1)5>0(\gamma(2p-1)+1)^{5}>0. Moreover, let

h=(γ1)24γ2p32γ(23γ)p2(4γ27γ+3)ph=(\gamma-1)^{2}-4\gamma^{2}p^{3}-2\gamma(2-3\gamma)p^{2}-\left(4\gamma^{2}-7\gamma+3\right)p

For p(0,0.1]p\in(0,0.1] and γ[0.9,1)\gamma\in[0.9,1), we can bound

h\displaystyle h =(γ1)24γ2p3+2γ(3γ2)p2+(1γ)(4γ3)p\displaystyle=(\gamma-1)^{2}-4\gamma^{2}p^{3}+2\gamma(3\gamma-2)p^{2}+(1-\gamma)(4\gamma-3)p
(γ1)24γ2p3+γ2p2+γ(5γ4)p2\displaystyle\geq(\gamma-1)^{2}-4\gamma^{2}p^{3}+\gamma^{2}p^{2}+\gamma(5\gamma-4)p^{2}
>(γ1)2\displaystyle>(\gamma-1)^{2}
0.\displaystyle\geq 0.

So, γf(p,γ)>0\partial_{\gamma}f(p,\gamma)>0 for p(0,0.1]p\in(0,0.1] and γ[0.9,1)\gamma\in[0.9,1), which implies that f(p,)f(p,\cdot) is non-decreasing. Recall that 1/p1/(1γ)1/p\leq 1/(1-\gamma); i.e. γ1p\gamma\geq 1-p. So, for p0.1p\leq 0.1, γ0.9\gamma\geq 0.9, and γ1p\gamma\geq 1-p

f(p,γ)f(p,1p)=(1p)(2p26p+5)(32p)4234.f(p,\gamma)\geq f(p,1-p)=\frac{(1-p)\left(2p^{2}-6p+5\right)}{(3-2p)^{4}}\geq\frac{2}{3^{4}}.

So, the constant 234γ2352\cdot 3^{-4}\gamma^{2}\geq 3^{-5}. ∎

C.2 Proof of Theorem 6

Proof.

By (30), we have that for any ¯𝒞(γ,tminorize)\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}})

𝔐n(γ,tminorize)infq~nmax{γ,tminorize,¯}nEnq~nq().\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\inf_{\tilde{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}},\bar{\mathcal{M}}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\tilde{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}.

The rest of the proof directly follows from that of Theorem 1 in Khamaru et al., 2021b (Theorem K1). Here we sketch the key steps and the construction of the instance ¯𝒞(γ,tminorize)\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}}).

First, the proof in Khamaru et al., 2021b use the standard reduction to testing and the Le Cam’s lower bound to conclude that for all 1,2\mathcal{M}_{1},\mathcal{M}_{2}

infq^nmax{1,2}nEnq^nq()n4q(1)q(2)(1n2dHel(P11,P12)2).\inf_{\hat{q}_{n}}\max_{\mathcal{M}\in\left\{{\mathcal{M}_{1},\mathcal{M}_{2}}\right\}}\sqrt{n}E_{n}^{\mathcal{M}}\|\hat{q}_{n}-q^{*}(\mathcal{M})\|_{\infty}\geq\frac{\sqrt{n}}{4}\|q^{*}(\mathcal{M}_{1})-q^{*}(\mathcal{M}_{2})\|_{\infty}\left(1-n\sqrt{2}{d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{1}},P_{1}^{\mathcal{M}_{2}})^{2}\right).

where dHel{d_{\mathrm{Hel}}} is the Hellinger distance. Therefore, if 1=γ,tminorize\mathcal{M}_{1}=\mathcal{M}_{\gamma,t_{\mathrm{minorize}}} and 2=¯\mathcal{M}_{2}=\bar{\mathcal{M}} s.t. dHel(P11,P12)(2n)1{d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{1}},P_{1}^{\mathcal{M}_{2}})\leq(2\sqrt{n})^{-1}.

𝔐n(γ,tminorize)n8q(γ,tminorize)q(¯)\mathfrak{M}_{n}(\gamma,t_{\mathrm{minorize}})\geq\frac{\sqrt{n}}{8}\|q^{*}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})-q^{*}(\bar{\mathcal{M}})\|_{\infty}

Next, we use the ¯\bar{\mathcal{M}} constructed in Lemma 2 of Khamaru et al., 2021b . Note that γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}, any stationary deterministic Markov policy is optimal, induce the same PπP_{\pi} and νπ(γ,tminorize)\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}. Let U:=(IγPπ)1U:=(I-\gamma P^{\pi})^{-1}, q=Urq^{*}=Ur, and (s¯,a¯)(\bar{s},\bar{a}) s.t. νπ(γ,tminorize)(s¯,a¯)=νπ(γ,tminorize)\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})(\bar{s},\bar{a})=\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}. Define ¯=(S,A,r,P¯,γ)\bar{\mathcal{M}}=(S,A,r,\bar{P},\gamma) having the same reward rr as γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}} and

P¯s,a(s)=Ps,a(s)+Ps,a(s)Us¯,a¯(s,a)[v(q)(s)(Pπq)(s,a)]2nνπ(γ,tminorize).\bar{P}_{s,a}(s^{\prime})=P_{s,a}(s^{\prime})+\frac{P_{s,a}(s^{\prime})U_{\bar{s},\bar{a}}(s,a)[v(q^{*})(s^{\prime})-(P^{\pi}q^{*})(s,a)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}.

Lemma 2 and 3 of Khamaru et al., 2021b and the proof of Theorem 5 imply that P¯\bar{P} is a valid transition kernel s.t. dHel(P1γ,tminorize,P1¯)(2n)1{d_{\mathrm{Hel}}}(P_{1}^{\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}},P_{1}^{\bar{\mathcal{M}}})\leq(2\sqrt{n})^{-1} and

q(γ,tminorize)q(¯)cγmaxπΠνπ(γ,tminorize)ctminorize35(1γ)\|q^{*}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})-q^{*}(\bar{\mathcal{M}})\|_{\infty}\geq c\gamma\max_{\pi\in\Pi^{*}}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}\geq\frac{c\sqrt{t_{\mathrm{minorize}}}}{3^{5}(1-\gamma)}

when nn satisfies (29), which is implied by the assumption. Therefore, it is left to show that ¯𝒞(γ,tminorize)\bar{\mathcal{M}}\in\mathcal{C}(\gamma,t_{\mathrm{minorize}}). To do this, it is sufficient to show that P¯π\bar{P}_{\pi} is (1,p)(1,p)-Doeblin for all πΠ\pi\in\Pi.

Recall the definition of γ,tminorize\mathcal{M}_{\gamma,t_{\mathrm{minorize}}}, for every policy πΠ\pi\in\Pi,

P¯π(s,s)\displaystyle\bar{P}_{\pi}(s,s^{\prime}) =2peT2(s)+(12p)I(s,s)+Ps,π(s)(s)Us¯,a¯(s,π(s))[v(q)(s)(Pπv(q))(s)]2nνπ(γ,tminorize).\displaystyle=2p\frac{e^{T}}{2}(s^{\prime})+(1-2p)I(s,s^{\prime})+\frac{P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}.

Moreover,

sups,s|Ps,π(s)(s)Us¯,a¯(s,π(s))[v(q)(s)(Pπv(q))(s)]||q|span1γtminorize1γ.\sup_{s,s^{\prime}}|P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]|\leq\frac{\left|q^{*}\right|_{\mathrm{span}}}{1-\gamma}\leq\frac{t_{\mathrm{minorize}}}{1-\gamma}.

By (38), and the choice of nn,

sups,s|Ps,π(s)(s)Us¯,a¯(s,π(s))[v(q)(s)(Pπv(q))(s)]2nνπ(γ,tminorize)|35/2tminorize2np2,\sup_{s,s^{\prime}}\left|\frac{P_{s,\pi(s)}(s^{\prime})U_{\bar{s},\bar{a}}(s,\pi(s))[v(q^{*})(s^{\prime})-(P_{\pi}v(q^{*}))(s)]}{\sqrt{2n}\|\nu^{\pi}(\mathcal{M}_{\gamma,t_{\mathrm{minorize}}})\|_{\infty}}\right|\leq\frac{3^{5/2}\sqrt{t_{\mathrm{minorize}}}}{\sqrt{2n}}\leq\frac{p}{2},

where p=1/tminorizep=1/t_{\mathrm{minorize}}. This completes the proof. ∎

Appendix D Proofs for Section 6

D.1 Proof of Proposition 6.1

Proof.

The qπq^{\pi} is the state-action value function under the policy π\pi:

qπ(s,a)\displaystyle q^{\pi}(s,a) =Es,aπk=0γkr(Sk,Ak)\displaystyle=E_{s,a}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k})
=r(s,a)+Es,aπk=1γkr(Sk,Ak)\displaystyle=r(s,a)+E_{s,a}^{\pi}\sum_{k=1}^{\infty}\gamma^{k}r(S_{k},A_{k})
=r(s,a)+γEs,aES1πk=0γkrπ(Sk)\displaystyle=r(s,a)+\gamma E_{s,a}E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})

where we use the Markov property on the bounded functional kγkr\sum_{k}\gamma^{k}r. Now

ES1πk=0γkr(Sk,Ak)\displaystyle E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r(S_{k},A_{k}) =ES1πj=0k=0m1γmj+krπ(Smj+k)\displaystyle=E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\sum_{k=0}^{m-1}\gamma^{mj+k}r_{\pi}(S_{mj+k})
=ES1πj=0k=0m1γmj+krπ(Smj+k)\displaystyle=E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\sum_{k=0}^{m-1}\gamma^{mj+k}r_{\pi}(S_{mj+k})
=k=0m1γkj=0γmjES1πE[rπ(Smj+k)|mj]\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}E_{S_{1}}^{\pi}E[r_{\pi}(S_{mj+k})|\mathcal{F}_{mj}]
=k=0m1γkES1πj=0γmjESmjrπ(Sk)\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}E_{S_{mj}}r_{\pi}(S_{k})

The last conditional expectation is a bounded function wk(s):=Esrπ(Sk)[0,1]w_{k}(s):=E_{s}r_{\pi}(S_{k})\in[0,1]. Recall that τ1\tau_{1} is the first regeneration time of the split chain. Then

ES1πk=0γkrπ(Sk)\displaystyle E_{S_{1}}^{\pi}\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k}) =k=0m1γkES1πj=0γmjwk(Smj)\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})
=k=0m1γkES1π(j=0τ11γmjwk(Smj)+j=τ1γmjwk(Smj))\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\tau_{1}-1}\gamma^{mj}w_{k}(S_{mj})+\sum_{j=\tau_{1}}^{\infty}\gamma^{mj}w_{k}(S_{mj})\right)
=k=0m1γkES1π(j=0τ11γmjwk(Smj)+E[j=τ1γmjwk(Smj)|τ])\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\tau_{1}-1}\gamma^{mj}w_{k}(S_{mj})+E\left[\sum_{j=\tau_{1}}^{\infty}\gamma^{mj}w_{k}(S_{mj})\middle|\mathcal{F}_{\tau}\right]\right)
=k=0m1γkES1π(j=0γmjwk(Smj)𝟙{τ1>j}+Eψj=0γmjwk(Smj))\displaystyle=\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\left(\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\mathds{1}\left\{{\tau_{1}>j}\right\}+E_{\psi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\right)

where we can use the strong Markov property by the boundedness of wkw_{k}; and the split chain’s distribution at regeneration time is ψ\psi. Note that the second term is constant. So, if we let

c:=k=0m1γkEψj=0γmjwk(Smj),c:=\sum_{k=0}^{m-1}\gamma^{k}E_{\psi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj}),

then qπq^{\pi} can be written as

qπ(s,a)\displaystyle q^{\pi}(s,a) =r(s,a)+γc+γEs,ak=0m1γkES1πj=0γmjwk(Smj)𝟙{τ>j}.\displaystyle=r(s,a)+\gamma c+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}w_{k}(S_{mj})\mathds{1}\left\{{\tau>j}\right\}.

Since wk()[0,1]w_{k}(\cdot)\in[0,1],

0qπ(s,a)γc\displaystyle 0\leq q^{\pi}(s,a)-\gamma c 1+γEs,ak=0m1γkES1πj=0γmj𝟙{τ>j}\displaystyle\leq 1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}E_{S_{1}}^{\pi}\sum_{j=0}^{\infty}\gamma^{mj}\mathds{1}\left\{{\tau>j}\right\}
=1+γEs,ak=0m1γkj=0γmjPS1π(τ>j)\displaystyle=1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}P_{S_{1}}^{\pi}(\tau>j)
=1+γEs,ak=0m1γkj=0γmj(1p)j\displaystyle=1+\gamma E_{s,a}\sum_{k=0}^{m-1}\gamma^{k}\sum_{j=0}^{\infty}\gamma^{mj}(1-p)^{j}
=1+γ(1γm)(1γ)(1(1p)γm).\displaystyle=1+\frac{\gamma(1-\gamma^{m})}{(1-\gamma)(1-(1-p)\gamma^{m})}.

Notice that

γ(1γm)(1γ)=γk=0m1γkmγ;1(1(1p)γm)1p.\frac{\gamma(1-\gamma^{m})}{(1-\gamma)}=\gamma\sum_{k=0}^{m-1}\gamma^{k}\leq m\gamma;\qquad\frac{1}{(1-(1-p)\gamma^{m})}\leq\frac{1}{p}. (39)

Therefore, we conclude that qπγcm/p+1\|q^{\pi}-\gamma c\|_{\infty}\leq m/p+1. Also recall the definition of ||span\left|\cdot\right|_{\mathrm{span}}

|qπ|span\displaystyle\left|q^{\pi}\right|_{\mathrm{span}} =2qπγc\displaystyle=2\|q^{\pi}-\gamma c\|_{\infty}
2m/p+2\displaystyle\leq 2m/p+2
3m/p\displaystyle\leq 3m/p

where we used that p1/2p\leq 1/2 and m1m\geq 1. ∎

D.2 Proof of Proposition 6.2

Proof.

Define Tj+1=τj+1τjT_{j+1}=\tau_{j+1}-\tau_{j}. By uniform regeneration, {Tj+1}=𝑑{mGi}\left\{{T_{j+1}}\right\}\overset{d}{=}\left\{{mG_{i}}\right\} where GiG_{i}\simGeo(p)(p) i.i.d.. So,

EγcTi=Eexp(cmlog(γ)G)=pγcm1(1p)γcm=:χ(c)E\gamma^{cT_{i}}=E\exp(cm\log(\gamma)G)=\frac{p\gamma^{cm}}{1-(1-p)\gamma^{cm}}=:\chi(c)

Next, we expand the variance:

Ψπ(s,a)\displaystyle\Psi^{\pi}(s,a) =Vars,aπ(k=0γkrπ(Sk))\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\infty}\gamma^{k}r_{\pi}(S_{k})\right)
=Vars,aπ(j=0k=τjτj+11γkrπ(Sk))\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=0}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right)
2Vars,aπ(k=0τ11γkrπ(Sk))+2Vars,aπ(j=1k=τjτj+11γkrπ(Sk)).\displaystyle\leq 2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\tau_{1}-1}\gamma^{k}r_{\pi}(S_{k})\right)+2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right).

We split the second term using the regeneration cycles. Let Wj+1=(Sτj,,Sτj+11,Tj+1)W_{j+1}=(S_{\tau_{j}},\dots,S_{\tau_{j+1}-1},T_{j+1}), 𝒞k=σ(Wj,jk)\mathcal{C}_{k}=\sigma(W_{j},j\leq k) and

g(Wj+1)=k=0Tj+11γkrπ(Sτj+k).g(W_{j+1})=\sum_{k=0}^{T_{j+1}-1}\gamma^{k}r_{\pi}(S_{\tau_{j}+k}).

Note that

Eg(Wj+1)\displaystyle Eg(W_{j+1}) Ek=0Tj+11γk\displaystyle\leq E\sum_{k=0}^{T_{j+1}-1}\gamma^{k} (40)
=1EγTj+11γ\displaystyle=\frac{1-E\gamma^{T_{j+1}}}{1-\gamma}
=1χ(1)1γ\displaystyle=\frac{1-\chi(1)}{1-\gamma}
=1γm(1γ)(1(1p)γm).\displaystyle=\frac{1-\gamma^{m}}{(1-\gamma)(1-(1-p)\gamma^{m})}.

Recall that the sequence of random elements {Wj+1}\left\{{W_{j+1}}\right\} are 1-dependent. In particular Wj+2WjW_{j+2}\perp W_{j}. Then

Vars,aπ(j=1k=τjτj+11γkrπ(Sk))\displaystyle\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\sum_{k={\tau_{j}}}^{\tau_{j+1}-1}\gamma^{k}r_{\pi}(S_{k})\right) =Vars,aπ(j=1γτjg(Wj+1))\displaystyle=\mathrm{Var}_{s,a}^{\pi}\left(\sum_{j=1}^{\infty}\gamma^{\tau_{j}}g(W_{j+1})\right)
=j=1Vars,aπ(γτjg(Wj+1))+2i=1j=1i2Cov(γτig(Wi+1),γτjg(Wj+1))\displaystyle=\sum_{j=1}^{\infty}\mathrm{Var}_{s,a}^{\pi}\left(\gamma^{\tau_{j}}g(W_{j+1})\right)+2\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))
+2i=1Cov(γτig(Wi+1),γτi+1g(Wi+2))\displaystyle\qquad+2\sum_{i=1}^{\infty}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{i+1}}g(W_{i+2}))
=:V1+2V2+2V3\displaystyle=:V_{1}+2V_{2}+2V_{3}

The first term can be bounded by the second moment

V1\displaystyle V_{1} =j=1Vars,aπ(γτjg(Wj+1))\displaystyle=\sum_{j=1}^{\infty}\mathrm{Var}_{s,a}^{\pi}\left(\gamma^{\tau_{j}}g(W_{j+1})\right)
j=1Eγ2τjg(Wj+1)2\displaystyle\leq\sum_{j=1}^{\infty}E\gamma^{2\tau_{j}}g(W_{j+1})^{2}
1(1γ)2j=1Eγ2τj(1γTj+1)2\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{j=1}^{\infty}E\gamma^{2\tau_{j}}(1-\gamma^{T_{j+1}})^{2}
1(1γ)2j=1χ(2)j(12χ(1)+χ(2))\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{j=1}^{\infty}\chi(2)^{j}(1-2\chi(1)+\chi(2))
pγ2m(1γm)(1+(1p)γm)(1γ)2(1+γm)(1(1p)γm)(1(1p)γ2m)\displaystyle\leq\frac{p\gamma^{2m}(1-\gamma^{m})(1+(1-p)\gamma^{m})}{(1-\gamma)^{2}(1+\gamma^{m})(1-(1-p)\gamma^{m})(1-(1-p)\gamma^{2m})}
2m(1γ)p\displaystyle\leq\frac{2m}{(1-\gamma)p}

where the last line follows from (39). Similarly, the third term can be bounded by the variance.

V3\displaystyle V_{3} =i=1Cov(γτig(Wi+1),γτi+1g(Wi+2))\displaystyle=\sum_{i=1}^{\infty}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{i+1}}g(W_{i+2}))
i=1Var(γτig(Wi+1))Var(γτi+1g(Wi+2))\displaystyle\leq\sum_{i=1}^{\infty}\sqrt{\mathrm{Var}(\gamma^{\tau_{i}}g(W_{i+1}))\mathrm{Var}(\gamma^{\tau_{i+1}}g(W_{i+2}))}
1(1γ)2i=1Eγ2τi(1γTi+1)2\displaystyle\leq\frac{1}{(1-\gamma)^{2}}\sum_{i=1}^{\infty}E\gamma^{2\tau_{i}}(1-\gamma^{T_{i+1}})^{2}
2m(1γ)p.\displaystyle\leq\frac{2m}{(1-\gamma)p}.

For the middle term,

Cov(γτig(Wi+1),γτjg(Wj+1))=E[γ2τjγTj+1g(Wj+1)γτiτj+1g(Wi+1)]E[γτjg(Wj+1)]E[γτig(Wi+1)].\displaystyle\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]-E[\gamma^{\tau_{j}}g(W_{j+1})]E[\gamma^{\tau_{i}}g(W_{i+1})].

We will defer the proof of the following lemma:

Lemma 8.
Cov(γTj+1,g(Wj+1))0.\mathrm{Cov}(\gamma^{T_{j+1}},g(W_{j+1}))\leq 0.

By Lemma 8, EγTj+1g(Wj+1)EγTj+1Eg(Wj+1)E\gamma^{T_{j+1}}g(W_{j+1})\leq E\gamma^{T_{j+1}}Eg(W_{j+1}). Also, we use the regeneration property: for ji2j\leq i-2 Wi+1𝒞j+1W_{i+1}\perp\mathcal{C}_{j+1}. We have

E[γ2τjγTj+1g(Wj+1)γτiτj+1g(Wi+1)]\displaystyle E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})] =E[γ2τjγTj+1g(Wj+1)E[γτiτj+1g(Wi+1)|𝒞j+1]]\displaystyle=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})|\mathcal{C}_{j+1}]]
=E[γ2τjγTj+1g(Wj+1)]E[γτiτj+1g(Wi+1)]\displaystyle=E[\gamma^{2\tau_{j}}\gamma^{T_{j+1}}g(W_{j+1})]E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]
E[γ2τj]E[γTj+1]E[g(Wj+1)]E[γτiτj+1g(Wi+1)]\displaystyle\leq E[\gamma^{2\tau_{j}}]E[\gamma^{T_{j+1}}]E[g(W_{j+1})]E[\gamma^{\tau_{i}-\tau_{j+1}}g(W_{i+1})]
=χ(2)jχ(1)ijE[g(Wj+1)]E[g(Wi+1)].\displaystyle=\chi(2)^{j}\chi(1)^{i-j}E[g(W_{j+1})]E[g(W_{i+1})].

Then, the covariance term

V2\displaystyle V_{2} =i=1j=1i2Cov(γτig(Wi+1),γτjg(Wj+1))\displaystyle=\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}\mathrm{Cov}(\gamma^{\tau_{i}}g(W_{i+1}),\gamma^{\tau_{j}}g(W_{j+1}))
i=1j=1i2(χ(2)jχ(1)ijχ(1)jχ(1)i)E[g(Wj+1)]E[g(Wi+1)]\displaystyle\leq\sum_{i=1}^{\infty}\sum_{j=1}^{i-2}(\chi(2)^{j}\chi(1)^{i-j}-\chi(1)^{j}\chi(1)^{i})E[g(W_{j+1})]E[g(W_{i+1})]
(1p)p3γ4m(1γm)(1+γm)(1(1p)γm)(1(12p)γm)(1γm(1γ)(1(1p)γm))2\displaystyle\leq\frac{(1-p)p^{3}\gamma^{4m}}{(1-\gamma^{m})(1+\gamma^{m})(1-(1-p)\gamma^{m})(1-(1-2p)\gamma^{m})}\left(\frac{1-\gamma^{m}}{(1-\gamma)(1-(1-p)\gamma^{m})}\right)^{2}
2(1γm)p3(1γ)2p(2p)p2\displaystyle\leq\frac{2(1-\gamma^{m})p^{3}}{(1-\gamma)^{2}p(2p)p^{2}}
m(1γ)p\displaystyle\leq\frac{m}{(1-\gamma)p}

where we used (40). We conclude that

Ψπ(s,a)\displaystyle\Psi^{\pi}(s,a) 2Vars,aπ(k=0τ11γkrπ(Sk))+2(V1+2V2+2V3)\displaystyle\leq 2\mathrm{Var}_{s,a}^{\pi}\left(\sum_{k=0}^{\tau_{1}-1}\gamma^{k}r_{\pi}(S_{k})\right)+2(V_{1}+2V_{2}+2V_{3})
21+χ(2)2χ(1)(1γ)2+16m(1γ)p\displaystyle\leq 2\frac{1+\chi(2)-2\chi(1)}{(1-\gamma)^{2}}+\frac{16m}{(1-\gamma)p}
2(1γm)2(1+(1p)γm)(1γ)2(1(1p)γm)(1(1p)γ2m)+16m(1γ)p\displaystyle\leq 2\frac{(1-\gamma^{m})^{2}(1+(1-p)\gamma^{m})}{(1-\gamma)^{2}(1-(1-p)\gamma^{m})(1-(1-p)\gamma^{2m})}+\frac{16m}{(1-\gamma)p}
4m2p2+16m(1γ)p\displaystyle\leq\frac{4m^{2}}{p^{2}}+\frac{16m}{(1-\gamma)p}
20m(1γ)p\displaystyle\leq\frac{20m}{(1-\gamma)p}

D.2.1 Proof of Lemma 8

Proof.

We simplify the notation by consider a new probability space supporting a cycle {Sk,k=0,,T}\left\{{S_{k},k=0,\dots,T}\right\} of the split chain starting from initial S0ψS_{0}\sim\psi and cycle length TT. We use the property of the split chain: Let {Yk}\left\{{Y_{k}}\right\} be a independent process on this probability space with mm skeleton having transition kernel RR and interpolated by PπP_{\pi}. Then, the random element (S0,,STm,T)(S_{0},\dots,S_{T-m},T) has the same distribution as (Y0,,YTm,T)(Y_{0},\dots,Y_{T-m},T).

Write

Cov(γT,g(W))\displaystyle\mathrm{Cov}(\gamma^{T},g(W)) =Cov(γT,k=0Tm1γkrπ(Sk)+k=TmT1γkrπ(Sk))\displaystyle=\mathrm{Cov}(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(S_{k})+\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k}))
=Cov(γT,k=0Tm1γkrπ(Yk))+Cov(γT,k=TmT1γkrπ(Sk))\displaystyle=\mathrm{Cov}(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(Y_{k}))+\mathrm{Cov}(\gamma^{T},\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k}))

First, we handle the last mm-segment.

EγTk=TmT1γkrπ(Sk)\displaystyle E\gamma^{T}\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k}) =EE[γTk=TmT1γkrπ(Sk)|T,STm]\displaystyle=EE\left[\gamma^{T}\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k})\middle|T,S_{T-m}\right]
=Eγ2TmE[k=0m1γkrπ(STm+k)|T,STm]\displaystyle=E\gamma^{2T-m}E\left[\sum_{k=0}^{m-1}\gamma^{k}r_{\pi}(S_{T-m+k})\middle|T,S_{T-m}\right]
=Eγ2Tmf~(STm)\displaystyle=E\gamma^{2T-m}\tilde{f}(S_{T-m})
=Eγ2Tmf~(YTm)\displaystyle=E\gamma^{2T-m}\tilde{f}(Y_{T-m})

for some non-negative and deterministic function f~\tilde{f}. Here we used the property of the split chain: given the coin toss is successful and the current state STmS_{T-m}, the path {STm+1,,ST1}\left\{{S_{T-m+1},\dots,S_{T-1}}\right\} is generated condition on the independently sampled STψS_{T}\sim\psi and STmS_{T-m}. Similarly,

EγTEk=TmT1γkrπ(Sk)=EγTEγTmf~(YTm).E\gamma^{T}E\sum_{k=T-m}^{T-1}\gamma^{k}r_{\pi}(S_{k})=E\gamma^{T}E\gamma^{T-m}\tilde{f}(Y_{T-m}).

Therefore,

Cov(γT,g(W))=Cov(γT,k=0Tm1γkrπ(Yk)+γTmf~(YTm))\mathrm{Cov}(\gamma^{T},g(W))=\mathrm{Cov}\left(\gamma^{T},\sum_{k=0}^{T-m-1}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m})\right)

Define h(T):=γTEγTh(T):=\gamma^{T}-E\gamma^{T} and the bounded functional f:m×+f:m\mathbb{N}\times\mathbb{R}^{\mathbb{N}}\rightarrow\mathbb{R}^{+}

f(T,Y):=k=0Tmγkrπ(Yk)+γTmf~(YTm)E(k=0Tmγkrπ(Yk)+γTmf~(YTm))).f(T,Y):=\sum_{k=0}^{T-m}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m})-E\left(\sum_{k=0}^{T-m}\gamma^{k}r_{\pi}(Y_{k})+\gamma^{T-m}\tilde{f}(Y_{T-m}))\right).

Notice that, f(,Y)f(\cdot,Y) is non-decreasing and hh is decreasing. Let TmT^{\prime}\sim mGeo(p)(p) be independent, then consider

0\displaystyle 0 E(f(T,Y)f(T,Y))(h(T)h(T))\displaystyle\geq E(f(T,Y)-f(T^{\prime},Y))(h(T)-h(T^{\prime}))
=Ef(T,Y)h(T)+Ef(T,Y)h(T)Ef(T,Y)Eh(T)Ef(T,Y)Eh(T)\displaystyle=Ef(T,Y)h(T)+Ef(T^{\prime},Y)h(T^{\prime})-Ef(T,Y)Eh(T^{\prime})-Ef(T^{\prime},Y)Eh(T)
=2Cov(γT,g(W))\displaystyle=2\mathrm{Cov}(\gamma^{T},g(W))

where the last equality follows from, (Y,T)=𝑑(Y,T)(Y,T)\overset{d}{=}(Y,T^{\prime}) and Eh(T)=0Eh(T)=0. ∎

D.3 Proof of Corollary 6.2.1

Proof.

By Theorem UE and the definition of tminorize(Pπ)t_{\mathrm{minorize}}(P_{\pi}), for any ϵ>0\epsilon>0, there exists (m,p)(m,p) s.t. PπP_{\pi} is (m,p)(m,p)-Doeblin and tminorize(Pπ)+ϵm/pt_{\mathrm{minorize}}(P_{\pi})+\epsilon\geq m/p. From Azar et al., (2013), we have that (31) and inequality

(IγPπ)1σπt1γt(Iγ2Pπ)1(σπ)2.\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty}\leq\frac{\sqrt{t}}{1-\gamma^{t}}\sqrt{\|(I-\gamma^{2}P^{\pi})^{-1}(\sigma^{\pi})^{2}\|_{\infty}}.

holds. Now, we choose t=log2/logγ2/(1γ)t=-\log 2/\log\gamma\leq 2/(1-\gamma) to conclude that

(IγPπ)1σπ\displaystyle\|(I-\gamma P^{\pi})^{-1}\sigma^{\pi}\|_{\infty} 2tΨπ\displaystyle\leq 2\sqrt{t}\sqrt{\|\Psi^{\pi}\|_{\infty}}
80m/p(1γ)\displaystyle\leq\frac{80\sqrt{m/p}}{(1-\gamma)}
80tminorize(Pπ)+ϵ(1γ)\displaystyle\leq\frac{80\sqrt{t_{\mathrm{minorize}}(P_{\pi})+\epsilon}}{(1-\gamma)}

where we used Proposition 6.2 to bound Ψπ\|\Psi^{\pi}\|_{\infty}. Since ϵ>0\epsilon>0 is arbitrary, this implies the corollary. ∎