This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Second-Order Convergence of Biased Policy Gradient Algorithms

Siqiao Mu    Diego Klabjan
Abstract

Since the objective functions of reinforcement learning problems are typically highly nonconvex, it is desirable that policy gradient, the most popular algorithm, escapes saddle points and arrives at second-order stationary points. Existing results only consider vanilla policy gradient algorithms with unbiased gradient estimators, but practical implementations under the infinite-horizon discounted reward setting are biased due to finite-horizon sampling. Moreover, actor-critic methods, whose second-order convergence has not yet been established, are also biased due to the critic approximation of the value function. We provide a novel second-order analysis of biased policy gradient methods, including the vanilla gradient estimator computed from Monte-Carlo sampling of trajectories as well as the double-loop actor-critic algorithm, where in the inner loop the critic improves the approximation of the value function via TD(0) learning. Separately, we also establish the convergence of TD(0) on Markov chains irrespective of initial state distribution.

Machine Learning, ICML, saddle points, nonconvex optimization, optimization, reinforcement learning, policy gradient

1 Introduction

In the standard reinforcement learning framework, an agent interacts with an environment according to some policy, which dictates the best actions to take given the state of the environment. The ultimate goal of the agent is to adopt a policy that maximizes some measure of cumulative reward. To efficiently search for the optimal policy, policy gradient methods optimize a policy parameter θ\theta through updates that approximate the gradient of the objective function with respect to θ\theta. These algorithms can be fast and flexible, and under mild assumptions, they share convergence properties with mainstream gradient descent algorithms.

The policy gradient can be estimated by way of the policy gradient theorem, which enables the straightforward application of existing techniques in gradient-based optimization towards understanding PG convergence. The theorem provides a formula for the exact gradient of the objective function as the expectation of the state-action value function over the discounted state-action measure (Sutton et al., 1999). In practice, the gradient is estimated through two common approaches: the “vanilla” approach, where the gradient is approximated via Monte-Carlo sampling, or the “actor-critic” approach, where the policy parameter θ\theta, called the “actor parameter,” is updated simultaneously alongside a “critic parameter” ww, which parametrizes the state-action value function. The critic parameter is typically updated via bootstrapping temporal difference methods such as TD(0), although the actor updates may also be bootstrapped (Konda & Tsitsiklis, 1999; Bhatnagar et al., 2009).

Although it is well-established that policy gradient converges to first-order stationary points where the norm of the gradient is approximately zero (Sutton et al., 1999; Yuan et al., 2022; Agarwal et al., 2020), it is of interest whether policy gradient algorithms yield second-order stationary points (local maxima) as opposed to saddle points. This is because the function landscape of RL problems can be highly nonconvex and features suboptimal stationary points even in very simple examples (Agarwal et al., 2020; Bhandari & Russo, 2019; Zhang et al., 2020). We can utilize seminal results in nonconvex optimization that show that stochastic gradient descent can leverage randomness to escape saddle points either with added noise (Ge et al., 2015; Jin et al., 2019) or as long as there is a component of noise in the direction of curvature (Daneshmand et al., 2018; Vlaski & Sayed, 2022).

However, unlike stochastic gradient descent, practical implementations of policy gradient are frequently biased. For the discounted infinite-horizon reward objective function, the vanilla policy gradient estimator is biased due to truncation of sampled trajectory horizons (Yuan et al., 2022). In addition, actor-critic algorithms introduce a second source of bias in the critic’s approximation of the value function. This therefore requires novel techniques for controlling and bounding the bias. In comparison, existing works on second-order convergence of policy gradient only consider unbiased gradient estimators; for example, (Zhang et al., 2020) require artificially constructing an unbiased gradient estimator via Q-sampling as well as periodic step size enlargement, and (Yang et al., 2021) assume the gradient estimator is unbiased. Both works also only consider the vanilla policy gradient algorithm and not the actor-critic scheme.

In this paper, we address the aforementioned gap by showing that biased policy gradient methods, including both vanilla policy gradient and actor-critic methods, can escape saddle points. Borrowing from the analyses of (Vlaski & Sayed, 2022) regarding unbiased stochastic gradient descent, we tackle the biased policy gradient setting by showing that the components of the bias can be sufficiently controlled to yield second-order convergence guarantees.

Related Work

Escaping Saddle Points. As referenced earlier, escaping saddle points has become a central research topic in nonconvex optimization in the last few years, with natural extensions to the policy gradient setting. In addition to the seminal works showing the second-order convergence of gradient descent and stochastic gradient descent (Ge et al., 2015; Jin et al., 2019, 2017; Daneshmand et al., 2018; Vlaski & Sayed, 2022), an additional body of work has focused on second-order methods that use Hessian information to arrive at second-order stationary points faster. Some examples of reinforcement learning approaches in this direction include (Shen et al., 2019; Wang et al., 2022; Khorasani et al., 2023). However, these works only consider unbiased gradient estimators. For an in-depth comparison of sample complexities, see Appendix A.2.

Global Convergence. Separate from our line of work, there are several “global convergence” results for policy gradient and actor-critic algorithms that ensure convergence to a global optimum for specific policy parametrization or function structure. For instance, global convergence is established for the following settings: tabular or tabular softmax policy parametrization with exact gradients (Agarwal et al., 2020; Bhandari & Russo, 2021), objective functions that satisfy the gradient domination property (Bhandari & Russo, 2019), linear quadratic and nearly linear-quadratic control systems (Yang et al., 2019; Han et al., 2022), and overparametrized neural networks (Wang et al., 2020; Fu et al., 2020). The premise of “global convergence” also requires some assumption that the proposed policy parametrization can approximate the optimal policy with arbitrary precision, i.e., the optimal policy lies within the policy class. In comparison, our second-order convergence guarantees pertain to finding the best policy parameter θ\theta under a generic policy parametrization, for policy gradient algorithms with noisy and biased updates. For an in-depth discussion of global convergence rates, see Appendix A.1.

Vanilla Policy Gradient. In this work, we refer to policy gradient algorithms that estimate the gradient via Monte-Carlo sampling of trajectories as “vanilla policy gradient.” Early formulas include REINFORCE (Williams, 2004) as well as GPOMDP (Baxter & Bartlett, 2001), a version of REINFORCE that enjoys reduced variance (Peters & Schaal, 2008) by employing the ”reward-to-go” trick. Our work pertains to the GPOMDP estimator. Both REINFORCE and GPOMDP are unbiased estimators of objectives with deterministic, fixed horizons, but they are biased estimators of the infinite-horizon discounted reward due to truncation (Yuan et al., 2022).

Actor-critic Algorithms. The asymptotic convergence of actor-critic algorithms was first established in (Konda & Tsitsiklis, 1999) for gradient actor updates and bootstrapped critic updates, and in (Bhatnagar et al., 2009) for bootstrapped actor and critic updates. Since then, finite-time convergence has been established for a variety of actor-critic frameworks, although to the best of our knowledge no second-order convergence result exists. In this work we consider double-loop actor-critic algorithms where the critic parameter undergoes TD(0) updates in the inner loop and the actor parameter undergoes gradient updates in the outer loop. Several works have shown first-order convergence of these algorithms with various additional settings; (Yang et al., 2018) consider i.i.d sampling in the actor and the critic, (Qiu et al., 2021) consider i.i.d. policy samples and critic sampling from a stationary Markov chain, and (Xu et al., 2020b) study mini-batch Markovian sampling for controlling the bias error. Separately, (Wang et al., 2020) establish global convergence for double-loop algorithms where the actor and critic functions are compatible overparametrized neural networks. In comparison, we consider linear critic parametrization and arbitrary actor parametrization with Markovian sampling, and we do not require the Markov chain to be stationary.

Recently, two-timescale (Xu et al., 2020a; Wu et al., 2020) and single-timescale (Fu et al., 2020; Olshevsky & Gharesifard, 2023) actor-critic algorithms have shown a slight performance advantage over double-loop algorithms, although existing works still only analyze their first-order or global convergence under specific function parametrizations, while we focus on second-order convergence.

Temporal Difference Algorithms. Actor-critic algorithms typically feature some bootstrapping element; in particular, we consider actor-critic algorithms where the critic updates via TD(0) learning. The finite-time convergence of TD(0) has been established recently under independent and identically distributed sampling (Dalal et al., 2018; Kumar et al., 2019) as well as under Markovian sampling (Liu & Olshevsky, 2020; Bhandari et al., 2018). The latter is more relevant to the RL setting; however, both (Bhandari et al., 2018) and (Liu & Olshevsky, 2020) assume the Markov chain begins in the stationary distribution, which is unrealistic for practical implementations such as in actor-critic algorithms.

Our Contributions

Our contributions are as follows.

  • We provide the first finite-time convergence analysis of vanilla policy gradient with biased gradient estimator to ϵ\epsilon-second-order stationary points. This results in a sample complexity of O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) iterations, where O~()\tilde{O}(\cdot) hides logarithmic dependencies. We note that this is a stronger result than O~(ϵ9)\tilde{O}(\epsilon^{-9}) from (Zhang et al., 2020) and a weaker result than O~(ϵ4.5)\tilde{O}(\epsilon^{-4.5}) from (Yang et al., 2021), both of which only analyze unbiased gradient estimators.

  • We provide the first finite-time convergence analysis of an actor-critic policy gradient algorithm to ϵ\epsilon-second-order stationary points. We show that our double-loop actor-critic algorithm, where the critic updates via TD(0)TD(0) and the actor updates via policy gradient, arrives at an ϵ\epsilon-second-order stationary point in O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) outer loop iterations with O~(ϵ8)\tilde{O}(\epsilon^{-8}) inner-loop TD(0) steps. In contrast to existing first-order analyses of actor-critic algorithms, we allow for Markovian sampling in both the actor and the critic.

  • Of separate interest, we provide the first finite-time convergence analysis of the classic TD(0) algorithm on nonstationary Markov chains; i.e., we do not assume that the initial state distribution is the stationary distribution of the Markov chain. This allows realistic analyses of the actor-critic setting, where we have no guarantee that after every policy update the new underlying Markov chain is in its stationary distribution. We show that for KK constant timesteps α=1K\alpha=\frac{1}{\sqrt{K}} and exponential mixing, the algorithm converges at the rate of O(1K)+O(1K)O(\frac{1}{\sqrt{K}})+O(\frac{1}{K}), as opposed to O(1K)O(\frac{1}{\sqrt{K}}) when starting from the stationary distribution.

The structure of the paper is as follows. In Section 2, we formalize the problem and introduce notation used in the rest of the paper. In Section 3, we establish second-order convergence for biased policy gradient estimators and apply our results to vanilla policy gradient. In Section 4, we present our new analysis of the TD(0) algorithm, which is incorporated to bound the critic approximation bias and show second-order convergence of the actor-critic algorithm. Finally, in Section 5, we summarize our results and discuss the next steps of our work.

2 Problem Formulation

2.1 Markov Decision Process

We define the Markov decision process as a quadruple (𝒮,𝒜,𝒫,)(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R}), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, 𝒫(s|s,a)\mathcal{P}(s^{\prime}|s,a) is the transition probability from state ss to state ss^{\prime} by taking action aa, and (s,a)\mathcal{R}(s,a) is the reward function for performing action aa in state ss. The agent is trying to learn a stochastic policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\to\Delta(\mathcal{A}), where Δ(𝒜)\Delta(\mathcal{A}) is the space of probability distributions over 𝒜\mathcal{A}, such that π(a|s)\pi(a|s) is the probability that the agent performs action aa given state ss. As the agent interacts with the environment, it generates a sequence of states, actions, and rewards referred to as a trajectory τ={s0,a0,s1,a1,}.\tau=\{s_{0},a_{0},s_{1},a_{1},...\}. The trajectory is sampled from the probability distribution p(|π)p(\cdot|\pi), which describes the probability of a trajectory generated by some policy π\pi, where akπ(|sk)a_{k}\sim\pi(\cdot|s_{k}) and sk+1𝒫(|sk,ak)s_{k+1}\sim\mathcal{P}(\cdot|s_{k},a_{k}).

We want to learn a policy π\pi that maximizes the expected infinite-horizon discounted reward, JJ. For policy gradient algorithms, we parameterize the policy π\pi with some parameter θM\theta\in\mathbb{R}^{M} so that JJ is a function of θ\theta to obtain

J(θ)\displaystyle J(\theta) =𝔼s0ρ0(),τp(|πθ)[k=0γk(sk,ak)]\displaystyle=\mathbb{E}_{s_{0}\sim\rho_{0}(\cdot),\tau\sim p(\cdot|\pi_{\theta})}[\sum_{k=0}^{\infty}\gamma^{k}\mathcal{R}(s_{k},a_{k})]

where γ(0,1)\gamma\in(0,1) is the discount factor and the expectation is taken with respect to an initial state distribution s0ρ0()s_{0}\sim\rho_{0}(\cdot) and a stochastic policy πθ\pi_{\theta} under which trajectories are sampled τp(|πθ)\tau\sim p(\cdot|\pi_{\theta}). The RL problem is to find an optimal policy parameter θ\theta^{*} such that θ=argmaxθJ(θ)\theta^{*}=\arg\max_{\theta}J(\theta).

We also define the state value function and state-action value function as Vπ(s)=𝔼π[k=0γk(sk,ak)|s0=s]V^{\pi}(s)~{}=~{}\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^{k}\mathcal{R}(s_{k},a_{k})|s_{0}~{}=~{}s] and Qπ(s,a)=𝔼π[k=0γk(sk,ak)|s0=s,a0=a]Q^{\pi}(s,a)~{}=~{}\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^{k}\mathcal{R}(s_{k},a_{k})|s_{0}~{}=~{}s,a_{0}~{}=~{}a] respectively, such that the objective can be alternatively formulated as J(θ)=𝔼sρ0[Vπθ(s)].J(\theta)=\mathbb{E}_{s\sim\rho_{0}}[V^{\pi_{\theta}}(s)]. Finally, we reference the discounted state-weighting measure as dπθ,ρ0(s)=k=0γk𝔼ρ0[Pr(sk=s|s0,πθ)].d^{\pi_{\theta},\rho_{0}}(s)~{}=~{}\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}_{\rho_{0}}[Pr(s_{k}~{}=~{}s~{}|s_{0},\pi_{\theta})].

2.2 Policy Gradient Algorithm

Algorithm 1 Biased Policy Gradient Algorithm
  Input: initial policy parameters θ0\theta_{0}
  for t=0,1,2,T1t=0,1,2,...T-1 do
     Sample a trajectory τt\tau_{t} of length HH under πθt\pi_{\theta_{t}},
τt={s0,a0,s1,a1,sH1,aH1}\tau_{t}=\{s_{0},a_{0},s_{1},a_{1},...s_{H-1},a_{H-1}\}
     where s0ρ0(s)s_{0}\sim\rho_{0}(s)
     Compute G^(θt;τt)\hat{G}(\theta_{t};\tau_{t}) via Algorithm 2 or Algorithm 3
     Update θt+1=θt+μG^(θt;τt)\theta_{t+1}=\theta_{t}+\mu\hat{G}(\theta_{t};\tau_{t})
  end for

In this work we consider policy gradient algorithms of the form shown in Algorithm 1. We note that these gradient algorithms can be mini-batched for additional variance reduction, but our analysis does not rely on batching and we omit this notation for simplicity. At each time step t+1t+1, a random trajectory τt\tau_{t} is sampled and G^(θt;τt)\hat{G}(\theta_{t};\tau_{t}), a biased estimator of the true policy gradient J(θt)\nabla J(\theta_{t}), is computed. Then the policy parameter θ\theta is updated with step-size μ\mu as follows

θt+1=θt+μG^(θt;τt).\theta_{t+1}=\theta_{t}+\mu\hat{G}(\theta_{t};\tau_{t}). (1)

Let G(θt)=𝔼τ[G^(θt;τ)]G(\theta_{t})=\mathbb{E}_{\tau}[\hat{G}(\theta_{t};\tau)], where the expectation is taken with respect to the sampled trajectory τ\tau. Then in order to analyze the convergence of Algorithm 1, we decompose the updates in terms of the noise ξt+1\xi_{t+1} and bias dt+1d_{t+1} to obtain

θt+1\displaystyle\theta_{t+1} =θt+μJ(θt)+μξt+1+μdt+1,\displaystyle=\theta_{t}+\mu\nabla J(\theta_{t})+\mu\xi_{t+1}+\mu d_{t+1},

where ξt+1=G^(θt;τt)G(θt)\xi_{t+1}=\hat{G}(\theta_{t};\tau_{t})-G(\theta_{t}) represents a zero-mean noise term induced from sampling the trajectory of length HH and dt+1=G(θt)J(θt)d_{t+1}=G(\theta_{t})-\nabla J(\theta_{t}) represents the bias induced by the gradient estimator algorithm. Specifically, we denote by {t}t0\{\mathcal{F}_{t}\}_{t\geq 0} the filtration generated by the random process {θt}t0\{\theta_{t}\}_{t\geq 0} such that θt\theta_{t} is measurable with respect to t\mathcal{F}_{t}. Then the gradient noise process {ξt}t0\{\xi_{t}\}_{t\geq 0} satisfies

𝔼[ξt+1|t]=𝔼[G^(θt;τt)G(θt)|t]=0.\mathbb{E}[\xi_{t+1}|\mathcal{F}_{t}]=\mathbb{E}[\hat{G}(\theta_{t};\tau_{t})-G(\theta_{t})|\mathcal{F}_{t}]=0.

Although dt+1d_{t+1} might be random or deterministic depending on the gradient estimator algorithm, it is generated by a different process than ξt+1\xi_{t+1}. In the sequel, our analyses rely on assumptions on ξt+1\xi_{t+1} and dt+1d_{t+1}.

2.3 Second-Order Stationary Points

We define second-order stationary points and approximate second-order stationary points as follows (Jin et al., 2017; Nesterov & Polyak, 2006).

Definition 2.1.

For the twice-differentiable function J(θ)J(\theta), θ\theta is a second-order stationary point if J(θ)=0\lVert\nabla J(\theta)\lVert=0 and λmax(2J(θ))0\lambda_{max}(\nabla^{2}J(\theta))\leq 0. In addition, if 2J(θ)\nabla^{2}J(\theta) is χ\chi-Lipschitz, θ\theta is an ϵ\epsilon-second order stationary point of J(θ)J(\theta) if J(θ)ϵ\lVert\nabla J(\theta)\lVert~{}\leq~{}\epsilon and λmax(2J(θ))χϵ.\lambda_{max}(\nabla^{2}J(\theta))\leq\sqrt{\chi\epsilon}.

In line with the strict-saddle definition introduced in (Ge et al., 2015) and later used in (Jin et al., 2019, 2017; Daneshmand et al., 2018), we focus on escaping saddle points with at least one strictly positive eigenvalue. We divide the parameter space of the objective function J(θ)J(\theta) into regions where the gradient is large or small with respect to the step-size μ\mu, which we assume to be a small hyperparameter. We define the following sets

𝒢={θ:J(θ)2μ(1+1δ)},\mathcal{G}=\left\{\theta:\lVert\nabla J(\theta)\rVert^{2}\geq\mu\ell\left(1+\frac{1}{\delta}\right)\right\},
={θ:θ𝒢C,λmax(2J(θ))ω},\mathcal{H}=\{\theta:\theta\in\mathcal{G}^{C},\lambda_{max}(\nabla^{2}J(\theta))\geq\omega\},
={θ:θ𝒢C,λmax(2J(θ))<ω},\mathcal{M}=\{\theta:\theta\in\mathcal{G}^{C},\lambda_{max}(\nabla^{2}J(\theta))<\omega\},

where >0\ell>0 is a parameter depending on the problem and δ>0\delta>0, ω>0\omega>0 are parameters depending on the desired accuracy of the algorithm that we choose later on. The set 𝒢C\mathcal{G}^{C} represents approximate first-order stationary points, and within 𝒢C\mathcal{G}^{C}, the region \mathcal{H} represents “strict-saddle” points, whereas the region \mathcal{M} represents approximately second-order stationary points. Specifically, points in \mathcal{M} represent the set of local maxima with respect to first and second-order information.

3 Second-Order Convergence of Vanilla Policy Gradient

In this section, we establish the second-order convergence of biased policy gradient in general and apply it to vanilla policy gradient. We begin with the original policy gradient theorem (Sutton et al., 1999), which states

J(θ)=s𝒮dπθ,ρ0(s)a𝒜πθ(a|s)logπθ(a|s)Qπθ(s,a),\nabla J(\theta)=\sum_{s\in\mathcal{S}}d^{\pi_{\theta},\rho_{0}}(s)\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)\nabla\log\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a),

which is equivalent to the temporal summation

J(θ)=𝔼πθ,ρ0[k=0t=kγtlogπθ(ak|sk)(st,at)].\nabla J(\theta)=\mathbb{E}_{\pi_{\theta},\rho_{0}}[\sum_{k=0}^{\infty}\sum_{t=k}^{\infty}\gamma^{t}\nabla\log\pi_{\theta}(a_{k}|s_{k})\mathcal{R}(s_{t},a_{t})].

For further details on the connection between these two formulations, see Appendix F.

Through Monte-Carlo sampling of trajectories τ\tau of fixed length HH, we construct the GPOMDP gradient estimator (Baxter & Bartlett, 2001) as follows

G^VPG(θ;τ)=h=0H1logπθ(ah|sh)i=hH1γi(si,ai).\hat{G}^{VPG}(\theta;\tau)=\sum_{h=0}^{H-1}\nabla\log\pi_{\theta}(a_{h}|s_{h})\sum_{i=h}^{H-1}\gamma^{i}\mathcal{R}(s_{i},a_{i}).
Algorithm 2 Vanilla Policy Gradient Estimator
  Input: policy parameter θt\theta_{t}, trajectory τt={s0,a0,s1,a1,sH1,aH1}\tau_{t}=\{s_{0},a_{0},s_{1},a_{1},...s_{H-1},a_{H-1}\}
  Compute gradient estimator from τt\tau_{t}:
G^VPG(θt;τt)=h=0H1logπθ(ah|sh)i=hH1γi(si,ai)\hat{G}^{VPG}(\theta_{t};\tau_{t})=\sum_{h=0}^{H-1}\nabla\log\pi_{\theta}(a_{h}|s_{h})\sum_{i=h}^{H-1}\gamma^{i}\mathcal{R}(s_{i},a_{i})
  return G^VPG(θt;τt)\hat{G}^{VPG}(\theta_{t};\tau_{t})

This yields the “vanilla” policy gradient algorithm as outlined in Algorithm 2. We define the truncated or finite-horizon objective JH(θ)=𝔼πθ,ρ0[t=0H1γt(st,at)]J_{H}(\theta)~{}=~{}\mathbb{E}_{\pi_{\theta},\rho_{0}}[\sum_{t=0}^{H-1}\gamma^{t}\mathcal{R}(s_{t},a_{t})]. As observed in (Yuan et al., 2022; Zhang et al., 2020; Wu et al., 2022), since we cannot practically sample infinite-horizon trajectories, G^VPG(θ;τ)\hat{G}^{VPG}(\theta;\tau) is a biased gradient estimator of J(θ)J(\theta) and an unbiased gradient estimator of JH(θ)J_{H}(\theta), such that

𝔼[G^VPG(θ;τ)]=JH(θ)J(θ).\mathbb{E}[\hat{G}^{VPG}(\theta;\tau)]=\nabla J_{H}(\theta)\neq\nabla J(\theta).

3.1 Assumptions

We require the following assumptions for the convergence of biased policy gradient algorithms.

Assumption 3.1.

The following conditions hold for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and θ\theta:

  1. 1.

    The rewards are bounded such that there exists max>0\mathcal{R}_{max}>0 with |(s,a)|max|\mathcal{R}(s,a)|\leq\mathcal{R}_{max}.

  2. 2.

    The policy score function logπθ(a|s)\nabla\log\pi_{\theta}(a|s) exists and its norm is bounded by a constant G>0G>0 such that logπθ(a|s)G\lVert\nabla\log\pi_{\theta}(a|s)\rVert\leq G.

  3. 3.

    The Jacobian of the score function exists and its norm is bounded by a constant B>0B>0 such that 2logπθ(a|s)<B\lVert\nabla^{2}\log\pi_{\theta}(a|s)\rVert<B, and it is Lipschitz continuous such that for all θ1,θ2\theta_{1},\theta_{2}, we have

    2logπθ1(a|s)2logπθ2(a|s)ιθ1θ2.\lVert\nabla^{2}\log\pi_{\theta_{1}}(a|s)-\nabla^{2}\log\pi_{\theta_{2}}(a|s)\rVert\leq\iota\lVert\theta_{1}-\theta_{2}\rVert.

In Assumption 3.1, we assume that the reward and the policy log gradient are bounded, assumptions first put forward in (Papini et al., 2018) and since widely adopted in many theoretical analyses of policy gradient (Zhu & Gong, 2023; Zhang et al., 2020; Yang et al., 2021). Assumption 3.1 is satisfied by commonly used policy parametrizations, including the softmax policy parametrization πθ(a|s)=eh(s,a,θ)beh(s,b,θ)\pi_{\theta}(a|s)=\frac{e^{h(s,a,\theta)}}{\sum_{b}e^{h(s,b,\theta)}}. The action preferences h(s,a,θ)h(s,a,\theta) can be parametrized via deep neural networks or other functions of the feature vectors ϕ(s,a)\phi(s,a). The assumption is also satisfied by Gaussian policies such as πθ(a|s)𝒩(ϕ(s)θ,σ2)\pi_{\theta}(a|s)\sim\mathcal{N}(\phi(s)^{\intercal}\theta,\sigma^{2}) if the parameter θ\theta lies in some bounded set and the actions and the feature vectors ϕ(s)\phi(s) are bounded (Zhang et al., 2020). In addition, Assumption 3.1 has several important implications as follows.

Lemma 3.2.

(Lemma 3.2 of (Zhang et al., 2020)) The score function logπθ(a|s)\nabla\log\pi_{\theta}(a|s) is BB-Lipschitz continuous. Moreover, the policy gradient J(θ)\nabla J(\theta) is Lipschitz continuous such that for all θ1\theta_{1}, θ2\theta_{2}, we have

J(θ1)J(θ2)Lθ1θ2\lVert\nabla J(\theta_{1})-\nabla J(\theta_{2})\rVert\leq L\lVert\theta_{1}-\theta_{2}\rVert

where L=maxB(1γ)2+(1+γ)maxG2(1γ)3L=\frac{\mathcal{R}_{max}B}{(1-\gamma)^{2}}+\frac{(1+\gamma)\mathcal{R}_{max}G^{2}}{(1-\gamma)^{3}}.

Lemma 3.3.

(Lemma 5.4 from (Zhang et al., 2020)) The Hessian matrix of J(θ)J(\theta) is Lipschitz continuous such that for all θ1\theta_{1}, θ2\theta_{2}, we have

2J(θ1)2J(θ2)χθ1θ2\lVert\nabla^{2}J(\theta_{1})-\nabla^{2}J(\theta_{2})\rVert\leq\chi\lVert\theta_{1}-\theta_{2}\rVert

where χ=maxGB(1γ)2+maxG3(1+γ)(1γ)3+maxG1γmax{B,G2γ1γ,ιG,Bγ1γ,G2(1+γ)+B(1γ)γ(1γ)2}\chi=\frac{\mathcal{R}_{max}GB}{(1-\gamma)^{2}}+\frac{\mathcal{R}_{max}G^{3}(1+\gamma)}{(1-\gamma)^{3}}+\frac{\mathcal{R}_{max}G}{1-\gamma}\cdot\max\{B,\frac{G^{2}\gamma}{1-\gamma},\frac{\iota}{G},\frac{B\gamma}{1-\gamma},\frac{G^{2}(1+\gamma)+B(1-\gamma)\gamma}{(1-\gamma)^{2}}\}.

Finally, we require assumptions on the noise of the algorithm that allows the iterates to escape saddle points.

Assumption 3.4.

The covariance matrix of the noise ξt+1\xi_{t+1} generated at iterate θt\theta_{t}, defined as Rξ(θt)=𝔼[ξt+1ξt+1|t]R_{\xi}(\theta_{t})=\mathbb{E}[\xi_{t+1}\xi_{t+1}^{\intercal}|\mathcal{F}_{t}], is Lipschitz such that

Rξ(θ1)Rξ(θ2)βRθ1θ2ν\lVert R_{\xi}(\theta_{1})-R_{\xi}(\theta_{2})\lVert\leq\beta_{R}\lVert\theta_{1}-\theta_{2}\lVert^{\nu}

where 0<ν40<\nu\leq 4 and βR>0\beta_{R}>0.

Assumption 3.5.

Let 2J(θ)=VθΛθVθT\nabla^{2}J(\theta)=V_{\theta}\Lambda_{\theta}V_{\theta}^{T} be the eigendecomposition of the Hessian matrix at θ\theta where the eigenvalues and eigenvectors are ordered as follows

Vθ=[Vθ>0Vθ0],Λθ=[Λθ>000Λθ0]V_{\theta}=\begin{bmatrix}V_{\theta}^{>0}&V_{\theta}^{\leq 0}\\ \end{bmatrix},\qquad\Lambda_{\theta}=\begin{bmatrix}\Lambda_{\theta}^{>0}&0\\ 0&\Lambda_{\theta}^{\leq 0}\end{bmatrix}

where Λθ>0>0\Lambda_{\theta}^{>0}>0 and Λθ00\Lambda_{\theta}^{\leq 0}\leq 0. Then, we assume that there exists σl2>0\sigma_{l}^{2}>0 such that for any approximate strict-saddle point θt\theta_{t}\in\mathcal{H}

λmin((Vθt>0)𝔼[ξt+1ξt+1]Vθt>0)σl2.\lambda_{min}((V_{\theta_{t}}^{>0})^{\intercal}\mathbb{E}[\xi_{t+1}\xi_{t+1}^{\intercal}]V_{\theta_{t}}^{>0})\geq\sigma_{l}^{2}.

Assumption 3.4 states that the covariance matrix is Lipschitz, an assumption proposed in (Vlaski & Sayed, 2022; Vlaski et al., 2020). Assumption 3.4 requires that the covariance of noise does not change too much between iterates, which can be ensured by restricting our parameter search space to a compact set, which reflects common practices. Assumption 3.5, or the condition of “correlated negative curvature,” states that there must be a component of noise in the direction of negative curvature in order for the noise to help the iterates escape the saddle point. It is necessary for most second-order convergence analyses (Daneshmand et al., 2018; Zhang et al., 2020; Yang et al., 2021). Like (Vlaski & Sayed, 2022), we note that although Assumption 3.5 is a technical requirement for convergence, the condition can be achieved by simply adding isotropic noise at each parameter update iteration like in (Ge et al., 2015; Jin et al., 2019).

3.2 Main Result

We first present the following theorem for general biased policy gradient, which shows that if the gradient estimator and bias are sufficiently bounded, we can conclude second-order convergence. Theorem 3.6 adapts Theorem 3 from (Vlaski & Sayed, 2022) regarding second-order convergence of unbiased stochastic gradient descent. Our approach follows the framework proposed in (Vlaski et al., 2020) for showing second-order convergence of federated learning, a form of biased stochastic gradient descent.

Theorem 3.6.

For the iterates θt\theta_{t} of Algorithm 1, suppose that Assumptions 3.1-3.5 hold and for σ>0\sigma>0, D>0D>0, the following hold

G^(θt;τt)σ\lVert\hat{G}(\theta_{t};\tau_{t})\rVert\leq\sigma (2)
𝔼[dt+14|t]D4μ4\mathbb{E}[\lVert d_{t+1}\rVert^{4}|\mathcal{F}_{t}]\leq D^{4}\mu^{4} (3)
𝔼[dt+12ξt+12|t]σ2𝔼[dt+12|t].\mathbb{E}[\lVert d_{t+1}\rVert^{2}\lVert\xi_{t+1}\rVert^{2}|\mathcal{F}_{t}]\leq\sigma^{2}\mathbb{E}[\lVert d_{t+1}\rVert^{2}|\mathcal{F}_{t}]. (4)

Let =Lσ2D2μ\ell=L\sigma^{2}-D^{2}\mu. Then with probability 1δ1-\delta, Algorithm 1 yields θT\theta_{T}\in\mathcal{M} such that J(θT)2μ(1+1δ)\lVert\nabla J(\theta_{T})\lVert^{2}\leq\mu\ell(1+\frac{1}{\delta}) and λmax(2J(θT))ω\lambda_{max}(\nabla^{2}J(\theta_{T}))\leq\omega in TT iterations, where

T4maxμ2(1γ)(Lσ2+D2μ)δ𝒯T\leq\frac{4\mathcal{R}_{max}}{\mu^{2}(1-\gamma)(L\sigma^{2}+D^{2}\mu)\delta}\cdot\mathcal{T}
𝒯=log(2Mσ2σl2+1)log(1+2μω).\mathcal{T}=\frac{\log(2M\frac{\sigma^{2}}{\sigma^{2}_{l}}+1)}{\log(1+2\mu\omega)}.

Proof. See Appendix B.

Therefore, for both the vanilla and actor-critic settings, our key challenge is establishing the conditions (2) (3) (4). By applying Theorem 3.6 and choosing μ\mu to be small enough with respect to ϵ\epsilon, we can arrive at the following theorem establishing convergence of vanilla policy gradient to ϵ\epsilon-second order stationary points, specifying the required horizon of each sampled trajectory.

Theorem 3.7.

Suppose Assumptions 3.1-3.5 hold and let ϵ>0\epsilon>0. For μ<ϵ2δLσ2+D2\mu<\frac{\epsilon^{2}\delta}{L\sigma^{2}+D^{2}} where D=Gmax1γD=\frac{G\mathcal{R}_{max}}{1-\gamma} and σ=Gmax(1γ)2\sigma=\frac{G\mathcal{R}_{max}}{(1-\gamma)^{2}}, we have with probability 1δ1-\delta that Algorithm 1 with vanilla policy gradient estimator computed via Algorithm 2 and H=O(log(ϵ2))H=O(\log(\epsilon^{-2})) reaches an ϵ\epsilon-second order stationary point in O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) iterations, where O~()\tilde{O}(\cdot) hides logarithmic dependencies.

In Theorem 3.7, O()O(\cdot) hides dependency on γ\gamma, and O~()\tilde{O}(\cdot) hides dependencies on LL, GG, max\mathcal{R}_{max}, γ\gamma, MM, σl\sigma_{l}, χ\chi, δ\delta.

The detailed proof of Theorem 3.7 is provided in Appendix C. The proof sketch is as follows. In order to apply Theorem 3.6, we first show that the gradient estimator is uniformly bounded based on our assumptions (Lemma C.1). This also implies that the second and fourth moment of the noise are also bounded. We then establish that the gradient bias due to truncation is deterministically bounded and decays as the trajectory horizon HH increases (Lemma C.3). For large enough HH, the bias error is proportional to μ\mu, allowing us to bound away its effects.

4 Second-Order Convergence of Actor-Critic Policy Gradient

Now we consider the second-order convergence of actor-critic policy gradient algorithms. Actor-critic methods can reduce the variance of policy gradient methods by separately learning to approximate the state-action value function QπQ^{\pi}. With some algebraic manipulation, the policy gradient can be expressed as follows

J(θ)=𝔼πθ,ρ0[k=0γklogπθ(ak|sk)Qπθ(sk,ak)].\nabla J(\theta)=\mathbb{E}_{{\pi_{\theta}},\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi_{\theta}(a_{k}|s_{k})Q^{\pi_{\theta}}(s_{k},a_{k})].

This motivates the construction of the following biased gradient estimator from a trajectory τ\tau of length HH

G^AC(θ;τ)=k=0Hγklogπθ(ak|sk)Qw(sk,ak),\hat{G}^{AC}(\theta;\tau)=\sum_{k=0}^{H}\gamma^{k}\nabla\log\pi_{\theta}(a_{k}|s_{k})Q_{w}(s_{k},a_{k}),

where Qw(s,a)Q_{w}(s,a) is a function with parameter ww that approximates Qπθ(s,a)Q^{\pi_{\theta}}(s,a). Note that G^AC(θ;τ)\hat{G}^{AC}(\theta;\tau) depends on ww, although we omit this dependency in notation. In contrast to vanilla policy gradient, the actor-critic gradient estimator has two sources of bias: bias from the horizon truncation that depends on HH, and bias from the critic approximation QwQ_{w}. Our following analysis addresses both. Let

G(θ)\displaystyle G(\theta) =GH(θ)=𝔼τ[G^AC(θ;τ)]\displaystyle=G_{H}(\theta)=\mathbb{E}_{\tau}[\hat{G}^{AC}(\theta;\tau)]
=𝔼τ[k=0Hγklogπθ(ak|sk)Qw(sk,ak)]\displaystyle=\mathbb{E}_{\tau}[\sum_{k=0}^{H}\gamma^{k}\nabla\log\pi_{\theta}(a_{k}|s_{k})Q_{w}(s_{k},a_{k})]

represent the expectation of the gradient estimator with respect to the sampled trajectory τ\tau of length HH. Let

G(θ)=𝔼τ[k=0γklogπθ(ak|sk)Qw(sk,ak)]G_{\infty}(\theta)=\mathbb{E}_{\tau}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi_{\theta}(a_{k}|s_{k})Q_{w}(s_{k},a_{k})]

represent the expectation of an infinite-horizon gradient estimator with respect to a sampled trajectory of infinite horizon. As before, we have the following noise-bias decomposition of the policy updates

θt+1=θt+μJ(θt)+μξt+1+μdt+1,\theta_{t+1}=\theta_{t}+\mu\nabla J(\theta_{t})+\mu\xi_{t+1}+\mu d_{t+1},

where ξt+1=G^AC(θt;τt)GH(θt)\xi_{t+1}=\hat{G}^{AC}(\theta_{t};\tau_{t})-G_{H}(\theta_{t}) once again represents a zero-mean noise term. Then we can further decompose the bias term dt+1d_{t+1} as follows

dt+1\displaystyle d_{t+1} =GH(θt)J(θt)\displaystyle=G_{H}(\theta_{t})-\nabla J(\theta_{t})
=GH(θt)G(θt)+G(θt)J(θt)\displaystyle=G_{H}(\theta_{t})-G_{\infty}(\theta_{t})+G_{\infty}(\theta_{t})-\nabla J(\theta_{t})
=pt+1+qt+1\displaystyle=p_{t+1}+q_{t+1}

where pt+1=GH(θt)G(θt)p_{t+1}=G_{H}(\theta_{t})-G_{\infty}(\theta_{t}) represents the bias component due to truncation of the infinite horizon and qt+1=G(θt)J(θt)q_{t+1}=G_{\infty}(\theta_{t})-\nabla J(\theta_{t}) represents the bias component induced by the critic approximation. In order to apply Theorem 3.6, we need to bound both pt+1p_{t+1} and qt+1q_{t+1}; although the former can be bounded using the approach of the previous section, the latter requires novel techniques.

The structure of our algorithm is as follows. We consider a double-loop actor-critic algorithm with linear function approximation of QπθQ^{\pi_{\theta}} and an arbitrary policy parametrization. In the inner loop, the critic parameter ww is updated via TD(0) and projected onto a convex set Θ\Theta, and in the outer loop, the policy parameter θ\theta is updated via gradient updates. The gradient estimator algorithm is outlined in Algorithm 3.

Algorithm 3 Actor-Critic Gradient Estimator
  Input: initial critic parameters w0w_{0}, policy parameter θt\theta_{t}, trajectory τt={s0,a0,s1,a1,sH1,aH1}\tau_{t}=\{s_{0},a_{0},s_{1},a_{1},...s_{H-1},a_{H-1}\}
  Sample initial state-action pair s0ρ0s^{\prime}_{0}\sim\rho_{0} and a0πθt(|s0)a^{\prime}_{0}\sim\pi_{\theta_{t}}(\cdot|s^{\prime}_{0})
  for k=0,1,2,K1k=0,1,2,...K-1 do
     Sample sk+1𝒫(|sk,ak)s^{\prime}_{k+1}\sim\mathcal{P}(\cdot|s^{\prime}_{k},a^{\prime}_{k}) and ak+1πθt(|sk+1)a^{\prime}_{k+1}\sim\pi_{\theta_{t}}(\cdot|s^{\prime}_{k+1})
     Compute the TD(0) semi-gradient gk(wk)g_{k}(w_{k})
gk(wk)=\displaystyle g_{k}(w_{k})= ((sk,ak)+γQwk(sk+1,ak+1)\displaystyle(\mathcal{R}(s^{\prime}_{k},a^{\prime}_{k})+\gamma Q_{w_{k}}(s^{\prime}_{k+1},a^{\prime}_{k+1})
Qwk(sk,ak))Qwk(sk,ak)\displaystyle-Q_{w_{k}}(s^{\prime}_{k},a^{\prime}_{k}))\nabla Q_{w_{k}}(s^{\prime}_{k},a^{\prime}_{k})
     wk+1=ProjΘ[wk+αkgk(wk)]w_{k+1}=Proj_{\Theta}[w_{k}+\alpha_{k}g_{k}(w_{k})]
  end for
  Calculate the averaged parameter w¯K,t=1Kk=0K1wk\bar{w}_{K,t}=\frac{1}{K}\sum_{k=0}^{K-1}w_{k}
  Compute gradient estimator from τt\tau_{t}:
G^AC(θt;τt)=h=0H1γhQw¯K,t(sh,ah)logπθt(ah|sh)\hat{G}^{AC}(\theta_{t};\tau_{t})=\sum_{h=0}^{H-1}\gamma^{h}Q_{\bar{w}_{K,t}}(s_{h},a_{h})\nabla\log\pi_{\theta_{t}}(a_{h}|s_{h})
  return G^AC(θt;τt)\hat{G}^{AC}(\theta_{t};\tau_{t})

To control the actor-critic bias qt+1q_{t+1} and ensure second-order convergence by Theorem 3.6, we require that for each policy parameter iterate, the inner loop converges to the global optima w(θt)w^{*}(\theta_{t}) and that Qw(θt)Q_{w^{*}}(\theta_{t}) precisely approximates the true value function QπθtQ^{\pi_{\theta_{t}}}. The first requirement is satisfied by linear TD, as shown in Section 4.1, and the second is formalized later in Assumption 4.9.

4.1 Convergence of TD(0) on Nonstationary Markov Chains

The inner-loop structure of Algorithm 3 suggests that we should apply existing results regarding the finite-time convergence of TD(0) with Markovian sampling (Bhandari et al., 2018; Liu & Olshevsky, 2020). However, these analyses rely on an additional assumption that the Markov chain begins in the stationary distribution, which is unrealistic for the actor-critic setting since there is no guarantee that after each policy update we can begin at the new stationary distribution with respect to the updated policy. As argued in (Bhandari et al., 2018; Liu & Olshevsky, 2020; Dalal et al., 2018), an exponentially mixing Markov chain approximately arrives at its stationary distribution after a logarithmic number of time steps. However, although this explanation justifies the assumption in a practical sense, it is not conducive for finite-time convergence analysis, since we can never reach the exact stationary distribution after a finite number of time steps.

As it turns out, the general convergence of TD(0) does hold without this initial state distribution assumption after some proof adjustments. In this section we reestablish the core results of (Bhandari et al., 2018) for a nonstationary Markov chain (i.e. a Markov chain that has not reached steady-state). These results, which are also utilized in (Qiu et al., 2021; Liu & Olshevsky, 2020), may be of independent interest.

4.1.1 Setup

We consider TD(0) with linear function approximation and a projection step. We define a set of NN feature functions ϕn:𝒮×𝒜\phi_{n}:\mathcal{S}\times\mathcal{A}\to\mathbb{R}, 0<nN0<n\leq N. For each state-action pair (s,a)(s,a), we define the vector ϕ(s,a)=(ϕ1(s,a),ϕ2(s,a),,ϕN(s,a))\phi(s,a)=(\phi_{1}(s,a),\phi_{2}(s,a),...,\phi_{N}(s,a))^{\intercal} as the vector representing the features of (s,a)(s,a). Then we denote as Φ|𝒮×𝒜|×N\Phi\in\mathbb{R}^{|\mathcal{S}\times\mathcal{A}|\times N} the matrix Φ=[ϕ1,ϕN]\Phi=[\phi_{1},...\phi_{N}]. Finally, we denote our linear parametrization of QπQ^{\pi} by the parameter wNw\in\mathbb{R}^{N} as

Qw(s,a)=wϕ(s,a).Q_{w}(s,a)=w^{\intercal}\phi(s,a).

Let gkg_{k} represent the stochastic semi-gradient at time step kk such that

gk(w)=\displaystyle g_{k}(w)= (sk,ak)ϕ(sk,ak)\displaystyle\mathcal{R}(s_{k},a_{k})\phi(s_{k},a_{k})
+(γϕ(sk+1,ak+1)wϕ(sk,ak)w)ϕ(sk,ak).\displaystyle+(\gamma\phi(s_{k+1},a_{k+1})^{\intercal}w-\phi(s_{k},a_{k})^{\intercal}w)\phi(s_{k},a_{k}).

As is common in the TD convergence literature, we consider TD(0) projected onto a convex set Θ\Theta that contains the limit point ww^{*} of the algorithm (Bhandari et al., 2018). Therefore, the TD update can be written as

wk+1=ProjΘ[wk+αgk(wk)].w_{k+1}=Proj_{\Theta}[w_{k}+\alpha g_{k}(w_{k})]. (5)

4.1.2 Assumptions

The following assumptions and Assumption 3.1 are necessary to show the convergence of TD(0).

Assumption 4.1.

For all θ\theta and (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, we have πθ(s,a)>0\pi_{\theta}(s,a)>0.

Assumption 4.1 is satisfied by popular policy parametrizations such as the softmax policy parametrization.

Assumption 4.2.

For any π>0\pi>0, The Markov chain defined by P(s,a,s,a)=𝒫(s|s,a)π(a|s)P(s,a,s^{\prime},a^{\prime})=\mathcal{P}(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime}) is ergodic.

Combined with Assumption 4.1, Assumption 4.2 is satisfied if there is a positive probability of transitioning between any two state-action pairs in a finite number of steps. Assumption 4.2 is a common assumption in the TD and actor-critic literature (Bhandari et al., 2018; Qiu et al., 2021; Liu & Olshevsky, 2020; Wu et al., 2020; Xu et al., 2020a, b) that has a number of implications; the Markov chain is irreducible and aperiodic, it has a unique stationary distribution ηπ(s,a)\eta_{\pi}(s,a), and ηπ(s,a)0\eta_{\pi}(s,a)\neq 0 for all (s,a)(s,a). In addition, the Markov chain mixes at a uniform geometric rate, i.e., there exists m>0m>0, r(0,1)r\in(0,1), such that for tt\in\mathbb{N} we have

sups𝒮a𝒜dTV((st=,at=|s0=s,a0=a),ηπ)mrt\sup_{\begin{subarray}{c}s\in\mathcal{S}\\ a\in\mathcal{A}\end{subarray}}d_{TV}(\mathbb{P}(s_{t}=\cdot,a_{t}=\cdot|s_{0}=s,a_{0}=a),\eta_{\pi})\leq mr^{t} (6)

and τmix(ϵ)=min{t|mrtϵ}\tau^{\text{\tiny mix}}(\epsilon)=\min\{t\in\mathbb{N}\,|\,mr^{t}\leq\epsilon\} is the mixing time.

Assumption 4.3.

The number of states and actions is finite.

As discussed in (Bhandari et al., 2018), Assumption 4.3 can be relaxed to consider countably infinite state-action pairs.

Assumption 4.4.

The feature matrix Φ\Phi has full column rank, i.e., the feature vectors {ϕ1,ϕN}\{\phi_{1},...\phi_{N}\} are linearly independent. In addition, for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, ϕ(s,a)21\lVert\phi(s,a)\lVert^{2}\leq 1 .

Assumption 4.5.

There exists R>0R>0 such that diam(Θ)R,diam(\Theta)~{}\leq~{}R, where diamdiam is the diameter.

Assumptions 4.4 and 4.5 are also common in the literature (Bhandari et al., 2018; Liu & Olshevsky, 2020; Tsitsiklis & Van Roy, 1997; Qiu et al., 2021). See Section 8.2 of (Bhandari et al., 2018) and Proposition 3 of (Qiu et al., 2021) for additional details on defining RR.

Finally, let AθA_{\theta} denote the positive definite matrix 𝔼(s,a)ηθ,(s,a)P(s,a,)[ϕ(s,a)(ϕ(s,a)γϕ(s,a))]\mathbb{E}_{(s,a)\sim\eta_{\theta},(s^{\prime},a^{\prime})\sim P(s,a,\cdot)}[\phi(s,a)(\phi(s,a)-\gamma\phi(s^{\prime},a^{\prime}))^{\intercal}]. Based on Assumptions 4.2 and 4.4, we can conclude that AθA_{\theta} is positive definite (Tsitsiklis & Van Roy, 1997). We further assume the smallest eigenvalue of AθA_{\theta} is uniformly bounded away from zero in the following assumption, which is also utilized in (Qiu et al., 2021).

Assumption 4.6.

There exists a lower bound ς>0\varsigma>0, such that for all θM\theta\in\mathbb{R}^{M} we have λmin(Aθ+Aθ)ς.\lambda_{min}(A_{\theta}+A_{\theta}^{\intercal})\geq\varsigma.

4.1.3 Finite-Time Convergence of TD(0)

We first require the following theorem from (Tsitsiklis & Van Roy, 1997), which establishes the existence and uniqueness of the solution to the projected Bellman equation as the limit point of the TD(0) algorithm.

Theorem 4.7.

(Tsitsiklis & Van Roy, 1997) Denote by 𝕋π\mathbb{T}^{\pi} the Bellman operator under policy π\pi such that for a value function Q:𝒮×𝒜Q:\mathcal{S}\times\mathcal{A}\to\mathbb{R} we have

(𝕋πQ)(s,a)=\displaystyle(\mathbb{T}^{\pi}Q)(s,a)= (s,a)+γs𝒮a𝒜𝒫(s|s,a)π(a|s)Q(s,a).\displaystyle\mathcal{R}(s,a)+\gamma\sum_{\begin{subarray}{c}s^{\prime}\in\mathcal{S}\\ a^{\prime}\in\mathcal{A}\end{subarray}}\mathcal{P}(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})Q(s^{\prime},a^{\prime}).

Then given Assumptions 3.1, 4.1-4.5, the limit point ww^{*} of the TD(0) algorithm with linear function approximation exists, and it is the unique solution to the projected Bellman equation

Φw=ProjΦ(𝕋πΦw)\Phi w=Proj_{\Phi}(\mathbb{T}^{\pi}\Phi w)

where ProjΦProj_{\Phi} is the projection operator onto the subspace {Φx|xN}\{\Phi x|x\in\mathbb{R}^{N}\} spanned by the feature vectors ϕn\phi_{n}.

Now we share our main result regarding the convergence of TD(0). Theorem 4.8 establishes the convergence rate of the Projected TD(0) algorithm after TT constant time steps on a nonstationary Markov chain.

Theorem 4.8.

Suppose Assumptions 3.1, 4.1-4.5 hold and w¯K=1Kk=0K1wk\bar{w}_{K}=\frac{1}{K}\sum_{k=0}^{K-1}w_{k} is generated by KK steps of the Projected TD(0) algorithm with wΘw^{*}\in\Theta and α=1K\alpha=\frac{1}{\sqrt{K}}. Then

𝔼[QwQw¯Kηπ2]\displaystyle\mathbb{E}[\lVert Q_{w^{*}}-Q_{\bar{w}_{K}}\lVert^{2}_{\eta_{\pi}}]\leq ww02+F2(17+12τmix(1K))2(1γ)K\displaystyle\frac{\lVert w^{*}-w_{0}\lVert^{2}+F^{2}(17+12\tau^{\text{\tiny mix}}(\frac{1}{\sqrt{K}}))}{2(1-\gamma)\sqrt{K}}
+10F2m(1r)(1γ)K,\displaystyle+\frac{10F^{2}m}{(1-r)(1-\gamma)K},

where F=max+2RF=\mathcal{R}_{max}+2R and

QwQw¯Kηπ2=s𝒮a𝒜ηπ(s,a)(Qw(s,a)Qw¯K(s,a))2.\lVert Q_{w^{*}}-Q_{\bar{w}_{K}}\lVert^{2}_{\eta_{\pi}}=\sum_{\begin{subarray}{c}s\in\mathcal{S}\\ a\in\mathcal{A}\end{subarray}}\eta_{\pi}(s,a)(Q_{w^{*}}(s,a)-Q_{\bar{w}_{K}}(s,a))^{2}.

Proof. See Appendix D.

We can compare this result with Theorem 3 from (Bhandari et al., 2018) which shows

𝔼[QwQw¯Kηπ2]ww02+F2(9+12τmix(1K))2(1γ)K\mathbb{E}[\lVert Q_{w^{*}}-Q_{\bar{w}_{K}}\lVert^{2}_{\eta_{\pi}}]\leq\frac{\lVert w^{*}-w_{0}\lVert^{2}+F^{2}(9+12\tau^{\text{\tiny mix}}(\frac{1}{\sqrt{K}}))}{2(1-\gamma)\sqrt{K}} (7)

for TD(0) on a stationary Markov chain. We note that the nonstationary result involves slightly different constants and an additional term that decays at the rate of O(1/K)O(1/K).

4.2 Main Result

Returning to the actor-critic setting, the following assumption combined with the existence and uniqueness theorem (Theorem 4.7) implies for the limit point of the TD(0) algorithm ww^{*}, the resulting function Qw=wTϕ(s,a)Q_{w^{*}}={w^{*}}^{T}\phi(s,a) approximates the true state-action value function QπθQ^{\pi_{\theta}} with arbitrarily close precision.

Assumption 4.9.

The value function lies in the linear function class such that

infwΘ𝔼(s,a)ηπθ[(𝕋πθ(wϕ(s,a))wϕ(s,a))2]=0.\inf_{w\in\Theta}\mathbb{E}_{(s,a)\sim\eta_{\pi_{\theta}}}[(\mathbb{T}^{\pi_{\theta}}(w^{\intercal}\phi(s,a))-w^{\intercal}\phi(s,a))^{2}]=0.

Assumption 4.9 is a linear realizability assumption that states that the value function can be sufficiently represented by a linear model of the feature vectors. This can be satisfied via an appropriate choice of feature vectors, such as radial basis functions, Fourier basis functions, or neural networks (Ji et al., 2019). Other actor-critic works (Xu et al., 2020a; Kumar et al., 2019; Fu et al., 2020) also require similar assumptions.

By characterizing the convergence of TD(0), we show in the following lemma that the norm of the bias term qt+1q_{t+1} decays with respect to the number of inner-loop iterations. Lemma 4.10 features diminishing step sizes because it requires a stronger fourth moment bound showing the direct convergence of w¯K\bar{w}_{K} to ww^{*}, as opposed to the weaker result obtained in Theorem 4.8. The details of the proof are in Appendix E.1.

Lemma 4.10.

For K=O(log2(μ4)μ4)K=O(\frac{\log^{2}(\mu^{-4})}{\mu^{4}}) as μ0\mu\to 0, for which O()O(\cdot) does not hide dependencies on other constants, and

Dq=G(192F2R2ς2log2(r1)+O(1logμ4))1/4D_{q}=G\left(\frac{192F^{2}R^{2}}{\varsigma^{2}\log^{2}(r^{-1})}+O(\frac{1}{\log\mu^{-4}})\right)^{-1/4}

for which O()O(\cdot) hides dependencies on FF, rr, ς\varsigma, mm, the expected function approximation bias is bounded such that

𝔼[qt+14|t]G4𝔼[w¯K,tw4|t]Dq4μ4.\mathbb{E}[\lVert q_{t+1}\lVert^{4}|\mathcal{F}_{t}]\leq G^{4}\mathbb{E}[\lVert\bar{w}_{K,t}-w^{*}\lVert^{4}|\mathcal{F}_{t}]\leq D_{q}^{4}\mu^{4}.

Finally, by combining Theorem 3.6 with Lemma 4.10 and bounds on the truncation bias pt+1p_{t+1} (Lemma E.2) and noise ξt+1\xi_{t+1} (Lemma E.1), we arrive at Theorem 4.11.

Theorem 4.11.

Suppose Assumptions 3.1-3.5, 4.1-4.6, 4.9 hold and let ϵ>0\epsilon>0. For μ<ϵ2δLσ2+D2\mu~{}<~{}\frac{\epsilon^{2}\delta}{L\sigma^{2}+D^{2}} where σ=GR1γ\sigma=\frac{GR}{1-\gamma}, D=2(Dp4+Dq4)1/4D=2(D_{p}^{4}+D_{q}^{4})^{1/4} and Dp=GR1γD_{p}=\frac{GR}{1-\gamma}, we have with probability 1δ1-\delta that Algorithm 1 with actor-critic policy gradient estimator computed via Algorithm 3 with H=O(log(ϵ2))H=O(\log(\epsilon^{-2})) and K=O~(ϵ8)K=\tilde{O}(\epsilon^{-8}) reaches an ϵ\epsilon-second order stationary point in O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) iterations.

Proof. See Appendix E.

In Theorem 4.11, O()O(\cdot) hides dependency on γ\gamma and O~()\tilde{O}(\cdot) hides dependencies on LL, GG, RR, γ\gamma, MM, σl\sigma_{l}, χ\chi, δ\delta, FF, rr, mm, ς\varsigma.

5 Conclusion

In this work, we provide a novel analysis on the convergence of biased policy gradient methods to second-order stationary points. Our work applies to general policy parametrization and Markovian sampling. We also show the convergence of TD(0) on nonstationary Markov chains, which pertains to realistic actor-critic implementations.

Future directions may involve extending this work to two-timescale or single-timescale actor-critic algorithms, which may provide some performance improvement. In addition, instead of assuming Assumption 4.9, we may want to show second-order convergence of actor-critic algorithms with respect to some irremoveable approximation error ϵapp\epsilon_{app} representing the imperfect critic approximation, similar to several first-order analyses (Wu et al., 2020; Qiu et al., 2021).

Impact Statement

This paper advances the field of reinforcement learning by characterizing the solutions achieved by policy gradient algorithms. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Agarwal et al. (2020) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. Optimality and approximation with policy gradient methods in markov decision processes. In Abernethy, J. and Agarwal, S. (eds.), Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pp.  64–66. PMLR, 09–12 Jul 2020.
  • Alfano et al. (2023) Alfano, C., Yuan, R., and Rebeschini, P. A novel framework for policy mirror descent with general parameterization and linear convergence. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  • Baxter & Bartlett (2001) Baxter, J. and Bartlett, P. L. Infinite-horizon policy-gradient estimation. J. Artif. Int. Res., 15(1):319–350, Nov 2001. ISSN 1076-9757.
  • Bhandari & Russo (2019) Bhandari, J. and Russo, D. Global optimality guarantees for policy gradient methods. CoRR, abs/1906.01786, 2019.
  • Bhandari & Russo (2021) Bhandari, J. and Russo, D. On the linear convergence of policy gradient methods for finite mdps. In Banerjee, A. and Fukumizu, K. (eds.), Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, volume 130 of Proceedings of Machine Learning Research, pp.  2386–2394. PMLR, 13–15 Apr 2021.
  • Bhandari et al. (2018) Bhandari, J., Russo, D., and Singal, R. A finite time analysis of temporal difference learning with linear function approximation. In Bubeck, S., Perchet, V., and Rigollet, P. (eds.), Proceedings of the 31st Conference On Learning Theory, volume 75 of Proceedings of Machine Learning Research, pp.  1691–1692. PMLR, 06–09 Jul 2018.
  • Bhatnagar et al. (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. Natural actor–critic algorithms. Automatica, 45(11):2471–2482, 2009. ISSN 0005-1098. doi: https://doi.org/10.1016/j.automatica.2009.07.008.
  • Cayci et al. (2022) Cayci, S., He, N., and Srikant, R. Finite-time analysis of entropy-regularized neural natural actor-critic algorithm, 2022.
  • Dalal et al. (2018) Dalal, G., Szorenyi, B., Thoppe, G., and Mannor, S. Finite sample analyses for td(0) with function approximation. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence, AAAI’18/IAAI’18/EAAI’18. AAAI Press, 2018. ISBN 978-1-57735-800-8.
  • Daneshmand et al. (2018) Daneshmand, H., Kohler, J., Lucchi, A., and Hofmann, T. Escaping saddles with stochastic gradients. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  1155–1164. PMLR, 10–15 Jul 2018.
  • Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., and Zhang, T. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018.
  • Fu et al. (2020) Fu, Z., Yang, Z., and Wang, Z. Single-timescale actor-critic provably finds globally optimal policy. CoRR, abs/2008.00483, 2020.
  • Ge et al. (2015) Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points - online stochastic gradient for tensor decomposition. CoRR, abs/1503.02101, 2015.
  • Han et al. (2022) Han, Y., Razaviyayn, M., and Xu, R. Policy gradient finds global optimum of nearly linear-quadratic control systems. In OPT 2022: Optimization for Machine Learning (NeurIPS 2022 Workshop), 2022.
  • Ji et al. (2019) Ji, Z., Telgarsky, M., and Xian, R. Neural tangent kernels, transportation mappings, and universal approximation. CoRR, abs/1910.06956, 2019.
  • Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In Precup, D. and Teh, Y. W. (eds.), Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp.  1724–1732. PMLR, 06–11 Aug 2017.
  • Jin et al. (2019) Jin, C., Netrapalli, P., Ge, R., Kakade, S. M., and Jordan, M. I. Stochastic gradient descent escapes saddle points efficiently. CoRR, abs/1902.04811, 2019.
  • Khorasani et al. (2023) Khorasani, S., Salehkaleybar, S., Kiyavash, N., He, N., and Grossglauser, M. Efficiently escaping saddle points for non-convex policy optimization, 2023.
  • Konda & Tsitsiklis (1999) Konda, V. and Tsitsiklis, J. Actor-critic algorithms. In Solla, S., Leen, T., and Müller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
  • Kumar et al. (2019) Kumar, H., Koppel, A., and Ribeiro, A. On the sample complexity of actor-critic method for reinforcement learning with function approximation. CoRR, abs/1910.08412, 2019.
  • Liu & Olshevsky (2020) Liu, R. and Olshevsky, A. Temporal difference learning as gradient splitting. CoRR, abs/2010.14657, 2020.
  • Mei et al. (2020) Mei, J., Xiao, C., Szepesvari, C., and Schuurmans, D. On the global convergence rates of softmax policy gradient methods. In III, H. D. and Singh, A. (eds.), Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pp.  6820–6829. PMLR, 13–18 Jul 2020.
  • Nesterov & Polyak (2006) Nesterov, Y. and Polyak, B. Cubic regularization of newton method and its global performance. Math. Program., 108:177–205, 08 2006. doi: 10.1007/s10107-006-0706-8.
  • Olshevsky & Gharesifard (2023) Olshevsky, A. and Gharesifard, B. A small gain analysis of single timescale actor critic. SIAM Journal on Control and Optimization, 61(2):980–1007, 2023. doi: 10.1137/22M1483335.
  • Papini et al. (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. Stochastic variance-reduced policy gradient. In Dy, J. and Krause, A. (eds.), Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp.  4026–4035. PMLR, 10–15 Jul 2018.
  • Peters & Schaal (2008) Peters, J. and Schaal, S. Reinforcement learning of motor skills with policy gradients. Neural Networks, 21(4):682–697, 2008. ISSN 0893-6080. doi: https://doi.org/10.1016/j.neunet.2008.02.003. Robotics and Neuroscience.
  • Qiu et al. (2021) Qiu, S., Yang, Z., Ye, J., and Wang, Z. On finite-time convergence of actor-critic algorithm. IEEE Journal on Selected Areas in Information Theory, 2(2):652–664, 2021. doi: 10.1109/JSAIT.2021.3078754.
  • Shen et al. (2019) Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. Hessian aided policy gradient. In Chaudhuri, K. and Salakhutdinov, R. (eds.), Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pp.  5729–5738. PMLR, 09–15 Jun 2019.
  • Sutton et al. (1999) Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. In Solla, S., Leen, T., and Müller, K. (eds.), Advances in Neural Information Processing Systems, volume 12. MIT Press, 1999.
  • Tsitsiklis & Van Roy (1997) Tsitsiklis, J. and Van Roy, B. An analysis of temporal-difference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674–690, 1997. doi: 10.1109/9.580874.
  • Vlaski & Sayed (2022) Vlaski, S. and Sayed, A. H. Second-order guarantees of stochastic gradient descent in nonconvex optimization. IEEE Transactions on Automatic Control, 67(12):6489–6504, 2022. doi: 10.1109/TAC.2021.3131963.
  • Vlaski et al. (2020) Vlaski, S., Rizk, E., and Sayed, A. H. Second-order guarantees in federated learning. In 2020 54th Asilomar Conference on Signals, Systems, and Computers, pp.  915–922, 2020. doi: 10.1109/IEEECONF51394.2020.9443421.
  • Wang et al. (2020) Wang, L., Cai, Q., Yang, Z., and Wang, Z. Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=BJgQfkSYDS.
  • Wang et al. (2022) Wang, P., Wang, H., and Zheng, N. Stochastic cubic-regularized policy gradient method. Knowledge-Based Systems, 255:109687, 2022. ISSN 0950-7051. doi: https://doi.org/10.1016/j.knosys.2022.109687.
  • Williams (2004) Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8:229–256, 2004.
  • Wu et al. (2022) Wu, S., Shi, L., Wang, J., and Tian, G. Understanding policy gradient algorithms: A sensitivity-based approach. In Chaudhuri, K., Jegelka, S., Song, L., Szepesvari, C., Niu, G., and Sabato, S. (eds.), Proceedings of the 39th International Conference on Machine Learning, volume 162 of Proceedings of Machine Learning Research, pp.  24131–24149. PMLR, 17–23 Jul 2022.
  • Wu et al. (2020) Wu, Y. F., ZHANG, W., Xu, P., and Gu, Q. A finite-time analysis of two time-scale actor-critic methods. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  17617–17628. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper_files/paper/2020/file/cc9b3c69b56df284846bf2432f1cba90-Paper.pdf.
  • Xu et al. (2020a) Xu, T., Wang, Z., and Liang, Y. Improving sample complexity bounds for (natural) actor-critic algorithms. In Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., and Lin, H. (eds.), Advances in Neural Information Processing Systems, volume 33, pp.  4358–4369. Curran Associates, Inc., 2020a.
  • Xu et al. (2020b) Xu, T., Wang, Z., and Liang, Y. Improving sample complexity bounds for actor-critic algorithms. CoRR, abs/2004.12956, 2020b.
  • Yang et al. (2021) Yang, L., Zheng, Q., and Pan, G. Sample complexity of policy gradient finding second-order stationary points. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10630–10638, May 2021.
  • Yang et al. (2018) Yang, Z., Zhang, K., Hong, M., and Başar, T. A finite sample analysis of the actor-critic algorithm. In 2018 IEEE Conference on Decision and Control (CDC), pp.  2759–2764, 2018. doi: 10.1109/CDC.2018.8619440.
  • Yang et al. (2019) Yang, Z., Chen, Y., Hong, M., and Wang, Z. Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., and Garnett, R. (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019.
  • Yuan et al. (2022) Yuan, R., Gower, R. M., and Lazaric, A. A general sample complexity analysis of vanilla policy gradient. In Camps-Valls, G., Ruiz, F. J. R., and Valera, I. (eds.), Proceedings of The 25th International Conference on Artificial Intelligence and Statistics, volume 151 of Proceedings of Machine Learning Research, pp.  3332–3380. PMLR, 28–30 Mar 2022.
  • Yuan et al. (2023) Yuan, R., Du, S. S., Gower, R. M., Lazaric, A., and Xiao, L. Linear convergence of natural policy gradient methods with log-linear policies. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=-z9hdsyUwVQ.
  • Zhang et al. (2020) Zhang, K., Koppel, A., Zhu, H., and Başar, T. Global convergence of policy gradient methods to (almost) locally optimal policies. SIAM Journal on Control and Optimization, 58(6):3586–3612, 2020. doi: 10.1137/19M1288012.
  • Zhu & Gong (2023) Zhu, Y. and Gong, X. Distributed policy gradient with heterogeneous computations for federated reinforcement learning. In 2023 57th Annual Conference on Information Sciences and Systems (CISS), pp.  1–6, 2023. doi: 10.1109/CISS56502.2023.10089771.

Appendix A Additional Discussion of Related Work

A.1 Global Convergence

We first present some of the many global convergence results available in terms of ϵ\epsilon-optimality. In reinforcement learning, global convergence guarantees for policy gradient algorithms tend to arise from some underlying structure of the optimization problem, due to specific policy parametrization or algorithm.

For first-order algorithms, the following papers achieve global optimality.

  • In (Agarwal et al., 2020), tabular parametrization with exact gradients, O(ϵ2)O(\epsilon^{-2}) iterations

  • In (Bhandari & Russo, 2019), objective functions that satisfy a gradient dominance property with exact gradients, O(ϵ2)O(\epsilon^{-2}) iterations

  • In (Mei et al., 2020), tabular softmax parametrization with exact gradients, O(ϵ1)O(\epsilon^{-1}) iterations

  • In (Wang et al., 2020), neural policy gradient with extremely wide shallow neural networks, O(ϵ2)O(\epsilon^{-2}) iterations

More powerful global convergence results are achieved for quasi-second-order methods like natural policy gradient and mirror descent policy gradient. These results can go beyond the tabular setting.

  • In (Agarwal et al., 2020), natural policy gradient with tabular softmax parametrization with exact gradients, O(ϵ1)O(\epsilon^{-1}) iterations

  • In (Xu et al., 2020b), natural actor-critic with function approximation where the feature vectors vary in each iteration, O(ϵ4)O(\epsilon^{-4}) outer-loop iterations

  • In (Cayci et al., 2022), natural actor-critic policy gradient where the actor and critic are parametrized with extremely wide neural networks, O(ϵ2)O(\epsilon^{-2}) iterations

  • In (Yuan et al., 2023), natural policy gradient for log-linear policies with compatible function approximation, linear convergence in terms of outer loop iterations

  • In (Alfano et al., 2023), policy mirror descent with general parametrization, linear convergence in terms of outer loop iterations

In comparison, we achieve an ϵ\epsilon-second-order stationary point in O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) iterations. Our work focuses on first-order algorithms and general policy parametrization which are simpler to implement and more widely used in practice. We also do not use oracles or exact gradients in our analysis. Finally, our work also has no dependence on the distribution mismatch coefficient, which is widely used in global convergence results and roughly quantifies how well the initial state distribution matches the optimal state distribution.

A.2 Second-Order Convergence

Our work inherits the sample complexity of O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) from (Vlaski & Sayed, 2022), which is weaker than the best sample complexity of O~(ϵ4)\tilde{O}(\epsilon^{-4}) obtained by (Jin et al., 2019), both of which analyze the second-order convergence of vanilla stochastic gradient descent. Faster second-order convergence is available for algorithms with exact gradients, such as perturbed gradient descent which converges in O~(ϵ2)\tilde{O}(\epsilon^{-2}) iterations (Jin et al., 2017). However, exact gradient computations are intractable for realistic policy gradient implementations. Second-order or quasi-second-order algorithms that utilize Hessian information can also converge faster. SPIDER-SFO from (Fang et al., 2018) obtains a sample complexity of O~(ϵ3)\tilde{O}(\epsilon^{-3}) via a negative curvature search method. Similar techniques are extended to the policy gradient setting in (Khorasani et al., 2023) with complexity O~(ϵ3)\tilde{O}(\epsilon^{-3}) and (Wang et al., 2022) with complexity O~(ϵ3.5)\tilde{O}(\epsilon^{-3.5}). All of these aforementioned works only deal with unbiased gradient estimators. To the best of our knowledge only (Vlaski et al., 2020) show second-order convergence with biased gradient estimators in the form of federated learning, requiring O~(ϵ6.5)\tilde{O}(\epsilon^{-6.5}) global iterations. This matches our sample complexity, as expected.

Appendix B Proof of Theorem 3.6

B.1 Key Lemmas and Proof Sketch

Our approach for proving that Algorithm 1 arrives at an ϵ\epsilon-second order stationary point relies on bounding the bias and noise of the gradient estimator and applying the techniques developed in (Vlaski & Sayed, 2022). Broadly speaking, (Vlaski & Sayed, 2022) show second-order convergence for unbiased stochastic gradient descent by first showing the iterates on the second-order Taylor expansion of the objective function escape saddle points, and then showing that the iterates and those on its Taylor approximation are sufficiently close. Therefore, if we also show that the policy gradient iterates are close to the iterates on the Taylor expansion, we can conveniently apply the convergence results of (Vlaski & Sayed, 2022) to our setting.

We begin with establishing fourth moment bounds on the noise term ξt+1\xi_{t+1} by the following lemma.

Lemma B.1.

Suppose for some random variable XX we have 𝔼[X]=μ\mathbb{E}[X]=\mu and Xσ\lVert X\rVert\leq\sigma. Then

𝔼[Xμ2]σ2,\mathbb{E}[\lVert X-\mu\lVert^{2}]\leq\sigma^{2},\\
𝔼[Xμ4]4σ2.\mathbb{E}[\lVert X-\mu\lVert^{4}]\leq 4\sigma^{2}.
Proof.
𝔼[Xμ2]=𝔼[X2]μ2𝔼[X2]σ2\mathbb{E}[\lVert X-\mu\lVert^{2}]=\mathbb{E}[\lVert X\lVert^{2}]-\mu^{2}\leq\mathbb{E}[\lVert X\lVert^{2}]\leq\sigma^{2}
Xμ4Xμ24σ2\displaystyle\lVert X-\mu\lVert^{4}\leq\lVert X-\mu\lVert^{2}\cdot 4\sigma^{2}
𝔼[Xμ4]4σ4\displaystyle\mathbb{E}[\lVert X-\mu\lVert^{4}]\leq 4\sigma^{4}

So by Lemma B.1, we have

𝔼[ξt+12|t]σ2,\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}|\mathcal{F}_{t}]\leq\sigma^{2}, (8)
𝔼[ξt+14|t]4σ4.\mathbb{E}[\lVert\xi_{t+1}\lVert^{4}|\mathcal{F}_{t}]\leq 4\sigma^{4}. (9)

In addition, by Jensen’s inequality, the fourth-moment bound

𝔼[dt+14|t]D4μ4\mathbb{E}[\lVert d_{t+1}\lVert^{4}|\mathcal{F}_{t}]\leq D^{4}\mu^{4}

also implies the following second-moment bound

𝔼[dt+12|t]D2μ2.\mathbb{E}[\lVert d_{t+1}\lVert^{2}|\mathcal{F}_{t}]\leq D^{2}\mu^{2}. (10)

To proceed with the proof, we first show that in the large gradient regime, we observe a large ascent in function value, whereas around local maxima, the possible descent is bounded. Then we construct a pair of coupled sequences {θi+j}\{\theta_{i+j}\} and {θi+j}\{\theta^{\prime}_{i+j}\}, where {θi+j}\{\theta_{i+j}\} represents the gradient iterates on the original objective function and {θi+j}\{\theta^{\prime}_{i+j}\} represents gradient ascent iterates on the second-order Taylor approximation of the function centered at θi\theta_{i} with the same noise term. Through moment bounds, we show that the difference between the coupled sequences is sufficiently small. These results allow us to leverage Theorems 2 and 3 in (Vlaski & Sayed, 2022).

The following lemma establishes that for small enough step sizes, we have sufficient ascent starting in the large gradient regime θi𝒢\theta_{i}\in\mathcal{G}, and descent is bounded starting near a local maxima θi\theta_{i}\in\mathcal{M}.

Lemma B.2.

For μ<1L\mu<\frac{1}{L}, we have after one iteration of Algorithm 1,

𝔼[J(θi+1)|θi𝒢]𝔼[J(θi)|θi𝒢]+μ2(Lσ2+D2μ)2δ\mathbb{E}[J(\theta_{i+1})|\theta_{i}\in\mathcal{G}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{G}]+\frac{\mu^{2}(L\sigma^{2}+D^{2}\mu)}{2\delta}
𝔼[J(θi+1)|θi]𝔼[J(θi)|θi]μ2(Lσ2+D2μ)2.\mathbb{E}[J(\theta_{i+1})|\theta_{i}\in\mathcal{M}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{M}]-\frac{\mu^{2}(L\sigma^{2}+D^{2}\mu)}{2}.

Proof of Lemma B.2: see Appendix B.3.

Beginning from θi\theta_{i}\in\mathcal{H}, we define {θi+j}\{\theta^{\prime}_{i+j}\} as the gradient ascent iterates on the second-order Taylor approximation of J(θ)J(\theta) plus the noise term ξi+j+1\xi_{i+j+1} from the original sequence. Denote the Taylor expansion around θi\theta_{i} as J^\hat{J} as follows

J^(θ)=J(θi)+J(θi)T(θθi)+12(θθi)2J(θi)(θθi)\hat{J}(\theta)=J(\theta_{i})+\nabla J(\theta_{i})^{T}(\theta-\theta_{i})+\frac{1}{2}(\theta-\theta_{i})^{\intercal}\nabla^{2}J(\theta_{i})(\theta-\theta_{i})
J^(θ)=J(θi)+2J(θi)(θθi).\nabla\hat{J}(\theta)=\nabla J(\theta_{i})+\nabla^{2}J(\theta_{i})(\theta-\theta_{i}).

So we have {θi+j}\{\theta^{\prime}_{i+j}\} and our original sequence iterates {θi}\{\theta_{i}\} defined as follows

θi+j+1=θi+j+μJ(θi)+μ2J(θi)(θi+jθi)+μξi+j+1\theta^{\prime}_{i+j+1}=\theta^{\prime}_{i+j}+\mu\nabla J(\theta_{i})+\mu\nabla^{2}J(\theta_{i})(\theta^{\prime}_{i+j}-\theta_{i})+\mu\xi_{i+j+1}
θi+j+1=θi+j+μJ(θi+j)+μξi+j+1+μdi+j+1.\theta_{i+j+1}=\theta_{i+j}+\mu\nabla J(\theta_{i+j})+\mu\xi_{i+j+1}+\mu d_{i+j+1}.

Then we can conclude in the following lemma that in the vicinity of a saddle point, the distance between θi\theta_{i} and θi+j+1\theta_{i+j+1} is bounded, and the distance between θi\theta_{i} and θi\theta^{\prime}_{i} is bounded.

Lemma B.3.

For {θi}\{\theta_{i}\} and {θi}\{\theta^{\prime}_{i}\} defined above, and jCμj\leq\frac{C}{\mu}, where CC is a constant independent of μ\mu, we have

𝔼[θiθi+j+12|θi]O(μ)\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu) (11)
𝔼[θiθi+j+14|θi]O(μ2)\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{2}) (12)
𝔼[θi+j+1θi+j+12|θi]O(μ2).\mathbb{E}[\lVert\theta^{\prime}_{i+j+1}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{2}). (13)
Corollary B.4.

From the results of Lemma B.3, we can conclude

𝔼[θiθi+j+13|θi]O(μ3/2)\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{3}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{3/2}) (14)
𝔼[θiθi+j+12|θi]O(μ)\mathbb{E}[\lVert\theta_{i}-\theta^{\prime}_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu) (15)
𝔼[θiθi+j+13|θi]O(μ3/2).\mathbb{E}[\lVert\theta_{i}-\theta^{\prime}_{i+j+1}\lVert^{3}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{3/2}). (16)

The first inequality follows from Jensen’s inequality, and the second and third follow from the bounds on θiθi+j+1\lVert\theta_{i}-\theta_{i+j+1}\lVert and θi+j+1θi+j+1\lVert\theta^{\prime}_{i+j+1}-\theta_{i+j+1}\lVert.

Proof of Lemma B.3. See Appendix B.4.

B.2 Proof of Theorem 3.6

Proof.

For the sequences {θi}\{\theta_{i}\} and {θi}\{\theta^{\prime}_{i}\} defined above, suppose the moment bounds in Lemma B.3 hold. Then from Corollary 1 in (Vlaski & Sayed, 2022), beginning at θi\theta_{i}\in\mathcal{H} for the finite horizon jCμj\leq\frac{C}{\mu} we have

𝔼[J(θi+j)|θi]𝔼[J(θi+j)|θi]O(μ3/2),\mathbb{E}[J(\theta_{i+j})|\theta_{i}\in\mathcal{H}]\geq\mathbb{E}[J(\theta^{\prime}_{i+j})|\theta_{i}\in\mathcal{H}]-O(\mu^{3/2}),

which basically states that the function values on {θi}\{\theta_{i}\} stay close to the function values on {θi}\{\theta^{\prime}_{i}\}. This allows us to conclude that sufficient ascent occurs on the Taylor approximation as well as the original function by way of Theorem 2 from (Vlaski & Sayed, 2022). Beginning at a strict saddle point θi\theta_{i}\in\mathcal{H}, gradient ascent iterates on the short-term model for 𝒯\mathcal{T} iterations after ii with

𝒯=log(2Mσ2σl2+1)log(1+2μω)O(1μω)\mathcal{T}=\frac{\log(2M\frac{\sigma^{2}}{\sigma^{2}_{l}}+1)}{\log(1+2\mu\omega)}\leq O(\frac{1}{\mu\omega})

guarantees

𝔼[J(θi+𝒯)|θi]𝔼[J(θi)|θi]+μ2Mσ2o(μ).\mathbb{E}[J(\theta^{\prime}_{i+\mathcal{T}})|\theta_{i}\in\mathcal{H}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{H}]+\frac{\mu}{2}M\sigma^{2}-o(\mu).

Combined with the bounds on the iterates from Lemma B.3, this implies

𝔼[J(θi+𝒯)|θi]𝔼[J(θi)|θi]+μ2Mσ2o(μ).\mathbb{E}[J(\theta_{i+\mathcal{T}})|\theta_{i}\in\mathcal{H}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{H}]+\frac{\mu}{2}M\sigma^{2}-o(\mu).

This result, in combination with Lemma B.2, and the observation that |J(θ)|max1γ|J(\theta)|\leq\frac{\mathcal{R}_{\max}}{1-\gamma} for all θ\theta allows us to apply Theorem 3 from (Vlaski & Sayed, 2022), yielding our final result. ∎

B.3 Proof of Lemma B.2

Proof.

Our iterates are

θi+1=θi+μJ(θi)+μξi+1+μdi+1.\theta_{i+1}=\theta_{i}+\mu\nabla J(\theta_{i})+\mu\xi_{i+1}+\mu d_{i+1}.

Because JJ is Lipschitz smooth by Lemma 3.2, we have

J(θi+1)\displaystyle J(\theta_{i+1})\geq J(θi)+J(θi)T(θi+1θi)L2θi+1θi2\displaystyle J(\theta_{i})+\nabla J(\theta_{i})^{T}(\theta_{i+1}-\theta_{i})-\frac{L}{2}\lVert\theta_{i+1}-\theta_{i}\lVert^{2}
\displaystyle\geq J(θi)+μJ(θi)T(J(θi)+ξi+1+di+1)Lμ22J(θi)+ξi+1+di+12\displaystyle J(\theta_{i})+\mu\nabla J(\theta_{i})^{T}(\nabla J(\theta_{i})+\xi_{i+1}+d_{i+1})-\frac{L\mu^{2}}{2}\lVert\nabla J(\theta_{i})+\xi_{i+1}+d_{i+1}\lVert^{2}
\displaystyle\geq J(θi)+μJ(θi)2+μJ(θi)Tξi+1+μJ(θi)Tdi+1\displaystyle J(\theta_{i})+\mu\lVert\nabla J(\theta_{i})\lVert^{2}+\mu\nabla J(\theta_{i})^{T}\xi_{i+1}+\mu\nabla J(\theta_{i})^{T}d_{i+1}
Lμ22(J(θi)+di+12+ξi+12+2(J(θi)+di+1)Tξi+1).\displaystyle-\frac{L\mu^{2}}{2}(\lVert\nabla J(\theta_{i})+d_{i+1}\lVert^{2}+\lVert\xi_{i+1}\lVert^{2}+2(\nabla J(\theta_{i})+d_{i+1})^{T}\xi_{i+1}).

We can take expectation with respect to the filtration i\mathcal{F}_{i} on either side to remove the cross terms with the noise term ξi+1\xi_{i+1}, and then we have by (8)

𝔼[J(θi+1)|i]\displaystyle\mathbb{E}[J(\theta_{i+1})|\mathcal{F}_{i}] J(θi)+μJ(θi)2+μ𝔼[J(θi)Tdi+1|i]Lμ22𝔼[J(θi)+di+12|i]Lμ22𝔼[ξi+12|i]\displaystyle\geq J(\theta_{i})+\mu\lVert\nabla J(\theta_{i})\lVert^{2}+\mu\mathbb{E}[\nabla J(\theta_{i})^{T}d_{i+1}|\mathcal{F}_{i}]-\frac{L\mu^{2}}{2}\mathbb{E}[\lVert\nabla J(\theta_{i})+d_{i+1}\rVert^{2}|\mathcal{F}_{i}]-\frac{L\mu^{2}}{2}\mathbb{E}[\lVert\xi_{i+1}\lVert^{2}|\mathcal{F}_{i}]
J(θi)+μJ(θi)2+μ𝔼[J(θi)Tdi+1|i]Lμ22𝔼[J(θi)+di+12|i]Lμ2σ22.\displaystyle\geq J(\theta_{i})+\mu\lVert\nabla J(\theta_{i})\lVert^{2}+\mu\mathbb{E}[\nabla J(\theta_{i})^{T}d_{i+1}|\mathcal{F}_{i}]-\frac{L\mu^{2}}{2}\mathbb{E}[\lVert\nabla J(\theta_{i})+d_{i+1}\rVert^{2}|\mathcal{F}_{i}]-\frac{L\mu^{2}\sigma^{2}}{2}.

We assume that μ<1L\mu<\frac{1}{L} to obtain

𝔼[J(θi+1)|i]J(θi)+μJ(θi)2+μ𝔼[J(θi)Tdi+1|i]μ2𝔼[J(θi)+di+12|i]Lμ2σ22.\mathbb{E}[J(\theta_{i+1})|\mathcal{F}_{i}]\geq J(\theta_{i})+\mu\lVert\nabla J(\theta_{i})\lVert^{2}+\mu\mathbb{E}[\nabla J(\theta_{i})^{T}d_{i+1}|\mathcal{F}_{i}]-\frac{\mu}{2}\mathbb{E}[\lVert\nabla J(\theta_{i})+d_{i+1}\rVert^{2}|\mathcal{F}_{i}]-\frac{L\mu^{2}\sigma^{2}}{2}.

Then we use the fact that a+b2=a2+2aTb+b2\lVert a+b\lVert^{2}=\lVert a\lVert^{2}+2a^{T}b+\lVert b\lVert^{2} and (10) to obtain

𝔼[J(θi+1)|i]\displaystyle\mathbb{E}[J(\theta_{i+1})|\mathcal{F}_{i}]\geq J(θi)+μJ(θi)2+μ𝔼[J(θi)Tdi+1|i]μ2J(θi)2μ2𝔼[di+12|i]\displaystyle J(\theta_{i})+\mu\lVert\nabla J(\theta_{i})\lVert^{2}+\mu\mathbb{E}[\nabla J(\theta_{i})^{T}d_{i+1}|\mathcal{F}_{i}]-\frac{\mu}{2}\lVert\nabla J(\theta_{i})\lVert^{2}-\frac{\mu}{2}\mathbb{E}[\lVert d_{i+1}\lVert^{2}|\mathcal{F}_{i}]
μ𝔼[J(θi)Tdi+1|i]Lμ2σ22\displaystyle-\mu\mathbb{E}[\nabla J(\theta_{i})^{T}d_{i+1}|\mathcal{F}_{i}]-\frac{L\mu^{2}\sigma^{2}}{2}
=\displaystyle= J(θi)+μ2J(θi)2μ2𝔼[di+12|i]Lμ2σ22\displaystyle J(\theta_{i})+\frac{\mu}{2}\lVert\nabla J(\theta_{i})\lVert^{2}-\frac{\mu}{2}\mathbb{E}[\lVert d_{i+1}\lVert^{2}|\mathcal{F}_{i}]-\frac{L\mu^{2}\sigma^{2}}{2}
\displaystyle\geq J(θi)+μ2J(θi)2D2μ32Lμ2σ22.\displaystyle J(\theta_{i})+\frac{\mu}{2}\lVert\nabla J(\theta_{i})\lVert^{2}-\frac{D^{2}\mu^{3}}{2}-\frac{L\mu^{2}\sigma^{2}}{2}.

Now we want to apply the law of total expectation and condition on where wiw_{i} is located in the parameter space. We first condition on θi𝒢\theta_{i}\in\mathcal{G}, where we have J(θi)2>μ(Lσ2+D2μ)(1+1δ)\lVert\nabla J(\theta_{i})\lVert^{2}>\mu(L\sigma^{2}+D^{2}\mu)(1+\frac{1}{\delta}) to arrive at

𝔼[J(θi+1)|θi𝒢]𝔼[J(θi)|θi𝒢]+μ2μ(Lσ2+D2μ)(1+1δ)Lμ2σ22D2μ32\mathbb{E}[J(\theta_{i+1})|\theta_{i}\in\mathcal{G}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{G}]+\frac{\mu}{2}\mu(L\sigma^{2}+D^{2}\mu)(1+\frac{1}{\delta})-\frac{L\mu^{2}\sigma^{2}}{2}-\frac{D^{2}\mu^{3}}{2}
𝔼[J(θi+1)|θi𝒢]𝔼[J(θi)|θi𝒢]+μ2(Lσ2+D2μ)2δ.\mathbb{E}[J(\theta_{i+1})|\theta_{i}\in\mathcal{G}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{G}]+\frac{\mu^{2}(L\sigma^{2}+D^{2}\mu)}{2\delta}.

If we instead condition on θi\theta_{i}\in\mathcal{M}, we have that

𝔼[J(θi+1)|θi]𝔼[J(θi)|θi]μ2(Lσ2+D2μ)2.\mathbb{E}[J(\theta_{i+1})|\theta_{i}\in\mathcal{M}]\geq\mathbb{E}[J(\theta_{i})|\theta_{i}\in\mathcal{M}]-\frac{\mu^{2}(L\sigma^{2}+D^{2}\mu)}{2}.

B.4 Proof of Lemma B.3

Before we proceed with the proof of Lemma B.3, we require a preliminary lemma from (Vlaski & Sayed, 2022) that will help us show that our product does not blow up for small μ\mu.

Lemma B.5.

For C,μ,L>0C,\mu,L>0 and k+k\in\mathbb{Z}_{+} with μ<1L\mu<\frac{1}{L}

limμ0((1+μL)k+O(μ2)(1μL)k1)C/μ=O(1).\lim_{\mu\to 0}\left(\frac{(1+\mu L)^{k}+O(\mu^{2})}{(1-\mu L)^{k-1}}\right)^{C/\mu}=O(1).

Proof of Lemma B.5. See (Vlaski & Sayed, 2022).

Proof.

First we want to show (11), restated below

𝔼[θiθi+j+12|θi]O(μ).\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu).

We have by (8)

θiθi+j+12\displaystyle\lVert\theta_{i}-\theta_{i+j+1}\rVert^{2} =θiθi+jμJ(θi+j)μξi+j+1μdi+j+12\displaystyle=\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu\xi_{i+j+1}-\mu d_{i+j+1}\lVert^{2}
𝔼[θiθi+j+12|i+j]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}] =𝔼[θiθi+jμJ(θi+j)μdi+j+12|i+j]+μ2𝔼[ξi+j+12|i+j]\displaystyle=\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]+\mu^{2}\mathbb{E}[\lVert\xi_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]
𝔼[θiθi+jμJ(θi+j)μdi+j+12|i+j]+μ2σ2\displaystyle\leq\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\rVert^{2}|\mathcal{F}_{i+j}]+\mu^{2}\sigma^{2}
=𝔼[θiθi+jμJ(θi+j)+μJ(θi)μJ(θi)μdi+j+12|i+j]+μ2σ2.\displaystyle=\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})+\mu\nabla J(\theta_{i})-\mu\nabla J(\theta_{i})-\mu d_{i+j+1}\rVert^{2}|\mathcal{F}_{i+j}]+\mu^{2}\sigma^{2}.

By Jensen’s inequality, we have for 0<α<10<\alpha<1,

a+b21αa2+11αb2.\lVert a+b\lVert^{2}\leq\frac{1}{\alpha}\lVert a\lVert^{2}+\frac{1}{1-\alpha}\lVert b\lVert^{2}.

So we have

𝔼[θiθi+j+12|i+j]11μLθiθi+jμJ(θi+j)+μJ(θi)2+μ2μL𝔼[J(θi)+di+j+12|i+j]+μ2σ2.\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]\leq\frac{1}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})+\mu\nabla J(\theta_{i})\lVert^{2}+\frac{\mu^{2}}{\mu L}\mathbb{E}[\lVert\nabla J(\theta_{i})+d_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]+\mu^{2}\sigma^{2}. (17)

We consider the first term on the right hand side of (17) and expand it to obtain

θiθi+jμJ(θi+j)+μJ(θi)2\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})+\mu\nabla J(\theta_{i})\lVert^{2}
θiθi+j2+2μθiθi+jJ(θi+j)J(θi)+μ2J(θi+j)J(θi)2.\leq\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+2\mu\lVert\theta_{i}-\theta_{i+j}\lVert\cdot\lVert\nabla J(\theta_{i+j})-\nabla J(\theta_{i})\lVert+\mu^{2}\lVert\nabla J(\theta_{i+j})-\nabla J(\theta_{i})\lVert^{2}.

By Lipschitz smoothness, we have

θiθi+jμJ(θi+j)+μJ(θi)2\displaystyle\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})+\mu\nabla J(\theta_{i})\lVert^{2} θiθi+j2+2μLθiθi+jθi+jθi+μ2L2θi+jθi2\displaystyle\leq\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+2\mu L\lVert\theta_{i}-\theta_{i+j}\lVert\cdot\lVert\theta_{i+j}-\theta_{i}\lVert+\mu^{2}L^{2}\lVert\theta_{i+j}-\theta_{i}\lVert^{2}
=(1+2μL+μ2L2)θiθi+j2\displaystyle=(1+2\mu L+\mu^{2}L^{2})\lVert\theta_{i}-\theta_{i+j}\lVert^{2}
=(1+μL)2θiθi+j2.\displaystyle=(1+\mu L)^{2}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}.

We plug this into our original expression (17) and use (10) to obtain

𝔼[θiθi+j+12|i+j]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}] (1+μL)21μLθiθi+j2+μL𝔼[J(θi)+di+j+12|i+j]+μ2σ2\displaystyle\leq\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{\mu}{L}\mathbb{E}[\lVert\nabla J(\theta_{i})+d_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]+\mu^{2}\sigma^{2}
(1+μL)21μLθiθi+j2+2μL𝔼[di+j+12|i+j]+2μLJ(θi)2+μ2σ2\displaystyle\leq\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{2\mu}{L}\mathbb{E}[\lVert d_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]+\frac{2\mu}{L}\lVert\nabla J(\theta_{i})\lVert^{2}+\mu^{2}\sigma^{2}
(1+μL)21μLθiθi+j2+2D2μ3L+2μLJ(θi)2+μ2σ2.\displaystyle\leq\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{2D^{2}\mu^{3}}{L}+\frac{2\mu}{L}\lVert\nabla J(\theta_{i})\lVert^{2}+\mu^{2}\sigma^{2}.

Now we want to condition on θi\theta_{i}\in\mathcal{H} to obtain

𝔼[θiθi+j+12|θi]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}] 𝔼[(1+μL)21μLθiθi+j2|θi]+2D2μ3L+2μ2Lμ(Lσ2+D2μ)(1+1δ)+μ2σ2\displaystyle\leq\mathbb{E}[\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}|\theta_{i}\in\mathcal{H}]+\frac{2D^{2}\mu^{3}}{L}+\frac{2\mu^{2}}{L}\cdot\mu(L\sigma^{2}+D^{2}\mu)(1+\frac{1}{\delta})+\mu^{2}\sigma^{2}
(1+μL)21μL𝔼[θiθi+j2|θi]+O(μ2).\displaystyle\leq\frac{(1+\mu L)^{2}}{1-\mu L}\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}\lVert^{2}|\theta_{i}\in\mathcal{H}]+O(\mu^{2}).

Then we can evaluate this recursive formula starting at j=0j=0, since 𝔼[θiθi2]=0\mathbb{E}[\lVert\theta_{i}-\theta_{i}\lVert^{2}]=0, to arrive at

𝔼[θiθi+j+12|θi]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}] O(μ2)n=0j1((1+μL)21μL)n\displaystyle\leq O(\mu^{2})\sum_{n=0}^{j-1}\Big{(}\frac{(1+\mu L)^{2}}{1-\mu L}\Big{)}^{n}
O(μ2)1((1+μL)21μL)j1(1+μL)21μL\displaystyle\leq O(\mu^{2})\frac{1-(\frac{(1+\mu L)^{2}}{1-\mu L})^{j}}{1-\frac{(1+\mu L)^{2}}{1-\mu L}}
=O(μ2)(1μL)(((1+μL)21μL)j1)1+2μL+μ2L21+μL\displaystyle=O(\mu^{2})\frac{(1-\mu L)((\frac{(1+\mu L)^{2}}{1-\mu L})^{j}-1)}{1+2\mu L+\mu^{2}L^{2}-1+\mu L}
=O(μ)(1μL)((1+2μL+μ2L21μL)j1)3L+μL2\displaystyle=O(\mu)\frac{(1-\mu L)((\frac{1+2\mu L+\mu^{2}L^{2}}{1-\mu L})^{j}-1)}{3L+\mu L^{2}}
O(μ)((1+μL)21μL)j3LO(μ)((1+μL)21μL)Cμ3L.\displaystyle\leq O(\mu)\frac{(\frac{(1+\mu L)^{2}}{1-\mu L})^{j}}{3L}\leq O(\mu)\frac{(\frac{(1+\mu L)^{2}}{1-\mu L})^{\frac{C}{\mu}}}{3L}.

By Lemma B.5, this gives us

𝔼[θiθi+j+12|θi]O(μ).\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu).

Now we want to show the fourth moment bound (12), restated below

𝔼[θiθi+j+14|θi]O(μ2).\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{2}).

We use the inequality a+b4a4+3b4+8a2b2+4a2(aTb)\lVert a+b\lVert^{4}\leq\lVert a\lVert^{4}+3\lVert b\lVert^{4}+8\lVert a\lVert^{2}\lVert b\lVert^{2}+4\lVert a\lVert^{2}(a^{T}b) to expand the expression as follows

θiθi+j+14=\displaystyle\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}= θiθi+jμJ(θi+j)μξi+j+1μdi+j+14\displaystyle\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu\xi_{i+j+1}-\mu d_{i+j+1}\lVert^{4} (18)
\displaystyle\leq θiθi+jμJ(θi+j)μdi+j+14+3μ4ξi+j+14\displaystyle\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{4}+3\mu^{4}\lVert\xi_{i+j+1}\lVert^{4}
+8μ2θiθi+jμJ(θi+j)μdi+j+12ξi+j+12\displaystyle+8\mu^{2}\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{2}\cdot\lVert\xi_{i+j+1}\rVert^{2}
+4θiθi+jμJ(θi+j)μdi+j+12(θiθi+jμJ(θi+j)μdi+j+1)Tξi+j+1.\displaystyle+4\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{2}(\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1})^{T}\xi_{i+j+1}.

We first consider the first term on the right hand side of (18) and decompose it via Jensen’s inequality:

θiθi+jμJ(θi+j)μdi+j+14\displaystyle\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{4}\leq 1(1μL)3θiθi+jμJ(θi+j)+μJ(θi)4\displaystyle\frac{1}{(1-\mu L)^{3}}\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})+\mu\nabla J(\theta_{i})\lVert^{4} (19)
+μL3J(θi)+di+j+14\displaystyle+\frac{\mu}{L^{3}}\lVert\nabla J(\theta_{i})+d_{i+j+1}\lVert^{4}
\displaystyle\leq (1+μL)4(1μL)3θiθi+j4+8μL3J(θi)4+8μL3di+j+14.\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\lVert\theta_{i}-\theta_{i+j}\lVert^{4}+\frac{8\mu}{L^{3}}\lVert\nabla J(\theta_{i})\lVert^{4}+\frac{8\mu}{L^{3}}\lVert d_{i+j+1}\lVert^{4}.

We then consider the third term on the right hand side of (18). From the analysis above, we have

θiθi+jμJ(θi+j)μdi+j+12(1+μL)21μLθiθi+j2+2μLJ(θi)2+2μLdi+j+12.\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{2}\leq\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{2\mu}{L}\lVert\nabla J(\theta_{i})\lVert^{2}+\frac{2\mu}{L}\lVert d_{i+j+1}\lVert^{2}. (20)

Now we can plug (19) and (20) into (18) to obtain

θiθi+j+14\displaystyle\lVert\theta_{i}-\theta_{i+j+1}\rVert^{4}\leq (1+μL)4(1μL)3θiθi+j4+8μL3J(θi)4+8μL3di+j+14+3μ4ξi+j+14\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\lVert\theta_{i}-\theta_{i+j}\lVert^{4}+\frac{8\mu}{L^{3}}\lVert\nabla J(\theta_{i})\lVert^{4}+\frac{8\mu}{L^{3}}\lVert d_{i+j+1}\lVert^{4}+3\mu^{4}\lVert\xi_{i+j+1}\rVert^{4}
+8μ2ξi+j+12((1+μL)21μLθiθi+j2+2μLJ(θi)2+2μLdi+j+12)\displaystyle+8\mu^{2}\lVert\xi_{i+j+1}\rVert^{2}\left(\frac{(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{2\mu}{L}\lVert\nabla J(\theta_{i})\lVert^{2}+\frac{2\mu}{L}\lVert d_{i+j+1}\lVert^{2}\right)
+4θiθi+jμJ(θi+j)μdi+j+12(θiθi+jμJ(θi+j)μdi+j+1)Tξi+j+1\displaystyle+4\lVert\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1}\lVert^{2}(\theta_{i}-\theta_{i+j}-\mu\nabla J(\theta_{i+j})-\mu d_{i+j+1})^{T}\xi_{i+j+1}

When we take the expectation on both sides, the cross term with ξi+j+1\xi_{i+j+1} disappears, and we have by (8), (9), (3) and (4)

𝔼[θiθi+j+14|i+j]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\rVert^{4}|\mathcal{F}_{i+j}]\leq (1+μL)4(1μL)3θiθi+j4+8μL3J(θi)4+8μL3𝔼[di+j+14|i+j]+12μ4σ4\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\lVert\theta_{i}-\theta_{i+j}\lVert^{4}+\frac{8\mu}{L^{3}}\lVert\nabla J(\theta_{i})\lVert^{4}+\frac{8\mu}{L^{3}}\mathbb{E}[\lVert d_{i+j+1}\lVert^{4}|\mathcal{F}_{i+j}]+12\mu^{4}\sigma^{4}
+8μ2σ2(1+μL)21μLθiθi+j2+16μ3σ2LJ(θi)2\displaystyle+\frac{8\mu^{2}\sigma^{2}(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{16\mu^{3}\sigma^{2}}{L}\lVert\nabla J(\theta_{i})\lVert^{2}
+16μ3L𝔼[ξi+j+12di+j+12|i+j]\displaystyle+\frac{16\mu^{3}}{L}\mathbb{E}[\lVert\xi_{i+j+1}\rVert^{2}\cdot\lVert d_{i+j+1}\lVert^{2}|\mathcal{F}_{i+j}]
𝔼[θiθi+j+14|i+j]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\rVert^{4}|\mathcal{F}_{i+j}]\leq (1+μL)4(1μL)3θiθi+j4+8μL3J(θi)4+8D4μ5L3+12μ4σ4\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\lVert\theta_{i}-\theta_{i+j}\lVert^{4}+\frac{8\mu}{L^{3}}\lVert\nabla J(\theta_{i})\lVert^{4}+\frac{8D^{4}\mu^{5}}{L^{3}}+12\mu^{4}\sigma^{4}
+8μ2σ2(1+μL)21μLθiθi+j2+16μ3σ2LJ(θi)2+16σ2D2μ5L\displaystyle+\frac{8\mu^{2}\sigma^{2}(1+\mu L)^{2}}{1-\mu L}\lVert\theta_{i}-\theta_{i+j}\lVert^{2}+\frac{16\mu^{3}\sigma^{2}}{L}\lVert\nabla J(\theta_{i})\lVert^{2}+\frac{16\sigma^{2}D^{2}\mu^{5}}{L}

Now we take expectation conditioned on θi\theta_{i}\in\mathcal{H}, allowing us to use the bound (11) derived before on θiθi+j2\lVert\theta_{i}-\theta_{i+j}\lVert^{2} to obtain

𝔼[θiθi+j+14|θi]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}|\theta_{i}\in\mathcal{H}]\leq (1+μL)4(1μL)3𝔼[θiθi+j4|θi]+8μL3𝔼[J(θi)4|θi]\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}\lVert^{4}|\theta_{i}\in\mathcal{H}]+\frac{8\mu}{L^{3}}\mathbb{E}[\lVert\nabla J(\theta_{i})\lVert^{4}|\theta_{i}\in\mathcal{H}]
+8μ2σ2(1+μL)21μL𝔼[θiθi+j2|θi]+16μ3σ2L𝔼[J(θi)2|θi]+O(μ4)\displaystyle+\frac{8\mu^{2}\sigma^{2}(1+\mu L)^{2}}{1-\mu L}\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}\lVert^{2}|\theta_{i}\in\mathcal{H}]+\frac{16\mu^{3}\sigma^{2}}{L}\mathbb{E}[\lVert\nabla J(\theta_{i})\lVert^{2}|\theta_{i}\in\mathcal{H}]+O(\mu^{4})
\displaystyle\leq (1+μL)4(1μL)3𝔼[θiθi+j4|θi]+O(μ3)\displaystyle\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}\mathbb{E}[\lVert\theta_{i}-\theta_{i+j}\lVert^{4}|\theta_{i}\in\mathcal{H}]+O(\mu^{3})

Then we can evaluate this recursive expression as follows

𝔼[θiθi+j+14|θi]\displaystyle\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}|\theta_{i}\in\mathcal{H}] O(μ3)n=0j1((1+μL)4(1μL)3)n\displaystyle\leq O(\mu^{3})\sum_{n=0}^{j-1}(\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}})^{n}
=O(μ3)1((1+μL)4(1μL)3)j1(1+μL)4(1μL)3\displaystyle=O(\mu^{3})\frac{1-(\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}})^{j}}{1-\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}}}
=O(μ3)(((1+μL)4(1μL)3)j1)(1μL)3(1+μL)4(1μL)3O(μ2)((1+μL)4(1μL)3))j7L+3μL2+5μ2L3+μ3L4\displaystyle=O(\mu^{3})\frac{((\frac{(1+\mu L)^{4}}{(1-\mu L)^{3}})^{j}-1)(1-\mu L)^{3}}{(1+\mu L)^{4}-(1-\mu L)^{3}}\leq O(\mu^{2})\frac{(\frac{(1+\mu L)^{4}}{(1-\mu L)^{3})})^{j}}{7L+3\mu L^{2}+5\mu^{2}L^{3}+\mu^{3}L^{4}}

By Lemma B.5, this gives us

𝔼[θiθi+j+14|θi]O(μ2).\mathbb{E}[\lVert\theta_{i}-\theta_{i+j+1}\lVert^{4}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{2}).

Finally, we want to bound (13), restated below

𝔼[θi+jθi+j2|θi]O(μ).\mathbb{E}[\lVert\theta_{i+j}-\theta^{\prime}_{i+j}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu).

First we expand the expression as follows using the definition of θ\theta and θ\theta^{\prime}

θi+j+1\displaystyle\lVert\theta_{i+j+1} θi+j+12=θi+jθi+j+μJ(θi+j)+μdi+j+1μJ(θi)μ2J(θi)(θi+jθi)2\displaystyle-\theta^{\prime}_{i+j+1}\lVert^{2}=\lVert\theta_{i+j}-\theta^{\prime}_{i+j}+\mu\nabla J(\theta_{i+j})+\mu d_{i+j+1}-\mu\nabla J(\theta_{i})-\mu\nabla^{2}J(\theta_{i})(\theta^{\prime}_{i+j}-\theta_{i})\lVert^{2}
=(I+μ2J(θi))(θi+jθi+j)+μ2J(θi)(θiθi+j)+μJ(θi+j)μJ(θi)+μdi+j+12\displaystyle=\lVert(I+\mu\nabla^{2}J(\theta_{i}))(\theta_{i+j}-\theta^{\prime}_{i+j})+\mu\nabla^{2}J(\theta_{i})(\theta_{i}-\theta_{i+j})+\mu\nabla J(\theta_{i+j})-\mu\nabla J(\theta_{i})+\mu d_{i+j+1}\lVert^{2}

Define Hi+j=012J((1t)θi+j+tθi)𝑑tH_{i+j}=\int_{0}^{1}\nabla^{2}J((1-t)\theta_{i+j}+t\theta_{i})dt, then we can plug this into the expression and expand via Jensens’s inequality to obtain

θi+j+1\displaystyle\lVert\theta_{i+j+1} θi+j+12=(I+μ2J(θi))(θi+jθi+j)+μ(2J(θi)Hi+j)(θiθi+j)+μdi+j+12\displaystyle-\theta^{\prime}_{i+j+1}\lVert^{2}=\lVert(I+\mu\nabla^{2}J(\theta_{i}))(\theta_{i+j}-\theta^{\prime}_{i+j})+\mu(\nabla^{2}J(\theta_{i})-H_{i+j})(\theta_{i}-\theta_{i+j})+\mu d_{i+j+1}\lVert^{2}
11μL(I+μ2J(θi))(θi+jθi+j)2+μL(2J(θi)Hi+j)(θiθi+j)+di+j+12\displaystyle\leq\frac{1}{1-\mu L}\lVert(I+\mu\nabla^{2}J(\theta_{i}))(\theta_{i+j}-\theta^{\prime}_{i+j})\lVert^{2}+\frac{\mu}{L}\lVert(\nabla^{2}J(\theta_{i})-H_{i+j})(\theta_{i}-\theta_{i+j})+d_{i+j+1}\lVert^{2}
11μL(I+μ2J(θi))(θi+jθi+j)2+2μL(2J(θi)Hi+j)(θiθi+j)2+2μLdi+j+12\displaystyle\leq\frac{1}{1-\mu L}\lVert(I+\mu\nabla^{2}J(\theta_{i}))(\theta_{i+j}-\theta^{\prime}_{i+j})\lVert^{2}+\frac{2\mu}{L}\lVert(\nabla^{2}J(\theta_{i})-H_{i+j})(\theta_{i}-\theta_{i+j})\lVert^{2}+\frac{2\mu}{L}\lVert d_{i+j+1}\lVert^{2}

As observed in (Vlaski & Sayed, 2022), we have

2J(θi)Hi+j\displaystyle\lVert\nabla^{2}J(\theta_{i})-H_{i+j}\lVert =2J(θi)012J((1t)θi+j+tθi)dt\displaystyle=\lVert\nabla^{2}J(\theta_{i})-\int_{0}^{1}\nabla^{2}J((1-t)\theta_{i+j}+t\theta_{i})dt\lVert (21)
=012J(θi)2J((1t)θi+j+tθi)dt\displaystyle=\lVert\int_{0}^{1}\nabla^{2}J(\theta_{i})-\nabla^{2}J((1-t)\theta_{i+j}+t\theta_{i})dt\lVert
012J(θi)2J((1t)θi+j+tθi)dt\displaystyle\leq\int_{0}^{1}\lVert\nabla^{2}J(\theta_{i})-\nabla^{2}J((1-t)\theta_{i+j}+t\theta_{i})\lVert dt
χ01(1t)θi(1t)θi+jdtχ2θiθi+j,\displaystyle\leq\chi\int_{0}^{1}\lVert(1-t)\theta_{i}-(1-t)\theta_{i+j}\lVert dt\leq\frac{\chi}{2}\lVert\theta_{i}-\theta_{i+j}\lVert,

which implies

(2J(θi)Hi+j)(θiθi+j)2χ2θiθi+j4.\lVert(\nabla^{2}J(\theta_{i})-H_{i+j})(\theta_{i}-\theta_{i+j})\lVert^{2}\leq\frac{\chi}{2}\lVert\theta_{i}-\theta_{i+j}\lVert^{4}. (22)

We can plug (22) back into (21) and take expectation of both sides, conditioned on θi\theta_{i}\in\mathcal{H}. When we apply the fourth moment bound from Lemma B.3, we obtain

𝔼[θi+j+1θi+j+12|θi](1+μL)2(1μL)𝔼[θi+jθi+j2|θi]+O(μ3).\mathbb{E}[\lVert\theta_{i+j+1}-\theta^{\prime}_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq\frac{(1+\mu L)^{2}}{(1-\mu L)}\mathbb{E}[\lVert\theta_{i+j}-\theta^{\prime}_{i+j}\lVert^{2}|\theta_{i}\in\mathcal{H}]+O(\mu^{3}).

This is the same recursion as in the proof of (11), so again from Lemma B.5 we have

𝔼[θi+j+1θi+j+12|θi]O(μ2).\mathbb{E}[\lVert\theta_{i+j+1}-\theta^{\prime}_{i+j+1}\lVert^{2}|\theta_{i}\in\mathcal{H}]\leq O(\mu^{2}).

Appendix C Noise and Bias Bounds for Vanilla Policy Gradient

To apply Theorem 3.6, we first show in Lemma C.1 that the gradient estimator and the second and fourth moment of the noise are bounded. Then we show that the deterministic bias term dt+1d_{t+1} is bounded via Lemma C.3. This allows us to directly conclude the results of Theorem 3.7.

Lemma C.1.

The gradient noise process {ξt}t0\{\xi_{t}\}_{t\geq 0} satisfies

𝔼[ξt+1|t]=𝔼[G^VPG(θt;τt)JH(θt)|t]=0.\mathbb{E}[\xi_{t+1}|\mathcal{F}_{t}]=\mathbb{E}[\hat{G}^{VPG}(\theta_{t};\tau_{t})-\nabla J_{H}(\theta_{t})|\mathcal{F}_{t}]=0.

In addition, let σ=Gmax(1γ)2\sigma=\frac{G\mathcal{R}_{max}}{(1-\gamma)^{2}}. Then we have the following bounds

G^VPG(θt;τt)σ,\lVert\hat{G}^{VPG}(\theta_{t};\tau_{t})\rVert\leq\sigma,
𝔼[ξt+12|t]σ2,\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}|\mathcal{F}_{t}]\leq\sigma^{2},
𝔼[ξt+14|t]4σ4.\mathbb{E}[\lVert\xi_{t+1}\lVert^{4}|\mathcal{F}_{t}]\leq 4\sigma^{4}.
Proof.
G^VPG(θ;τ)\displaystyle\lVert\hat{G}^{VPG}(\theta;\tau)\rVert =h=0H1θlogπθ(ah|sh)t=hH1γt(st,at)\displaystyle=\lVert\sum_{h=0}^{H-1}\nabla_{\theta}\log\pi_{\theta}(a_{h}|s_{h})\sum_{t=h}^{H-1}\gamma^{t}\mathcal{R}(s_{t},a_{t})\lVert
h=0H1θlogπθ(ah|sh)t=hH1γt(st,at)\displaystyle\leq\sum_{h=0}^{H-1}\lVert\nabla_{\theta}\log\pi_{\theta}(a_{h}|s_{h})\sum_{t=h}^{H-1}\gamma^{t}\mathcal{R}(s_{t},a_{t})\lVert
h=0H1θlogπθ(ah|sh)γht=hH1γthmax\displaystyle\leq\sum_{h=0}^{H-1}\lVert\nabla_{\theta}\log\pi_{\theta}(a_{h}|s_{h})\rVert\gamma^{h}\sum_{t=h}^{H-1}\gamma^{t-h}\mathcal{R}_{max}
h=0H1θlogπθ(ah|sh)γhmax1γ\displaystyle\leq\sum_{h=0}^{H-1}\lVert\nabla_{\theta}\log\pi_{\theta}(a_{h}|s_{h})\rVert\gamma^{h}\frac{\mathcal{R}_{max}}{1-\gamma}
Gmax(1γ)2\displaystyle\leq\frac{G\mathcal{R}_{max}}{(1-\gamma)^{2}}

Then the rest of the bounds follow from Lemma B.1. Thanks to reviewer feedback, we note that the bound on the noise variance 𝔼[ξt+12]\mathbb{E}[\lVert\xi_{t+1}\rVert^{2}] can be tightened by a factor of 11γ\frac{1}{1-\gamma} as shown in Lemma 4.2 of (Yuan et al., 2022). ∎

Before we can prove Lemma C.3, we require the following lemma from (Yuan et al., 2022).

Lemma C.2.

(Lemma 4.5 from (Yuan et al., 2022)) For D=Gmax1γD=\frac{G\mathcal{R}_{max}}{1-\gamma}, we have that the bias term di+1d_{i+1} is bounded such that

di+1=J(θi)JH(θi)D(11γ+H)1/2γH.\lVert d_{i+1}\lVert=\lVert\nabla J(\theta_{i})-\nabla J_{H}(\theta_{i})\lVert\leq D(\frac{1}{1-\gamma}+H)^{1/2}\gamma^{H}.

Now we can proceed with the proof of Lemma C.3.

Lemma C.3.

For H=1log1γO(log(1μ))H=\frac{1}{\log\frac{1}{\gamma}}\cdot O(\log(\frac{1}{\mu})) where μ0\mu\to 0, we have that the gradient bias is deterministically bounded as follows

dt+1=J(θt)JH(θt)D(11γ+H)1/2γHDμ\lVert d_{t+1}\rVert=\lVert\nabla J(\theta_{t})-\nabla J_{H}(\theta_{t})\rVert\leq D(\frac{1}{1-\gamma}+H)^{1/2}\gamma^{H}\leq D\mu

where D=Gmax1γD=\frac{G\mathcal{R}_{max}}{1-\gamma}.

Proof.

We have the bound on the bias in terms of HH from Lemma C.2. We want to choose HH large enough so that

D(11γ+H)1/2γHDμD(\frac{1}{1-\gamma}+H)^{1/2}\gamma^{H}\leq D\mu
(11γ+H)1/2γHμ.(\frac{1}{1-\gamma}+H)^{1/2}\gamma^{H}\leq\mu.

We begin by finding the approximate solution to the following equation using asymptotic expansion

(11γ+H)1/2γH=μ(\frac{1}{1-\gamma}+H)^{1/2}\gamma^{H}=\mu
12log(11γ+H)+Hlogγ=logμ\frac{1}{2}\log(\frac{1}{1-\gamma}+H)+H\log\gamma=\log\mu
12log(11γ+H)Hlog1γ=log1μ\frac{1}{2}\log(\frac{1}{1-\gamma}+H)-H\log\frac{1}{\gamma}=-\log\frac{1}{\mu}
Hlog1γ12log(11γ+H)=log1μ.H\log\frac{1}{\gamma}-\frac{1}{2}\log(\frac{1}{1-\gamma}+H)=\log\frac{1}{\mu}.

Now we use the method of dominant balance, treating μ\mu as a small parameter. We have that log1μ\log\frac{1}{\mu} and Hlog1γH\log\frac{1}{\gamma} must balance each other out, so

Hlog1μlog1γ=logμlogγ.H\sim\frac{\log\frac{1}{\mu}}{\log\frac{1}{\gamma}}=\frac{\log\mu}{\log\gamma}.

We consider the next term in our asymptotic expansion of HH, assuming that it is much smaller than the first term

Hlogμlogγ+x1.H\sim\frac{\log\mu}{\log\gamma}+x_{1}.

We substitute this into the inequality to obtain

(logμlogγ+x1)log1γ12log(11γ+logμlogγ+x1)=log1μ(\frac{\log\mu}{\log\gamma}+x_{1})\log\frac{1}{\gamma}-\frac{1}{2}\log(\frac{1}{1-\gamma}+\frac{\log\mu}{\log\gamma}+x_{1})=\log\frac{1}{\mu}
x1log1γ12log(11γ+logμlogγ+x1)=0x_{1}\log\frac{1}{\gamma}-\frac{1}{2}\log(\frac{1}{1-\gamma}+\frac{\log\mu}{\log\gamma}+x_{1})=0
x1=log(11γ+logμlogγ+x1)2log1γ.x_{1}=\frac{\log(\frac{1}{1-\gamma}+\frac{\log\mu}{\log\gamma}+x_{1})}{2\log\frac{1}{\gamma}}.

Since we assume x1x_{1} is much smaller than logμlogγ\frac{\log\mu}{\log\gamma} we have

x1log(11γ+logμlogγ)2log1γ.x_{1}\sim\frac{\log(\frac{1}{1-\gamma}+\frac{\log\mu}{\log\gamma})}{2\log\frac{1}{\gamma}}.

Our asymptotic expansion is H=logμlogγ+O(log(logμlogγ))H=\frac{\log\mu}{\log\gamma}+O(\log(\frac{\log\mu}{\log\gamma})). So we can pick H=O(logμlogγ)H=O(\frac{\log\mu}{\log\gamma}) to achieve our inequality. ∎

Appendix D Proof of Theorem 4.8

D.1 Key Lemmas and Proof Sketch

In Theorem 4.8, we establish the convergence of Qw¯KQ_{\bar{w}_{K}} to QwQ_{w^{*}} under constant time steps. Our approach will mirror that of (Bhandari et al., 2018) by establishing a recurrence relation for the iterates and then bounding the bias induced by Markovian sampling. The key challenge is characterizing the distance between the initial state distribution and the stationary distribution in terms of the mixing rate.

We define g¯(w)\bar{g}(w) as the expectation of of the semigradient gt(w)g_{t}(w) with respect to the stationary distribution of the Markov chain as follows

g¯(w)\displaystyle\bar{g}_{(}w) =𝔼ηπ[gt(w)]=s,s,a,aηπ(s,a)𝒫(s,a,s,a)((s,a)+γϕ(s,a)wϕ(s,a)w)ϕ(s,a).\displaystyle=\mathbb{E}_{\eta_{\pi}}[g_{t}(w)]=\sum_{s,s^{\prime},a,a^{\prime}}\eta_{\pi}(s,a)\mathcal{P}(s,a,s^{\prime},a^{\prime})(\mathcal{R}(s,a)+\gamma\phi(s^{\prime},a^{\prime})^{\intercal}w-\phi(s,a)^{\intercal}w)\phi(s,a).

We also define ζt\zeta_{t} to represent the bias from Markovian sampling as follows

ζt(w)=(gt(w)g¯(w))(ww).\zeta_{t}(w)=(g_{t}(w)-\bar{g}(w))^{\intercal}(w-w^{*}).

First, The following lemma, which is Lemma 6 and 10 from (Bhandari et al., 2018), uniformly bounds the norm of the semi-gradient and Markov bias term ζt\zeta_{t}.

Lemma D.1.

Let F=max+2RF=\mathcal{R}_{max}+2R. Then RF2R\leq\frac{F}{2} and for all t0t\geq 0 we have

gt(w)2max+2w2F\lVert g_{t}(w)\lVert_{2}\leq\mathcal{R}_{\max}+2\lVert w\lVert_{2}\leq F

In addition, for all wΘw\in\Theta, the gradient bias is bounded such that

|ζt(w)|2F2|\zeta_{t}(w)|\leq 2F^{2}
|ζt(w)ζt(w)|6Fww2.|\zeta_{t}(w)-\zeta_{t}(w^{\prime})|\leq 6F\lVert w-w^{\prime}\lVert_{2}.

Then we obtain the following lemma for general nonstationary Markov chains, which differs from Lemma 9 in (Bhandari et al., 2018) by a factor of 2.

Lemma D.2.

Consider two random variables XX and YY such that

Xstst+τYX\to s_{t}\to s_{t+\tau}\to Y

forms a Markov chain for some fixed t0t\geq 0 and τ>0\tau>0. Assume the Markov chain mixes at a uniform geometric rate as in Assumption 4.2. Let XX^{\prime} and YY^{\prime} denote independent copies drawn from the marginal distributions of XX and YY, so )X=,Y=)=(X=)(Y=)\mathbb{P})X^{\prime}=\cdot,Y^{\prime}=\cdot)=\mathbb{P}(X=\cdot)\otimes\mathbb{P}(Y=\cdot). Then, for any bounded function vv,

|𝔼[v(X,Y)]𝔼[v(X,Y)]|4v(mrτ),|\mathbb{E}[v(X,Y)]-\mathbb{E}[v(X^{\prime},Y^{\prime})]|\leq 4\lVert v\lVert_{\infty}(mr^{\tau}),

where v=supx|f(x)|\lVert v\rVert_{\infty}=\sup_{x}|f(x)|.

Proof. See Appendix D.3.

Now in the following key lemma, we apply Lemma D.2 to bound ζt(wt)\zeta_{t}(w_{t}) with respect to exponential mixing. Although we follow the proof of Lemma 11 in (Bhandari et al., 2018), it is not sufficient to directly carry the factor of 22 over from Lemma D.2 because we need to account for the fact that the marginal distribution of each observation OtO_{t} is now time-dependent and not equal to the stationary distribution.

Lemma D.3.

Consider a non-increasing step-size sequence α0α1αT\alpha_{0}\geq\alpha_{1}\geq...\geq\alpha_{T}. Let τ0=τmix(αT)\tau_{0}=\tau^{\text{\tiny mix}}(\alpha_{T}). Fix any tTt\leq T and set t=max{0,tτ0}t^{*}=\max\{0,t-\tau_{0}\}. Then,

𝔼[ζt(wt)]F2(8+6τ0)αt+10F2mrt.\mathbb{E}[\zeta_{t}(w_{t})]\leq F^{2}(8+6\tau_{0})\alpha_{t^{*}}+10F^{2}mr^{t}.

Proof. See Appendix D.4.

Finally, we have the following lemma from (Bhandari et al., 2018) that establishes a recursion for the distance between the iterates and the limit point ww^{*}.

Lemma D.4.

With probability 11, for every t0t\in\mathbb{N}_{0},

wwt+12wwt22αt(1γ)QwQwtηπ2+2αtζt(wt)+αt2F2.\lVert w^{*}-w_{t+1}\rVert^{2}\leq\lVert w^{*}-w_{t}\rVert^{2}-2\alpha_{t}(1-\gamma)\lVert Q_{w^{*}}-Q_{w_{t}}\rVert_{\eta_{\pi}}^{2}+2\alpha_{t}\zeta_{t}(w_{t})+\alpha_{t}^{2}F^{2}.

Proof. See the proof of Lemma 8 in (Bhandari et al., 2018).

These key lemmas allow us to proceed with the proof of Theorem 4.8.

D.2 Proof of Theorem 4.8

Proof.

Rearranging the terms of the inequality in Lemma D.4 and summing from t=0t=0 to K1K-1, we arrive at

2α0(1γ)t=0K1𝔼[QwQwtηπ2]ww022+F2+2α0t=0K1𝔼[ζt(wt)].2\alpha_{0}(1-\gamma)\sum_{t=0}^{K-1}\mathbb{E}[\lVert Q_{w^{*}}-Q_{w_{t}}\lVert^{2}_{\eta_{\pi}}]\leq\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2}+2\alpha_{0}\sum_{t=0}^{K-1}\mathbb{E}[\zeta_{t}(w_{t})].

Then we can use the bound on ζt(wt)\zeta_{t}(w_{t}) from Lemma D.3 and the fact that our step-sizes are constant to obtain

2α0(1γ)t=0K1𝔼[QwQwtηπ2]\displaystyle 2\alpha_{0}(1-\gamma)\sum_{t=0}^{K-1}\mathbb{E}[\lVert Q_{w^{*}}-Q_{w_{t}}\lVert^{2}_{\eta_{\pi}}] ww022+F2+2α0t=0K1F2(8+6τ0)α0+2α0t=0K110F2mrt\displaystyle\leq\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2}+2\alpha_{0}\sum_{t=0}^{K-1}F^{2}(8+6\tau_{0})\alpha_{0}+2\alpha_{0}\sum_{t=0}^{K-1}10F^{2}mr^{t}
ww022+F2+2α02KF2(8+6τ0)+20F2mα01r.\displaystyle\leq\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2}+2\alpha_{0}^{2}KF^{2}(8+6\tau_{0})+\frac{20F^{2}m\alpha_{0}}{1-r}.

Now we can divide both sides by 2α0(1γ)2\alpha_{0}(1-\gamma) and substitute α0=1K\alpha_{0}=\frac{1}{\sqrt{K}} to obtain

t=0K1𝔼[QwQwtηπ2]\displaystyle\sum_{t=0}^{K-1}\mathbb{E}[\lVert Q_{w^{*}}-Q_{w_{t}}\lVert^{2}_{\eta_{\pi}}] ww022+F22α0(1γ)+α0KF2(8+6τ0)1γ+10F2m(1r)(1γ)\displaystyle\leq\frac{\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2}}{2\alpha_{0}(1-\gamma)}+\frac{\alpha_{0}KF^{2}(8+6\tau_{0})}{1-\gamma}+\frac{10F^{2}m}{(1-r)(1-\gamma)}
=K(ww022+F2)2(1γ)+KF2(8+6τ0)1γ+10F2m(1r)(1γ)\displaystyle=\frac{\sqrt{K}(\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2})}{2(1-\gamma)}+\frac{\sqrt{K}F^{2}(8+6\tau_{0})}{1-\gamma}+\frac{10F^{2}m}{(1-r)(1-\gamma)}
=K(ww022+17F2+12F2τ0)2(1γ)+10F2m(1r)(1γ).\displaystyle=\frac{\sqrt{K}(\lVert w^{*}-w_{0}\lVert^{2}_{2}+17F^{2}+12F^{2}\tau_{0})}{2(1-\gamma)}+\frac{10F^{2}m}{(1-r)(1-\gamma)}.

Finally, we divide both sides by KK and use Jensen’s inequality to obtain our final result

𝔼[QwQw¯K2ηπ]1Kt=0K1𝔼[QwQwt2ηπ]ww022+F2(17+12τ0)2(1γ)K+10F2m(1r)(1γ)K.\displaystyle\mathbb{E}[\lVert Q_{w^{*}}-Q_{\bar{w}_{K}}\lVert^{2}_{\eta_{\pi}}]\leq\frac{1}{K}\sum_{t=0}^{K-1}\mathbb{E}[\lVert Q_{w^{*}}-Q_{w_{t}}\lVert^{2}_{\eta_{\pi}}]\leq\frac{\lVert w^{*}-w_{0}\lVert^{2}_{2}+F^{2}(17+12\tau_{0})}{2(1-\gamma)\sqrt{K}}+\frac{10F^{2}m}{(1-r)(1-\gamma)K}.

D.3 Proof of Lemma D.2

We first require the following auxiliary lemma.

Lemma D.5.

For any Markov chain {st}\{s_{t}\} with stationary distribution η\eta and finite state space 𝒮\mathcal{S},

dTV(η,(st+τ=))supsSdTV(η,(st+τ=|s0=s)).d_{TV}(\eta,\mathbb{P}(s_{t+\tau}=\cdot))\leq\sup_{s\in S}d_{TV}(\eta,\mathbb{P}(s_{t+\tau}=\cdot|s_{0}=s)).
Proof.

It follows from proof by induction that for a general convex function ff, if f(xn)f(xi)f(x_{n})\geq f(x_{i}) for all xi{x1,xn}x_{i}\in\{x_{1},...x_{n}\} then f(xn)f(x)f(x_{n})\geq f(x) for x=i=1nαixnx=\sum_{i=1}^{n}\alpha_{i}x_{n} where i=1nαi=1\sum_{i=1}^{n}\alpha_{i}=1. In other words, f(xn)f(x_{n}) is greater than f(x)f(x) for any xx that is a convex combination of the other {x1,xn}\{x_{1},...x_{n}\}.

Now let f(x)=12si𝒮|xPt+τ(si)η(si)|f(x)=\frac{1}{2}\sum_{s_{i}\in\mathcal{S}}|x^{\intercal}P^{t+\tau}(s_{i})-\eta(s_{i})|, which is a convex function, and let eiSe_{i}\in\mathbb{R}^{S} represent the unit vector that is 11 at index ii and 0 everywhere else. Then we have

dTV(η,(st+τ=))=f(ρ0)d_{TV}(\eta,\mathbb{P}(s_{t+\tau}=\cdot))=f(\rho_{0})
dTV(η,(st+τ=|s0=si))=f(ei).d_{TV}(\eta,\mathbb{P}(s_{t+\tau}=\cdot|s_{0}=s_{i}))=f(e_{i}).

Since any initial state distribution ρ0\rho_{0} will be a convex combination of the eie_{i} unit vectors, we have shown the result. ∎

Now we can proceed with the proof of Lemma D.2.

Proof.

Let h=v2vh=\frac{v}{2\lVert v\lVert_{\infty}} denote the function vv rescaled to take values in [1/2,1/2][-1/2,1/2]. Then we can follow the steps of Lemma 9 in (Bhandari et al., 2018) to arrive at

|𝔼[h(X,Y)]𝔼[h(X,Y)]|s𝒮(st=s)dTV((st+τ=|st=s),(st+τ=).|\mathbb{E}[h(X,Y)]-\mathbb{E}[h(X^{\prime},Y^{\prime})]|\leq\sum_{s\in\mathcal{S}}\mathbb{P}(s_{t}=s)d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\mathbb{P}(s_{t+\tau}=\cdot). (23)

We can bound dTV((st+τ=|st=s),(st+τ=))d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\mathbb{P}(s_{t+\tau}=\cdot)) as follows

dTV((st+τ=|st=s),(st+τ=))dTV((st+τ=|st=s),ηπ)+dTV(ηπ,(st+τ=))d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\mathbb{P}(s_{t+\tau}=\cdot))\leq d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\eta_{\pi})+d_{TV}(\eta_{\pi},\mathbb{P}(s_{t+\tau}=\cdot)) (24)

because total variation distance is a norm and obeys the triangle inequality. The first term on the right hand side of (24) is bounded by exponential mixing

dTV((st+τ=|st=s),ηπ)mrτ.d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\eta_{\pi})\leq mr^{\tau}.

The second term can be bound as follows by Lemma D.5

dTV(ηπ,(st+τ=))supsSdTV(ηπ,(st+τ=|s0=s))mrt+τ.d_{TV}(\eta_{\pi},\mathbb{P}(s_{t+\tau}=\cdot))\leq\sup_{s\in S}d_{TV}(\eta_{\pi},\mathbb{P}(s_{t+\tau}=\cdot|s_{0}=s))\leq mr^{t+\tau}.

Returning to (24), we have

dTV((st+τ=|st=s),(st+τ=))mrτ+mrt+τ2mrτ,d_{TV}(\mathbb{P}(s_{t+\tau}=\cdot|s_{t}=s),\mathbb{P}(s_{t+\tau}=\cdot))\leq mr^{\tau}+mr^{t+\tau}\leq 2mr^{\tau},

and applying these inequalities to (23), we have

|𝔼[v(X,Y)]𝔼[v(X,Y)]|4vmrτ.|\mathbb{E}[v(X,Y)]-\mathbb{E}[v(X^{\prime},Y^{\prime})]|\leq 4\lVert v\lVert_{\infty}mr^{\tau}.

D.4 Proof of Lemma D.3

First we require the following auxiliary lemmas.

Lemma D.6.

Let PP and QQ represent two different probability distributions. For some bounded function ff, and random variable XX, we have

|𝔼P[f(X)]𝔼Q[f(X)]|2fdTV(P,Q)|\mathbb{E}_{P}[f(X)]-\mathbb{E}_{Q}[f(X)]|\leq 2\lVert f\lVert_{\infty}d_{TV}(P,Q)
Proof.

This can be shown with the definition

dTV(P,Q)=supv:v12|vdPvdQ|d_{TV}(P,Q)=\sup_{v:\lVert v\lVert_{\infty}\leq\frac{1}{2}}|\int vdP-\int vdQ|

Lemma D.7.

Let O=(s,a,s,a)O^{\prime\prime}=(s,a,s^{\prime},a^{\prime}) represent the observation tuple of consecutive state-action pairs where (s,a)(s,a) is drawn from the stationary distribution of the Markov chain ηπ\eta_{\pi}, and let Ot=(st,at,st+1,at+1)O_{t}=(s_{t},a_{t},s_{t+1},a_{t+1}) represent the observation tuple drawn at time tt from the Markov chain. Then

dTV((O=),(Ot=))=dTV(ηπ,(st=,at=))mrtd_{TV}(\mathbb{P}(O^{\prime\prime}=\cdot),\mathbb{P}(O_{t}=\cdot))=d_{TV}(\eta_{\pi},\mathbb{P}(s_{t}=\cdot,a_{t}=\cdot))\leq mr^{t}

Now we can proceed with the proof of Lemma D.3, which mirrors the framework of Lemma 11 in (Bhandari et al., 2018). The main difference in our proofs is in Step 2.

Proof.

Step 1: Relate ζt(wt)\zeta_{t}(w_{t}) and ζt(wtτ)\zeta_{t}(w_{t-\tau})

We apply Lemma D.1 to bound ζt\zeta_{t} as follows

|ζt(wt)ζt(wtτ)|6Fwwtτ6F2i=tτt1αi|\zeta_{t}(w_{t})-\zeta_{t}(w_{t-\tau})|\leq 6F\lVert w-w_{t-\tau}\lVert\leq 6F^{2}\sum_{i=t-\tau}^{t-1}\alpha_{i}
ζt(wt)ζt(wtτ)+6F2i=tτt1αi\zeta_{t}(w_{t})\leq\zeta_{t}(w_{t-\tau})+6F^{2}\sum_{i=t-\tau}^{t-1}\alpha_{i} (25)

Step 2: Bound 𝔼[ζt(wtτ)]\mathbb{E}[\zeta_{t}(w_{t-\tau})] using Lemma D.2 and exponential mixing

We denote by Ot=(st,at,st+1,at+1)O_{t}=(s_{t},a_{t},s_{t+1},a_{t+1}) the observation tuple at each time step, and we overload the notation of gtg_{t} and ζt\zeta_{t} to make clear their dependence on OtO_{t} as follows

gt(w,Ot)=(r(st,at)+γϕ(st+1,at+1)wϕ(st,at)w)ϕ(st,at)g_{t}(w,O_{t})=(r(s_{t},a_{t})+\gamma\phi(s_{t+1},a_{t+1})^{\intercal}w-\phi(s_{t},a_{t})^{\intercal}w)\phi(s_{t},a_{t})
ζ(w,Ot)=(gt(w,Ot)g¯(w))(ww)\zeta(w,O_{t})=(g_{t}(w,O_{t})-\bar{g}(w))^{\intercal}(w-w^{*})

To apply Lemma D.2, we consider random variables wtτw^{\prime}_{t-\tau} and OtO^{\prime}_{t} drawn independently from their marginal distributions (wtτ=)\mathbb{P}(w_{t-\tau}=\cdot) and (Ot=)\mathbb{P}(O_{t}=\cdot) respectively, such that their joint distribution is defined as follows

(wt+τ=,Ot=)=(wtτ=)(Ot=)\mathbb{P}(w^{\prime}_{t+\tau}=\cdot,O^{\prime}_{t}=\cdot)=\mathbb{P}(w_{t-\tau}=\cdot)\otimes\mathbb{P}(O_{t}=\cdot)

Note that typically the random variables wtτw_{t-\tau} and OtO_{t} are not independent. Now we want to bound 𝔼[ζ(wtτ,Ot)]\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})]. We can use the law of total expectation as follows

𝔼[ζ(wtτ,Ot)]=𝔼[𝔼[ζ(wtτ,Ot)|wtτ]]\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})]=\mathbb{E}[\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}]]

Now we consider the conditional expectation term 𝔼[ζ(wtτ,Ot)|wtτ]\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}]. Since wtτw^{\prime}_{t-\tau} and OtO^{\prime}_{t} are independent, we have

𝔼[ζ(wtτ,Ot)|wtτ]\displaystyle\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}] =𝔼Ot(Ot=)[ζ(wtτ,Ot)|wtτ]\displaystyle=\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}]
=𝔼Ot(Ot=)[(gt(wtτ,Ot)g¯(wtτ))(wtτw)|wtτ]\displaystyle=\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[(g_{t}(w^{\prime}_{t-\tau},O^{\prime}_{t})-\bar{g}(w^{\prime}_{t-\tau}))^{\intercal}(w^{\prime}_{t-\tau}-w^{*})|w^{\prime}_{t-\tau}]
=(𝔼Ot(Ot=)[gt(wtτ,Ot)]g¯(wtτ))(wtτw)\displaystyle=(\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[g_{t}(w^{\prime}_{t-\tau},O^{\prime}_{t})]-\bar{g}(w^{\prime}_{t-\tau}))^{\intercal}(w^{\prime}_{t-\tau}-w^{*})
=(𝔼Ot(Ot=)[gt(wtτ,Ot)]𝔼Oηπ[g(wtτ,O)])(wtτw)\displaystyle=(\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[g_{t}(w^{\prime}_{t-\tau},O^{\prime}_{t})]-\mathbb{E}_{O^{\prime\prime}\sim\eta_{\pi}}[g(w^{\prime}_{t-\tau},O^{\prime\prime})])^{\intercal}(w^{\prime}_{t-\tau}-w^{*})
𝔼Ot(Ot=)[gt(wtτ,Ot)]𝔼Oηπ[g(wtτ,O)]wtτw\displaystyle\leq\lVert\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[g_{t}(w^{\prime}_{t-\tau},O^{\prime}_{t})]-\mathbb{E}_{O^{\prime\prime}\sim\eta_{\pi}}[g(w^{\prime}_{t-\tau},O^{\prime\prime})]\lVert\cdot\lVert w^{\prime}_{t-\tau}-w^{*}\lVert
𝔼Ot(Ot=)[gt(wtτ,Ot)]𝔼Oηπ[g(wtτ,O)]2R,\displaystyle\leq\lVert\mathbb{E}_{O^{\prime}_{t}\sim\mathbb{P}(O_{t}=\cdot)}[g_{t}(w^{\prime}_{t-\tau},O^{\prime}_{t})]-\mathbb{E}_{O^{\prime\prime}\sim\eta_{\pi}}[g(w^{\prime}_{t-\tau},O^{\prime\prime})]\lVert\cdot 2R,

where the last inequality is due to the fact that wR\lVert w\lVert\leq R. Here OtO^{\prime}_{t} is drawn from the time-dependent marginal distribution (Ot=)\mathbb{P}(O_{t}=\cdot) whereas OO^{\prime\prime} is drawn from the stationary distribution of the Markov chain. Then by Lemma D.6 and the fact that gtF\lVert g_{t}\lVert\leq F and RF2R\leq\frac{F}{2}, we can bound this difference in expectation by the total variation distance between the two distributions as follows

𝔼[ζ(wtτ,Ot)|wtτ]2F2dTV((O=),(Ot=)).\displaystyle\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}]\leq 2F^{2}d_{TV}(\mathbb{P}(O^{\prime\prime}=\cdot),\mathbb{P}(O_{t}=\cdot)).

Then by Lemma D.5 and Lemma D.7 we have

𝔼[ζ(wtτ,Ot)|wtτ]2F2mrt.\displaystyle\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})|w^{\prime}_{t-\tau}]\leq 2F^{2}mr^{t}.

Now we can apply Lemma D.2 to bound 𝔼[ζt(wtτ),Ot)]\mathbb{E}[\zeta_{t}(w_{t-\tau}),O_{t})]. By Lemma D.1, we have |ζ(w,Ot)|2F2|\zeta(w,O_{t})|\leq 2F^{2}. Since θtτstτstOt\theta_{t-\tau}\to s_{t-\tau}\to s_{t}\to O_{t} forms a Markov chain, applying Lemma D.2 yields

|𝔼[ζ(wtτ,Ot)]𝔼[ζ(wtτ,Ot)]|8F2mrτ|\mathbb{E}[\zeta(w_{t-\tau},O_{t})]-\mathbb{E}[\zeta(w^{\prime}_{t-\tau},O^{\prime}_{t})]|\leq 8F^{2}mr^{\tau}
|𝔼[ζ(wtτ,Ot)]|8F2mrτ+2F2mrt.|\mathbb{E}[\zeta(w_{t-\tau},O_{t})]|\leq 8F^{2}mr^{\tau}+2F^{2}mr^{t}. (26)

Step 3: Combine terms.

We combine (25) and (26) to arrive at

𝔼[ζt(wt)]8F2mrτ+2F2mrt+6F2ταtτ.\mathbb{E}[\zeta_{t}(w_{t})]\leq 8F^{2}mr^{\tau}+2F^{2}mr^{t}+6F^{2}\tau\alpha_{t-\tau}.

Let τ0=τmix(αT)\tau_{0}=\tau^{\text{\tiny mix}}(\alpha_{T}). For tτ0t\leq\tau_{0}, pick τ=t\tau=t, to arrive at the bound

𝔼[ζt(wt)]8F2mrt+2F2mrt+6F2tα010F2mrt+6F2τ0α0.\mathbb{E}[\zeta_{t}(w_{t})]\leq 8F^{2}mr^{t}+2F^{2}mr^{t}+6F^{2}t\alpha_{0}\leq 10F^{2}mr^{t}+6F^{2}\tau_{0}\alpha_{0}.

Now for Tt>τ0T\geq t>\tau_{0} we pick τ=τ0\tau=\tau_{0} to get the bound

𝔼[ζt(wt)]8F2αT+2F2mrt+6F2τ0αtτ0F2(8+6τ0)αtτ0+2F2mrt.\mathbb{E}[\zeta_{t}(w_{t})]\leq 8F^{2}\alpha_{T}+2F^{2}mr^{t}+6F^{2}\tau_{0}\alpha_{t-\tau_{0}}\leq F^{2}(8+6\tau_{0})\alpha_{t-\tau_{0}}+2F^{2}mr^{t}.

Appendix E Noise and Bias Bounds for Actor-Critic Policy Gradient

Our general approach for actor-critic policy gradient is similar to that of vanilla policy gradient. Once again, we bound the noise and bias — bounding ξt\xi_{t} in Lemma E.1, ptp_{t} in Lemma E.2, and dtd_{t} in Lemma E.1.

We note that for vanilla policy gradient, the noise term ξt+1\xi_{t+1} is random due to the sampled trajectory τt\tau_{t} whereas the bias term dt+1d_{t+1} is deterministic conditioned on t\mathcal{F}_{t}. In contrast, for the actor-critic algorithm, ξt+1\xi_{t+1} depends on two random variables, τt\tau_{t} and the averaged critic parameter w¯K,t\bar{w}_{K,t}, whereas the bias term depends only on w¯K,t\bar{w}_{K,t}. Moreover, τt\tau_{t} and w¯K,t\bar{w}_{K,t} are generated from independently sampled trajectories and are therefore independent. We quantify these observations in the following lemma.

Lemma E.1.

The gradient noise process {ξt}t0\{\xi_{t}\}_{t\geq 0} satisfies

𝔼[ξt+1|t]=0\mathbb{E}[\xi_{t+1}|\mathcal{F}_{t}]=0

In addition, let σ=GR1γ\sigma=\frac{GR}{1-\gamma}. Then we have the following bounds

G^(θt;τt)σ,\lVert\hat{G}(\theta_{t};\tau_{t})\lVert\leq\sigma,
𝔼[ξt+12|t]σ2\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}|\mathcal{F}_{t}]\leq\sigma^{2}
𝔼[ξt+14|t]4σ4\mathbb{E}[\lVert\xi_{t+1}\lVert^{4}|\mathcal{F}_{t}]\leq 4\sigma^{4}
𝔼[dt+12ξt+12|t]σ2𝔼[dt+12|t]\mathbb{E}[\lVert d_{t+1}\rVert^{2}\lVert\xi_{t+1}\rVert^{2}|\mathcal{F}_{t}]\leq\sigma^{2}\mathbb{E}[\lVert d_{t+1}\rVert^{2}|\mathcal{F}_{t}]
Proof.

As discussed earlier, ξt+1\xi_{t+1} depends on two random independent variables: τt\tau_{t} and w¯K,t\bar{w}_{K,t}. We can overload notation and denote the gradient estimator as G^(θt;τt;w¯K,t)\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t}). We therefore obtain

𝔼[ξt+1|t]\displaystyle\mathbb{E}[\xi_{t+1}|\mathcal{F}_{t}] =𝔼[G^(θt;τt;w¯K,t)𝔼τ[G^(θt;τt;w¯K,t)]|t]\displaystyle=\mathbb{E}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})-\mathbb{E}_{\tau}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})]|\mathcal{F}_{t}]
=𝔼[𝔼[G^(θt;τt;w¯K,t)𝔼τ[G^(θt;τt;w¯K,t)]|w¯K,t]|t]\displaystyle=\mathbb{E}[\mathbb{E}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})-\mathbb{E}_{\tau}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})]|\bar{w}_{K,t}]|\mathcal{F}_{t}]
=𝔼[𝔼τ[G^(θt;τt;w¯K,t)]𝔼τ[G^(θt;τt;w¯K,t)]|t]=0\displaystyle=\mathbb{E}[\mathbb{E}_{\tau}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})]-\mathbb{E}_{\tau}[\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})]|\mathcal{F}_{t}]=0

Now we want to show the bound on G^(θt;τt)\hat{G}(\theta_{t};\tau_{t}).

G^(θt;τt;w¯K,t)\displaystyle\lVert\hat{G}(\theta_{t};\tau_{t};\bar{w}_{K,t})\lVert =j=0HγjQw¯K,t(sj,aj)logπθt(aj|sj)\displaystyle=\lVert\sum_{j=0}^{H}\gamma^{j}Q_{\bar{w}_{K,t}}(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})\lVert
=j=0Hγjw¯K,tϕ(sj,aj)logπθt(aj|sj)RG1γ\displaystyle=\lVert\sum_{j=0}^{H}\gamma^{j}\bar{w}_{K,t}^{\intercal}\phi(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})\lVert\leq\frac{RG}{1-\gamma}

By Lemma B.1

𝔼[ξt+12||w¯K,t]\displaystyle\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}||\bar{w}_{K,t}] =𝔼τ[ξt+12]σ2\displaystyle=\mathbb{E}_{\tau}[\lVert\xi_{t+1}\lVert^{2}]\leq\sigma^{2}
𝔼[ξt+12|t]\displaystyle\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}|\mathcal{F}_{t}] =𝔼[𝔼[ξt+12|w¯K,t]|t]σ2\displaystyle=\mathbb{E}[\mathbb{E}[\lVert\xi_{t+1}\lVert^{2}|\bar{w}_{K,t}]|\mathcal{F}_{t}]\leq\sigma^{2}
𝔼[ξt+14|t]\displaystyle\mathbb{E}[\lVert\xi_{t+1}\lVert^{4}|\mathcal{F}_{t}] 4σ2\displaystyle\leq 4\sigma^{2}

Finally we have

𝔼[dt+12ξt+12|t]\displaystyle\mathbb{E}[\lVert d_{t+1}\rVert^{2}\lVert\xi_{t+1}\rVert^{2}|\mathcal{F}_{t}] =𝔼[𝔼[dt+12ξt+12|w¯K,t]|t]\displaystyle=\mathbb{E}[\mathbb{E}[\lVert d_{t+1}\rVert^{2}\lVert\xi_{t+1}\rVert^{2}|\bar{w}_{K,t}]|\mathcal{F}_{t}]
=𝔼[dt+12𝔼[ξt+12|w¯K,t]|t]\displaystyle=\mathbb{E}[\lVert d_{t+1}\rVert^{2}\mathbb{E}[\lVert\xi_{t+1}\rVert^{2}|\bar{w}_{K,t}]|\mathcal{F}_{t}]
𝔼[dt+12|t]σ2\displaystyle\leq\mathbb{E}[\lVert d_{t+1}\rVert^{2}|\mathcal{F}_{t}]\sigma^{2}

In the following lemma, we bound the bias in the actor-critic gradient estimator due to truncation of the infinite horizon. The approach is similar to the proof of Lemma C.2.

Lemma E.2.

For Dp=GR(1γ)D_{p}=\frac{GR}{(1-\gamma)} and HlogμlogγH\geq\frac{\log\mu}{\log\gamma}, the truncation bias is deterministically bounded such that

pt+1Dpμ\lVert p_{t+1}\lVert\leq D_{p}\mu
pt+14Dp4μ4.\lVert p_{t+1}\lVert^{4}\leq D_{p}^{4}\mu^{4}.
Proof.
pt+1\displaystyle\lVert p_{t+1}\lVert =GH(θt)G(θt)\displaystyle=\lVert G_{H}(\theta_{t})-G_{\infty}(\theta_{t})\lVert
=𝔼[j=0HγjQw¯K,t(sj,aj)logπθt(aj|sj)]𝔼[j=0γjQw¯K,t(sj,aj)logπθt(aj|sj)]\displaystyle=\lVert\mathbb{E}[\sum_{j=0}^{H}\gamma^{j}Q_{\bar{w}_{K,t}}(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})]-\mathbb{E}[\sum_{j=0}^{\infty}\gamma^{j}Q_{\bar{w}_{K,t}}(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})]\lVert
=𝔼[j=HγjQw¯K,t(sj,aj)logπθt(aj|sj)]\displaystyle=\lVert\mathbb{E}[\sum_{j=H}^{\infty}\gamma^{j}Q_{\bar{w}_{K,t}}(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})]\lVert
=𝔼[j=Hγjw¯K,tϕ(sj,aj)logπθt(aj|sj)]\displaystyle=\lVert\mathbb{E}[\sum_{j=H}^{\infty}\gamma^{j}\bar{w}_{K,t}^{\intercal}\phi(s_{j},a_{j})\nabla\log\pi_{\theta_{t}}(a_{j}|s_{j})]\lVert
RGj=HγjGR(1γ)γH\displaystyle\leq RG\sum_{j=H}^{\infty}\gamma^{j}\leq\frac{GR}{(1-\gamma)}\gamma^{H}

E.1 Proof of Lemma 4.10

Although Theorem 4.8 establishes the convergence of TD(0) in terms of Qw¯KQ_{\bar{w}_{K}} under constant step sizes, we actually require a stronger convergence result showing the direct convergence of w¯K\bar{w}_{K} to ww^{*} in order to bound the critic approximation bias. In the following proofs, we use core results proved in Appendix D to prove an alternate fourth-moment bound under diminishing step sizes. These convergence results are formalized in Lemma E.3.

Lemma E.3.

Suppose w¯K\bar{w}_{K} is generated by KK steps of the Projected TD(0) algorithm with wΘw^{*}\in\Theta and diminishing step-size αt=1(t+1)ς\alpha_{t}~{}=~{}\frac{1}{(t+1)\varsigma}. Then

𝔼[ww¯K4]log2KK(192F2R2ς2log2(r1)+O(1logK)+O(1log2K))\mathbb{E}[\lVert w^{*}-\bar{w}_{K}\lVert^{4}]\leq\frac{\log^{2}K}{K}\cdot\left(\frac{192F^{2}R^{2}}{\varsigma^{2}\log^{2}(r^{-1})}+O(\frac{1}{\log K})+O(\frac{1}{\log^{2}K})\right)

Proof: See Appendix E.2.1

Lemma E.4 follows from Lemma E.3.

Lemma E.4.

For K=O(log2(μ4)μ4)K=O(\frac{\log^{2}(\mu^{-4})}{\mu^{4}}) as μ0\mu\to 0 iterations of Algorithm 3 and

D=(192F2R2ς2log2(r1)+O(1logμ4))1/4D=\left(\frac{192F^{2}R^{2}}{\varsigma^{2}\log^{2}(r^{-1})}+O(\frac{1}{\log\mu^{-4}})\right)^{-1/4}

we have

𝔼[ww¯K4]D4μ4\mathbb{E}[\lVert w^{*}-\bar{w}_{K}\lVert^{4}]\leq D^{4}\mu^{4}

Proof. See Appendix E.2.2.

Now we can proceed with the proof of Lemma 4.10.

Proof.

We can characterize the bias by the quality of the critic approximation achieved by the TD(0) algorithm. We can ”roll up” the temporal summation as follows

qt+1\displaystyle q_{t+1} =GJ(θt)\displaystyle=G_{\infty}-\nabla J(\theta_{t})
=𝔼π,ρ0[k=0γkQw¯K,t(sk,ak)logπθ(ak|sk)]J(θ)\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}Q_{\bar{w}_{K,t}}(s_{k},a_{k})\nabla\log\pi_{\theta}(a_{k}|s_{k})]-\nabla J(\theta)
=𝔼π,ρ0[k=0γkQw¯K,t(sk,ak)logπθ(ak|sk)]𝔼π,ρ0[k=0γklogπ(ak|sk)Qγπ(sk,ak)]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}Q_{\bar{w}_{K,t}}(s_{k},a_{k})\nabla\log\pi_{\theta}(a_{k}|s_{k})]-\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi(a_{k}|s_{k})Q_{\gamma}^{\pi}(s_{k},a_{k})]
=𝔼π,ρ0[k=0γklogπθ(ak|sk)(Qw¯K,t(sk,ak)Qγπ(sk,ak))]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi_{\theta}(a_{k}|s_{k})(Q_{\bar{w}_{K,t}}(s_{k},a_{k})-Q_{\gamma}^{\pi}(s_{k},a_{k}))]
=s𝒮a𝒜k=0(sk=s|s0)π(a|s)γklogπθ(a|s)(Qw¯K,t(s,a)Qγπ(s,a))\displaystyle=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\sum_{k=0}^{\infty}\mathbb{P}(s_{k}=s|s_{0})\pi(a|s)\gamma^{k}\nabla\log\pi_{\theta}(a|s)(Q_{\bar{w}_{K,t}}(s,a)-Q_{\gamma}^{\pi}(s,a))
=s𝒮a𝒜dγπ(s)π(a|s)logπθ(a|s)(Qw¯K,t(s,a)Qγπ(s,a)),\displaystyle=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)\nabla\log\pi_{\theta}(a|s)(Q_{\bar{w}_{K,t}}(s,a)-Q_{\gamma}^{\pi}(s,a)),

where dπγ(s)=k=0γk(sk=s|s0)d^{\pi}_{\gamma}(s)=\sum_{k=0}^{\infty}\gamma^{k}\mathbb{P}(s_{k}=s|s_{0}) denotes the discounted state visitation measure. Since we have

s𝒮a𝒜dγπ(s)π(a|s)=s𝒮dγπ(s)=11γ,\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)=\sum_{s\in\mathcal{S}}d_{\gamma}^{\pi}(s)=\frac{1}{1-\gamma},

then by Jensen’s inequality, we can obtain the following bound

qt+14\displaystyle\lVert q_{t+1}\lVert^{4} (1γ)s𝒮a𝒜dγπ(s)π(a|s)logπθ(a|s)(Qw¯K,t(s,a)Qγπ(s,a))4\displaystyle\leq(1-\gamma)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)\lVert\nabla\log\pi_{\theta}(a|s)(Q_{\bar{w}_{K,t}}(s,a)-Q_{\gamma}^{\pi}(s,a))\lVert^{4}
(1γ)s𝒮a𝒜dγπ(s)π(a|s)G4Qw¯K,t(s,a)Qγπ(s,a)4\displaystyle\leq(1-\gamma)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)G^{4}\lVert Q_{\bar{w}_{K,t}}(s,a)-Q_{\gamma}^{\pi}(s,a)\lVert^{4}
=(1γ)s𝒮a𝒜dγπ(s)π(a|s)G4w¯K,tϕ(s,a)wϕ(s,a)4\displaystyle=(1-\gamma)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)G^{4}\lVert\bar{w}_{K,t}^{\intercal}\phi(s,a)-{w^{*}}^{\intercal}\phi(s,a)\lVert^{4}
(1γ)s𝒮a𝒜dγπ(s)π(a|s)G4ϕ(s,a)4w¯K,tw4\displaystyle\leq(1-\gamma)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)G^{4}\lVert\phi(s,a)\lVert^{4}\cdot\lVert\bar{w}_{K,t}-{w^{*}}\lVert^{4}
(1γ)s𝒮a𝒜dγπ(s)π(a|s)G4w¯K,tw4\displaystyle\leq(1-\gamma)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}d_{\gamma}^{\pi}(s)\pi(a|s)G^{4}\cdot\lVert\bar{w}_{K,t}-{w^{*}}\lVert^{4}
G4w¯K,tw4.\displaystyle\leq G^{4}\cdot\lVert\bar{w}_{K,t}-{w^{*}}\lVert^{4}.

Now we take expectation on both sides with respect to the TD(0) algorithm that occurs between timestep tt and t+1t+1 to obtain

𝔼[qt+14]G4𝔼[w¯K,tw4].\mathbb{E}[\lVert q_{t+1}\lVert^{4}]\leq G^{4}\cdot\mathbb{E}[\lVert\bar{w}_{K,t}-{w^{*}}\lVert^{4}].

Finally, we can apply Lemma E.4 to achieve the final result. ∎

E.2 Proof of Key Lemmas

E.2.1 Proof of Lemma E.3

We first require the following auxiliary lemma.

Lemma E.5.

Recall the definition of Aθ=𝔼(s,a)ηθ,(s,a)P(s,a,)[ϕ(s,a)(ϕ(s,a)γϕ(s,a))]A_{\theta}=\mathbb{E}_{(s,a)\sim\eta_{\theta},(s^{\prime},a^{\prime})\sim P(s,a,\cdot)}[\phi(s,a)(\phi(s,a)-\gamma\phi(s^{\prime},a^{\prime}))^{\intercal}]. Then

(wwt)g¯(wt)12λmin(Aθ+AθT)wwt212ςwtw2(w^{*}-w_{t})^{\intercal}\bar{g}(w_{t})\geq\frac{1}{2}\lambda_{min}(A_{\theta}+A_{\theta}^{T})\lVert w^{*}-w_{t}\lVert^{2}\geq\frac{1}{2}\varsigma\lVert w_{t}-w^{*}\lVert^{2}

where ς\varsigma is defined in Assumption 4.6.

Proof.
(wwt)g¯(wt)\displaystyle(w^{*}-w_{t})^{\intercal}\bar{g}(w_{t}) =(wwt)(g¯(wt)g¯(w))\displaystyle=(w^{*}-w_{t})^{\intercal}(\bar{g}(w_{t})-\bar{g}(w^{*}))
=(wwt)𝔼[(γϕ(s,a)wtϕ(s,a)wtγϕ(s,a)wϕ(s,a)w)ϕ(s,a)]\displaystyle=(w^{*}-w_{t})^{\intercal}\mathbb{E}[(\gamma\phi(s^{\prime},a^{\prime})^{\intercal}w_{t}-\phi(s,a)^{\intercal}w_{t}-\gamma\phi(s^{\prime},a^{\prime})^{\intercal}w^{*}-\phi(s,a)^{\intercal}w^{*})\phi(s,a)]
=(wwt)𝔼[(γϕ(s,a)ϕ(s,a))(wtw)ϕ(s,a)]\displaystyle=(w^{*}-w_{t})^{\intercal}\mathbb{E}[(\gamma\phi(s^{\prime},a^{\prime})-\phi(s,a))^{\intercal}(w_{t}-w^{*})\phi(s,a)]
=(wwt)Aθ(wwt)\displaystyle=(w^{*}-w_{t})^{\intercal}A_{\theta}(w^{*}-w_{t})
=12(wwt)(Aθ+AθT)(wwt)\displaystyle=\frac{1}{2}(w^{*}-w_{t})^{\intercal}(A_{\theta}+A_{\theta}^{T})(w^{*}-w_{t})
12λmin(Aθ+AθT)wwt2\displaystyle\geq\frac{1}{2}\lambda_{min}(A_{\theta}+A_{\theta}^{T})\lVert w^{*}-w_{t}\lVert^{2}

Now we can proceed with the proof of Lemma E.3.

Proof.
wwt+12\displaystyle\lVert w^{*}-w_{t+1}\lVert^{2} =wProjΘ(wt+αtgt(wt))2\displaystyle=\lVert w^{*}-Proj_{\Theta}(w_{t}+\alpha_{t}g_{t}(w_{t}))\lVert^{2}
=ProjΘ(w)ProjΘ(wt+αtgt(wt))2\displaystyle=\lVert Proj_{\Theta}(w^{*})-Proj_{\Theta}(w_{t}+\alpha_{t}g_{t}(w_{t}))\lVert^{2}
wwtαtgt(wt)2\displaystyle\leq\lVert w^{*}-w_{t}-\alpha_{t}g_{t}(w_{t})\lVert^{2}
=wwt22αtgt(wt)(wwt)+αt2gt(wt)2\displaystyle=\lVert w^{*}-w_{t}\lVert^{2}-2\alpha_{t}g_{t}(w_{t})^{\intercal}(w^{*}-w_{t})+\alpha_{t}^{2}\lVert g_{t}(w_{t})\lVert^{2}
=wwt22αtg¯(wt)(wwt)+2αtζt(wt)+αt2F2\displaystyle=\lVert w^{*}-w_{t}\lVert^{2}-2\alpha_{t}\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t})+2\alpha_{t}\zeta_{t}(w_{t})+\alpha_{t}^{2}F^{2}

So we have the following fourth moment bound (since both sides of the inequality are positive):

wwt+14\displaystyle\lVert w^{*}-w_{t+1}\lVert^{4} [wwt22αtg¯(wt)(wwt)+2αtζt(wt)+αt2F2]2\displaystyle\leq[\lVert w^{*}-w_{t}\lVert^{2}-2\alpha_{t}\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t})+2\alpha_{t}\zeta_{t}(w_{t})+\alpha_{t}^{2}F^{2}\big{]}^{2}
=wwt4+4αt2(g¯(wt)(wwt))2+4αt2(ζt(wt))2+αt4F4\displaystyle=\lVert w^{*}-w_{t}\lVert^{4}+4\alpha_{t}^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))^{2}+4\alpha_{t}^{2}(\zeta_{t}(w_{t}))^{2}+\alpha_{t}^{4}F^{4}
4αtwwt2(g¯(wt)(wwt))+4αtwwt2ζt(wt)+2αt2F2wwt2\displaystyle-4\alpha_{t}\lVert w^{*}-w_{t}\lVert^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))+4\alpha_{t}\lVert w^{*}-w_{t}\lVert^{2}\zeta_{t}(w_{t})+2\alpha_{t}^{2}F^{2}\lVert w^{*}-w_{t}\lVert^{2}
8αt2(g¯(wt)(wwt))ζt(wt)4αt3F2(g¯(wt)(wwt))+4αt3F2ζt(wt)\displaystyle-8\alpha_{t}^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))\zeta_{t}(w_{t})-4\alpha_{t}^{3}F^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))+4\alpha_{t}^{3}F^{2}\zeta_{t}(w_{t})

By Lemma E.5,

wwt+14\displaystyle\lVert w^{*}-w_{t+1}\lVert^{4}\leq wwt4+4αt2(g¯(wt)(wwt))2+4αt2(ζt(wt))2+αt4F4\displaystyle\lVert w^{*}-w_{t}\lVert^{4}+4\alpha_{t}^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))^{2}+4\alpha_{t}^{2}(\zeta_{t}(w_{t}))^{2}+\alpha_{t}^{4}F^{4}
2αtwwt2ςwwt2+4αtwwt2ζt(wt)+2αt2F2wwt2\displaystyle-2\alpha_{t}\lVert w^{*}-w_{t}\lVert^{2}\varsigma\lVert w^{*}-w_{t}\lVert^{2}+4\alpha_{t}\lVert w^{*}-w_{t}\lVert^{2}\zeta_{t}(w_{t})+2\alpha_{t}^{2}F^{2}\lVert w^{*}-w_{t}\lVert^{2}
+4αt2ςwwt2|ζt(wt)|2αt3F2ςwwt2+4αt3F2ζt(wt)\displaystyle+4\alpha_{t}^{2}\varsigma\lVert w^{*}-w_{t}\lVert^{2}|\zeta_{t}(w_{t})|-2\alpha_{t}^{3}F^{2}\varsigma\lVert w^{*}-w_{t}\lVert^{2}+4\alpha_{t}^{3}F^{2}\zeta_{t}(w_{t})
\displaystyle\leq (12αtς)wwt4+4αt2(g¯(wt)(wwt))2+4αt2(ζt(wt))2+αt4F4\displaystyle(1-2\alpha_{t}\varsigma)\lVert w^{*}-w_{t}\lVert^{4}+4\alpha_{t}^{2}(\bar{g}(w_{t})^{\intercal}(w^{*}-w_{t}))^{2}+4\alpha_{t}^{2}(\zeta_{t}(w_{t}))^{2}+\alpha_{t}^{4}F^{4}
+(4αt+4αt2ς)wwt2|ζt(wt)|+(2αt2F22αt3F2ς)wwt2+4αt3F2ζt(wt).\displaystyle+(4\alpha_{t}+4\alpha_{t}^{2}\varsigma)\lVert w^{*}-w_{t}\lVert^{2}|\zeta_{t}(w_{t})|+(2\alpha_{t}^{2}F^{2}-2\alpha_{t}^{3}F^{2}\varsigma)\lVert w^{*}-w_{t}\lVert^{2}+4\alpha_{t}^{3}F^{2}\zeta_{t}(w_{t}).

We can utilize the bounds wR\lVert w\lVert\leq R, |ζt(w)|2F2|\zeta_{t}(w)|\leq 2F^{2}, gt(w)F\lVert g_{t}(w)\lVert\leq F from Lemma D.1 to arrive at

wwt+14\displaystyle\lVert w^{*}-w_{t+1}\lVert^{4}\leq (12αtς)wwt4+16αt2F2R2+16αt2F4+αt4F4\displaystyle(1-2\alpha_{t}\varsigma)\lVert w^{*}-w_{t}\lVert^{4}+16\alpha_{t}^{2}F^{2}R^{2}+16\alpha_{t}^{2}F^{4}+\alpha_{t}^{4}F^{4}
+(4αt+4αt2ς)wwt2|ζt(wt)|+(2αt2F22αt3F2ς)4R2+8αt3F4\displaystyle+(4\alpha_{t}+4\alpha_{t}^{2}\varsigma)\lVert w^{*}-w_{t}\lVert^{2}|\zeta_{t}(w_{t})|+(2\alpha_{t}^{2}F^{2}-2\alpha_{t}^{3}F^{2}\varsigma)4R^{2}+8\alpha_{t}^{3}F^{4}
\displaystyle\leq (12αtς)wwt4+(8αt+8αt2ς)R2|ζt(wt)|+24αt2F2R2+16αt2F4\displaystyle(1-2\alpha_{t}\varsigma)\lVert w^{*}-w_{t}\lVert^{4}+(8\alpha_{t}+8\alpha_{t}^{2}\varsigma)R^{2}|\zeta_{t}(w_{t})|+24\alpha_{t}^{2}F^{2}R^{2}+16\alpha_{t}^{2}F^{4}
+αt4F48αt3F2ςR2+8αt3F4.\displaystyle+\alpha_{t}^{4}F^{4}-8\alpha_{t}^{3}F^{2}\varsigma R^{2}+8\alpha_{t}^{3}F^{4}.

After some rearrangement of terms, we have

αtςwwt4\displaystyle\alpha_{t}\varsigma\lVert w^{*}-w_{t}\lVert^{4}\leq (1αtς)wwt4wwt+14+(8αt+8αt2ς)R2|ζt(wt)|\displaystyle(1-\alpha_{t}\varsigma)\lVert w^{*}-w_{t}\lVert^{4}-\lVert w^{*}-w_{t+1}\lVert^{4}+(8\alpha_{t}+8\alpha_{t}^{2}\varsigma)R^{2}|\zeta_{t}(w_{t})|
+24αt2F2R2+16αt2F4+αt4F48αt3F2ςR2+8αt3F4\displaystyle+24\alpha_{t}^{2}F^{2}R^{2}+16\alpha_{t}^{2}F^{4}+\alpha_{t}^{4}F^{4}-8\alpha_{t}^{3}F^{2}\varsigma R^{2}+8\alpha_{t}^{3}F^{4}
𝔼[wwt4]\displaystyle\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]\leq (1αtς1)𝔼[wwt4]1αtς𝔼[wwt+14]+(8ς+8αt)R2𝔼[|ζt(wt)|]\displaystyle(\frac{1}{\alpha_{t}\varsigma}-1)\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]-\frac{1}{\alpha_{t}\varsigma}\mathbb{E}[\lVert w^{*}-w_{t+1}\lVert^{4}]+(\frac{8}{\varsigma}+8\alpha_{t})R^{2}\mathbb{E}[|\zeta_{t}(w_{t})|]
+24αtF2R2ς+16αtF4ς+αt3F4ς8αt2F2R2+8αt2F4ς.\displaystyle+\frac{24\alpha_{t}F^{2}R^{2}}{\varsigma}+\frac{16\alpha_{t}F^{4}}{\varsigma}+\frac{\alpha_{t}^{3}F^{4}}{\varsigma}-8\alpha_{t}^{2}F^{2}R^{2}+\frac{8\alpha_{t}^{2}F^{4}}{\varsigma}.

Let αt=1(t+1)ς\alpha_{t}=\frac{1}{(t+1)\varsigma}, then we have

𝔼[wwt4]\displaystyle\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]\leq t𝔼[wwt4](t+1)𝔼[wwt+14]+(8ς+8(t+1)ς)R2𝔼[|ζt(wt)|]\displaystyle t\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]-(t+1)\mathbb{E}[\lVert w^{*}-w_{t+1}\lVert^{4}]+(\frac{8}{\varsigma}+\frac{8}{(t+1)\varsigma})R^{2}\mathbb{E}[|\zeta_{t}(w_{t})|]
+24F2R2ς2(t+1)+16F4ς2(t+1)+F4ς4(t+1)38F2R2(t+1)2ς2+8F4ς3(t+1)2.\displaystyle+\frac{24F^{2}R^{2}}{\varsigma^{2}(t+1)}+\frac{16F^{4}}{\varsigma^{2}(t+1)}+\frac{F^{4}}{\varsigma^{4}(t+1)^{3}}-\frac{8F^{2}R^{2}}{(t+1)^{2}\varsigma^{2}}+\frac{8F^{4}}{\varsigma^{3}(t+1)^{2}}.

Then we sum on either side from 0 to K1K-1 and divide by KK, using the facts that t=1K1t2π26\sum_{t=1}^{K}\frac{1}{t^{2}}\leq~{}\frac{\pi^{2}}{6} and t=1K1t=log(K)+O(1)\sum_{t=1}^{K}\frac{1}{t}=\log(K)+O(1) to conclude

1Kt=0K1𝔼[wwt4]\displaystyle\frac{1}{K}\sum_{t=0}^{K-1}\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]\leq 1Kt=0K1[t𝔼[wwt4](t+1)𝔼[wwt+14]]+1Kt=0K1(8ς+8(t+1)ς)R2𝔼[|ζt(wt)|]\displaystyle\frac{1}{K}\sum_{t=0}^{K-1}[t\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]-(t+1)\mathbb{E}[\lVert w^{*}-w_{t+1}\lVert^{4}]]+\frac{1}{K}\sum_{t=0}^{K-1}(\frac{8}{\varsigma}+\frac{8}{(t+1)\varsigma})R^{2}\mathbb{E}[|\zeta_{t}(w_{t})|]
+1Kt=0K1[24F2R2ς2(t+1)+16F4ς2(t+1)+F4ς4(t+1)2+8F4ς3(t+1)2]\displaystyle+\frac{1}{K}\sum_{t=0}^{K-1}[\frac{24F^{2}R^{2}}{\varsigma^{2}(t+1)}+\frac{16F^{4}}{\varsigma^{2}(t+1)}+\frac{F^{4}}{\varsigma^{4}(t+1)^{2}}+\frac{8F^{4}}{\varsigma^{3}(t+1)^{2}}]
\displaystyle\leq w1w4K+1Kt=0K1(8ς+8(t+1)ς)R2𝔼[|ζt(wt)|]+1Kt=0K1[24F2R2ς2(t+1)+16F4ς2(t+1)]\displaystyle\frac{\lVert w_{1}-w^{*}\lVert^{4}}{K}+\frac{1}{K}\sum_{t=0}^{K-1}(\frac{8}{\varsigma}+\frac{8}{(t+1)\varsigma})R^{2}\mathbb{E}[|\zeta_{t}(w_{t})|]+\frac{1}{K}\sum_{t=0}^{K-1}[\frac{24F^{2}R^{2}}{\varsigma^{2}(t+1)}+\frac{16F^{4}}{\varsigma^{2}(t+1)}]
+4F4π23ς3K+F4π26ς4K\displaystyle+\frac{4F^{4}\pi^{2}}{3\varsigma^{3}K}+\frac{F^{4}\pi^{2}}{6\varsigma^{4}K}
\displaystyle\leq w1w4K+1Kt=0K116R2ς𝔼[|ζt(wt)|]+(24F2R2+16F4ς2)logK+O(1)K\displaystyle\frac{\lVert w_{1}-w^{*}\lVert^{4}}{K}+\frac{1}{K}\sum_{t=0}^{K-1}\frac{16R^{2}}{\varsigma}\mathbb{E}[|\zeta_{t}(w_{t})|]+(\frac{24F^{2}R^{2}+16F^{4}}{\varsigma^{2}})\frac{\log K+O(1)}{K}
+4F4π23ς3K+F4π26ς4K.\displaystyle+\frac{4F^{4}\pi^{2}}{3\varsigma^{3}K}+\frac{F^{4}\pi^{2}}{6\varsigma^{4}K}.

To bound the summation on the right hand side, we can apply the results of Lemma D.3 for tK1t\leq K-1, with τ0=τmix(αT1)log(mςT)log(r1)\tau_{0}~{}=~{}\tau^{\text{\tiny mix}}(\alpha_{T-1})\leq\frac{\log(m\varsigma T)}{\log(r^{-1})}. Then we have

t=0K1𝔼[|ζt(wt)|]\displaystyle\sum_{t=0}^{K-1}\mathbb{E}[|\zeta_{t}(w_{t})|] F2(8+6τ0)t=0K1αt+10F2mt=0K1rt\displaystyle\leq F^{2}(8+6\tau_{0})\sum_{t=0}^{K-1}\alpha_{t^{*}}+10F^{2}m\sum_{t=0}^{K-1}r^{t}
=F2(8+6τ0)t=0τ0α0+F2(8+6τ0)t=τ0+1K1αt+10F2mt=0K1rt\displaystyle=F^{2}(8+6\tau_{0})\sum_{t=0}^{\tau_{0}}\alpha_{0}+F^{2}(8+6\tau_{0})\sum_{t=\tau_{0}+1}^{K-1}\alpha_{t}+10F^{2}m\sum_{t=0}^{K-1}r^{t}
F2(8+6τ0)t=0τ0α0+F2(8+6τ0)t=τ0+1K1αt+10F2m1r\displaystyle\leq F^{2}(8+6\tau_{0})\sum_{t=0}^{\tau_{0}}\alpha_{0}+F^{2}(8+6\tau_{0})\sum_{t=\tau_{0}+1}^{K-1}\alpha_{t}+\frac{10F^{2}m}{1-r}
=F2(8+6τ0)τ0α0+F2(8+6τ0)t=τ0+1K1αt+10F2m1r\displaystyle=F^{2}(8+6\tau_{0})\tau_{0}\alpha_{0}+F^{2}(8+6\tau_{0})\sum_{t=\tau_{0}+1}^{K-1}\alpha_{t}+\frac{10F^{2}m}{1-r}
8F2ςlog(mςK)log(r1)+6F2ςlog2(mςK)log2(r1)+F2ς(8+6log(mςK)log(r1))(logK+O(1))+10F2m1r\displaystyle\leq\frac{8F^{2}}{\varsigma}\frac{\log(m\varsigma K)}{\log(r^{-1})}+\frac{6F^{2}}{\varsigma}\frac{\log^{2}(m\varsigma K)}{\log^{2}(r^{-1})}+\frac{F^{2}}{\varsigma}(8+6\frac{\log(m\varsigma K)}{\log(r^{-1})})(\log K+O(1))+\frac{10F^{2}m}{1-r}
(6F2ςlog2(r1)+6F2ςlog(r1))log2K+O(log(K))+O(1)\displaystyle\leq(\frac{6F^{2}}{\varsigma\log^{2}(r^{-1})}+\frac{6F^{2}}{\varsigma\log(r^{-1})})\log^{2}K+O(\log(K))+O(1)
12F2ςlog2(r1)log2K+O(log(K))+O(1).\displaystyle\leq\frac{12F^{2}}{\varsigma\log^{2}(r^{-1})}\log^{2}K+O(\log(K))+O(1).

So we have for the original expression

1Kt=0K1𝔼[wwt4]\displaystyle\frac{1}{K}\sum_{t=0}^{K-1}\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]\leq (192F2R2ς2log2(r1)+O(1logK)+O(1log2K))log2KK,\displaystyle(\frac{192F^{2}R^{2}}{\varsigma^{2}\log^{2}(r^{-1})}+O(\frac{1}{\log K})+O(\frac{1}{\log^{2}K}))\cdot\frac{\log^{2}K}{K},

and by Jensen’s inequality, we obtain

𝔼[ww¯K4]1Kt=0K1𝔼[wwt4](192F2R2ς2log2(r1)+O(1logK)+O(1log2K))log2KK.\mathbb{E}[\lVert w^{*}-\bar{w}_{K}\lVert^{4}]\leq\frac{1}{K}\sum_{t=0}^{K-1}\mathbb{E}[\lVert w^{*}-w_{t}\lVert^{4}]\leq(\frac{192F^{2}R^{2}}{\varsigma^{2}\log^{2}(r^{-1})}+O(\frac{1}{\log K})+O(\frac{1}{\log^{2}K}))\cdot\frac{\log^{2}K}{K}.\\

E.2.2 Proof of Lemma E.4

Proof. We establish a bound on 𝔼[ww¯K4]\mathbb{E}[\lVert w^{*}-\bar{w}_{K}\lVert^{4}] via Lemma E.3. Let ϵ=μ2\epsilon=\mu^{2}. Then we want to find KK large enough such that

log(K)ϵK\log(K)\leq\epsilon\sqrt{K}

We look for the asymptotic solution to

log(K)=ϵK\log(K)=\epsilon\sqrt{K}

in the form K=k1ϵ2K=\frac{k_{1}}{\epsilon^{2}}. Plugging this in, we get

log(k1)+log(1ϵ2)=k1\log(k_{1})+\log(\frac{1}{\epsilon^{2}})=\sqrt{k_{1}}

By the method of dominant balance, we have

k1\displaystyle\sqrt{k_{1}} log(1ϵ2)\displaystyle\approx\log(\frac{1}{\epsilon^{2}})
k1\displaystyle k_{1} log2(1ϵ2)\displaystyle\approx\log^{2}(\frac{1}{\epsilon^{2}})

So we have K=O(log2(1ϵ2)ϵ2)K=O(\frac{\log^{2}(\frac{1}{\epsilon^{2}})}{\epsilon^{2}}).

Appendix F Policy Gradient Theorem

The original policy gradient theorem derived in (Sutton et al., 1999) addresses the gradient of the value function, which is a slightly different objective that we denote as J~(θ)\tilde{J}(\theta) as follows

J~(θ)=Vπθ(s0)=𝔼πθ[k=0γk(sk,ak)|s0]\tilde{J}(\theta)=V^{\pi_{\theta}}(s_{0})=\mathbb{E}_{\pi_{\theta}}[\sum_{k=0}^{\infty}\gamma^{k}\mathcal{R}(s_{k},a_{k})|s_{0}]

Then from (Sutton et al., 1999) we have the original policy gradient theorem:

J~(θ)=s𝒮dπ(s)a𝒜πθ(s,a)Qπθ(s,a)\nabla\tilde{J}(\theta)=\sum_{s\in\mathcal{S}}d^{\pi}(s)\sum_{a\in\mathcal{A}}\nabla\pi_{\theta}(s,a)Q^{\pi_{\theta}}(s,a)

Where

dπ(s)=k=0γkPr(sk=s|s0,π)d^{\pi}(s)=\sum_{k=0}^{\infty}\gamma^{k}Pr(s_{k}=s|s_{0},\pi)

We consider instead the expectation of the value function over an initial state distribution:

J(θ)=𝔼ρ0[Vπθ(s0)]J(\theta)=\mathbb{E}_{\rho_{0}}[V^{\pi_{\theta}}(s_{0})]

Then

J(θ)=s𝒮dπθ,ρ0(s)a𝒜πθ(a|s)Qπ(s,a)\nabla J(\theta)=\sum_{s\in\mathcal{S}}d^{\pi_{\theta},\rho_{0}}(s)\sum_{a\in\mathcal{A}}\nabla\pi_{\theta}(a|s)Q^{\pi}(s,a)

Where

dπθ,ρ0(s)=k=0γk𝔼ρ0[Pr(sk=s|s0,πθ)]d^{\pi_{\theta},\rho_{0}}(s)=\sum_{k=0}^{\infty}\gamma^{k}\mathbb{E}_{\rho_{0}}[Pr(s_{k}=s|s_{0},\pi_{\theta})]

When we implement the “log-likelihood trick”, we have

J(θ)=sdπθ,ρ0aπθ(a|s)logπθ(a|s)Qπθ(s,a)\nabla J(\theta)=\sum_{s}d^{\pi_{\theta},\rho_{0}}\sum_{a}\pi_{\theta}(a|s)\nabla\log\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a)

We can “unroll” this result as in (Wu et al., 2022) to acquire the temporal formulation:

=𝔼π,ρ0[k=0γklogπ(ak|sk)Qπ(sk,ak)]=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi(a_{k}|s_{k})Q^{\pi}(s_{k},a_{k})]

To derive the GPOMDP estimator from this result, we use the definition of QπQ^{\pi}

Qπ(s,a)=𝔼π[t=0γt(st,at)|s0=s,a0=a]=𝔼π[t=0γt(st+k,at+k)|sk=s,ak=a]Q^{\pi}(s,a)=\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}(s_{t},a_{t})|s_{0}=s,a_{0}=a]=\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}\mathcal{R}(s_{t+k},a_{t+k})|s_{k}=s,a_{k}=a]
J(θ)\displaystyle\nabla J(\theta) =𝔼π,ρ0[k=0γklogπ(ak|sk)𝔼π[t=0γtR(st+k,at+k)|sk=sk,ak=ak]]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\gamma^{k}\nabla\log\pi(a_{k}|s_{k})\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{t}R(s^{\prime}_{t+k},a^{\prime}_{t+k})|s^{\prime}_{k}=s_{k},a^{\prime}_{k}=a_{k}]]
=𝔼π,ρ0[k=0𝔼π[t=0γk+tlogπ(ak|sk)(st+k,at+k)|sk=sk,ak=ak]]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\mathbb{E}_{\pi}[\sum_{t=0}^{\infty}\gamma^{k+t}\nabla\log\pi(a_{k}|s_{k})\mathcal{R}(s^{\prime}_{t+k},a^{\prime}_{t+k})|s^{\prime}_{k}=s_{k},a^{\prime}_{k}=a_{k}]]
=𝔼π,ρ0[k=0t=0γk+tlogπ(ak|sk)(st+k,at+k)]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\sum_{t=0}^{\infty}\gamma^{k+t}\nabla\log\pi(a_{k}|s_{k})\mathcal{R}(s_{t+k},a_{t+k})]
=𝔼π,ρ0[k=0t=kγtlogπ(ak|sk)(st,at)]\displaystyle=\mathbb{E}_{\pi,\rho_{0}}[\sum_{k=0}^{\infty}\sum_{t=k}^{\infty}\gamma^{t}\nabla\log\pi(a_{k}|s_{k})\mathcal{R}(s_{t},a_{t})]

And so we end with an unbiased estimator of the policy gradient

J(θ)=𝔼τp(|θ)[logpθ(τ)(τ)].\nabla J(\theta)=\mathbb{E}_{\tau\sim p(\cdot|\theta)}[\nabla\log p_{\theta}(\tau)\mathcal{R}(\tau)].