This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Fractal Landscapes in Policy Optimization

Abstract

Policy gradient lies at the core of reinforcement learning (RL) in continuous spaces. Although it has achieved much success, it is often observed in practice that RL training with policy gradient can fail for a variety of reasons, even for standard control problems that can be solved with other methods. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal, so that there may not be gradients to estimate in the first place. We draw on chaos theory and non-smooth analysis, and analyze the Maximum Lyapunov Exponents and Hölder Exponents of the policy optimization objectives. Moreover, we develop a practical method that statistically estimates Hölder exponents, to identify when the training process has encountered fractal landscapes. Our experiments illustrate how some failure cases of policy optimization in continuous control environments can be explained by our theory.

1 Introduction

Deep reinforcement learning has achieved much success in various applications atari; alphago; video, but they also often fail, especially in continuous spaces, on control problems that other methods can readily solve. The understanding of such failure cases is still very limited. Among the failures, two main issues arise: first, the training process of reinforcement learning is unstable in the way that the loss curve can fluctuate violently during training; second, the probability of obtaining a satisfactory solution can be surprisingly low, or even close to zero in certain tasks such as acrobot. Currently, they are attributed to the function approximation error Tsitsiklis; wang, the data error and inefficiency thomas, and making large moves in policy space schulman17. Nevertheless, all of these explanations solely focus on the training process, and there remains a lack of systematic study regarding the underlying causes of these failures from the perspective of the MDP itself.

Given that even a small update in policy space can significantly change the performance metric, it is natural to question the smoothness of the landscape of the objective function. Motivated by chaos theory, we adopt the concept of maximum Lyapunov exponent kantz to describe the exponential rate of trajectory divergence in the MDP under a given policy. It is related to the smoothness of the objective function in the following way: suppose that the discount factor in the MDP is γ(0,1)\gamma\in(0,1), and the maximum Lyapunov exponent at certain policy parameter θ0\theta_{0} is λ(θ0)>0\lambda(\theta_{0})>0 which is inherent to the MDP. Then, we show that the objective function J(θ)J(\theta) can be non-differentiable at θ=θ0\theta=\theta_{0} when λ(θ0)>logγ\lambda(\theta_{0})>-\log\gamma. Intuitively speaking, the objective function is non-differentiable when the rate of trajectory divergence is faster than the decay rate of discount factor. Furthermore, it demonstrates that the fluctuation in the loss curve is not a result of any numerical error but rather a reflection of the intrinsic properties associated with the corresponding MDP.

In this paper, we first analyze deterministic policies in Section 4. It is not just because the deterministic policies can be regarded as the limit of stochastic policies when the variance σ0\sigma\rightarrow 0. More importantly, the ground truth we try to highlight is that: non-smoothness in objective functions is not a result of randomness but rather an inherent feature of the MDP itself. As shown in Section 6, the loss curve is still fractal even when all things in the MDP are deterministic. After that, we will briefly discuss the case of stochastic policies, and show why it is even more likely to have non-smoothness through an example. Although the fact that having positive MLEs can explain the non-smoothness of objective functions, we do not really need to compute it in practice even though it is possible as in kantz. Instead, we propose a statistical method is proposed in Section 5 that can directly estimate the Hölder exponent logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)} as a whole to determine whether J(θ)J(\theta) is differentiable at a specific θN\theta\in\mathbb{R}^{N}. It can also indicate if the training has run into fractal areas and hence should stop.

2 Related work

Policy gradient methods.

While the optimal policy is usually approximated by minimizing (or maximizing) the objective function, the most common way of performing optimization is via policy gradient methods that were originated from sutton99; william. After being reformulated as an optimization problem in parameter space, more and more efficient algorithms such as natural policy gradient kakade, deterministic policy gradient silver, deep deterministic policy gradient lillicrap, trust region policy optimization schulman15 and proximal policy optimization schulman17, were proposed. As all of these algorithms seek to estimate the gradient of the objective function for descent direction, they become fundamentally ill-posed when the true objective is non-differentiable since there does not exist any gradient to estimate at all, which is exactly the point we try to make in this paper. Consequently, it is not surprising to have a low probability of obtaining a good solution using these methods.

Chaos in machine learning.

Chaotic behaviors due to in the learning dynamics have been reported in other learning problems camuto; pascanu. For instance, when training recurrent neural networks for a long period, the outcome will behave like a random walk due to the vanishing and the exploding gradient problems bengio. In parmas, the authors point out that the chaotic behavior in finite-horizon model-based reinforcement learning problems may be caused by long chains of non-linear computations. However, there is a significant difference compared to our theory: the true objective function is provably differentiable when time horizon is finite, and our theory focuses on more general infinite-horizon MDPs. Actually, all of these work fundamentally differ from our theory in the following sense: they attribute the chaos to the randomness during training such as noise and errors from computation or sampling, without recognizing the fact that it can be an intrinsic feature of the model.

Loss landscape of policy optimization.

There are many works in the study of the landscape of objective functions. In particular, it has been shown that the objective functions in finite state-space MDPs are smooth agarwal; xiao, which allows people to perform further optimization algorithm such as gradient descent and direct policy search without worrying about the well-posedness of those methods. It also explains why the classical RL algorithms in sutton98 are provably efficient in finite space settings. Also, such smoothness results can be extended to continuous state-space MDPs having "very nice properties". For instance, the objective function in Linear Quadratic Regulator (LQR) problems is almost smooth fazel as long as the cost is finite, along with a similar result obtained for the 2/\mathcal{H}_{2}/\mathcal{H}_{\infty} problem zhang. For the robust control problem, despite the objective function not being smooth, it is indeed locally Lipschitz continuous, which implies differentiability almost everywhere and further leads to global convergence of direct policy search guo. Due to the lack of analytical tools, however, there is still a vacancy in the theoretical study of loss landscapes of policy optimization for nonlinear and complex MDPs. Our paper partially addresses this gap by pointint out the possibility that the loss landscape can be highly non-smooth and even fractal, which is far more complex than all aforementioned types.

3 Preliminaries

3.1 Continuous state-space MDPs

Let us consider the deterministic continuous MDP given as dynamical system:

st+1=f(st,at),s0n,s_{t+1}=f(s_{t},a_{t}),\quad s_{0}\in\mathbb{R}^{n}, (1)

where st𝒮ns_{t}\in\mathcal{S}\subset\mathbb{R}^{n} is the state at time tt, s0s_{0} is the initial state and atπθ(|xt)𝒜ma_{t}\sim\pi_{\theta}(\cdot|x_{t})\in\mathcal{A}\subset\mathbb{R}^{m} is the action taken at time tt based on a stochastic policy parameterized by θN\theta\in\mathbb{R}^{N}. The sample space 𝒮\mathcal{S} is compact. Also, we define the value function VπθV^{\pi_{\theta}} of policy πθ\pi_{\theta} and the objective function J(θ)J(\theta) to be minimized as

J(θ)=Vπθ(s0)=𝔼atπθ(|st)[t=0γtR(st,at)]J(\theta)=V^{\pi_{\theta}}(s_{0})=\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|s_{t})}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})] (2)

where γ(0,1)\gamma\in(0,1) is the discount factor and R(s,a)R(s,a) is the cost function. The following assumptions are made throughout this paper:

  • (A.1) f:n×mnf:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{n} is Lipschitz continuous over any compact domains (i.e., locally Lipschitz continuous);

  • (A.2) The cost function R:n×mR:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R} is non-negative and locally Lipschitz continuous everywhere;

  • (A.3) The state space is closed under transitons, i.e., for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, the next state s=f(s,a)𝒮s^{\prime}=f(s,a)\in\mathcal{S}.

3.2 Policy gradient methods

Generally, policy gradient methods provide an estimator of the gradient of true objective function J()J(\cdot) with respect to policy parameters. One of the most commonly used forms is

η(θ)=𝔼πθ[θlogπθ(at|st)Qπθ(st,at)]=𝔼πθ[θlogπθ(at|st)Aπθ(st,at)]\eta(\theta)=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\ Q^{\pi_{\theta}}(s_{t},a_{t})]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\ A^{\pi_{\theta}}(s_{t},a_{t})] (3)

where πθ(|)\pi_{\theta}(\cdot|\cdot) is a stochastic policy parameterized by θ\theta and Qπθ(,)Q^{\pi_{\theta}}(\cdot,\cdot) is the Q-value function of πθ\pi_{\theta}, and Aπθ(s,a)=Qπθ(s,a)Vπθ(s)A^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s) is the advantage function, often used for variance reduction. The theoretical guarantee of the convergence of policy gradient methods is essentially established upon the argument that the tail term as in satisfies sutton99

γtθVπθ(s)0 as t\gamma^{t}\ \nabla_{\theta}V^{\pi_{\theta}}(s)\rightarrow 0\textit{ as }t\rightarrow\infty (4)

for any s𝒮s\in\mathcal{S}. Usually, two additional assumptions are made for (4):

  • θVπθ(s)\nabla_{\theta}V^{\pi_{\theta}}(s) exists and continuous for all s𝒮s\in\mathcal{S};

  • θVπθ(s)\|\nabla_{\theta}V^{\pi_{\theta}}(s)\| is uniformly bounded over 𝒮\mathcal{S}.

Apparently, the second assumption is automatically satisfied if the first assumption holds in the case that 𝒮\mathcal{S} is either finite or compact. However, as we will see in Section 4 and 6, the existence of θVπθ()\nabla_{\theta}V^{\pi_{\theta}}(\cdot) may fail in many continuous MDPs even if 𝒮\mathcal{S} is compact, which poses challenges to the fundamental well-posedness of all policy gradient methods.

3.3 Maximal Lyapunov exponents

Notice that the most important property of chaotic systems is their “sensitive dependence on initial conditions.” To make it precise, consider the following dynamical system

st+1=F(st),s0N,s_{t+1}=F(s_{t}),\quad s_{0}\in\mathbb{R}^{N}, (5)

and suppose that a small perturbation ΔZ0\Delta Z_{0} is made to the initial state s0s_{0}, and the divergence resulted at time tt, say ΔZ(t)\Delta Z(t), can be estimated by ΔZ(t)eλtΔZ0\|\Delta Z(t)\|\simeq e^{\lambda t}\|\Delta Z_{0}\| where λ\lambda is called the Lyapunov exponent. For chaotic systems, positive Lyapunov exponents are usually observed, which implies an exponential divergence rate of the separation of nearby trajectories lorenz. Since the Lyapunov exponent at a given point may depend on the direction of the perturbation ΔZ0\Delta Z_{0}, and we are often interested in identifying the largest divergence rate, the concept of the maximal Lyapunov exponent (MLE) in this paper is defined as follows:

Definition 3.1.

(Maximal Lyapunov exponent) For the dynamical system (5), the maximal Lyapunov exponent λmax\lambda_{max} at s0s_{0} is defined as the largest value such that

λmax=limtlimΔZ001tlogΔZ(t)ΔZ0.\lambda_{max}=\lim_{t\rightarrow\infty}\lim_{\|\Delta Z_{0}\|\rightarrow 0}\frac{1}{t}\log\frac{\|\Delta Z(t)\|}{\|\Delta Z_{0}\|}. (6)

Just as shown in (6), not only chaotic systems but also systems with unstable equilibrium can have a positive MLE at corresponding states. Therefore, the framework built in this paper is not limited to chaotic MDPs, but is applicable to a broad class of continuous state-space MDPs having any one of these characteristics.

4 Non-smooth objective functions

Now we begin to consider the landscape of objective functions. In this section, we will show that the true objective function J(θ)J(\theta) given in (2) can be non-differentiable at some θN\theta\in\mathbb{R}^{N} where the system (1) has a positive MLE. Before showing how this argument is made, we first introduce the notion of α\alpha-Hölder continuity to extend the concept of Lipschitz continuity:

Definition 4.1.

(α\alpha-Hölder continuity) Suppose that some scalar α>0\alpha>0. A function g:Ng:\mathbb{R}^{N}\rightarrow\mathbb{R} is α\alpha-Hölder continuous at xNx\in\mathbb{R}^{N} if there exist C>0C>0 and δ>0\delta>0 such that

|g(x)g(y)|Cxyα|g(x)-g(y)|\leq C\|x-y\|^{\alpha}

for any y(x,δ)y\in\mathcal{B}(x,\delta), where (x,δ)\mathcal{B}(x,\delta) denotes the open ball of radius δ\delta centered at xx.

In the remaining part of section, we will mainly investigate the Hölder continuity of VπθV^{\pi_{\theta}} and JJ respectively in the case of deterministic policy. After that, a discussion about why it is even more likely to have non-smoothness in the stochastic case is provided.

4.1 Hölder Exponent of Vπθ()V^{\pi_{\theta}}(\cdot) at s0s_{0}

For simplicity, we denote the deterministic policy by πθ(s)\pi_{\theta}(s) throughout this section as it directly gives the action a=πθ(s)a=\pi_{\theta}(s) instead of a probability distribution. Consider a given parameter θN\theta\in\mathbb{R}^{N} such that the MLE of (1), say λ(θ)\lambda(\theta), is greater than logγ-\log\gamma. Let s0𝒮s^{\prime}_{0}\in\mathcal{S} be another initial state that is close to s0s_{0}, i.e., δ=s0s0>0\delta=\|s^{\prime}_{0}-s_{0}\|>0 is small enough. According to the assumption (A.3), we can find a constant M>0M>0 such that both stM\|s_{t}\|\leq M and stM\|s^{\prime}_{t}\|\leq M for all tt\in\mathbb{N}. Additionally, it is natural to make the following assumptions:

  • (A.4) There exists K1>0K_{1}>0 such that ststK1δeλ(θ)t\|s^{\prime}_{t}-s_{t}\|\leq K_{1}\delta e^{\lambda(\theta)t} for all tt\in\mathbb{N} and δ=s0s0>0\delta=\|s^{\prime}_{0}-s_{0}\|>0.

  • (A.5) The policy π:N×nm\pi:\mathbb{R}^{N}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{m} is locally Lipschitz continuous;

First, we present the following theorem to provide a lower bound for the Hölder exponent of J()J(\cdot) and the detailed proof can be found in the Appendix:

Theorem 4.1.

(Non-smoothness of VπθV^{\pi_{\theta}}) Assume (A.1)-(A.5) and the parameterized policy πθ()\pi_{\theta}(\cdot) is deterministic. Let λ(θ)\lambda(\theta) denote the MLE of (1) at θN\theta\in\mathbb{R}^{N}. Suppose that λ(θ)>logγ\lambda(\theta)>-\log\gamma, then Vπθ()V^{\pi_{\theta}}(\cdot) is logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}-Hölder continuous at s0s_{0}.

Although the counterexample given in Example 4.1 shows that it is theoretically impossible to prove pp-Hölder continuity for VπθV^{\pi_{\theta}} for any p>logγλ(θ)p>\frac{-\log\gamma}{\lambda(\theta)}, we believe that providing an illustrative demonstration can help to gain a better understanding how large MLEs lead to non-smootheness.

To show that the strongest continuity result one can prove is logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}-Hölder continuous where 0<logγλ(θ)<10<-\frac{\log\gamma}{\lambda(\theta)}<1, suppose that p(0,1]p\in(0,1] is some constant for which we would like to prove that Vπθ(s)V^{\pi_{\theta}}(s) pp-Hölder continuous at s=s0s=s_{0}, and let us see why the largest value for pp is logγλ(θ)-\frac{\log\gamma}{\lambda(\theta)}:

According to Definition 4.1, it suffices to find some C>0C^{\prime}>0 such that |Vπθ(s0)Vπθ(s0)|Cδp|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})|\leq C^{\prime}\delta^{p} when δ1\delta\ll 1. A usual way is to work on a relaxed form

|Vπθ(s0)Vπθ(s0)|t=0γt|R(st,πθ(st))R(st,πθ(st))|Cδp.|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})|\leq\sum_{t=0}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq C^{\prime}\delta^{p}. (7)

While handling the entire series is tricky, we instead split this analysis into three parts as shown in Figure 1: the sum of first T2T_{2} terms, the sum from t=T2+1t=T_{2}+1 to T31T_{3}-1, and the sum from t=T3t=T_{3} to infinity. First, applying (A.4) to the sum of the first T2T_{2} terms yields

t=0T2γt|R(st,πθ(st))R(st,πθ(st))|e(λ(θ)+logγ)T21γK1K2δ\sum_{t=0}^{T_{2}}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq\frac{e^{(\lambda(\theta)+\log\gamma)T_{2}}}{1-\gamma}K_{1}K_{2}\delta (8)

where K2>0K_{2}>0 is a Lipschitz constant obtained by (A.2) and (A.5). Therefore, if we wish to bound the right-hand side of (8) by some term of order 𝒪(δp)\mathcal{O}(\delta^{p}) when δ1\delta\ll 1, the length T2(δ)T_{2}(\delta)\in\mathbb{N} should have

T2(δ)C1+p1λ(θ)+logγlog(δ)T_{2}(\delta)\simeq C_{1}+\frac{p-1}{\lambda(\theta)+\log\gamma}\log(\delta) (9)

where C1>0C_{1}>0 is some constant independent of pp and δ\delta.


Refer to caption

Figure 1: An illustration of the two series (9) and (11) that need to cover the entire \mathbb{R} when δ0\delta\rightarrow 0.

Next, for the sum of the tail terms in Vπθ()V^{\pi_{\theta}}(\cdot) starting from T3T_{3}\in\mathbb{N}, it is naturally bounded by

t=T3γt|R(st,πθ(st))R(st,πθ(st))|2M2eT3logγ1γ,\sum_{t=T_{3}}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq\frac{2M_{2}e^{T_{3}\log\gamma}}{1-\gamma}, (10)

where M2=maxs𝒮R(s,πθ(s))M_{2}=\max_{s\in\mathcal{S}}R(s,\pi_{\theta}(s)) is the maximum of continuous function R(,πθ())R(\cdot,\pi_{\theta}(\cdot)) over the compact domain 𝒮\mathcal{S} (and hence exists). Therefore, letting the right-hand side of (10) be of order 𝒪(δp)\mathcal{O}(\delta^{p}) yields

T3(δ)C2+plogγlog(δ),T_{3}(\delta)\simeq C_{2}+\frac{p}{\log\gamma}\log(\delta), (11)

for some independent constant C2>0C_{2}>0. Since the sum of (8) and (10) provides a good estimate of VπθV^{\pi_{\theta}} only if the slope of T2(δ)T_{2}(\delta), say p1λ(θ)+logγ\frac{p-1}{\lambda(\theta)+\log\gamma} should be no less than that of T3(δ)T_{3}(\delta), or there would be infinitely many terms in the middle part as δ0\delta\rightarrow 0 and hence cannot be controlled by any 𝒪(δp)\mathcal{O}(\delta^{p}) terms. In this case, the slopes satisfy the inequality p1λ(θ)+logγplogγ\frac{p-1}{\lambda(\theta)+\log\gamma}\leq\frac{p}{\log\gamma} by solving which we finally obtain the desired plogγλ(θ)p\leq-\frac{\log\gamma}{\lambda(\theta)}.

Specifically, the following counterexample shows that Theorem 4.1 has already provided the strongest guaranteed continuity result for Vπθ(s)V^{\pi_{\theta}}(s) at s=s0s=s_{0}:

Example 4.1.

Suppose that the one-dimensional MDP st+1=f(st,at)s_{t+1}=f(s_{t},a_{t}) where

f(s,a)={1,a1,a,1<a<1,1,a1,f(s,a)=\begin{cases}-1,&a\leq-1,\\ a,&-1<a<1,\\ 1,&a\geq 1,\\ \end{cases}

is defined over the state space 𝒮=[1,1]\mathcal{S}=[-1,1] and the cost function R(s,a)=|s|R(s,a)=|s|. Also, the policy πθ(s)=θs\pi_{\theta}(s)=\theta s is linear. It is easy to verify that all assumptions (A.1)-(A.5) are satisfied. Now let s0=0s_{0}=0 and θ>1\theta>1, then applying (6) directly yields

λ(θ)\displaystyle\lambda(\theta) =limtlimδ01tlogΔZ(t)δ=limtlimδ01tlogδθtδ=logθ.\displaystyle=\lim_{t\rightarrow\infty}\lim_{\delta\rightarrow 0}\frac{1}{t}\log\frac{\|\Delta Z(t)\|}{\delta}=\lim_{t\rightarrow\infty}\lim_{\delta\rightarrow 0}\frac{1}{t}\log\frac{\delta\theta^{t}}{\delta}=\log\theta.

On the other hand, let δ>0\delta>0 be sufficiently small, then

Vπθ(δ)=t=0δγtθt=t=0T0(δ)δγtθt+t=T0(δ)γtγT0(δ)1γ\displaystyle V^{\pi_{\theta}}(\delta)=\sum_{t=0}^{\infty}\delta\gamma^{t}\theta^{t}=\sum_{t=0}^{T_{0}(\delta)}\delta\gamma^{t}\theta^{t}+\sum_{t=T_{0}(\delta)}^{\infty}\gamma^{t}\geq\frac{\gamma^{T_{0}(\delta)}}{1-\gamma}

where T0(δ)=1+logδlogθT_{0}(\delta)=1+\lfloor\frac{-\ \log\delta}{\log\theta}\rfloor\in\mathbb{N} and \lfloor\cdot\rfloor is the flooring function. Therefore, we have

Vπθ(δ)Vπθ(0)=Vπθ(δ)γlogδlogθ+11γ=γ1γδlogγlogθV^{\pi_{\theta}}(\delta)-V^{\pi_{\theta}}(0)=V^{\pi_{\theta}}(\delta)\geq\frac{\gamma^{\frac{-\log\delta}{\log\theta}+1}}{1-\gamma}=\frac{\gamma}{1-\gamma}\delta^{\frac{-\log\gamma}{\log\theta}}

Thus, p=logγλ(θ)p=\frac{-\log\gamma}{\lambda(\theta)} is the largest Hölder exponent of VπθV^{\pi_{\theta}} that can be proved in the worst case.

4.2 Hölder Exponent of J()J(\cdot) at θ\theta

The following lemma establishes a direct connection between J(θ)J(\theta) and J(θ)J(\theta^{\prime}) through VπθV^{\pi_{\theta}}:

Lemma 4.1.

Suppose that θ,θ𝒮\theta,\theta^{\prime}\in\mathcal{S}, then

Vπθ(s0)Vπθ(s0)=t=0γt(Qπθ(stθ,πθ(stθ))Vπθ(stθ))V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0})=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\theta^{\prime}}_{t},\pi_{\theta^{\prime}}(s^{\theta^{\prime}}_{t}))-V^{{\pi_{\theta}}}(s^{\theta^{\prime}}_{t}))

where {stθ}t=0\{s^{\theta^{\prime}}_{t}\}_{t=0}^{\infty} is the trajectory generated by the policy πθ()\pi_{\theta^{\prime}}(\cdot).

The proof can be found in the Appendix. Notice that indeed we have J(θ)=Vπθ(s0)J(\theta^{\prime})=V^{\pi_{\theta^{\prime}}}(s_{0}) and J(θ)=Vπθ(s0)J(\theta)=V^{\pi_{\theta}}(s_{0}), substituting with these two terms in the previous lemma and applying some additional calculations yield the main theorem as follows:

Theorem 4.2.

(Non-smoothness of JJ) Assume (A.1)-(A.5) and the parameterized policy πθ()\pi_{\theta}(\cdot) is deterministic. Let λ(θ)\lambda(\theta) denote the MLE of (1) at θN\theta\in\mathbb{R}^{N}. Suppose that λ(θ)>logγ\lambda(\theta)>-\log\gamma, then J()J(\cdot) is logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}-Hölder continuous at θ\theta.

The detailed proof of Theorem 4.2 is put in the Appendix. Actually, the most important implication of this theorem is that J(θ)J(\theta) is not guaranteed (and is very unlikely) to be Lipschitz continuous when λ(θ)>logγ\lambda(\theta)>-\log\gamma, similar counterexample can be constructed as in Example 4.1. Finally, let us see how the loss landscape of the objective function becomes non-smooth when λ(θ)\lambda(\theta) is large through the following example:

Example 4.2.

(Logistic model) Consider the following MDP:

st+1=(1st)at,s0=0.9,s_{t+1}=(1-s_{t})a_{t},\quad s_{0}=0.9, (12)

where the policy ata_{t} is given by deterministic linear function at=πθ(st)=θsta_{t}=\pi_{\theta}(s_{t})=\theta s_{t}. The objective function is defined as J(θ)=t=0γt(st2+0.1at2)J(\theta)=\sum_{t=0}^{\infty}\gamma^{t}\ (s_{t}^{2}+0.1\ a^{2}_{t}) where γ(0,1)\gamma\in(0,1) is user-defined. It is well-known that (12) begins to exhibit chaotic behavior with positive MLEs (as shown in Figure 2(a)) when θ3.3\theta\geq 3.3 jordan, so we plot the graphs of J(θ)J(\theta) for different discount factors over the interval θ[3.3,3.9]\theta\in[3.3,3.9]. From Figure 2(b) to 2(d), the non-smoothness becomes more and more significant as γ\gamma grows. In particular, Figure 2(e) shows that the value of J(θ)J(\theta) fluctuates violently even within a very small interval of θ\theta, suggesting a high degree of non-differentiability in this region.

Refer to caption
(a) MLE λ(θ)\lambda(\theta).
Refer to caption
(b) γ=0.5\gamma=0.5
Refer to caption
(c) γ=0.9\gamma=0.9
Refer to caption
(d) γ=0.99\gamma=0.99
Refer to caption
(e) Magnified
Figure 2: The value of MLE λ(θ)\lambda(\theta) for θ[3.3,3.9]\theta\in[3.3,3.9] is shown in 2(a). The graph of objective function J(θ)J(\theta) for different values of γ\gamma are shown in 2(b)-2(e) where J(θ)J(\theta) is estimated by the sum of first 1000 terms in the infinite series.

4.3 Non-smoothness in the case of stochastic policies

The original MDP (1) becomes stochastic when a stochastic policy is employed. In fact, it makes the objective less likely to be smooth and we will demonstrate this point now. First, let us instead consider the slightly modified version of MLE:

λ~max=limtlimΔZ001tlog𝔼π[ΔZω(t)]ΔZ0.\tilde{\lambda}_{max}=\lim_{t\rightarrow\infty}\lim_{\|\Delta Z_{0}\|\rightarrow 0}\frac{1}{t}\log\frac{\|\mathbb{E}_{\pi}[\Delta Z_{\omega}(t)]\|}{\|\Delta Z_{0}\|}. (13)

where ΔZ0=s0s0\Delta Z_{0}=s^{\prime}_{0}-s_{0} is a small pertubation made to the initial state and ΔZω(t)=st(ω)st(ω)\Delta Z_{\omega}(t)=s^{\prime}_{t}(\omega)-s_{t}(\omega) is the difference in the sample path at time tt\in\mathbb{N} and ωΩ\omega\in\Omega. Since this definition is consistent with that in (6) when taking the variance to 0, we use the same notation λ(θ)\lambda(\theta) to denote the MLE at given θN\theta\in\mathbb{R}^{N} and again assume λ(θ)>logγ\lambda(\theta)>-\log\gamma.

Recall that in the deterministic case, the Hölder continuity result was proved under the assumption that the policy πθ(s)\pi_{\theta}(s) is locally Lipschitz continuous. When it comes to the stochastic case, however, we cannot make any similar assumption for πθ(|s)\pi_{\theta}(\cdot|s). The reason is that while the deterministic policy πθ(s)\pi_{\theta}(s) directly gives the action, the stochastic policy πθ(|s)\pi_{\theta}(\cdot|s) is instead a probability density function from which the action is sampled, and hence cannot be locally Lipschitz continuous in θ\theta. For instance, consider the one-dimensional Gaussian distribution πθ(a|s)=12πσe(aμs)22σ2\pi_{\theta}(a|s)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{(a-\mu s)^{2}}{2\sigma^{2}}} where θ=[μ,σ]T\theta=[\mu,\sigma]^{T} denotes the parameters. Apparently, πθ(a|s)\pi_{\theta}(a|s) becomes more and more concentrated at a=μsa=\mu s as σ\sigma approaches 0, and eventually converges to the Dirac delta function δ(aμs)\delta(a-\mu s), which means that πθ(a|s)\pi_{\theta}(a|s) cannot be Lipschitz continuous within a neighborhood of any θ=[μ,0]T\theta=[\mu,0]^{T} even though its deterministic limit πθ(s)=μs\pi_{\theta}(s)=\mu s is indeed Lipschitz continuous.

Therefore, there is a trade-off: if we want the probability density function πθ(|s)\pi_{\theta}(\cdot|s) to cover its deterministic limit, then it cannot be locally Lipschitz continuous. Conversely, it is not allowed to cover the deterministic limit if it is assumed to be locally Lipschitz continuous. Since the former case is widely accepted in practice, we present the following example to show that in this case the Hölder exponent of the objective function J()J(\cdot) can be arbitrarily smaller than logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}:

Example 4.3.

Suppose that the one-dimensional MDP st+1=f(st,at)s_{t+1}=f(s_{t},a_{t}) where

f(s,a)={0,a0,a,1<a<1,1,a1,f(s,a)=\begin{cases}0,&a\leq 0,\\ a,&-1<a<1,\\ 1,&a\geq 1,\\ \end{cases}

is defined over the state space 𝒮=[0,1]\mathcal{S}=[0,1] and action space 𝒜=[0,)\mathcal{A}=[0,\infty). The cost function is R(s,a)=s+1R(s,a)=s+1. Also, the parameter space is θ=[θ1,θ2]T2\theta=[\theta_{1},\theta_{2}]^{T}\in\mathbb{R}^{2} and the policy πθ(|s)𝒰(|θ1|s+|θ2|β,|θ1|s+2|θ2|β)\pi_{\theta}(\cdot|s)\sim\mathcal{U}(|\theta_{1}|s+|\theta_{2}|^{\beta},|\theta_{1}|s+2|\theta_{2}|^{\beta}) is a uniform distribution where β>0\beta>0 is constant. It is easy to verify that all required assumptions are satisfied. Let the initial state s0=0s_{0}=0 and θ1>1,θ2=0\theta_{1}>1,\theta_{2}=0, then applying (13) directly yields λ(θ)=logθ1\lambda(\theta)=\log\theta_{1} similarly to Example 4.1. Now suppose that θ2>0\theta^{\prime}_{2}>0 is small and θ=[θ1,θ2]T\theta^{\prime}=[\theta_{1},\theta^{\prime}_{2}]^{T}, then for any ωΩ\omega\in\Omega, the sampled trajectory {st}\{s^{\prime}_{t}\} generated by πθ\pi_{\theta^{\prime}} has

st+1(ω)θ1st(ω)+(θ2)β>θ1st(ω)θ1ts1(ω)θ1t(θ2)β\displaystyle s^{\prime}_{t+1}(\omega)\geq\theta_{1}s^{\prime}_{t}(\omega)+(\theta^{\prime}_{2})^{\beta}>\theta_{1}s^{\prime}_{t}(\omega)\geq\theta^{t}_{1}s^{\prime}_{1}(\omega)\geq\theta^{t}_{1}(\theta^{\prime}_{2})^{\beta}

when st+1(ω)<1s^{\prime}_{t+1}(\omega)<1. Thus, we have st+1(ω)=1s^{\prime}_{t+1}(\omega)=1 for all ωΩ\omega\in\Omega and tT0(θ)=1+βlogθ2logθ1t\geq T_{0}(\theta^{\prime})=1+\lfloor\frac{-\beta\log\theta^{\prime}_{2}}{\log\theta_{1}}\rfloor, which further leads to

J(θ)=11γ+t=0γt𝔼πθ[st]J(θ)+t=T0(δ)γt𝔼πθ[st]γ1γ(θ2)βlogγlogθ1J(\theta^{\prime})=\frac{1}{1-\gamma}+\sum_{t=0}^{\infty}\gamma^{t}\ \mathbb{E}_{\pi_{\theta^{\prime}}}[s^{\prime}_{t}]\geq J(\theta)+\sum_{t=T_{0}(\delta)}^{\infty}\gamma^{t}\ \mathbb{E}_{\pi_{\theta^{\prime}}}[s^{\prime}_{t}]\geq\frac{\gamma}{1-\gamma}(\theta^{\prime}_{2})^{\frac{-\beta\log\gamma}{\log\theta_{1}}}

using the fact that J(θ)=11γJ(\theta)=\frac{1}{1-\gamma}. Plugging θθ=θ2\|\theta-\theta^{\prime}\|=\theta^{\prime}_{2} into the above inequality yields

J(θ)J(θ)γ1γθθβlogγlogθ1.J(\theta^{\prime})-J(\theta)\geq\frac{\gamma}{1-\gamma}\|\theta^{\prime}-\theta\|^{\frac{-\beta\log\gamma}{\log\theta_{1}}}. (14)

Since β>0\beta>0 is arbitrary, the Hölder exponent can be arbitrarily close to 0.

In particular, (14) prohibits us from achieving the logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}-Hölder continuity result for J()J(\cdot) when 0<β<10<\beta<1, which further implies that the loss landscape of J()J(\cdot) can become even more fractal and non-smooth in the case of stochastic policies. As a consequence, it is not surprising that policy gradient methods perform poorly in such problems. Some discussion from the perspective of fractal theory can be found in the Appendix and also in barnsley; falconer; hardy; hirsch; mcdonald; preiss.

5 Estimating Hölder Exponents from Samples

In the previous sections, we have seen that the true objective function J(θ)J(\theta) can be highly non-smooth. Thus, any gradient-based method are no longer guaranteed to work well in the policy parameter space. The question is: how can we determine whether the objective function J(θ)J(\theta) is differentiable at some θ=θ0\theta=\theta_{0} or not in high-dimensional settings? To answer this question, we propose a statistical method to estimate the Hölder exponent: Consider the objective function J(θ)J(\theta) and the Gaussian distribution X𝒩(θ0,σ2N)X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I}_{N}) where N\mathcal{I}_{N} is the N×NN\times N identity matrix. The diagonal form of the covariance matrix is assumed for simplicity. In the general case, let J()J(\cdot) be α\alpha-Hölder continuous over N\mathbb{R}^{N}, then its variance matrix can be expressed as

Var(J(X))\displaystyle Var(J(X)) =𝔼X𝒩(θ0,σ2)[J(X)𝔼X𝒩(θ0,σ2)[J(X)])2]\displaystyle=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)-\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)])^{2}]
=𝔼X𝒩(θ0,σ2)[(J(X)J(ξ))2]\displaystyle=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[(J(X)-J(\xi^{\prime}))^{2}]

where ξN\xi^{\prime}\in\mathbb{R}^{N} is obtained from applying the intermediate value theorem to 𝔼X𝒩(θ0,σ2)[J(X)]\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)] and hence not a constant. If the α\alpha-Hölder continuity is tight, say |J(θ)J(θ0)|θθ0α|J(\theta)-J(\theta_{0})|\simeq\|\theta-\theta_{0}\|^{\alpha} when θθ01\|\theta-\theta_{0}\|\ll 1, then it has the following approximation

Var(J(X))=𝔼X𝒩(θ0,σ2)[Xξ2α](Var(X))α𝒪(σ2α)Var(J(X))=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[\|X-\xi^{\prime}\|^{2\alpha}]\simeq(Var(X))^{\alpha}\sim\mathcal{O}(\sigma^{2\alpha}) (15)

when σ1\sigma\ll 1. Therefore, (15) provides a way to directly estimate the Hölder exponent of J()J(\cdot) around any given θN\theta\in\mathbb{R}^{N}, especially when the dimension NN is large. In particular, taking the logarithm on both sides of (15) yields

logVarσ(J(X))C+2αlogσ\log Var_{\sigma}(J(X))\simeq C+2\alpha\log\sigma (16)

for some constant CC where the subscript in Varσ(J(X))Var_{\sigma}(J(X)) indicates its dependence on the standard deviation σ\sigma of XX. Thus, the log-log plot of Varσ(J(X))Var_{\sigma}(J(X)) versus σ\sigma is expected to be close to a straight line with slope k=2αk=2\alpha where α\alpha is the Hölder exponent of J()J(\cdot) around θ\theta. Therefore, one can estimate the Hölder exponent by sampling around θ\theta with different variance and estimating the slope using linear regression. Usually, J(θ)J(\theta) is Lipschitz continuous at θ=θ0\theta=\theta_{0} when the slope kk is close to or greater than 22. Some examples will be presented in Section 6.

6 Experiments

In this section, we will validate the theory established in this paper through some of the most common tasks in RL. All environments are adopted from The OpenAI Gym Documentation brockman with continuous control input. The experiments are conducted in two steps: first, we randomly sample a parameter θ0\theta_{0} from a Gaussian distribution and estimate the gradient θη(θ0)\nabla_{\theta}\eta(\theta_{0}) from (3); second, we evaluate J(θ)J(\theta) at θ=θ0+δθη(θ0)\theta=\theta_{0}+\delta\nabla_{\theta}\eta(\theta_{0}) for each small δ>0\delta>0. Based on the previous theoretical results, the loss curve is expected to become smoother as γ\gamma decreases, since smaller γ\gamma makes the Hölder exponent logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)} larger. In the meantime, the policy gradient method (3) should give a better descent direction while the true objective function J()J(\cdot) becoming smoother. Notice that a single sample path can always be non-smooth when the policy is stochastic and therefore interfere the desired observation, we use stochastic policies to estimate the gradient in (3), and apply their deterministic version (by setting variance equal to 0) when evaluating J(θ)J(\theta). Regarding the infinite series, we use the sum of first 1000 terms to approximate J(θ)J(\theta). The stochastic policy is given by πθ(|s)𝒩(u(s),σ2N)\pi_{\theta}(\cdot|s)\sim\mathcal{N}(u(s),\sigma^{2}\mathcal{I}_{N}) where the mean u(s)u(s) is represented by the 2-layer neural network u(s)=W2tanh(W1s)u(s)=W_{2}\tanh(W_{1}s) where W1r×n()W_{1}\in\mathcal{M}_{r\times n}(\mathbb{R}) and W2m×r()W_{2}\in\mathcal{M}_{m\times r}(\mathbb{R}) are weight matrices. Let θ=[W1,W2,σ]T\theta=[W_{1},W_{2},\sigma]^{T} denote the vectorized policy parameter. For the width of the hidden layer, we use r=8r=8 for the inverted pendulum and acrobot task, and r=64r=64 for the hopper task.

Refer to caption
(a) γ=0.9\gamma=0.9, deterministic.
Refer to caption
(b) γ=0.9\gamma=0.9, stochastic
Refer to caption
(c) γ=0.99\gamma=0.99, deterministic.
Refer to caption
(d) γ=0.99\gamma=0.99, stochastic.
Refer to caption
(e) y=1.980x+2.088y=1.980x+2.088.
Figure 3: Experiment results of the inverted pendulum example. In 3(c), the linear regression result is obtained for γ=0.9\gamma=0.9. The loss curves J(θ)J(\theta) are presented in 3(a)-3(d) where θ=θ0+δθη(θ0)\theta=\theta_{0}+\delta\nabla_{\theta}\eta(\theta_{0}) with step size 10710^{-7}.

Inverted pendulum.

The inverted pendulum task is a standard test case for RL algorithms, and here we use it as an example of non-chaotic system. The initial state is always taken as s0=[1,0]Ts_{0}=[-1,0]^{T} ([0,0]T[0,0]^{T} is the upright position), and quadratic cost function R(s,a)=stTQst+0.001at2R(s,a)=s^{T}_{t}Qs_{t}+0.001\|a_{t}\|^{2}, where Q=diag(1,0.1)Q={\text{diag}}(1,0.1) is a 2×22\times 2 diagonal matrix, st4s_{t}\in\mathbb{R}^{4} and ata_{t}\in\mathbb{R}. The initial parameter is given by θ0𝒩(0,0.052)\theta_{0}\sim\mathcal{N}(0,0.05^{2}\ \mathcal{I}). In Figure 4(a) and 4(c), we see that the loss curve is close to a straight line within a very small interval, which indicates the local smoothness of θ0\theta_{0}. It is validated by the estimate of the Hölder exponent of J(θ)J(\theta) at θ=θ0\theta=\theta_{0} which is based on (15) by sampling many parameters around θ0\theta_{0} with different variances. In Figure 3(e), the slope α=1.980\alpha=1.980 is very closed to 22 so Lipschitz continuity (and hence differentiability) is verified at θ=θ0\theta=\theta_{0}. As a comparison, the loss curve of single sample path is totally non-smooth as shown in Figure 3(b) and 3(e).

Refer to caption
(a) γ=0.8\gamma=0.8, deterministic.
Refer to caption
(b) γ=0.9\gamma=0.9, deterministic.
Refer to caption
(c) γ=0.99\gamma=0.99, deterministic.
Refer to caption
(d) γ=0.99\gamma=0.99, stochastic.
Refer to caption
(e) y=0.8631x+2.041y=0.8631x+2.041
Figure 4: Experiment results of the acrobot example. In Figure 4(e), the linear regression result is obtained for γ=0.9\gamma=0.9. The loss curves J(θ)J(\theta) are presented in 4(a)-4(d) where θ=θ0+δθη(θ0)\theta=\theta_{0}+\delta\nabla_{\theta}\eta(\theta_{0}) with step size 10710^{-7}.

Acrobot.

The acrobot system is well-known for its chaotic behavior and hence we use it as the main test case. Here we use the cost function R(s,a)=stTQst+0.005at2R(s,a)=s^{T}_{t}Qs_{t}+0.005\|a_{t}\|^{2}, where Q=diag(1,1,0.1,0.1)Q={\text{diag}}(1,1,0.1,0.1), st4s_{t}\in\mathbb{R}^{4} and ata_{t}\in\mathbb{R}. The initial state is s0=[1,0,0,0]Ts_{0}=[1,0,0,0]^{T}. The initial parameter is again sampled from θ0𝒩(0,0.052)\theta_{0}\sim\mathcal{N}(0,0.05^{2}\ \mathcal{I}). From Figure 4(a)-4(c), the non-smoothness grows as γ\gamma increases and finally becomes completely non-differentiable when γ=0.99\gamma=0.99 which is the most common value used for discount factor. It may partially explain why the acrobot task is difficult and almost unsolvable by policy gradient methods. In Figure 4(e), the Hölder exponent of J(θ)J(\theta) at θ=θ0\theta=\theta_{0} is estimated as α0.43155<1\alpha\simeq 0.43155<1, which further indicates non-differentiability around θ0\theta_{0}.

Refer to caption
(a) γ=0.8\gamma=0.8, deterministic.
Refer to caption
(b) γ=0.9\gamma=0.9, deterministic.
Refer to caption
(c) γ=0.99\gamma=0.99, deterministic.
Refer to caption
(d) γ=0.99\gamma=0.99, stochastic.
Refer to caption
(e) y=0.5036x1.250y=0.5036x-1.250.
Figure 5: Experiment results of the hopper example. In Figure 5(e), the linear regression result is obtained for γ=0.9\gamma=0.9. The loss curves J(θ)J(\theta) are presented in 5(a)-5(d) where θ=θ0+δθη(θ0)\theta=\theta_{0}+\delta\nabla_{\theta}\eta(\theta_{0}) with step size 10310^{-3}.

Hopper.

Now we consider the Hopper task in which the cost function is defined R(s,a)=(1.25s[0])+0.001a2R(s,a)=(1.25-s[0])+0.001\|a\|^{2}, where s[0]s[0] is the first coordinate in s11s\in\mathbb{R}^{1}1 which indicates the height of hopper. Because the number of parameters involved in the neural network is larger, the initial parameter is instead sampled from θ0𝒩(0,102)\theta_{0}\sim\mathcal{N}(0,10^{2}\ \mathcal{I}). As we see that in Figure 5(a), the loss curve is almost a straight line when γ=0.8\gamma=0.8, and it begins to show non-smoothness when γ=0.9\gamma=0.9 and becomes totally non-differentiable when γ=0.99\gamma=0.99. A supporting evidence by the Hölder exponent estimation as in Figure 5(e) where the slope is far less than 22.

7 Conclusion

In this paper, we initiate the study of chaotic behavior in reinforcement learning, especially focusing on how it is reflected on the non-smoothness of objective functions. A method to statistically estimate the Hölder exponent at some given parameter is proposed, so that one can figure out if the training has run into chaos and hence determine whether to stop or not. We believe that the theory established in this paper can help to explain many existing results in reinforcement learning, such as the unsolvability of acrobot task and the fluctuating behavior of training curve. It also poses a serious question to the well-posedness of policy gradient methods given the fact that no gradient exists in many continuous state-space RL problems. Finally, We conjecture that, the chaotic behavior may be able to explain the limits of a much wider range of deep learning problems beyond the scope of reinforcement learning.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

  • Did you include the license to the code and datasets? [Yes] See Section LABEL:gen_inst.

  • Did you include the license to the code and datasets? [No] The code and the data are proprietary.

  • Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

  1. 1.

    For all authors…

    1. (a)

      Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]

    2. (b)

      Did you describe the limitations of your work? [Yes]

    3. (c)

      Did you discuss any potential negative societal impacts of your work? [N/A]

    4. (d)

      Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]

  2. 2.

    If you are including theoretical results…

    1. (a)

      Did you state the full set of assumptions of all theoretical results? [Yes]

    2. (b)

      Did you include complete proofs of all theoretical results? [Yes]

  3. 3.

    If you ran experiments…

    1. (a)

      Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]

    2. (b)

      Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]

    3. (c)

      Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]

    4. (d)

      Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]

  4. 4.

    If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…

    1. (a)

      If your work uses existing assets, did you cite the creators? [N/A]

    2. (b)

      Did you mention the license of the assets? [N/A]

    3. (c)

      Did you include any new assets either in the supplemental material or as a URL? [N/A]

    4. (d)

      Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]

    5. (e)

      Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]

  5. 5.

    If you used crowdsourcing or conducted research with human subjects…

    1. (a)

      Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]

    2. (b)

      Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]

    3. (c)

      Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Brief introduction to chaos theory

As mentioned in the Introduction, chaos exists in many systems in the real world. Although no universal definition of chaos can be made, there are, indeed, three features that a chaotic system usually exhibits [hirsch]:

  • Dense periodic points;

  • Topologically transitive;

  • Sensitive dependence on initial conditions;

In some cases, some of these properties imply the others. It is important to note that, despite the appearance of chaos is always accompanied by high unpredictability, it should be emphasized that chaotic behavior is entirely deterministic and is not a consequence of randomness. Another interesting fact is that trajectories in a chaotic system are usually bounded, which force us to think about the convergence of policy gradient methods beyond the boundedness property.

Refer to caption
(a) Lorenz attractor.
Refer to caption
(b) Rössler attractor.
Figure 6: Lorenz system and Rössler system are standard examples of chaotic systems, in which a small perturbation on the initial state can cause a significant divergence in the entire trajectory.

Actually, it can be summarized from the results in this paper that for a given MDP, it has to possess the following three features to exhibit chaotic behavior:

  • Infinite time horizon (allows tt\rightarrow\infty);

  • Continuous state space (allows ΔZ00\|\Delta Z_{0}\|\rightarrow 0);

  • Exponential divergence (allows λmax>0\lambda_{max}>0);

Since these features are not necessarily binded to certain types of continuous state-space MDPs, it would be exciting for future studies to investigate other types of MDPs using the framework developed in the paper.

Appendix B Proofs omitted in Section 4

B.1 Proof of Theorem 4.1

Proof.

Suppose that s0𝒮s^{\prime}_{0}\in\mathcal{S} is another initial state close to s0s_{0}. Let T1T_{1}\in\mathbb{N} be the smallest integer that satisfies

T11λ(θ)log(2MK1δ),T_{1}\geq\frac{1}{\lambda(\theta)}\log(\frac{2M}{K_{1}\delta}), (17)

then applying the Lipschitz condition of R(,πθ())R(\cdot,\pi_{\theta}(\cdot)) yields

t=0T1γt|R(st,πθ(st))R(st,πθ(st))|\displaystyle\sum_{t=0}^{T_{1}}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))| t=0T1γtK2stst\displaystyle\leq\sum_{t=0}^{T_{1}}\gamma^{t}K_{2}\|s_{t}-s^{\prime}_{t}\|
t=0T1K1K2e(λ(θ)+logγ)tδ\displaystyle\leq\sum_{t=0}^{T_{1}}K_{1}K_{2}e^{(\lambda(\theta)+\log\gamma)t}\delta
K1K2δeλ(θ)+logγλ(θ)log(2MK1δ)+1e(λ(θ)+logγ)1\displaystyle\leq K_{1}K_{2}\delta\frac{e^{\frac{\lambda(\theta)+\log\gamma}{\lambda(\theta)}\log(\frac{2M}{K_{1}\delta})+1}}{e^{(\lambda(\theta)+\log\gamma)}-1}
=eK2K1logγλ(θ)(2M)1+logγλ(θ)e(λ(θ)+logγ)1δlogγλ(θ)\displaystyle=\frac{eK_{2}K_{1}^{\frac{-\log\gamma}{\lambda(\theta)}}(2M)^{1+\frac{\log\gamma}{\lambda(\theta)}}}{e^{(\lambda(\theta)+\log\gamma)}-1}\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}

where K2>0K_{2}>0 is the Lipschitz constant of R(,πθ())R(\cdot,\pi_{\theta}(\cdot)) over compact set 𝒮\mathcal{S}.

On the other hand, the tail terms in J()J(\cdot) is bounded by

t=T1+1γt|R(st,πθ(st))R(st,πθ(st))|\displaystyle\sum_{t=T_{1}+1}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))| t=T1+12M2γt\displaystyle\leq\sum_{t=T_{1}+1}^{\infty}2M_{2}\gamma^{t}
t=T12M2γt\displaystyle\leq\sum_{t=T_{1}}^{\infty}2M_{2}\gamma^{t}
=2M2γT11γ\displaystyle=2M_{2}\frac{\gamma^{T_{1}}}{1-\gamma}
2M21γ(K12M)logγλ(θ)δlogγλ(θ)\displaystyle\leq\frac{2M_{2}}{1-\gamma}(\frac{K_{1}}{2M})^{\frac{-\log\gamma}{\lambda(\theta)}}\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}

using that |R(st,πθ(st))R(st,πθ(st))|2M2|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq 2M_{2} where M2=maxs𝒮R(s,πθ(s))M_{2}=\max_{s\in\mathcal{S}}R(s,\pi_{\theta}(s)) is the maximum of the continuous function R(,πθ())R(\cdot,\pi_{\theta}(\cdot)) over 𝒮\mathcal{S}.

Combining the above two inequalities yields

|Vπθ(s0)Vπθ(s0)|\displaystyle|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})|
\displaystyle\leq t=0γt|R(st,πθ(st))R(st,πθ(st))|\displaystyle\sum_{t=0}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|
=\displaystyle= t=0T1γt|R(st,πθ(st))R(st,πθ(st))|+t=T1+1γt|R(st,πθ(st))R(st,πθ(st))|\displaystyle\sum_{t=0}^{T_{1}}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|+\sum_{t=T_{1}+1}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|
\displaystyle\leq (eK2K1logγλ(θ)(2M)1+logγλ(θ)e(λ(θ)+logγ)1+2M21γ(K12M)logγλ(θ))δlogγλ(θ)\displaystyle(\frac{eK_{2}K_{1}^{\frac{-\log\gamma}{\lambda(\theta)}}(2M)^{1+\frac{\log\gamma}{\lambda(\theta)}}}{e^{(\lambda(\theta)+\log\gamma)}-1}+\frac{2M_{2}}{1-\gamma}(\frac{K_{1}}{2M})^{\frac{-\log\gamma}{\lambda(\theta)}})\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}

and we complete the proof. ∎

B.2 Proof of Lemma 4.1

Proof.

For the ease of notation, let st=stθ,st=stθ,u(s)=πθ(s)s_{t}=s^{\theta}_{t},s^{\prime}_{t}=s^{\theta^{\prime}}_{t},u(s)=\pi_{\theta}(s) and u(s)=πθ(s)u^{\prime}(s)=\pi_{\theta^{\prime}}(s).

Vπθ(s0)Vπθ(s0)\displaystyle V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0}) =t=0γtR(st;u(st))Vπθ(s0)\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s_{0})
=t=0γt(R(st;u(st))+Vπθ(st)Vπθ(st))Vπθ(s0)\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))+V^{\pi_{\theta}}(s^{\prime}_{t})-V^{\pi_{\theta}}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s_{0})
=t=0γt(R(st;u(st))+γVπθ(st+1)Vπθ(st)+Vπθ(st)γVπθ(st+1))Vπθ(s0)\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))+\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1})-V^{\pi_{\theta}}(s^{\prime}_{t})+V^{\pi_{\theta}}(s^{\prime}_{t})-\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1}))-V^{\pi_{\theta}}(s_{0})
=t=0γt(Qθ(st,u(st))Vπθ(st))+t=0γt(Vπθ(st)γVπθ(st+1))Vπθ(s0).\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\theta}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))+\sum_{t=0}^{\infty}\gamma^{t}(V^{\pi_{\theta}}(s^{\prime}_{t})-\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1}))-V^{\pi_{\theta}}(s_{0}).

Using the fact that γtVθ(xt+1)0\gamma^{t}V^{\theta}(x^{\prime}_{t+1})\rightarrow 0 as tt\rightarrow\infty from the boundedness assumption yields

Vπθ(s0)Vπθ(s0)\displaystyle V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0}) =t=0γt(Qπθ(st,u(st))Vπθ(st))+Vπθ(s0)Vπθ(s0)\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))+V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})
=t=0γt(Qπθ(st,u(st))Vπθ(st))\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))

and the proof is completed using s0=s0s^{\prime}_{0}=s_{0}. ∎

B.3 Proof of Theorem 4.2

Proof.

First, we will show that Qπθ(s,a)Q^{\pi_{\theta}}(s,a) is logγλ(θ)\frac{-\log\gamma}{\lambda(\theta)}-Hölder Lipschitz with respect to aa. Note that for any a,a𝒜a,a^{\prime}\in\mathcal{A},

|Qπθ(s,a)Qπθ(s,a)|\displaystyle|Q^{\pi_{\theta}}(s,a)-Q^{\pi_{\theta}}(s,a^{\prime})| |R(s,a)R(s,a)|+γ|Vπθ(f(s,a))Vπθ(f(s,a))|\displaystyle\leq|R(s,a)-R(s,a^{\prime})|+\gamma|V^{\pi_{\theta}}(f(s,a))-V^{\pi_{\theta}}(f(s,a^{\prime}))|
K1aa+γf(s,a)f(s,a)logγλ(θ)\displaystyle\leq K_{1}\|a-a^{\prime}\|+\gamma\|f(s,a)-f(s,a^{\prime})\|^{\frac{-\log\gamma}{\lambda(\theta)}}
K1aa+γK2aalogγλ(θ)\displaystyle\leq K_{1}\|a-a^{\prime}\|+\gamma K_{2}\|a-a^{\prime}\|^{\frac{-\log\gamma}{\lambda(\theta)}}
K3aalogγλ(θ)\displaystyle\leq K_{3}\|a-a^{\prime}\|^{\frac{-\log\gamma}{\lambda(\theta)}}

for some K3>0K_{3}>0 using the locally Lipschitz continuity of RR and ff.

Note that Vπθ(s)=Qπθ(s,πθ(s))V^{\pi_{\theta}}(s)=Q^{\pi_{\theta}}(s,\pi_{\theta}(s)), combining it with Lemma 4.1 yields

|J(θ)J(θ)|\displaystyle|J(\theta^{\prime})-J(\theta)| t=0γt|Qπθ(st,u(st))Vπθ(st)|\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}|Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t})|
=t=0γt|Qπθ(st,πθ(st))Qπθ(st,πθ(st))|\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}|Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta^{\prime}}(s^{\prime}_{t}))-Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|
t=0γtK3πθ(st)πθ(st)logγλ(θ)\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}\|\pi_{\theta^{\prime}}(s^{\prime}_{t})-\pi_{\theta}(s^{\prime}_{t})\|^{\frac{-\log\gamma}{\lambda(\theta)}}
t=0γtK3K4θθlogγλ(θ)\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}K_{4}\|\theta^{\prime}-\theta\|^{\frac{-\log\gamma}{\lambda(\theta)}}
=K3K41γθθlogγλ(θ)\displaystyle=\frac{K_{3}K_{4}}{1-\gamma}\|\theta^{\prime}-\theta\|^{\frac{-\log\gamma}{\lambda(\theta)}}

using the fact that πθ(s)\pi_{\theta}(s) is Lipschitz continuous in a neighborhood of (θ,s)N×𝒮(\theta,s)\in\mathbb{R}^{N}\times\mathcal{S} for some constant K4>0K_{4}>0 and we complete the proof. ∎

Appendix C From the perspective of fractal theory

We will go through some basic concepts in fractal theory that are related to the study of non-smooth functions.

C.1 The Hausdorff dimension

The Hausdorff dimension is the most fundamental concept in fractal theory. First, let us define the concept of δ\delta-cover and Hausdorff measure

Definition C.1.

(δ\delta-cover) Let {Ui}\{U_{i}\} be a countable collection of sets of diameter at most δ\delta (i.e. |Ui|=sup{xy:x,yUi}δ|U_{i}|=\sup\{\|x-y\|:x,y\in U_{i}\}\leq\delta) and FNF\subset\mathbb{R}^{N}, then {Ui}\{U_{i}\} is a δ\delta-cover of FF if Fi=1UiF\subset\cup_{i=1}^{\infty}U_{i}.

Definition C.2.

(Hausdorff measure) For any FNF\subset\mathbb{R}^{N} and s0s\geq 0, let

δs(F)=inf{i=1|Ui|s:{Ui} is a δ-cover of F}.\mathcal{H}_{\delta}^{s}(F)=\inf\{\sum_{i=1}^{\infty}|U_{i}|^{s}:\{U_{i}\}\textit{ is a }\delta\textit{-cover of }F\}.

Then we call the limit s(F)=limδ0δs(F)\mathcal{H}^{s}(F)=\lim_{\delta\rightarrow 0}\mathcal{H}_{\delta}^{s}(F) the ss-dimensional Hausdorff measure of FF.

Definition C.3.

(Hausdorff dimension) Let FNF\subset\mathbb{R}^{N} be a subset, then its Hausdorff dimension

dimHF=inf{s0:s(F)=0}=sup{s0:s(F)=}.\dim_{H}F=\inf\{s\geq 0:\mathcal{H}^{s}(F)=0\}=\sup\{s\geq 0:\mathcal{H}^{s}(F)=\infty\}.

Indeed, the Hausdorff dimension is well-defined: First, it is clear that when δ<1\delta<1, δs(F)\mathcal{H}_{\delta}^{s}(F) is non-increasing with respect to ss. Thus, s(F)\mathcal{H}^{s}(F) is non-increasing as well.

Let s0s\geq 0 such that s(F)<\mathcal{H}^{s}(F)<\infty, then for any t>st>s and any δ\delta-cover {Ui}\{U_{i}\} of FF, we have

i=1|Ui|tδtsi=1|Ui|s\sum_{i=1}^{\infty}|U_{i}|^{t}\leq\delta^{t-s}\sum_{i=1}^{\infty}|U_{i}|^{s}

which implies t(F)=0\mathcal{H}^{t}(F)=0 by taking infimum on both sides and letting δ0\delta\rightarrow 0.

Therefore, the set {s0:0<s(F)<}\{s\geq 0:0<\mathcal{H}^{s}(F)<\infty\} contains at most one point, which further implies inf{s0:s(F)=0}=sup{s0:s(F)=}\inf\{s\geq 0:\mathcal{H}^{s}(F)=0\}=\sup\{s\geq 0:\mathcal{H}^{s}(F)=\infty\}.

More details regarding the well-posedness of Hausdorff dimension can be found in [barnsley, falconer]. In particular, one can easily verify that the Hausdorff dimension coincides with the standard dimension (i.e. ss\in\mathbb{N}) when FF is a regular manifold. Typically, the Hausdorff dimension of a fractal is not an integer, and we will be exploiting this fact through the section. To see how it is associated with the non-smoothness of functions, the following claim makes a rigorous statement on the dimension of fractal loss surfaces:

Proposition C.1.

([falconer]) Let FkF\subset\mathbb{R}^{k} be a subset and suppose that η:Fp\eta:F\rightarrow\mathbb{R}^{p} is α\alpha-Hölder continuous, then dimHη(F)1αdimHF\dim_{H}\eta(F)\leq\frac{1}{\alpha}\dim_{H}F.

An immediate implication is that if the objective function is α\alpha-Hölder for some α<1\alpha<1, its loss landscape J={(θ,J(θ))N+1:θN}\mathcal{L}_{J}=\{(\theta,J(\theta))\in\mathbb{R}^{N+1}:\theta\in\mathbb{R}^{N}\} can be a fractal with Hausdorff dimension dimHJ(N,N+1]\dim_{H}\mathcal{L}_{J}\in(N,N+1]. A famous example is the Weierstrass function as shown in Figure 7. A comparison of Figure 2(e) and Figure 7(c) (they have the same scale) gives some sense about how non-smooth the objective function can be.

Refer to caption
(a) The double sector S(x,ϕ,ψ)S(x,\phi,\psi).
Refer to caption
(b) The W(x)W(x) over x[2,2]x\in[-2,2].
Refer to caption
(c) Zoom-in: x[0.5,0.5001]x\in[0.5,0.5001].
Figure 7: (a) shows how the double sector looks like. In (b) and (c), the Weierstrass function is given by W(x)=n=0ancos(bnπx)W(x)=\sum_{n=0}^{\infty}a^{n}\cos(b^{n}\pi x) where a=0.6,b=7a=0.6,b=7. The Hausdorff dimension of its loss curve is calculated as dimHW=2+logba1.73\dim_{H}\mathcal{L}_{W}=2+\log_{b}a\simeq 1.73. Also, according to [hardy], such W(x)W(x) is nowhere differentiable when 0<a<10<a<1 and ab1ab\geq 1.

C.2 Non-existence of tangent plane

Actually, when J()J(\cdot) is Lipschitz continuous on any compact subset of N\mathbb{R}^{N}, by the Rademacher’s Theorem, we know that it is differentiable almost everywhere which implies the existence of tangent plane. As it comes to fractal landscapes, however, the tangent plane itself does not exist for almost every θN\theta\in\mathbb{R}^{N}, which makes all policy gradient algorithms ill-posed. Although similar results were obtained for higher-dimensional cases as in [preiss], we restrict it to the two-dimensional case so that it provides clearer geometric intuition. First, we introduce the notion of ss-sets:

Definition C.4.

Let F2F\subset\mathbb{R}^{2} be a Borel set and s0s\geq 0, then FF is called an ss-set if 0<s(F)<0<\mathcal{H}^{s}(F)<\infty.

The intuition is that: when the dimension of fractal FF is a fraction between 1 and 2, then there is no direction along which a significant part of FF concentrates within a small double sector with vertex xx as shown in Figure 7(a). To be precise, let S(x,ϕ,ψ)S(x,\phi,\psi) denote the double sector and r>0r>0, then we say that FF has a tangent at xFx\in F if there exists a direction ϕ\phi such that for every angle ϕ>0\phi>0, it has

  1. 1.

    lim supr0s(F(x,r))(2r)s>0\limsup_{r\rightarrow 0}\frac{\mathcal{H}^{s}(F\cap\mathcal{B}(x,r))}{(2r)^{s}}>0;

  2. 2.

    limr0s(F((x,r)\S(x,ϕ,ψ)))(2r)s=0\lim_{r\rightarrow 0}\frac{\mathcal{H}^{s}(F\cap(\mathcal{B}(x,r)\textbackslash S(x,\phi,\psi)))}{(2r)^{s}}=0;

where the first condition means that the set FF behaves like a fractal around xx, and the second condition implies that the part of FF lies outside of any double sector S(x,ϕ,ψ)S(x,\phi,\psi) is negligible when r0r\rightarrow 0. Then, the main result is as follows:

Proposition C.2.

(Non-existence of tangent planes, [falconer]) If F2F\subset\mathbb{R}^{2} is an ss-set with 1<s<21<s<2, then at almost all points of FF, no tangent exists.

Therefore, "estimate the gradient" no longer makes sense since there does not exist a tangent line/plane at almost every point on the loss surface. This means that all policy gradient algorithms are ill-posed since there is no gradient for them to estimate at all.

C.3 Accumulated uncertainty

Another issue that may emerge during optimization process is the accumulation of uncertainty. To see how the uncertainty entered at each step accumulates and eventually blows up when generating a path along fractal boundaries, let us consider the following toy problem: Suppose that the distance between the initial point θ0\theta_{0} and the target θ\theta^{*} is d>0d>0, and step size δk>0\delta_{k}>0 is adapted at the kk-th step, as shown in Figure 8(a). If there exists c>0c>0 such that the projection θθ0,θk+1θkcdδk\langle\theta^{*}-\theta_{0},\theta_{k+1}-\theta_{k}\rangle\geq cd\delta_{k} for all kk\in\mathbb{N} which implies that the angle between the direction from θk\theta_{k} to θk+1\theta_{k+1} and the true direction θθ0\theta^{*}-\theta_{0} does not exceed arccos(c)\arccos(c). In this case, a successful path {θk}\{\theta_{k}\} that converges to θ\theta^{*} should give

k=0cdδkk=0θθ0,θk+1θk=θθ0,θθ0=d2\sum_{k=0}^{\infty}cd\delta_{k}\leq\sum_{k=0}^{\infty}\langle\theta^{*}-\theta_{0},\theta_{k+1}-\theta_{k}\rangle=\langle\theta^{*}-\theta_{0},\theta^{*}-\theta_{0}\rangle=d^{2}

using θkθ\theta_{k}\rightarrow\theta^{*} as kk\rightarrow\infty, which is equivalent to k=0δkdc\sum_{k=0}^{\infty}\delta_{k}\leq\frac{d}{c}.

Refer to caption
(a) Generating path from θ0\theta_{0} to θ\theta^{*}.
Refer to caption
(b) Update θn+1\theta_{n+1} from θn\theta_{n}.
Figure 8: Illustrations of statistical difficulties of implementing search gradient algorithms on a fractal loss surface.

On the other hand, when walking on the loss surface, it is not guaranteed to follow the correct direction at each iteration. For any small step size δ>0\delta>0, the uncertainty fraction u(δ)u(\delta) involved in every single step can be estimated by the following result [mcdonald]:

Proposition C.3.

Let δ>0\delta>0 be the step size and β=NdimHJ\beta=N-\dim_{H}J where dimHJ\dim_{H}J is the Hausdorff dimension of loss surface of J()J(\cdot), then the uncertainty u(δ)𝒪(δβ)u(\delta)\sim\mathcal{O}(\delta^{\beta}) when δ1\delta\ll 1.

Therefore, we may assume that there exists another c>0c^{\prime}>0 such that the uncertainty UkU_{k} at the kk-th step has UkcδkβU_{k}\leq c^{\prime}\delta_{k}^{\beta} for all k=0,1,k=0,1,.... Then, the accumulated uncertainty

U=k=0Ukck=0δkβU=\sum_{k=0}^{\infty}U_{k}\leq c^{\prime}\sum_{k=0}^{\infty}\delta_{k}^{\beta}

is bounded when β=1\beta=1 (i.e. boundary is smooth) using the earlier result k=0δkdc\sum_{k=0}^{\infty}\delta_{k}\leq\frac{d}{c}. However, the convergence of k=0δk\sum_{k=0}^{\infty}\delta_{k} no longer guarantees the convergence of k=0δkβ\sum_{k=0}^{\infty}\delta_{k}^{\beta} when β<1\beta<1, and a counterexample is the following series:

δk=1k(log(k+2))2\delta_{k}=\frac{1}{k(\log(k+2))^{2}}

for all k=0,1,k=0,1,..., which means that the uncertainty accumulated over the course of iterations may blow up and eventually lead the sequence {θk}\{\theta_{k}\} toward randomness.