Fractal Landscapes in Policy Optimization

Abstract

Policy gradient lies at the core of reinforcement learning (RL) in continuous spaces. Although it has achieved much success, it is often observed in practice that RL training with policy gradient can fail for a variety of reasons, even for standard control problems that can be solved with other methods. We propose a framework for understanding one inherent limitation of the policy gradient approach: the optimization landscape in the policy space can be extremely non-smooth or fractal, so that there may not be gradients to estimate in the first place. We draw on chaos theory and non-smooth analysis, and analyze the Maximum Lyapunov Exponents and Hölder Exponents of the policy optimization objectives. Moreover, we develop a practical method that statistically estimates Hölder exponents, to identify when the training process has encountered fractal landscapes. Our experiments illustrate how some failure cases of policy optimization in continuous control environments can be explained by our theory.

1 Introduction

Deep reinforcement learning has achieved much success in various applications atari; alphago; video, but they also often fail, especially in continuous spaces, on control problems that other methods can readily solve. The understanding of such failure cases is still very limited. Among the failures, two main issues arise: first, the training process of reinforcement learning is unstable in the way that the loss curve can fluctuate violently during training; second, the probability of obtaining a satisfactory solution can be surprisingly low, or even close to zero in certain tasks such as acrobot. Currently, they are attributed to the function approximation error Tsitsiklis; wang, the data error and inefficiency thomas, and making large moves in policy space schulman17. Nevertheless, all of these explanations solely focus on the training process, and there remains a lack of systematic study regarding the underlying causes of these failures from the perspective of the MDP itself.

Given that even a small update in policy space can significantly change the performance metric, it is natural to question the smoothness of the landscape of the objective function. Motivated by chaos theory, we adopt the concept of maximum Lyapunov exponent kantz to describe the exponential rate of trajectory divergence in the MDP under a given policy. It is related to the smoothness of the objective function in the following way: suppose that the discount factor in the MDP is $\gamma\in(0,1)$ , and the maximum Lyapunov exponent at certain policy parameter $\theta_{0}$ is $\lambda(\theta_{0})>0$ which is inherent to the MDP. Then, we show that the objective function $J(\theta)$ can be non-differentiable at $\theta=\theta_{0}$ when $\lambda(\theta_{0})>-\log\gamma$ . Intuitively speaking, the objective function is non-differentiable when the rate of trajectory divergence is faster than the decay rate of discount factor. Furthermore, it demonstrates that the fluctuation in the loss curve is not a result of any numerical error but rather a reflection of the intrinsic properties associated with the corresponding MDP.

In this paper, we first analyze deterministic policies in Section 4. It is not just because the deterministic policies can be regarded as the limit of stochastic policies when the variance $\sigma\rightarrow 0$ . More importantly, the ground truth we try to highlight is that: non-smoothness in objective functions is not a result of randomness but rather an inherent feature of the MDP itself. As shown in Section 6, the loss curve is still fractal even when all things in the MDP are deterministic. After that, we will briefly discuss the case of stochastic policies, and show why it is even more likely to have non-smoothness through an example. Although the fact that having positive MLEs can explain the non-smoothness of objective functions, we do not really need to compute it in practice even though it is possible as in kantz. Instead, we propose a statistical method is proposed in Section 5 that can directly estimate the Hölder exponent $\frac{-\log\gamma}{\lambda(\theta)}$ as a whole to determine whether $J(\theta)$ is differentiable at a specific $\theta\in\mathbb{R}^{N}$ . It can also indicate if the training has run into fractal areas and hence should stop.

2 Related work

Policy gradient methods.

While the optimal policy is usually approximated by minimizing (or maximizing) the objective function, the most common way of performing optimization is via policy gradient methods that were originated from sutton99; william. After being reformulated as an optimization problem in parameter space, more and more efficient algorithms such as natural policy gradient kakade, deterministic policy gradient silver, deep deterministic policy gradient lillicrap, trust region policy optimization schulman15 and proximal policy optimization schulman17, were proposed. As all of these algorithms seek to estimate the gradient of the objective function for descent direction, they become fundamentally ill-posed when the true objective is non-differentiable since there does not exist any gradient to estimate at all, which is exactly the point we try to make in this paper. Consequently, it is not surprising to have a low probability of obtaining a good solution using these methods.

Chaos in machine learning.

Chaotic behaviors due to in the learning dynamics have been reported in other learning problems camuto; pascanu. For instance, when training recurrent neural networks for a long period, the outcome will behave like a random walk due to the vanishing and the exploding gradient problems bengio. In parmas, the authors point out that the chaotic behavior in finite-horizon model-based reinforcement learning problems may be caused by long chains of non-linear computations. However, there is a significant difference compared to our theory: the true objective function is provably differentiable when time horizon is finite, and our theory focuses on more general infinite-horizon MDPs. Actually, all of these work fundamentally differ from our theory in the following sense: they attribute the chaos to the randomness during training such as noise and errors from computation or sampling, without recognizing the fact that it can be an intrinsic feature of the model.

Loss landscape of policy optimization.

There are many works in the study of the landscape of objective functions. In particular, it has been shown that the objective functions in finite state-space MDPs are smooth agarwal; xiao, which allows people to perform further optimization algorithm such as gradient descent and direct policy search without worrying about the well-posedness of those methods. It also explains why the classical RL algorithms in sutton98 are provably efficient in finite space settings. Also, such smoothness results can be extended to continuous state-space MDPs having "very nice properties". For instance, the objective function in Linear Quadratic Regulator (LQR) problems is almost smooth fazel as long as the cost is finite, along with a similar result obtained for the $\mathcal{H}_{2}/\mathcal{H}_{\infty}$ problem zhang. For the robust control problem, despite the objective function not being smooth, it is indeed locally Lipschitz continuous, which implies differentiability almost everywhere and further leads to global convergence of direct policy search guo. Due to the lack of analytical tools, however, there is still a vacancy in the theoretical study of loss landscapes of policy optimization for nonlinear and complex MDPs. Our paper partially addresses this gap by pointint out the possibility that the loss landscape can be highly non-smooth and even fractal, which is far more complex than all aforementioned types.

3 Preliminaries

3.1 Continuous state-space MDPs

Let us consider the deterministic continuous MDP given as dynamical system:

s_{t+1}=f(s_{t},a_{t}),\quad s_{0}\in\mathbb{R}^{n},

(1)

where $s_{t}\in\mathcal{S}\subset\mathbb{R}^{n}$ is the state at time $t$ , $s_{0}$ is the initial state and $a_{t}\sim\pi_{\theta}(\cdot|x_{t})\in\mathcal{A}\subset\mathbb{R}^{m}$ is the action taken at time $t$ based on a stochastic policy parameterized by $\theta\in\mathbb{R}^{N}$ . The sample space $\mathcal{S}$ is compact. Also, we define the value function $V^{\pi_{\theta}}$ of policy $\pi_{\theta}$ and the objective function $J(\theta)$ to be minimized as

J(\theta)=V^{\pi_{\theta}}(s_{0})=\mathbb{E}_{a_{t}\sim\pi_{\theta}(\cdot|s_{t})}[\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t})]

(2)

where $\gamma\in(0,1)$ is the discount factor and $R(s,a)$ is the cost function. The following assumptions are made throughout this paper:

•

(A.1) $f:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}^{n}$ is Lipschitz continuous over any compact domains (i.e., locally Lipschitz continuous);
•

(A.2) The cost function $R:\mathbb{R}^{n}\times\mathbb{R}^{m}\rightarrow\mathbb{R}$ is non-negative and locally Lipschitz continuous everywhere;
•

(A.3) The state space is closed under transitons, i.e., for any $(s,a)\in\mathcal{S}\times\mathcal{A}$ , the next state $s^{\prime}=f(s,a)\in\mathcal{S}$ .

3.2 Policy gradient methods

Generally, policy gradient methods provide an estimator of the gradient of true objective function $J(\cdot)$ with respect to policy parameters. One of the most commonly used forms is

\eta(\theta)=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\ Q^{\pi_{\theta}}(s_{t},a_{t})]=\mathbb{E}_{\pi_{\theta}}[\nabla_{\theta}\log\pi_{\theta}(a_{t}|s_{t})\ A^{\pi_{\theta}}(s_{t},a_{t})]

(3)

where $\pi_{\theta}(\cdot|\cdot)$ is a stochastic policy parameterized by $\theta$ and $Q^{\pi_{\theta}}(\cdot,\cdot)$ is the Q-value function of $\pi_{\theta}$ , and $A^{\pi_{\theta}}(s,a)=Q^{\pi_{\theta}}(s,a)-V^{\pi_{\theta}}(s)$ is the advantage function, often used for variance reduction. The theoretical guarantee of the convergence of policy gradient methods is essentially established upon the argument that the tail term as in satisfies sutton99

\gamma^{t}\ \nabla_{\theta}V^{\pi_{\theta}}(s)\rightarrow 0\textit{ as }t\rightarrow\infty

(4)

for any $s\in\mathcal{S}$ . Usually, two additional assumptions are made for (4):

•

$\nabla_{\theta}V^{\pi_{\theta}}(s)$ exists and continuous for all $s\in\mathcal{S}$ ;
•

$\|\nabla_{\theta}V^{\pi_{\theta}}(s)\|$ is uniformly bounded over $\mathcal{S}$ .

Apparently, the second assumption is automatically satisfied if the first assumption holds in the case that $\mathcal{S}$ is either finite or compact. However, as we will see in Section 4 and 6, the existence of $\nabla_{\theta}V^{\pi_{\theta}}(\cdot)$ may fail in many continuous MDPs even if $\mathcal{S}$ is compact, which poses challenges to the fundamental well-posedness of all policy gradient methods.

3.3 Maximal Lyapunov exponents

Notice that the most important property of chaotic systems is their “sensitive dependence on initial conditions.” To make it precise, consider the following dynamical system

s_{t+1}=F(s_{t}),\quad s_{0}\in\mathbb{R}^{N},

(5)

and suppose that a small perturbation $\Delta Z_{0}$ is made to the initial state $s_{0}$ , and the divergence resulted at time $t$ , say $\Delta Z(t)$ , can be estimated by $\|\Delta Z(t)\|\simeq e^{\lambda t}\|\Delta Z_{0}\|$ where $\lambda$ is called the Lyapunov exponent. For chaotic systems, positive Lyapunov exponents are usually observed, which implies an exponential divergence rate of the separation of nearby trajectories lorenz. Since the Lyapunov exponent at a given point may depend on the direction of the perturbation $\Delta Z_{0}$ , and we are often interested in identifying the largest divergence rate, the concept of the maximal Lyapunov exponent (MLE) in this paper is defined as follows:

Definition 3.1.

(Maximal Lyapunov exponent) For the dynamical system (5), the maximal Lyapunov exponent $\lambda_{max}$ at $s_{0}$ is defined as the largest value such that

\lambda_{max}=\lim_{t\rightarrow\infty}\lim_{\|\Delta Z_{0}\|\rightarrow 0}\frac{1}{t}\log\frac{\|\Delta Z(t)\|}{\|\Delta Z_{0}\|}.

(6)

Just as shown in (6), not only chaotic systems but also systems with unstable equilibrium can have a positive MLE at corresponding states. Therefore, the framework built in this paper is not limited to chaotic MDPs, but is applicable to a broad class of continuous state-space MDPs having any one of these characteristics.

4 Non-smooth objective functions

Now we begin to consider the landscape of objective functions. In this section, we will show that the true objective function $J(\theta)$ given in (2) can be non-differentiable at some $\theta\in\mathbb{R}^{N}$ where the system (1) has a positive MLE. Before showing how this argument is made, we first introduce the notion of $\alpha$ -Hölder continuity to extend the concept of Lipschitz continuity:

Definition 4.1.

( $\alpha$ -Hölder continuity) Suppose that some scalar $\alpha>0$ . A function $g:\mathbb{R}^{N}\rightarrow\mathbb{R}$ is $\alpha$ -Hölder continuous at $x\in\mathbb{R}^{N}$ if there exist $C>0$ and $\delta>0$ such that

|g(x)-g(y)|\leq C\|x-y\|^{\alpha}

for any $y\in\mathcal{B}(x,\delta)$ , where $\mathcal{B}(x,\delta)$ denotes the open ball of radius $\delta$ centered at $x$ .

In the remaining part of section, we will mainly investigate the Hölder continuity of $V^{\pi_{\theta}}$ and $J$ respectively in the case of deterministic policy. After that, a discussion about why it is even more likely to have non-smoothness in the stochastic case is provided.

4.1 Hölder Exponent of $V^{\pi_{\theta}}(\cdot)$ at $s_{0}$

For simplicity, we denote the deterministic policy by $\pi_{\theta}(s)$ throughout this section as it directly gives the action $a=\pi_{\theta}(s)$ instead of a probability distribution. Consider a given parameter $\theta\in\mathbb{R}^{N}$ such that the MLE of (1), say $\lambda(\theta)$ , is greater than $-\log\gamma$ . Let $s^{\prime}_{0}\in\mathcal{S}$ be another initial state that is close to $s_{0}$ , i.e., $\delta=\|s^{\prime}_{0}-s_{0}\|>0$ is small enough. According to the assumption (A.3), we can find a constant $M>0$ such that both $\|s_{t}\|\leq M$ and $\|s^{\prime}_{t}\|\leq M$ for all $t\in\mathbb{N}$ . Additionally, it is natural to make the following assumptions:

•

(A.4) There exists $K_{1}>0$ such that $\|s^{\prime}_{t}-s_{t}\|\leq K_{1}\delta e^{\lambda(\theta)t}$ for all $t\in\mathbb{N}$ and $\delta=\|s^{\prime}_{0}-s_{0}\|>0$ .
•

(A.5) The policy $\pi:\mathbb{R}^{N}\times\mathbb{R}^{n}\rightarrow\mathbb{R}^{m}$ is locally Lipschitz continuous;

First, we present the following theorem to provide a lower bound for the Hölder exponent of $J(\cdot)$ and the detailed proof can be found in the Appendix:

Theorem 4.1.

(Non-smoothness of $V^{\pi_{\theta}}$ ) Assume (A.1)-(A.5) and the parameterized policy $\pi_{\theta}(\cdot)$ is deterministic. Let $\lambda(\theta)$ denote the MLE of (1) at $\theta\in\mathbb{R}^{N}$ . Suppose that $\lambda(\theta)>-\log\gamma$ , then $V^{\pi_{\theta}}(\cdot)$ is $\frac{-\log\gamma}{\lambda(\theta)}$ -Hölder continuous at $s_{0}$ .

Although the counterexample given in Example 4.1 shows that it is theoretically impossible to prove $p$ -Hölder continuity for $V^{\pi_{\theta}}$ for any $p>\frac{-\log\gamma}{\lambda(\theta)}$ , we believe that providing an illustrative demonstration can help to gain a better understanding how large MLEs lead to non-smootheness.

To show that the strongest continuity result one can prove is $\frac{-\log\gamma}{\lambda(\theta)}$ -Hölder continuous where $0<-\frac{\log\gamma}{\lambda(\theta)}<1$ , suppose that $p\in(0,1]$ is some constant for which we would like to prove that $V^{\pi_{\theta}}(s)$ $p$ -Hölder continuous at $s=s_{0}$ , and let us see why the largest value for $p$ is $-\frac{\log\gamma}{\lambda(\theta)}$ :

According to Definition 4.1, it suffices to find some $C^{\prime}>0$ such that $|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})|\leq C^{\prime}\delta^{p}$ when $\delta\ll 1$ . A usual way is to work on a relaxed form

|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})|\leq\sum_{t=0}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq C^{\prime}\delta^{p}.

(7)

While handling the entire series is tricky, we instead split this analysis into three parts as shown in Figure 1: the sum of first $T_{2}$ terms, the sum from $t=T_{2}+1$ to $T_{3}-1$ , and the sum from $t=T_{3}$ to infinity. First, applying (A.4) to the sum of the first $T_{2}$ terms yields

\sum_{t=0}^{T_{2}}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq\frac{e^{(\lambda(\theta)+\log\gamma)T_{2}}}{1-\gamma}K_{1}K_{2}\delta

(8)

where $K_{2}>0$ is a Lipschitz constant obtained by (A.2) and (A.5). Therefore, if we wish to bound the right-hand side of (8) by some term of order $\mathcal{O}(\delta^{p})$ when $\delta\ll 1$ , the length $T_{2}(\delta)\in\mathbb{N}$ should have

T_{2}(\delta)\simeq C_{1}+\frac{p-1}{\lambda(\theta)+\log\gamma}\log(\delta)

(9)

where $C_{1}>0$ is some constant independent of $p$ and $\delta$ .

Refer to caption — Figure 1: An illustration of the two series (9) and (11) that need to cover the entire $\mathbb{R}$ when $\delta\rightarrow 0$ .

Next, for the sum of the tail terms in $V^{\pi_{\theta}}(\cdot)$ starting from $T_{3}\in\mathbb{N}$ , it is naturally bounded by

\sum_{t=T_{3}}^{\infty}\gamma^{t}|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq\frac{2M_{2}e^{T_{3}\log\gamma}}{1-\gamma},

(10)

where $M_{2}=\max_{s\in\mathcal{S}}R(s,\pi_{\theta}(s))$ is the maximum of continuous function $R(\cdot,\pi_{\theta}(\cdot))$ over the compact domain $\mathcal{S}$ (and hence exists). Therefore, letting the right-hand side of (10) be of order $\mathcal{O}(\delta^{p})$ yields

T_{3}(\delta)\simeq C_{2}+\frac{p}{\log\gamma}\log(\delta),

(11)

for some independent constant $C_{2}>0$ . Since the sum of (8) and (10) provides a good estimate of $V^{\pi_{\theta}}$ only if the slope of $T_{2}(\delta)$ , say $\frac{p-1}{\lambda(\theta)+\log\gamma}$ should be no less than that of $T_{3}(\delta)$ , or there would be infinitely many terms in the middle part as $\delta\rightarrow 0$ and hence cannot be controlled by any $\mathcal{O}(\delta^{p})$ terms. In this case, the slopes satisfy the inequality $\frac{p-1}{\lambda(\theta)+\log\gamma}\leq\frac{p}{\log\gamma}$ by solving which we finally obtain the desired $p\leq-\frac{\log\gamma}{\lambda(\theta)}$ .

Specifically, the following counterexample shows that Theorem 4.1 has already provided the strongest guaranteed continuity result for $V^{\pi_{\theta}}(s)$ at $s=s_{0}$ :

Example 4.1.

Suppose that the one-dimensional MDP $s_{t+1}=f(s_{t},a_{t})$ where

f(s,a)=\begin{cases}-1,&a\leq-1,\\ a,&-1<a<1,\\ 1,&a\geq 1,\\ \end{cases}

is defined over the state space $\mathcal{S}=[-1,1]$ and the cost function $R(s,a)=|s|$ . Also, the policy $\pi_{\theta}(s)=\theta s$ is linear. It is easy to verify that all assumptions (A.1)-(A.5) are satisfied. Now let $s_{0}=0$ and $\theta>1$ , then applying (6) directly yields

\displaystyle\lambda(\theta)

\displaystyle=\lim_{t\rightarrow\infty}\lim_{\delta\rightarrow 0}\frac{1}{t}\log\frac{\|\Delta Z(t)\|}{\delta}=\lim_{t\rightarrow\infty}\lim_{\delta\rightarrow 0}\frac{1}{t}\log\frac{\delta\theta^{t}}{\delta}=\log\theta.

On the other hand, let $\delta>0$ be sufficiently small, then

\displaystyle V^{\pi_{\theta}}(\delta)=\sum_{t=0}^{\infty}\delta\gamma^{t}\theta^{t}=\sum_{t=0}^{T_{0}(\delta)}\delta\gamma^{t}\theta^{t}+\sum_{t=T_{0}(\delta)}^{\infty}\gamma^{t}\geq\frac{\gamma^{T_{0}(\delta)}}{1-\gamma}

where $T_{0}(\delta)=1+\lfloor\frac{-\ \log\delta}{\log\theta}\rfloor\in\mathbb{N}$ and $\lfloor\cdot\rfloor$ is the flooring function. Therefore, we have

V^{\pi_{\theta}}(\delta)-V^{\pi_{\theta}}(0)=V^{\pi_{\theta}}(\delta)\geq\frac{\gamma^{\frac{-\log\delta}{\log\theta}+1}}{1-\gamma}=\frac{\gamma}{1-\gamma}\delta^{\frac{-\log\gamma}{\log\theta}}

Thus, $p=\frac{-\log\gamma}{\lambda(\theta)}$ is the largest Hölder exponent of $V^{\pi_{\theta}}$ that can be proved in the worst case.

4.2 Hölder Exponent of $J(\cdot)$ at $\theta$

The following lemma establishes a direct connection between $J(\theta)$ and $J(\theta^{\prime})$ through $V^{\pi_{\theta}}$ :

Lemma 4.1.

Suppose that $\theta,\theta^{\prime}\in\mathcal{S}$ , then

V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0})=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\theta^{\prime}}_{t},\pi_{\theta^{\prime}}(s^{\theta^{\prime}}_{t}))-V^{{\pi_{\theta}}}(s^{\theta^{\prime}}_{t}))

where $\{s^{\theta^{\prime}}_{t}\}_{t=0}^{\infty}$ is the trajectory generated by the policy $\pi_{\theta^{\prime}}(\cdot)$ .

The proof can be found in the Appendix. Notice that indeed we have $J(\theta^{\prime})=V^{\pi_{\theta^{\prime}}}(s_{0})$ and $J(\theta)=V^{\pi_{\theta}}(s_{0})$ , substituting with these two terms in the previous lemma and applying some additional calculations yield the main theorem as follows:

Theorem 4.2.

(Non-smoothness of $J$ ) Assume (A.1)-(A.5) and the parameterized policy $\pi_{\theta}(\cdot)$ is deterministic. Let $\lambda(\theta)$ denote the MLE of (1) at $\theta\in\mathbb{R}^{N}$ . Suppose that $\lambda(\theta)>-\log\gamma$ , then $J(\cdot)$ is $\frac{-\log\gamma}{\lambda(\theta)}$ -Hölder continuous at $\theta$ .

The detailed proof of Theorem 4.2 is put in the Appendix. Actually, the most important implication of this theorem is that $J(\theta)$ is not guaranteed (and is very unlikely) to be Lipschitz continuous when $\lambda(\theta)>-\log\gamma$ , similar counterexample can be constructed as in Example 4.1. Finally, let us see how the loss landscape of the objective function becomes non-smooth when $\lambda(\theta)$ is large through the following example:

Example 4.2.

(Logistic model) Consider the following MDP:

s_{t+1}=(1-s_{t})a_{t},\quad s_{0}=0.9,

(12)

where the policy $a_{t}$ is given by deterministic linear function $a_{t}=\pi_{\theta}(s_{t})=\theta s_{t}$ . The objective function is defined as $J(\theta)=\sum_{t=0}^{\infty}\gamma^{t}\ (s_{t}^{2}+0.1\ a^{2}_{t})$ where $\gamma\in(0,1)$ is user-defined. It is well-known that (12) begins to exhibit chaotic behavior with positive MLEs (as shown in Figure 2(a)) when $\theta\geq 3.3$ jordan, so we plot the graphs of $J(\theta)$ for different discount factors over the interval $\theta\in[3.3,3.9]$ . From Figure 2(b) to 2(d), the non-smoothness becomes more and more significant as $\gamma$ grows. In particular, Figure 2(e) shows that the value of $J(\theta)$ fluctuates violently even within a very small interval of $\theta$ , suggesting a high degree of non-differentiability in this region.

4.3 Non-smoothness in the case of stochastic policies

The original MDP (1) becomes stochastic when a stochastic policy is employed. In fact, it makes the objective less likely to be smooth and we will demonstrate this point now. First, let us instead consider the slightly modified version of MLE:

\tilde{\lambda}_{max}=\lim_{t\rightarrow\infty}\lim_{\|\Delta Z_{0}\|\rightarrow 0}\frac{1}{t}\log\frac{\|\mathbb{E}_{\pi}[\Delta Z_{\omega}(t)]\|}{\|\Delta Z_{0}\|}.

(13)

where $\Delta Z_{0}=s^{\prime}_{0}-s_{0}$ is a small pertubation made to the initial state and $\Delta Z_{\omega}(t)=s^{\prime}_{t}(\omega)-s_{t}(\omega)$ is the difference in the sample path at time $t\in\mathbb{N}$ and $\omega\in\Omega$ . Since this definition is consistent with that in (6) when taking the variance to $0$ , we use the same notation $\lambda(\theta)$ to denote the MLE at given $\theta\in\mathbb{R}^{N}$ and again assume $\lambda(\theta)>-\log\gamma$ .

Recall that in the deterministic case, the Hölder continuity result was proved under the assumption that the policy $\pi_{\theta}(s)$ is locally Lipschitz continuous. When it comes to the stochastic case, however, we cannot make any similar assumption for $\pi_{\theta}(\cdot|s)$ . The reason is that while the deterministic policy $\pi_{\theta}(s)$ directly gives the action, the stochastic policy $\pi_{\theta}(\cdot|s)$ is instead a probability density function from which the action is sampled, and hence cannot be locally Lipschitz continuous in $\theta$ . For instance, consider the one-dimensional Gaussian distribution $\pi_{\theta}(a|s)=\frac{1}{\sqrt{2\pi}\sigma}e^{\frac{(a-\mu s)^{2}}{2\sigma^{2}}}$ where $\theta=[\mu,\sigma]^{T}$ denotes the parameters. Apparently, $\pi_{\theta}(a|s)$ becomes more and more concentrated at $a=\mu s$ as $\sigma$ approaches $0$ , and eventually converges to the Dirac delta function $\delta(a-\mu s)$ , which means that $\pi_{\theta}(a|s)$ cannot be Lipschitz continuous within a neighborhood of any $\theta=[\mu,0]^{T}$ even though its deterministic limit $\pi_{\theta}(s)=\mu s$ is indeed Lipschitz continuous.

Therefore, there is a trade-off: if we want the probability density function $\pi_{\theta}(\cdot|s)$ to cover its deterministic limit, then it cannot be locally Lipschitz continuous. Conversely, it is not allowed to cover the deterministic limit if it is assumed to be locally Lipschitz continuous. Since the former case is widely accepted in practice, we present the following example to show that in this case the Hölder exponent of the objective function $J(\cdot)$ can be arbitrarily smaller than $\frac{-\log\gamma}{\lambda(\theta)}$ :

Example 4.3.

Suppose that the one-dimensional MDP $s_{t+1}=f(s_{t},a_{t})$ where

f(s,a)=\begin{cases}0,&a\leq 0,\\ a,&-1<a<1,\\ 1,&a\geq 1,\\ \end{cases}

is defined over the state space $\mathcal{S}=[0,1]$ and action space $\mathcal{A}=[0,\infty)$ . The cost function is $R(s,a)=s+1$ . Also, the parameter space is $\theta=[\theta_{1},\theta_{2}]^{T}\in\mathbb{R}^{2}$ and the policy $\pi_{\theta}(\cdot|s)\sim\mathcal{U}(|\theta_{1}|s+|\theta_{2}|^{\beta},|\theta_{1}|s+2|\theta_{2}|^{\beta})$ is a uniform distribution where $\beta>0$ is constant. It is easy to verify that all required assumptions are satisfied. Let the initial state $s_{0}=0$ and $\theta_{1}>1,\theta_{2}=0$ , then applying (13) directly yields $\lambda(\theta)=\log\theta_{1}$ similarly to Example 4.1. Now suppose that $\theta^{\prime}_{2}>0$ is small and $\theta^{\prime}=[\theta_{1},\theta^{\prime}_{2}]^{T}$ , then for any $\omega\in\Omega$ , the sampled trajectory $\{s^{\prime}_{t}\}$ generated by $\pi_{\theta^{\prime}}$ has

\displaystyle s^{\prime}_{t+1}(\omega)\geq\theta_{1}s^{\prime}_{t}(\omega)+(\theta^{\prime}_{2})^{\beta}>\theta_{1}s^{\prime}_{t}(\omega)\geq\theta^{t}_{1}s^{\prime}_{1}(\omega)\geq\theta^{t}_{1}(\theta^{\prime}_{2})^{\beta}

when $s^{\prime}_{t+1}(\omega)<1$ . Thus, we have $s^{\prime}_{t+1}(\omega)=1$ for all $\omega\in\Omega$ and $t\geq T_{0}(\theta^{\prime})=1+\lfloor\frac{-\beta\log\theta^{\prime}_{2}}{\log\theta_{1}}\rfloor$ , which further leads to

J(\theta^{\prime})=\frac{1}{1-\gamma}+\sum_{t=0}^{\infty}\gamma^{t}\ \mathbb{E}_{\pi_{\theta^{\prime}}}[s^{\prime}_{t}]\geq J(\theta)+\sum_{t=T_{0}(\delta)}^{\infty}\gamma^{t}\ \mathbb{E}_{\pi_{\theta^{\prime}}}[s^{\prime}_{t}]\geq\frac{\gamma}{1-\gamma}(\theta^{\prime}_{2})^{\frac{-\beta\log\gamma}{\log\theta_{1}}}

using the fact that $J(\theta)=\frac{1}{1-\gamma}$ . Plugging $\|\theta-\theta^{\prime}\|=\theta^{\prime}_{2}$ into the above inequality yields

J(\theta^{\prime})-J(\theta)\geq\frac{\gamma}{1-\gamma}\|\theta^{\prime}-\theta\|^{\frac{-\beta\log\gamma}{\log\theta_{1}}}.

(14)

Since $\beta>0$ is arbitrary, the Hölder exponent can be arbitrarily close to $0$ .

In particular, (14) prohibits us from achieving the $\frac{-\log\gamma}{\lambda(\theta)}$ -Hölder continuity result for $J(\cdot)$ when $0<\beta<1$ , which further implies that the loss landscape of $J(\cdot)$ can become even more fractal and non-smooth in the case of stochastic policies. As a consequence, it is not surprising that policy gradient methods perform poorly in such problems. Some discussion from the perspective of fractal theory can be found in the Appendix and also in barnsley; falconer; hardy; hirsch; mcdonald; preiss.

5 Estimating Hölder Exponents from Samples

In the previous sections, we have seen that the true objective function $J(\theta)$ can be highly non-smooth. Thus, any gradient-based method are no longer guaranteed to work well in the policy parameter space. The question is: how can we determine whether the objective function $J(\theta)$ is differentiable at some $\theta=\theta_{0}$ or not in high-dimensional settings? To answer this question, we propose a statistical method to estimate the Hölder exponent: Consider the objective function $J(\theta)$ and the Gaussian distribution $X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I}_{N})$ where $\mathcal{I}_{N}$ is the $N\times N$ identity matrix. The diagonal form of the covariance matrix is assumed for simplicity. In the general case, let $J(\cdot)$ be $\alpha$ -Hölder continuous over $\mathbb{R}^{N}$ , then its variance matrix can be expressed as

	$\displaystyle Var(J(X))$	$\displaystyle=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)-\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)])^{2}]$
		$\displaystyle=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[(J(X)-J(\xi^{\prime}))^{2}]$

where $\xi^{\prime}\in\mathbb{R}^{N}$ is obtained from applying the intermediate value theorem to $\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[J(X)]$ and hence not a constant. If the $\alpha$ -Hölder continuity is tight, say $|J(\theta)-J(\theta_{0})|\simeq\|\theta-\theta_{0}\|^{\alpha}$ when $\|\theta-\theta_{0}\|\ll 1$ , then it has the following approximation

Var(J(X))=\mathbb{E}_{X\sim\mathcal{N}(\theta_{0},\sigma^{2}\mathcal{I})}[\|X-\xi^{\prime}\|^{2\alpha}]\simeq(Var(X))^{\alpha}\sim\mathcal{O}(\sigma^{2\alpha})

(15)

when $\sigma\ll 1$ . Therefore, (15) provides a way to directly estimate the Hölder exponent of $J(\cdot)$ around any given $\theta\in\mathbb{R}^{N}$ , especially when the dimension $N$ is large. In particular, taking the logarithm on both sides of (15) yields

\log Var_{\sigma}(J(X))\simeq C+2\alpha\log\sigma

(16)

for some constant $C$ where the subscript in $Var_{\sigma}(J(X))$ indicates its dependence on the standard deviation $\sigma$ of $X$ . Thus, the log-log plot of $Var_{\sigma}(J(X))$ versus $\sigma$ is expected to be close to a straight line with slope $k=2\alpha$ where $\alpha$ is the Hölder exponent of $J(\cdot)$ around $\theta$ . Therefore, one can estimate the Hölder exponent by sampling around $\theta$ with different variance and estimating the slope using linear regression. Usually, $J(\theta)$ is Lipschitz continuous at $\theta=\theta_{0}$ when the slope $k$ is close to or greater than $2$ . Some examples will be presented in Section 6.

6 Experiments

In this section, we will validate the theory established in this paper through some of the most common tasks in RL. All environments are adopted from The OpenAI Gym Documentation brockman with continuous control input. The experiments are conducted in two steps: first, we randomly sample a parameter $\theta_{0}$ from a Gaussian distribution and estimate the gradient $\nabla_{\theta}\eta(\theta_{0})$ from (3); second, we evaluate $J(\theta)$ at $\theta=\theta_{0}+\delta\nabla_{\theta}\eta(\theta_{0})$ for each small $\delta>0$ . Based on the previous theoretical results, the loss curve is expected to become smoother as $\gamma$ decreases, since smaller $\gamma$ makes the Hölder exponent $\frac{-\log\gamma}{\lambda(\theta)}$ larger. In the meantime, the policy gradient method (3) should give a better descent direction while the true objective function $J(\cdot)$ becoming smoother. Notice that a single sample path can always be non-smooth when the policy is stochastic and therefore interfere the desired observation, we use stochastic policies to estimate the gradient in (3), and apply their deterministic version (by setting variance equal to 0) when evaluating $J(\theta)$ . Regarding the infinite series, we use the sum of first 1000 terms to approximate $J(\theta)$ . The stochastic policy is given by $\pi_{\theta}(\cdot|s)\sim\mathcal{N}(u(s),\sigma^{2}\mathcal{I}_{N})$ where the mean $u(s)$ is represented by the 2-layer neural network $u(s)=W_{2}\tanh(W_{1}s)$ where $W_{1}\in\mathcal{M}_{r\times n}(\mathbb{R})$ and $W_{2}\in\mathcal{M}_{m\times r}(\mathbb{R})$ are weight matrices. Let $\theta=[W_{1},W_{2},\sigma]^{T}$ denote the vectorized policy parameter. For the width of the hidden layer, we use $r=8$ for the inverted pendulum and acrobot task, and $r=64$ for the hopper task.

Inverted pendulum.

The inverted pendulum task is a standard test case for RL algorithms, and here we use it as an example of non-chaotic system. The initial state is always taken as $s_{0}=[-1,0]^{T}$ ( $[0,0]^{T}$ is the upright position), and quadratic cost function $R(s,a)=s^{T}_{t}Qs_{t}+0.001\|a_{t}\|^{2}$ , where $Q={\text{diag}}(1,0.1)$ is a $2\times 2$ diagonal matrix, $s_{t}\in\mathbb{R}^{4}$ and $a_{t}\in\mathbb{R}$ . The initial parameter is given by $\theta_{0}\sim\mathcal{N}(0,0.05^{2}\ \mathcal{I})$ . In Figure 4(a) and 4(c), we see that the loss curve is close to a straight line within a very small interval, which indicates the local smoothness of $\theta_{0}$ . It is validated by the estimate of the Hölder exponent of $J(\theta)$ at $\theta=\theta_{0}$ which is based on (15) by sampling many parameters around $\theta_{0}$ with different variances. In Figure 3(e), the slope $\alpha=1.980$ is very closed to $2$ so Lipschitz continuity (and hence differentiability) is verified at $\theta=\theta_{0}$ . As a comparison, the loss curve of single sample path is totally non-smooth as shown in Figure 3(b) and 3(e).

Acrobot.

The acrobot system is well-known for its chaotic behavior and hence we use it as the main test case. Here we use the cost function $R(s,a)=s^{T}_{t}Qs_{t}+0.005\|a_{t}\|^{2}$ , where $Q={\text{diag}}(1,1,0.1,0.1)$ , $s_{t}\in\mathbb{R}^{4}$ and $a_{t}\in\mathbb{R}$ . The initial state is $s_{0}=[1,0,0,0]^{T}$ . The initial parameter is again sampled from $\theta_{0}\sim\mathcal{N}(0,0.05^{2}\ \mathcal{I})$ . From Figure 4(a)-4(c), the non-smoothness grows as $\gamma$ increases and finally becomes completely non-differentiable when $\gamma=0.99$ which is the most common value used for discount factor. It may partially explain why the acrobot task is difficult and almost unsolvable by policy gradient methods. In Figure 4(e), the Hölder exponent of $J(\theta)$ at $\theta=\theta_{0}$ is estimated as $\alpha\simeq 0.43155<1$ , which further indicates non-differentiability around $\theta_{0}$ .

Hopper.

Now we consider the Hopper task in which the cost function is defined $R(s,a)=(1.25-s[0])+0.001\|a\|^{2}$ , where $s[0]$ is the first coordinate in $s\in\mathbb{R}^{1}1$ which indicates the height of hopper. Because the number of parameters involved in the neural network is larger, the initial parameter is instead sampled from $\theta_{0}\sim\mathcal{N}(0,10^{2}\ \mathcal{I})$ . As we see that in Figure 5(a), the loss curve is almost a straight line when $\gamma=0.8$ , and it begins to show non-smoothness when $\gamma=0.9$ and becomes totally non-differentiable when $\gamma=0.99$ . A supporting evidence by the Hölder exponent estimation as in Figure 5(e) where the slope is far less than $2$ .

7 Conclusion

In this paper, we initiate the study of chaotic behavior in reinforcement learning, especially focusing on how it is reflected on the non-smoothness of objective functions. A method to statistically estimate the Hölder exponent at some given parameter is proposed, so that one can figure out if the training has run into chaos and hence determine whether to stop or not. We believe that the theory established in this paper can help to explain many existing results in reinforcement learning, such as the unsolvability of acrobot task and the fluctuating behavior of training curve. It also poses a serious question to the well-posedness of policy gradient methods given the fact that no gradient exists in many continuous state-space RL problems. Finally, We conjecture that, the chaotic behavior may be able to explain the limits of a much wider range of deep learning problems beyond the scope of reinforcement learning.

Checklist

The checklist follows the references. Please read the checklist guidelines carefully for information on how to answer these questions. For each question, change the default [TODO] to [Yes] , [No] , or [N/A] . You are strongly encouraged to include a justification to your answer, either by referencing the appropriate section of your paper or providing a brief inline description. For example:

•

Did you include the license to the code and datasets? [Yes] See Section LABEL:gen_inst.
•

Did you include the license to the code and datasets? [No] The code and the data are proprietary.
•

Did you include the license to the code and datasets? [N/A]

Please do not modify the questions and only use the provided macros for your answers. Note that the Checklist section does not count towards the page limit. In your paper, please delete this instructions block and only keep the Checklist section heading above along with the questions/answers below.

1.
For all authors…
1. (a)
  
  Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? [Yes]
2. (b)
  
  Did you describe the limitations of your work? [Yes]
3. (c)
  
  Did you discuss any potential negative societal impacts of your work? [N/A]
4. (d)
  
  Have you read the ethics review guidelines and ensured that your paper conforms to them? [Yes]
2.
If you are including theoretical results…
1. (a)
  
  Did you state the full set of assumptions of all theoretical results? [Yes]
2. (b)
  
  Did you include complete proofs of all theoretical results? [Yes]
3.
If you ran experiments…
1. (a)
  
  Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? [Yes]
2. (b)
  
  Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? [Yes]
3. (c)
  
  Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? [N/A]
4. (d)
  
  Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? [N/A]
4.
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets…
1. (a)
  
  If your work uses existing assets, did you cite the creators? [N/A]
2. (b)
  
  Did you mention the license of the assets? [N/A]
3. (c)
  
  Did you include any new assets either in the supplemental material or as a URL? [N/A]
4. (d)
  
  Did you discuss whether and how consent was obtained from people whose data you’re using/curating? [N/A]
5. (e)
  
  Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? [N/A]
5.
If you used crowdsourcing or conducted research with human subjects…
1. (a)
  
  Did you include the full text of instructions given to participants and screenshots, if applicable? [N/A]
2. (b)
  
  Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? [N/A]
3. (c)
  
  Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? [N/A]

Appendix A Brief introduction to chaos theory

As mentioned in the Introduction, chaos exists in many systems in the real world. Although no universal definition of chaos can be made, there are, indeed, three features that a chaotic system usually exhibits [hirsch]:

•

Dense periodic points;
•

Topologically transitive;
•

Sensitive dependence on initial conditions;

In some cases, some of these properties imply the others. It is important to note that, despite the appearance of chaos is always accompanied by high unpredictability, it should be emphasized that chaotic behavior is entirely deterministic and is not a consequence of randomness. Another interesting fact is that trajectories in a chaotic system are usually bounded, which force us to think about the convergence of policy gradient methods beyond the boundedness property.

Actually, it can be summarized from the results in this paper that for a given MDP, it has to possess the following three features to exhibit chaotic behavior:

•

Infinite time horizon (allows $t\rightarrow\infty$ );
•

Continuous state space (allows $\|\Delta Z_{0}\|\rightarrow 0$ );
•

Exponential divergence (allows $\lambda_{max}>0$ );

Since these features are not necessarily binded to certain types of continuous state-space MDPs, it would be exciting for future studies to investigate other types of MDPs using the framework developed in the paper.

Appendix B Proofs omitted in Section 4

B.1 Proof of Theorem 4.1

Proof.

Suppose that $s^{\prime}_{0}\in\mathcal{S}$ is another initial state close to $s_{0}$ . Let $T_{1}\in\mathbb{N}$ be the smallest integer that satisfies

T_{1}\geq\frac{1}{\lambda(\theta)}\log(\frac{2M}{K_{1}\delta}),

(17)

then applying the Lipschitz condition of $R(\cdot,\pi_{\theta}(\cdot))$ yields

	$\displaystyle\sum_{t=0}^{T_{1}}\gamma^{t}\|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$	$\displaystyle\leq\sum_{t=0}^{T_{1}}\gamma^{t}K_{2}\\|s_{t}-s^{\prime}_{t}\\|$
		$\displaystyle\leq\sum_{t=0}^{T_{1}}K_{1}K_{2}e^{(\lambda(\theta)+\log\gamma)t}\delta$
		$\displaystyle\leq K_{1}K_{2}\delta\frac{e^{\frac{\lambda(\theta)+\log\gamma}{\lambda(\theta)}\log(\frac{2M}{K_{1}\delta})+1}}{e^{(\lambda(\theta)+\log\gamma)}-1}$
		$\displaystyle=\frac{eK_{2}K_{1}^{\frac{-\log\gamma}{\lambda(\theta)}}(2M)^{1+\frac{\log\gamma}{\lambda(\theta)}}}{e^{(\lambda(\theta)+\log\gamma)}-1}\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}$

where $K_{2}>0$ is the Lipschitz constant of $R(\cdot,\pi_{\theta}(\cdot))$ over compact set $\mathcal{S}$ .

On the other hand, the tail terms in $J(\cdot)$ is bounded by

	$\displaystyle\sum_{t=T_{1}+1}^{\infty}\gamma^{t}\|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$	$\displaystyle\leq\sum_{t=T_{1}+1}^{\infty}2M_{2}\gamma^{t}$
		$\displaystyle\leq\sum_{t=T_{1}}^{\infty}2M_{2}\gamma^{t}$
		$\displaystyle=2M_{2}\frac{\gamma^{T_{1}}}{1-\gamma}$
		$\displaystyle\leq\frac{2M_{2}}{1-\gamma}(\frac{K_{1}}{2M})^{\frac{-\log\gamma}{\lambda(\theta)}}\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}$

using that $|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))|\leq 2M_{2}$ where $M_{2}=\max_{s\in\mathcal{S}}R(s,\pi_{\theta}(s))$ is the maximum of the continuous function $R(\cdot,\pi_{\theta}(\cdot))$ over $\mathcal{S}$ .

Combining the above two inequalities yields

		$\displaystyle\|V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})\|$
	$\displaystyle\leq$	$\displaystyle\sum_{t=0}^{\infty}\gamma^{t}\|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$
	$\displaystyle=$	$\displaystyle\sum_{t=0}^{T_{1}}\gamma^{t}\|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|+\sum_{t=T_{1}+1}^{\infty}\gamma^{t}\|R(s_{t},\pi_{\theta}(s_{t}))-R(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$
	$\displaystyle\leq$	$\displaystyle(\frac{eK_{2}K_{1}^{\frac{-\log\gamma}{\lambda(\theta)}}(2M)^{1+\frac{\log\gamma}{\lambda(\theta)}}}{e^{(\lambda(\theta)+\log\gamma)}-1}+\frac{2M_{2}}{1-\gamma}(\frac{K_{1}}{2M})^{\frac{-\log\gamma}{\lambda(\theta)}})\ \delta^{\frac{-\log\gamma}{\lambda(\theta)}}$

and we complete the proof. ∎

B.2 Proof of Lemma 4.1

Proof.

For the ease of notation, let $s_{t}=s^{\theta}_{t},s^{\prime}_{t}=s^{\theta^{\prime}}_{t},u(s)=\pi_{\theta}(s)$ and $u^{\prime}(s)=\pi_{\theta^{\prime}}(s)$ .

	$\displaystyle V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0})$	$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s_{0})$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))+V^{\pi_{\theta}}(s^{\prime}_{t})-V^{\pi_{\theta}}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s_{0})$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(R(s^{\prime}_{t};u^{\prime}(s^{\prime}_{t}))+\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1})-V^{\pi_{\theta}}(s^{\prime}_{t})+V^{\pi_{\theta}}(s^{\prime}_{t})-\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1}))-V^{\pi_{\theta}}(s_{0})$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\theta}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))+\sum_{t=0}^{\infty}\gamma^{t}(V^{\pi_{\theta}}(s^{\prime}_{t})-\gamma V^{\pi_{\theta}}(s^{\prime}_{t+1}))-V^{\pi_{\theta}}(s_{0}).$

Using the fact that $\gamma^{t}V^{\theta}(x^{\prime}_{t+1})\rightarrow 0$ as $t\rightarrow\infty$ from the boundedness assumption yields

	$\displaystyle V^{\pi_{\theta^{\prime}}}(s_{0})-V^{\pi_{\theta}}(s_{0})$	$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))+V^{\pi_{\theta}}(s^{\prime}_{0})-V^{\pi_{\theta}}(s_{0})$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}(Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t}))$

and the proof is completed using $s^{\prime}_{0}=s_{0}$ . ∎

B.3 Proof of Theorem 4.2

Proof.

First, we will show that $Q^{\pi_{\theta}}(s,a)$ is $\frac{-\log\gamma}{\lambda(\theta)}$ -Hölder Lipschitz with respect to $a$ . Note that for any $a,a^{\prime}\in\mathcal{A}$ ,

	$\displaystyle\|Q^{\pi_{\theta}}(s,a)-Q^{\pi_{\theta}}(s,a^{\prime})\|$	$\displaystyle\leq\|R(s,a)-R(s,a^{\prime})\|+\gamma\|V^{\pi_{\theta}}(f(s,a))-V^{\pi_{\theta}}(f(s,a^{\prime}))\|$
		$\displaystyle\leq K_{1}\\|a-a^{\prime}\\|+\gamma\\|f(s,a)-f(s,a^{\prime})\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq K_{1}\\|a-a^{\prime}\\|+\gamma K_{2}\\|a-a^{\prime}\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq K_{3}\\|a-a^{\prime}\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$

for some $K_{3}>0$ using the locally Lipschitz continuity of $R$ and $f$ .

Note that $V^{\pi_{\theta}}(s)=Q^{\pi_{\theta}}(s,\pi_{\theta}(s))$ , combining it with Lemma 4.1 yields

	$\displaystyle\|J(\theta^{\prime})-J(\theta)\|$	$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}\|Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t})\|$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\|Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta^{\prime}}(s^{\prime}_{t}))-Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$
		$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}\\|\pi_{\theta^{\prime}}(s^{\prime}_{t})-\pi_{\theta}(s^{\prime}_{t})\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}K_{4}\\|\theta^{\prime}-\theta\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle=\frac{K_{3}K_{4}}{1-\gamma}\\|\theta^{\prime}-\theta\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$

using the fact that $\pi_{\theta}(s)$ is Lipschitz continuous in a neighborhood of $(\theta,s)\in\mathbb{R}^{N}\times\mathcal{S}$ for some constant $K_{4}>0$ and we complete the proof. ∎

Appendix C From the perspective of fractal theory

We will go through some basic concepts in fractal theory that are related to the study of non-smooth functions.

C.1 The Hausdorff dimension

The Hausdorff dimension is the most fundamental concept in fractal theory. First, let us define the concept of $\delta$ -cover and Hausdorff measure

Definition C.1.

( $\delta$ -cover) Let $\{U_{i}\}$ be a countable collection of sets of diameter at most $\delta$ (i.e. $|U_{i}|=\sup\{\|x-y\|:x,y\in U_{i}\}\leq\delta$ ) and $F\subset\mathbb{R}^{N}$ , then $\{U_{i}\}$ is a $\delta$ -cover of $F$ if $F\subset\cup_{i=1}^{\infty}U_{i}$ .

Definition C.2.

(Hausdorff measure) For any $F\subset\mathbb{R}^{N}$ and $s\geq 0$ , let

\mathcal{H}_{\delta}^{s}(F)=\inf\{\sum_{i=1}^{\infty}|U_{i}|^{s}:\{U_{i}\}\textit{ is a }\delta\textit{-cover of }F\}.

Then we call the limit $\mathcal{H}^{s}(F)=\lim_{\delta\rightarrow 0}\mathcal{H}_{\delta}^{s}(F)$ the $s$ -dimensional Hausdorff measure of $F$ .

Definition C.3.

(Hausdorff dimension) Let $F\subset\mathbb{R}^{N}$ be a subset, then its Hausdorff dimension

\dim_{H}F=\inf\{s\geq 0:\mathcal{H}^{s}(F)=0\}=\sup\{s\geq 0:\mathcal{H}^{s}(F)=\infty\}.

Indeed, the Hausdorff dimension is well-defined: First, it is clear that when $\delta<1$ , $\mathcal{H}_{\delta}^{s}(F)$ is non-increasing with respect to $s$ . Thus, $\mathcal{H}^{s}(F)$ is non-increasing as well.

Let $s\geq 0$ such that $\mathcal{H}^{s}(F)<\infty$ , then for any $t>s$ and any $\delta$ -cover $\{U_{i}\}$ of $F$ , we have

\sum_{i=1}^{\infty}|U_{i}|^{t}\leq\delta^{t-s}\sum_{i=1}^{\infty}|U_{i}|^{s}

which implies $\mathcal{H}^{t}(F)=0$ by taking infimum on both sides and letting $\delta\rightarrow 0$ .

Therefore, the set $\{s\geq 0:0<\mathcal{H}^{s}(F)<\infty\}$ contains at most one point, which further implies $\inf\{s\geq 0:\mathcal{H}^{s}(F)=0\}=\sup\{s\geq 0:\mathcal{H}^{s}(F)=\infty\}$ .

More details regarding the well-posedness of Hausdorff dimension can be found in [barnsley, falconer]. In particular, one can easily verify that the Hausdorff dimension coincides with the standard dimension (i.e. $s\in\mathbb{N}$ ) when $F$ is a regular manifold. Typically, the Hausdorff dimension of a fractal is not an integer, and we will be exploiting this fact through the section. To see how it is associated with the non-smoothness of functions, the following claim makes a rigorous statement on the dimension of fractal loss surfaces:

Proposition C.1.

([falconer]) Let $F\subset\mathbb{R}^{k}$ be a subset and suppose that $\eta:F\rightarrow\mathbb{R}^{p}$ is $\alpha$ -Hölder continuous, then $\dim_{H}\eta(F)\leq\frac{1}{\alpha}\dim_{H}F$ .

An immediate implication is that if the objective function is $\alpha$ -Hölder for some $\alpha<1$ , its loss landscape $\mathcal{L}_{J}=\{(\theta,J(\theta))\in\mathbb{R}^{N+1}:\theta\in\mathbb{R}^{N}\}$ can be a fractal with Hausdorff dimension $\dim_{H}\mathcal{L}_{J}\in(N,N+1]$ . A famous example is the Weierstrass function as shown in Figure 7. A comparison of Figure 2(e) and Figure 7(c) (they have the same scale) gives some sense about how non-smooth the objective function can be.

C.2 Non-existence of tangent plane

Actually, when $J(\cdot)$ is Lipschitz continuous on any compact subset of $\mathbb{R}^{N}$ , by the Rademacher’s Theorem, we know that it is differentiable almost everywhere which implies the existence of tangent plane. As it comes to fractal landscapes, however, the tangent plane itself does not exist for almost every $\theta\in\mathbb{R}^{N}$ , which makes all policy gradient algorithms ill-posed. Although similar results were obtained for higher-dimensional cases as in [preiss], we restrict it to the two-dimensional case so that it provides clearer geometric intuition. First, we introduce the notion of $s$ -sets:

Definition C.4.

Let $F\subset\mathbb{R}^{2}$ be a Borel set and $s\geq 0$ , then $F$ is called an $s$ -set if $0<\mathcal{H}^{s}(F)<\infty$ .

The intuition is that: when the dimension of fractal $F$ is a fraction between 1 and 2, then there is no direction along which a significant part of $F$ concentrates within a small double sector with vertex $x$ as shown in Figure 7(a). To be precise, let $S(x,\phi,\psi)$ denote the double sector and $r>0$ , then we say that $F$ has a tangent at $x\in F$ if there exists a direction $\phi$ such that for every angle $\phi>0$ , it has

1.

$\limsup_{r\rightarrow 0}\frac{\mathcal{H}^{s}(F\cap\mathcal{B}(x,r))}{(2r)^{s}}>0$ ;
2.

$\lim_{r\rightarrow 0}\frac{\mathcal{H}^{s}(F\cap(\mathcal{B}(x,r)\textbackslash S(x,\phi,\psi)))}{(2r)^{s}}=0$ ;

where the first condition means that the set $F$ behaves like a fractal around $x$ , and the second condition implies that the part of $F$ lies outside of any double sector $S(x,\phi,\psi)$ is negligible when $r\rightarrow 0$ . Then, the main result is as follows:

Proposition C.2.

(Non-existence of tangent planes, [falconer]) If $F\subset\mathbb{R}^{2}$ is an $s$ -set with $1<s<2$ , then at almost all points of $F$ , no tangent exists.

Therefore, "estimate the gradient" no longer makes sense since there does not exist a tangent line/plane at almost every point on the loss surface. This means that all policy gradient algorithms are ill-posed since there is no gradient for them to estimate at all.

C.3 Accumulated uncertainty

Another issue that may emerge during optimization process is the accumulation of uncertainty. To see how the uncertainty entered at each step accumulates and eventually blows up when generating a path along fractal boundaries, let us consider the following toy problem: Suppose that the distance between the initial point $\theta_{0}$ and the target $\theta^{*}$ is $d>0$ , and step size $\delta_{k}>0$ is adapted at the $k$ -th step, as shown in Figure 8(a). If there exists $c>0$ such that the projection $\langle\theta^{*}-\theta_{0},\theta_{k+1}-\theta_{k}\rangle\geq cd\delta_{k}$ for all $k\in\mathbb{N}$ which implies that the angle between the direction from $\theta_{k}$ to $\theta_{k+1}$ and the true direction $\theta^{*}-\theta_{0}$ does not exceed $\arccos(c)$ . In this case, a successful path $\{\theta_{k}\}$ that converges to $\theta^{*}$ should give

\sum_{k=0}^{\infty}cd\delta_{k}\leq\sum_{k=0}^{\infty}\langle\theta^{*}-\theta_{0},\theta_{k+1}-\theta_{k}\rangle=\langle\theta^{*}-\theta_{0},\theta^{*}-\theta_{0}\rangle=d^{2}

using $\theta_{k}\rightarrow\theta^{*}$ as $k\rightarrow\infty$ , which is equivalent to $\sum_{k=0}^{\infty}\delta_{k}\leq\frac{d}{c}$ .

On the other hand, when walking on the loss surface, it is not guaranteed to follow the correct direction at each iteration. For any small step size $\delta>0$ , the uncertainty fraction $u(\delta)$ involved in every single step can be estimated by the following result [mcdonald]:

Proposition C.3.

Let $\delta>0$ be the step size and $\beta=N-\dim_{H}J$ where $\dim_{H}J$ is the Hausdorff dimension of loss surface of $J(\cdot)$ , then the uncertainty $u(\delta)\sim\mathcal{O}(\delta^{\beta})$ when $\delta\ll 1$ .

Therefore, we may assume that there exists another $c^{\prime}>0$ such that the uncertainty $U_{k}$ at the $k$ -th step has $U_{k}\leq c^{\prime}\delta_{k}^{\beta}$ for all $k=0,1,...$ . Then, the accumulated uncertainty

U=\sum_{k=0}^{\infty}U_{k}\leq c^{\prime}\sum_{k=0}^{\infty}\delta_{k}^{\beta}

is bounded when $\beta=1$ (i.e. boundary is smooth) using the earlier result $\sum_{k=0}^{\infty}\delta_{k}\leq\frac{d}{c}$ . However, the convergence of $\sum_{k=0}^{\infty}\delta_{k}$ no longer guarantees the convergence of $\sum_{k=0}^{\infty}\delta_{k}^{\beta}$ when $\beta<1$ , and a counterexample is the following series:

\delta_{k}=\frac{1}{k(\log(k+2))^{2}}

for all $k=0,1,...$ , which means that the uncertainty accumulated over the course of iterations may blow up and eventually lead the sequence $\{\theta_{k}\}$ toward randomness.

	$\displaystyle\|Q^{\pi_{\theta}}(s,a)-Q^{\pi_{\theta}}(s,a^{\prime})\|$	$\displaystyle\leq\|R(s,a)-R(s,a^{\prime})\|+\gamma\|V^{\pi_{\theta}}(f(s,a))-V^{\pi_{\theta}}(f(s,a^{\prime}))\|$
		$\displaystyle\leq K_{1}\\|a-a^{\prime}\\|+\gamma\\|f(s,a)-f(s,a^{\prime})\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq K_{1}\\|a-a^{\prime}\\|+\gamma K_{2}\\|a-a^{\prime}\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq K_{3}\\|a-a^{\prime}\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$

	$\displaystyle\|J(\theta^{\prime})-J(\theta)\|$	$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}\|Q^{\pi_{\theta}}(s^{\prime}_{t},u^{\prime}(s^{\prime}_{t}))-V^{\pi_{\theta}}(s^{\prime}_{t})\|$
		$\displaystyle=\sum_{t=0}^{\infty}\gamma^{t}\|Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta^{\prime}}(s^{\prime}_{t}))-Q^{\pi_{\theta}}(s^{\prime}_{t},\pi_{\theta}(s^{\prime}_{t}))\|$
		$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}\\|\pi_{\theta^{\prime}}(s^{\prime}_{t})-\pi_{\theta}(s^{\prime}_{t})\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle\leq\sum_{t=0}^{\infty}\gamma^{t}K_{3}K_{4}\\|\theta^{\prime}-\theta\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$
		$\displaystyle=\frac{K_{3}K_{4}}{1-\gamma}\\|\theta^{\prime}-\theta\\|^{\frac{-\log\gamma}{\lambda(\theta)}}$

Fractal Landscapes in Policy Optimization

Abstract

1 Introduction

2 Related work

Policy gradient methods.

Chaos in machine learning.

Loss landscape of policy optimization.

3 Preliminaries

3.1 Continuous state-space MDPs

3.2 Policy gradient methods

3.3 Maximal Lyapunov exponents

Definition 3.1.

4 Non-smooth objective functions

Definition 4.1.

4.1 Hölder Exponent of Vπθ​(⋅)V^{\pi_{\theta}}(\cdot) at s0s_{0}

Theorem 4.1.

Example 4.1.

4.2 Hölder Exponent of J​(⋅)J(\cdot) at θ\theta

Lemma 4.1.

Theorem 4.2.

Example 4.2.

4.3 Non-smoothness in the case of stochastic policies

Example 4.3.

5 Estimating Hölder Exponents from Samples

6 Experiments

Inverted pendulum.

Acrobot.

Hopper.

7 Conclusion

Checklist

Appendix A Brief introduction to chaos theory

Appendix B Proofs omitted in Section 4

B.1 Proof of Theorem 4.1

Proof.

B.2 Proof of Lemma 4.1

Proof.

B.3 Proof of Theorem 4.2

Proof.

Appendix C From the perspective of fractal theory

C.1 The Hausdorff dimension

Definition C.1.

Definition C.2.

Definition C.3.

Proposition C.1.

C.2 Non-existence of tangent plane

Definition C.4.

Proposition C.2.

C.3 Accumulated uncertainty

Proposition C.3.

4.1 Hölder Exponent of $V^{\pi_{\theta}}(\cdot)$ at $s_{0}$

4.2 Hölder Exponent of $J(\cdot)$ at $\theta$