This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

oddsidemargin has been altered.
textheight has been altered.
marginparsep has been altered.
textwidth has been altered.
marginparwidth has been altered.
marginparpush has been altered.
The page layout violates the UAI style. Please do not change the page layout, or include packages like geometry, savetrees, or fullpage, which change it for you. We’re not able to reliably undo arbitrary changes to the style. Please remove the offending package(s), or layout-changing commands and try again.

Accelerating Reinforcement Learning with a Directional-Gaussian-Smoothing Evolution Strategy

Jiaxin Zhang   Hoang Tran   Guannan Zhang   Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831.  Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831.  Corresponding author, Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831. Email: [email protected].
Abstract

Evolution strategy (ES) has been shown great promise in many challenging reinforcement learning (RL) tasks, rivaling other state-of-the-art deep RL methods. Yet, there are two limitations in the current ES practice that may hinder its otherwise further capabilities. First, most current methods rely on Monte Carlo type gradient estimators to suggest search direction, where the policy parameter is, in general, randomly sampled. Due to the low accuracy of such estimators, the RL training may suffer from slow convergence and require more iterations to reach optimal solution. Secondly, the landscape of reward functions can be deceptive and contains many local maxima, causing ES algorithms to prematurely converge and be unable to explore other parts of the parameter space with potentially greater rewards. In this work, we employ a Directional Gaussian Smoothing Evolutionary Strategy (DGS-ES) to accelerate RL training, which is well-suited to address these two challenges with its ability to i) provide gradient estimates with high accuracy, and ii) find nonlocal search direction which lays stress on large-scale variation of the reward function and disregards local fluctuation. Through several benchmark RL tasks demonstrated herein, we show that DGS-ES is highly scalable, possesses superior wall-clock time, and achieves competitive reward scores to other popular policy gradient and ES approaches.

1 INTRODUCTION

Reinforcement learning is a class of problems which aim to find, through trial and error, a feedback policy that prescribes how an agent should act in an uncertain, complex environment to maximize some notion of cumulative reward (Sutton & Barto, 1998). Traditionally, RL algorithms have mainly been employed for small input and action spaces, and suffered difficulties when scaling to high-dimensional problems. With the recent emergence of deep learning, powerful non-linear function approximators such as deep neural networks (DNN) can be integrated into RL and extend the capability of RL in a variety of challenging tasks which would otherwise be infeasible, ranging from playing Atari from pixels (Mnih et al., 2015, 2016), playing expert-level Go (Silver et al., 2016) to robotic control (Andrychowicz et al., 2017; Lillicrap et al., 2016).

Among the most popular current deep RL algorithms are Q-learning methods, policy gradient methods, and evolution strategies. Deep Q-learning algorithms (Mnih et al., 2015) use a DNN to approximate the optimal Q function, yielding policies that, for a given state, choose the action that maximizes the Q-value. Policy gradient methods (Sehnke et al., 2010) improve the policies with a gradient estimator obtained from sample trajectories in action space, examples of which are A3C (Mnih et al., 2016), TRPO (Schulman et al., 2015) and PPO (Schulman et al., 2017).

This work concerns RL techniques based on evolution strategies. ES refers to a family of blackbox optimization algorithms inspired by ideas of natural evolution, often used to optimize functions when gradient information is inaccessible. This is exactly the prominent challenge in a typical RL problem, where the environment and policy are usually nonsmooth or can only be accessed via noisy sampling. It is not totally surprising ES has become a convincing competitor to Q-learning and policy gradient in deep RL in recent years. Unlike policy gradient, ES perturbs and performs policy search directly in the parameter space to find an effective policy, which is now generally considered to be superior to action perturbation (Sigaud & Stulp, 2013). The policy search can be guided by a surrogate gradient (Salimans et al., 2017), completely population-based and gradient-free (Such et al., 2017), and hybridized with other exploration strategies such as novelty search and quality diversity (Conti et al., 2018). It has been shown in those works that ES is easy to parallelize, and requires low communication overhead in a distributed setting. More importantly, being able to achieve competitive performance on many RL tasks, these methods are advantageous over other RL approaches in their high scalability and substantially lower wall-clock time. Given wide availability of distributed computing resources, all environment simulations at each iteration of training can be conducted totally in parallel. Thus, a more reasonable metric for the performance of a training algorithm is the number of non-parallelizable iterations, as opposed to the sample complexity. In this metric, ES is truly an appealing choice.

Nevertheless, there are several challenges that need to be addressed in order to further improve the performance of ES in training complex policies. First, most ES methods cope with the non-smoothness of the objective function by considering a Gaussian-smoothed version of the expected total reward. The gradient of this function is intractable and must be estimated to provide the policy parameter update. In the pioneering work (Salimans et al., 2017), a gradient estimator is proposed based on random parameter sampling. Developing efficient sampling strategies for gradient estimates has become an interest in ES research since then, and several improvements have been proposed, based on imposing structures on parameter perturbation (Choromanski et al., 2018, 2019a), or reusing past evaluations, (Choromanski et al., 2019b; Maheswaranathan et al., 2019; Meier et al., 2019). Yet, most of these gradient estimators are of Monte Carlo type, therefore arguably affected by the low accuracy of Monte Carlo methods. For faster convergence of training (i.e., reducing the number of iterations), more accurate gradient estimators are desired, particularly in RL tasks where the policy has a large number of parameters to learn. Another prominent challenge is that the landscape of the objective function is complex and possesses plentiful local maxima. There is a risk for any optimization algorithm to get trapped in some of those points and unable to explore the parameter space effectively. The Gaussian smoothing, with its ability to smooth out function and damps out small, insignificant fluctuations, is a strong candidate in this very challenge. Specifically, with a moderately large smoothing parameter (i.e., strong smoothing effect), we can expect the gradient of the smoothed objective function will be able to look outside unimportant variation in the adjacent area and detect the general trend of the function from a distance, therefore an efficient nonlocal search direction. This potential of Gaussian smoothing, however, has not been explored in reinforcement learning.

In this paper, we propose a new strategy to accelerate the time-to-solution of reinforcement learning by exploiting the Directional Gaussian Smoothing Evolution Strategy (DGS-ES), recently developed in (Zhang et al., 2020). The DGS-ES method introduced a new directional Gaussian smoothing (DGS) gradient operator, that smooths the original objective function only along dd-orthogonal directions in the parameter space. In other words, the DGS gradient requires dd one-dimensional Gaussian convolutions, instead of one dd-dimensional convolution in the existing ES methods. There are several advantages of using the DGS-gradient operator in reinforcement learning. First, each component of the DGS-gradient, represented as a one-dimensional integral, can be accurately approximated with various classic numerical integration techniques. When having Gaussian kernels, we use Gauss-Hermite quadrature rule which can provide spectral accuracy (see Abramowitz & Stegun (1972)) in the DGS gradient approximation. Second, the use of Gauss-Hermite quadrature also features embarrassing parallelism as the random sampling used in existing ES methods. Since the communication cost between computing processors/cores is neglectable, the total computing time for each iteration of training does not increase with the number of environment simulations given sufficient computing resources. Third, the directional smoothing approach enables nonlocal exploration which takes into account large variation of the objective function and disregards local fluctuations. This property will greatly help skipping local optima or saddle points during the training. It is demonstrated in §4 that the proposed strategy can significantly reduce the number of iterations in training several benchmark reinforcement learning problems, compared to the state-of-the-art baselines.

1.1 RELATED WORKS

ES belongs to the family of blackbox optimization that employs random search techniques to optimize an objective function (Rechenberg & Eigen, 1973; Schwefel, 1977). The search can be guided via either covariance matrix adaptation (Hansen & Ostermeier, 2001; Hansen, 2016), or search gradients to be fed to first order optimization schemes, (Wierstra et al., 2014). The DGS-ES method is in line with the second approach. Gaussian smoothed as a method for approximating search direction has been introduced in (Flaxman et al., 2005; Nesterov & Spokoiny, 2015). In recent years, ES has been revived as a popular strategy in machine learning. In addition to those mentioned above, applications of ES to RL can also be found in Fazel et al. (2018); Ha & Schmidhuber (2018); Houthooft et al. (2018); Müller & Glasmachers (2018); Liu et al. (2019); Pourchot & Sigaud (2019). Applications of ES beyond RL include meta-learning (Metz et al., 2018), optimizing neural network architecture (Real et al., 2017; Miikkulainen et al., 2017; Cui et al., 2018), and direct training of neural network (Morse & Stanley, 2016).

Effective exploration is also critical for deep RL on high dimensional action and state spaces, and a wide variety of exploration strategies have been developed in recent years, to name a few, count-based exploration (Ostrovski et al., 2017; Tang et al., 2017), intrinsic motivation (Bellemare et al., 2016), curiosity (Pathak et al., 2017) and variational information maximization (Houthooft et al., 2016). For exploration techniques which add noise directly to the parameter space of policy, see (Fortunato et al., 2017; Plappert et al., 2017). Also, exploration can be combine with deep RL methods to improve sample efficiency (Colas et al., 2018). Finally, for a recent survey on policy search, we refer to (Sigaud & Stulp, 2019).

2 BACKGROUND

We introduce the background of reinforcement learning and discuss the application of ES methods in reinforcement learning.

2.1 REINFORCEMENT LEARNING

Reinforcement learning considers agents that are able to take a sequence of actions in an environment, denoted by \mathcal{E}, over a number of discrete time steps t{1,,T}t\in\{1,\ldots,T\}. At each time step tt, the agent receives a state 𝒔t𝒮\bm{s}_{t}\in\mathcal{S} and produces a follow-up action 𝒂t𝒜\bm{a}_{t}\in\mathcal{A}. The agent will then observe another state 𝒔t+1\bm{s}_{t+1} and a scalar reward rtr_{t}. The goal of the agents is to learn a policy π(𝒂t|𝒔t)\pi(\bm{a}_{t}|\bm{s}_{t}) that maximizes the objective function, i.e., the expected cumulative return of the form

J(π):=t=0T𝔼(𝒔t,𝒂t)[γtr(𝒔t,𝒂t)],J(\pi):=\sum_{t=0}^{T}\mathbb{E}_{(\bm{s}_{t},\bm{a}_{t})}\left[\gamma^{t}r(\bm{s}_{t},\bm{a}_{t})\right], (1)

where 0γ10\leq\gamma\leq 1 is a discount rate, 𝒂t\bm{a}_{t} is drawn from the policy π(𝒂t|𝒔t)\pi(\bm{a}_{t}|\bm{s}_{t}), and 𝒔t+1=(𝒔t,𝒂t)\bm{s}_{t+1}=\mathcal{E}(\bm{s}_{t},\bm{a}_{t}) is generated by running the environment dynamics.

In policy-based reinforcement learning approaches, the policy π(𝒂|𝒔;𝜽)\pi(\bm{a}|\bm{s};\bm{\theta}) is parameterized by 𝜽d\bm{\theta}\in\mathbb{R}^{d}, where the vector 𝜽:=(θ1,,θd)\bm{\theta}:=(\theta_{1},\ldots,\theta_{d})^{\top} represents the parameters of the policy, e.g., weights of a neural network. Then, the task of learning a good policy π\pi becomes iterative updating the parameter 𝜽\bm{\theta} to solve the following optimization problem

max𝜽dJ(𝜽),\max_{\bm{\theta}\in\mathbb{R}^{d}}J(\bm{\theta}), (2)

where we denote J(𝜽):=J(π(𝒂|𝒔;𝜽))J(\bm{\theta}):=J(\pi(\bm{a}|\bm{s};\bm{\theta})) by an abuse of notation. In reinforcement learning, it is usually true that the gradient of the environment 𝒮\mathcal{S} is inaccessible, so automatic differentiation cannot be used to obtain the gradient of J(𝜽)J(\bm{\theta}). Thus, much of the innovation in reinforcement learning algorithms is focused on addressing the lack of access to or the existence of gradients of the environment/policy. In this work, we focus on the evolution strategy and other types of training algorithms for reinforcement learning are summarized in §1.

2.2 EVOLUTION STRATEGY

We briefly recall the evolution strategy methods, e.g., (Hansen & Ostermeier, 2001; Salimans et al., 2017), which use a multivariate Gaussian distribution to generate the population around the current parameter value 𝜽t\bm{\theta}_{t} at the tt-iteration. When the Gaussian distribution can be factorized to dd independent one-dimensional Gaussian distributions, the standard ES method can be mathematically interpreted based on the Gaussian smoothing technique (Flaxman et al., 2005; Nesterov & Spokoiny, 2015). Specifically, a smoothed version of J(𝜽)J(\bm{\theta}) in Eq. (1), denoted by J𝝈(𝜽)J_{\bm{\sigma}}(\bm{\theta}), is defined by

J𝝈(𝜽)\displaystyle J_{\bm{\sigma}}(\bm{\theta}) :=1(2π)d2dJ(𝜽+𝝈𝒖)e12𝒖22𝑑𝒖\displaystyle:=\frac{1}{(2\pi)^{\frac{d}{2}}}\int_{\mathbb{R}^{d}}J(\bm{\theta}+\bm{\sigma}\circ\bm{u})\,{\rm e}^{-\frac{1}{2}\|\bm{u}\|^{2}_{2}}\,d\bm{u} (3)
=𝔼𝒖𝒩(0,𝐈d)[J(𝜽+𝝈𝒖)],\displaystyle=\mathbb{E}_{\bm{u}\sim\mathcal{N}(0,\mathbf{I}_{d})}\left[J(\bm{\theta}+\bm{\sigma}\circ\bm{u})\right],

where 𝒩(0,𝐈d)\mathcal{N}(0,\mathbf{I}_{d}) is a dd-dimensional standard Gaussian distribution, the notation “\circ” represents the element-wise product, and the vector 𝝈=(σ1,,σd)\bm{\sigma}=(\sigma_{1},\ldots,\sigma_{d}) controls the smoothing effect. It is well known that J𝝈J_{\bm{\sigma}} is always differentiable even if JJ is not. In addition, most of the characteristics of the original objective function J(𝜽)J(\bm{\theta}), e.g., convexity, the Lipschitz constant, are inherited by J𝝈(𝜽)J_{\bm{\sigma}}(\bm{\theta}). When J(𝜽)J(\bm{\theta}) is differentiable, the difference JJ𝜽\nabla J-\nabla J_{\bm{\theta}} can be bounded by its Lipschitz constant (see (Nesterov & Spokoiny, 2015), Lemma 3 for details). Thus, the original optimization problem in Eq. (2) can be replaced by a smoothed version, i.e.,

max𝜽dJ𝝈(𝜽),\max_{\bm{\theta}\in\mathbb{R}^{d}}J_{\bm{\sigma}}(\bm{\theta}), (4)

where the gradient of J𝝈(𝜽)J_{\bm{\sigma}}(\bm{\theta}) is given by

J𝝈(𝜽)=1𝝈2d/2𝔼𝒖𝒩(0,𝐈d)[J(𝜽+𝝈𝒖)𝒖].\nabla J_{\bm{\sigma}}(\bm{\theta})=\frac{1}{\|\bm{\sigma}\|_{2}^{d/2}}\mathbb{E}_{\bm{u}\sim\mathcal{N}(0,\mathbf{I}_{d})}\left[J(\bm{\theta}+\bm{\sigma}\circ\bm{u})\,\bm{u}\right]. (5)

The standard ES method (Salimans et al., 2017) uses Monte Carlo sampling to estimate the gradient J𝝈(𝜽)\nabla J_{\bm{\sigma}}(\bm{\theta}) and update the state 𝜽\bm{\theta} from iteration nn to n+1n+1 by

𝜽n+1=𝜽nλMσm=1MJ(𝜽n+𝝈𝒖m)𝒖m,\bm{\theta}_{n+1}=\bm{\theta}_{n}-\frac{\lambda}{M\sigma}\sum_{m=1}^{M}J(\bm{\theta}_{n}+\bm{\sigma}\circ\bm{u}_{m})\bm{u}_{m}, (6)

where λ\lambda is the learning rate, 𝒖m\bm{u}_{m} are sampled from the Gaussian distribution 𝒩(0,𝐈d)\mathcal{N}(0,\mathbf{I}_{d}).

One drawback of the ES method and its variants is the slow convergence of the training process, due to the low accuracy of the MC-based gradient estimator (see (Berahas et al., 2019)), also (Zhang et al., 2020), for extended discussions on the accuracy of gradient approximations using Eq. (6) and related methods). On the other hand, the evaluations of J(𝜽n+𝝈𝒖m)J(\bm{\theta}_{n}+\bm{\sigma}\circ\bm{u}_{m}) for m=1,,Mm=1,\ldots,M at the nn-th iteration can be generated totally in parallel, which makes it well suited to be scaled up to a large number of parallel workers on modern supercomputers. Therefore, the motivation of this work is to develop a new gradient operator to replace the one in Eq. (5), such that the new gradient can be approximated in a much more accurate way and the embarrassing parallelism feature can be retained.

3 THE DGS-ES METHOD

This section introduces our main framework. We start by introducing in §3.1 the DGS gradient operator and its approximation using the Gauss-Hermite quadrature rule. In §3.2, we describe how to incorporate the DGS gradient operator into the ES for reinforcement learning.

3.1 THE DGS GRADIENT OPEARTOR

For a given direction 𝝃d\bm{\xi}\in\mathbb{R}^{d}, the restriction of the objective function J(𝜽)J(\bm{\theta}) along 𝝃\bm{\xi} can be represented by

G(y|𝜽,𝝃)=J(𝜽+y𝝃),y,G(y\,|\,{\bm{\theta},\bm{\xi}})=J(\bm{\theta}+y\,\bm{\xi}),\;\;y\in\mathbb{R}, (7)

where 𝜽\bm{\theta} is the current state of the agent’s parameters. Then, we can define the one-dimensional Gaussian smoothing of G(y)G(y), denoted by Gσ(y)G_{\sigma}(y), by

Gσ(y|𝜽,𝝃):=𝔼v𝒩(0,1)[G(y+σv|𝜽,𝝃)],\displaystyle G_{\sigma}(y\,|\,{\bm{\theta},\bm{\xi}}):=\mathbb{E}_{v\sim\mathcal{N}(0,1)}\left[G(y+\sigma v\,|\,\bm{\theta},\bm{\xi})\right], (8)

which is also the Gaussian smoothing of J(𝜽)J(\bm{\theta}) along 𝝃\bm{\xi} in the neighbourhood of 𝜽\bm{\theta}. The derivative of Gσ(y|𝜽,𝝃)G_{\sigma}(y|\bm{\theta},\bm{\xi}) at y=0y=0 is given by

𝒟[Gσ(0|𝜽,𝝃)]=1σ𝔼v𝒩(0,1)[G(σv|𝜽,𝝃)v],\displaystyle\mathscr{D}[G_{\sigma}(0\,|\,\bm{\theta},\bm{\xi})]=\frac{1}{\sigma}\,\mathbb{E}_{v\sim\mathcal{N}(0,1)}\left[G(\sigma v\,|\,\bm{\theta},\bm{\xi})\,v\right], (9)

where 𝒟\mathscr{D} denotes the differential operator. It is easy to see that 𝒟[Gσ(0|𝒙,𝝃)]\mathscr{D}[G_{\sigma}(0\,|\,\bm{x},\bm{\xi})] only involves the directionally smoothed objective function given in Eq. (8).

We can assemble a new gradient, i.e., the DGS gradient, by putting together the derivatives in Eq. (9) along dd orthogonal directions, i.e.,

𝝈,𝚵[J](𝜽)=𝚵[𝒟[Gσ1(0|𝜽,𝝃1)]𝒟[Gσd(0|𝜽,𝝃d)]],{\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta})=\bm{\Xi}^{\top}\begin{bmatrix}{\mathscr{D}}\left[G_{\sigma_{1}}(0\,|\,\bm{\theta},\bm{\xi}_{1})\right]\\ \vdots\\ {\mathscr{D}}\left[G_{\sigma_{d}}(0\,|\,\bm{\theta},\bm{\xi}_{d})\right]\end{bmatrix}, (10)

where 𝚵:=(𝝃1,,𝝃d)\bm{\Xi}:=(\bm{\xi}_{1},\ldots,\bm{\xi}_{d})^{\top} represents the matrix consisting of dd orthonormal vectors. It is important to notice that

J𝝈(𝜽)𝝈,𝚵[J](𝜽)\nabla J_{\bm{\sigma}}(\bm{\theta})\not={\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta})

for any 𝝈>0\bm{\sigma}>0, because of the directional Gaussian smoothing used in Eq. (10). However, there is consistency between the two quantities as 𝝈0\bm{\sigma}\rightarrow 0, i.e.,

lim𝝈0|J𝝈(𝜽)𝝈,𝚵[J](𝜽)|=0,\lim_{\bm{\sigma}\rightarrow 0}\left|\nabla J_{\bm{\sigma}}(\bm{\theta})-{\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta})\right|=0, (11)

for fixed 𝜽\bm{\theta} and 𝚵\bm{\Xi}. If J(𝜽)\nabla J(\bm{\theta}) exists, then 𝝈,𝚵[J](𝜽){\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) will also converge to J(𝜽)\nabla J(\bm{\theta}) as 𝝈0\bm{\sigma}\rightarrow 0. Such consistency naturally led to the idea of replacing J𝝈(𝜽)\nabla J_{\bm{\sigma}}(\bm{\theta}) with 𝝈,𝚵[J](𝜽){\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) in the ES framework.

3.2 THE DGS-ES ALGORITHM FOR REINFORCEMENT LEARNING

Since each component of 𝝈,𝚵[J](𝜽){\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) in Eq. (10) only involves a one-dimensional integral, we can use Gaussian quadrature rules (Quarteroni et al., 2007) to obtain spectral convergence. In the case of Gaussian smoothing, a natural choice is the Gauss-Hermite (GH) rule, which is used to approximate integrals of the form g(x)ex2𝑑x\int_{\mathbb{R}}g(x){\rm e}^{-x^{2}}dx. By doing a simple change of variable in Eq. (9), the GH rule can be directly used to obtain the following estimator:

𝒟~M[Gσ(0|𝜽,𝝃)]\displaystyle\widetilde{\mathscr{D}}^{M}[G_{\sigma}(0\,|\,\bm{\theta},\bm{\xi})] (12)
:=\displaystyle:= 1πσm=1MwmG(2σvm|𝜽,𝝃)2vm\displaystyle\frac{1}{\sqrt{\pi}\sigma}\sum_{m=1}^{M}w_{m}\,G(\sqrt{2}\sigma v_{m}\,|\,\bm{\theta},\bm{\xi})\sqrt{2}v_{m}

where wmw_{m} are the GH quadrature weights defined by

wm=2M+1M!π[HM(vm)]2,m=1,,M,w_{m}=\frac{2^{M+1}M\,!\sqrt{\pi}}{[H_{M}^{\prime}(v_{m})]^{2}},\;\;m=1,\ldots,M, (13)

vmv_{m} are the roots of the Hermite polynomial of degree MM

HM(v)=(1)Mev2dMdvM(ev2),H_{M}(v)=(-1)^{M}{\rm e}^{v^{2}}\frac{d^{M}}{dv^{M}}({\rm e}^{-v^{2}}), (14)

and MM is the number of function evaluations, i.e., environment simulations, needed to compute the quadrature in Eq. (12). The weights {wm}m=1M\{w_{m}\}_{m=1}^{M} and the roots {vm}m=1M\{v_{m}\}_{m=1}^{M} can be found in Abramowitz & Stegun (1972). The approximation error of the GH formula can be bounded by 𝒟~M[Gσ]\widetilde{\mathscr{D}}^{M}[G_{\sigma}] is

|𝒟~M[Gσ]𝒟[Gσ]|C0M!π2M(2M)!σ2M1,\displaystyle\left|\widetilde{\mathscr{D}}^{M}[G_{\sigma}]-\mathscr{D}[G_{\sigma}]\right|\leq C_{0}\frac{M\,!\sqrt{\pi}}{2^{M}(2M)\,!}\sigma^{2M-1}, (15)

where M!M! is the factorial of MM, and the constant C0>0C_{0}>0 is independent of MM and σ\sigma.

Applying the GH quadrature rule 𝒟~M\widetilde{\mathscr{D}}^{M} to each component of 𝝈,𝚵[J](𝜽){\nabla}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) in Eq. (10), we define the following estimator:

~𝝈,𝚵M[J](𝜽):=𝚵[𝒟~M[Gσ1(0|𝜽,𝝃1)]𝒟~M[Gσd(0|𝜽,𝝃d)]],\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}):=\bm{\Xi}^{\top}\begin{bmatrix}\widetilde{\mathscr{D}}^{M}\left[G_{\sigma_{1}}(0\,|\,\bm{\theta},\bm{\xi}_{1})\right]\\ \vdots\\ \widetilde{\mathscr{D}}^{M}\left[G_{\sigma_{d}}(0\,|\,\bm{\theta},\bm{\xi}_{d})\right]\end{bmatrix}, (16)

which requires a total of M×dM\times d parallelizable environment simulations at each iteration of training. The error bound in Eq. (15) indicates that ~𝝈,𝚵M[J](𝜽)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) is an accurate estimator of the DGS gradient for a small MM, regardless of the value of 𝝈\bm{\sigma}. This enables the use of relatively big values of 𝝈\bm{\sigma} in Eq. (16) to exploit the nonlocal features of the landscape of J(𝜽)J(\bm{\theta}) in the training process. The nonlocal exploitation ability of ~𝝈,𝚵M[J]\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J] is demonstrated in §4 to be effective in reducing the necessary number of iterations to achieve a prescribed reward score.

On the other hand, as the quadrature weights wmw_{m} and vmv_{m} defined in Eq. (13) and Eq. (14), are deterministic values, the DGS estimator ~𝝈,𝚵M[J](𝒙)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{x}) in Eq. (16) is also a deterministic for fixed 𝚵\bm{\Xi} and 𝝈\bm{\sigma}. To introduce random exploration ability to our approach, we add random perturbations to both 𝚵\bm{\Xi} and 𝝈\bm{\sigma}. For the orthonormal matrix 𝚵\bm{\Xi}, we add a small random rotation, denoted by Δ𝚵\Delta\bm{\Xi}, to the current matrix 𝚵\bm{\Xi}. The matrix Δ𝚵\Delta\bm{\Xi} is generated as a random skew-symmetric matrix, of which the magnitude of each entry is smaller than a prescribed threshold α>0\alpha>0. The perturbation of 𝝈\bm{\sigma} is conducted by drawing random samples from a uniform distribution U(rβ,r+β)U(r-\beta,r+\beta) with two hyper-parameters rr and β\beta with rβ>0r-\beta>0. The random perturbation of 𝚵\bm{\Xi} and 𝝈\bm{\sigma} can be triggered by various types of indicators e.g., the magnitude of the DGS-gradient, the number of iteration done since last perturbation.

Algorithm 1 : The DGS-ES for reinforcement learning
1:  Hyper-parameters: MM: the order of GH quadrature ruleα\alpha: the scaling factor for controlling the norm of Δ𝚵\Delta\bm{\Xi}r,βr,\beta: the mean and perturbation scale for sampling 𝝈\bm{\sigma}γ\gamma: the tolerance for triggering random perturbation
2:  Input:θ0\bm{\theta}_{0}: the initial parameter value,LL: the number of parallel workers
3:  Output: the final parameter value 𝜽N\bm{\theta}_{N}
4:  Initialize the policy π\pi with 𝜽0\bm{\theta}_{0}
5:  Set 𝚵=𝐈d\bm{\Xi}=\mathbf{I}_{d}, and σi=r\sigma_{i}=r for i=1,,di=1,\ldots,d
6:  Broadcast LL copies of π(𝒂|𝒔;𝜽0)\pi(\bm{a}|\bm{s};\bm{\theta}_{0}) to the LL workers
7:  Divide the total GH quadrature points into LL subsets, and send each subset to a worker.
8:  for n=0,N1n=0,\ldots N-1 do
9:     Each worker runs Md/LMd/L environment simulations at their assigned quadrature points
10:     Each worker sends Md/LMd/L scores to the master
11:     for i=1,,di=1,\ldots,d do
12:        Compute 𝒟~M[Gσi(0|𝜽n,𝝃i)]\widetilde{\mathscr{D}}^{M}[G_{\sigma_{i}}(0\,|\,\bm{\theta}_{n},\bm{\xi}_{i})] in Eq. (12)
13:     end for
14:     Assemble ~𝝈,𝚵M[J](𝜽n)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}_{n}) in Eq. (16)
15:     Update 𝜽n\bm{\theta}_{n} to 𝜽n+1\bm{\theta}_{n+1} using Adam
16:     if ~𝝈,𝚵M[J](𝜽n)2<γ\|\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}_{n})\|_{2}<\gamma then
17:        Generate Δ𝚵\Delta\bm{\Xi} and update 𝚵=𝐈d+Δ𝚵\bm{\Xi}=\mathbf{I}_{d}+\Delta\bm{\Xi}
18:        Generate 𝝈\bm{\sigma} from U(rβ,r+β)U(r-\beta,r+\beta)
19:     end if
20:     Broadcast 𝜽n+1\bm{\theta}_{n+1} to the LL workers
21:     Each worker updates the policy to π(𝒂|𝒔;𝜽n+1)\pi(\bm{a}|\bm{s};\bm{\theta}_{n+1})
22:  end for

4 EXPERIMENTS

To evaluate the DGS-ES algorithm, we tested its performance on two classes of reinforcement learning environments: three classical control theory problems from OpenAI Gym (https://github.com/openai/gym) (Brockman et al., 2016) and three continuous control tasks simulated using PyBullet (2.6.5) (https://pybullet.org/) (Coumans & Bai, 2016) which is an open-source library. Within OpenAI Gym, we demonstrate the proposed approach on three benchmark examples: CartPole-v0 (discrete), MountainCarContinuous-v0 (continuous), Pendulum-v0 (continuous). The maximum time steps for these examples are 200, 999 and 200, respectively. More details about the environment and reward settings can be found in Brockman et al. (2016). We also examined the DGS-ES algorithm on the challenging continuous robotic control problems in PyBullet library, namely HopperBulletEnv-v0, InvertedPendulumBulletEnv-v0 and ReacherBulletEnv-v0. In these three tasks, the maximum time steps are 1000, 1000 and 150, respectively. For the purpose of reproducible comparison, we employed the original environment settings from the OpenAI Gym and the PyBullet library without modifying the rewards or the environments.

For our implementation of DGS-ES, we defined our policies as a two-layer feed-forward neural network with 16 hidden nodes and tanh activation functions. For gradient-based optimization, we used Adam to adaptively update the network parameters with a learning rate of r=0.1\ell_{r}=0.1. We chose the hyper-parameters used in Algorithm 1 as follows: M=7,α=2.0,r=1.0,β=0.2,γ=0.01M=7,\alpha=2.0,r=1.0,\beta=0.2,\gamma=0.01. In practice one can tune the critical hyper-parameters (M,rM,r and r\ell_{r}) given the following suggested range: M[7,9]M\in[7,9], r[0.5,1.0]r\in[0.5,1.0] and r[0.01,0.1]\ell_{r}\in[0.01,0.1]. For each task, our results were performed over 5 repeated independent trials (different random seeds) of the Gym/PyBullet simulators and the network policy initialization.

The DGS-ES algorithm is specifically amenable to parallelization since it only needs to communicate scalars, allowing it to scale to over a large number of parallel workers. We implemented a distributed version of Algorithm 1 to the reinforcement learning tasks. The distributed DGS-ES is implemented using PyTorch (Paszke et al., 2017) combined with Ray (Moritz et al., 2018) (https://github.com/ray-project/ray), which does not rely on special networking setup and was tested on large-scale high performance computing facilities with thousands of computing nodes/workers.

Comparison metric: As the motivation of this work is to accelerate time-to-solution of reinforcement training under the assumption that sufficient distributed computing resource is available, we used a different metric to evaluate the performance of DGS-ES and the baselines. Specifically, we are interested in the average return 𝔼[J]\mathbb{E}[J] versus the number of iterations, i.e., NN in Algorithm 1, because those iterations cannot be parallelized.

4.1 BASELINE METHODS

We compared Algorithm 1 against several RL baselines, including ES, PPO and TRPO, as well as the state-of-the-art algorithms such as ASEBO, DDPG, and TD3. Below is the information of the packages used in this effort.

  • ES: The Evolution Strategy proposed in Salimans et al. (2017). We used the implementation of ES from the open-source code https://github.com/hardmaru/estool.

  • ASEBO: Adaptive ES-Active Subspaces for Blackbox Optimization, which was recently developed by Choromanski et al. (2019a). We used the implementation released by the authors at https://github.com/jparkerholder/ASEBO.

  • PPO: Proximal Policy Optimization in Schulman et al. (2017), which is available in OpenAI’s baselines repository at https://github.com/openai/baselines (Dhariwal et al., 2017).

  • TRPO: Trust Region Policy Optimization, developed by Schulman et al. (2015). We also used the OpenAI’s baselines implementation (Dhariwal et al., 2017).

  • DDPG: Deep Deterministic Policy Gradient, proposed by Lillicrap et al. (2016). We used the implementation from https://github.com/georgesung/TD3 where the benchmark DDPG in PyBullet is provided.

  • TD3: Twin Delayed Deep Deterministic Policy Gradient (Fujimoto et al., 2018), which was built upon the DDPG. The original results were reported for the MuJoCo version environments using the implementation from https://github.com/sfujim/TD3, but we used the PyBullet implementation from https://github.com/georgesung/TD3.

The hyper-parameters for all algorithms above were set to match the original papers without further tuning to improve performance on the testing benchmark examples.

4.2 COMPARATIVE EVALUATION

Figure 1 shows the comparison results of CartPole, Pendulum and MountainCar problems from the OpenAI Gym. We compared the DGS-ES with classical ES and the improved ASEBO method. In general, the DGS-ES method features faster convergence than the baselines. For the simplest CartPole problem, the three methods perform equally well. Discrepancy appears in the Pendulum test, where the DGS-ES method not only converges faster than the baselines, but also achieves a higher average return. There is a much bigger discrepancy between the DGS-ES and the baselines appear in the MountainCar test. According to the guideline provided in the OpenAI Gym, the success threshold is to achieve an average return of 90. The DGS-ES method achieves the threshold within 500 iterations, while the average returns of the ES and ASEBO methods are around zero. It is well known that the challenge of this problem is that the surface of the objective function J(𝜽)J(\bm{\theta}) is very flat at most locations in the parameter space, which makes it difficult to capture the peak of J(𝜽)J(\bm{\theta}). This test is a good demonstration of the nonlocal exploration ability of the DGS-ES method. Since the mean of the smoothing factor 𝝈\bm{\sigma} is set to 1.0, DGS-ES can capture the peak of J(𝜽)J(\bm{\theta}) much faster than the baselines. In fact, we can see that it took around 50 iterations for the DGS-ES to find the peak region, while ES and ASEBO needed more than 3000 iterations (not plotted) to move out of the flat region.

Figure 2 shows the comparison results of Hopper-v0, InvertedPendulum-v0 and Reacher-v0 problems from the

Refer to caption
Refer to caption
Refer to caption
Figure 1: Comparison between the DGS-ES and two baselines, i.e., ES and ASEBO, for solving the three problems from OpenAI Gym. The colored curves are the average return over 5 repeated runs with different random seeds, and the corresponding shade represents the interval between the maximum and minimum return.
Refer to caption
Refer to caption
Refer to caption
Figure 2: Comparison between the DGS-ES method and the baselines, i.e., ES, ASEBO, PPO, TRPO, DDPG and TD3, for solving the three problems from the PyBullet library. The colored curves are the average return over 5 runs with different random seeds, and the corresponding shade represents the interval between the maximum and minimum return.

PyBullet library. We compared the DGS-ES with six baselines including ES, ASEBO, PPO, TRPO, DDPG and TD3. As expected, the DGS-ES method shows better performance in terms of the convergence speed. For the Hopper-v0 problem, the DGS-ES achieves the highest return of all the methods within 400 iterations. It is worth pointing out that some of the baselines will eventually catch up with the DGS-ES given a sufficiently large number of iterations. For example, DDPG and TD3 do not provide much improvement within 400 iterations, but DDPG could reach 1650 average return with over 3000 iterations, and TD3 could reach even higher, around 2200 average return, with over 6000 iterations, according to the baselines provided by OpenAI (Dhariwal et al., 2017) and Fujimoto et al. (2018). This phenomenon illustrates the fast convergence feature of DGS-ES. For the InvertedPendulum-v0 problem, DGS-ES can achieve the maximum return 1000 (default value in the PyBellet library) around 30 iterations. In comparison, ES and ASEBO can reach the maximum return but with more iterations than DGS-ES. According to the benchmark results for PyBullet environments in Fujimoto et al. (2018), DDPG and TD3 cannot converge to the maximum return even with a large number of iterations. For the Reacher-v0 problem, DGS-ES and ASEBO are still the top performers, and the advantage of DGS-ES is, again, faster convergence.

Figure 3 illustrates the effect of the radius 𝝈\bm{\sigma} of ~𝝈,𝚵M[J](𝜽)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) on the performance of the DGS-ES method in solving the Reacher-v0 problem. All the simulations were done using the same initialization. We set the mean of , i.e., the hyper-parameter rr in Algorithm 1, to 0.5, 0.05, 0.01, and 0.005. It is easy to see that the performance of DGS-ES deteriorates with the decrease of 𝝈\bm{\sigma}. Since it is evident that the surface of J(𝜽)J(\bm{\theta}) is not convex and may have many local maxima, a relatively big radius 𝝈\bm{\sigma} is necessary to help DGS-ES skip the local maxima. As 𝝈\bm{\sigma} becomes smaller, the DGS gradient ~𝝈,𝚵M[J](𝜽)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) converges to the local gradient J(𝜽)\nabla J(\bm{\theta}), which may get the optimizer trapped in a local maximum.

Refer to caption
Figure 3: Illustration of the effect of the radius 𝝈\bm{\sigma} of ~𝝈,𝚵M[J](𝜽)\widetilde{\nabla}^{M}_{\bm{\sigma},\bm{\Xi}}[J](\bm{\theta}) on the performance of the DGS-ES method in solving the Reacher-v0 problem.

5 CONCLUSION

Despite the successful demonstration shown in §4, there are several limitations with the DGS-ES method for reinforcement learning. First, it requires a powerful enough cluster or distributed computing resources to show superior performance. Even though more and more parallel environments for complex RL tasks, those parallel codes might not compatible with a cluster/supercomputer with a specific architecture. This will need some extra effort to modify environment codes, in order to exploit the advantage of the DGS-ES approach. Second, asynchronization between different environment simulations may drag down the total performance. In a distributed computing system, all the parallel workers receive the same number of environment simulations. However, as different parameter values may lead to different termination times of the environment simulations, there will be a potential waste of computing resources due to such asynchronization. Thus, a better scheduling algorithm is needed to further improve the performance of DGS-ES in RL. Third, even though the performance of the DGS-ES is not very sensitive to the hyper-parameters, especially the radius 𝝈\bm{\sigma} in the experiments conducted in this work, optimal or even viable hyper-parameters of the DGS-ES method are still problem dependent, which means hyper-parameter tuning may be needed when applying the method to another RL problem.

Acknowledgement

This material was based upon work supported by the U.S. Department of Energy, Office of Science, Office of Advanced Scientific Computing Research, Applied Mathematics program under contract and award numbers ERKJ352, by the DOE SciDac FastMath project, and by the Artificial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL). ORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-AC05-00OR22725.

References

  • Abramowitz & Stegun (1972) Abramowitz, M. and Stegun, I. (eds.). Handbook of Mathematical Functions. Dover, New York, 1972.
  • Andrychowicz et al. (2017) Andrychowicz, M., Wolski, F., Ray, A., Schneider, J., Fong, R., Welinder, P., McGrew, B., Tobin, J., Abbeel, O. P., and Zaremba, W. Hindsight experience replay. In Advances in Neural Information Processing Systems, pp. 5048–5058, 2017.
  • Bellemare et al. (2016) Bellemare, M., Srinivasan, S., Ostrovski, G., Schaul, T., Saxton, D., and Munos, R. Unifying count-based exploration and intrinsic motivation. Advances in Neural Information Processing Systems, pp. 1471–1479, 2016.
  • Berahas et al. (2019) Berahas, A. S., Cao, L., Choromanskiv, K., and Scheinberg, K. A theoretical and empirical comparison of gradient approximations in derivative-free optimization. arXiv:1905.01332, 2019.
  • Brockman et al. (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Choromanski et al. (2018) Choromanski, K., Rowland, M., Sindhwani, V., Turner, R. E., and Weller, A. Structured evolution with compact architectures for scalable policy optimization. International Conference on Machine Learning, pp.  969–977, 2018.
  • Choromanski et al. (2019a) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. From complexity to simplicity: Adaptive es-active subspaces for blackbox optimization. NeurIPS, 2019a.
  • Choromanski et al. (2019b) Choromanski, K., Pacchiano, A., Parker-Holder, J., , and Tang, Y. Provably robust blackbox optimization for reinforcement learning. arXiv:1903.02993, 2019b.
  • Colas et al. (2018) Colas, C., Sigaud, O., and Oudeyer, P. Y. GEP-PG: Decoupling exploration and exploitation in deep reinforcement learning algorithms. arXiv preprint arXiv: 1802.05054., 2018.
  • Conti et al. (2018) Conti, E., Madhavan, V., Such, F. P., Lehman, J., Stanley, K. O., and Clune, J. Improving exploration in evolution strategies for deep reinforcement learning via a population of novelty-seeking agents. NIPS, 2018.
  • Coumans & Bai (2016) Coumans, E. and Bai, Y. Pybullet, a python module for physics simulation for games, robotics and machine learning. GitHub repository, 2016.
  • Cui et al. (2018) Cui, X., Zhang, W., Tüske, Z., and Picheny, M. Evolutionary stochastic gradient descent for optimization of deep neural networks. NeurIPS, 2018.
  • Dhariwal et al. (2017) Dhariwal, P., Hesse, C., Klimov, O., Nichol, A., Plappert, M., Radford, A., Schulman, J., Sidor, S., Wu, Y., and Zhokhov, P. Openai baselines. https://github.com/openai/baselines, 2017.
  • Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M., and Mesbahi, M. Global convergence of policy gradient methods for the linear quadratic regulator. Proceedings of the 35th International Conference on Machine Learning, 80:1467–01476, 2018.
  • Flaxman et al. (2005) Flaxman, A. D., Kalai, A. T., Kalai, A. T., and McMahan, H. B. Online convex optimization in the bandit setting: gradient descent without a gradient. Proceedings of the 16th Annual ACM-SIAM symposium on Discrete Algorithms, pp.  385–394, 2005.
  • Fortunato et al. (2017) Fortunato, M., Azar, M. G., Piot, B., Menick, J., Osband, I., Graves, A., Mnih, V., Munos, R., Has-sabis, D., Pietquin, O., and et al. Noisy networks for exploration. arXiv preprint arXiv:1706.10295, 2017.
  • Fujimoto et al. (2018) Fujimoto, S., Van Hoof, H., and Meger, D. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477, 2018.
  • Ha & Schmidhuber (2018) Ha, D. and Schmidhuber, J. Recurrent world models facilitate policy evolution. Advances in Neural Information Processing Systems, pp. 2450–2462, 2018.
  • Hansen (2016) Hansen, N. The cma evolution strategy: A tutorial. arXiv preprint arXiv:1604.00772, 2016.
  • Hansen & Ostermeier (2001) Hansen, N. and Ostermeier, A. Completely derandomized self-adaptation in evolution strategies. Evolutionary computation, 9(2):159–195, 2001.
  • Houthooft et al. (2016) Houthooft, R., Chen, X., Duan, Y., Schulman, J., Turck, F. D., and Abbeel, P. Vime: Variational information maximizing exploration. Advances in Neural Information Processing Systems, pp. 1109–1117, 2016.
  • Houthooft et al. (2018) Houthooft, R., Chen, Y., Isola, P., Stadie, B., Wolski, F., Ho, O. J., and Abbeel, P. Evolved policy gradients. Advances in Neural Information Processing Systems, pp. 5400–5409, 2018.
  • Lillicrap et al. (2016) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D., and Wierstra, D. Continuous control with deep reinforcement learning. ICLR, 2016.
  • Liu et al. (2019) Liu, G., Zhao, L., Yang, F., Bian, J., Qin, T., Yu, N., and Liu, T.-Y. Trust region evolution strategies. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pp.  4352–4359, 2019.
  • Maheswaranathan et al. (2019) Maheswaranathan, N., Metz, L., Tucker, G., Choi, D., and Sohl-Dickstein, J. Guided evolutionary strategies: Augmenting random search with surrogate gradients. Proceedings of the 36th International Conference on Machine Learning, 2019.
  • Meier et al. (2019) Meier, F., Mujika, A., Gauy, M. M., and Steger, A. Improving gradient estimation in evolutionary strategies with past descent directions. Optimization Foundations for Reinforcement Learning Workshop at NeurIPS, 2019.
  • Metz et al. (2018) Metz, L., Maheswaranathan, N., Nixon, J., Freeman, D., and Sohl-dickstein, J. Learned optimizers that outperform sgd on wall-clock and test loss. 2nd Workshop on Meta-Learning at NeurIPS, 2018.
  • Miikkulainen et al. (2017) Miikkulainen, R., Liang, J., Meyerson, E., Rawal, A., Fink, D., Francon, O., Raju, B., Shahrzad, H., Navruzyan, A., Nuffy, N., and Hodjat, B. Evolving deep neural networks. arXiv preprint arXiv:1703.00548, 2017.
  • Mnih et al. (2015) Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller, M., Fidjeland, A. K., Ostrovski, G., and et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. (2016) Mnih, V., Badia, A. P., Mirza, M., Graves, A., Lillicrap, T., Harley, T., Silver, D., and Kavukcuoglu, K. Asynchronous methods for deep reinforcement learning. ICML, pp.  1928–1937, 2016.
  • Moritz et al. (2018) Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., Elibol, M., Yang, Z., Paul, W., Jordan, M. I., et al. Ray: A distributed framework for emerging {\{AI}\} applications. In 13th {\{USENIX}\} Symposium on Operating Systems Design and Implementation ({\{OSDI}\} 18), pp.  561–577, 2018.
  • Morse & Stanley (2016) Morse, G. and Stanley, K. O. Simple evolutionary optimization can rival stochastic gradient descent in neural networks. Proceedings of the Genetic and Evolutionary Computation Conference (GECCO 2016), pp.  477–484, 2016.
  • Müller & Glasmachers (2018) Müller, N. and Glasmachers, T. Challenges in high-dimensional reinforcement learning with evolution strategies. In International Conference on Parallel Problem Solving from Nature, pp.  411–423. Springer, 2018.
  • Nesterov & Spokoiny (2015) Nesterov, Y. and Spokoiny, V. Random gradient-free minimization of convex functions. Foundations of Computational Mathematics, pp.  1–40, 2015.
  • Ostrovski et al. (2017) Ostrovski, G., Bellemare, M. G., v. d. Oord, A., and Munos, R. Count-based exploration with neural density models. arXiv preprint arXiv:1703.01310, 2017.
  • Paszke et al. (2017) Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. Automatic differentiation in pytorch. 2017.
  • Pathak et al. (2017) Pathak, D., Agrawal, P., Efros, A. A., and Darrell, T. Curiosity-driven exploration by self- supervised prediction. International Conference on Machine Learning (ICML), 2017.
  • Plappert et al. (2017) Plappert, M., Houthooft, R., Dhariwal, P., Sidor, S., Chen, R. Y., Chen, X., Asfour, T., Abbeel, P., and Andrychowicz, M. Parameter space noise for exploration. arXiv preprint arXiv:1706.01905, 2017.
  • Pourchot & Sigaud (2019) Pourchot, A. and Sigaud, O. Cem-rl: Combining evolutionary and gradient-based methods for policy search. ICLR, 2019.
  • Quarteroni et al. (2007) Quarteroni, A., Sacco, R., and Saleri, F. Numerical Mathematics, volume 332. Springer Science Business Media &, 2007.
  • Real et al. (2017) Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y. L., Tan, J., Le, Q. V., and Kurakin, A. Large-scale evolution of image classifiers. International Conference on Machine Learning (ICML), pp. 2902–2911, 2017.
  • Rechenberg & Eigen (1973) Rechenberg, I. and Eigen, M. Evolutionsstrategie: Optimierung Technischer Systeme nach Prinzipien der Biologischen Evolution. Frommann-Holzboog Stuttgart, 1973.
  • Salimans et al. (2017) Salimans, T., Ho, J., Chen, X., Sidor, S., and Sutskever, I. From complexity to simplicity as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864, 2017.
  • Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. I., and Moritz, P. Trust region policy optimization. ICML, pp.  1889–1897, 2015.
  • Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Schwefel (1977) Schwefel, H.-P. Numerische optimierung von computer-modellen mittels der evolutionsstrategie. 1977.
  • Sehnke et al. (2010) Sehnke, F., Osendorfer, C., Ruckstieb, T., Graves, A., Peters, J., and Schmidhuber, J. Parameter-exploring policy gradients. Neural Networks, 2010.
  • Sigaud & Stulp (2013) Sigaud, O. and Stulp, F. Robot skill learning: From reinforcement learning to evolution strategies. Paladyn Journal of Behavioral Robotics, 4(1):49–61, 2013.
  • Sigaud & Stulp (2019) Sigaud, O. and Stulp, F. Policy search in continuous action domains: An overview. Neural Networks, 113:28–40, 2019.
  • Silver et al. (2016) Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Driessche, G. V. D., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., and M. Lanctot, e. a. Mastering the game of go with deep neural networks and tree search. Nature, 529(7587):484–489, 2016.
  • Such et al. (2017) Such, F. P., Madhavan, V., Conti, E., Lehman, J., Stanley, K. O., and Clune, J. Deep neuroevolution: Genetic algorithms are a competitive alternative for training deep neural networks for reinforcement learning. arXiv preprint arXiv:1712.06567, 2017.
  • Sutton & Barto (1998) Sutton, R. and Barto, A. G. (eds.). Reinforcement learning: An introduction, volume 1. MIT press Cambridge, New York, 1998.
  • Tang et al. (2017) Tang, H., Houthooft, R., Foote, D., Stooke, A., Chen, O. X., Duan, Y., Schulman, J., DeTurck, F., and Abbeel, P. # exploration: A study of count-based exploration for deep reinforcement learning. Advances in Neural Information Processing Systems, pp. 2750–2759, 2017.
  • Wierstra et al. (2014) Wierstra, D., Schaul, T., Glasmachers, T., Sun, Y., Peters, J., and Schmidhuber, J. Natural evolution strategies. Journal of Machine Learning Research, 15:949–980, 2014.
  • Zhang et al. (2020) Zhang, J., Tran, H., Lu, D., and Zhang, G. A Scalable Evolution Strategy with Directional Gaussian Smoothing for Blackbox Optimization. arXiv e-prints, art. arXiv:2002.03001, Feb 2020.