This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Consolidated Adaptive T-soft Update for Deep Reinforcement Learning

Taisuke Kobayashi1 1Taisuke Kobayashi is with the Division of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan [email protected]
Abstract

Demand for deep reinforcement learning (DRL) is gradually increased to enable robots to perform complex tasks, while DRL is known to be unstable. As a technique to stabilize its learning, a target network that slowly and asymptotically matches a main network is widely employed to generate stable pseudo-supervised signals. Recently, T-soft update has been proposed as a noise-robust update rule for the target network and has contributed to improving the DRL performance. However, the noise robustness of T-soft update is specified by a hyperparameter, which should be tuned for each task, and is deteriorated by a simplified implementation. This study develops adaptive T-soft (AT-soft) update by utilizing the update rule in AdaTerm, which has been developed recently. In addition, the concern that the target network does not asymptotically match the main network is mitigated by a new consolidation for bringing the main network back to the target network. This so-called consolidated AT-soft (CAT-soft) update is verified through numerical simulations.

Index Terms:
Reinforcement Learning, Machine Learning for Robot Control, Deep Learning Methods

I Introduction

As robotic and machine learning technologies remarkably develop, the tasks required of intelligent robots becomes more complex: e.g. physical human-robot interaction [1, 2]; work on disaster sites [3, 4]; and manipulation of various objects [5, 6]. In most cases, these complex tasks have no accurate analytical model. To resolve this difficulty, deep reinforcement learning (DRL) [7] has received a lot of attention as an alternative to classic model-based control. Using deep neural networks (DNNs) as a nonlinear function approximator, DRL can learn the complicated policy and/or value function in a model-free manner [8, 9], or learn the complicated world model for planning the optimal policy in a model-based manner [10, 11].

Since DNNs are nonlinear and especially model-free DRLs must generate pseudo-supervised signals by themselves, making DRL unstable. Techniques to stabilize learning have been actively proposed, such as the design of regularization [5, 9, 12] and the introduction of the model that makes learning conservative [13, 14]. Among them, a target network is one of the current standard techniques in DRL [8, 15]. After generating it as a copy of the main network to be learned by DRL, an update rule is given to make it slowly match the main network at regular intervals or asymptotically. In this case, the pseudo-supervised signals generated from the target network are more stable than those generated from the main network, which greatly contributes to the overall stability of DRL.

The challenge in using the target network is its update rule. It has been reported that too slow update stagnates the whole learning process [16], while too fast update reverts to instability of the pseudo-supervised signals. A new update rule, T-soft update [15], has been proposed to mitigate the latter problem. This method provides a mechanism to limit the amount of updates when the main network deviates unnaturally from the target network, which can be regarded as noise. Such noise robustness enables to stabilize the whole learning process even with the high update rate by appropriately ignoring the unstable behaviors of the main network.

However, the noise robustness of T-soft update is specified as a hyperparameter, which must be set to an appropriate value depending on the task to be solved. In addition, the simplified implementation for detecting noise deteriorates the noise robustness. More sophisticated implementation with the adaptive noise robustness is desired. As another concern, when the update of the target network is restricted like T-soft update does, the target network may not asymptotically match the main network. A new constraint is needed to avoid the situation where the main network deviates from the target network.

Hence, this paper proposes two methods to resolve each of the above two issues: i) the adaptive and sophisticated implementation of T-soft update and; ii) the appropriate consolidation of the main target network to the target network. Specifically, for the issue i), a new update rule, so-called adaptive T-soft (AT-soft) update, is developed based on the recently proposed AdaTerm [17] formulation, which is an adaptively noise-robust stochastic gradient descent method. This allows us to sophisticate the simplified implementation of T-soft update and improve the noise robustness, which can be adaptive to the input patterns. For the issue ii), a new consolidation so that the main network is regularized to the target network when AT-soft update restricts the updates of the target network. By implementing it with interpolation, the parameters in the main network that naturally deviate significantly from that of the target network, are updated to a larger extent. With this consolidation, the proposed method is so-called consolidated AT-soft (CAT-soft) update.

To verify CAT-soft update, typical benchmarks implemented by Pybullet [18] are tried using the latest DRL algorithms [12, 19]. It is shown that even though the learning rate is larger than the standard value for DRL, the task performance is improved by CAT-soft update and more stable learning can be achieved. In addition, the developed consolidation successfully suppresses the divergence between the main and target networks.

II Preliminaries

II-A Reinforcement learning

First of all, the basic problem statement of DRL and an actor-critic algorithm, which can handle continuous action space, is natural choice [7] as one of the basic algorithms for robot control. Note that the proposed method can be applied to other algorithms with the target network.

In DRL, an agent interacts with an unknown environment under Markov decision process (MDP) with the current state ss, the agent’s action aa, the next state ss^{\prime}, and the reward from the environment rr. Specifically, the environment implicitly has its initial randomness p0(s)p_{0}(s) and its state transition probability pe(ss,a)p_{e}(s^{\prime}\mid s,a). Since the agent can act on the environment’s state transition through aa, the goal is to find the optimal policy to reach the desired state. To this end, aa is sampled from a state-dependent trainable policy, π(as;θπ)\pi(a\mid s;\theta_{\pi}), with its parameters set θπ\theta_{\pi} (a.k.a. weights and biases of DNNs in DRL). The outcome of the interaction between the agent and the environment can be evaluated as r=r(s,a,s)r=r(s,a,s^{\prime}).

By repeating the above process, the agent gains the sum of rr over the future (so-called return), R=k=0γkrkR=\sum_{k=0}^{\infty}\gamma^{k}r_{k}, with γ[0,1)\gamma\in[0,1) discount factor. The main purpose of DRL is to maximize RR by optimizing π\pi (i.e. θπ\theta_{\pi}). However, RR cannot be gained due to its future information, hence its expected value is inferred as a trainable (state) value function, V(s;θV)=𝔼[Rs]V(s;\theta_{V})=\mathbb{E}[R\mid s], with its parameters set θV\theta_{V}. Finally, DRL optimizes π\pi to maximize VV while increasing the accuracy of VV.

To learn VV, a temporal difference (TD) error method is widely used as follows:

V(θV)\displaystyle\mathcal{L}_{V}(\theta_{V}) =12(yV(s;θV))2\displaystyle=\cfrac{1}{2}(y-V(s;\theta_{V}))^{2} (1)
y\displaystyle y =r+γV(s;θ¯V)\displaystyle=r+\gamma V(s^{\prime};\bar{\theta}_{V}) (2)

where yy denotes the pseudo-supervised signal generated from the target network with the parameters set θ¯V\bar{\theta}_{V} (see later). By minimizing V\mathcal{L}_{V}, θV\theta_{V} can be optimized to correctly infer the value over ss.

To learn π\pi, a policy-gradient method is applied as follows:

π(θπ)\displaystyle\mathcal{L}_{\pi}(\theta_{\pi}) =(yV(s;θV))π(as;θπ)b(as;θ¯π)\displaystyle=-(y-V(s;\theta_{V}))\cfrac{\pi(a\mid s;\theta_{\pi})}{b(a\mid s;\bar{\theta}_{\pi})} (3)

where aa is sampled from the alternative policy bb, which is often given by the target network with the parameters set θ¯π\bar{\theta}_{\pi}. The sampler change is allowed by the importance sampling, and the likelihood ratio is introduced as in the above loss function. By minimizing π\mathcal{L}_{\pi}, θπ\theta_{\pi} can be optimized to reach the state with higher value.

II-B Target network with T-soft update

The target network with θ¯V,π\bar{\theta}_{V,\pi} is briefly introduced together with the latest update rule, T-soft update [15]. First, with the initialization phase of the main network, the target network is also created as a copy with θ¯V,π=θV,π\bar{\theta}_{V,\pi}=\theta_{V,\pi}. Since the copied θ¯V,π\bar{\theta}_{V,\pi} is given independently of θV,π\theta_{V,\pi}, and not updated through the minimization problem of eqs. (1) and (3). Therefore, the pseudo-supervised signal yy has the same value for the same input, which greatly contributes to the stability of learning by making the minimization problem stationary.

However, in practice, if θ¯V,π\bar{\theta}_{V,\pi} is fixed at its initial value, the correct yy cannot be generated and the task is never accomplished. Thus, θ¯V,π\bar{\theta}_{V,\pi} must be updated slowly towards θV,π\theta_{V,\pi} as in alternating optimization [20]. When the target network was first introduced, a technique called hard update was employed [8], where θV,π\theta_{V,\pi} was updated a certain number of times and then copied again as θ¯V,π=θV,π\bar{\theta}_{V,\pi}=\theta_{V,\pi}. Afterwards, soft update shown in the following equation has been proposed to make θ¯V,π\bar{\theta}_{V,\pi} asymptotically match θV,π\theta_{V,\pi} more smoothly.

θ¯V,π(1τ)θ¯V,π+τθV,π\displaystyle\bar{\theta}_{V,\pi}\leftarrow(1-\tau)\bar{\theta}_{V,\pi}+\tau\theta_{V,\pi} (4)

where τ(0,1]\tau\in(0,1] denotes the update rate.

The above update rule is given as an exponential moving average, and all new inputs are treated equivalently. As a result, even when θV,π\theta_{V,\pi} is incorrectly updated, its adverse effect as noise is reflected into θ¯V,π\bar{\theta}_{V,\pi}. This effect is more pronounced when τ\tau is large, but mitigating this by reducing τ\tau causes a reduction in learning speed [16].

To tackle this problem, T-soft update that is robust to noise even with relatively large τ\tau has recently been proposed [15]. It regards the exponential moving average as update of a location parameter of normal distribution, and derives a new update rule by replacing it with student-t distribution that is more robust to noise by specifying degrees of freedom ν+\nu\in\mathbb{R}_{+}. T-soft update is described in Alg. 1. Note that σ2\sigma^{2} and WW must be updated as new internal states.

With mathematical explanations, the issues of T-soft update are again summarized as below.

  1. 1.

    Since ν\nu must be specified as a constant in advance, it must be tuned for each task to provide the appropriate noise robustness to maximize performance.

  2. 2.

    The larger Δiσi2\Delta_{i}\sigma_{i}^{-2} is, the more the update is suppressed. However, the simple calculation of Δi\Delta_{i} as mean square error makes it easier to hide the noise hidden in the ii-th subset.

  3. 3.

    If τi\tau_{i} is frequently close to zero (i.e. no update), there is a risk that θ¯V,π\bar{\theta}_{V,\pi} will not asymptotically match θV,π\theta_{V,\pi}.

Algorithm 1 T-soft update [15]
1:(Initialize θ¯i=θi\bar{\theta}_{i}=\theta_{i}, Wi=(1τ)τ1W_{i}=(1-\tau)\tau^{-1}, σi=ϵ1\sigma_{i}=\epsilon\ll 1)
2:for θi,θ¯iθV,π,θ¯V,π\theta_{i},\bar{\theta}_{i}\subset\theta_{V,\pi},\bar{\theta}_{V,\pi} do
3:     Δi2=1dij=1di(θi,jθ¯i,j)2\Delta_{i}^{2}=\frac{1}{d_{i}}\sum_{j=1}^{d_{i}}(\theta_{i,j}-\bar{\theta}_{i,j})^{2}
4:     wi=(ν+1)(ν+Δi2σi2)1w_{i}=(\nu+1)(\nu+\Delta_{i}^{2}\sigma_{i}^{-2})^{-1}
5:     τi=wi(Wi+wi)1\tau_{i}=w_{i}(W_{i}+w_{i})^{-1}
6:     τσi=τwiν(ν+1)1\tau_{\sigma_{i}}=\tau w_{i}\nu(\nu+1)^{-1}
7:     θ¯i(1τi)θ¯i+τiθi\bar{\theta}_{i}\leftarrow(1-\tau_{i})\bar{\theta}_{i}+\tau_{i}\theta_{i}
8:     σi2(1τσi)σi2+τσiΔi2\sigma_{i}^{2}\leftarrow(1-\tau_{\sigma_{i}})\sigma_{i}^{2}+\tau_{\sigma_{i}}\Delta_{i}^{2}
9:     Wi(1τ)(Wi+wi)W_{i}\leftarrow(1-\tau)(W_{i}+w_{i})
10:end for

III Proposal

III-A Adaptive T-soft update

The first two of the three issues mentioned above are resolved by deriving a new update rule, so-called AT-soft update. To develop AT-soft update, the formulation of AdaTerm [17], which is a kind of stochastic gradient descent method with the adaptive noise robustness, is applied. In this method, by assuming that the gradient is generated from student-t distribution, its location, scale, and degrees of freedom parameters, which are utilized for updating the network, can be estimated by approximate maximum likelihood estimation. Instead of the gradient as stochastic variable, the parameters of the main network are considered to be generated from student-t distribution, and its location is mapped to the parameters of the target network. With this assumption, AT-soft update obtains the noise robustness as in the conventional T-soft update. In addition, the degrees of freedom can be estimated at the same time in this formulation, so that the noise robustness can be automatically adjusted according to the faced task.

Specifically, the ii-th subset of θV,π\theta_{V,\pi} (e.g. a weight matrix in each layer), θi\theta_{i} with did_{i} the number of dimensions, is assumed to be generated from did_{i}-dimensional diagonal student-t distribution with three kinds of sample statistics: a location parameter θ¯idi\bar{\theta}_{i}\in\mathbb{R}^{d_{i}}; a scale parameter σi+di\sigma_{i}\in\mathbb{R}_{+}^{d_{i}}; and degrees of freedom νi+\nu_{i}\in\mathbb{R}_{+}. With ν~i=νidi1\tilde{\nu}_{i}=\nu_{i}d_{i}^{-1} and Di=di1j=1di(θi,jθ¯i,j)2σi,j2D_{i}=d_{i}^{-1}\sum_{j=1}^{d_{i}}(\theta_{i,j}-\bar{\theta}_{i,j})^{2}\sigma_{i,j}^{-2}, its density can be described as below.

θi\displaystyle\theta_{i} Γ(νi+di2)Γ(νi2)(νiπ)di2j=1diσi,j(1+Diν~i)νi+d2\displaystyle\sim\cfrac{\Gamma(\frac{\nu_{i}+d_{i}}{2})}{\Gamma(\frac{\nu_{i}}{2})(\nu_{i}\pi)^{\frac{d_{i}}{2}}\prod_{j=1}^{d_{i}}\sigma_{i,j}}\left(1+\cfrac{D_{i}}{\tilde{\nu}_{i}}\right)^{-\frac{\nu_{i}+d}{2}}
=:𝒯(θiθ¯i,σi,νi)\displaystyle=:\mathcal{T}(\theta_{i}\mid\bar{\theta}_{i},\sigma_{i},\nu_{i}) (5)

where Γ\Gamma denotes the gamma function. Note that the conventional T-soft update simplifies this model as one-dimensional student-t distribution for the mean of θiθ¯i\theta_{i}-\bar{\theta}_{i}, but here we treat it as did_{i}-dimensional distribution with slightly increased computational cost.

With this assumption, following the derivation of AdaTerm, θ¯i\bar{\theta}_{i}, σi\sigma_{i}, and νi\nu_{i} are optimally inferred to maximize the approximated log-likelihood. The important variable in the derivation is w1w_{1}, which indicates the deviation of θi\theta_{i} from 𝒯\mathcal{T}, and is calculated as follows:

w1=ν~i+1ν~i+Di\displaystyle w_{1}=\cfrac{\tilde{\nu}_{i}+1}{\tilde{\nu}_{i}+D_{i}} (6)

That is, since DiD_{i} represents the pseudo-distance from 𝒯\mathcal{T}, the larger DiD_{i} is, the closer w1w_{1} is to 0. In addition, the smaller ν~i\tilde{\nu}_{i} is, the more sensitive w1w_{1} is to fluctuations in DiD_{i}, leading to higher noise robustness. Using w1w_{1}, w2w_{2}, which is used only for updating ν~i\tilde{\nu}_{i}, can be derived as follows:

w2=w1ln(w1)\displaystyle w_{2}=w_{1}-\ln(w_{1}) (7)

These w1,2w_{1,2} are used to calculate the update ratio of the sample statistics, τ1,2\tau_{1,2}.

τ1,2=τw1,2w¯1,2\displaystyle\tau_{1,2}=\tau\cfrac{w_{1,2}}{\overline{w}_{1,2}} (8)

where τ(0,1]\tau\in(0,1] denotes the basic update ratio given as a hyperparameter. To satisfy τ1,2(0,1]\tau_{1,2}\in(0,1], the upper bounds of w1,2w_{1,2}, w¯1,2\overline{w}_{1,2}, are employed.

w¯1=ν~i+1ν~i,w¯2=max(w¯1ln(w¯1),87.3365)\displaystyle\overline{w}_{1}=\cfrac{\tilde{\nu}_{i}+1}{\tilde{\nu}_{i}},\ \overline{w}_{2}=\max(\overline{w}_{1}-\ln(\overline{w}_{1}),87.3365) (9)

where 87.336587.3365 means the negative logarithm with the tiny number of float32.

The update amounts for θ¯i\bar{\theta}_{i}, σi2\sigma_{i}^{2}, and ν~i\tilde{\nu}_{i} are respectively given as follows:

θ¯i\displaystyle\bar{\theta}_{i}^{\prime} =θi\displaystyle=\theta_{i} (10)
(σi2)\displaystyle(\sigma_{i}^{2})^{\prime} =(θiθ¯i)2\displaystyle=(\theta_{i}-\bar{\theta}_{i})^{2}
+max(ϵ2,((θiθ¯i)2Diσi2)ν~i1)\displaystyle+\max(\epsilon^{2},((\theta_{i}-\bar{\theta}_{i})^{2}-D_{i}\sigma_{i}^{2})\tilde{\nu}_{i}^{-1}) (11)
ν~i\displaystyle\tilde{\nu}_{i}^{\prime} =(ν~i+2ν~i+1+ν~i)ν~iν¯~ν~iw2+ν¯~+ϵ\displaystyle=\left(\cfrac{\tilde{\nu}_{i}+2}{\tilde{\nu}_{i}+1}+\tilde{\nu}_{i}\right)\cfrac{\tilde{\nu}_{i}-\underline{\tilde{\nu}}}{\tilde{\nu}_{i}w_{2}}+\underline{\tilde{\nu}}+\epsilon (12)

where ϵ1\epsilon\ll 1 denotes the small value for stabilizing the computation and ν¯~\underline{\tilde{\nu}} denotes the lower bound of ν~\tilde{\nu} (i.e. the maximum noise robustness) given as a hyperparamter.

Using the update ratios and the update amounts obtained above, θ¯i\bar{\theta}_{i}, σi2\sigma_{i}^{2}, and ν~i\tilde{\nu}_{i} can be updated.

θ¯i\displaystyle\bar{\theta}_{i} (1τ1)θ¯i+τ1θ¯i\displaystyle\leftarrow(1-\tau_{1})\bar{\theta}_{i}+\tau_{1}\bar{\theta}_{i}^{\prime} (13)
σi2\displaystyle\sigma_{i}^{2} (1τ1)σi2+τ1(σi2)\displaystyle\leftarrow(1-\tau_{1})\sigma_{i}^{2}+\tau_{1}(\sigma_{i}^{2})^{\prime} (14)
ν~i\displaystyle\tilde{\nu}_{i} (1τ2)ν~i+τ2ν~i\displaystyle\leftarrow(1-\tau_{2})\tilde{\nu}_{i}+\tau_{2}\tilde{\nu}_{i}^{\prime} (15)

As a result, AT-soft update enables to update the parameters set θ¯i\bar{\theta}_{i} of the target network adaptively (i.e. depending on the deviation of θi\theta_{i} from 𝒯\mathcal{T}), while automatically tuning the noise robustness represented by σi2\sigma_{i}^{2} and ν~i\tilde{\nu}_{i}.

III-B Consolidation from main to target networks

However, if τ10\tau_{1}\simeq 0 continues in the above update, θi\theta_{i} will deviate from θ¯i\bar{\theta}_{i} gradually, and the target network will no longer be able to generate pseudo-supervised signals since the assumption θ¯iθi\bar{\theta}_{i}\simeq\theta_{i} is broken. In such a case, parts of θi\theta_{i} would be updated with the minimization of eqs. (1)–(3) in a wrong direction, causing τ10\tau_{1}\simeq 0 as outlier. Hence, to stop this fruitless judgement and restart the appropriate updates, reverting θi\theta_{i} to θ¯i\bar{\theta}_{i} and holding θiθ¯i\theta_{i}\simeq\bar{\theta}_{i} would be the natural effective way. To this end, a heuristic consolidation is designed as below.

Specifically, the update ratio from the main to target networks, τc\tau_{c}, is designed to be larger when the update ratio of θ¯i\bar{\theta}_{i} is smaller (i.e. when θi\theta_{i} deviates from 𝒯\mathcal{T}).

τc=λτ(1w1w¯1)\displaystyle\tau_{c}=\lambda\tau\left(1-\cfrac{w_{1}}{\overline{w}_{1}}\right) (16)

where λ[0,1]\lambda\in[0,1] adjusts the strength of this consolidation, which should be the same as or weaker than the update speed of the target network.

Next, since consolidating all θi\theta_{i} would interfere with learning, the consolidated subset of outliers, θic\theta_{i}^{c}, should be extracted. The simple and popular way is to use the qq-th quantile QQ with q[0,1]q\in[0,1]. Since the component that contributes to making w1w_{1} small is with large (θi,jθ¯i,j)2σi,j2=:Δi,j(\theta_{i,j}-\bar{\theta}_{i,j})^{2}\sigma_{i,j}^{-2}=:\Delta_{i,j}, θic\theta_{i}^{c} is defined as follows:

θic={θi,jθiΔi,jQ(Δi;q)}\displaystyle\theta_{i}^{c}=\left\{\theta_{i,j}\in\theta_{i}\mid\Delta_{i,j}\geq Q(\Delta_{i};q)\right\} (17)

Thus, the following update formula consolidates θic\theta_{i}^{c} to the corresponding subset of the target network, θ¯ic\bar{\theta}_{i}^{c}.

θic(1τc)θic+τcθ¯ic\displaystyle\theta_{i}^{c}\leftarrow(1-\tau_{c})\theta_{i}^{c}+\tau_{c}\bar{\theta}_{i}^{c} (18)

A rough sketch of this consolidation is shown in Fig. 1. Although loss-function-based consolidations, as proposed in the context of continual learning [21, 22], would be also possible, a more convenient implementation with lower computational cost was employed.

The pseudo-code for the consolidated adaptive T-soft (CAT-soft) update is summarized in Alg. 2. Note that, although ν¯~\underline{\tilde{\nu}} must be specified as a new hyperparameter in (C)AT-soft update, ν\nu specified in T-soft update is already conservatively set for noise, and we can inherit it (i.e. ν¯~=1\underline{\tilde{\nu}}=1). Therefore, the additional hyperparameters to be tuned are λ\lambda and qq. qq can be given as q1q\simeq 1 so that a few parameters in ii-th subset are consolidated without interfering with learning. λ\lambda can be given as the inverse of the number of parameters to be consolidated. In other words, we can decide whether to make qq closer to 11 and consolidate fewer parameters tightly, or make qq smaller and consolidate more parameters slightly.

Refer to caption
Figure 1: Rough sketches of the proposed consolidation: when θi\theta_{i} is not far from 𝒯\mathcal{T} as in (a), almost no consolidation works; when parts of θi\theta_{i} are far from 𝒯\mathcal{T} as in (b), only θic\theta_{i}^{c}, which leads to w10w_{1}\simeq 0, is consolidated to the target network.
Algorithm 2 CAT-soft update
1:(Initialize θ¯i=θi\bar{\theta}_{i}=\theta_{i}, σi=ϵ\sigma_{i}=\epsilon, ν~i=ν¯~\tilde{\nu}_{i}=\underline{\tilde{\nu}})
2:for θi,θ¯iθV,π,θ¯V,π\theta_{i},\bar{\theta}_{i}\subset\theta_{V,\pi},\bar{\theta}_{V,\pi} do
3:     Δi=(θiθ¯i)2σi2\Delta_{i}=(\theta_{i}-\bar{\theta}_{i})^{2}\sigma_{i}^{-2}
4:     Di=di1j=1diΔi,jD_{i}=d_{i}^{-1}\sum_{j=1}^{d_{i}}\Delta_{i,j}
5:     w1=(ν~i+1)(ν~i+Di)1w_{1}=(\tilde{\nu}_{i}+1)(\tilde{\nu}_{i}+D_{i})^{-1}, w2=w1ln(w1)w_{2}=w_{1}-\ln(w_{1})
6:     w¯1=(ν~i+1)ν~i1,w¯2=max(w¯1ln(w¯1),87.3365)\overline{w}_{1}=(\tilde{\nu}_{i}+1)\tilde{\nu}_{i}^{-1},\overline{w}_{2}=\max(\overline{w}_{1}-\ln(\overline{w}_{1}),87.3365)
7:     τ1,2=τw1,2w¯1,21\tau_{1,2}=\tau w_{1,2}\overline{w}_{1,2}^{-1}
8:     (σi2)=Δiσi2+max(ϵ2,(ΔiDi)σi2ν~i1)(\sigma_{i}^{2})^{\prime}=\Delta_{i}\sigma_{i}^{2}+\max(\epsilon^{2},(\Delta_{i}-D_{i})\sigma_{i}^{2}\tilde{\nu}_{i}^{-1})
9:     ν~i={1+(ν~i+1)1+ν~i}(ν~iν¯~)(ν~iw2)1+ν¯~+ϵ\tilde{\nu}_{i}^{\prime}=\{1+(\tilde{\nu}_{i}+1)^{-1}+\tilde{\nu}_{i}\}(\tilde{\nu}_{i}-\underline{\tilde{\nu}})(\tilde{\nu}_{i}w_{2})^{-1}+\underline{\tilde{\nu}}+\epsilon
10:     θ¯i(1τ1)θ¯i+τ1θi\bar{\theta}_{i}\leftarrow(1-\tau_{1})\bar{\theta}_{i}+\tau_{1}\theta_{i}
11:     σi2(1τ1)σi2+τ1(σi2)\sigma_{i}^{2}\leftarrow(1-\tau_{1})\sigma_{i}^{2}+\tau_{1}(\sigma_{i}^{2})^{\prime}
12:     ν~i(1τ2)ν~i+τ2ν~i\tilde{\nu}_{i}\leftarrow(1-\tau_{2})\tilde{\nu}_{i}+\tau_{2}\tilde{\nu}_{i}^{\prime}
13:     if Consolidation then
14:         τc=λτ(1w1w¯11)\tau_{c}=\lambda\tau(1-w_{1}\overline{w}_{1}^{-1})
15:         θic={θi,jθiΔi,jQ(Δi;q)}\theta_{i}^{c}=\{\theta_{i,j}\in\theta_{i}\mid\Delta_{i,j}\geq Q(\Delta_{i};q)\}
16:         θic(1τc)θic+τcθ¯ic\theta_{i}^{c}\leftarrow(1-\tau_{c})\theta_{i}^{c}+\tau_{c}\bar{\theta}_{i}^{c}
17:     end if
18:end for

IV Simulations

TABLE I: Hyperparameters for the used DRL algorithms
Symbol Meaning Value
LL #Hidden layer 22
NN #Neuron for each layer 100100
γ\gamma Discount factor 0.990.99
(α,β,ϵ,ν¯~)(\alpha,\beta,\epsilon,\underline{\tilde{\nu}}) For AdaTerm [17] (103,0.9,105,1)(10^{-3},0.9,10^{-5},1)
(κ,β,λ,Δ¯)(\kappa,\beta,\lambda,\underline{\Delta}) For PPO-RPE [12] (0.5,0.5,0.999,0.1)(0.5,0.5,0.999,0.1)
(Nc,Nb,α,β)(N_{c},N_{b},\alpha,\beta) For PER [19] (104,32,1.0,0.5)(10^{4},32,1.0,0.5)
(σ,λ¯,λ¯,β)(\sigma,\underline{\lambda},\overline{\lambda},\beta) For L2C2 [23] (1,0.01,1,0.1)(1,0.01,1,0.1)

IV-A Setup

For the statistical verification of the proposed method, the following simulations are conducted. As simulation environments, Pybullet [18] with OpenAI Gym [24] is employed. From it, InvertedDoublePendulumBulletEnv-v0 (DoublePendulum), HopperBulletEnv-v0 (Hopper), and AntBulletEnv-v0 (Ant) are chosen as tasks. To make the tasks harder, the observations from them are with white noises, scale of which is 0.0010.001. With 18 random seeds, each task is tried to be accomplished by each method. After training, the learned policy is run 100 times for evaluating the sum of rewards as a score (larger is better).

The implementation of the base network architecture and DRL algorithm is basically the same as in the literature [23]. However, it is noticeable that the stochastic policy function is modeled by student-t distribution for conservative learning and efficient exploration [14], instead of normal distribution. Hyperparamters for the implementation are summarized in Table I. Note that the learning difficulty is higher because the specified value of the learning rate is higher than one suitable for DRL, revealing the effectiveness of the target network for stabilizing learning.

The following three methods are compared.

  • T-soft update: (τ=0.1,ν=1)(\tau=0.1,\nu=1)

  • AT-soft update: (τ=0.1,ν¯~=1)(\tau=0.1,\underline{\tilde{\nu}}=1)

  • CAT-soft update: (τ=0.1,ν¯~=1,λ=1,q=1)(\tau=0.1,\underline{\tilde{\nu}}=1,\lambda=1,q=1)

Here, qq is designed to consolidate only one parameter in each subset as the simplest implementation. Correspondingly, λ\lambda is given as the inverse of the (maximum) number of consolidated parameters. τ=0.1\tau=0.1 is set smaller than one in the literature [15], but this is to counteract the negative effects of the high learning rate (i.e. 10310^{-3}) set above.

Refer to caption
(a) DoublePendulum
Refer to caption
(b) Hopper
Refer to caption
(c) Ant
Figure 2: Learning curves for benchmarks: the upper row shows mean of the deviation between the main and target networks, |θθ¯||\theta-\bar{\theta}|; the lower row plots mean of 1w1/w¯11-w_{1}/\bar{w}_{1}, which corresponds to the noise robustness (or the magnitude of consolidation); compared to T-soft update, AT-soft and CAT-soft updates have the larger deviation, but this was suppressed by the consolidation in the early stages of learning; as learning progresses, the cases, where updates of the target network were suppressed due to noise, were decreased, and the consolidation was relaxed, resulting in AT-soft and CAT-soft updates converging to roughly the same degree of deviation.
TABLE II: The sum of rewards after training
Method DoublePendulum Hopper Ant
T-soft 6427.1 ±\pm 3357.8 1852.8 ±\pm 900.9 2683.8 ±\pm 249.3
AT-soft 6379.7 ±\pm 3299.7 1662.7 ±\pm 897.4 2764.1 ±\pm 265.5
CAT-soft 7129.2 ±\pm 2946.0 1971.2 ±\pm 812.9 2760.0 ±\pm 312.2

IV-B Result

The learning behaviors are depicted in Fig. 2. As pointed out, the deviation by (C)AT-soft updates were larger than that of the conventional T-soft update since (C)AT-soft update have better outlier and noise detection performance and the target network update is easily suppressed. This was pronounced in the early stage of training when the noise robustness is high and the update of the main network is unstable. However, CAT-soft update suppressed the deviation in the early stage of training. As the learning progresses, CAT-soft update converged to roughly the same level of deviation as AT-soft update because the consolidation was relaxed with the weakened noise robustness.

The scores of 100 runs after learning are summarized in Table II. AT-soft update slightly increased the performance of T-soft update on Ant, but decreased it on Hopper. In contrast, CAT-soft update outperformed T-soft update in all tasks.

IV-C Demonstration

TABLE III: Arguments for MinitaurBulletDuckEnv-v0
Argument Default Modified
motor_velocity_limit \infty 100
pd_control_enabled False True
accurate_motor_model_enabled True False
action_repeat 1 4
observation_noise_stdev 0 10510^{-5}
hard_reset True False
env_randomizer Uniform None
distance_weight 1 100
energy_weight 0.005 1
shake_weight 0 1
Refer to caption
(a) Learning curves
Refer to caption
(b) Test results
Figure 3: Demonstration results: CAT-soft update showed the more stable learning curve than that of T-soft update and reached the success level of the task in the final performance.

As a demonstration, a simulation closer to the real robot experiment, MinitaurBulletDuckEnv-v0 (Minitaur) in Pybullet, is tried. The task is to move a duck on top of a Ghost Minitaur, a quadruped robot developed by Ghost Robotics. Since this duck is not fixed, careful locomotion is required, and its states (e.g. position) are unobserved, making this task a partially observed MDP (POMDP). Note that the default setting for Minitaur tasks is unrealistic, as pointed out in the literature [25]. Therefore, it was modified as shown in Table III (arguments not listed are left at default).

T-soft and CAT-soft updates are compared under the same conditions as in the above simulations. The learning curves of the scores for 8 trials and the test results of the trained policies are depicted in Fig. 3. The best behaviors on the tests can be found in the attached video. As can be seen from Fig. 3, only the proposed CAT-soft update was able to acquire the successful cases of the task (walking without dropping the duck). Thus, it is suggested that CAT-soft update can contribute to the success of the task by steadily improving the learning performance even for more practical tasks.

V Conclusion

This paper proposed a new update rule for the target network, CAT-soft update, which stabilizes DRL. In order to adaptively adjust the noise robustness, the update rule inspired by AdaTerm, which has been developed recently, was derived. In addition, a heuristic consolidation from the main to target networks was developed to suppress the deviation between them, which may occur when updates are continuously limited due to noise. The developed CAT-soft update was tested on the DRL benchmark tasks, and succeeded in improving and stabilizing the learning performance over the conventional T-soft update.

Actually, the target network should not deviate from the main network in terms of its outputs, not in terms of its parameters. A new consolidation and a noise-robust update based on the output space are expected to contribute to further performance improvements. These efforts to stabilize DRL will lead to its practical use in the near future.

ACKNOWLEDGMENT

This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265.

References

  • [1] H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa, “Optimized assistive human–robot interaction using reinforcement learning,” IEEE transactions on cybernetics, vol. 46, no. 3, pp. 655–667, 2015.
  • [2] T. Kobayashi, E. Dean-Leon, J. R. Guadarrama-Olvera, F. Bergner, and G. Cheng, “Whole-body multicontact haptic human–humanoid interaction based on leader–follower switching: A robot dance of the “box step”,” Advanced Intelligent Systems, p. 2100038, 2021.
  • [3] T. Kobayashi, T. Aoyama, K. Sekiyama, and T. Fukuda, “Selection algorithm for locomotion based on the evaluation of falling risk,” IEEE Transactions on Robotics, vol. 31, no. 3, pp. 750–765, 2015.
  • [4] J. Delmerico, S. Mintchev, A. Giusti, B. Gromov, K. Melo, T. Horvat, C. Cadena, M. Hutter, A. Ijspeert, D. Floreano et al., “The current state and future outlook of rescue robotics,” Journal of Field Robotics, vol. 36, no. 7, pp. 1171–1191, 2019.
  • [5] Y. Tsurumine, Y. Cui, E. Uchibe, and T. Matsubara, “Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation,” Robotics and Autonomous Systems, vol. 112, pp. 72–83, 2019.
  • [6] O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” Journal of Machine Learning Research, vol. 22, no. 30, pp. 1–82, 2021.
  • [7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction.   MIT press, 2018.
  • [8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
  • [9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning.   PMLR, 2018, pp. 1861–1870.
  • [10] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Advances in Neural Information Processing Systems, 2018, pp. 4754–4765.
  • [11] M. Okada, N. Kosaka, and T. Taniguchi, “Planet of the bayesians: Reconsidering and improving deep planning network by incorporating bayesian inference,” in IEEE/RSJ International Conference on Intelligent Robots and Systems.   IEEE, 2020, pp. 5611–5618.
  • [12] T. Kobayashi, “Proximal policy optimization with relative pearson divergence,” in IEEE international conference on robotics and automation.   IEEE, 2021, pp. 8416–8421.
  • [13] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped dqn,” Advances in neural information processing systems, vol. 29, pp. 4026–4034, 2016.
  • [14] T. Kobayashi, “Student-t policy in reinforcement learning to acquire global optimum of robot control,” Applied Intelligence, vol. 49, no. 12, pp. 4335–4347, 2019.
  • [15] T. Kobayashi and W. E. L. Ilboudo, “t-soft update of target network for deep reinforcement learning,” Neural Networks, 2021.
  • [16] S. Kim, K. Asadi, M. Littman, and G. Konidaris, “Deepmellow: removing the need for a target network in deep q-learning,” in International Joint Conference on Artificial Intelligence.   AAAI Press, 2019, pp. 2733–2739.
  • [17] W. E. L. Ilboudo, T. Kobayashi, and K. Sugimoto, “Adaterm: Adaptive t-distribution estimated robust moments towards noise-robust stochastic gradient optimizer,” arXiv preprint arXiv:2201.06714, 2022.
  • [18] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” GitHub repository, 2016.
  • [19] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
  • [20] J. C. Bezdek and R. J. Hathaway, “Convergence of alternating optimization,” Neural, Parallel & Scientific Computations, vol. 11, no. 4, pp. 351–368, 2003.
  • [21] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
  • [22] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in International Conference on Machine Learning.   PMLR, 2017, pp. 3987–3995.
  • [23] T. Kobayashi, “L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning,” arXiv preprint arXiv:2202.07152, 2022.
  • [24] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
  • [25] T. Kobayashi, “Optimistic reinforcement learning by forward kullback-leibler divergence optimization,” arXiv preprint arXiv:2105.12991, 2021.