Consolidated Adaptive T-soft Update for Deep Reinforcement Learning

Taisuke Kobayashi¹ ¹Taisuke Kobayashi is with the Division of Information Science, Nara Institute of Science and Technology, 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan [email protected]

Abstract

Demand for deep reinforcement learning (DRL) is gradually increased to enable robots to perform complex tasks, while DRL is known to be unstable. As a technique to stabilize its learning, a target network that slowly and asymptotically matches a main network is widely employed to generate stable pseudo-supervised signals. Recently, T-soft update has been proposed as a noise-robust update rule for the target network and has contributed to improving the DRL performance. However, the noise robustness of T-soft update is specified by a hyperparameter, which should be tuned for each task, and is deteriorated by a simplified implementation. This study develops adaptive T-soft (AT-soft) update by utilizing the update rule in AdaTerm, which has been developed recently. In addition, the concern that the target network does not asymptotically match the main network is mitigated by a new consolidation for bringing the main network back to the target network. This so-called consolidated AT-soft (CAT-soft) update is verified through numerical simulations.

Index Terms:

Reinforcement Learning, Machine Learning for Robot Control, Deep Learning Methods

I Introduction

As robotic and machine learning technologies remarkably develop, the tasks required of intelligent robots becomes more complex: e.g. physical human-robot interaction [1, 2]; work on disaster sites [3, 4]; and manipulation of various objects [5, 6]. In most cases, these complex tasks have no accurate analytical model. To resolve this difficulty, deep reinforcement learning (DRL) [7] has received a lot of attention as an alternative to classic model-based control. Using deep neural networks (DNNs) as a nonlinear function approximator, DRL can learn the complicated policy and/or value function in a model-free manner [8, 9], or learn the complicated world model for planning the optimal policy in a model-based manner [10, 11].

Since DNNs are nonlinear and especially model-free DRLs must generate pseudo-supervised signals by themselves, making DRL unstable. Techniques to stabilize learning have been actively proposed, such as the design of regularization [5, 9, 12] and the introduction of the model that makes learning conservative [13, 14]. Among them, a target network is one of the current standard techniques in DRL [8, 15]. After generating it as a copy of the main network to be learned by DRL, an update rule is given to make it slowly match the main network at regular intervals or asymptotically. In this case, the pseudo-supervised signals generated from the target network are more stable than those generated from the main network, which greatly contributes to the overall stability of DRL.

The challenge in using the target network is its update rule. It has been reported that too slow update stagnates the whole learning process [16], while too fast update reverts to instability of the pseudo-supervised signals. A new update rule, T-soft update [15], has been proposed to mitigate the latter problem. This method provides a mechanism to limit the amount of updates when the main network deviates unnaturally from the target network, which can be regarded as noise. Such noise robustness enables to stabilize the whole learning process even with the high update rate by appropriately ignoring the unstable behaviors of the main network.

However, the noise robustness of T-soft update is specified as a hyperparameter, which must be set to an appropriate value depending on the task to be solved. In addition, the simplified implementation for detecting noise deteriorates the noise robustness. More sophisticated implementation with the adaptive noise robustness is desired. As another concern, when the update of the target network is restricted like T-soft update does, the target network may not asymptotically match the main network. A new constraint is needed to avoid the situation where the main network deviates from the target network.

Hence, this paper proposes two methods to resolve each of the above two issues: i) the adaptive and sophisticated implementation of T-soft update and; ii) the appropriate consolidation of the main target network to the target network. Specifically, for the issue i), a new update rule, so-called adaptive T-soft (AT-soft) update, is developed based on the recently proposed AdaTerm [17] formulation, which is an adaptively noise-robust stochastic gradient descent method. This allows us to sophisticate the simplified implementation of T-soft update and improve the noise robustness, which can be adaptive to the input patterns. For the issue ii), a new consolidation so that the main network is regularized to the target network when AT-soft update restricts the updates of the target network. By implementing it with interpolation, the parameters in the main network that naturally deviate significantly from that of the target network, are updated to a larger extent. With this consolidation, the proposed method is so-called consolidated AT-soft (CAT-soft) update.

To verify CAT-soft update, typical benchmarks implemented by Pybullet [18] are tried using the latest DRL algorithms [12, 19]. It is shown that even though the learning rate is larger than the standard value for DRL, the task performance is improved by CAT-soft update and more stable learning can be achieved. In addition, the developed consolidation successfully suppresses the divergence between the main and target networks.

II Preliminaries

II-A Reinforcement learning

First of all, the basic problem statement of DRL and an actor-critic algorithm, which can handle continuous action space, is natural choice [7] as one of the basic algorithms for robot control. Note that the proposed method can be applied to other algorithms with the target network.

In DRL, an agent interacts with an unknown environment under Markov decision process (MDP) with the current state $s$ , the agent’s action $a$ , the next state $s^{\prime}$ , and the reward from the environment $r$ . Specifically, the environment implicitly has its initial randomness $p_{0}(s)$ and its state transition probability $p_{e}(s^{\prime}\mid s,a)$ . Since the agent can act on the environment’s state transition through $a$ , the goal is to find the optimal policy to reach the desired state. To this end, $a$ is sampled from a state-dependent trainable policy, $\pi(a\mid s;\theta_{\pi})$ , with its parameters set $\theta_{\pi}$ (a.k.a. weights and biases of DNNs in DRL). The outcome of the interaction between the agent and the environment can be evaluated as $r=r(s,a,s^{\prime})$ .

By repeating the above process, the agent gains the sum of $r$ over the future (so-called return), $R=\sum_{k=0}^{\infty}\gamma^{k}r_{k}$ , with $\gamma\in[0,1)$ discount factor. The main purpose of DRL is to maximize $R$ by optimizing $\pi$ (i.e. $\theta_{\pi}$ ). However, $R$ cannot be gained due to its future information, hence its expected value is inferred as a trainable (state) value function, $V(s;\theta_{V})=\mathbb{E}[R\mid s]$ , with its parameters set $\theta_{V}$ . Finally, DRL optimizes $\pi$ to maximize $V$ while increasing the accuracy of $V$ .

To learn $V$ , a temporal difference (TD) error method is widely used as follows:

	$\displaystyle\mathcal{L}_{V}(\theta_{V})$	$\displaystyle=\cfrac{1}{2}(y-V(s;\theta_{V}))^{2}$		(1)
	$\displaystyle y$	$\displaystyle=r+\gamma V(s^{\prime};\bar{\theta}_{V})$		(2)

where $y$ denotes the pseudo-supervised signal generated from the target network with the parameters set $\bar{\theta}_{V}$ (see later). By minimizing $\mathcal{L}_{V}$ , $\theta_{V}$ can be optimized to correctly infer the value over $s$ .

To learn $\pi$ , a policy-gradient method is applied as follows:

\displaystyle\mathcal{L}_{\pi}(\theta_{\pi})

\displaystyle=-(y-V(s;\theta_{V}))\cfrac{\pi(a\mid s;\theta_{\pi})}{b(a\mid s;\bar{\theta}_{\pi})}

(3)

where $a$ is sampled from the alternative policy $b$ , which is often given by the target network with the parameters set $\bar{\theta}_{\pi}$ . The sampler change is allowed by the importance sampling, and the likelihood ratio is introduced as in the above loss function. By minimizing $\mathcal{L}_{\pi}$ , $\theta_{\pi}$ can be optimized to reach the state with higher value.

II-B Target network with T-soft update

The target network with $\bar{\theta}_{V,\pi}$ is briefly introduced together with the latest update rule, T-soft update [15]. First, with the initialization phase of the main network, the target network is also created as a copy with $\bar{\theta}_{V,\pi}=\theta_{V,\pi}$ . Since the copied $\bar{\theta}_{V,\pi}$ is given independently of $\theta_{V,\pi}$ , and not updated through the minimization problem of eqs. (1) and (3). Therefore, the pseudo-supervised signal $y$ has the same value for the same input, which greatly contributes to the stability of learning by making the minimization problem stationary.

However, in practice, if $\bar{\theta}_{V,\pi}$ is fixed at its initial value, the correct $y$ cannot be generated and the task is never accomplished. Thus, $\bar{\theta}_{V,\pi}$ must be updated slowly towards $\theta_{V,\pi}$ as in alternating optimization [20]. When the target network was first introduced, a technique called hard update was employed [8], where $\theta_{V,\pi}$ was updated a certain number of times and then copied again as $\bar{\theta}_{V,\pi}=\theta_{V,\pi}$ . Afterwards, soft update shown in the following equation has been proposed to make $\bar{\theta}_{V,\pi}$ asymptotically match $\theta_{V,\pi}$ more smoothly.

\displaystyle\bar{\theta}_{V,\pi}\leftarrow(1-\tau)\bar{\theta}_{V,\pi}+\tau\theta_{V,\pi}

(4)

where $\tau\in(0,1]$ denotes the update rate.

The above update rule is given as an exponential moving average, and all new inputs are treated equivalently. As a result, even when $\theta_{V,\pi}$ is incorrectly updated, its adverse effect as noise is reflected into $\bar{\theta}_{V,\pi}$ . This effect is more pronounced when $\tau$ is large, but mitigating this by reducing $\tau$ causes a reduction in learning speed [16].

To tackle this problem, T-soft update that is robust to noise even with relatively large $\tau$ has recently been proposed [15]. It regards the exponential moving average as update of a location parameter of normal distribution, and derives a new update rule by replacing it with student-t distribution that is more robust to noise by specifying degrees of freedom $\nu\in\mathbb{R}_{+}$ . T-soft update is described in Alg. 1. Note that $\sigma^{2}$ and $W$ must be updated as new internal states.

With mathematical explanations, the issues of T-soft update are again summarized as below.

1.

Since $\nu$ must be specified as a constant in advance, it must be tuned for each task to provide the appropriate noise robustness to maximize performance.
2.

The larger $\Delta_{i}\sigma_{i}^{-2}$ is, the more the update is suppressed. However, the simple calculation of $\Delta_{i}$ as mean square error makes it easier to hide the noise hidden in the $i$ -th subset.
3.

If $\tau_{i}$ is frequently close to zero (i.e. no update), there is a risk that $\bar{\theta}_{V,\pi}$ will not asymptotically match $\theta_{V,\pi}$ .

Algorithm 1 T-soft update [15]

1:(Initialize

\bar{\theta}_{i}=\theta_{i}

W_{i}=(1-\tau)\tau^{-1}

\sigma_{i}=\epsilon\ll 1

)

2:for

\theta_{i},\bar{\theta}_{i}\subset\theta_{V,\pi},\bar{\theta}_{V,\pi}

\Delta_{i}^{2}=\frac{1}{d_{i}}\sum_{j=1}^{d_{i}}(\theta_{i,j}-\bar{\theta}_{i,j})^{2}

w_{i}=(\nu+1)(\nu+\Delta_{i}^{2}\sigma_{i}^{-2})^{-1}

\tau_{i}=w_{i}(W_{i}+w_{i})^{-1}

\tau_{\sigma_{i}}=\tau w_{i}\nu(\nu+1)^{-1}

\bar{\theta}_{i}\leftarrow(1-\tau_{i})\bar{\theta}_{i}+\tau_{i}\theta_{i}

\sigma_{i}^{2}\leftarrow(1-\tau_{\sigma_{i}})\sigma_{i}^{2}+\tau_{\sigma_{i}}\Delta_{i}^{2}

W_{i}\leftarrow(1-\tau)(W_{i}+w_{i})

10:end for

III Proposal

III-A Adaptive T-soft update

The first two of the three issues mentioned above are resolved by deriving a new update rule, so-called AT-soft update. To develop AT-soft update, the formulation of AdaTerm [17], which is a kind of stochastic gradient descent method with the adaptive noise robustness, is applied. In this method, by assuming that the gradient is generated from student-t distribution, its location, scale, and degrees of freedom parameters, which are utilized for updating the network, can be estimated by approximate maximum likelihood estimation. Instead of the gradient as stochastic variable, the parameters of the main network are considered to be generated from student-t distribution, and its location is mapped to the parameters of the target network. With this assumption, AT-soft update obtains the noise robustness as in the conventional T-soft update. In addition, the degrees of freedom can be estimated at the same time in this formulation, so that the noise robustness can be automatically adjusted according to the faced task.

Specifically, the $i$ -th subset of $\theta_{V,\pi}$ (e.g. a weight matrix in each layer), $\theta_{i}$ with $d_{i}$ the number of dimensions, is assumed to be generated from $d_{i}$ -dimensional diagonal student-t distribution with three kinds of sample statistics: a location parameter $\bar{\theta}_{i}\in\mathbb{R}^{d_{i}}$ ; a scale parameter $\sigma_{i}\in\mathbb{R}_{+}^{d_{i}}$ ; and degrees of freedom $\nu_{i}\in\mathbb{R}_{+}$ . With $\tilde{\nu}_{i}=\nu_{i}d_{i}^{-1}$ and $D_{i}=d_{i}^{-1}\sum_{j=1}^{d_{i}}(\theta_{i,j}-\bar{\theta}_{i,j})^{2}\sigma_{i,j}^{-2}$ , its density can be described as below.

	$\displaystyle\theta_{i}$	$\displaystyle\sim\cfrac{\Gamma(\frac{\nu_{i}+d_{i}}{2})}{\Gamma(\frac{\nu_{i}}{2})(\nu_{i}\pi)^{\frac{d_{i}}{2}}\prod_{j=1}^{d_{i}}\sigma_{i,j}}\left(1+\cfrac{D_{i}}{\tilde{\nu}_{i}}\right)^{-\frac{\nu_{i}+d}{2}}$
		$\displaystyle=:\mathcal{T}(\theta_{i}\mid\bar{\theta}_{i},\sigma_{i},\nu_{i})$		(5)

where $\Gamma$ denotes the gamma function. Note that the conventional T-soft update simplifies this model as one-dimensional student-t distribution for the mean of $\theta_{i}-\bar{\theta}_{i}$ , but here we treat it as $d_{i}$ -dimensional distribution with slightly increased computational cost.

With this assumption, following the derivation of AdaTerm, $\bar{\theta}_{i}$ , $\sigma_{i}$ , and $\nu_{i}$ are optimally inferred to maximize the approximated log-likelihood. The important variable in the derivation is $w_{1}$ , which indicates the deviation of $\theta_{i}$ from $\mathcal{T}$ , and is calculated as follows:

\displaystyle w_{1}=\cfrac{\tilde{\nu}_{i}+1}{\tilde{\nu}_{i}+D_{i}}

(6)

That is, since $D_{i}$ represents the pseudo-distance from $\mathcal{T}$ , the larger $D_{i}$ is, the closer $w_{1}$ is to $0$ . In addition, the smaller $\tilde{\nu}_{i}$ is, the more sensitive $w_{1}$ is to fluctuations in $D_{i}$ , leading to higher noise robustness. Using $w_{1}$ , $w_{2}$ , which is used only for updating $\tilde{\nu}_{i}$ , can be derived as follows:

\displaystyle w_{2}=w_{1}-\ln(w_{1})

(7)

These $w_{1,2}$ are used to calculate the update ratio of the sample statistics, $\tau_{1,2}$ .

\displaystyle\tau_{1,2}=\tau\cfrac{w_{1,2}}{\overline{w}_{1,2}}

(8)

where $\tau\in(0,1]$ denotes the basic update ratio given as a hyperparameter. To satisfy $\tau_{1,2}\in(0,1]$ , the upper bounds of $w_{1,2}$ , $\overline{w}_{1,2}$ , are employed.

\displaystyle\overline{w}_{1}=\cfrac{\tilde{\nu}_{i}+1}{\tilde{\nu}_{i}},\ \overline{w}_{2}=\max(\overline{w}_{1}-\ln(\overline{w}_{1}),87.3365)

(9)

where $87.3365$ means the negative logarithm with the tiny number of float32.

The update amounts for $\bar{\theta}_{i}$ , $\sigma_{i}^{2}$ , and $\tilde{\nu}_{i}$ are respectively given as follows:

$\displaystyle\bar{\theta}_{i}^{\prime}$	$\displaystyle=\theta_{i}$	(10)
$\displaystyle(\sigma_{i}^{2})^{\prime}$	$\displaystyle=(\theta_{i}-\bar{\theta}_{i})^{2}$
	$\displaystyle+\max(\epsilon^{2},((\theta_{i}-\bar{\theta}_{i})^{2}-D_{i}\sigma_{i}^{2})\tilde{\nu}_{i}^{-1})$	(11)
$\displaystyle\tilde{\nu}_{i}^{\prime}$	$\displaystyle=\left(\cfrac{\tilde{\nu}_{i}+2}{\tilde{\nu}_{i}+1}+\tilde{\nu}_{i}\right)\cfrac{\tilde{\nu}_{i}-\underline{\tilde{\nu}}}{\tilde{\nu}_{i}w_{2}}+\underline{\tilde{\nu}}+\epsilon$	(12)

where $\epsilon\ll 1$ denotes the small value for stabilizing the computation and $\underline{\tilde{\nu}}$ denotes the lower bound of $\tilde{\nu}$ (i.e. the maximum noise robustness) given as a hyperparamter.

Using the update ratios and the update amounts obtained above, $\bar{\theta}_{i}$ , $\sigma_{i}^{2}$ , and $\tilde{\nu}_{i}$ can be updated.

$\displaystyle\bar{\theta}_{i}$	$\displaystyle\leftarrow(1-\tau_{1})\bar{\theta}_{i}+\tau_{1}\bar{\theta}_{i}^{\prime}$	(13)
$\displaystyle\sigma_{i}^{2}$	$\displaystyle\leftarrow(1-\tau_{1})\sigma_{i}^{2}+\tau_{1}(\sigma_{i}^{2})^{\prime}$	(14)
$\displaystyle\tilde{\nu}_{i}$	$\displaystyle\leftarrow(1-\tau_{2})\tilde{\nu}_{i}+\tau_{2}\tilde{\nu}_{i}^{\prime}$	(15)

As a result, AT-soft update enables to update the parameters set $\bar{\theta}_{i}$ of the target network adaptively (i.e. depending on the deviation of $\theta_{i}$ from $\mathcal{T}$ ), while automatically tuning the noise robustness represented by $\sigma_{i}^{2}$ and $\tilde{\nu}_{i}$ .

III-B Consolidation from main to target networks

However, if $\tau_{1}\simeq 0$ continues in the above update, $\theta_{i}$ will deviate from $\bar{\theta}_{i}$ gradually, and the target network will no longer be able to generate pseudo-supervised signals since the assumption $\bar{\theta}_{i}\simeq\theta_{i}$ is broken. In such a case, parts of $\theta_{i}$ would be updated with the minimization of eqs. (1)–(3) in a wrong direction, causing $\tau_{1}\simeq 0$ as outlier. Hence, to stop this fruitless judgement and restart the appropriate updates, reverting $\theta_{i}$ to $\bar{\theta}_{i}$ and holding $\theta_{i}\simeq\bar{\theta}_{i}$ would be the natural effective way. To this end, a heuristic consolidation is designed as below.

Specifically, the update ratio from the main to target networks, $\tau_{c}$ , is designed to be larger when the update ratio of $\bar{\theta}_{i}$ is smaller (i.e. when $\theta_{i}$ deviates from $\mathcal{T}$ ).

\displaystyle\tau_{c}=\lambda\tau\left(1-\cfrac{w_{1}}{\overline{w}_{1}}\right)

(16)

where $\lambda\in[0,1]$ adjusts the strength of this consolidation, which should be the same as or weaker than the update speed of the target network.

Next, since consolidating all $\theta_{i}$ would interfere with learning, the consolidated subset of outliers, $\theta_{i}^{c}$ , should be extracted. The simple and popular way is to use the $q$ -th quantile $Q$ with $q\in[0,1]$ . Since the component that contributes to making $w_{1}$ small is with large $(\theta_{i,j}-\bar{\theta}_{i,j})^{2}\sigma_{i,j}^{-2}=:\Delta_{i,j}$ , $\theta_{i}^{c}$ is defined as follows:

\displaystyle\theta_{i}^{c}=\left\{\theta_{i,j}\in\theta_{i}\mid\Delta_{i,j}\geq Q(\Delta_{i};q)\right\}

(17)

Thus, the following update formula consolidates $\theta_{i}^{c}$ to the corresponding subset of the target network, $\bar{\theta}_{i}^{c}$ .

\displaystyle\theta_{i}^{c}\leftarrow(1-\tau_{c})\theta_{i}^{c}+\tau_{c}\bar{\theta}_{i}^{c}

(18)

A rough sketch of this consolidation is shown in Fig. 1. Although loss-function-based consolidations, as proposed in the context of continual learning [21, 22], would be also possible, a more convenient implementation with lower computational cost was employed.

The pseudo-code for the consolidated adaptive T-soft (CAT-soft) update is summarized in Alg. 2. Note that, although $\underline{\tilde{\nu}}$ must be specified as a new hyperparameter in (C)AT-soft update, $\nu$ specified in T-soft update is already conservatively set for noise, and we can inherit it (i.e. $\underline{\tilde{\nu}}=1$ ). Therefore, the additional hyperparameters to be tuned are $\lambda$ and $q$ . $q$ can be given as $q\simeq 1$ so that a few parameters in $i$ -th subset are consolidated without interfering with learning. $\lambda$ can be given as the inverse of the number of parameters to be consolidated. In other words, we can decide whether to make $q$ closer to $1$ and consolidate fewer parameters tightly, or make $q$ smaller and consolidate more parameters slightly.

Refer to caption — Figure 1: Rough sketches of the proposed consolidation: when $\theta_{i}$ is not far from $\mathcal{T}$ as in (a), almost no consolidation works; when parts of $\theta_{i}$ are far from $\mathcal{T}$ as in (b), only $\theta_{i}^{c}$ , which leads to $w_{1}\simeq 0$ , is consolidated to the target network.

Algorithm 2 CAT-soft update

1:(Initialize

\bar{\theta}_{i}=\theta_{i}

\sigma_{i}=\epsilon

\tilde{\nu}_{i}=\underline{\tilde{\nu}}

)

2:for

\theta_{i},\bar{\theta}_{i}\subset\theta_{V,\pi},\bar{\theta}_{V,\pi}

\Delta_{i}=(\theta_{i}-\bar{\theta}_{i})^{2}\sigma_{i}^{-2}

D_{i}=d_{i}^{-1}\sum_{j=1}^{d_{i}}\Delta_{i,j}

w_{1}=(\tilde{\nu}_{i}+1)(\tilde{\nu}_{i}+D_{i})^{-1}

w_{2}=w_{1}-\ln(w_{1})

\overline{w}_{1}=(\tilde{\nu}_{i}+1)\tilde{\nu}_{i}^{-1},\overline{w}_{2}=\max(\overline{w}_{1}-\ln(\overline{w}_{1}),87.3365)

\tau_{1,2}=\tau w_{1,2}\overline{w}_{1,2}^{-1}

(\sigma_{i}^{2})^{\prime}=\Delta_{i}\sigma_{i}^{2}+\max(\epsilon^{2},(\Delta_{i}-D_{i})\sigma_{i}^{2}\tilde{\nu}_{i}^{-1})

\tilde{\nu}_{i}^{\prime}=\{1+(\tilde{\nu}_{i}+1)^{-1}+\tilde{\nu}_{i}\}(\tilde{\nu}_{i}-\underline{\tilde{\nu}})(\tilde{\nu}_{i}w_{2})^{-1}+\underline{\tilde{\nu}}+\epsilon

10:

\bar{\theta}_{i}\leftarrow(1-\tau_{1})\bar{\theta}_{i}+\tau_{1}\theta_{i}

11:

\sigma_{i}^{2}\leftarrow(1-\tau_{1})\sigma_{i}^{2}+\tau_{1}(\sigma_{i}^{2})^{\prime}

12:

\tilde{\nu}_{i}\leftarrow(1-\tau_{2})\tilde{\nu}_{i}+\tau_{2}\tilde{\nu}_{i}^{\prime}

13: if Consolidation then

14:

\tau_{c}=\lambda\tau(1-w_{1}\overline{w}_{1}^{-1})

15:

\theta_{i}^{c}=\{\theta_{i,j}\in\theta_{i}\mid\Delta_{i,j}\geq Q(\Delta_{i};q)\}

16:

\theta_{i}^{c}\leftarrow(1-\tau_{c})\theta_{i}^{c}+\tau_{c}\bar{\theta}_{i}^{c}

17: end if

18:end for

IV Simulations

TABLE I: Hyperparameters for the used DRL algorithms

Symbol	Meaning	Value
$L$	#Hidden layer	$2$
$N$	#Neuron for each layer	$100$
$\gamma$	Discount factor	$0.99$
$(\alpha,\beta,\epsilon,\underline{\tilde{\nu}})$	For AdaTerm [17]	$(10^{-3},0.9,10^{-5},1)$
$(\kappa,\beta,\lambda,\underline{\Delta})$	For PPO-RPE [12]	$(0.5,0.5,0.999,0.1)$
$(N_{c},N_{b},\alpha,\beta)$	For PER [19]	$(10^{4},32,1.0,0.5)$
$(\sigma,\underline{\lambda},\overline{\lambda},\beta)$	For L2C2 [23]	$(1,0.01,1,0.1)$

IV-A Setup

For the statistical verification of the proposed method, the following simulations are conducted. As simulation environments, Pybullet [18] with OpenAI Gym [24] is employed. From it, InvertedDoublePendulumBulletEnv-v0 (DoublePendulum), HopperBulletEnv-v0 (Hopper), and AntBulletEnv-v0 (Ant) are chosen as tasks. To make the tasks harder, the observations from them are with white noises, scale of which is $0.001$ . With 18 random seeds, each task is tried to be accomplished by each method. After training, the learned policy is run 100 times for evaluating the sum of rewards as a score (larger is better).

The implementation of the base network architecture and DRL algorithm is basically the same as in the literature [23]. However, it is noticeable that the stochastic policy function is modeled by student-t distribution for conservative learning and efficient exploration [14], instead of normal distribution. Hyperparamters for the implementation are summarized in Table I. Note that the learning difficulty is higher because the specified value of the learning rate is higher than one suitable for DRL, revealing the effectiveness of the target network for stabilizing learning.

The following three methods are compared.

•

T-soft update: $(\tau=0.1,\nu=1)$
•

AT-soft update: $(\tau=0.1,\underline{\tilde{\nu}}=1)$
•

CAT-soft update: $(\tau=0.1,\underline{\tilde{\nu}}=1,\lambda=1,q=1)$

Here, $q$ is designed to consolidate only one parameter in each subset as the simplest implementation. Correspondingly, $\lambda$ is given as the inverse of the (maximum) number of consolidated parameters. $\tau=0.1$ is set smaller than one in the literature [15], but this is to counteract the negative effects of the high learning rate (i.e. $10^{-3}$ ) set above.

TABLE II: The sum of rewards after training

Method	DoublePendulum	Hopper	Ant
T-soft	6427.1 $\pm$ 3357.8	1852.8 $\pm$ 900.9	2683.8 $\pm$ 249.3
AT-soft	6379.7 $\pm$ 3299.7	1662.7 $\pm$ 897.4	2764.1 $\pm$ 265.5
CAT-soft	7129.2 $\pm$ 2946.0	1971.2 $\pm$ 812.9	2760.0 $\pm$ 312.2

IV-B Result

The learning behaviors are depicted in Fig. 2. As pointed out, the deviation by (C)AT-soft updates were larger than that of the conventional T-soft update since (C)AT-soft update have better outlier and noise detection performance and the target network update is easily suppressed. This was pronounced in the early stage of training when the noise robustness is high and the update of the main network is unstable. However, CAT-soft update suppressed the deviation in the early stage of training. As the learning progresses, CAT-soft update converged to roughly the same level of deviation as AT-soft update because the consolidation was relaxed with the weakened noise robustness.

The scores of 100 runs after learning are summarized in Table II. AT-soft update slightly increased the performance of T-soft update on Ant, but decreased it on Hopper. In contrast, CAT-soft update outperformed T-soft update in all tasks.

IV-C Demonstration

TABLE III: Arguments for MinitaurBulletDuckEnv-v0

Argument	Default	Modified
motor_velocity_limit	$\infty$	100
pd_control_enabled	False	True
accurate_motor_model_enabled	True	False
action_repeat	1	4
observation_noise_stdev	0	$10^{-5}$
hard_reset	True	False
env_randomizer	Uniform	None
distance_weight	1	100
energy_weight	0.005	1
shake_weight	0	1

As a demonstration, a simulation closer to the real robot experiment, MinitaurBulletDuckEnv-v0 (Minitaur) in Pybullet, is tried. The task is to move a duck on top of a Ghost Minitaur, a quadruped robot developed by Ghost Robotics. Since this duck is not fixed, careful locomotion is required, and its states (e.g. position) are unobserved, making this task a partially observed MDP (POMDP). Note that the default setting for Minitaur tasks is unrealistic, as pointed out in the literature [25]. Therefore, it was modified as shown in Table III (arguments not listed are left at default).

T-soft and CAT-soft updates are compared under the same conditions as in the above simulations. The learning curves of the scores for 8 trials and the test results of the trained policies are depicted in Fig. 3. The best behaviors on the tests can be found in the attached video. As can be seen from Fig. 3, only the proposed CAT-soft update was able to acquire the successful cases of the task (walking without dropping the duck). Thus, it is suggested that CAT-soft update can contribute to the success of the task by steadily improving the learning performance even for more practical tasks.

V Conclusion

This paper proposed a new update rule for the target network, CAT-soft update, which stabilizes DRL. In order to adaptively adjust the noise robustness, the update rule inspired by AdaTerm, which has been developed recently, was derived. In addition, a heuristic consolidation from the main to target networks was developed to suppress the deviation between them, which may occur when updates are continuously limited due to noise. The developed CAT-soft update was tested on the DRL benchmark tasks, and succeeded in improving and stabilizing the learning performance over the conventional T-soft update.

Actually, the target network should not deviate from the main network in terms of its outputs, not in terms of its parameters. A new consolidation and a noise-robust update based on the output space are expected to contribute to further performance improvements. These efforts to stabilize DRL will lead to its practical use in the near future.

ACKNOWLEDGMENT

This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265.

References

[1] H. Modares, I. Ranatunga, F. L. Lewis, and D. O. Popa, “Optimized assistive human–robot interaction using reinforcement learning,” IEEE transactions on cybernetics, vol. 46, no. 3, pp. 655–667, 2015.
[2] T. Kobayashi, E. Dean-Leon, J. R. Guadarrama-Olvera, F. Bergner, and G. Cheng, “Whole-body multicontact haptic human–humanoid interaction based on leader–follower switching: A robot dance of the “box step”,” Advanced Intelligent Systems, p. 2100038, 2021.
[3] T. Kobayashi, T. Aoyama, K. Sekiyama, and T. Fukuda, “Selection algorithm for locomotion based on the evaluation of falling risk,” IEEE Transactions on Robotics, vol. 31, no. 3, pp. 750–765, 2015.
[4] J. Delmerico, S. Mintchev, A. Giusti, B. Gromov, K. Melo, T. Horvat, C. Cadena, M. Hutter, A. Ijspeert, D. Floreano et al., “The current state and future outlook of rescue robotics,” Journal of Field Robotics, vol. 36, no. 7, pp. 1171–1191, 2019.
[5] Y. Tsurumine, Y. Cui, E. Uchibe, and T. Matsubara, “Deep reinforcement learning with smooth policy update: Application to robotic cloth manipulation,” Robotics and Autonomous Systems, vol. 112, pp. 72–83, 2019.
[6] O. Kroemer, S. Niekum, and G. Konidaris, “A review of robot learning for manipulation: Challenges, representations, and algorithms,” Journal of Machine Learning Research, vol. 22, no. 30, pp. 1–82, 2021.
[7] R. S. Sutton and A. G. Barto, Reinforcement learning: An introduction. MIT press, 2018.
[8] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski et al., “Human-level control through deep reinforcement learning,” nature, vol. 518, no. 7540, pp. 529–533, 2015.
[9] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine, “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor,” in International conference on machine learning. PMLR, 2018, pp. 1861–1870.
[10] K. Chua, R. Calandra, R. McAllister, and S. Levine, “Deep reinforcement learning in a handful of trials using probabilistic dynamics models,” in Advances in Neural Information Processing Systems, 2018, pp. 4754–4765.
[11] M. Okada, N. Kosaka, and T. Taniguchi, “Planet of the bayesians: Reconsidering and improving deep planning network by incorporating bayesian inference,” in IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2020, pp. 5611–5618.
[12] T. Kobayashi, “Proximal policy optimization with relative pearson divergence,” in IEEE international conference on robotics and automation. IEEE, 2021, pp. 8416–8421.
[13] I. Osband, C. Blundell, A. Pritzel, and B. Van Roy, “Deep exploration via bootstrapped dqn,” Advances in neural information processing systems, vol. 29, pp. 4026–4034, 2016.
[14] T. Kobayashi, “Student-t policy in reinforcement learning to acquire global optimum of robot control,” Applied Intelligence, vol. 49, no. 12, pp. 4335–4347, 2019.
[15] T. Kobayashi and W. E. L. Ilboudo, “t-soft update of target network for deep reinforcement learning,” Neural Networks, 2021.
[16] S. Kim, K. Asadi, M. Littman, and G. Konidaris, “Deepmellow: removing the need for a target network in deep q-learning,” in International Joint Conference on Artificial Intelligence. AAAI Press, 2019, pp. 2733–2739.
[17] W. E. L. Ilboudo, T. Kobayashi, and K. Sugimoto, “Adaterm: Adaptive t-distribution estimated robust moments towards noise-robust stochastic gradient optimizer,” arXiv preprint arXiv:2201.06714, 2022.
[18] E. Coumans and Y. Bai, “Pybullet, a python module for physics simulation for games, robotics and machine learning,” GitHub repository, 2016.
[19] T. Schaul, J. Quan, I. Antonoglou, and D. Silver, “Prioritized experience replay,” arXiv preprint arXiv:1511.05952, 2015.
[20] J. C. Bezdek and R. J. Hathaway, “Convergence of alternating optimization,” Neural, Parallel & Scientific Computations, vol. 11, no. 4, pp. 351–368, 2003.
[21] J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska et al., “Overcoming catastrophic forgetting in neural networks,” Proceedings of the national academy of sciences, vol. 114, no. 13, pp. 3521–3526, 2017.
[22] F. Zenke, B. Poole, and S. Ganguli, “Continual learning through synaptic intelligence,” in International Conference on Machine Learning. PMLR, 2017, pp. 3987–3995.
[23] T. Kobayashi, “L2c2: Locally lipschitz continuous constraint towards stable and smooth reinforcement learning,” arXiv preprint arXiv:2202.07152, 2022.
[24] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, and W. Zaremba, “Openai gym,” arXiv preprint arXiv:1606.01540, 2016.
[25] T. Kobayashi, “Optimistic reinforcement learning by forward kullback-leibler divergence optimization,” arXiv preprint arXiv:2105.12991, 2021.