This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

AdaTerm: Adaptive T-Distribution Estimated Robust Moments for Noise-Robust Stochastic Gradient Optimization

Wendyam Eric Lionel Ilboudo11footnotemark: 1 [email protected] Taisuke Kobayashi22footnotemark: 2 [email protected] Takamitsu Matsubara33footnotemark: 3 [email protected]
Abstract

With the increasing practicality of deep learning applications, practitioners are inevitably faced with datasets corrupted by noise from various sources such as measurement errors, mislabeling, and estimated surrogate inputs/outputs that can adversely impact the optimization results. It is a common practice to improve the optimization algorithm’s robustness to noise, since this algorithm is ultimately in charge of updating the network parameters. Previous studies revealed that the first-order moment used in Adam-like stochastic gradient descent optimizers can be modified based on the Student’s t-distribution. While this modification led to noise-resistant updates, the other associated statistics remained unchanged, resulting in inconsistencies in the assumed models. In this paper, we propose AdaTerm, a novel approach that incorporates the Student’s t-distribution to derive not only the first-order moment but also all the associated statistics. This provides a unified treatment of the optimization process, offering a comprehensive framework under the statistical model of the t-distribution for the first time. The proposed approach offers several advantages over previously proposed approaches, including reduced hyperparameters and improved robustness and adaptability. AdaTerm achieves this by considering the interdependence of gradient dimensions. In particular, upon detection, AdaTerm excludes aberrant gradients from the update process and enhances its robustness for subsequent updates. Conversely, it performs normal parameter updates when the gradients are statistically valid, allowing for flexibility in adapting its robustness. This noise-adaptive behavior contributes to AdaTerm’s exceptional learning performance, as demonstrated through various optimization problems with different and/or unknown noise ratios. Furthermore, we introduce a new technique for deriving a theoretical regret bound without relying on AMSGrad, providing a valuable contribution to the field.

keywords:
Robust optimization , Stochastic gradient descent , Deep neural networks , Student’s t-distribution
journal: Neurocomputing
\varv
\affiliation

[label1]organization=Nara Institute of Science and Technology, city=Nara, country=Japan \affiliation[label2]organization=National Institute of Informatics/The Graduate University for Advanced Studies, SOKENDAI, city=Tokyo, country=Japan

1 Introduction

Deep learning [1] has emerged as a highly successful technology in recent years. A significant factor contributing to its success is the widespread adoption of stochastic gradient descent (SGD) algorithms [2] for optimizing deep neural networks. These algorithms utilize first-order gradients computed through the back-propagation technique to update the network parameters, thereby addressing the optimization problem at hand. Therefore, in tandem with the advancement of diverse network architectures such as residual networks [3], various SGD-based optimizers have been pursued that can effectively and reliably optimize these networks. The most representative optimizer is Adam [4], and diverse variants with their respective features have been proposed (see the survey for details in [5, 6]). Among these, RAdam [7] and AdaBelief [8] have illustrated the state-of-the-art (SOTA) learning performance, to the best of our knowledge.

One notable feature of SGD optimizers is their robustness to gradient noise that generally results from the use of noisy datasets containing sensory and mislabeling errors [9, 10]. Moreover, optimization problems that require the use of estimated inputs and/or outputs, such as long-term dynamics learning [11, 12], reinforcement learning (RL) [13], and agents trained using distillation from a trained teacher(s) [14, 15], also suffer from the noise caused by the estimation errors. This feature becomes particularly crucial in robot learning tasks, where the available datasets may be limited in size, augmenting the adverse effects of noise. Additionally, empirical evidence from  [16] and [17] has demonstrated that both Adam and SGD exhibit heavy-tailed gradients’ noise, even in the absence of input/output noise.

To address such noise and robustly conduct efficient gradient updates, previous studies have proposed the detection and exclusion of the aberrant gradients affected by noise. In particular, recognizing that the conventional exponential moving average (EMA) momentum used in recent Adam-like optimizers is sensitive to noise owing to its association with a Gaussian distribution, [18] and [19] introduced a novel momentum algorithm called t-momentum that is derived from the Student’s t-distribution. However, the t-momentum approach lacks a unified derivation of all its algorithm components, as described in section 3, leading to certain limitations in its application.

This paper presents a novel Adaptive T-distribution estimated robust moments algorithm, AdaTerm444Code available at https://github.com/kbys-t/adaterm, that addresses the limitations of previous approaches by offering a unified derivation of all the distribution parameters based on their gradients for an online maximum likelihood estimation of the Student’s t-distribution. AdaTerm introduces adaptive step sizes, transforming the gradient-based update of the statistics into an interpolation mechanism between past statistics values and the update amounts. Such adaptive step sizes allow smoothness parameters β\beta to be common for all the involved statistics. This is enabled by AdaTerm’s comprehensive integration of the diagonal Student’s t-distribution, allowing the scale estimator to involve every entry and avoiding the independent estimation of scale entries for each variable as in the diagonal Gaussian distribution. In addition, AdaTerm appropriately approximates the gradient of the degrees-of-freedom in the multi-dimensional case based on the behavior observed in the one-dimensional case (since it has been reported that the degrees-of-freedom was unintuitively small in multiple dimensions; see details in [20]).

In order to provide a theoretical basis for the convergence of optimization algorithms, including the proposed AdaTerm, the derivation of a regret bound is required. However, since the introduction of AMSGrad [21], the literature has assumed its usage, even for optimizers that do not practically employ AMSGrad. To address this theoretical and practical contradiction, this paper introduces a new approach for deriving a regret bound without relying on the usage of AMSGrad. This approach can be applied to other optimizers as well, providing a more consistent framework for analyzing the convergence properties of various algorithms. This allows for a more accurate theoretical understanding of the convergence behavior of optimizers, including the proposed AdaTerm.

To summarize, this study presents four-fold contributions:

  1. 1.

    Unified derivation of a novel SGD algorithm that is adaptively robust to noise;

  2. 2.

    Alleviation of the difficulty of tuning multiple hyper-parameters for each statistic (β1\beta_{1}, β2\beta_{2} and βz\beta_{z}) through replacement with a common and unique smoothness parameter β\beta for all the distribution parameters;

  3. 3.

    Theoretical proof of the regret bound without incorporating AMSGrad;

  4. 4.

    Numerical verification of efficacy in major test functions and typical problems such as classification problems with mislabeling, long-term prediction problems, reinforcement learning, and policy distillation.

In the final verification, we compared not only AdaBelief and RAdam as the state-of-the-art algorithms but also t-Adam variants developed in another related study.

2 Problem statement

2.1 Optimization problem

Here, we briefly define the optimization (or minimization, without loss of generality) problem to be solved using either of the SGD optimizers. Suppose that certain input data xx and output data yy are generated according to the problem-specific (stochastic) rule, p(x,y)p(x,y). Accordingly, the problem-specific minimization target, \mathcal{L}, is given as follows:

=𝔼x,yp(x,y)[(f(x;θ),y)]\displaystyle\mathcal{L}=\mathbb{E}_{x,y\sim p(x,y)}[\ell(f(x;\theta),y)] (1)

where \ell denotes the loss function for each data, and f(x;θ)f(x;\theta) denotes the mapping function (e.g., from xx to yy) with the parameter set θ\theta that is optimized through this minimization (e.g., network weights and biases).

The above expectation operation can be approximated by the Monte Carlo method. In other words, consider a dataset containing NN pairs of (x,y)(x,y), 𝒟={(xn,yn)}n=1N\mathcal{D}=\{(x_{n},y_{n})\}_{n=1}^{N}, constructed according to the problem-specific rule. Based on 𝒟\mathcal{D}, the above minimization target is replaced as follows:

𝒟=1|𝒟|xn,yn𝒟(f(xn;θ),yn)\displaystyle\mathcal{L}_{\mathcal{D}}=\cfrac{1}{|\mathcal{D}|}\sum_{x_{n},y_{n}\in\mathcal{D}}\ell(f(x_{n};\theta),y_{n}) (2)

where |𝒟||\mathcal{D}| denotes the size of dataset (NN in this case).

2.2 SGD optimizer

To solve the above optimization problem, we can compute the gradient with respect to θ\theta, g=θ𝒟g=\nabla_{\theta}\mathcal{L}_{\mathcal{D}}, and use gradient descent to obtain the (sub)optimal θ\theta that (locally) minimizes 𝒟\mathcal{L}_{\mathcal{D}}. However, if 𝒟\mathcal{D} is large, the above gradient computation would be infeasible due to the limitation of computational resources. Therefore, the SGD optimizer [2] extracts a subset (also known as mini batch) at each update step tt, t𝒟\mathcal{B}_{t}\subset\mathcal{D}, and updates θ\theta from θt1\theta_{t-1} to θt\theta_{t} as follows:

gt\displaystyle g_{t} =θt1t\displaystyle=\nabla_{\theta_{t-1}}\mathcal{L}_{\mathcal{B}_{t}} (3)
θt\displaystyle\theta_{t} =θt1αη(gt)\displaystyle=\theta_{t-1}-\alpha\eta(g_{t}) (4)

where α>0\alpha>0 denotes the learning rate, and η\eta represents a function used to modify gtg_{t} to improve the learning performance. In other words, each SGD-based optimizer have their own η(gt)\eta(g_{t}).

For example, in the case of Adam [4], the most popular optimizer in recent years, η\eta has three hyper-parameters, β1(0,1)\beta_{1}\in(0,1), β2(0,1)\beta_{2}\in(0,1), and ϵ1\epsilon\ll 1, and is defined by:

ηAdam(gt)=mt(1β1t)1vt(1β2t)1+ϵ\displaystyle\eta^{\mathrm{Adam}}(g_{t})=\cfrac{m_{t}(1-\beta_{1}^{t})^{-1}}{\sqrt{v_{t}(1-\beta_{2}^{t})^{-1}}+\epsilon} (5)

where

mt\displaystyle m_{t} =β1mt1+(1β1)gt=mt1+(1β1)(gtmt1)\displaystyle=\beta_{1}m_{t-1}+(1-\beta_{1})g_{t}=m_{t-1}+(1-\beta_{1})(g_{t}-m_{t-1}) (6)
vt\displaystyle v_{t} =β2vt1+(1β2)gt2=vt1+(1β2)(gt2vt1)\displaystyle=\beta_{2}v_{t-1}+(1-\beta_{2})g_{t}^{2}=v_{t-1}+(1-\beta_{2})(g_{t}^{2}-v_{t-1}) (7)

where gt2g_{t}^{2} is the element-wise square. A simple interpretation of the aforementioned Adam optimization algorithm is as follows: since gtg_{t} fluctuates depending on the sampling method of t\mathcal{B}_{t}, the first-order moment mtm_{t} smoothens gtg_{t} with the past gradients to stabilize the update, and the second-order moment vtv_{t} scales gtg_{t}, according to the problem, to increase the generality without tuning α\alpha.

If the problem setting described in section 2.1 is satisfied, we can stably acquire one of the local solutions through one of the SGD optimizers, although their respective regret bounds may be different [21, 22]. However, in real problems, development of datasets that follow the problem-specific rules exactly (i.e., 𝒟\mathcal{D} is generated from p(x,y)p(x,y)) is difficult. In other words, when dealing with real problems, mostly inaccurate datasets 𝒟~𝒟\mathcal{\tilde{D}}\neq\mathcal{D}, contaminated by improper estimated values, mislabeling errors, etc., are available.

3 Previous work

The ubiquity of imperfect data in practical settings combined with the inherent noisiness of the stochastic gradient has encouraged the propositions of more robust and efficient machine learning algorithms against noisy or heavy-tailed datasets. All these methods can be divided into two main approaches, ranging from methods that produce robust estimates of the loss function, to methods based on the detection and attenuation of incorrect gradient updates [23, 24, 25, 26]. Each approach has its own pros and cons, as summarized in Table 1.

Table 1: Pros/Cons of the two main approaches to regulate robustness
Approach
Robust Loss Estimation Robust Gradient Descent
Pros Robustness independent of batch size Widely applicable
Cons Usually problem specific Robustness dependent of batch size and outliers repartition
Usually requires the use of all the available data Relies on only estimates of true gradient
Can be both unstable and costly in high dimensions

As mentioned in the Introduction, we proceed with the latter approach in this study, in view of its wide applicability in machine learning applications. In particular, we extend the concepts pertaining to the use of the Student’s t-distribution as a statistical model that was first proposed by [18] for the gradients of the optimization process. To explicitly highlight the difference between the previous studies (i.e. [18] and [19]) and our current study, we review the previous methods in this section.

3.1 The t-momentum

3.1.1 Overview

Given a variable xtx_{t} (e.g. gtg_{t} or gt2g_{t}^{2}), the popular EMA-based momentum underlying the most recent optimization algorithms is defined by mt=βmt1+(1β)xtm_{t}=\beta m_{t-1}+(1-\beta)x_{t}, as described in eqs. (6) and (7). This arises from the Gaussian distribution by solving the following equation  [18]:

mi=1nln𝒩(xim,v)=0\displaystyle\frac{\partial}{\partial m}\sum\limits_{i=1}^{n}\ln{\mathcal{N}(x_{i}\mid m,v)}=0 (8)

where 𝒩(xi|m,v)\mathcal{N}(x_{i}|m,v) is the diagonal Gaussian distribution with mean mm and variance vv, and nn is a fixed number of samples (corresponding to the nn most recent observations) defined by n=11βn=\frac{1}{1-\beta}.

Based on this observation, the classical EMA-based momentum (now regarded as the Gaussian-momentum), was replaced by the t-momentum, mt=Wt1Wt1+wtmt1+wtWt1+wtxtm_{t}=\frac{W_{t-1}}{W_{t-1}+w_{t}}m_{t-1}+\frac{w_{t}}{W_{t-1}+w_{t}}x_{t}, obtained by solving the following equation:

mi=1nln𝒯(xim,v,ν)=0\displaystyle\frac{\partial}{\partial m}\sum\limits_{i=1}^{n}\ln{\mathcal{T}(x_{i}\mid m,v,\nu)}=0 (9)

where 𝒯(xim,v,ν)\mathcal{T}(x_{i}\mid m,v,\nu) is the diagonal Student’s t-distribution with mean mm, diagonal scale vv, and degrees-of-freedom ν\nu. Subsequently, based on the requirement of a fixed number of samples and the Gaussian limit (as ν\nu\rightarrow\infty), the sum Wt=i=1twiW_{t}=\sum_{i=1}^{t}w_{i} was replaced by a decaying sum Wt=2β11β1Wt1+wtW_{t}=\frac{2\beta_{1}-1}{\beta_{1}}W_{t-1}+w_{t}, where wi=(ν+d)/(ν+j=1d(xi,jmi1,j)2vi1,j+ϵ)w_{i}=(\nu+d)/(\nu+\sum_{j=1}^{d}\frac{(x_{i,j}-m_{i-1,j})^{2}}{v_{i-1,j}+\epsilon}) with xi,jx_{i,j} denoting the jj-th element of the vector xix_{i}. Note that both eqs. (8) and (9) originate from the maximum log-likelihood (MLL) formulation and provide analytical solutions to the mean estimator.

3.1.2 Limitations of the t-momentum

Nonetheless, in the previous studies, the second-order moment vtv_{t} is still based on the regular EMA, i.e., vt=β2mt1+(1β2)stv_{t}=\beta_{2}m_{t-1}+(1-\beta_{2})s_{t}, where sts_{t} is a function of the squared gradient (e.g. st=gt2s_{t}=g_{t}^{2} for Adam [4] in eq. (7) and st=(gtmt)2s_{t}=(g_{t}-m_{t})^{2} for AdaBelief [8]). This is in contrast to its usage in the computation of wiw_{i}, which falsely assumes that vtv_{t} is also derived from the MLL of the Student’s t-distribution, resulting in the unnatural blending of two statistical models.

Although one could also integrate the t-momentum with the second-order moment, this would require the computation of two “variances”, one for the gradient gg and the other for the squared gradient function ss, increasing the overall complexity of the algorithm. Another strategy would be to derive the MLL analytical solution corresponding to the scale of the t-distribution, i.e., vt=βvt1+(1β)wtstv_{t}=\beta v_{t-1}+(1-\beta)w_{t}s_{t}, similar to the mean estimator. However, we found this approach to be unstable in practice, as also mentioned in H, and it can result in NaN errors on tasks such as the long-term prediction task.

Finally, the degrees-of-freedom ν\nu is treated as a hyper-parameter and kept constant throughout the optimization process. This is likely because the MLL formulation of the Student’s t-distribution provides no analytic solution to the degrees-of-freedom estimator. This issue can be connected to the high non-linearity of the distribution with respect to ν\nu. To alleviate this limitation, the subsequent work [19] proposed the application of the direct incremental degrees-of-freedom estimation algorithm developed by [27] to automatically update the degrees-of-freedom, as discussed below.

3.2 The At-momentum

3.2.1 Overview

As stated, in [19], the authors suggested circumventing the difficulties related to the MLL degrees-of-freedom by adopting a different approach and employing an alternative estimation method that do not require the use of the MLL framework. Indeed, the algorithm developed by Aeschliman et al. [27] and employed in the At-momentum is based on an approximation of 𝔼[lng2]\mathbb{E}[\ln{\left\lVert g\right\rVert^{2}}] and 𝕍[lng2]\mathbb{V}[\ln{\left\lVert g\right\rVert^{2}}]. Specifically, given a dd-dimensional variable gg following a Student’s t-distribution with scale v=ξIv=\xi I, where ξ\xi is a constant and II the identity matrix, we have the following results:

𝔼[lng2]\displaystyle\mathbb{E}[\ln{\left\lVert g\right\rVert^{2}}] =lnξ+lnν+ψ(d2)ψ(ν2)\displaystyle=\ln{\xi}+\ln{\nu}+\psi(\frac{d}{2})-\psi(\frac{\nu}{2})
𝕍[lng2]\displaystyle\mathbb{V}[\ln{\left\lVert g\right\rVert^{2}}] =ψ1(ν2)+ψ1(d2)\displaystyle=\psi_{1}(\frac{\nu}{2})+\psi_{1}(\frac{d}{2})

where ψ(x)\psi(x) and ψ1(x)\psi_{1}(x) are the digamma and trigamma functions, respectively, and \left\lVert\cdot\right\rVert is the 2\ell_{2}-norm (or Euclidean norm). By setting zi=lngi2z_{i}=\ln{\left\lVert g_{i}\right\rVert^{2}}, the degrees-of-freedom ν\nu can be estimated by solving the following equation:

ψ1(ν2)\displaystyle\psi_{1}(\frac{\nu}{2}) =1ni=1n(ziz¯)2ψ1(d2),withz¯=1ni=1nzi\displaystyle=\frac{1}{n}\sum\limits_{i=1}^{n}(z_{i}-\bar{z})^{2}-\psi_{1}(\frac{d}{2}),\;\mathrm{with}\;\bar{z}=\frac{1}{n}\sum\limits_{i=1}^{n}z_{i}

Since this equation cannot be solved directly, an approximation of the trigamma function, ψ1(x)x+1x2\psi_{1}(x)\approx\frac{x+1}{x^{2}}, is used on the left side and the estimator for the degrees-of-freedom is expressed as follows:

b\displaystyle b =1ni=1n(ziz¯)2ψ1(d2)\displaystyle=\frac{1}{n}\sum\limits_{i=1}^{n}(z_{i}-\bar{z})^{2}-\psi_{1}(\frac{d}{2}) (10)
ν^\displaystyle\hat{\nu} =1+1+4bb\displaystyle=\frac{1+\sqrt{1+4b}}{b} (11)

In the previous work [19], the incremental version of this estimator obtained by replacing the arithmetic mean and variance of zz by EMAs with βz\beta_{z}, was employed to derive the At-momentum.

3.2.2 Limitations of At-momentum

Notably, the direct incremental degrees-of-freedom estimation algorithm was originally developed to utilize specific estimation methods for the location and scale parameters. This again introduces false assumptions, as the degrees-of-freedom estimator assumes that the mean and scale are based on the principles developed by Aeschliman et al. in [27], while they are given as the MLL of Student’s t-distribution and Gaussian distribution, respectively. In the experiment section below, we show that this inconsistency causes insufficient adaptability of the robustness to noise.

3.3 Contributions

Table 2: Differences summary between AdaTerm and previous studies
t-momentum At-momentum AdaTerm
Parameters’ origins mm Student’s t MLL Student’s t MLL Student’s t MLL
vv Gaussian MLL Gaussian MLL Student’s t MLL
ν\nu Fixed Aeschliman et al. [27] Student’s t MLL
MLL solution Analytic Analytic Sequential w/ custom step sizes
Decay factors β1\beta_{1}, β2\beta_{2} β1\beta_{1}, β2\beta_{2}, βz\beta_{z} β\beta

As explained in the previous paragraphs, the discrepancy in statistical model between the first- and second-order moments and the introduction of a different framework to estimate the degrees-of-freedom creates a chain of equations with different assumptions pertaining to how their defining parameters are obtained. This afflicts the optimization algorithm with certain limitations (e.g. limited robustness and limited adaptability).

In this study, we tackle this lack of unified approach by estimating all the parameters through the unified MLL estimation of the t-distribution. In the previous study, the Student’s t-based EMA of the first-order moment was derived by relying on the analytic solution of the MLL (i.e., by setting the gradient of the log-likelihood to zero and solving explicitly). However, this approach is ill-suited for obtaining the scale vtv_{t} and the degrees-of-freedom ν\nu owing to its practical inefficiency and the absence of a closed-form solution, respectively.

To meet the unification requirement, we propose replacing MLL’s direct solution of the previous studies with a sequential solution based on a gradient ascent algorithm for all the parameters of the Student’s t-model. This new approach has two main advantages: (i) it allows the derivation of an update rule for the degrees-of-freedom that has no close-form solution when one attempts to set its gradient to zero; and (ii) it allows for the use of customized step-sizes that stabilize the components of the algorithm, avoiding the instability of the analytic solution to the scale vv estimator (i.e., apparition of NaN values, as mentioned in Appendix H). Based on the properties of the t-distribution and the success of the EMA approach for well-behaved datasets, we derive suitable adaptive step sizes to allow for smooth updates. In addition, additional modifications ensure that the algorithm still recovers the EMA in the Gaussian limit and does not violate the positiveness of both the scale and the degrees-of-freedom parameters. As another byproduct of this unified approach, we are able to employ one single decay factor β\beta for all the parameters, eliminating the requirement for the usual two distinct decay factors (β1\beta_{1} and β2\beta_{2}).

The differences between our algorithm and the previous studies are summarized in Table 2, and the details of the derivation are given in the next section.

4 Derivation of AdaTerm

4.1 The Student’s t-distribution statistical model

To robustly estimate the true average gradient from the disturbed gradients, we model the distribution of these gradients gg by the Student’s t-distribution, which has a heavy tail and is robust to outliers (as shown in Fig. 1), referring to the related work [18]. Notably, as further motivation for this statistical model, empirical evidence in [16] has shown that the norm of the gradient noise in SGD has heavy tails and [28] revealed that in the continuous-time analysis, the gradient exhibits a stationary distribution following Student’s t-like distributions.

Refer to caption
Figure 1: Robustness of Student’s t-distribution (right) compared with the Gaussian distribution (left). The green curve represents the true distribution to be captured.

Specifically, in the present study, gg is assumed to be generated from a dd-dimensional diagonal Student’s t-distribution, characterized by three types of parameters: a location parameter mdm\in\mathbb{R}^{d}; a scale parameter v>0dv\in\mathbb{R}_{>0}^{d}; and a degrees-of-freedom parameter ν>0\nu\in\mathbb{R}_{>0}. That is,

g\displaystyle g Γ(ν+d2)Γ(ν2)(νπ)d2i=1dvi(1+1νi=1d(gimi)2vi1)ν+d2\displaystyle\sim\cfrac{\Gamma(\frac{\nu+d}{2})}{\Gamma(\frac{\nu}{2})(\nu\pi)^{\frac{d}{2}}\prod\limits_{i=1}^{d}\sqrt{v_{i}}}\left(1+\cfrac{1}{\nu}\sum\limits_{i=1}^{d}(g_{i}-m_{i})^{2}v_{i}^{-1}\right)^{-\frac{\nu+d}{2}}
=:𝒯(gm,v,ν)\displaystyle=:\mathcal{T}(g\mid m,v,\nu) (12)

where Γ\Gamma denotes the gamma function. Following the related work [18] and PyTorch’s implementation [29], AdaTerm is in practice applied to the weight matrix and the bias of each neural network layer separately. Therefore, dd here corresponds to the dimension size of each subset of parameters (weight or bias of the layer).

4.2 Gradients for the maximum log-likelihood

To derive AdaTerm, let us consider the MLL problem, maxm,v,νln𝒯(gm,v,ν)\max_{m,v,\nu}\ln\mathcal{T}(g\mid m,v,\nu), to estimate the parameters mm, vv, and ν\nu that can adequately model the most recent observed loss function gradients gg.

To simplify the notation, the following variables are defined.

s=(gm)2,D=1dsv1,ν~=νd1,wmv=ν~+1ν~+D\displaystyle s=(g-m)^{2},\ D=\cfrac{1}{d}s^{\top}v^{-1},\ \tilde{\nu}=\nu d^{-1},\ w_{mv}=\cfrac{\tilde{\nu}+1}{\tilde{\nu}+D}

When ν\nu is defined to be proportional to dd, as in the literature [18], ν~\tilde{\nu} corresponds to the proportionality coefficient. Note that it is not necessary to follow the literature [18] for setting the degrees-of-freedom as ν=ν~d\nu=\tilde{\nu}d. Nonetheless, this procedure is important to provide the same level of robustness to the respective subsets of parameters, regardless of their respective size.

Based on these, the log-likelihood gradients with respect to mm, vv, and ν\nu can respectively be derived as follows:

mln𝒯\displaystyle\nabla_{m}\ln\mathcal{T} =ν+d2ν1(gm)v11+ν1dD\displaystyle=-\cfrac{\nu+d}{2}\cfrac{-\nu^{-1}(g-m)v^{-1}}{1+\nu^{-1}dD}
=νd1+1νd1+Dgm2v\displaystyle=\cfrac{\nu d^{-1}+1}{\nu d^{-1}+D}\cfrac{g-m}{2v}
=wmvgm2v=:gm\displaystyle=w_{mv}\cfrac{g-m}{2v}=:g_{m} (13)
vln𝒯\displaystyle\nabla_{v}\ln\mathcal{T} =12vν+d2ν1(gm)2v21+ν1dD\displaystyle=-\cfrac{1}{2v}-\cfrac{\nu+d}{2}\cfrac{-\nu^{-1}(g-m)^{2}v^{-2}}{1+\nu^{-1}dD}
=12v2(ν~+1ν~+Dsv)\displaystyle=\cfrac{1}{2v^{2}}\left(\cfrac{\tilde{\nu}+1}{\tilde{\nu}+D}s-v\right)
=wmvν~2v2(ν~+1){(sv)+(sDv)ν~1}\displaystyle=w_{mv}\cfrac{\tilde{\nu}}{2v^{2}(\tilde{\nu}+1)}\{(s-v)+(s-Dv)\tilde{\nu}^{-1}\} (14)
νln𝒯\displaystyle\nabla_{\nu}\ln\mathcal{T} =12ψ(ν+d2)12ψ(ν2)d2ν\displaystyle=\cfrac{1}{2}\psi\left(\cfrac{\nu+d}{2}\right)-\cfrac{1}{2}\psi\left(\cfrac{\nu}{2}\right)-\cfrac{d}{2\nu}
12ln(1+ν1dD)ν+d2{(ν+dD)1ν1}\displaystyle-\cfrac{1}{2}\ln(1+\nu^{-1}dD)-\cfrac{\nu+d}{2}\{(\nu+dD)^{-1}-\nu^{-1}\} (15)

where the gradients with respect to mm and vv are transformed so that the response to outliers can be intuitively analyzed by wmvw_{mv}. ψ\psi denotes the digamma function.

In the previous approach, the gradients would be set to zero to find a direct analytical solution. We instead employ a gradient ascent update rule to solve this MLL problem. Next, we discuss how the adaptive step sizes and the update rule for each one of the parameters are derived to develop the AdaTerm algorithm.

4.3 Online maximum likelihood updates with adaptive step sizes

Refer to caption
Figure 2: The Student’s t-distribution tends toward the Gaussian distribution as ν\nu\to\infty

As shown in Fig. 2, the Student’s t-distribution exhibits the property that when ν\nu\to\infty, it reverts to a Gaussian distribution. Similarly, when ν\nu\to\infty (i.e. wmv1w_{mv}\to 1), the AdaTerm optimizer should also yield a non-robust Gaussian optimization algorithm.

Following the success of the EMA scheme in outlier-free optimization, we specifically choose the adaptive step sizes, such that in the Gaussian limit (i.e., when ν\nu\to\infty), the gradient ascent updates of the mean mm and scale vv parameters revert to the simple EMA updates given by eq. (6) and eq. (7).

4.3.1 The first-order moment

Since the location parameter mm can take any value in real space, we can perform unconstrained gradient-based updates without any restriction. The gradient ascent equation is therefore expressed as follows:

mt\displaystyle m_{t} =mt1+κmgm\displaystyle=m_{t-1}+\kappa_{m}g_{m}
=mt1+κmwmv(gtmt1)2vt1\displaystyle=m_{t-1}+\kappa_{m}w_{mv}\frac{(g_{t}-m_{t-1})}{2v_{t-1}} (16)

where κm\kappa_{m} is the update step size to be defined. Because we expect an EMA-like update rule similar to eq. (6), mtm_{t} can also be written as:

mt\displaystyle m_{t} =mt1+τm(gtmt1)\displaystyle=m_{t-1}+\tau_{m}(g_{t}-m_{t-1})

where τm\tau_{m} is the interpolation factor (or decay factor) satisfying τm(0,1)\tau_{m}\in(0,1) and τm(1β)\tau_{m}\to(1-\beta) for ν~\tilde{\nu}\to\infty. Based on identification with the gradient ascent equation, we can set

τm\displaystyle\tau_{m} =κmwmv2vt1\displaystyle=\frac{\kappa_{m}w_{mv}}{2v_{t-1}}

Using the second constraint on τm\tau_{m}, i.e., τm(1β)\tau_{m}\to(1-\beta), and the fact that wmv1w_{mv}\to 1 for ν~\tilde{\nu}\to\infty, we can write the following proposition:

ν~\displaystyle\tilde{\nu}\to\infty τm=κmwmv2vt1(1β)κm2(1β)vt1\displaystyle\implies\tau_{m}=\frac{\kappa_{m}w_{mv}}{2v_{t-1}}\to(1-\beta)\implies\kappa_{m}\to 2(1-\beta)v_{t-1}

Let k(ν~)k(\tilde{\nu}) be an arbitrary function of ν~\tilde{\nu} such that ν~k(ν~)1\tilde{\nu}\to\infty\implies k(\tilde{\nu})\to 1; accordingly, κm=2(1β)vt1k(ν~)\kappa_{m}=2(1-\beta)v_{t-1}k(\tilde{\nu}) satisfies the previous proposition.

To identify a suitable function k(ν~)k(\tilde{\nu}) for our optimizer, we can utilize the first constraint on τm\tau_{m}, i.e., τm(0,1)\tau_{m}\in(0,1). By replacing the expression for κm\kappa_{m} into the expression for τm\tau_{m}, this constraint can be rewritten as follows:

0<(1β)wmvk(ν~)<1\displaystyle 0<(1-\beta)w_{mv}k(\tilde{\nu})<1 0<k(ν~)<1(1β)wmv\displaystyle\implies 0<k(\tilde{\nu})<\frac{1}{(1-\beta)w_{mv}} (17)

This implies that any function satisfying ν~k(ν~)1\tilde{\nu}\to\infty\implies k(\tilde{\nu})\to 1 and 0<k(ν~)<1(1β)wmv0<k(\tilde{\nu})<\frac{1}{(1-\beta)w_{mv}} will result in a valid step size κm\kappa_{m}. In this paper, we consider the simplest version among these functions for which τm\tau_{m} is not a constant, i.e., k(ν~)=1w¯mvk(\tilde{\nu})=\frac{1}{\overline{w}_{mv}}, where w¯mv=(ν~+1)ν~1wmv(1β)wmv\overline{w}_{mv}=(\tilde{\nu}+1)\tilde{\nu}^{-1}\geq w_{mv}\geq(1-\beta)w_{mv} (since β(0,1)\beta\in(0,1) by definition).

Based on this, the adaptive step size κm\kappa_{m} is expressed as follows:

κm\displaystyle\kappa_{m} =2vt11βw¯mv\displaystyle=2v_{t-1}\frac{1-\beta}{\overline{w}_{mv}} (18)

and the update rule for the first-order moment mm becomes

mt\displaystyle m_{t} =mt1+κmgm=mt1+(1β)wmvw¯mv(gtmt1)\displaystyle=m_{t-1}+\kappa_{m}g_{m}=m_{t-1}+(1-\beta)\frac{w_{mv}}{\overline{w}_{mv}}(g_{t}-m_{t-1})
=(1τm)mt1+τmgt\displaystyle=(1-\tau_{m})m_{t-1}+\tau_{m}g_{t} (19)

where τm=(1β)wmvw¯mv\tau_{m}=(1-\beta)\frac{w_{mv}}{\overline{w}_{mv}}.

4.3.2 Central second-order moment

Similarly, the gradient ascent update rule for the scale parameter vv is expressed as follows:

vt\displaystyle v_{t} =vt1+κvgv\displaystyle=v_{t-1}+\kappa_{v}g_{v}
=vt1+κvwmvν~2vt12(ν~+1)[(st+Δs)vt1]\displaystyle=v_{t-1}+\kappa_{v}\frac{w_{mv}\tilde{\nu}}{2v_{t-1}^{2}(\tilde{\nu}+1)}\left[(s_{t}+\Delta s)-v_{t-1}\right]
=vt1+κvwmv2vt12w¯mv[(st+Δs)vt1]\displaystyle=v_{t-1}+\kappa_{v}\frac{w_{mv}}{2v_{t-1}^{2}\overline{w}_{mv}}\left[(s_{t}+\Delta s)-v_{t-1}\right] (20)

where κv\kappa_{v} is the update step size, and Δs=(sDv)ν~1\Delta s=(s-Dv)\tilde{\nu}^{-1}. Again, we restrict ourselves to an EMA-like update rule similar to eq. (7):

vt\displaystyle v_{t} =vt1+τv[(st+Δs)vt1]\displaystyle=v_{t-1}+\tau_{v}\left[(s_{t}+\Delta s)-v_{t-1}\right]

where τv\tau_{v} must satisfy the same constraints as τm\tau_{m}, i.e., τv(0,1)\tau_{v}\in(0,1) and τv(1β)\tau_{v}\to(1-\beta) for ν~\tilde{\nu}\to\infty. The simplicity of this second constraint despite the presence of Δs\Delta s is because we already have ν~Δs0\tilde{\nu}\to\infty\implies\Delta s\to 0. This leaves only (1β)(stvt1)(1-\beta)(s_{t}-v_{t-1}), as found in the Gaussian-based EMA for the variance. Therefore, the derivations for κv\kappa_{v} and τv\tau_{v} follow almost exactly the same procedure as performed above for κm\kappa_{m} and τm\tau_{m}:

ν~\displaystyle\tilde{\nu}\to\infty τv=κvwmv2vt12w¯mv(1β)κv2(1β)vt12\displaystyle\implies\tau_{v}=\frac{\kappa_{v}w_{mv}}{2v_{t-1}^{2}\overline{w}_{mv}}\to(1-\beta)\implies\kappa_{v}\to 2(1-\beta)v_{t-1}^{2}
κv=2(1β)vt12k(ν~),(ν~k(ν~)1)\displaystyle\implies\kappa_{v}=2(1-\beta)v_{t-1}^{2}k(\tilde{\nu})\;,\;(\tilde{\nu}\to\infty\implies k(\tilde{\nu})\to 1)
0<τv<1\displaystyle 0<\tau_{v}<1 0<k(ν~)<w¯mv(1β)wmv\displaystyle\implies 0<k(\tilde{\nu})<\frac{\overline{w}_{mv}}{(1-\beta)w_{mv}}

Again, we can use the lowest bound for 1(1β)wmv\frac{1}{(1-\beta)w_{mv}}, i.e., 1w¯mv\frac{1}{\overline{w}_{mv}}, and set k(ν~)=w¯mvw¯mv=1k(\tilde{\nu})=\frac{\overline{w}_{mv}}{\overline{w}_{mv}}=1. This results in the following step size:

κv\displaystyle\kappa_{v} =2vt12(1β)\displaystyle=2v_{t-1}^{2}(1-\beta) (21)

and the corresponding update rule:

vt\displaystyle v_{t} =vt1+κvgv\displaystyle=v_{t-1}+\kappa_{v}g_{v}
=vt1+(1β)wmvw¯mv[(st+Δs)vt1]\displaystyle=v_{t-1}+(1-\beta)\frac{w_{mv}}{\overline{w}_{mv}}\left[(s_{t}+\Delta s)-v_{t-1}\right]
=(1τv)vt1+τv(st+Δs)\displaystyle=(1-\tau_{v})v_{t-1}+\tau_{v}(s_{t}+\Delta s) (22)

where τv=τm=(1β)wmvw¯mv\tau_{v}=\tau_{m}=(1-\beta)\frac{w_{mv}}{\overline{w}_{mv}}.

A closer look at this update rule reveals that although for ν~\tilde{\nu}\to\infty, Δs=(sDv)ν~10\Delta s=(s-Dv)\tilde{\nu}^{-1}\to 0, the value of Δs\Delta s can still be negative for ν~\tilde{\nu}\ll\infty. This implies that the update shown above can potentially lead the scale parameter vv into a negative region during the learning process. This is undesirable since vv is only defined in the positive real space. There are many ways to satisfy this positive constraint, e.g., reparameterization or the mirror descent (MD) strategies [30]. However, to preserve the EMA-like interpolation directly on the parameter vv, we simply employ the projected gradient method. Specifically, we use a simple gradient clipping method [31], where we place a lower bound on Δs\Delta s

Δs=:max(ϵ2,(sDv)ν~1)\displaystyle\Delta s=:\max(\epsilon^{2},(s-Dv)\tilde{\nu}^{-1}) (23)

where ϵ1\epsilon\ll 1. Such a gradient upper bound is disadvantageous in that it can cause the algorithm to overestimate the scale parameter vv. In this case, when vv is larger than the exact value, DD and wmvw_{mv} becomes smaller and larger, respectively; in other words, the robustness to noise may be impaired. However, this drawback is attenuated by the fact that the scaled learning rate (similar to that of Adam in eq. (5)) is inversely proportional to vv, rendering the effect of noise insignificant. We also argue that the slow decrease of vv permitted by the gradient clipping strategy is advantageous in preventing excessive robustness, compared with the case where vv is analytically estimated (see details in H).

Finally, since τm=τv=(1β)wmvw¯mv1\tau_{m}=\tau_{v}=(1-\beta)w_{mv}\overline{w}_{mv}^{-1}, we adopt one common notation for both as τmv=(1β)wmvw¯mv1\tau_{mv}=(1-\beta)w_{mv}\overline{w}_{mv}^{-1}. Note that the past work uses the discounted sum of wmvw_{mv}, WtW_{t}, by storing it in memory; however, in AdaTerm, w¯mv\overline{w}_{mv} is used instead, saving memory.

4.3.3 Degrees-of-freedom

Simplifying the gradient

Compared with mm and vv, the update rule of ν\nu is considerably more delicate to handle. In particular, when the positive constraint on ν\nu is violated by a simple gradient ascent update cannot be identified. This is mainly due to the digamma function ψ\psi. Hence, the first step is to eliminate it from the problem. In this study, we use the upper and lower bounds of ψ\psi to find the upper bound of the gradient with respect to ν\nu. Specifically, the upper and lower bounds of ψ\psi are expressed as follows:

lnx1xψ(x)lnx12x\displaystyle\ln x-\cfrac{1}{x}\leq\psi(x)\leq\ln x-\cfrac{1}{2x}

implying that we can define an upper-bound for ψ(ν+d2)ψ(ν2)\psi(\frac{\nu+d}{2})-\psi(\frac{\nu}{2}) as ln(ν+d2)1ν+dln(ν2)+2ν\ln(\frac{\nu+d}{2})-\frac{1}{\nu+d}-\ln(\frac{\nu}{2})+\frac{2}{\nu}.

Fig. 3 shows the curves traced by the two equations as a function of ν~\tilde{\nu}, for different values of dd (given that for our application ν=ν~d\nu=\tilde{\nu}d with typically ν~1\tilde{\nu}\geq 1). As can be seen, for large values of dd, which is generally the case for neural networks (NN), the defined upper-bound tightens. For d=1d=1 (which can occur for the NN biases), the gap evaluated in the range ν~[1,)\tilde{\nu}\in[1,\infty), is at most 0.80685\approx 0.80685 and occurs when ν~=1\tilde{\nu}=1.

Refer to caption
Figure 3: The tightness of the upper bound employed for eq. (24). For a large dd, the upper bound’s curve and the original equation’s curve are practically indistinguishable.

Although the use of an upper bound can impair the noise robustness in theory (as in the case of vv), the relative small gap shown in Fig. 3 hardly impairs this in practice. Therefore, the following upper bound for eq. (15) can be defined:

νln𝒯\displaystyle\nabla_{\nu}\ln\mathcal{T} 12{ln(ν+d)ln21ν+dlnν+ln2+2ν\displaystyle\leq\cfrac{1}{2}\Biggl{\{}\ln(\nu+d)-\ln 2-\cfrac{1}{\nu+d}-\ln\nu+\ln 2+\cfrac{2}{\nu}
dνln(ν+dD)+lnνν+dν+dD+1+dν}\displaystyle-\cfrac{d}{\nu}-\ln(\nu+dD)+\ln\nu-\cfrac{\nu+d}{\nu+dD}+1+\cfrac{d}{\nu}\Biggr{\}}
=12{wmv+lnwmv+1+ν~+2ν~+11ν}\displaystyle=\cfrac{1}{2}\Biggl{\{}-w_{mv}+\ln w_{mv}+1+\cfrac{\tilde{\nu}+2}{\tilde{\nu}+1}\cfrac{1}{\nu}\Biggl{\}}
=12{wν~+1+ν~+2ν~+11ν}\displaystyle=\cfrac{1}{2}\Biggl{\{}-w_{\tilde{\nu}}+1+\cfrac{\tilde{\nu}+2}{\tilde{\nu}+1}\cfrac{1}{\nu}\Biggl{\}} (24)

where wν~=wmvlnwmvw_{\tilde{\nu}}=w_{mv}-\ln w_{mv}.

Refer to caption
Figure 4: wmvw_{mv} vs. the surrogated gradient in eq. (24) with ν=d\nu=d according to several dimension sizes
Tackling the curse of dimensionality

The overestimated gradient of the previous paragraph is simple and its behavior is easy to analyze. Notably, the behavior of the overestimated gradient is considerably affected by dd, as shown in Fig. 4 with ν=d\nu=d (i.e. ν~=1\tilde{\nu}=1). As can be seen, when d=1d=1, the gradient with respect to ν\nu is negative only when wmvw_{mv} is exceedingly small, corresponding to an extremely large deviation DD. In this case, the negative gradient effectively decreases ν\nu to exclude the noisy gradients of the network’s parameters gg. Conversely, for small deviations DD, the gradient in (24) becomes positive and ν\nu is increased to suppress the robustness and easily update the parameters.

Unfavorably, with d=10,000d\sim=10,000, which is common for a NN’s weights matrix, the gradient of eq. (24) is always negative and almost zero even when wmv=1w_{mv}=1, implying that ν\nu can only decrease. The same phenomenon occurs even for d=100d\sim=100, whose curve is practically indistinguishable from the curve of d=10,000d\sim=10,000. Such a pessimistic behavior is consistent with the non-intuitive behavior of the multivariate Student’s t-distribution reported in the literature [20]. This problem has also been empirically confirmed in [19], where the excessive robustness is forcibly suppressed by correcting the obtained (approximated) estimate by multiplying it with dd.

To alleviate this problem, we further consider an upper bound, where ν\nu is replaced by ν~\tilde{\nu}. This choice is further supported by the fact that ν\nu does not appear directly in any of the other MLL gradients, implying that directly dealing with the gradient with respect to ν~\tilde{\nu} is more natural. The above modifications are summarized below.

ν~ln𝒯\displaystyle\nabla_{\tilde{\nu}}\ln\mathcal{T} =dνln𝒯\displaystyle=d\nabla_{\nu}\ln\mathcal{T}
d2{wν~+1+ν~+2ν~+11ν~}\displaystyle\leq\cfrac{d}{2}\Biggl{\{}-w_{\tilde{\nu}}+1+\cfrac{\tilde{\nu}+2}{\tilde{\nu}+1}\cfrac{1}{\tilde{\nu}}\Biggl{\}}
=wν~d2{1+(ν~+2ν~+1+ν~)1ν~wν~}=:gν~\displaystyle=w_{\tilde{\nu}}\cfrac{d}{2}\Biggl{\{}-1+\Biggl{(}\cfrac{\tilde{\nu}+2}{\tilde{\nu}+1}+\tilde{\nu}\Biggr{)}\cfrac{1}{\tilde{\nu}w_{\tilde{\nu}}}\Biggl{\}}=:g_{\tilde{\nu}} (25)

Where the second line is obtained by using 1ν1ν~\frac{1}{\nu}\leq\frac{1}{\tilde{\nu}}. This new upper bound can be interpreted as an approximation of the multivariate distribution by a univariate distribution and allows the degrees-of-freedom to increase even for large dimensions dd when wmvw_{mv} is small (i.e., DD is large). Although it is possible to model the distribution as a univariate distribution from the beginning of the derivation [32], the robustness may be degraded, since it would capture the average without focusing on the amount of deviation along each axis.

Update rule

Based on this upper bound, we now proceed to the derivation of a suitable update rule for the robustness parameter ν~\tilde{\nu}. We draw similarities with the update rules obtained previously for mm and vv and set our goal to an EMA-like equation with an adaptive smoothness parameter τν~=(1β)wν~w¯ν~1\tau_{\tilde{\nu}}=(1-\beta)w_{\tilde{\nu}}\overline{w}_{\tilde{\nu}}^{-1}, where wν~w¯ν~w_{\tilde{\nu}}\leq\overline{w}_{\tilde{\nu}}. We aim for the following update rule:

ν~t\displaystyle\tilde{\nu}_{t} =(1τν~)ν~t1+τν~λt=ν~t1+τν~(λtν~t1)\displaystyle=(1-\tau_{\tilde{\nu}})\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}\lambda_{t}=\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}(\lambda_{t}-\tilde{\nu}_{t-1}) (26)

where λt\lambda_{t} is some function of wmvw_{mv}. To derive such an equation, we focus first on τν~\tau_{\tilde{\nu}} and particularly on determining the maximum value of wν~w_{\tilde{\nu}}, w¯ν~\overline{w}_{\tilde{\nu}}.

We begin by noticing that wν~w_{\tilde{\nu}} is a convex function over wmvw_{mv} with a minimum value at wmv=1w_{mv}=1. The maximum value is therefore determined when wmvw_{mv} is the largest (wmv1w_{mv}\gg 1) or smallest (wmv1w_{mv}\ll 1). The largest wmvw_{mv} value has already been derived as w¯mv\overline{w}_{mv}; however, the smallest value cannot exactly be given since wmv0w_{mv}\to 0 when DD\to\infty. Therefore, instead of the exact minimum value, we employ the tiny value of float ϵfloat\epsilon_{\mathrm{float}} that is the closest to zero numerically (in the case of float32, wν~(wmv=ϵfloat)87.3365w_{\tilde{\nu}}(w_{mv}=\epsilon_{\mathrm{float}})\simeq 87.3365). In summary, the maximum wν~w_{\tilde{\nu}}, w¯ν~\overline{w}_{\tilde{\nu}}, can be defined as follows:

w¯ν~=max(w¯mvln(w¯mv),ϵfloatln(ϵfloat))\displaystyle\overline{w}_{\tilde{\nu}}=\max(\overline{w}_{mv}-\ln(\overline{w}_{mv}),\epsilon_{\mathrm{float}}-\ln(\epsilon_{\mathrm{float}})) (27)

If we design the step size with w¯ν~\overline{w}_{\tilde{\nu}}, we can obtain an interpolation update rule for ν~\tilde{\nu} similar to mm and vv. However, although the minimum value of ν~\tilde{\nu} is expected to be positive, an excessive robustness could inhibit the learning process. Thus, it is desirable that the user retain a certain level of control on the extent of robustness the algorithm is allowed to achieve. Therefore, our final step is to transform ν~=ν¯~+Δν~\tilde{\nu}=\underline{\tilde{\nu}}+\Delta\tilde{\nu} with a user-specified minimum value, ν¯~>0\underline{\tilde{\nu}}>0, and the deviation, Δν~>0\Delta\tilde{\nu}>0, automatically controlled by the algorithm. With this transformation, the appropriate step size κΔν~\kappa_{\Delta\tilde{\nu}} satisfies the following:

Δν~t\displaystyle\Delta\tilde{\nu}_{t} =Δν~t1+κΔν~gν~=Δν~t1+τν~(λtΔν~t1)\displaystyle=\Delta\tilde{\nu}_{t-1}+\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}}=\Delta\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}(\lambda^{\prime}_{t}-\Delta\tilde{\nu}_{t-1}) (28)

where the right side of the second equality is obtained from eq. (26) by replacing ν~\tilde{\nu} with Δν~\Delta\tilde{\nu} and using λtλt\lambda^{\prime}_{t}\neq\lambda_{t}. Substituting gν~g_{\tilde{\nu}} by its expression from eq. (25) and using τν~=(1β)wν~w¯ν~1\tau_{\tilde{\nu}}=(1-\beta)w_{\tilde{\nu}}\overline{w}_{\tilde{\nu}}^{-1}, the following equalities are obtained:

κΔν~[wν~d2{1ν~t1wν~(ν~t1+2ν~t1+1+ν~t1)1}]A=(1β)wν~w¯ν~(λtΔν~t1)\displaystyle\underbrace{\kappa_{\Delta\tilde{\nu}}\left[w_{\tilde{\nu}}\cfrac{d}{2}\Biggl{\{}\cfrac{1}{\tilde{\nu}_{t-1}w_{\tilde{\nu}}}\Biggl{(}\cfrac{\tilde{\nu}_{t-1}+2}{\tilde{\nu}_{t-1}+1}+\tilde{\nu}_{t-1}\Biggr{)}-1\Biggr{\}}\right]}_{\text{A}}=(1-\beta)\frac{w_{\tilde{\nu}}}{\overline{w}_{\tilde{\nu}}}(\lambda^{\prime}_{t}-\Delta\tilde{\nu}_{t-1})
A=κΔν~[dwν~2Δν~t1](1β)wν~w¯ν~(Δν~t1ν~t1wν~(ν~t1+2ν~t1+1+ν~t1)λtΔν~t1)\displaystyle A=\underbrace{\kappa_{\Delta\tilde{\nu}}\left[\frac{dw_{\tilde{\nu}}}{2\Delta\tilde{\nu}_{t-1}}\right]}_{(1-\beta)\frac{w_{\tilde{\nu}}}{\overline{w}_{\tilde{\nu}}}}\left(\underbrace{\cfrac{\Delta\tilde{\nu}_{t-1}}{\tilde{\nu}_{t-1}w_{\tilde{\nu}}}\Biggl{(}\cfrac{\tilde{\nu}_{t-1}+2}{\tilde{\nu}_{t-1}+1}+\tilde{\nu}_{t-1}\Biggr{)}}_{\lambda^{\prime}_{t}}-\Delta\tilde{\nu}_{t-1}\right) (29)

From this, we can derive

κΔν~\displaystyle\kappa_{\Delta\tilde{\nu}} =2Δν~t11βdw¯ν~\displaystyle=2\Delta\tilde{\nu}_{t-1}\cfrac{1-\beta}{d\overline{w}_{\tilde{\nu}}} (30)
λt\displaystyle\lambda^{\prime}_{t} =(ν~t1+2ν~t1+1+ν~t1)ν~t1ν¯~ν~t1wν~\displaystyle=\Biggl{(}\cfrac{\tilde{\nu}_{t-1}+2}{\tilde{\nu}_{t-1}+1}+\tilde{\nu}_{t-1}\Biggr{)}\cfrac{\tilde{\nu}_{t-1}-\underline{\tilde{\nu}}}{\tilde{\nu}_{t-1}w_{\tilde{\nu}}} (31)

along with the update rule for Δν~\Delta\tilde{\nu}:

Δν~t\displaystyle\Delta\tilde{\nu}_{t} =Δν~t1+κΔν~gν~\displaystyle=\Delta\tilde{\nu}_{t-1}+\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}}
=(1τν~)Δν~t1+τν~(ν~t1+2ν~t1+1+ν~t1)ν~t1ν¯~ν~t1wν~\displaystyle=(1-\tau_{\tilde{\nu}})\Delta\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}\Biggl{(}\cfrac{\tilde{\nu}_{t-1}+2}{\tilde{\nu}_{t-1}+1}+\tilde{\nu}_{t-1}\Biggr{)}\cfrac{\tilde{\nu}_{t-1}-\underline{\tilde{\nu}}}{\tilde{\nu}_{t-1}w_{\tilde{\nu}}}

Finally, by adding ν¯~+ϵ\underline{\tilde{\nu}}+\epsilon to both sides, we can directly obtain the update rule for ν~\tilde{\nu}.

τν~\displaystyle\tau_{\tilde{\nu}} =(1β)wν~w¯ν~1\displaystyle=(1-\beta)w_{\tilde{\nu}}\overline{w}_{\tilde{\nu}}^{-1} (32)
λt\displaystyle\lambda_{t} =λt+ν¯~+ϵ\displaystyle=\lambda^{\prime}_{t}+\underline{\tilde{\nu}}+\epsilon (33)
ν~t\displaystyle\tilde{\nu}_{t} =(1τν~)ν~t1+τν~λt\displaystyle=(1-\tau_{\tilde{\nu}})\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}\lambda_{t} (34)

Note that the minimum value of Δν~=ν~ν¯~\Delta\tilde{\nu}=\tilde{\nu}-\underline{\tilde{\nu}} is given by ϵ\epsilon, such that Δν~>0\Delta\tilde{\nu}>0 is always satisfied. This process also prevents λ\lambda from becoming 0 and stopping the update of ν~\tilde{\nu}.

4.4 Algorithm

Finally, the update amount of the optimization parameters θ\theta for AdaTerm is expressed as follows:

ηAdaTerm(gt)=mt(1βt)1vt(1βt)1\displaystyle\eta^{\mathrm{AdaTerm}}(g_{t})=\cfrac{m_{t}(1-\beta^{t})^{-1}}{\sqrt{v_{t}(1-\beta^{t})^{-1}}} (35)

Unlike Adam [4], the small amount usually added to the denominator, ϵ\epsilon, is removed, since vϵ\sqrt{v}\geq\epsilon in our implementation.

The pseudo-code of AdaTerm555AdaTerm’s t-momentum requires scale estimation and would incur a higher cost if integrated into the pure SGD algorithm. Therefore, we omit it in this study and only consider a variant of the Adam algorithm. is summarized in Alg. 1. The regret bound is also analyzed based on a novel approach different from the literature [21] in A, through combination of the approach proposed by [22], with a novel method based on the Lemma A.1, to eliminate the AMSGrad assumption from the regret analysis.

Algorithm 1 AdaTerm optimizer
1:  Set α>0\alpha>0 (10310^{-3} is the default value)
2:  Set β(0,1)\beta\in(0,1) (0.90.9 is the default value)
3:  Set ϵ1\epsilon\ll 1 (10510^{-5} is the default value)
4:  Set ν¯~>0\underline{\tilde{\nu}}>0 (11 is the default value)
5:  Set dd as the dimension size of each subset of parameters
6:  Initialize 𝒟~\mathcal{\tilde{D}}, θ1\theta_{1}, m00m_{0}\leftarrow 0, v0ϵ2v_{0}\leftarrow\epsilon^{2}, ν~0ν¯~+ϵ\tilde{\nu}_{0}\leftarrow\underline{\tilde{\nu}}+\epsilon, t0t\leftarrow 0
7:  while θt\theta_{t} not converged do
8:     // Compute gradient
9:     tt+1t\leftarrow t+1
10:     gt=θttg_{t}=\nabla_{\theta_{t}}\mathcal{L}_{\mathcal{B}_{t}}, t𝒟~\mathcal{B}_{t}\sim\mathcal{\tilde{D}}
11:     // Compute index of outlier
12:     s=(gtmt1)2s=(g_{t}-m_{t-1})^{2}
13:     D=d1svt11D=d^{-1}s^{\top}v_{t-1}^{-1}
14:     // Compute adaptive step sizes
15:     wmv=(ν~t1+1)(ν~t1+D)1w_{mv}=(\tilde{\nu}_{t-1}+1)(\tilde{\nu}_{t-1}+D)^{-1}
16:     w¯mv=(ν~t1+1)ν~t11\overline{w}_{mv}=(\tilde{\nu}_{t-1}+1)\tilde{\nu}_{t-1}^{-1}
17:     wν~=wmvln(wmv)w_{\tilde{\nu}}=w_{mv}-\ln(w_{mv})
18:     w¯ν~=max(w¯mvln(w¯mv),ϵfloatln(ϵfloat))\overline{w}_{\tilde{\nu}}=\max(\overline{w}_{mv}-\ln(\overline{w}_{mv}),\epsilon_{\mathrm{float}}-\ln(\epsilon_{\mathrm{float}}))
19:     τmv=(1β)wmvw¯mv\tau_{mv}=(1-\beta)\cfrac{w_{mv}}{\overline{w}_{mv}} , τν~=(1β)wν~w¯ν~\tau_{\tilde{\nu}}=(1-\beta)\cfrac{w_{\tilde{\nu}}}{\overline{w}_{\tilde{\nu}}}
20:     // Compute update amounts
21:     Δs=max(ϵ2,(sDvt1)ν~t11)\Delta s=\max(\epsilon^{2},(s-Dv_{t-1})\tilde{\nu}_{t-1}^{-1})
22:     λ=(ν~t1+2ν~t1+1+ν~t1)ν~t1ν¯~ν~t1wν~+ν¯~+ϵ\lambda=\left(\cfrac{\tilde{\nu}_{t-1}+2}{\tilde{\nu}_{t-1}+1}+\tilde{\nu}_{t-1}\right)\cfrac{\tilde{\nu}_{t-1}-\underline{\tilde{\nu}}}{\tilde{\nu}_{t-1}w_{\tilde{\nu}}}+\underline{\tilde{\nu}}+\epsilon
23:     // Update parameters
24:     mt=(1τmv)mt1+τmvgtm_{t}=(1-\tau_{mv})m_{t-1}+\tau_{mv}g_{t}
25:     vt=(1τmv)vt1+τmv(s+Δs)v_{t}=(1-\tau_{mv})v_{t-1}+\tau_{mv}(s+\Delta s)
26:     ν~t=(1τν~)ν~t1+τν~λ\tilde{\nu}_{t}=(1-\tau_{\tilde{\nu}})\tilde{\nu}_{t-1}+\tau_{\tilde{\nu}}\lambda
27:     θt+1=θtαmt(1βt)1vt(1βt)1\theta_{t+1}=\theta_{t}-\alpha\cfrac{m_{t}(1-\beta^{t})^{-1}}{\sqrt{v_{t}(1-\beta^{t})^{-1}}}
28:  end while

4.5 Behavior analysis

4.5.1 Convergence analysis

Our convergence proof adopts the approach highlighted in [22]. As such, we start by enunciating the same assumptions:

Assumption 4.1.

Necessary assumptions:

  1. 1.

    d\mathcal{F}\subset\mathbb{R}^{d} is a compact convex set

  2. 2.

    t:\mathcal{L}_{\mathcal{B}_{t}}:\mathcal{F}\to\mathbb{R} is a convex lower semicontinuous (lsc) function

  3. 3.

    \mathcal{F} has a bounded diameter, i.e. D=maxx,yxyD=\underset{x,y\in\mathcal{F}}{\max}\left\lVert x-y\right\rVert_{\infty}, and G=maxt[T]gtG=\underset{t\in[T]}{\max}\left\lVert g_{t}\right\rVert_{\infty}

Following the convergence result, the proof of which can be found in A, the following holds true:

Theorem 4.2.

Let τt\tau_{t} be the value of τmv\tau_{mv} at time step tt and let τ¯τt\underline{\tau}\leq\tau_{t}, t\forall t. Under Assumption 4.1, and with β<1\beta<1, αt=α/t\alpha_{t}=\alpha/\sqrt{t}, AdaTerm achieves a regret RT=t=1Tt(θt)t(θ)R_{T}=\sum\limits_{t=1}^{T}\mathcal{L}_{\mathcal{B}_{t}}(\theta_{t})-\mathcal{L}_{\mathcal{B}_{t}}(\theta^{*}), such that

RTD2T4τTαi=1dvT,i1/2+[τ¯2+1(β+τ¯)2τ¯2]t=1T1D2αti=1dvt,i1/2+[(1β)2αϵτ¯2T]t=1T1i=1d(1τ¯)Tkgk,i2+[τ¯(1τ¯)+(1β)2τ¯2][(1β)2αϵτ¯2]1+ln(T1)i=1dg1:T1,i22\displaystyle\begin{split}&R_{T}\leq\left.\frac{D^{2}\sqrt{T}}{4\tau_{T}\alpha}\sum\limits_{i=1}^{d}v_{T,i}^{1/2}+\left[\frac{\underline{\tau}^{2}+1-(\beta+\underline{\tau})}{2\underline{\tau}^{2}}\right]\sum\limits_{t=1}^{T-1}\frac{D^{2}}{\alpha_{t}}\sum\limits_{i=1}^{d}v_{t,i}^{1/2}\right.\\ &\left.\;+\left[\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}^{2}\sqrt{T}}\right]\sum\limits_{t=1}^{T-1}\sum\limits_{i=1}^{d}(1-\underline{\tau})^{T-k}g^{2}_{k,i}\right.\\ &\;+\left[\frac{\underline{\tau}(1-\underline{\tau})+(1-\beta)}{2\underline{\tau}^{2}}\right]\left[\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}^{2}}\right]\sqrt{1+\ln(T-1)}\sum\limits_{i=1}^{d}\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2}\end{split} (36)
Corollary 4.3 (Non-robust case regret bound).

If ν¯~\underline{\tilde{\nu}}\to\infty, then τ¯constant=(1β)\underline{\tau}\to\mathrm{constant}=(1-\beta). Then, the regret becomes:

RTD2T4(1β)αi=1dvT,i1/2+12t=1T1D2αti=1dvt,i1/2+αϵTt=1T1i=1dβTkgk,i2+(1+β)α1+ln(T1)2(1β)ϵi=1dg1:T1,i22\displaystyle\begin{split}R_{T}&\leq\left.\frac{D^{2}\sqrt{T}}{4(1-\beta)\alpha}\sum\limits_{i=1}^{d}v_{T,i}^{1/2}+\frac{1}{2}\sum\limits_{t=1}^{T-1}\frac{D^{2}}{\alpha_{t}}\sum\limits_{i=1}^{d}v_{t,i}^{1/2}\right.\\ &\left.\;+\frac{\alpha}{\epsilon\sqrt{T}}\sum\limits_{t=1}^{T-1}\sum\limits_{i=1}^{d}\beta^{T-k}g^{2}_{k,i}\right.\\ &\;+\frac{(1+\beta)\alpha\sqrt{1+\ln(T-1)}}{2(1-\beta)\epsilon}\sum\limits_{i=1}^{d}\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2}\end{split} (37)

Note the similarity between this regret bound and the one derived by [21] and by [8] using AMSGrad. In particular, this regret can be bounded by 𝒪(GT)\mathcal{O}(G\sqrt{T}), and thus, the regret is upper-bounded by a minimum of 𝒪(GT)\mathcal{O}(G\sqrt{T}). This leads to the worst-case dependence of the regret on TT to remain 𝒪(T)\mathcal{O}(\sqrt{T}), although AMSGrad is not used.

We emphasize that this approach to the regret bound is not specific to AdaTerm but can be used to bound the regret of other moment-based optimizers, including Adam and AdaBelief.

4.5.2 Qualitative robustness behavior analysis

In AdaTerm, the robustness to noise is qualitatively expressed differently from the previous studies [18, 19], owing to the following two factors.

First, vv is now updated robustly depending on wmvw_{mv}, ensuring that the observation of aberrant gradients gg would not cause a sudden increase in the scale parameter and inadvertently loosen the threshold for other aberrant update directions in the next and subsequent steps. In addition, the update of the scale vv is expected to self-coordinate among the axes, by virtue of the inter-dependency between all of the parameters when estimating the scale.

Indeed, in a diagonal Gaussian distribution, the scale for each axis can and is estimated independently; however, in a diagonal Student’s t-distribution, each entry of the scale also depends on the previous value of the other entries, leading to a coordinated update of the scales. This inter-dependency exists thanks to the Mahalanobis distance metric DD and its intervention in the weight computation wmvw_{mv}.

Specifically, as illustrated in Fig. 5, if an anomaly is detected only along a particular axis (e.g., points red and green in Fig. 5), the corresponding entry of the vector Δs\Delta s becomes larger, causing the associated entry of vv to increase. This in turn would mitigate the anomaly detection of the next step (i.e., a gradient far from the current location estimate mm would be attributed a higher weight wmvw_{mv}, since the squared Mahalanobis distance Dsv1D\propto s^{\top}v^{-1} will be overall smaller). Conversely, if anomalies are detected on most of the axes (e.g., violet point in Fig. 5), Δs\Delta s will become smaller (Δsϵ2\Delta s\simeq\epsilon^{2}) and either prevent any further increase in vv or cause it to decrease. This would prompt the algorithm to treat a larger number of the further gradients as anomalies (DD becomes overall larger, implying wmvw_{mv} is smaller) as described above. Such adaptive and coordinated behavior would yield stable updates even if β\beta is smaller than the conventional β2\beta_{2} (set to be =0.999=0.999 in most optimizers).

Refer to caption
Figure 5: Interdependence of the scale entries. The ellipses represent 3σ3\sigma and the blue color is the initial σ\sigma value. The other colors correspond to how the σ\sigma would change under the algorithm when the associated point with the same color is observed.

Secondly, as illustrated in Fig. 6, the robustness indicator ν~\tilde{\nu} is increased when the deviation DD is small (i.e., when the observed gradient is not an aberrant value). Indeed, in this case, λt\lambda^{\prime}_{t} becomes larger and κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} becomes positive. However, the increased speed will be limited by w¯ν~\overline{w}_{\tilde{\nu}}, owing to its inclusion in the step size κΔν~\kappa_{\Delta\tilde{\nu}}. In contrast, if an aberrant value is observed, wν~1w_{\tilde{\nu}}\gg 1 and λt\lambda^{\prime}_{t} decreases, such that κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} becomes negative.

Refer to caption
Figure 6: Change in the robustness parameter ν\nu based on the Mahalanobis distance DD (represented by the contour plot) of the observations. The green vertical lines represent the 90% percentile. As illustrated, when DD is small, the degree of freedom increases. When DD is large, it decreases, and the tail of the distribution becomes heavier to account for the outliers.

An intuitive visualization of the described behavior obtained by the AdaTerm equations for τmv\tau_{mv} and κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} can also be found in B.

Overall, although the robustness-tuning mechanism replaces the MLL gradient by the upper bounds, it still behaves conservatively and can be expected to retain its excellent robustness to noise, as illustrated in Fig. 7.

Refer to caption
Figure 7: Illustration of the first-order moment acquired by AdaTerm and Gaussian (ν~\tilde{\nu}\to\infty) strategies after 10 steps (with β=0.9\beta=0.9) based on step-by-step noisy gradients observed on the sphere function f(x,y)=x2+y2f(x,y)=x^{2}+y^{2}.

As a final remark, if ν¯~\underline{\tilde{\nu}}\to\infty, the robustness is lost by design, and AdaTerm essentially matches a slightly different version of AdaBelief exhibiting performance difference as a result of the simplification of the hyper-parameters (β1=β2=β\beta_{1}=\beta_{2}=\beta) and the variance computation mechanism (AdaBelief estimates 𝔼[(gtmt)2]\mathbb{E}[(g_{t}-m_{t})^{2}], whereas AdaTerm estimates 𝔼[(gtmt1)2]\mathbb{E}[(g_{t}-m_{t-1})^{2}], which is the usual variance estimator). Therefore, in problems where AdaBelief would be effective, AdaTerm would perform effectively as well.

5 Simulation benchmarks

Before solving more practical problems, we analyze the behavior of AdaTerm through minimization of typical test functions, focusing mainly on the adaptiveness of the robustness parameter and the convergence profile.

To evaluate its adaptiveness, in all simulations and experiments, the initial ν\nu in At-Adam is set to be the same value used in t-Adam. Furthermore, although many SGD-based optimization algorithms are available, in this study, we focus on comparing AdaTerm against the two related works (t-momentum-based Adam - t-Adam  [18] - and At-momentum-based Adam - At-Adam [19] -)666Employed implementation can be found at https://github.com/Mahoumaru/t-momentum and include only the results from Adam [4], AdaBelief [8], and RAdam [23] as references for non-robust optimization.

5.1 Analysis based on test functions

Refer to caption
Figure 8: Noise ratio vs. L2 norm between the converged point and the analytically-optimal point
Refer to caption
Figure 9: Noise ratio vs. degrees-of-freedom ν\nu

In this task, our aim is to determine the minimum point of a two-dimensional potential field (benchmark function) by relying on the field gradients. To analyze the robustness of the optimization algorithms, a uniformly distributed noise ((0.1,0.1)\in(-0.1,0.1)) is added at a specified ratio to the point coordinates. The results are summarized in Figs. 8 and 9 (details are in C).

Fig. 8 shows the error norm from the analytically-optimal point, revealing that AdaTerm’s convergence accuracy on the McCormick function was not effective in the lower noise ratio, probably due to the large learning rate. However, on the other fields, AdaTerm was able to maintain the standard update performance while mitigating the adverse effects of the added noise.

The performance of Adam [4] significantly degraded when noise was introduced, indicating its sensitivity to noise. On the Michalewicz function, At-Adam [19] attributes the steep gradients near the optimal value to noise and tends to exclude them; thus, the optimal solution is not obtained. This result implies that the automatic mechanism for tuning ν\nu employed by At-Adam has insufficient robustness adaptability to noise.

Indeed, the final ν\nu plotted in Fig. 9 shows that the robustness parameter ν\nu converges to a nearly constant value with At-Adam, independently of the noise ratio. In contrast, in AdaTerm, the degrees-of-freedom ν\nu is inversely proportional to the noise ratio; thus, the typical behavior of increase in ν\nu under minimal noise and decrease under the condition of a high noise ratio is observed.

5.2 Robustness on regression task

Problem settings

Following the same process as in [18], we consider the problem of fitting a ground truth function f(x)=x2+ln(x+1)+sin(2πx)cos(2πx)f(x)=x^{2}+\ln(x+1)+\sin(2\pi x)\cos(2\pi x) given noisy observations y=f(x)+ζy=f(x)+\zeta with ζ\zeta expressed as:

ζ\displaystyle\zeta 𝒯(1,0,0.05)Bern(p100),p=0,10,20,,100\displaystyle\sim\mathcal{T}(1,0,0.05)\mathrm{Bern}\left(\frac{p}{100}\right),\ p=0,10,20,\ldots,100 (38)

where 𝒯(1,0,0.05)\mathcal{T}(1,0,0.05) designates a Student’s t-distribution with degrees-of-freedom νζ=1\nu_{\zeta}=1, location μζ=0\mu_{\zeta}=0, and scale vζ=0.05v_{\zeta}=0.05; Bern(p/100)\mathrm{Bern}(p/100) is a Bernoulli distribution with the ratio pp as its parameter. 5050 trials with different random seeds are conducted for each noise ratio pp and each optimization method, using 4000040000 (x,y)(x,y) pairs sampled as observations and split into batches of size 1010. This batch size is chosen arbitrarily as being neither exceedingly large nor exceedingly small with respect to the full dataset size.

The model used is a fully connected neural network with 55 linear layers, each composed of 5050 neurons, equipped with an ReLU activation function [33] for all the hidden layers. The training and the test loss functions are the mean squared error (MSE) applied on (y^,y)(\hat{y},y) and (y^,f(x))(\hat{y},f(x)), respectively, where y^\hat{y} is the network’s estimate given xx.

Result

The test loss against the noise ratio pp is plotted in Fig. 10. As can be observed, AdaTerm performed even more robustly than t-Adam and At-Adam and was able to maintain its prediction loss almost constant across all values of noise ratio. The effect of batch size on the learning performance is also studied and can be found in F.

Refer to caption
Figure 10: Test loss on the Regression task against the noise ratio pp

6 Experiments

6.1 Configurations of practical problems

After verifying the optimization efficiency and robustness on certain test functions and a regression task with artificial noise, we now investigate four practical benchmark problems to evaluate the performance of the proposed algorithm, AdaTerm. Details of the setups (e.g., network architectures) can be found in D.

For each method, 24 trials are conducted with different random seeds, and the mean and standard deviation of the scores for the respective benchmarks are evaluated. All the test results after learning can be found in Table 3. An ablation test was also conducted for AdaTerm, as summarized in E.

6.1.1 Classification of mislabeled CIFAR-100

The first problem is an image classification problem with the CIFAR-100 dataset. As artificial noise, a proportion of the labels (0 % and 10 %) of the training data was randomized to differ from the true labels. As a simple data augmentation during the training process, random horizontal flip is introduced along with four paddings.

The loss function is the cross-entropy. To stabilize the learning performance, we utilize the label smoothing technique [34], a regularization tool for deep neural networks. It involves the generation of “soft” labels by applying a weighted average between a uniform distribution and the “hard” labels. In our experiments, the degree of smoothing is set to 20%, in reference to the literature [35].

6.1.2 Long-term prediction of robot locomotion

The second problem is a motion-prediction problem. The dataset used was sourced from the literature [36]. It contains 150 trajectories as training data, 25 trajectories as validation data, and 25 trajectories as test data, all collected from a hexapod locomotion task with 18 observed joint angles and 18 reference joint angles. An agent continually predicts the states within the given time interval (1 and 30 steps) from the initial true state and the subsequent predicted states. Therefore, the predicted states used as inputs deviate from the true states and become noise in the learning process.

For the loss function, instead of the commonly used mean squared error (MSE), we employ the negative log likelihood (NLL) of the predictive distribution, allowing the scale difference of each state to be considered internally. The NLL is computed at each prediction step, and their sum is used as the loss function. Because of the high cost of back-propagation through time (BPTT) over the entire trajectory, truncated BPTT (30 steps on average) [37, 38] is employed.

6.1.3 Reinforcement learning on Pybullet simulator

The third problem is RL performed on the Pybullet simulation engine using OpenAI Gym [39, 40] environments. The tasks to be solved are Hopper and Ant, both of which require an agent to move as straight as possible on a flat terrain. As mentioned before, RL relies on estimated values in behalf of the absence of true target signals. This can easily introduce noise in the training.

The implemented RL algorithm is an actor-critic algorithm based on the literature [41]. The agent only learns after each episode using an experience replay. This experience replay samples 128 batches after each episode, and its buffer size is set to be sufficiently small (at 10,000) to reveal the influence of noise.

6.1.4 Policy distillation by behavioral cloning

The fourth problem is policy distillation [14]; a method used to replicate the policy of a given RL agent by training a smaller and computationally more efficient network to perform at the expert level. Three policies for Ant properly trained on the RL problem above were considered as experts, and 10 trajectories of state-action pairs were collected for each policy. We also collected, as non-expert (or amateur) demonstrations, three trajectories from one policy that fell into a local solution. In the dataset constructed with these trajectories, not only the amateur trajectories but also the expert trajectories could be non-optimal in certain situations, thereby leading to noise.

The loss function is the negative log likelihood of the policy according to behavioral cloning [42], which is the simplest imitation learning method. In addition, a weight decay is added to prevent the small networks (i.e., the distillation target) from over-fitting particular behaviors.

6.2 Analysis of results of practical problems

Table 3: Experimental results: the best in all the results for each benchmark is written in bold; The numbers in parentheses denote standard deviations. We can see that AdaTerm relaxes the requirement of manual tuning of the robustness of the algorithm for every single task and interpolates between or prevails over the performance of non-robust and robust optimizers when facing practical problems (see section 6.2 for the detailed analysis).
Classification Distillation Prediction RL
Accuracy The sum of rewards MSE at final prediction The sum of rewards
Methods 0 % 10 % w/o amateur w/ amateur 1 step 30 steps Hopper Ant
Adam 0.7205 0.6811 1686.94 1401.01 0.0321 1.1591 1539.06 863.31
(4.33e-3) (4.93e-3) (2.34e+2) (2.39e+2) (4.48e-4) (1.04e-1) (5.64e+2) (4.01e+2)
AdaBelief 0.7227 0.6799 1731.51 1344.23 0.0320 1.2410 1152.10 692.68
(5.56e-3) (3.94e-3) (1.88e+2) (2.75e+2) (5.67e-4) (1.34e-1) (5.48e+2) (1.17e+2)
RAdam 0.7192 0.6797 1574.13 1258.10 0.0328 1.1218 1373.85 871.23
(4.28e-3) (3.41e-3) (2.86e+2) (2.70e+2) (3.55e-4) (1.36e-1) (7.55e+2) (3.57e+2)
t-Adam 0.7174 0.6803 1586.99 1347.29 0.0320 1.1048 1634.20 2272.58
(5.67e-3) (3.10e-3) (2.06e+2) (2.11e+2) (4.09e-4) (2.66e-1) (4.52e+2) (3.20e+2)
At-Adam 0.7178 0.6811 1611.23 1347.51 0.0319 1.1075 1523.37 2213.75
(4.21e-3) (4.03e-3) (2.25e+2) (2.67e+2) (3.14e-4) (1.97e-1) (5.48e+2) (2.31e+2)
AdaTerm 0.7315 0.6815 1770.17 1411.02 0.0335 1.0016 1550.25 2021.37
(Ours) (3.66e-3) (4.46e-3) (2.17e+2) (1.92e+2) (3.09e-4) (2.31e-1) (5.88e+2) (3.87e+2)

To analyze the experimental results, we divide the four tasks into two cases depending on the composition of the dataset and the origin of the inaccuracies:

  • 1.

    In the first class of problems, the corrupted dataset is such that it can be partitioned into two subsets of accurate and noisy observations, i.e. 𝒟~=𝒟\mathcal{\tilde{D}}=\mathcal{D}\cup\mathcal{E} where 𝒟\mathcal{E}\nsubseteq\mathcal{D} is the noisy subset.

  • 2.

    In the second class, the corruption extends to the entire dataset, either due to a stationary noise distribution or a dynamic noise distribution.

6.2.1 Case 1 Tasks: 𝒟~=𝒟\mathcal{\tilde{D}}=\mathcal{D}\cup\mathcal{E}

This first case comprises of the classification and distillation tasks. In the classification task, the subset data (0% or 10%) with randomized labels constitutes the corrupted subset \mathcal{E}, and in the distillation task, \mathcal{E} corresponds to the amateur demonstrations. This subdivision holds even though what we consider to be the uncorrupted data may contain noise (CIFAR-100 contains general images that are prone to imperfections, and similarly, the RL expert policies are stochastic) because these corruptions are not as problematic as the ones mentioned previously.

In these types of tasks, adaptiveness is primordial because the proportion of \mathcal{E} is typically unknown in advance. As can be seen in the left part of Table 3, the non-robust optimizers (Adam, AdaBelief and RAdam) outperform the fixed-robustness t-Adam and the low-adaptive-capability At-Adam when the proportion of \mathcal{E} is small (0% and w/o amateur). Furthermore, although At-Adam exhibits a low level of adaptiveness, the adaptiveness nevertheless results in a better performance compared to t-Adam. In contrast, when the proportion of \mathcal{E} increases (10% and w/ amateur), the robust optimizers lead in terms of average performance. Notably, At-Adam still outperforms t-Adam, which uses the most robust version of the algorithm with its degrees-of-freedom fixed at 11. Finally, we observe that AdaTerm’s adaptive ability allows it to outperform both the other non-robust and robust optimizers under all dataset conditions considered. Its better performance over non-robust optimizers even without corruption can be attributed to the natural imperfections mentioned earlier, implying that a small level of robustness (ν\nu\ll\infty) is still desirable during the optimization process even in the absence of notable corruption.

6.2.2 Case 2 Tasks: 𝒟~=\mathcal{\tilde{D}}=\mathcal{E}

The prediction and RL tasks constitute the second case. Indeed, the estimators and approximators used in these types of tasks are a constant source of noise and corruptions. Although such noise can be suspected to be stronger in the earlier stages of learning, their prevalence may necessitate an optimizer that maintains its a constant robustness.

This preference can be observed on the right side of Table 3. Indeed, the robust optimizers lead by performance, achieving the lowest prediction errors in the 1-step and 30-step prediction and the highest scores in the Hopper and Ant environments.

Although the single-step prediction task is a relatively simple supervised learning approach that does not necessitate a high robustness to noise, the performance of AdaTerm is the worst among all algorithms. We postulate that, similar to the error norm of the McCormick function in Fig. 8, the exceedingly high learning rate caused by β<β2\beta<\beta_{2} contributes to the performance degradation. In contrast, in the 30-step prediction task, only AdaTerm succeeds in achieving a MSE of approximately 11. As a consequence of the inaccurate estimated inputs, particularly in the earlier stages of learning, this task challenges the ability of the optimization algorithm to ignore aberrant gradient update directions. However, as learning progresses, the accuracy of the estimated inputs improves, and the noise robustness gradually becomes less critical.

For the RL tasks, although AdaTerm did not exhibit the best performance, we can confirm its usefulness from two facts: (i) it demonstrated the second-best performance in Hopper task, and (ii) Ant task was only successful with noise-robust optimizers. This particular task highlights a common drawback of adaptive methods compared with methods based on the right assumption (whether or not robustness will be needed in the considered problem). The inferior performance of the non-robust optimizers indicate the prevalence of noise in RL tasks. Therefore, t-Adam unsurprisingly exhibits the best performance in such situation where robustness is critical. t-Adam was followed closely by At-Adam in the Ant task (clearly, from the results of the non-robust algorithms, the most challenging in terms of noisiness) thanks to its low adaptiveness (similarly to Fig. 9), enabling it to maintain its robustness level close to that of At-Adam during the learning process. Regardless, it was outperformed by AdaTerm (and Adam) on the Hopper.

In summary, the experiments show that AdaTerm interpolates between or prevails over the performance of non-robust and robust optimizers when encountering practical and noisy problems.

7 Discussion

The above experiments and simulations demonstrated the robustness and adaptability improvement of AdaTerm compared with the related works. However, we discuss its limitations below.

7.1 Performance and drawbacks of AdaTerm

As can be seen in Fig. 8 and on the right part of Table 3, there is a drawback of using AdaTerm on a noise-free optimization problems when compared with a non-robust optimizer, such as Adam, and similarly on full-noise problems compared with a fixed-robust optimizer, such as t-Adam. These results show that in the absence of noise (respectively in the presence of abundant noise), employing an optimization method that assumes the absence of aberrant data points (respectively the presence of worst-case noisy data) from the beginning can afford better results on noise-free (respectively on extremely noisy) problems. This is to be expected from an adaptive method such as AdaTerm, which inevitably requires a non-zero adaptation time to converge to a non-robust behavior or may relax its robustness during the learning process before hitting an extremely noisy set of gradients necessitating a fixed high robustness.

Nevertheless, the overall results indicate that the drawback or penalty incurred from using the AdaTerm algorithm instead of Adam or AdaBelief in the case of noise-free applications or t-Adam in the case of high-noise applications is not a significant problem when considering its ability to adapt to different unknown noise ratios. This is particularly why algorithms are equipped with adaptive parameters.

When encountering a potentially noise-free problem, AdaTerm allows setting a large initial value for the degrees-of-freedom and during the course of the training process, the algorithm will decide if the dataset is indeed noise-free and adapt in consequence. Similarly, if there is uncertainty pertaining to the noisiness of the task at hand, the practitioner can directly execute the algorithm and let it automatically adapt to the problem. Such freedom is only possible thanks to the adaptive capability of AdaTerm as clearly displayed in Fig. 9, and the resulting performance is guaranteed to be sub-optimal with respect to the true nature of the problem.

7.2 Gap between theoretical and experimental convergence analysis

Although Theorem 4.2 provides a theoretical upper bound on the regret achieved by AdaTerm; as a typical approach to convergence analysis, it completely eludes the robustness factor introduced by AdaTerm. In particular, Corollary 4.3 provides a more effective bound compared with Theorem 4.2 that is in contrast to the practical application (as displayed in the ablation study of E). This implies a gap between the theoretical and experimental analyses and is anchored on the fact that the theoretical bound relies on τ¯\underline{\tau}.

As a remedy to this shortcoming of the regret analysis, we have employed (in section 4.5.2) an intuitive analysis of the robustness factor based on the behavior of the different components of our algorithm. Although this qualitative analysis possesses a weak theoretical convergence value, the experimental analysis against different noise settings shows that AdaTerm is not only robust but also efficient as an optimization algorithm. Nevertheless, we acknowledge the requirement for a stronger theoretical analysis that considers the noisiness of the gradients, while allowing for a theoretical comparison between the robustness and efficiency of different optimizers.

7.3 Normalization of gradient

By considering the gradients to be generated from Student’s t-distribution, AdaTerm normalizes the gradients (more precisely, its first-order moment) using the estimated scale, instead of the second-order moment. This is similar to the normalization of AdaBelief, as mentioned before. However, as shown in  G, the normalization with the second-order moment can exhibit better performances in certain circumstances.

Since the second-order moment is larger than the scale, the normalization by the second-order moment renders the update conservative, while the one by the scale is expected to break through the stagnation of updates. While both characteristics are important, the answer to the question “which one is desirable?” remains situation-dependent. Therefore, we need to pursue the theory concerning this point and introduce a mechanism to adaptively switch the use of both normalization approach according to the required characteristics.

8 Conclusion

In this study, we presented AdaTerm, an optimizer adaptively robust to noise and outliers for deep learning. AdaTerm models the stochastic gradients generated during training using a multivariate diagonal Student’s t-distribution. Accordingly, the distribution parameters are optimized through the surrogated maximum log-likelihood estimation. Optimization of test functions reveals that AdaTerm can adjust its robustness to noise in accordance with the impact of noise. Through the four typical benchmarks, we confirmed that the robustness to noise and the learning performance of AdaTerm are comparable to or better than those of conventional optimizers. In addition, we derived the new regret bound for the Adam-like optimizers without incorporating AMSGrad.

This study focused on computing the moments of the t-distribution; however, in recent years, the importance of integration with raw SGD (i.e., decay of scaling in Adam-like optimizers) has been confirmed [43, 17]. Therefore, we will investigate a natural integration of AdaTerm and the raw SGD by reviewing Δs\Delta s that may enable the normalization to be constant. In addition, as mentioned in the discussion, a new framework for analyzing optimization algorithms both in terms of robustness and efficiency (such that they can be compared)is required. One such analysis was performed by [44] on SGD; however, its extension to momentum-based optimizers and its ability to allow theoretical comparison across different algorithms remain limited. Therefore, we will seek a similar but better approach in the future, e.g. via trajectory analysis [45], that can be applied to analyze the robustness and efficiency of different optimization algorithms, including AdaTerm. Finally, latent feature analysis methods [46] have attracted extensive attention in recent years. The use of AdaTerm in such approaches is therefore an interesting field of study that we will investigate in the future.

Acknowledgements

This work was supported by JSPS KAKENHI, Grant-in-Aid for Scientific Research (B), Grant Number JP20H04265.

References

  • [1] Y. LeCun, Y. Bengio, G. Hinton, Deep learning, Nature 521 (7553) (2015) 436–444.
  • [2] H. Robbins, S. Monro, A stochastic approximation method, The Annals of Mathematical Statistics (1951) 400–407.
  • [3] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [4] D. P. Kingma, J. Ba, Adam: A method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014).
  • [5] S. Sun, Z. Cao, H. Zhu, J. Zhao, A survey of optimization methods from a machine learning perspective, IEEE Transactions on Cybernetics 50 (8) (2019) 3668–3681.
  • [6] R. M. Schmidt, F. Schneider, P. Hennig, Descending through a crowded valley-benchmarking deep learning optimizers, in: International Conference on Machine Learning, PMLR, 2021, pp. 9367–9376.
  • [7] L. Liu, H. Jiang, P. He, W. Chen, X. Liu, J. Gao, J. Han, On the variance of the adaptive learning rate and beyond, arXiv preprint arXiv:1908.03265 (2019).
  • [8] J. Zhuang, T. Tang, Y. Ding, S. C. Tatikonda, N. Dvornek, X. Papademetris, J. Duncan, Adabelief optimizer: Adapting stepsizes by the belief in observed gradients, Advances in Neural Information Processing Systems 33 (2020) 18795–18806.
  • [9] K. Mirylenka, G. Giannakopoulos, L. M. Do, T. Palpanas, On classifier behavior in the presence of mislabeling noise, Data Mining and Knowledge Discovery 31 (2017) 661–701.
  • [10] M. Suchi, T. Patten, D. Fischinger, M. Vincze, Easylabel: A semi-automatic pixel-wise object annotation tool for creating robotic rgb-d datasets, in: International Conference on Robotics and Automation, IEEE, 2019, pp. 6678–6684.
  • [11] R. T. Chen, Y. Rubanova, J. Bettencourt, D. Duvenaud, Neural ordinary differential equations, Vol. 31, 2018, pp. 6572–6583.
  • [12] M. Kishida, M. Ogura, Y. Yoshida, T. Wadayama, Deep learning-based average consensus, IEEE Access 8 (2020) 142404–142412.
  • [13] R. S. Sutton, A. G. Barto, Reinforcement learning: An introduction, MIT press, 2018.
  • [14] A. A. Rusu, S. G. Colmenarejo, C. Gulcehre, G. Desjardins, J. Kirkpatrick, R. Pascanu, V. Mnih, K. Kavukcuoglu, R. Hadsell, Policy distillation, arXiv preprint arXiv:1511.06295 (2015).
  • [15] J. Gou, B. Yu, S. J. Maybank, D. Tao, Knowledge distillation: A survey, International Journal of Computer Vision 129 (6) (2021) 1789–1819.
  • [16] U. Simsekli, L. Sagun, M. Gurbuzbalaban, A tail-index analysis of stochastic gradient noise in deep neural networks, in: International Conference on Machine Learning, PMLR, 2019, pp. 5827–5837.
  • [17] P. Zhou, J. Feng, C. Ma, C. Xiong, S. C. H. Hoi, et al., Towards theoretically understanding why sgd generalizes better than adam in deep learning, Advances in Neural Information Processing Systems 33 (2020) 21285–21296.
  • [18] W. E. L. Ilboudo, T. Kobayashi, K. Sugimoto, Robust stochastic gradient descent with student-t distribution based first-order momentum, IEEE Transactions on Neural Networks and Learning Systems 33 (3) (2020) 1324–1337.
  • [19] W. E. L. Ilboudo, T. Kobayashi, K. Sugimoto, Adaptive t-momentum-based optimization for unknown ratio of outliers in amateur data in imitation learning, in: IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, 2021, pp. 7851–7857.
  • [20] C. Ley, A. Neven, The value at the mode in multivariate tt distributions: a curiosity or not?, arXiv preprint arXiv:1211.1174 (2012).
  • [21] S. J. Reddi, S. Kale, S. Kumar, On the convergence of adam and beyond, arXiv preprint arXiv:1904.09237 (2019).
  • [22] A. Alacaoglu, Y. Malitsky, P. Mertikopoulos, V. Cevher, A new regret analysis for adam-type algorithms, in: International Conference on Machine Learning, PMLR, 2020, pp. 202–210.
  • [23] C. Gulcehre, J. Sotelo, M. Moczulski, Y. Bengio, A robust adaptive stochastic gradient method for deep learning, in: International Joint Conference on Neural Networks, IEEE, 2017, pp. 125–132.
  • [24] M. J. Holland, K. Ikeda, Efficient learning with robust gradient descent, Machine Learning 108 (8) (2019) 1523–1560.
  • [25] A. Prasad, A. S. Suggala, S. Balakrishnan, P. Ravikumar, Robust estimation via robust gradient estimation, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 82 (3) (2020) 601–627.
  • [26] K.-S. Kim, Y.-S. Choi, Hyadamc: A new adam-based hybrid optimization algorithm for convolution neural networks, Sensors 21 (12) (2021) 4054.
  • [27] C. Aeschliman, J. Park, A. C. Kak, A novel parameter estimation algorithm for the multivariate t-distribution and its application to computer vision, in: European Conference on Computer Vision, Springer, 2010, pp. 594–607.
  • [28] L. Ziyin, K. Liu, T. Mori, M. Ueda, Strength of minibatch noise in sgd, arXiv preprint arXiv:2102.05375 (2021).
  • [29] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, A. Lerer, Automatic differentiation in pytorch, in: Advances in Neural Information Processing Systems Workshop, 2017.
  • [30] A. Beck, M. Teboulle, Mirror descent and nonlinear projected subgradient methods for convex optimization, Operations Research Letters 31 (3) (2003) 167–175.
  • [31] E. Gorbunov, M. Danilova, A. Gasnikov, Stochastic optimization with heavy-tailed noise via accelerated gradient clipping, Advances in Neural Information Processing Systems 33 (2020) 15042–15053.
  • [32] T. Kobayashi, W. E. L. Ilboudo, t-soft update of target network for deep reinforcement learning, Neural Networks 136 (2021) 63–71.
  • [33] W. Shang, K. Sohn, D. Almeida, H. Lee, Understanding and improving convolutional neural networks via concatenated rectified linear units, in: International Conference on Machine Learning, PMLR, 2016, pp. 2217–2225.
  • [34] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, Z. Wojna, Rethinking the inception architecture for computer vision, in: IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2818–2826.
  • [35] M. Lukasik, S. Bhojanapalli, A. Menon, S. Kumar, Does label smoothing mitigate label noise?, in: International Conference on Machine Learning, PMLR, 2020, pp. 6448–6458.
  • [36] T. Kobayashi, q-vae for disentangled representation learning and latent dynamical systems, IEEE Robotics and Automation Letters 5 (4) (2020) 5669–5676.
  • [37] G. Puskorius, L. Feldkamp, Truncated backpropagation through time and kalman filter training for neurocontrol, in: IEEE International Conference on Neural Networks, Vol. 4, IEEE, 1994, pp. 2488–2493.
  • [38] C. Tallec, Y. Ollivier, Unbiasing truncated backpropagation through time, arXiv preprint arXiv:1705.08209 (2017).
  • [39] G. Brockman, V. Cheung, L. Pettersson, J. Schneider, J. Schulman, J. Tang, W. Zaremba, Openai gym, arXiv preprint arXiv:1606.01540 (2016).
  • [40] E. Coumans, Y. Bai, Pybullet, a python module for physics simulation for games, robotics and machine learning, GitHub repository (2016).
  • [41] T. Kobayashi, Proximal policy optimization with adaptive threshold for symmetric relative density ratio, Results in Control and Optimization 10 (2023) 100192.
  • [42] M. Bain, C. Sammut, A framework for behavioural cloning., in: Machine Intelligence 15, 1995, pp. 103–129.
  • [43] L. Luo, Y. Xiong, Y. Liu, X. Sun, Adaptive gradient methods with dynamic bound of learning rate, arXiv preprint arXiv:1902.09843 (2019).
  • [44] K. Scaman, C. Malherbe, Robustness analysis of non-convex stochastic gradient descent using biased expectations, Advances in Neural Information Processing Systems 33 (2020) 16377–16387.
  • [45] M. Sandler, A. Zhmoginov, M. Vladymyrov, N. Miller, Training trajectories, mini-batch losses and the curious role of the learning rate, arXiv preprint arXiv:2301.02312 (2023).
  • [46] X. Luo, Y. Yuan, S. Chen, N. Zeng, Z. Wang, Position-transitional particle swarm optimization-incorporated latent factor analysis, IEEE Transactions on Knowledge and Data Engineering 34 (8) (2020) 3958–3970.
  • [47] J. Chung, C. Gulcehre, K. Cho, Y. Bengio, Empirical evaluation of gated recurrent neural networks on sequence modeling, arXiv preprint arXiv:1412.3555 (2014).
  • [48] J. L. Ba, J. R. Kiros, G. E. Hinton, Layer normalization, arXiv preprint arXiv:1607.06450 (2016).
  • [49] J. Xu, X. Sun, Z. Zhang, G. Zhao, J. Lin, Understanding and improving layer normalization, Advances in Neural Information Processing Systems 32 (2019) 4381–4391.
  • [50] S. Elfwing, E. Uchibe, K. Doya, Sigmoid-weighted linear units for neural network function approximation in reinforcement learning, Neural Networks 107 (2018) 3–11.
  • [51] T. Kobayashi, Student-t policy in reinforcement learning to acquire global optimum of robot control, Applied Intelligence 49 (12) (2019) 4335–4347.
  • [52] T. De Ryck, S. Lanthaler, S. Mishra, On the approximation of functions by tanh neural networks, Neural Networks 143 (2021) 732–750.
  • [53] C. Lee, K. Cho, W. Kang, Directional analysis of stochastic gradient descent via von mises-fisher distributions in deep learning, arXiv preprint arXiv:1810.00150 (2018).
  • [54] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, P. T. P. Tang, On large-batch training for deep learning: Generalization gap and sharp minima, arXiv preprint arXiv:1609.04836 (2016).

Appendix A Proof of regret bounds

Below is the proof and the detailed analysis of the AdaTerm algorithm. Following the literature, we ignore the bias correction term. We also consider a scheduled learning rate αt=α(t)1\alpha_{t}=\alpha(\sqrt{t})^{-1} and the non-expansive weighted projection operator Π,V(θ)=argminθθθV\Pi_{\mathcal{F},V}(\theta)=arg\;\underset{\theta^{\prime}\in\mathcal{F}}{\min}\;\left\lVert\theta^{\prime}-\theta\right\rVert_{V}. The use of the projection operator is not necessary for deriving the upper bound; in fact, no projection is used in Algorithm 1. Regardless, it provides a more general setting, and therefore, following the literature, we consider its potential use in our proof.

A.1 Preliminary

We start by stating intermediary results that we shall use later to bound the regret.

Lemma A.1 (Peter?Paul inequality or Young’s inequality with ζ\zeta and exponent 2).

ζ>0\forall\zeta>0 and (x,y)2\forall(x,y)\in\mathbb{R}^{2} we have that xyζ2x2+12ζy2xy\leq\frac{\zeta}{2}x^{2}+\frac{1}{2\zeta}y^{2}.

From which we obtain the following results:

Corollary A.2 (Partial bound of mt,θtθ\langle m_{t},\theta_{t}-\theta^{*}\rangle).

Let (mt)t0(m_{t})_{t\geq 0}, (vt)t0(v_{t})_{t\geq 0} and (θt)t0(\theta_{t})_{t\geq 0} be the sequence of moment, variance and iterates produced by AdaTerm with a scheduled learning rate αt>0,t\alpha_{t}>0,\forall t. Then, we have:

mt,θtθ\displaystyle\langle m_{t},\theta_{t}-\theta^{*}\rangle D22αtvt1/42+12αtmtvt1/22\displaystyle\leq\frac{D^{2}}{2\alpha_{t}}\left\lVert v_{t}^{1/4}\right\rVert^{2}+\frac{1}{2}\alpha_{t}\left\lVert m_{t}\right\rVert^{2}_{v_{t}^{1/2}} (39)
Proof.

This follows from the application of Lemma A.1 with the identification:

ζ\displaystyle\zeta =vt,i1/2αt,xi=(θt,iθi),yi=mt,i\displaystyle=\frac{v_{t,i}^{1/2}}{\alpha_{t}},\;x_{i}=(\theta_{t,i}-\theta^{*}_{i}),\;y_{i}=m_{t,i}

Combined with the fact that i[d]\forall i\in[d], (θt,iθi)D(\theta_{t,i}-\theta^{*}_{i})\leq D. ∎

A.2 Introduction

Let mtm_{t} and vtv_{t} be the first- and second-order moments produced by AdaTerm, then:

mt\displaystyle m_{t} =(1τt)mt1+τtgt,vt=(1τt)mt1+τt((gtmt1)2+Δs)\displaystyle=(1-\tau_{t})m_{t-1}+\tau_{t}g_{t}\;,\;\;\;\;v_{t}=(1-\tau_{t})m_{t-1}+\tau_{t}\left((g_{t}-m_{t-1})^{2}+\Delta s\right) (40)

From the definition of mtm_{t}, we can derive the following:

gt=1τtmt1τtτtmt1\displaystyle g_{t}=\frac{1}{\tau_{t}}m_{t}-\frac{1-\tau_{t}}{\tau_{t}}m_{t-1} (41)

which leads to:

gt,θtθ\displaystyle\langle g_{t},\theta_{t}-\theta^{*}\rangle =1τtmt,θtθ1τtτtmt1,θtθ=1τtmt,θtθ1τtτtmt1,θt1θ1τtτtmt1,θtθt1\displaystyle=\frac{1}{\tau_{t}}\langle m_{t},\theta_{t}-\theta^{*}\rangle-\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t}-\theta^{*}\rangle=\frac{1}{\tau_{t}}\langle m_{t},\theta_{t}-\theta^{*}\rangle-\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle-\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t}-\theta_{t-1}\rangle (42)
=1τt(mt,θtθmt1,θt1θ)+mt1,θt1θ1τtτtmt1,θtθt1\displaystyle=\frac{1}{\tau_{t}}\left(\langle m_{t},\theta_{t}-\theta^{*}\rangle-\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle\right)+\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle-\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t}-\theta_{t-1}\rangle (43)

We therefore have that:

t=1Tgt,θtθ\displaystyle\sum\limits_{t=1}^{T}\langle g_{t},\theta_{t}-\theta^{*}\rangle =R1,T+R2,T+R3,T\displaystyle=R_{1,T}+R_{2,T}+R_{3,T} (44)
R1,T\displaystyle R_{1,T} =t=1Tmt1,θt1θ=t=1T1mt,θtθ(sincem0=0)\displaystyle=\sum\limits_{t=1}^{T}\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle=\sum\limits_{t=1}^{T-1}\langle m_{t},\theta_{t}-\theta^{*}\rangle\;\;\;\mathrm{(since}\;m_{0}=0\mathrm{)} (45)
R2,T\displaystyle R_{2,T} =t=1T1τtτtmt1,θtθt1=t=1T1τtτtmt1,θt1θt\displaystyle=-\sum\limits_{t=1}^{T}\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t}-\theta_{t-1}\rangle=\sum\limits_{t=1}^{T}\frac{1-\tau_{t}}{\tau_{t}}\langle m_{t-1},\theta_{t-1}-\theta_{t}\rangle (46)
R3,T\displaystyle R_{3,T} =t=1T1τt(mt,θtθmt1,θt1θ)\displaystyle=\sum\limits_{t=1}^{T}\frac{1}{\tau_{t}}\left(\langle m_{t},\theta_{t}-\theta^{*}\rangle-\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle\right) (47)

A.3 Bound of R1,TR_{1,T}

Using Corollary A.2, we have that:

R1,T\displaystyle R_{1,T} =t=1T1mt,θtθt=1T1D22αtvt1/42+12t=1T1αtmtvt1/22\displaystyle=\sum\limits_{t=1}^{T-1}\langle m_{t},\theta_{t}-\theta^{*}\rangle\leq\sum\limits_{t=1}^{T-1}\frac{D^{2}}{2\alpha_{t}}\left\lVert v_{t}^{1/4}\right\rVert^{2}+\frac{1}{2}\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert^{2}_{v_{t}^{1/2}} (48)

A.4 Bound of R2,TR_{2,T}

With m~t1=1τtτtmt1\tilde{m}_{t-1}=\frac{1-\tau_{t}}{\tau_{t}}m_{t-1} and by using m0=0m_{0}=0, we have:

R2,T\displaystyle R_{2,T} =t=1Tm~t1,θt1θt=t=2Tm~t1,θt1θt=t=1T1m~t,θtθt+1\displaystyle=\sum\limits_{t=1}^{T}\langle\tilde{m}_{t-1},\theta_{t-1}-\theta_{t}\rangle=\sum\limits_{t=2}^{T}\langle\tilde{m}_{t-1},\theta_{t-1}-\theta_{t}\rangle=\sum\limits_{t=1}^{T-1}\langle\tilde{m}_{t},\theta_{t}-\theta_{t+1}\rangle

Then, by applying Holder’s inequality, followed by the use of the projection operator where θt=Π,v^t1/2(θt)\theta_{t}=\Pi_{\mathcal{F},\hat{v}_{t}^{1/2}}(\theta_{t}), since θt\theta_{t}\in\mathcal{F}, we obtain:

R2,T\displaystyle R_{2,T} t=1T1m~tv^t1/2θtθt+1v^t1/2=t=1T1m~tv^t1/2Π,v^t1/2(θt)Π,v^t1/2(θtαtv^t1/2mt)v^t1/2\displaystyle\leq\sum\limits_{t=1}^{T-1}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert\theta_{t}-\theta_{t+1}\right\rVert_{\hat{v}_{t}^{1/2}}=\sum\limits_{t=1}^{T-1}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert\Pi_{\mathcal{F},\hat{v}_{t}^{1/2}}(\theta_{t})-\Pi_{\mathcal{F},\hat{v}_{t}^{1/2}}(\theta_{t}-\alpha_{t}\hat{v}_{t}^{-1/2}m_{t})\right\rVert_{\hat{v}_{t}^{1/2}}

The non-expansive property of the operator allows the following:

R2,T\displaystyle R_{2,T} t=1T1m~tv^t1/2θtθt+1v^t1/2t=1T1m~tv^t1/2θt(θtαtv^t1/2mt)v^t1/2=t=1T1m~tv^t1/2αtv^t1/2mtv^t1/2=t=1T1αtm~tv^t1/2mtv^t1/2\displaystyle\leq\sum\limits_{t=1}^{T-1}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert\theta_{t}-\theta_{t+1}\right\rVert_{\hat{v}_{t}^{1/2}}\leq\sum\limits_{t=1}^{T-1}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert\theta_{t}-(\theta_{t}-\alpha_{t}\hat{v}_{t}^{-1/2}m_{t})\right\rVert_{\hat{v}_{t}^{1/2}}=\sum\limits_{t=1}^{T-1}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert\alpha_{t}\hat{v}_{t}^{-1/2}m_{t}\right\rVert_{\hat{v}_{t}^{1/2}}=\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert\tilde{m}_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}

Finally, by restoring m~t=1τt+1τt+1mt\tilde{m}_{t}=\frac{1-\tau_{t+1}}{\tau_{t+1}}m_{t} and using 1τt+1τt+11τ¯τ¯\frac{1-\tau_{t+1}}{\tau_{t+1}}\leq\frac{1-\underline{\tau}}{\underline{\tau}}, we have:

R2,T\displaystyle R_{2,T} t=1T1αt1τt+1τt+1mtv^t1/221τ¯τ¯t=1T1αtmtv^t1/22\displaystyle\leq\sum\limits_{t=1}^{T-1}\alpha_{t}\frac{1-\tau_{t+1}}{\tau_{t+1}}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}^{2}\leq\frac{1-\underline{\tau}}{\underline{\tau}}\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}^{2} (49)

A.5 Bound of R3,TR_{3,T}

R3,T\displaystyle R_{3,T} =t=1T1τt(mt,θtθmt1,θt1θ)=t=1TAt\displaystyle=\sum\limits_{t=1}^{T}\frac{1}{\tau_{t}}\left(\langle m_{t},\theta_{t}-\theta^{*}\rangle-\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle\right)=\sum\limits_{t=1}^{T}A_{t}
At\displaystyle A_{t} =1τtmt,θtθ1τt1mt1,θt1θ+(τtτt1τtτt1)mt1,θt1θ\displaystyle=\frac{1}{\tau_{t}}\langle m_{t},\theta_{t}-\theta^{*}\rangle-\frac{1}{\tau_{t-1}}\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle+\left(\frac{\tau_{t}-\tau_{t-1}}{\tau_{t}\tau_{t-1}}\right)\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle
1τtmt,θtθ1τt1mt1,θt1θ+(|τtτt1|2τtτt1){D2αt1vt11/42+αt1mt1vt11/22}\displaystyle\leq\frac{1}{\tau_{t}}\langle m_{t},\theta_{t}-\theta^{*}\rangle-\frac{1}{\tau_{t-1}}\langle m_{t-1},\theta_{t-1}-\theta^{*}\rangle+\left(\frac{|\tau_{t}-\tau_{t-1}|}{2\tau_{t}\tau_{t-1}}\right)\left\{\frac{D^{2}}{\alpha_{t-1}}\left\lVert v_{t-1}^{1/4}\right\rVert^{2}+\alpha_{t-1}\left\lVert m_{t-1}\right\rVert^{2}_{v_{t-1}^{1/2}}\right\}

where the last line is obtained by applying Corollary A.2 on time step t-1. Finally, using the telescoping sum over the first two terms in AtA_{t} with the particularity of m0=0m_{0}=0, combined with |τtτt1|(1β)τ¯=[1(β+τ¯)]|\tau_{t}-\tau_{t-1}|\leq(1-\beta)-\underline{\tau}=\left[1-(\beta+\underline{\tau})\right], we arrive at the following relation:

t=1TAt\displaystyle\sum\limits_{t=1}^{T}A_{t} 1τTmT,θTθ+(1(β+τ¯)2τ¯2){t=1TD2αt1vt11/42+t=1Tαt1mt1vt11/22}\displaystyle\leq\frac{1}{\tau_{T}}\langle m_{T},\theta_{T}-\theta^{*}\rangle+\left(\frac{1-(\beta+\underline{\tau})}{2\underline{\tau}^{2}}\right)\left\{\sum\limits_{t=1}^{T}\frac{D^{2}}{\alpha_{t-1}}\left\lVert v_{t-1}^{1/4}\right\rVert^{2}+\sum\limits_{t=1}^{T}\alpha_{t-1}\left\lVert m_{t-1}\right\rVert^{2}_{v_{t-1}^{1/2}}\right\}
R3,T\displaystyle R_{3,T} 1τTmT,θTθ+(1(β+τ¯)2τ¯2){t=1T1D2αtvt1/42+t=1T1αtmtvt1/22}\displaystyle\leq\frac{1}{\tau_{T}}\langle m_{T},\theta_{T}-\theta^{*}\rangle+\left(\frac{1-(\beta+\underline{\tau})}{2\underline{\tau}^{2}}\right)\left\{\sum\limits_{t=1}^{T-1}\frac{D^{2}}{\alpha_{t}}\left\lVert v_{t}^{1/4}\right\rVert^{2}+\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert^{2}_{v_{t}^{1/2}}\right\} (50)

A.6 Bound of 1τTmT,θTθ\frac{1}{\tau_{T}}\langle m_{T},\theta_{T}-\theta^{*}\rangle

Applying Lemma A.1 with ζ=vT,i1/22αT\zeta=\frac{v_{T,i}^{1/2}}{2\alpha_{T}} and xi=(θT,iθi)x_{i}=(\theta_{T,i}-\theta^{*}_{i}), yi=mT,iy_{i}=m_{T,i}, we get:

1τTmT,θTθ1τT[D24αTvT1/42+αTmTvT1/22]1τTD24αTvT1/42+(1β)2αTϵτ¯2i=1dk=1T(1τ¯)Tkgk,i2\displaystyle\frac{1}{\tau_{T}}\langle m_{T},\theta_{T}-\theta^{*}\rangle\leq\frac{1}{\tau_{T}}\left[\frac{D^{2}}{4\alpha_{T}}\left\lVert v_{T}^{1/4}\right\rVert^{2}+\alpha_{T}\left\lVert m_{T}\right\rVert^{2}_{v_{T}^{1/2}}\right]\leq\frac{1}{\tau_{T}}\frac{D^{2}}{4\alpha_{T}}\left\lVert v_{T}^{1/4}\right\rVert^{2}+\frac{(1-\beta)^{2}\alpha_{T}}{\epsilon\underline{\tau}^{2}}\sum\limits_{i=1}^{d}\sum\limits_{k=1}^{T}(1-\underline{\tau})^{T-k}g^{2}_{k,i} (51)

A.7 Bound of t=1T1αtmtv^t1/22\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}^{2}

With αt=αt\alpha_{t}=\frac{\alpha}{\sqrt{t}}, we have:

t=1T1αtmtv^t1/22\displaystyle\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}^{2} =t=1T1αtv^t1/4mt22t=1T1αϵtmt2=t=1T1αϵti=1d(k=1tτkgk,ij=1tk(1τk+j))2\displaystyle=\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert\hat{v}_{t}^{-1/4}m_{t}\right\rVert^{2}_{2}\leq\sum\limits_{t=1}^{T-1}\frac{\alpha}{\epsilon\sqrt{t}}\left\lVert m_{t}\right\rVert^{2}=\sum\limits_{t=1}^{T-1}\frac{\alpha}{\epsilon\sqrt{t}}\sum\limits_{i=1}^{d}\left(\sum\limits_{k=1}^{t}\tau_{k}g_{k,i}\prod\limits_{j=1}^{t-k}(1-\tau_{k+j})\right)^{2}
t=1T1(1β)2αϵti=1d(k=1t|gk,i|(1τ¯)tk)2t=1T1(1β)2αϵti=1d(k=1t(1τ¯)tk)(k=1t(1τ¯)tkgk,i2)\displaystyle\leq\sum\limits_{t=1}^{T-1}\frac{(1-\beta)^{2}\alpha}{\epsilon\sqrt{t}}\sum\limits_{i=1}^{d}\left(\sum\limits_{k=1}^{t}|g_{k,i}|(1-\underline{\tau})^{t-k}\right)^{2}\leq\sum\limits_{t=1}^{T-1}\frac{(1-\beta)^{2}\alpha}{\epsilon\sqrt{t}}\sum\limits_{i=1}^{d}\left(\sum\limits_{k=1}^{t}(1-\underline{\tau})^{t-k}\right)\left(\sum\limits_{k=1}^{t}(1-\underline{\tau})^{t-k}g_{k,i}^{2}\right)
(1β)2αϵt=1T1(1(1τ¯)tτ¯)i=1dk=1t(1τ¯)tkgk,i21t(1β)2αϵτ¯t=1T1i=1dk=1t(1τ¯)tkgk,i21t\displaystyle\leq\frac{(1-\beta)^{2}\alpha}{\epsilon}\sum\limits_{t=1}^{T-1}\left(\frac{1-(1-\underline{\tau})^{t}}{\underline{\tau}}\right)\sum\limits_{i=1}^{d}\sum\limits_{k=1}^{t}(1-\underline{\tau})^{t-k}g_{k,i}^{2}\frac{1}{\sqrt{t}}\leq\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}}\sum\limits_{t=1}^{T-1}\sum\limits_{i=1}^{d}\sum\limits_{k=1}^{t}(1-\underline{\tau})^{t-k}g_{k,i}^{2}\frac{1}{\sqrt{t}}
=(1β)2αϵτ¯i=1dt=1T1gt,i2k=tT1(1τ¯)kt1k(1β)2αϵτ¯i=1dt=1T1gt,i2k=tT1(1τ¯)kt1t\displaystyle=\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}}\sum\limits_{i=1}^{d}\sum\limits_{t=1}^{T-1}g_{t,i}^{2}\sum\limits_{k=t}^{T-1}(1-\underline{\tau})^{k-t}\frac{1}{\sqrt{k}}\leq\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}}\sum\limits_{i=1}^{d}\sum\limits_{t=1}^{T-1}g_{t,i}^{2}\sum\limits_{k=t}^{T-1}(1-\underline{\tau})^{k-t}\frac{1}{\sqrt{t}}
=(1β)2αϵτ¯i=1dt=1T1gt,i21t(1(1τ¯)Ttτ¯)(1β)2αϵτ¯2i=1dt=1T1gt,i21t(Since(1(1τ¯)Tt)<1)\displaystyle=\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}}\sum\limits_{i=1}^{d}\sum\limits_{t=1}^{T-1}g_{t,i}^{2}\frac{1}{\sqrt{t}}\left(\frac{1-(1-\underline{\tau})^{T-t}}{\underline{\tau}}\right)\leq\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}^{2}}\sum\limits_{i=1}^{d}\sum\limits_{t=1}^{T-1}g_{t,i}^{2}\frac{1}{\sqrt{t}}\;\;\;\mathrm{(Since}\;(1-(1-\underline{\tau})^{T-t})<1\mathrm{)}

Finally, using Cauchy-Schwartz’s inequality (without the squares) with ut=gt,i2u_{t}=g_{t,i}^{2} and vt=1tv_{t}=\frac{1}{\sqrt{t}}, and with t=1T11t1+ln(T1)\sum\limits_{t=1}^{T-1}\frac{1}{t}\leq 1+\ln(T-1), we get:

t=1T1αtmtv^t1/22\displaystyle\sum\limits_{t=1}^{T-1}\alpha_{t}\left\lVert m_{t}\right\rVert_{\hat{v}_{t}^{-1/2}}^{2} α(1β)2ϵτ¯2i=1dg1:T1,i22t=1T11tα(1β)2ϵτ¯21+ln(T1)i=1dg1:T1,i22\displaystyle\leq\frac{\alpha(1-\beta)^{2}}{\epsilon\underline{\tau}^{2}}\sum\limits_{i=1}^{d}\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2}\sqrt{\sum\limits_{t=1}^{T-1}\frac{1}{t}}\leq\frac{\alpha(1-\beta)^{2}}{\epsilon\underline{\tau}^{2}}\sqrt{1+\ln(T-1)}\sum\limits_{i=1}^{d}\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2} (52)

where g1:T1,i22=t=1T1gt,i4GT1\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2}=\sqrt{\sum\limits_{t=1}^{T-1}g_{t,i}^{4}}\leq G_{\infty}\sqrt{T-1} where G=maxt[T1]gtG_{\infty}=\underset{t\in[T-1]}{\max}\left\lVert g_{t}\right\rVert_{\infty}.

A.8 Bound of the regret

With

RT\displaystyle R_{T} =t=1Tt(θt)t(θ)t=1Tgt,θtθ=R1,T+R2,T+R3,T\displaystyle=\sum\limits_{t=1}^{T}\mathcal{L}_{\mathcal{B}_{t}}(\theta_{t})-\mathcal{L}_{\mathcal{B}_{t}}(\theta^{*})\leq\sum\limits_{t=1}^{T}\langle g_{t},\theta_{t}-\theta^{*}\rangle=R_{1,T}+R_{2,T}+R_{3,T}

the bound of the regret is obtained by combining the upper bounds of R1,TR_{1,T}, R2,TR_{2,T} and R3,TR_{3,T} and through a straightforward factorization:

RTD2T4τTαi=1dvT,i1/2+[τ¯2+1(β+τ¯)2τ¯2]t=1T1D2αti=1dvt,i1/2+[(1β)2αϵτ¯2T]t=1T1i=1d(1τ¯)Tkgk,i2+[12+1τ¯τ¯+1(β+τ¯)2τ¯2][(1β)2αϵτ¯2]1+ln(T1)i=1dg1:T1,i22\displaystyle\begin{split}R_{T}&\leq\left.\frac{D^{2}\sqrt{T}}{4\tau_{T}\alpha}\sum\limits_{i=1}^{d}v_{T,i}^{1/2}+\left[\frac{\underline{\tau}^{2}+1-(\beta+\underline{\tau})}{2\underline{\tau}^{2}}\right]\sum\limits_{t=1}^{T-1}\frac{D^{2}}{\alpha_{t}}\sum\limits_{i=1}^{d}v_{t,i}^{1/2}\right.\\ &\left.\;+\left[\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}^{2}\sqrt{T}}\right]\sum\limits_{t=1}^{T-1}\sum\limits_{i=1}^{d}(1-\underline{\tau})^{T-k}g^{2}_{k,i}\right.\\ &\;+\left[\frac{1}{2}+\frac{1-\underline{\tau}}{\underline{\tau}}+\frac{1-(\beta+\underline{\tau})}{2\underline{\tau}^{2}}\right]\left[\frac{(1-\beta)^{2}\alpha}{\epsilon\underline{\tau}^{2}}\right]\sqrt{1+\ln(T-1)}\sum\limits_{i=1}^{d}\left\lVert g_{1:T-1,i}^{2}\right\rVert_{2}\end{split} (53)

Appendix B Visualization of the key elements of AdaTerm

Figures 11 and 12 show the visual description of the interpolation factor τmv\tau_{mv} and of the gradient ascent increment κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} for the robustness parameter ν~\tilde{\nu}, respectively. As can be seen in Fig. 11, for small deviations DtD_{t}, τmv(1β)\tau_{mv}\to(1-\beta), and conversely, when DtD_{t} is large, τmv0\tau_{mv}\to 0 to alleviate the influence of the aberrant gradient on both mtm_{t} and vtv_{t}. Similarly, Fig. 12 shows how the degrees-of-freedom is incremented based on the value of DtD_{t}. The blue region corresponds to a region where λt\lambda_{t} is larger than the current robustness value ν~t1\tilde{\nu}_{t-1}, leading to an increase in ν~t\tilde{\nu}_{t}. Conversely, the orange region, corresponding to large DtD_{t}, prompts a decrease in the value of ν~t\tilde{\nu}_{t} because κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} is negative.

Refer to caption
Figure 11: Visualization of τmv\tau_{mv} against ν~t1\tilde{\nu}_{t-1} and DtD_{t} values (with β=0.9\beta=0.9)
Refer to caption
Figure 12: Visualization of κΔν~gν~\kappa_{\Delta\tilde{\nu}}g_{\tilde{\nu}} against ν~t1\tilde{\nu}_{t-1} and DtD_{t} values (with β=0.9\beta=0.9)

Appendix C Details of test functions

The following test functions were tested: f(x,y)f(x,y) were Rosenbrock with a valley, McCormick with a plate, and Michalewicz with steep drops, as defined below.

f(x,y)={100(yx2)2+(x1)2Rosenbrocksin(x+y)+(xy)21.5x+2.5y+1McCormicksin(x)sin20(x2/π)sin(y)sin20(2y2/π)Michalewicz\displaystyle f(x,y)=\begin{cases}100(y-x^{2})^{2}+(x-1)^{2}&\mathrm{Rosenbrock}\\ \sin(x+y)+(x-y)^{2}-1.5x+2.5y+1&\mathrm{McCormick}\\ -\sin(x)\sin^{20}(x^{2}/\pi)-\sin(y)\sin^{20}(2y^{2}/\pi)&\mathrm{Michalewicz}\end{cases} (54)

Using the gradients of f(x,y)f(x,y), (x,y)(x,y) can be optimized to minimize f(x,y)f(x,y). The initial values of (x,y)(x,y) were fixed to (2,2)(-2,2), (4,3)(4,-3), and (1,1)(1,1), respectively. The noise added to (x,y)(x,y) for computing the gradients was sampled from the uniform distribution with range (0.1,0.1)(-0.1,0.1). The probability of noise was set to six conditions: {0,1,2.5,5,10,15}\{0,1,2.5,5,10,15\}%.

The following optimizers were tested: Adam, At-Adam with adaptive ν\nu, and the proposed AdaTerm. AdaBelief [8], similar to AdaTerm with ν\nu\to\infty, was also tested but excluded from Fig. 8 and  9 owing to its similarity to Adam. The learning rate was set to α=0.01\alpha=0.01, which is higher than that for typical network updates; however, other hyper-parameters were left at their default settings. The update was performed 15000 times, and the error norm between the final (x,y)(x,y) and the analytically-optimal point was computed. To evaluate the statistics, 100 trials were run using different random seeds.

Trajectories of the convergence process are illustrated in Fig. 1315. Note that the color of each point indicates the elapsed time, changing from blue to red. AdaTerm was less likely to update in the incorrect direction caused by noise than the others, indicating stable updating. In addition, we observe higher update speed of AdaTerm compared to others. This is probably because β<β2\beta<\beta_{2} could rapidly adapt to small gradients in the saddle area, although this caused an overshoot around the optimal point in Fig. 14(c), resulting in the larger error norm than the others in Fig. 8.

A remarkable result was obtained for Michalewicz function (see Fig. 15(c)). Specifically, AdaTerm initially moved only in the yy-direction, as in AdaBelief, and subsequently paused at steep gradients in the xx-direction. This is because the steep gradients toward the optimum were considered as noise in certain cases. In fact, with the elapse of time, AdaTerm judged the steep gradients as non-noise. Gradually, as ν~\tilde{\nu} increased, the update was resumed, finally reaching the optimal point. Furthermore, At-Adam was unable to adapt ν\nu to the steep gradients, and the point moved to avoid the optimal point (see Fig. 15(b)). However, when noise was added, the steep gradients started to be utilized with the aid of the noise, and the optimal point was finally reached.

Refer to caption
(a) Adam at 0% noise
Refer to caption
(b) At-Adam at 0% noise
Refer to caption
(c) AdaTerm at 0% noise
Refer to caption
(d) AdaBelief at 0% noise
Refer to caption
(e) Adam at 15% noise
Refer to caption
(f) At-Adam at 15% noise
Refer to caption
(g) AdaTerm at 15% noise
Refer to caption
(h) AdaBelief at 15% noise
Figure 13: Trajectories on Rosenbrock function
Refer to caption
(a) Adam at 0% noise
Refer to caption
(b) At-Adam at 0% noise
Refer to caption
(c) AdaTerm at 0% noise
Refer to caption
(d) AdaBelief at 0% noise
Refer to caption
(e) Adam at 15% noise
Refer to caption
(f) At-Adam at 15% noise
Refer to caption
(g) AdaTerm at 15% noise
Refer to caption
(h) AdaBelief at 15% noise
Figure 14: Trajectories on McCormick function
Refer to caption
(a) Adam at 0% noise
Refer to caption
(b) At-Adam at 0% noise
Refer to caption
(c) AdaTerm at 0% noise
Refer to caption
(d) AdaBelief at 0% noise
Refer to caption
(e) Adam at 15% noise
Refer to caption
(f) At-Adam at 15% noise
Refer to caption
(g) AdaTerm at 15% noise
Refer to caption
(h) AdaBelief at 15% noise
Figure 15: Trajectories on Michalewicz function

Appendix D Learning setups

The models for learning the four benchmark problems are implemented using PyTorch [29] with the respective setups, as summarized in Table 4. The parameters that are not listed in the table are basically set at the recommended default values. Note that the learning rate for RL was smaller than the ones for the other problems to avoid being trapped in local solutions before effective samples are collected. In addition, the batch size was set to 32, such that the noise is more likely to affect the gradients. However, for the classification problem only, the batch size was increased to 256, which is within the general range, in response to the large learning time cost of this task.

For the classification problem, ResNet18 [3] was employed as the network model. The official PyTorch implementation was employed; however, to account for the size difference of the input image, the first layer of convolution was changed to 1 with a kernel size of 3 and a stride of 1; furthermore, the subsequent MaxPool layer was excluded.

For the prediction problem, a gated recurrent unit (GRU) [47] was employed for constructing a network model consisting of two serial hidden layers, with each including 128 GRU units. The output layer generates the means and standard deviations of the multivariate diagonal Gaussian distribution used to sample the predicted state. Note that, since extending the number of prediction steps delays the learning process, we set the number of epochs to 100 for single-step prediction and 300 for 30-step prediction.

For the RL problem, five fully-connected layers with 100 neurons for each were implemented as hidden layers. Each activation function was an unlearned LayerNorm [48, 49] and Swish function [50]. The output is a multivariate diagonal Student’s t-distribution used to improve the efficiency of the exploration [51].

For the policy distillation problem, two fully-connected layers with only 32 neurons for each were implemented as the hidden layers. This structure is clearly smaller than that for the RL problem described above. The only activation function used for the different layers is the Tanh function, with reference to the fact that this design has been reported to have sufficient expressive capability [52]. The output layer generates the means and standard deviations of the multivariate diagonal Gaussian distribution used to capture the stochastic teacher policy.

Table 4: Learning setups
Parameter Classification Prediction RL Distillation
Learning rate 1e-3 1e-3 1e-4 1e-3
Batch size 256 32 32 32
#Epoch 100 100/300 2000 200
Label smoothing 0.2 - - -
Truncation - 30 - -
#Batch/#Replay buffer - - 128/1e+4 -
Weight decay - - - 1e-4

Appendix E Ablation study

For the ablation test of AdaTerm, we test the following three conditions: Δν~=0\Delta\tilde{\nu}=0, called no adaptiveness; ν¯~=\underline{\tilde{\nu}}=\infty, called no robustness; and ν~0=100\tilde{\nu}_{0}=100, called large init. All other conditions are the same as the experiments in the main text.

The test results are summarized in Table 5. As expected, the cases with no adaptiveness outperformed the cases with no robustness for the problems with high noise effects, and vice versa. As observed from the results of the 30-step prediction and the policy distillation, the optimal solution may lie somewhere in the middle rather than at the two extremes, and the normal AdaTerm has been successful in finding it. In the case where ν~0\tilde{\nu}_{0} is increased and the noise robustness is inferior at the beginning of training, the performance was worse than the normal case where ν~0=ν¯~+ϵ\tilde{\nu}_{0}=\underline{\tilde{\nu}}+\epsilon and the noise robustness is maximized at the beginning, except for the classification problem. This tendency suggests that even if ν~\tilde{\nu} is adjusted to gain the appropriate noise robustness after optimization without considering the effects of noise, the performance would be prone to local solution traps. For the classification problem only, the size of the network architecture was larger than that for the other problems, and its redundancy allowed the classifier to escape from the local solutions. In such a case, using more gradients from the beginning resulted in performance improvement.

Table 5: Ablation results
Classification Prediction RL Distillation
Accuracy MSE at final prediction The sum of rewards The sum of rewards
Method 0 % 10 % 1 step 30 steps Hopper Ant w/o amateur w/ amateur
AdaTerm 0.7315 0.6815 0.0335 1.0016 1550.25 2021.37 1770.17 1411.02
(3.66e-3) (4.46e-3) (3.09e-4) (2.31e-1) (5.88e+2) (3.87e+2) (2.17e+2) (1.92e+2)
No adaptiveness 0.7330 0.6840 0.0336 1.0718 496.75 2192.21 1701.63 1437.18
(3.20e-3) (4.20e-3) (4.30e-4) (2.75e-1) (6.19e+2) (4.18e+2) (2.32e+2) (2.26e+2)
No robustness 0.7330 0.6810 0.0328 1.2553 1666.30 1412.93 1625.05 1403.92
(3.20e-3) (3.80e-3) (4.14e-4) (2.09e-1) (6.57e+2) (6.52e+2) (2.61e+2) (2.56e+2)
Large init 0.7350 0.6830 0.0332 1.2966 1750.63 1902.63 1655.40 1442.24
(3.50e-3) (3.90e-3) (4.47e-4) (2.36e-1) (5.47e+2) (3.87e+2) (1.71e+2) (2.84e+2)
t-AdaBelief 0.7231 0.6827 0.0318 1.1517 1104.87 2185.45 1639.93 1303.19
(4.23e-3) (4.02e-3) (4.42e-4) (1.17e-1) (4.48e+2) (4.83e+2) (3.19e+2) (2.58e+2)
At-AdaBelief 0.7223 0.6815 0.0318 1.1306 1214.47 2196.84 1671.65 1325.70
(4.50e-3) (3.46e-3) (5.02e-4) (1.17e-1) (3.77e+2) (2.29e+2) (3.32e+2) (2.72e+2)

Appendix F Analysis of the batch size’s effect on robustness

As previously proven [53], the batch size has an impact on the stochastic gradient’s noise strength. Furthermore, larger batch sizes would aid mitigating the effect of corrupted data points by drowning them in a majority of good points inside the mini batch. However, in typical machine learning tasks, given a certain dataset, an increase in the batch size also implies a decrease in the number of updates (and therefore also a decrease in the number of gradients observed). It has also been reported that the large batch size degrades generalization performance [54]. To replicate such trade-off effect and to study its influence on robustness, we use the regression task where the number of samples is fixed at 4000040000 and subsequently split into batches of different sizes {8,16,32,64}\in\left\{8,16,32,64\right\}.

The results show that for all optimizers in noiseless scenarios, the performance degrades when increasing the batch size, consistent with the results of previous studies, although this is more pronounced for AdaTerm and (A)t-Adam. Specifically, the performance of the robust optimizers degraded as the batch size increased, as expected, since the detection of the effect of aberrant points inside a majority of good points is difficult. Nevertheless, this majority of good points could not mitigate the effect of noise, as can be confirmed in the fact that the non-robust Adam and AdaBelief remained sensitive to noise.

Refer to caption
(a) Batch size =8=8
Refer to caption
(b) Batch size =16=16
Refer to caption
(c) Batch size =32=32
Refer to caption
(d) Batch size =64=64

Appendix G AdaTerm variants

Our main proposed algorithm is given in Alg. 1. However, we consider and study two alternative variants as described below.

G.1 Updates with uncentered second-order moment

The derived AdaTerm equations use the estimate of the centered second-order moment, i.e., 𝔼[(gm)2]\mathbb{E}[(g-m)^{2}], in the update based on the same logic outlined in [8]. However, the Adam algorithm employs an estimate of the uncentered second-order moment, i.e., 𝔼[g2]\mathbb{E}[g^{2}]. Therefore, we also consider a gradient descent update based on the uncentered estimated moment.

ηAdaTerm_Uncentered(gt)=mt(1βt)1(vt+mt2)(1βt)1\displaystyle\eta^{\mathrm{AdaTerm\_Uncentered}}(g_{t})=\cfrac{m_{t}(1-\beta^{t})^{-1}}{\sqrt{(v_{t}+m_{t}^{2})(1-\beta^{t})^{-1}}} (55)

Here, we approximately use the relationship, 𝔼[(gm)2]=𝔼[g2]𝔼[g]2\mathbb{E}[(g-m)^{2}]=\mathbb{E}[g^{2}]-\mathbb{E}[g]^{2}. Note that, in that case, ηAdaTerm_Uncentered<1\eta^{\mathrm{AdaTerm\_Uncentered}}<1 is always satisfied, which would yield more conservative learning.

G.2 Adaptive bias correction term

The bias correction term has its origin in the first proposal of Adam [4], where the authors derive the bias term as follows:

(1β)i=1tβti\displaystyle(1-\beta)\sum\limits_{i=1}^{t}\beta^{t-i} =1βt\displaystyle=1-\beta^{t} (56)

and they interpret it as the term needed to correct the bias caused by initializing the EMA with zeros.

Another interpretation (that does not require considering the expectation of the EMA) is to view this term as the normalization term in a weighted average. Indeed, given {g1,,gt}\left\{g_{1},\cdots,g_{t}\right\}, the weighted sum of gg with respect to the weights {w1,,wt}\left\{w_{1},\cdots,w_{t}\right\} is expressed as follows:

i=1twigi\displaystyle\sum\limits_{i=1}^{t}w_{i}g_{i}

If i=1twi1\sum\limits_{i=1}^{t}w_{i}\neq 1, then to obtain the proper weighted average, a correction is required as follows:

ave(g)\displaystyle\mathrm{ave}(g) =(i=1twi)1i=1twigi\displaystyle=\left(\sum\limits_{i=1}^{t}w_{i}\right)^{-1}\sum\limits_{i=1}^{t}w_{i}g_{i}

In the case of the EMA, the weights wiw_{i} are given by:

mt\displaystyle m_{t} =i=1t(1β)βtigiwi=(1β)βti\displaystyle=\sum\limits_{i=1}^{t}(1-\beta)\beta^{t-i}g_{i}\;\;\Leftrightarrow\;\;w_{i}=(1-\beta)\beta^{t-i}
\displaystyle\implies i=1twi=i=1t(1β)βti=1βt\displaystyle\sum\limits_{i=1}^{t}w_{i}=\sum\limits_{i=1}^{t}(1-\beta)\beta^{t-i}=1-\beta^{t} (57)

The bias correction term of Adam can therefore be seen as a correction required for a proper weighted average. We observe that the bias correction term at time step tt, ctc_{t}, corresponds to the EMA of 11:

ct=i=1t(1β)βti\displaystyle c_{t}=\sum\limits_{i=1}^{t}(1-\beta)\beta^{t-i} =1βt=EMA(xi=1)\displaystyle=1-\beta^{t}=\mathrm{EMA}\left(x_{i}=1\right)
suchthat\displaystyle\mathrm{such\;that\;} ct=βct1+(1β),withc0=0\displaystyle c_{t}=\beta c_{t-1}+(1-\beta),\;\;\;\mathrm{with}\;c_{0}=0

c0=0c_{0}=0 corresponds to the fact that EMA is initialized at 0. If instead, m00m_{0}\neq 0, then c0=1c_{0}=1.

In AdaTerm, the EMA is expressed as follows:

mt\displaystyle m_{t} =(1τmv,t)mt1+τmv,tgt=i=1tτmv,ij=1ti(1τmv,j)gi\displaystyle=(1-\tau_{mv,t})m_{t-1}+\tau_{mv,t}g_{t}=\sum\limits_{i=1}^{t}\tau_{mv,i}\prod\limits_{j=1}^{t-i}(1-\tau_{mv,j})g_{i}
wi=τmv,ij=1ti(1τmv,j)\displaystyle\implies w_{i}=\tau_{mv,i}\prod\limits_{j=1}^{t-i}(1-\tau_{mv,j}) (58)

Since (1β)(1-\beta) linearly enters in the expression of τmv\tau_{mv}, the bias correction of Adam can also be used in AdaTerm, as expressed in eq. 35. However, following the weighted average interpretation, a different correction term can be obtained as follows:

ct\displaystyle c_{t} =i=1tτmv,ij=1ti(1τmv,j)\displaystyle=\sum\limits_{i=1}^{t}\tau_{mv,i}\prod\limits_{j=1}^{t-i}(1-\tau_{mv,j})
ct=τmvct1+(1τmv),withc0=0\displaystyle\implies c_{t}=\tau_{mv}c_{t-1}+(1-\tau_{mv}),\;\;\;\mathrm{with}\;c_{0}=0 (59)

This correction term is adaptive, and in the Gaussian limit, it will be the same as the Adam/Adabelief term in eq. (56). Using this adaptive correction term, a slightly modified AdaTerm is used:

ηAdaTerm_AdaBias(gt)=mtct1vtct1\displaystyle\eta^{\mathrm{AdaTerm\_AdaBias}}(g_{t})=\cfrac{m_{t}c_{t}^{-1}}{\sqrt{v_{t}c_{t}^{-1}}} (60)

G.3 Comparison

As can be observed in Table 6, the adaptive bias correction is a complete downgrade compared with the vanilla bias correction. However, the uncentered version of AdaTerm outperforms the main algorithm in certain cases and proves to have a certain merit that deserves to be further investigated, as mentioned in the conclusion section above.

Table 6: Results for AdaTerm’s variants
Classification Prediction RL Distillation
Accuracy MSE at final prediction The sum of rewards The sum of rewards
Method 0 % 10 % 1 step 30 steps Hopper Ant w/o amateur w/ amateur
AdaTerm 0.7315 0.6815 0.0335 1.0016 1550.25 2021.37 1770.17 1411.02
(Ours) (3.66e-3) (4.46e-3) (3.09e-4) (2.31e-1) (5.88e+2) (3.87e+2) (2.17e+2) (1.92e+2)
AdaTerm 0.7309 0.6815 0.0338 0.9085 1688.59 2113.81 1649.53 1372.08
(Uncentered) (2.60e-3) (4.77e-3) (5.23e-4) (1.21e-1) (5.42e+2) (4.01e+2) (2.48e+2) (2.48e+2)
AdaTerm 0.7294 0.6810 0.0336 1.0342 1595.10 2017.69 1672.71 1400.88
(AdaBias) (4.26e-3) (3.13e-3) (4.65e-4) (2.53e-1) (6.45e+2) (3.21e+2) (2.50e+2) (2.45e+2)
AdaTerm 0.7299 0.6800 0.0336 1.1310 1552.83 1960.78 1624.56 1302.55
(Uncentered + AdaBias) (3.70e-3) (3.76e-3) (4.31e-4) (6.43e-1) (6.72e+2) (3.74e+2) (2.21e+2) (2.47e+2)

Appendix H Alternative vv update rule

The gradient ascent update rule for the scale parameter vv is expressed as follows:

vt\displaystyle v_{t} =vt1+κvgv=vt1+κvwmvν~2vt12(ν~+1)[(st+Δs)vt1]=vt1+κvwmvν~2vt12(ν~+1)[(st+(stDtvt1)ν~1)vt1]\displaystyle=v_{t-1}+\kappa_{v}g_{v}=v_{t-1}+\kappa_{v}\frac{w_{mv}\tilde{\nu}}{2v_{t-1}^{2}(\tilde{\nu}+1)}\left[(s_{t}+\Delta s)-v_{t-1}\right]=v_{t-1}+\kappa_{v}\frac{w_{mv}\tilde{\nu}}{2v_{t-1}^{2}(\tilde{\nu}+1)}\left[(s_{t}+(s_{t}-D_{t}v_{t-1})\tilde{\nu}^{-1})-v_{t-1}\right] (61)
=vt1+κvwmvν~2vt12(ν~+1)[(ν~+1ν~)st(ν~+Dtν~)vt1]=vt1+κv2vt12[wmvstvt1]=vt1+κv2vt12[st+ϵtvt1]\displaystyle=v_{t-1}+\kappa_{v}\frac{w_{mv}\tilde{\nu}}{2v_{t-1}^{2}(\tilde{\nu}+1)}\left[\left(\frac{\tilde{\nu}+1}{\tilde{\nu}}\right)s_{t}-\left(\frac{\tilde{\nu}+D_{t}}{\tilde{\nu}}\right)v_{t-1}\right]=v_{t-1}+\frac{\kappa_{v}}{2v_{t-1}^{2}}\left[w_{mv}s_{t}-v_{t-1}\right]=v_{t-1}+\frac{\kappa_{v}}{2v_{t-1}^{2}}\left[s_{t}+\epsilon_{t}-v_{t-1}\right] (62)
=vt1+τv[st+ϵtvt1]\displaystyle=v_{t-1}+\tau_{v}\left[s_{t}+\epsilon_{t}-v_{t-1}\right] (63)

where κv\kappa_{v} is the update step size, and ϵt=(wmv1)st\epsilon_{t}=(w_{mv}-1)s_{t}. To preserve a limiting Gaussian model, we require that for ν~\tilde{\nu}\to\infty, τv(1β)\tau_{v}\to(1-\beta), since we already have ϵt0\epsilon_{t}\to 0. Therefore, we have the following:

τv\displaystyle\tau_{v} =κv2vt120<κv<2vt12κv=2kvt12, 0<k<1τv=k\displaystyle=\frac{\kappa_{v}}{2v_{t-1}^{2}}\implies 0<\kappa_{v}<2v_{t-1}^{2}\implies\kappa_{v}=2kv_{t-1}^{2},\;0<k<1\implies\tau_{v}=k (64)

Since st+ϵt=wmvsts_{t}+\epsilon_{t}=w_{mv}s_{t}, it is no longer necessary to have kk as a function of wmvw_{mv}. In particular, we can directly set k=(1β)k=(1-\beta). However, in practice, this can lead to instabilities and NaN values. Therefore, we instead use k=(1β)(w¯mv)1k=(1-\beta)(\overline{w}_{mv})^{-1}, resulting in the following update rule:

κv\displaystyle\kappa_{v} =2vt12(1β),τv=(1β)(w¯mv)1\displaystyle=2v_{t-1}^{2}(1-\beta),\;\tau_{v}=(1-\beta)(\overline{w}_{mv})^{-1} (65)
vt\displaystyle v_{t} =vt1+κvgv=vt1+1βw¯mv[(st+ϵt)vt1]=(1τv)vt1+τv(wmvst)\displaystyle=v_{t-1}+\kappa_{v}g_{v}=v_{t-1}+\frac{1-\beta}{\overline{w}_{mv}}\left[(s_{t}+\epsilon_{t})-v_{t-1}\right]=(1-\tau_{v})v_{t-1}+\tau_{v}(w_{mv}s_{t})
vt\displaystyle v_{t} =(11βw¯mv)vt1+1βw¯mv(wmvst+ϵ2)\displaystyle=\left(1-\frac{1-\beta}{\overline{w}_{mv}}\right)v_{t-1}+\frac{1-\beta}{\overline{w}_{mv}}(w_{mv}s_{t}+\epsilon^{2}) (66)

where ϵ2\epsilon^{2} is added to avoid the zero value problem. Since wmvw_{mv} and sts_{t} are both positive, the gradient ascent rule never violates the constraint.

This version of AdaTerm is called AdaTerm2 in the results below, where we can observe that on the simple regression task, it performs on par with our main algorithm but performs less efficiently on the prediction task. Notably, we can see signs of its instability in the 3030-step prediction task around epoch 100100.

Refer to caption
(a) Test loss on Regression task
Refer to caption
(b) Prediction task test error
Refer to caption
(a) Training loss (1 step)
Refer to caption
(b) Eval. prediction error (1 step)
Refer to caption
(c) Training loss (30 steps)
Refer to caption
(d) Eval. prediction error (30 steps)