This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Signal Processing Meets SGD: From Momentum to Filter

1Zhipeng Yao,   1Guiyuan Fu,  1Ying Li,  1, 2Yu Zhang,  1Dazhou Li11footnotemark: 1,  3Rui Yu
1Shenyang University of Chemical Technology,  2University of Macau,  3University of Louisville
[email protected],  [email protected],  [email protected],
{zhangy;lidazhou}@syuct.edu.cn,  [email protected]
Corresponding author.
Abstract

In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization, but they typically suffer from slow convergence. Conversely, existing adaptive learning rate optimizers speed up convergence but often compromise generalization. To resolve this issue, we propose a novel optimization method designed to accelerate SGD’s convergence without sacrificing generalization. Our approach reduces the variance of the historical gradient, improves first-order moment estimation of SGD by applying Wiener filter theory, and introduces a time-varying adaptive gain. Empirical results demonstrate that SGDF (SGD with Filter) effectively balances convergence and generalization compared to state-of-the-art optimizers. The code is available at https://github.com/LilYau350/SGDF-Optimizer.

1 Introduction

During the training process, the optimizer serves as a critical component of the model. It refines and adjusts model parameters to ensure that the model can recognize underlying data patterns. Beyond updating weights, the optimizer’s role includes strategically navigating complex loss landscapes [16] to locate regions that offer the best generalization [30]. The chosen optimizer significantly impacts training efficiency, influencing model convergence speed, generalization performance, and resilience to data distribution shifts [3]. A poor optimizer choice can result in suboptimal convergence or failure to converge, whereas a suitable one can accelerate learning and ensure robust performance [46]. Thus, continually refining optimization algorithms is essential for enhancing the capabilities of machine learning models.

Meanwhile, Stochastic Gradient Descent (SGD) [41] and its variants, such as momentum-based SGD [50], Adam [32], and RMSprop [25], have secured prominent roles. Despite their substantial contributions to deep learning, these methods have inherent drawbacks. They primarily exploit first-order moment estimation and frequently overlook the pivotal influence of historical gradients on current parameter adjustments. Consequently, they can result in training instability or poor generalization [7], especially with high-dimensional, non-convex loss functions common in deep learning [21]. Such characteristics render adaptive learning rate methods prone to entrapment in sharp local minima, which can significantly impair the model’s generalization capability [63]. Various Adam variants [8, 37, 40, 66] aim to improve optimization and enhance generalization performance by adjusting the adaptive learning rate. Although these variants have achieved some success, they still have not completely resolved the issue of generalization loss.

To achieve an effective trade-off between convergence speed and generalization capability [20], this paper introduces a novel optimization method called SGDF (SGD with Filter). SGDF incorporates filter theory from signal processing to enhance first-moment estimation, balancing historical and current gradient estimates. Through its adaptive weighting mechanism, SGDF precisely adjusts gradient estimates throughout the training process, thereby accelerating model convergence while preserving generalization ability.

Initial evaluations demonstrate that SGDF surpasses many traditional adaptive learning rate and variance reduction optimization methods across various benchmark datasets, particularly in terms of accelerating convergence and maintaining generalization. This indicates that SGDF successfully navigates the trade-off between speeding up convergence and preserving generalization capability. By achieving this balance, SGDF offers a more efficient and robust optimization option for training deep learning models.

The main contributions of this paper can be summarized as follows:

  • We introduce SGDF, an optimizer that integrates historical and current gradient data to compute the gradient’s variance estimate, addressing the slow convergence of the vanilla SGD method.

  • We theoretically analyze the benefits of SGDF in terms of generalization (Section 3.3) and convergence (Section 3.4), and empirically verify the effectiveness of SGDF (Section 4).

  • We employ first-moment filter estimation in SGDF, which can also significantly enhance the generalization capacity of adaptive optimization algorithms (e.g., Adam) (Section 4.4), surpassing traditional momentum strategies.

2 Related Works

Convergence: Variance Reduction to Adaptive Methods. In the early stages of deep learning development, optimization algorithms focused on reducing the variance of gradient estimation [2, 13, 27, 48] to achieve a linear convergence rate. Subsequently, the emergence of adaptive learning rate methods [15, 17, 62] marked a significant shift in optimization algorithms. While SGD and its variants have advanced many applications, they come with inherent limitations. They often oscillate or become trapped in sharp minima [54]. Although these methods can lead models to achieve low training loss, such minima frequently fail to generalize effectively to new data [22, 56]. This issue is exacerbated in the high-dimensional, non-convex landscapes characteristic of deep learning settings [11, 39].

Generalization: Sharp and Flat Solutions. The generalization ability of a deep learning model depends heavily on the nature of the solutions found during the optimization process. Keskar et al. [31] demonstrated experimentally that flat minima generalize better than sharp minima. SAM [19] theoretically showed that the generalization error of smooth minima is lower than that of sharp minima on test data, and further proposed optimizing the zero-order smoothness. GAM [65] improves SAM by simultaneously optimizing the prediction error and the number of paradigms of the maximum gradient in the neighborhood during the training process. Adaptive Inertia [55] aims to balance exploration and exploitation in the optimization process by adjusting the inertia of each parameter update. This adaptive inertia mechanism helps the model avoid falling into sharp local minima.

Second-Order and Filter Methods. The recent integration of second-order information into optimization problems has gained popularity [36, 60]. Methods such as Kalman Filter [28] combined with Gradient Descent incorporate second-order curvature information [42, 51]. The KOALA algorithm [12] posits that the optimizer must adapt to the loss landscape. It adjusts learning rates based on both gradient magnitudes and the curvature of the loss landscape. However, it should be noted that the Kalman filtering framework introduces more complex parameter settings, which can hinder understanding and application.

3 Method

We derive SGDF and discovered that it shares the same underlying principles as Wiener Filter [53]. SGDF minimizes the mean squared error [29], which effectively reduces the variance of the estimated gradient distribution. This design ensures that the algorithm achieves a trade-off between convergence speed and generalization. We demonstrate how SGDF intuitively acts. As illustrated in Fig. 1, the property of SGDF to estimate the optimal gradient when the noise of a specific stochastic gradient is very large makes its descent direction more accurate. However, the stochastic gradient descent is affected by the noise, causing the descent path to deviate from the optimal point.

Input: {αt}t=1T\{\alpha_{t}\}^{T}_{t=1}: step size, {β1,β2}\{\beta_{1},\beta_{2}\}: attenuation coefficient, θ0\theta_{0}: initial parameter, f(θ)f(\theta): stochastic objective function
Output: θT\theta_{T}: resulting parameters.
Init: m00m_{0}\leftarrow 0, s00s_{0}\leftarrow 0
while t=1t=1 to TT do
       gtft(θt1)g_{t}\leftarrow\nabla f_{t}(\theta_{t-1}) (Calculate Gradients w.r.t. Stochastic Objective at Timestep tt)
      mtβ1mt1+(1β1)gtm_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})g_{t} (Calculate Exponential Moving Average)
      stβ2st1+(1β2)(gtmt)2s_{t}\leftarrow\beta_{2}s_{t-1}+(1-\beta_{2})(g_{t}-m_{t})^{2} (Calculate Exponential Moving Variance)
      m^tmt1β1t\widehat{m}_{t}\leftarrow\dfrac{m_{t}}{1-\beta_{1}^{t}}, s^t(1β1)(1β12t)st(1+β1)(1β2t)\widehat{s}_{t}\leftarrow\dfrac{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\beta_{1})(1-\beta_{1}^{2t})}s_{t}}{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1+\beta_{1})}(1-\beta_{2}^{t})}(Bias Correction)
      Kts^ts^t+(gtm^t)2{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}K_{t}\leftarrow\dfrac{\widehat{s}_{t}}{\widehat{s}_{t}+(g_{t}-\widehat{m}_{t})^{2}}} (Calculate Estimate Gain)
      g^tm^t+Kt(gtm^t){\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\widehat{g}_{t}\leftarrow\widehat{m}_{t}+K_{t}(g_{t}-\widehat{m}_{t})} (Update Gradient Estimation)
      θtθt1αtg^t\theta_{t}\leftarrow\theta_{t-1}-\alpha_{t}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\widehat{g}_{t}} (Update Parameters)
return θT\theta_{T}
Algorithm 1 SGDF, Wiener Filter Estimate Gradient. All operations are element-wise.

3.1 SGDF General Introduction

Refer to caption
Figure 1: Schematic of the SGDF intuition

In the SGDF algorithm, sts_{t} serves as a key indicator, calculated as the exponential moving average of the squared difference between the current gradient gtg_{t} and its momentum mtm_{t}, acting as a marker for gradient variation with weight-adjusted by β2\beta_{2}. Zhuang et al. [66] first proposed the calculation of sts_{t}, which is utilized for estimating the fluctuation variance of the stochastic gradient. We derived a correction factor (1β1)(1β12t)/(1+β1)(1-\beta_{1})(1-\beta_{1}^{2t})/(1+\beta_{1}) under the assumption that mtm_{t} and gtg_{t} are independently and identically distributed (i.i.d.), to accurately estimate the variance of mtm_{t} using sts_{t}. Fig. 11 compares performances with and without the correction factor, showing superior results with correction. For the derivation of the correction factor, please refer to Appendix A.2.

At each time step tt, gtg_{t} represents the stochastic gradient for our objective function, while mtm_{t} approximates the historical trend of the gradient through an exponential moving average. The difference gtmtg_{t}-m_{t} highlights the gradient’s deviation from its historical pattern, reflecting the inherent noise or uncertainty in the instantaneous gradient estimate, which can be expressed as p(gt|𝒟)𝒩(gt;mt,σt2)p(g_{t}|\mathcal{D})\sim\mathcal{N}(g_{t};m_{t},\sigma_{t}^{2}) [37].

SGDF utilizes the gain KtK_{t}, where the components of each dimension of the estimated gain range between 0 and 1, to balance the current observed gradient gtg_{t} and the past corrected gradient m^t\widehat{m}_{t}, thus optimizing the gradient estimate. This balance plays a crucial role in noisy or complex optimization scenarios, helping to mitigate noise and achieve stable gradient direction, faster convergence, and enhanced performance. The computation of KtK_{t}, based on sts_{t} and gtmtg_{t}-m_{t}, aims to minimize the expected variance of the corrected gradient g^t\widehat{g}_{t} for optimal linear estimation in noisy conditions. For the method derivation, please refer to Appendix A.1.

3.2 Method Principle of Fusion of Gaussian Distributions for Gradient Estimate

SGDF optimizes gradient estimate through the Wiener Filter, enhancing the stability and effectiveness of the optimization process and facilitating faster convergence. By fusing two Gaussian distributions, SGDF significantly reduces the variance of gradient estimates, thereby benefiting in solving complex stochastic optimization problems. In this section, we will delve into how SGDF achieves the reduction of gradient estimate variance.

The properties of SGDF ensure that the estimated gradient is a linear combination of the current noisy gradient observation gtg_{t} and the first-order moment estimate m^t\widehat{m}_{t}. These two components are assumed to have Gaussian distributions, where gi𝒩(μ,σ2)g_{i}\sim\mathcal{N}\left(\mu,\sigma^{2}\right). Hence, their fusion by the filter naturally ensures that the fused estimate g^t\widehat{g}_{t} is also Gaussian.

Consider two Gaussian distributions for the momentum term m^t\widehat{m}_{t} and the current gradient gtg_{t}:

  • The momentum term m^t\widehat{m}_{t} is normally distributed with mean μm\mu_{m} and variance σm2\sigma_{m}^{2}, denoted as m^t𝒩(μm,σm2)\widehat{m}_{t}\sim\mathcal{N}(\mu_{m},\sigma_{m}^{2}).

  • The current gradient gtg_{t} is normally distributed with mean μg\mu_{g} and variance σg2\sigma_{g}^{2}, denoted as gt𝒩(μg,σg2)g_{t}\sim\mathcal{N}(\mu_{g},\sigma_{g}^{2}).

The product of their probability density functions is given by:

N(m^t;μm,σm)N(gt;μg,σg)=12πσmσgexp((m^tμm)22σm2(gtμg)22σg2)N(\widehat{m}_{t};\mu_{m},\sigma_{m})\cdot N(g_{t};\mu_{g},\sigma_{g})=\\ \frac{1}{2\pi\sigma_{m}\sigma_{g}}\exp\left(-\frac{(\widehat{m}_{t}-\mu_{m})^{2}}{2\sigma_{m}^{2}}-\frac{(g_{t}-\mu_{g})^{2}}{2\sigma_{g}^{2}}\right) (1)

Through coefficient matching in the exponential terms, we obtain the new mean and variance:

μ=σg2μm+σm2μgσm2+σg2σ2=σm2σg2σm2+σg2\mu^{\prime}=\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\quad\sigma^{\prime 2}=\frac{\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}} (2)

The new mean μ\mu^{\prime} is a weighted average of the two means, μm\mu_{m} and μg\mu_{g}, with weights inversely proportional to their variances. This places μ\mu^{\prime} between μm\mu_{m} and μg\mu_{g}, closer to the mean with the smaller variance. The new standard deviation σ\sigma^{\prime} is smaller than either of the original standard deviations σm\sigma_{m} and σg\sigma_{g}, reflecting the reduced uncertainty in the estimate due to the combination of information from both sources. This is a direct consequence of the Wiener Filter’s optimality in the mean-square error sense. The proof is provided in Appendix A.3.

3.3 Generalization Analysis of the Variance Lower Bound for SGDF

In Section 3.2, we explained the principle of noise reduction by SGDF. In this section, we illustrate the advantages of SGDF over other variance reduction techniques, ensuring better generalization.

In prior variance reduction techniques [13, 27, 48], the variance is gradually reduced at a rate of ξt1,ξ(0,1)\xi^{t-1},\xi\in(0,1). This approach rapidly decreases the initial variance, facilitating a linear convergence rate, as noted in [5]. However, this approach poses a potential problem in that the variance may approach zero, thus depriving the optimization process of necessary stochastic exploration. The Wiener Filter satisfies statistical estimation theory, wherein the Cramér-Rao lower bound (CRLB) [43] provides an important lower bound of the variance in the estimator. Thus, we construct a dynamical system modeled by the Fokker-Planck equation to demonstrate the potential benefits of having a lower bound on the variance for optimization.

Theorem 3.1.

Consider a system described by the Fokker-Planck equation, evolving the probability density function PP in parameter spaces. Given a loss function f(θ)f(\theta), and the noise variance or diffusion matrix DijD_{ij} satisfying DiC>0D_{i}\geq C>0, where CC is a positive lower bound constant, known as the Cramér-Rao lower bound. In the steady state condition, i.e., Pt=0\frac{\partial P}{\partial t}=0, the analytical form of the probability density PP can be obtained by solving the corresponding Fokker-Planck equation. These solutions reveal the probability distribution of the system at steady state, described as follows:

P(θ)=1Zexp(i=1nf(θ)Di),P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right), (3)

Here, ZZ is also a normalization constant, ensuring the total probability sums to one, assuming Dij=DiδijD_{ij}=D_{i}\delta_{ij}, where δij\delta_{ij} is the Kronecker delta.

The existence of a variance lower bound critically enhances the algorithm’s exploration capabilities, especially in regions of the loss landscape where gradients are minimal. By preventing the probability density function from becoming unbounded, it ensures continuous exploration and increases the probability of converging to flat minima associated with better generalization properties [57]. The proof of Theorem 3.1 is provided in Appendix A.4.

3.4 Convergence Analysis in Convex and Non-convex Optimization

Finally, we provide the convergence property of SGDF as shown in Theorem 3.2 and Theorem 3.3. The assumptions are common and standard when analyzing the convergence of convex and non-convex functions via SGD-based methods [9, 32, 44]. Proofs for convergence in convex and non-convex cases are provided in Appendix B and Appendix C, respectively. In the convergence analysis, the assumptions are relaxed and the upper bound is reduced due to the estimation gain introduced by SGDF, promoting faster convergence.

Theorem 3.2.

(Convergence in convex optimization) Assume that the function ftf_{t} has bounded gradients, ft(θ)2G\|\nabla f_{t}(\theta)\|_{2}\leq G, ft(θ)G\|\nabla f_{t}(\theta)\|_{\infty}\leq G_{\infty} for all θd\theta\in\mathbb{R}^{d} and distance between any θt\theta_{t} generated by SGDF is bounded, θnθm2D\|\theta_{n}-\theta_{m}\|_{2}\leq D, θmθnD\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty} for any m,n{1,,T}m,n\in\{1,...,T\}, and β1,β2[0,1)\beta_{1},\beta_{2}\in[0,1). Let αt=α/t\alpha_{t}=\alpha/\sqrt{t}. SGDF achieves the following guarantee, for all T1T\geq 1:

R(T)\displaystyle R(T)\leq D2αi=1dT+2DG1β1i=1dg1:T,i2+2αG2(1+(1β1)2)T(1β1)2i=1dg1:T,i22\displaystyle\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2} (4)

where R(T)=t=1Tft(θt)ft(θ)R(T)=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*}) denotes the cumulative performance gap between the generated solution and the optimal solution.

For the convex case, Theorem 3.2 implies that the regret of SGDF is upper bounded by O(T)O(\sqrt{T}). In the Adam-type optimizers, it’s crucial for the convex analysis to decay β1,t\beta_{1,t} towards zero [32, 66]. We have relaxed the analysis assumption by introducing a time-varying gain KtK_{t}, which can adapt with variance. Moreover, KtK_{t} converges with variance at the end of training to improve convergence [50].

Theorem 3.3.

(Convergence for non-convex stochastic optimization) Under the assumptions:

  • A1 Bounded variables (same as convex). θθ2D,θ,θ\left\|\theta-\theta^{*}\right\|_{2}\leq D,\,\,\forall\theta,\theta^{*} or for any dimension ii of the variable, θiθi2Di,θi,θi\left\|\theta_{i}-\theta_{i}^{*}\right\|_{2}\leq D_{i},\,\,\forall\theta_{i},\theta_{i}^{*}

  • A2 The noisy gradient is unbiased. For t\forall t, the random variable ζt\zeta_{t} is defined as ζt=gtf(θt)\zeta_{t}=g_{t}-\nabla f\left(\theta_{t}\right), ζt\zeta_{t} satisfy 𝔼[ζt]=0\mathbb{E}\left[\zeta_{t}\right]=0, 𝔼[ζt22]σ2\mathbb{E}\left[\left\|\zeta_{t}\right\|_{2}^{2}\right]\leq\sigma^{2}, and when t1t2t_{1}\neq t_{2}, ζt1\zeta_{t_{1}} and ζt2\zeta_{t_{2}} are statistically independent, i.e., ζt1ζt2\zeta_{t_{1}}\perp\zeta_{t_{2}}.

  • A3 Bounded gradient and noisy gradient. At step tt, the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. i.e.||f(θt)||G,||gt||G,t>1i.e.\ \ ||\nabla f(\theta_{t})||\leq G,\ ||g_{t}||\leq G,\ \ \forall t>1.

  • A4 The property of function. (1)ff is differentiable; (2) f(x)f(y)Lxy,x,y||\nabla f(x)-\nabla f(y)||\leq L||x-y||,\ \forall x,y; (3) ff is also lower bounded.

Consider a non-convex optimization problem. Suppose assumptions A1-A4 are satisfied, and let αt=α/t\alpha_{t}=\alpha/\sqrt{t}. For all T1T\geq 1, SGDF achieves the following guarantee:

𝔼(T)C7α2(logT+1)+C82αT\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}} (5)

where 𝔼(T)=mint=1,2,,T𝔼t1[f(θt)22]\mathbb{E}(T)=\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right] denotes the minimum of the squared-paradigm expectation of the gradient, α\alpha is the learning rate at the 11-th step, C7C_{7} are constants independent of dd and TT, C8C_{8} is a constant independent of TT, and the expectation is taken w.r.t all randomness corresponding to gt{g_{t}}.

Theorem 3.3 implies the convergence rate for SGDF in the non-convex case is O(logT/T)O(\log T/\sqrt{T}), which is similar to Adam-type optimizers [9, 44]. Note that these bounds are derived under the worst possible case, so they may be loose. The components of each dimension of the estimated gain KtK_{t} converge with variance, causing the upper bound to decrease rapidly. This is why SGDF converges faster than SGD.

4 Experiments

4.1 Empirical Evaluation

In this study, we focus on the following tasks: Image Classification. We employed the VGG [49], ResNet [23], and DenseNet [26] models for image classification tasks on the CIFAR dataset [33]. The major difference between these three network architectures is the residual connectivity, which we will discuss in section 4.4. We evaluated and compared the performance of SGDF with other optimizers such as SGD, Adam, RAdam [37], AdamW [38], MSVAG [2], Adabound [40], Sophia [36], and Lion [10], all of which were implemented based on the official PyTorch. Additionally, we further tested the performance of SGDF on the ImageNet dataset [14] using the ResNet model. Object Detection. Object detection was performed on the PASCAL VOC dataset [18] using Faster-RCNN [45] integrated with FPN. For hyper-parameter tuning related to image classification and object detection, refer to [66]. Image Generation. Wasserstein-GAN (WGAN) [1] on the CIFAR-10 dataset.

Hyperparameter tuning. Following Zhuang et al. [66], we delved deep into the optimal hyperparameter settings for our experiments. In the image classification task, we employed these settings:

  • SGDF: We adhered to Adam’s original parameter values: β1=0.9\beta_{1}=0.9, β2=0.999\beta_{2}=0.999, ϵ=108\epsilon=10^{-8}.

  • SGD: We set the momentum to 0.9, the default for networks like ResNet and DenseNet. The learning rate was searched in the set {10.0, 1.0, 0.1, 0.01, 0.001}.

  • Adam, RAdam, MSVAG, AdaBound: Traversing the hyperparameter landscape, we scoured β1\beta_{1} values in {0.5,0.6,0.7,0.8,0.9}\{0.5,0.6,0.7,0.8,0.9\}, probed α\alpha as in SGD, while tethering other parameters to their literary defaults.

  • AdamW, SophiaG, Lion: Mirroring Adam’s parameter search schema, we fixed weight decay at 5×1045\times 10^{-4}; yet for AdamW, whose optimal decay often exceeds norms [38], we ranged weight decay over {104,5×104,103,102,101}\left\{10^{-4},5\times 10^{-4},10^{-3},10^{-2},10^{-1}\right\}.

  • SophiaG, Lion: We searched for the learning rate among {103,104,105}\{10^{-3},10^{-4},10^{-5}\} and adjusted Lion’s learning rate [36]. Following  [36, 10], we set β1\beta_{1}=0.965, 0.9 and β2\beta_{2}=0.99 as the default parameters.

CIFAR-10/100 Experiments. We initially trained on the CIFAR-10 and CIFAR-100 datasets using the VGG, ResNet, and DenseNet models and assessed the performance of the SGDF optimizer. In these experiments, we employed basic data augmentation techniques such as random horizontal flip and random cropping (with a 4-pixel padding). To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 5. The results represent the mean and standard deviation of 3 runs, visualized as curve graphs in Fig. 2.

Figure 7: Test accuracy ([μ±σ\mu\pm\sigma]) on CIFAR.

As Fig. 2 shows, that it is evident that the SGDF optimizer exhibited convergence speeds comparable to adaptive optimization algorithms. Additionally, SGDF’s final test set accuracy was either better than or equal to that achieved by SGD.

ImageNet Experiments. We utilized the ResNet18 model to train on the ImageNet dataset. Due to extensive computational demands, we did not conduct an exhaustive hyperparameter search for other optimizers and instead opted to use the best-reported parameters from [8, 37]. We applied basic data augmentation strategies such as random cropping and random horizontal flipping. The results are presented in Tab. 1. To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 6. Detailed training and test curves are depicted in Fig. 8. Additionally, to mitigate the effect of learning rate scheduling, we employed cosine learning rate scheduling as suggested by [10, 65] and trained ResNet18, 34, and 50 models. The results are summarized in Tab. 2. Experiments on the ImageNet dataset demonstrate that SGDF has improved convergence speed and achieves similar accuracy to SGD on the test set.

Table 1: Top-1, 5 accuracy of ResNet18 on ImageNet.     is reported in [66, 8, 37].
Method SGDF SGD AdaBound Yogi MSVAG Adam RAdam AdamW
Top-1 70.23 70.23 68.13 68.23 65.99 63.79 (66.54) 67.62 67.93
Top-5 89.55 89.40 88.55 88.59 - 85.61 - 88.47
Table 2: Cosine learning rate scheduling train ImageNet. is reported in [65]
Model ResNet18 ResNet34 ResNet50
SGDF 70.16 73.37 76.03
SGD 69.80 73.26 76.01

Object Detection. We conducted object detection experiments on the PASCAL VOC dataset [18]. The model used in these experiments was pre-trained on the COCO dataset [35], obtained from the official website. We trained this model on the VOC2007 and VOC2012 trainval dataset (17K) and evaluated it on the VOC2007 test dataset (5K). The utilized model was Faster-RCNN [45] with FPN, and the backbone was ResNet50 [23]. Results are summarized in Tab. 3. To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 6. As expected, SGDF outperforms other methods. These results also illustrate the efficiency of our method in object detection tasks.

Table 3: The mAP on PASCAL VOC using Faster-RCNN+FPN.
Method SGDF SGD Adam AdamW RAdam
mAP 83.81 80.43 78.67 78.48 75.21

Image Generation. The stability of optimizers is crucial, especially when training Generative Adversarial Networks (GANs). If the generator and discriminator have mismatched complexities, it can lead to imbalance during GAN training, causing the GAN to fail to converge. This is known as model collapse. For instance, Vanilla SGD frequently causes model collapse, making adaptive optimizers like Adam and RMSProp the preferred choice. Therefore, GAN training provides a good benchmark for assessing optimizer stability. For reproducibility details, please refer to the parameter table in Table 7.

Refer to caption
Refer to caption
Figure 8: FID score of WGAN-GP.

We evaluated the Wasserstein-GAN with gradient penalty (WGAN-GP) [47]. Using well-known optimizers [4, 61], the model was trained for 100 epochs. We then calculated the Frechet Inception Distance (FID) [24] which is a metric that measures the similarity between the real image and the generated image distribution and is used to assess the quality of the generated model (lower FID indicates superior performance). Five random runs were conducted, and the outcomes are presented in Figure 3. Results for SGD and MSVAG were extracted from Zhuang et al. [66].

Experimental results demonstrate that SGDF significantly enhances WGAN-GP model training, achieving a FID score higher than vanilla SGD and outperforming most adaptive optimization methods. The integration of a Wiener filter in SGDF facilitates smooth gradient updates, mitigating training oscillations and effectively addressing the issue of pattern collapse.

4.2 Top Eigenvalues of Hessian and Hessian Trace

The success of optimization algorithms in deep learning not only depends on their ability to minimize training loss but also critically hinges on the nature of the solutions they converge to. Adam-type algorithms tend to converge to sharp minima, characterized by large eigenvalues of the Hessian matrix [6].

We computed the Hessian spectrum of ResNet-18 trained on the CIFAR-100 dataset for 200 epochs using four optimization methods: SGD, SGDM, Adam, and SGDF. These experiments ensure that all methods achieve similar results on the training set. We employed power iteration [58] to compute the top eigenvalues of Hessian and Hutchinson’s method [59] to compute the Hessian trace. Histograms illustrating the distribution of the top 50 Hessian eigenvalues for each optimization method are presented in Fig. 4.

4.3 Visualization of Landscapes

We visualized the loss landscapes of models trained with SGD, SGDM, SGDF, and Adam using the ResNet-18 model on CIFAR-100, following the method in [34]. All models are trained with the same hyperparameters for 200 epochs, as detailed in Section 4.1. As shown in Fig. 5, SGDF finds flatter minima. Notably, the visualization reveals that Adam is more prone to converge to sharper minima.

4.4 Wiener Filter combines Adam

We’ve conducted comparative experiments on the CIFAR-100 dataset, evaluating both the vanilla Adam algorithm and Wiener Adam, which substitutes the first-moment gradient estimates in the Adam optimizer with Wiener filter estimates. The results are presented in Tab. 4, and the detailed test curves are depicted in Fig. 10. This suggests that our first-moment filter estimation method has the potential to be applied to other optimization methods.

Table 4: Accuracy comparison between Adam and Wiener-Adam.
Model VGG11 ResNet34 DenseNet121
Wiener-Adam 62.64 73.98 74.89
Vanilla-Adam 56.73 72.34 74.89

The performance improvement observed when using filter estimate can be attributed to several factors related to the nature of the Wiener filter and the architecture of the neural networks used.

Noise Reduction. The Wiener Filter is fundamentally designed to minimize the mean square error in the presence of noise. In the context of gradient estimation, this means it can effectively reduce the noise in the gradient updates. The vanilla Adam optimizer calculates the first moment (the mean) of the gradients, which may still include some noise components. By employing the Wiener filter, the algorithm may benefit from cleaner and more reliable gradient signals, thus improving the optimizer’s ability to converge to better minima.

Network Architecture. The impact of the Wiener Filter on gradient estimation varies with the network architecture’s unique handling of features and gradients. For VGG, which lacks residual connections, the Wiener filter significantly enhances performance by providing more accurate gradient estimations, reducing noise-induced errors, and leading to better overall accuracy. In contrast, the benefits are less pronounced in ResNet and DenseNet, which incorporate residual and dense connections, respectively. These architectures naturally promote more stable gradient flows through direct paths or dense layer interconnectivity, diminishing the additional advantage offered by the Wiener filter. Intuitively, we can find the loss landscape with residual connected networks in [34]. Thus, while the Wiener filter improves performance in simpler architectures like VGG, its impact is mitigated in more complex architectures where inherent features already assist in gradient stability.

5 Limitations

Theoretical Limitations. Our analysis relies on the common L-smooth condition for optimization continuity. Zhang et al. [64] propose gradient clipping, weakening this constraint. This suggests considering additional conditions for better understanding optimization in non-convex settings [52].

Empirical Limitations. Modern Large Language Models (LLMs) prioritize performance on training datasets [10, 36]. Discussing the theoretical properties of the Hessian matrix may lack relevance without comparable training losses. In practice, metrics like convergence speed and final loss values matter most.

6 Conclusion

In this paper, we introduce SGDF, a novel optimization method that estimates the gradient for faster convergence by leveraging both the variance of historical gradients and the current gradient. We demonstrate that SGDF yields solutions with a flat spectrum akin to SGD through Hessian spectral analysis. Through extensive experiments employing various deep learning architectures on benchmark datasets, we showcase SGDF’s superior performance compared to other state-of-the-art optimizers, striking a balance between convergence speed and generalization.

Acknowledgement

Zhipeng Yao thanks to Aram Davtyan and Prof. Dr. Paolo Favaro in the Computer Vision Group at the University of Bern for discussing and improving the paper.

Thanks to computational support from the Ascend AI Eco-Innovation Centre at the Shenyang AI Computing Hub.

References

  • Arjovsky et al. [2017] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. 2017.
  • Balles & Hennig [2018] Balles, L. and Hennig, P. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pp. 404–413. PMLR, 2018.
  • Bengio & Lecun [2007] Bengio, Y. and Lecun, Y. Scaling learning algorithms towards ai. 2007.
  • Bernstein et al. [2020] Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33:21370–21381, 2020.
  • Bottou et al. [2018] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
  • Caldarola et al. [2022] Caldarola, D., Caputo, B., and Ciccone, M. Improving generalization in federated learning by seeking flat minima. ArXiv, 2022.
  • Chandramoorthy et al. [2022] Chandramoorthy, N., Loukas, A., Gatmiry, K., and Jegelka, S. On the generalization of learning algorithms that do not converge. Advances in Neural Information Processing Systems, 35:34241–34257, 2022.
  • Chen et al. [2018a] Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018a.
  • Chen et al. [2018b] Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018b.
  • Chen et al. [2023] Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., et al. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
  • Dauphin et al. [2014] Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. MIT Press, 2014.
  • Davtyan et al. [2022] Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P. Koala: A kalman optimization algorithm with loss adaptivity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp.  6471–6479, 2022.
  • Defazio et al. [2014] Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
  • Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
  • Dozat [2016] Dozat, T. Incorporating nesterov momentum into adam. ICLR Workshop, 2016.
  • Du & Lee [2018] Du, S. S. and Lee, J. D. On the power of over-parametrization in neural networks with quadratic activation. 2018.
  • Duchi et al. [2011] Duchi, John, Hazan, Elad, Singer, and Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
  • Everingham et al. [2010] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
  • Foret et al. [2021] Foret, P. et al. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021. spotlight.
  • Geman et al. [2014] Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 2014.
  • Goodfellow et al. [2016] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press, 2016.
  • Hardt et al. [2015] Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. Mathematics, 2015.
  • He et al. [2016] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  • Heusel et al. [2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a nash equilibrium. 2017.
  • Hinton et al. [2012] Hinton, G., Srivastava, N., and Swersky, K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
  • Huang et al. [2017] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  4700–4708, 2017.
  • Johnson & Zhang [2013] Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
  • Kalman [1960] Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960.
  • Kay [1993] Kay, S. M. Fundamentals of Statistical Signal Processing: Estimation Theory, volume 1. Prentice-Hall, Inc., 1993.
  • Keskar et al. [2016] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. 2016.
  • Keskar et al. [2017] Keskar, N. S. et al. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
  • Kingma & Ba [2014] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Krizhevsky et al. [2009] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
  • Li et al. [2018] Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
  • Lin et al. [2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. European Conference on Computer Vision (ECCV), 2014.
  • Liu et al. [2023] Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
  • Liu et al. [2019] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
  • Loshchilov & Hutter [2017] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  • Lucchi et al. [2022] Lucchi, A., Proske, F., Orvieto, A., Bach, F., and Kersting, H. On the theoretical properties of noise correlation in stochastic optimization. Advances in Neural Information Processing Systems, 35:14261–14273, 2022.
  • Luo et al. [2019] Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
  • Monro [1951] Monro, R. S. a stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
  • Ollivier [2019] Ollivier, Y. The extended kalman filter is a natural gradient descent in trajectory space. arXiv: Optimization and Control, 2019.
  • Rao [1992] Rao, C. R. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in Statistics: Foundations and basic theory, pp.  235–247. Springer, 1992.
  • Reddi et al. [2019] Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. 2019.
  • Ren et al. [2015] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
  • Ruder [2016] Ruder, S. An overview of gradient descent optimization algorithms. 2016.
  • Salimans et al. [2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  • Schmidt et al. [2017] Schmidt, M., Roux, N. L., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
  • Simonyan & Zisserman [2014] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
  • Sutskever et al. [2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
  • Vuckovic [2018] Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. ArXiv, 2018.
  • Wang et al. [2022] Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., and Chen, W. Provable adaptivity in adam. arXiv preprint arXiv:2208.09900, 2022.
  • Wiener [1950] Wiener, N. The extrapolation, interpolation and smoothing of stationary time series, with engineering applications. Journal of the Royal Statistical Society Series A (General), 1950.
  • Wilson et al. [2017] Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. 2017.
  • Xie et al. [2020] Xie, Z., Wang, X., Zhang, H., Sato, I., and Sugiyama, M. Adai: Separating the effects of adaptive learning rate and momentum inertia. arXiv preprint arXiv:2006.15815, 2020.
  • Xie et al. [2022] Xie, Z., Tang, Q. Y., Cai, Y., Sun, M., and Li, P. On the power-law spectrum in deep learning: A bridge to protein science. 2022.
  • Yang et al. [2023] Yang, N., Tang, C., and Tu, Y. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023.
  • Yao et al. [2018] Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. 2018.
  • Yao et al. [2020a] Yao, Z., Gholami, A., Keutzer, K., and Mahoney, M. W. Pyhessian: Neural networks through the lens of the hessian. In International Conference on Big Data, 2020a.
  • Yao et al. [2020b] Yao, Z., Gholami, A., Shen, S., Keutzer, K., and Mahoney, M. W. Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020b.
  • Zaheer et al. [2018] Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
  • Zeiler [2012] Zeiler, M. D. Adadelta: An adaptive learning rate method. arXiv e-prints, 2012.
  • Zhang et al. [2016] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. 2016.
  • Zhang et al. [2019] Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
  • Zhang et al. [2023] Zhang, X., Xu, R., Yu, H., Zou, H., and Cui, P. Gradient norm aware minimization seeks first-order flatness and improves generalization. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  • Zhuang et al. [2020] Zhuang, J., Tang, T., Ding, Y., Tatikonda, S. C., Dvornek, N., Papademetris, X., and Duncan, J. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.

Appendix A Method Derivation (Section 3 in main paper)

A.1 Wiener Filter Derivation for Gradient Estimation (Main paper Section 3.1)

Given the sequence of gradients {gt}\{g_{t}\} in a stochastic gradient descent process, we aim to find an estimate g^t\widehat{g}_{t} that incorporates information from both the historical gradients and the current gradient. The Wiener Filter provides an estimate that minimizes the mean squared error. We begin by constructing the estimate as a simple average and then refine it using the properties of the Wiener Filter.

g^t\displaystyle\widehat{g}_{t} =1T+1t=1Tgt+1T+1gt\displaystyle=\frac{1}{T+1}\sum_{t=1}^{T}g_{t}+\frac{1}{T+1}g_{t} (6)
=1T+1TTt=1Tgt+1T+1gt\displaystyle=\frac{1}{T+1}\frac{T}{T}\sum_{t=1}^{T}g_{t}+\frac{1}{T+1}g_{t}
=TT+1g¯t+1T+1gt\displaystyle=\frac{T}{T+1}\bar{g}_{t}+\frac{1}{T+1}g_{t}
(a)TT+1m^t+1T+1gt\displaystyle\overset{\left(a\right)}{\approx}\frac{T}{T+1}\widehat{m}_{t}+\frac{1}{T+1}g_{t}
=(11T+1)m^t+1T+1gt\displaystyle=\left(1-\frac{1}{T+1}\right)\widehat{m}_{t}+\frac{1}{T+1}g_{t}
=m^tKtm^t+Ktgt\displaystyle=\widehat{m}_{t}-K_{t}\widehat{m}_{t}+K_{t}g_{t}
=m^t+Kt(gtm^t)\displaystyle=\widehat{m}_{t}+K_{t}\left(g_{t}-\widehat{m}_{t}\right)

In the above derivation, step (a) replaces the arithmetic mean of gradients g¯T\bar{g}_{T} with the momentum term m^T\widehat{m}_{T}. The Wiener gain KT=1T+1K_{T}=\frac{1}{T+1} is then introduced to update the gradient estimate with information from the new gradient.

By defining g^t\widehat{g}_{t} as the weighted combination of the momentum term m^t\widehat{m}_{t} and the current gradient gtg_{t}, we can compute the variance of g^t\widehat{g}_{t} as follows:

Var(g^t)\displaystyle\mathrm{Var}(\widehat{g}_{t}) =Var((1Kt)m^t+Ktgt)\displaystyle=\mathrm{Var}((1-K_{t})\widehat{m}_{t}+K_{t}g_{t}) (7)
=(1Kt)2Var(m^t)+Kt2Var(gt)\displaystyle=(1-K_{t})^{2}\mathrm{Var}(\widehat{m}_{t})+K_{t}^{2}\mathrm{Var}(g_{t})

Minimizing the variance of g^t\widehat{g}_{t} with respect to KtK_{t}, by setting the derivative dVar(g^t)dKt=0\dfrac{\mathrm{d}\mathrm{Var}(\widehat{g}_{t})}{\mathrm{d}K_{t}}=0, yields:

0=\displaystyle 0= 2(1Kt)Var(m^t)+2KtVar(gt)\displaystyle 2(1-K_{t})\mathrm{Var}(\widehat{m}_{t})+2K_{t}\mathrm{Var}(g_{t}) (8)
0=\displaystyle 0= (1Kt)Var(m^t)+KtVar(gt)\displaystyle(1-K_{t})\mathrm{Var}(\widehat{m}_{t})+K_{t}\mathrm{Var}(g_{t})
Kt\displaystyle K_{t} =Var(m^t)Var(m^t)+Var(gt)\displaystyle=\frac{\mathrm{Var}(\widehat{m}_{t})}{\mathrm{Var}(\widehat{m}_{t})+\mathrm{Var}(g_{t})}

The final expression for KtK_{t} shows that the optimal interpolation coefficient is the ratio of the variance of the momentum term to the sum of the variances of the momentum term and the current gradient. This result exemplifies the essence of the Wiener Filter: optimally combining past information with new observations to reduce estimation error due to noisy data.

A.2 Variance Correction (Correction factor in main paper Section 3.1)

The momentum term is defined as:

mt=(1β1)i=1tβ1tigti+1,m_{t}=(1-\beta_{1})\sum_{i=1}^{t}\beta_{1}^{t-i}g_{t-i+1}, (9)

which means that the momentum term is a weighted sum of past gradients, where the weights decrease exponentially over time.

To compute the variance of the momentum term mtm_{t}, we first observe that since gti+1g_{t-i+1} are independent and identically distributed with a constant variance σg2\sigma_{g}^{2}, the variance of the momentum term can be obtained by summing up the variances of all the weighted gradients.

The variance of each weighted gradient β1tigti+1\beta_{1}^{t-i}g_{t-i+1} is β12(ti)σg2\beta_{1}^{2(t-i)}\sigma_{g}^{2}, because the variance operation has a quadratic nature, so the weight β1ti\beta_{1}^{t-i} becomes β12(ti)\beta_{1}^{2(t-i)} in the variance computation.

Therefore, the variance of mtm_{t} is the sum of all these weighted variances:

σmt2=(1β1)2σg2i=1tβ12(ti).\sigma_{m_{t}}^{2}=(1-\beta_{1})^{2}\sigma_{g}^{2}\sum_{i=1}^{t}\beta_{1}^{2(t-i)}. (10)

The factor (1β1)2(1-\beta_{1})^{2} comes from the multiplication factor (1β1)(1-\beta_{1}) in the momentum update formula, which is also squared when calculating the variance.

The summation part i=1tβ12(ti)\sum_{i=1}^{t}\beta_{1}^{2(t-i)} is a geometric series, which can be formulated as:

i=1tβ12(ti)=1β12t1β12.\sum_{i=1}^{t}\beta_{1}^{2(t-i)}=\frac{1-\beta_{1}^{2t}}{1-\beta_{1}^{2}}. (11)

As tt\rightarrow\infty, and given that β1<1\beta_{1}<1, we note that β12t0\beta_{1}^{2t}\rightarrow 0, and the geometric series sum converges to:

i=1tβ12(ti)=1β12t1β12=11β12.\sum_{i=1}^{t}\beta_{1}^{2(t-i)}=\frac{1-\beta_{1}^{2t}}{1-\beta_{1}^{2}}=\frac{1}{1-\beta_{1}^{2}}. (12)

Consequently, the long-term variance of the momentum term mtm_{t} is expressed as:

σmt2=(1β11β12)2σg2=1β11+β1σg2.\sigma^{2}_{m_{t}}=\left(\frac{1-\beta_{1}}{1-\beta_{1}^{2}}\right)^{2}\sigma^{2}_{g}=\frac{1-\beta_{1}}{1+\beta_{1}}\sigma^{2}_{g}. (13)

This result shows how the effective gradient noise is reduced by the momentum term, which is a factor of 1β11+β1\frac{1-\beta_{1}}{1+\beta_{1}} compared to the variance of the gradients σg2\sigma^{2}_{g}.

A.3 Fusion Gaussian distribution (Main paper Section 3.2)

Consider two Gaussian distributions for the momentum term m^t\widehat{m}_{t} and the current gradient gtg_{t}:

  • The momentum term m^t\widehat{m}_{t} is normally distributed with mean μm\mu_{m} and variance σm2\sigma_{m}^{2}, denoted as m^t𝒩(μm,σm2)\widehat{m}_{t}\sim\mathcal{N}(\mu_{m},\sigma_{m}^{2}).

  • The current gradient gtg_{t} is normally distributed with mean μg\mu_{g} and variance σg2\sigma_{g}^{2}, denoted as gt𝒩(μg,σg2)g_{t}\sim\mathcal{N}(\mu_{g},\sigma_{g}^{2}).

The product of their probability density functions is given by:

N(m^t;μm,σm)N(gt;μg,σg)=12πσmσgexp((m^tμm)22σm2(gtμg)22σg2)N(\widehat{m}_{t};\mu_{m},\sigma_{m})\cdot N(g_{t};\mu_{g},\sigma_{g})=\frac{1}{2\pi\sigma_{m}\sigma_{g}}\exp\left(-\frac{(\widehat{m}_{t}-\mu_{m})^{2}}{2\sigma_{m}^{2}}-\frac{(g_{t}-\mu_{g})^{2}}{2\sigma_{g}^{2}}\right) (14)

The goal is to find equivalent mean μ\mu^{\prime} and variance σ2\sigma^{\prime 2} for the new Gaussian distribution that matches the product:

N(x;μ,σ2)=12πσexp((xμ)22σ2)N(x;\mu^{\prime},\sigma^{\prime 2})=\frac{1}{\sqrt{2\pi}\sigma^{\prime}}\exp\left(-\frac{(x-\mu^{\prime})^{2}}{2\sigma^{\prime 2}}\right) (15)

We derive the expression for combining these two distributions. For convenience, let us define the variable tt as follows:

t\displaystyle t =(xμm)22σm2(xμg)22σg2\displaystyle=-\frac{\left(x-\mu_{m}\right)^{2}}{2\sigma_{m}^{2}}-\frac{\left(x-\mu_{g}\right)^{2}}{2\sigma_{g}^{2}} (16)
=σg2(xμm)2+σm2(xμg)22σm2σg2\displaystyle=-\frac{\sigma_{g}^{2}\left(x-\mu_{m}\right)^{2}+\sigma_{m}^{2}\left(x-\mu_{g}\right)^{2}}{2\sigma_{m}^{2}\sigma_{g}^{2}}
=(xσg2μm+σm2μgσm2+σg2)22σm2σg2σm2+σg2+(μmμg)22(σm2+σg2).\displaystyle=-\frac{\left(x-\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\right)^{2}}{\frac{2\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}}}+\frac{\left(\mu_{m}-\mu_{g}\right)^{2}}{2\left(\sigma_{m}^{2}+\sigma_{g}^{2}\right)}.

Through coefficient matching in the exponential terms, we obtain the new mean and variance:

μ=σg2μm+σm2μgσm2+σg2σ2=σm2σg2σm2+σg2\mu^{\prime}=\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\quad\sigma^{\prime 2}=\frac{\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}} (17)

The new mean μ\mu^{\prime} is a weighted average of the two means, μm\mu_{m} and μg\mu_{g}, with weights inversely proportional to their variances. This places μ\mu^{\prime} between μm\mu_{m} and μg\mu_{g}, closer to the mean with the smaller variance. The new standard deviation σ\sigma^{\prime} is smaller than either of the original standard deviations σm\sigma_{m} and σg\sigma_{g}, which reflects the reduced uncertainty in the estimate due to the combination of information from both sources. This is a direct consequence of the Wiener Filter’s optimality in the mean-square error sense.

A.4 Fokker Planck modelling (Theorem 3.1 in main paper)

Theorem A.1.

Consider a system described by the Fokker-Planck equation, evolving the probability density function PP in one-dimensional and multi-dimensional parameter spaces. Given a loss function f(θ)f(\theta), and the noise variance DD or diffusion matrix DijD_{ij} satisfying DC>0D\geq C>0 or DiC>0D_{i}\geq C>0, where CC is a positive lower bound constant, known as the Cramér-Rao lower bound. In the steady state condition, i.e., Pt=0\frac{\partial P}{\partial t}=0, the analytical form of the probability density PP can be obtained by solving the corresponding Fokker-Planck equation. These solutions reveal the probability distribution of the system at steady state, described as follows:

One-dimensional case In a one-dimensional parameter space, the probability density function P(θ)P(\theta) is

P(θ)=1Zexp(1Dfθ𝑑x),P(\theta)=\frac{1}{Z}\exp\left(-\int\frac{1}{D}\frac{\partial f}{\partial\theta}dx\right), (18)

where ZZ is a normalization constant, ensuring the total probability sums to one.

Multi-dimensional case In a multi-dimensional parameter space, the probability density function P(θ)P(\theta) is

P(θ)=1Zexp(i=1nf(θ)Di),P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right), (19)

Here, ZZ is also a normalization constant, ensuring the total probability sums to one, assuming Dij=DiδijD_{ij}=D_{i}\delta_{ij}, where δij\delta_{ij} is the Kronecker delta.

Proof.

one-dimensional Fokker-Planck equation: Given the one-dimensional Fokker-Planck equation:

Pt=θ(Pfθ)+2θ2(DP),\frac{\partial P}{\partial t}=-\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)+\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right), (20)

where f(θ)f(\theta) is the loss function, and DD is the variance of the noise, with DC>0D\geq C>0 representing a positive lower bound for the variance. PP denotes the probability density of finding the state of the system near a given point or region

Derivation of the Steady-State Distribution:

In the steady state condition, Pt=0\frac{\partial P}{\partial t}=0, thus the equation simplifies to:

0=θ(Pfθ)+2θ2(DP).0=-\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)+\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right). (21)

Our goal is to find the probability density PP as a function of θ\theta.

By integrating, we obtain:

θ(Pfθ)=2θ2(DP).\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)=\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right). (22)

Next, we set J=PfθJ=P\frac{\partial f}{\partial\theta} as the probability current, and we have:

Jθ=θ(DPθ).\frac{\partial J}{\partial\theta}=\frac{\partial}{\partial\theta}\left(D\frac{\partial P}{\partial\theta}\right). (23)

Upon integration, we get:

J=DPθ+C1,J=D\frac{\partial P}{\partial\theta}+C_{1}, (24)

where C1C_{1} is an integration constant. Assuming the probability current JJ vanishes at infinity, then C1=0C_{1}=0.

Therefore, we have:

DPθ=Pfθ.D\frac{\partial P}{\partial\theta}=P\frac{\partial f}{\partial\theta}. (25)

This equation can be rewritten as:

Pθ=PDfθ.\frac{\partial P}{\partial\theta}=\frac{P}{D}\frac{\partial f}{\partial\theta}. (26)

Now, leveraging the variance lower bound DCD\geq C, we analyze the above equation. Since DD is a positive constant, we can further integrate to get PP:

lnP=1Dfθ𝑑θ+C2,\ln P=-\int\frac{1}{D}\frac{\partial f}{\partial\theta}d\theta+C_{2}, (27)

where C2C_{2} is an integration constant.

Solving for PP, we get:

P=eC2exp(1Dfθ𝑑θ).P=e^{C_{2}}\exp\left(-\int\frac{1}{D}\frac{\partial f}{\partial\theta}d\theta\right). (28)

Since we know that DD has a lower bound, 1D\frac{1}{D} is bounded above, which suggests that PP will not explode at any specific value of θ\theta.

multi-dimensional Fokker-Planck equation: Consider a multi-dimensional parameter space xnx\in\mathbb{R}^{n} and a loss function f(θ)f(\theta). The evolution of the probability density function P(θ,t)P(\theta,t) in this space governed by the Fokker-Planck equation is given by:

Pt=i=1nθi(Pfθi)+i=1nj=1n2θiθj(DijP),\frac{\partial P}{\partial t}=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\sum_{j=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\left(D_{ij}P\right), (29)

where DijD_{ij} are elements of the diffusion matrix, representing the intensity and correlation of the stochastic in the directions θi\theta_{i} and θj\theta_{j}. At the steady state, where the time derivative of PP vanishes, we find:

0=i=1nθi(Pfθi)+i=1nj=1n2θiθj(DijP).0=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\sum_{j=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\left(D_{ij}P\right). (30)

Assuming Dij=DiδijD_{ij}=D_{i}\delta_{ij} where δij\delta_{ij} is the Kronecker delta, and DiC>0D_{i}\geq C>0, the equation simplifies to:

0=i=1nθi(Pfθi)+i=1n2θi2(DiP).0=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}^{2}}\left(D_{i}P\right). (31)

Integrating with respect to θi\theta_{i}, we obtain a set of equations:

DiPθi=Pfθi+Ci,D_{i}\frac{\partial P}{\partial\theta_{i}}=P\frac{\partial f}{\partial\theta_{i}}+C_{i}, (32)

where CiC_{i} is an integration constant. Assuming Ci=0C_{i}=0, which corresponds to no flux at the boundaries, we can solve for PP:

P(θ)=1Zexp(i=1nf(θ)Di),P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right), (33)

where ZZ is a normalization constant ensuring that the total probability integrates to one.

Exploration Efficacy of SGD due to Variance Lower Bound The existence of a variance lower bound in Stochastic Gradient Descent (SGD) critically enhances the algorithm’s exploration capabilities, particularly in regions of the loss landscape where gradients are minimal. By preventing the probability density function from becoming unbounded, it ensures continuous exploration and increases the probability of converging to flat minima that are associated with better generalization properties. This principle holds true across both one-dimensional and multi-dimensional scenarios, making the variance lower bound an essential consideration for optimizing SGD’s performance in finding robust, generalizable solutions.

Appendix B Convergence analysis in convex online learning case (Theorem 3.2 in main paper).

Assumption B.1.

Variables are bounded: D such that t,θt2D\exists D\text{ such that }\forall t,\|\theta_{t}\|_{2}\leq D. Gradients are bounded: G such that t,gt2G\exists G\text{ such that }\forall t,\|g_{t}\|_{2}\leq G.

Definition B.2.

Let ft(θt)f_{t}(\theta_{t}) be the loss at time tt and ft(θ)f_{t}(\theta^{*}) be the loss of the best possible strategy at the same time. The cumulative regret R(T)R(T) at time TT is defined as:

R(T)=t=1Tft(θt)ft(θ)\displaystyle R(T)=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*}) (34)
Definition B.3.

If a function ff: RdRR^{d}\rightarrow R is convex if for all x,yRdx,y\in R^{d} for all λ[0,1]\lambda\in[0,1],

λf(x)+(1λ)f(y)f(λx+(1λ)y)\displaystyle\lambda f(x)+(1-\lambda)f(y)\geq f(\lambda x+(1-\lambda)y) (35)

Also, notice that a convex function can be lower bounded by a hyperplane at its tangent.

Lemma B.4.

If a function f:RdRf:R^{d}\rightarrow R is convex, then for all x,yRdx,y\in R^{d} ,

f(y)f(x)+f(x)T(yx)\displaystyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x) (36)

The above lemma can be used to upper bound the regret, and our proof for the main theorem is constructed by substituting the hyperplane with SGDF update rules.

The following two lemmas are used to support our main theorem. We also use some definitions to simplify our notation, where gtft(θt)g_{t}\triangleq\nabla f_{t}\left(\theta_{t}\right) and gt,ig_{t,i} as the ith i^{\text{th }} element. We denote g1:t,itg_{1:t,i}\in\mathbb{R}^{t} as a vector that contains the ith i^{\text{th }} dimension of the gradients over all iterations till t,g1:t,i=[g1,i,g2,i,,gt,i]t,g_{1:t,i}=\left[g_{1,i},g_{2,i},\cdots,g_{t,i}\right]

Lemma B.5.

Let gt=ft(θt)g_{t}=\nabla f_{t}(\theta_{t}) and g1:tg_{1:t} be defined as above and bounded,

gt2G,gtG.\displaystyle\left\|g_{t}\right\|_{2}\leq G,\left\|g_{t}\right\|_{\infty}\leq G_{\infty}. (37)

Then,

t=1Tgt,i2Gg1:T,i2.\displaystyle\sum_{t=1}^{T}g_{t,i}\leq 2G_{\infty}\left\|g_{1:T,i}\right\|_{2}. (38)

Proof. We will prove the inequality using induction over TT. For the base case T=1T=1:

g1,i2Gg1,i2.\displaystyle g_{1,i}\leq 2G_{\infty}\left\|g_{1,i}\right\|_{2}. (39)

Assuming the inductive hypothesis holds for T1T-1, for the inductive step:

t=1Tgt,i\displaystyle\sum_{t=1}^{T}g_{t,i} =t=1T1gt,i+gT,i\displaystyle=\sum_{t=1}^{T-1}g_{t,i}+g_{T,i} (40)
2Gg1:T1,i2+gT,i\displaystyle\leq 2G_{\infty}\left\|g_{1:T-1,i}\right\|_{2}+g_{T,i}
=2Gg1:T,i22gT2+gT,i2.\displaystyle=2G_{\infty}\sqrt{\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T}^{2}}+g_{T,i}^{2}.

Given,

g1:T,i22gT,i2+gT,i44g1:T,i22g1:T,i22gT,i2,\displaystyle\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}+\frac{g_{T,i}^{4}}{4\left\|g_{1:T,i}\right\|_{2}^{2}}\geq\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}, (41)

taking the square root of both sides, we get:

g1:T,i22gT,i2\displaystyle\sqrt{\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}} g1:T,i2gT,i22g1:T,i2\displaystyle\leq\left\|g_{1:T,i}\right\|_{2}-\frac{g_{T,i}^{2}}{2\left\|g_{1:T,i}\right\|_{2}} (42)
g1:T,i2gT,i22G2.\displaystyle\leq\left\|g_{1:T,i}\right\|_{2}-\frac{g_{T,i}^{2}}{2\sqrt{G_{\infty}^{2}}}.

Substituting into the previous inequality:

Gg1:T,i22gT,i2+gT,i22Gg1:T,i2\displaystyle G_{\infty}\sqrt{\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}}+\sqrt{g_{T,i}^{2}}\leq 2G_{\infty}\left\|g_{1:T,i}\right\|_{2} (43)
Lemma B.6.

Let bounded gt,gt2Gg_{t},\left\|g_{t}\right\|_{2}\leq G , gtG\left\|g_{t}\right\|_{\infty}\leq G_{\infty}, the following inequality holds

t=1Tm^t,i24G2(1β1)2g1:T,i22\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}\leq\frac{4G_{\infty}^{2}}{(1-\beta_{1})^{2}}\left\|g_{1:T,i}\right\|_{2}^{2} (44)

Proof. Under the inequality: 1(1β1t)21(1β1)2\frac{1}{\left(1-\beta_{1}^{t}\right)^{2}}\leq\frac{1}{\left(1-\beta_{1}\right)^{2}} . We can expand the last term in the summation using the updated rules in Algorithm 1,

t=1Tm^t,i2\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2} =t=1T1m^t,i2+(k=1T(1β1)β1Tkgk,i)2(1β1T)2\displaystyle=\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\left(\sum_{k=1}^{T}\left(1-\beta_{1}\right)\beta_{1}^{T-k}g_{k,i}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}} (45)
t=1T1m^t,i2+k=1TT((1β1)β1Tkgk,i)2(1β1T)2\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\sum_{k=1}^{T}T\left(\left(1-\beta_{1}\right)\beta_{1}^{T-k}g_{k,i}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}}
t=1T1m^t,i2+(1β1)2(1β1T)2k=1TT(β12)Tkgk,i22\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\left(1-\beta_{1}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}}\sum_{k=1}^{T}T\left(\beta_{1}^{2}\right)^{T-k}\left\|g_{k,i}\right\|_{2}^{2}
t=1T1m^t,i2+Tk=1T(β12)Tkgk,i22\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+T\sum_{k=1}^{T}\left(\beta_{1}^{2}\right)^{T-k}\left\|g_{k,i}\right\|_{2}^{2}

Similarly, we can upper-bound the rest of the terms in the summation.

t=1Tm^t,i2\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2} t=1Tgt,i22j=0Tttβ1j\displaystyle\leq\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2}\sum_{j=0}^{T-t}t\beta_{1}^{j} (46)
t=1Tgt,i22j=0Ttβ1j\displaystyle\leq\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2}\sum_{j=0}^{T}t\beta_{1}^{j}

For β1<1\beta_{1}<1 , using the upper bound on the arithmetic-geometric series, ttβ1t<1(1β1)2\sum_{t}t\beta_{1}^{t}<\frac{1}{(1-\beta_{1})^{2}} :

t=1Tgt,i22j=0Ttβ1j1(1β1)2t=1Tgt,i22\displaystyle\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2}\sum_{j=0}^{T}t\beta_{1}^{j}\leq\frac{1}{(1-\beta_{1})^{2}}\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2} (47)

Apply Lemma B.5,

t=1Tm^t,i24G2(1β1)2g1:T,i22\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}\leq\frac{4G_{\infty}^{2}}{(1-\beta_{1})^{2}}\left\|g_{1:T,i}\right\|_{2}^{2} (48)
Theorem B.7.

Assume that the function ftf_{t} has bounded gradients, ft(θ)2G\|\nabla f_{t}(\theta)\|_{2}\leq G, ft(θ)G\|\nabla f_{t}(\theta)\|_{\infty}\leq G_{\infty} for all θd\theta\in\mathbb{R}^{d} and the distance between any θt\theta_{t} generated by SGDF is bounded, θnθm2D\|\theta_{n}-\theta_{m}\|_{2}\leq D, θmθnD\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty} for any m,n{1,,T}m,n\in\{1,...,T\}, and β1,β2[0,1)\beta_{1},\beta_{2}\in[0,1). Let αt=α/t\alpha_{t}=\alpha/\sqrt{t}. For all T1T\geq 1, SGDF achieves the following guarantee:

R(T)\displaystyle R(T)\leq D2αi=1dT+2DG1β1i=1dg1:T,i2+2αG2(1+(1β1)2)T(1β1)2i=1dg1:T,i22\displaystyle\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2} (49)

Proof of convex Convergence.

We aim to prove the convergence of the algorithm by showing that R(T)R(T) is bounded, or equivalently, that R(T)T\dfrac{R(T)}{T} converges to zero as TT goes to infinity.

To express the cumulative regret in terms of each dimension, let ft(θt)f_{t}(\theta_{t}) and ft(θ)f_{t}(\theta^{*}) represent the loss and the best strategy’s loss for the ddth dimension, respectively. Define RT,dR_{T,d} as:

RT,i=t=1Tft(θt)ft(θ)\displaystyle R_{T,i}=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*}) (50)

Then, the overall regret R(T)R(T) can be expressed in terms of all dimensions DD as:

R(T)=d=1DRT,i\displaystyle R(T)=\sum_{d=1}^{D}R_{T,i} (51)

Establishing the Connection: From the Iteration of θt\theta_{t} to gt,θtθ\left<g_{t},\theta_{t}-\theta^{*}\right>

Using Lemma B.4, we have,

ft(θt)ft(θ)gtT(θtθ)=i=1dgt,i(θt,iθ,i)\displaystyle f_{t}(\theta_{t})-f_{t}(\theta^{*})\leq g^{T}_{t}(\theta_{t}-\theta^{*})=\sum_{i=1}^{d}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*}) (52)

From the update rules presented in algorithm 1,

θt+1\displaystyle\theta_{t+1} =θtαtg^t\displaystyle=\theta_{t}-\alpha_{t}\widehat{g}_{t} (53)
=θtαt(m^t+Kt,d(gtm^t))\displaystyle=\theta_{t}-\alpha_{t}\big{(}\widehat{m}_{t}+K_{t,d}(g_{t}-\widehat{m}_{t})\big{)}

We focus on the ithi^{\text{th}} dimension of the parameter vector θtRd\theta_{t}\in R^{d}. Subtract the scalar θ,i\theta^{*}_{,i} and square both sides of the above update rule, we have,

(θt+1,dθ,i)2\displaystyle(\theta_{t+1,d}-\theta_{,i}^{*})^{2} =(θt,iθ,i)22αt(m^t,i+Kt,d(gt,im^t,i))(θt,iθ,i)+αt2g^t2\displaystyle=(\theta_{t,i}-\theta_{,i}^{*})^{2}-2\alpha_{t}(\widehat{m}_{t,i}+K_{t,d}(g_{t,i}-\widehat{m}_{t,i}))(\theta_{t,i}-\theta_{,i}^{*})+\alpha_{t}^{2}\widehat{g}^{2}_{t} (54)

Separating itemsgt,i(θt,iθ,i)g_{t,i}(\theta_{t,i}-\theta_{,i}^{*}):

gt,d(θt,iθ,i)\displaystyle g_{t,d}(\theta_{t,i}-\theta_{,i}^{*}) =(θt,iθ,i)2(θt+1,iθ,i)22αtKt,i(1)1Kt,iKt,im^t,i(θt,iθ,i)(2)+αt2Kt,i(g^t,i)2(3)\displaystyle=\underset{(1)}{\underbrace{\frac{\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{t}K_{t,i}}}}\underset{(2)}{\underbrace{-\frac{1-K_{t,i}}{K_{t,i}}\widehat{m}_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right)}}+\underset{(3)}{\underbrace{\frac{\alpha_{t}}{2K_{t,i}}(\widehat{g}_{t,i})^{2}}} (55)

We then deal with (1), (2) and (3) separately.

For the first term (1), we have:

t=1T(θt,iθ,i)2(θt+1,iθ,i)22αtKt,i\displaystyle\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{t}K_{t,i}} (56)
t=1T(θt,iθ,i)2(θt+1,iθ,i)22αtKt,i\displaystyle\leq\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{t}K_{t,i}}
=(θ1,iθ,i)22α1K1,i(θT+1,iθ,i)22αTKT,i+t=2T(θt,iθ,i)2[12αtKt,i12αt1Kt1,i]\displaystyle=\frac{\left(\theta_{1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{1}K_{1,i}}-\frac{\left(\theta_{T+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{T}K_{T,i}}+\sum_{t=2}^{T}(\theta_{t,i}-\theta_{,i}^{*})^{2}\left[\frac{1}{2\alpha_{t}K_{t,i}}-\frac{1}{2\alpha_{t-1}K_{t-1,i}}\right]

Given that (θT+1,iθ,i)22αT(K1)0-\dfrac{\left(\theta_{T+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{T}\left(K_{1}\right)}\leq 0 and (θ1,iθ,i)22α1(KT)Di22α1(KT)\dfrac{\left(\theta_{1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{1}\left(K_{T}\right)}\leq\dfrac{D_{i}^{2}}{2\alpha_{1}\left(K_{T}\right)}, we can bound it as:

t=1T(θt,iθ,i)2(θt+1,iθ,i)22αtKt,i\displaystyle\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{t}K_{t,i}} (57)
i=1d(θt,iθ,i)22αtKt,i\displaystyle\leq\sum_{i=1}^{d}\frac{(\theta_{t,i}-\theta_{,i}^{*})^{2}}{2\alpha_{t}K_{t,i}}

For the second term (2), we have:

t=1T1Kt,iKt,im^t,i(θt,iθ,i)\displaystyle\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,i}}\widehat{m}_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right) (58)
=t=1T1Kt,iKt,i(1β1t)(i=1T(1β1,i)j=i+1Tβ1,j)gt,i(θt,iθ,i)\displaystyle=\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,i}(1-\beta_{1}^{t})}\left(\sum_{i=1}^{T}(1-\beta_{1,i})\prod_{j=i+1}^{T}\beta_{1,j}\right)g_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right)
t=1T1Kt,iKt,d(1β1t)(1i=1Tβ1,i)gt,i(θt,iθ,i)\displaystyle\leq\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,d}(1-\beta_{1}^{t})}\left(1-\prod_{i=1}^{T}\beta_{1,i}\right)g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})
t=1T1Kt,iKt,d(1β1t)gt,i(θt,iθ,i)\displaystyle\leq\sum_{t=1}^{T}\frac{1-K_{t,i}}{K_{t,d}(1-\beta_{1}^{t})}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})

For the third term (3), we have:

t=1Tαt2Kt,i(g^t,i)2\displaystyle\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}(\widehat{g}_{t,i})^{2} t=1Tαt2Kt,i(m^t,i+Kt(gt,im^t,i))2\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left(\widehat{m}_{t,i}+K_{t}(g_{t,i}-\widehat{m}_{t,i})\right)^{2} (59)
t=1Tαt2Kt,i((1Kt,i)m^t,i+Kt,dgt,i)2\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left((1-K_{t,i})\widehat{m}_{t,i}+K_{t,d}g_{t,i}\right)^{2}
t=1Tαt2Kt,i(2(1Kt,i)2m^t,i2+2Kt,i2gt,i2)\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left(2(1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+2K_{t,i}^{2}g_{t,i}^{2}\right)
t=1TαtKt,i((1Kt,i)2m^t,i2+Kt,i2gt,i2)\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{K_{t,i}}\left((1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+K_{t,i}^{2}g_{t,i}^{2}\right)

Collate all the items that we have:

R(T)\displaystyle R(T) i=1dt=1T(θt,iθ,i)22αtKt,i+i=1dt=1T1Kt,iKt,i(1β1t)gt,i(θt,iθ,i)+i=1dt=1TαtKt,i((1Kt,i)2m^t,i2+Kt,i2gt,i2)\displaystyle\leq\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{(\theta_{t,i}-\theta_{,i}^{*})^{2}}{2\alpha_{t}K_{t,i}}+\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{1-K_{t,i}}{K_{t,i}(1-\beta_{1}^{t})}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})+\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{\alpha_{t}}{K_{t,i}}\left((1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+K_{t,i}^{2}g_{t,i}^{2}\right) (60)

Using Lemma B.5 and Lemma B.6 From t=1Ts^t>t=1T(gtm^t)2\sum_{t=1}^{T}\widehat{s}_{t}>\sum_{t=1}^{T}(g_{t}-\widehat{m}_{t})^{2}, we have 1Tt=1TKt>12\frac{1}{T}\sum_{t=1}^{T}K_{t}>\frac{1}{2}. Therefore, from the assumption, θtθ22D,θmθnD\|\theta_{t}-\theta^{*}\|_{2}^{2}\leq D,\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty}, we have the following regret bound:

R(T)\displaystyle R(T) D2αi=1dT+2DG1β1i=1dg1:T,i2+2αG2(1+(1β1)2)T(1β1)2i=1dg1:T,i22\displaystyle\leq\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2} (61)

Appendix C Convergence analysis for non-convex stochastic optimization (Theorem 3.3 in main paper).

We have relaxed the assumption on the objective function, allowing it to be non-convex, and adjusted the criterion for convergence from the statistic R(T)R(T) to 𝔼(T)\mathbb{E}(T). Let’s briefly review the assumptions and the criterion for convergence after relaxing the assumption:

Assumption C.1.
  • A1 Bounded variables (same as convex). θθ2D,θ,θ\left\|\theta-\theta^{*}\right\|_{2}\leq D,\,\,\forall\theta,\theta^{*} or for any dimension ii of the variable, θiθi2Di,θi,θi\left\|\theta_{i}-\theta_{i}^{*}\right\|_{2}\leq D_{i},\,\,\forall\theta_{i},\theta_{i}^{*}

  • A2 The noisy gradient is unbiased. For t\forall t, the random variable ζt\zeta_{t} is defined as ζt=gtf(θt)\zeta_{t}=g_{t}-\nabla f\left(\theta_{t}\right), ζt\zeta_{t} satisfy 𝔼[ζt]=0\mathbb{E}\left[\zeta_{t}\right]=0, 𝔼[ζt22]σ2\mathbb{E}\left[\left\|\zeta_{t}\right\|_{2}^{2}\right]\leq\sigma^{2}, and when t1t2t_{1}\neq t_{2}, ζt1\zeta_{t_{1}} and ζt2\zeta_{t_{2}} are statistically independent, i.e., ζt1ζt2\zeta_{t_{1}}\perp\zeta_{t_{2}}.

  • A3 Bounded gradient and noisy gradient. At step tt, the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. i.e.||f(θt)||G,||gt||G,t>1i.e.\ \ ||\nabla f(\theta_{t})||\leq G,\ ||g_{t}||\leq G,\ \ \forall t>1.

  • A4 The property of function. The objective function f(θ)f\left(\theta\right) is a global loss function, defined as f(θ)=limT1Tt=1Tft(θ)f\left(\theta\right)=\lim_{T\longrightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}f_{t}\left(\theta\right). Although f(θ)f\left(\theta\right) is no longer a convex function, it must still be a LL-smooth function, i.e., it satisfies (1) ff is differentiable, f\nabla f exists everywhere in the domain; (2) there exists L>0L>0 such that for any θ1\theta_{1} and θ2\theta_{2} in the domain, (first definition)

    f(θ2)f(θ1)+f(θ1),θ2θ1+L2θ2θ122f\left(\theta_{2}\right)\leq f\left(\theta_{1}\right)+\left\langle\nabla f\left(\theta_{1}\right),\theta_{2}-\theta_{1}\right\rangle+\frac{L}{2}\left\|\theta_{2}-\theta_{1}\right\|_{2}^{2} (62)

    or (second definition)

    f(θ1)f(θ2)2Lθ1θ22\left\|\nabla f\left(\theta_{1}\right)-\nabla f\left(\theta_{2}\right)\right\|_{2}\leq L\left\|\theta_{1}-\theta_{2}\right\|_{2} (63)

    This condition is also known as L - Lipschitz.

Definition C.2.

The criterion for convergence is the statistic 𝔼(T)\mathbb{E}\left(T\right):

𝔼(T)=mint=1,2,,T𝔼t1[f(θt)22]\mathbb{E}\left(T\right)=\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right] (64)

When TT\rightarrow\infty, if the amortized value of 𝔼(T)\mathbb{E}\left(T\right), 𝔼(T)/T0\mathbb{E}\left(T\right)/T\rightarrow 0, we consider such an algorithm to be convergent, and generally, the slower 𝔼(T)\mathbb{E}\left(T\right) grows with TT, the faster the algorithm converges.

Definition C.3.

Define ξt\xi_{t} as

ξt={θtt=1θt+β11β1(θtθt1)t2\xi_{t}=\begin{cases}\theta_{t}&t=1\\ \theta_{t}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)&t\geq 2\end{cases} (65)
Theorem C.4.

Consider a non-convex optimization problem. Suppose assumptions A1-A5 are satisfied, and let αt=α/t\alpha_{t}=\alpha/\sqrt{t}. For all T1T\geq 1, SGDF achieves the following guarantee:

𝔼(T)C7α2(logT+1)+C82αT\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}} (66)

where 𝔼(T)=mint=1,2,,T𝔼t1[f(θt)22]\mathbb{E}(T)=\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right] denotes the minimum of the squared-paradigm expectation of the gradient, α\alpha is the learning rate at the 11-th step, C7C_{7} are constants independent of dd and TT, C8C_{8} is a constant independent of TT, and the expectation is taken w.r.t all randomness corresponding to gt{g_{t}}.

Proof of convex Convergence.

Since ff is an L-smooth function,

f(ξt)f(θt)22L2ξtθt22\left\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\leq L^{2}\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2} (67)

Thus,

f(ξt+1)\displaystyle f\left(\xi_{t+1}\right) f(ξt)\displaystyle-f\left(\xi_{t}\right) (68)
\displaystyle\leq f(ξt),ξt+1ξt+L2ξt+1ξt22\displaystyle\left\langle\nabla f\left(\xi_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}
=\displaystyle= 1L(f(ξt)f(θt)),L(ξt+1ξt)+f(θt),ξt+1ξt+L2ξt+1ξt22\displaystyle\left\langle\frac{1}{\sqrt{L}}\left(\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right),\sqrt{L}\left(\xi_{t+1}-\xi_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}
\displaystyle\leq 12(1Lf(ξt)f(θt)22+Lξt+1ξt22)+f(θt),ξt+1ξt+L2ξt+1ξt22\displaystyle\frac{1}{2}\left(\frac{1}{L}\left\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+L\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}\right)+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}
\displaystyle\leq 12Lf(ξt)f(θt)22+Lξt+1ξt22+f(θt),ξt+1ξt\displaystyle\frac{1}{2L}\left\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+L\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle
\displaystyle\leq 12LL2ξtθt22+Lξt+1ξt22+f(θt),ξt+1ξt\displaystyle\frac{1}{2L}L^{2}\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}+L\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle
=\displaystyle= L2ξtθt22(1)+Lξt+1ξt22(2)+f(θt),ξt+1ξt(3)\displaystyle\frac{L}{2}\underset{\left(1\right)}{\underbrace{\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}}}+L\underset{\left(2\right)}{\underbrace{\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}}}+\underset{\left(3\right)}{\underbrace{\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle}}

Next, we will deal with the three terms (1), (2), and (3) separately.

For term (1)

When t=1t=1, ξtθt22=0\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}=0

When t2t\geq 2,

ξtθt22=β11β1(θtθt1)22\displaystyle\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}=\left\|\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)\right\|_{2}^{2} (69)
=β12(1β1)2αt12g^t1,i22\displaystyle=\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\left\|\widehat{g}_{t-1,i}\right\|_{2}^{2}
=β12(1β1)2αt12i=1d(1Kt)(m^t1,i)2+Ktgt2\displaystyle=\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}\left(1-K_{t}\right)\left(\widehat{m}_{t-1,i}\right)^{2}+K_{t}g_{t}^{2}
(a)β12(1β1)2αt12i=1dGi2\displaystyle\overset{\left(a\right)}{\leq}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}

Where (a) holds because for any tt:

  • |m^t,i|11β1ts=1t(1β1)β1ts|gs,i|11β1ts=1t(1β1)β1tsGi=Gi\left|\widehat{m}_{t,i}\right|\leq\frac{1}{1-\beta_{1}^{t}}\sum_{s=1}^{t}\left(1-\beta_{1}\right)\beta_{1}^{t-s}\left|g_{s,i}\right|\leq\frac{1}{1-\beta_{1}^{t}}\sum_{s=1}^{t}\left(1-\beta_{1}\right)\beta_{1}^{t-s}G_{i}=G_{i}.

  • gt2G,t\|g_{t}\|_{2}\leq G,\,\forall t, or for any dimension of the variable ii: gt,i2Gi,t\|g_{t,i}\|_{2}\leq G_{i},\,\forall t

For term (2)

When t=1t=1,

ξt+1ξt=\displaystyle\xi_{t+1}-\xi_{t}= θt+1+β11β1(θt+1θt)θt\displaystyle\theta_{t+1}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\theta_{t} (70)
=\displaystyle= 11β1(θt+1θt)\displaystyle\frac{1}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)
=\displaystyle= αt1β1(g^t)\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left(\widehat{g}_{t}\right)
=\displaystyle= αt1β1(1Kt1β1tmt+Ktgt)\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left(\frac{1-K_{t}}{1-\beta_{1}^{t}}m_{t}+K_{t}g_{t}\right)
=\displaystyle= αt1β11Kt1β1t(β1mt10+(1β1)gt)αt1β1Ktgt\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\frac{1-K_{t}}{1-\beta_{1}^{t}}\left(\beta_{1}\cancelto{0}{m_{t-1}}+\left(1-\beta_{1}\right)g_{t}\right)-\frac{\alpha_{t}}{1-\beta_{1}}K_{t}g_{t}
=\displaystyle= αt(1Kt)1β1tgtαtKt1β1gt\displaystyle-\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}
=\displaystyle= αt1β1gt\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}g_{t}

Thus,

ξt+1ξt22=\displaystyle\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}= αt(1Kt)1β1gtαtKt1β1gt22\displaystyle\left\|-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}\right\|_{2}^{2} (71)
=\displaystyle= (αt1β1)2gt22\displaystyle\left(-\frac{\alpha_{t}}{1-\beta_{1}}\right)^{2}\|g_{t}\|_{2}^{2}
=\displaystyle= αt2(1β1)2gt22\displaystyle\frac{\alpha_{t}^{2}}{(1-\beta_{1})^{2}}\|g_{t}\|_{2}^{2}
=\displaystyle= αt2(1β1)2i=1dgt,i2\displaystyle\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}g_{t,i}^{2}
\displaystyle\leq αt2(1β1)2i=1dGi2\displaystyle\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}

When t2t\geq 2,

ξt+1ξt=\displaystyle\xi_{t+1}-\xi_{t}= θt+1+β11β1(θt+1θt)θtβ11β1(θtθt1)\displaystyle\theta_{t+1}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\theta_{t}-\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right) (72)
=\displaystyle= 11β1(θt+1θt)β11β1(θtθt1)\displaystyle\frac{1}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)

Due to

θt+1θt=\displaystyle\theta_{t+1}-\theta_{t}= αtg^t\displaystyle-\alpha_{t}\widehat{g}_{t} (73)
=\displaystyle= αt(1Kt)1β1tmtαtKtgt\displaystyle-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}m_{t}-\alpha_{t}K_{t}g_{t}
=\displaystyle= αt(1Kt)1β1t(β1mt1+(1β1)gt)αtKtgt\displaystyle-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\left(\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}\right)-\alpha_{t}K_{t}g_{t}

So,

ξt+1ξt\displaystyle\xi_{t+1}-\xi_{t} (74)
=\displaystyle= 11β1(αt(1Kt)1β1t(β1mt1+(1β1)gt)αtKtgt)β11β1(αt1(1Kt1)1β1t1mt1αt1Kt1gt1)\displaystyle\frac{1}{1-\beta_{1}}\left(-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\left(\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}\right)-\alpha_{t}K_{t}g_{t}\right)-\frac{\beta_{1}}{1-\beta_{1}}\left(-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}m_{t-1}-\alpha_{t-1}K_{t-1}g_{t-1}\right)
=\displaystyle= β11β1mt1(αt(1Kt)1β1tαt1(1Kt1)1β1t1)αt(1Kt)1β1tgtαtKt1β1gt+β11β1αt1Kt1gt1\displaystyle-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}+\frac{\beta_{1}}{1-\beta_{1}}\alpha_{t-1}K_{t-1}g_{t-1}
=\displaystyle= β11β1mt1(αt(1Kt)1β1tαt1(1Kt1)1β1t1)(αt(1Kt)1β1t+αtKt1β1)gt+β1αt1Kt11β1gt1\displaystyle-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}+\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}

We have:

ξt+1ξt22\displaystyle\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2} 2β11β1mt1(αt(1Kt)1β1tαt1(1Kt1)1β1t1)22\displaystyle\leq 2\left\|-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\Bigg{(}\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\right.\left.-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\Bigg{)}\right\|_{2}^{2} (75)
+2(αt(1Kt)1β1t+αtKt1β1)gt22+2β1αt1Kt11β1gt122\displaystyle+2\left\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\|_{2}^{2}+2\left\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\|_{2}^{2}
2β12(1β1)2mt12αt(1Kt)1β1tαt1(1Kt1)1β1t1αt(1Kt)1β1tαt1(1Kt1)1β1t11\displaystyle\leq 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left\|m_{t-1}\right\|_{\infty}^{2}\left\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\|_{\infty}\cdot\left\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\|_{1}
+2(αt(1Kt)1β1t+αtKt1β1)gt22+2β1αt1Kt11β1gt122\displaystyle+2\left\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\|_{2}^{2}+2\left\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\|_{2}^{2}

Because

  • |mt1,i|=(1β1t)|m^t,i||m^t,i|Gi\left|m_{t-1,i}\right|=\left(1-\beta_{1}^{t}\right)\left|\widehat{m}{t,i}\right|\leq\left|\widehat{m}{t,i}\right|\leq G_{i}, mt12(maxiGi)2\left\|m_{t-1}\right\|{\infty}^{2}\leq\left(\max{i}G_{i}\right)^{2}

  • gt22=i=1dgt,i2i=1dGi2\left\|g_{t}\right\|_{2}^{2}=\sum_{i=1}^{d}g_{t,i}^{2}\leq\sum_{i=1}^{d}G_{i}^{2}

  • Kt0,1dK_{t}\in{0,1}^{d}, we have Kti=1d1i,1Kti=1d1id\left\|K_{t}\right\|_{\infty}\leq\sum_{i=1}^{d}\textbf{1}_{i},\left\|1-K_{t}\right\|{\infty}\leq\sum_{i=1}^{d}\textbf{1}_{i}\leq d

αt/(1β1t)0,αt1/(1β1t1)/0\displaystyle\alpha_{t}/\left(1-\beta_{1}^{t}\right)\geq 0,\thinspace\thinspace\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)/\geq 0 (76)
αtαt1,11β1t11β1t1\displaystyle\alpha_{t}\leq\alpha_{t-1},\thinspace\thinspace\frac{1}{1-\beta_{1}^{t}}\leq\frac{1}{1-\beta_{1}^{t-1}}\thinspace
\displaystyle\Longrightarrow αt1β1tαt11β1t1\displaystyle\frac{\alpha_{t}}{1-\beta_{1}^{t}}\leq\frac{\alpha_{t-1}}{1-\beta_{1}^{t-1}}
\displaystyle\Longrightarrow |αt1β1tαt11β1t1|\displaystyle\left|\frac{\alpha_{t}}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}}{1-\beta_{1}^{t-1}}\right|
=αt1/(1β1t1)αt/(1β1t)\displaystyle=\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)
αt1/(1β1t1)α1/(1β1)\displaystyle\leq\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)\leq\alpha_{1}/\left(1-\beta_{1}\right)
\displaystyle\Longrightarrow αt(1Kt)1β1tαt1(1Kt1)1β1t1α1(1β1)\displaystyle\left\|\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}\left(1-K_{t-1}\right)}{1-\beta_{1}^{t-1}}\right\|_{\infty}\leq\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}
αt(1Kt)1β1tαt1(1Kt1)1β1t11i=1d(αt1/(1β1t1)αt/(1β1t))1id(αt1/(1β1t1)αt/(1β1t))\left\|\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}\left(1-K_{t-1}\right)}{1-\beta_{1}^{t-1}}\right\|_{1}\leq\sum_{i=1}^{d}\left(\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)\right)\textbf{1}_{i}\leq d\left(\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)\right) (77)

Therefore

ξt+1ξt22\displaystyle\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}\leq 2β12(1β1)2(maxiGi)2dα1(1β1)(αt1(1β1t1)αt(1β1t))+4αt2(1β1)2i=1dGi2\displaystyle 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)+4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2} (78)

For term (3)

When t=1t=1, referring to the case of t=1t=1 in the previous subsection,

f(θt),ξt+1ξt=\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle= f(θt),αt1β1gt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}g_{t}\right\rangle (79)
=\displaystyle= f(θt),αt1β1f(θt)+f(θt),αt1β1ζt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\zeta_{t}\right\rangle

The last equality is due to the definition of gtg_{t}: gt=f(θt)+ζtg_{t}=\nabla f\left(\theta_{t}\right)+\zeta_{t}. Let’s consider them separately:

f(θt),αt1β1f(θt)=\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\nabla f\left(\theta_{t}\right)\right\rangle= αt1β1[f(θt)][f(θt)]\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left[\nabla f\left(\theta_{t}\right)\right]\left[\nabla f\left(\theta_{t}\right)\right] (80)
\displaystyle\leq αt1β1f(θt)22\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}
f(θt),αt1β1ζt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\zeta_{t}\right\rangle\leq αt1β1f(θt)2ζt2\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}\left\|\zeta_{t}\right\|_{2} (81)
=\displaystyle= αt1β1f(θt)2gtf(θt)2\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}\left\|g_{t}-\nabla f\left(\theta_{t}\right)\right\|_{2}
\displaystyle\leq αt1β12i=1dGi2\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\cdot 2\sum_{i=1}^{d}G_{i}^{2}

Thus

f(θt),ξt+1ξt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle (82)
\displaystyle\leq αt(1β1)f(θt)22+2αt1β1i=1dGi2\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}

When t2t\geq 2,

f(θt),ξt+1ξt=\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle= f(θt),β11β1mt1(αt(1Kt)1β1tαt1(1Kt1)1β1t1)\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\right.\left.\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)\right\rangle (83)
+f(θt),(αt(1Kt)1β1t+αtKt1β1)f(θt)+f(θt),(αt(1Kt)1β1t+αtKt1β1)ζt\displaystyle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\zeta_{t}\right\rangle
+f(θt1),β1αt1Kt11β1f(θt1)+f(θt1),β1αt1Kt11β1ζt1\displaystyle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\nabla f\left(\theta_{t-1}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\zeta_{t-1}\right\rangle

Start by looking at the first item after the equal sign:

f(θt),β11β1mt1(αt(1Kt)1β1tαt1(1Kt1)1β1t1)\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\right.\left.\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)\right\rangle (84)
\displaystyle\leq β11β1f(θt)mt1αt(1Kt)1β1tαt1(1Kt1)1β1t11\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{\infty}\left\|m_{t-1}\right\|_{\infty}\cdot\left\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\|_{1}
\displaystyle\leq β11β1(maxiGi)(maxiGi)i=1d(αt1(1β1t1)αt(1β1t))1i\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot\sum_{i=1}^{d}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\textbf{1}_{i}
\displaystyle\leq β11β1(maxiGi)(maxiGi)d(αt1(1β1t1)αt(1β1t))\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)

The second and third terms after the equal sign:

f(θt),(αt(1Kt)1β1t+αtKt1β1)f(θt)+f(θt),(αt(1Kt)1β1t+αtKt1β1)ζt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\zeta_{t}\right\rangle (85)
\displaystyle\leq αt1β1tf(θt)22+f(θt),αt1β1tζt\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle

The fourth and fifth terms after the equal sign:

f(θt1),β1αt1Kt11β1f(θt1)+f(θt1),β1αt1Kt11β1ζt1\displaystyle\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\nabla f\left(\theta_{t-1}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\zeta_{t-1}\right\rangle (86)
\displaystyle\leq β1αt11β1f(θt)f(θt)1i1+β1αt11β1f(θt)ζt1i1\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{\infty}\left\|\nabla f\left(\theta_{t}\right)\right\|_{\infty}\left\|\textbf{1}_{i}\right\|_{1}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left\|\nabla f\left(\theta_{t}\right)\right\|_{\infty}\left\|\zeta_{t}\right\|_{\infty}\left\|\textbf{1}_{i}\right\|_{1}
\displaystyle\leq β1αt11β1(maxiGi)(maxiGi)i=1d1i+β1αt11β1(maxiGi)(2maxiGi)i=1d1i\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\sum_{i=1}^{d}\textbf{1}_{i}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)\sum_{i=1}^{d}\textbf{1}_{i}
\displaystyle\leq β1αt11β1(maxiGi)(maxiGi)d+β1αt11β1(maxiGi)(2maxiGi)d\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d

Final:

f(θt),ξt+1ξt\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle (87)
\displaystyle\leq β11β1(maxiGi)(maxiGi)d(αt1(1β1t1)αt(1β1t))αt(1β1t)f(θt)22\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}
+β1αt11β1(maxiGi)(maxiGi)d+β1αt11β1(maxiGi)(2maxiGi)d+f(θt),αt1β1tζt\displaystyle+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle

Summarizing the results

Let’s start summarizing: when t=1t=1,

f(ξt+1)f(ξt)\displaystyle f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\leq L20+Lαt2(1β1)2i=1dGi2αt(1β1)f(θt)22+2αt1β1i=1dGi2\displaystyle\frac{L}{2}\cdot 0+L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2} (88)

Taking the expectation over the random distribution of ζ1,ζ2,,ζt\zeta_{1},\zeta_{2},\ldots,\zeta_{t} on both sides of the inequality:

𝔼t[f(ξt+1)f(ξt)]\displaystyle\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]\leq Lαt2(1β1)2i=1dGi2αt(1β1)𝔼tf(θt)22+2αt1β1i=1dGi2\displaystyle L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\mathbb{E}_{t}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2} (89)

When t2t\geq 2,

f(ξt+1)f(ξt)\displaystyle f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right) (90)
\displaystyle\leq L2β12(1β1)2αt12i=1dGi2+L2β12(1β1)2(maxiGi)2dα1(1β1)(αt1(1β1t1)αt(1β1t))\displaystyle\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}+L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)
+L4αt2(1β1)2i=1dGi2+β11β1(maxiGi)(maxiGi)d(αt1(1β1t1)αt(1β1t))\displaystyle+L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}+\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)
αt(1β1t)f(θt)22+β1αt11β1(maxiGi)(maxiGi)d+β1αt11β1(maxiGi)(2maxiGi)d\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d
+f(θt),αt1β1tζt\displaystyle+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle

Taking the expectation over the random distribution of ζ1,ζ2,,ζt\zeta_{1},\zeta_{2},\ldots,\zeta_{t} on both sides of the inequality:

𝔼t[f(ξt+1)f(ξt)]\displaystyle\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right] (91)
\displaystyle\leq L2β12(1β1)2αt12i=1dGi2+L2β12(1β1)2(maxiGi)2dα1(1β1)(αt1(1β1t1)αt(1β1t))\displaystyle\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}+L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)
+L4αt2(1β1)2i=1dGi2+β11β1(maxiGi)(maxiGi)d(αt1(1β1t1)αt(1β1t))\displaystyle+L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}+\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)
αt(1β1t)𝔼tf(θt)22+β1αt11β1(maxiGi)(maxiGi)d+β1αt11β1(maxiGi)(2maxiGi)d\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\mathbb{E}_{t}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d
+𝔼tf(θt),αt1β1tζt\displaystyle+\mathbb{E}_{t}\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle

Since the value of θt\theta_{t} is independent of gtg_{t}, they are statistically independent of ζt\zeta_{t}:

𝔼t[f(θt),αt1β1tζt]\displaystyle\mathbb{E}_{t}\left[\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle\right] (92)
=\displaystyle= 𝔼t[αt1β1tf(θt),ζt]\displaystyle\mathbb{E}_{t}\left[\left\langle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\nabla f\left(\theta_{t}\right),\zeta_{t}\right\rangle\right]
=\displaystyle= αt1β1t𝔼t[f(θt)],𝔼t[ζt]0=0\displaystyle\left\langle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\mathbb{E}_{t}\left[\nabla f\left(\theta_{t}\right)\right],\cancelto{0}{\mathbb{E}_{t}\left[\zeta_{t}\right]}\right\rangle=0

Summing up both sides of the inequality for t=1,2,,Tt=1,2,\ldots,T:

• Left side of the inequality (can be reduced to maintain the inequality)

t=1TLHS of the inequality=\displaystyle\sum_{t=1}^{T}\textrm{LHS of the inequality}= t=1T𝔼t[f(ξt+1)f(ξt)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right] (93)
=\displaystyle= t=1T𝔼t[f(ξt+1)]𝔼t[f(ξt)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)\right]-\mathbb{E}_{t}\left[f\left(\xi_{t}\right)\right]
=\displaystyle= t=1T𝔼t[f(ξt+1)]𝔼t1[f(ξt)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)\right]-\mathbb{E}_{t-1}\left[f\left(\xi_{t}\right)\right]
=\displaystyle= 𝔼T[f(ξT+1)]𝔼0[f(ξ1)]\displaystyle\mathbb{E}_{T}\left[f\left(\xi_{T+1}\right)\right]-\mathbb{E}_{0}\left[f\left(\xi_{1}\right)\right]

Since f(ξT+1)minθf(θ)=f(θ)f\left(\xi_{T+1}\right)\geq\min_{\theta}f\left(\theta\right)=f\left(\theta^{*}\right), ξ1=θ1\xi_{1}=\theta_{1}, and both are deterministic:

t=1T𝔼t[f(ξt+1)f(ξt)]\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]\geq 𝔼T[f(θ)]𝔼0[f(θ1)]\displaystyle\mathbb{E}_{T}\left[f\left(\theta^{*}\right)\right]-\mathbb{E}_{0}\left[f\left(\theta_{1}\right)\right] (94)
=\displaystyle= f(θ)f(θ1)\displaystyle f\left(\theta^{*}\right)-f\left(\theta_{1}\right)

• The right side of the inequality (can be enlarged to keep the inequality valid)

We perform a series of substitutions to simplify the symbols:

When t>2t>2,

  1. 1.

    L2β12(1β1)2αt12i=1dGi2C1αt12\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{1}\alpha_{t-1}^{2}

  2. 2.

    L2β12(1β1)2(maxiGi)2dα1(1β1)(αt1(1β1t1)αt(1β1t))C2(αt1(1β1t1)αt(1β1t))L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\triangleq C_{2}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)

  3. 3.

    L4αt2(1β1)2i=1dGi2L4αt2(1β1)2i=1dGi2C3αt2L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\leq L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{3}\alpha_{t}^{2}

  4. 4.

    β11β1(maxiGi)(maxiGi)d(αt1(1β1t1)αt(1β1t))C4(αt1(1β1t1)αt(1β1t))\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\triangleq C_{4}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)

  5. 5.

    αt(1β1t)𝔼t[f(θt)22]αt𝔼t[f(θt)22]-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\leq-\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]

  6. 6.

    β1αt11β1(maxiGi)(maxiGi)d+β1αt11β1(maxiGi)(2maxiGi)dC5αt1\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d\triangleq C_{5}\alpha_{t-1}

When t=1t=1,

  1. 1.

    Lαt2(1β1)2i=1dGi2L4αt2(1β1)2i=1dGi2=C3αt2L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\leq L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}=C_{3}\alpha_{t}^{2}

  2. 2.

    αt(1β1)𝔼t[f(θt)22]αt𝔼t[f(θt)22]-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\leq-\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]

  3. 3.

    2αt1β1i=1dGi2C6αt\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{6}\alpha_{t}

After substitution,

t=1TRHS of the inequalityt=2TC1αt12+t=1TC3αt2t=1Tαt𝔼t[f(θt)22]\displaystyle\sum_{t=1}^{T}\textrm{RHS of the inequality}\leq\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right] (95)
+t=2T(C2+C4)(αt1(1β1t1)αt(1β1t))+t=1TC5αt1+t=1TC6αt\displaystyle+\sum_{t=2}^{T}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}
=t=2TC1αt12+t=1TC3αt2t=1Tαt𝔼t[f(θt)22]+t=1TC5αt1+t=1TC6αt\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}
+i=1d(C2+C4)t=2T(αt1(1β1t1)αt(1β1t))\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\sum_{t=2}^{T}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)
=t=2TC1αt12+t=1TC3αt2t=1Tαt𝔼t[f(θt)22]+t=1TC5αt1+t=1TC6αt\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}
+i=1d(C2+C4)(α1(1β1)αT(1β1T))\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}-\frac{\alpha_{T}}{\left(1-\beta_{1}^{T}\right)}\right)
(C1+C3+C5+C6)t=1Tαt2t=1Tαt𝔼t[f(θt)22]+i=1d(C2+C4)α1(1β1)\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}
(C1+C3+C5+C6)t=1Tαt2t=1Tαt𝔼t[f(θt)22]+(C2+C4)α1(1β1)\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}

Combining the results of scaling on both sides of the inequality:

f(θ)f(θ1)(C1+C3+C5+C6)t=1Tαt2t=1Tαt𝔼t[f(θt)22]+(C2+C4)α1(1β1)\displaystyle f\left(\theta^{*}\right)-f\left(\theta_{1}\right)\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)} (96)
\displaystyle\Longrightarrow t=1Tαt𝔼t[f(θt)22](C1+C3+C5+C6)t=1Tαt2+f(θ1)f(θ)+(C2+C4)α1(1β1)\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}+f\left(\theta_{1}\right)-f\left(\theta^{*}\right)+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}

Due to 𝔼t[f(θt)22]=𝔼t1[f(θt)22]\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]=\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right],

t=1Tαt𝔼t[f(θt)22]=\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]= t=1Tαt𝔼t1[f(θt)22]\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right] (97)
\displaystyle\geq t=1Tαtmint=1,2,,T𝔼t1[f(θt)22]\displaystyle\sum_{t=1}^{T}\alpha_{t}\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]
=\displaystyle= mint=1,2,,T𝔼t1[f(θt)22]t=1Tαt\displaystyle\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\sum_{t=1}^{T}\alpha_{t}
=\displaystyle= 𝔼(T)t=1Tαt\displaystyle\cdot\mathbb{E}\left(T\right)\cdot\sum_{t=1}^{T}\alpha_{t}

Then let C1+C3+C5+C6C7C_{1}+C_{3}+C_{5}+C_{6}\triangleq C_{7}, f(θ1)f(θ)0+(C2+C4)α1(1β1)C8\underset{\geq 0}{\underbrace{f\left(\theta_{1}\right)-f\left(\theta^{*}\right)}}+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}\triangleq C_{8}, therefore

𝔼(T)t=1TαtC7t=1Tαt2+C8\displaystyle\mathbb{E}\left(T\right)\cdot\sum_{t=1}^{T}\alpha_{t}\leq C_{7}\sum_{t=1}^{T}\alpha_{t}^{2}+C_{8} (98)
\displaystyle\Longrightarrow 𝔼(T)C7t=1Tαt2+C8t=1Tαt\displaystyle\mathbb{E}\left(T\right)\leq\frac{C_{7}\sum_{t=1}^{T}\alpha_{t}^{2}+C_{8}}{\sum_{t=1}^{T}\alpha_{t}}

Since αt=α/t,t=1T1t1+logT\alpha_{t}=\alpha/\sqrt{t},\sum_{t=1}^{T}\frac{1}{t}\leq 1+\log T, we have:

𝔼(T)C7α2(logT+1)+C82αT\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}} (99)

Appendix D Detailed Experimental Supplement

We performed extensive comparisons with other optimizers, including SGD [41], Adam[32], RAdam[37] and AdamW[38]. The experiments include: (a) image classification on CIFAR dataset[33] with VGG [49], ResNet [23] and DenseNet [26], and image recognition with ResNet on ImageNet [14].

D.1 Image classification with CNNs on CIFAR

For all experiments, the model is trained for 200 epochs with a batch size of 128, and the learning rate is multiplied by 0.1 at epoch 150. We performed extensive hyperparameter search as described in the main paper. Detailed experimental parameters we place in Tab. 5. Here we report both training and test accuracy in Fig. 6 and Fig. 7. SGDF not only achieves the highest test accuracy, but also a smaller gap between training and test accuracy compared with other optimizers.

Table 5: Hyperparameters used for CIFAR-10 and CIFAR-100 datasets.
Optimizer Learning Rate β1\beta_{1} β2\beta_{2} Epochs Schedule Weight Decay Batch Size ε\varepsilon
SGDF 0.3 0.9 0.999 200 StepLR 0.0005 128 1e-8
SGD 0.1 0.9 - 200 StepLR 0.0005 128 -
Adam 0.001 0.9 0.999 200 StepLR 0.0005 128 1e-8
RAdam 0.001 0.9 0.999 200 StepLR 0.0005 128 1e-8
AdamW 0.001 0.9 0.999 200 StepLR 0.01 128 1e-8
MSVAG 0.1 0.9 0.999 200 StepLR 0.0005 128 1e-8
AdaBound 0.001 0.9 0.999 200 StepLR 0.0005 128 -
Sophia 0.0001 0.965 0.99 200 StepLR 0.1 128 -
Lion 0.00002 0.9 0.99 200 StepLR 0.1 128 -

Note: StepLR indicates a learning rate decay by a factor of 0.1 at the 150th epoch.

D.2 Image Classification on ImageNet

We experimented with a ResNet18 on ImageNet classification task. For SGD, we set an initial learning rate of 0.1, and multiplied by 0.1 every 30 epochs; for SGDF, we use an initial learning rate of 0.5, set β1\beta_{1} = 0.5. Weight decay is set as 10410^{-4} for both cases. To match the settings in [37]. Detailed experimental parameters we place in Tab. 6. As shown in Fig. 8, SGDF achieves an accuracy very close to SGD.

Table 6: Hyperparameters used for ImageNet.
Optimizer Learning Rate β1\beta_{1} β2\beta_{2} Epochs Schedule Weight Decay Batch Size ε\varepsilon
SGDF 0.5 0.5 0.999 100 StepLR 0.0005 256 1e-8
SGD 0.1 - - 100 StepLR 0.0005 256 -
SGDF 0.5 0.5 0.999 90 Cosine 0.0005 256 1e-8
SGD 0.1 - - 90 Cosine 0.0005 256 -

Note: StepLR indicates a learning rate decay by a factor of 0.1 every 30 epochs.

D.3 Objective Detection on PASCAL VOC

We show the results on PASCAL VOC[18]. Object detection with a Faster-RCNN model[45]. Detailed experimental parameters we place in Tab. 7. The results are reported in Tab. 2, and detection examples shown in Fig. 9. These results also illustrate that our method is still efficient in object detection tasks.

Table 7: Hyperparameters for object detection on PASCAL VOC using Faster-RCNN+FPN with different optimizers.
Optimizer Learning Rate β1\beta_{1} β2\beta_{2} Epochs Schedule Weight Decay Batch Size ε\varepsilon
SGDF 0.01 0.9 0.999 4 StepLR 0.0001 2 1e-8
SGD 0.01 0.9 - 4 StepLR 0.0001 2 -
Adam 0.0001 0.9 0.999 4 StepLR 0.0001 2 1e-8
AdamW 0.0001 0.9 0.999 4 StepLR 0.0001 2 1e-8
RAdam 0.0001 0.9 0.999 4 StepLR 0.0001 2 1e-8

Note: StepLR schedule indicates a learning rate decay by a factor of 0.1 at the last epoch.

D.4 Image Generation.

We experiment with one of the most widely used models, the Wasserstein-GAN with gradient penalty (WGAN-GP)[47] using a small model with a vanilla CNN generator. Using popular optimizer[40, 61, 2, 4], we train the model for 100 epochs, generate 64,000 fake images from noise, and compute the Frechet Inception Distance (FID)[24] between the fake images and real dataset (60,000 real images). FID score captures both the quality and diversity of generated images and is widely used to assess generative models (lower FID is better). For SGD and MSVAG, we report results from [66]. We perform 5 runs of experiments, and report the results in Fig. 5. Detailed experimental parameters we place in Tab. 8.

Table 8: Hyperparameters for Image Generation Tasks.
Optimizer Learning Rate β1\beta_{1} β2\beta_{2} Epochs Batch Size ε\varepsilon
SGDF 0.01 0.5 0.999 100 64 1e-8
Adam 0.0002 0.5 0.999 100 64 1e-8
AdamW 0.0002 0.5 0.999 100 64 1e-8
Fromage 0.01 0.5 0.999 100 64 1e-8
RMSProp 0.0002 0.5 0.999 100 64 1e-8
AdaBound 0.0002 0.5 0.999 100 64 1e-8
Yogi 0.01 0.9 0.999 100 64 1e-8
RAdam 0.0002 0.5 0.999 100 64 1e-8

D.5 Extended Experiment.

The study involves evaluating the vanilla Adam optimization algorithm and its enhancement with a Wiener filter on the CIFAR-100 dataset. Fig. 9 contains detailed test accuracy curves for both methods across different models. The results indicate that the adaptive learning rate algorithms exhibit improved performance when supplemented with the proposed first-moment filter estimation. This suggests that integrating a Wiener filter with the Adam optimizer may improve performance.

D.6 Optimizer Test.

We derived a correction factor (1β1)(1β12t)/(1+β1)(1-\beta_{1})(1-\beta_{1}^{2t})/(1+\beta_{1}) from the geometric progression to correct the variance of by the correction factor. So we test the SGDF with or without correction in VGG, ResNet, DenseNet on CIFAR. We report both test accuracy in Fig. 11. It can be seen that the SGDF with correction exceeds the uncorrected one.

We built a simple neural network to test the convergence speed of SGDF compared to SGDM and vanilla SGD. We trained 5 epochs and recorded the loss every 30 iterations. As Fig. 12 shown, the convergence rate of the filter method surpasses that of the momentum method, which in turn exceeds that of vanilla SGD.

Refer to caption
Refer to caption
Figure 18: Training and test accuracy (top-1) of ResNet18 on ImageNet.
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
[Uncaptioned image]
Figure 23: SGDF with or without the correction factor. The curve shows the accuracy of the test.
Refer to caption
Figure 24: Comparison of convergence rates.