Signal Processing Meets SGD: From Momentum to Filter

¹Zhipeng Yao, ¹Guiyuan Fu, ¹Ying Li, ^{1, 2}Yu Zhang, ¹Dazhou Li¹¹footnotemark: 1, ³Rui Yu
¹Shenyang University of Chemical Technology, ²University of Macau, ³University of Louisville
[email protected], [email protected], [email protected],
{zhangy;lidazhou}@syuct.edu.cn, [email protected]
Corresponding author.

Abstract

In deep learning, stochastic gradient descent (SGD) and its momentum-based variants are widely used for optimization, but they typically suffer from slow convergence. Conversely, existing adaptive learning rate optimizers speed up convergence but often compromise generalization. To resolve this issue, we propose a novel optimization method designed to accelerate SGD’s convergence without sacrificing generalization. Our approach reduces the variance of the historical gradient, improves first-order moment estimation of SGD by applying Wiener filter theory, and introduces a time-varying adaptive gain. Empirical results demonstrate that SGDF (SGD with Filter) effectively balances convergence and generalization compared to state-of-the-art optimizers. The code is available at https://github.com/LilYau350/SGDF-Optimizer.

1 Introduction

During the training process, the optimizer serves as a critical component of the model. It refines and adjusts model parameters to ensure that the model can recognize underlying data patterns. Beyond updating weights, the optimizer’s role includes strategically navigating complex loss landscapes [16] to locate regions that offer the best generalization [30]. The chosen optimizer significantly impacts training efficiency, influencing model convergence speed, generalization performance, and resilience to data distribution shifts [3]. A poor optimizer choice can result in suboptimal convergence or failure to converge, whereas a suitable one can accelerate learning and ensure robust performance [46]. Thus, continually refining optimization algorithms is essential for enhancing the capabilities of machine learning models.

Meanwhile, Stochastic Gradient Descent (SGD) [41] and its variants, such as momentum-based SGD [50], Adam [32], and RMSprop [25], have secured prominent roles. Despite their substantial contributions to deep learning, these methods have inherent drawbacks. They primarily exploit first-order moment estimation and frequently overlook the pivotal influence of historical gradients on current parameter adjustments. Consequently, they can result in training instability or poor generalization [7], especially with high-dimensional, non-convex loss functions common in deep learning [21]. Such characteristics render adaptive learning rate methods prone to entrapment in sharp local minima, which can significantly impair the model’s generalization capability [63]. Various Adam variants [8, 37, 40, 66] aim to improve optimization and enhance generalization performance by adjusting the adaptive learning rate. Although these variants have achieved some success, they still have not completely resolved the issue of generalization loss.

To achieve an effective trade-off between convergence speed and generalization capability [20], this paper introduces a novel optimization method called SGDF (SGD with Filter). SGDF incorporates filter theory from signal processing to enhance first-moment estimation, balancing historical and current gradient estimates. Through its adaptive weighting mechanism, SGDF precisely adjusts gradient estimates throughout the training process, thereby accelerating model convergence while preserving generalization ability.

Initial evaluations demonstrate that SGDF surpasses many traditional adaptive learning rate and variance reduction optimization methods across various benchmark datasets, particularly in terms of accelerating convergence and maintaining generalization. This indicates that SGDF successfully navigates the trade-off between speeding up convergence and preserving generalization capability. By achieving this balance, SGDF offers a more efficient and robust optimization option for training deep learning models.

The main contributions of this paper can be summarized as follows:

•

We introduce SGDF, an optimizer that integrates historical and current gradient data to compute the gradient’s variance estimate, addressing the slow convergence of the vanilla SGD method.
•

We theoretically analyze the benefits of SGDF in terms of generalization (Section 3.3) and convergence (Section 3.4), and empirically verify the effectiveness of SGDF (Section 4).
•

We employ first-moment filter estimation in SGDF, which can also significantly enhance the generalization capacity of adaptive optimization algorithms (e.g., Adam) (Section 4.4), surpassing traditional momentum strategies.

2 Related Works

Convergence: Variance Reduction to Adaptive Methods. In the early stages of deep learning development, optimization algorithms focused on reducing the variance of gradient estimation [2, 13, 27, 48] to achieve a linear convergence rate. Subsequently, the emergence of adaptive learning rate methods [15, 17, 62] marked a significant shift in optimization algorithms. While SGD and its variants have advanced many applications, they come with inherent limitations. They often oscillate or become trapped in sharp minima [54]. Although these methods can lead models to achieve low training loss, such minima frequently fail to generalize effectively to new data [22, 56]. This issue is exacerbated in the high-dimensional, non-convex landscapes characteristic of deep learning settings [11, 39].

Generalization: Sharp and Flat Solutions. The generalization ability of a deep learning model depends heavily on the nature of the solutions found during the optimization process. Keskar et al. [31] demonstrated experimentally that flat minima generalize better than sharp minima. SAM [19] theoretically showed that the generalization error of smooth minima is lower than that of sharp minima on test data, and further proposed optimizing the zero-order smoothness. GAM [65] improves SAM by simultaneously optimizing the prediction error and the number of paradigms of the maximum gradient in the neighborhood during the training process. Adaptive Inertia [55] aims to balance exploration and exploitation in the optimization process by adjusting the inertia of each parameter update. This adaptive inertia mechanism helps the model avoid falling into sharp local minima.

Second-Order and Filter Methods. The recent integration of second-order information into optimization problems has gained popularity [36, 60]. Methods such as Kalman Filter [28] combined with Gradient Descent incorporate second-order curvature information [42, 51]. The KOALA algorithm [12] posits that the optimizer must adapt to the loss landscape. It adjusts learning rates based on both gradient magnitudes and the curvature of the loss landscape. However, it should be noted that the Kalman filtering framework introduces more complex parameter settings, which can hinder understanding and application.

3 Method

We derive SGDF and discovered that it shares the same underlying principles as Wiener Filter [53]. SGDF minimizes the mean squared error [29], which effectively reduces the variance of the estimated gradient distribution. This design ensures that the algorithm achieves a trade-off between convergence speed and generalization. We demonstrate how SGDF intuitively acts. As illustrated in Fig. 1, the property of SGDF to estimate the optimal gradient when the noise of a specific stochastic gradient is very large makes its descent direction more accurate. However, the stochastic gradient descent is affected by the noise, causing the descent path to deviate from the optimal point.

Input:

\{\alpha_{t}\}^{T}_{t=1}

: step size,

\{\beta_{1},\beta_{2}\}

: attenuation coefficient,

\theta_{0}

: initial parameter,

f(\theta)

: stochastic objective function

Output:

\theta_{T}

: resulting parameters.

Init:

m_{0}\leftarrow 0

s_{0}\leftarrow 0

while $t=1$ to $T$ do

g_{t}\leftarrow\nabla f_{t}(\theta_{t-1})

(Calculate Gradients w.r.t. Stochastic Objective at Timestep

t

)

m_{t}\leftarrow\beta_{1}m_{t-1}+(1-\beta_{1})g_{t}

(Calculate Exponential Moving Average)

s_{t}\leftarrow\beta_{2}s_{t-1}+(1-\beta_{2})(g_{t}-m_{t})^{2}

(Calculate Exponential Moving Variance)

\widehat{m}_{t}\leftarrow\dfrac{m_{t}}{1-\beta_{1}^{t}}

\widehat{s}_{t}\leftarrow\dfrac{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1-\beta_{1})(1-\beta_{1}^{2t})}s_{t}}{{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}(1+\beta_{1})}(1-\beta_{2}^{t})}

(Bias Correction)

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}K_{t}\leftarrow\dfrac{\widehat{s}_{t}}{\widehat{s}_{t}+(g_{t}-\widehat{m}_{t})^{2}}}

(Calculate Estimate Gain)

{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\widehat{g}_{t}\leftarrow\widehat{m}_{t}+K_{t}(g_{t}-\widehat{m}_{t})}

(Update Gradient Estimation)

\theta_{t}\leftarrow\theta_{t-1}-\alpha_{t}{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}\widehat{g}_{t}}

(Update Parameters)

return

\theta_{T}

Algorithm 1 SGDF, Wiener Filter Estimate Gradient. All operations are element-wise.

3.1 SGDF General Introduction

Refer to caption — Figure 1: Schematic of the SGDF intuition

In the SGDF algorithm, $s_{t}$ serves as a key indicator, calculated as the exponential moving average of the squared difference between the current gradient $g_{t}$ and its momentum $m_{t}$ , acting as a marker for gradient variation with weight-adjusted by $\beta_{2}$ . Zhuang et al. [66] first proposed the calculation of $s_{t}$ , which is utilized for estimating the fluctuation variance of the stochastic gradient. We derived a correction factor $(1-\beta_{1})(1-\beta_{1}^{2t})/(1+\beta_{1})$ under the assumption that $m_{t}$ and $g_{t}$ are independently and identically distributed (i.i.d.), to accurately estimate the variance of $m_{t}$ using $s_{t}$ . Fig. 11 compares performances with and without the correction factor, showing superior results with correction. For the derivation of the correction factor, please refer to Appendix A.2.

At each time step $t$ , $g_{t}$ represents the stochastic gradient for our objective function, while $m_{t}$ approximates the historical trend of the gradient through an exponential moving average. The difference $g_{t}-m_{t}$ highlights the gradient’s deviation from its historical pattern, reflecting the inherent noise or uncertainty in the instantaneous gradient estimate, which can be expressed as $p(g_{t}|\mathcal{D})\sim\mathcal{N}(g_{t};m_{t},\sigma_{t}^{2})$ [37].

SGDF utilizes the gain $K_{t}$ , where the components of each dimension of the estimated gain range between 0 and 1, to balance the current observed gradient $g_{t}$ and the past corrected gradient $\widehat{m}_{t}$ , thus optimizing the gradient estimate. This balance plays a crucial role in noisy or complex optimization scenarios, helping to mitigate noise and achieve stable gradient direction, faster convergence, and enhanced performance. The computation of $K_{t}$ , based on $s_{t}$ and $g_{t}-m_{t}$ , aims to minimize the expected variance of the corrected gradient $\widehat{g}_{t}$ for optimal linear estimation in noisy conditions. For the method derivation, please refer to Appendix A.1.

3.2 Method Principle of Fusion of Gaussian Distributions for Gradient Estimate

SGDF optimizes gradient estimate through the Wiener Filter, enhancing the stability and effectiveness of the optimization process and facilitating faster convergence. By fusing two Gaussian distributions, SGDF significantly reduces the variance of gradient estimates, thereby benefiting in solving complex stochastic optimization problems. In this section, we will delve into how SGDF achieves the reduction of gradient estimate variance.

The properties of SGDF ensure that the estimated gradient is a linear combination of the current noisy gradient observation $g_{t}$ and the first-order moment estimate $\widehat{m}_{t}$ . These two components are assumed to have Gaussian distributions, where $g_{i}\sim\mathcal{N}\left(\mu,\sigma^{2}\right)$ . Hence, their fusion by the filter naturally ensures that the fused estimate $\widehat{g}_{t}$ is also Gaussian.

Consider two Gaussian distributions for the momentum term $\widehat{m}_{t}$ and the current gradient $g_{t}$ :

•

The momentum term $\widehat{m}_{t}$ is normally distributed with mean $\mu_{m}$ and variance $\sigma_{m}^{2}$ , denoted as $\widehat{m}_{t}\sim\mathcal{N}(\mu_{m},\sigma_{m}^{2})$ .
•

The current gradient $g_{t}$ is normally distributed with mean $\mu_{g}$ and variance $\sigma_{g}^{2}$ , denoted as $g_{t}\sim\mathcal{N}(\mu_{g},\sigma_{g}^{2})$ .

The product of their probability density functions is given by:

N(\widehat{m}_{t};\mu_{m},\sigma_{m})\cdot N(g_{t};\mu_{g},\sigma_{g})=\\ \frac{1}{2\pi\sigma_{m}\sigma_{g}}\exp\left(-\frac{(\widehat{m}_{t}-\mu_{m})^{2}}{2\sigma_{m}^{2}}-\frac{(g_{t}-\mu_{g})^{2}}{2\sigma_{g}^{2}}\right)

(1)

Through coefficient matching in the exponential terms, we obtain the new mean and variance:

\mu^{\prime}=\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\quad\sigma^{\prime 2}=\frac{\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}}

(2)

The new mean $\mu^{\prime}$ is a weighted average of the two means, $\mu_{m}$ and $\mu_{g}$ , with weights inversely proportional to their variances. This places $\mu^{\prime}$ between $\mu_{m}$ and $\mu_{g}$ , closer to the mean with the smaller variance. The new standard deviation $\sigma^{\prime}$ is smaller than either of the original standard deviations $\sigma_{m}$ and $\sigma_{g}$ , reflecting the reduced uncertainty in the estimate due to the combination of information from both sources. This is a direct consequence of the Wiener Filter’s optimality in the mean-square error sense. The proof is provided in Appendix A.3.

3.3 Generalization Analysis of the Variance Lower Bound for SGDF

In Section 3.2, we explained the principle of noise reduction by SGDF. In this section, we illustrate the advantages of SGDF over other variance reduction techniques, ensuring better generalization.

In prior variance reduction techniques [13, 27, 48], the variance is gradually reduced at a rate of $\xi^{t-1},\xi\in(0,1)$ . This approach rapidly decreases the initial variance, facilitating a linear convergence rate, as noted in [5]. However, this approach poses a potential problem in that the variance may approach zero, thus depriving the optimization process of necessary stochastic exploration. The Wiener Filter satisfies statistical estimation theory, wherein the Cramér-Rao lower bound (CRLB) [43] provides an important lower bound of the variance in the estimator. Thus, we construct a dynamical system modeled by the Fokker-Planck equation to demonstrate the potential benefits of having a lower bound on the variance for optimization.

Theorem 3.1.

Consider a system described by the Fokker-Planck equation, evolving the probability density function $P$ in parameter spaces. Given a loss function $f(\theta)$ , and the noise variance or diffusion matrix $D_{ij}$ satisfying $D_{i}\geq C>0$ , where $C$ is a positive lower bound constant, known as the Cramér-Rao lower bound. In the steady state condition, i.e., $\frac{\partial P}{\partial t}=0$ , the analytical form of the probability density $P$ can be obtained by solving the corresponding Fokker-Planck equation. These solutions reveal the probability distribution of the system at steady state, described as follows:

P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right),

(3)

Here, $Z$ is also a normalization constant, ensuring the total probability sums to one, assuming $D_{ij}=D_{i}\delta_{ij}$ , where $\delta_{ij}$ is the Kronecker delta.

The existence of a variance lower bound critically enhances the algorithm’s exploration capabilities, especially in regions of the loss landscape where gradients are minimal. By preventing the probability density function from becoming unbounded, it ensures continuous exploration and increases the probability of converging to flat minima associated with better generalization properties [57]. The proof of Theorem 3.1 is provided in Appendix A.4.

3.4 Convergence Analysis in Convex and Non-convex Optimization

Finally, we provide the convergence property of SGDF as shown in Theorem 3.2 and Theorem 3.3. The assumptions are common and standard when analyzing the convergence of convex and non-convex functions via SGD-based methods [9, 32, 44]. Proofs for convergence in convex and non-convex cases are provided in Appendix B and Appendix C, respectively. In the convergence analysis, the assumptions are relaxed and the upper bound is reduced due to the estimation gain introduced by SGDF, promoting faster convergence.

Theorem 3.2.

(Convergence in convex optimization) Assume that the function $f_{t}$ has bounded gradients, $\|\nabla f_{t}(\theta)\|_{2}\leq G$ , $\|\nabla f_{t}(\theta)\|_{\infty}\leq G_{\infty}$ for all $\theta\in\mathbb{R}^{d}$ and distance between any $\theta_{t}$ generated by SGDF is bounded, $\|\theta_{n}-\theta_{m}\|_{2}\leq D$ , $\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty}$ for any $m,n\in\{1,...,T\}$ , and $\beta_{1},\beta_{2}\in[0,1)$ . Let $\alpha_{t}=\alpha/\sqrt{t}$ . SGDF achieves the following guarantee, for all $T\geq 1$ :

\displaystyle R(T)\leq

\displaystyle\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2}

(4)

where $R(T)=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*})$ denotes the cumulative performance gap between the generated solution and the optimal solution.

For the convex case, Theorem 3.2 implies that the regret of SGDF is upper bounded by $O(\sqrt{T})$ . In the Adam-type optimizers, it’s crucial for the convex analysis to decay $\beta_{1,t}$ towards zero [32, 66]. We have relaxed the analysis assumption by introducing a time-varying gain $K_{t}$ , which can adapt with variance. Moreover, $K_{t}$ converges with variance at the end of training to improve convergence [50].

Theorem 3.3.

(Convergence for non-convex stochastic optimization) Under the assumptions:

•

A1 Bounded variables (same as convex). $\left\|\theta-\theta^{*}\right\|_{2}\leq D,\,\,\forall\theta,\theta^{*}$ or for any dimension $i$ of the variable, $\left\|\theta_{i}-\theta_{i}^{*}\right\|_{2}\leq D_{i},\,\,\forall\theta_{i},\theta_{i}^{*}$
•

A2 The noisy gradient is unbiased. For $\forall t$ , the random variable $\zeta_{t}$ is defined as $\zeta_{t}=g_{t}-\nabla f\left(\theta_{t}\right)$ , $\zeta_{t}$ satisfy $\mathbb{E}\left[\zeta_{t}\right]=0$ , $\mathbb{E}\left[\left\|\zeta_{t}\right\|_{2}^{2}\right]\leq\sigma^{2}$ , and when $t_{1}\neq t_{2}$ , $\zeta_{t_{1}}$ and $\zeta_{t_{2}}$ are statistically independent, i.e., $\zeta_{t_{1}}\perp\zeta_{t_{2}}$ .
•

A3 Bounded gradient and noisy gradient. At step $t$ , the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. $i.e.\ \ ||\nabla f(\theta_{t})||\leq G,\ ||g_{t}||\leq G,\ \ \forall t>1$ .
•

A4 The property of function. (1) $f$ is differentiable; (2) $||\nabla f(x)-\nabla f(y)||\leq L||x-y||,\ \forall x,y$ ; (3) $f$ is also lower bounded.

Consider a non-convex optimization problem. Suppose assumptions A1-A4 are satisfied, and let $\alpha_{t}=\alpha/\sqrt{t}$ . For all $T\geq 1$ , SGDF achieves the following guarantee:

\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}}

(5)

where $\mathbb{E}(T)=\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]$ denotes the minimum of the squared-paradigm expectation of the gradient, $\alpha$ is the learning rate at the $1$ -th step, $C_{7}$ are constants independent of $d$ and $T$ , $C_{8}$ is a constant independent of $T$ , and the expectation is taken w.r.t all randomness corresponding to ${g_{t}}$ .

Theorem 3.3 implies the convergence rate for SGDF in the non-convex case is $O(\log T/\sqrt{T})$ , which is similar to Adam-type optimizers [9, 44]. Note that these bounds are derived under the worst possible case, so they may be loose. The components of each dimension of the estimated gain $K_{t}$ converge with variance, causing the upper bound to decrease rapidly. This is why SGDF converges faster than SGD.

4 Experiments

4.1 Empirical Evaluation

In this study, we focus on the following tasks: Image Classification. We employed the VGG [49], ResNet [23], and DenseNet [26] models for image classification tasks on the CIFAR dataset [33]. The major difference between these three network architectures is the residual connectivity, which we will discuss in section 4.4. We evaluated and compared the performance of SGDF with other optimizers such as SGD, Adam, RAdam [37], AdamW [38], MSVAG [2], Adabound [40], Sophia [36], and Lion [10], all of which were implemented based on the official PyTorch. Additionally, we further tested the performance of SGDF on the ImageNet dataset [14] using the ResNet model. Object Detection. Object detection was performed on the PASCAL VOC dataset [18] using Faster-RCNN [45] integrated with FPN. For hyper-parameter tuning related to image classification and object detection, refer to [66]. Image Generation. Wasserstein-GAN (WGAN) [1] on the CIFAR-10 dataset.

Hyperparameter tuning. Following Zhuang et al. [66], we delved deep into the optimal hyperparameter settings for our experiments. In the image classification task, we employed these settings:

•

SGDF: We adhered to Adam’s original parameter values: $\beta_{1}=0.9$ , $\beta_{2}=0.999$ , $\epsilon=10^{-8}$ .
•

SGD: We set the momentum to 0.9, the default for networks like ResNet and DenseNet. The learning rate was searched in the set {10.0, 1.0, 0.1, 0.01, 0.001}.
•

Adam, RAdam, MSVAG, AdaBound: Traversing the hyperparameter landscape, we scoured $\beta_{1}$ values in $\{0.5,0.6,0.7,0.8,0.9\}$ , probed $\alpha$ as in SGD, while tethering other parameters to their literary defaults.
•

AdamW, SophiaG, Lion: Mirroring Adam’s parameter search schema, we fixed weight decay at $5\times 10^{-4}$ ; yet for AdamW, whose optimal decay often exceeds norms [38], we ranged weight decay over $\left\{10^{-4},5\times 10^{-4},10^{-3},10^{-2},10^{-1}\right\}$ .
•

SophiaG, Lion: We searched for the learning rate among $\{10^{-3},10^{-4},10^{-5}\}$ and adjusted Lion’s learning rate [36]. Following [36, 10], we set $\beta_{1}$ =0.965, 0.9 and $\beta_{2}$ =0.99 as the default parameters.

CIFAR-10/100 Experiments. We initially trained on the CIFAR-10 and CIFAR-100 datasets using the VGG, ResNet, and DenseNet models and assessed the performance of the SGDF optimizer. In these experiments, we employed basic data augmentation techniques such as random horizontal flip and random cropping (with a 4-pixel padding). To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 5. The results represent the mean and standard deviation of 3 runs, visualized as curve graphs in Fig. 2.

Figure 7: Test accuracy ([

\mu\pm\sigma

]) on CIFAR.

As Fig. 2 shows, that it is evident that the SGDF optimizer exhibited convergence speeds comparable to adaptive optimization algorithms. Additionally, SGDF’s final test set accuracy was either better than or equal to that achieved by SGD.

ImageNet Experiments. We utilized the ResNet18 model to train on the ImageNet dataset. Due to extensive computational demands, we did not conduct an exhaustive hyperparameter search for other optimizers and instead opted to use the best-reported parameters from [8, 37]. We applied basic data augmentation strategies such as random cropping and random horizontal flipping. The results are presented in Tab. 1. To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 6. Detailed training and test curves are depicted in Fig. 8. Additionally, to mitigate the effect of learning rate scheduling, we employed cosine learning rate scheduling as suggested by [10, 65] and trained ResNet18, 34, and 50 models. The results are summarized in Tab. 2. Experiments on the ImageNet dataset demonstrate that SGDF has improved convergence speed and achieves similar accuracy to SGD on the test set.

Table 1: Top-1, 5 accuracy of ResNet18 on ImageNet. ^∗ ^† ^‡ is reported in [66, 8, 37].

Method	SGDF	SGD	AdaBound	Yogi	MSVAG	Adam	RAdam	AdamW
Top-1	70.23	70.23^†	68.13^†	68.23^†	65.99^∗	63.79^† (66.54^‡)	67.62^‡	67.93^†
Top-5	89.55	89.40^†	88.55^†	88.59^†	-	85.61^†	-	88.47^†

Table 2: Cosine learning rate scheduling train ImageNet. ^∗ is reported in [65]

Model	ResNet18	ResNet34	ResNet50
SGDF	70.16	73.37	76.03
SGD	69.80	73.26	76.01^∗

Object Detection. We conducted object detection experiments on the PASCAL VOC dataset [18]. The model used in these experiments was pre-trained on the COCO dataset [35], obtained from the official website. We trained this model on the VOC2007 and VOC2012 trainval dataset (17K) and evaluated it on the VOC2007 test dataset (5K). The utilized model was Faster-RCNN [45] with FPN, and the backbone was ResNet50 [23]. Results are summarized in Tab. 3. To facilitate result reproduction, we provide the parameter table for this subpart in Tab. 6. As expected, SGDF outperforms other methods. These results also illustrate the efficiency of our method in object detection tasks.

Table 3: The mAP on PASCAL VOC using Faster-RCNN+FPN.

Method	SGDF	SGD	Adam	AdamW	RAdam
mAP	83.81	80.43	78.67	78.48	75.21

Image Generation. The stability of optimizers is crucial, especially when training Generative Adversarial Networks (GANs). If the generator and discriminator have mismatched complexities, it can lead to imbalance during GAN training, causing the GAN to fail to converge. This is known as model collapse. For instance, Vanilla SGD frequently causes model collapse, making adaptive optimizers like Adam and RMSProp the preferred choice. Therefore, GAN training provides a good benchmark for assessing optimizer stability. For reproducibility details, please refer to the parameter table in Table 7.

We evaluated the Wasserstein-GAN with gradient penalty (WGAN-GP) [47]. Using well-known optimizers [4, 61], the model was trained for 100 epochs. We then calculated the Frechet Inception Distance (FID) [24] which is a metric that measures the similarity between the real image and the generated image distribution and is used to assess the quality of the generated model (lower FID indicates superior performance). Five random runs were conducted, and the outcomes are presented in Figure 3. Results for SGD and MSVAG were extracted from Zhuang et al. [66].

Experimental results demonstrate that SGDF significantly enhances WGAN-GP model training, achieving a FID score higher than vanilla SGD and outperforming most adaptive optimization methods. The integration of a Wiener filter in SGDF facilitates smooth gradient updates, mitigating training oscillations and effectively addressing the issue of pattern collapse.

4.2 Top Eigenvalues of Hessian and Hessian Trace

The success of optimization algorithms in deep learning not only depends on their ability to minimize training loss but also critically hinges on the nature of the solutions they converge to. Adam-type algorithms tend to converge to sharp minima, characterized by large eigenvalues of the Hessian matrix [6].

We computed the Hessian spectrum of ResNet-18 trained on the CIFAR-100 dataset for 200 epochs using four optimization methods: SGD, SGDM, Adam, and SGDF. These experiments ensure that all methods achieve similar results on the training set. We employed power iteration [58] to compute the top eigenvalues of Hessian and Hutchinson’s method [59] to compute the Hessian trace. Histograms illustrating the distribution of the top 50 Hessian eigenvalues for each optimization method are presented in Fig. 4.

4.3 Visualization of Landscapes

We visualized the loss landscapes of models trained with SGD, SGDM, SGDF, and Adam using the ResNet-18 model on CIFAR-100, following the method in [34]. All models are trained with the same hyperparameters for 200 epochs, as detailed in Section 4.1. As shown in Fig. 5, SGDF finds flatter minima. Notably, the visualization reveals that Adam is more prone to converge to sharper minima.

4.4 Wiener Filter combines Adam

We’ve conducted comparative experiments on the CIFAR-100 dataset, evaluating both the vanilla Adam algorithm and Wiener Adam, which substitutes the first-moment gradient estimates in the Adam optimizer with Wiener filter estimates. The results are presented in Tab. 4, and the detailed test curves are depicted in Fig. 10. This suggests that our first-moment filter estimation method has the potential to be applied to other optimization methods.

Table 4: Accuracy comparison between Adam and Wiener-Adam.

Model	VGG11	ResNet34	DenseNet121
Wiener-Adam	62.64	73.98	74.89
Vanilla-Adam	56.73	72.34	74.89

The performance improvement observed when using filter estimate can be attributed to several factors related to the nature of the Wiener filter and the architecture of the neural networks used.

Noise Reduction. The Wiener Filter is fundamentally designed to minimize the mean square error in the presence of noise. In the context of gradient estimation, this means it can effectively reduce the noise in the gradient updates. The vanilla Adam optimizer calculates the first moment (the mean) of the gradients, which may still include some noise components. By employing the Wiener filter, the algorithm may benefit from cleaner and more reliable gradient signals, thus improving the optimizer’s ability to converge to better minima.

Network Architecture. The impact of the Wiener Filter on gradient estimation varies with the network architecture’s unique handling of features and gradients. For VGG, which lacks residual connections, the Wiener filter significantly enhances performance by providing more accurate gradient estimations, reducing noise-induced errors, and leading to better overall accuracy. In contrast, the benefits are less pronounced in ResNet and DenseNet, which incorporate residual and dense connections, respectively. These architectures naturally promote more stable gradient flows through direct paths or dense layer interconnectivity, diminishing the additional advantage offered by the Wiener filter. Intuitively, we can find the loss landscape with residual connected networks in [34]. Thus, while the Wiener filter improves performance in simpler architectures like VGG, its impact is mitigated in more complex architectures where inherent features already assist in gradient stability.

5 Limitations

Theoretical Limitations. Our analysis relies on the common L-smooth condition for optimization continuity. Zhang et al. [64] propose gradient clipping, weakening this constraint. This suggests considering additional conditions for better understanding optimization in non-convex settings [52].

Empirical Limitations. Modern Large Language Models (LLMs) prioritize performance on training datasets [10, 36]. Discussing the theoretical properties of the Hessian matrix may lack relevance without comparable training losses. In practice, metrics like convergence speed and final loss values matter most.

6 Conclusion

In this paper, we introduce SGDF, a novel optimization method that estimates the gradient for faster convergence by leveraging both the variance of historical gradients and the current gradient. We demonstrate that SGDF yields solutions with a flat spectrum akin to SGD through Hessian spectral analysis. Through extensive experiments employing various deep learning architectures on benchmark datasets, we showcase SGDF’s superior performance compared to other state-of-the-art optimizers, striking a balance between convergence speed and generalization.

Acknowledgement

Zhipeng Yao thanks to Aram Davtyan and Prof. Dr. Paolo Favaro in the Computer Vision Group at the University of Bern for discussing and improving the paper.

Thanks to computational support from the Ascend AI Eco-Innovation Centre at the Shenyang AI Computing Hub.

References

Arjovsky et al. [2017] Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein gan. 2017.
Balles & Hennig [2018] Balles, L. and Hennig, P. Dissecting adam: The sign, magnitude and variance of stochastic gradients. In International Conference on Machine Learning, pp. 404–413. PMLR, 2018.
Bengio & Lecun [2007] Bengio, Y. and Lecun, Y. Scaling learning algorithms towards ai. 2007.
Bernstein et al. [2020] Bernstein, J., Vahdat, A., Yue, Y., and Liu, M.-Y. On the distance between two neural networks and the stability of learning. Advances in Neural Information Processing Systems, 33:21370–21381, 2020.
Bottou et al. [2018] Bottou, L., Curtis, F. E., and Nocedal, J. Optimization methods for large-scale machine learning. SIAM review, 60(2):223–311, 2018.
Caldarola et al. [2022] Caldarola, D., Caputo, B., and Ciccone, M. Improving generalization in federated learning by seeking flat minima. ArXiv, 2022.
Chandramoorthy et al. [2022] Chandramoorthy, N., Loukas, A., Gatmiry, K., and Jegelka, S. On the generalization of learning algorithms that do not converge. Advances in Neural Information Processing Systems, 35:34241–34257, 2022.
Chen et al. [2018a] Chen, J., Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. Closing the generalization gap of adaptive gradient methods in training deep neural networks. arXiv preprint arXiv:1806.06763, 2018a.
Chen et al. [2018b] Chen, X., Liu, S., Sun, R., and Hong, M. On the convergence of a class of adam-type algorithms for non-convex optimization. arXiv preprint arXiv:1808.02941, 2018b.
Chen et al. [2023] Chen, X., Liang, C., Huang, D., Real, E., Wang, K., Liu, Y., Pham, H., Dong, X., Luong, T., Hsieh, C.-J., et al. Symbolic discovery of optimization algorithms. arXiv preprint arXiv:2302.06675, 2023.
Dauphin et al. [2014] Dauphin, Y., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. MIT Press, 2014.
Davtyan et al. [2022] Davtyan, A., Sameni, S., Cerkezi, L., Meishvili, G., Bielski, A., and Favaro, P. Koala: A kalman optimization algorithm with loss adaptivity. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pp. 6471–6479, 2022.
Defazio et al. [2014] Defazio, A., Bach, F., and Lacoste-Julien, S. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems, pp. 1646–1654, 2014.
Deng et al. [2009] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, 2009.
Dozat [2016] Dozat, T. Incorporating nesterov momentum into adam. ICLR Workshop, 2016.
Du & Lee [2018] Du, S. S. and Lee, J. D. On the power of over-parametrization in neural networks with quadratic activation. 2018.
Duchi et al. [2011] Duchi, John, Hazan, Elad, Singer, and Yoram. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 2011.
Everingham et al. [2010] Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International journal of computer vision, 88(2):303–338, 2010.
Foret et al. [2021] Foret, P. et al. Sharpness-aware minimization for efficiently improving generalization. In ICLR, 2021. spotlight.
Geman et al. [2014] Geman, S., Bienenstock, E., and Doursat, R. Neural networks and the bias/variance dilemma. Neural Computation, 4(1):1–58, 2014.
Goodfellow et al. [2016] Goodfellow, I., Bengio, Y., and Courville, A. Deep learning. MIT Press, 2016.
Hardt et al. [2015] Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. Mathematics, 2015.
He et al. [2016] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
Heusel et al. [2017] Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Klambauer, G., and Hochreiter, S. Gans trained by a two time-scale update rule converge to a nash equilibrium. 2017.
Hinton et al. [2012] Hinton, G., Srivastava, N., and Swersky, K. Neural networks for machine learning lecture 6a overview of mini-batch gradient descent. Cited on, 14(8):2, 2012.
Huang et al. [2017] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 4700–4708, 2017.
Johnson & Zhang [2013] Johnson, R. and Zhang, T. Accelerating stochastic gradient descent using predictive variance reduction. Advances in neural information processing systems, 26, 2013.
Kalman [1960] Kalman, R. E. A new approach to linear filtering and prediction problems. Journal of Basic Engineering, 1960.
Kay [1993] Kay, S. M. Fundamentals of Statistical Signal Processing: Estimation Theory, volume 1. Prentice-Hall, Inc., 1993.
Keskar et al. [2016] Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On large-batch training for deep learning: Generalization gap and sharp minima. 2016.
Keskar et al. [2017] Keskar, N. S. et al. On large-batch training for deep learning: Generalization gap and sharp minima. In ICLR, 2017.
Kingma & Ba [2014] Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Krizhevsky et al. [2009] Krizhevsky, A., Hinton, G., et al. Learning multiple layers of features from tiny images. 2009.
Li et al. [2018] Li, H., Xu, Z., Taylor, G., Studer, C., and Goldstein, T. Visualizing the loss landscape of neural nets. Advances in neural information processing systems, 31, 2018.
Lin et al. [2014] Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: Common objects in context. European Conference on Computer Vision (ECCV), 2014.
Liu et al. [2023] Liu, H., Li, Z., Hall, D., Liang, P., and Ma, T. Sophia: A scalable stochastic second-order optimizer for language model pre-training. arXiv preprint arXiv:2305.14342, 2023.
Liu et al. [2019] Liu, L., Jiang, H., He, P., Chen, W., Liu, X., Gao, J., and Han, J. On the variance of the adaptive learning rate and beyond. arXiv preprint arXiv:1908.03265, 2019.
Loshchilov & Hutter [2017] Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
Lucchi et al. [2022] Lucchi, A., Proske, F., Orvieto, A., Bach, F., and Kersting, H. On the theoretical properties of noise correlation in stochastic optimization. Advances in Neural Information Processing Systems, 35:14261–14273, 2022.
Luo et al. [2019] Luo, L., Xiong, Y., Liu, Y., and Sun, X. Adaptive gradient methods with dynamic bound of learning rate. arXiv preprint arXiv:1902.09843, 2019.
Monro [1951] Monro, R. S. a stochastic approximation method. Annals of Mathematical Statistics, 22(3):400–407, 1951.
Ollivier [2019] Ollivier, Y. The extended kalman filter is a natural gradient descent in trajectory space. arXiv: Optimization and Control, 2019.
Rao [1992] Rao, C. R. Information and the accuracy attainable in the estimation of statistical parameters. In Breakthroughs in Statistics: Foundations and basic theory, pp. 235–247. Springer, 1992.
Reddi et al. [2019] Reddi, S. J., Kale, S., and Kumar, S. On the convergence of adam and beyond. 2019.
Ren et al. [2015] Ren, S., He, K., Girshick, R., and Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Neural Information Processing Systems (NIPS), 2015.
Ruder [2016] Ruder, S. An overview of gradient descent optimization algorithms. 2016.
Salimans et al. [2016] Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
Schmidt et al. [2017] Schmidt, M., Roux, N. L., and Bach, F. Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1):83–112, 2017.
Simonyan & Zisserman [2014] Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. Computer Science, 2014.
Sutskever et al. [2013] Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On the importance of initialization and momentum in deep learning. In International conference on machine learning, pp. 1139–1147. PMLR, 2013.
Vuckovic [2018] Vuckovic, J. Kalman gradient descent: Adaptive variance reduction in stochastic optimization. ArXiv, 2018.
Wang et al. [2022] Wang, B., Zhang, Y., Zhang, H., Meng, Q., Ma, Z.-M., Liu, T.-Y., and Chen, W. Provable adaptivity in adam. arXiv preprint arXiv:2208.09900, 2022.
Wiener [1950] Wiener, N. The extrapolation, interpolation and smoothing of stationary time series, with engineering applications. Journal of the Royal Statistical Society Series A (General), 1950.
Wilson et al. [2017] Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and Recht, B. The marginal value of adaptive gradient methods in machine learning. 2017.
Xie et al. [2020] Xie, Z., Wang, X., Zhang, H., Sato, I., and Sugiyama, M. Adai: Separating the effects of adaptive learning rate and momentum inertia. arXiv preprint arXiv:2006.15815, 2020.
Xie et al. [2022] Xie, Z., Tang, Q. Y., Cai, Y., Sun, M., and Li, P. On the power-law spectrum in deep learning: A bridge to protein science. 2022.
Yang et al. [2023] Yang, N., Tang, C., and Tu, Y. Stochastic gradient descent introduces an effective landscape-dependent regularization favoring flat solutions. Physical Review Letters, 130(23):237101, 2023.
Yao et al. [2018] Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based analysis of large batch training and robustness to adversaries. 2018.
Yao et al. [2020a] Yao, Z., Gholami, A., Keutzer, K., and Mahoney, M. W. Pyhessian: Neural networks through the lens of the hessian. In International Conference on Big Data, 2020a.
Yao et al. [2020b] Yao, Z., Gholami, A., Shen, S., Keutzer, K., and Mahoney, M. W. Adahessian: An adaptive second order optimizer for machine learning. arXiv preprint arXiv:2006.00719, 2020b.
Zaheer et al. [2018] Zaheer, M., Reddi, S., Sachan, D., Kale, S., and Kumar, S. Adaptive methods for nonconvex optimization. Advances in neural information processing systems, 31, 2018.
Zeiler [2012] Zeiler, M. D. Adadelta: An adaptive learning rate method. arXiv e-prints, 2012.
Zhang et al. [2016] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. 2016.
Zhang et al. [2019] Zhang, J., He, T., Sra, S., and Jadbabaie, A. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019.
Zhang et al. [2023] Zhang, X., Xu, R., Yu, H., Zou, H., and Cui, P. Gradient norm aware minimization seeks first-order flatness and improves generalization. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
Zhuang et al. [2020] Zhuang, J., Tang, T., Ding, Y., Tatikonda, S. C., Dvornek, N., Papademetris, X., and Duncan, J. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. Advances in neural information processing systems, 33:18795–18806, 2020.

Appendix A Method Derivation (Section 3 in main paper)

A.1 Wiener Filter Derivation for Gradient Estimation (Main paper Section 3.1)

Given the sequence of gradients $\{g_{t}\}$ in a stochastic gradient descent process, we aim to find an estimate $\widehat{g}_{t}$ that incorporates information from both the historical gradients and the current gradient. The Wiener Filter provides an estimate that minimizes the mean squared error. We begin by constructing the estimate as a simple average and then refine it using the properties of the Wiener Filter.

$\displaystyle\widehat{g}_{t}$	$\displaystyle=\frac{1}{T+1}\sum_{t=1}^{T}g_{t}+\frac{1}{T+1}g_{t}$	(6)
	$\displaystyle=\frac{1}{T+1}\frac{T}{T}\sum_{t=1}^{T}g_{t}+\frac{1}{T+1}g_{t}$
	$\displaystyle=\frac{T}{T+1}\bar{g}_{t}+\frac{1}{T+1}g_{t}$
	$\displaystyle\overset{\left(a\right)}{\approx}\frac{T}{T+1}\widehat{m}_{t}+\frac{1}{T+1}g_{t}$
	$\displaystyle=\left(1-\frac{1}{T+1}\right)\widehat{m}_{t}+\frac{1}{T+1}g_{t}$
	$\displaystyle=\widehat{m}_{t}-K_{t}\widehat{m}_{t}+K_{t}g_{t}$
	$\displaystyle=\widehat{m}_{t}+K_{t}\left(g_{t}-\widehat{m}_{t}\right)$

In the above derivation, step (a) replaces the arithmetic mean of gradients $\bar{g}_{T}$ with the momentum term $\widehat{m}_{T}$ . The Wiener gain $K_{T}=\frac{1}{T+1}$ is then introduced to update the gradient estimate with information from the new gradient.

By defining $\widehat{g}_{t}$ as the weighted combination of the momentum term $\widehat{m}_{t}$ and the current gradient $g_{t}$ , we can compute the variance of $\widehat{g}_{t}$ as follows:

	$\displaystyle\mathrm{Var}(\widehat{g}_{t})$	$\displaystyle=\mathrm{Var}((1-K_{t})\widehat{m}_{t}+K_{t}g_{t})$		(7)
		$\displaystyle=(1-K_{t})^{2}\mathrm{Var}(\widehat{m}_{t})+K_{t}^{2}\mathrm{Var}(g_{t})$		(7)

Minimizing the variance of $\widehat{g}_{t}$ with respect to $K_{t}$ , by setting the derivative $\dfrac{\mathrm{d}\mathrm{Var}(\widehat{g}_{t})}{\mathrm{d}K_{t}}=0$ , yields:

$\displaystyle 0=$	$\displaystyle 2(1-K_{t})\mathrm{Var}(\widehat{m}_{t})+2K_{t}\mathrm{Var}(g_{t})$	(8)
$\displaystyle 0=$	$\displaystyle(1-K_{t})\mathrm{Var}(\widehat{m}_{t})+K_{t}\mathrm{Var}(g_{t})$
$\displaystyle K_{t}$	$\displaystyle=\frac{\mathrm{Var}(\widehat{m}_{t})}{\mathrm{Var}(\widehat{m}_{t})+\mathrm{Var}(g_{t})}$

The final expression for $K_{t}$ shows that the optimal interpolation coefficient is the ratio of the variance of the momentum term to the sum of the variances of the momentum term and the current gradient. This result exemplifies the essence of the Wiener Filter: optimally combining past information with new observations to reduce estimation error due to noisy data.

A.2 Variance Correction (Correction factor in main paper Section 3.1)

The momentum term is defined as:

m_{t}=(1-\beta_{1})\sum_{i=1}^{t}\beta_{1}^{t-i}g_{t-i+1},

(9)

which means that the momentum term is a weighted sum of past gradients, where the weights decrease exponentially over time.

To compute the variance of the momentum term $m_{t}$ , we first observe that since $g_{t-i+1}$ are independent and identically distributed with a constant variance $\sigma_{g}^{2}$ , the variance of the momentum term can be obtained by summing up the variances of all the weighted gradients.

The variance of each weighted gradient $\beta_{1}^{t-i}g_{t-i+1}$ is $\beta_{1}^{2(t-i)}\sigma_{g}^{2}$ , because the variance operation has a quadratic nature, so the weight $\beta_{1}^{t-i}$ becomes $\beta_{1}^{2(t-i)}$ in the variance computation.

Therefore, the variance of $m_{t}$ is the sum of all these weighted variances:

\sigma_{m_{t}}^{2}=(1-\beta_{1})^{2}\sigma_{g}^{2}\sum_{i=1}^{t}\beta_{1}^{2(t-i)}.

(10)

The factor $(1-\beta_{1})^{2}$ comes from the multiplication factor $(1-\beta_{1})$ in the momentum update formula, which is also squared when calculating the variance.

The summation part $\sum_{i=1}^{t}\beta_{1}^{2(t-i)}$ is a geometric series, which can be formulated as:

\sum_{i=1}^{t}\beta_{1}^{2(t-i)}=\frac{1-\beta_{1}^{2t}}{1-\beta_{1}^{2}}.

(11)

As $t\rightarrow\infty$ , and given that $\beta_{1}<1$ , we note that $\beta_{1}^{2t}\rightarrow 0$ , and the geometric series sum converges to:

\sum_{i=1}^{t}\beta_{1}^{2(t-i)}=\frac{1-\beta_{1}^{2t}}{1-\beta_{1}^{2}}=\frac{1}{1-\beta_{1}^{2}}.

(12)

Consequently, the long-term variance of the momentum term $m_{t}$ is expressed as:

\sigma^{2}_{m_{t}}=\left(\frac{1-\beta_{1}}{1-\beta_{1}^{2}}\right)^{2}\sigma^{2}_{g}=\frac{1-\beta_{1}}{1+\beta_{1}}\sigma^{2}_{g}.

(13)

This result shows how the effective gradient noise is reduced by the momentum term, which is a factor of $\frac{1-\beta_{1}}{1+\beta_{1}}$ compared to the variance of the gradients $\sigma^{2}_{g}$ .

A.3 Fusion Gaussian distribution (Main paper Section 3.2)

Consider two Gaussian distributions for the momentum term $\widehat{m}_{t}$ and the current gradient $g_{t}$ :

•

The momentum term $\widehat{m}_{t}$ is normally distributed with mean $\mu_{m}$ and variance $\sigma_{m}^{2}$ , denoted as $\widehat{m}_{t}\sim\mathcal{N}(\mu_{m},\sigma_{m}^{2})$ .
•

The current gradient $g_{t}$ is normally distributed with mean $\mu_{g}$ and variance $\sigma_{g}^{2}$ , denoted as $g_{t}\sim\mathcal{N}(\mu_{g},\sigma_{g}^{2})$ .

The product of their probability density functions is given by:

N(\widehat{m}_{t};\mu_{m},\sigma_{m})\cdot N(g_{t};\mu_{g},\sigma_{g})=\frac{1}{2\pi\sigma_{m}\sigma_{g}}\exp\left(-\frac{(\widehat{m}_{t}-\mu_{m})^{2}}{2\sigma_{m}^{2}}-\frac{(g_{t}-\mu_{g})^{2}}{2\sigma_{g}^{2}}\right)

(14)

The goal is to find equivalent mean $\mu^{\prime}$ and variance $\sigma^{\prime 2}$ for the new Gaussian distribution that matches the product:

N(x;\mu^{\prime},\sigma^{\prime 2})=\frac{1}{\sqrt{2\pi}\sigma^{\prime}}\exp\left(-\frac{(x-\mu^{\prime})^{2}}{2\sigma^{\prime 2}}\right)

(15)

We derive the expression for combining these two distributions. For convenience, let us define the variable $t$ as follows:

$\displaystyle t$	$\displaystyle=-\frac{\left(x-\mu_{m}\right)^{2}}{2\sigma_{m}^{2}}-\frac{\left(x-\mu_{g}\right)^{2}}{2\sigma_{g}^{2}}$	(16)
	$\displaystyle=-\frac{\sigma_{g}^{2}\left(x-\mu_{m}\right)^{2}+\sigma_{m}^{2}\left(x-\mu_{g}\right)^{2}}{2\sigma_{m}^{2}\sigma_{g}^{2}}$
	$\displaystyle=-\frac{\left(x-\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\right)^{2}}{\frac{2\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}}}+\frac{\left(\mu_{m}-\mu_{g}\right)^{2}}{2\left(\sigma_{m}^{2}+\sigma_{g}^{2}\right)}.$

Through coefficient matching in the exponential terms, we obtain the new mean and variance:

\mu^{\prime}=\frac{\sigma_{g}^{2}\mu_{m}+\sigma_{m}^{2}\mu_{g}}{\sigma_{m}^{2}+\sigma_{g}^{2}}\quad\sigma^{\prime 2}=\frac{\sigma_{m}^{2}\sigma_{g}^{2}}{\sigma_{m}^{2}+\sigma_{g}^{2}}

(17)

The new mean $\mu^{\prime}$ is a weighted average of the two means, $\mu_{m}$ and $\mu_{g}$ , with weights inversely proportional to their variances. This places $\mu^{\prime}$ between $\mu_{m}$ and $\mu_{g}$ , closer to the mean with the smaller variance. The new standard deviation $\sigma^{\prime}$ is smaller than either of the original standard deviations $\sigma_{m}$ and $\sigma_{g}$ , which reflects the reduced uncertainty in the estimate due to the combination of information from both sources. This is a direct consequence of the Wiener Filter’s optimality in the mean-square error sense.

A.4 Fokker Planck modelling (Theorem 3.1 in main paper)

Theorem A.1.

Consider a system described by the Fokker-Planck equation, evolving the probability density function $P$ in one-dimensional and multi-dimensional parameter spaces. Given a loss function $f(\theta)$ , and the noise variance $D$ or diffusion matrix $D_{ij}$ satisfying $D\geq C>0$ or $D_{i}\geq C>0$ , where $C$ is a positive lower bound constant, known as the Cramér-Rao lower bound. In the steady state condition, i.e., $\frac{\partial P}{\partial t}=0$ , the analytical form of the probability density $P$ can be obtained by solving the corresponding Fokker-Planck equation. These solutions reveal the probability distribution of the system at steady state, described as follows:

One-dimensional case In a one-dimensional parameter space, the probability density function $P(\theta)$ is

P(\theta)=\frac{1}{Z}\exp\left(-\int\frac{1}{D}\frac{\partial f}{\partial\theta}dx\right),

(18)

where $Z$ is a normalization constant, ensuring the total probability sums to one.

Multi-dimensional case In a multi-dimensional parameter space, the probability density function $P(\theta)$ is

P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right),

(19)

Here, $Z$ is also a normalization constant, ensuring the total probability sums to one, assuming $D_{ij}=D_{i}\delta_{ij}$ , where $\delta_{ij}$ is the Kronecker delta.

Proof.

one-dimensional Fokker-Planck equation: Given the one-dimensional Fokker-Planck equation:

\frac{\partial P}{\partial t}=-\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)+\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right),

(20)

where $f(\theta)$ is the loss function, and $D$ is the variance of the noise, with $D\geq C>0$ representing a positive lower bound for the variance. $P$ denotes the probability density of finding the state of the system near a given point or region

Derivation of the Steady-State Distribution:

In the steady state condition, $\frac{\partial P}{\partial t}=0$ , thus the equation simplifies to:

0=-\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)+\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right).

(21)

Our goal is to find the probability density $P$ as a function of $\theta$ .

By integrating, we obtain:

\frac{\partial}{\partial\theta}\left(P\frac{\partial f}{\partial\theta}\right)=\frac{\partial^{2}}{\partial\theta^{2}}\left(DP\right).

(22)

Next, we set $J=P\frac{\partial f}{\partial\theta}$ as the probability current, and we have:

\frac{\partial J}{\partial\theta}=\frac{\partial}{\partial\theta}\left(D\frac{\partial P}{\partial\theta}\right).

(23)

Upon integration, we get:

J=D\frac{\partial P}{\partial\theta}+C_{1},

(24)

where $C_{1}$ is an integration constant. Assuming the probability current $J$ vanishes at infinity, then $C_{1}=0$ .

Therefore, we have:

D\frac{\partial P}{\partial\theta}=P\frac{\partial f}{\partial\theta}.

(25)

This equation can be rewritten as:

\frac{\partial P}{\partial\theta}=\frac{P}{D}\frac{\partial f}{\partial\theta}.

(26)

Now, leveraging the variance lower bound $D\geq C$ , we analyze the above equation. Since $D$ is a positive constant, we can further integrate to get $P$ :

\ln P=-\int\frac{1}{D}\frac{\partial f}{\partial\theta}d\theta+C_{2},

(27)

where $C_{2}$ is an integration constant.

Solving for $P$ , we get:

P=e^{C_{2}}\exp\left(-\int\frac{1}{D}\frac{\partial f}{\partial\theta}d\theta\right).

(28)

Since we know that $D$ has a lower bound, $\frac{1}{D}$ is bounded above, which suggests that $P$ will not explode at any specific value of $\theta$ .

multi-dimensional Fokker-Planck equation: Consider a multi-dimensional parameter space $x\in\mathbb{R}^{n}$ and a loss function $f(\theta)$ . The evolution of the probability density function $P(\theta,t)$ in this space governed by the Fokker-Planck equation is given by:

\frac{\partial P}{\partial t}=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\sum_{j=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\left(D_{ij}P\right),

(29)

where $D_{ij}$ are elements of the diffusion matrix, representing the intensity and correlation of the stochastic in the directions $\theta_{i}$ and $\theta_{j}$ . At the steady state, where the time derivative of $P$ vanishes, we find:

0=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\sum_{j=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}\partial\theta_{j}}\left(D_{ij}P\right).

(30)

Assuming $D_{ij}=D_{i}\delta_{ij}$ where $\delta_{ij}$ is the Kronecker delta, and $D_{i}\geq C>0$ , the equation simplifies to:

0=-\sum_{i=1}^{n}\frac{\partial}{\partial\theta_{i}}\left(P\frac{\partial f}{\partial\theta_{i}}\right)+\sum_{i=1}^{n}\frac{\partial^{2}}{\partial\theta_{i}^{2}}\left(D_{i}P\right).

(31)

Integrating with respect to $\theta_{i}$ , we obtain a set of equations:

D_{i}\frac{\partial P}{\partial\theta_{i}}=P\frac{\partial f}{\partial\theta_{i}}+C_{i},

(32)

where $C_{i}$ is an integration constant. Assuming $C_{i}=0$ , which corresponds to no flux at the boundaries, we can solve for $P$ :

P(\theta)=\frac{1}{Z}\exp\left(-\sum_{i=1}^{n}\frac{f(\theta)}{D_{i}}\right),

(33)

where $Z$ is a normalization constant ensuring that the total probability integrates to one.

Exploration Efficacy of SGD due to Variance Lower Bound The existence of a variance lower bound in Stochastic Gradient Descent (SGD) critically enhances the algorithm’s exploration capabilities, particularly in regions of the loss landscape where gradients are minimal. By preventing the probability density function from becoming unbounded, it ensures continuous exploration and increases the probability of converging to flat minima that are associated with better generalization properties. This principle holds true across both one-dimensional and multi-dimensional scenarios, making the variance lower bound an essential consideration for optimizing SGD’s performance in finding robust, generalizable solutions.

Appendix B Convergence analysis in convex online learning case (Theorem 3.2 in main paper).

Assumption B.1.

Variables are bounded: $\exists D\text{ such that }\forall t,\|\theta_{t}\|_{2}\leq D$ . Gradients are bounded: $\exists G\text{ such that }\forall t,\|g_{t}\|_{2}\leq G$ .

Definition B.2.

Let $f_{t}(\theta_{t})$ be the loss at time $t$ and $f_{t}(\theta^{*})$ be the loss of the best possible strategy at the same time. The cumulative regret $R(T)$ at time $T$ is defined as:

\displaystyle R(T)=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*})

(34)

Definition B.3.

If a function $f$ : $R^{d}\rightarrow R$ is convex if for all $x,y\in R^{d}$ for all $\lambda\in[0,1]$ ,

\displaystyle\lambda f(x)+(1-\lambda)f(y)\geq f(\lambda x+(1-\lambda)y)

(35)

Also, notice that a convex function can be lower bounded by a hyperplane at its tangent.

Lemma B.4.

If a function $f:R^{d}\rightarrow R$ is convex, then for all $x,y\in R^{d}$ ,

\displaystyle f(y)\geq f(x)+\nabla f(x)^{T}(y-x)

(36)

The above lemma can be used to upper bound the regret, and our proof for the main theorem is constructed by substituting the hyperplane with SGDF update rules.

The following two lemmas are used to support our main theorem. We also use some definitions to simplify our notation, where $g_{t}\triangleq\nabla f_{t}\left(\theta_{t}\right)$ and $g_{t,i}$ as the $i^{\text{th }}$ element. We denote $g_{1:t,i}\in\mathbb{R}^{t}$ as a vector that contains the $i^{\text{th }}$ dimension of the gradients over all iterations till $t,g_{1:t,i}=\left[g_{1,i},g_{2,i},\cdots,g_{t,i}\right]$

Lemma B.5.

Let $g_{t}=\nabla f_{t}(\theta_{t})$ and $g_{1:t}$ be defined as above and bounded,

\displaystyle\left\|g_{t}\right\|_{2}\leq G,\left\|g_{t}\right\|_{\infty}\leq G_{\infty}.

(37)

Then,

\displaystyle\sum_{t=1}^{T}g_{t,i}\leq 2G_{\infty}\left\|g_{1:T,i}\right\|_{2}.

(38)

Proof. We will prove the inequality using induction over $T$ . For the base case $T=1$ :

\displaystyle g_{1,i}\leq 2G_{\infty}\left\|g_{1,i}\right\|_{2}.

(39)

Assuming the inductive hypothesis holds for $T-1$ , for the inductive step:

$\displaystyle\sum_{t=1}^{T}g_{t,i}$	$\displaystyle=\sum_{t=1}^{T-1}g_{t,i}+g_{T,i}$	(40)
	$\displaystyle\leq 2G_{\infty}\left\\|g_{1:T-1,i}\right\\|_{2}+g_{T,i}$
	$\displaystyle=2G_{\infty}\sqrt{\left\\|g_{1:T,i}\right\\|_{2}^{2}-g_{T}^{2}}+g_{T,i}^{2}.$

Given,

\displaystyle\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}+\frac{g_{T,i}^{4}}{4\left\|g_{1:T,i}\right\|_{2}^{2}}\geq\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2},

(41)

taking the square root of both sides, we get:

	$\displaystyle\sqrt{\left\\|g_{1:T,i}\right\\|_{2}^{2}-g_{T,i}^{2}}$	$\displaystyle\leq\left\\|g_{1:T,i}\right\\|_{2}-\frac{g_{T,i}^{2}}{2\left\\|g_{1:T,i}\right\\|_{2}}$		(42)
		$\displaystyle\leq\left\\|g_{1:T,i}\right\\|_{2}-\frac{g_{T,i}^{2}}{2\sqrt{G_{\infty}^{2}}}.$		(42)

Substituting into the previous inequality:

\displaystyle G_{\infty}\sqrt{\left\|g_{1:T,i}\right\|_{2}^{2}-g_{T,i}^{2}}+\sqrt{g_{T,i}^{2}}\leq 2G_{\infty}\left\|g_{1:T,i}\right\|_{2}

(43)

Lemma B.6.

Let bounded $g_{t},\left\|g_{t}\right\|_{2}\leq G$ , $\left\|g_{t}\right\|_{\infty}\leq G_{\infty}$ , the following inequality holds

\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}\leq\frac{4G_{\infty}^{2}}{(1-\beta_{1})^{2}}\left\|g_{1:T,i}\right\|_{2}^{2}

(44)

Proof. Under the inequality: $\frac{1}{\left(1-\beta_{1}^{t}\right)^{2}}\leq\frac{1}{\left(1-\beta_{1}\right)^{2}}$ . We can expand the last term in the summation using the updated rules in Algorithm 1,

$\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}$	$\displaystyle=\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\left(\sum_{k=1}^{T}\left(1-\beta_{1}\right)\beta_{1}^{T-k}g_{k,i}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}}$	(45)
	$\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\sum_{k=1}^{T}T\left(\left(1-\beta_{1}\right)\beta_{1}^{T-k}g_{k,i}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}}$
	$\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+\frac{\left(1-\beta_{1}\right)^{2}}{\left(1-\beta_{1}^{T}\right)^{2}}\sum_{k=1}^{T}T\left(\beta_{1}^{2}\right)^{T-k}\left\\|g_{k,i}\right\\|_{2}^{2}$
	$\displaystyle\leq\sum_{t=1}^{T-1}\widehat{m}_{t,i}^{2}+T\sum_{k=1}^{T}\left(\beta_{1}^{2}\right)^{T-k}\left\\|g_{k,i}\right\\|_{2}^{2}$

Similarly, we can upper-bound the rest of the terms in the summation.

	$\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}$	$\displaystyle\leq\sum_{t=1}^{T}\left\\|g_{t,i}\right\\|_{2}^{2}\sum_{j=0}^{T-t}t\beta_{1}^{j}$		(46)
		$\displaystyle\leq\sum_{t=1}^{T}\left\\|g_{t,i}\right\\|_{2}^{2}\sum_{j=0}^{T}t\beta_{1}^{j}$		(46)

For $\beta_{1}<1$ , using the upper bound on the arithmetic-geometric series, $\sum_{t}t\beta_{1}^{t}<\frac{1}{(1-\beta_{1})^{2}}$ :

\displaystyle\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2}\sum_{j=0}^{T}t\beta_{1}^{j}\leq\frac{1}{(1-\beta_{1})^{2}}\sum_{t=1}^{T}\left\|g_{t,i}\right\|_{2}^{2}

(47)

Apply Lemma B.5,

\displaystyle\sum_{t=1}^{T}\widehat{m}_{t,i}^{2}\leq\frac{4G_{\infty}^{2}}{(1-\beta_{1})^{2}}\left\|g_{1:T,i}\right\|_{2}^{2}

(48)

Theorem B.7.

Assume that the function $f_{t}$ has bounded gradients, $\|\nabla f_{t}(\theta)\|_{2}\leq G$ , $\|\nabla f_{t}(\theta)\|_{\infty}\leq G_{\infty}$ for all $\theta\in\mathbb{R}^{d}$ and the distance between any $\theta_{t}$ generated by SGDF is bounded, $\|\theta_{n}-\theta_{m}\|_{2}\leq D$ , $\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty}$ for any $m,n\in\{1,...,T\}$ , and $\beta_{1},\beta_{2}\in[0,1)$ . Let $\alpha_{t}=\alpha/\sqrt{t}$ . For all $T\geq 1$ , SGDF achieves the following guarantee:

\displaystyle R(T)\leq

\displaystyle\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2}

(49)

Proof of convex Convergence.

We aim to prove the convergence of the algorithm by showing that $R(T)$ is bounded, or equivalently, that $\dfrac{R(T)}{T}$ converges to zero as $T$ goes to infinity.

To express the cumulative regret in terms of each dimension, let $f_{t}(\theta_{t})$ and $f_{t}(\theta^{*})$ represent the loss and the best strategy’s loss for the $d$ th dimension, respectively. Define $R_{T,d}$ as:

\displaystyle R_{T,i}=\sum_{t=1}^{T}f_{t}(\theta_{t})-f_{t}(\theta^{*})

(50)

Then, the overall regret $R(T)$ can be expressed in terms of all dimensions $D$ as:

\displaystyle R(T)=\sum_{d=1}^{D}R_{T,i}

(51)

Establishing the Connection: From the Iteration of $\theta_{t}$ to $\left<g_{t},\theta_{t}-\theta^{*}\right>$

Using Lemma B.4, we have,

\displaystyle f_{t}(\theta_{t})-f_{t}(\theta^{*})\leq g^{T}_{t}(\theta_{t}-\theta^{*})=\sum_{i=1}^{d}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})

(52)

From the update rules presented in algorithm 1,

	$\displaystyle\theta_{t+1}$	$\displaystyle=\theta_{t}-\alpha_{t}\widehat{g}_{t}$		(53)
		$\displaystyle=\theta_{t}-\alpha_{t}\big{(}\widehat{m}_{t}+K_{t,d}(g_{t}-\widehat{m}_{t})\big{)}$		(53)

We focus on the $i^{\text{th}}$ dimension of the parameter vector $\theta_{t}\in R^{d}$ . Subtract the scalar $\theta^{*}_{,i}$ and square both sides of the above update rule, we have,

\displaystyle(\theta_{t+1,d}-\theta_{,i}^{*})^{2}

\displaystyle=(\theta_{t,i}-\theta_{,i}^{*})^{2}-2\alpha_{t}(\widehat{m}_{t,i}+K_{t,d}(g_{t,i}-\widehat{m}_{t,i}))(\theta_{t,i}-\theta_{,i}^{*})+\alpha_{t}^{2}\widehat{g}^{2}_{t}

(54)

Separating items $g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})$ :

\displaystyle g_{t,d}(\theta_{t,i}-\theta_{,i}^{*})

\displaystyle=\underset{(1)}{\underbrace{\frac{\left(\theta_{t,i}-\theta_{,i}^{*}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{t}K_{t,i}}}}\underset{(2)}{\underbrace{-\frac{1-K_{t,i}}{K_{t,i}}\widehat{m}_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right)}}+\underset{(3)}{\underbrace{\frac{\alpha_{t}}{2K_{t,i}}(\widehat{g}_{t,i})^{2}}}

(55)

We then deal with (1), (2) and (3) separately.

For the first term (1), we have:

		$\displaystyle\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{t}K_{t,i}}$		(56)
		$\displaystyle\leq\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{t}K_{t,i}}$
		$\displaystyle=\frac{\left(\theta_{1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{1}K_{1,i}}-\frac{\left(\theta_{T+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{T}K_{T,i}}+\sum_{t=2}^{T}(\theta_{t,i}-\theta_{,i}^{*})^{2}\left[\frac{1}{2\alpha_{t}K_{t,i}}-\frac{1}{2\alpha_{t-1}K_{t-1,i}}\right]$

Given that $-\dfrac{\left(\theta_{T+1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{T}\left(K_{1}\right)}\leq 0$ and $\dfrac{\left(\theta_{1,i}-\theta_{,i}^{*}\right)^{2}}{2\alpha_{1}\left(K_{T}\right)}\leq\dfrac{D_{i}^{2}}{2\alpha_{1}\left(K_{T}\right)}$ , we can bound it as:

		$\displaystyle\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{t}K_{t,i}}$		(57)
		$\displaystyle\leq\sum_{i=1}^{d}\frac{(\theta_{t,i}-\theta_{,i}^{*})^{2}}{2\alpha_{t}K_{t,i}}$		(57)

For the second term (2), we have:

		$\displaystyle\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,i}}\widehat{m}_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right)$		(58)
		$\displaystyle=\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,i}(1-\beta_{1}^{t})}\left(\sum_{i=1}^{T}(1-\beta_{1,i})\prod_{j=i+1}^{T}\beta_{1,j}\right)g_{t,i}\left(\theta_{t,i}-\theta_{,i}^{*}\right)$
		$\displaystyle\leq\sum_{t=1}^{T}-\frac{1-K_{t,i}}{K_{t,d}(1-\beta_{1}^{t})}\left(1-\prod_{i=1}^{T}\beta_{1,i}\right)g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})$
		$\displaystyle\leq\sum_{t=1}^{T}\frac{1-K_{t,i}}{K_{t,d}(1-\beta_{1}^{t})}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})$

For the third term (3), we have:

$\displaystyle\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}(\widehat{g}_{t,i})^{2}$	$\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left(\widehat{m}_{t,i}+K_{t}(g_{t,i}-\widehat{m}_{t,i})\right)^{2}$	(59)
	$\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left((1-K_{t,i})\widehat{m}_{t,i}+K_{t,d}g_{t,i}\right)^{2}$
	$\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{2K_{t,i}}\left(2(1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+2K_{t,i}^{2}g_{t,i}^{2}\right)$
	$\displaystyle\leq\sum_{t=1}^{T}\frac{\alpha_{t}}{K_{t,i}}\left((1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+K_{t,i}^{2}g_{t,i}^{2}\right)$

Collate all the items that we have:

\displaystyle R(T)

\displaystyle\leq\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{(\theta_{t,i}-\theta_{,i}^{*})^{2}}{2\alpha_{t}K_{t,i}}+\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{1-K_{t,i}}{K_{t,i}(1-\beta_{1}^{t})}g_{t,i}(\theta_{t,i}-\theta_{,i}^{*})+\sum_{i=1}^{d}\sum_{t=1}^{T}\frac{\alpha_{t}}{K_{t,i}}\left((1-K_{t,i})^{2}\widehat{m}_{t,i}^{2}+K_{t,i}^{2}g_{t,i}^{2}\right)

(60)

Using Lemma B.5 and Lemma B.6 From $\sum_{t=1}^{T}\widehat{s}_{t}>\sum_{t=1}^{T}(g_{t}-\widehat{m}_{t})^{2}$ , we have $\frac{1}{T}\sum_{t=1}^{T}K_{t}>\frac{1}{2}$ . Therefore, from the assumption, $\|\theta_{t}-\theta^{*}\|_{2}^{2}\leq D,\|\theta_{m}-\theta_{n}\|_{\infty}\leq D_{\infty}$ , we have the following regret bound:

\displaystyle R(T)

\displaystyle\leq\frac{D^{2}}{\alpha}\sum_{i=1}^{d}\sqrt{T}+\frac{2D_{\infty}G_{\infty}}{1-\beta_{1}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}+\frac{2\alpha G_{\infty}^{2}(1+(1-\beta_{1})^{2})}{\sqrt{T}(1-\beta_{1})^{2}}\sum_{i=1}^{d}\left\|g_{1:T,i}\right\|_{2}^{2}

(61)

Appendix C Convergence analysis for non-convex stochastic optimization (Theorem 3.3 in main paper).

We have relaxed the assumption on the objective function, allowing it to be non-convex, and adjusted the criterion for convergence from the statistic $R(T)$ to $\mathbb{E}(T)$ . Let’s briefly review the assumptions and the criterion for convergence after relaxing the assumption:

Assumption C.1.

•

A1 Bounded variables (same as convex). $\left\|\theta-\theta^{*}\right\|_{2}\leq D,\,\,\forall\theta,\theta^{*}$ or for any dimension $i$ of the variable, $\left\|\theta_{i}-\theta_{i}^{*}\right\|_{2}\leq D_{i},\,\,\forall\theta_{i},\theta_{i}^{*}$
•

A2 The noisy gradient is unbiased. For $\forall t$ , the random variable $\zeta_{t}$ is defined as $\zeta_{t}=g_{t}-\nabla f\left(\theta_{t}\right)$ , $\zeta_{t}$ satisfy $\mathbb{E}\left[\zeta_{t}\right]=0$ , $\mathbb{E}\left[\left\|\zeta_{t}\right\|_{2}^{2}\right]\leq\sigma^{2}$ , and when $t_{1}\neq t_{2}$ , $\zeta_{t_{1}}$ and $\zeta_{t_{2}}$ are statistically independent, i.e., $\zeta_{t_{1}}\perp\zeta_{t_{2}}$ .
•

A3 Bounded gradient and noisy gradient. At step $t$ , the algorithm can access a bounded noisy gradient, and the true gradient is also bounded. $i.e.\ \ ||\nabla f(\theta_{t})||\leq G,\ ||g_{t}||\leq G,\ \ \forall t>1$ .

•

A4 The property of function. The objective function $f\left(\theta\right)$ is a global loss function, defined as $f\left(\theta\right)=\lim_{T\longrightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}f_{t}\left(\theta\right)$ . Although $f\left(\theta\right)$ is no longer a convex function, it must still be a $L$ -smooth function, i.e., it satisfies (1) $f$ is differentiable, $\nabla f$ exists everywhere in the domain; (2) there exists $L>0$ such that for any $\theta_{1}$ and $\theta_{2}$ in the domain, (first definition)

f\left(\theta_{2}\right)\leq f\left(\theta_{1}\right)+\left\langle\nabla f\left(\theta_{1}\right),\theta_{2}-\theta_{1}\right\rangle+\frac{L}{2}\left\|\theta_{2}-\theta_{1}\right\|_{2}^{2}

(62)

or (second definition)

\left\|\nabla f\left(\theta_{1}\right)-\nabla f\left(\theta_{2}\right)\right\|_{2}\leq L\left\|\theta_{1}-\theta_{2}\right\|_{2}

(63)

This condition is also known as L - Lipschitz.

Definition C.2.

The criterion for convergence is the statistic $\mathbb{E}\left(T\right)$ :

\mathbb{E}\left(T\right)=\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]

(64)

When $T\rightarrow\infty$ , if the amortized value of $\mathbb{E}\left(T\right)$ , $\mathbb{E}\left(T\right)/T\rightarrow 0$ , we consider such an algorithm to be convergent, and generally, the slower $\mathbb{E}\left(T\right)$ grows with $T$ , the faster the algorithm converges.

Definition C.3.

Define $\xi_{t}$ as

\xi_{t}=\begin{cases}\theta_{t}&t=1\\ \theta_{t}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)&t\geq 2\end{cases}

(65)

Theorem C.4.

Consider a non-convex optimization problem. Suppose assumptions A1-A5 are satisfied, and let $\alpha_{t}=\alpha/\sqrt{t}$ . For all $T\geq 1$ , SGDF achieves the following guarantee:

\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}}

(66)

Proof of convex Convergence.

Since $f$ is an L-smooth function,

\left\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\leq L^{2}\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}

(67)

Thus,

$\displaystyle f\left(\xi_{t+1}\right)$	$\displaystyle-f\left(\xi_{t}\right)$	(68)
$\displaystyle\leq$	$\displaystyle\left\langle\nabla f\left(\xi_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\left\langle\frac{1}{\sqrt{L}}\left(\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right),\sqrt{L}\left(\xi_{t+1}-\xi_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{2}\left(\frac{1}{L}\left\\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+L\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}\right)+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle+\frac{L}{2}\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{1}{2L}\left\\|\nabla f\left(\xi_{t}\right)-\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+L\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle$
$\displaystyle\leq$	$\displaystyle\frac{1}{2L}L^{2}\left\\|\xi_{t}-\theta_{t}\right\\|_{2}^{2}+L\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle$
$\displaystyle=$	$\displaystyle\frac{L}{2}\underset{\left(1\right)}{\underbrace{\left\\|\xi_{t}-\theta_{t}\right\\|_{2}^{2}}}+L\underset{\left(2\right)}{\underbrace{\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}}}+\underset{\left(3\right)}{\underbrace{\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle}}$

Next, we will deal with the three terms (1), (2), and (3) separately.

For term (1)

When $t=1$ , $\left\|\xi_{t}-\theta_{t}\right\|_{2}^{2}=0$

When $t\geq 2$ ,

		$\displaystyle\left\\|\xi_{t}-\theta_{t}\right\\|_{2}^{2}=\left\\|\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)\right\\|_{2}^{2}$		(69)
		$\displaystyle=\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\left\\|\widehat{g}_{t-1,i}\right\\|_{2}^{2}$
		$\displaystyle=\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}\left(1-K_{t}\right)\left(\widehat{m}_{t-1,i}\right)^{2}+K_{t}g_{t}^{2}$
		$\displaystyle\overset{\left(a\right)}{\leq}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}$

Where (a) holds because for any $t$ :

•

$\left|\widehat{m}_{t,i}\right|\leq\frac{1}{1-\beta_{1}^{t}}\sum_{s=1}^{t}\left(1-\beta_{1}\right)\beta_{1}^{t-s}\left|g_{s,i}\right|\leq\frac{1}{1-\beta_{1}^{t}}\sum_{s=1}^{t}\left(1-\beta_{1}\right)\beta_{1}^{t-s}G_{i}=G_{i}$ .
•

$\|g_{t}\|_{2}\leq G,\,\forall t$ , or for any dimension of the variable $i$ : $\|g_{t,i}\|_{2}\leq G_{i},\,\forall t$

For term (2)

When $t=1$ ,

$\displaystyle\xi_{t+1}-\xi_{t}=$	$\displaystyle\theta_{t+1}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\theta_{t}$	(70)
$\displaystyle=$	$\displaystyle\frac{1}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left(\widehat{g}_{t}\right)$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left(\frac{1-K_{t}}{1-\beta_{1}^{t}}m_{t}+K_{t}g_{t}\right)$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\frac{1-K_{t}}{1-\beta_{1}^{t}}\left(\beta_{1}\cancelto{0}{m_{t-1}}+\left(1-\beta_{1}\right)g_{t}\right)-\frac{\alpha_{t}}{1-\beta_{1}}K_{t}g_{t}$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}g_{t}$

Thus,

$\displaystyle\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}=$	$\displaystyle\left\\|-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}\right\\|_{2}^{2}$	(71)
$\displaystyle=$	$\displaystyle\left(-\frac{\alpha_{t}}{1-\beta_{1}}\right)^{2}\\|g_{t}\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\frac{\alpha_{t}^{2}}{(1-\beta_{1})^{2}}\\|g_{t}\\|_{2}^{2}$
$\displaystyle=$	$\displaystyle\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}g_{t,i}^{2}$
$\displaystyle\leq$	$\displaystyle\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}$

When $t\geq 2$ ,

	$\displaystyle\xi_{t+1}-\xi_{t}=$	$\displaystyle\theta_{t+1}+\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\theta_{t}-\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)$		(72)
	$\displaystyle=$	$\displaystyle\frac{1}{1-\beta_{1}}\left(\theta_{t+1}-\theta_{t}\right)-\frac{\beta_{1}}{1-\beta_{1}}\left(\theta_{t}-\theta_{t-1}\right)$		(72)

Due to

$\displaystyle\theta_{t+1}-\theta_{t}=$	$\displaystyle-\alpha_{t}\widehat{g}_{t}$	(73)
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}m_{t}-\alpha_{t}K_{t}g_{t}$
$\displaystyle=$	$\displaystyle-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\left(\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}\right)-\alpha_{t}K_{t}g_{t}$

So,

	$\displaystyle\xi_{t+1}-\xi_{t}$	(74)
$\displaystyle=$	$\displaystyle\frac{1}{1-\beta_{1}}\left(-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\left(\beta_{1}m_{t-1}+\left(1-\beta_{1}\right)g_{t}\right)-\alpha_{t}K_{t}g_{t}\right)-\frac{\beta_{1}}{1-\beta_{1}}\left(-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}m_{t-1}-\alpha_{t-1}K_{t-1}g_{t-1}\right)$
$\displaystyle=$	$\displaystyle-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)-\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}g_{t}-\frac{\alpha_{t}K_{t}}{1-\beta_{1}}g_{t}+\frac{\beta_{1}}{1-\beta_{1}}\alpha_{t-1}K_{t-1}g_{t-1}$
$\displaystyle=$	$\displaystyle-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}+\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}$

We have:

$\displaystyle\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}$	$\displaystyle\leq 2\left\\|-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\Bigg{(}\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\right.\left.-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\Bigg{)}\right\\|_{2}^{2}$	(75)
	$\displaystyle+2\left\\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\\|_{2}^{2}+2\left\\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\\|_{2}^{2}$
	$\displaystyle\leq 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left\\|m_{t-1}\right\\|_{\infty}^{2}\left\\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\\|_{\infty}\cdot\left\\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\\|_{1}$
	$\displaystyle+2\left\\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\\|_{2}^{2}+2\left\\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\\|_{2}^{2}$

Because

•

$\left|m_{t-1,i}\right|=\left(1-\beta_{1}^{t}\right)\left|\widehat{m}{t,i}\right|\leq\left|\widehat{m}{t,i}\right|\leq G_{i}$ , $\left\|m_{t-1}\right\|{\infty}^{2}\leq\left(\max{i}G_{i}\right)^{2}$
•

$\left\|g_{t}\right\|_{2}^{2}=\sum_{i=1}^{d}g_{t,i}^{2}\leq\sum_{i=1}^{d}G_{i}^{2}$
•

$K_{t}\in{0,1}^{d}$ , we have $\left\|K_{t}\right\|_{\infty}\leq\sum_{i=1}^{d}\textbf{1}_{i},\left\|1-K_{t}\right\|{\infty}\leq\sum_{i=1}^{d}\textbf{1}_{i}\leq d$

	$\displaystyle\alpha_{t}/\left(1-\beta_{1}^{t}\right)\geq 0,\thinspace\thinspace\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)/\geq 0$	(76)
	$\displaystyle\alpha_{t}\leq\alpha_{t-1},\thinspace\thinspace\frac{1}{1-\beta_{1}^{t}}\leq\frac{1}{1-\beta_{1}^{t-1}}\thinspace$
$\displaystyle\Longrightarrow$	$\displaystyle\frac{\alpha_{t}}{1-\beta_{1}^{t}}\leq\frac{\alpha_{t-1}}{1-\beta_{1}^{t-1}}$
$\displaystyle\Longrightarrow$	$\displaystyle\left\|\frac{\alpha_{t}}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}}{1-\beta_{1}^{t-1}}\right\|$
	$\displaystyle=\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)$
	$\displaystyle\leq\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)\leq\alpha_{1}/\left(1-\beta_{1}\right)$
$\displaystyle\Longrightarrow$	$\displaystyle\left\\|\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}\left(1-K_{t-1}\right)}{1-\beta_{1}^{t-1}}\right\\|_{\infty}\leq\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$

\left\|\frac{\alpha_{t}\left(1-K_{t}\right)}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}\left(1-K_{t-1}\right)}{1-\beta_{1}^{t-1}}\right\|_{1}\leq\sum_{i=1}^{d}\left(\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)\right)\textbf{1}_{i}\leq d\left(\alpha_{t-1}/\left(1-\beta_{1}^{t-1}\right)-\alpha_{t}/\left(1-\beta_{1}^{t}\right)\right)

(77)

Therefore

\displaystyle\left\|\xi_{t+1}-\xi_{t}\right\|_{2}^{2}\leq

\displaystyle 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)+4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}

(78)

For term (3)

When $t=1$ , referring to the case of $t=1$ in the previous subsection,

	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle=$	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}g_{t}\right\rangle$		(79)
	$\displaystyle=$	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\zeta_{t}\right\rangle$		(79)

The last equality is due to the definition of $g_{t}$ : $g_{t}=\nabla f\left(\theta_{t}\right)+\zeta_{t}$ . Let’s consider them separately:

	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\nabla f\left(\theta_{t}\right)\right\rangle=$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left[\nabla f\left(\theta_{t}\right)\right]\left[\nabla f\left(\theta_{t}\right)\right]$		(80)
	$\displaystyle\leq$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}$		(80)

$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}}\zeta_{t}\right\rangle\leq$	$\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}\left\\|\zeta_{t}\right\\|_{2}$	(81)
$\displaystyle=$	$\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}\left\\|g_{t}-\nabla f\left(\theta_{t}\right)\right\\|_{2}$
$\displaystyle\leq$	$\displaystyle\frac{\alpha_{t}}{1-\beta_{1}}\cdot 2\sum_{i=1}^{d}G_{i}^{2}$

Thus

		$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle$		(82)
	$\displaystyle\leq$	$\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}$		(82)

When $t\geq 2$ ,

$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle=$	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\right.\left.\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)\right\rangle$	(83)
	$\displaystyle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\zeta_{t}\right\rangle$
	$\displaystyle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\nabla f\left(\theta_{t-1}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\zeta_{t-1}\right\rangle$

Start by looking at the first item after the equal sign:

	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\right.\left.\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right)\right\rangle$	(84)
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{\infty}\left\\|m_{t-1}\right\\|_{\infty}\cdot\left\\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\\|_{1}$
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot\sum_{i=1}^{d}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\textbf{1}_{i}$
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$

The second and third terms after the equal sign:

		$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\nabla f\left(\theta_{t}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t}\right),-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)\zeta_{t}\right\rangle$		(85)
	$\displaystyle\leq$	$\displaystyle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle$		(85)

The fourth and fifth terms after the equal sign:

	$\displaystyle\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\nabla f\left(\theta_{t-1}\right)\right\rangle+\left\langle\nabla f\left(\theta_{t-1}\right),\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}\zeta_{t-1}\right\rangle$	(86)
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{\infty}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{\infty}\left\\|\textbf{1}_{i}\right\\|_{1}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{\infty}\left\\|\zeta_{t}\right\\|_{\infty}\left\\|\textbf{1}_{i}\right\\|_{1}$
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\sum_{i=1}^{d}\textbf{1}_{i}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)\sum_{i=1}^{d}\textbf{1}_{i}$
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d$

Final:

	$\displaystyle\left\langle\nabla f\left(\theta_{t}\right),\xi_{t+1}-\xi_{t}\right\rangle$	(87)
$\displaystyle\leq$	$\displaystyle\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}$
	$\displaystyle+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle$

Summarizing the results

Let’s start summarizing: when $t=1$ ,

\displaystyle f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\leq

\displaystyle\frac{L}{2}\cdot 0+L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}

(88)

Taking the expectation over the random distribution of $\zeta_{1},\zeta_{2},\ldots,\zeta_{t}$ on both sides of the inequality:

\displaystyle\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]\leq

\displaystyle L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\mathbb{E}_{t}\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}+\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}

(89)

When $t\geq 2$ ,

	$\displaystyle f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)$	(90)
$\displaystyle\leq$	$\displaystyle\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}+L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
	$\displaystyle+L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}+\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
	$\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d$
	$\displaystyle+\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle$

Taking the expectation over the random distribution of $\zeta_{1},\zeta_{2},\ldots,\zeta_{t}$ on both sides of the inequality:

	$\displaystyle\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]$	(91)
$\displaystyle\leq$	$\displaystyle\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}+L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
	$\displaystyle+L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}+\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
	$\displaystyle-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\mathbb{E}_{t}\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d$
	$\displaystyle+\mathbb{E}_{t}\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle$

Since the value of $\theta_{t}$ is independent of $g_{t}$ , they are statistically independent of $\zeta_{t}$ :

	$\displaystyle\mathbb{E}_{t}\left[\left\langle\nabla f\left(\theta_{t}\right),-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\zeta_{t}\right\rangle\right]$	(92)
$\displaystyle=$	$\displaystyle\mathbb{E}_{t}\left[\left\langle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\nabla f\left(\theta_{t}\right),\zeta_{t}\right\rangle\right]$
$\displaystyle=$	$\displaystyle\left\langle-\frac{\alpha_{t}}{1-\beta_{1}^{t}}\mathbb{E}_{t}\left[\nabla f\left(\theta_{t}\right)\right],\cancelto{0}{\mathbb{E}_{t}\left[\zeta_{t}\right]}\right\rangle=0$

Summing up both sides of the inequality for $t=1,2,\ldots,T$ :

• Left side of the inequality (can be reduced to maintain the inequality)

$\displaystyle\sum_{t=1}^{T}\textrm{LHS of the inequality}=$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]$	(93)
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)\right]-\mathbb{E}_{t}\left[f\left(\xi_{t}\right)\right]$
$\displaystyle=$	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)\right]-\mathbb{E}_{t-1}\left[f\left(\xi_{t}\right)\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}_{T}\left[f\left(\xi_{T+1}\right)\right]-\mathbb{E}_{0}\left[f\left(\xi_{1}\right)\right]$

Since $f\left(\xi_{T+1}\right)\geq\min_{\theta}f\left(\theta\right)=f\left(\theta^{*}\right)$ , $\xi_{1}=\theta_{1}$ , and both are deterministic:

	$\displaystyle\sum_{t=1}^{T}\mathbb{E}_{t}\left[f\left(\xi_{t+1}\right)-f\left(\xi_{t}\right)\right]\geq$	$\displaystyle\mathbb{E}_{T}\left[f\left(\theta^{*}\right)\right]-\mathbb{E}_{0}\left[f\left(\theta_{1}\right)\right]$		(94)
	$\displaystyle=$	$\displaystyle f\left(\theta^{*}\right)-f\left(\theta_{1}\right)$		(94)

• The right side of the inequality (can be enlarged to keep the inequality valid)

We perform a series of substitutions to simplify the symbols:

When $t>2$ ,

1.

$\frac{L}{2}\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\alpha_{t-1}^{2}\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{1}\alpha_{t-1}^{2}$
2.

$L\cdot 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left(\max_{i}G_{i}\right)^{2}\frac{d\alpha_{1}}{\left(1-\beta_{1}\right)}\cdot\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\triangleq C_{2}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
3.

$L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\leq L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{3}\alpha_{t}^{2}$
4.

$\frac{\beta_{1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)\cdot d\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)\triangleq C_{4}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
5.

$-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\leq-\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]$
6.

$\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(\max_{i}G_{i}\right)d+\frac{\beta_{1}\alpha_{t-1}}{1-\beta_{1}}\left(\max_{i}G_{i}\right)\left(2\max_{i}G_{i}\right)d\triangleq C_{5}\alpha_{t-1}$

When $t=1$ ,

1.

$L\cdot\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}\leq L\cdot 4\frac{\alpha_{t}^{2}}{\left(1-\beta_{1}\right)^{2}}\sum_{i=1}^{d}G_{i}^{2}=C_{3}\alpha_{t}^{2}$
2.

$-\frac{\alpha_{t}}{\left(1-\beta_{1}\right)}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]\leq-\alpha_{t}\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]$
3.

$\frac{2\alpha_{t}}{1-\beta_{1}}\cdot\sum_{i=1}^{d}G_{i}^{2}\triangleq C_{6}\alpha_{t}$

After substitution,

		$\displaystyle\sum_{t=1}^{T}\textrm{RHS of the inequality}\leq\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]$		(95)
		$\displaystyle+\sum_{t=2}^{T}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\sum_{t=2}^{T}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
		$\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}-\frac{\alpha_{T}}{\left(1-\beta_{1}^{T}\right)}\right)$
		$\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$
		$\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$

Combining the results of scaling on both sides of the inequality:

		$\displaystyle f\left(\theta^{*}\right)-f\left(\theta_{1}\right)\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$		(96)
	$\displaystyle\Longrightarrow$	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}+f\left(\theta_{1}\right)-f\left(\theta^{*}\right)+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$		(96)

Due to $\mathbb{E}_{t}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]=\mathbb{E}_{t-1}\left[\left\|\nabla f\left(\theta_{t}\right)\right\|_{2}^{2}\right]$ ,

$\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]=$	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t-1}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]$	(97)
$\displaystyle\geq$	$\displaystyle\sum_{t=1}^{T}\alpha_{t}\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]$
$\displaystyle=$	$\displaystyle\min_{t=1,2,\ldots,T}\mathbb{E}_{t-1}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]\sum_{t=1}^{T}\alpha_{t}$
$\displaystyle=$	$\displaystyle\cdot\mathbb{E}\left(T\right)\cdot\sum_{t=1}^{T}\alpha_{t}$

Then let $C_{1}+C_{3}+C_{5}+C_{6}\triangleq C_{7}$ , $\underset{\geq 0}{\underbrace{f\left(\theta_{1}\right)-f\left(\theta^{*}\right)}}+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}\triangleq C_{8}$ , therefore

		$\displaystyle\mathbb{E}\left(T\right)\cdot\sum_{t=1}^{T}\alpha_{t}\leq C_{7}\sum_{t=1}^{T}\alpha_{t}^{2}+C_{8}$		(98)
	$\displaystyle\Longrightarrow$	$\displaystyle\mathbb{E}\left(T\right)\leq\frac{C_{7}\sum_{t=1}^{T}\alpha_{t}^{2}+C_{8}}{\sum_{t=1}^{T}\alpha_{t}}$		(98)

Since $\alpha_{t}=\alpha/\sqrt{t},\sum_{t=1}^{T}\frac{1}{t}\leq 1+\log T$ , we have:

\mathbb{E}(T)\leq\frac{C_{7}\alpha^{2}(\log T+1)+C_{8}}{2\alpha\sqrt{T}}

(99)

Appendix D Detailed Experimental Supplement

We performed extensive comparisons with other optimizers, including SGD [41], Adam[32], RAdam[37] and AdamW[38]. The experiments include: (a) image classification on CIFAR dataset[33] with VGG [49], ResNet [23] and DenseNet [26], and image recognition with ResNet on ImageNet [14].

D.1 Image classification with CNNs on CIFAR

For all experiments, the model is trained for 200 epochs with a batch size of 128, and the learning rate is multiplied by 0.1 at epoch 150. We performed extensive hyperparameter search as described in the main paper. Detailed experimental parameters we place in Tab. 5. Here we report both training and test accuracy in Fig. 6 and Fig. 7. SGDF not only achieves the highest test accuracy, but also a smaller gap between training and test accuracy compared with other optimizers.

Table 5: Hyperparameters used for CIFAR-10 and CIFAR-100 datasets.

Optimizer	Learning Rate	$\beta_{1}$	$\beta_{2}$	Epochs	Schedule	Weight Decay	Batch Size	$\varepsilon$
SGDF	0.3	0.9	0.999	200	StepLR	0.0005	128	1e-8
SGD	0.1	0.9	-	200	StepLR	0.0005	128	-
Adam	0.001	0.9	0.999	200	StepLR	0.0005	128	1e-8
RAdam	0.001	0.9	0.999	200	StepLR	0.0005	128	1e-8
AdamW	0.001	0.9	0.999	200	StepLR	0.01	128	1e-8
MSVAG	0.1	0.9	0.999	200	StepLR	0.0005	128	1e-8
AdaBound	0.001	0.9	0.999	200	StepLR	0.0005	128	-
Sophia	0.0001	0.965	0.99	200	StepLR	0.1	128	-
Lion	0.00002	0.9	0.99	200	StepLR	0.1	128	-

Note: StepLR indicates a learning rate decay by a factor of 0.1 at the 150th epoch.

D.2 Image Classification on ImageNet

We experimented with a ResNet18 on ImageNet classification task. For SGD, we set an initial learning rate of 0.1, and multiplied by 0.1 every 30 epochs; for SGDF, we use an initial learning rate of 0.5, set $\beta_{1}$ = 0.5. Weight decay is set as $10^{-4}$ for both cases. To match the settings in [37]. Detailed experimental parameters we place in Tab. 6. As shown in Fig. 8, SGDF achieves an accuracy very close to SGD.

Table 6: Hyperparameters used for ImageNet.

Optimizer	Learning Rate	$\beta_{1}$	$\beta_{2}$	Epochs	Schedule	Weight Decay	Batch Size	$\varepsilon$
SGDF	0.5	0.5	0.999	100	StepLR	0.0005	256	1e-8
SGD	0.1	-	-	100	StepLR	0.0005	256	-
SGDF	0.5	0.5	0.999	90	Cosine	0.0005	256	1e-8
SGD	0.1	-	-	90	Cosine	0.0005	256	-

Note: StepLR indicates a learning rate decay by a factor of 0.1 every 30 epochs.

D.3 Objective Detection on PASCAL VOC

We show the results on PASCAL VOC[18]. Object detection with a Faster-RCNN model[45]. Detailed experimental parameters we place in Tab. 7. The results are reported in Tab. 2, and detection examples shown in Fig. 9. These results also illustrate that our method is still efficient in object detection tasks.

Table 7: Hyperparameters for object detection on PASCAL VOC using Faster-RCNN+FPN with different optimizers.

Optimizer	Learning Rate	$\beta_{1}$	$\beta_{2}$	Epochs	Schedule	Weight Decay	Batch Size	$\varepsilon$
SGDF	0.01	0.9	0.999	4	StepLR	0.0001	2	1e-8
SGD	0.01	0.9	-	4	StepLR	0.0001	2	-
Adam	0.0001	0.9	0.999	4	StepLR	0.0001	2	1e-8
AdamW	0.0001	0.9	0.999	4	StepLR	0.0001	2	1e-8
RAdam	0.0001	0.9	0.999	4	StepLR	0.0001	2	1e-8

Note: StepLR schedule indicates a learning rate decay by a factor of 0.1 at the last epoch.

D.4 Image Generation.

We experiment with one of the most widely used models, the Wasserstein-GAN with gradient penalty (WGAN-GP)[47] using a small model with a vanilla CNN generator. Using popular optimizer[40, 61, 2, 4], we train the model for 100 epochs, generate 64,000 fake images from noise, and compute the Frechet Inception Distance (FID)[24] between the fake images and real dataset (60,000 real images). FID score captures both the quality and diversity of generated images and is widely used to assess generative models (lower FID is better). For SGD and MSVAG, we report results from [66]. We perform 5 runs of experiments, and report the results in Fig. 5. Detailed experimental parameters we place in Tab. 8.

Table 8: Hyperparameters for Image Generation Tasks.

Optimizer	Learning Rate	$\beta_{1}$	$\beta_{2}$	Epochs	Batch Size	$\varepsilon$
SGDF	0.01	0.5	0.999	100	64	1e-8
Adam	0.0002	0.5	0.999	100	64	1e-8
AdamW	0.0002	0.5	0.999	100	64	1e-8
Fromage	0.01	0.5	0.999	100	64	1e-8
RMSProp	0.0002	0.5	0.999	100	64	1e-8
AdaBound	0.0002	0.5	0.999	100	64	1e-8
Yogi	0.01	0.9	0.999	100	64	1e-8
RAdam	0.0002	0.5	0.999	100	64	1e-8

D.5 Extended Experiment.

The study involves evaluating the vanilla Adam optimization algorithm and its enhancement with a Wiener filter on the CIFAR-100 dataset. Fig. 9 contains detailed test accuracy curves for both methods across different models. The results indicate that the adaptive learning rate algorithms exhibit improved performance when supplemented with the proposed first-moment filter estimation. This suggests that integrating a Wiener filter with the Adam optimizer may improve performance.

D.6 Optimizer Test.

We derived a correction factor $(1-\beta_{1})(1-\beta_{1}^{2t})/(1+\beta_{1})$ from the geometric progression to correct the variance of by the correction factor. So we test the SGDF with or without correction in VGG, ResNet, DenseNet on CIFAR. We report both test accuracy in Fig. 11. It can be seen that the SGDF with correction exceeds the uncorrected one.

We built a simple neural network to test the convergence speed of SGDF compared to SGDM and vanilla SGD. We trained 5 epochs and recorded the loss every 30 iterations. As Fig. 12 shown, the convergence rate of the filter method surpasses that of the momentum method, which in turn exceeds that of vanilla SGD.

Figure 23: SGDF with or without the correction factor. The curve shows the accuracy of the test.

	$\displaystyle\sqrt{\left\\|g_{1:T,i}\right\\|_{2}^{2}-g_{T,i}^{2}}$	$\displaystyle\leq\left\\|g_{1:T,i}\right\\|_{2}-\frac{g_{T,i}^{2}}{2\left\\|g_{1:T,i}\right\\|_{2}}$		(42)
		$\displaystyle\leq\left\\|g_{1:T,i}\right\\|_{2}-\frac{g_{T,i}^{2}}{2\sqrt{G_{\infty}^{2}}}.$		(42)

		$\displaystyle\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{t}K_{t,i}}$		(56)
		$\displaystyle\leq\sum_{t=1}^{T}\frac{\left(\theta_{t,i}-\theta_{,i}^{}\right)^{2}-\left(\theta_{t+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{t}K_{t,i}}$
		$\displaystyle=\frac{\left(\theta_{1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{1}K_{1,i}}-\frac{\left(\theta_{T+1,i}-\theta_{,i}^{}\right)^{2}}{2\alpha_{T}K_{T,i}}+\sum_{t=2}^{T}(\theta_{t,i}-\theta_{,i}^{*})^{2}\left[\frac{1}{2\alpha_{t}K_{t,i}}-\frac{1}{2\alpha_{t-1}K_{t-1,i}}\right]$

$\displaystyle\left\\|\xi_{t+1}-\xi_{t}\right\\|_{2}^{2}$	$\displaystyle\leq 2\left\\|-\frac{\beta_{1}}{1-\beta_{1}}m_{t-1}\odot\Bigg{(}\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}\right.\left.-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\Bigg{)}\right\\|_{2}^{2}$	(75)
	$\displaystyle+2\left\\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\\|_{2}^{2}+2\left\\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\\|_{2}^{2}$
	$\displaystyle\leq 2\frac{\beta_{1}^{2}}{\left(1-\beta_{1}\right)^{2}}\left\\|m_{t-1}\right\\|_{\infty}^{2}\left\\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\\|_{\infty}\cdot\left\\|\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}-\frac{\alpha_{t-1}(1-K_{t-1})}{1-\beta_{1}^{t-1}}\right\\|_{1}$
	$\displaystyle+2\left\\|-\left(\frac{\alpha_{t}(1-K_{t})}{1-\beta_{1}^{t}}+\frac{\alpha_{t}K_{t}}{1-\beta_{1}}\right)g_{t}\right\\|_{2}^{2}+2\left\\|\frac{\beta_{1}\alpha_{t-1}K_{t-1}}{1-\beta_{1}}g_{t-1}\right\\|_{2}^{2}$

		$\displaystyle\sum_{t=1}^{T}\textrm{RHS of the inequality}\leq\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]$		(95)
		$\displaystyle+\sum_{t=2}^{T}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\sum_{t=2}^{T}\left(\frac{\alpha_{t-1}}{\left(1-\beta_{1}^{t-1}\right)}-\frac{\alpha_{t}}{\left(1-\beta_{1}^{t}\right)}\right)$
		$\displaystyle=\sum_{t=2}^{T}C_{1}\alpha_{t-1}^{2}+\sum_{t=1}^{T}C_{3}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{t=1}^{T}C_{5}\alpha_{t-1}+\sum_{t=1}^{T}C_{6}\alpha_{t}$
		$\displaystyle+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\left(\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}-\frac{\alpha_{T}}{\left(1-\beta_{1}^{T}\right)}\right)$
		$\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\sum_{i=1}^{d}\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$
		$\displaystyle\leq\left(C_{1}+C_{3}+C_{5}+C_{6}\right)\sum_{t=1}^{T}\alpha_{t}^{2}-\sum_{t=1}^{T}\alpha_{t}\mathbb{E}_{t}\left[\left\\|\nabla f\left(\theta_{t}\right)\right\\|_{2}^{2}\right]+\left(C_{2}+C_{4}\right)\frac{\alpha_{1}}{\left(1-\beta_{1}\right)}$