Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Yanjie Zhong Department of Statistics and Data Science, Washington University in St. Louis Jiaqi Li Corresponding author: Jiaqi Li, [email protected] Department of Statistics and Data Science, Washington University in St. Louis Soumendra Lahiri Department of Statistics and Data Science, Washington University in St. Louis

Abstract

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.

Keywords: machine learning, variance-reduced method, stochastic gradient descent, non-convex optimization

1 Introduction

We consider the popular non-convex finite sum optimization problem in this work, that is, estimating $\mathbf{x}^{*}\in\mathcal{D}\subseteq\mathbb{R}^{d}$ minimizing the following loss function

f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}f_{i}(\mathbf{x}),\ \mathbf{x}\in\mathcal{D},

(1)

where $f_{i}:\mathbb{R}^{d}\mapsto\mathbb{R}$ is a potentially non-convex function on some compact set $\mathcal{D}$ . Such non-convex problems lie at the heart of many applications of statistical learning [13] and machine learning [7].

Unlike convex optimization problems, in general, non-convex problems are intractable and the best we can expect is to find a stationary point. Given a target error $\varepsilon$ , since $\nabla f(\mathbf{x}^{*})=0$ , we aim to find an estimator $\hat{\mathbf{x}}$ such that roughly $\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon$ , where $\nabla f(\cdot)$ denotes the gradient vector the loss function $f$ and $\|\cdot\|$ is the operator norm. With a non-deterministic algorithm, the output $\hat{\mathbf{x}}$ is always stochastic, and the most frequently considered measure of error bound is in expectation, i.e.,

\mathbb{E}\|\nabla f(\hat{\mathbf{x}})\|^{2}\leq\varepsilon^{2}.

(2)

There has been a substantial amount of work providing upper bounds on computational complexity needed to achieve the in-expectation bound. However, in practice, we only run a stochastic algorithm for once and an in-expectation bound cannot provide a convincing bound in this situation. Instead, a high-probability bound is more appropriate by nature. Given a pair of target errors $(\varepsilon,\delta)$ , we want to obtain an estimator $\hat{\mathbf{x}}$ such that with probability at least $1-\delta$ , $\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon$ , that is

\mathbb{P}\big{(}\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon\big{)}\geq 1-\delta.

(3)

Though the Markov inequality might help, in general, an in-expectation bound cannot be simply converted to an in-probability bound with a desirable dependency on $\delta$ . It would be important to prove upper bounds on high-probability complexity, which ideally should be polylogorithmic in $\delta$ and with polynomial terms comparable to the in-expectation complexity bound.

Gradient-based methods are favored by practitioners due to simplicity and efficiency and have been widely studied by researchers in the non-convex setting ([26, 6, 1, 31, 5, 33]). Among numerous gradient-based methods, the StochAstic Recursive grAdient algoritHm (SARAH) ([27, 28, 33]) is the one with the best first-order guarantee as given an in-expectation error target, in both of convex and non-convex finite sum problems. It is worth noticing that [23] attempted to show that a modified version of SARAH is able to approximate the second-order stationary point with a high probability. However, we believe that their application of the martingale Azuma-Hoeffding inequality is unjustifiable because the bounds are potentially random and uncontrollable. In this paper, we shall provide a correct dimension-free martingale Azuma-Hoeffding inequality with rigorous proofs and leverage it to show in-probability properties for SARAH-based algorithms in the non-convex setting.

1.1 Related Works

•

High-Probability Bounds: While most works in the literature of optimization provide in-expectation bounds, there is only a small fraction of works discussing bounds in the high probability sense. [17] provide a high-probability bound on the excess risk given a bound on the regret. [12], [8, 9] derive some high-probability bounds for SGD in convex online optimization problems. [34, 22] prove high-probability bounds for several adaptive methods, including AMSGrad, RMSProp and Delayed AdaGrad with momentum. All these works rely on (generalized) Freedman’s inequality or the concentration inequality given in Lemma 6 in [15]. Different from them, our high-probability results are built on a novel Azuma-Hoeffding type inequality proved in this work and Corollary 8 from [15]. In addition, we notice that [23] provide some probabilistic bounds on a SARAH-based algorithm. However, we believe their use of the plain martingale Azuma-Hoeffding inequality is not justifiable. [5] show in-probability upper bound for SPIDER. Nevertheless, SPIDER’s practical performance is inferior due to its accuracy-dependent small step size [32, 33].
•

Variance-Reduced Methods in Non-Convex Finite Sum Problems: Since the invention of the variance-reduction technique in [18, 16, 4], there has been a large amount of work incorporating this efficient technique to methods targeting the non-convex finite-sum problem. Subsequent methods, including SVRG ([1, 31, 25]), SARAH ([27, 28]), SCSG ([19, 21, 11]), SNVRG ([34]), SPIDER ([5]), SpiderBoost ([33]) and PAGE ([24]), have greatly reduced computational complexity in non-convex problems.

1.2 Our Contributions

•

Dimension-Free Martingale Azuma-Hoeffding inequality: To facilitate our probabilistic analysis, we provide a novel Azuma-Hoeffding type bound on the summation norm of a martingale difference sequence. The novelty is two-fold. Firstly, same as the plain martingale Azuma-Hoeffding inequality, it provides a dimension-free bound. In a recent paper, a sub-Gaussian type bound has been developed by [15]. However, their results are not dimension-free. Our technique in the proof is built on a classic paper by [29] and is completely different from the random matrix technique used in [15]. Secondly, our concentration inequality allows random bounds on each element of the martingale difference sequence, which is much tighter than a large deterministic bound. It should be highlighted that our novel concentration result perfectly suits the nature of SARAH-style methods where the increment can be characterized as a martingale difference sequence and it can be further used to analyze other algorithms beyond the current paper.
•

In-probability error bounds of stochastic recursive gradient: We design a SARAH-based algorithm, named Prob-SARAH, adapted to the high-probability target and provably show its good in-probability properties. Under appropriate parameter setting, the first order complexity needed to achieve the in-probability target is $\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right)$ , which matches the best known in-expectation upper bound up to some logarithmic factors ([34, 33, 11]). We would like to point out that the parameter setting used to achieve such complexity is semi-adaptive to $\varepsilon$ . That is, only the final stopping rule relies on $\varepsilon$ while other key parameters are independent of $\varepsilon$ , including step size, mini-batch sizes, and lengths of loops.
•

Probabilistic analysis of SARAH for non-convex finite sum: Existing literature on the bounds of SARAH is mostly focusing on the strongly convex or general convex settings. We extend the case to the non-convex scenarios, which can be considered as a complimentary study to the stochastic recursive gradient in probability.

1.3 Notation

For a sequence of sets $\mathcal{A}_{1},\mathcal{A}_{2},\ldots$ , we denote the smallest sigma algebra containing $\mathcal{A}_{i}$ , $i\geq 1$ , by $\sigma\big{(}\bigcup_{i=1}^{\infty}\mathcal{A}_{i}\big{)}$ . By abuse of notation, for a random variable $\mathbf{X}$ , we denote the sigma algebra generated by $\mathbf{X}$ by $\sigma(\mathbf{X})$ . We define constant $C_{e}=\sum_{i=0}^{\infty}i^{-2}$ . For two scalars $a,b\in\mathbb{R}$ , we denote $a\wedge b=\min\{a,b\}$ and $a\vee b=\max\{a,b\}$ . When we say a quantity $T$ is $\mathcal{O}_{\theta_{1},\theta_{2}}(\theta_{3})$ for some $\theta_{1},\theta_{2},\theta_{3}\in\mathbb{R}$ , there exists a $g\in\mathbb{R}$ polylogarithmic in $\theta_{1}$ and $\theta_{2}$ such that $T\leq g\cdot\theta_{3}$ , and similarly $\tilde{\mathcal{O}}_{(\cdot)}(\cdot)$ is defined the same but up to a logarithm factor.

2 Prob-SARAH Algorithm

The algorithm Prob-SARAH proposed in our work is a modified version of SpiderBoost ([33]) and SARAH ([27, 28]). Since the key update structure is originated from ([27]), we call our modified algorithm Prob-SARAH. In fact, it can also be viewed as a generalization of the SPIDER algorithm introduced in ([5]).

We present the Prob-SARAH in Algorithm 1, and here, we provide some explanation of the key steps. Following other SARAH-based algorithms, we adopt a similar gradient approximation design with nested loops, specifically with a checkpoint gradient estimator $\boldsymbol{\nu}^{(j)}_{0}$ using a large mini-batch size $B_{j}$ in Line 4 and a recursive gradient estimator $\boldsymbol{\nu}_{k}^{(j)}$ updated in Line 9. When the mini-batch size $B_{j}$ is large, we can regard the checkpoint gradient estimator $\boldsymbol{\nu}^{(j)}_{0}$ as a solid approximation to the true gradient at $\tilde{\mathbf{x}}_{j-1}$ . With this checkpoint, we can update the gradient estimator $\boldsymbol{\nu}_{k}^{(j)}$ with a small mini-batch size $b_{j}$ while maintaining a desirable estimation accuracy.

To emphasize, our stopping rules in Line 11 of Algorithm 1 is newly proposed, which ensures a critical enhancement of the performance compared to previous literature. In particular, with this new design, we can control the gradient norm of the output with high probability. For a more intuitive understanding of these stopping rules, we will see in our proof sketch section that the gradient norm of iterates in the $j$ -th outer iteration, $\|\nabla f\|$ , can be bounded by a linear combination of $\big{\{}\boldsymbol{\nu}_{k}^{(j)}\big{\}}_{k=1}^{K_{j}}$ with a small remainder. The first stopping rule, therefore, strives to control the magnitude of the linear combination of $\big{\{}\boldsymbol{\nu}_{k}^{(j)}\big{\}}_{k=1}^{K_{j}}$ , while the second stopping rule is specifically designed to control the size of remainder terms. For this purpose, $\varepsilon_{j}$ should be set as a credible controller of the remainder term, with an example given in Theorems 3.1. In this way, with small preset constants $\tilde{\varepsilon}$ and $\varepsilon$ , we guarantee that the output has a desirably small gradient norm, dependent on $\tilde{\varepsilon}$ and $\varepsilon$ , when the designed stopping rules are activated. Indeed, Proposition B.1 in Appendix B offers a guarantee that the stopping rule will be definitively satisfied at some point. More refined quantitative results regarding the number of steps required for stopping will follow in Theorems 3.1 and Appendix D.3.

Algorithm 1 Probabilistic Stochastic Recursive Gradient (Prob-SARAH)

1: Input: sample size

n

, constraint area

\mathcal{D}

, initial point

\tilde{\mathbf{x}}_{0}\in\mathcal{D}

, large batch size

\{B_{j}\}_{j\geq 1}

, mini batch size

\{b_{j}\}_{j\geq 1}

, inner loop length

\{K_{j}\}_{j\geq 1}

, auxiliary error estimator

\{\varepsilon_{j}\}_{j\geq 1}

, errors

\tilde{\varepsilon}^{2},\varepsilon^{2}

2: for

j=1,2,\ldots

3: Uniformly sample a batch

\mathcal{I}_{j}\subseteq\{1,\ldots,n\}

without replacement,

|\mathcal{I}_{j}|=B_{j}

;

\boldsymbol{\nu}^{(j)}_{0}\,\leftarrow\,\frac{1}{B_{j}}\sum_{i\in\mathcal{I}_{j}}\nabla f_{i}(\tilde{\mathbf{x}}_{j-1})

;

\mathbf{x}_{0}^{(j)}\,\leftarrow\,\tilde{\mathbf{x}}_{j-1}

;

6: for

k=1,2,\ldots,K_{j}

\mathbf{x}_{k}^{(j)}\,\leftarrow\,\mathrm{Proj}\big{(}\mathbf{x}_{k-1}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k-1}^{(j)},\mathcal{D}\big{)}

, project the update back to

\mathcal{D}

;

8: Uniformly sample a mini-batch

\mathcal{I}_{k}^{(j)}\subseteq\{1,\ldots,n\}

with replacement and

|\mathcal{I}_{k}^{(j)}|=b_{j}

;

\boldsymbol{\nu}_{k}^{(j)}\,\leftarrow\,\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})+\boldsymbol{\nu}_{k-1}^{(j)}

;

10: end for

11: if

\frac{1}{K_{j}}\sum_{k=0}^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}\leq\tilde{\varepsilon}^{2}

and

\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}

then

12:

\hat{k}\,\leftarrow\,\mathop{\arg\min}_{0\leq k\leq K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}

;

13: Return

\hat{\mathbf{x}}\,\leftarrow\,\mathbf{x}_{\hat{k}}^{(j)}

;

14: end if

15:

\tilde{\mathbf{x}}_{j}\,\leftarrow\,\mathbf{x}_{K_{j}}^{(j)}

;

16: end for

3 Theoretical Results

This section is devoted to the main theoretical result of our proposed algorithm Prob-SARAH. We provide the stop guarantee of the algorithm along with the upper bound of the steps. The high-probability error bound of the estimated gradient is also established. The discussion of the dependence of our algorithm on the parameters is available after we introduce our main theorems.

3.1 Technical Assumptions

We shall introduce some necessary regularized assumptions. Most assumptions are commonly used in the optimization literature. We have further clarifications in Appendix A.

Assumption 3.1 (Existence of achievable minimum).

Assume that for each $i=1,2,\ldots,n$ , $f_{i}$ has continuous gradient on $\mathcal{D}$ and $\mathcal{D}$ is a compact subset of $\mathbb{R}^{d}$ . Then, there exists a constant $\alpha_{M}<\infty$ such that

\max\limits_{1\leq i\leq n}\sup\limits_{\mathbf{x}\in\mathcal{D}}\|\nabla f_{i}(\mathbf{x})\|\leq\alpha_{M}.

(4)

Also, assume that there exists an interior point $\mathbf{x}^{*}$ of the set $\mathcal{D}$ such that

f(\mathbf{x}^{*})=\inf\limits_{\mathbf{x}\in\mathcal{D}}f(\mathbf{x}).

Assumption 3.2 ( $L$ -smoothness).

For each $i=1,2,\ldots,n$ , $f_{i}:\mathcal{D}\rightarrow\mathbb{R}$ is $L$ -smooth for some constant $L>0$ , i.e.,

\|\nabla f_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{x}^{\prime})\|\leq L\|\mathbf{x}-\mathbf{x}^{\prime}\|,\ \forall\ \mathbf{x},\mathbf{x}^{\prime}\in\mathcal{D}.

Assumption 3.3 ( $L$ -smoothness extension).

There exists a $L$ -smooth function $\tilde{f}:\mathcal{D}\rightarrow\mathbb{R}$ such that

\tilde{f}(\mathbf{x})=f(\mathbf{x}),\ \forall\ \mathbf{x}\in\mathcal{D},\quad\text{and}\quad\tilde{f}(\mathrm{Proj}(\mathbf{x},\mathcal{D}))\leq\tilde{f}(\mathbf{x}),\ \forall\ \mathbf{x}\in\mathbb{R}^{d},

where $\mathrm{Proj}(\mathbf{x},\mathcal{D})$ is the Euclidean projection of $\mathbf{x}$ on some compact set $\mathcal{D}$ .

Assumption 3.4.

Assume that the following conditions hold.

1.

$\varepsilon\leq\frac{1}{e}$ and $\alpha_{M}^{2}\geq\frac{1}{10240}$ , where $\epsilon$ is the target error bound in (3) and $\alpha_{M}$ is defined in (4).
2.

The diameter of $\mathcal{D}$ is at least 1, i.e. $d_{1}\triangleq\max\{\|\mathbf{x}-\mathbf{x}^{\prime}\|:\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{D}\}\geq 1$ .

Assumption 3.1 also indicates that there exists a positive number $\Delta_{f}$ such that $\sup_{\mathbf{x}\in\mathcal{D}}\big{[}f(\mathbf{x})-f(\mathbf{x}^{*})\big{]}\leq\Delta_{f}.$ Assumptions 3.1–3.3 are commonly used in the optimization literature, and Assumption 3.4 can be easily satisfied in practical use as long as the initial points are not too far from the optimum. See more comments on assumptions in Appendix A.

3.2 Main Results on Complexity

According to the definition given in [20], an algorithm is called $\varepsilon$ -independent if it can guarantee convergence at all target accuracies $\varepsilon$ in expectation without explicitly using $\varepsilon$ in the algorithm. This is a very favorable property because it means that we no longer need to set the target error beforehand. Here, we introduce a similar property regarding the dependency on $\varepsilon$ .

Definition 3.1 ( $\varepsilon$ -semi-independence).

An algorithm is $\varepsilon$ -semi-independent, given $\delta$ , if it can guarantee convergence at all target accuracies $\varepsilon$ with probability at least $\delta$ and the knowledge of $\varepsilon$ is only needed in the post-processing. That is, the algorithm can iterate without knowing $\varepsilon$ and we can select an appropriate iterate out afterwards.

The newly introduced property can be perceived as the probabilistic equivalent of $\varepsilon$ -independence. As stated in the succeeding theorem, under the given conditions, Prob-SARAH can achieve $\varepsilon$ -semi-independence, given $\delta$ .

Theorem 3.1.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. Given a pair of errors $(\varepsilon,\delta)$ , in Algorithm 1 (Prob-SARAH), set hyperparameters

$\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=8L^{2}\tau_{j}+2q_{j},\quad\tilde{\varepsilon}^{2}=\frac{1}{5}\varepsilon^{2},$

(5)

for $j\geq 1$ , where

Then,

Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\Big{(}\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\Big{)},

where $Comp(\varepsilon,\delta)$ represents the number of computations needed to get an output $\hat{\mathbf{x}}$ satisfying $\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}$ with probability at least $1-\delta$ .

More detailed results can be found in Appendix C. In appendix C, we also introduce another hyper-parameter setting that can lead to a complexity with better dependency on $\alpha_{M}^{2}$ , which could be implicitly affected by the choice of constraint region $\mathcal{D}$ .

3.3 Proof Sketch

In this part, we explain the idea of the proof of Theorem 3.1. Same proofing strategy can be applied to other hyper-parameter settings. First, we bound the difference between $\boldsymbol{\nu}_{k}^{(j)}$ and $\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}$ by a linear combination of $\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{k-1}$ and small remainders, with which we can have a good control on $\|\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\|$ when the stopping rules are met. Second, we bound the number of steps we need to meet the stopping rules. Combining these 2 key components, we can smoothly get the final conclusions.

Let us firstly introduce a novel Azuma-Hoeffding type inequality, which is key to our analysis.

Theorem 3.2 (Martingale Azuma-Hoeffding Inequality with Random Bounds).

Suppose $\mathbf{z}_{1},\ldots,\mathbf{z}_{K}\in\mathbb{R}^{d}$ is a martingale difference sequence adapted to $\mathcal{F}_{0},\ldots,\mathcal{F}_{K}$ . Suppose $\{r_{k}\}_{k=1}^{K}$ is a sequence of random variables such that $\|\mathbf{z}_{k}\|\leq r_{k}$ and $r_{k}$ is measurable with respect to $\mathcal{F}_{k}$ , $k=1,\ldots,K$ . Then, for any fixed $\delta>0$ , and $B>b>0$ , with probability at least $1-\delta$ , for $1\leq t\leq K$ , either

$\exists 1\leq t\leq K,\ \sum\limits_{k=1}\limits^{t}r_{k}^{2}\geq B\text{ or }\Big{\|}\sum\limits_{k=1}\limits^{t}\mathbf{z}_{k}\Big{\|}^{2}\leq 9\max\Big{\{}\sum\limits_{k=1}\limits^{t}r_{k}^{2},b\Big{\}}\Big{(}\log(\frac{2}{\delta})+\log\log(\frac{B}{b})\Big{)}.$

Remark 3.1.

It is noteworthy that this probabilistic bound on large-deviation is dimension-free, which is a nontrivial extension of Theorem 3.5 in [30]. If $r_{1},r_{2},\ldots,r_{K}$ are not random, we can let $B=\sum_{k=1}^{K}r_{k}^{2}+\zeta_{1}$ and $b=\zeta_{2}B$ with $\zeta_{1}>0$ , $0<\zeta_{2}<1$ . Since $\zeta_{1}$ can be arbitrarily close to 0 and $\zeta_{2}$ can be arbitrarily close to 1, we can recover Theorem 3.5 in [30]. Compared with Corollary 8 in [15], which can be viewed as a sub-Gaussian counterpart of our result, a key feature of our Theorem 3.2 is its dimension-independence. We are also working towards improving the bound in Corollary 8 from [15] to a dimension-free one.

The success of Algorithm 1 is largely because $\nabla f(\mathbf{x}_{k}^{(j)})$ is well-approximated by $\boldsymbol{\nu}_{k}^{(j)}$ , and meanwhile $\boldsymbol{\nu}_{k}^{(j)}$ can be easily updated. We can observe that $\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})$ is actually sum of a sequence of martingale difference as

	$\displaystyle\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})=\Big{[}\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{]}$
	$\displaystyle\quad+\left[\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right]=\sum_{m=1}^{k}\Big{[}\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m-1}^{(j)})$
	$\displaystyle\quad+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})\Big{]}+\left[\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right].$		(6)

To be more specific, let $\mathcal{F}_{0}=\{\emptyset,\Omega\}$ , and iteratively define $\mathcal{F}_{j,-1}=\mathcal{F}_{j-1}$ , $\mathcal{F}_{j,0}=\sigma\big{(}\mathcal{F}_{j-1}\cup\sigma(\mathcal{I}_{j})\big{)}$ , $\mathcal{F}_{j,k}=\sigma\big{(}\mathcal{F}_{j,0}\cup\sigma(\mathcal{I}_{k}^{(j)})\big{)}$ , $\mathcal{F}_{j}=\sigma\big{(}\bigcup\limits_{k=1}\limits^{\infty}\mathcal{F}_{j,k}\big{)}$ , $j\geq 1,k\geq 1$ . We also denote $\boldsymbol{\epsilon}_{0}^{(j)}\triangleq\boldsymbol{\nu}_{0}^{(j)}-\nabla f\big{(}\mathbf{x}_{0}^{(j)}\big{)}$ , $\boldsymbol{\epsilon}_{m}^{(j)}\triangleq\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m-1}^{(j)})$ , $m\geq 1$ . Then, we can see that $\{\boldsymbol{\epsilon}_{m}^{(j)}\}_{m=0}^{k}$ is a martingale difference sequence adapted to $\{\mathcal{F}_{j,m}\}_{m=-1}^{k}$ . With the help of our new Martingale Azuma-Hoeffding inequality, we can control the difference between $\boldsymbol{\nu}_{k}^{(j)}$ and $\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}$ by a linear combination of $\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{k-1}$ and small remainders, with details given in Appendix D.1. Then, given the stopping rules in line 11 and selection method specified in line 12 of Algorithm 1, it would be not hard for us to obtain $\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}$ with a high probability. More details can be found in Appendix D.2.

Another key question needed to be resolved is, when the algorithm can stop? The following analysis can build some intuitions for us. Given a $T\in\mathbb{Z}_{+}$ , with the bound given in Proposition F.1 in Appendix F, with a high probability,

\displaystyle-\Delta_{f}

\displaystyle\leq f\left(\tilde{\mathbf{x}}_{{2T}}\right)-f\left(\tilde{\mathbf{x}}_{{T}}\right)\leq A_{T}-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2},

(7)

where $A_{T}$ is upper bounded by a value polylogorithmic in $T$ . As for the second summation, if $\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}$ for $j=T,T+1,\ldots,2T$ (which is obviously true when $T$ is moderately large) and our algorithm doesn’t stop in $2T$ outer iterations,

\displaystyle\quad\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}K_{j}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}\left(T\wedge\sqrt{n}\right)=\frac{\tilde{\varepsilon}^{2}}{16L}T^{2}\wedge(\sqrt{n}T),

which grows at least linear in $T$ . Consequently, when $T$ is sufficiently large, the RHS of (7) can be smaller than $-\Delta_{f}$ , which leads to a contradiction. Roughly, we can see that the stopping time $T$ cannot exceed the order of $\tilde{\mathcal{O}}\big{(}\frac{1}{\varepsilon}\vee\frac{1}{\sqrt{n}\varepsilon^{2}}\big{)}$ . More details can be found in Appendix D.3.

4 Numerical Experiments

In order to validate our theoretical results and show good probabilistic property for the newly-introduced Prob-SARAH, we conduct some numerical experiments where the objectives are possibly non-convex.

4.1 Logistic Regression with Non-Convex Regularization

In this part, we consider to add a non-convex regularization term to the commonly-used logistic regression. Specifically, given a sequence of observations $(\mathbf{w}_{i},y_{i})\in\mathbb{R}^{d}\times\{-1,1\}$ , $i=1,2,\ldots,n$ and a regularized parameter $\lambda>0$ , the objective is

f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}\log\left(1+e^{-y_{i}\mathbf{w}_{i}^{T}\mathbf{x}}\right)+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{x_{j}^{2}}{1+x_{j}^{2}}.

Refer to caption — Figure 1: Comparison of convergence with respect to $(1-\delta)$ -quantile of square of gradient norm $\left(\|\nabla f\|^{2}\right)$ and $\delta$ -quantile of validation accuracy on the MNIST dataset for $\delta=0.1$ and $\delta=0.01$ . The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: $\delta=0.1$ . Bottom: $\delta=0.01$ . ’bs’ stands for batch size. ’sj=x’ means that the smallest batch size $\approx x\log x$ .

Such an objective has also been considered in other works like [11] and [14]. Same as other works, we set the regularized parameter $\lambda=0.1$ across all experiments. We compare the newly-introduced Prob-SARAH against three popular methods including SGD ([6]), SVRG ([31]) and SCSG ([21]). Based on results given in Theorem 3.1, we let the length of the inner loop $K_{j}\sim j\wedge\sqrt{n}$ , the inner loop batch size $b_{j}\sim\log j\left(j\wedge\sqrt{n}\right)$ , the outer loop batch size $B_{j}\sim j^{2}\wedge n$ . For fair comparison, we determine the batch size (inner loop batch size) for SGD (SCSG and SVRG) based on the sample size $n$ and the number of epochs needed to have sufficient decrease in gradient norm. For example, for the w7a dataset, the sample size is 24692 and we run 60 epochs in total. In the 20th epoch, the inner loop batch size of Prob-SARAH is approximately $67\log 67\approx 281$ . Thus, we set batch size 256 for SGD, SCSG and SVRG so that they can be roughly matched. In addition, based on the theoretical results from [31], we also consider a large inner loop batch size comparable to $n^{2/3}$ for SVRG. In addition, we set step size $\eta=0.01$ for all algorithms across all experiments for simplicity.

Results are displayed in Figure 2, from which we can see that Prob-SARAH has superior probabilistic guarantee in controlling the gradient norm in all experiments. It is significantly better than SCSG and SVRG under our current setting. Prob-SARAH can achieve a lower gradient norm than SGD at the early stage while SGD has a slight advantage when the number of epochs is large.

4.2 Two-Layer Neural Network

We also evaluate the performance of Prob-SARAH, SGD, SVRG and SCSG on the MNIST dataset with a simple 2-layer neural network. The two hidden layers respectively have 128 and 64 neurons. We include a GELU activation layer following each hidden layer. We use the negative log likelihood as our loss function. Under this setting, the objective is possibly non-convex and smooth on any given compact set. The step size is fixed to be 0.01 for all algorithms. For Prob-SARAH, we still have the length of the inner loop $K_{j}\sim j\wedge\sqrt{n}$ , the inner loop batch size $b_{j}\sim\log j\left(j\wedge\sqrt{n}\right)$ , the outer loop batch size $B_{j}\sim j^{2}\wedge n$ . But to reduce computational time, we let $j$ start from 10, 30 and 50 respectively. Based on the same rule described in the previous subsection, we let the batch size (or inner loop batch size) for SGD, SVRG and SCSG be 512.

Results are given in Figure 1. In terms of gradient norm, Prob-SARAH has the best performance among algorithms considered here when the number of epochs is relatively small. With increasing number of epochs, SVRG tends to be better in finding first-order stationary points. However, based on the 3rd and 4th columns in Figure 1, SVRG apparently has an inferior performance on the validation set, which indicates that it could be trapped at local minima. In brief, Prob-SARAH achieves the best tradeoff between finding a first-order stationary point and generalization.

We also consider another set of experiments by replacing the GELU activation function with ReLU, resulting in a non-smooth objective. The results are shown in Appendix G, which resemble those in Figure 1 and the similar conclusions can be drawn.

5 Conclusion

In this paper, we propose a SARAH-based variance reduction algorithm called Prob-SARAH and provide high-probability bounds on gradient norm for estimator resulted from Prob-SARAH. Under appropriate assumptions, the high-probability first order complexity nearly match the one in the in-expectation sense. The main tool used in the theoretical analysis is a novel Azuma-Hoeffding type inequality. We believe that similar probabilistic analysis can be applied to SARAH-based algorithms in other settings.

Appendix A Remarks and Examples for Assumptions

A.1 More comments on Assumptions 3.1–3.4

Remark A.1 (Convexity and smoothness).

It is worth noticing that Assumption 3.1 is widely used in many non-convex optimization works and can be met for most applications in practice. Assumption 3.2 is also needed in deriving in-expectation bound for many non-convex variance-reduced methods, including state-of-art ones like SPIDER and SpiderBoost. As for Assumption 3.3, it is a byproduct of the compact constraint and can be satisfied with some commonly-seen $f$ and usual choices of $\mathcal{D}$ . For more discussions on Assumption 3.3, please see Appendix A.2.

Remark A.2 (Compact set $\mathcal{D}$ ).

Compared with other works in the literature of non-convex optimization, the compact constraint region $\mathcal{D}\in\mathbb{R}^{d}$ imposed in the finite sum problem (1) may seem somewhat restrictive. In fact, such constraint is largely due to technical convenience and it can be removed with additional condition on gradients. We will elaborate on this point in subsection C.1. Besides, in many practical applications, it is reasonable to restrict estimators to a compact set when certain prior knowledge is available.

A.2 An Example of Assumption 3.3

Let us consider the logistic regression with non-convex regularization where the object function can be characterized as

f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}\log\left(1+\text{exp}\left(-y_{i}\langle\mathbf{w}_{i},\mathbf{x}\rangle\right)\right)+\frac{\lambda}{2}\Phi(\mathbf{x}),

where $\Phi(\mathbf{x})=\sum\limits_{j=1}\limits^{d}(x_{j}^{2})^{\frac{1}{4}}$ , $x_{j}$ is the $j$ th element of $\mathbf{x}$ , $\lambda>0$ is the regularization parameter, $\{y_{i}\}_{i=1}^{n}$ are labels and $\{\mathbf{w}_{i}\}_{i=1}^{n}$ are normalized covariates with norm 1. In fact, for any fixed $\lambda>0$ , Assumption 3.3 holds with $\tilde{f}=f$ and $\mathcal{D}=\{\mathbf{x}:\|\mathbf{x}\|\leq R\}$ when $R$ is sufficiently large. Since smoothness is easy to show, we focus on the second part of Assumption 3.3. To show that

f\left(\mathrm{Proj}(\mathbf{x},\mathcal{D})\right)\leq f(\mathbf{x})

holds for any $\mathbf{x}\in\mathbb{R}^{d}$ , since the projection direction is pointed towards the origin, it suffices to show that for any $\boldsymbol{\nu}\in\mathbb{R}^{d}$ with $\|\boldsymbol{\nu}\|=1$ ,

\frac{d}{dt}f_{i}(t\boldsymbol{\nu})=\frac{d}{dt}\Big{(}\log\big{(}1+\text{exp}(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle)\big{)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\sqrt{t}(\nu_{j}^{2})^{\frac{1}{4}}\Big{)}\geq 0,

when $t\geq R$ for $i=1,2,\ldots,n$ , where $\nu_{j}$ is the $j$ th element of $\boldsymbol{\nu}$ . To see this,

	$\displaystyle\quad\frac{d}{dt}f_{i}(t\boldsymbol{\nu})$
	$\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\text{exp}\left(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}{1+\text{exp}\left(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{(\nu_{j}^{2})^{\frac{1}{4}}}{2\sqrt{t}}$
	$\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{(\nu_{j}^{2})^{\frac{1}{4}}}{2\sqrt{t}}$
	$\displaystyle\geq\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{\nu_{j}^{2}}{2\sqrt{t}}$
	$\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{4\sqrt{t}}.$

If $y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\leq 0$ , we can immediately know that $\frac{d}{dt}f_{i}(t\boldsymbol{\nu})\geq 0$ for any $t>0$ .

If $y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle>0$ , let us consider an auxiliary function

g(b)=\frac{-b}{1+e^{tb}}.

Then,

g^{\prime}(b)\propto-\left(1+e^{tb}\right)+bte^{bt},

from where we can know the minimum of $g(b)$ is achieved for some $b^{*}\in[\frac{1}{t},\frac{2}{t}]$ . Thus,

g(b)\geq g(b^{*})\geq\frac{-2}{(1+e^{tb})t}\geq\frac{-2}{(1+e)t}.

Therefore,

\frac{d}{dt}f_{i}(t\boldsymbol{\nu})\geq\frac{-2}{(1+e)t}+\frac{\lambda}{4\sqrt{t}},

which is positive when $t\geq\left(\frac{8}{(1+e)\lambda}\right)^{2}$ .

If we consider other non-convex regularization terms in logistic regression, such as $\Phi(\mathbf{x})=\sum\limits_{j=1}\limits^{d}\frac{x_{j}^{2}}{1+x_{j}^{2}}$ , we may no longer enjoy Assumption 3.3 because monotony may not hold for a few projection directions even when the constraint region is large. Nevertheless, such theoretical flaw can be easily remedied by adding an extra regularization term like $\frac{\lambda_{e}}{2}\|\mathbf{x}\|^{2}$ with appropriate $\lambda_{e}>0$ .

Appendix B Stop Guarantee

We would like to point out that, under appropriate parameter setting, Prob-SARAH is guaranteed to stop. Actually, we can have the stopping guarantee under more general conditions than those stated in the following proposition. But for simplicity, we only present conditions naturally matched parameter settings given in the next two subsections.

Proposition B.1 (Stop guarantee of Prob-SARAH).

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are satisfied. Let step size $\eta_{j}\equiv 1/(4L)$ and suppose that $b_{j}\geq K_{j}$ , $j\geq 1$ . The large batch size $\{B_{j}\}_{j\geq 1}$ is set appropriately such that $B_{j}=n$ when $j$ is sufficiently large. If the limit of $\{\varepsilon_{j}\}_{j\geq 1}$ is 0, then, for any fixed $\tilde{\varepsilon}$ and $\varepsilon$ , with probability 1, Prob-SARAH (Algorithm 1) stops. In settings where we always have $\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}$ , we also have the result that Prob-SARAH (Algorithm 1) stops with probability 1.

Appendix C Detailed Results on Complexity

Theorem C.1.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. Given a pair of errors $(\varepsilon,\delta)$ , in Algorithm 1 (Prob-SARAH), set hyperparameters

$\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=8L^{2}\tau_{j}+2q_{j},\quad\tilde{\varepsilon}^{2}=\frac{1}{5}\varepsilon^{2},$

(8)

for $j\geq 1$ , where

Then, with probability at least $1-\delta$ , Prob-SARAH stops in at most

2(T_{1}\vee T_{2}\vee T_{3}\vee T_{4})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon}+\frac{1}{\sqrt{n}\varepsilon^{2}}\right)

outer iterations and the output satisfies $\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}$ . Detailed definitions of $T_{1},T_{2},T_{3}$ and $T_{4}$ can be found in Propositions D.3 and D.4.

Corollary C.1.

Under parameter settings in Theorem C.1,

Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

We introduce another setting that can help to reduce the dependence on $\alpha_{M}^{2}$ , which could be implicitly affected by the choice of constraint region $\mathcal{D}$ . We should also notice that, under such setting, the algorithm is no longer $\varepsilon$ -semi-independent.

Theorem C.2.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. We denote $\Delta_{f}^{0}\triangleq f\left(\tilde{\mathbf{x}}_{0}\right)-f\left(\mathbf{x}^{*}\right)$ . Given a pair of errors $(\varepsilon,\delta)$ , in Algorithm 1 (Prob-SARAH), set parameters

\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=\frac{1}{2}\tilde{\varepsilon}^{2}=\frac{1}{10}\varepsilon^{2},

(9)

for $j\geq 1$ , where

\displaystyle\tau_{j}=\frac{1}{40L^{2}}\varepsilon^{2},\delta^{\prime}_{j}=\frac{\delta}{4C_{e}j^{4}},\quad l_{j}=18\Big{(}\log(\frac{2}{\delta^{\prime}_{j}})+\log\log(\frac{2d_{1}}{\tau_{j}})\Big{)}.

Then, with probability at least $1-\delta$ , Prob-SARAH stops in at most

T_{5}=\frac{160L(\Delta_{f}^{0}+1)}{\sqrt{n}\varepsilon^{2}}=\mathcal{O}_{L,\Delta_{f}^{0}}\left(\frac{1}{\sqrt{n}\varepsilon^{2}}\right)

outer iterations and the output satisfies $\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}$ .

Corollary C.2.

Under parameter settings in Theorem C.2,

Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f}^{0}}\left(\frac{\sqrt{n}}{\varepsilon^{2}}\right).

Comparing the complexities in Corollary C.1 and Corollary C.2, we can notice that under the second setting, we can get rid of the dependence on $\alpha_{M}$ at the expense of losing some adativity to $\varepsilon$ . We would also like to point out that, in the low precision region, i.e. when $\frac{1}{\varepsilon}=o\left(\sqrt{n}\right)$ , the second complexity is inferior.

C.1 Dependency on Parameters

To apply our newly-introduce Azuma-Hoeffding type inequality (see Theorem 3.2), it is necessary to impose a compact constraint region $\mathcal{D}$ . Therefore, let us provide a delicate analysis on how $\mathcal{D}$ can affect the convergence guarantee.

Dependency on $d_{1}$ : $d_{1}$ , the diameter of $\mathcal{D}$ , is a parameter directly related to the choice of $\mathcal{D}$ . Shown in theoretical results presented above, the in-probability first-order complexities always have a polylogorithmic dependency on $d_{1}$ , which implies that as long as $d_{1}$ is polynomial in $n$ or $\frac{1}{\varepsilon}$ , it should only have a minor effect on the complexity. With certain prior knowledge, we should be able to control $d_{1}$ at a reasonable scale.

Dependency on $\Delta_{f}$ and $\Delta_{f}^{0}$ : Under the setting given in Theorem 3.1, the first-order complexity is polynomial in $\Delta_{f}$ . Such dependency implicates that the complexity would not deteriorate much if $\Delta_{f}$ is of a small order, which is definitely true when the loss function is bounded. As for the setting given in Theorem C.2, the first-order complexity is polynomial in $\Delta_{f}^{0}$ , which is conventionally assumed to be $\mathcal{O}(1)$ and will not be affected by $\mathcal{D}$ .

Appendix D Postponed Proofs for the Results in Section 3

D.1 Bounding the Difference between $\boldsymbol{\nu}_{k}^{(j)}$ and $\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}$

Proposition D.1.

For $k\geq 0$ , $j\geq 1$ , denote

\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}\triangleq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\big{\|}\boldsymbol{\nu}_{m-1}^{(j)}\big{\|}^{2}.

Under Assumptions 3.2 and 3.4, for any prescribed constant $\delta^{\prime}\in(0,1)$ , $\tau\in(0,1)$ , $k\geq 0$ , $j\geq 1$ ,

	$\displaystyle\quad\big{\\|}\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\big{\\|}^{2}$
	$\displaystyle\leq 18\Big{(}\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\Big{)}\Big{(}\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\Big{)}$		(10)
	$\displaystyle\quad+\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}$

with probability at least $1-2\delta^{\prime}$ .

Remark D.1.

Let us briefly explain this high-probability bound on $\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})\|^{2}$ . When $k=o(b_{j})$ and $L=\mathcal{O}({1})$ , by letting $\tau$ be of appropriate $n^{-1}$ -polynomial order, $4L^{2}\tau k/b_{j}$ will be roughly $o(1)$ . If further we have $d_{1}$ be of $n$ -polynomial order and let $\delta^{\prime}$ be of $n^{-1}$ -polynomial order, $\log(2/\delta^{\prime})+\log\log(2d_{1}^{2}/\tau)$ will be $\tilde{\mathcal{O}}(1)$ . As a result, the upper bound is roughly $(\tilde{\sigma}_{k}^{(j)})^{2}=(4L^{2}\eta_{j}^{2}/b_{j})\sum_{m=1}^{k}\|\boldsymbol{\nu}_{m-1}^{(j)}\|^{2}$ when $B_{j}$ is sufficiently large so that the last term in the bound (10) is negligible. Bounding $\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})\|^{2}$ by linear combination of $\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{\infty}$ is the key to our analysis.

D.2 Analysis on the Output $\hat{\mathbf{x}}$

Under parameter setting specified in Theorem 3.1, if we suppose that the algorithm stops at the $j$ -th outer iteration, i.e.

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2},\ \varepsilon_{j}\leq\frac{1}{2}\varepsilon,

(11)

there must exist a $0\leq k^{\prime}\leq K_{j}-1$ , such that $\left\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.$

Then, on the event

\Omega_{j}\triangleq\left\{\omega:\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)\right\|^{2}\leq l_{j}\left(\left(\tilde{\sigma}_{k}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)+q_{j},0\leq k\leq K_{j}\right\},

(12)

where $l_{j}=18\left(\log\frac{2}{\delta^{\prime}_{j}}+\log\log\frac{2d_{1}^{2}}{\tau_{j}}\right)$ and $q_{j}=\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}_{j}}\mathbf{1}\left\{B_{j}<n\right\}$ , we can easily derive an upper bound on $\left\|\nabla f(\mathbf{x}_{k^{\prime}}^{(j)})\right\|^{2}$ ,

	$\displaystyle\quad\left\\|\nabla f(\mathbf{x}_{k^{\prime}}^{(j)})\right\\|^{2}$
	$\displaystyle\leq 2\\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}\\|^{2}+2\left\\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}-\nabla f\left(\mathbf{x}_{k^{\prime}}^{(j)}\right)\right\\|^{2}$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\left(\tilde{\sigma}_{k^{\prime}}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k^{\prime}}{b_{j}}\right)+2q_{j}$
	$\displaystyle=2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k^{\prime}}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}+\frac{4L^{2}\tau_{j}k^{\prime}}{b_{j}}\right)+2q_{j}$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{K_{j}}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\tilde{\varepsilon}^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}$
	$\displaystyle=2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{K_{j}}{4b_{j}}\tilde{\varepsilon}^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}$
	$\displaystyle=2.5\tilde{\varepsilon}^{2}+8L^{2}\tau_{j}+2q_{j}=2.5\tilde{\varepsilon}^{2}+\varepsilon_{j}\leq\varepsilon^{2},$

where the 2nd step is based on (10) with definitions of $l_{j}$ and $q_{j}$ given in Theorem 3.1, the 5th step is based on (11), the 6th step is based on the choice of $\eta_{j}=\frac{1}{4L}$ and the 7th step is based on the choice of $b_{j}=l_{j}K_{j}$ . In addition, based on Proposition D.1, the union event $\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j}$ occurs with probability at least

1-2\sum\limits_{j=1}\limits^{\infty}\sum\limits_{k=0}\limits^{K_{j}}\delta^{\prime}_{j}\geq 1-\sum\limits_{j=1}\limits^{\infty}\frac{\delta K_{j}}{2C_{e}j^{4}}\geq 1-\sum\limits_{j=1}\limits^{\infty}\frac{\delta}{C_{e}j^{2}}=1-\delta.

In one word, it is highly likely to control the norm of gradient at our desired level when the algorithm stops.

The above results can sufficiently explain our choice of stopping rule imposed in Algorithm 1. We can summarize them as the following proposition.

Proposition D.2.

Suppose that Assumptions 3.2 and 3.4 are true. Under the parameter setting given in Theorem 3.1, the output of Algorithm 1 satisfies

\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2},

with probability at least $1-\delta$ .

D.3 Upper-bounding the Stopping Time

Proposition D.3 (First Stopping Rule).

Suppose that Assumptions 3.2, 3.3 and 3.4 are valid. Let

		$\displaystyle T_{1}=\Bigg{\lceil}\frac{\sqrt{320L(c_{1}+\Delta_{f})}}{\varepsilon}+\frac{320L(c_{1}+\Delta_{f})}{\sqrt{n}\varepsilon^{2}}\Bigg{\rceil},$
		$\displaystyle T_{2}=\Bigg{\lceil}3\left(\frac{\sqrt{320Lc_{2}}}{\varepsilon}\log\frac{\sqrt{320Lc_{2}}}{\varepsilon}+\frac{640Lc_{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{320Lc_{2}}{\varepsilon^{2}}+1\right)\Bigg{\rceil},$		(13)

where

c_{1}=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}}{\delta},\quad c_{2}=\frac{64\alpha_{M}^{2}}{L}.

Under the parameter setting given in Theorem 3.1, on $\Omega$ , when $T\geq T_{1}\vee T_{2}$ , there exists a $T+1\leq j\leq 2T$ such that

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

Proposition D.4 (Second Stopping Rule).

Let

\displaystyle T_{3}=\Bigg{\lceil}\frac{2\sqrt{c_{3}}}{\varepsilon}\Bigg{\rceil},\quad T_{4}=\Bigg{\lceil}\frac{6\sqrt{c_{4}}}{\varepsilon}\log\frac{2\sqrt{c_{4}}}{\varepsilon}\Bigg{\rceil},

(14)

where

c_{3}=8L^{2}+256\alpha_{M}^{2}\log\frac{12C_{e}}{\delta},\quad c_{4}=1024\alpha_{M}^{2}.

Under the parameter setting given in Theorem 3.1, on $\Omega$ , when $T\geq T_{3}\vee T_{4}$ ,

\varepsilon_{T}\leq\frac{1}{2}\varepsilon^{2}.

Proposition D.5 (Stop Guarantee).

Under the parameter setting and assumptions given in Theorem 3.1, on $\Omega$ , when $T\geq T_{1}\vee T_{2}\vee T_{3}\vee T_{4}$ , Algorithm 1 stops in at most $2T$ outer iterations.

Proof.

If Algorithm 1 stops in $T$ outer iterations, our conclusion is obviously true. If not, according to Proposition D.3, there must exist a $j\in[T+1,2T]$ such that the first stopping rule is met, i.e.

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

According to Proposition D.4, the second stopping rule is also met, i.e. $\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}.$

Consequently, the algorithm stops at the $j$ -th outer iteration. ∎

Appendix E Technical Lemmas

Lemma E.1 (Theorem 4 in [10]).

Let $\{\boldsymbol{\epsilon}_{1},\boldsymbol{\epsilon}_{2},\ldots,\boldsymbol{\epsilon}_{n}\}$ be a set of fixed vectors in $\mathbb{R}^{d}$ . $\mathcal{I},\mathcal{J}\subseteq\{1,2,\ldots,n\}$ are 2 random index sets sampled respectively with replacement and without replacement, with size $|\mathcal{I}|=|\mathcal{J}|=k$ . For any continuous and convex function $f:\mathbb{R}^{d}\rightarrow\mathbb{R}$ ,

\mathbb{E}f\Big{(}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}\Big{)}\leq\mathbb{E}f\Big{(}\sum\limits_{i\in\mathcal{I}}\boldsymbol{\epsilon}_{i}\Big{)}.

Lemma E.2 (Proposition 1.2 in [3]¹¹1See also Lemma 1.3 in [2].).

Let $X$ be real random variable such that $\mathbb{E}X=0$ and $a\leq X\leq b$ for some $a,b\in\mathbb{R}$ . Then, for all $t\in\mathbb{R}$ ,

\log\mathbb{E}e^{tX}\leq\frac{t^{2}(b-a)^{2}}{8}.

Lemma E.3 (Theorem 3.5 in [30]²²2See also Theorem 3 in [29] and Proposition 2 in [5].).

Let $\{\boldsymbol{\epsilon}_{k}\}_{k=1}^{K}\subseteq\mathbb{R}^{d}$ be a vector-valued martingale difference sequence with respect to $\mathcal{F}_{k}$ , $k=0,1,\ldots,K$ , i.e. for $k=1,\ldots,K$ , $\mathbb{E}\left[\boldsymbol{\epsilon}_{k}|\mathcal{F}_{k-1}\right]=\textbf{0}$ . Assume $\|\boldsymbol{\epsilon}_{k}\|^{2}\leq B_{k}^{2}$ , $k=1,2,\ldots,K$ . Then,

\mathbb{P}\left(\Big{\|}\sum\limits_{k=1}\limits^{K}\boldsymbol{\epsilon}_{k}\Big{\|}\geq t\right)\leq 2\text{exp}\Bigg{(}-\frac{t^{2}}{2\sum\limits_{k=1}\limits^{K}B_{k}^{2}}\Bigg{)},

$\forall t\in\mathbb{R}$ .

Proposition E.1 (Norm-Hoeffding, Sampling without Replacement).

Let $\{\boldsymbol{\epsilon}_{1},\boldsymbol{\epsilon}_{2},\ldots,\boldsymbol{\epsilon}_{n}\}$ be a set of $n$ fixed vectors in $\mathbb{R}^{d}$ such that $\|\boldsymbol{\epsilon}_{i}\|^{2}\leq\sigma^{2}$ , $\forall 1\leq i\leq n$ , for some $\sigma^{2}>0$ . Let $\mathcal{J}\subseteq\{1,2,\ldots,n\}$ be a random index sets sampled without replacement from $\{1,2,\ldots,n\}$ , with size $|\mathcal{J}|=k$ . Then,

\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\leq 3\text{exp}\Big{(}-\frac{kt^{2}}{64\sigma^{2}}\Big{)},

$\forall t\in\mathbb{R}$ . In addition,

\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{2}\leq\frac{16\sigma^{2}}{k}.

Proof.

Firstly, we start with developing moment bounds. Let $\mathcal{I}$ be a random index sets sampled with replacement from $\{1,2,\ldots,n\}$ , independent of $\mathcal{J}$ , with size $|\mathcal{I}|=k$ . For any $p\in\mathbb{Z}_{+}$ ,

		$\displaystyle\mathbb{E}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}^{p}$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}^{p}$
	$\displaystyle=$	$\displaystyle\bigintss_{0}^{\infty}\mathbb{P}\Bigg{(}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}^{p}\geq r\Bigg{)}dr$
	$\displaystyle=$	$\displaystyle\bigintss_{0}^{\infty}\mathbb{P}\Bigg{(}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}\geq r^{1/p}\Bigg{)}dr$
	$\displaystyle\leq$	$\displaystyle\bigintss_{0}^{\infty}2\text{exp}\Bigg{(}-\frac{kr^{2/p}}{8\sigma^{2}}\Bigg{)}dr$
	$\displaystyle=$	$\displaystyle p\cdot\left(\frac{8\sigma^{2}}{k}\right)^{p/2}\cdot\Gamma\left(\frac{p}{2}\right),$

where the 1st step is based on Lemma E.1 and the 4th step is based on the fact that $\big{\|}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{i=1}\limits^{n}\boldsymbol{\epsilon}_{i}\big{\|}\leq 2\sigma,\forall j$ and Lemma E.3.

Then, for any $s>0$ ,

	$\displaystyle\quad\mathbb{E}\text{exp}\Bigg{(}s\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}\Bigg{)}$
	$\displaystyle\leq 1+\sum\limits_{p=1}\limits^{\infty}\frac{s^{p}\mathbb{E}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}^{p}}{p!}$
	$\displaystyle\leq 1+\sum\limits_{p=2}\limits^{\infty}\frac{s^{p}\mathbb{E}\Big{\\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\\|}^{p}}{p!}+s\sqrt{\frac{8\pi\sigma^{2}}{k}}$
	$\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(2p)\cdot\Gamma(p)}{(2p)!}$
	$\displaystyle\quad+\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{\frac{2p+1}{2}}\cdot(2p+1)\cdot\Gamma\left(p+\frac{1}{2}\right)}{(2p+1)!}$
	$\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+2\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(p!)}{(2p)!}$
	$\displaystyle\quad+\sqrt{\frac{8s^{2}\sigma^{2}}{k}}\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot\Gamma\left(p+\frac{1}{2}\right)}{(2p)!}$
	$\displaystyle\leq 1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(2+\sqrt{\frac{8\pi s^{2}\sigma^{2}}{k}}\right)\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(p!)}{(2p)!}$
	$\displaystyle\leq 1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}}{p!}$
	$\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\left[\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)-1\right]$
	$\displaystyle\leq\sqrt{\frac{8\pi s^{2}\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)$
	$\displaystyle\leq\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)+2\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}\right)$
	$\displaystyle\leq 3\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}\right),$

where the 1st step is based on Taylor’s expansion and the second to the last step is based on the fact that $x\leq e^{\frac{x^{2}}{\pi}}$ , $\sqrt{\frac{\pi}{2}}xe^{x^{2}}\leq e^{2x^{2}}$ , $\forall x\geq 0$ .

For any $s>0$ ,

\begin{array}[]{cl}&\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\\ \leq&\mathbb{E}\text{exp}\Bigg{(}s\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}-ts\Bigg{)}\\ \leq&3\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}-ts\right).\\ \end{array}

(15)

By letting $s=\frac{kt}{32\sigma^{2}}$ in (15),

\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\leq 3\text{exp}\left(-\frac{kt^{2}}{64\sigma^{2}}\right).

∎

Definition E.1.

A random vector $\boldsymbol{\epsilon}\in\mathbb{R}^{d}$ is $(a,\sigma^{2})$ -norm-subGaussian (or nSG $(a,\sigma^{2})$ ), if $\exists a,\sigma^{2}>0$ such that

\mathbb{P}\left(\|\boldsymbol{\epsilon}-\mathbb{E}\boldsymbol{\epsilon}\|\geq t\right)\leq a\cdot\text{exp}\left(-\frac{t^{2}}{2\sigma^{2}}\right),

$\forall t\in\mathbb{R}$ .

Definition E.2.

A sequence of random vectors $\boldsymbol{\epsilon}_{1},\ldots,\boldsymbol{\epsilon}_{K}\in\mathbb{R}^{d}$ is $(a,\{\sigma_{k}^{2}\}_{k=1}^{K})$ -norm-subGaussian martingale difference sequence adapted to $\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}$ , if $\exists\ a,\sigma^{2}_{1},\ldots,\sigma_{K}^{2}>0$ such that for $k=1,2,\ldots,K$ ,

\mathbb{E}\left[\boldsymbol{\epsilon}|\mathcal{F}_{k-1}\right]=\textbf{0},\ \sigma_{k}\in\mathcal{F}_{k-1},\ \boldsymbol{\epsilon}_{k}\in\mathcal{F}_{k},

and $\boldsymbol{\epsilon}_{k}|\mathcal{F}_{k-1}$ is $(a,\sigma^{2}_{k})$ -norm-subGaussian.

Lemma E.4 (Corollary 8 in [15]).

Suppose $\boldsymbol{\epsilon}_{1},\ldots,\boldsymbol{\epsilon}_{K}\in\mathbb{R}^{d}$ is $(a,\{\sigma_{k}^{2}\}_{k=1}^{K})$ -norm-subGaussian martingale difference sequence adapted to $\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}$ . Then for any fixed $\delta>0$ , and $B>b>0$ , with probability at least $1-\delta$ , either

\sum\limits_{k=1}\limits^{K}\sigma_{i}^{2}\geq B\

or,

\Big{\|}\sum\limits_{k=1}\limits^{K}\boldsymbol{\epsilon}_{k}\Big{\|}\leq\frac{a}{2}e^{1/e}\sqrt{\max\left\{\sum\limits_{k=1}\limits^{K}\sigma_{i}^{2},b\right\}\left(\log\frac{2d}{\delta}+\log\log\frac{B}{b}\right)}.

Lemma E.5.

For any $\varepsilon>0,n\in\mathbb{Z}_{+}$ ,

\frac{1}{T^{2}\wedge(\sqrt{n}T)}\leq\varepsilon^{2},

when $T\geq\lceil\frac{1}{\varepsilon}+\frac{1}{\sqrt{n}\varepsilon^{2}}\rceil$ .

Proof.

	$\displaystyle\quad\frac{1}{T^{2}\wedge(\sqrt{n}T)}=\max\Big{\{}\frac{1}{T^{2}},\frac{1}{\sqrt{n}T}\Big{\}}$
	$\displaystyle\leq\max\Big{\{}\frac{1}{T^{\prime 2}}\big{\|}_{T^{\prime}=\frac{1}{\varepsilon}},\frac{1}{\sqrt{n}T^{\prime}}\big{\|}_{T^{\prime}=\frac{1}{\sqrt{n}\varepsilon^{2}}}\Big{\}}$
	$\displaystyle=\varepsilon^{2}.$

∎

Lemma E.6.

For any $\varepsilon\in(0,e^{-1}],n\in\mathbb{Z}_{+}$ ,

\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\varepsilon^{2},

when $T\geq\Big{\lceil}3\left(\frac{1}{\varepsilon}\log\frac{1}{\varepsilon}+\frac{2}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}+\mathbf{1}\left\{\frac{1}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\leq\frac{e}{6}\right\}\right)\Big{\rceil}$ .

Proof.

Function $h(T)=\frac{\log T}{T^{2}}$ is monotonically decreasing when $T\geq\sqrt{e}$ . Since $T\geq\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\geq\sqrt{e}$ ,

	$\displaystyle\quad\frac{\log T}{T^{2}}\leq\frac{\log\left(\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\right)}{\left(\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\right)^{2}}=\frac{\log 3+\log\frac{1}{\varepsilon}+\log\log\frac{1}{\varepsilon}}{9\left(\log\frac{1}{\varepsilon}\right)^{2}}\varepsilon^{2}$
	$\displaystyle\leq\frac{1+\log 3+2\log\frac{1}{\varepsilon}}{9\left(\log\frac{1}{\varepsilon}\right)^{2}}\varepsilon^{2}\leq\frac{3+\log 3}{9}\varepsilon^{2}\leq\varepsilon^{2}.$

Define a function $\tilde{h}(T)=\frac{\log T}{T}$ . It is monotonically decreasing when $T\geq e$ . Thus, if $\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\geq e$ , we know $T\geq e$ and consequently,

	$\displaystyle\quad\frac{\log T}{\sqrt{n}T}\leq\frac{\log\left(\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\right)}{\sqrt{n}\left(\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\right)}$
	$\displaystyle=\frac{\log\frac{1}{\sqrt{n}\varepsilon^{2}}+\log\log\frac{1}{\varepsilon^{2}}+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}$
	$\displaystyle\leq\frac{\log\frac{1}{\varepsilon^{2}}+\log\log\frac{1}{\varepsilon^{2}}+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}$
	$\displaystyle\leq\frac{2\log\frac{1}{\varepsilon^{2}}+1+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}$
	$\displaystyle\leq\frac{3+\log 6}{6}\varepsilon^{2}$
	$\displaystyle\leq\varepsilon^{2}.$

If $\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\leq e$ , $T\geq 3$ . Hence,

	$\displaystyle\quad\frac{\log T}{\sqrt{n}T}\leq\frac{\log 3}{\sqrt{n}3}\leq\frac{6}{\sqrt{n}e}\log\frac{1}{\varepsilon^{2}}\left(\frac{\log 3}{3}\cdot\frac{e}{6}\right)$
	$\displaystyle\leq\frac{6}{\sqrt{n}e}\log\frac{1}{\varepsilon^{2}}\leq\varepsilon^{2}.$

Based on the above results,

\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\max\left\{\frac{\log T}{T^{2}},\frac{\log T}{(\sqrt{n}T)}\right\}\leq\varepsilon^{2}.

∎

Appendix F Proofs of Main Theorems

Proof of Proposition B.1.

It is not hard to conclude that we only need to show

\mathbb{P}\Bigg{(}\exists\ j\geq 1,\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}\leq\tilde{\boldsymbol{\epsilon}}^{2}\Bigg{)}=1.

(16)

For simplicity, we denote $V_{j}\triangleq\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}$ . To show (16), we firstly derive the in-expectation bound on $V_{j}$ , which has been covered in works like [33].

With our basic assumptions, we have

	$\displaystyle\quad f\left(\mathbf{x}_{k+1}^{(j)}\right)$
	$\displaystyle=\tilde{f}\Big{(}\mathrm{Proj}\big{(}\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)},\mathcal{D}\big{)}\Big{)}$
	$\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\right)$
	$\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\big{\\|}\boldsymbol{\nu}_{k}^{(j)}\big{\\|}^{2}$
	$\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)+\frac{\eta_{j}}{2}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}-\frac{\eta_{j}}{2}\left\\|\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)\big{\\|}\boldsymbol{\nu}_{k}^{(j)}\big{\\|}^{2},$

where the 2nd and 3rd step is based on Assumption 3.3. Then, summing the above inequality from $k=0$ to $K_{j}-1$ ,

	$\displaystyle\quad f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)$
	$\displaystyle=f\big{(}\mathbf{x}_{K_{j}}^{(j)}\big{)}-f\big{(}\mathbf{x}_{0}^{(j)}\big{)}$
	$\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\Big{\\|}\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\Big{\\|}^{2}-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)K_{j}V_{j}.$		(17)

Then,

	$\displaystyle\quad\mathbb{E}\big{(}f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)\big{\|}\mathcal{F}_{j-1}\big{)}$
	$\displaystyle\leq\mathbb{E}\Bigg{(}\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}\big{\|}\mathcal{F}_{j-1}\Bigg{)}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)K_{j}\mathbb{E}\left(V_{j}\big{\|}\mathcal{F}_{j-1}\right).$		(18)

For convenience, we abbreviate $\mathbb{E}\left(\cdot|\mathcal{F}_{j-1}\right)$ as $\mathbb{E}_{j-1}(\cdot)$ . For $k=1,2,\ldots,K_{j}-1$ ,

	$\displaystyle\quad\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\\|^{2}\big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\Big{\\|}\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})$
	$\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\\|}^{2}\Big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle\quad+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle=\mathbb{E}_{j-1}\frac{1}{b_{j}^{2}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\mathbb{E}\Bigg{(}\Big{\\|}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})$
	$\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\\|}^{2}\Big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle\quad+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}}{b_{j}}\mathbb{E}_{j-1}\left\\|\mathbf{x}_{k}^{(j)}-\mathbf{x}_{k-1}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{t=0}\limits^{k-1}\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{t}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\mathbb{E}_{j-1}V_{j}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2}.$		(19)

Based on (18) and (19),

	$\displaystyle\quad\mathbb{E}\big{(}f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)\big{)}$
	$\displaystyle\leq-\frac{\eta_{j}K_{j}}{2}\left(1-L\eta_{j}-\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\right)\mathbb{E}V_{j}$
	$\displaystyle\quad+\frac{\eta_{j}K_{j}}{2}\mathbb{E}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2}$
	$\displaystyle\leq-\frac{K_{j}}{16L}\mathbb{E}V_{j}+\frac{K_{j}}{8L}\mathbb{E}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2},$		(20)

where the second step is based on the choice of $\eta_{j}\equiv\frac{1}{4L}$ and $b_{j}\geq K_{j}$ , $j\geq 1$ . Let us define $J_{0}=\min\{j:B_{j}=n\}$ . Then, for $j\geq J_{0}$ , based on (20),

\frac{1}{16L}\mathbb{E}V_{j}\leq\frac{K_{j}}{16L}\mathbb{E}V_{j}\leq\mathbb{E}\left(f\left(\tilde{\mathbf{x}}_{j-1}\right)-f\left(\tilde{\mathbf{x}}_{j}\right)\right).

Then for any $m\in\mathbb{Z}_{+}$ ,

	$\displaystyle\quad\mathbb{P}\left(V_{j}>\tilde{\boldsymbol{\epsilon}}^{2},\ j\geq 1\right)$
	$\displaystyle\leq\mathbb{P}\left(V_{J_{0}}+V_{J_{0}+1}+\ldots+V_{J_{0}+m}>(m+1)\tilde{\boldsymbol{\epsilon}}^{2}\right)$
	$\displaystyle\leq\frac{\mathbb{E}\left(V_{J_{0}}+V_{J_{0}+1}+\ldots+V_{J_{0}+m}\right)}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}$
	$\displaystyle\leq\frac{16L}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}\sum\limits_{j=J_{0}}\limits^{J_{0}+m}\mathbb{E}\left(f\left(\tilde{\mathbf{x}}_{j-1}\right)-f\left(\tilde{\mathbf{x}}_{j}\right)\right)$
	$\displaystyle\leq\frac{16L\Delta_{f}}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}.$

Since $m$ can be arbitrarily large, we know

\mathbb{P}\left(V_{j}>\tilde{\boldsymbol{\epsilon}}^{2},\ j\geq 1\right)=0,

which can directly lead to (16). ∎

Proof of Theorem 3.2.

In this proof, for simplicity, we denote $\mathbb{E}\left[\cdot|\mathcal{F}_{k}\right]$ by $\mathbb{E}_{k}\left[\cdot\right]$ . Let $\mathbf{s}_{k}=\sum\limits_{i=1}\limits^{k}\mathbf{z}_{i},k\geq 1$ . For a $1\leq k\leq K$ , consider

f_{k}(t)=\mathbb{E}_{k-1}\left[cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\right],\ \lambda>0,t>0.

Then,

f^{\prime}_{k}(t)=\frac{1}{2}\mathbb{E}_{k-1}\left[\frac{\lambda\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\left(e^{\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}-e^{-\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\right)\right],

and consequently,

	$\displaystyle f^{\prime}_{k}(0)$	$\displaystyle=\frac{1}{2}\mathbb{E}_{k-1}\left[\frac{\lambda\langle\mathbf{z}_{k},\mathbf{s}_{k-1}\rangle}{\\|\mathbf{s}_{k-1}\\|}\left(e^{\lambda\\|\mathbf{s}_{k-1}\\|}-e^{-\lambda\\|\mathbf{s}_{k-1}\\|}\right)\right]$
		$\displaystyle=0.$

Next,

	$\displaystyle\quad f^{\prime\prime}_{k}(t)$
	$\displaystyle=\frac{1}{2}\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}+\frac{\lambda\\|\mathbf{z}_{k}\\|^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\right)e^{\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}$
	$\displaystyle\quad\quad+\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}-\frac{\lambda\\|\mathbf{z}_{k}\\|^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\right)e^{-\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\Biggr{]}$
	$\displaystyle=\mathbb{E}_{k-1}\Biggl{[}\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)$
	$\displaystyle\quad\quad+\frac{\lambda^{2}\\|\mathbf{z}_{k}\\|^{2}}{\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}sinh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\Biggr{]}$
	$\displaystyle\leq\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}+\lambda^{2}\\|\mathbf{z}_{k}\\|^{2}\right)cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\Biggr{]}$
	$\displaystyle\leq 2\lambda^{2}\mathbb{E}_{k-1}\left[\\|\mathbf{z}_{k}\\|^{2}cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\right]$
	$\displaystyle\leq 2\lambda^{2}r_{k}^{2}\mathbb{E}_{k-1}\left[cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\right]$
	$\displaystyle=2\lambda^{2}r_{k}^{2}f_{k}(t),$

where the first inequality is based on the fact that if $y>0$ , $\frac{sinh(y)}{y}\leq cosh(y)$ .

According to Lemma 3 in [29],

f_{k}(t)\leq f_{k}(0)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right)=cosh(\lambda\|\mathbf{s}_{k-1}\|)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right).

Thus,

	$\displaystyle\quad\mathbb{E}_{k-1}\left[cosh(\lambda\\|\mathbf{s}_{k}\\|)\right]$
	$\displaystyle=f_{k}(1)\leq cosh(\lambda\\|\mathbf{s}_{k-1}\\|)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right).$		(21)

Now, let

G_{k}=cosh(\lambda\|\mathbf{s}_{k}\|)\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k}r_{i}^{2}\Big{)},\ k=1,2,\ldots,K.

We can easily know that for $k=1,2,\ldots,K$ , $G_{k}$ is measurable with respect to $\mathcal{F}_{k}$ . According to (21),

	$\displaystyle\mathbb{E}_{k-1}G_{k}$	$\displaystyle=\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k}r_{i}^{2}\Big{)}\mathbb{E}_{k-1}\left[cosh(\lambda\\|\mathbf{s}_{k}\\|)\right]$
		$\displaystyle\leq cosh(\lambda\\|\mathbf{s}_{k-1}\\|)\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k-1}r_{i}^{2}\Big{)}$
		$\displaystyle=G_{k-1},$

which implies that $\{G_{k}\}_{k=1}^{K}$ is a non-negative super-martingale adapted to $\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}$ .

For any constant $m>0$ , if we define stopping time $T_{m}=\inf\left\{t:\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+m\right\}$ , we immediately know that $G_{T_{m}\wedge k}$ , $k\geq 0$ , is a supermartingale and

	$\displaystyle\quad\mathbb{P}\left(\exists 1\leq t\leq k,\\|\mathbf{s}_{t}\\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+m\right)$
	$\displaystyle=\mathbb{P}\left(\\|\mathbf{s}_{T_{m}}\\|\geq\lambda\sum\limits_{i=1}\limits^{T_{m}}r_{i}^{2}+m,1\leq T_{m}\leq k\right)$
	$\displaystyle=\mathbb{P}\left(\\|\mathbf{s}_{T_{m}\wedge k}\\|\geq\lambda\sum\limits_{i=1}\limits^{T_{m}\wedge k}r_{i}^{2}+m,1\leq T_{m}\leq k\right)$
	$\displaystyle\leq\mathbb{P}\left(G_{T_{m}\wedge k}\geq\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}\Big{)}cosh\Big{(}\lambda^{2}\sum\limits_{i=1}\limits^{T_{m}\wedge k}r_{i}^{2}+m\lambda\Big{)}\right)$
	$\displaystyle\leq\mathbb{P}\Bigg{(}G_{T_{m}\wedge k}\geq\frac{1}{2}\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}+\big{(}\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}+m\lambda\big{)}\Big{)}\Bigg{)}$
	$\displaystyle=\mathbb{P}\left(2G_{T_{m}\wedge k}\geq e^{\lambda m}\right)$
	$\displaystyle\leq\frac{2\mathbb{E}G_{T_{m}\wedge k}}{e^{\lambda m}}$
	$\displaystyle\leq 2e^{-\lambda m}\mathbb{E}G_{0}$
	$\displaystyle=2e^{-\lambda m},$

where the 2nd step is based on the fact that $cosh(y)\geq\frac{1}{2}e^{y},\forall y\in\mathbb{R}$ , the 4th step is by Chebyshev’s inequality and the 5th step is based on the supermartingale property.

Therefore, if we let $\lambda m=\log\frac{2}{\delta}$ ,

\mathbb{P}\left(\exists 1\leq t\leq k,\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+\frac{1}{\lambda}\log\frac{2}{\delta}\right)\leq\delta.

Since $k$ can be up to $K$ ,

\mathbb{P}\left(\exists 1\leq t\leq K,\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+\frac{1}{\lambda}\log\frac{2}{\delta}\right)\leq\delta.

The final conclusion can be obtained immediately by following similar steps given in the proof of Corollary 8 from [15]. ∎

Proof of Proposition D.1.

Recall that

\boldsymbol{\epsilon}_{0}^{(j)}=\boldsymbol{\nu}_{0}^{(j)}-\nabla f\left(\mathbf{x}_{0}^{(j)}\right)=\frac{1}{B_{j}}\sum\limits_{i\in\mathcal{I}_{j}}\nabla f_{i}\left(\mathbf{x}_{0}^{(j)}\right)-\nabla f\left(\mathbf{x}_{0}^{(j)}\right),

where $\mathcal{I}_{j}$ is sampled without replacement. Since $\Big{\|}\nabla f_{i}\big{(}\mathbf{x}_{0}^{(j)}\big{)}\Big{\|}\leq\alpha_{M}$ , $i=1,2,\ldots,n$ , based on Proposition E.1,

\mathbb{P}\left(\|\boldsymbol{\epsilon}_{0}^{(j)}\|\geq t|\mathcal{F}_{j,-1}\right)\leq 3\text{exp}\left(-\frac{B_{j}t^{2}}{64\alpha_{M}^{2}}\right)\mathbf{1}\left\{B_{j}<n\right\}.

(22)

Next, if we suppose $\mathcal{I}_{m}^{(j)}=\left\{i_{m,1}^{(j)},i_{m,2}^{(j)},\ldots,i_{m,b_{j}}^{(j)}\right\}$ , where $i_{m,t_{1}}^{(j)}\neq i_{m,t_{2}}^{(j)}$ for any $1\leq t_{1}<t_{2}\leq b_{j}$ , we have

	$\displaystyle\quad\boldsymbol{\epsilon}_{m}^{(j)}$
	$\displaystyle=\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\Biggl{[}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i}(\mathbf{x}_{m-1}^{(j)})\Biggr{]}$
	$\displaystyle=\sum\limits_{r=1}\limits^{b_{j}}\frac{1}{b_{j}}\Biggl{[}\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m-1}^{(j)})\Biggr{]}$
	$\displaystyle\triangleq\sum\limits_{r=1}\limits^{b_{j}}\boldsymbol{\rho}_{(m-1)b_{j}+r}^{(j)}.$

Let

\tilde{\mathcal{F}}_{0}^{(j)}=\mathcal{F}_{j,0}

and

\tilde{\mathcal{F}}_{a_{1}b_{j}+a_{2}}^{(j)}=\sigma\left(\tilde{\mathcal{F}}_{a_{1}b_{j}+a_{2}-1}^{(j)}\bigcup\sigma\big{(}i_{a_{1}+1,a_{2}}^{(j)}\big{)}\right)

for $a_{1}=0,1,2,\ldots$ and $a_{2}=1,2,\ldots,b_{j}$ . Then, we can see that $\big{\{}\boldsymbol{\rho}_{s}^{(j)}\big{\}}_{s=1}^{kb_{j}}$ is a martingale difference sequence adapted to $\big{\{}\tilde{\mathcal{F}}_{s}^{(j)}\big{\}}_{s=0}^{kb_{j}}$ .

Notice that for $m=1,2,\ldots,k$ and $r=1,2,\ldots,b_{j}$ ,

	$\displaystyle\quad\big{\\|}\boldsymbol{\rho}_{(m-1)b_{j}+r}^{(j)}\big{\\|}$
	$\displaystyle=\left\\|\frac{1}{b_{j}}\big{[}\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m-1}^{(j)})\big{]}\right\\|$
	$\displaystyle\leq\frac{2L}{b_{j}}\left\\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\\|.$

Therefore, based on Theorem 3.2, for any fixed $\delta^{\prime}>0$ , $B>b>0$ , with probability at least $1-\delta^{\prime}$ , either

	$\displaystyle\left(\sigma_{k}^{(j)}\right)^{2}$	$\displaystyle\triangleq\sum\limits_{m=1}\limits^{k}\sum\limits_{r=1}\limits^{b_{j}}\left(\frac{2L}{bj}\left\\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\\|\right)^{2}$
		$\displaystyle=\frac{4L^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\\|^{2}$
		$\displaystyle\geq B,$

	$\displaystyle\quad\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)-\boldsymbol{\epsilon}_{0}^{(j)}\right\\|^{2}$
	$\displaystyle=\Big{\\|}\sum\limits_{s=1}\limits^{kb_{j}}\boldsymbol{\rho}_{s}^{(j)}\Big{\\|}^{2}$
	$\displaystyle\leq 9\max\left\{\big{(}\sigma_{k}^{(j)}\big{)}^{2},b\right\}\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{B}{b}\right).$

Under the compact constraint,

\big{(}\sigma_{k}^{(j)}\big{)}^{2}\leq\frac{4L^{2}d_{1}^{2}k}{b_{j}}.

Thus, if we let $B=\frac{8L^{2}d_{1}^{2}k}{b_{j}}$ and $b=\frac{4L^{2}\tau k}{b_{j}}$ for some $\tau\in(0,1)$ , it would be of probability 0 to have $\big{(}\sigma_{k}^{(j)}\big{)}^{2}\geq B$ . Thus, with probability at least $1-\delta^{\prime}$ ,

	$\displaystyle\quad\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)-\boldsymbol{\epsilon}_{0}^{(j)}\right\\|^{2}$
	$\displaystyle\leq 9\left(\big{(}\sigma_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right)$		(23)
	$\displaystyle\leq 9\left(\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right).$

According to (22), with probability at least $1-\delta^{\prime}$ ,

\big{\|}\boldsymbol{\epsilon}_{0}^{(j)}\big{\|}^{2}\leq\frac{64\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}.

(24)

Thus, combining (23) and (24), with probability at least $1-2\delta^{\prime}$ ,

	$\displaystyle\quad\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)\right\\|^{2}$
	$\displaystyle\leq 18\left(\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right)$
	$\displaystyle\quad+\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}.$

∎

Proposition F.1 (Inner Loop Analysis).

Given Assumptions 3.2, 3.3 and 3.4, under the parameter setting given in Theorem 3.1, let $\Omega=\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j}$ , where the definition of $\Omega_{j}$ is given in (12). On $\Omega$ ,

	$\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)$
	$\displaystyle\leq-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2},$

for all $j\in\mathbb{Z}_{+}$ . Such event $\Omega$ occurs with probability at least $1-\delta$ .

Proof of Proposition F.1.

Firstly, as what we have shown in section 3, $\bigcup\limits_{j=0}\limits^{\infty}\Omega_{j}$ occurs with probability at least $1-\delta$ .

Based on (17), on $\bigcup\limits_{j=0}\limits^{\infty}\Omega_{j}$ ,

	$\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)$
	$\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left[l_{j}\left(\left(\tilde{\sigma}_{k}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)+q_{j}\right]$
	$\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left(\frac{4L^{2}\eta_{j}^{2}l_{j}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}\right)+\frac{2L^{2}\eta_{j}\tau_{j}l_{j}}{b_{j}}\frac{K_{j}^{2}}{2}$
	$\displaystyle\quad+\frac{\eta_{j}K_{j}q_{j}}{2}-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\leq\frac{2L^{2}\eta_{j}^{3}l_{j}K_{j}}{b_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle=-\frac{\eta_{j}}{2}\left(1-L\eta_{j}-\frac{4L^{2}\eta_{j}^{2}l_{j}K_{j}}{b_{j}}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\quad+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle=-\frac{1}{8L}\left(1-\frac{1}{4}-\frac{1}{4}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2},$

where the 5th step is based on our choices of $\eta_{j}=\frac{1}{4L}$ and $b_{j}=l_{j}K_{j}$ , $j=1,2,\ldots$ . ∎

Proof of Proposition D.3.

Firstly,

	$\displaystyle\quad-\Delta_{f}\leq f\left(\tilde{\mathbf{x}}_{{2T}}\right)-f\left(\tilde{\mathbf{x}}_{{T}}\right)=f\left(\mathbf{x}_{K_{2T}}^{(2T)}\right)-f\left(\mathbf{x}_{0}^{(T+1)}\right)$
	$\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left[\frac{-1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right]$		(25)
	$\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left[\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right]-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2},$

where the 3rd step is based on Proposition F.1.

For simplifying notations, we denote

A_{T}\triangleq\sum\limits_{j=T+1}\limits^{2T}\left[\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right].

Then,

	$\displaystyle\quad A_{T}$
	$\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L\tau_{j}K_{j}}{4}+\frac{K_{j}q_{j}}{8L}\right)$
	$\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L\sqrt{j^{2}\wedge n}}{4j^{3}}+\frac{16\alpha_{M}^{2}}{L\sqrt{j^{2}\wedge n}}\log\frac{12C_{e}j^{4}}{\delta}\mathbf{1}\left\{j^{2}<n\right\}\right)$
	$\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L}{4j^{2}}+\frac{16\alpha_{M}^{2}}{L\sqrt{j^{2}\wedge n}}\log\frac{12C_{e}j^{4}}{\delta}\mathbf{1}\left\{j^{2}<n\right\}\right)$
	$\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L}{4j^{2}}+\frac{16\alpha_{M}^{2}}{Lj}\log\frac{12C_{e}j^{4}}{\delta}\right)$		(26)
	$\displaystyle\leq\frac{C_{e}L}{4}+\sum\limits_{j=T+1}\limits^{2T}\frac{16\alpha_{M}^{2}}{Lj}\log\frac{12C_{e}j^{4}}{\delta}$
	$\displaystyle\leq\frac{C_{e}L}{4}+\sum\limits_{j=T+1}\limits^{2T}\frac{16\alpha_{M}^{2}}{LT}\log\frac{12C_{e}(2T)^{4}}{\delta}$
	$\displaystyle=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}T^{4}}{\delta}$
	$\displaystyle=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}}{\delta}+\frac{64\alpha_{M}^{2}}{L}\log T,$

where the 1st step is based on the choices that $\eta_{j}=\frac{1}{4L}$ and $b_{j}=l_{j}K_{j}$ , the second step is bases on the choices of $K_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n}$ , $\delta^{\prime}_{j}=\frac{\delta}{4C_{e}j^{4}}$ . According to Lemma E.5, as $T\geq T_{1}$ ,

\frac{1}{T^{2}\wedge(\sqrt{n}T)}\leq\frac{\varepsilon^{2}}{320L(c_{1}+\Delta_{f})}.

(27)

According to Lemma E.6, as $T\geq T_{2}$ ,

\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\frac{\varepsilon^{2}}{320Lc_{2}}.

(28)

If we suppose to the contrary that

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}>\tilde{\varepsilon}^{2}

holds for all $T+1\leq j\leq 2T$ , then we have

\displaystyle\quad\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}K_{j}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}\left(T\wedge\sqrt{n}\right)=\frac{\tilde{\varepsilon}^{2}}{16L}T^{2}\wedge(\sqrt{n}T).

By (26), (27), (28) and the above results,

	$\displaystyle\quad\frac{80L}{T^{2}\wedge(\sqrt{n}T)}\left\{\Delta_{f}+A_{T}-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}\right\}$
	$\displaystyle\leq\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+A_{T})-5\tilde{\varepsilon}^{2}$
	$\displaystyle=\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+A_{T})-\varepsilon^{2}$
	$\displaystyle\leq\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+c_{1})+\frac{80Lc_{2}\log T}{T^{2}\wedge(\sqrt{n}T)}-\varepsilon^{2}$
	$\displaystyle\leq\frac{\varepsilon^{2}}{4}+\frac{\varepsilon^{2}}{4}-\varepsilon^{2}$
	$\displaystyle=-\frac{\varepsilon^{2}}{2},$

which contradicts (25). ∎

Proof of Proposition D.4.

	$\displaystyle\quad\varepsilon_{T}$
	$\displaystyle=8L^{2}\tau_{T}+2q_{T}$
	$\displaystyle=\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{B_{T}}\log\frac{3}{\delta^{\prime}_{T}}\mathbf{1}\left\{B_{T}<n\right\}$
	$\displaystyle\leq\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{T^{2}}\log\frac{3}{\delta^{\prime}_{T}}$		(29)
	$\displaystyle=\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{T^{2}}\log\frac{12C_{e}T^{4}}{\delta}$
	$\displaystyle\leq\left(8L^{2}+256\alpha_{M}^{2}\log\frac{12C_{e}}{\delta}\right)\frac{1}{T^{2}}+\frac{1024\alpha_{M}^{2}}{T^{2}}\log T$
	$\displaystyle=c_{3}\frac{1}{T^{2}}+c_{4}\frac{\log T}{T^{2}},$		(30)

where the 2nd step is based on our choice of $\tau_{T}=\frac{1}{T^{3}}$ and the 4th step is based on the choice of $\delta^{\prime}_{T}=\frac{\delta}{4C_{e}T^{4}}$ . According to Lemma E.5, where we can simply let $n=\infty$ , as $T\geq T_{3}$ ,

\frac{1}{T^{2}}\leq\frac{\varepsilon^{2}}{4c_{3}}.

(31)

Similarly, according to Lemma E.6 and Assumption 3.4, as $T\geq T_{4}$ ,

\frac{\log T}{T^{2}}\leq\frac{\varepsilon^{2}}{4c_{4}}.

(32)

Combining (29), (31) and (32),

\varepsilon_{T}\leq\frac{\varepsilon^{2}}{2}.

∎

Proof of Corollary C.1.

This part follows a similar way as the complexity analysis in [11]. It is easy to know that if Algorithm 1 stops in $T$ outer iterations, the first order computational complexity is

\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(T^{3}\wedge(nT)\right).

Thus, it is sufficient to show

T_{i}^{3}\wedge(nT_{i})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right),\ i=1,2,3,4.

$\bullet\ T_{1}^{3}\wedge(nT_{1})$

For simplicity, we let $\tilde{c}_{1}=\sqrt{320L(c_{1}+\Delta_{f})}$ and consequently $T_{1}=\Big{\lceil}\frac{\tilde{c}_{1}}{\varepsilon}+\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}\Big{\rceil}$ .

When $\sqrt{n}\varepsilon\leq\tilde{c}_{1}$ , $\frac{\tilde{c}_{1}}{\varepsilon}\leq\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}$ and consequently $T=\mathcal{O}\left(\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}\right)$ . Hence,

T_{1}^{3}\wedge(nT_{1})=\mathcal{O}(nT_{1})=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\right)=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right),

where the last step is due to $\sqrt{n}\leq\frac{\tilde{c}_{1}}{\varepsilon}$ .

When $\sqrt{n}\varepsilon\geq\tilde{c}_{1}$ , $\frac{\tilde{c}_{1}}{\varepsilon}\geq\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}$ and consequently $T=\mathcal{O}\left(\frac{\tilde{c}_{1}}{\varepsilon}\right)$ . Hence,

T_{1}^{3}\wedge(nT_{1})=\mathcal{O}(T_{1}^{3})=\mathcal{O}\left(\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right)=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right),

where the last step is due to $\tilde{c}_{1}\leq\sqrt{n}\varepsilon$ .

To sum up,

T_{1}^{3}\wedge(nT_{1})=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

$\bullet\ T_{2}^{3}\wedge(nT_{2})$

Secondly, if we let $\tilde{c}_{2}=\sqrt{320Lc_{2}}$ , we have $\tilde{c}_{2}\geq 4$ based on Assumption 3.4. As a result, $T_{2}=\Theta\left(\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}+\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right)$ . Therefore, it is equivalent to study $\bar{T}_{2}^{3}\wedge(n\bar{T}_{2})$ where $\bar{T}_{2}=\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}+\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}$ .

When $\tilde{c}_{2}\geq 1.5\sqrt{n}\varepsilon$ ,

\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}\leq\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\leq\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}.

Thus, $\bar{T}_{2}=\mathcal{O}\left(\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right)$ . Then,

	$\displaystyle\quad\bar{T}_{2}^{3}\wedge(n\bar{T}_{2})=\mathcal{O}\left(n\bar{T}_{2}\right)$
	$\displaystyle=\mathcal{O}\left(\frac{2\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right)=\mathcal{O}\left(\frac{2\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\wedge\frac{4\tilde{c}_{2}^{3}}{3\varepsilon^{3}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right).$

When $\tilde{c}_{2}\leq 1.5\sqrt{n}\varepsilon$ ,

	$\displaystyle\quad\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}=\frac{4\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}}{\varepsilon}\leq\frac{4\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{3\tilde{c}_{2}}{\varepsilon}$
	$\displaystyle\leq\frac{1.5\varepsilon}{\tilde{c}_{2}}\cdot\frac{4\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{3\tilde{c}_{2}}{\varepsilon}=\frac{6\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}.$

Thus, $\bar{T}_{2}=\mathcal{O}\left(\frac{\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)$ . Then

	$\displaystyle\quad\bar{T}_{2}^{3}\wedge(n\bar{T}_{2})=\mathcal{O}\left(\bar{T}_{2}^{3}\right)=\mathcal{O}\left(\frac{\tilde{c}_{2}^{3}}{\varepsilon^{3}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\right)$
	$\displaystyle=\mathcal{O}\left(\frac{\tilde{c}_{2}^{3}}{\varepsilon^{3}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\wedge\frac{1.5\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\right).$

To sum up,

{T}_{2}^{3}\wedge(n{T}_{2})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

$\bullet\ T_{3}^{3}\wedge(nT_{3})$

Since $T_{3}=\tilde{\Theta}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon}\right)$ , we can directly know that

{T}_{3}^{3}\wedge(n{T}_{3})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{{n}}{\varepsilon}\right)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

$\bullet\ T_{4}^{3}\wedge(nT_{4})$

Similar to the previous case,

T_{4}^{3}\wedge(nT_{4})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

∎

Proof of Theorem C.2.

We can see that many results given under the setting of Theorem 3.1 can still apply under the current setting. If we still define $\Omega_{j}$ as (12), $\Omega=\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j}$ occurs with probability at least $1-\delta$ .

Under the current setting, Proposition F.1 is still valid. Thus, on $\Omega$ , for any $j\in\mathbb{Z}_{+}$ ,

	$\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)$
	$\displaystyle\leq-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}$
	$\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L\tau_{j}K_{j}}{4}$
	$\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{\sqrt{n}L}{4}\tau_{j},$

where the 2nd step is due to our choice of $B_{j}\equiv n$ and consequently $q_{j}\equiv 0$ , the 3rd step is based on our choices of $\eta_{j}=\frac{1}{4L}$ and $b_{j}=l_{j}K_{j}$ , the 4th step is based on our choice of $K_{j}=n$ . Summing the above inequality from $j=1$ to $T$ ,

	$\displaystyle\quad-\Delta_{f}^{0}$
	$\displaystyle=f\left(\mathbf{x}^{*}\right)-f\left(\mathbf{x}_{0}^{(1)}\right)$
	$\displaystyle\leq f\left(\mathbf{x}_{K_{T}}^{(T)}\right)-f\left(\mathbf{x}_{0}^{(1)}\right)$
	$\displaystyle=\sum\limits_{j=1}\limits^{T}\left(\frac{\sqrt{n}L}{4}\tau_{j}-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}\right)$
	$\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2},$		(33)

where the 2nd step is according to Assumption 3.1, the 4th step is based on our choice of $\tau_{j}\equiv\frac{\tilde{\varepsilon}^{2}}{8L^{2}}$ . We assert that when $T\geq T_{5}$ , there must exist a $1\leq j\leq T$ such that

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

If not,

	$\displaystyle\quad\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\frac{K_{j}}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\leq\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\tilde{\varepsilon}^{2}{K_{j}}$
	$\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{\sqrt{n}\tilde{\varepsilon}^{2}T}{16L}$
	$\displaystyle=-\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}$
	$\displaystyle=-\frac{\sqrt{n}\varepsilon^{2}T}{160L}$
	$\displaystyle\leq-(\Delta_{f}^{0}+1),$

which is in conflict with (33). Thus, on $\Omega$ , the first stopping rule will be met in at most $T$ outer iterations while the second stopping rule is always satisfied. When both stopping rules are met, we can show that the output is of desirable property. Let $1\leq j\leq T$ and $0\leq k\leq K_{j}$ such that

\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}

and

\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

Then, on $\Omega$ ,

	$\displaystyle\quad\left\\|\nabla f\left(\hat{\mathbf{x}}\right)\right\\|^{2}$
	$\displaystyle=\left\\|\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}$
	$\displaystyle\leq 2\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+2\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{K_{j}}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)$
	$\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}K_{j}\tilde{\varepsilon}^{2}}{b_{j}}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)$
	$\displaystyle=2\tilde{\varepsilon}^{2}+0.5\tilde{\varepsilon}^{2}+8L^{2}\tau_{j}$
	$\displaystyle=3.5\tilde{\varepsilon}^{2}$
	$\displaystyle\leq\varepsilon^{2},$

where the 4th step is based on Proposition D.1, the 7th step is based on our choices of $\eta_{j}=\frac{1}{4L}$ and $b_{j}=l_{j}K_{j}$ , the 8th step is based on our choice of $\tau_{j}\equiv\frac{\tilde{\varepsilon}^{2}}{8L^{2}}$ . ∎

Appendix G Supplementary Figures

References

[1] Zeyuan Allen-Zhu and Elad Hazan “Variance reduction for faster non-convex optimization” In International conference on machine learning, 2016, pp. 699–707 PMLR
[2] Rémi Bardenet and Odalric-Ambrym Maillard “Concentration inequalities for sampling without replacement” In Bernoulli 21.3 Bernoulli Society for Mathematical StatisticsProbability, 2015, pp. 1361–1385
[3] Stéphane Boucheron, Gábor Lugosi and Pascal Massart “Concentration inequalities: A nonasymptotic theory of independence” Oxford university press, 2013
[4] Aaron Defazio, Francis Bach and Simon Lacoste-Julien “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives” In Advances in neural information processing systems, 2014, pp. 1646–1654
[5] Cong Fang, Chris Junchi Li, Zhouchen Lin and Tong Zhang “SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 687–697
[6] Saeed Ghadimi and Guanghui Lan “Stochastic first-and zeroth-order methods for nonconvex stochastic programming” In SIAM Journal on Optimization 23.4 SIAM, 2013, pp. 2341–2368
[7] Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep learning” MIT press, 2016
[8] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan and Sikander Randhawa “Tight analyses for non-smooth stochastic gradient descent” In Conference on Learning Theory, 2019, pp. 1579–1613 PMLR
[9] Nicholas JA Harvey, Christopher Liaw and Sikander Randhawa “Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent” In arXiv preprint arXiv:1909.00843, 2019
[10] Wassily Hoeffding “Probability Inequalities for Sums of Bounded Random Variables” In Journal of the American Statistical Association 58.301, 1963, pp. 13–30
[11] Samuel Horváth, Lihua Lei, Peter Richtárik and Michael I Jordan “Adaptivity of stochastic gradient methods for nonconvex optimization” In arXiv preprint arXiv:2002.05359, 2020
[12] Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli “Making the last iterate of sgd information theoretically optimal” In Conference on Learning Theory, 2019, pp. 1752–1755 PMLR
[13] Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani “An introduction to statistical learning” Springer, 2013
[14] Kaiyi Ji, Zhe Wang, Bowen Weng, Yi Zhou, Wei Zhang and Yingbin Liang “History-gradient aided batch size adaptation for variance reduced algorithms” In International Conference on Machine Learning, 2020, pp. 4762–4772 PMLR
[15] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade and Michael I Jordan “A short note on concentration inequalities for random vectors with subgaussian norm” In arXiv preprint arXiv:1902.03736, 2019
[16] Rie Johnson and Tong Zhang “Accelerating stochastic gradient descent using predictive variance reduction” In Advances in neural information processing systems 26, 2013, pp. 315–323
[17] Sham M Kakade and Ambuj Tewari “On the generalization ability of online strongly convex programming algorithms” In Advances in Neural Information Processing Systems, 2009, pp. 801–808
[18] Nicolas Le Roux, Mark Schmidt and Francis Bach “A stochastic gradient method with an exponential convergence rate for finite training sets” In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, 2012, pp. 2663–2671
[19] Lihua Lei and Michael Jordan “Less than a single pass: Stochastically controlled stochastic gradient” In Artificial Intelligence and Statistics, 2017, pp. 148–156 PMLR
[20] Lihua Lei and Michael I Jordan “On the adaptivity of stochastic gradient-based optimization” In SIAM Journal on Optimization 30.2 SIAM, 2020, pp. 1473–1500
[21] Lihua Lei, Cheng Ju, Jianbo Chen and Michael I Jordan “Non-convex finite-sum optimization via SCSG methods” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 2345–2355
[22] Xiaoyu Li and Francesco Orabona “A high probability analysis of adaptive sgd with momentum” In arXiv preprint arXiv:2007.14294, 2020
[23] Zhize Li “SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points” In Advances in Neural Information Processing Systems 32, 2019, pp. 1523–1533
[24] Zhize Li, Hongyan Bao, Xiangliang Zhang and Peter Richtárik “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization” In International Conference on Machine Learning, 2021, pp. 6286–6295 PMLR
[25] Zhize Li and Jian Li “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 5569–5579
[26] Yurii Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2003
[27] L M Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč “SARAH: A novel method for machine learning problems using stochastic recursive gradient” In International Conference on Machine Learning, 2017, pp. 2613–2621 PMLR
[28] Lam M Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč “Stochastic recursive gradient algorithm for nonconvex optimization” In arXiv preprint arXiv:1705.07261, 2017
[29] Iosif Pinelis “An approach to inequalities for the distributions of infinite-dimensional martingales” In Probability in Banach Spaces, 8: Proceedings of the Eighth International Conference, 1992, pp. 128–134 Springer
[30] Iosif Pinelis “Optimum bounds for the distributions of martingales in Banach spaces” In The Annals of Probability JSTOR, 1994, pp. 1679–1706
[31] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos and Alex Smola “Stochastic variance reduction for nonconvex optimization” In International conference on machine learning, 2016, pp. 314–323 PMLR
[32] Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan and Lam M Nguyen “Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization” In arXiv preprint arXiv:1905.05920, 2019
[33] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang and Vahid Tarokh “Spiderboost and momentum: Faster variance reduction algorithms” In Advances in Neural Information Processing Systems 32, 2019, pp. 2406–2416
[34] Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang and Quanquan Gu “On the convergence of adaptive gradient methods for nonconvex optimization” In arXiv preprint arXiv:1808.05671, 2018

	$\displaystyle\quad f\left(\mathbf{x}_{k+1}^{(j)}\right)$
	$\displaystyle=\tilde{f}\Big{(}\mathrm{Proj}\big{(}\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)},\mathcal{D}\big{)}\Big{)}$
	$\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\right)$
	$\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\big{\\|}\boldsymbol{\nu}_{k}^{(j)}\big{\\|}^{2}$
	$\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)+\frac{\eta_{j}}{2}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}-\frac{\eta_{j}}{2}\left\\|\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)\big{\\|}\boldsymbol{\nu}_{k}^{(j)}\big{\\|}^{2},$

	$\displaystyle\quad\mathbb{E}\big{(}f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)\big{\|}\mathcal{F}_{j-1}\big{)}$
	$\displaystyle\leq\mathbb{E}\Bigg{(}\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\\|^{2}\big{\|}\mathcal{F}_{j-1}\Bigg{)}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)K_{j}\mathbb{E}\left(V_{j}\big{\|}\mathcal{F}_{j-1}\right).$		(18)

	$\displaystyle\quad\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\left\\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\\|^{2}\big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\Big{\\|}\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})$
	$\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\\|}^{2}\Big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle\quad+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle=\mathbb{E}_{j-1}\frac{1}{b_{j}^{2}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\mathbb{E}\Bigg{(}\Big{\\|}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})$
	$\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\\|}^{2}\Big{\|}\mathcal{F}_{j,k-1}\Bigg{)}$
	$\displaystyle\quad+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}}{b_{j}}\mathbb{E}_{j-1}\left\\|\mathbf{x}_{k}^{(j)}-\mathbf{x}_{k-1}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{t=0}\limits^{k-1}\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{t}^{(j)}\right\\|^{2}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2}$
	$\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\mathbb{E}_{j-1}V_{j}+\mathbb{E}_{j-1}\left\\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\\|^{2}.$		(19)

	$\displaystyle\quad f^{\prime\prime}_{k}(t)$
	$\displaystyle=\frac{1}{2}\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}+\frac{\lambda\\|\mathbf{z}_{k}\\|^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\right)e^{\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}$
	$\displaystyle\quad\quad+\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}-\frac{\lambda\\|\mathbf{z}_{k}\\|^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\right)e^{-\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}\Biggr{]}$
	$\displaystyle=\mathbb{E}_{k-1}\Biggl{[}\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)$
	$\displaystyle\quad\quad+\frac{\lambda^{2}\\|\mathbf{z}_{k}\\|^{2}}{\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|}sinh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\Biggr{]}$
	$\displaystyle\leq\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|^{2}}+\lambda^{2}\\|\mathbf{z}_{k}\\|^{2}\right)cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\Biggr{]}$
	$\displaystyle\leq 2\lambda^{2}\mathbb{E}_{k-1}\left[\\|\mathbf{z}_{k}\\|^{2}cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\right]$
	$\displaystyle\leq 2\lambda^{2}r_{k}^{2}\mathbb{E}_{k-1}\left[cosh(\lambda\\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\\|)\right]$
	$\displaystyle=2\lambda^{2}r_{k}^{2}f_{k}(t),$

	$\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)$
	$\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left[l_{j}\left(\left(\tilde{\sigma}_{k}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)+q_{j}\right]$
	$\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left(\frac{4L^{2}\eta_{j}^{2}l_{j}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\\|\boldsymbol{\nu}_{m-1}^{(j)}\right\\|^{2}\right)+\frac{2L^{2}\eta_{j}\tau_{j}l_{j}}{b_{j}}\frac{K_{j}^{2}}{2}$
	$\displaystyle\quad+\frac{\eta_{j}K_{j}q_{j}}{2}-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\leq\frac{2L^{2}\eta_{j}^{3}l_{j}K_{j}}{b_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle=-\frac{\eta_{j}}{2}\left(1-L\eta_{j}-\frac{4L^{2}\eta_{j}^{2}l_{j}K_{j}}{b_{j}}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}$
	$\displaystyle\quad+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle=-\frac{1}{8L}\left(1-\frac{1}{4}-\frac{1}{4}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}$
	$\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\\|\boldsymbol{\nu}_{k}^{(j)}\right\\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2},$

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Abstract

1 Introduction

1.1 Related Works

1.2 Our Contributions

1.3 Notation

2 Prob-SARAH Algorithm

3 Theoretical Results

3.1 Technical Assumptions

Assumption 3.1 (Existence of achievable minimum).

Assumption 3.2 (LL-smoothness).

Assumption 3.3 (LL-smoothness extension).

Assumption 3.4.

3.2 Main Results on Complexity

Definition 3.1 (ε\varepsilon-semi-independence).

Theorem 3.1.

3.3 Proof Sketch

Theorem 3.2 (Martingale Azuma-Hoeffding Inequality with Random Bounds).

Remark 3.1.

4 Numerical Experiments

4.1 Logistic Regression with Non-Convex Regularization

4.2 Two-Layer Neural Network

5 Conclusion

Appendix A Remarks and Examples for Assumptions

A.1 More comments on Assumptions 3.1–3.4

Remark A.1 (Convexity and smoothness).

Remark A.2 (Compact set 𝒟\mathcal{D}).

A.2 An Example of Assumption 3.3

Appendix B Stop Guarantee

Proposition B.1 (Stop guarantee of Prob-SARAH).

Appendix C Detailed Results on Complexity

Theorem C.1.

Corollary C.1.

Theorem C.2.

Corollary C.2.

C.1 Dependency on Parameters

Appendix D Postponed Proofs for the Results in Section 3

D.1 Bounding the Difference between 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} and ∇f​(𝐱k(j))\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}

Proposition D.1.

Remark D.1.

D.2 Analysis on the Output 𝐱^\hat{\mathbf{x}}

Proposition D.2.

D.3 Upper-bounding the Stopping Time

Proposition D.3 (First Stopping Rule).

Proposition D.4 (Second Stopping Rule).

Proposition D.5 (Stop Guarantee).

Proof.

Appendix E Technical Lemmas

Lemma E.1 (Theorem 4 in [10]).

Lemma E.2 (Proposition 1.2 in [3]111See also Lemma 1.3 in [2].).

Lemma E.3 (Theorem 3.5 in [30]222See also Theorem 3 in [29] and Proposition 2 in [5].).

Proposition E.1 (Norm-Hoeffding, Sampling without Replacement).

Proof.

Definition E.1.

Definition E.2.

Lemma E.4 (Corollary 8 in [15]).

Lemma E.5.

Proof.

Lemma E.6.

Proof.

Appendix F Proofs of Main Theorems

Proof of Proposition B.1.

Proof of Theorem 3.2.

Proof of Proposition D.1.

Proposition F.1 (Inner Loop Analysis).

Proof of Proposition F.1.

Proof of Proposition D.3.

Proof of Proposition D.4.

Proof of Corollary C.1.

Proof of Theorem C.2.

Appendix G Supplementary Figures

References

Assumption 3.2 ( $L$ -smoothness).

Assumption 3.3 ( $L$ -smoothness extension).

Definition 3.1 ( $\varepsilon$ -semi-independence).

Remark A.2 (Compact set $\mathcal{D}$ ).

D.1 Bounding the Difference between $\boldsymbol{\nu}_{k}^{(j)}$ and $\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}$

D.2 Analysis on the Output $\hat{\mathbf{x}}$

Lemma E.2 (Proposition 1.2 in [3]¹¹1See also Lemma 1.3 in [2].).

Lemma E.3 (Theorem 3.5 in [30]²²2See also Theorem 3 in [29] and Proposition 2 in [5].).