This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Probabilistic Guarantees of Stochastic Recursive Gradient in Non-Convex Finite Sum Problems

Yanjie Zhong Department of Statistics and Data Science, Washington University in St. Louis Jiaqi Li Corresponding author: Jiaqi Li, [email protected] Department of Statistics and Data Science, Washington University in St. Louis Soumendra Lahiri Department of Statistics and Data Science, Washington University in St. Louis
Abstract

This paper develops a new dimension-free Azuma-Hoeffding type bound on summation norm of a martingale difference sequence with random individual bounds. With this novel result, we provide high-probability bounds for the gradient norm estimator in the proposed algorithm Prob-SARAH, which is a modified version of the StochAstic Recursive grAdient algoritHm (SARAH), a state-of-art variance reduced algorithm that achieves optimal computational complexity in expectation for the finite sum problem. The in-probability complexity by Prob-SARAH matches the best in-expectation result up to logarithmic factors. Empirical experiments demonstrate the superior probabilistic performance of Prob-SARAH on real datasets compared to other popular algorithms.

Keywords: machine learning, variance-reduced method, stochastic gradient descent, non-convex optimization

1 Introduction

We consider the popular non-convex finite sum optimization problem in this work, that is, estimating 𝐱𝒟d\mathbf{x}^{*}\in\mathcal{D}\subseteq\mathbb{R}^{d} minimizing the following loss function

f(𝐱)=1ni=1nfi(𝐱),𝐱𝒟,f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}f_{i}(\mathbf{x}),\ \mathbf{x}\in\mathcal{D}, (1)

where fi:df_{i}:\mathbb{R}^{d}\mapsto\mathbb{R} is a potentially non-convex function on some compact set 𝒟\mathcal{D}. Such non-convex problems lie at the heart of many applications of statistical learning [13] and machine learning [7].

Unlike convex optimization problems, in general, non-convex problems are intractable and the best we can expect is to find a stationary point. Given a target error ε\varepsilon, since f(𝐱)=0\nabla f(\mathbf{x}^{*})=0, we aim to find an estimator 𝐱^\hat{\mathbf{x}} such that roughly f(𝐱^)ε\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon, where f()\nabla f(\cdot) denotes the gradient vector the loss function ff and \|\cdot\| is the operator norm. With a non-deterministic algorithm, the output 𝐱^\hat{\mathbf{x}} is always stochastic, and the most frequently considered measure of error bound is in expectation, i.e.,

𝔼f(𝐱^)2ε2.\mathbb{E}\|\nabla f(\hat{\mathbf{x}})\|^{2}\leq\varepsilon^{2}. (2)

There has been a substantial amount of work providing upper bounds on computational complexity needed to achieve the in-expectation bound. However, in practice, we only run a stochastic algorithm for once and an in-expectation bound cannot provide a convincing bound in this situation. Instead, a high-probability bound is more appropriate by nature. Given a pair of target errors (ε,δ)(\varepsilon,\delta), we want to obtain an estimator 𝐱^\hat{\mathbf{x}} such that with probability at least 1δ1-\delta, f(𝐱^)ε\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon, that is

(f(𝐱^)ε)1δ.\mathbb{P}\big{(}\|\nabla f(\hat{\mathbf{x}})\|\leq\varepsilon\big{)}\geq 1-\delta. (3)

Though the Markov inequality might help, in general, an in-expectation bound cannot be simply converted to an in-probability bound with a desirable dependency on δ\delta. It would be important to prove upper bounds on high-probability complexity, which ideally should be polylogorithmic in δ\delta and with polynomial terms comparable to the in-expectation complexity bound.

Gradient-based methods are favored by practitioners due to simplicity and efficiency and have been widely studied by researchers in the non-convex setting ([26, 6, 1, 31, 5, 33]). Among numerous gradient-based methods, the StochAstic Recursive grAdient algoritHm (SARAH) ([27, 28, 33]) is the one with the best first-order guarantee as given an in-expectation error target, in both of convex and non-convex finite sum problems. It is worth noticing that [23] attempted to show that a modified version of SARAH is able to approximate the second-order stationary point with a high probability. However, we believe that their application of the martingale Azuma-Hoeffding inequality is unjustifiable because the bounds are potentially random and uncontrollable. In this paper, we shall provide a correct dimension-free martingale Azuma-Hoeffding inequality with rigorous proofs and leverage it to show in-probability properties for SARAH-based algorithms in the non-convex setting.

1.1 Related Works

  • High-Probability Bounds: While most works in the literature of optimization provide in-expectation bounds, there is only a small fraction of works discussing bounds in the high probability sense. [17] provide a high-probability bound on the excess risk given a bound on the regret. [12], [8, 9] derive some high-probability bounds for SGD in convex online optimization problems. [34, 22] prove high-probability bounds for several adaptive methods, including AMSGrad, RMSProp and Delayed AdaGrad with momentum. All these works rely on (generalized) Freedman’s inequality or the concentration inequality given in Lemma 6 in [15]. Different from them, our high-probability results are built on a novel Azuma-Hoeffding type inequality proved in this work and Corollary 8 from [15]. In addition, we notice that [23] provide some probabilistic bounds on a SARAH-based algorithm. However, we believe their use of the plain martingale Azuma-Hoeffding inequality is not justifiable. [5] show in-probability upper bound for SPIDER. Nevertheless, SPIDER’s practical performance is inferior due to its accuracy-dependent small step size [32, 33].

  • Variance-Reduced Methods in Non-Convex Finite Sum Problems: Since the invention of the variance-reduction technique in [18, 16, 4], there has been a large amount of work incorporating this efficient technique to methods targeting the non-convex finite-sum problem. Subsequent methods, including SVRG ([1, 31, 25]), SARAH ([27, 28]), SCSG ([19, 21, 11]), SNVRG ([34]), SPIDER ([5]), SpiderBoost ([33]) and PAGE ([24]), have greatly reduced computational complexity in non-convex problems.

1.2 Our Contributions

  • Dimension-Free Martingale Azuma-Hoeffding inequality: To facilitate our probabilistic analysis, we provide a novel Azuma-Hoeffding type bound on the summation norm of a martingale difference sequence. The novelty is two-fold. Firstly, same as the plain martingale Azuma-Hoeffding inequality, it provides a dimension-free bound. In a recent paper, a sub-Gaussian type bound has been developed by [15]. However, their results are not dimension-free. Our technique in the proof is built on a classic paper by [29] and is completely different from the random matrix technique used in [15]. Secondly, our concentration inequality allows random bounds on each element of the martingale difference sequence, which is much tighter than a large deterministic bound. It should be highlighted that our novel concentration result perfectly suits the nature of SARAH-style methods where the increment can be characterized as a martingale difference sequence and it can be further used to analyze other algorithms beyond the current paper.

  • In-probability error bounds of stochastic recursive gradient: We design a SARAH-based algorithm, named Prob-SARAH, adapted to the high-probability target and provably show its good in-probability properties. Under appropriate parameter setting, the first order complexity needed to achieve the in-probability target is 𝒪~(1ε3nε2)\tilde{\mathcal{O}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right), which matches the best known in-expectation upper bound up to some logarithmic factors ([34, 33, 11]). We would like to point out that the parameter setting used to achieve such complexity is semi-adaptive to ε\varepsilon. That is, only the final stopping rule relies on ε\varepsilon while other key parameters are independent of ε\varepsilon, including step size, mini-batch sizes, and lengths of loops.

  • Probabilistic analysis of SARAH for non-convex finite sum: Existing literature on the bounds of SARAH is mostly focusing on the strongly convex or general convex settings. We extend the case to the non-convex scenarios, which can be considered as a complimentary study to the stochastic recursive gradient in probability.

1.3 Notation

For a sequence of sets 𝒜1,𝒜2,\mathcal{A}_{1},\mathcal{A}_{2},\ldots, we denote the smallest sigma algebra containing 𝒜i\mathcal{A}_{i}, i1i\geq 1, by σ(i=1𝒜i)\sigma\big{(}\bigcup_{i=1}^{\infty}\mathcal{A}_{i}\big{)}. By abuse of notation, for a random variable 𝐗\mathbf{X}, we denote the sigma algebra generated by 𝐗\mathbf{X} by σ(𝐗)\sigma(\mathbf{X}). We define constant Ce=i=0i2C_{e}=\sum_{i=0}^{\infty}i^{-2}. For two scalars a,ba,b\in\mathbb{R}, we denote ab=min{a,b}a\wedge b=\min\{a,b\} and ab=max{a,b}a\vee b=\max\{a,b\}. When we say a quantity TT is 𝒪θ1,θ2(θ3)\mathcal{O}_{\theta_{1},\theta_{2}}(\theta_{3}) for some θ1,θ2,θ3\theta_{1},\theta_{2},\theta_{3}\in\mathbb{R}, there exists a gg\in\mathbb{R} polylogarithmic in θ1\theta_{1} and θ2\theta_{2} such that Tgθ3T\leq g\cdot\theta_{3}, and similarly 𝒪~()()\tilde{\mathcal{O}}_{(\cdot)}(\cdot) is defined the same but up to a logarithm factor.

2 Prob-SARAH Algorithm

The algorithm Prob-SARAH proposed in our work is a modified version of SpiderBoost ([33]) and SARAH ([27, 28]). Since the key update structure is originated from ([27]), we call our modified algorithm Prob-SARAH. In fact, it can also be viewed as a generalization of the SPIDER algorithm introduced in ([5]).

We present the Prob-SARAH in Algorithm 1, and here, we provide some explanation of the key steps. Following other SARAH-based algorithms, we adopt a similar gradient approximation design with nested loops, specifically with a checkpoint gradient estimator 𝝂0(j)\boldsymbol{\nu}^{(j)}_{0} using a large mini-batch size BjB_{j} in Line 4 and a recursive gradient estimator 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} updated in Line 9. When the mini-batch size BjB_{j} is large, we can regard the checkpoint gradient estimator 𝝂0(j)\boldsymbol{\nu}^{(j)}_{0} as a solid approximation to the true gradient at 𝐱~j1\tilde{\mathbf{x}}_{j-1}. With this checkpoint, we can update the gradient estimator 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} with a small mini-batch size bjb_{j} while maintaining a desirable estimation accuracy.

To emphasize, our stopping rules in Line 11 of Algorithm 1 is newly proposed, which ensures a critical enhancement of the performance compared to previous literature. In particular, with this new design, we can control the gradient norm of the output with high probability. For a more intuitive understanding of these stopping rules, we will see in our proof sketch section that the gradient norm of iterates in the jj-th outer iteration, f\|\nabla f\|, can be bounded by a linear combination of {𝝂k(j)}k=1Kj\big{\{}\boldsymbol{\nu}_{k}^{(j)}\big{\}}_{k=1}^{K_{j}} with a small remainder. The first stopping rule, therefore, strives to control the magnitude of the linear combination of {𝝂k(j)}k=1Kj\big{\{}\boldsymbol{\nu}_{k}^{(j)}\big{\}}_{k=1}^{K_{j}}, while the second stopping rule is specifically designed to control the size of remainder terms. For this purpose, εj\varepsilon_{j} should be set as a credible controller of the remainder term, with an example given in Theorems 3.1. In this way, with small preset constants ε~\tilde{\varepsilon} and ε\varepsilon, we guarantee that the output has a desirably small gradient norm, dependent on ε~\tilde{\varepsilon} and ε\varepsilon, when the designed stopping rules are activated. Indeed, Proposition B.1 in Appendix B offers a guarantee that the stopping rule will be definitively satisfied at some point. More refined quantitative results regarding the number of steps required for stopping will follow in Theorems 3.1 and Appendix D.3.

Algorithm 1 Probabilistic Stochastic Recursive Gradient (Prob-SARAH)
1:  Input: sample size nn, constraint area 𝒟\mathcal{D}, initial point 𝐱~0𝒟\tilde{\mathbf{x}}_{0}\in\mathcal{D}, large batch size {Bj}j1\{B_{j}\}_{j\geq 1}, mini batch size {bj}j1\{b_{j}\}_{j\geq 1}, inner loop length {Kj}j1\{K_{j}\}_{j\geq 1}, auxiliary error estimator {εj}j1\{\varepsilon_{j}\}_{j\geq 1}, errors ε~2,ε2\tilde{\varepsilon}^{2},\varepsilon^{2}
2:  for j=1,2,j=1,2,\ldots do
3:     Uniformly sample a batch j{1,,n}\mathcal{I}_{j}\subseteq\{1,\ldots,n\} without replacement, |j|=Bj|\mathcal{I}_{j}|=B_{j};
4:     𝝂0(j)1Bjijfi(𝐱~j1)\boldsymbol{\nu}^{(j)}_{0}\,\leftarrow\,\frac{1}{B_{j}}\sum_{i\in\mathcal{I}_{j}}\nabla f_{i}(\tilde{\mathbf{x}}_{j-1});
5:     𝐱0(j)𝐱~j1\mathbf{x}_{0}^{(j)}\,\leftarrow\,\tilde{\mathbf{x}}_{j-1};
6:     for k=1,2,,Kjk=1,2,\ldots,K_{j} do
7:        𝐱k(j)Proj(𝐱k1(j)ηj𝝂k1(j),𝒟)\mathbf{x}_{k}^{(j)}\,\leftarrow\,\mathrm{Proj}\big{(}\mathbf{x}_{k-1}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k-1}^{(j)},\mathcal{D}\big{)}, project the update back to 𝒟\mathcal{D};
8:        Uniformly sample a mini-batch k(j){1,,n}\mathcal{I}_{k}^{(j)}\subseteq\{1,\ldots,n\} with replacement and |k(j)|=bj|\mathcal{I}_{k}^{(j)}|=b_{j};
9:        𝝂k(j)1bjik(j)fi(𝐱k(j))1bjik(j)fi(𝐱k1(j))+𝝂k1(j)\boldsymbol{\nu}_{k}^{(j)}\,\leftarrow\,\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})+\boldsymbol{\nu}_{k-1}^{(j)};
10:     end for
11:     if 1Kjk=0Kj1𝝂k(j)2ε~2\frac{1}{K_{j}}\sum_{k=0}^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}\leq\tilde{\varepsilon}^{2} and εj12ε2\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2} then
12:        k^argmin0kKj1𝝂k(j)2\hat{k}\,\leftarrow\,\mathop{\arg\min}_{0\leq k\leq K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2};
13:        Return 𝐱^𝐱k^(j)\hat{\mathbf{x}}\,\leftarrow\,\mathbf{x}_{\hat{k}}^{(j)};
14:     end if
15:     𝐱~j𝐱Kj(j)\tilde{\mathbf{x}}_{j}\,\leftarrow\,\mathbf{x}_{K_{j}}^{(j)};
16:  end for

3 Theoretical Results

This section is devoted to the main theoretical result of our proposed algorithm Prob-SARAH. We provide the stop guarantee of the algorithm along with the upper bound of the steps. The high-probability error bound of the estimated gradient is also established. The discussion of the dependence of our algorithm on the parameters is available after we introduce our main theorems.

3.1 Technical Assumptions

We shall introduce some necessary regularized assumptions. Most assumptions are commonly used in the optimization literature. We have further clarifications in Appendix A.

Assumption 3.1 (Existence of achievable minimum).

Assume that for each i=1,2,,ni=1,2,\ldots,n, fif_{i} has continuous gradient on 𝒟\mathcal{D} and 𝒟\mathcal{D} is a compact subset of d\mathbb{R}^{d}. Then, there exists a constant αM<\alpha_{M}<\infty such that

max1insup𝐱𝒟fi(𝐱)αM.\max\limits_{1\leq i\leq n}\sup\limits_{\mathbf{x}\in\mathcal{D}}\|\nabla f_{i}(\mathbf{x})\|\leq\alpha_{M}. (4)

Also, assume that there exists an interior point 𝐱\mathbf{x}^{*} of the set 𝒟\mathcal{D} such that

f(𝐱)=inf𝐱𝒟f(𝐱).f(\mathbf{x}^{*})=\inf\limits_{\mathbf{x}\in\mathcal{D}}f(\mathbf{x}).
Assumption 3.2 (LL-smoothness).

For each i=1,2,,ni=1,2,\ldots,n, fi:𝒟f_{i}:\mathcal{D}\rightarrow\mathbb{R} is LL-smooth for some constant L>0L>0, i.e.,

fi(𝐱)fi(𝐱)L𝐱𝐱,𝐱,𝐱𝒟.\|\nabla f_{i}(\mathbf{x})-\nabla f_{i}(\mathbf{x}^{\prime})\|\leq L\|\mathbf{x}-\mathbf{x}^{\prime}\|,\ \forall\ \mathbf{x},\mathbf{x}^{\prime}\in\mathcal{D}.
Assumption 3.3 (LL-smoothness extension).

There exists a LL-smooth function f~:𝒟\tilde{f}:\mathcal{D}\rightarrow\mathbb{R} such that

f~(𝐱)=f(𝐱),𝐱𝒟,andf~(Proj(𝐱,𝒟))f~(𝐱),𝐱d,\tilde{f}(\mathbf{x})=f(\mathbf{x}),\ \forall\ \mathbf{x}\in\mathcal{D},\quad\text{and}\quad\tilde{f}(\mathrm{Proj}(\mathbf{x},\mathcal{D}))\leq\tilde{f}(\mathbf{x}),\ \forall\ \mathbf{x}\in\mathbb{R}^{d},

where Proj(𝐱,𝒟)\mathrm{Proj}(\mathbf{x},\mathcal{D}) is the Euclidean projection of 𝐱\mathbf{x} on some compact set 𝒟\mathcal{D}.

Assumption 3.4.

Assume that the following conditions hold.

  1. 1.

    ε1e\varepsilon\leq\frac{1}{e} and αM2110240\alpha_{M}^{2}\geq\frac{1}{10240}, where ϵ\epsilon is the target error bound in (3) and αM\alpha_{M} is defined in (4).

  2. 2.

    The diameter of 𝒟\mathcal{D} is at least 1, i.e. d1max{𝐱𝐱:𝐱,𝐱𝒟}1d_{1}\triangleq\max\{\|\mathbf{x}-\mathbf{x}^{\prime}\|:\mathbf{x},\mathbf{x}^{\prime}\in\mathcal{D}\}\geq 1.

Assumption 3.1 also indicates that there exists a positive number Δf\Delta_{f} such that sup𝐱𝒟[f(𝐱)f(𝐱)]Δf.\sup_{\mathbf{x}\in\mathcal{D}}\big{[}f(\mathbf{x})-f(\mathbf{x}^{*})\big{]}\leq\Delta_{f}. Assumptions 3.13.3 are commonly used in the optimization literature, and Assumption 3.4 can be easily satisfied in practical use as long as the initial points are not too far from the optimum. See more comments on assumptions in Appendix A.

3.2 Main Results on Complexity

According to the definition given in [20], an algorithm is called ε\varepsilon-independent if it can guarantee convergence at all target accuracies ε\varepsilon in expectation without explicitly using ε\varepsilon in the algorithm. This is a very favorable property because it means that we no longer need to set the target error beforehand. Here, we introduce a similar property regarding the dependency on ε\varepsilon.

Definition 3.1 (ε\varepsilon-semi-independence).

An algorithm is ε\varepsilon-semi-independent, given δ\delta, if it can guarantee convergence at all target accuracies ε\varepsilon with probability at least δ\delta and the knowledge of ε\varepsilon is only needed in the post-processing. That is, the algorithm can iterate without knowing ε\varepsilon and we can select an appropriate iterate out afterwards.

The newly introduced property can be perceived as the probabilistic equivalent of ε\varepsilon-independence. As stated in the succeeding theorem, under the given conditions, Prob-SARAH can achieve ε\varepsilon-semi-independence, given δ\delta.

Theorem 3.1.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. Given a pair of errors (ε,δ)(\varepsilon,\delta), in Algorithm 1 (Prob-SARAH), set hyperparameters

ηj=14L,Kj=Bj=j2n,bj=ljKj,εj=8L2τj+2qj,ε~2=15ε2,\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=8L^{2}\tau_{j}+2q_{j},\quad\tilde{\varepsilon}^{2}=\frac{1}{5}\varepsilon^{2},

(5)

for j1j\geq 1, where

Then,

Comp(ε,δ)=𝒪~L,Δf,αM(1ε3nε2),Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\Big{(}\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\Big{)},

where Comp(ε,δ)Comp(\varepsilon,\delta) represents the number of computations needed to get an output 𝐱^\hat{\mathbf{x}} satisfying f(𝐱^)2ε2\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2} with probability at least 1δ1-\delta.

More detailed results can be found in Appendix C. In appendix C, we also introduce another hyper-parameter setting that can lead to a complexity with better dependency on αM2\alpha_{M}^{2}, which could be implicitly affected by the choice of constraint region 𝒟\mathcal{D}.

3.3 Proof Sketch

In this part, we explain the idea of the proof of Theorem 3.1. Same proofing strategy can be applied to other hyper-parameter settings. First, we bound the difference between 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} and f(𝐱k(j))\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)} by a linear combination of {𝝂m(j)}m=0k1\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{k-1} and small remainders, with which we can have a good control on f(𝐱k(j))\|\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\| when the stopping rules are met. Second, we bound the number of steps we need to meet the stopping rules. Combining these 2 key components, we can smoothly get the final conclusions.

Let us firstly introduce a novel Azuma-Hoeffding type inequality, which is key to our analysis.

Theorem 3.2 (Martingale Azuma-Hoeffding Inequality with Random Bounds).

Suppose 𝐳1,,𝐳Kd\mathbf{z}_{1},\ldots,\mathbf{z}_{K}\in\mathbb{R}^{d} is a martingale difference sequence adapted to 0,,K\mathcal{F}_{0},\ldots,\mathcal{F}_{K}. Suppose {rk}k=1K\{r_{k}\}_{k=1}^{K} is a sequence of random variables such that 𝐳krk\|\mathbf{z}_{k}\|\leq r_{k} and rkr_{k} is measurable with respect to k\mathcal{F}_{k}, k=1,,Kk=1,\ldots,K. Then, for any fixed δ>0\delta>0, and B>b>0B>b>0, with probability at least 1δ1-\delta, for 1tK1\leq t\leq K, either

1tK,k=1trk2B or k=1t𝐳k29max{k=1trk2,b}(log(2δ)+loglog(Bb)).\exists 1\leq t\leq K,\ \sum\limits_{k=1}\limits^{t}r_{k}^{2}\geq B\text{ or }\Big{\|}\sum\limits_{k=1}\limits^{t}\mathbf{z}_{k}\Big{\|}^{2}\leq 9\max\Big{\{}\sum\limits_{k=1}\limits^{t}r_{k}^{2},b\Big{\}}\Big{(}\log(\frac{2}{\delta})+\log\log(\frac{B}{b})\Big{)}.

Remark 3.1.

It is noteworthy that this probabilistic bound on large-deviation is dimension-free, which is a nontrivial extension of Theorem 3.5 in [30]. If r1,r2,,rKr_{1},r_{2},\ldots,r_{K} are not random, we can let B=k=1Krk2+ζ1B=\sum_{k=1}^{K}r_{k}^{2}+\zeta_{1} and b=ζ2Bb=\zeta_{2}B with ζ1>0\zeta_{1}>0, 0<ζ2<10<\zeta_{2}<1. Since ζ1\zeta_{1} can be arbitrarily close to 0 and ζ2\zeta_{2} can be arbitrarily close to 1, we can recover Theorem 3.5 in [30]. Compared with Corollary 8 in [15], which can be viewed as a sub-Gaussian counterpart of our result, a key feature of our Theorem 3.2 is its dimension-independence. We are also working towards improving the bound in Corollary 8 from [15] to a dimension-free one.

The success of Algorithm 1 is largely because f(𝐱k(j))\nabla f(\mathbf{x}_{k}^{(j)}) is well-approximated by 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)}, and meanwhile 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} can be easily updated. We can observe that 𝝂k(j)f(𝐱k(j))\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)}) is actually sum of a sequence of martingale difference as

𝝂k(j)f(𝐱k(j))=[1bjik(j)fi(𝐱k(j))1bjik(j)fi(𝐱k1(j))+f(𝐱k1(j))f(𝐱k(j))]\displaystyle\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})=\Big{[}\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{]}
+[𝝂k1(j)f(𝐱k1(j))]=m=1k[1bjim(j)fi(𝐱m(j))1bjim(j)fi(𝐱m1(j))\displaystyle\quad+\left[\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right]=\sum_{m=1}^{k}\Big{[}\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\frac{1}{b_{j}}\sum_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m-1}^{(j)})
+f(𝐱m1(j))f(𝐱m(j))]+[𝝂0(j)f(𝐱0(j))].\displaystyle\quad+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})\Big{]}+\left[\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right]. (6)

To be more specific, let 0={,Ω}\mathcal{F}_{0}=\{\emptyset,\Omega\}, and iteratively define j,1=j1\mathcal{F}_{j,-1}=\mathcal{F}_{j-1}, j,0=σ(j1σ(j))\mathcal{F}_{j,0}=\sigma\big{(}\mathcal{F}_{j-1}\cup\sigma(\mathcal{I}_{j})\big{)}, j,k=σ(j,0σ(k(j)))\mathcal{F}_{j,k}=\sigma\big{(}\mathcal{F}_{j,0}\cup\sigma(\mathcal{I}_{k}^{(j)})\big{)}, j=σ(k=1j,k)\mathcal{F}_{j}=\sigma\big{(}\bigcup\limits_{k=1}\limits^{\infty}\mathcal{F}_{j,k}\big{)}, j1,k1j\geq 1,k\geq 1. We also denote ϵ0(j)𝝂0(j)f(𝐱0(j))\boldsymbol{\epsilon}_{0}^{(j)}\triangleq\boldsymbol{\nu}_{0}^{(j)}-\nabla f\big{(}\mathbf{x}_{0}^{(j)}\big{)}, ϵm(j)1bjim(j)fi(𝐱m(j))f(𝐱m(j))+f(𝐱m1(j))1bjim(j)fi(𝐱m1(j))\boldsymbol{\epsilon}_{m}^{(j)}\triangleq\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\nabla f_{i}(\mathbf{x}_{m-1}^{(j)}), m1m\geq 1. Then, we can see that {ϵm(j)}m=0k\{\boldsymbol{\epsilon}_{m}^{(j)}\}_{m=0}^{k} is a martingale difference sequence adapted to {j,m}m=1k\{\mathcal{F}_{j,m}\}_{m=-1}^{k}. With the help of our new Martingale Azuma-Hoeffding inequality, we can control the difference between 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} and f(𝐱k(j))\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)} by a linear combination of {𝝂m(j)}m=0k1\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{k-1} and small remainders, with details given in Appendix D.1. Then, given the stopping rules in line 11 and selection method specified in line 12 of Algorithm 1, it would be not hard for us to obtain f(𝐱^)2ε2\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2} with a high probability. More details can be found in Appendix D.2.

Another key question needed to be resolved is, when the algorithm can stop? The following analysis can build some intuitions for us. Given a T+T\in\mathbb{Z}_{+}, with the bound given in Proposition F.1 in Appendix F, with a high probability,

Δf\displaystyle-\Delta_{f} f(𝐱~2T)f(𝐱~T)AT116Lj=T+12Tk=0Kj1𝝂k(j)2,\displaystyle\leq f\left(\tilde{\mathbf{x}}_{{2T}}\right)-f\left(\tilde{\mathbf{x}}_{{T}}\right)\leq A_{T}-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}, (7)

where ATA_{T} is upper bounded by a value polylogorithmic in TT. As for the second summation, if εj12ε2\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2} for j=T,T+1,,2Tj=T,T+1,\ldots,2T (which is obviously true when TT is moderately large) and our algorithm doesn’t stop in 2T2T outer iterations,

116Lj=T+12Tk=0Kj1𝝂k(j)2ε~216Lj=T+12TKjε~216Lj=T+12T(Tn)=ε~216LT2(nT),\displaystyle\quad\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}K_{j}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}\left(T\wedge\sqrt{n}\right)=\frac{\tilde{\varepsilon}^{2}}{16L}T^{2}\wedge(\sqrt{n}T),

which grows at least linear in TT. Consequently, when TT is sufficiently large, the RHS of (7) can be smaller than Δf-\Delta_{f}, which leads to a contradiction. Roughly, we can see that the stopping time TT cannot exceed the order of 𝒪~(1ε1nε2)\tilde{\mathcal{O}}\big{(}\frac{1}{\varepsilon}\vee\frac{1}{\sqrt{n}\varepsilon^{2}}\big{)}. More details can be found in Appendix D.3.

4 Numerical Experiments

In order to validate our theoretical results and show good probabilistic property for the newly-introduced Prob-SARAH, we conduct some numerical experiments where the objectives are possibly non-convex.

4.1 Logistic Regression with Non-Convex Regularization

In this part, we consider to add a non-convex regularization term to the commonly-used logistic regression. Specifically, given a sequence of observations (𝐰i,yi)d×{1,1}(\mathbf{w}_{i},y_{i})\in\mathbb{R}^{d}\times\{-1,1\}, i=1,2,,ni=1,2,\ldots,n and a regularized parameter λ>0\lambda>0, the objective is

f(𝐱)=1ni=1nlog(1+eyi𝐰iT𝐱)+λ2j=1dxj21+xj2.f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}\log\left(1+e^{-y_{i}\mathbf{w}_{i}^{T}\mathbf{x}}\right)+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{x_{j}^{2}}{1+x_{j}^{2}}.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Comparison of convergence with respect to (1δ)(1-\delta)-quantile of square of gradient norm (f2)\left(\|\nabla f\|^{2}\right) and δ\delta-quantile of validation accuracy on the MNIST dataset for δ=0.1\delta=0.1 and δ=0.01\delta=0.01. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: δ=0.1\delta=0.1. Bottom: δ=0.01\delta=0.01. ’bs’ stands for batch size. ’sj=x’ means that the smallest batch size xlogx\approx x\log x.

Such an objective has also been considered in other works like [11] and [14]. Same as other works, we set the regularized parameter λ=0.1\lambda=0.1 across all experiments. We compare the newly-introduced Prob-SARAH against three popular methods including SGD ([6]), SVRG ([31]) and SCSG ([21]). Based on results given in Theorem 3.1, we let the length of the inner loop KjjnK_{j}\sim j\wedge\sqrt{n}, the inner loop batch size bjlogj(jn)b_{j}\sim\log j\left(j\wedge\sqrt{n}\right), the outer loop batch size Bjj2nB_{j}\sim j^{2}\wedge n. For fair comparison, we determine the batch size (inner loop batch size) for SGD (SCSG and SVRG) based on the sample size nn and the number of epochs needed to have sufficient decrease in gradient norm. For example, for the w7a dataset, the sample size is 24692 and we run 60 epochs in total. In the 20th epoch, the inner loop batch size of Prob-SARAH is approximately 67log6728167\log 67\approx 281. Thus, we set batch size 256 for SGD, SCSG and SVRG so that they can be roughly matched. In addition, based on the theoretical results from [31], we also consider a large inner loop batch size comparable to n2/3n^{2/3} for SVRG. In addition, we set step size η=0.01\eta=0.01 for all algorithms across all experiments for simplicity.

Results are displayed in Figure 2, from which we can see that Prob-SARAH has superior probabilistic guarantee in controlling the gradient norm in all experiments. It is significantly better than SCSG and SVRG under our current setting. Prob-SARAH can achieve a lower gradient norm than SGD at the early stage while SGD has a slight advantage when the number of epochs is large.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: Comparison of convergence with respect to (1δ)(1-\delta)-quantile of square of gradient norm (f2)\left(\|\nabla f\|^{2}\right) over 3 datasets for δ=0.1\delta=0.1 and δ=0.01\delta=0.01. Top: δ=0.1\delta=0.1. Bottom: δ=0.01\delta=0.01. Datasets: mushrooms, ijcnn1, w7a (from left to right). ’bs’ stands for batch size.

4.2 Two-Layer Neural Network

We also evaluate the performance of Prob-SARAH, SGD, SVRG and SCSG on the MNIST dataset with a simple 2-layer neural network. The two hidden layers respectively have 128 and 64 neurons. We include a GELU activation layer following each hidden layer. We use the negative log likelihood as our loss function. Under this setting, the objective is possibly non-convex and smooth on any given compact set. The step size is fixed to be 0.01 for all algorithms. For Prob-SARAH, we still have the length of the inner loop KjjnK_{j}\sim j\wedge\sqrt{n}, the inner loop batch size bjlogj(jn)b_{j}\sim\log j\left(j\wedge\sqrt{n}\right), the outer loop batch size Bjj2nB_{j}\sim j^{2}\wedge n. But to reduce computational time, we let jj start from 10, 30 and 50 respectively. Based on the same rule described in the previous subsection, we let the batch size (or inner loop batch size) for SGD, SVRG and SCSG be 512.

Results are given in Figure 1. In terms of gradient norm, Prob-SARAH has the best performance among algorithms considered here when the number of epochs is relatively small. With increasing number of epochs, SVRG tends to be better in finding first-order stationary points. However, based on the 3rd and 4th columns in Figure 1, SVRG apparently has an inferior performance on the validation set, which indicates that it could be trapped at local minima. In brief, Prob-SARAH achieves the best tradeoff between finding a first-order stationary point and generalization.

We also consider another set of experiments by replacing the GELU activation function with ReLU, resulting in a non-smooth objective. The results are shown in Appendix G, which resemble those in Figure 1 and the similar conclusions can be drawn.

5 Conclusion

In this paper, we propose a SARAH-based variance reduction algorithm called Prob-SARAH and provide high-probability bounds on gradient norm for estimator resulted from Prob-SARAH. Under appropriate assumptions, the high-probability first order complexity nearly match the one in the in-expectation sense. The main tool used in the theoretical analysis is a novel Azuma-Hoeffding type inequality. We believe that similar probabilistic analysis can be applied to SARAH-based algorithms in other settings.

Appendix A Remarks and Examples for Assumptions

A.1 More comments on Assumptions 3.13.4

Remark A.1 (Convexity and smoothness).

It is worth noticing that Assumption 3.1 is widely used in many non-convex optimization works and can be met for most applications in practice. Assumption 3.2 is also needed in deriving in-expectation bound for many non-convex variance-reduced methods, including state-of-art ones like SPIDER and SpiderBoost. As for Assumption 3.3, it is a byproduct of the compact constraint and can be satisfied with some commonly-seen ff and usual choices of 𝒟\mathcal{D}. For more discussions on Assumption 3.3, please see Appendix A.2.

Remark A.2 (Compact set 𝒟\mathcal{D}).

Compared with other works in the literature of non-convex optimization, the compact constraint region 𝒟d\mathcal{D}\in\mathbb{R}^{d} imposed in the finite sum problem (1) may seem somewhat restrictive. In fact, such constraint is largely due to technical convenience and it can be removed with additional condition on gradients. We will elaborate on this point in subsection C.1. Besides, in many practical applications, it is reasonable to restrict estimators to a compact set when certain prior knowledge is available.

A.2 An Example of Assumption 3.3

Let us consider the logistic regression with non-convex regularization where the object function can be characterized as

f(𝐱)=1ni=1nlog(1+exp(yi𝐰i,𝐱))+λ2Φ(𝐱),f(\mathbf{x})=\frac{1}{n}\sum\limits_{i=1}\limits^{n}\log\left(1+\text{exp}\left(-y_{i}\langle\mathbf{w}_{i},\mathbf{x}\rangle\right)\right)+\frac{\lambda}{2}\Phi(\mathbf{x}),

where Φ(𝐱)=j=1d(xj2)14\Phi(\mathbf{x})=\sum\limits_{j=1}\limits^{d}(x_{j}^{2})^{\frac{1}{4}}, xjx_{j} is the jjth element of 𝐱\mathbf{x}, λ>0\lambda>0 is the regularization parameter, {yi}i=1n\{y_{i}\}_{i=1}^{n} are labels and {𝐰i}i=1n\{\mathbf{w}_{i}\}_{i=1}^{n} are normalized covariates with norm 1. In fact, for any fixed λ>0\lambda>0, Assumption 3.3 holds with f~=f\tilde{f}=f and 𝒟={𝐱:𝐱R}\mathcal{D}=\{\mathbf{x}:\|\mathbf{x}\|\leq R\} when RR is sufficiently large. Since smoothness is easy to show, we focus on the second part of Assumption 3.3. To show that

f(Proj(𝐱,𝒟))f(𝐱)f\left(\mathrm{Proj}(\mathbf{x},\mathcal{D})\right)\leq f(\mathbf{x})

holds for any 𝐱d\mathbf{x}\in\mathbb{R}^{d}, since the projection direction is pointed towards the origin, it suffices to show that for any 𝝂d\boldsymbol{\nu}\in\mathbb{R}^{d} with 𝝂=1\|\boldsymbol{\nu}\|=1,

ddtfi(t𝝂)=ddt(log(1+exp(tyi𝐰i,𝝂))+λ2j=1dt(νj2)14)0,\frac{d}{dt}f_{i}(t\boldsymbol{\nu})=\frac{d}{dt}\Big{(}\log\big{(}1+\text{exp}(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle)\big{)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\sqrt{t}(\nu_{j}^{2})^{\frac{1}{4}}\Big{)}\geq 0,

when tRt\geq R for i=1,2,,ni=1,2,\ldots,n, where νj\nu_{j} is the jjth element of 𝝂\boldsymbol{\nu}. To see this,

ddtfi(t𝝂)\displaystyle\quad\frac{d}{dt}f_{i}(t\boldsymbol{\nu})
=yi𝐰i,𝝂exp(tyi𝐰i,𝝂)1+exp(tyi𝐰i,𝝂)+λ2j=1d(νj2)142t\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\text{exp}\left(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}{1+\text{exp}\left(-ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{(\nu_{j}^{2})^{\frac{1}{4}}}{2\sqrt{t}}
=yi𝐰i,𝝂1+exp(tyi𝐰i,𝝂)+λ2j=1d(νj2)142t\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{(\nu_{j}^{2})^{\frac{1}{4}}}{2\sqrt{t}}
yi𝐰i,𝝂1+exp(tyi𝐰i,𝝂)+λ2j=1dνj22t\displaystyle\geq\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{2}\sum\limits_{j=1}\limits^{d}\frac{\nu_{j}^{2}}{2\sqrt{t}}
=yi𝐰i,𝝂1+exp(tyi𝐰i,𝝂)+λ4t.\displaystyle=\frac{-y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle}{1+\text{exp}\left(ty_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\right)}+\frac{\lambda}{4\sqrt{t}}.

If yi𝐰i,𝝂0y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle\leq 0, we can immediately know that ddtfi(t𝝂)0\frac{d}{dt}f_{i}(t\boldsymbol{\nu})\geq 0 for any t>0t>0.

If yi𝐰i,𝝂>0y_{i}\langle\mathbf{w}_{i},\boldsymbol{\nu}\rangle>0, let us consider an auxiliary function

g(b)=b1+etb.g(b)=\frac{-b}{1+e^{tb}}.

Then,

g(b)(1+etb)+btebt,g^{\prime}(b)\propto-\left(1+e^{tb}\right)+bte^{bt},

from where we can know the minimum of g(b)g(b) is achieved for some b[1t,2t]b^{*}\in[\frac{1}{t},\frac{2}{t}]. Thus,

g(b)g(b)2(1+etb)t2(1+e)t.g(b)\geq g(b^{*})\geq\frac{-2}{(1+e^{tb})t}\geq\frac{-2}{(1+e)t}.

Therefore,

ddtfi(t𝝂)2(1+e)t+λ4t,\frac{d}{dt}f_{i}(t\boldsymbol{\nu})\geq\frac{-2}{(1+e)t}+\frac{\lambda}{4\sqrt{t}},

which is positive when t(8(1+e)λ)2t\geq\left(\frac{8}{(1+e)\lambda}\right)^{2}.

If we consider other non-convex regularization terms in logistic regression, such as Φ(𝐱)=j=1dxj21+xj2\Phi(\mathbf{x})=\sum\limits_{j=1}\limits^{d}\frac{x_{j}^{2}}{1+x_{j}^{2}}, we may no longer enjoy Assumption 3.3 because monotony may not hold for a few projection directions even when the constraint region is large. Nevertheless, such theoretical flaw can be easily remedied by adding an extra regularization term like λe2𝐱2\frac{\lambda_{e}}{2}\|\mathbf{x}\|^{2} with appropriate λe>0\lambda_{e}>0.

Appendix B Stop Guarantee

We would like to point out that, under appropriate parameter setting, Prob-SARAH is guaranteed to stop. Actually, we can have the stopping guarantee under more general conditions than those stated in the following proposition. But for simplicity, we only present conditions naturally matched parameter settings given in the next two subsections.

Proposition B.1 (Stop guarantee of Prob-SARAH).

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are satisfied. Let step size ηj1/(4L)\eta_{j}\equiv 1/(4L) and suppose that bjKjb_{j}\geq K_{j}, j1j\geq 1. The large batch size {Bj}j1\{B_{j}\}_{j\geq 1} is set appropriately such that Bj=nB_{j}=n when jj is sufficiently large. If the limit of {εj}j1\{\varepsilon_{j}\}_{j\geq 1} is 0, then, for any fixed ε~\tilde{\varepsilon} and ε\varepsilon, with probability 1, Prob-SARAH (Algorithm 1) stops. In settings where we always have εj12ε2\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}, we also have the result that Prob-SARAH (Algorithm 1) stops with probability 1.

Appendix C Detailed Results on Complexity

Theorem C.1.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. Given a pair of errors (ε,δ)(\varepsilon,\delta), in Algorithm 1 (Prob-SARAH), set hyperparameters

ηj=14L,Kj=Bj=j2n,bj=ljKj,εj=8L2τj+2qj,ε~2=15ε2,\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=8L^{2}\tau_{j}+2q_{j},\quad\tilde{\varepsilon}^{2}=\frac{1}{5}\varepsilon^{2},

(8)

for j1j\geq 1, where

Then, with probability at least 1δ1-\delta, Prob-SARAH stops in at most

2(T1T2T3T4)=𝒪~L,Δf,αM(1ε+1nε2)2(T_{1}\vee T_{2}\vee T_{3}\vee T_{4})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon}+\frac{1}{\sqrt{n}\varepsilon^{2}}\right)

outer iterations and the output satisfies f(𝐱^)2ε2\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}. Detailed definitions of T1,T2,T3T_{1},T_{2},T_{3} and T4T_{4} can be found in Propositions D.3 and D.4.

Corollary C.1.

Under parameter settings in Theorem C.1,

Comp(ε,δ)=𝒪~L,Δf,αM(1ε3nε2).Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

We introduce another setting that can help to reduce the dependence on αM2\alpha_{M}^{2}, which could be implicitly affected by the choice of constraint region 𝒟\mathcal{D}. We should also notice that, under such setting, the algorithm is no longer ε\varepsilon-semi-independent.

Theorem C.2.

Suppose that Assumptions 3.1, 3.2, 3.3 and 3.4 are valid. We denote Δf0f(𝐱~0)f(𝐱)\Delta_{f}^{0}\triangleq f\left(\tilde{\mathbf{x}}_{0}\right)-f\left(\mathbf{x}^{*}\right). Given a pair of errors (ε,δ)(\varepsilon,\delta), in Algorithm 1 (Prob-SARAH), set parameters

ηj=14L,Kj=Bj=n,bj=ljKj,εj=12ε~2=110ε2,\eta_{j}=\frac{1}{4L},\quad K_{j}=\sqrt{B_{j}}=\sqrt{n},\quad b_{j}=l_{j}K_{j},\quad\varepsilon_{j}=\frac{1}{2}\tilde{\varepsilon}^{2}=\frac{1}{10}\varepsilon^{2}, (9)

for j1j\geq 1, where

τj=140L2ε2,δj=δ4Cej4,lj=18(log(2δj)+loglog(2d1τj)).\displaystyle\tau_{j}=\frac{1}{40L^{2}}\varepsilon^{2},\delta^{\prime}_{j}=\frac{\delta}{4C_{e}j^{4}},\quad l_{j}=18\Big{(}\log(\frac{2}{\delta^{\prime}_{j}})+\log\log(\frac{2d_{1}}{\tau_{j}})\Big{)}.

Then, with probability at least 1δ1-\delta, Prob-SARAH stops in at most

T5=160L(Δf0+1)nε2=𝒪L,Δf0(1nε2)T_{5}=\frac{160L(\Delta_{f}^{0}+1)}{\sqrt{n}\varepsilon^{2}}=\mathcal{O}_{L,\Delta_{f}^{0}}\left(\frac{1}{\sqrt{n}\varepsilon^{2}}\right)

outer iterations and the output satisfies f(𝐱^)2ε2\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2}.

Corollary C.2.

Under parameter settings in Theorem C.2,

Comp(ε,δ)=𝒪~L,Δf0(nε2).Comp(\varepsilon,\delta)=\tilde{\mathcal{O}}_{L,\Delta_{f}^{0}}\left(\frac{\sqrt{n}}{\varepsilon^{2}}\right).

Comparing the complexities in Corollary C.1 and Corollary C.2, we can notice that under the second setting, we can get rid of the dependence on αM\alpha_{M} at the expense of losing some adativity to ε\varepsilon. We would also like to point out that, in the low precision region, i.e. when 1ε=o(n)\frac{1}{\varepsilon}=o\left(\sqrt{n}\right), the second complexity is inferior.

C.1 Dependency on Parameters

To apply our newly-introduce Azuma-Hoeffding type inequality (see Theorem 3.2), it is necessary to impose a compact constraint region 𝒟\mathcal{D}. Therefore, let us provide a delicate analysis on how 𝒟\mathcal{D} can affect the convergence guarantee.

Dependency on d1d_{1}: d1d_{1}, the diameter of 𝒟\mathcal{D}, is a parameter directly related to the choice of 𝒟\mathcal{D}. Shown in theoretical results presented above, the in-probability first-order complexities always have a polylogorithmic dependency on d1d_{1}, which implies that as long as d1d_{1} is polynomial in nn or 1ε\frac{1}{\varepsilon}, it should only have a minor effect on the complexity. With certain prior knowledge, we should be able to control d1d_{1} at a reasonable scale.

Dependency on Δf\Delta_{f} and Δf0\Delta_{f}^{0}: Under the setting given in Theorem 3.1, the first-order complexity is polynomial in Δf\Delta_{f}. Such dependency implicates that the complexity would not deteriorate much if Δf\Delta_{f} is of a small order, which is definitely true when the loss function is bounded. As for the setting given in Theorem C.2, the first-order complexity is polynomial in Δf0\Delta_{f}^{0}, which is conventionally assumed to be 𝒪(1)\mathcal{O}(1) and will not be affected by 𝒟\mathcal{D}.

Appendix D Postponed Proofs for the Results in Section 3

D.1 Bounding the Difference between 𝝂k(j)\boldsymbol{\nu}_{k}^{(j)} and f(𝐱k(j))\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}

Proposition D.1.

For k0k\geq 0, j1j\geq 1, denote

(σ~k(j))24L2ηj2bjm=1k𝝂m1(j)2.\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}\triangleq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\big{\|}\boldsymbol{\nu}_{m-1}^{(j)}\big{\|}^{2}.

Under Assumptions 3.2 and 3.4, for any prescribed constant δ(0,1)\delta^{\prime}\in(0,1), τ(0,1)\tau\in(0,1), k0k\geq 0, j1j\geq 1,

𝝂k(j)f(𝐱k(j))2\displaystyle\quad\big{\|}\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\big{\|}^{2}
18((σ~k(j))2+4L2τkbj)(log2δ+loglog2d12τ)\displaystyle\leq 18\Big{(}\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\Big{)}\Big{(}\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\Big{)} (10)
+128αM2Bjlog3δ𝟏{Bj<n}\displaystyle\quad+\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}

with probability at least 12δ1-2\delta^{\prime}.

Remark D.1.

Let us briefly explain this high-probability bound on 𝛎k(j)f(𝐱k(j))2\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})\|^{2}. When k=o(bj)k=o(b_{j}) and L=𝒪(1)L=\mathcal{O}({1}), by letting τ\tau be of appropriate n1n^{-1}-polynomial order, 4L2τk/bj4L^{2}\tau k/b_{j} will be roughly o(1)o(1). If further we have d1d_{1} be of nn-polynomial order and let δ\delta^{\prime} be of n1n^{-1}-polynomial order, log(2/δ)+loglog(2d12/τ)\log(2/\delta^{\prime})+\log\log(2d_{1}^{2}/\tau) will be 𝒪~(1)\tilde{\mathcal{O}}(1). As a result, the upper bound is roughly (σ~k(j))2=(4L2ηj2/bj)m=1k𝛎m1(j)2(\tilde{\sigma}_{k}^{(j)})^{2}=(4L^{2}\eta_{j}^{2}/b_{j})\sum_{m=1}^{k}\|\boldsymbol{\nu}_{m-1}^{(j)}\|^{2} when BjB_{j} is sufficiently large so that the last term in the bound (10) is negligible. Bounding 𝛎k(j)f(𝐱k(j))2\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f(\mathbf{x}_{k}^{(j)})\|^{2} by linear combination of {𝛎m(j)}m=0\{\|\boldsymbol{\nu}_{m}^{(j)}\|\}_{m=0}^{\infty} is the key to our analysis.

D.2 Analysis on the Output 𝐱^\hat{\mathbf{x}}

Under parameter setting specified in Theorem 3.1, if we suppose that the algorithm stops at the jj-th outer iteration, i.e.

1Kjk=0Kj1𝝂k(j)2ε~2,εj12ε,\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2},\ \varepsilon_{j}\leq\frac{1}{2}\varepsilon, (11)

there must exist a 0kKj10\leq k^{\prime}\leq K_{j}-1, such that 𝝂k(j)2ε~2.\left\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

Then, on the event

Ωj{ω:𝝂k(j)f(𝐱k(j))2lj((σ~k(j))2+4L2τjkbj)+qj,0kKj},\Omega_{j}\triangleq\left\{\omega:\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)\right\|^{2}\leq l_{j}\left(\left(\tilde{\sigma}_{k}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)+q_{j},0\leq k\leq K_{j}\right\}, (12)

where lj=18(log2δj+loglog2d12τj)l_{j}=18\left(\log\frac{2}{\delta^{\prime}_{j}}+\log\log\frac{2d_{1}^{2}}{\tau_{j}}\right) and qj=128αM2Bjlog3δj𝟏{Bj<n}q_{j}=\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}_{j}}\mathbf{1}\left\{B_{j}<n\right\}, we can easily derive an upper bound on f(𝐱k(j))2\left\|\nabla f(\mathbf{x}_{k^{\prime}}^{(j)})\right\|^{2},

f(𝐱k(j))2\displaystyle\quad\left\|\nabla f(\mathbf{x}_{k^{\prime}}^{(j)})\right\|^{2}
2𝝂k(j)2+2𝝂k(j)f(𝐱k(j))2\displaystyle\leq 2\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}\|^{2}+2\left\|\boldsymbol{\nu}_{k^{\prime}}^{(j)}-\nabla f\left(\mathbf{x}_{k^{\prime}}^{(j)}\right)\right\|^{2}
2ε~2+2lj((σ~k(j))2+4L2τjkbj)+2qj\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\left(\tilde{\sigma}_{k^{\prime}}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k^{\prime}}{b_{j}}\right)+2q_{j}
=2ε~2+2lj(4L2ηj2bjm=1k𝝂m1(j)2+4L2τjkbj)+2qj\displaystyle=2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k^{\prime}}\left\|\boldsymbol{\nu}_{m-1}^{(j)}\right\|^{2}+\frac{4L^{2}\tau_{j}k^{\prime}}{b_{j}}\right)+2q_{j}
2ε~2+2lj(4L2ηj2bjm=1Kj𝝂m1(j)2+4L2τjKjbj)+2qj\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{K_{j}}\left\|\boldsymbol{\nu}_{m-1}^{(j)}\right\|^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}
2ε~2+2lj(4L2ηj2Kjbjε~2+4L2τjKjbj)+2qj\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\tilde{\varepsilon}^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}
=2ε~2+2lj(Kj4bjε~2+4L2τjKjbj)+2qj\displaystyle=2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{K_{j}}{4b_{j}}\tilde{\varepsilon}^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)+2q_{j}
=2.5ε~2+8L2τj+2qj=2.5ε~2+εjε2,\displaystyle=2.5\tilde{\varepsilon}^{2}+8L^{2}\tau_{j}+2q_{j}=2.5\tilde{\varepsilon}^{2}+\varepsilon_{j}\leq\varepsilon^{2},

where the 2nd step is based on (10) with definitions of ljl_{j} and qjq_{j} given in Theorem 3.1, the 5th step is based on (11), the 6th step is based on the choice of ηj=14L\eta_{j}=\frac{1}{4L} and the 7th step is based on the choice of bj=ljKjb_{j}=l_{j}K_{j}. In addition, based on Proposition D.1, the union event j=1Ωj\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j} occurs with probability at least

12j=1k=0Kjδj1j=1δKj2Cej41j=1δCej2=1δ.1-2\sum\limits_{j=1}\limits^{\infty}\sum\limits_{k=0}\limits^{K_{j}}\delta^{\prime}_{j}\geq 1-\sum\limits_{j=1}\limits^{\infty}\frac{\delta K_{j}}{2C_{e}j^{4}}\geq 1-\sum\limits_{j=1}\limits^{\infty}\frac{\delta}{C_{e}j^{2}}=1-\delta.

In one word, it is highly likely to control the norm of gradient at our desired level when the algorithm stops.

The above results can sufficiently explain our choice of stopping rule imposed in Algorithm 1. We can summarize them as the following proposition.

Proposition D.2.

Suppose that Assumptions 3.2 and 3.4 are true. Under the parameter setting given in Theorem 3.1, the output of Algorithm 1 satisfies

f(𝐱^)2ε2,\left\|\nabla f(\hat{\mathbf{x}})\right\|^{2}\leq\varepsilon^{2},

with probability at least 1δ1-\delta.

D.3 Upper-bounding the Stopping Time

Proposition D.3 (First Stopping Rule).

Suppose that Assumptions 3.2, 3.3 and 3.4 are valid. Let

T1=320L(c1+Δf)ε+320L(c1+Δf)nε2,\displaystyle T_{1}=\Bigg{\lceil}\frac{\sqrt{320L(c_{1}+\Delta_{f})}}{\varepsilon}+\frac{320L(c_{1}+\Delta_{f})}{\sqrt{n}\varepsilon^{2}}\Bigg{\rceil},
T2=3(320Lc2εlog320Lc2ε+640Lc2nε2log320Lc2ε2+1),\displaystyle T_{2}=\Bigg{\lceil}3\left(\frac{\sqrt{320Lc_{2}}}{\varepsilon}\log\frac{\sqrt{320Lc_{2}}}{\varepsilon}+\frac{640Lc_{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{320Lc_{2}}{\varepsilon^{2}}+1\right)\Bigg{\rceil}, (13)

where

c1=CeL4+16αM2Llog192Ceδ,c2=64αM2L.c_{1}=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}}{\delta},\quad c_{2}=\frac{64\alpha_{M}^{2}}{L}.

Under the parameter setting given in Theorem 3.1, on Ω\Omega, when TT1T2T\geq T_{1}\vee T_{2}, there exists a T+1j2TT+1\leq j\leq 2T such that

1Kjk=0Kj1𝝂k(j)2ε~2.\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.
Proposition D.4 (Second Stopping Rule).

Let

T3=2c3ε,T4=6c4εlog2c4ε,\displaystyle T_{3}=\Bigg{\lceil}\frac{2\sqrt{c_{3}}}{\varepsilon}\Bigg{\rceil},\quad T_{4}=\Bigg{\lceil}\frac{6\sqrt{c_{4}}}{\varepsilon}\log\frac{2\sqrt{c_{4}}}{\varepsilon}\Bigg{\rceil}, (14)

where

c3=8L2+256αM2log12Ceδ,c4=1024αM2.c_{3}=8L^{2}+256\alpha_{M}^{2}\log\frac{12C_{e}}{\delta},\quad c_{4}=1024\alpha_{M}^{2}.

Under the parameter setting given in Theorem 3.1, on Ω\Omega, when TT3T4T\geq T_{3}\vee T_{4},

εT12ε2.\varepsilon_{T}\leq\frac{1}{2}\varepsilon^{2}.
Proposition D.5 (Stop Guarantee).

Under the parameter setting and assumptions given in Theorem 3.1, on Ω\Omega, when TT1T2T3T4T\geq T_{1}\vee T_{2}\vee T_{3}\vee T_{4}, Algorithm 1 stops in at most 2T2T outer iterations.

Proof.

If Algorithm 1 stops in TT outer iterations, our conclusion is obviously true. If not, according to Proposition D.3, there must exist a j[T+1,2T]j\in[T+1,2T] such that the first stopping rule is met, i.e.

1Kjk=0Kj1𝝂k(j)2ε~2.\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

According to Proposition D.4, the second stopping rule is also met, i.e. εj12ε2.\varepsilon_{j}\leq\frac{1}{2}\varepsilon^{2}.

Consequently, the algorithm stops at the jj-th outer iteration. ∎

Appendix E Technical Lemmas

Lemma E.1 (Theorem 4 in [10]).

Let {ϵ1,ϵ2,,ϵn}\{\boldsymbol{\epsilon}_{1},\boldsymbol{\epsilon}_{2},\ldots,\boldsymbol{\epsilon}_{n}\} be a set of fixed vectors in d\mathbb{R}^{d}. ,𝒥{1,2,,n}\mathcal{I},\mathcal{J}\subseteq\{1,2,\ldots,n\} are 2 random index sets sampled respectively with replacement and without replacement, with size ||=|𝒥|=k|\mathcal{I}|=|\mathcal{J}|=k. For any continuous and convex function f:df:\mathbb{R}^{d}\rightarrow\mathbb{R},

𝔼f(j𝒥ϵj)𝔼f(iϵi).\mathbb{E}f\Big{(}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}\Big{)}\leq\mathbb{E}f\Big{(}\sum\limits_{i\in\mathcal{I}}\boldsymbol{\epsilon}_{i}\Big{)}.
Lemma E.2 (Proposition 1.2 in [3]111See also Lemma 1.3 in [2].).

Let XX be real random variable such that 𝔼X=0\mathbb{E}X=0 and aXba\leq X\leq b for some a,ba,b\in\mathbb{R}. Then, for all tt\in\mathbb{R},

log𝔼etXt2(ba)28.\log\mathbb{E}e^{tX}\leq\frac{t^{2}(b-a)^{2}}{8}.
Lemma E.3 (Theorem 3.5 in [30]222See also Theorem 3 in [29] and Proposition 2 in [5].).

Let {ϵk}k=1Kd\{\boldsymbol{\epsilon}_{k}\}_{k=1}^{K}\subseteq\mathbb{R}^{d} be a vector-valued martingale difference sequence with respect to k\mathcal{F}_{k}, k=0,1,,Kk=0,1,\ldots,K, i.e. for k=1,,Kk=1,\ldots,K, 𝔼[ϵk|k1]=0\mathbb{E}\left[\boldsymbol{\epsilon}_{k}|\mathcal{F}_{k-1}\right]=\textbf{0}. Assume ϵk2Bk2\|\boldsymbol{\epsilon}_{k}\|^{2}\leq B_{k}^{2}, k=1,2,,Kk=1,2,\ldots,K. Then,

(k=1Kϵkt)2exp(t22k=1KBk2),\mathbb{P}\left(\Big{\|}\sum\limits_{k=1}\limits^{K}\boldsymbol{\epsilon}_{k}\Big{\|}\geq t\right)\leq 2\text{exp}\Bigg{(}-\frac{t^{2}}{2\sum\limits_{k=1}\limits^{K}B_{k}^{2}}\Bigg{)},

t\forall t\in\mathbb{R}.

Proposition E.1 (Norm-Hoeffding, Sampling without Replacement).

Let {ϵ1,ϵ2,,ϵn}\{\boldsymbol{\epsilon}_{1},\boldsymbol{\epsilon}_{2},\ldots,\boldsymbol{\epsilon}_{n}\} be a set of nn fixed vectors in d\mathbb{R}^{d} such that ϵi2σ2\|\boldsymbol{\epsilon}_{i}\|^{2}\leq\sigma^{2}, 1in\forall 1\leq i\leq n, for some σ2>0\sigma^{2}>0. Let 𝒥{1,2,,n}\mathcal{J}\subseteq\{1,2,\ldots,n\} be a random index sets sampled without replacement from {1,2,,n}\{1,2,\ldots,n\}, with size |𝒥|=k|\mathcal{J}|=k. Then,

(1kj𝒥ϵj1nj=1nϵjt)3exp(kt264σ2),\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\leq 3\text{exp}\Big{(}-\frac{kt^{2}}{64\sigma^{2}}\Big{)},

t\forall t\in\mathbb{R}. In addition,

𝔼1kjϵj1nj=1nϵj216σ2k.\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{2}\leq\frac{16\sigma^{2}}{k}.
Proof.

Firstly, we start with developing moment bounds. Let \mathcal{I} be a random index sets sampled with replacement from {1,2,,n}\{1,2,\ldots,n\}, independent of 𝒥\mathcal{J}, with size ||=k|\mathcal{I}|=k. For any p+p\in\mathbb{Z}_{+},

𝔼1kj𝒥ϵj1nj=1nϵjp\displaystyle\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{p}
\displaystyle\leq 𝔼1kjϵj1nj=1nϵjp\displaystyle\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{p}
=\displaystyle= \bigintss0(1kjϵj1nj=1nϵjpr)dr\displaystyle\bigintss_{0}^{\infty}\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{p}\geq r\Bigg{)}dr
=\displaystyle= \bigintss0(1kjϵj1nj=1nϵjr1/p)dr\displaystyle\bigintss_{0}^{\infty}\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{I}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq r^{1/p}\Bigg{)}dr
\displaystyle\leq \bigintss02exp(kr2/p8σ2)dr\displaystyle\bigintss_{0}^{\infty}2\text{exp}\Bigg{(}-\frac{kr^{2/p}}{8\sigma^{2}}\Bigg{)}dr
=\displaystyle= p(8σ2k)p/2Γ(p2),\displaystyle p\cdot\left(\frac{8\sigma^{2}}{k}\right)^{p/2}\cdot\Gamma\left(\frac{p}{2}\right),

where the 1st step is based on Lemma E.1 and the 4th step is based on the fact that ϵj1ni=1nϵi2σ,j\big{\|}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{i=1}\limits^{n}\boldsymbol{\epsilon}_{i}\big{\|}\leq 2\sigma,\forall j and Lemma E.3.

Then, for any s>0s>0,

𝔼exp(s1kj𝒥ϵj1nj=1nϵj)\displaystyle\quad\mathbb{E}\text{exp}\Bigg{(}s\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\Bigg{)}
1+p=1sp𝔼1kj𝒥ϵj1nj=1nϵjpp!\displaystyle\leq 1+\sum\limits_{p=1}\limits^{\infty}\frac{s^{p}\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{p}}{p!}
1+p=2sp𝔼1kj𝒥ϵj1nj=1nϵjpp!+s8πσ2k\displaystyle\leq 1+\sum\limits_{p=2}\limits^{\infty}\frac{s^{p}\mathbb{E}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}^{p}}{p!}+s\sqrt{\frac{8\pi\sigma^{2}}{k}}
=1+s8πσ2k+p=1(8σ2s2k)p(2p)Γ(p)(2p)!\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(2p)\cdot\Gamma(p)}{(2p)!}
+p=1(8σ2s2k)2p+12(2p+1)Γ(p+12)(2p+1)!\displaystyle\quad+\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{\frac{2p+1}{2}}\cdot(2p+1)\cdot\Gamma\left(p+\frac{1}{2}\right)}{(2p+1)!}
=1+s8πσ2k+2p=1(8σ2s2k)p(p!)(2p)!\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+2\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(p!)}{(2p)!}
+8s2σ2kp=1(8σ2s2k)pΓ(p+12)(2p)!\displaystyle\quad+\sqrt{\frac{8s^{2}\sigma^{2}}{k}}\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot\Gamma\left(p+\frac{1}{2}\right)}{(2p)!}
1+s8πσ2k+(2+8πs2σ2k)p=1(8σ2s2k)p(p!)(2p)!\displaystyle\leq 1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(2+\sqrt{\frac{8\pi s^{2}\sigma^{2}}{k}}\right)\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}\cdot(p!)}{(2p)!}
1+s8πσ2k+(1+2πs2σ2k)p=1(8σ2s2k)pp!\displaystyle\leq 1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\sum\limits_{p=1}\limits^{\infty}\frac{\left(\frac{8\sigma^{2}s^{2}}{k}\right)^{p}}{p!}
=1+s8πσ2k+(1+2πs2σ2k)[exp(8s2σ2k)1]\displaystyle=1+s\sqrt{\frac{8\pi\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\left[\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)-1\right]
8πs2σ2k+(1+2πs2σ2k)exp(8s2σ2k)\displaystyle\leq\sqrt{\frac{8\pi s^{2}\sigma^{2}}{k}}+\left(1+\sqrt{\frac{2\pi s^{2}\sigma^{2}}{k}}\right)\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)
exp(8s2σ2k)+2exp(16s2σ2k)\displaystyle\leq\text{exp}\left(\frac{8s^{2}\sigma^{2}}{k}\right)+2\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}\right)
3exp(16s2σ2k),\displaystyle\leq 3\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}\right),

where the 1st step is based on Taylor’s expansion and the second to the last step is based on the fact that xex2πx\leq e^{\frac{x^{2}}{\pi}}, π2xex2e2x2\sqrt{\frac{\pi}{2}}xe^{x^{2}}\leq e^{2x^{2}}, x0\forall x\geq 0.

For any s>0s>0,

(1kj𝒥ϵj1nj=1nϵjt)𝔼exp(s1kj𝒥ϵj1nj=1nϵjts)3exp(16s2σ2kts).\begin{array}[]{cl}&\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\\ \leq&\mathbb{E}\text{exp}\Bigg{(}s\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}-ts\Bigg{)}\\ \leq&3\text{exp}\left(\frac{16s^{2}\sigma^{2}}{k}-ts\right).\\ \end{array} (15)

By letting s=kt32σ2s=\frac{kt}{32\sigma^{2}} in (15),

(1kj𝒥ϵj1nj=1nϵjt)3exp(kt264σ2).\mathbb{P}\Bigg{(}\Big{\|}\frac{1}{k}\sum\limits_{j\in\mathcal{J}}\boldsymbol{\epsilon}_{j}-\frac{1}{n}\sum\limits_{j=1}\limits^{n}\boldsymbol{\epsilon}_{j}\Big{\|}\geq t\Bigg{)}\leq 3\text{exp}\left(-\frac{kt^{2}}{64\sigma^{2}}\right).

Definition E.1.

A random vector ϵd\boldsymbol{\epsilon}\in\mathbb{R}^{d} is (a,σ2)(a,\sigma^{2})-norm-subGaussian (or nSG(a,σ2)(a,\sigma^{2})), if a,σ2>0\exists a,\sigma^{2}>0 such that

(ϵ𝔼ϵt)aexp(t22σ2),\mathbb{P}\left(\|\boldsymbol{\epsilon}-\mathbb{E}\boldsymbol{\epsilon}\|\geq t\right)\leq a\cdot\text{exp}\left(-\frac{t^{2}}{2\sigma^{2}}\right),

t\forall t\in\mathbb{R}.

Definition E.2.

A sequence of random vectors ϵ1,,ϵKd\boldsymbol{\epsilon}_{1},\ldots,\boldsymbol{\epsilon}_{K}\in\mathbb{R}^{d} is (a,{σk2}k=1K)(a,\{\sigma_{k}^{2}\}_{k=1}^{K})-norm-subGaussian martingale difference sequence adapted to 0,1,,K\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}, if a,σ12,,σK2>0\exists\ a,\sigma^{2}_{1},\ldots,\sigma_{K}^{2}>0 such that for k=1,2,,Kk=1,2,\ldots,K,

𝔼[ϵ|k1]=0,σkk1,ϵkk,\mathbb{E}\left[\boldsymbol{\epsilon}|\mathcal{F}_{k-1}\right]=\textbf{0},\ \sigma_{k}\in\mathcal{F}_{k-1},\ \boldsymbol{\epsilon}_{k}\in\mathcal{F}_{k},

and ϵk|k1\boldsymbol{\epsilon}_{k}|\mathcal{F}_{k-1} is (a,σk2)(a,\sigma^{2}_{k})-norm-subGaussian.

Lemma E.4 (Corollary 8 in [15]).

Suppose ϵ1,,ϵKd\boldsymbol{\epsilon}_{1},\ldots,\boldsymbol{\epsilon}_{K}\in\mathbb{R}^{d} is (a,{σk2}k=1K)(a,\{\sigma_{k}^{2}\}_{k=1}^{K})-norm-subGaussian martingale difference sequence adapted to 0,1,,K\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}. Then for any fixed δ>0\delta>0, and B>b>0B>b>0, with probability at least 1δ1-\delta, either

k=1Kσi2B\sum\limits_{k=1}\limits^{K}\sigma_{i}^{2}\geq B\

or,

k=1Kϵka2e1/emax{k=1Kσi2,b}(log2dδ+loglogBb).\Big{\|}\sum\limits_{k=1}\limits^{K}\boldsymbol{\epsilon}_{k}\Big{\|}\leq\frac{a}{2}e^{1/e}\sqrt{\max\left\{\sum\limits_{k=1}\limits^{K}\sigma_{i}^{2},b\right\}\left(\log\frac{2d}{\delta}+\log\log\frac{B}{b}\right)}.
Lemma E.5.

For any ε>0,n+\varepsilon>0,n\in\mathbb{Z}_{+},

1T2(nT)ε2,\frac{1}{T^{2}\wedge(\sqrt{n}T)}\leq\varepsilon^{2},

when T1ε+1nε2T\geq\lceil\frac{1}{\varepsilon}+\frac{1}{\sqrt{n}\varepsilon^{2}}\rceil.

Proof.
1T2(nT)=max{1T2,1nT}\displaystyle\quad\frac{1}{T^{2}\wedge(\sqrt{n}T)}=\max\Big{\{}\frac{1}{T^{2}},\frac{1}{\sqrt{n}T}\Big{\}}
max{1T2|T=1ε,1nT|T=1nε2}\displaystyle\leq\max\Big{\{}\frac{1}{T^{\prime 2}}\big{|}_{T^{\prime}=\frac{1}{\varepsilon}},\frac{1}{\sqrt{n}T^{\prime}}\big{|}_{T^{\prime}=\frac{1}{\sqrt{n}\varepsilon^{2}}}\Big{\}}
=ε2.\displaystyle=\varepsilon^{2}.

Lemma E.6.

For any ε(0,e1],n+\varepsilon\in(0,e^{-1}],n\in\mathbb{Z}_{+},

logTT2(nT)ε2,\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\varepsilon^{2},

when T3(1εlog1ε+2nε2log1ε2+𝟏{1nε2log1ε2e6})T\geq\Big{\lceil}3\left(\frac{1}{\varepsilon}\log\frac{1}{\varepsilon}+\frac{2}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}+\mathbf{1}\left\{\frac{1}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\leq\frac{e}{6}\right\}\right)\Big{\rceil}.

Proof.

Function h(T)=logTT2h(T)=\frac{\log T}{T^{2}} is monotonically decreasing when TeT\geq\sqrt{e}. Since T3εlog1εeT\geq\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\geq\sqrt{e},

logTT2log(3εlog1ε)(3εlog1ε)2=log3+log1ε+loglog1ε9(log1ε)2ε2\displaystyle\quad\frac{\log T}{T^{2}}\leq\frac{\log\left(\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\right)}{\left(\frac{3}{\varepsilon}\log\frac{1}{\varepsilon}\right)^{2}}=\frac{\log 3+\log\frac{1}{\varepsilon}+\log\log\frac{1}{\varepsilon}}{9\left(\log\frac{1}{\varepsilon}\right)^{2}}\varepsilon^{2}
1+log3+2log1ε9(log1ε)2ε23+log39ε2ε2.\displaystyle\leq\frac{1+\log 3+2\log\frac{1}{\varepsilon}}{9\left(\log\frac{1}{\varepsilon}\right)^{2}}\varepsilon^{2}\leq\frac{3+\log 3}{9}\varepsilon^{2}\leq\varepsilon^{2}.

Define a function h~(T)=logTT\tilde{h}(T)=\frac{\log T}{T}. It is monotonically decreasing when TeT\geq e. Thus, if 6nε2log1ε2e\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\geq e, we know TeT\geq e and consequently,

logTnTlog(6nε2log1ε2)n(6nε2log1ε2)\displaystyle\quad\frac{\log T}{\sqrt{n}T}\leq\frac{\log\left(\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\right)}{\sqrt{n}\left(\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\right)}
=log1nε2+loglog1ε2+log66log1ε2ε2\displaystyle=\frac{\log\frac{1}{\sqrt{n}\varepsilon^{2}}+\log\log\frac{1}{\varepsilon^{2}}+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}
log1ε2+loglog1ε2+log66log1ε2ε2\displaystyle\leq\frac{\log\frac{1}{\varepsilon^{2}}+\log\log\frac{1}{\varepsilon^{2}}+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}
2log1ε2+1+log66log1ε2ε2\displaystyle\leq\frac{2\log\frac{1}{\varepsilon^{2}}+1+\log 6}{6\log\frac{1}{\varepsilon^{2}}}\varepsilon^{2}
3+log66ε2\displaystyle\leq\frac{3+\log 6}{6}\varepsilon^{2}
ε2.\displaystyle\leq\varepsilon^{2}.

If 6nε2log1ε2e\frac{6}{\sqrt{n}\varepsilon^{2}}\log\frac{1}{\varepsilon^{2}}\leq e, T3T\geq 3. Hence,

logTnTlog3n36nelog1ε2(log33e6)\displaystyle\quad\frac{\log T}{\sqrt{n}T}\leq\frac{\log 3}{\sqrt{n}3}\leq\frac{6}{\sqrt{n}e}\log\frac{1}{\varepsilon^{2}}\left(\frac{\log 3}{3}\cdot\frac{e}{6}\right)
6nelog1ε2ε2.\displaystyle\leq\frac{6}{\sqrt{n}e}\log\frac{1}{\varepsilon^{2}}\leq\varepsilon^{2}.

Based on the above results,

logTT2(nT)max{logTT2,logT(nT)}ε2.\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\max\left\{\frac{\log T}{T^{2}},\frac{\log T}{(\sqrt{n}T)}\right\}\leq\varepsilon^{2}.

Appendix F Proofs of Main Theorems

Proof of Proposition B.1.

It is not hard to conclude that we only need to show

(j1,1Kjk=0Kj1𝝂k(j)2ϵ~2)=1.\mathbb{P}\Bigg{(}\exists\ j\geq 1,\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}\leq\tilde{\boldsymbol{\epsilon}}^{2}\Bigg{)}=1. (16)

For simplicity, we denote Vj1Kjk=0Kj1𝝂k(j)2V_{j}\triangleq\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}. To show (16), we firstly derive the in-expectation bound on VjV_{j}, which has been covered in works like [33].

With our basic assumptions, we have

f(𝐱k+1(j))\displaystyle\quad f\left(\mathbf{x}_{k+1}^{(j)}\right)
=f~(Proj(𝐱k(j)ηj𝝂k(j),𝒟))\displaystyle=\tilde{f}\Big{(}\mathrm{Proj}\big{(}\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)},\mathcal{D}\big{)}\Big{)}
f~(𝐱k(j)ηj𝝂k(j))\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}-\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\right)
f~(𝐱k(j))<f~(𝐱k(j)),ηj𝝂k(j)>+L2ηj2𝝂k(j)2\displaystyle\leq\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla\tilde{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
=f(𝐱k(j))<f(𝐱k(j)),ηj𝝂k(j)>+L2ηj2𝝂k(j)2\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)-\Bigl{<}\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right),\eta_{j}\boldsymbol{\nu}_{k}^{(j)}\Bigr{>}+\frac{L}{2}\eta_{j}^{2}\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2}
=f(𝐱k(j))+ηj2𝝂k(j)f(𝐱k(j))2ηj2f(𝐱k(j))2\displaystyle={f}\left(\mathbf{x}_{k}^{(j)}\right)+\frac{\eta_{j}}{2}\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}-\frac{\eta_{j}}{2}\left\|\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}
ηj2(1Lηj)𝝂k(j)2,\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)\big{\|}\boldsymbol{\nu}_{k}^{(j)}\big{\|}^{2},

where the 2nd and 3rd step is based on Assumption 3.3. Then, summing the above inequality from k=0k=0 to Kj1K_{j}-1,

f(𝐱~j)f(𝐱~j1)\displaystyle\quad f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)
=f(𝐱Kj(j))f(𝐱0(j))\displaystyle=f\big{(}\mathbf{x}_{K_{j}}^{(j)}\big{)}-f\big{(}\mathbf{x}_{0}^{(j)}\big{)}
ηj2k=0Kj1𝝂k(j)f(𝐱k(j))2ηj2(1Lηj)KjVj.\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\Big{\|}\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\Big{\|}^{2}-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)K_{j}V_{j}. (17)

Then,

𝔼(f(𝐱~j)f(𝐱~j1)|j1)\displaystyle\quad\mathbb{E}\big{(}f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)\big{|}\mathcal{F}_{j-1}\big{)}
𝔼(ηj2k=0Kj1𝝂k(j)f(𝐱k(j))2|j1)\displaystyle\leq\mathbb{E}\Bigg{(}\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}\big{|}\mathcal{F}_{j-1}\Bigg{)}
ηj2(1Lηj)Kj𝔼(Vj|j1).\displaystyle\quad-\frac{\eta_{j}}{2}\left(1-L\eta_{j}\right)K_{j}\mathbb{E}\left(V_{j}\big{|}\mathcal{F}_{j-1}\right). (18)

For convenience, we abbreviate 𝔼(|j1)\mathbb{E}\left(\cdot|\mathcal{F}_{j-1}\right) as 𝔼j1()\mathbb{E}_{j-1}(\cdot). For k=1,2,,Kj1k=1,2,\ldots,K_{j}-1,

𝔼j1𝝂k(j)f(𝐱k(j))2\displaystyle\quad\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\|^{2}
=𝔼j1𝔼(𝝂k(j)f(𝐱k(j))2|j,k1)\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla{f}\left(\mathbf{x}_{k}^{(j)}\right)\right\|^{2}\big{|}\mathcal{F}_{j,k-1}\Bigg{)}
=𝔼j1𝔼(1bjik(j)fi(𝐱k(j))1bjik(j)fi(𝐱k1(j))\displaystyle=\mathbb{E}_{j-1}\mathbb{E}\Bigg{(}\Big{\|}\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})
+f(𝐱k1(j))f(𝐱k(j))2|j,k1)\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\|}^{2}\Big{|}\mathcal{F}_{j,k-1}\Bigg{)}
+𝔼j1𝝂k1(j)f(𝐱k1(j))2\displaystyle\quad+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\|^{2}
=𝔼j11bj2ik(j)𝔼(fi(𝐱k(j))1bjik(j)fi(𝐱k1(j))\displaystyle=\mathbb{E}_{j-1}\frac{1}{b_{j}^{2}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\mathbb{E}\Bigg{(}\Big{\|}\nabla f_{i}(\mathbf{x}_{k}^{(j)})-\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{k}^{(j)}}\nabla f_{i}(\mathbf{x}_{k-1}^{(j)})
+f(𝐱k1(j))f(𝐱k(j))2|j,k1)\displaystyle\quad+\nabla f(\mathbf{x}_{k-1}^{(j)})-\nabla f(\mathbf{x}_{k}^{(j)})\Big{\|}^{2}\Big{|}\mathcal{F}_{j,k-1}\Bigg{)}
+𝔼j1𝝂k1(j)f(𝐱k1(j))2\displaystyle\quad+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\|^{2}
4L2bj𝔼j1𝐱k(j)𝐱k1(j)2+𝔼j1𝝂k1(j)f(𝐱k1(j))2\displaystyle\leq\frac{4L^{2}}{b_{j}}\mathbb{E}_{j-1}\left\|\mathbf{x}_{k}^{(j)}-\mathbf{x}_{k-1}^{(j)}\right\|^{2}+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\|^{2}
4L2ηj2bj𝔼j1𝝂k1(j)2+𝔼j1𝝂k1(j)f(𝐱k1(j))2\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k-1}^{(j)}\right\|^{2}+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{k-1}^{(j)}-\nabla f(\mathbf{x}_{k-1}^{(j)})\right\|^{2}
4L2ηj2bjt=0k1𝔼j1𝝂t(j)2+𝔼j1𝝂0(j)f(𝐱0(j))2\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{t=0}\limits^{k-1}\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{t}^{(j)}\right\|^{2}+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\|^{2}
4L2ηj2Kjbj𝔼j1Vj+𝔼j1𝝂0(j)f(𝐱0(j))2.\displaystyle\leq\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\mathbb{E}_{j-1}V_{j}+\mathbb{E}_{j-1}\left\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\|^{2}. (19)

Based on (18) and (19),

𝔼(f(𝐱~j)f(𝐱~j1))\displaystyle\quad\mathbb{E}\big{(}f\left(\tilde{\mathbf{x}}_{j}\right)-f\left(\tilde{\mathbf{x}}_{j-1}\right)\big{)}
ηjKj2(1Lηj4L2ηj2Kjbj)𝔼Vj\displaystyle\leq-\frac{\eta_{j}K_{j}}{2}\left(1-L\eta_{j}-\frac{4L^{2}\eta_{j}^{2}K_{j}}{b_{j}}\right)\mathbb{E}V_{j}
+ηjKj2𝔼𝝂0(j)f(𝐱0(j))2\displaystyle\quad+\frac{\eta_{j}K_{j}}{2}\mathbb{E}\left\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\|^{2}
Kj16L𝔼Vj+Kj8L𝔼𝝂0(j)f(𝐱0(j))2,\displaystyle\leq-\frac{K_{j}}{16L}\mathbb{E}V_{j}+\frac{K_{j}}{8L}\mathbb{E}\left\|\boldsymbol{\nu}_{0}^{(j)}-\nabla f(\mathbf{x}_{0}^{(j)})\right\|^{2}, (20)

where the second step is based on the choice of ηj14L\eta_{j}\equiv\frac{1}{4L} and bjKjb_{j}\geq K_{j}, j1j\geq 1. Let us define J0=min{j:Bj=n}J_{0}=\min\{j:B_{j}=n\}. Then, for jJ0j\geq J_{0}, based on (20),

116L𝔼VjKj16L𝔼Vj𝔼(f(𝐱~j1)f(𝐱~j)).\frac{1}{16L}\mathbb{E}V_{j}\leq\frac{K_{j}}{16L}\mathbb{E}V_{j}\leq\mathbb{E}\left(f\left(\tilde{\mathbf{x}}_{j-1}\right)-f\left(\tilde{\mathbf{x}}_{j}\right)\right).

Then for any m+m\in\mathbb{Z}_{+},

(Vj>ϵ~2,j1)\displaystyle\quad\mathbb{P}\left(V_{j}>\tilde{\boldsymbol{\epsilon}}^{2},\ j\geq 1\right)
(VJ0+VJ0+1++VJ0+m>(m+1)ϵ~2)\displaystyle\leq\mathbb{P}\left(V_{J_{0}}+V_{J_{0}+1}+\ldots+V_{J_{0}+m}>(m+1)\tilde{\boldsymbol{\epsilon}}^{2}\right)
𝔼(VJ0+VJ0+1++VJ0+m)(m+1)ϵ~2\displaystyle\leq\frac{\mathbb{E}\left(V_{J_{0}}+V_{J_{0}+1}+\ldots+V_{J_{0}+m}\right)}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}
16L(m+1)ϵ~2j=J0J0+m𝔼(f(𝐱~j1)f(𝐱~j))\displaystyle\leq\frac{16L}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}\sum\limits_{j=J_{0}}\limits^{J_{0}+m}\mathbb{E}\left(f\left(\tilde{\mathbf{x}}_{j-1}\right)-f\left(\tilde{\mathbf{x}}_{j}\right)\right)
16LΔf(m+1)ϵ~2.\displaystyle\leq\frac{16L\Delta_{f}}{(m+1)\tilde{\boldsymbol{\epsilon}}^{2}}.

Since mm can be arbitrarily large, we know

(Vj>ϵ~2,j1)=0,\mathbb{P}\left(V_{j}>\tilde{\boldsymbol{\epsilon}}^{2},\ j\geq 1\right)=0,

which can directly lead to (16). ∎

Proof of Theorem 3.2.

In this proof, for simplicity, we denote 𝔼[|k]\mathbb{E}\left[\cdot|\mathcal{F}_{k}\right] by 𝔼k[]\mathbb{E}_{k}\left[\cdot\right]. Let 𝐬k=i=1k𝐳i,k1\mathbf{s}_{k}=\sum\limits_{i=1}\limits^{k}\mathbf{z}_{i},k\geq 1. For a 1kK1\leq k\leq K, consider

fk(t)=𝔼k1[cosh(λ𝐬k1+t𝐳k)],λ>0,t>0.f_{k}(t)=\mathbb{E}_{k-1}\left[cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\right],\ \lambda>0,t>0.

Then,

fk(t)=12𝔼k1[λ𝐳k,𝐬k1+t𝐳k𝐬k1+t𝐳k(eλ𝐬k1+t𝐳keλ𝐬k1+t𝐳k)],f^{\prime}_{k}(t)=\frac{1}{2}\mathbb{E}_{k-1}\left[\frac{\lambda\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\left(e^{\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}-e^{-\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\right)\right],

and consequently,

fk(0)\displaystyle f^{\prime}_{k}(0) =12𝔼k1[λ𝐳k,𝐬k1𝐬k1(eλ𝐬k1eλ𝐬k1)]\displaystyle=\frac{1}{2}\mathbb{E}_{k-1}\left[\frac{\lambda\langle\mathbf{z}_{k},\mathbf{s}_{k-1}\rangle}{\|\mathbf{s}_{k-1}\|}\left(e^{\lambda\|\mathbf{s}_{k-1}\|}-e^{-\lambda\|\mathbf{s}_{k-1}\|}\right)\right]
=0.\displaystyle=0.

Next,

fk′′(t)\displaystyle\quad f^{\prime\prime}_{k}(t)
=12𝔼k1[(λ2𝐳k,𝐬k1+t𝐳k2𝐬k1+t𝐳k2+λ𝐳k2𝐬k1+t𝐳k)eλ𝐬k1+t𝐳k\displaystyle=\frac{1}{2}\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|^{2}}+\frac{\lambda\|\mathbf{z}_{k}\|^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\right)e^{\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}
+(λ2𝐳k,𝐬k1+t𝐳k2𝐬k1+t𝐳k2λ𝐳k2𝐬k1+t𝐳k)eλ𝐬k1+t𝐳k]\displaystyle\quad\quad+\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|^{2}}-\frac{\lambda\|\mathbf{z}_{k}\|^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\right)e^{-\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}\Biggr{]}
=𝔼k1[λ2𝐳k,𝐬k1+t𝐳k2𝐬k1+t𝐳k2cosh(λ𝐬k1+t𝐳k)\displaystyle=\mathbb{E}_{k-1}\Biggl{[}\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|^{2}}cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)
+λ2𝐳k2λ𝐬k1+t𝐳ksinh(λ𝐬k1+t𝐳k)]\displaystyle\quad\quad+\frac{\lambda^{2}\|\mathbf{z}_{k}\|^{2}}{\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|}sinh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\Biggr{]}
𝔼k1[(λ2𝐳k,𝐬k1+t𝐳k2𝐬k1+t𝐳k2+λ2𝐳k2)cosh(λ𝐬k1+t𝐳k)]\displaystyle\leq\mathbb{E}_{k-1}\Biggl{[}\left(\frac{\lambda^{2}\langle\mathbf{z}_{k},\mathbf{s}_{k-1}+t\mathbf{z}_{k}\rangle^{2}}{\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|^{2}}+\lambda^{2}\|\mathbf{z}_{k}\|^{2}\right)cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\Biggr{]}
2λ2𝔼k1[𝐳k2cosh(λ𝐬k1+t𝐳k)]\displaystyle\leq 2\lambda^{2}\mathbb{E}_{k-1}\left[\|\mathbf{z}_{k}\|^{2}cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\right]
2λ2rk2𝔼k1[cosh(λ𝐬k1+t𝐳k)]\displaystyle\leq 2\lambda^{2}r_{k}^{2}\mathbb{E}_{k-1}\left[cosh(\lambda\|\mathbf{s}_{k-1}+t\mathbf{z}_{k}\|)\right]
=2λ2rk2fk(t),\displaystyle=2\lambda^{2}r_{k}^{2}f_{k}(t),

where the first inequality is based on the fact that if y>0y>0, sinh(y)ycosh(y)\frac{sinh(y)}{y}\leq cosh(y).

According to Lemma 3 in [29],

fk(t)fk(0)exp(λ2rk2t2)=cosh(λ𝐬k1)exp(λ2rk2t2).f_{k}(t)\leq f_{k}(0)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right)=cosh(\lambda\|\mathbf{s}_{k-1}\|)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right).

Thus,

𝔼k1[cosh(λ𝐬k)]\displaystyle\quad\mathbb{E}_{k-1}\left[cosh(\lambda\|\mathbf{s}_{k}\|)\right]
=fk(1)cosh(λ𝐬k1)exp(λ2rk2t2).\displaystyle=f_{k}(1)\leq cosh(\lambda\|\mathbf{s}_{k-1}\|)\text{exp}\left(\lambda^{2}r_{k}^{2}t^{2}\right). (21)

Now, let

Gk=cosh(λ𝐬k)exp(λ2i=1kri2),k=1,2,,K.G_{k}=cosh(\lambda\|\mathbf{s}_{k}\|)\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k}r_{i}^{2}\Big{)},\ k=1,2,\ldots,K.

We can easily know that for k=1,2,,Kk=1,2,\ldots,K, GkG_{k} is measurable with respect to k\mathcal{F}_{k}. According to (21),

𝔼k1Gk\displaystyle\mathbb{E}_{k-1}G_{k} =exp(λ2i=1kri2)𝔼k1[cosh(λ𝐬k)]\displaystyle=\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k}r_{i}^{2}\Big{)}\mathbb{E}_{k-1}\left[cosh(\lambda\|\mathbf{s}_{k}\|)\right]
cosh(λ𝐬k1)exp(λ2i=1k1ri2)\displaystyle\leq cosh(\lambda\|\mathbf{s}_{k-1}\|)\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{k-1}r_{i}^{2}\Big{)}
=Gk1,\displaystyle=G_{k-1},

which implies that {Gk}k=1K\{G_{k}\}_{k=1}^{K} is a non-negative super-martingale adapted to 0,1,,K\mathcal{F}_{0},\mathcal{F}_{1},\ldots,\mathcal{F}_{K}.

For any constant m>0m>0, if we define stopping time Tm=inf{t:𝐬tλi=1tri2+m}T_{m}=\inf\left\{t:\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+m\right\}, we immediately know that GTmkG_{T_{m}\wedge k}, k0k\geq 0, is a supermartingale and

(1tk,𝐬tλi=1tri2+m)\displaystyle\quad\mathbb{P}\left(\exists 1\leq t\leq k,\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+m\right)
=(𝐬Tmλi=1Tmri2+m,1Tmk)\displaystyle=\mathbb{P}\left(\|\mathbf{s}_{T_{m}}\|\geq\lambda\sum\limits_{i=1}\limits^{T_{m}}r_{i}^{2}+m,1\leq T_{m}\leq k\right)
=(𝐬Tmkλi=1Tmkri2+m,1Tmk)\displaystyle=\mathbb{P}\left(\|\mathbf{s}_{T_{m}\wedge k}\|\geq\lambda\sum\limits_{i=1}\limits^{T_{m}\wedge k}r_{i}^{2}+m,1\leq T_{m}\leq k\right)
(GTmkexp(λ2i=1Tmkri2)cosh(λ2i=1Tmkri2+mλ))\displaystyle\leq\mathbb{P}\left(G_{T_{m}\wedge k}\geq\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}\Big{)}cosh\Big{(}\lambda^{2}\sum\limits_{i=1}\limits^{T_{m}\wedge k}r_{i}^{2}+m\lambda\Big{)}\right)
(GTmk12exp(λ2i=1Tmkri2+(λ2i=1Tmkri2+mλ)))\displaystyle\leq\mathbb{P}\Bigg{(}G_{T_{m}\wedge k}\geq\frac{1}{2}\text{exp}\Big{(}-\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}+\big{(}\lambda^{2}\sum\limits_{i=1}\limits^{{T_{m}\wedge k}}r_{i}^{2}+m\lambda\big{)}\Big{)}\Bigg{)}
=(2GTmkeλm)\displaystyle=\mathbb{P}\left(2G_{T_{m}\wedge k}\geq e^{\lambda m}\right)
2𝔼GTmkeλm\displaystyle\leq\frac{2\mathbb{E}G_{T_{m}\wedge k}}{e^{\lambda m}}
2eλm𝔼G0\displaystyle\leq 2e^{-\lambda m}\mathbb{E}G_{0}
=2eλm,\displaystyle=2e^{-\lambda m},

where the 2nd step is based on the fact that cosh(y)12ey,ycosh(y)\geq\frac{1}{2}e^{y},\forall y\in\mathbb{R}, the 4th step is by Chebyshev’s inequality and the 5th step is based on the supermartingale property.

Therefore, if we let λm=log2δ\lambda m=\log\frac{2}{\delta},

(1tk,𝐬tλi=1tri2+1λlog2δ)δ.\mathbb{P}\left(\exists 1\leq t\leq k,\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+\frac{1}{\lambda}\log\frac{2}{\delta}\right)\leq\delta.

Since kk can be up to KK,

(1tK,𝐬tλi=1tri2+1λlog2δ)δ.\mathbb{P}\left(\exists 1\leq t\leq K,\|\mathbf{s}_{t}\|\geq\lambda\sum\limits_{i=1}\limits^{t}r_{i}^{2}+\frac{1}{\lambda}\log\frac{2}{\delta}\right)\leq\delta.

The final conclusion can be obtained immediately by following similar steps given in the proof of Corollary 8 from [15]. ∎

Proof of Proposition D.1.

Recall that

ϵ0(j)=𝝂0(j)f(𝐱0(j))=1Bjijfi(𝐱0(j))f(𝐱0(j)),\boldsymbol{\epsilon}_{0}^{(j)}=\boldsymbol{\nu}_{0}^{(j)}-\nabla f\left(\mathbf{x}_{0}^{(j)}\right)=\frac{1}{B_{j}}\sum\limits_{i\in\mathcal{I}_{j}}\nabla f_{i}\left(\mathbf{x}_{0}^{(j)}\right)-\nabla f\left(\mathbf{x}_{0}^{(j)}\right),

where j\mathcal{I}_{j} is sampled without replacement. Since fi(𝐱0(j))αM\Big{\|}\nabla f_{i}\big{(}\mathbf{x}_{0}^{(j)}\big{)}\Big{\|}\leq\alpha_{M}, i=1,2,,ni=1,2,\ldots,n, based on Proposition E.1,

(ϵ0(j)t|j,1)3exp(Bjt264αM2)𝟏{Bj<n}.\mathbb{P}\left(\|\boldsymbol{\epsilon}_{0}^{(j)}\|\geq t|\mathcal{F}_{j,-1}\right)\leq 3\text{exp}\left(-\frac{B_{j}t^{2}}{64\alpha_{M}^{2}}\right)\mathbf{1}\left\{B_{j}<n\right\}. (22)

Next, if we suppose m(j)={im,1(j),im,2(j),,im,bj(j)}\mathcal{I}_{m}^{(j)}=\left\{i_{m,1}^{(j)},i_{m,2}^{(j)},\ldots,i_{m,b_{j}}^{(j)}\right\}, where im,t1(j)im,t2(j)i_{m,t_{1}}^{(j)}\neq i_{m,t_{2}}^{(j)} for any 1t1<t2bj1\leq t_{1}<t_{2}\leq b_{j}, we have

ϵm(j)\displaystyle\quad\boldsymbol{\epsilon}_{m}^{(j)}
=1bjim(j)[fi(𝐱m(j))f(𝐱m(j))+f(𝐱m1(j))fi(𝐱m1(j))]\displaystyle=\frac{1}{b_{j}}\sum\limits_{i\in\mathcal{I}_{m}^{(j)}}\Biggl{[}\nabla f_{i}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i}(\mathbf{x}_{m-1}^{(j)})\Biggr{]}
=r=1bj1bj[fim,r(j)(𝐱m(j))f(𝐱m(j))+f(𝐱m1(j))fim,r(j)(𝐱m1(j))]\displaystyle=\sum\limits_{r=1}\limits^{b_{j}}\frac{1}{b_{j}}\Biggl{[}\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m-1}^{(j)})\Biggr{]}
r=1bj𝝆(m1)bj+r(j).\displaystyle\triangleq\sum\limits_{r=1}\limits^{b_{j}}\boldsymbol{\rho}_{(m-1)b_{j}+r}^{(j)}.

Let

~0(j)=j,0\tilde{\mathcal{F}}_{0}^{(j)}=\mathcal{F}_{j,0}

and

~a1bj+a2(j)=σ(~a1bj+a21(j)σ(ia1+1,a2(j)))\tilde{\mathcal{F}}_{a_{1}b_{j}+a_{2}}^{(j)}=\sigma\left(\tilde{\mathcal{F}}_{a_{1}b_{j}+a_{2}-1}^{(j)}\bigcup\sigma\big{(}i_{a_{1}+1,a_{2}}^{(j)}\big{)}\right)

for a1=0,1,2,a_{1}=0,1,2,\ldots and a2=1,2,,bja_{2}=1,2,\ldots,b_{j}. Then, we can see that {𝝆s(j)}s=1kbj\big{\{}\boldsymbol{\rho}_{s}^{(j)}\big{\}}_{s=1}^{kb_{j}} is a martingale difference sequence adapted to {~s(j)}s=0kbj\big{\{}\tilde{\mathcal{F}}_{s}^{(j)}\big{\}}_{s=0}^{kb_{j}}.

Notice that for m=1,2,,km=1,2,\ldots,k and r=1,2,,bjr=1,2,\ldots,b_{j},

𝝆(m1)bj+r(j)\displaystyle\quad\big{\|}\boldsymbol{\rho}_{(m-1)b_{j}+r}^{(j)}\big{\|}
=1bj[fim,r(j)(𝐱m(j))f(𝐱m(j))+f(𝐱m1(j))fim,r(j)(𝐱m1(j))]\displaystyle=\left\|\frac{1}{b_{j}}\big{[}\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m}^{(j)})-\nabla f(\mathbf{x}_{m}^{(j)})+\nabla f(\mathbf{x}_{m-1}^{(j)})-\nabla f_{i_{m,r}^{(j)}}(\mathbf{x}_{m-1}^{(j)})\big{]}\right\|
2Lbj𝐱m(j)𝐱m1(j).\displaystyle\leq\frac{2L}{b_{j}}\left\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\|.

Therefore, based on Theorem 3.2, for any fixed δ>0\delta^{\prime}>0, B>b>0B>b>0, with probability at least 1δ1-\delta^{\prime}, either

(σk(j))2\displaystyle\left(\sigma_{k}^{(j)}\right)^{2} m=1kr=1bj(2Lbj𝐱m(j)𝐱m1(j))2\displaystyle\triangleq\sum\limits_{m=1}\limits^{k}\sum\limits_{r=1}\limits^{b_{j}}\left(\frac{2L}{bj}\left\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\|\right)^{2}
=4L2bjm=1k𝐱m(j)𝐱m1(j)2\displaystyle=\frac{4L^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\|\mathbf{x}_{m}^{(j)}-\mathbf{x}_{m-1}^{(j)}\right\|^{2}
B,\displaystyle\geq B,

or

𝝂k(j)f(𝐱k(j))ϵ0(j)2\displaystyle\quad\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)-\boldsymbol{\epsilon}_{0}^{(j)}\right\|^{2}
=s=1kbj𝝆s(j)2\displaystyle=\Big{\|}\sum\limits_{s=1}\limits^{kb_{j}}\boldsymbol{\rho}_{s}^{(j)}\Big{\|}^{2}
9max{(σk(j))2,b}(log2δ+loglogBb).\displaystyle\leq 9\max\left\{\big{(}\sigma_{k}^{(j)}\big{)}^{2},b\right\}\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{B}{b}\right).

Under the compact constraint,

(σk(j))24L2d12kbj.\big{(}\sigma_{k}^{(j)}\big{)}^{2}\leq\frac{4L^{2}d_{1}^{2}k}{b_{j}}.

Thus, if we let B=8L2d12kbjB=\frac{8L^{2}d_{1}^{2}k}{b_{j}} and b=4L2τkbjb=\frac{4L^{2}\tau k}{b_{j}} for some τ(0,1)\tau\in(0,1), it would be of probability 0 to have (σk(j))2B\big{(}\sigma_{k}^{(j)}\big{)}^{2}\geq B. Thus, with probability at least 1δ1-\delta^{\prime},

𝝂k(j)f(𝐱k(j))ϵ0(j)2\displaystyle\quad\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)-\boldsymbol{\epsilon}_{0}^{(j)}\right\|^{2}
9((σk(j))2+4L2τkbj)(log2δ+loglog2d12τ)\displaystyle\leq 9\left(\big{(}\sigma_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right) (23)
9((σ~k(j))2+4L2τkbj)(log2δ+loglog2d12τ).\displaystyle\leq 9\left(\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right).

According to (22), with probability at least 1δ1-\delta^{\prime},

ϵ0(j)264αM2Bjlog3δ𝟏{Bj<n}.\big{\|}\boldsymbol{\epsilon}_{0}^{(j)}\big{\|}^{2}\leq\frac{64\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}. (24)

Thus, combining (23) and (24), with probability at least 12δ1-2\delta^{\prime},

𝝂k(j)f(𝐱k(j))2\displaystyle\quad\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\left(\mathbf{x}_{k}^{(j)}\right)\right\|^{2}
18((σ~k(j))2+4L2τkbj)(log2δ+loglog2d12τ)\displaystyle\leq 18\left(\big{(}\tilde{\sigma}_{k}^{(j)}\big{)}^{2}+\frac{4L^{2}\tau k}{b_{j}}\right)\left(\log\frac{2}{\delta^{\prime}}+\log\log\frac{2d_{1}^{2}}{\tau}\right)
+128αM2Bjlog3δ𝟏{Bj<n}.\displaystyle\quad+\frac{128\alpha_{M}^{2}}{B_{j}}\log\frac{3}{\delta^{\prime}}\mathbf{1}\left\{B_{j}<n\right\}.

Proposition F.1 (Inner Loop Analysis).

Given Assumptions 3.2, 3.3 and 3.4, under the parameter setting given in Theorem 3.1, let Ω=j=1Ωj\Omega=\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j}, where the definition of Ωj\Omega_{j} is given in (12). On Ω\Omega,

f(𝐱Kj(j))f(𝐱0(j))\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)
116Lk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2,\displaystyle\leq-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2},

for all j+j\in\mathbb{Z}_{+}. Such event Ω\Omega occurs with probability at least 1δ1-\delta.

Proof of Proposition F.1.

Firstly, as what we have shown in section 3, j=0Ωj\bigcup\limits_{j=0}\limits^{\infty}\Omega_{j} occurs with probability at least 1δ1-\delta.

Based on (17), on j=0Ωj\bigcup\limits_{j=0}\limits^{\infty}\Omega_{j},

f(𝐱Kj(j))f(𝐱0(j))\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)
ηj2k=0Kj1[lj((σ~k(j))2+4L2τjkbj)+qj]\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left[l_{j}\left(\left(\tilde{\sigma}_{k}^{(j)}\right)^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)+q_{j}\right]
ηj2(1Lηj)k=0Kj1𝝂k(j)2\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
ηj2k=0Kj1(4L2ηj2ljbjm=1k𝝂m1(j)2)+2L2ηjτjljbjKj22\displaystyle\leq\frac{\eta_{j}}{2}\sum\limits_{k=0}\limits^{K_{j}-1}\left(\frac{4L^{2}\eta_{j}^{2}l_{j}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\|\boldsymbol{\nu}_{m-1}^{(j)}\right\|^{2}\right)+\frac{2L^{2}\eta_{j}\tau_{j}l_{j}}{b_{j}}\frac{K_{j}^{2}}{2}
+ηjKjqj2ηj2(1Lηj)k=0Kj1𝝂k(j)2\displaystyle\quad+\frac{\eta_{j}K_{j}q_{j}}{2}-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
2L2ηj3ljKjbjk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2\displaystyle\leq\frac{2L^{2}\eta_{j}^{3}l_{j}K_{j}}{b_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}
ηj2(1Lηj)k=0Kj1𝝂k(j)2\displaystyle\quad-\frac{\eta_{j}}{2}(1-L\eta_{j})\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
=ηj2(1Lηj4L2ηj2ljKjbj)k=0Kj1𝝂k(j)2\displaystyle=-\frac{\eta_{j}}{2}\left(1-L\eta_{j}-\frac{4L^{2}\eta_{j}^{2}l_{j}K_{j}}{b_{j}}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
+L2ηjτjljKj2bj+ηjKjqj2\displaystyle\quad+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}
=18L(11414)k=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2\displaystyle=-\frac{1}{8L}\left(1-\frac{1}{4}-\frac{1}{4}\right)\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}
=116Lk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2,\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2},

where the 5th step is based on our choices of ηj=14L\eta_{j}=\frac{1}{4L} and bj=ljKjb_{j}=l_{j}K_{j}, j=1,2,j=1,2,\ldots. ∎

Proof of Proposition D.3.

Firstly,

Δff(𝐱~2T)f(𝐱~T)=f(𝐱K2T(2T))f(𝐱0(T+1))\displaystyle\quad-\Delta_{f}\leq f\left(\tilde{\mathbf{x}}_{{2T}}\right)-f\left(\tilde{\mathbf{x}}_{{T}}\right)=f\left(\mathbf{x}_{K_{2T}}^{(2T)}\right)-f\left(\mathbf{x}_{0}^{(T+1)}\right)
j=T+12T[116Lk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2]\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left[\frac{-1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right] (25)
=j=T+12T[L2ηjτjljKj2bj+ηjKjqj2]116Lj=T+12Tk=0Kj1𝝂k(j)2,\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left[\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right]-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2},

where the 3rd step is based on Proposition F.1.

For simplifying notations, we denote

ATj=T+12T[L2ηjτjljKj2bj+ηjKjqj2].A_{T}\triangleq\sum\limits_{j=T+1}\limits^{2T}\left[\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}\right].

Then,

AT\displaystyle\quad A_{T}
=j=T+12T(LτjKj4+Kjqj8L)\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L\tau_{j}K_{j}}{4}+\frac{K_{j}q_{j}}{8L}\right)
=j=T+12T(Lj2n4j3+16αM2Lj2nlog12Cej4δ𝟏{j2<n})\displaystyle=\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L\sqrt{j^{2}\wedge n}}{4j^{3}}+\frac{16\alpha_{M}^{2}}{L\sqrt{j^{2}\wedge n}}\log\frac{12C_{e}j^{4}}{\delta}\mathbf{1}\left\{j^{2}<n\right\}\right)
j=T+12T(L4j2+16αM2Lj2nlog12Cej4δ𝟏{j2<n})\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L}{4j^{2}}+\frac{16\alpha_{M}^{2}}{L\sqrt{j^{2}\wedge n}}\log\frac{12C_{e}j^{4}}{\delta}\mathbf{1}\left\{j^{2}<n\right\}\right)
j=T+12T(L4j2+16αM2Ljlog12Cej4δ)\displaystyle\leq\sum\limits_{j=T+1}\limits^{2T}\left(\frac{L}{4j^{2}}+\frac{16\alpha_{M}^{2}}{Lj}\log\frac{12C_{e}j^{4}}{\delta}\right) (26)
CeL4+j=T+12T16αM2Ljlog12Cej4δ\displaystyle\leq\frac{C_{e}L}{4}+\sum\limits_{j=T+1}\limits^{2T}\frac{16\alpha_{M}^{2}}{Lj}\log\frac{12C_{e}j^{4}}{\delta}
CeL4+j=T+12T16αM2LTlog12Ce(2T)4δ\displaystyle\leq\frac{C_{e}L}{4}+\sum\limits_{j=T+1}\limits^{2T}\frac{16\alpha_{M}^{2}}{LT}\log\frac{12C_{e}(2T)^{4}}{\delta}
=CeL4+16αM2Llog192CeT4δ\displaystyle=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}T^{4}}{\delta}
=CeL4+16αM2Llog192Ceδ+64αM2LlogT,\displaystyle=\frac{C_{e}L}{4}+\frac{16\alpha_{M}^{2}}{L}\log\frac{192C_{e}}{\delta}+\frac{64\alpha_{M}^{2}}{L}\log T,

where the 1st step is based on the choices that ηj=14L\eta_{j}=\frac{1}{4L} and bj=ljKjb_{j}=l_{j}K_{j}, the second step is bases on the choices of Kj=Bj=j2nK_{j}=\sqrt{B_{j}}=\sqrt{j^{2}\wedge n}, δj=δ4Cej4\delta^{\prime}_{j}=\frac{\delta}{4C_{e}j^{4}}. According to Lemma E.5, as TT1T\geq T_{1},

1T2(nT)ε2320L(c1+Δf).\frac{1}{T^{2}\wedge(\sqrt{n}T)}\leq\frac{\varepsilon^{2}}{320L(c_{1}+\Delta_{f})}. (27)

According to Lemma E.6, as TT2T\geq T_{2},

logTT2(nT)ε2320Lc2.\frac{\log T}{T^{2}\wedge(\sqrt{n}T)}\leq\frac{\varepsilon^{2}}{320Lc_{2}}. (28)

If we suppose to the contrary that

1Kjk=0Kj1𝝂k(j)2>ε~2\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}>\tilde{\varepsilon}^{2}

holds for all T+1j2TT+1\leq j\leq 2T, then we have

116Lj=T+12Tk=0Kj1𝝂k(j)2ε~216Lj=T+12TKjε~216Lj=T+12T(Tn)=ε~216LT2(nT).\displaystyle\quad\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}K_{j}\geq\frac{\tilde{\varepsilon}^{2}}{16L}\sum\limits_{j=T+1}\limits^{2T}\left(T\wedge\sqrt{n}\right)=\frac{\tilde{\varepsilon}^{2}}{16L}T^{2}\wedge(\sqrt{n}T).

By (26), (27), (28) and the above results,

80LT2(nT){Δf+AT116Lj=T+12Tk=0Kj1𝝂k(j)2}\displaystyle\quad\frac{80L}{T^{2}\wedge(\sqrt{n}T)}\left\{\Delta_{f}+A_{T}-\frac{1}{16L}\sum\limits_{j=T+1}\limits^{2T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\right\}
80LT2(nT)(Δf+AT)5ε~2\displaystyle\leq\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+A_{T})-5\tilde{\varepsilon}^{2}
=80LT2(nT)(Δf+AT)ε2\displaystyle=\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+A_{T})-\varepsilon^{2}
80LT2(nT)(Δf+c1)+80Lc2logTT2(nT)ε2\displaystyle\leq\frac{80L}{T^{2}\wedge(\sqrt{n}T)}(\Delta_{f}+c_{1})+\frac{80Lc_{2}\log T}{T^{2}\wedge(\sqrt{n}T)}-\varepsilon^{2}
ε24+ε24ε2\displaystyle\leq\frac{\varepsilon^{2}}{4}+\frac{\varepsilon^{2}}{4}-\varepsilon^{2}
=ε22,\displaystyle=-\frac{\varepsilon^{2}}{2},

which contradicts (25). ∎

Proof of Proposition D.4.
εT\displaystyle\quad\varepsilon_{T}
=8L2τT+2qT\displaystyle=8L^{2}\tau_{T}+2q_{T}
=8L2T3+256αM2BTlog3δT𝟏{BT<n}\displaystyle=\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{B_{T}}\log\frac{3}{\delta^{\prime}_{T}}\mathbf{1}\left\{B_{T}<n\right\}
8L2T3+256αM2T2log3δT\displaystyle\leq\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{T^{2}}\log\frac{3}{\delta^{\prime}_{T}} (29)
=8L2T3+256αM2T2log12CeT4δ\displaystyle=\frac{8L^{2}}{T^{3}}+\frac{256\alpha_{M}^{2}}{T^{2}}\log\frac{12C_{e}T^{4}}{\delta}
(8L2+256αM2log12Ceδ)1T2+1024αM2T2logT\displaystyle\leq\left(8L^{2}+256\alpha_{M}^{2}\log\frac{12C_{e}}{\delta}\right)\frac{1}{T^{2}}+\frac{1024\alpha_{M}^{2}}{T^{2}}\log T
=c31T2+c4logTT2,\displaystyle=c_{3}\frac{1}{T^{2}}+c_{4}\frac{\log T}{T^{2}}, (30)

where the 2nd step is based on our choice of τT=1T3\tau_{T}=\frac{1}{T^{3}} and the 4th step is based on the choice of δT=δ4CeT4\delta^{\prime}_{T}=\frac{\delta}{4C_{e}T^{4}}. According to Lemma E.5, where we can simply let n=n=\infty, as TT3T\geq T_{3},

1T2ε24c3.\frac{1}{T^{2}}\leq\frac{\varepsilon^{2}}{4c_{3}}. (31)

Similarly, according to Lemma E.6 and Assumption 3.4, as TT4T\geq T_{4},

logTT2ε24c4.\frac{\log T}{T^{2}}\leq\frac{\varepsilon^{2}}{4c_{4}}. (32)

Combining (29), (31) and (32),

εTε22.\varepsilon_{T}\leq\frac{\varepsilon^{2}}{2}.

Proof of Corollary C.1.

This part follows a similar way as the complexity analysis in [11]. It is easy to know that if Algorithm 1 stops in TT outer iterations, the first order computational complexity is

𝒪~L,Δf,αM(T3(nT)).\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(T^{3}\wedge(nT)\right).

Thus, it is sufficient to show

Ti3(nTi)=𝒪~L,Δf,αM(1ε3nε2),i=1,2,3,4.T_{i}^{3}\wedge(nT_{i})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right),\ i=1,2,3,4.

T13(nT1)\bullet\ T_{1}^{3}\wedge(nT_{1})

For simplicity, we let c~1=320L(c1+Δf)\tilde{c}_{1}=\sqrt{320L(c_{1}+\Delta_{f})} and consequently T1=c~1ε+c~12nε2T_{1}=\Big{\lceil}\frac{\tilde{c}_{1}}{\varepsilon}+\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}\Big{\rceil}.

When nεc~1\sqrt{n}\varepsilon\leq\tilde{c}_{1}, c~1εc~12nε2\frac{\tilde{c}_{1}}{\varepsilon}\leq\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}} and consequently T=𝒪(c~12nε2)T=\mathcal{O}\left(\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}}\right). Hence,

T13(nT1)=𝒪(nT1)=𝒪(nc~12ε2)=𝒪(nc~12ε2c~13ε3),T_{1}^{3}\wedge(nT_{1})=\mathcal{O}(nT_{1})=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\right)=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right),

where the last step is due to nc~1ε\sqrt{n}\leq\frac{\tilde{c}_{1}}{\varepsilon}.

When nεc~1\sqrt{n}\varepsilon\geq\tilde{c}_{1}, c~1εc~12nε2\frac{\tilde{c}_{1}}{\varepsilon}\geq\frac{\tilde{c}_{1}^{2}}{\sqrt{n}\varepsilon^{2}} and consequently T=𝒪(c~1ε)T=\mathcal{O}\left(\frac{\tilde{c}_{1}}{\varepsilon}\right). Hence,

T13(nT1)=𝒪(T13)=𝒪(c~13ε3)=𝒪(nc~12ε2c~13ε3),T_{1}^{3}\wedge(nT_{1})=\mathcal{O}(T_{1}^{3})=\mathcal{O}\left(\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right)=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right),

where the last step is due to c~1nε\tilde{c}_{1}\leq\sqrt{n}\varepsilon.

To sum up,

T13(nT1)=𝒪(nc~12ε2c~13ε3)=𝒪~L,Δf,αM(1ε3nε2).T_{1}^{3}\wedge(nT_{1})=\mathcal{O}\left(\frac{\sqrt{n}\tilde{c}_{1}^{2}}{\varepsilon^{2}}\wedge\frac{\tilde{c}_{1}^{3}}{\varepsilon^{3}}\right)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

T23(nT2)\bullet\ T_{2}^{3}\wedge(nT_{2})

Secondly, if we let c~2=320Lc2\tilde{c}_{2}=\sqrt{320Lc_{2}}, we have c~24\tilde{c}_{2}\geq 4 based on Assumption 3.4. As a result, T2=Θ(3c~2εlog3c~2ε+2c~22nε2logc~22ε2)T_{2}=\Theta\left(\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}+\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right). Therefore, it is equivalent to study T¯23(nT¯2)\bar{T}_{2}^{3}\wedge(n\bar{T}_{2}) where T¯2=3c~2εlog3c~2ε+2c~22nε2logc~22ε2\bar{T}_{2}=\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}+\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}.

When c~21.5nε\tilde{c}_{2}\geq 1.5\sqrt{n}\varepsilon,

3c~2εlog3c~2ε3c~2εlogc~22ε22c~22nε2logc~22ε2.\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}\leq\frac{3\tilde{c}_{2}}{\varepsilon}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\leq\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}.

Thus, T¯2=𝒪(2c~22nε2logc~22ε2)\bar{T}_{2}=\mathcal{O}\left(\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right). Then,

T¯23(nT¯2)=𝒪(nT¯2)\displaystyle\quad\bar{T}_{2}^{3}\wedge(n\bar{T}_{2})=\mathcal{O}\left(n\bar{T}_{2}\right)
=𝒪(2nc~22ε2logc~22ε2)=𝒪(2nc~22ε2logc~22ε24c~233ε3logc~22ε2).\displaystyle=\mathcal{O}\left(\frac{2\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right)=\mathcal{O}\left(\frac{2\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\wedge\frac{4\tilde{c}_{2}^{3}}{3\varepsilon^{3}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}\right).

When c~21.5nε\tilde{c}_{2}\leq 1.5\sqrt{n}\varepsilon,

2c~22nε2logc~22ε2=4c~22nε2logc~2ε4c~22nε2log3c~2ε\displaystyle\quad\frac{2\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}^{2}}{\varepsilon^{2}}=\frac{4\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{\tilde{c}_{2}}{\varepsilon}\leq\frac{4\tilde{c}_{2}^{2}}{\sqrt{n}\varepsilon^{2}}\log\frac{3\tilde{c}_{2}}{\varepsilon}
1.5εc~24c~22ε2log3c~2ε=6c~2εlog3c~2ε.\displaystyle\leq\frac{1.5\varepsilon}{\tilde{c}_{2}}\cdot\frac{4\tilde{c}_{2}^{2}}{\varepsilon^{2}}\log\frac{3\tilde{c}_{2}}{\varepsilon}=\frac{6\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}.

Thus, T¯2=𝒪(c~2εlog3c~2ε)\bar{T}_{2}=\mathcal{O}\left(\frac{\tilde{c}_{2}}{\varepsilon}\log\frac{3\tilde{c}_{2}}{\varepsilon}\right). Then

T¯23(nT¯2)=𝒪(T¯23)=𝒪(c~23ε3(log3c~2ε)3)\displaystyle\quad\bar{T}_{2}^{3}\wedge(n\bar{T}_{2})=\mathcal{O}\left(\bar{T}_{2}^{3}\right)=\mathcal{O}\left(\frac{\tilde{c}_{2}^{3}}{\varepsilon^{3}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\right)
=𝒪(c~23ε3(log3c~2ε)31.5nc~22ε2(log3c~2ε)3).\displaystyle=\mathcal{O}\left(\frac{\tilde{c}_{2}^{3}}{\varepsilon^{3}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\wedge\frac{1.5\sqrt{n}\tilde{c}_{2}^{2}}{\varepsilon^{2}}\left(\log\frac{3\tilde{c}_{2}}{\varepsilon}\right)^{3}\right).

To sum up,

T23(nT2)=𝒪~L,Δf,αM(1ε3nε2).{T}_{2}^{3}\wedge(n{T}_{2})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

T33(nT3)\bullet\ T_{3}^{3}\wedge(nT_{3})

Since T3=Θ~L,Δf,αM(1ε)T_{3}=\tilde{\Theta}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon}\right), we can directly know that

T33(nT3)=𝒪~L,Δf,αM(1ε3nε)=𝒪~L,Δf,αM(1ε3nε2).{T}_{3}^{3}\wedge(n{T}_{3})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{{n}}{\varepsilon}\right)=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

T43(nT4)\bullet\ T_{4}^{3}\wedge(nT_{4})

Similar to the previous case,

T43(nT4)=𝒪~L,Δf,αM(1ε3nε2).T_{4}^{3}\wedge(nT_{4})=\tilde{\mathcal{O}}_{L,\Delta_{f},\alpha_{M}}\left(\frac{1}{\varepsilon^{3}}\wedge\frac{\sqrt{n}}{\varepsilon^{2}}\right).

Proof of Theorem C.2.

We can see that many results given under the setting of Theorem 3.1 can still apply under the current setting. If we still define Ωj\Omega_{j} as (12), Ω=j=1Ωj\Omega=\bigcup\limits_{j=1}\limits^{\infty}\Omega_{j} occurs with probability at least 1δ1-\delta.

Under the current setting, Proposition F.1 is still valid. Thus, on Ω\Omega, for any j+j\in\mathbb{Z}_{+},

f(𝐱Kj(j))f(𝐱0(j))\displaystyle\quad f\left(\mathbf{x}_{K_{j}}^{(j)}\right)-f\left(\mathbf{x}_{0}^{(j)}\right)
116Lk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj+ηjKjqj2\displaystyle\leq-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}+\frac{\eta_{j}K_{j}q_{j}}{2}
=116Lk=0Kj1𝝂k(j)2+L2ηjτjljKj2bj\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L^{2}\eta_{j}\tau_{j}l_{j}K_{j}^{2}}{b_{j}}
=116Lk=0Kj1𝝂k(j)2+LτjKj4\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{L\tau_{j}K_{j}}{4}
=116Lk=0Kj1𝝂k(j)2+nL4τj,\displaystyle=-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+\frac{\sqrt{n}L}{4}\tau_{j},

where the 2nd step is due to our choice of BjnB_{j}\equiv n and consequently qj0q_{j}\equiv 0, the 3rd step is based on our choices of ηj=14L\eta_{j}=\frac{1}{4L} and bj=ljKjb_{j}=l_{j}K_{j}, the 4th step is based on our choice of Kj=nK_{j}=n. Summing the above inequality from j=1j=1 to TT,

Δf0\displaystyle\quad-\Delta_{f}^{0}
=f(𝐱)f(𝐱0(1))\displaystyle=f\left(\mathbf{x}^{*}\right)-f\left(\mathbf{x}_{0}^{(1)}\right)
f(𝐱KT(T))f(𝐱0(1))\displaystyle\leq f\left(\mathbf{x}_{K_{T}}^{(T)}\right)-f\left(\mathbf{x}_{0}^{(1)}\right)
=j=1T(nL4τj116Lk=0Kj1𝝂k(j)2)\displaystyle=\sum\limits_{j=1}\limits^{T}\left(\frac{\sqrt{n}L}{4}\tau_{j}-\frac{1}{16L}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\right)
=nTε~232L116Lj=1Tk=0Kj1𝝂k(j)2,\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}, (33)

where the 2nd step is according to Assumption 3.1, the 4th step is based on our choice of τjε~28L2\tau_{j}\equiv\frac{\tilde{\varepsilon}^{2}}{8L^{2}}. We assert that when TT5T\geq T_{5}, there must exist a 1jT1\leq j\leq T such that

1Kjk=0Kj1𝝂k(j)2ε~2.\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

If not,

nTε~232L116Lj=1Tk=0Kj1𝝂k(j)2\displaystyle\quad\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
=nTε~232L116Lj=1TKjKjk=0Kj1𝝂k(j)2\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\frac{K_{j}}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}
nTε~232L116Lj=1Tε~2Kj\displaystyle\leq\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{1}{16L}\sum\limits_{j=1}\limits^{T}\tilde{\varepsilon}^{2}{K_{j}}
=nTε~232Lnε~2T16L\displaystyle=\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}-\frac{\sqrt{n}\tilde{\varepsilon}^{2}T}{16L}
=nTε~232L\displaystyle=-\frac{\sqrt{n}T\tilde{\varepsilon}^{2}}{32L}
=nε2T160L\displaystyle=-\frac{\sqrt{n}\varepsilon^{2}T}{160L}
(Δf0+1),\displaystyle\leq-(\Delta_{f}^{0}+1),

which is in conflict with (33). Thus, on Ω\Omega, the first stopping rule will be met in at most TT outer iterations while the second stopping rule is always satisfied. When both stopping rules are met, we can show that the output is of desirable property. Let 1jT1\leq j\leq T and 0kKj0\leq k\leq K_{j} such that

1Kjk=0Kj1𝝂k(j)2ε~2\frac{1}{K_{j}}\sum\limits_{k=0}\limits^{K_{j}-1}\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}

and

𝝂k(j)2ε~2.\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}\leq\tilde{\varepsilon}^{2}.

Then, on Ω\Omega,

f(𝐱^)2\displaystyle\quad\left\|\nabla f\left(\hat{\mathbf{x}}\right)\right\|^{2}
=f(𝐱k(j))2\displaystyle=\left\|\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}
2𝝂k(j)2+2𝝂k(j)f(𝐱k(j))2\displaystyle\leq 2\left\|\boldsymbol{\nu}_{k}^{(j)}\right\|^{2}+2\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}
2ε~2+2𝝂k(j)f(𝐱k(j))2\displaystyle\leq 2\tilde{\varepsilon}^{2}+2\left\|\boldsymbol{\nu}_{k}^{(j)}-\nabla f\big{(}\mathbf{x}_{k}^{(j)}\big{)}\right\|^{2}
2ε~2+2lj(4L2ηj2bjm=1k𝝂m1(j)2+4L2τjkbj)\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{k}\left\|\boldsymbol{\nu}_{m-1}^{(j)}\right\|^{2}+\frac{4L^{2}\tau_{j}k}{b_{j}}\right)
2ε~2+2lj(4L2ηj2bjm=1Kj𝝂m1(j)2+4L2τjKjbj)\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}}{b_{j}}\sum\limits_{m=1}\limits^{K_{j}}\left\|\boldsymbol{\nu}_{m-1}^{(j)}\right\|^{2}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)
2ε~2+2lj(4L2ηj2Kjε~2bj+4L2τjKjbj)\displaystyle\leq 2\tilde{\varepsilon}^{2}+2l_{j}\left(\frac{4L^{2}\eta_{j}^{2}K_{j}\tilde{\varepsilon}^{2}}{b_{j}}+\frac{4L^{2}\tau_{j}K_{j}}{b_{j}}\right)
=2ε~2+0.5ε~2+8L2τj\displaystyle=2\tilde{\varepsilon}^{2}+0.5\tilde{\varepsilon}^{2}+8L^{2}\tau_{j}
=3.5ε~2\displaystyle=3.5\tilde{\varepsilon}^{2}
ε2,\displaystyle\leq\varepsilon^{2},

where the 4th step is based on Proposition D.1, the 7th step is based on our choices of ηj=14L\eta_{j}=\frac{1}{4L} and bj=ljKjb_{j}=l_{j}K_{j}, the 8th step is based on our choice of τjε~28L2\tau_{j}\equiv\frac{\tilde{\varepsilon}^{2}}{8L^{2}}. ∎

Appendix G Supplementary Figures

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: Comparison of convergence with respect to (1δ)(1-\delta)-quantile of square of gradient norm (f2)\left(\|\nabla f\|^{2}\right) and δ\delta-quantile of validation accuracy on the MNIST dataset for δ=0.1\delta=0.1 and δ=0.01\delta=0.01. The second (fourth) column presents zoom-in figures of those in the first (third) column. Top: δ=0.1\delta=0.1. Bottom: δ=0.01\delta=0.01. ’bs’ stands for batch size. ’sj=x’ means that the smallest batch size xlogx\approx x\log x.

References

  • [1] Zeyuan Allen-Zhu and Elad Hazan “Variance reduction for faster non-convex optimization” In International conference on machine learning, 2016, pp. 699–707 PMLR
  • [2] Rémi Bardenet and Odalric-Ambrym Maillard “Concentration inequalities for sampling without replacement” In Bernoulli 21.3 Bernoulli Society for Mathematical StatisticsProbability, 2015, pp. 1361–1385
  • [3] Stéphane Boucheron, Gábor Lugosi and Pascal Massart “Concentration inequalities: A nonasymptotic theory of independence” Oxford university press, 2013
  • [4] Aaron Defazio, Francis Bach and Simon Lacoste-Julien “SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives” In Advances in neural information processing systems, 2014, pp. 1646–1654
  • [5] Cong Fang, Chris Junchi Li, Zhouchen Lin and Tong Zhang “SPIDER: near-optimal non-convex optimization via stochastic path integrated differential estimator” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 687–697
  • [6] Saeed Ghadimi and Guanghui Lan “Stochastic first-and zeroth-order methods for nonconvex stochastic programming” In SIAM Journal on Optimization 23.4 SIAM, 2013, pp. 2341–2368
  • [7] Ian Goodfellow, Yoshua Bengio and Aaron Courville “Deep learning” MIT press, 2016
  • [8] Nicholas JA Harvey, Christopher Liaw, Yaniv Plan and Sikander Randhawa “Tight analyses for non-smooth stochastic gradient descent” In Conference on Learning Theory, 2019, pp. 1579–1613 PMLR
  • [9] Nicholas JA Harvey, Christopher Liaw and Sikander Randhawa “Simple and optimal high-probability bounds for strongly-convex stochastic gradient descent” In arXiv preprint arXiv:1909.00843, 2019
  • [10] Wassily Hoeffding “Probability Inequalities for Sums of Bounded Random Variables” In Journal of the American Statistical Association 58.301, 1963, pp. 13–30
  • [11] Samuel Horváth, Lihua Lei, Peter Richtárik and Michael I Jordan “Adaptivity of stochastic gradient methods for nonconvex optimization” In arXiv preprint arXiv:2002.05359, 2020
  • [12] Prateek Jain, Dheeraj Nagaraj and Praneeth Netrapalli “Making the last iterate of sgd information theoretically optimal” In Conference on Learning Theory, 2019, pp. 1752–1755 PMLR
  • [13] Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani “An introduction to statistical learning” Springer, 2013
  • [14] Kaiyi Ji, Zhe Wang, Bowen Weng, Yi Zhou, Wei Zhang and Yingbin Liang “History-gradient aided batch size adaptation for variance reduced algorithms” In International Conference on Machine Learning, 2020, pp. 4762–4772 PMLR
  • [15] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade and Michael I Jordan “A short note on concentration inequalities for random vectors with subgaussian norm” In arXiv preprint arXiv:1902.03736, 2019
  • [16] Rie Johnson and Tong Zhang “Accelerating stochastic gradient descent using predictive variance reduction” In Advances in neural information processing systems 26, 2013, pp. 315–323
  • [17] Sham M Kakade and Ambuj Tewari “On the generalization ability of online strongly convex programming algorithms” In Advances in Neural Information Processing Systems, 2009, pp. 801–808
  • [18] Nicolas Le Roux, Mark Schmidt and Francis Bach “A stochastic gradient method with an exponential convergence rate for finite training sets” In Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 2, 2012, pp. 2663–2671
  • [19] Lihua Lei and Michael Jordan “Less than a single pass: Stochastically controlled stochastic gradient” In Artificial Intelligence and Statistics, 2017, pp. 148–156 PMLR
  • [20] Lihua Lei and Michael I Jordan “On the adaptivity of stochastic gradient-based optimization” In SIAM Journal on Optimization 30.2 SIAM, 2020, pp. 1473–1500
  • [21] Lihua Lei, Cheng Ju, Jianbo Chen and Michael I Jordan “Non-convex finite-sum optimization via SCSG methods” In Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017, pp. 2345–2355
  • [22] Xiaoyu Li and Francesco Orabona “A high probability analysis of adaptive sgd with momentum” In arXiv preprint arXiv:2007.14294, 2020
  • [23] Zhize Li “SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points” In Advances in Neural Information Processing Systems 32, 2019, pp. 1523–1533
  • [24] Zhize Li, Hongyan Bao, Xiangliang Zhang and Peter Richtárik “PAGE: A simple and optimal probabilistic gradient estimator for nonconvex optimization” In International Conference on Machine Learning, 2021, pp. 6286–6295 PMLR
  • [25] Zhize Li and Jian Li “A simple proximal stochastic gradient method for nonsmooth nonconvex optimization” In Proceedings of the 32nd International Conference on Neural Information Processing Systems, 2018, pp. 5569–5579
  • [26] Yurii Nesterov “Introductory lectures on convex optimization: A basic course” Springer Science & Business Media, 2003
  • [27] L M Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč “SARAH: A novel method for machine learning problems using stochastic recursive gradient” In International Conference on Machine Learning, 2017, pp. 2613–2621 PMLR
  • [28] Lam M Nguyen, Jie Liu, Katya Scheinberg and Martin Takáč “Stochastic recursive gradient algorithm for nonconvex optimization” In arXiv preprint arXiv:1705.07261, 2017
  • [29] Iosif Pinelis “An approach to inequalities for the distributions of infinite-dimensional martingales” In Probability in Banach Spaces, 8: Proceedings of the Eighth International Conference, 1992, pp. 128–134 Springer
  • [30] Iosif Pinelis “Optimum bounds for the distributions of martingales in Banach spaces” In The Annals of Probability JSTOR, 1994, pp. 1679–1706
  • [31] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos and Alex Smola “Stochastic variance reduction for nonconvex optimization” In International conference on machine learning, 2016, pp. 314–323 PMLR
  • [32] Quoc Tran-Dinh, Nhan H Pham, Dzung T Phan and Lam M Nguyen “Hybrid stochastic gradient descent algorithms for stochastic nonconvex optimization” In arXiv preprint arXiv:1905.05920, 2019
  • [33] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang and Vahid Tarokh “Spiderboost and momentum: Faster variance reduction algorithms” In Advances in Neural Information Processing Systems 32, 2019, pp. 2406–2416
  • [34] Dongruo Zhou, Jinghui Chen, Yuan Cao, Yiqi Tang, Ziyan Yang and Quanquan Gu “On the convergence of adaptive gradient methods for nonconvex optimization” In arXiv preprint arXiv:1808.05671, 2018