This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Stability of Stochastic Gradient Descent
on Nonsmooth Convex Losses

Raef Bassily Department of Computer Science & Engineering, The Ohio State University. [email protected]    Vitaly Feldman Work done while at Google Research. [email protected]    Cristóbal Guzmán Institute for Mathematical and Computational Engineering, Pontificia Universidad Católica de Chile. [email protected]    Kunal Talwar Work done while at Google Research. [email protected]
Abstract

Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al., (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses.

Our work is the first to address uniform stability of SGD on nonsmooth convex losses. Specifically, we provide sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses. Our lower bounds show that, in the nonsmooth case, (S)GD can be inherently less stable than in the smooth case. On the other hand, our upper bounds show that (S)GD is sufficiently stable for deriving new and useful bounds on generalization error. Most notably, we obtain the first dimension-independent generalization bounds for multi-pass SGD in the nonsmooth case. In addition, our bounds allow us to derive a new algorithm for differentially private nonsmooth stochastic convex optimization with optimal excess population risk. Our algorithm is simpler and more efficient than the best known algorithm for the nonsmooth case (Feldman et al.,, 2020).

1 Introduction

Successful applications of a machine learning algorithm require the algorithm to generalize well to unseen data. Thus understanding and bounding the generalization error of machine learning algorithms is an area of intense theoretical interest and practical importance. The single most popular approach to modern machine learning relies on the use of continuous optimization techniques to optimize the appropriate loss function, most notably the stochastic (sub)gradient descent (SGD) method. Yet the generalization properties of SGD are still not well understood.

Consider the setting of stochastic convex optimization (SCO). In this problem, we are interested in the minimization of the population risk F𝒟(x):=𝔼𝐳𝒟[f(x,𝐳)]F_{\mathcal{D}}(x):=\mathbb{E}_{\mathbf{z}\sim\mathcal{D}}[f(x,\mathbf{z})], where 𝒟\mathcal{D} is an arbitrary and unknown distribution, for which we have access to an i.i.d. sample of size nn, 𝐒=(𝐳1,,𝐳n)\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}); and f(,z)f(\cdot,z) is convex and Lipschitz for all zz. The performance of an algorithm 𝒜{\cal A} is quantified by its expected excess population risk,

εrisk(𝒜):=𝔼[F𝒟(𝒜(𝐒))]minx𝒳F𝒟(x),\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}):=\mathbb{E}[F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x),

where the expectation is taken with respect to the randomness of the sample 𝐒\mathbf{S} and internal randomness of 𝒜{\mathcal{A}}. A standard way to bound the excess risk is given by its decomposition into optimization error (a.k.a. training error) and generalization error (see eqn. (3) in Sec. 2). The optimization error can be easily measured empirically but assessing the generalization error requires access to fresh samples from the same distribution. Thus bounds on the generalization error lead directly to provable guarantees on the excess population risk.

Classical analysis of SGD allows obtaining bounds on the excess population risk of one pass SGD. In particular, with an appropriately chosen step size, SGD gives a solution with expected excess population risk of O(1/n)O(1/\sqrt{n}) and this rate is optimal (Nemirovsky and Yudin,, 1983). However, this analysis does not apply to multi-pass SGD that is ubiquitous in practice.

In an influential work, Hardt et al., (2016) gave the first bounds on the generalization error of general forms of SGD (such as those that make multiple passes over the data). Their analysis relies on algorithmic stability, a classical tool for proving bounds on the generalization error. Specifically, they gave strong bound on the uniform stability of several variants of SGD on convex and smooth losses (with 2/η2/\eta-smoothness sufficing when all the step sizes are at most η\eta). Uniform stability bounds the worst case change in loss of the model output by the algorithm on the worst case point when a single data point in the dataset is replaced (Bousquet and Elisseeff,, 2002). Formally, for a randomized algorithm 𝒜{\cal A}, loss functions f(,z)f(\cdot,z) and SSS\simeq S^{\prime} and z𝒵z\in\mathcal{Z}, let γ𝒜(S,S,z):=f(𝒜(S),z)f(𝒜(S),z)\gamma_{\cal A}(S,S^{\prime},z):=f({\mathcal{A}}(S),z)-f({\mathcal{A}}(S^{\prime}),z), where SSS\simeq S^{\prime} denotes that the two datasets differ only in a single data point. We say 𝒜{\cal A} is γ\gamma-uniformly stable if

supSS,z𝔼[γ𝒜(S,S,z)]γ,\sup_{S\simeq S^{\prime},z}\mathbb{E}[\gamma_{\cal A}(S,S^{\prime},z)]\leq\gamma,

where the expectation is over the internal randomness of 𝒜.\cal{A}. Stronger notions of stability can also be considered, e.g., bounding the probability – over the internal randomness of 𝒜\cal{A} – that γ𝒜(S,S,z)>γ\gamma_{\cal A}(S,S^{\prime},z)>\gamma. Using stability, Hardt et al., (2016) showed that several variants of SGD simultaneously achieve the optimal tradeoff between the excess empirical risk and stability with both being O(1/n)O(1/\sqrt{n}). Several works have used this approach to derive new generalization properties of SGD (London,, 2017; Chen et al.,, 2018; Feldman and Vondrák,, 2019).

The key insight of Hardt et al., (2016) is that a gradient step on a sufficiently smooth convex function is a nonexpansive operator (that is, it does not increase the 2\ell_{2} distance between points). Unfortunately, this property does not hold for nonsmooth losses such as the hinge loss. As a result, no non-trivial bounds on the uniform stability of SGD have been previously known in this case.

Uniform stability is also closely related to the notion of differential privacy (DP). DP upper bounds the worst case change in the output distribution of an algorithm when a single data point in the dataset is replaced (Dwork et al., 2006b, ). This connection has been exploited in the design of several DP algorithms for SCO. In particular, bounds on the uniform stability of SGD from (Hardt et al.,, 2016) have been crucial in the design and analysis of new DP-SCO algorithms (Wu et al.,, 2017; Dwork and Feldman,, 2018; Feldman et al.,, 2018; Bassily et al.,, 2019; Feldman et al.,, 2020).

1.1 Our Results

We establish tight bounds on the uniform stability of the (stochastic) subgradient descent method on nonsmooth convex losses. These results demonstrate that in the nonsmooth case SGD can be substantially less stable. At the same time we show that SGD has strong stability properties even in the regime when its iterations can be expansive.

For convenience, we describe our results in terms of uniform argument stability (UAS), which bounds the output sensitivity in 2\ell_{2}-norm w.r.t. an arbitrary change in a single data point. Formally, a (randomized) algorithm has δ\delta-UAS if

supSS𝔼𝒜(S)𝒜(S)2δ.\sup_{S\simeq S^{\prime}}\mathbb{E}\left\|{\mathcal{A}}(S)-{\mathcal{A}}(S^{\prime})\right\|_{2}\leq\delta. (1)

This notion is implicit in existing analyses of uniform stability (Bousquet and Elisseeff,, 2002; Shalev-Shwartz et al.,, 2010; Hardt et al.,, 2016) and was explicitly defined by Liu et al., (2017). In this work, we prove stronger – high probability – upper bounds on the random variable δ𝒜(S,S):=𝒜(S)𝒜(S)\delta_{\cal A}(S,S^{\prime}):=\left\|{\mathcal{A}}(S)-{\mathcal{A}}(S^{\prime})\right\|,111In fact, for both GD and fixed-permutation SGD we can obtain w.p. 1 upper bounds on δ𝒜(S,S)\delta_{\cal A}(S,S^{\prime}), whereas for sampling-with-replacement SGD, we obtain a high-probability upper bound. and we provide matching lower bounds for the weaker – in expectation – notion of UAS (1). A summary of our bounds is in Table 1. For simplicity, they are provided for constant step size; general step sizes (for upper bounds) are provided in Section 3.

Algorithm H.p. upper bound Exp. upper bound Exp. Lower bound
GD (full batch) 4(ηT+ηTn)4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)} 4(ηT+ηTn)4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)} Ω(ηT+ηTn)\Omega\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)}
SGD (w/replacement) 4(ηT+ηTn)4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)} min{1,Tn}4ηT+4ηTn\min\{1,\frac{T}{n}\}4\eta\sqrt{T}+4\frac{\eta T}{n} Ω(min{1,Tn}ηT+ηTn)\Omega\Big{(}\min\{1,\frac{T}{n}\}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}
SGD (fixed permutation) 2ηT+4ηTn2\eta\sqrt{T}+4\frac{\eta T}{n} min{1,Tn}2ηT+4ηTn\min\{1,\frac{T}{n}\}2\eta\sqrt{T}+4\frac{\eta T}{n} Ω(min{1,Tn}ηT+ηTn)\Omega\Big{(}\min\{1,\frac{T}{n}\}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}
Table 1: UAS for variants of GD/SGD, with normalized radius and Lipschitz constant. Here TT is the number of iterations and η>0\eta>0 is the step size. Both upper and lower bounds also are min{2,()}\min\{2,(\cdot)\}, due to the feasible domain radius.

Compared to the smooth case (Hardt et al.,, 2016), the main difference is the presence of the additional ηT\eta\sqrt{T} term. This term has important implications for the generalization bounds derived from UAS. The first one is that the standard step size η=Θ(1/n)\eta=\Theta(1/\sqrt{n}) used in single pass SGD leads to a vacuous stability bound. Unfortunately, as shown by our lower bounds, this is unavoidable (at least in high dimension). However, by decreasing the step size and increasing the number of steps, one obtains a variant of SGD with nearly optimal balance between the UAS and the excess empirical risk.

We highlight two major consequences of our bounds:

  • Generalization bounds for multi-pass nonsmooth SGD. We prove that the generalization error of multi-pass SGD with KK passes is bounded by O((Kn+K)η)O((\sqrt{Kn}+K)\eta). This result can be easily combined with training error guarantees to provide excess risk bounds for this algorithm. Since training error can be measured directly, our generalization bounds would immediately yield strong guarantees on the excess risk in practical scenarios where we can certify small training error.

  • Differentially private stochastic convex optimization for non-smooth losses. We show that a variant of standard noisy SGD (Bassily et al.,, 2014) with constant step size and n2n^{2} iterations yields the optimal excess population risk O(1n+dlog(1/β)αn)O\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\beta)}}{\alpha n}\big{)} for convex nonsmooth losses under (α,β)(\alpha,\beta)-differential privacy. The best previous algorithm for this problem is substantially more involved: it relies on a multi-phase regularized SGD with decreasing step sizes and variable noise rates and uses O(n2log(1/β))O(n^{2}\sqrt{\log(1/\beta)}) gradient computations (Feldman et al.,, 2020).

1.2 Overview of Techniques

  • Upper bounds. When gradient steps are nonexpansive, upper-bounding UAS requires simply summing the differences between the gradients on the neighboring datasets when the replaced data point is used (Hardt et al.,, 2016). This gives the bound of ηT/n\eta T/n in the smooth case.

    By contrast, in the nonsmooth case, UAS may increase even when the gradient step is performed on the same function. As a result it may increase in every single iteration. However, we use the fact that the difference in the subgradients has negative inner product with the difference between the iterates themselves (by monotonicity of the subgradient). Thus the increase in distance satisfies a recurrence with a quadratic and a linear term. Solving this recurrence leads to our upper bounds.

  • Lower bounds. The lower bounds are based on a function with a highly nonsmooth behavior around the origin. More precisely, it is the maximum of linear functions plus a small linear drift that is controlled by a single data point. We show that, when starting the algorithm from the origin, the presence of the linear drift pushes the iterate into a trajectory in which each subgradient step is orthogonal to the current iterate. Thus, if dmin{T,1/η2}d\geq\min\{T,1/\eta^{2}\}, we get the Tη\sqrt{T}\eta increase in UAS. Our lower bounds are also robust to averaging of the iterates.

1.3 Other Related Work

Stability is a classical approach to proving generalization bounds pioneered by Rogers and Wagner, (1978); Devroye and Wagner, 1979a ; Devroye and Wagner, 1979b . It is based on analysis of the sensitivity of the learning algorithm to changes in the dataset such as leaving one of the data points out or replacing it with a different one. The choice of how to measure the effect of the change and various ways to average over multiple changes give rise to a variety of stability notions (e.g., (Bousquet and Elisseeff,, 2002; Mukherjee et al.,, 2006; Shalev-Shwartz et al.,, 2010)). ​ Uniform stability was introduced by Bousquet and Elisseeff, (2002) in order to derive general bounds on the generalization error that hold with high probability. These bounds have been significantly improved in a recent sequence of works (Feldman and Vondrák,, 2018, 2019; Bousquet et al.,, 2019). A long line of work focuses on the relationship between various notions of stability and learnability in supervised setting (see (Kearns and Ron,, 1999; Poggio et al.,, 2004; Shalev-Shwartz et al.,, 2010) for an overview). These works employ relatively weak notions of average stability and derive a variety of asymptotic equivalence results. Chen et al., (2018) establish limits of stability in the smooth convex setting, proving that accelerated methods must satisfy strong stability lower bounds. Stability-based data-dependent generalization bounds for continuous losses were studied in (Maurer,, 2017; Kuzborskij and Lampert,, 2018).

First applications of uniform stability in the context of stochastic convex optimization relied on the stability of the empirical minimizer for strongly convex losses (Bousquet and Elisseeff,, 2002). Therefore a natural approach to achieve uniform stability (and also UAS) is to add a strongly convex regularizer and solve the ERM to high accuracy (Shalev-Shwartz et al.,, 2010). Recent applications of this approach can be found for example in (Koren and Levy,, 2015; Charles and Papailiopoulos,, 2018; Feldman et al.,, 2020). In contrast, our approach does not require strong convexity and applies to all iterates of the SGD and not only to a very accurate empirical minimizer.

Classical approach to generalization relies on uniform convergence of empirical risk to population risk. Unfortunately, without additional structural assumptions on convex functions, a lower bound of Ω(d/n)\Omega(\sqrt{d/n}) on the rate of uniform convergence for convex SCO is known (Shalev-Shwartz et al.,, 2010; Feldman,, 2016). The dependence on the dimension dd makes the bound obtained via the uniform-convergence approach vacuous in the high-dimensional settings common in modern applications.

Differentially private convex optimization has been studied extensively for over a decade (see, e.g., (Chaudhuri and Monteleoni,, 2008; Chaudhuri et al.,, 2011; Jain et al.,, 2012; Kifer et al.,, 2012; Smith and Thakurta,, 2013; Bassily et al.,, 2014; Ullman,, 2015; Jain and Thakurta,, 2014; Talwar et al.,, 2015; Bassily et al.,, 2019; Feldman et al.,, 2020)). However, until recently, the research focused on minimization of the empirical risk. Population risk for DP-SCO was first studied by Bassily et al., (2014) who gave an upper bound of max(d14n,dαn)\max\left(\tfrac{d^{\frac{1}{4}}}{\sqrt{n}},\tfrac{\sqrt{d}}{\alpha n}\right) (Bassily et al.,, 2014, Sec. F) on the excess risk. A recent work of Bassily et al., (2019) established that the optimal rate of the excess population risk for (α,β)(\alpha,\beta)-DP SCO algorithms is O(1n+dlog(1/β)αn)O\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\beta)}}{\alpha n}\big{)}. Their algorithms are relatively inefficient, especially in the nonsmooth case. Subsequently, Feldman et al., (2020) gave several new algorithms for DP-SCO with the optimal population risk. For sufficiently smooth losses, their algorithms use a linear number of gradient computations. In the nonsmooth case, as mentioned earlier, their algorithm requires O(n2log(1/β))O(n^{2}\sqrt{\log(1/\beta)}) gradient computations and is significantly more involved than the algorithm shown here.

2 Notation and Preliminaries

Throughout we work on the Euclidean space (d,2)(\mathbb{R}^{d},\|\cdot\|_{2}). Therefore, we use unambiguously =2\|\cdot\|=\|\cdot\|_{2}. Vectors are denoted by lower case letters, e.g. x,yx,y. Random variables (either scalar or vector) are denoted by boldface letters, e.g. 𝐳,𝐮\mathbf{z},\mathbf{u}.We denote the Euclidean ball of radius r>0r>0 centered at xdx\in\mathbb{R}^{d} by (x,r)\mathcal{B}(x,r). In what follows, 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} is a compact convex set, and assume we know its Euclidean radius R>0R>0, 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R). Let 𝖯𝗋𝗈𝗃𝒳\mathsf{Proj}_{\mathcal{X}} be the Euclidean projection onto 𝒳\mathcal{X}, which is nonexpansive 𝖯𝗋𝗈𝗃𝒳(x)𝖯𝗋𝗈𝗃𝒳(y)xy\|\mathsf{Proj}_{\mathcal{X}}(x)-\mathsf{Proj}_{\mathcal{X}}(y)\|\leq\|x-y\|. A convex function f:𝒳f:\mathcal{X}\mapsto\mathbb{R} is LL-Lipschitz if

f(x)f(y)Lxy(x,y𝒳).f(x)-f(y)\leq L\|x-y\|\qquad(\forall x,y\in\mathcal{X}). (2)

Functions with these properties are guaranteed to be subdifferentiable. Moreover, in the convex case, property (2) is “almost” equivalent to having subgradients bounded as f(x)(0,L)\partial f(x)\subseteq\mathcal{B}(0,L), for all x𝒳x\in\mathcal{X}.222For equivalence to hold it is necessary that the function is well-defined and satisfies (2) over an open set containing 𝒳\mathcal{X}, see Thm. 3.61 in Beck, (2017). We will assume this is the case, which can be done w.l.o.g.. We denote the class of convex LL-Lipschitz functions as 𝒳0(L)\mathcal{F}_{\mathcal{X}}^{0}(L). With slight abuse of notation, given a function f𝒳0(L)f\in\mathcal{F}_{\mathcal{X}}^{0}(L), we will denote by f(x)\nabla f(x) an arbitrary choice of gf(x)g\in\partial f(x). In this work, we will focus on the class 𝒳0(L)\mathcal{F}_{\mathcal{X}}^{0}(L) defined over a compact convex set 𝒳\mathcal{X}. Since the Euclidean radius of 𝒳\mathcal{X} is bounded by RR, we will assume that the range of these functions lies in [RL,RL][-RL,RL].

Nonsmooth stochastic convex optimization: We study the standard setting of nonsmooth stochastic convex optimization

xargmin{F𝒟(x):=𝔼𝐳𝒟[f(x,𝐳)]:x𝒳}.x^{\ast}\in\arg\min\{F_{\mathcal{D}}(x):=\mathbb{E}_{\mathbf{z}\sim\mathcal{D}}[f(x,\mathbf{z})]:\,\,x\in\mathcal{X}\}.

Here, 𝒟\mathcal{D} is an unknown distribution supported on a set 𝒵\mathcal{Z}, and f(,z)𝒳0(L)f(\cdot,z)\in\mathcal{F}_{\mathcal{X}}^{0}(L) for all z𝒵z\in\mathcal{Z}. In the stochastic setting, we assume access to an i.i.d. sample from 𝒟\mathcal{D}, denoted as 𝐒=(𝐳1,,𝐳n)𝒟n\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})\sim\mathcal{D}^{n}. Here, we will use the bold symbol 𝐒\mathbf{S} to denote a random sample from the unknown distribution. A fixed (not random) dataset from 𝒵n\mathcal{Z}^{n} will be denoted as S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}.

A stochastic optimization algorithm is a (randomized) mapping 𝒜:𝒵n𝒳{\cal A}:\mathcal{Z}^{n}\mapsto\mathcal{X}. When the algorithm is randomized, 𝒜(𝐒){\cal A}(\mathbf{S}) is a random variable depending on both the sample 𝐒𝒟n\mathbf{S}\sim\mathcal{D}^{n} and its own random coins. The performance of 𝒜{\cal A} is quantified by its excess population risk

εrisk(𝒜):=F𝒟(𝒜(𝐒))F𝒟(x).\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}):=F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathcal{D}}(x^{\ast}).

Note that εrisk(𝒜)\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}) is a random variable (due to randomness in the sample 𝐒\mathbf{S} and any possible internal randomness of the algorithm). Our guarantees on the excess population risk will be expressed in terms of upper bounds on this quantity that hold with high probability over the randomness of both 𝐒\mathbf{S} and the random coins of the algorithm.

Empirical risk minimization (ERM) is one of the most standard approaches to stochastic convex optimization. In the ERM problem, we are given a sample 𝐒=(𝐳1,,𝐳n)\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n}), and the goal is to find

x(𝐒)argmin{F𝐒(x):=1ni=1nf(x,𝐳i):x𝒳}.x^{\ast}(\mathbf{S})\in\arg\min\Big{\{}F_{\mathbf{S}}(x):=\frac{1}{n}\sum_{i=1}^{n}f(x,\mathbf{z}_{i}):\,\,x\in\mathcal{X}\Big{\}}.

One way to bound the excess population risk is to solve the ERM problem, and appeal to uniform convergence; however, uniform convergence rates in this case are dimension-dependent, Ω(d/n)\Omega(\sqrt{d/n}) (Feldman,, 2016).

Risk decomposition: Guaranteeing low excess population risk for a general algorithm is a nontrivial task. A common way to bound it is by decomposing it into generalization, optimization and approximation error:

εrisk(𝒜)F𝒟(𝒜(𝐒))F𝐒(𝒜(𝐒))εgen(𝒜)+F𝐒(𝒜(𝐒))F𝐒(x(𝐒))εopt(𝒜)+F𝐒(x(𝐒))F𝒟(x)εapprox.\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}})\leq\underbrace{F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S}))}_{\varepsilon_{\mbox{\tiny gen}}({\mathcal{A}})}+\underbrace{F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))}_{\varepsilon_{\mbox{\tiny opt}}({\mathcal{A}})}+\underbrace{F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))-F_{\mathcal{D}}(x^{\ast})}_{\varepsilon_{\mbox{\tiny approx}}}. (3)

Here, the optimization error corresponds to the empirical optimization gap, which can be bounded by standard optimization convergence analysis. The expected value of the approximation error is at most zero. One can show, e.g., by Hoeffding’s inequality, that the approximation error is bounded by O~(LR/n)\tilde{O}(LR/\sqrt{n}) with high probability (see Lemma 2.1 below.) Therefore, to establish bounds on the excess risk it suffices to upper bound the optimization and generalization errors.

Lemma 2.1.

For any θ(0,1),\theta\in(0,1), with probability at least 1θ1-\theta, the approximation error is bounded as

εapproxRL2log(1/θ)n.\varepsilon_{\mathrm{\tiny approx}}\leq\frac{RL\sqrt{2\log(1/\theta)}}{\sqrt{n}}.
Proof.

First, note that F𝒟(x)=𝔼𝐒[FS(x)]=1ni=1nf(x,𝐳i)F_{\mathcal{D}}(x^{\ast})=\underset{\mathbf{S}}{\mathbb{E}}\left[F_{S}(x^{\ast})\right]=\frac{1}{n}\sum_{i=1}^{n}f(x^{\ast},\mathbf{z}_{i}). Hence, by independence and the fact that f(x,𝐳i)[RL,RL]f(x^{\ast},\mathbf{z}_{i})\in[-RL,RL] with probability 11 for all i[n]i\in[n], the following follows from Hoeffding’s inequality:

𝐒𝒟n[F𝐒(x)F𝒟(x)RL2log(1/θ)n]θ.\underset{\mathbf{S}\sim\mathcal{D}^{n}}{\mathbb{P}}\left[F_{\mathbf{S}}(x^{\ast})-F_{\mathcal{D}}(x^{\ast})\geq\frac{RL\sqrt{2\log(1/\theta)}}{\sqrt{n}}\right]\leq\theta.

Finally, note that by definition of x(𝐒)x^{\ast}(\mathbf{S}), we have F𝐒(x(𝐒))F𝐒(x)0F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))-F_{\mathbf{S}}(x^{\ast})\leq 0. Combining this with the above bound completes the proof.

We say that two datasets S,SS,~{}S^{\prime} are neighboring, denoted SSS\simeq S^{\prime}, if they only differ on a single entry; i.e., there exists i[n]i\in[n] s.t. for all kik\neq i, zk=zkz_{k}=z_{k}^{\prime}.

Uniform argument stability (UAS): Given an algorithm 𝒜{\cal A} and datasets SSS\simeq S^{\prime}, we define the uniform argument stability (UAS) random variable as

δ𝒜(S,S):=𝒜(S)𝒜(S).\delta_{\cal A}(S,S^{\prime}):=\|{\cal A}(S)-{\cal A}(S^{\prime})\|.

The randomness here is due to any possible internal randomness of 𝒜{\cal A}. For any LL-Lipschitz function ff, we have that f(𝒜(S),z)f(𝒜(S),z)Lδ𝒜(S,S).f\left({\mathcal{A}}(S),z\right)-f\left({\mathcal{A}}(S^{\prime}),z\right)\leq L\,\delta_{\cal A}(S,S^{\prime}). Hence, upper bounds on UAS can be easily transformed into upper bounds on uniform stability.

In this work, we will consider two types of bounds on UAS.

2.1 High-probability guarantees on UAS

In Section 3, we give upper bounds on UAS for three variants of the (stochastic) gradient descent algorithm, namely, (i) full-batch gradient descent, (ii) sampling-with-replacement stochastic gradient descent, and (iii) fixed-permutation stochastic gradient descent. Variant (i) is deterministic (and hence UAS is a deterministic quantity). For variant (ii), for any pair of neighboring datasets S,SS,S^{\prime}, we give an upper bound on the UAS random variable that holds with high probability over the algorithm’s internal randomness (the sampling with replacement). For variant (iii), we give an upper bound on UAS that holds for an arbitrary choice of permutation; in particular, for any random permutation our upper bound on the UAS random variable that holds with probability 1.

High-probability upper bounds on UAS lead to high-probability upper bounds on generalization error εgen\varepsilon_{\mathrm{\tiny gen}}. We will use the following theorem, which follows in a straightforward fashion from (Feldman and Vondrák,, 2019, Theorem 1.1), to derive generalization-error guarantees for our results in Sections 5 and 6 based on our UAS upper bounds in Section 3.

Theorem 2.2 (follows from Theorem 1.1 in Feldman and Vondrák, (2019)).

Let 𝒜:𝒵n𝒳{\mathcal{A}}:\mathcal{Z}^{n}\rightarrow\mathcal{X} be a randomized algorithm. For any pair of neighboring datasets S,SS,S^{\prime}, suppose that the UAS random variable of 𝒜{\mathcal{A}} satisfies:

𝒜[δ𝒜(S,S)γ]\displaystyle\underset{{\mathcal{A}}}{\mathbb{P}}\left[\delta_{{\mathcal{A}}}(S,S^{\prime})\geq\gamma\right] θ0.\displaystyle\leq\theta_{0}.

Then there is a constant cc such that for any distribution 𝒟\mathcal{D} over 𝒵\mathcal{Z} and any θ(0,1)\theta\in(0,1), we have

𝐒𝒟n,𝒜[|εgen(𝒜)|c(Lγlog(n)log(n/θ)+LRlog(1/θ)n)]\displaystyle\underset{\mathbf{S}\sim\mathcal{D}^{n},\,{\mathcal{A}}}{\mathbb{P}}\left[\lvert\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}})\rvert\geq c\left(L\gamma\log(n)\log(n/\theta)+LR\sqrt{\frac{\log(1/\theta)}{n}}\right)\right] θ+θ0,\displaystyle\leq\theta+\theta_{0},

where εgen(𝒜)=F𝒟(𝒜(𝐒))F𝐒(𝒜(𝐒))\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}})=F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S})) as defined earlier.

2.2 Expectation guarantees on UAS

Our results also include upper and lower bounds on supSS𝔼𝒜[δ𝒜(S,S)]\sup\limits_{S\simeq S^{\prime}}\underset{{\mathcal{A}}}{\mathbb{E}}\left[\delta_{{\mathcal{A}}}(S,S^{\prime})\right]; that is the supremum of the expected value of the UAS random variable, where the supremum is taken over all pairs of neighboring datasets. In Section 3.3.1, we provide an upper bound on this quantity for the sampling-with-replacement stochastic gradient descent. The upper bounds on the other two variants of the gradient descent method hold in the strongest sense (they hold with probability 11). Moreover, in Appendix F, we give slightly tighter expectation guarantees on UAS for both sampling-with-replacement SGD and fixed-permutation SGD with a uniformly random permutation.

In Section 4, we give lower bounds on this quantity for the two variants of the stochastic subgradient method, together with a deterministic lower bound for the full-batch variant.

3 Upper Bounds on Uniform Argument Stability

3.1 The Basic Lemma

We begin by stating a key lemma that encompasses the UAS bound analysis of multiple variants of (S)GD. In particular, all of our UAS upper bounds are obtained by almost a direct application of this lemma. In the lemma we consider two gradient descent trajectories associated to different sequences of objective functions. The degree of concordance of the two sequences, quantified by the distance between the subgradients at the current iterate, controls the deviation between the trajectories. We note that this distance condition is satisfied for all (S)GD variants we study in this work.

Lemma 3.1.

Let (xt)t[T](x^{t})_{t\in[T]} and (yt)t[T](y^{t})_{t\in[T]}, with x1=y1x^{1}=y^{1}, be online gradient descent trajectories for convex LL-Lipschitz objectives (ft)t[T1](f_{t})_{t\in[T-1]} and (ft)t[T1](f_{t}^{\prime})_{t\in[T-1]}, respectively; i.e.,

xt+1\displaystyle x^{t+1} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[xtηtft(xt)]\displaystyle\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f_{t}(x^{t})]
yt+1\displaystyle y^{t+1} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[ytηtft(yt)],\displaystyle\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f_{t}^{\prime}(y^{t})],

for all t[T1]t\in[T-1]. Suppose for every t[T1]t\in[T-1], ft(xt)ft(xt)at\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}, for scalars 0at2L0\leq a_{t}\leq 2L. Then, if t0=inf{t:ftft},t_{0}=\inf\{t:f_{t}\neq f_{t}^{\prime}\},

xTyT2Lt=t0T1ηt2+2t=t0+1T1ηtat.\|x^{T}-y^{T}\|\leq 2L\sqrt{\sum_{t=t_{0}}^{T-1}\eta_{t}^{2}}+2\sum_{t=t_{0}+1}^{T-1}\eta_{t}a_{t}.
Proof.

Let δt=xtyt\delta_{t}=\|x^{t}-y^{t}\|. By definition of t0t_{0} it is clear that δ1==δt0=0\delta_{1}=\ldots=\delta_{t_{0}}=0. For t=t0+1t=t_{0}+1, we have that δt0+1=ηt0(ft0(xt0)ft0(yt0)2Lηt0\delta_{t_{0}+1}=\|\eta_{t_{0}}(\nabla f_{t_{0}}(x^{t_{0}})-\nabla f_{t_{0}}^{\prime}(y^{t_{0}})\|\leq 2L\eta_{t_{0}}.

Now, we derive a recurrence for (δt)t[T](\delta_{t})_{t\in[T]}:

δt+12=𝖯𝗋𝗈𝗃𝒳[xtηtft(xt)]𝖯𝗋𝗈𝗃𝒳[ytηtft(yt)]2xtytηt(ft(xt)ft(yt))2\displaystyle\delta_{t+1}^{2}=\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f_{t}(x^{t})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f_{t}^{\prime}(y^{t})]\|^{2}\,\,\leq\,\,\|x^{t}-y^{t}-\eta_{t}(\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}))\|^{2}
=δt2+ηt2ft(xt)ft(yt)22ηtft(xt)ft(yt),xtyt\displaystyle=\delta_{t}^{2}+\eta_{t}^{2}\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle
δt2+ηt2ft(xt)ft(yt)22ηtft(xt)ft(xt),xtyt2ηtft(xt)ft(yt),xtyt\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t}),x^{t}-y^{t}\rangle-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle
δt2+ηt2ft(xt)ft(yt)2+2ηtft(xt)ft(xt)δt2ηtft(xt)ft(yt),xtyt\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\|^{2}+2\eta_{t}\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\delta_{t}-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle
δt2+4L2ηt2+2ηtatδt,\displaystyle\leq\delta_{t}^{2}+4L^{2}\eta_{t}^{2}+2\eta_{t}a_{t}\delta_{t},

where at the last step we use the monotonicity of the subgradient. Note that

δt0+1ηt0ft0(xt0)ft0(xt0)2Lηt0.\delta_{t_{0}+1}\leq\eta_{t_{0}}\|\nabla f_{t_{0}}(x^{t_{0}})-\nabla f_{t_{0}}^{\prime}(x^{t_{0}})\|\leq 2L\eta_{t_{0}}.

Hence,

δt2\displaystyle\delta_{t}^{2} \displaystyle\leq δt0+12+4L2s=t0+1t1ηs2+2s=t0+1t1ηsasδs\displaystyle\textstyle\delta_{t_{0}+1}^{2}+4L^{2}\sum_{s=t_{0}+1}^{t-1}\eta_{s}^{2}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s} (4)
\displaystyle\leq 4L2s=t0t1ηs2+2s=t0+1t1ηsasδs.\displaystyle\textstyle 4L^{2}\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s}.

Now we prove the following bound by induction (notice this claim proves the result):

δt2Ls=t0t1ηs2+2s=t0+1t1ηsasδs(t[T]).\textstyle\delta_{t}\leq 2L\sqrt{\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s}\qquad(\forall t\in[T]).

Indeed, the claim is clearly true for t=t0t=t_{0}. For the inductive step, we assume it holds for some t[T1]t\in[T-1]. To prove the result we consider two cases: first, when δt+1maxs[t]δs\delta_{t+1}\leq\max_{s\in[t]}\delta_{s}, by induction hypothesis we have

δt+1δt2Ls=t0t1ηs2+2s=t0+1t1ηsas2Ls=t0tηs2+2s=t0+1tηsas.\textstyle\delta_{t+1}\leq\delta_{t}\leq 2L\sqrt{\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\leq 2L\sqrt{\sum_{s=t_{0}}^{t}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}.

In the other case, δt+1>maxs[t]δs\delta_{t+1}>\max_{s\in[t]}\delta_{s}, we use (4)

δt+12 4L2s=t0tηt2+2s=t0+1tηsasδs  4L2s=t0tηt2+2δt+1s=t0+1tηsas,\textstyle\delta_{t+1}^{2}\,\leq\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+2\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}\delta_{s}\,\,\leq\,\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+2\delta_{t+1}\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s},

which is equivalent to

(δt+1s=t0+1tasηs)2  4L2s=t0tηt2+(s=t0+1tηsas)2.\textstyle\Big{(}\delta_{t+1}-\sum_{s=t_{0}+1}^{t}a_{s}\eta_{s}\Big{)}^{2}\,\,\leq\,\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+\Big{(}\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}\Big{)}^{2}.

Taking square root at this inequality, and using the subadditivity of the square root, we obtain the inductive step, and therefore the result. ∎

3.2 Upper Bounds for the Full Batch GD

Algorithm 1 𝒜𝖦𝖣{\mathcal{A}}_{\sf GD}: Full-batch Gradient Descent
0:  Dataset: S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}, # iterations TT,  step sizes {ηt:t[T]}\{\eta_{t}:t\in[T]\}
1:  Choose arbitrary initial point x1𝒳x^{1}\in\mathcal{X}
2:  for t=1t=1 to T1T-1  do
3:     xt+1:=𝖯𝗋𝗈𝗃𝒳(xtηtFS(xt)),x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla F_{S}(x^{t})\right),
4:  return  x¯T=1t[T]ηtt[T]ηtxt\overline{x}^{T}=\frac{1}{\sum_{t\in[T]}\eta_{t}}\sum_{t\in[T]}\eta_{t}x^{t}

As a direct corollary of Lemma 3.1, we derive the following upper bound on UAS for the batch gradient descent algorithm.

Theorem 3.2.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R) and =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L). The full-batch gradient descent (Algorithm 1) has uniform argument stability

supSSδ𝒜𝖦𝖣(S,S)min{2R,4L(1nt=1T1ηt+t=1T1ηt2)}.\sup\limits_{S\simeq S^{\prime}}\delta_{{\mathcal{A}}_{\sf GD}}(S,S^{\prime})\leq\min\left\{2R,~{}4L\,\Big{(}\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}+\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}\Big{)}\right\}.
Proof.

The bound of 2R2R is obtained directly from the diameter bound on 𝒳\mathcal{X}. Therefore, we focus exclusively on the second term. Let SSS\simeq S^{\prime} be arbitrary neighboring datasets, x1=y1x^{1}=y^{1}, and consider the trajectories (xt)t,(yt)t(x^{t})_{t},(y^{t})_{t} associated with the batch GD method on datasets SS and SS^{\prime}, respectively. We use Lemma 3.1 with ft=FSf_{t}=F_{S} and ft=FSf_{t}^{\prime}=F_{S^{\prime}}, for all t[T1]t\in[T-1]. Notice that

supx𝒳FS(x)FS(x)2L/n,\sup_{x\in{\cal X}}\|\nabla F_{S}(x)-\nabla F_{S^{\prime}}(x)\|\leq 2L/n,

since SSS\simeq S^{\prime}; in particular, ft(xt)ft(xt)at\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}, with at=2L/na_{t}=2L/n. We conclude by Lemma 3.1 that for all t[T]t\in[T]

xtyt2Ls=1t1ηs2+4Lns=2t1ηs.\|x^{t}-y^{t}\|\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=2}^{t-1}\eta_{s}.

Hence, the stability bound holds for all the iterates, and thus for x¯T\overline{x}^{T} by the triangle inequality. ∎

3.3 Upper Bounds for SGD

Next, we state and prove upper bounds on UAS for two variants of stochastic gradient descent: sampling-with-replacement SGD (Section 3.3.1) and fixed-permutation SGD (Section 3.3.2). Here, we give strong upper bounds that hold with high probability (for sampling-with-replacement SGD) and with probability 1 (for fixed-permutation SGD). In Appendix F, we derive tighter upper bounds for these two variants of SGD in the case where the number of iterations T<T< the number of samples in the data set nn; however, the bounds derived in this case hold only in expectation.

3.3.1 Sampling-with-replacement SGD

Next, we study the uniform argument stability of the sampling-with-replacement stochastic gradient descent (Algorithm 2). This algorithm has the benefit that each iteration is extremely cheap compared to Algorithm 1. Despite these savings, we will show that same bound on UAS holds with high probability.

Algorithm 2 𝒜𝗋𝖲𝖦𝖣{\mathcal{A}}_{\sf rSGD}: Sampling with replacement SGD
0:  Dataset: S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}, # iterations TT,  stepsizes {ηt:t[T]}\{\eta_{t}:t\in[T]\}
1:  Choose arbitrary initial point x1𝒳x^{1}\in\mathcal{X}
2:  for t=1t=1 to T1T-1  do
3:     Sample 𝐢tUnif([n])\mathbf{i}_{t}\sim\mbox{Unif}([n])
4:     xt+1:=𝖯𝗋𝗈𝗃𝒳(xtηtf(xt,z𝐢t))x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla f(x^{t},z_{\mathbf{i}_{t}})\right)
5:  return  x¯T=1t[T]ηtt[T]ηtxt\overline{x}^{T}=\frac{1}{\sum_{t\in[T]}\eta_{t}}\sum_{t\in[T]}\eta_{t}x^{t}

We now state and prove our upper bound for sampling-with-replacement SGD.

Theorem 3.3.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R) and =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L). The uniform argument stability of the sampling-with-replacement SGD (Algorithm 2) satisfies:

supSS𝔼𝒜𝗋𝖲𝖦𝖣[δ𝒜𝗋𝖲𝖦𝖣(S,S)]min(2R,4L(t=1T1ηt2+1nt=1T1ηt)).\sup\limits_{S\simeq S^{\prime}}\underset{{\mathcal{A}}_{\sf rSGD}}{\mathbb{E}}\left[\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})\right]\leq\min\left(2R,~{}4L\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right).

Moreover, if ηt=η>0t\eta_{t}=\eta>0~{}\forall t then, for any pair (S,S)(S,S^{\prime}) of neighboring datasets, with probability at least 1exp(n/2)1-\exp\left(-n/2\right) (over the algorithm’s internal randomness), the UAS random variable is bounded as

δ𝒜𝗋𝖲𝖦𝖣(S,S)min(2R,4L(ηT1+ηT1n)).\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})\leq\min\left(2R,~{}4L\left(\eta\,\sqrt{T-1}+\eta\,\frac{T-1}{n}\right)\right).
Proof.

The bound of 2R2R trivially follows from the diameter bound on 𝒳\mathcal{X}. We thus focus on the second term of the bound. Let SSS\simeq S^{\prime} be arbitrary neighboring datasets, x0=y0x^{0}=y^{0}, and consider the trajectories (xt)t[T],(yt)t[T](x^{t})_{t\in[T]},(y^{t})_{t\in[T]} associated with the sampled-with-replacement stochastic subgradient method on datasets SS and SS^{\prime}, respectively. We use Lemma 3.1 with ft()=f(,𝐳𝐢t)f_{t}(\cdot)=f(\cdot,\mathbf{z}_{\mathbf{i}_{t}}) and ft()=f(,𝐳𝐢t)f_{t}^{\prime}(\cdot)=f(\cdot,\mathbf{z}_{{\mathbf{i}_{t}}^{\prime}}). Let us define 𝐫t𝟏{𝐳𝐢t𝐳𝐢t}\mathbf{r}_{t}\triangleq\mathbf{1}_{\{\mathbf{z}_{\mathbf{i}_{t}}\neq\mathbf{z}_{\mathbf{i}_{t}}^{\prime}\}}. Note that at every step tt, 𝐫t=1\mathbf{r}_{t}=1 with probability 11/n1-1/n, and 𝐫t=0\mathbf{r}_{t}=0 otherwise. Moreover, note that {𝐫t:t[T]}\{\mathbf{r}_{t}:~{}t\in[T]\} is an independent sequence of Bernoulli random variables. Finally, note that ft(xt)ft(xt)2L𝐫t\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq 2L\mathbf{r}_{t}.

Hence, by Lemma 3.1, for any realization of the trajectories of the SGD method, we have

t[T]:xtyt\displaystyle\forall t\in[T]:\quad\|x^{t}-y^{t}\| 2Ls=1t1ηs2+4Ls=1t1𝐫sηsΔT,\displaystyle\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+4L\sum_{s=1}^{t-1}\mathbf{r}_{s}\eta_{s}\leq\Delta_{T}, (5)

where ΔT2Ls=1T1ηs2+4Ls=1T1𝐫sηs\Delta_{T}\triangleq 2L\sqrt{\sum_{s=1}^{T-1}\eta_{s}^{2}}+4L\sum_{s=1}^{T-1}\mathbf{r}_{s}\eta_{s}. Taking expectation of (5), we have

t[T]:𝔼[xtyt]𝔼[ΔT]=2Ls=1T1ηs2+4Lns=1T1ηs.\forall t\in[T]:\quad\mathbb{E}\left[\|x^{t}-y^{t}\|\right]\leq\mathbb{E}\left[\Delta_{T}\right]=2L\sqrt{\sum_{s=1}^{T-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=1}^{T-1}\eta_{s}.

This establishes the upper bound on UAS but only in expectation. Now, we proceed to prove the high-probability bound. Here, we assume that the step size is fixed; that is, ηt=η>0\eta_{t}=\eta>0 for all t[T1]t\in[T-1]. Note that each 𝐫s,s[T],\mathbf{r}_{s},s\in[T], has variance 1n(11n)<1n\frac{1}{n}\left(1-\frac{1}{n}\right)<\frac{1}{n}. Hence, by Chernoff’s bound333Here, we are applying a bound for (scaled) Bernoulli rvs where the exponent is expressed in terms of the variance., we have

[ηs=1T1𝐫sηT1n+ηT1]\displaystyle\mathbb{P}\left[\eta\sum_{s=1}^{T-1}\mathbf{r}_{s}\geq\eta\frac{T-1}{n}+\eta\sqrt{T-1}\right] exp(η2(T1)2η2T1n)=exp(n2).\displaystyle\leq\exp\left(-\,\frac{\eta^{2}(T-1)}{2\eta^{2}\,\frac{T-1}{n}}\right)=\exp\left(-\frac{n}{2}\right).

Therefore, with probability at least 1exp(n/2)1-\exp\left(-n/2\right), we have

ΔT3LηT1+4Lnη(T1).\Delta_{T}\leq 3L\eta\sqrt{T-1}+\frac{4L}{n}\eta(T-1).

Putting this together with (5), with probability at least 1exp(n/2)1-\exp\left(-n/2\right), we have

t[T]:xtyt3LηT1+4Lnη(T1).\forall t\in[T]:\quad\|x^{t}-y^{t}\|\leq 3L\eta\sqrt{T-1}+\frac{4L}{n}\eta(T-1).

Finally, by the triangle inequality, we get that with probability at least 1exp(n/2)1-\exp\left(-n/2\right), the same stability bound holds for the average of the iterates x¯T\overline{x}^{T}, y¯T\overline{y}^{T}. ∎

3.3.2 Upper Bounds for the Fixed Permutation SGD

In Algorithm 3, we describe the fixed-permutation stochastic gradient descent. This algorithm works in epochs, where each epoch is a single pass on the data. The order in which data is used is the same across epochs, and is given by a permutation π\pi. The algorithm can be alternatively described without the epoch loop simply by

xt+1=𝖯𝗋𝗈𝗃𝒳(xtηtf(xt,z𝝅(tmod n)))(t[nK]).x^{t+1}=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla f(x^{t},z_{\bm{\pi}(t\,\mbox{\footnotesize mod }n)})\right)\qquad(\forall t\in[nK]).\vspace{-0.5cm} (6)
Algorithm 3 𝒜𝖯𝖾𝗋𝖲𝖦𝖣{\mathcal{A}}_{\sf PerSGD}: Fixed Permutation SGD
0:  Dataset S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}, # rounds KK, total # steps TnKT\triangleq nK, step sizes, {ηt}t[nK]\{\eta_{t}\}_{t\in[nK]} π:[n][n]{\pi}:[n]\rightarrow[n] permutation over [n][n]
1:  Choose arbitrary initial point xn+10𝒳x_{n+1}^{0}\in\mathcal{X}
2:  for k=1,,Kk=1,\ldots,K do
3:     x1k=xn+1k1x_{1}^{k}=x_{n+1}^{k-1}
4:     for t=1t=1 to nn  do
5:        xt+1k:=𝖯𝗋𝗈𝗃𝒳(xtkη(k1)n+tf(xtk,zπ(t)))x^{k}_{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{k}_{t}-\eta_{(k-1)n+t}\cdot\nabla f(x^{k}_{t},z_{\pi(t)})\right)
6:     η¯k=t=1nη(k1)n+t\overline{\eta}_{k}=\sum_{t=1}^{n}\eta_{(k-1)n+t}
7:  return  x¯K=1k[K]η¯kk[K]η¯kx1k\overline{x}^{K}=\frac{1}{\sum_{k\in[K]}\overline{\eta}_{k}}\sum_{k\in[K]}\overline{\eta}_{k}\cdot x_{1}^{k}

We show that the same UAS bound of batch gradient descent and sampling-with-replacement SGD holds for the fixed-permutation SGD. We also observe that a slightly tighter bound can be achieved if we consider the expectation guarantee on UAS when 𝝅\bm{\pi} is chosen uniformly at random. We leave these details to Theorem F.2 in the Appendix.

In the next result, we assume that the sequence of step sizes (ηt)t[T]\left(\eta_{t}\right)_{t\in[T]} is non-increasing, which is indeed the case for almost all known variants of SGD.

Theorem 3.4.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R), =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L), and π\pi be any permutation over [n][n]. Suppose the step sizes (ηt)t[T]\left(\eta_{t}\right)_{t\in[T]} form a non-increasing sequence. Then the uniform argument stability of the fixed-permutation SGD (Algorithm 3) is bounded as

supSSδ𝒜𝖯𝖾𝗋𝖲𝖦𝖣(S,S)min{2R,2L(t=1T1ηt2+2nt=1T1ηt)}.\sup\limits_{S\simeq S^{\prime}}\delta_{{\mathcal{A}}_{\sf PerSGD}}(S,S^{\prime})\leq\min\left\{2R,~{}2L\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right\}.
Proof.

Again, the bound of 2R2R is trivial. Now, we show the second term of the bound. Let SSS\simeq S^{\prime} be arbitrary neighboring datasets, x1=y1x^{1}=y^{1}, and consider the trajectories (xt)t[T],(yt)t[T](x^{t})_{t\in[T]},(y^{t})_{t\in[T]} associated with the fixed permutation stochastic subgradient method on datasets SS and SS^{\prime}, respectively. Since the datasets SSS\simeq S^{\prime} are arbitrary, we may assume without loss of generality that π\pi is the identity, whereas the perturbed coordinate 𝐢=i\mathbf{i}=i is arbitrary. We use Lemma 3.1 with ft()=f(,𝐳(tmod n))f_{t}(\cdot)=f\left(\cdot,\mathbf{z}_{\left(t~{}\mbox{\footnotesize mod }n\right)}\right) and ft()=f(,𝐳(tmod n))f_{t}^{\prime}(\cdot)=f\left(\cdot,\mathbf{z}^{\prime}_{\left(t~{}\mbox{\footnotesize mod }n\right)}\right). It is easy to see then that ft(xt)ft(xt)at\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}, with at=2L𝟏{(tmod n)=i}a_{t}=2L\cdot\mathbf{1}_{\{(t~{}\mbox{\footnotesize mod }n)=i\}}, where 𝟏{𝖼𝗈𝗇𝖽𝗂𝗍𝗂𝗈𝗇}\mathbf{1}_{\{\mathsf{condition}\}} is the indicator of 𝖼𝗈𝗇𝖽𝗂𝗍𝗂𝗈𝗇\mathsf{condition}. Hence, by Lemma 3.1, we have

xtyt\displaystyle\|x^{t}-y^{t}\| \displaystyle\leq 2Ls=1t1ηs2+4Lr=1(t1)/nηrn+i\displaystyle 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+4L\sum_{r=1}^{\lfloor(t-1)/n\rfloor}\eta_{rn+i}
\displaystyle\leq 2Ls=1t1ηs2+4Lnr=1t1ηs,\displaystyle 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{r=1}^{t-1}\eta_{s},

where at the last step we used the fact that (ηt)t[T](\eta_{t})_{t\in[T]} is non-increasing; namely, for any r1r\geq 1

ηrn+i1ns=(r1)n+i+1rn+iηs.\eta_{rn+i}\leq\frac{1}{n}\sum_{s=(r-1)n+i+1}^{rn+i}\eta_{s}.

Since the bound holds for all the iterates, using triangle inequality, it holds for the output x¯K\overline{x}^{K} averaged over the iterates from the T/nT/n epochs. ∎

3.4 Discussion of the upper bounds: examples of specific instantiations

The upper bounds on stability from this section all behave very similarly. Let us explore the consequences of the obtained rates in terms of generalization bounds for different choices of the step size sequence. As a case study, we will consider excess risk bounds for the full-batch subgradient method (Algorithm 1), but similar conclusions hold for all the variants that we studied. We emphasize that prior to this work, no dimension-independent bounds on the excess risk were known of this method (specifically, for nonsmooth losses and without explicit regularization).

To bound the excess risk, we will use the risk decomposition, eqn. (3). For simplicity, we will only be studying excess risk bounds in expectation (in Section 6 we consider stronger, high probability, bounds). In this case, the stability implies generalization result (Theorem 2.2) simplifies to Bousquet and Elisseeff, (2002); Hardt et al., (2016)

𝔼𝐒[F𝐒(𝒜(𝐒))F𝒟(𝒜(𝐒))]supSSδ𝒜(S,S).\mathbb{E}_{\mathbf{S}}[F_{\mathbf{S}}({\cal A}(\mathbf{S}))-F_{\cal D}({\cal A}(\mathbf{S}))]\leq\sup_{S\simeq S^{\prime}}\delta_{{\cal A}}(S,S^{\prime}).

Finally, the approximation error (Lemma 2.1) simplifies as well: it is upper bounded by 0 in expectation.

  • Fixed stepsize: Let ηtη>0\eta_{t}\equiv\eta>0. By Thm. 3.2, UAS is bounded by 4LTη+4LTηn4L\sqrt{T}\eta+\frac{4LT\eta}{n}. On the other hand, the standard analysis of subgradient descent guarantees that εopt(𝒜𝖦𝖣)R22ηT+ηL22\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf GD})\leq\frac{R^{2}}{2\eta T}+\frac{\eta L^{2}}{2}. Therefore, by the expected risk decomposition (3)

    𝔼𝐒[εrisk(𝒜𝖦𝖣)]\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})] \displaystyle\leq 𝔼𝐒[εgen(𝒜𝖦𝖣)]+𝔼𝐒[εopt(𝒜𝖦𝖣)]4L2Tη+4L2Tηn+R22ηT+ηL22.\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf GD})]+\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]\leq 4L^{2}\sqrt{T}\eta+\frac{4L^{2}T\eta}{n}+\frac{R^{2}}{2\eta T}+\frac{\eta L^{2}}{2}.

    If we consider the standard method choice, η=R/[Ln]\eta=R/[L\sqrt{n}] and T=nT=n, the bound above is at least 4LR4LR (due to the first term). Consequently, upper bounds obtained from this approach are vacuous.

    In order to deal with the LTηL\sqrt{T}\eta term, we need to substantially moderate our stepsize, together with running the algorithm for longer. For example, η=R4LTn\eta=\frac{R}{4L\sqrt{Tn}} gives 𝔼𝐒[εrisk(𝒜𝖦𝖣)]2LRn+2LRnT+RTn3/2\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})]\leq\frac{2LR}{\sqrt{n}}+\frac{2LR\sqrt{n}}{\sqrt{T}}+\frac{R\sqrt{T}}{n^{3/2}}, so by choosing T=n2T=n^{2} we obtain an expected excess risk bound of O(LR/n)O(LR/\sqrt{n}), which is optimal. We will see next that it is not possible to obtain the same rates from this bound if T=o(n2)T=o(n^{2}), for any choice of η>0\eta>0. It is also an easy observation that, at least for constant stepsize, it is not possible to recover the optimal excess risk if T=ω(n2)T=\omega(n^{2}).

  • Varying stepsize: For a general sequence of stepsizes the optimization guarantees of Algorithm 1 are the following

    𝔼𝐒[εopt(𝒜𝖦𝖣)]R22t=1T1ηt+L2t=1T1ηt22.\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]\leq\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}+\frac{L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}}{2}.

    From the risk decomposition, we have

    𝔼𝐒[εrisk(𝒜𝖦𝖣)]\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})] \displaystyle\leq 𝔼𝐒[εgen(𝒜𝖦𝖣)]+𝔼𝐒[εopt(𝒜𝖦𝖣)]\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf GD})]+\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]
    \displaystyle\leq 4L2t=1T1ηt2+4L2nt=1T1ηt+R22t=1T1ηt+L2t=1T1ηt22.\displaystyle 4L^{2}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{4L^{2}}{n}\sum_{t=1}^{T-1}\eta_{t}+\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}+\frac{L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}}{2}.

    In fact, we can show that any choice of step sizes that makes the quantity above O(LR/n)O(LR/\sqrt{n}) must necessarily have T=Ω(n2)T=\Omega(n^{2}). Indeed, notice that in such case

    R22t=1T1ηt=O(LRn);\displaystyle\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)}; 4L2t=1T1ηt2=O(LRn)\displaystyle 4L^{2}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)}
    t=1T1ηt=Ω(RnL);\displaystyle\Longleftrightarrow\qquad\sum_{t=1}^{T-1}\eta_{t}=\Omega\Big{(}\frac{R\sqrt{n}}{L}\Big{)}; t=1T1ηt2=O(RLn).\displaystyle\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}=O\Big{(}\frac{R}{L\sqrt{n}}\Big{)}.

    Therefore, by Cauchy-Schwarz inequality,

    Ω(RnL)=tηtTtηt2=O(RTLn)T=Ω(n2).\Omega\Big{(}\frac{R\sqrt{n}}{L}\Big{)}=\sum_{t}\eta_{t}\leq\sqrt{T}\sqrt{\sum_{t}\eta_{t}^{2}}=O\Big{(}\frac{R\sqrt{T}}{L\sqrt{n}}\Big{)}\quad\Longrightarrow\quad T=\Omega(n^{2}).

The high iteration complexity required to obtain optimal bounds motivates studying whether it is possible to improve our uniform argument stability bounds. We will show that, unfortunately, they are sharp up to absolute constant factors.

4 Lower Bounds on Uniform Argument Stability

In this section we provide matching lower bounds for the previously studied first-order methods. These lower bounds show that our analyses are tight, up to absolute constant factors.

We note that it is possible to prove a general purpose lower bound on stability by appealing to sample complexity lower bounds for stochastic convex optimization (Nemirovsky and Yudin,, 1983). This approach in the smooth convex case was first studied in (Chen et al.,, 2018); there, these lower bounds are sharp. However, in the nonsmooth case they are very far from bounds in the previous section. The idea is that for sufficiently small step size, a first-order method must incur Ω(LTη/n)\Omega(LT\eta/n) uniform stability. Details of this lower bound can be found on Appendix C. This reasoning leads to an Ω(LTη/n)\Omega(LT\eta/n) lower bound on uniform argument stability, that can be added to any other lower bound we can prove on specific algorithms that enjoy rates as of gradient descent.

Next we will prove finer lower bounds on the UAS of specific algorithms. For this, note that the objective functions we use are polyhedral, thus the subdifferential is a polytope at any point. Since the algorithm should work for any oracle, we will let the subgradients provided to be extreme points, f(x,z)ext(f(x,z))\nabla f(x,z)\in\mbox{ext}(\partial f(x,z)). Moreover, we can make adversarial choices of the chosen subgradient.

4.1 Lower Bounds for Full Batch GD

Theorem 4.1.

Let 𝒳=(0,1)\mathcal{X}={\cal B}(0,1), =𝒳0(1){\cal F}={\cal F}_{\mathcal{X}}^{0}(1) and dmin{T,1/η2}d\geq\min\{T,1/\eta^{2}\}. For the full-batch gradient descent (Alg. 1) with constant step size η>0\eta>0, there exist SSS\simeq S^{\prime} such that the UAS is lower bounded as δ𝒜𝖦𝖣(S,S)=Ω(min{1,ηT+ηT/n}).\delta_{{\mathcal{A}}_{\sf GD}(S,S^{\prime})}=\Omega(\min\{1,\eta\sqrt{T}+\eta T/n\}).

The proof of this result is deferred to Appendix D, due to space considerations.

4.2 Lower Bounds for SGD Sampled with Replacement

We use a similar construction as from the previous result to prove a sharp lower bound on the uniform argument stability for stochastic gradient descent where the sampling is with replacement.

Theorem 4.2.

Let 𝒳=(0,1)\mathcal{X}={\cal B}(0,1), =𝒳0(1){\cal F}={\cal F}_{\mathcal{X}}^{0}(1), and dmin{T,1/η2}d\geq\min\{T,1/\eta^{2}\}. For the sampled with replacement stochastic gradient descent (Algorithm 2) with constant step size η>0\eta>0, there exist SSS\simeq S^{\prime} such that the uniform argument stability satisfies 𝔼[δ𝒜𝗋𝖲𝖦𝖣(S,S)]=Ω(min{1,Tn}ηT+ηTn).\mathbb{E}[\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})]=\Omega\Big{(}\min\Big{\{}1,\frac{T}{n}\Big{\}}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}.

Proof.

Let Dmin{T,1/η2}dD\triangleq\min\{T,1/\eta^{2}\}\leq d, and ν>0\nu>0, KDK\geq\sqrt{D}. Consider 𝒵={0,1}{\cal Z}=\{0,1\} and define

f(x,z)={max{0,x1ν,,xDν} if z=0r,x/K if z=1,f(x,z)=\left\{\begin{array}[]{ll}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}&\mbox{ if }z=0\\ \langle r,x\rangle/K&\mbox{ if }z=1,\end{array}\right.

where r=(1,,1,0,,0)r=(-1,\ldots,-1,0,\ldots,0) (i.e., supported on the first DD coordinates). Let the random sequence of indices used by the algorithm: (𝐢t)t0i.i.d.Unif([n])(\mathbf{i}_{t})_{t\geq 0}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mbox{Unif}([n]). Let S=(1,0,,0)S=(1,0,\ldots,0) and S=(0,0,,0)S^{\prime}=(0,0,\ldots,0) be neighboring datasets, and denote by (xt)t(x^{t})_{t} and (yt)t(y^{t})_{t} the respective stochastic gradient descent trajectories on SS and SS^{\prime}, initialized at x1=y1=0x^{1}=y^{1}=0. It is easy to see that under SS^{\prime}, we have yt=0y^{t}=0 for all t[T]t\in[T]. Now, suppose that ν<η/K\nu<\eta/K. Then, we only have xt=0x^{t}=0 for all tτt\leq\tau, where τ:=inf{t1:𝐢t=1}\tau:=\inf\{t\geq 1:\,\mathbf{i}_{t}=1\}. After time τ\tau, xτ+1=ηr/Kx^{\tau+1}=-\eta r/K, and consequently xτ+1+j=η𝐤(τ+j)Krηs=1j𝐤(τ+j)+1es,x^{\tau+1+j}=-\frac{\eta\mathbf{k}(\tau+j)}{K}r-\eta\sum_{s=1}^{j-\mathbf{k}(\tau+j)+1}e_{s}, for all j[D+𝐤(τ+j)1]j\in[D+\mathbf{k}(\tau+j)-1], where 𝐤(t)|{s[t]:𝐢s=1}|\mathbf{k}(t)\triangleq|\{s\in[t]:\mathbf{i}_{s}=1\}|. Note that conditioned on any fixed value for τ,\tau,  𝐤(τ+j)j+1\mathbf{k}(\tau+j)\leq j+1.

Let δT=xTyT=xT\delta_{T}=\|x^{T}-y^{T}\|=\|x^{T}\|. Hence, we have δTηs=1Tτ𝐤(T1)esη𝐤(T1)D/KηT𝐤(T1)τηTD/K\delta_{T}\geq\eta\|\sum_{s=1}^{T-\tau-\mathbf{k}(T-1)}e_{s}\|-\eta\mathbf{k}(T-1)\sqrt{D}/K\geq\eta\sqrt{T-\mathbf{k}(T-1)-\tau}-\eta T\sqrt{D}/K. Let 𝐤=𝐤(T1)\mathbf{k}=\mathbf{k}(T-1) from now on. Note that conditioned on any value for τ,\tau,  𝐤1\mathbf{k}-1 is a binomial random variable taking values in {0,,T1τ}\{0,\ldots,T-1-\tau\}. Hence, conditioned on τ=t\tau=t, by the binomial tail, we always have [𝐤>T/2|τ=t]exp(T/4)\mathbb{P}[\mathbf{k}>T/2~{}|~{}\tau=t]\leq\exp(-T/4) for all t[T]t\in[T] (in particular, this conditional probability is zero when tT/2t\geq T/2). Also, note that the same upper bound is valid without conditioning on τ\tau. Hence, by the law of total expectation, we have

𝔼[δT]\displaystyle\mathbb{E}[\delta_{T}] =\displaystyle= 𝔼[δT|𝐤T/2][𝐤T/2]+𝔼[δT|𝐤>T/2][𝐤>T/2]c𝔼[δT|𝐤T/2]\displaystyle\mathbb{E}[\delta_{T}|~{}\mathbf{k}\leq T/2]\cdot\mathbb{P}[\mathbf{k}\leq T/2]+\mathbb{E}[\delta_{T}|~{}\mathbf{k}>T/2]\cdot\mathbb{P}[\mathbf{k}>T/2]\geq c\,\mathbb{E}[\delta_{T}|~{}\mathbf{k}\leq T/2]

where c=(1exp(T/4))=Ω(1)c=(1-\exp(-T/4))=\Omega(1). Hence,

𝔼[δT]\displaystyle\mathbb{E}[\delta_{T}] \displaystyle\geq ct=1T/2𝔼[δT|τ=t,𝐤T/2][τ=t|𝐤T/2]\displaystyle c\sum_{t=1}^{T/2}\mathbb{E}[\delta_{T}|\tau=t,~{}\mathbf{k}\leq T/2]\,\mathbb{P}[\tau=t|~{}\mathbf{k}\leq T/2]
\displaystyle\geq c2t=1T/2𝔼[δT|τ=t,𝐤T/2][τ=t]\displaystyle c^{2}\sum_{t=1}^{T/2}\mathbb{E}[\delta_{T}|\tau=t,~{}\mathbf{k}\leq T/2]\,\mathbb{P}[\tau=t]
\displaystyle\geq c2ηnt=1T/2TT/2t(11n)t1c2ηDT/K.\displaystyle c^{2}\frac{\eta}{n}\sum_{t=1}^{T/2}\sqrt{T-T/2-t}\big{(}1-\frac{1}{n}\big{)}^{t-1}-c^{2}\eta\sqrt{D}T/K.

We choose KK sufficiently large such that ηDT/K=o(ηmin{T3/2/n,T})\eta\sqrt{D}T/K=o(\eta\min\{T^{3/2}/n,\sqrt{T}\}). Hence, we have

𝔼[δT]{c2ηnt=1T/2t(11n)n2c2ηDTKc2ηe1nt=1T/2to(ηT3/2n)=Ω(ηT3/2n) if Tnc2ηnt=1n/4T/2n/4e1o(ηT)=Ω(ηT) if T>n.\mathbb{E}[\delta_{T}]\geq\left\{\begin{array}[]{ll}\frac{c^{2}\eta}{n}\sum_{t=1}^{T/2}\sqrt{t}\big{(}1-\frac{1}{n}\big{)}^{n-2}-\frac{c^{2}\eta\sqrt{D}T}{K}\geq\frac{c^{2}\eta e^{-1}}{n}\sum_{t=1}^{T/2}\sqrt{t}-o(\frac{\eta T^{3/2}}{n})=\Omega\Big{(}\frac{\eta T^{3/2}}{n}\Big{)}&\mbox{ if }T\leq n\\ \frac{c^{2}\eta}{n}\sum_{t=1}^{n/4}\sqrt{T/2-n/4}\,e^{-1}-o(\eta\sqrt{T})=\Omega(\eta\sqrt{T})&\mbox{ if }T>n.\end{array}\right.

This gives a lower bound on 𝔼[δT]\mathbb{E}[\delta_{T}]. Proving that x¯T\overline{x}^{T} satisfies the same lower bound is analogous to the proof in Theorem 4.1. Finally, Ω(ηT/n)\Omega(\eta T/n) can be added to the lower bound by Appendix C.∎

4.3 Lower Bounds for the Fixed Permutation Stochastic Gradient Descent

Finally, we study fixed permutation SGD. The proof of this result is deferred to Appendix E.

Theorem 4.3.

Let 𝒳=(0,1)\mathcal{X}={\cal B}(0,1), =𝒳0(1){\cal F}={\cal F}_{\mathcal{X}}^{0}(1) and dmin{T,1/η2}d\geq\min\{T,1/\eta^{2}\}. For the fixed permutation stochastic gradient descent (Algorithm 3) with constant step size η>0\eta>0, there exist SSS\simeq S^{\prime} such that the uniform argument stability is lower bounded by 𝔼[δ𝒜𝖯𝖾𝗋𝖲𝖦𝖣(S,S)]=Ω(min{1,Tn}ηT+ηTn).\mathbb{E}[\delta_{{\mathcal{A}}_{\sf PerSGD}}(S,S^{\prime})]=\Omega\Big{(}\min\Big{\{}1,\frac{T}{n}\Big{\}}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}.

5 Generalization Guarantees for Multi-pass SGD

One important implication of our stability bounds is that they provide non-trivial generalization error guarantees for multi-pass SGD on nonsmooth losses. Multi-pass SGD is one of the most extensively used settings of SGD in practice, where SGD is run for KK passes (epochs) over the dataset (namely, the number of iterations T=KnT=Kn). To the best of our knowledge, aside from the dimension-dependent bounds based on uniform convergence (Shalev-Shwartz et al.,, 2010), no generalization error guarantees are known for the multi-pass setting on general nonsmooth convex losses. Given our uniform stability upper bounds, we can prove the following generalization error guarantees for the multi-pass setting of sampling-with-replacement SGD. Analogous results can be obtained for fixed-permutation SGD .

Theorem 5.1.

Running Algorithm 2 for KK passes (i.e., for T=KnT=Kn iterations) with constant stepsize ηt=η>0\eta_{t}=\eta>0 yields the following generalization error guarantees:

|𝔼𝒜𝗋𝖲𝖦𝖣[εgen(𝒜𝗋𝖲𝖦𝖣)]|4L2η(Kn+K),|\mathbb{E}_{{\mathcal{A}}_{\sf rSGD}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})]|\leq 4L^{2}\eta\left(\sqrt{Kn}+K\right),

and there exists c>0c>0, such that for any 0<θ<10<\theta<1, with probability 1θexp(n/2)\geq 1-\theta-\exp(-n/2),

|εgen(𝒜𝗋𝖲𝖦𝖣)|c(L2η(Kn+K)log(n)log(n/θ)+LRlog(1/θ)n).|\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})|\leq c~{}\Bigg{(}L^{2}\eta\left(\sqrt{Kn}+K\right)\log(n)\log(n/\theta)+LR\sqrt{\frac{\log(1/\theta)}{n}}~{}\Bigg{)}.
Proof.

First, by the expectation guarantee on UAS given in Theorem 3.3 together with the fact that the losses are LL-Lipschitz, it follows that Algorithm 2 (when run for KK passes with constant stepsize η\eta) is γ\gamma-uniformly stable, where γ=4L2(ηKn+ηK)\gamma=4L^{2}\left(\eta\sqrt{Kn}+\eta K\right). Then, by (Hardt et al.,, 2016, Thm. 2.2), we have

|𝔼𝒜𝗋𝖲𝖦𝖣[εgen(𝒜𝗋𝖲𝖦𝖣)]|γ.|\mathbb{E}_{{\mathcal{A}}_{\sf rSGD}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})]|\leq\gamma.

For the high-probability bound, we combine the high-probability guarantee on UAS given in Theorem 3.3 with Theorem 2.2 to get the claimed bound. ∎

These bounds on generalization error can be used to obtain excess risk bounds using the standard risk decomposition (see (3)). In practical scenarios where one can certify small optimization error for multi-pass SGD, Thm. 5.1 can be used to readily estimate the excess risk. In Section 6.2 we provide worst-case analysis showing that multi-pass SGD is guaranteed to attain the optimal excess risk of LR/n\approx LR/\sqrt{n} within nn passes (with appropriately chosen constant stepsize).

6 Implications of Our Stability Bounds

6.1 Differentially Private Nonsmooth Stochastic Convex Optimization

Now we show an application of our stability upper bound to differentially private stochastic convex optimization (DP-SCO). Here, the input sample to the stochastic convex optimization algorithm is a sensitive and private data set, thus the algorithm is required to satisfy the notion of (α,β)(\alpha,\beta)-differential privacy. A randomized algorithm 𝒜{\mathcal{A}} is (α,β)(\alpha,\beta)-differentially private if, for any pair of datasets SSS\simeq S^{\prime}, and for all events 𝒪\mathcal{O} in the output range of 𝒜{\mathcal{A}}, we have [𝒜(S)𝒪]eα[𝒜(S)𝒪]+β,\underset{}{\mathbb{P}}\left[{\mathcal{A}}(S)\in\mathcal{O}\right]\leq e^{\alpha}\cdot\underset{}{\mathbb{P}}\left[{\mathcal{A}}(S^{\prime})\in\mathcal{O}\right]+\beta, where the probability is taken over the random coins of 𝒜{\mathcal{A}} (Dwork et al., 2006b, ; Dwork et al., 2006a, ). For meaningful privacy guarantees, the typical settings of the privacy parameters are α<1\alpha<1 and β1/n\beta\ll 1/n.

Using our UAS upper bounds, we show that a simple variant of noisy SGD (Bassily et al.,, 2014), that requires only n2n^{2} gradient computations, yields the optimal excess population risk for DP-SCO. In terms of running time, this is a small improvement over the algorithm of Feldman et al., (2020) for the nonsmooth case, which requires O(n2log1/β)O(n^{2}\sqrt{\log 1/\beta}) gradient computations. More importantly, our algorithm is substantially simpler. For comparison, the algorithm in (Feldman et al.,, 2020) is based on a multi-phase SGD, where in each phase a separate regularized ERM problem is solved. To ensure privacy, the output of each phase is perturbed with an appropriately chosen amount of noise before being used as the initial point for the next phase.

The description of the algorithm is given in Algorithm 4.

Algorithm 4 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}: Noisy SGD for convex losses
0:  Private dataset S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}, step size η\eta;  privacy parameters α1,β1/n\alpha\leq 1,~{}\beta\ll 1/n
1:  Set noise variance σ2:=8L2log(1/β)α2\sigma^{2}:=\frac{8\,L^{2}\,\log(1/\beta)}{\alpha^{2}}
2:  Choose an arbitrary initial point x1𝒳x^{1}\in\mathcal{X}
3:  for t=1t=1 to n21n^{2}-1  do
4:     Sample 𝐢tUnif([n])\mathbf{i}_{t}\sim\mbox{Unif}([n])
5:     xt+1:=𝖯𝗋𝗈𝗃𝒳(xtη((xt,z𝐢t)+𝐆t)),x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta\cdot\left(\nabla\ell(x^{t},z_{\mathbf{i}_{t}})+\mathbf{G}_{t}\right)\right), where 𝐆t𝒩(𝟎,σ2𝕀d)\mathbf{G}_{t}\sim\mathcal{N}\left(\mathbf{0},\sigma^{2}\mathbb{I}_{d}\right) drawn independently each iteration
6:  return  x¯=1n2t=1n2xt\overline{x}=\frac{1}{n^{2}}\sum_{t=1}^{n^{2}}x^{t}
Theorem 6.1 (Privacy guarantee of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}).

Algorithm 4 is (α,β)(\alpha,\beta)-differentially private.

The proof of the theorem follows the same lines of (Bassily et al.,, 2014, Theorem 2.1), but we replace their privacy analysis of the Gaussian mechanism with the tighter Moments Accountant method of Abadi et al., (2016). analysis of Abadi et al., (2016).

Theorem 6.2Risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}. (Excess risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}).

In Algorithm 4, let η=R/(Lnmax(n,dlog(1/β)α))\eta=R/\Big{(}L\cdot n\cdot\max\big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\big{)}\Big{)}. Then, for any θ(6exp(n/2),1)\theta\in(6\exp(-n/2),1), with probability at least 1θ1-\theta over the randomness in both the sample and the algorithm, we have

ε𝗋𝗂𝗌𝗄(𝒜𝖭𝖲𝖦𝖣)\displaystyle\varepsilon_{\mathsf{\footnotesize{risk}}}\left({\mathcal{A}}_{\sf NSGD}\right) =RLO(max(log(n)log(n/θ)n,dlog(1/β)αn))\displaystyle=RL\cdot O\left(\max\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\right)
Proof.

Fix any confidence parameter θ6exp(n/2)\theta\geq 6\exp(-n/2). First, for any data set S𝒵nS\in\mathcal{Z}^{n} and any step size η>0,\eta>0, by Lemma H.1 in Appendix H, we have the following high-probability guarantee on the training error of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}:

With probability at least 1θ/3,1-\theta/3, we have

εopt(𝒜𝖭𝖲𝖦𝖣)FS(x¯)minx𝒳FS(x)\displaystyle\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD})\triangleq F_{S}(\overline{x})-\min\limits_{x\in\mathcal{X}}F_{S}(x) R2ηn2+7RLlog(1/β)log(12/θ)αn+ηL2(32dlog(1/β)α2+1)\displaystyle\leq\frac{R^{2}}{\,\eta\,n^{2}}+7RL\frac{\sqrt{\log(1/\beta)\log(12/\theta)}}{\alpha n}+\eta\,L^{2}\Big{(}32\frac{d\,\log(1/\beta)}{\alpha^{2}}+1\Big{)}

where the probability is over the sampling in step 4 and the independent Gaussian noise vectors 𝐆1,,𝐆n2\mathbf{G}_{1},\ldots,\mathbf{G}_{n^{2}}. Given the setting of η\eta in the theorem, we get

εopt(𝒜𝖭𝖲𝖦𝖣)\displaystyle\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD}) 8RLmax(1n,dlog(1/β)αn)+33RLdlog(1/β)α2nmax(n,dlog(1/β)α)\displaystyle\leq 8RL\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+33RL\,\frac{d\,\frac{\log(1/\beta)}{\alpha^{2}}}{n\cdot\max\Big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\Big{)}}
8RLmax(1n,dlog(1/β)αn)+33RLdlog(1/β)nα\displaystyle\leq 8RL\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+33RL\,\frac{\sqrt{d\,\log(1/\beta)}}{n\,\alpha}
=RLO(max(1n,dlog(1/β)nα)).\displaystyle=RL\cdot O\Bigg{(}\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{n\,\alpha}\Big{)}\Bigg{)}. (7)

Next, it is not hard to show that 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD} attains the same UAS bound as 𝒜𝗋𝖲𝖦𝖣{\mathcal{A}}_{\sf rSGD} (Theorem 3.3). Indeed, the only difference is the noise addition in gradient step; however, this does not impact the stability analysis. This is because the sequence of noise vectors {𝐆1,,𝐆n2}\{\mathbf{G}_{1},\ldots,\mathbf{G}_{n^{2}}\} is the same for the trajectories corresponding to the pair S,SS,~{}S^{\prime} of neighboring datasets. Hence, the argument basically follows the same lines of the proof of Theorem 3.3 since the noise terms cancel out. Thus, we conclude that for any pair SSS\simeq S^{\prime} of neighboring datasets, with probability at least 1exp(n/2)1θ/61-\exp(n/2)\geq 1-\theta/6 (over the randomness of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}), the uniform argument stability of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD} is bounded as: δ𝒜𝖭𝖲𝖦𝖣4Lη(T+Tn),\delta_{{\mathcal{A}}_{\sf NSGD}}\leq 4L\eta\left(\sqrt{T}+\frac{T}{n}\right), where T=n2T=n^{2}. Given the setting of η\eta in the theorem, this bound reduces to 8R/max(n,dlog(1/β)α)8R/\max\big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\big{)}.

Hence, by Theorem 2.2, with probability at least 1θ/31-\theta/3 (over the randomness in both the i.i.d. dataset SS and the algorithm), the generalization error of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD} is bounded as

εgen(𝒜𝖭𝖲𝖦𝖣)\displaystyle\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf NSGD}) 8cRLlog(n)log(6n/θ)max(n,dlog(1/β)α)+clog(6/θ)n=RLO(log(n)log(n/θ)n),\displaystyle\leq\frac{8c\,RL\,\log(n)\log(6n/\theta)}{\max\Big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\Big{)}}+\frac{c\,\sqrt{\log(6/\theta)}}{\sqrt{n}}=RL\cdot O\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}}\right), (8)

where cc in the first bound is a universal constant.

Now, using (7), (8), and Lemma 2.1, we finally conlcude that with probability at least 1θ1-\theta (over randomness in the sample SS and the internal randomness of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}), the excess population risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD} is bounded as

ε𝗋𝗂𝗌𝗄(𝒜𝖭𝖲𝖦𝖣)\displaystyle\varepsilon_{\mathsf{\footnotesize{risk}}}({\mathcal{A}}_{\sf NSGD}) εopt(𝒜𝖭𝖲𝖦𝖣)+εgen(𝒜𝖭𝖲𝖦𝖣)+εapprox\displaystyle\leq\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD})+\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf NSGD})+\varepsilon_{\mbox{\tiny approx}}
=RLO(max(1n,dlog(1/β)αn)+log(n)log(n/θ)n+log(1/θ)n)\displaystyle=RL\cdot O\Bigg{(}\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+\frac{\log(n)\log(n/\theta)}{\sqrt{n}}+\frac{\sqrt{\log(1/\theta)}}{\sqrt{n}}\Bigg{)}
=RLO(max(log(n)log(n/θ)n,dlog(1/β)αn)),\displaystyle=RL\cdot O\Bigg{(}\max\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\Bigg{)},

which completes the proof. ∎

Remark 6.3.

Using the expectation guarantee on UAS given in Theorem 3.3 and following similar steps of the analysis above, we can also show that the expected excess population risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD} is bounded as:

𝔼[ε𝗋𝗂𝗌𝗄(𝒜𝖭𝖲𝖦𝖣)]=RLO(max(1n,dlog(1/β)αn)).\underset{}{\mathbb{E}}\left[\varepsilon_{\mathsf{\footnotesize{risk}}}\left({\mathcal{A}}_{\sf NSGD}\right)\right]=RL\cdot O\left(\max\left(\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\right).

6.2 Nonsmooth Stochastic Convex Optimization with Multi-pass SGD

Another application of our results concerns obtaining optimal excess risk for stochastic nonsmooth convex optimization via multi-pass SGD. It is known that one-pass SGD is guaranteed to have optimal excess risk, which can be shown via martingale arguments that trace back to the stochastic approximation literature (Robbins and Monro,, 1951; Kiefer and Wolfowitz,, 1952).

Using our UAS bound, we show that Algorithms 2 and 3 can recover nearly-optimal high-probability excess risk bounds by making nn passes over the data. Analogous bounds hold for Algorithm 1, however these are less interesting from a computational efficiency perspective.

In short, we prove the following excess population risk bounds, and their analyses are deferred to Appendix G,

εrisk(𝒜𝗋𝖲𝖦𝖣)4RLn;εrisk(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)8LRn.\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf rSGD})\leq\frac{4RL}{\sqrt{n}}\qquad;\qquad\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD})\leq\frac{8LR}{\sqrt{n}}.\vspace{-0.6cm}

7 Discussion and Open Problems

In this work we provide sharp upper and lower bounds on uniform argument stability for the (stochastic) subgradient method in stochastic nonsmooth convex optimization. Our lower bounds show inherent limitations of stability bounds compared to the smooth convex case, however we can still derive optimal population risk bounds by reducing the step size and running the algorithms for longer number of iterations. We provide applications of this idea for differentially-private noisy SGD, and for two versions of SGD (sampling-with-replacement and fixed-permutation SGD).

The first open problem regards lower bounds that are robust to general forms of algorithmic randomization. Unfortunately, the methods presented here are not robust in this respect, since random initialization would prevent the trajectories reaching the region of highly nonsmooth behavior of the objective (or doing it in such a way that it does not increase UAS). One may try to strengthen the lower bound by using a random rotation of the objective; however, this leads to an uninformative lower bound. Finding distributional constructions for lower bounds against randomization is a very interesting future direction.

Our privacy application provides optimal risk for an algorithm that runs for n2n^{2} steps, which is impractical for large datasets. Other algorithms, e.g. in (Feldman et al.,, 2020), run into similar limitations. Proving that quadratic running time is necessary for general nonsmooth DP-SCO is a very interesting open problem that can be formalized in terms of the oracle complexity of stochastic convex optimization (Nemirovsky and Yudin,, 1983) under stability and/or privacy constraints.

Acknowledgements

Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing during the “Data Privacy: Foundations and Applications” program. RB’s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. Work by CG was partially funded by the Millennium Science Initiative of the Ministry of Economy, Development, and Tourism, grant “Millennium Nucleus Center for the Discovery of Structures in Complex Data.” CG would like to thank Nicolas Flammarion and Juan Peypouquet for extremely valuable discussions at early stages of this work.

References

  • Abadi et al., (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM.
  • Bassily et al., (2019) Bassily, R., Feldman, V., Talwar, K., and Thakurta, A. G. (2019). Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288.
  • Bassily et al., (2014) Bassily, R., Smith, A., and Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (full version available at arXiv:1405.7085), pages 464–473. IEEE.
  • Beck, (2017) Beck, A. (2017). First-Order Methods in Optimization. MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics.
  • Bousquet and Elisseeff, (2002) Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. JMLR, 2:499–526.
  • Bousquet et al., (2019) Bousquet, O., Klochkov, Y., and Zhivotovskiy, N. (2019). Sharper bounds for uniformly stable algorithms. CoRR, abs/1910.07833.
  • Charles and Papailiopoulos, (2018) Charles, Z. and Papailiopoulos, D. (2018). Stability and generalization of learning algorithms that converge to global optima. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 745–754, Stockholmsmässan, Stockholm Sweden. PMLR.
  • Chaudhuri and Monteleoni, (2008) Chaudhuri, K. and Monteleoni, C. (2008). Privacy-preserving logistic regression. In NIPS.
  • Chaudhuri et al., (2011) Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109.
  • Chen et al., (2018) Chen, Y., Jin, C., and Yu, B. (2018). Stability and convergence trade-off of iterative optimization algorithms. CoRR, abs/1804.01619.
  • (11) Devroye, L. and Wagner, T. J. (1979a). Distribution-free inequalities for the deleted and holdout error estimates. IEEE Trans. Information Theory, 25(2):202–207.
  • (12) Devroye, L. and Wagner, T. J. (1979b). Distribution-free performance bounds with the resubstitution error estimate (corresp.). IEEE Trans. Information Theory, 25(2):208–210.
  • Dwork and Feldman, (2018) Dwork, C. and Feldman, V. (2018). Privacy-preserving prediction. CoRR, abs/1803.10266. Extended abstract in COLT 2018.
  • (14) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. (2006a). Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT.
  • (15) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer.
  • Feldman, (2016) Feldman, V. (2016). Generalization of ERM in stochastic convex optimization: The dimension strikes back. CoRR, abs/1608.04414. Extended abstract in NIPS 2016.
  • Feldman et al., (2020) Feldman, V., Koren, T., and Talwar, K. (2020). Private stochastic convex optimization: Optimal rates in linear time. Extended abstract in STOC 2020.
  • Feldman et al., (2018) Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. (2018). Privacy amplification by iteration. In FOCS, pages 521–532.
  • Feldman and Vondrák, (2018) Feldman, V. and Vondrák, J. (2018). Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 9770–9780.
  • Feldman and Vondrák, (2019) Feldman, V. and Vondrák, J. (2019). High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 1270–1279.
  • Hardt et al., (2016) Hardt, M., Recht, B., and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In ICML, pages 1225–1234.
  • Jain et al., (2012) Jain, P., Kothari, P., and Thakurta, A. (2012). Differentially private online learning. In 25th Annual Conference on Learning Theory (COLT), pages 24.1–24.34.
  • Jain and Thakurta, (2014) Jain, P. and Thakurta, A. (2014). (near) dimension independent risk bounds for differentially private learning. In ICML.
  • Kearns and Ron, (1999) Kearns, M. J. and Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453.
  • Kiefer and Wolfowitz, (1952) Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466.
  • Kifer et al., (2012) Kifer, D., Smith, A., and Thakurta, A. (2012). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1.
  • Koren and Levy, (2015) Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In NIPS, pages 1477–1485.
  • Kuzborskij and Lampert, (2018) Kuzborskij, I. and Lampert, C. H. (2018). Data-dependent stability of stochastic gradient descent. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2820–2829. PMLR.
  • Liu et al., (2017) Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2159–2167.
  • London, (2017) London, B. (2017). A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 2931–2940. Curran Associates, Inc.
  • Maurer, (2017) Maurer, A. (2017). A second-order look at stability and generalization. In COLT, pages 1461–1475.
  • Mukherjee et al., (2006) Mukherjee, S., Niyogi, P., Poggio, T., and Rifkin, R. (2006). Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161–193.
  • Nedic and Bertsekas, (2001) Nedic, A. and Bertsekas, D. P. (2001). Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138.
  • Nemirovsky and Yudin, (1983) Nemirovsky, A. and Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. J. Wiley @ Sons, New York.
  • Poggio et al., (2004) Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981):419–422.
  • Rigollet, (2015) Rigollet, P. (2015.  https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015). Lecture Notes. 18.S997: High Dimensional Statistics. MIT Courses/Mathematics.
  • Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist., 22(3):400–407.
  • Rogers and Wagner, (1978) Rogers, W. H. and Wagner, T. J. (1978). A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, 6(3):506–514.
  • Shalev-Shwartz et al., (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670.
  • Smith and Thakurta, (2013) Smith, A. and Thakurta, A. (2013). Differentially private feature selection via stability arguments, and the robustness of the LASSO. In Conference on Learning Theory (COLT), pages 819–850.
  • Talwar et al., (2015) Talwar, K., Thakurta, A., and Zhang, L. (2015). Nearly optimal private LASSO. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, pages 3025–3033.
  • Ullman, (2015) Ullman, J. (2015). Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 303–312. ACM.
  • Wu et al., (2017) Wu, X., Li, F., Kumar, A., Chaudhuri, K., Jha, S., and Naughton, J. (2017). Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In SIGMOD. ACM.
  • Zinkevich, (2003) Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, page 928–935. AAAI Press.

Appendix A Proof of Theorem 3.2

Proof.

The bound of 2R2R is obtained directly from the diameter bound on 𝒳\mathcal{X}. Therefore, we focus exclusively on the second term. Let SSS\simeq S^{\prime} be arbitrary neighboring datasets, x1=y1x^{1}=y^{1}, and consider the trajectories (xt)t,(yt)t(x^{t})_{t},(y^{t})_{t} associated with the batch GD method on datasets SS and SS^{\prime}, respectively. We use Lemma 3.1 with ft=FSf_{t}=F_{S} and ft=FSf_{t}^{\prime}=F_{S^{\prime}}, for all t[T1]t\in[T-1]. Notice that

supx𝒳FS(x)FS(x)2L/n,\sup_{x\in{\cal X}}\|\nabla F_{S}(x)-\nabla F_{S^{\prime}}(x)\|\leq 2L/n,

since SSS\simeq S^{\prime}; in particular, ft(xt)ft(xt)at\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}, with at=2L/na_{t}=2L/n. We conclude by Lemma 3.1 that for all t[T]t\in[T]

xtyt2Ls=1t1ηs2+4Lns=2t1ηs.\|x^{t}-y^{t}\|\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=2}^{t-1}\eta_{s}.

Hence, the stability bound holds for all the iterates, and thus for x¯T\overline{x}^{T} by the triangle inequality. ∎

Appendix B Proof of Theorem 3.4

Proof.

The stability bound of 2R2R is implied directly by the diameter of the feasible set. Let SSS\simeq S^{\prime}, and let (xt)t[T],(yt)t[T](x^{t})_{t\in[T]},(y^{t})_{t\in[T]} be the trajectories of Algorithm 3 on SS and SS^{\prime}, respectively, with x1=y1x^{1}=y^{1}.

Notice that since 𝝅\bm{\pi} is a random permutation, we may assume w.l.o.g. that 𝝅\bm{\pi} is the identity, whereas the perturbed coordinate between S,SS,S^{\prime} is 𝐢Unif([n])\mathbf{i}\sim\mbox{Unif}([n]). The rest of the proof is a stability analysis conditioned on 𝝅\bm{\pi} (which fixes all the randomness of the algorithm), but from the observation above this is the same as conditioning on the random perturbed coordinate 𝐢\mathbf{i}.

Let TnT\leq n, and δt=xtyt\delta_{t}=\|x^{t}-y^{t}\| so that δ1=0\delta_{1}=0. Conditioned on 𝐢=i\mathbf{i}=i, we have that for all tTt\leq T,

δt+12{0t<i4ηt2L2t=iδt2+4ηt2L2i<tT\delta_{t+1}^{2}\leq\left\{\begin{array}[]{ll}0&t<i\\ 4\eta_{t}^{2}L^{2}&t=i\\ \delta_{t}^{2}+4\eta_{t}^{2}L^{2}&i<t\leq T\end{array}\right.

Indeed, for all tit\leq i, δt=0\delta_{t}=0. For t=it=i, we have

δi+1\displaystyle\delta_{i+1} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[xiηif(xi,zi)]𝖯𝗋𝗈𝗃𝒳[yiηif(yi,zi)]\displaystyle\|\mathsf{Proj}_{\mathcal{X}}[x^{i}-\eta_{i}\nabla f(x^{i},z_{i})]-\mathsf{Proj}_{\mathcal{X}}[y^{i}-\eta_{i}\nabla f(y^{i},z_{i}^{\prime})]\|
\displaystyle\leq xiyiηi[f(xi,zi)f(yi,zi)]\displaystyle\|x^{i}-y^{i}-\eta_{i}[\nabla f(x^{i},z_{i})-\nabla f(y^{i},z_{i}^{\prime})]\|
\displaystyle\leq 2Lηi,\displaystyle 2L\eta_{i},

where we used xi=yix^{i}=y^{i}, and that both gradients are bounded in norm by LL. Finally, when t>it>i, we have zt=ztz_{t}=z_{t}^{\prime}, and therefore we can leverage the monotonicity of the subgradients

δt+12\displaystyle\delta_{t+1}^{2} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[xtηtf(xt,zt)]𝖯𝗋𝗈𝗃𝒳[ytηtf(yt,zt)]2\displaystyle\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f(x^{t},z_{t})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f(y^{t},z_{t})]\|^{2}
\displaystyle\leq δt2+4L2ηt22ηtf(xt,zt)f(yt,zt),xtyt\displaystyle\delta_{t}^{2}+4L^{2}\eta_{t}^{2}-2\eta_{t}\langle\nabla f(x^{t},z_{t})-\nabla f(y^{t},z_{t}),x^{t}-y^{t}\rangle
\displaystyle\leq δt2+4L2ηt2.\displaystyle\delta_{t}^{2}+4L^{2}\eta_{t}^{2}.

Unravelling this recursion, we get 𝔼[δt+12|𝐢=i]4L2s=itηs\mathbb{E}[\delta_{t+1}^{2}|\mathbf{i}=i]\leq 4L^{2}\sum_{s=i}^{t}\eta_{s}, so by the law of total expectation:

𝔼[δt]\displaystyle\mathbb{E}[\delta_{t}] =\displaystyle= 1ni=1n𝔼[δt|𝐢=i]1ni=1n𝔼[δt2|𝐢=i]2Lni=1t1s=it1ηs2\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[\delta_{t}|\mathbf{i}=i]\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\mathbb{E}[\delta_{t}^{2}|\mathbf{i}=i]}\leq\frac{2L}{n}\sum_{i=1}^{t-1}\sqrt{\sum_{s=i}^{t-1}\eta_{s}^{2}}
\displaystyle\leq 2Lni=1t1(ti)ηi2Ln(i=1t1(ti))(i=1t1ηi2)\displaystyle\frac{2L}{n}\sum_{i=1}^{t-1}\sqrt{(t-i)}\eta_{i}\leq\frac{2L}{n}\sqrt{\Big{(}\sum_{i=1}^{t-1}(t-i)\Big{)}\Big{(}\sum_{i=1}^{t-1}\eta_{i}^{2}}\Big{)}
\displaystyle\leq 2Ln(t1)22i=1t1ηi2=2L(t1)ni=1t1ηi2.\displaystyle\frac{2L}{n}\sqrt{\frac{(t-1)^{2}}{2}\sum_{i=1}^{t-1}\eta_{i}^{2}}=\frac{\sqrt{2}L(t-1)}{n}\sqrt{\sum_{i=1}^{t-1}\eta_{i}^{2}}.

where the first inequality holds by the Jensen inequality, the second inequality comes from the bound on the conditional expectation, the third inequality from the non-increasing stepsize assumption, and the fourth inequality is from Cauchy-Schwarz. Since averaging can only improve stability, we conclude the result. ∎

Appendix C General Lower Bound on Stability of SGD

Let 𝒜{\cal A} be a γ\gamma-uniformly stable stochastic convex optimization algorithm with γ=s(T)/n\gamma=s(T)/n, where s(T)s(T) is increasing and limT+s(T)=+\lim_{T\rightarrow+\infty}s(T)=+\infty. By the lower bound on the optimal risk of nonsmooth convex optimization, εriskLRC1n\varepsilon_{\mbox{\footnotesize risk}}\geq\frac{LR}{C_{1}\sqrt{n}}, where C1>0C_{1}>0 is a universal constant (Nemirovsky and Yudin,, 1983). This, combined with the risk decomposition (3), implies that

εoptLRC1ns(T)n=s(T)(1nLR2C1s(T))2+(LR)24C12s(T).\varepsilon_{\mbox{\footnotesize opt}}\geq\frac{LR}{C_{1}\sqrt{n}}-\frac{s(T)}{n}=-s(T)\Big{(}\frac{1}{\sqrt{n}}-\frac{LR}{2C_{1}s(T)}\Big{)}^{2}+\frac{(LR)^{2}}{4C_{1}^{2}s(T)}.

By our assumption on s(T)s(T), for TT sufficiently large, there always exists nn such that

4C1s(T)3LRn4C1s(T)LR\frac{4C_{1}s(T)}{3LR}\leq\sqrt{n}\leq\frac{4C_{1}s(T)}{LR}

which leads to εopt(LR)2C2s(T)\varepsilon_{\mbox{\footnotesize opt}}\geq\frac{(LR)^{2}}{C_{2}s(T)}, where C2>0C_{2}>0 is a universal constant.

If algorithm 𝒜{\cal A} is based on TT subgradient iterations with constant step size η>0\eta>0 (these could be either stochastic, batch or minibatch), by standard analysis, the optimization guarantee of such algorithm is εopt12(R2ηT+ηL2)\varepsilon_{\mbox{\footnotesize opt}}\leq\frac{1}{2}(\frac{R^{2}}{\eta T}+\eta L^{2}). Both bounds in combination give

s(T)2(LR)2C2(ηL2+R2/[ηT])=2(LR)2ηTC2(η2TL2+R2).s(T)\geq\frac{2(LR)^{2}}{C_{2}(\eta L^{2}+R^{2}/[\eta T])}=\frac{2(LR)^{2}\eta T}{C_{2}(\eta^{2}TL^{2}+R^{2})}.

If we further assume that η(R/L)/T\eta\leq(R/L)/\sqrt{T} (notice η=(R/L)/T\eta=(R/L)/\sqrt{T} minimizes the optimization error), then s(T)L2ηT/C2s(T)\geq L^{2}\eta T/C_{2}. We also emphasize all the choices of step size that we will make to control generalization error will lie in this range.

Appendix D Proof of Theorem 4.1

Proof.

Let Dmin{T,1/η2}dD\triangleq\min\{T,1/\eta^{2}\}\leq d, and ν,K>0\nu,K>0. We consider 𝒵={0,1}\mathcal{Z}=\{0,1\}, and the objective function

f(x,z)={max{0,x1ν,,xDν} if z=0r,x/K if z=1,f(x,z)=\left\{\begin{array}[]{ll}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}&\mbox{ if }z=0\\ \langle r,x\rangle/K&\mbox{ if }z=1,\end{array}\right.

where r=(1,,1,0,,0)r=(-1,\ldots,-1,0,\ldots,0) (i.e., supported on the first DD coordinates). Notice that for normalization purposes, we need KDK\geq\sqrt{D}; furthermore, we will choose KK sufficiently large such that TD/[nK]=o(1)T\sqrt{D}/[nK]=o(1). Consider the data sets SSS\simeq S^{\prime}, leading to the empirical objectives:

FS(x)=1nKr,x+n1nmax{0,x1ν,,xDν}andFS(x)=max{0,x1ν,,xDν}.F_{S}(x)=\frac{1}{nK}\langle r,x\rangle+\frac{n-1}{n}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}\,\,\mbox{and}\,\,F_{S^{\prime}}(x)=\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}.

Let (xt)t[T](x^{t})_{t\in[T]} and (yt)t[T](y^{t})_{t\in[T]} be the trajectories of the algorithm over datasets SS and SS^{\prime}, respectively, initialized from x1=y1=0x^{1}=y^{1}=0. Clearly, yt=0y^{t}=0 for all tt. Now x2=ηnKrx^{2}=-\frac{\eta}{nK}\,r; choosing ν<η/(nK)\nu<\eta/(nK), we have f(x2,z)=ηnKr+n1ne1\nabla f(x^{2},z)=-\frac{\eta}{nK}\,r+\frac{n-1}{n}e_{1}, and hence x3=2ηnKrηn1ne1.x^{3}=-\frac{2\eta}{nK}\,r-\eta\frac{n-1}{n}e_{1}. Sequentially, the method will perform cumulative subgradient steps on e2,e3,eDe_{2},e_{3}\ldots,e_{D}. In particular, for any t[D+1],t\in[D+1], we have xt+1=tηnKrηn1ns=1t1es.x^{t+1}=-t\frac{\eta}{nK}\,r-\eta\frac{n-1}{n}\sum_{s=1}^{t-1}e_{s}.

By orthogonality of the subgradients and given our choice of KK, we conclude that

xD+2yD+2\displaystyle\|x^{D+2}-y^{D+2}\| =\displaystyle= xD+2η2t=1DetηDDnK\displaystyle\|x^{D+2}\|\geq\frac{\eta}{2}\Big{\|}\sum_{t=1}^{D}e_{t}\Big{\|}-\eta\frac{D\sqrt{D}}{nK}
\displaystyle\geq η2Dηo(1)=Ω(ηD)\displaystyle\frac{\eta}{2}\sqrt{D}-\eta\cdot o(1)=\Omega(\eta\sqrt{D})
=\displaystyle= Ω(min{1,ηT}),\displaystyle\Omega(\min\{1,\eta\sqrt{T}\}),

and further subgradient steps t=D+1,,Tt=D+1,\ldots,T are only given by the linear term, r/[nK]r/[nK], which are negligible perturbations.

We finish by arguing that averaging does not help. First, in the case D=TD=T:

δ(𝒜𝖦𝖣)\displaystyle\delta({\mathcal{A}}_{\sf GD}) =\displaystyle= x¯Tη21Tt=1Ts=1t2eso(1)=η2s=1TTs2Teso(1)\displaystyle\|\overline{x}^{T}\|\geq\frac{\eta}{2}\Big{\|}\frac{1}{T}\sum_{t=1}^{T}\sum_{s=1}^{t-2}e_{s}\Big{\|}-o(1)=\frac{\eta}{2}\Big{\|}\sum_{s=1}^{T}\frac{T-s-2}{T}e_{s}\Big{\|}-o(1)
\displaystyle\geq η4sT/22eso(1)=Ω(ηT).\displaystyle\frac{\eta}{4}\Big{\|}\sum_{s\leq T/2-2}e_{s}\Big{\|}-o(1)=\Omega(\eta\sqrt{T}).

And second, in the case D=1/η2D=1/\eta^{2}:

δ(𝒜𝖦𝖣)\displaystyle\delta({\mathcal{A}}_{\sf GD}) =\displaystyle= x¯Tη2Tt=1D+2s=1t2es+t=D+3Ts=1Deso(1)\displaystyle\|\overline{x}^{T}\|\geq\frac{\eta}{2T}\Big{\|}\sum_{t=1}^{D+2}\sum_{s=1}^{t-2}e_{s}+\sum_{t=D+3}^{T}\sum_{s=1}^{D}e_{s}\Big{\|}-o(1)
=\displaystyle= η2Ts=1D(Ds1)es+s=1D(TD+2)eso(1)\displaystyle\frac{\eta}{2T}\Big{\|}\sum_{s=1}^{D}(D-s-1)e_{s}+\sum_{s=1}^{D}(T-D+2)e_{s}\Big{\|}-o(1)
=\displaystyle= η2s=1D1Ts+1Teso(1)η4s=1D/2eso(1)=Ω(Dη)=Ω(1).\displaystyle\frac{\eta}{2}\Big{\|}\sum_{s=1}^{D-1}\frac{T-s+1}{T}e_{s}\Big{\|}-o(1)\geq\frac{\eta}{4}\Big{\|}\sum_{s=1}^{D/2}e_{s}\Big{\|}-o(1)=\Omega(\sqrt{D}\eta)=\Omega(1).

Finally, the additional term Ω(ηT/n)\Omega(\eta T/n) in the lower bound is obtained by Appendix C. ∎

Appendix E Proof of Theorem 4.3

Proof.

We consider the same function class of Thm. 4.1, and neighbor datasets S=(0,0,,0)S^{\prime}=(0,0,\ldots,0), S=(1,0,,0)S=(1,0,\ldots,0). We will assume in what follows that D=min{T,1/η2}D=\min\{T,1/\eta^{2}\}, KK is sufficiently large and ν<ηr/K\nu<\eta\|r\|/K. Let (xt)t[T](x^{t})_{t\in[T]} and (yt)t[T](y^{t})_{t\in[T]} be the trajectories of Algorithm 3 over datasets S,SS,S^{\prime} respectively, both initialized at x1=y1=0x^{1}=y^{1}=0. Let now τ=𝝅1(1)Unif[n]\tau=\bm{\pi}^{-1}(1)\sim\mbox{Unif}[n]. Arguing as in Thm. 4.1, we have that yt=0y^{t}=0 for all tt, whereas

xt+1={0t<τη(1+t/n)rKηs=1tτ(1+t/n)esτtτ+D.x^{t+1}=\left\{\begin{array}[]{ll}0&t<\tau\\ -\frac{\eta(1+\lfloor t/n\rfloor)r}{K}-\eta\sum_{s=1}^{t-\tau-(1+\lfloor t/n\rfloor)}e_{s}&\tau\leq t\leq\tau+D.\end{array}\right.

Later iterations will satisfy xt=1o(1)\|x^{t}\|=1-o(1) if D=1/η2D=1/\eta^{2} (and otherwise the algorithm stops earlier). Therefore, for all t[T]t\in[T],

𝔼𝝅[xtyt]\displaystyle\mathbb{E}_{\bm{\pi}}[\|x^{t}-y^{t}\|] =\displaystyle= s=1n𝔼[xtyt|τ=s][τ=s]\displaystyle\sum_{s=1}^{n}\mathbb{E}[\|x^{t}-y^{t}\|\,|\,\tau=s]\mathbb{P}[\tau=s]
\displaystyle\geq η2ns=1min{t,n}tsηo(1)\displaystyle\frac{\eta}{2n}\sum_{s=1}^{\min\{t,n\}}\sqrt{t-s}-\eta\cdot o(1)
=\displaystyle= {Ω(ηt3/2n) if tnΩ(ηt) if t>n.\displaystyle\left\{\begin{array}[]{ll}\Omega(\frac{\eta t^{3/2}}{n})&\mbox{ if }t\leq n\\ \Omega(\eta\sqrt{t})&\mbox{ if }t>n.\end{array}\right.

Notice that we used above that KK is such that TD/nK=o(1)T\sqrt{D}/nK=o(1). Analogously as in Thm. 4.1, we can obtain the same conclusion for x¯T\overline{x}^{T}. The lower bound of ηT/n\eta T/n can be added by Appendix C, so the result follows. ∎

Appendix F Upper bounds on UAS of SGD when TnT\leq n

Theorem F.1.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R) and =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L). Suppose TnT\leq n. The UAS of sampling-with-replacement stochastic gradient descent (Algorithm 2) satisfies uniform argument stability

𝔼[δ𝒜𝗋𝖲𝖦𝖣]min(2R,3LT1n(t=1T1ηt2+1nt=1T1ηt)).\mathbb{E}\left[\delta_{{\mathcal{A}}_{\sf rSGD}}\right]\leq\min\left(2R,~{}3L\,\frac{T-1}{n}\,\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right).
Proof.

The bound of 2R2R is obtained directly from the diameter bound on 𝒳\mathcal{X}. Therefore, we focus exclusively on the second term. Let SSS\simeq S^{\prime}, and let k[n]k\in[n] be the entry where both datasets differ. Let (xt)t[T],(yt)t[T](x^{t})_{t\in[T]},(y^{t})_{t\in[T]} be the trajectories of Algorithm 2 on SS and SS^{\prime}, respectively, with x1=y1x^{1}=y^{1}.

Let BtB_{t} denote the event that 𝐢j=k\mathbf{i}_{j}=k for some jtj\leq t; that is, BtB_{t} is the event that the index kk is sampled at least once in the first tt iterations. We note that

[Bt]\displaystyle\underset{}{\mathbb{P}}\left[B_{t}\right] =\displaystyle= 1nj=0t1(11n)j=1(11n)tmin(1,tn)\displaystyle\frac{1}{n}\sum_{j=0}^{t-1}\big{(}1-\frac{1}{n}\big{)}^{j}=1-\big{(}1-\frac{1}{n}\big{)}^{t}\leq\min\left(1,~{}\frac{t}{n}\right)
𝔼[δT]\displaystyle\Longrightarrow\quad\underset{}{\mathbb{E}}\left[\delta_{T}\right] \displaystyle\leq min(1,T1n)𝔼[δT|BT1].\displaystyle\min\left(1,~{}\frac{T-1}{n}\right)\cdot\underset{}{\mathbb{E}}\left[\delta_{T}|~{}B_{T-1}\right]. (10)

For the rest of the proof we bound 𝔼[δT|BT1]\underset{}{\mathbb{E}}\left[\delta_{T}|B_{T-1}\right]. To do this, we derive a recurrence for 𝔼[δt+1|Bt]\underset{}{\mathbb{E}}\left[\delta_{t+1}|B_{t}\right]. Note that BtB_{t} is the union of two mutually exclusive events: {𝐢t=k}Bt1¯\{\mathbf{i}_{t}=k\}\cap\overline{B_{t-1}} and Bt1B_{t-1}, where Bt1¯\overline{B_{t-1}} is the complement of Bt1B_{t-1}, i.e., the event “index kk is never sampled in the first tt iterations.” Hence,

𝔼[δt+12|Bt]\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}B_{t}\right] =\displaystyle= [𝐢t=k,Bt1¯|Bt]𝔼[δt+12|𝐢t=k,Bt1¯]+[Bt1|Bt]𝔼[δt+12|Bt1]\displaystyle\underset{}{\mathbb{P}}\left[\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}~{}|~{}B_{t}\right]\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right]+\underset{}{\mathbb{P}}\left[B_{t-1}~{}|~{}B_{t}\right]\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}B_{t-1}\right] (11)
\displaystyle\leq 𝔼[δt+12|𝐢t=k,Bt1¯]+𝔼[δt+12|Bt1].\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right]+\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}B_{t-1}\right].

Now, conditioned on the past sampled coordinates 𝐢1,,𝐢t1\mathbf{i}_{1},\ldots,\mathbf{i}_{t-1}, we have

δt+1\displaystyle\delta_{t+1} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[xtηtf(xt,z𝐢t)]𝖯𝗋𝗈𝗃𝒳[ytηtf(yt,z𝐢t)]\displaystyle\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f(x^{t},z_{\mathbf{i}_{t}})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f(y^{t},z_{\mathbf{i}_{t}}^{\prime})]\|
\displaystyle\leq xtytηt[f(xt,z𝐢t)f(yt,z𝐢t)]\displaystyle\|x^{t}-y^{t}-\eta_{t}[\nabla f(x^{t},z_{\mathbf{i}_{t}})-\nabla f(y^{t},z_{\mathbf{i}_{t}}^{\prime})]\|
\displaystyle\leq 𝟏{𝐢t=k}(δt+2Lηt)+𝟏{𝐢tk}δt2+4L2ηt2,\displaystyle\mathbf{1}_{\{\mathbf{i}_{t}=k\}}(\delta_{t}+2L\eta_{t})+\mathbf{1}_{\{\mathbf{i}_{t}\neq k\}}\sqrt{\delta_{t}^{2}+4L^{2}\eta_{t}^{2}},

where the last inequality is obtained from convexity and Lipschitzness of the objective. Now, squaring we get

δt+12\displaystyle\delta_{t+1}^{2} \displaystyle\leq 𝟏{𝐢t=k}(δt+2Lηt)2+𝟏{𝐢tk}(δt2+4L2ηt2)=δt2+4ηt2L2+𝟏{𝐢t=k}4Lηtδt.\displaystyle\mathbf{1}_{\{\mathbf{i}_{t}=k\}}(\delta_{t}+2L\eta_{t})^{2}+\mathbf{1}_{\{\mathbf{i}_{t}\neq k\}}(\delta_{t}^{2}+4L^{2}\eta_{t}^{2})\,\,=\,\,\delta_{t}^{2}+4\eta_{t}^{2}L^{2}+\mathbf{1}_{\{\mathbf{i}_{t}=k\}}4L\eta_{t}\delta_{t}.

From this formula we derive bounds for the two conditional expectations:

𝔼[δt+12|Bt1]\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t+1}^{2}~{}|~{}B_{t-1}\right] \displaystyle\leq 𝔼[δt2|Bt1]+4L2ηt2+4Lnηt𝔼[δt|Bt1]\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t}^{2}~{}|~{}B_{t-1}\right]+4L^{2}\eta_{t}^{2}+\frac{4L}{n}\eta_{t}\underset{}{\mathbb{E}}\left[\delta_{t}~{}|~{}B_{t-1}\right] (12)
𝔼[δt+12|𝐢t=k,Bt1¯]\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right] \displaystyle\leq 4L2ηt2,\displaystyle 4L^{2}\eta_{t}^{2}, (13)

where (12) holds by independence of 𝐢t\mathbf{i}_{t} and Bt1B_{t-1}, and in (13) we used that δt=0\delta_{t}=0 conditioned on Bt1¯\overline{B_{t-1}}.

Combining (12) and (13) in (11), we get

𝔼[δt+12|Bt]\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t+1}^{2}~{}|~{}B_{t}\right] \displaystyle\leq 𝔼[δt2|Bt1]+8L2ηt2+4Lnηt𝔼[δt|Bt1]\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t}^{2}~{}|~{}B_{t-1}\right]+8L^{2}\eta_{t}^{2}+\frac{4L}{n}\eta_{t}\underset{}{\mathbb{E}}\left[\delta_{t}~{}|~{}B_{t-1}\right]
𝔼[δT2|BT1]\displaystyle\mathbb{E}[\delta_{T}^{2}~{}|~{}B_{T-1}] \displaystyle\leq 8L2t=1T1ηt2+4Lnt=1T1ηt𝔼[δt|Bt1].\displaystyle 8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{t}~{}|B_{t-1}].

With this last bound we can proceed inductively to show that

𝔼[δT|BT1]8Lt=1T1ηt2+2Lnt=1T1ηt.\underset{}{\mathbb{E}}\left[\delta_{T}~{}|~{}B_{T-1}\right]\leq\sqrt{8}L\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}.

The base case, T=0T=0, is evident, and the inductive step can be considered in two separate cases; namely, the case where 𝔼[δT|BT1]maxt[T1]𝔼[δt|Bt1]\mathbb{E}[\delta_{T}~{}|~{}B_{T-1}]\leq\max_{t\in[T-1]}\mathbb{E}[\delta_{t}~{}|~{}B_{t-1}], which can be obtained by the induction hypothesis; and the case where 𝔼[δT|BT1]>maxt[T1]𝔼[δt|Bt1]\mathbb{E}[\delta_{T}~{}|~{}B_{T-1}]>\max_{t\in[T-1]}\mathbb{E}[\delta_{t}~{}|~{}B_{t-1}], for which

𝔼[δT2|BT1]\displaystyle\mathbb{E}[\delta_{T}^{2}|B_{T-1}]\! \displaystyle\leq 8L2t=1T1ηt2+4Lnt=1T1ηt𝔼[δt|Bt1]  8L2t=1T1ηt2+4Lnt=1T1ηt𝔼[δT|BT1].\displaystyle\!8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{t}|B_{t-1}]\,\,\leq\,\,8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{T}|B_{T-1}].

Then

𝔼𝐢[(δT2Lnt=1T1ηt)2|BT1]\displaystyle\mathbb{E}_{\mathbf{i}}\Big{[}\Big{(}\delta_{T}-\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2}~{}\Big{|}~{}B_{T-1}\Big{]} \displaystyle\leq 8L2t=1T1ηt2+(2Lnt=1T1ηt)2,\displaystyle 8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\Big{(}\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2},

and by the Jensen inequality

𝔼𝐢[δT2Lnt=1T1ηt|BT1]\displaystyle\mathbb{E}_{\mathbf{i}}\Big{[}\delta_{T}-\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{|}B_{T-1}\Big{]} \displaystyle\leq 8L2t=1T1ηt2+(2Lnt=1T1ηt)28Lt=1T1ηt2+2Lnt=1T1ηt,\displaystyle\sqrt{8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\Big{(}\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2}}\,\,\leq\,\,\sqrt{8}L\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t},

proving the inductive step. Finally, putting this together with (10) completes the proof. ∎

Theorem F.2.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R), =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L), 𝛑\bm{\pi} be a uniformly random permutation over [n][n], and (ηt)t[T](\eta_{t})_{t\in[T]} be a non-increasing sequence. Let TnT\leq n. The UAS of fixed-permutation stochastic gradient descent (Algorithm 3) satisfies uniform argument stability 𝔼[δ𝒜𝖯𝖾𝗋𝖲𝖦𝖣]min{2R,2LT1nt=1T1ηt2}\mathbb{E}\left[\delta_{{\mathcal{A}}_{\sf PerSGD}}\right]\leq\min\{2R,\sqrt{2}L\frac{T-1}{n}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}\}.

Notice that the UAS of Algorithms 2 and 3 are of the same order. The proof is in Appendix B.

Appendix G Generalization in Stochastic Optimization with Data Reuse

G.0.1 Risk Bounds for Sampling-with-Replacement Stochastic Gradient Descent

Theorem G.1.

Let 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R) and =𝒳0(L)\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L). The sampling with replacement stochastic gradient descent (Algorithm 2) with T=n2T=n^{2} iterations and η=RLn3/2\eta=\frac{R}{\,L\,n^{3/2}} satisfies for any 12exp{n2/32}<θ<1.12\exp\{-n^{2}/32\}<\theta<1.

[εrisk(𝒜𝗋𝖲𝖦𝖣)=O(cLRnlog(n)log(nθ))]θ.\mathbb{P}\Big{[}\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf rSGD})=O\Big{(}\frac{cLR}{\sqrt{n}}\log(n)\log(\frac{n}{\theta})\Big{)}\Big{]}\leq\theta.

It should be noted that, similarly to Remark 6.3, if we are only interested in expectation risk bounds, one can shave off the polylogarithmic factor above, which is optimal for the expected excess risk.

Proof.

Let 𝐒𝒟n\mathbf{S}\sim{\cal D}^{n} an i.i.d. random sample for the stochastic convex program, and apply on these data the algorithm 𝒜𝗋𝖲𝖦𝖣{\mathcal{A}}_{\sf rSGD} for constant step size η>0\eta>0 and TT iterations.

We consider θ>0\theta>0 such that θ>12exp{T/32}\theta>12\exp\{-T/32\}. Notice that the sampling-with-replacement stochastic gradient is a bounded first-order stochastic oracle for the empirical objective. It is direct to verify that the assumptions of Lemma H.1 are satisfied with σ=0\sigma=0. Hence, by Lemma H.1, we have that, with probability at least 1θ/31-\theta/3

εopt(𝒜𝗋𝖲𝖦𝖣)O(LR2log(12/θ)T+R2ηT+ηL2).\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf rSGD})\leq O\Big{(}LR\sqrt{\frac{2\log(12/\theta)}{T}}+\frac{R^{2}}{\eta T}+\eta L^{2}\Big{)}.

On the other hand, Theorem 3.3 together with Theorem 2.2, guarantees that with probability at least 1θ/31-\theta/3, we have

|εgen(𝒜𝗋𝖲𝖦𝖣)|O(L2[Tη+Tηn]lognlog(6n/θ)+LRlog(6/θ)n).|\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf rSGD})|\leq O\Big{(}L^{2}\big{[}\sqrt{T}\eta+\frac{T\eta}{n}\big{]}\log n\log(6n/\theta)+LR\sqrt{\log(6/\theta)}{n}\Big{)}.

Finally, Lemma 2.1 ensures that with probability 1θ/31-\theta/3

εapproxLR2log(3/θ)n.\varepsilon_{\mathrm{\tiny approx}}\leq LR\sqrt{\frac{2\log(3/\theta)}{n}}.

By the union bound and the excess risk decomposition (3), we have that, with probability 1θ1-\theta,

εrisk(𝒜)\displaystyle\varepsilon_{\mbox{\footnotesize risk}}({\cal A}) =\displaystyle= O(LRlog(1/θ)T+R2ηT+ηL2+L2η(T+Tn)log(n)log(6nθ)\displaystyle O\Big{(}LR\sqrt{\frac{\log(1/\theta)}{T}}+\frac{R^{2}}{\eta T}+\eta L^{2}+L^{2}\eta\big{(}\sqrt{T}+\frac{T}{n}\big{)}\log(n)\log(\frac{6n}{\theta})
+LRlog(6/θ)n+LRlog(3/θ)n)\displaystyle+LR\sqrt{\frac{\log(6/\theta)}{n}}+LR\sqrt{\frac{\log(3/\theta)}{n}}\Big{)}
=\displaystyle= O(LRnlog(n)log(nθ)),\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}\log(n)\log(\frac{n}{\theta})\Big{)},

where only at the last step we replaced by the choice of step size and number of iterations from the statement.

G.0.2 Risk Bounds for Fixed-Permutation Stochastic Gradient Descent

As a final application we provide a population risk bound based on the UAS of Algorithm 3. Similarly as in the case of sampling-with-replacement SGD, we need an optimization error analysis, which for completeness is provided in Appendix I, and it is based on the analysis of the incremental subgradient method (Nedic and Bertsekas,, 2001).

Interestingly, the combination of the incremental method analysis for arbitrary permutation (Nedic and Bertsekas,, 2001) and our novel stability bounds that also work for arbitrary permutation, guarantees generalization bounds for fixed permutation SGD without the need of reshuffling, or even any form of randomization. We believe this could be of independent interest.

Theorem G.2.

Algorithm 3 with constant step size ηkη=R/[LnK]\eta_{k}\equiv\eta=R/[Ln\sqrt{K}] and K=nK=n epochs is such that for every 0<θ<10<\theta<1,

[εrisk(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)>cLRnlognlog(nθ)]θ,\mathbb{P}\Big{[}\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD})>\frac{cLR}{\sqrt{n}}\log n\log(\frac{n}{\theta})\Big{]}\leq\theta,

where c>0c>0 is an absolute constant.

Similarly to the previous result, we can remove the polylogarithmic factor if we are only interested in expected excess risk guarantees.

Proof.

By Corollary I.2

εopt(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)R2nKη+L2(n+2)η2=O(LRn),\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf PerSGD})\leq\frac{R^{2}}{nK\eta}+\frac{L^{2}(n+2)\eta}{2}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)},

by our choice of K,ηK,\eta. On the other hand, Theorem 3.4 guarantees the algorithm is δ\delta-UAS with probability 1, where δ=O(R/n)\delta=O(R/\sqrt{n}). Therefore, by Theorem 2.2, we have that w.p. 1θ/21-\theta/2

|εgen(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)|c(LRnlognlog(2n/θ)+LRlog2/θn).|\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf PerSGD})|\leq c\Big{(}\frac{LR}{\sqrt{n}}\log n\log(2n/\theta)+LR\sqrt{\frac{\log 2/\theta}{n}}\Big{)}.

Finally, Lemma 2.1 ensures that with probability 1θ/21-\theta/2

εapproxLR2log(2/θ)n.\varepsilon_{\mathrm{\tiny approx}}\leq LR\sqrt{\frac{2\log(2/\theta)}{n}}.

By the union bound and the excess risk decomposition (3), we have that, with probability at least 1θ1-\theta,

εrisk(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)\displaystyle\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD}) \displaystyle\leq εopt(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)+εgen(𝒜𝖯𝖾𝗋𝖲𝖦𝖣)+εapprox\displaystyle\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf PerSGD})+\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf PerSGD})+\varepsilon_{\mathrm{\tiny approx}}
=\displaystyle= O(LRn+LRnlognlog(n/θ)+LRlog1/θn+LRlog(2/θ)n)\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}+\frac{LR}{\sqrt{n}}\log n\log(n/\theta)+LR\sqrt{\frac{\log 1/\theta}{n}}+LR\sqrt{\frac{\log(2/\theta)}{n}}\Big{)}
=\displaystyle= O(LRnlognlog(nθ)).\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}\log n\log(\frac{n}{\theta})\Big{)}.

Appendix H High-probability Bound on Optimization Error of SGD with Noisy Gradient Oracle

It is known that standard online-to-batch conversion technique can be used to provide high-probability bound on the optimization error (i.e., the excess empirical risk) of stochastic gradient descent. For the sake of completeness and to make the paper more self-contained, we re-expose this technique here for stochastic gradient descent with noisy gradient oracle. This is done in the following lemma, which is used in the proofs of our results in Section 6.

Lemma H.1 (Optimization error of SGD with noisy gradient oracle).

Let S=(z1,,zn)𝒵nS=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n} be a dataset. Let FS(x)=1ni[n]f(x,zi)F_{S}(x)=\frac{1}{n}\sum_{i\in[n]}f(x,z_{i}) be the empirical risk associated with SS, where for every z𝒵,z\in\mathcal{Z}, f(,z)f(\cdot,z) is convex, LL-Lipschitz function over 𝒳(0,R)\mathcal{X}\subseteq\mathcal{B}(0,R) for some L,R>0L,R>0. Consider the stochastic (sub)gradient method:

xt+1=xtη𝐠(x,ξt)(t=0,,T1),x^{t+1}=x^{t}-\eta\cdot\mathbf{g}(x,\xi_{t})\qquad(\forall t=0,\ldots,T-1),

with output x¯T=1Tt[T]xt\overline{x}^{T}=\frac{1}{T}\sum_{t\in[T]}x^{t}; where ξ1,,ξT\xi_{1},\ldots,\xi_{T} are drawn uniformly from from SS with replacement, and for all z𝒵z\in\mathcal{Z}, 𝐠(.,z):𝒳d\mathbf{g}(.,z):\mathcal{X}\rightarrow\mathbb{R}^{d} is a random map (referred to as noisy gradient oracle) that satisfies the following conditions:

  1. 1.

    Unbiasedness: For every x𝒳,z𝒵x\in\mathcal{X},z\in\mathcal{Z}𝔼[𝐠(x,z)]=f(x,z)\mathbb{E}[\mathbf{g}(x,z)]=\nabla f(x,z), where the expectation is taken over the randomness in the gradient oracle 𝐠(,z)\mathbf{g}(\cdot,z).

  2. 2.

    Sub-Gaussian gradient noise: There is σ20\sigma^{2}\geq 0 such that for every x𝒳,z𝒵x\in\mathcal{X},z\in\mathcal{Z}, 𝐠(x,z)f(x,z)\mathbf{g}(x,z)-\nabla f(x,z) is σ2\sigma^{2}-sub-Gaussian random vector; that is, for every x𝒳,z𝒵,x\in\mathcal{X},z\in\mathcal{Z}, 𝐠(x,z)f(x,z),u\langle\mathbf{g}(x,z)-\nabla f(x,z),~{}u\rangle is σ2\sigma^{2}-sub-Gaussian random variable u(0,1)\forall u\in\mathcal{B}(0,1).

  3. 3.

    Independence of the gradient noise across iterations: conditioned on any fixed realization of (ξt)t[T](\xi_{t})_{t\in[T]} the sequence of random maps 𝐠(,ξ1),,𝐠(,ξT)\mathbf{g}(\cdot,\xi_{1}),\ldots,\mathbf{g}(\cdot,\xi_{T}) is independent. (Here, randomness comes only from the gradient oracle.)

Then, for any θ(4eT/32,1)\theta\in(4e^{-T/32},1), with probability at least 1θ1-\theta, the optimization error (i.e., the excess empirical risk) of this method is bounded as

εopt(LR+σ(R+ηL))2log(4/θ)T+R22ηT+η(L22+dσ2).\varepsilon_{\mathrm{\tiny opt}}\leq\left(LR+\sigma(R+\eta L)\right)\sqrt{\frac{2\log(4/\theta)}{T}}+\frac{R^{2}}{2\eta T}+\eta\,(\frac{L^{2}}{2}+d\sigma^{2}).
Proof.

Let xSargminx𝒳FS(x)x^{\ast}_{S}\in\arg\min\limits_{x\in\mathcal{X}}F_{S}(x). By convexity of the empirical loss, we have

FS(x¯T)FS(xS)1Tt[T]FS(xt)FS(xS)\displaystyle F_{S}(\overline{x}^{T})-F_{S}(x^{\ast}_{S})\leq\frac{1}{T}\sum_{t\in[T]}F_{S}(x^{t})-F_{S}(x^{\ast}_{S})
=1Tt[T][FS(xt)f(xt,ξt)]+1Tt[T][f(xS,ξt)FS(xS)]+1Tt[T][f(xt,ξt)f(xS,ξt)].\displaystyle=\frac{1}{T}\sum_{t\in[T]}[F_{S}(x^{t})-f(x^{t},\xi_{t})]+\frac{1}{T}\sum_{t\in[T]}[f(x^{\ast}_{S},\xi_{t})-F_{S}(x^{\ast}_{S})]+\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]. (14)

Since (ξt)t[T](\xi_{t})_{t\in[T]} are sampled uniformly without replacement from S,S, we have

𝔼ξt|x1,,xt1[f(xt,ξt)|x1,,xt1,xt=v]=FS(v),\underset{\xi_{t}~{}|~{}x^{1},\ldots,x^{t-1}}{\mathbb{E}}\left[f(x^{t},\xi_{t})|x^{1},\ldots,x^{t-1},x^{t}=v\right]=F_{S}(v),

for all v𝒳,t[T]v\in\mathcal{X},t\in[T]. Moreover, since the range of ff lies in [LR,LR],[-LR,LR], it follows that Yt:=j=1tf(xj,ξj),t[T]Y_{t}:=\sum_{j=1}^{t}f(x^{j},\xi_{j}),~{}t\in[T] is a martingale with bounded differences (namely, bounded by 2LR2LR). Therefore, by Azuma’s inequality, the first term in (14) satisfies

[1Tt[T][FS(xt)f(xt,ξt)]>LR2log4θT]θ4.\displaystyle\mathbb{P}\Bigg{[}\frac{1}{T}\sum_{t\in[T]}[F_{S}(x^{t})-f(x^{t},\xi_{t})]>LR\sqrt{\frac{2\log\frac{4}{\theta}}{T}}~{}\Bigg{]}\leq\frac{\theta}{4}. (15)

By Hoeffding’s inequality, the second term in (14) also satisfies the same bound; namely,

[1Tt[T][f(xS,ξt)FS(xS)]>LR2log4θT]θ4.\displaystyle\mathbb{P}\Bigg{[}\frac{1}{T}\sum_{t\in[T]}[f(x^{\ast}_{S},\xi_{t})-F_{S}(x^{\ast}_{S})]>LR\sqrt{\frac{2\log\frac{4}{\theta}}{T}}~{}\Bigg{]}\leq\frac{\theta}{4}. (16)

Using similar analysis to that of the standard online gradient descent analysis Zinkevich, (2003), the last term in (14) can be bounded as

1Tt[T][f(xt,ξt)f(xS,ξt)]R22Tη+1Tt[T]f(xt,ξt)𝐠(xt,ξt),xtxS+η2Tt[T]𝐠(xt,ξt)2\displaystyle\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]\leq\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\|\nabla\mathbf{g}(x^{t},\xi_{t})\|^{2}
=\displaystyle=~{} R22Tη+1Tt[T]f(xt,ξt)𝐠(xt,ξt),xtxSηf(xt,ξt)+η2Tt[T]𝐠(xt,ξt)f(xt,ξt)2\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2}
+η2Tt[T]f(xt,ξt)2\displaystyle+\frac{\eta}{2T}\sum_{t\in[T]}\|\nabla f(x^{t},\xi_{t})\|^{2}
\displaystyle\leq~{} R22Tη+ηL22+1Tt[T]f(xt,ξt)𝐠(xt,ξt),xtxSηf(xt,ξt)+η2Tt[T]𝐠(xt,ξt)f(xt,ξt)2\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{\eta L^{2}}{2}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2} (17)

By the properties of the gradient oracle stated in the lemma, we can see that for any fixed realization of (xt,ξt)t[T](x^{t},\xi_{t})_{t\in[T]}, the second term in (17) is (2R+ηL)2σ2T\left(2R+\eta L\right)^{2}\frac{\sigma^{2}}{T}-sub-Gaussian random variable. Hence,

[1Tt[T]f(xt,ξt)𝐠(xt,ξt),xtxSηf(xt,ξt)>(2R+ηL)σ2log(4/θ)T]\displaystyle\underset{}{\mathbb{P}}\left[\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle>\left(2R+\eta L\right)\sigma\sqrt{\frac{2\log(4/\theta)}{T}}\right] θ4.\displaystyle\leq\frac{\theta}{4}. (18)

Let Ut:=𝐠(xt,ξt)f(xt,ξt)2,t[T]U_{t}:=\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2},~{}t\in[T]. Note that 𝔼[Ut]dσ2\underset{}{\mathbb{E}}\left[U_{t}\right]\leq d\sigma^{2}. Moreover, observe (e.g., by (Rigollet,, 2015, Lemma 1.12)) that for any fixed realization of xt,ξtx^{t},\xi_{t}, Vt:=Ut𝔼[Ut]V_{t}:=U_{t}-\underset{}{\mathbb{E}}\left[U_{t}\right] is a sub-exponential random variable with parameter 16dσ216d\sigma^{2}; namely, 𝔼[exp(λVt)]exp(128λ2σ4d2),λ116σ2d.\underset{}{\mathbb{E}}\left[\exp(\lambda V_{t})\right]\leq\exp(128\lambda^{2}\sigma^{4}d^{2}),~{}\lambda\leq\frac{1}{16\sigma^{2}d}. Hence, by a standard concentration argument (e.g., Bernstein’s inequality), we have

[η2Tt[T]𝐠(xt,ξt)f(xt,ξt)2>η2dσ2+16ηdσ2log(4/θ)T]\displaystyle\underset{}{\mathbb{P}}\left[\frac{\eta}{2T}\sum_{t\in[T]}\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2}>\frac{\eta}{2}d\sigma^{2}+16\eta d\sigma^{2}\,\frac{\log(4/\theta)}{T}\right] θ/4.\displaystyle\leq\theta/4. (19)

Putting (18) and (19) together, and noticing that T>32log(4/θ)T>32\log(4/\theta), we conclude that with probability at least 1θ/2,1-\theta/2, the third term of (14) is bounded as

1Tt[T][f(xt,ξt)f(xS,ξt)]R22Tη+ηL22+ησ2d+(2R+ηL)σ2log(4/θ)T.\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]\leq\frac{R^{2}}{2T\eta}+\frac{\eta L^{2}}{2}+\eta\sigma^{2}d+\left(2R+\eta L\right)\sigma\sqrt{\frac{2\log(4/\theta)}{T}}.

Hence, by the union bound, we conclude that with probability at least 1θ,1-\theta, the excess empirical risk of the stochastic subgradient method is bounded as

εopt(LR+σ(2R+ηL))2log(4/θ)T+R22ηT+η(L22+dσ2).\varepsilon_{\mathrm{\tiny opt}}\leq\left(LR+\sigma(2R+\eta L)\right)\sqrt{\frac{2\log(4/\theta)}{T}}+\frac{R^{2}}{2\eta T}+\eta\,(\frac{L^{2}}{2}+d\sigma^{2}).

Appendix I Empirical Risk of Fixed-Permutation SGD

Our optimization error analysis is based on (Nedic and Bertsekas,, 2001, Lemma 2.1).

Lemma I.1.

Let us consider the fixed permutation stochastic gradient descent (Algorithm 3), for arbitrary permutation (i.e., not necessarily random) and with constant step size over each epoch (i.e., η(k1)n+tηk\eta_{(k-1)n+t}\equiv\eta_{k} for all t[n]t\in[n], k[K]k\in[K]). Then

ηk[FS(xk)FS(y)]12n[xky2xk+1y2]+ηk2L2(n+2)2(y𝒳).\eta_{k}[F_{S}(x^{k})-F_{S}(y)]\leq\frac{1}{2n}[\|x^{k}-y\|^{2}-\|x^{k+1}-y\|^{2}]+\frac{\eta_{k}^{2}L^{2}(n+2)}{2}\qquad(\forall y\in\mathcal{X}).
Proof.

First, since the permutation is arbitrary, w.l.o.g. 𝝅\bm{\pi} is the identity (we make this choice only for notational convenience). Let now y𝒳y\in\mathcal{X}. At each round, the recursion of SGD implies that

xt+1ky2\displaystyle\|x_{t+1}^{k}-y\|^{2} =\displaystyle= 𝖯𝗋𝗈𝗃𝒳[xtkηkf(xtk,zt)]𝖯𝗋𝗈𝗃𝒳(y)2\displaystyle\|\mathsf{Proj}_{\mathcal{X}}[x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})]-\mathsf{Proj}_{\mathcal{X}}(y)\|^{2}
\displaystyle\leq xtkηkf(xtk,zt)y2\displaystyle\|x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})-y\|^{2}
=\displaystyle= xtky2+ηk2L22ηkf(xtk,zt),xtky\displaystyle\|x_{t}^{k}-y\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}\langle\nabla f(x_{t}^{k},z_{t}),x_{t}^{k}-y\rangle
\displaystyle\leq xtky2+ηk2L22ηk[f(xtk,zt)f(y,zt)].\displaystyle\|x_{t}^{k}-y\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}[f(x_{t}^{k},z_{t})-f(y,z_{t})].

Let rt:=xtyr_{t}:=\|x^{t}-y\|. Summing up these inequalities from t=1,,nt=1,\ldots,n

rn+12r12\displaystyle r_{n+1}^{2}-r_{1}^{2} \displaystyle\leq nL2ηk2+2ηkt=1n[f(xk,zt)f(xtk,zt)]2ηkn[FS(xk)FS(y)]\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}\sum_{t=1}^{n}[f(x^{k},z_{t})-f(x_{t}^{k},z_{t})]-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]
\displaystyle\leq nL2ηk2+2ηkt=1nLxkxtk2ηkn[FS(xk)FS(y)]\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}\sum_{t=1}^{n}L\|x^{k}-x_{t}^{k}\|-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]
\displaystyle\leq nL2ηk2+2ηk2L2t=1nt2ηkn[FS(xk)FS(y)]\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}^{2}L^{2}\sum_{t=1}^{n}t-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]
=\displaystyle= ηk2L2n+ηk2L2n(n+1)2ηkn[FS(xk)FS(y)].\displaystyle\eta_{k}^{2}L^{2}n+\eta_{k}^{2}L^{2}n(n+1)-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)].

Re-arranging terms we obtain the result. ∎

Using the previous lemma, it is straightforward to derive the optimization accuracy of the method.

Corollary I.2.

The fixed permutation stochastic gradient descent (Algorithm 3), for arbitrary permutation (i.e., not necessarily random) and with constant step size over each epoch (i.e., η(k1)n+tηk\eta_{(k-1)n+t}\equiv\eta_{k} for all t[n]t\in[n], k[K]k\in[K]). satisfies

εoptx1x(S)22nkηk+L2(n+2)2kηk2kηk.\varepsilon_{\mbox{\footnotesize opt}}\leq\frac{\|x^{1}-x^{\ast}(S)\|^{2}}{2n\sum_{k}\eta_{k}}+\frac{L^{2}(n+2)}{2}\frac{\sum_{k}\eta_{k}^{2}}{\sum_{k}\eta_{k}}.
Proof.

By convexity and Lemma I.1, we have

Fs(x¯K)FS(x(S))\displaystyle F_{s}(\overline{x}^{K})-F_{S}(x^{\ast}(S)) \displaystyle\leq 1k=1Kηkk=1Kηk[FS(xk)fS(x(S))]\displaystyle\frac{1}{\sum_{k=1}^{K}\eta_{k}}\sum_{k=1}^{K}\eta_{k}[F_{S}(x^{k})-f_{S}(x^{\ast}(S))]
\displaystyle\leq 1k=1Kηk[12nx1x(S)2+L2(n+2)2k=1Kηk2],\displaystyle\frac{1}{\sum_{k=1}^{K}\eta_{k}}\Big{[}\frac{1}{2n}\|x^{1}-x^{\ast}(S)\|^{2}+\frac{L^{2}(n+2)}{2}\sum_{k=1}^{K}\eta_{k}^{2}\Big{]},

which proves the result. ∎