Stability of Stochastic Gradient Descent
on Nonsmooth Convex Losses

Raef Bassily Department of Computer Science & Engineering, The Ohio State University. [email protected] Vitaly Feldman Work done while at Google Research. [email protected] Cristóbal Guzmán Institute for Mathematical and Computational Engineering, Pontificia Universidad Católica de Chile. [email protected] Kunal Talwar Work done while at Google Research. [email protected]

Abstract

Uniform stability is a notion of algorithmic stability that bounds the worst case change in the model output by the algorithm when a single data point in the dataset is replaced. An influential work of Hardt et al., (2016) provides strong upper bounds on the uniform stability of the stochastic gradient descent (SGD) algorithm on sufficiently smooth convex losses. These results led to important progress in understanding of the generalization properties of SGD and several applications to differentially private convex optimization for smooth losses.

Our work is the first to address uniform stability of SGD on nonsmooth convex losses. Specifically, we provide sharp upper and lower bounds for several forms of SGD and full-batch GD on arbitrary Lipschitz nonsmooth convex losses. Our lower bounds show that, in the nonsmooth case, (S)GD can be inherently less stable than in the smooth case. On the other hand, our upper bounds show that (S)GD is sufficiently stable for deriving new and useful bounds on generalization error. Most notably, we obtain the first dimension-independent generalization bounds for multi-pass SGD in the nonsmooth case. In addition, our bounds allow us to derive a new algorithm for differentially private nonsmooth stochastic convex optimization with optimal excess population risk. Our algorithm is simpler and more efficient than the best known algorithm for the nonsmooth case (Feldman et al.,, 2020).

1 Introduction

Successful applications of a machine learning algorithm require the algorithm to generalize well to unseen data. Thus understanding and bounding the generalization error of machine learning algorithms is an area of intense theoretical interest and practical importance. The single most popular approach to modern machine learning relies on the use of continuous optimization techniques to optimize the appropriate loss function, most notably the stochastic (sub)gradient descent (SGD) method. Yet the generalization properties of SGD are still not well understood.

Consider the setting of stochastic convex optimization (SCO). In this problem, we are interested in the minimization of the population risk $F_{\mathcal{D}}(x):=\mathbb{E}_{\mathbf{z}\sim\mathcal{D}}[f(x,\mathbf{z})]$ , where $\mathcal{D}$ is an arbitrary and unknown distribution, for which we have access to an i.i.d. sample of size $n$ , $\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})$ ; and $f(\cdot,z)$ is convex and Lipschitz for all $z$ . The performance of an algorithm ${\cal A}$ is quantified by its expected excess population risk,

\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}):=\mathbb{E}[F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))]-\min_{x\in\mathcal{X}}F_{\mathcal{D}}(x),

where the expectation is taken with respect to the randomness of the sample $\mathbf{S}$ and internal randomness of ${\mathcal{A}}$ . A standard way to bound the excess risk is given by its decomposition into optimization error (a.k.a. training error) and generalization error (see eqn. (3) in Sec. 2). The optimization error can be easily measured empirically but assessing the generalization error requires access to fresh samples from the same distribution. Thus bounds on the generalization error lead directly to provable guarantees on the excess population risk.

Classical analysis of SGD allows obtaining bounds on the excess population risk of one pass SGD. In particular, with an appropriately chosen step size, SGD gives a solution with expected excess population risk of $O(1/\sqrt{n})$ and this rate is optimal (Nemirovsky and Yudin,, 1983). However, this analysis does not apply to multi-pass SGD that is ubiquitous in practice.

In an influential work, Hardt et al., (2016) gave the first bounds on the generalization error of general forms of SGD (such as those that make multiple passes over the data). Their analysis relies on algorithmic stability, a classical tool for proving bounds on the generalization error. Specifically, they gave strong bound on the uniform stability of several variants of SGD on convex and smooth losses (with $2/\eta$ -smoothness sufficing when all the step sizes are at most $\eta$ ). Uniform stability bounds the worst case change in loss of the model output by the algorithm on the worst case point when a single data point in the dataset is replaced (Bousquet and Elisseeff,, 2002). Formally, for a randomized algorithm ${\cal A}$ , loss functions $f(\cdot,z)$ and $S\simeq S^{\prime}$ and $z\in\mathcal{Z}$ , let $\gamma_{\cal A}(S,S^{\prime},z):=f({\mathcal{A}}(S),z)-f({\mathcal{A}}(S^{\prime}),z)$ , where $S\simeq S^{\prime}$ denotes that the two datasets differ only in a single data point. We say ${\cal A}$ is $\gamma$ -uniformly stable if

\sup_{S\simeq S^{\prime},z}\mathbb{E}[\gamma_{\cal A}(S,S^{\prime},z)]\leq\gamma,

where the expectation is over the internal randomness of $\cal{A}.$ Stronger notions of stability can also be considered, e.g., bounding the probability – over the internal randomness of $\cal{A}$ – that $\gamma_{\cal A}(S,S^{\prime},z)>\gamma$ . Using stability, Hardt et al., (2016) showed that several variants of SGD simultaneously achieve the optimal tradeoff between the excess empirical risk and stability with both being $O(1/\sqrt{n})$ . Several works have used this approach to derive new generalization properties of SGD (London,, 2017; Chen et al.,, 2018; Feldman and Vondrák,, 2019).

The key insight of Hardt et al., (2016) is that a gradient step on a sufficiently smooth convex function is a nonexpansive operator (that is, it does not increase the $\ell_{2}$ distance between points). Unfortunately, this property does not hold for nonsmooth losses such as the hinge loss. As a result, no non-trivial bounds on the uniform stability of SGD have been previously known in this case.

Uniform stability is also closely related to the notion of differential privacy (DP). DP upper bounds the worst case change in the output distribution of an algorithm when a single data point in the dataset is replaced (Dwork et al., 2006b, ). This connection has been exploited in the design of several DP algorithms for SCO. In particular, bounds on the uniform stability of SGD from (Hardt et al.,, 2016) have been crucial in the design and analysis of new DP-SCO algorithms (Wu et al.,, 2017; Dwork and Feldman,, 2018; Feldman et al.,, 2018; Bassily et al.,, 2019; Feldman et al.,, 2020).

1.1 Our Results

We establish tight bounds on the uniform stability of the (stochastic) subgradient descent method on nonsmooth convex losses. These results demonstrate that in the nonsmooth case SGD can be substantially less stable. At the same time we show that SGD has strong stability properties even in the regime when its iterations can be expansive.

For convenience, we describe our results in terms of uniform argument stability (UAS), which bounds the output sensitivity in $\ell_{2}$ -norm w.r.t. an arbitrary change in a single data point. Formally, a (randomized) algorithm has $\delta$ -UAS if

\sup_{S\simeq S^{\prime}}\mathbb{E}\left\|{\mathcal{A}}(S)-{\mathcal{A}}(S^{\prime})\right\|_{2}\leq\delta.

(1)

This notion is implicit in existing analyses of uniform stability (Bousquet and Elisseeff,, 2002; Shalev-Shwartz et al.,, 2010; Hardt et al.,, 2016) and was explicitly defined by Liu et al., (2017). In this work, we prove stronger – high probability – upper bounds on the random variable $\delta_{\cal A}(S,S^{\prime}):=\left\|{\mathcal{A}}(S)-{\mathcal{A}}(S^{\prime})\right\|$ ,¹¹1In fact, for both GD and fixed-permutation SGD we can obtain w.p. 1 upper bounds on $\delta_{\cal A}(S,S^{\prime})$ , whereas for sampling-with-replacement SGD, we obtain a high-probability upper bound. and we provide matching lower bounds for the weaker – in expectation – notion of UAS (1). A summary of our bounds is in Table 1. For simplicity, they are provided for constant step size; general step sizes (for upper bounds) are provided in Section 3.

Algorithm	H.p. upper bound	Exp. upper bound	Exp. Lower bound
GD (full batch)	$4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)}$	$4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)}$	$\Omega\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)}$
SGD (w/replacement)	$4\big{(}\eta\sqrt{T}+\frac{\eta T}{n}\big{)}$	$\min\{1,\frac{T}{n}\}4\eta\sqrt{T}+4\frac{\eta T}{n}$	$\Omega\Big{(}\min\{1,\frac{T}{n}\}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}$
SGD (fixed permutation)	$2\eta\sqrt{T}+4\frac{\eta T}{n}$	$\min\{1,\frac{T}{n}\}2\eta\sqrt{T}+4\frac{\eta T}{n}$	$\Omega\Big{(}\min\{1,\frac{T}{n}\}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}$

Table 1: UAS for variants of GD/SGD, with normalized radius and Lipschitz constant. Here

T

is the number of iterations and

\eta>0

is the step size. Both upper and lower bounds also are

\min\{2,(\cdot)\}

, due to the feasible domain radius.

Compared to the smooth case (Hardt et al.,, 2016), the main difference is the presence of the additional $\eta\sqrt{T}$ term. This term has important implications for the generalization bounds derived from UAS. The first one is that the standard step size $\eta=\Theta(1/\sqrt{n})$ used in single pass SGD leads to a vacuous stability bound. Unfortunately, as shown by our lower bounds, this is unavoidable (at least in high dimension). However, by decreasing the step size and increasing the number of steps, one obtains a variant of SGD with nearly optimal balance between the UAS and the excess empirical risk.

We highlight two major consequences of our bounds:

•

Generalization bounds for multi-pass nonsmooth SGD. We prove that the generalization error of multi-pass SGD with $K$ passes is bounded by $O((\sqrt{Kn}+K)\eta)$ . This result can be easily combined with training error guarantees to provide excess risk bounds for this algorithm. Since training error can be measured directly, our generalization bounds would immediately yield strong guarantees on the excess risk in practical scenarios where we can certify small training error.
•

Differentially private stochastic convex optimization for non-smooth losses. We show that a variant of standard noisy SGD (Bassily et al.,, 2014) with constant step size and $n^{2}$ iterations yields the optimal excess population risk $O\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\beta)}}{\alpha n}\big{)}$ for convex nonsmooth losses under $(\alpha,\beta)$ -differential privacy. The best previous algorithm for this problem is substantially more involved: it relies on a multi-phase regularized SGD with decreasing step sizes and variable noise rates and uses $O(n^{2}\sqrt{\log(1/\beta)})$ gradient computations (Feldman et al.,, 2020).

1.2 Overview of Techniques

•

Upper bounds. When gradient steps are nonexpansive, upper-bounding UAS requires simply summing the differences between the gradients on the neighboring datasets when the replaced data point is used (Hardt et al.,, 2016). This gives the bound of $\eta T/n$ in the smooth case.

By contrast, in the nonsmooth case, UAS may increase even when the gradient step is performed on the same function. As a result it may increase in every single iteration. However, we use the fact that the difference in the subgradients has negative inner product with the difference between the iterates themselves (by monotonicity of the subgradient). Thus the increase in distance satisfies a recurrence with a quadratic and a linear term. Solving this recurrence leads to our upper bounds.
•

Lower bounds. The lower bounds are based on a function with a highly nonsmooth behavior around the origin. More precisely, it is the maximum of linear functions plus a small linear drift that is controlled by a single data point. We show that, when starting the algorithm from the origin, the presence of the linear drift pushes the iterate into a trajectory in which each subgradient step is orthogonal to the current iterate. Thus, if $d\geq\min\{T,1/\eta^{2}\}$ , we get the $\sqrt{T}\eta$ increase in UAS. Our lower bounds are also robust to averaging of the iterates.

1.3 Other Related Work

Stability is a classical approach to proving generalization bounds pioneered by Rogers and Wagner, (1978); Devroye and Wagner, 1979a ; Devroye and Wagner, 1979b . It is based on analysis of the sensitivity of the learning algorithm to changes in the dataset such as leaving one of the data points out or replacing it with a different one. The choice of how to measure the effect of the change and various ways to average over multiple changes give rise to a variety of stability notions (e.g., (Bousquet and Elisseeff,, 2002; Mukherjee et al.,, 2006; Shalev-Shwartz et al.,, 2010)). Uniform stability was introduced by Bousquet and Elisseeff, (2002) in order to derive general bounds on the generalization error that hold with high probability. These bounds have been significantly improved in a recent sequence of works (Feldman and Vondrák,, 2018, 2019; Bousquet et al.,, 2019). A long line of work focuses on the relationship between various notions of stability and learnability in supervised setting (see (Kearns and Ron,, 1999; Poggio et al.,, 2004; Shalev-Shwartz et al.,, 2010) for an overview). These works employ relatively weak notions of average stability and derive a variety of asymptotic equivalence results. Chen et al., (2018) establish limits of stability in the smooth convex setting, proving that accelerated methods must satisfy strong stability lower bounds. Stability-based data-dependent generalization bounds for continuous losses were studied in (Maurer,, 2017; Kuzborskij and Lampert,, 2018).

First applications of uniform stability in the context of stochastic convex optimization relied on the stability of the empirical minimizer for strongly convex losses (Bousquet and Elisseeff,, 2002). Therefore a natural approach to achieve uniform stability (and also UAS) is to add a strongly convex regularizer and solve the ERM to high accuracy (Shalev-Shwartz et al.,, 2010). Recent applications of this approach can be found for example in (Koren and Levy,, 2015; Charles and Papailiopoulos,, 2018; Feldman et al.,, 2020). In contrast, our approach does not require strong convexity and applies to all iterates of the SGD and not only to a very accurate empirical minimizer.

Classical approach to generalization relies on uniform convergence of empirical risk to population risk. Unfortunately, without additional structural assumptions on convex functions, a lower bound of $\Omega(\sqrt{d/n})$ on the rate of uniform convergence for convex SCO is known (Shalev-Shwartz et al.,, 2010; Feldman,, 2016). The dependence on the dimension $d$ makes the bound obtained via the uniform-convergence approach vacuous in the high-dimensional settings common in modern applications.

Differentially private convex optimization has been studied extensively for over a decade (see, e.g., (Chaudhuri and Monteleoni,, 2008; Chaudhuri et al.,, 2011; Jain et al.,, 2012; Kifer et al.,, 2012; Smith and Thakurta,, 2013; Bassily et al.,, 2014; Ullman,, 2015; Jain and Thakurta,, 2014; Talwar et al.,, 2015; Bassily et al.,, 2019; Feldman et al.,, 2020)). However, until recently, the research focused on minimization of the empirical risk. Population risk for DP-SCO was first studied by Bassily et al., (2014) who gave an upper bound of $\max\left(\tfrac{d^{\frac{1}{4}}}{\sqrt{n}},\tfrac{\sqrt{d}}{\alpha n}\right)$ (Bassily et al.,, 2014, Sec. F) on the excess risk. A recent work of Bassily et al., (2019) established that the optimal rate of the excess population risk for $(\alpha,\beta)$ -DP SCO algorithms is $O\big{(}\frac{1}{\sqrt{n}}+\frac{\sqrt{d\log(1/\beta)}}{\alpha n}\big{)}$ . Their algorithms are relatively inefficient, especially in the nonsmooth case. Subsequently, Feldman et al., (2020) gave several new algorithms for DP-SCO with the optimal population risk. For sufficiently smooth losses, their algorithms use a linear number of gradient computations. In the nonsmooth case, as mentioned earlier, their algorithm requires $O(n^{2}\sqrt{\log(1/\beta)})$ gradient computations and is significantly more involved than the algorithm shown here.

2 Notation and Preliminaries

Throughout we work on the Euclidean space $(\mathbb{R}^{d},\|\cdot\|_{2})$ . Therefore, we use unambiguously $\|\cdot\|=\|\cdot\|_{2}$ . Vectors are denoted by lower case letters, e.g. $x,y$ . Random variables (either scalar or vector) are denoted by boldface letters, e.g. $\mathbf{z},\mathbf{u}$ .We denote the Euclidean ball of radius $r>0$ centered at $x\in\mathbb{R}^{d}$ by $\mathcal{B}(x,r)$ . In what follows, $\mathcal{X}\subseteq\mathbb{R}^{d}$ is a compact convex set, and assume we know its Euclidean radius $R>0$ , $\mathcal{X}\subseteq\mathcal{B}(0,R)$ . Let $\mathsf{Proj}_{\mathcal{X}}$ be the Euclidean projection onto $\mathcal{X}$ , which is nonexpansive $\|\mathsf{Proj}_{\mathcal{X}}(x)-\mathsf{Proj}_{\mathcal{X}}(y)\|\leq\|x-y\|$ . A convex function $f:\mathcal{X}\mapsto\mathbb{R}$ is $L$ -Lipschitz if

f(x)-f(y)\leq L\|x-y\|\qquad(\forall x,y\in\mathcal{X}).

(2)

Functions with these properties are guaranteed to be subdifferentiable. Moreover, in the convex case, property (2) is “almost” equivalent to having subgradients bounded as $\partial f(x)\subseteq\mathcal{B}(0,L)$ , for all $x\in\mathcal{X}$ .²²2For equivalence to hold it is necessary that the function is well-defined and satisfies (2) over an open set containing $\mathcal{X}$ , see Thm. 3.61 in Beck, (2017). We will assume this is the case, which can be done w.l.o.g.. We denote the class of convex $L$ -Lipschitz functions as $\mathcal{F}_{\mathcal{X}}^{0}(L)$ . With slight abuse of notation, given a function $f\in\mathcal{F}_{\mathcal{X}}^{0}(L)$ , we will denote by $\nabla f(x)$ an arbitrary choice of $g\in\partial f(x)$ . In this work, we will focus on the class $\mathcal{F}_{\mathcal{X}}^{0}(L)$ defined over a compact convex set $\mathcal{X}$ . Since the Euclidean radius of $\mathcal{X}$ is bounded by $R$ , we will assume that the range of these functions lies in $[-RL,RL]$ .

Nonsmooth stochastic convex optimization: We study the standard setting of nonsmooth stochastic convex optimization

x^{\ast}\in\arg\min\{F_{\mathcal{D}}(x):=\mathbb{E}_{\mathbf{z}\sim\mathcal{D}}[f(x,\mathbf{z})]:\,\,x\in\mathcal{X}\}.

Here, $\mathcal{D}$ is an unknown distribution supported on a set $\mathcal{Z}$ , and $f(\cdot,z)\in\mathcal{F}_{\mathcal{X}}^{0}(L)$ for all $z\in\mathcal{Z}$ . In the stochastic setting, we assume access to an i.i.d. sample from $\mathcal{D}$ , denoted as $\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})\sim\mathcal{D}^{n}$ . Here, we will use the bold symbol $\mathbf{S}$ to denote a random sample from the unknown distribution. A fixed (not random) dataset from $\mathcal{Z}^{n}$ will be denoted as $S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}$ .

A stochastic optimization algorithm is a (randomized) mapping ${\cal A}:\mathcal{Z}^{n}\mapsto\mathcal{X}$ . When the algorithm is randomized, ${\cal A}(\mathbf{S})$ is a random variable depending on both the sample $\mathbf{S}\sim\mathcal{D}^{n}$ and its own random coins. The performance of ${\cal A}$ is quantified by its excess population risk

\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}):=F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathcal{D}}(x^{\ast}).

Note that $\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}})$ is a random variable (due to randomness in the sample $\mathbf{S}$ and any possible internal randomness of the algorithm). Our guarantees on the excess population risk will be expressed in terms of upper bounds on this quantity that hold with high probability over the randomness of both $\mathbf{S}$ and the random coins of the algorithm.

Empirical risk minimization (ERM) is one of the most standard approaches to stochastic convex optimization. In the ERM problem, we are given a sample $\mathbf{S}=(\mathbf{z}_{1},\ldots,\mathbf{z}_{n})$ , and the goal is to find

x^{\ast}(\mathbf{S})\in\arg\min\Big{\{}F_{\mathbf{S}}(x):=\frac{1}{n}\sum_{i=1}^{n}f(x,\mathbf{z}_{i}):\,\,x\in\mathcal{X}\Big{\}}.

One way to bound the excess population risk is to solve the ERM problem, and appeal to uniform convergence; however, uniform convergence rates in this case are dimension-dependent, $\Omega(\sqrt{d/n})$ (Feldman,, 2016).

Risk decomposition: Guaranteeing low excess population risk for a general algorithm is a nontrivial task. A common way to bound it is by decomposing it into generalization, optimization and approximation error:

\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}})\leq\underbrace{F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S}))}_{\varepsilon_{\mbox{\tiny gen}}({\mathcal{A}})}+\underbrace{F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))}_{\varepsilon_{\mbox{\tiny opt}}({\mathcal{A}})}+\underbrace{F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))-F_{\mathcal{D}}(x^{\ast})}_{\varepsilon_{\mbox{\tiny approx}}}.

(3)

Here, the optimization error corresponds to the empirical optimization gap, which can be bounded by standard optimization convergence analysis. The expected value of the approximation error is at most zero. One can show, e.g., by Hoeffding’s inequality, that the approximation error is bounded by $\tilde{O}(LR/\sqrt{n})$ with high probability (see Lemma 2.1 below.) Therefore, to establish bounds on the excess risk it suffices to upper bound the optimization and generalization errors.

Lemma 2.1.

For any $\theta\in(0,1),$ with probability at least $1-\theta$ , the approximation error is bounded as

\varepsilon_{\mathrm{\tiny approx}}\leq\frac{RL\sqrt{2\log(1/\theta)}}{\sqrt{n}}.

Proof.

First, note that $F_{\mathcal{D}}(x^{\ast})=\underset{\mathbf{S}}{\mathbb{E}}\left[F_{S}(x^{\ast})\right]=\frac{1}{n}\sum_{i=1}^{n}f(x^{\ast},\mathbf{z}_{i})$ . Hence, by independence and the fact that $f(x^{\ast},\mathbf{z}_{i})\in[-RL,RL]$ with probability $1$ for all $i\in[n]$ , the following follows from Hoeffding’s inequality:

\underset{\mathbf{S}\sim\mathcal{D}^{n}}{\mathbb{P}}\left[F_{\mathbf{S}}(x^{\ast})-F_{\mathcal{D}}(x^{\ast})\geq\frac{RL\sqrt{2\log(1/\theta)}}{\sqrt{n}}\right]\leq\theta.

Finally, note that by definition of $x^{\ast}(\mathbf{S})$ , we have $F_{\mathbf{S}}(x^{\ast}(\mathbf{S}))-F_{\mathbf{S}}(x^{\ast})\leq 0$ . Combining this with the above bound completes the proof.

∎

We say that two datasets $S,~{}S^{\prime}$ are neighboring, denoted $S\simeq S^{\prime}$ , if they only differ on a single entry; i.e., there exists $i\in[n]$ s.t. for all $k\neq i$ , $z_{k}=z_{k}^{\prime}$ .

Uniform argument stability (UAS): Given an algorithm ${\cal A}$ and datasets $S\simeq S^{\prime}$ , we define the uniform argument stability (UAS) random variable as

\delta_{\cal A}(S,S^{\prime}):=\|{\cal A}(S)-{\cal A}(S^{\prime})\|.

The randomness here is due to any possible internal randomness of ${\cal A}$ . For any $L$ -Lipschitz function $f$ , we have that $f\left({\mathcal{A}}(S),z\right)-f\left({\mathcal{A}}(S^{\prime}),z\right)\leq L\,\delta_{\cal A}(S,S^{\prime}).$ Hence, upper bounds on UAS can be easily transformed into upper bounds on uniform stability.

In this work, we will consider two types of bounds on UAS.

2.1 High-probability guarantees on UAS

In Section 3, we give upper bounds on UAS for three variants of the (stochastic) gradient descent algorithm, namely, (i) full-batch gradient descent, (ii) sampling-with-replacement stochastic gradient descent, and (iii) fixed-permutation stochastic gradient descent. Variant (i) is deterministic (and hence UAS is a deterministic quantity). For variant (ii), for any pair of neighboring datasets $S,S^{\prime}$ , we give an upper bound on the UAS random variable that holds with high probability over the algorithm’s internal randomness (the sampling with replacement). For variant (iii), we give an upper bound on UAS that holds for an arbitrary choice of permutation; in particular, for any random permutation our upper bound on the UAS random variable that holds with probability 1.

High-probability upper bounds on UAS lead to high-probability upper bounds on generalization error $\varepsilon_{\mathrm{\tiny gen}}$ . We will use the following theorem, which follows in a straightforward fashion from (Feldman and Vondrák,, 2019, Theorem 1.1), to derive generalization-error guarantees for our results in Sections 5 and 6 based on our UAS upper bounds in Section 3.

Theorem 2.2 (follows from Theorem 1.1 in Feldman and Vondrák, (2019)).

Let ${\mathcal{A}}:\mathcal{Z}^{n}\rightarrow\mathcal{X}$ be a randomized algorithm. For any pair of neighboring datasets $S,S^{\prime}$ , suppose that the UAS random variable of ${\mathcal{A}}$ satisfies:

\displaystyle\underset{{\mathcal{A}}}{\mathbb{P}}\left[\delta_{{\mathcal{A}}}(S,S^{\prime})\geq\gamma\right]

\displaystyle\leq\theta_{0}.

Then there is a constant $c$ such that for any distribution $\mathcal{D}$ over $\mathcal{Z}$ and any $\theta\in(0,1)$ , we have

\displaystyle\underset{\mathbf{S}\sim\mathcal{D}^{n},\,{\mathcal{A}}}{\mathbb{P}}\left[\lvert\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}})\rvert\geq c\left(L\gamma\log(n)\log(n/\theta)+LR\sqrt{\frac{\log(1/\theta)}{n}}\right)\right]

\displaystyle\leq\theta+\theta_{0},

where $\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}})=F_{\mathcal{D}}({\mathcal{A}}(\mathbf{S}))-F_{\mathbf{S}}({\mathcal{A}}(\mathbf{S}))$ as defined earlier.

2.2 Expectation guarantees on UAS

Our results also include upper and lower bounds on $\sup\limits_{S\simeq S^{\prime}}\underset{{\mathcal{A}}}{\mathbb{E}}\left[\delta_{{\mathcal{A}}}(S,S^{\prime})\right]$ ; that is the supremum of the expected value of the UAS random variable, where the supremum is taken over all pairs of neighboring datasets. In Section 3.3.1, we provide an upper bound on this quantity for the sampling-with-replacement stochastic gradient descent. The upper bounds on the other two variants of the gradient descent method hold in the strongest sense (they hold with probability $1$ ). Moreover, in Appendix F, we give slightly tighter expectation guarantees on UAS for both sampling-with-replacement SGD and fixed-permutation SGD with a uniformly random permutation.

In Section 4, we give lower bounds on this quantity for the two variants of the stochastic subgradient method, together with a deterministic lower bound for the full-batch variant.

3 Upper Bounds on Uniform Argument Stability

3.1 The Basic Lemma

We begin by stating a key lemma that encompasses the UAS bound analysis of multiple variants of (S)GD. In particular, all of our UAS upper bounds are obtained by almost a direct application of this lemma. In the lemma we consider two gradient descent trajectories associated to different sequences of objective functions. The degree of concordance of the two sequences, quantified by the distance between the subgradients at the current iterate, controls the deviation between the trajectories. We note that this distance condition is satisfied for all (S)GD variants we study in this work.

Lemma 3.1.

Let $(x^{t})_{t\in[T]}$ and $(y^{t})_{t\in[T]}$ , with $x^{1}=y^{1}$ , be online gradient descent trajectories for convex $L$ -Lipschitz objectives $(f_{t})_{t\in[T-1]}$ and $(f_{t}^{\prime})_{t\in[T-1]}$ , respectively; i.e.,

	$\displaystyle x^{t+1}$	$\displaystyle=$	$\displaystyle\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f_{t}(x^{t})]$
	$\displaystyle y^{t+1}$	$\displaystyle=$	$\displaystyle\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f_{t}^{\prime}(y^{t})],$

for all $t\in[T-1]$ . Suppose for every $t\in[T-1]$ , $\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}$ , for scalars $0\leq a_{t}\leq 2L$ . Then, if $t_{0}=\inf\{t:f_{t}\neq f_{t}^{\prime}\},$

\|x^{T}-y^{T}\|\leq 2L\sqrt{\sum_{t=t_{0}}^{T-1}\eta_{t}^{2}}+2\sum_{t=t_{0}+1}^{T-1}\eta_{t}a_{t}.

Proof.

Let $\delta_{t}=\|x^{t}-y^{t}\|$ . By definition of $t_{0}$ it is clear that $\delta_{1}=\ldots=\delta_{t_{0}}=0$ . For $t=t_{0}+1$ , we have that $\delta_{t_{0}+1}=\|\eta_{t_{0}}(\nabla f_{t_{0}}(x^{t_{0}})-\nabla f_{t_{0}}^{\prime}(y^{t_{0}})\|\leq 2L\eta_{t_{0}}$ .

Now, we derive a recurrence for $(\delta_{t})_{t\in[T]}$ :

	$\displaystyle\delta_{t+1}^{2}=\\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f_{t}(x^{t})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f_{t}^{\prime}(y^{t})]\\|^{2}\,\,\leq\,\,\\|x^{t}-y^{t}-\eta_{t}(\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}))\\|^{2}$
	$\displaystyle=\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t}),x^{t}-y^{t}\rangle-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}+2\eta_{t}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\\|\delta_{t}-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+4L^{2}\eta_{t}^{2}+2\eta_{t}a_{t}\delta_{t},$

where at the last step we use the monotonicity of the subgradient. Note that

\delta_{t_{0}+1}\leq\eta_{t_{0}}\|\nabla f_{t_{0}}(x^{t_{0}})-\nabla f_{t_{0}}^{\prime}(x^{t_{0}})\|\leq 2L\eta_{t_{0}}.

Hence,

	$\displaystyle\delta_{t}^{2}$	$\displaystyle\leq$	$\displaystyle\textstyle\delta_{t_{0}+1}^{2}+4L^{2}\sum_{s=t_{0}+1}^{t-1}\eta_{s}^{2}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s}$		(4)
		$\displaystyle\leq$	$\displaystyle\textstyle 4L^{2}\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s}.$		(4)

Now we prove the following bound by induction (notice this claim proves the result):

\textstyle\delta_{t}\leq 2L\sqrt{\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\delta_{s}\qquad(\forall t\in[T]).

Indeed, the claim is clearly true for $t=t_{0}$ . For the inductive step, we assume it holds for some $t\in[T-1]$ . To prove the result we consider two cases: first, when $\delta_{t+1}\leq\max_{s\in[t]}\delta_{s}$ , by induction hypothesis we have

\textstyle\delta_{t+1}\leq\delta_{t}\leq 2L\sqrt{\sum_{s=t_{0}}^{t-1}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t-1}\eta_{s}a_{s}\leq 2L\sqrt{\sum_{s=t_{0}}^{t}\eta_{s}^{2}}+2\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}.

In the other case, $\delta_{t+1}>\max_{s\in[t]}\delta_{s}$ , we use (4)

\textstyle\delta_{t+1}^{2}\,\leq\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+2\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}\delta_{s}\,\,\leq\,\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+2\delta_{t+1}\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s},

which is equivalent to

\textstyle\Big{(}\delta_{t+1}-\sum_{s=t_{0}+1}^{t}a_{s}\eta_{s}\Big{)}^{2}\,\,\leq\,\,4L^{2}\sum_{s=t_{0}}^{t}\eta_{t}^{2}+\Big{(}\sum_{s=t_{0}+1}^{t}\eta_{s}a_{s}\Big{)}^{2}.

Taking square root at this inequality, and using the subadditivity of the square root, we obtain the inductive step, and therefore the result. ∎

3.2 Upper Bounds for the Full Batch GD

Algorithm 1

{\mathcal{A}}_{\sf GD}

: Full-batch Gradient Descent

0: Dataset:

S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}

, # iterations

T

, step sizes

\{\eta_{t}:t\in[T]\}

1: Choose arbitrary initial point

x^{1}\in\mathcal{X}

2: for

t=1

T-1

x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla F_{S}(x^{t})\right),

4: return

\overline{x}^{T}=\frac{1}{\sum_{t\in[T]}\eta_{t}}\sum_{t\in[T]}\eta_{t}x^{t}

As a direct corollary of Lemma 3.1, we derive the following upper bound on UAS for the batch gradient descent algorithm.

Theorem 3.2.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ and $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ . The full-batch gradient descent (Algorithm 1) has uniform argument stability

\sup\limits_{S\simeq S^{\prime}}\delta_{{\mathcal{A}}_{\sf GD}}(S,S^{\prime})\leq\min\left\{2R,~{}4L\,\Big{(}\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}+\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}\Big{)}\right\}.

Proof.

The bound of $2R$ is obtained directly from the diameter bound on $\mathcal{X}$ . Therefore, we focus exclusively on the second term. Let $S\simeq S^{\prime}$ be arbitrary neighboring datasets, $x^{1}=y^{1}$ , and consider the trajectories $(x^{t})_{t},(y^{t})_{t}$ associated with the batch GD method on datasets $S$ and $S^{\prime}$ , respectively. We use Lemma 3.1 with $f_{t}=F_{S}$ and $f_{t}^{\prime}=F_{S^{\prime}}$ , for all $t\in[T-1]$ . Notice that

\sup_{x\in{\cal X}}\|\nabla F_{S}(x)-\nabla F_{S^{\prime}}(x)\|\leq 2L/n,

since $S\simeq S^{\prime}$ ; in particular, $\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}$ , with $a_{t}=2L/n$ . We conclude by Lemma 3.1 that for all $t\in[T]$

\|x^{t}-y^{t}\|\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=2}^{t-1}\eta_{s}.

Hence, the stability bound holds for all the iterates, and thus for $\overline{x}^{T}$ by the triangle inequality. ∎

3.3 Upper Bounds for SGD

Next, we state and prove upper bounds on UAS for two variants of stochastic gradient descent: sampling-with-replacement SGD (Section 3.3.1) and fixed-permutation SGD (Section 3.3.2). Here, we give strong upper bounds that hold with high probability (for sampling-with-replacement SGD) and with probability 1 (for fixed-permutation SGD). In Appendix F, we derive tighter upper bounds for these two variants of SGD in the case where the number of iterations $T<$ the number of samples in the data set $n$ ; however, the bounds derived in this case hold only in expectation.

3.3.1 Sampling-with-replacement SGD

Next, we study the uniform argument stability of the sampling-with-replacement stochastic gradient descent (Algorithm 2). This algorithm has the benefit that each iteration is extremely cheap compared to Algorithm 1. Despite these savings, we will show that same bound on UAS holds with high probability.

Algorithm 2

{\mathcal{A}}_{\sf rSGD}

: Sampling with replacement SGD

0: Dataset:

S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}

, # iterations

T

, stepsizes

\{\eta_{t}:t\in[T]\}

1: Choose arbitrary initial point

x^{1}\in\mathcal{X}

2: for

t=1

T-1

3: Sample

\mathbf{i}_{t}\sim\mbox{Unif}([n])

x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla f(x^{t},z_{\mathbf{i}_{t}})\right)

5: return

\overline{x}^{T}=\frac{1}{\sum_{t\in[T]}\eta_{t}}\sum_{t\in[T]}\eta_{t}x^{t}

We now state and prove our upper bound for sampling-with-replacement SGD.

Theorem 3.3.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ and $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ . The uniform argument stability of the sampling-with-replacement SGD (Algorithm 2) satisfies:

\sup\limits_{S\simeq S^{\prime}}\underset{{\mathcal{A}}_{\sf rSGD}}{\mathbb{E}}\left[\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})\right]\leq\min\left(2R,~{}4L\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right).

Moreover, if $\eta_{t}=\eta>0~{}\forall t$ then, for any pair $(S,S^{\prime})$ of neighboring datasets, with probability at least $1-\exp\left(-n/2\right)$ (over the algorithm’s internal randomness), the UAS random variable is bounded as

\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})\leq\min\left(2R,~{}4L\left(\eta\,\sqrt{T-1}+\eta\,\frac{T-1}{n}\right)\right).

Proof.

The bound of $2R$ trivially follows from the diameter bound on $\mathcal{X}$ . We thus focus on the second term of the bound. Let $S\simeq S^{\prime}$ be arbitrary neighboring datasets, $x^{0}=y^{0}$ , and consider the trajectories $(x^{t})_{t\in[T]},(y^{t})_{t\in[T]}$ associated with the sampled-with-replacement stochastic subgradient method on datasets $S$ and $S^{\prime}$ , respectively. We use Lemma 3.1 with $f_{t}(\cdot)=f(\cdot,\mathbf{z}_{\mathbf{i}_{t}})$ and $f_{t}^{\prime}(\cdot)=f(\cdot,\mathbf{z}_{{\mathbf{i}_{t}}^{\prime}})$ . Let us define $\mathbf{r}_{t}\triangleq\mathbf{1}_{\{\mathbf{z}_{\mathbf{i}_{t}}\neq\mathbf{z}_{\mathbf{i}_{t}}^{\prime}\}}$ . Note that at every step $t$ , $\mathbf{r}_{t}=1$ with probability $1-1/n$ , and $\mathbf{r}_{t}=0$ otherwise. Moreover, note that $\{\mathbf{r}_{t}:~{}t\in[T]\}$ is an independent sequence of Bernoulli random variables. Finally, note that $\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq 2L\mathbf{r}_{t}$ .

Hence, by Lemma 3.1, for any realization of the trajectories of the SGD method, we have

\displaystyle\forall t\in[T]:\quad\|x^{t}-y^{t}\|

\displaystyle\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+4L\sum_{s=1}^{t-1}\mathbf{r}_{s}\eta_{s}\leq\Delta_{T},

(5)

where $\Delta_{T}\triangleq 2L\sqrt{\sum_{s=1}^{T-1}\eta_{s}^{2}}+4L\sum_{s=1}^{T-1}\mathbf{r}_{s}\eta_{s}$ . Taking expectation of (5), we have

\forall t\in[T]:\quad\mathbb{E}\left[\|x^{t}-y^{t}\|\right]\leq\mathbb{E}\left[\Delta_{T}\right]=2L\sqrt{\sum_{s=1}^{T-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=1}^{T-1}\eta_{s}.

This establishes the upper bound on UAS but only in expectation. Now, we proceed to prove the high-probability bound. Here, we assume that the step size is fixed; that is, $\eta_{t}=\eta>0$ for all $t\in[T-1]$ . Note that each $\mathbf{r}_{s},s\in[T],$ has variance $\frac{1}{n}\left(1-\frac{1}{n}\right)<\frac{1}{n}$ . Hence, by Chernoff’s bound³³3Here, we are applying a bound for (scaled) Bernoulli rvs where the exponent is expressed in terms of the variance., we have

\displaystyle\mathbb{P}\left[\eta\sum_{s=1}^{T-1}\mathbf{r}_{s}\geq\eta\frac{T-1}{n}+\eta\sqrt{T-1}\right]

\displaystyle\leq\exp\left(-\,\frac{\eta^{2}(T-1)}{2\eta^{2}\,\frac{T-1}{n}}\right)=\exp\left(-\frac{n}{2}\right).

Therefore, with probability at least $1-\exp\left(-n/2\right)$ , we have

\Delta_{T}\leq 3L\eta\sqrt{T-1}+\frac{4L}{n}\eta(T-1).

Putting this together with (5), with probability at least $1-\exp\left(-n/2\right)$ , we have

\forall t\in[T]:\quad\|x^{t}-y^{t}\|\leq 3L\eta\sqrt{T-1}+\frac{4L}{n}\eta(T-1).

Finally, by the triangle inequality, we get that with probability at least $1-\exp\left(-n/2\right)$ , the same stability bound holds for the average of the iterates $\overline{x}^{T}$ , $\overline{y}^{T}$ . ∎

3.3.2 Upper Bounds for the Fixed Permutation SGD

In Algorithm 3, we describe the fixed-permutation stochastic gradient descent. This algorithm works in epochs, where each epoch is a single pass on the data. The order in which data is used is the same across epochs, and is given by a permutation $\pi$ . The algorithm can be alternatively described without the epoch loop simply by

x^{t+1}=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta_{t}\cdot\nabla f(x^{t},z_{\bm{\pi}(t\,\mbox{\footnotesize mod }n)})\right)\qquad(\forall t\in[nK]).\vspace{-0.5cm}

(6)

Algorithm 3

{\mathcal{A}}_{\sf PerSGD}

: Fixed Permutation SGD

0: Dataset

S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}

, # rounds

K

, total # steps

T\triangleq nK

, step sizes,

\{\eta_{t}\}_{t\in[nK]}

{\pi}:[n]\rightarrow[n]

permutation over

[n]

1: Choose arbitrary initial point

x_{n+1}^{0}\in\mathcal{X}

2: for

k=1,\ldots,K

x_{1}^{k}=x_{n+1}^{k-1}

4: for

t=1

n

x^{k}_{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{k}_{t}-\eta_{(k-1)n+t}\cdot\nabla f(x^{k}_{t},z_{\pi(t)})\right)

\overline{\eta}_{k}=\sum_{t=1}^{n}\eta_{(k-1)n+t}

7: return

\overline{x}^{K}=\frac{1}{\sum_{k\in[K]}\overline{\eta}_{k}}\sum_{k\in[K]}\overline{\eta}_{k}\cdot x_{1}^{k}

We show that the same UAS bound of batch gradient descent and sampling-with-replacement SGD holds for the fixed-permutation SGD. We also observe that a slightly tighter bound can be achieved if we consider the expectation guarantee on UAS when $\bm{\pi}$ is chosen uniformly at random. We leave these details to Theorem F.2 in the Appendix.

In the next result, we assume that the sequence of step sizes $\left(\eta_{t}\right)_{t\in[T]}$ is non-increasing, which is indeed the case for almost all known variants of SGD.

Theorem 3.4.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ , $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ , and $\pi$ be any permutation over $[n]$ . Suppose the step sizes $\left(\eta_{t}\right)_{t\in[T]}$ form a non-increasing sequence. Then the uniform argument stability of the fixed-permutation SGD (Algorithm 3) is bounded as

\sup\limits_{S\simeq S^{\prime}}\delta_{{\mathcal{A}}_{\sf PerSGD}}(S,S^{\prime})\leq\min\left\{2R,~{}2L\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right\}.

Proof.

Again, the bound of $2R$ is trivial. Now, we show the second term of the bound. Let $S\simeq S^{\prime}$ be arbitrary neighboring datasets, $x^{1}=y^{1}$ , and consider the trajectories $(x^{t})_{t\in[T]},(y^{t})_{t\in[T]}$ associated with the fixed permutation stochastic subgradient method on datasets $S$ and $S^{\prime}$ , respectively. Since the datasets $S\simeq S^{\prime}$ are arbitrary, we may assume without loss of generality that $\pi$ is the identity, whereas the perturbed coordinate $\mathbf{i}=i$ is arbitrary. We use Lemma 3.1 with $f_{t}(\cdot)=f\left(\cdot,\mathbf{z}_{\left(t~{}\mbox{\footnotesize mod }n\right)}\right)$ and $f_{t}^{\prime}(\cdot)=f\left(\cdot,\mathbf{z}^{\prime}_{\left(t~{}\mbox{\footnotesize mod }n\right)}\right)$ . It is easy to see then that $\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}$ , with $a_{t}=2L\cdot\mathbf{1}_{\{(t~{}\mbox{\footnotesize mod }n)=i\}}$ , where $\mathbf{1}_{\{\mathsf{condition}\}}$ is the indicator of $\mathsf{condition}$ . Hence, by Lemma 3.1, we have

	$\displaystyle\\|x^{t}-y^{t}\\|$	$\displaystyle\leq$	$\displaystyle 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+4L\sum_{r=1}^{\lfloor(t-1)/n\rfloor}\eta_{rn+i}$
		$\displaystyle\leq$	$\displaystyle 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{r=1}^{t-1}\eta_{s},$

where at the last step we used the fact that $(\eta_{t})_{t\in[T]}$ is non-increasing; namely, for any $r\geq 1$

\eta_{rn+i}\leq\frac{1}{n}\sum_{s=(r-1)n+i+1}^{rn+i}\eta_{s}.

Since the bound holds for all the iterates, using triangle inequality, it holds for the output $\overline{x}^{K}$ averaged over the iterates from the $T/n$ epochs. ∎

3.4 Discussion of the upper bounds: examples of specific instantiations

The upper bounds on stability from this section all behave very similarly. Let us explore the consequences of the obtained rates in terms of generalization bounds for different choices of the step size sequence. As a case study, we will consider excess risk bounds for the full-batch subgradient method (Algorithm 1), but similar conclusions hold for all the variants that we studied. We emphasize that prior to this work, no dimension-independent bounds on the excess risk were known of this method (specifically, for nonsmooth losses and without explicit regularization).

To bound the excess risk, we will use the risk decomposition, eqn. (3). For simplicity, we will only be studying excess risk bounds in expectation (in Section 6 we consider stronger, high probability, bounds). In this case, the stability implies generalization result (Theorem 2.2) simplifies to Bousquet and Elisseeff, (2002); Hardt et al., (2016)

\mathbb{E}_{\mathbf{S}}[F_{\mathbf{S}}({\cal A}(\mathbf{S}))-F_{\cal D}({\cal A}(\mathbf{S}))]\leq\sup_{S\simeq S^{\prime}}\delta_{{\cal A}}(S,S^{\prime}).

Finally, the approximation error (Lemma 2.1) simplifies as well: it is upper bounded by 0 in expectation.

•

Fixed stepsize: Let $\eta_{t}\equiv\eta>0$ . By Thm. 3.2, UAS is bounded by $4L\sqrt{T}\eta+\frac{4LT\eta}{n}$ . On the other hand, the standard analysis of subgradient descent guarantees that $\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf GD})\leq\frac{R^{2}}{2\eta T}+\frac{\eta L^{2}}{2}$ . Therefore, by the expected risk decomposition (3)

\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})]

\displaystyle\leq

\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf GD})]+\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]\leq 4L^{2}\sqrt{T}\eta+\frac{4L^{2}T\eta}{n}+\frac{R^{2}}{2\eta T}+\frac{\eta L^{2}}{2}.

If we consider the standard method choice, $\eta=R/[L\sqrt{n}]$ and $T=n$ , the bound above is at least $4LR$ (due to the first term). Consequently, upper bounds obtained from this approach are vacuous.

In order to deal with the $L\sqrt{T}\eta$ term, we need to substantially moderate our stepsize, together with running the algorithm for longer. For example, $\eta=\frac{R}{4L\sqrt{Tn}}$ gives $\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})]\leq\frac{2LR}{\sqrt{n}}+\frac{2LR\sqrt{n}}{\sqrt{T}}+\frac{R\sqrt{T}}{n^{3/2}}$ , so by choosing $T=n^{2}$ we obtain an expected excess risk bound of $O(LR/\sqrt{n})$ , which is optimal. We will see next that it is not possible to obtain the same rates from this bound if $T=o(n^{2})$ , for any choice of $\eta>0$ . It is also an easy observation that, at least for constant stepsize, it is not possible to recover the optimal excess risk if $T=\omega(n^{2})$ .

•

Varying stepsize: For a general sequence of stepsizes the optimization guarantees of Algorithm 1 are the following

\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]\leq\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}+\frac{L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}}{2}.

From the risk decomposition, we have

	$\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf GD})]$	$\displaystyle\leq$	$\displaystyle\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf GD})]+\mathbb{E}_{\mathbf{S}}[\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf GD})]$
		$\displaystyle\leq$	$\displaystyle 4L^{2}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{4L^{2}}{n}\sum_{t=1}^{T-1}\eta_{t}+\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}+\frac{L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}}{2}.$

In fact, we can show that any choice of step sizes that makes the quantity above $O(LR/\sqrt{n})$ must necessarily have $T=\Omega(n^{2})$ . Indeed, notice that in such case

	$\displaystyle\frac{R^{2}}{2\sum_{t=1}^{T-1}\eta_{t}}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)};$	$\displaystyle 4L^{2}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)}$
	$\displaystyle\Longleftrightarrow\qquad\sum_{t=1}^{T-1}\eta_{t}=\Omega\Big{(}\frac{R\sqrt{n}}{L}\Big{)};$	$\displaystyle\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}=O\Big{(}\frac{R}{L\sqrt{n}}\Big{)}.$

Therefore, by Cauchy-Schwarz inequality,

\Omega\Big{(}\frac{R\sqrt{n}}{L}\Big{)}=\sum_{t}\eta_{t}\leq\sqrt{T}\sqrt{\sum_{t}\eta_{t}^{2}}=O\Big{(}\frac{R\sqrt{T}}{L\sqrt{n}}\Big{)}\quad\Longrightarrow\quad T=\Omega(n^{2}).

The high iteration complexity required to obtain optimal bounds motivates studying whether it is possible to improve our uniform argument stability bounds. We will show that, unfortunately, they are sharp up to absolute constant factors.

4 Lower Bounds on Uniform Argument Stability

In this section we provide matching lower bounds for the previously studied first-order methods. These lower bounds show that our analyses are tight, up to absolute constant factors.

We note that it is possible to prove a general purpose lower bound on stability by appealing to sample complexity lower bounds for stochastic convex optimization (Nemirovsky and Yudin,, 1983). This approach in the smooth convex case was first studied in (Chen et al.,, 2018); there, these lower bounds are sharp. However, in the nonsmooth case they are very far from bounds in the previous section. The idea is that for sufficiently small step size, a first-order method must incur $\Omega(LT\eta/n)$ uniform stability. Details of this lower bound can be found on Appendix C. This reasoning leads to an $\Omega(LT\eta/n)$ lower bound on uniform argument stability, that can be added to any other lower bound we can prove on specific algorithms that enjoy rates as of gradient descent.

Next we will prove finer lower bounds on the UAS of specific algorithms. For this, note that the objective functions we use are polyhedral, thus the subdifferential is a polytope at any point. Since the algorithm should work for any oracle, we will let the subgradients provided to be extreme points, $\nabla f(x,z)\in\mbox{ext}(\partial f(x,z))$ . Moreover, we can make adversarial choices of the chosen subgradient.

4.1 Lower Bounds for Full Batch GD

Theorem 4.1.

Let $\mathcal{X}={\cal B}(0,1)$ , ${\cal F}={\cal F}_{\mathcal{X}}^{0}(1)$ and $d\geq\min\{T,1/\eta^{2}\}$ . For the full-batch gradient descent (Alg. 1) with constant step size $\eta>0$ , there exist $S\simeq S^{\prime}$ such that the UAS is lower bounded as $\delta_{{\mathcal{A}}_{\sf GD}(S,S^{\prime})}=\Omega(\min\{1,\eta\sqrt{T}+\eta T/n\}).$

The proof of this result is deferred to Appendix D, due to space considerations.

4.2 Lower Bounds for SGD Sampled with Replacement

We use a similar construction as from the previous result to prove a sharp lower bound on the uniform argument stability for stochastic gradient descent where the sampling is with replacement.

Theorem 4.2.

Let $\mathcal{X}={\cal B}(0,1)$ , ${\cal F}={\cal F}_{\mathcal{X}}^{0}(1)$ , and $d\geq\min\{T,1/\eta^{2}\}$ . For the sampled with replacement stochastic gradient descent (Algorithm 2) with constant step size $\eta>0$ , there exist $S\simeq S^{\prime}$ such that the uniform argument stability satisfies $\mathbb{E}[\delta_{{\mathcal{A}}_{\sf rSGD}}(S,S^{\prime})]=\Omega\Big{(}\min\Big{\{}1,\frac{T}{n}\Big{\}}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}.$

Proof.

Let $D\triangleq\min\{T,1/\eta^{2}\}\leq d$ , and $\nu>0$ , $K\geq\sqrt{D}$ . Consider ${\cal Z}=\{0,1\}$ and define

f(x,z)=\left\{\begin{array}[]{ll}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}&\mbox{ if }z=0\\ \langle r,x\rangle/K&\mbox{ if }z=1,\end{array}\right.

where $r=(-1,\ldots,-1,0,\ldots,0)$ (i.e., supported on the first $D$ coordinates). Let the random sequence of indices used by the algorithm: $(\mathbf{i}_{t})_{t\geq 0}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}\mbox{Unif}([n])$ . Let $S=(1,0,\ldots,0)$ and $S^{\prime}=(0,0,\ldots,0)$ be neighboring datasets, and denote by $(x^{t})_{t}$ and $(y^{t})_{t}$ the respective stochastic gradient descent trajectories on $S$ and $S^{\prime}$ , initialized at $x^{1}=y^{1}=0$ . It is easy to see that under $S^{\prime}$ , we have $y^{t}=0$ for all $t\in[T]$ . Now, suppose that $\nu<\eta/K$ . Then, we only have $x^{t}=0$ for all $t\leq\tau$ , where $\tau:=\inf\{t\geq 1:\,\mathbf{i}_{t}=1\}$ . After time $\tau$ , $x^{\tau+1}=-\eta r/K$ , and consequently $x^{\tau+1+j}=-\frac{\eta\mathbf{k}(\tau+j)}{K}r-\eta\sum_{s=1}^{j-\mathbf{k}(\tau+j)+1}e_{s},$ for all $j\in[D+\mathbf{k}(\tau+j)-1]$ , where $\mathbf{k}(t)\triangleq|\{s\in[t]:\mathbf{i}_{s}=1\}|$ . Note that conditioned on any fixed value for $\tau,$ $\mathbf{k}(\tau+j)\leq j+1$ .

Let $\delta_{T}=\|x^{T}-y^{T}\|=\|x^{T}\|$ . Hence, we have $\delta_{T}\geq\eta\|\sum_{s=1}^{T-\tau-\mathbf{k}(T-1)}e_{s}\|-\eta\mathbf{k}(T-1)\sqrt{D}/K\geq\eta\sqrt{T-\mathbf{k}(T-1)-\tau}-\eta T\sqrt{D}/K$ . Let $\mathbf{k}=\mathbf{k}(T-1)$ from now on. Note that conditioned on any value for $\tau,$ $\mathbf{k}-1$ is a binomial random variable taking values in $\{0,\ldots,T-1-\tau\}$ . Hence, conditioned on $\tau=t$ , by the binomial tail, we always have $\mathbb{P}[\mathbf{k}>T/2~{}|~{}\tau=t]\leq\exp(-T/4)$ for all $t\in[T]$ (in particular, this conditional probability is zero when $t\geq T/2$ ). Also, note that the same upper bound is valid without conditioning on $\tau$ . Hence, by the law of total expectation, we have

\displaystyle\mathbb{E}[\delta_{T}]

\displaystyle=

\displaystyle\mathbb{E}[\delta_{T}|~{}\mathbf{k}\leq T/2]\cdot\mathbb{P}[\mathbf{k}\leq T/2]+\mathbb{E}[\delta_{T}|~{}\mathbf{k}>T/2]\cdot\mathbb{P}[\mathbf{k}>T/2]\geq c\,\mathbb{E}[\delta_{T}|~{}\mathbf{k}\leq T/2]

where $c=(1-\exp(-T/4))=\Omega(1)$ . Hence,

$\displaystyle\mathbb{E}[\delta_{T}]$	$\displaystyle\geq$	$\displaystyle c\sum_{t=1}^{T/2}\mathbb{E}[\delta_{T}\|\tau=t,~{}\mathbf{k}\leq T/2]\,\mathbb{P}[\tau=t\|~{}\mathbf{k}\leq T/2]$
	$\displaystyle\geq$	$\displaystyle c^{2}\sum_{t=1}^{T/2}\mathbb{E}[\delta_{T}\|\tau=t,~{}\mathbf{k}\leq T/2]\,\mathbb{P}[\tau=t]$
	$\displaystyle\geq$	$\displaystyle c^{2}\frac{\eta}{n}\sum_{t=1}^{T/2}\sqrt{T-T/2-t}\big{(}1-\frac{1}{n}\big{)}^{t-1}-c^{2}\eta\sqrt{D}T/K.$

We choose $K$ sufficiently large such that $\eta\sqrt{D}T/K=o(\eta\min\{T^{3/2}/n,\sqrt{T}\})$ . Hence, we have

\mathbb{E}[\delta_{T}]\geq\left\{\begin{array}[]{ll}\frac{c^{2}\eta}{n}\sum_{t=1}^{T/2}\sqrt{t}\big{(}1-\frac{1}{n}\big{)}^{n-2}-\frac{c^{2}\eta\sqrt{D}T}{K}\geq\frac{c^{2}\eta e^{-1}}{n}\sum_{t=1}^{T/2}\sqrt{t}-o(\frac{\eta T^{3/2}}{n})=\Omega\Big{(}\frac{\eta T^{3/2}}{n}\Big{)}&\mbox{ if }T\leq n\\ \frac{c^{2}\eta}{n}\sum_{t=1}^{n/4}\sqrt{T/2-n/4}\,e^{-1}-o(\eta\sqrt{T})=\Omega(\eta\sqrt{T})&\mbox{ if }T>n.\end{array}\right.

This gives a lower bound on $\mathbb{E}[\delta_{T}]$ . Proving that $\overline{x}^{T}$ satisfies the same lower bound is analogous to the proof in Theorem 4.1. Finally, $\Omega(\eta T/n)$ can be added to the lower bound by Appendix C.∎

4.3 Lower Bounds for the Fixed Permutation Stochastic Gradient Descent

Finally, we study fixed permutation SGD. The proof of this result is deferred to Appendix E.

Theorem 4.3.

Let $\mathcal{X}={\cal B}(0,1)$ , ${\cal F}={\cal F}_{\mathcal{X}}^{0}(1)$ and $d\geq\min\{T,1/\eta^{2}\}$ . For the fixed permutation stochastic gradient descent (Algorithm 3) with constant step size $\eta>0$ , there exist $S\simeq S^{\prime}$ such that the uniform argument stability is lower bounded by $\mathbb{E}[\delta_{{\mathcal{A}}_{\sf PerSGD}}(S,S^{\prime})]=\Omega\Big{(}\min\Big{\{}1,\frac{T}{n}\Big{\}}\eta\sqrt{T}+\frac{\eta T}{n}\Big{)}.$

5 Generalization Guarantees for Multi-pass SGD

One important implication of our stability bounds is that they provide non-trivial generalization error guarantees for multi-pass SGD on nonsmooth losses. Multi-pass SGD is one of the most extensively used settings of SGD in practice, where SGD is run for $K$ passes (epochs) over the dataset (namely, the number of iterations $T=Kn$ ). To the best of our knowledge, aside from the dimension-dependent bounds based on uniform convergence (Shalev-Shwartz et al.,, 2010), no generalization error guarantees are known for the multi-pass setting on general nonsmooth convex losses. Given our uniform stability upper bounds, we can prove the following generalization error guarantees for the multi-pass setting of sampling-with-replacement SGD. Analogous results can be obtained for fixed-permutation SGD .

Theorem 5.1.

Running Algorithm 2 for $K$ passes (i.e., for $T=Kn$ iterations) with constant stepsize $\eta_{t}=\eta>0$ yields the following generalization error guarantees:

|\mathbb{E}_{{\mathcal{A}}_{\sf rSGD}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})]|\leq 4L^{2}\eta\left(\sqrt{Kn}+K\right),

and there exists $c>0$ , such that for any $0<\theta<1$ , with probability $\geq 1-\theta-\exp(-n/2)$ ,

|\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})|\leq c~{}\Bigg{(}L^{2}\eta\left(\sqrt{Kn}+K\right)\log(n)\log(n/\theta)+LR\sqrt{\frac{\log(1/\theta)}{n}}~{}\Bigg{)}.

Proof.

First, by the expectation guarantee on UAS given in Theorem 3.3 together with the fact that the losses are $L$ -Lipschitz, it follows that Algorithm 2 (when run for $K$ passes with constant stepsize $\eta$ ) is $\gamma$ -uniformly stable, where $\gamma=4L^{2}\left(\eta\sqrt{Kn}+\eta K\right)$ . Then, by (Hardt et al.,, 2016, Thm. 2.2), we have

|\mathbb{E}_{{\mathcal{A}}_{\sf rSGD}}[\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf rSGD})]|\leq\gamma.

For the high-probability bound, we combine the high-probability guarantee on UAS given in Theorem 3.3 with Theorem 2.2 to get the claimed bound. ∎

These bounds on generalization error can be used to obtain excess risk bounds using the standard risk decomposition (see (3)). In practical scenarios where one can certify small optimization error for multi-pass SGD, Thm. 5.1 can be used to readily estimate the excess risk. In Section 6.2 we provide worst-case analysis showing that multi-pass SGD is guaranteed to attain the optimal excess risk of $\approx LR/\sqrt{n}$ within $n$ passes (with appropriately chosen constant stepsize).

6 Implications of Our Stability Bounds

6.1 Differentially Private Nonsmooth Stochastic Convex Optimization

Now we show an application of our stability upper bound to differentially private stochastic convex optimization (DP-SCO). Here, the input sample to the stochastic convex optimization algorithm is a sensitive and private data set, thus the algorithm is required to satisfy the notion of $(\alpha,\beta)$ -differential privacy. A randomized algorithm ${\mathcal{A}}$ is $(\alpha,\beta)$ -differentially private if, for any pair of datasets $S\simeq S^{\prime}$ , and for all events $\mathcal{O}$ in the output range of ${\mathcal{A}}$ , we have $\underset{}{\mathbb{P}}\left[{\mathcal{A}}(S)\in\mathcal{O}\right]\leq e^{\alpha}\cdot\underset{}{\mathbb{P}}\left[{\mathcal{A}}(S^{\prime})\in\mathcal{O}\right]+\beta,$ where the probability is taken over the random coins of ${\mathcal{A}}$ (Dwork et al., 2006b, ; Dwork et al., 2006a, ). For meaningful privacy guarantees, the typical settings of the privacy parameters are $\alpha<1$ and $\beta\ll 1/n$ .

Using our UAS upper bounds, we show that a simple variant of noisy SGD (Bassily et al.,, 2014), that requires only $n^{2}$ gradient computations, yields the optimal excess population risk for DP-SCO. In terms of running time, this is a small improvement over the algorithm of Feldman et al., (2020) for the nonsmooth case, which requires $O(n^{2}\sqrt{\log 1/\beta})$ gradient computations. More importantly, our algorithm is substantially simpler. For comparison, the algorithm in (Feldman et al.,, 2020) is based on a multi-phase SGD, where in each phase a separate regularized ERM problem is solved. To ensure privacy, the output of each phase is perturbed with an appropriately chosen amount of noise before being used as the initial point for the next phase.

The description of the algorithm is given in Algorithm 4.

Algorithm 4

{\mathcal{A}}_{\sf NSGD}

: Noisy SGD for convex losses

0: Private dataset

S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}

, step size

\eta

; privacy parameters

\alpha\leq 1,~{}\beta\ll 1/n

1: Set noise variance

\sigma^{2}:=\frac{8\,L^{2}\,\log(1/\beta)}{\alpha^{2}}

2: Choose an arbitrary initial point

x^{1}\in\mathcal{X}

3: for

t=1

n^{2}-1

4: Sample

\mathbf{i}_{t}\sim\mbox{Unif}([n])

x^{t+1}:=\mathsf{Proj}_{\mathcal{X}}\left(x^{t}-\eta\cdot\left(\nabla\ell(x^{t},z_{\mathbf{i}_{t}})+\mathbf{G}_{t}\right)\right),

where

\mathbf{G}_{t}\sim\mathcal{N}\left(\mathbf{0},\sigma^{2}\mathbb{I}_{d}\right)

drawn independently each iteration

6: return

\overline{x}=\frac{1}{n^{2}}\sum_{t=1}^{n^{2}}x^{t}

Theorem 6.1 (Privacy guarantee of ${\mathcal{A}}_{\sf NSGD}$ ).

Algorithm 4 is $(\alpha,\beta)$ -differentially private.

The proof of the theorem follows the same lines of (Bassily et al.,, 2014, Theorem 2.1), but we replace their privacy analysis of the Gaussian mechanism with the tighter Moments Accountant method of Abadi et al., (2016). analysis of Abadi et al., (2016).

Theorem 6.2Risk of ${\mathcal{A}}_{\sf NSGD}$ . (Excess risk of ${\mathcal{A}}_{\sf NSGD}$ ).

In Algorithm 4, let $\eta=R/\Big{(}L\cdot n\cdot\max\big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\big{)}\Big{)}$ . Then, for any $\theta\in(6\exp(-n/2),1)$ , with probability at least $1-\theta$ over the randomness in both the sample and the algorithm, we have

\displaystyle\varepsilon_{\mathsf{\footnotesize{risk}}}\left({\mathcal{A}}_{\sf NSGD}\right)

\displaystyle=RL\cdot O\left(\max\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\right)

Proof.

Fix any confidence parameter $\theta\geq 6\exp(-n/2)$ . First, for any data set $S\in\mathcal{Z}^{n}$ and any step size $\eta>0,$ by Lemma H.1 in Appendix H, we have the following high-probability guarantee on the training error of ${\mathcal{A}}_{\sf NSGD}$ :

With probability at least $1-\theta/3,$ we have

\displaystyle\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD})\triangleq F_{S}(\overline{x})-\min\limits_{x\in\mathcal{X}}F_{S}(x)

\displaystyle\leq\frac{R^{2}}{\,\eta\,n^{2}}+7RL\frac{\sqrt{\log(1/\beta)\log(12/\theta)}}{\alpha n}+\eta\,L^{2}\Big{(}32\frac{d\,\log(1/\beta)}{\alpha^{2}}+1\Big{)}

where the probability is over the sampling in step 4 and the independent Gaussian noise vectors $\mathbf{G}_{1},\ldots,\mathbf{G}_{n^{2}}$ . Given the setting of $\eta$ in the theorem, we get

$\displaystyle\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD})$	$\displaystyle\leq 8RL\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+33RL\,\frac{d\,\frac{\log(1/\beta)}{\alpha^{2}}}{n\cdot\max\Big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\Big{)}}$
	$\displaystyle\leq 8RL\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+33RL\,\frac{\sqrt{d\,\log(1/\beta)}}{n\,\alpha}$
	$\displaystyle=RL\cdot O\Bigg{(}\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{n\,\alpha}\Big{)}\Bigg{)}.$	(7)

Next, it is not hard to show that ${\mathcal{A}}_{\sf NSGD}$ attains the same UAS bound as ${\mathcal{A}}_{\sf rSGD}$ (Theorem 3.3). Indeed, the only difference is the noise addition in gradient step; however, this does not impact the stability analysis. This is because the sequence of noise vectors $\{\mathbf{G}_{1},\ldots,\mathbf{G}_{n^{2}}\}$ is the same for the trajectories corresponding to the pair $S,~{}S^{\prime}$ of neighboring datasets. Hence, the argument basically follows the same lines of the proof of Theorem 3.3 since the noise terms cancel out. Thus, we conclude that for any pair $S\simeq S^{\prime}$ of neighboring datasets, with probability at least $1-\exp(n/2)\geq 1-\theta/6$ (over the randomness of ${\mathcal{A}}_{\sf NSGD}$ ), the uniform argument stability of ${\mathcal{A}}_{\sf NSGD}$ is bounded as: $\delta_{{\mathcal{A}}_{\sf NSGD}}\leq 4L\eta\left(\sqrt{T}+\frac{T}{n}\right),$ where $T=n^{2}$ . Given the setting of $\eta$ in the theorem, this bound reduces to $8R/\max\big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\big{)}$ .

Hence, by Theorem 2.2, with probability at least $1-\theta/3$ (over the randomness in both the i.i.d. dataset $S$ and the algorithm), the generalization error of ${\mathcal{A}}_{\sf NSGD}$ is bounded as

\displaystyle\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf NSGD})

\displaystyle\leq\frac{8c\,RL\,\log(n)\log(6n/\theta)}{\max\Big{(}\sqrt{n},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha}\Big{)}}+\frac{c\,\sqrt{\log(6/\theta)}}{\sqrt{n}}=RL\cdot O\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}}\right),

(8)

where $c$ in the first bound is a universal constant.

Now, using (7), (8), and Lemma 2.1, we finally conlcude that with probability at least $1-\theta$ (over randomness in the sample $S$ and the internal randomness of ${\mathcal{A}}_{\sf NSGD}$ ), the excess population risk of ${\mathcal{A}}_{\sf NSGD}$ is bounded as

	$\displaystyle\varepsilon_{\mathsf{\footnotesize{risk}}}({\mathcal{A}}_{\sf NSGD})$	$\displaystyle\leq\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf NSGD})+\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf NSGD})+\varepsilon_{\mbox{\tiny approx}}$
		$\displaystyle=RL\cdot O\Bigg{(}\max\Big{(}\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\Big{)}+\frac{\log(n)\log(n/\theta)}{\sqrt{n}}+\frac{\sqrt{\log(1/\theta)}}{\sqrt{n}}\Bigg{)}$
		$\displaystyle=RL\cdot O\Bigg{(}\max\left(\frac{\log(n)\log(n/\theta)}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\Bigg{)},$

which completes the proof. ∎

Remark 6.3.

Using the expectation guarantee on UAS given in Theorem 3.3 and following similar steps of the analysis above, we can also show that the expected excess population risk of ${\mathcal{A}}_{\sf NSGD}$ is bounded as:

\underset{}{\mathbb{E}}\left[\varepsilon_{\mathsf{\footnotesize{risk}}}\left({\mathcal{A}}_{\sf NSGD}\right)\right]=RL\cdot O\left(\max\left(\frac{1}{\sqrt{n}},~{}\frac{\sqrt{d\,\log(1/\beta)}}{\alpha\,n}\right)\right).

6.2 Nonsmooth Stochastic Convex Optimization with Multi-pass SGD

Another application of our results concerns obtaining optimal excess risk for stochastic nonsmooth convex optimization via multi-pass SGD. It is known that one-pass SGD is guaranteed to have optimal excess risk, which can be shown via martingale arguments that trace back to the stochastic approximation literature (Robbins and Monro,, 1951; Kiefer and Wolfowitz,, 1952).

Using our UAS bound, we show that Algorithms 2 and 3 can recover nearly-optimal high-probability excess risk bounds by making $n$ passes over the data. Analogous bounds hold for Algorithm 1, however these are less interesting from a computational efficiency perspective.

In short, we prove the following excess population risk bounds, and their analyses are deferred to Appendix G,

\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf rSGD})\leq\frac{4RL}{\sqrt{n}}\qquad;\qquad\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD})\leq\frac{8LR}{\sqrt{n}}.\vspace{-0.6cm}

7 Discussion and Open Problems

In this work we provide sharp upper and lower bounds on uniform argument stability for the (stochastic) subgradient method in stochastic nonsmooth convex optimization. Our lower bounds show inherent limitations of stability bounds compared to the smooth convex case, however we can still derive optimal population risk bounds by reducing the step size and running the algorithms for longer number of iterations. We provide applications of this idea for differentially-private noisy SGD, and for two versions of SGD (sampling-with-replacement and fixed-permutation SGD).

The first open problem regards lower bounds that are robust to general forms of algorithmic randomization. Unfortunately, the methods presented here are not robust in this respect, since random initialization would prevent the trajectories reaching the region of highly nonsmooth behavior of the objective (or doing it in such a way that it does not increase UAS). One may try to strengthen the lower bound by using a random rotation of the objective; however, this leads to an uninformative lower bound. Finding distributional constructions for lower bounds against randomization is a very interesting future direction.

Our privacy application provides optimal risk for an algorithm that runs for $n^{2}$ steps, which is impractical for large datasets. Other algorithms, e.g. in (Feldman et al.,, 2020), run into similar limitations. Proving that quadratic running time is necessary for general nonsmooth DP-SCO is a very interesting open problem that can be formalized in terms of the oracle complexity of stochastic convex optimization (Nemirovsky and Yudin,, 1983) under stability and/or privacy constraints.

Acknowledgements

Part of this work was done while the authors were visiting the Simons Institute for the Theory of Computing during the “Data Privacy: Foundations and Applications” program. RB’s research is supported by NSF Awards AF-1908281, SHF-1907715, Google Faculty Research Award, and OSU faculty start-up support. Work by CG was partially funded by the Millennium Science Initiative of the Ministry of Economy, Development, and Tourism, grant “Millennium Nucleus Center for the Discovery of Structures in Complex Data.” CG would like to thank Nicolas Flammarion and Juan Peypouquet for extremely valuable discussions at early stages of this work.

References

Abadi et al., (2016) Abadi, M., Chu, A., Goodfellow, I., McMahan, H. B., Mironov, I., Talwar, K., and Zhang, L. (2016). Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 308–318. ACM.
Bassily et al., (2019) Bassily, R., Feldman, V., Talwar, K., and Thakurta, A. G. (2019). Private stochastic convex optimization with optimal rates. In Advances in Neural Information Processing Systems, pages 11279–11288.
Bassily et al., (2014) Bassily, R., Smith, A., and Thakurta, A. (2014). Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science (full version available at arXiv:1405.7085), pages 464–473. IEEE.
Beck, (2017) Beck, A. (2017). First-Order Methods in Optimization. MOS-SIAM Series on Optimization. Society for Industrial and Applied Mathematics.
Bousquet and Elisseeff, (2002) Bousquet, O. and Elisseeff, A. (2002). Stability and generalization. JMLR, 2:499–526.
Bousquet et al., (2019) Bousquet, O., Klochkov, Y., and Zhivotovskiy, N. (2019). Sharper bounds for uniformly stable algorithms. CoRR, abs/1910.07833.
Charles and Papailiopoulos, (2018) Charles, Z. and Papailiopoulos, D. (2018). Stability and generalization of learning algorithms that converge to global optima. In Dy, J. and Krause, A., editors, Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pages 745–754, Stockholmsmässan, Stockholm Sweden. PMLR.
Chaudhuri and Monteleoni, (2008) Chaudhuri, K. and Monteleoni, C. (2008). Privacy-preserving logistic regression. In NIPS.
Chaudhuri et al., (2011) Chaudhuri, K., Monteleoni, C., and Sarwate, A. D. (2011). Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(Mar):1069–1109.
Chen et al., (2018) Chen, Y., Jin, C., and Yu, B. (2018). Stability and convergence trade-off of iterative optimization algorithms. CoRR, abs/1804.01619.
(11) Devroye, L. and Wagner, T. J. (1979a). Distribution-free inequalities for the deleted and holdout error estimates. IEEE Trans. Information Theory, 25(2):202–207.
(12) Devroye, L. and Wagner, T. J. (1979b). Distribution-free performance bounds with the resubstitution error estimate (corresp.). IEEE Trans. Information Theory, 25(2):208–210.
Dwork and Feldman, (2018) Dwork, C. and Feldman, V. (2018). Privacy-preserving prediction. CoRR, abs/1803.10266. Extended abstract in COLT 2018.
(14) Dwork, C., Kenthapadi, K., McSherry, F., Mironov, I., and Naor, M. (2006a). Our data, ourselves: Privacy via distributed noise generation. In EUROCRYPT.
(15) Dwork, C., McSherry, F., Nissim, K., and Smith, A. (2006b). Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265–284. Springer.
Feldman, (2016) Feldman, V. (2016). Generalization of ERM in stochastic convex optimization: The dimension strikes back. CoRR, abs/1608.04414. Extended abstract in NIPS 2016.
Feldman et al., (2020) Feldman, V., Koren, T., and Talwar, K. (2020). Private stochastic convex optimization: Optimal rates in linear time. Extended abstract in STOC 2020.
Feldman et al., (2018) Feldman, V., Mironov, I., Talwar, K., and Thakurta, A. (2018). Privacy amplification by iteration. In FOCS, pages 521–532.
Feldman and Vondrák, (2018) Feldman, V. and Vondrák, J. (2018). Generalization bounds for uniformly stable algorithms. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, 3-8 December 2018, Montréal, Canada, pages 9770–9780.
Feldman and Vondrák, (2019) Feldman, V. and Vondrák, J. (2019). High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, COLT 2019, 25-28 June 2019, Phoenix, AZ, USA, pages 1270–1279.
Hardt et al., (2016) Hardt, M., Recht, B., and Singer, Y. (2016). Train faster, generalize better: Stability of stochastic gradient descent. In ICML, pages 1225–1234.
Jain et al., (2012) Jain, P., Kothari, P., and Thakurta, A. (2012). Differentially private online learning. In 25th Annual Conference on Learning Theory (COLT), pages 24.1–24.34.
Jain and Thakurta, (2014) Jain, P. and Thakurta, A. (2014). (near) dimension independent risk bounds for differentially private learning. In ICML.
Kearns and Ron, (1999) Kearns, M. J. and Ron, D. (1999). Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural Computation, 11(6):1427–1453.
Kiefer and Wolfowitz, (1952) Kiefer, J. and Wolfowitz, J. (1952). Stochastic estimation of the maximum of a regression function. Ann. Math. Statist., 23(3):462–466.
Kifer et al., (2012) Kifer, D., Smith, A., and Thakurta, A. (2012). Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1.
Koren and Levy, (2015) Koren, T. and Levy, K. (2015). Fast rates for exp-concave empirical risk minimization. In NIPS, pages 1477–1485.
Kuzborskij and Lampert, (2018) Kuzborskij, I. and Lampert, C. H. (2018). Data-dependent stability of stochastic gradient descent. In ICML, volume 80 of Proceedings of Machine Learning Research, pages 2820–2829. PMLR.
Liu et al., (2017) Liu, T., Lugosi, G., Neu, G., and Tao, D. (2017). Algorithmic stability and hypothesis complexity. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, pages 2159–2167.
London, (2017) London, B. (2017). A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. In Guyon, I., Luxburg, U. V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., and Garnett, R., editors, Advances in Neural Information Processing Systems 30, pages 2931–2940. Curran Associates, Inc.
Maurer, (2017) Maurer, A. (2017). A second-order look at stability and generalization. In COLT, pages 1461–1475.
Mukherjee et al., (2006) Mukherjee, S., Niyogi, P., Poggio, T., and Rifkin, R. (2006). Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1-3):161–193.
Nedic and Bertsekas, (2001) Nedic, A. and Bertsekas, D. P. (2001). Incremental subgradient methods for nondifferentiable optimization. SIAM Journal on Optimization, 12(1):109–138.
Nemirovsky and Yudin, (1983) Nemirovsky, A. and Yudin, D. (1983). Problem Complexity and Method Efficiency in Optimization. J. Wiley @ Sons, New York.
Poggio et al., (2004) Poggio, T., Rifkin, R., Mukherjee, S., and Niyogi, P. (2004). General conditions for predictivity in learning theory. Nature, 428(6981):419–422.
Rigollet, (2015) Rigollet, P. (2015. https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015). Lecture Notes. 18.S997: High Dimensional Statistics. MIT Courses/Mathematics.
Robbins and Monro, (1951) Robbins, H. and Monro, S. (1951). A stochastic approximation method. Ann. Math. Statist., 22(3):400–407.
Rogers and Wagner, (1978) Rogers, W. H. and Wagner, T. J. (1978). A finite sample distribution-free performance bound for local discrimination rules. The Annals of Statistics, 6(3):506–514.
Shalev-Shwartz et al., (2010) Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. (2010). Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670.
Smith and Thakurta, (2013) Smith, A. and Thakurta, A. (2013). Differentially private feature selection via stability arguments, and the robustness of the LASSO. In Conference on Learning Theory (COLT), pages 819–850.
Talwar et al., (2015) Talwar, K., Thakurta, A., and Zhang, L. (2015). Nearly optimal private LASSO. In Proceedings of the 28th International Conference on Neural Information Processing Systems, volume 2, pages 3025–3033.
Ullman, (2015) Ullman, J. (2015). Private multiplicative weights beyond linear queries. In Proceedings of the 34th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, pages 303–312. ACM.
Wu et al., (2017) Wu, X., Li, F., Kumar, A., Chaudhuri, K., Jha, S., and Naughton, J. (2017). Bolt-on differential privacy for scalable stochastic gradient descent-based analytics. In SIGMOD. ACM.
Zinkevich, (2003) Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the Twentieth International Conference on International Conference on Machine Learning, ICML’03, page 928–935. AAAI Press.

Appendix A Proof of Theorem 3.2

Proof.

\sup_{x\in{\cal X}}\|\nabla F_{S}(x)-\nabla F_{S^{\prime}}(x)\|\leq 2L/n,

since $S\simeq S^{\prime}$ ; in particular, $\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\|\leq a_{t}$ , with $a_{t}=2L/n$ . We conclude by Lemma 3.1 that for all $t\in[T]$

\|x^{t}-y^{t}\|\leq 2L\sqrt{\sum_{s=1}^{t-1}\eta_{s}^{2}}+\frac{4L}{n}\sum_{s=2}^{t-1}\eta_{s}.

Hence, the stability bound holds for all the iterates, and thus for $\overline{x}^{T}$ by the triangle inequality. ∎

Appendix B Proof of Theorem 3.4

Proof.

The stability bound of $2R$ is implied directly by the diameter of the feasible set. Let $S\simeq S^{\prime}$ , and let $(x^{t})_{t\in[T]},(y^{t})_{t\in[T]}$ be the trajectories of Algorithm 3 on $S$ and $S^{\prime}$ , respectively, with $x^{1}=y^{1}$ .

Notice that since $\bm{\pi}$ is a random permutation, we may assume w.l.o.g. that $\bm{\pi}$ is the identity, whereas the perturbed coordinate between $S,S^{\prime}$ is $\mathbf{i}\sim\mbox{Unif}([n])$ . The rest of the proof is a stability analysis conditioned on $\bm{\pi}$ (which fixes all the randomness of the algorithm), but from the observation above this is the same as conditioning on the random perturbed coordinate $\mathbf{i}$ .

Let $T\leq n$ , and $\delta_{t}=\|x^{t}-y^{t}\|$ so that $\delta_{1}=0$ . Conditioned on $\mathbf{i}=i$ , we have that for all $t\leq T$ ,

\delta_{t+1}^{2}\leq\left\{\begin{array}[]{ll}0&t<i\\ 4\eta_{t}^{2}L^{2}&t=i\\ \delta_{t}^{2}+4\eta_{t}^{2}L^{2}&i<t\leq T\end{array}\right.

Indeed, for all $t\leq i$ , $\delta_{t}=0$ . For $t=i$ , we have

$\displaystyle\delta_{i+1}$	$\displaystyle=$	$\displaystyle\\|\mathsf{Proj}_{\mathcal{X}}[x^{i}-\eta_{i}\nabla f(x^{i},z_{i})]-\mathsf{Proj}_{\mathcal{X}}[y^{i}-\eta_{i}\nabla f(y^{i},z_{i}^{\prime})]\\|$
	$\displaystyle\leq$	$\displaystyle\\|x^{i}-y^{i}-\eta_{i}[\nabla f(x^{i},z_{i})-\nabla f(y^{i},z_{i}^{\prime})]\\|$
	$\displaystyle\leq$	$\displaystyle 2L\eta_{i},$

where we used $x^{i}=y^{i}$ , and that both gradients are bounded in norm by $L$ . Finally, when $t>i$ , we have $z_{t}=z_{t}^{\prime}$ , and therefore we can leverage the monotonicity of the subgradients

$\displaystyle\delta_{t+1}^{2}$	$\displaystyle=$	$\displaystyle\\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f(x^{t},z_{t})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f(y^{t},z_{t})]\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\delta_{t}^{2}+4L^{2}\eta_{t}^{2}-2\eta_{t}\langle\nabla f(x^{t},z_{t})-\nabla f(y^{t},z_{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq$	$\displaystyle\delta_{t}^{2}+4L^{2}\eta_{t}^{2}.$

Unravelling this recursion, we get $\mathbb{E}[\delta_{t+1}^{2}|\mathbf{i}=i]\leq 4L^{2}\sum_{s=i}^{t}\eta_{s}$ , so by the law of total expectation:

$\displaystyle\mathbb{E}[\delta_{t}]$	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}[\delta_{t}\|\mathbf{i}=i]\leq\frac{1}{n}\sum_{i=1}^{n}\sqrt{\mathbb{E}[\delta_{t}^{2}\|\mathbf{i}=i]}\leq\frac{2L}{n}\sum_{i=1}^{t-1}\sqrt{\sum_{s=i}^{t-1}\eta_{s}^{2}}$
	$\displaystyle\leq$	$\displaystyle\frac{2L}{n}\sum_{i=1}^{t-1}\sqrt{(t-i)}\eta_{i}\leq\frac{2L}{n}\sqrt{\Big{(}\sum_{i=1}^{t-1}(t-i)\Big{)}\Big{(}\sum_{i=1}^{t-1}\eta_{i}^{2}}\Big{)}$
	$\displaystyle\leq$	$\displaystyle\frac{2L}{n}\sqrt{\frac{(t-1)^{2}}{2}\sum_{i=1}^{t-1}\eta_{i}^{2}}=\frac{\sqrt{2}L(t-1)}{n}\sqrt{\sum_{i=1}^{t-1}\eta_{i}^{2}}.$

where the first inequality holds by the Jensen inequality, the second inequality comes from the bound on the conditional expectation, the third inequality from the non-increasing stepsize assumption, and the fourth inequality is from Cauchy-Schwarz. Since averaging can only improve stability, we conclude the result. ∎

Appendix C General Lower Bound on Stability of SGD

Let ${\cal A}$ be a $\gamma$ -uniformly stable stochastic convex optimization algorithm with $\gamma=s(T)/n$ , where $s(T)$ is increasing and $\lim_{T\rightarrow+\infty}s(T)=+\infty$ . By the lower bound on the optimal risk of nonsmooth convex optimization, $\varepsilon_{\mbox{\footnotesize risk}}\geq\frac{LR}{C_{1}\sqrt{n}}$ , where $C_{1}>0$ is a universal constant (Nemirovsky and Yudin,, 1983). This, combined with the risk decomposition (3), implies that

\varepsilon_{\mbox{\footnotesize opt}}\geq\frac{LR}{C_{1}\sqrt{n}}-\frac{s(T)}{n}=-s(T)\Big{(}\frac{1}{\sqrt{n}}-\frac{LR}{2C_{1}s(T)}\Big{)}^{2}+\frac{(LR)^{2}}{4C_{1}^{2}s(T)}.

By our assumption on $s(T)$ , for $T$ sufficiently large, there always exists $n$ such that

\frac{4C_{1}s(T)}{3LR}\leq\sqrt{n}\leq\frac{4C_{1}s(T)}{LR}

which leads to $\varepsilon_{\mbox{\footnotesize opt}}\geq\frac{(LR)^{2}}{C_{2}s(T)}$ , where $C_{2}>0$ is a universal constant.

If algorithm ${\cal A}$ is based on $T$ subgradient iterations with constant step size $\eta>0$ (these could be either stochastic, batch or minibatch), by standard analysis, the optimization guarantee of such algorithm is $\varepsilon_{\mbox{\footnotesize opt}}\leq\frac{1}{2}(\frac{R^{2}}{\eta T}+\eta L^{2})$ . Both bounds in combination give

s(T)\geq\frac{2(LR)^{2}}{C_{2}(\eta L^{2}+R^{2}/[\eta T])}=\frac{2(LR)^{2}\eta T}{C_{2}(\eta^{2}TL^{2}+R^{2})}.

If we further assume that $\eta\leq(R/L)/\sqrt{T}$ (notice $\eta=(R/L)/\sqrt{T}$ minimizes the optimization error), then $s(T)\geq L^{2}\eta T/C_{2}$ . We also emphasize all the choices of step size that we will make to control generalization error will lie in this range.

Appendix D Proof of Theorem 4.1

Proof.

Let $D\triangleq\min\{T,1/\eta^{2}\}\leq d$ , and $\nu,K>0$ . We consider $\mathcal{Z}=\{0,1\}$ , and the objective function

f(x,z)=\left\{\begin{array}[]{ll}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}&\mbox{ if }z=0\\ \langle r,x\rangle/K&\mbox{ if }z=1,\end{array}\right.

where $r=(-1,\ldots,-1,0,\ldots,0)$ (i.e., supported on the first $D$ coordinates). Notice that for normalization purposes, we need $K\geq\sqrt{D}$ ; furthermore, we will choose $K$ sufficiently large such that $T\sqrt{D}/[nK]=o(1)$ . Consider the data sets $S\simeq S^{\prime}$ , leading to the empirical objectives:

F_{S}(x)=\frac{1}{nK}\langle r,x\rangle+\frac{n-1}{n}\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}\,\,\mbox{and}\,\,F_{S^{\prime}}(x)=\max\{0,x_{1}-\nu,\ldots,x_{D}-\nu\}.

Let $(x^{t})_{t\in[T]}$ and $(y^{t})_{t\in[T]}$ be the trajectories of the algorithm over datasets $S$ and $S^{\prime}$ , respectively, initialized from $x^{1}=y^{1}=0$ . Clearly, $y^{t}=0$ for all $t$ . Now $x^{2}=-\frac{\eta}{nK}\,r$ ; choosing $\nu<\eta/(nK)$ , we have $\nabla f(x^{2},z)=-\frac{\eta}{nK}\,r+\frac{n-1}{n}e_{1}$ , and hence $x^{3}=-\frac{2\eta}{nK}\,r-\eta\frac{n-1}{n}e_{1}.$ Sequentially, the method will perform cumulative subgradient steps on $e_{2},e_{3}\ldots,e_{D}$ . In particular, for any $t\in[D+1],$ we have $x^{t+1}=-t\frac{\eta}{nK}\,r-\eta\frac{n-1}{n}\sum_{s=1}^{t-1}e_{s}.$

By orthogonality of the subgradients and given our choice of $K$ , we conclude that

$\displaystyle\\|x^{D+2}-y^{D+2}\\|$	$\displaystyle=$	$\displaystyle\\|x^{D+2}\\|\geq\frac{\eta}{2}\Big{\\|}\sum_{t=1}^{D}e_{t}\Big{\\|}-\eta\frac{D\sqrt{D}}{nK}$
	$\displaystyle\geq$	$\displaystyle\frac{\eta}{2}\sqrt{D}-\eta\cdot o(1)=\Omega(\eta\sqrt{D})$
	$\displaystyle=$	$\displaystyle\Omega(\min\{1,\eta\sqrt{T}\}),$

and further subgradient steps $t=D+1,\ldots,T$ are only given by the linear term, $r/[nK]$ , which are negligible perturbations.

We finish by arguing that averaging does not help. First, in the case $D=T$ :

	$\displaystyle\delta({\mathcal{A}}_{\sf GD})$	$\displaystyle=$	$\displaystyle\\|\overline{x}^{T}\\|\geq\frac{\eta}{2}\Big{\\|}\frac{1}{T}\sum_{t=1}^{T}\sum_{s=1}^{t-2}e_{s}\Big{\\|}-o(1)=\frac{\eta}{2}\Big{\\|}\sum_{s=1}^{T}\frac{T-s-2}{T}e_{s}\Big{\\|}-o(1)$
		$\displaystyle\geq$	$\displaystyle\frac{\eta}{4}\Big{\\|}\sum_{s\leq T/2-2}e_{s}\Big{\\|}-o(1)=\Omega(\eta\sqrt{T}).$

And second, in the case $D=1/\eta^{2}$ :

$\displaystyle\delta({\mathcal{A}}_{\sf GD})$	$\displaystyle=$	$\displaystyle\\|\overline{x}^{T}\\|\geq\frac{\eta}{2T}\Big{\\|}\sum_{t=1}^{D+2}\sum_{s=1}^{t-2}e_{s}+\sum_{t=D+3}^{T}\sum_{s=1}^{D}e_{s}\Big{\\|}-o(1)$
	$\displaystyle=$	$\displaystyle\frac{\eta}{2T}\Big{\\|}\sum_{s=1}^{D}(D-s-1)e_{s}+\sum_{s=1}^{D}(T-D+2)e_{s}\Big{\\|}-o(1)$
	$\displaystyle=$	$\displaystyle\frac{\eta}{2}\Big{\\|}\sum_{s=1}^{D-1}\frac{T-s+1}{T}e_{s}\Big{\\|}-o(1)\geq\frac{\eta}{4}\Big{\\|}\sum_{s=1}^{D/2}e_{s}\Big{\\|}-o(1)=\Omega(\sqrt{D}\eta)=\Omega(1).$

Finally, the additional term $\Omega(\eta T/n)$ in the lower bound is obtained by Appendix C. ∎

Appendix E Proof of Theorem 4.3

Proof.

We consider the same function class of Thm. 4.1, and neighbor datasets $S^{\prime}=(0,0,\ldots,0)$ , $S=(1,0,\ldots,0)$ . We will assume in what follows that $D=\min\{T,1/\eta^{2}\}$ , $K$ is sufficiently large and $\nu<\eta\|r\|/K$ . Let $(x^{t})_{t\in[T]}$ and $(y^{t})_{t\in[T]}$ be the trajectories of Algorithm 3 over datasets $S,S^{\prime}$ respectively, both initialized at $x^{1}=y^{1}=0$ . Let now $\tau=\bm{\pi}^{-1}(1)\sim\mbox{Unif}[n]$ . Arguing as in Thm. 4.1, we have that $y^{t}=0$ for all $t$ , whereas

x^{t+1}=\left\{\begin{array}[]{ll}0&t<\tau\\ -\frac{\eta(1+\lfloor t/n\rfloor)r}{K}-\eta\sum_{s=1}^{t-\tau-(1+\lfloor t/n\rfloor)}e_{s}&\tau\leq t\leq\tau+D.\end{array}\right.

Later iterations will satisfy $\|x^{t}\|=1-o(1)$ if $D=1/\eta^{2}$ (and otherwise the algorithm stops earlier). Therefore, for all $t\in[T]$ ,

$\displaystyle\mathbb{E}_{\bm{\pi}}[\\|x^{t}-y^{t}\\|]$	$\displaystyle=$	$\displaystyle\sum_{s=1}^{n}\mathbb{E}[\\|x^{t}-y^{t}\\|\,\|\,\tau=s]\mathbb{P}[\tau=s]$
	$\displaystyle\geq$	$\displaystyle\frac{\eta}{2n}\sum_{s=1}^{\min\{t,n\}}\sqrt{t-s}-\eta\cdot o(1)$
	$\displaystyle=$	$\displaystyle\left\{\begin{array}[]{ll}\Omega(\frac{\eta t^{3/2}}{n})&\mbox{ if }t\leq n\\ \Omega(\eta\sqrt{t})&\mbox{ if }t>n.\end{array}\right.$

Notice that we used above that $K$ is such that $T\sqrt{D}/nK=o(1)$ . Analogously as in Thm. 4.1, we can obtain the same conclusion for $\overline{x}^{T}$ . The lower bound of $\eta T/n$ can be added by Appendix C, so the result follows. ∎

Appendix F Upper bounds on UAS of SGD when $T\leq n$

Theorem F.1.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ and $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ . Suppose $T\leq n$ . The UAS of sampling-with-replacement stochastic gradient descent (Algorithm 2) satisfies uniform argument stability

\mathbb{E}\left[\delta_{{\mathcal{A}}_{\sf rSGD}}\right]\leq\min\left(2R,~{}3L\,\frac{T-1}{n}\,\left(\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{1}{n}\sum_{t=1}^{T-1}\eta_{t}\right)\right).

Proof.

The bound of $2R$ is obtained directly from the diameter bound on $\mathcal{X}$ . Therefore, we focus exclusively on the second term. Let $S\simeq S^{\prime}$ , and let $k\in[n]$ be the entry where both datasets differ. Let $(x^{t})_{t\in[T]},(y^{t})_{t\in[T]}$ be the trajectories of Algorithm 2 on $S$ and $S^{\prime}$ , respectively, with $x^{1}=y^{1}$ .

Let $B_{t}$ denote the event that $\mathbf{i}_{j}=k$ for some $j\leq t$ ; that is, $B_{t}$ is the event that the index $k$ is sampled at least once in the first $t$ iterations. We note that

	$\displaystyle\underset{}{\mathbb{P}}\left[B_{t}\right]$	$\displaystyle=$	$\displaystyle\frac{1}{n}\sum_{j=0}^{t-1}\big{(}1-\frac{1}{n}\big{)}^{j}=1-\big{(}1-\frac{1}{n}\big{)}^{t}\leq\min\left(1,~{}\frac{t}{n}\right)$
	$\displaystyle\Longrightarrow\quad\underset{}{\mathbb{E}}\left[\delta_{T}\right]$	$\displaystyle\leq$	$\displaystyle\min\left(1,~{}\frac{T-1}{n}\right)\cdot\underset{}{\mathbb{E}}\left[\delta_{T}\|~{}B_{T-1}\right].$		(10)

For the rest of the proof we bound $\underset{}{\mathbb{E}}\left[\delta_{T}|B_{T-1}\right]$ . To do this, we derive a recurrence for $\underset{}{\mathbb{E}}\left[\delta_{t+1}|B_{t}\right]$ . Note that $B_{t}$ is the union of two mutually exclusive events: $\{\mathbf{i}_{t}=k\}\cap\overline{B_{t-1}}$ and $B_{t-1}$ , where $\overline{B_{t-1}}$ is the complement of $B_{t-1}$ , i.e., the event “index $k$ is never sampled in the first $t$ iterations.” Hence,

	$\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}B_{t}\right]$	$\displaystyle=$	$\displaystyle\underset{}{\mathbb{P}}\left[\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}~{}\|~{}B_{t}\right]\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right]+\underset{}{\mathbb{P}}\left[B_{t-1}~{}\|~{}B_{t}\right]\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}B_{t-1}\right]$		(11)
		$\displaystyle\leq$	$\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right]+\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}B_{t-1}\right].$		(11)

Now, conditioned on the past sampled coordinates $\mathbf{i}_{1},\ldots,\mathbf{i}_{t-1}$ , we have

$\displaystyle\delta_{t+1}$	$\displaystyle=$	$\displaystyle\\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f(x^{t},z_{\mathbf{i}_{t}})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f(y^{t},z_{\mathbf{i}_{t}}^{\prime})]\\|$
	$\displaystyle\leq$	$\displaystyle\\|x^{t}-y^{t}-\eta_{t}[\nabla f(x^{t},z_{\mathbf{i}_{t}})-\nabla f(y^{t},z_{\mathbf{i}_{t}}^{\prime})]\\|$
	$\displaystyle\leq$	$\displaystyle\mathbf{1}_{\{\mathbf{i}_{t}=k\}}(\delta_{t}+2L\eta_{t})+\mathbf{1}_{\{\mathbf{i}_{t}\neq k\}}\sqrt{\delta_{t}^{2}+4L^{2}\eta_{t}^{2}},$

where the last inequality is obtained from convexity and Lipschitzness of the objective. Now, squaring we get

\displaystyle\delta_{t+1}^{2}

\displaystyle\leq

\displaystyle\mathbf{1}_{\{\mathbf{i}_{t}=k\}}(\delta_{t}+2L\eta_{t})^{2}+\mathbf{1}_{\{\mathbf{i}_{t}\neq k\}}(\delta_{t}^{2}+4L^{2}\eta_{t}^{2})\,\,=\,\,\delta_{t}^{2}+4\eta_{t}^{2}L^{2}+\mathbf{1}_{\{\mathbf{i}_{t}=k\}}4L\eta_{t}\delta_{t}.

From this formula we derive bounds for the two conditional expectations:

	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t+1}^{2}~{}\|~{}B_{t-1}\right]$	$\displaystyle\leq$	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t}^{2}~{}\|~{}B_{t-1}\right]+4L^{2}\eta_{t}^{2}+\frac{4L}{n}\eta_{t}\underset{}{\mathbb{E}}\left[\delta_{t}~{}\|~{}B_{t-1}\right]$		(12)
	$\displaystyle\underset{}{\mathbb{E}}\left[\delta^{2}_{t+1}~{}\|~{}\mathbf{i}_{t}=k,~{}\overline{B_{t-1}}\right]$	$\displaystyle\leq$	$\displaystyle 4L^{2}\eta_{t}^{2},$		(13)

where (12) holds by independence of $\mathbf{i}_{t}$ and $B_{t-1}$ , and in (13) we used that $\delta_{t}=0$ conditioned on $\overline{B_{t-1}}$ .

Combining (12) and (13) in (11), we get

	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t+1}^{2}~{}\|~{}B_{t}\right]$	$\displaystyle\leq$	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t}^{2}~{}\|~{}B_{t-1}\right]+8L^{2}\eta_{t}^{2}+\frac{4L}{n}\eta_{t}\underset{}{\mathbb{E}}\left[\delta_{t}~{}\|~{}B_{t-1}\right]$
	$\displaystyle\mathbb{E}[\delta_{T}^{2}~{}\|~{}B_{T-1}]$	$\displaystyle\leq$	$\displaystyle 8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{t}~{}\|B_{t-1}].$

With this last bound we can proceed inductively to show that

\underset{}{\mathbb{E}}\left[\delta_{T}~{}|~{}B_{T-1}\right]\leq\sqrt{8}L\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}.

The base case, $T=0$ , is evident, and the inductive step can be considered in two separate cases; namely, the case where $\mathbb{E}[\delta_{T}~{}|~{}B_{T-1}]\leq\max_{t\in[T-1]}\mathbb{E}[\delta_{t}~{}|~{}B_{t-1}]$ , which can be obtained by the induction hypothesis; and the case where $\mathbb{E}[\delta_{T}~{}|~{}B_{T-1}]>\max_{t\in[T-1]}\mathbb{E}[\delta_{t}~{}|~{}B_{t-1}]$ , for which

\displaystyle\mathbb{E}[\delta_{T}^{2}|B_{T-1}]\!

\displaystyle\leq

\displaystyle\!8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{t}|B_{t-1}]\,\,\leq\,\,8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{T}|B_{T-1}].

Then

\displaystyle\mathbb{E}_{\mathbf{i}}\Big{[}\Big{(}\delta_{T}-\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2}~{}\Big{|}~{}B_{T-1}\Big{]}

\displaystyle\leq

\displaystyle 8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\Big{(}\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2},

and by the Jensen inequality

\displaystyle\mathbb{E}_{\mathbf{i}}\Big{[}\delta_{T}-\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{|}B_{T-1}\Big{]}

\displaystyle\leq

\displaystyle\sqrt{8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\Big{(}\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t}\Big{)}^{2}}\,\,\leq\,\,\sqrt{8}L\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}+\frac{2L}{n}\sum_{t=1}^{T-1}\eta_{t},

proving the inductive step. Finally, putting this together with (10) completes the proof. ∎

Theorem F.2.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ , $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ , $\bm{\pi}$ be a uniformly random permutation over $[n]$ , and $(\eta_{t})_{t\in[T]}$ be a non-increasing sequence. Let $T\leq n$ . The UAS of fixed-permutation stochastic gradient descent (Algorithm 3) satisfies uniform argument stability $\mathbb{E}\left[\delta_{{\mathcal{A}}_{\sf PerSGD}}\right]\leq\min\{2R,\sqrt{2}L\frac{T-1}{n}\sqrt{\sum_{t=1}^{T-1}\eta_{t}^{2}}\}$ .

Notice that the UAS of Algorithms 2 and 3 are of the same order. The proof is in Appendix B.

Appendix G Generalization in Stochastic Optimization with Data Reuse

G.0.1 Risk Bounds for Sampling-with-Replacement Stochastic Gradient Descent

Theorem G.1.

Let $\mathcal{X}\subseteq\mathcal{B}(0,R)$ and $\mathcal{F}=\mathcal{F}_{\mathcal{X}}^{0}(L)$ . The sampling with replacement stochastic gradient descent (Algorithm 2) with $T=n^{2}$ iterations and $\eta=\frac{R}{\,L\,n^{3/2}}$ satisfies for any $12\exp\{-n^{2}/32\}<\theta<1.$

\mathbb{P}\Big{[}\varepsilon_{\mbox{\footnotesize{risk}}}({\mathcal{A}}_{\sf rSGD})=O\Big{(}\frac{cLR}{\sqrt{n}}\log(n)\log(\frac{n}{\theta})\Big{)}\Big{]}\leq\theta.

It should be noted that, similarly to Remark 6.3, if we are only interested in expectation risk bounds, one can shave off the polylogarithmic factor above, which is optimal for the expected excess risk.

Proof.

Let $\mathbf{S}\sim{\cal D}^{n}$ an i.i.d. random sample for the stochastic convex program, and apply on these data the algorithm ${\mathcal{A}}_{\sf rSGD}$ for constant step size $\eta>0$ and $T$ iterations.

We consider $\theta>0$ such that $\theta>12\exp\{-T/32\}$ . Notice that the sampling-with-replacement stochastic gradient is a bounded first-order stochastic oracle for the empirical objective. It is direct to verify that the assumptions of Lemma H.1 are satisfied with $\sigma=0$ . Hence, by Lemma H.1, we have that, with probability at least $1-\theta/3$

\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf rSGD})\leq O\Big{(}LR\sqrt{\frac{2\log(12/\theta)}{T}}+\frac{R^{2}}{\eta T}+\eta L^{2}\Big{)}.

On the other hand, Theorem 3.3 together with Theorem 2.2, guarantees that with probability at least $1-\theta/3$ , we have

|\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf rSGD})|\leq O\Big{(}L^{2}\big{[}\sqrt{T}\eta+\frac{T\eta}{n}\big{]}\log n\log(6n/\theta)+LR\sqrt{\log(6/\theta)}{n}\Big{)}.

Finally, Lemma 2.1 ensures that with probability $1-\theta/3$

\varepsilon_{\mathrm{\tiny approx}}\leq LR\sqrt{\frac{2\log(3/\theta)}{n}}.

By the union bound and the excess risk decomposition (3), we have that, with probability $1-\theta$ ,

$\displaystyle\varepsilon_{\mbox{\footnotesize risk}}({\cal A})$	$\displaystyle=$	$\displaystyle O\Big{(}LR\sqrt{\frac{\log(1/\theta)}{T}}+\frac{R^{2}}{\eta T}+\eta L^{2}+L^{2}\eta\big{(}\sqrt{T}+\frac{T}{n}\big{)}\log(n)\log(\frac{6n}{\theta})$
		$\displaystyle+LR\sqrt{\frac{\log(6/\theta)}{n}}+LR\sqrt{\frac{\log(3/\theta)}{n}}\Big{)}$
	$\displaystyle=$	$\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}\log(n)\log(\frac{n}{\theta})\Big{)},$

where only at the last step we replaced by the choice of step size and number of iterations from the statement.

∎

G.0.2 Risk Bounds for Fixed-Permutation Stochastic Gradient Descent

As a final application we provide a population risk bound based on the UAS of Algorithm 3. Similarly as in the case of sampling-with-replacement SGD, we need an optimization error analysis, which for completeness is provided in Appendix I, and it is based on the analysis of the incremental subgradient method (Nedic and Bertsekas,, 2001).

Interestingly, the combination of the incremental method analysis for arbitrary permutation (Nedic and Bertsekas,, 2001) and our novel stability bounds that also work for arbitrary permutation, guarantees generalization bounds for fixed permutation SGD without the need of reshuffling, or even any form of randomization. We believe this could be of independent interest.

Theorem G.2.

Algorithm 3 with constant step size $\eta_{k}\equiv\eta=R/[Ln\sqrt{K}]$ and $K=n$ epochs is such that for every $0<\theta<1$ ,

\mathbb{P}\Big{[}\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD})>\frac{cLR}{\sqrt{n}}\log n\log(\frac{n}{\theta})\Big{]}\leq\theta,

where $c>0$ is an absolute constant.

Similarly to the previous result, we can remove the polylogarithmic factor if we are only interested in expected excess risk guarantees.

Proof.

By Corollary I.2

\varepsilon_{\mbox{\footnotesize opt}}({\mathcal{A}}_{\sf PerSGD})\leq\frac{R^{2}}{nK\eta}+\frac{L^{2}(n+2)\eta}{2}=O\Big{(}\frac{LR}{\sqrt{n}}\Big{)},

by our choice of $K,\eta$ . On the other hand, Theorem 3.4 guarantees the algorithm is $\delta$ -UAS with probability 1, where $\delta=O(R/\sqrt{n})$ . Therefore, by Theorem 2.2, we have that w.p. $1-\theta/2$

|\varepsilon_{\mbox{\footnotesize gen}}({\mathcal{A}}_{\sf PerSGD})|\leq c\Big{(}\frac{LR}{\sqrt{n}}\log n\log(2n/\theta)+LR\sqrt{\frac{\log 2/\theta}{n}}\Big{)}.

Finally, Lemma 2.1 ensures that with probability $1-\theta/2$

\varepsilon_{\mathrm{\tiny approx}}\leq LR\sqrt{\frac{2\log(2/\theta)}{n}}.

By the union bound and the excess risk decomposition (3), we have that, with probability at least $1-\theta$ ,

$\displaystyle\varepsilon_{\mbox{\footnotesize risk}}({\mathcal{A}}_{\sf PerSGD})$	$\displaystyle\leq$	$\displaystyle\varepsilon_{\mathrm{\tiny opt}}({\mathcal{A}}_{\sf PerSGD})+\varepsilon_{\mathrm{\tiny gen}}({\mathcal{A}}_{\sf PerSGD})+\varepsilon_{\mathrm{\tiny approx}}$
	$\displaystyle=$	$\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}+\frac{LR}{\sqrt{n}}\log n\log(n/\theta)+LR\sqrt{\frac{\log 1/\theta}{n}}+LR\sqrt{\frac{\log(2/\theta)}{n}}\Big{)}$
	$\displaystyle=$	$\displaystyle O\Big{(}\frac{LR}{\sqrt{n}}\log n\log(\frac{n}{\theta})\Big{)}.$

∎

Appendix H High-probability Bound on Optimization Error of SGD with Noisy Gradient Oracle

It is known that standard online-to-batch conversion technique can be used to provide high-probability bound on the optimization error (i.e., the excess empirical risk) of stochastic gradient descent. For the sake of completeness and to make the paper more self-contained, we re-expose this technique here for stochastic gradient descent with noisy gradient oracle. This is done in the following lemma, which is used in the proofs of our results in Section 6.

Lemma H.1 (Optimization error of SGD with noisy gradient oracle).

Let $S=(z_{1},\ldots,z_{n})\in\mathcal{Z}^{n}$ be a dataset. Let $F_{S}(x)=\frac{1}{n}\sum_{i\in[n]}f(x,z_{i})$ be the empirical risk associated with $S$ , where for every $z\in\mathcal{Z},$ $f(\cdot,z)$ is convex, $L$ -Lipschitz function over $\mathcal{X}\subseteq\mathcal{B}(0,R)$ for some $L,R>0$ . Consider the stochastic (sub)gradient method:

x^{t+1}=x^{t}-\eta\cdot\mathbf{g}(x,\xi_{t})\qquad(\forall t=0,\ldots,T-1),

with output $\overline{x}^{T}=\frac{1}{T}\sum_{t\in[T]}x^{t}$ ; where $\xi_{1},\ldots,\xi_{T}$ are drawn uniformly from from $S$ with replacement, and for all $z\in\mathcal{Z}$ , $\mathbf{g}(.,z):\mathcal{X}\rightarrow\mathbb{R}^{d}$ is a random map (referred to as noisy gradient oracle) that satisfies the following conditions:

1.

Unbiasedness: For every $x\in\mathcal{X},z\in\mathcal{Z}$ , $\mathbb{E}[\mathbf{g}(x,z)]=\nabla f(x,z)$ , where the expectation is taken over the randomness in the gradient oracle $\mathbf{g}(\cdot,z)$ .
2.

Sub-Gaussian gradient noise: There is $\sigma^{2}\geq 0$ such that for every $x\in\mathcal{X},z\in\mathcal{Z}$ , $\mathbf{g}(x,z)-\nabla f(x,z)$ is $\sigma^{2}$ -sub-Gaussian random vector; that is, for every $x\in\mathcal{X},z\in\mathcal{Z},$ $\langle\mathbf{g}(x,z)-\nabla f(x,z),~{}u\rangle$ is $\sigma^{2}$ -sub-Gaussian random variable $\forall u\in\mathcal{B}(0,1)$ .
3.

Independence of the gradient noise across iterations: conditioned on any fixed realization of $(\xi_{t})_{t\in[T]}$ the sequence of random maps $\mathbf{g}(\cdot,\xi_{1}),\ldots,\mathbf{g}(\cdot,\xi_{T})$ is independent. (Here, randomness comes only from the gradient oracle.)

Then, for any $\theta\in(4e^{-T/32},1)$ , with probability at least $1-\theta$ , the optimization error (i.e., the excess empirical risk) of this method is bounded as

\varepsilon_{\mathrm{\tiny opt}}\leq\left(LR+\sigma(R+\eta L)\right)\sqrt{\frac{2\log(4/\theta)}{T}}+\frac{R^{2}}{2\eta T}+\eta\,(\frac{L^{2}}{2}+d\sigma^{2}).

Proof.

Let $x^{\ast}_{S}\in\arg\min\limits_{x\in\mathcal{X}}F_{S}(x)$ . By convexity of the empirical loss, we have

	$\displaystyle F_{S}(\overline{x}^{T})-F_{S}(x^{\ast}_{S})\leq\frac{1}{T}\sum_{t\in[T]}F_{S}(x^{t})-F_{S}(x^{\ast}_{S})$
	$\displaystyle=\frac{1}{T}\sum_{t\in[T]}[F_{S}(x^{t})-f(x^{t},\xi_{t})]+\frac{1}{T}\sum_{t\in[T]}[f(x^{\ast}_{S},\xi_{t})-F_{S}(x^{\ast}_{S})]+\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})].$		(14)

Since $(\xi_{t})_{t\in[T]}$ are sampled uniformly without replacement from $S,$ we have

\underset{\xi_{t}~{}|~{}x^{1},\ldots,x^{t-1}}{\mathbb{E}}\left[f(x^{t},\xi_{t})|x^{1},\ldots,x^{t-1},x^{t}=v\right]=F_{S}(v),

for all $v\in\mathcal{X},t\in[T]$ . Moreover, since the range of $f$ lies in $[-LR,LR],$ it follows that $Y_{t}:=\sum_{j=1}^{t}f(x^{j},\xi_{j}),~{}t\in[T]$ is a martingale with bounded differences (namely, bounded by $2LR$ ). Therefore, by Azuma’s inequality, the first term in (14) satisfies

\displaystyle\mathbb{P}\Bigg{[}\frac{1}{T}\sum_{t\in[T]}[F_{S}(x^{t})-f(x^{t},\xi_{t})]>LR\sqrt{\frac{2\log\frac{4}{\theta}}{T}}~{}\Bigg{]}\leq\frac{\theta}{4}.

(15)

By Hoeffding’s inequality, the second term in (14) also satisfies the same bound; namely,

\displaystyle\mathbb{P}\Bigg{[}\frac{1}{T}\sum_{t\in[T]}[f(x^{\ast}_{S},\xi_{t})-F_{S}(x^{\ast}_{S})]>LR\sqrt{\frac{2\log\frac{4}{\theta}}{T}}~{}\Bigg{]}\leq\frac{\theta}{4}.

(16)

Using similar analysis to that of the standard online gradient descent analysis Zinkevich, (2003), the last term in (14) can be bounded as

	$\displaystyle\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]\leq\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\nabla\mathbf{g}(x^{t},\xi_{t})\\|^{2}$
$\displaystyle=~{}$	$\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\\|^{2}$
	$\displaystyle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\nabla f(x^{t},\xi_{t})\\|^{2}$
$\displaystyle\leq~{}$	$\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{\eta L^{2}}{2}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\\|^{2}$	(17)

By the properties of the gradient oracle stated in the lemma, we can see that for any fixed realization of $(x^{t},\xi_{t})_{t\in[T]}$ , the second term in (17) is $\left(2R+\eta L\right)^{2}\frac{\sigma^{2}}{T}$ -sub-Gaussian random variable. Hence,

\displaystyle\underset{}{\mathbb{P}}\left[\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle>\left(2R+\eta L\right)\sigma\sqrt{\frac{2\log(4/\theta)}{T}}\right]

\displaystyle\leq\frac{\theta}{4}.

(18)

Let $U_{t}:=\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2},~{}t\in[T]$ . Note that $\underset{}{\mathbb{E}}\left[U_{t}\right]\leq d\sigma^{2}$ . Moreover, observe (e.g., by (Rigollet,, 2015, Lemma 1.12)) that for any fixed realization of $x^{t},\xi_{t}$ , $V_{t}:=U_{t}-\underset{}{\mathbb{E}}\left[U_{t}\right]$ is a sub-exponential random variable with parameter $16d\sigma^{2}$ ; namely, $\underset{}{\mathbb{E}}\left[\exp(\lambda V_{t})\right]\leq\exp(128\lambda^{2}\sigma^{4}d^{2}),~{}\lambda\leq\frac{1}{16\sigma^{2}d}.$ Hence, by a standard concentration argument (e.g., Bernstein’s inequality), we have

\displaystyle\underset{}{\mathbb{P}}\left[\frac{\eta}{2T}\sum_{t\in[T]}\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\|^{2}>\frac{\eta}{2}d\sigma^{2}+16\eta d\sigma^{2}\,\frac{\log(4/\theta)}{T}\right]

\displaystyle\leq\theta/4.

(19)

Putting (18) and (19) together, and noticing that $T>32\log(4/\theta)$ , we conclude that with probability at least $1-\theta/2,$ the third term of (14) is bounded as

\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]\leq\frac{R^{2}}{2T\eta}+\frac{\eta L^{2}}{2}+\eta\sigma^{2}d+\left(2R+\eta L\right)\sigma\sqrt{\frac{2\log(4/\theta)}{T}}.

Hence, by the union bound, we conclude that with probability at least $1-\theta,$ the excess empirical risk of the stochastic subgradient method is bounded as

\varepsilon_{\mathrm{\tiny opt}}\leq\left(LR+\sigma(2R+\eta L)\right)\sqrt{\frac{2\log(4/\theta)}{T}}+\frac{R^{2}}{2\eta T}+\eta\,(\frac{L^{2}}{2}+d\sigma^{2}).

∎

Appendix I Empirical Risk of Fixed-Permutation SGD

Our optimization error analysis is based on (Nedic and Bertsekas,, 2001, Lemma 2.1).

Lemma I.1.

Let us consider the fixed permutation stochastic gradient descent (Algorithm 3), for arbitrary permutation (i.e., not necessarily random) and with constant step size over each epoch (i.e., $\eta_{(k-1)n+t}\equiv\eta_{k}$ for all $t\in[n]$ , $k\in[K]$ ). Then

\eta_{k}[F_{S}(x^{k})-F_{S}(y)]\leq\frac{1}{2n}[\|x^{k}-y\|^{2}-\|x^{k+1}-y\|^{2}]+\frac{\eta_{k}^{2}L^{2}(n+2)}{2}\qquad(\forall y\in\mathcal{X}).

Proof.

First, since the permutation is arbitrary, w.l.o.g. $\bm{\pi}$ is the identity (we make this choice only for notational convenience). Let now $y\in\mathcal{X}$ . At each round, the recursion of SGD implies that

$\displaystyle\\|x_{t+1}^{k}-y\\|^{2}$	$\displaystyle=$	$\displaystyle\\|\mathsf{Proj}_{\mathcal{X}}[x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})]-\mathsf{Proj}_{\mathcal{X}}(y)\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\\|x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})-y\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|x_{t}^{k}-y\\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}\langle\nabla f(x_{t}^{k},z_{t}),x_{t}^{k}-y\rangle$
	$\displaystyle\leq$	$\displaystyle\\|x_{t}^{k}-y\\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}[f(x_{t}^{k},z_{t})-f(y,z_{t})].$

Let $r_{t}:=\|x^{t}-y\|$ . Summing up these inequalities from $t=1,\ldots,n$

$\displaystyle r_{n+1}^{2}-r_{1}^{2}$	$\displaystyle\leq$	$\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}\sum_{t=1}^{n}[f(x^{k},z_{t})-f(x_{t}^{k},z_{t})]-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]$
	$\displaystyle\leq$	$\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}\sum_{t=1}^{n}L\\|x^{k}-x_{t}^{k}\\|-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]$
	$\displaystyle\leq$	$\displaystyle nL^{2}\eta_{k}^{2}+2\eta_{k}^{2}L^{2}\sum_{t=1}^{n}t-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)]$
	$\displaystyle=$	$\displaystyle\eta_{k}^{2}L^{2}n+\eta_{k}^{2}L^{2}n(n+1)-2\eta_{k}n[F_{S}(x^{k})-F_{S}(y)].$

Re-arranging terms we obtain the result. ∎

Using the previous lemma, it is straightforward to derive the optimization accuracy of the method.

Corollary I.2.

The fixed permutation stochastic gradient descent (Algorithm 3), for arbitrary permutation (i.e., not necessarily random) and with constant step size over each epoch (i.e., $\eta_{(k-1)n+t}\equiv\eta_{k}$ for all $t\in[n]$ , $k\in[K]$ ). satisfies

\varepsilon_{\mbox{\footnotesize opt}}\leq\frac{\|x^{1}-x^{\ast}(S)\|^{2}}{2n\sum_{k}\eta_{k}}+\frac{L^{2}(n+2)}{2}\frac{\sum_{k}\eta_{k}^{2}}{\sum_{k}\eta_{k}}.

Proof.

By convexity and Lemma I.1, we have

	$\displaystyle F_{s}(\overline{x}^{K})-F_{S}(x^{\ast}(S))$	$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{k=1}^{K}\eta_{k}}\sum_{k=1}^{K}\eta_{k}[F_{S}(x^{k})-f_{S}(x^{\ast}(S))]$
		$\displaystyle\leq$	$\displaystyle\frac{1}{\sum_{k=1}^{K}\eta_{k}}\Big{[}\frac{1}{2n}\\|x^{1}-x^{\ast}(S)\\|^{2}+\frac{L^{2}(n+2)}{2}\sum_{k=1}^{K}\eta_{k}^{2}\Big{]},$

which proves the result. ∎

	$\displaystyle\delta_{t+1}^{2}=\\|\mathsf{Proj}_{\mathcal{X}}[x^{t}-\eta_{t}\nabla f_{t}(x^{t})]-\mathsf{Proj}_{\mathcal{X}}[y^{t}-\eta_{t}\nabla f_{t}^{\prime}(y^{t})]\\|^{2}\,\,\leq\,\,\\|x^{t}-y^{t}-\eta_{t}(\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}))\\|^{2}$
	$\displaystyle=\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}-2\eta_{t}\langle\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t}),x^{t}-y^{t}\rangle-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+\eta_{t}^{2}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(y^{t})\\|^{2}+2\eta_{t}\\|\nabla f_{t}(x^{t})-\nabla f_{t}^{\prime}(x^{t})\\|\delta_{t}-2\eta_{t}\langle\nabla f_{t}^{\prime}(x^{t})-\nabla f_{t}^{\prime}(y^{t}),x^{t}-y^{t}\rangle$
	$\displaystyle\leq\delta_{t}^{2}+4L^{2}\eta_{t}^{2}+2\eta_{t}a_{t}\delta_{t},$

	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t+1}^{2}~{}\|~{}B_{t}\right]$	$\displaystyle\leq$	$\displaystyle\underset{}{\mathbb{E}}\left[\delta_{t}^{2}~{}\|~{}B_{t-1}\right]+8L^{2}\eta_{t}^{2}+\frac{4L}{n}\eta_{t}\underset{}{\mathbb{E}}\left[\delta_{t}~{}\|~{}B_{t-1}\right]$
	$\displaystyle\mathbb{E}[\delta_{T}^{2}~{}\|~{}B_{T-1}]$	$\displaystyle\leq$	$\displaystyle 8L^{2}\sum_{t=1}^{T-1}\eta_{t}^{2}+\frac{4L}{n}\sum_{t=1}^{T-1}\eta_{t}\mathbb{E}[\delta_{t}~{}\|B_{t-1}].$

	$\displaystyle\frac{1}{T}\sum_{t\in[T]}[f(x^{t},\xi_{t})-f(x^{\ast}_{S},\xi_{t})]\leq\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\nabla\mathbf{g}(x^{t},\xi_{t})\\|^{2}$
$\displaystyle=~{}$	$\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\\|^{2}$
	$\displaystyle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\nabla f(x^{t},\xi_{t})\\|^{2}$
$\displaystyle\leq~{}$	$\displaystyle~{}\frac{R^{2}}{2T\eta}+\frac{\eta L^{2}}{2}+\frac{1}{T}\sum_{t\in[T]}\langle\nabla f(x^{t},\xi_{t})-\mathbf{g}(x^{t},\xi_{t}),~{}x^{t}-x^{\ast}_{S}-\eta\nabla f(x^{t},\xi_{t})\rangle+\frac{\eta}{2T}\sum_{t\in[T]}\\|\mathbf{g}(x^{t},\xi_{t})-\nabla f(x^{t},\xi_{t})\\|^{2}$	(17)

$\displaystyle\\|x_{t+1}^{k}-y\\|^{2}$	$\displaystyle=$	$\displaystyle\\|\mathsf{Proj}_{\mathcal{X}}[x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})]-\mathsf{Proj}_{\mathcal{X}}(y)\\|^{2}$
	$\displaystyle\leq$	$\displaystyle\\|x_{t}^{k}-\eta_{k}\nabla f(x_{t}^{k},z_{t})-y\\|^{2}$
	$\displaystyle=$	$\displaystyle\\|x_{t}^{k}-y\\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}\langle\nabla f(x_{t}^{k},z_{t}),x_{t}^{k}-y\rangle$
	$\displaystyle\leq$	$\displaystyle\\|x_{t}^{k}-y\\|^{2}+\eta_{k}^{2}L^{2}-2\eta_{k}[f(x_{t}^{k},z_{t})-f(y,z_{t})].$

Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses

Abstract

1 Introduction

1.1 Our Results

1.2 Overview of Techniques

1.3 Other Related Work

2 Notation and Preliminaries

Lemma 2.1.

Proof.

2.1 High-probability guarantees on UAS

Theorem 2.2 (follows from Theorem 1.1 in Feldman and Vondrák, (2019)).

2.2 Expectation guarantees on UAS

3 Upper Bounds on Uniform Argument Stability

3.1 The Basic Lemma

Lemma 3.1.

Proof.

3.2 Upper Bounds for the Full Batch GD

Theorem 3.2.

Proof.

3.3 Upper Bounds for SGD

3.3.1 Sampling-with-replacement SGD

Theorem 3.3.

Proof.

3.3.2 Upper Bounds for the Fixed Permutation SGD

Theorem 3.4.

Proof.

3.4 Discussion of the upper bounds: examples of specific instantiations

4 Lower Bounds on Uniform Argument Stability

4.1 Lower Bounds for Full Batch GD

Theorem 4.1.

4.2 Lower Bounds for SGD Sampled with Replacement

Theorem 4.2.

Proof.

4.3 Lower Bounds for the Fixed Permutation Stochastic Gradient Descent

Theorem 4.3.

5 Generalization Guarantees for Multi-pass SGD

Theorem 5.1.

Proof.

6 Implications of Our Stability Bounds

6.1 Differentially Private Nonsmooth Stochastic Convex Optimization

Theorem 6.1 (Privacy guarantee of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}).

Theorem 6.2Risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}. (Excess risk of 𝒜𝖭𝖲𝖦𝖣{\mathcal{A}}_{\sf NSGD}).

Proof.

Remark 6.3.

6.2 Nonsmooth Stochastic Convex Optimization with Multi-pass SGD

7 Discussion and Open Problems

Acknowledgements

References

Appendix A Proof of Theorem 3.2

Proof.

Appendix B Proof of Theorem 3.4

Proof.

Appendix C General Lower Bound on Stability of SGD

Appendix D Proof of Theorem 4.1

Proof.

Appendix E Proof of Theorem 4.3

Proof.

Appendix F Upper bounds on UAS of SGD when T≤nT\leq n

Theorem F.1.

Proof.

Theorem F.2.

Appendix G Generalization in Stochastic Optimization with Data Reuse

G.0.1 Risk Bounds for Sampling-with-Replacement Stochastic Gradient Descent

Theorem G.1.

Proof.

G.0.2 Risk Bounds for Fixed-Permutation Stochastic Gradient Descent

Theorem G.2.

Proof.

Appendix H High-probability Bound on Optimization Error of SGD with Noisy Gradient Oracle

Lemma H.1 (Optimization error of SGD with noisy gradient oracle).

Proof.

Appendix I Empirical Risk of Fixed-Permutation SGD

Lemma I.1.

Proof.

Corollary I.2.

Proof.

Stability of Stochastic Gradient Descent
on Nonsmooth Convex Losses

Theorem 6.1 (Privacy guarantee of ${\mathcal{A}}_{\sf NSGD}$ ).

Theorem 6.2Risk of ${\mathcal{A}}_{\sf NSGD}$ . (Excess risk of ${\mathcal{A}}_{\sf NSGD}$ ).

Appendix F Upper bounds on UAS of SGD when $T\leq n$