This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent

\nameLingjiong Zhu \email[email protected]
\addrDepartment of Mathematics
Florida State University, Tallahassee, FL, USA. \AND\nameMert Gürbüzbalaban \email[email protected]
\addrDepartment of Management Science and Information Systems
Rutgers Business School, Piscataway, NJ, USA &
Center for Statistics and Machine Learning
Princeton University, Princeton, NJ, USA. \ANDAnant Raj \email[email protected]
\addrCoordinated Science Laboraotry
University of Illinois Urbana-Champaign, IL, USA &
Inria, Ecole Normale Supérieure
PSL Research University, Paris, France. \AND\nameUmut Şimşekli \email[email protected]
\addrInria, CNRS, Ecole Normale Supérieure
PSL Research University, Paris, France.
\name \addrCorresponding author
Abstract

Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds – which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.

Keywords: Stochastic gradient descent, algorithmic stability, Wasserstein perturbation.

1 Introduction

With the development of modern machine learning applications, understanding the generalization properties of stochastic gradient descent (SGD) has become a major challenge in statistical learning theory. In this context, the main goal is to obtain computable upper-bounds on the population risk associated with the output of the SGD algorithm that is given as follows: F(θ):=𝔼x𝒟[f(θ,x)]F(\theta):=\mathbb{E}_{x\sim\mathcal{D}}[f(\theta,x)], where x𝒳x\in\mathcal{X} denotes a random data point, 𝒟\mathcal{D} is the (unknown) data distribution defined on the data space 𝒳\mathcal{X}, θ\theta denotes the parameter vector, and f:d×𝒳f:\mathbb{R}^{d}\times\mathcal{X}\to\mathbb{R} is an instantaneous loss function.

In a practical setting, directly minimizing F(θ)F(\theta) is not typically possible as 𝒟\mathcal{D} is unknown; yet one typically has access to a finite data set Xn={x1,,xn}𝒳nX_{n}=\{x_{1},\ldots,x_{n}\}\in\mathcal{X}^{n}, where we assume each xix_{i} is independent and identically distributed (i.i.d.) with the common distribution 𝒟\mathcal{D}. Hence, given XnX_{n}, one can then attempt to minimize the empirical risk F^(θ,Xn):=1ni=1nf(θ,xi)\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}f(\theta,x_{i}) as a proxy for F(θ)F(\theta). In this setting, SGD has been one of the most popular optimization algorithms for minimizing F^(θ)\hat{F}(\theta) and is based on the following recursion:

θk=θk1η~F^k(θk1,Xn),~F^k(θk1,Xn):=1biΩkf(θk1,xi),\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}),\qquad\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}), (1)

where η\eta is the step-size, bb is the batch-size, Ωk\Omega_{k} is the minibatch that is chosen randomly from the set {1,2,,n}\{1,2,\ldots,n\}, and its cardinality satisfies |Ωk|=b|\Omega_{k}|=b.

One fruitful approach for estimating the population risk attained by SGD, i.e., F(θk)F(\theta_{k}), is based on the following simple decomposition:

F(θk)F^(θk)+|F^(θk)F(θk)|,\displaystyle F(\theta_{k})\leq\hat{F}(\theta_{k})+|\hat{F}(\theta_{k})-F(\theta_{k})|, (2)

where the last term is called the generalization error. Once a computable upper-bound for the generalization error can be obtained, this decomposition directly leads to a computable upper bound for the population risk F(θk)F(\theta_{k}), since F^(θk)\hat{F}(\theta_{k}) can be computed thanks to the availability of XnX_{n}. Hence, the challenge here is reduced to derive upper-bounds on |F^(θk)F(θk)||\hat{F}(\theta_{k})-F(\theta_{k})|, typically referred to as generalization bounds.

Among many approaches for deriving generalization bounds, algorithmic stability (Bousquet and Elisseeff, 2002) has been one of the most fruitful notions that have paved the way to numerous generalization bounds for stochastic optimization algorithms (Hardt et al., 2016; Chen et al., 2018; Mou et al., 2018; Feldman and Vondrak, 2019; Lei and Ying, 2020; Zhang et al., 2022). In a nutshell, algorithmic stability measures how much the algorithm output differs if we replace one data point in XnX_{n} with a new sample. More precisely, in the context of SGD, given another data set X^n={x^1,,x^n}={x1,,xi1,x^i,xi+1,xn}𝒳n\hat{X}_{n}=\{\hat{x}_{1},\ldots,\hat{x}_{n}\}=\{x_{1},\ldots,x_{i-1},\hat{x}_{i},x_{i+1},\ldots{x}_{n}\}\in\mathcal{X}^{n} that differs from XnX_{n} by at most one element, we (theoretically) consider running SGD on X^n\hat{X}_{n}, i.e.,

θ^k=θ^k1η~F^k(θ^k1,X^n),~F^k(θ^k1,X^n):=1biΩkf(θ^k1,x^i),\hat{\theta}_{k}=\hat{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n}),\qquad\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\hat{\theta}_{k-1},\hat{x}_{i}), (3)

and we are interested in the discrepancy between θk\theta_{k} and θ^k\hat{\theta}_{k} in some precise sense (to be formally defined in the next section). The wisdom of algorithmic stability indicates that a smaller discrepancy between θk\theta_{k} and θ^k\hat{\theta}_{k} implies a smaller generalization error.

The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. In a pioneering study, Hardt et al. (2016) proved a variety of stability bounds for SGD, for strongly convex, convex, and non-convex problems. Their analysis showed that, under strong convexity and bounded gradient assumptions, the generalization error of SGD with constant step-size is of order n1n^{-1}; whereas for general convex and non-convex problems, their bounds diverged with the number of iterations (even with a projection step), unless a decreasing step-size is used. In subsequent studies (Lei and Ying, 2020; Kozachkov et al., 2023) extended the results of Hardt et al. (2016), by either relaxing the assumptions or generalizing the setting to more general algorithms. However, their bounds still diverged for constant step-sizes, unless strong convexity is assumed. In a recent study, Bassily et al. (2020) proved stability lower-bounds for projected SGD when the loss is convex and non-smooth. Their results showed for general non-smooth loss functions we cannot expect to prove time-uniform (i.e., non-divergent with the number of iterations) stability bounds for SGD, even when a projection step is appended.

In a related line of research, several studies investigated the algorithmic stability of the stochastic gradient Langevin dynamics (SGLD) algorithm (Welling and Teh, 2011), which is essentially a ‘noisy’ version of SGD that uses the following recursion: θk=θk1η~F^k(θk1,Xn)+ξk\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n})+\xi_{k}, where (ξk)k0(\xi_{k})_{k\geq 0} is a sequence of i.i.d. Gaussian vectors, independent of θk1\theta_{k-1} and Ωk\Omega_{k}. The authors of Raginsky et al. (2017); Mou et al. (2018) proved stability bounds for SGLD for non-convex losses, which were then extended to more general (non-Gaussian) noise settings in Li et al. (2019). While these bounds hinted at the benefits of additional noise in terms of stability, they still increased with the number of iterations, which limited the impact of their results. More recently, Farghly and Rebeschini (2021) proved the first time-uniform stability bounds for SGLD under non-convexity, indicating that, with the presence of additive Gaussian noise, better stability bounds can be achieved. Their time-uniform results were then extended to non-Gaussian, heavy-tailed perturbations in Raj et al. (2023a, b) for quadratic and a class of non-convex problems.

While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. Hence, it is not straightforward to extend the existing techniques to different algorithms with different classes of loss functions. Moreover, currently, it is not clear how the noisy perturbations affect algorithmic stability so that time-uniform bounds can be achieved, and more generally, it is not clear in which circumstances one might hope for time-uniform stability bounds.

In this study, we contribute to this line of research and prove novel time-uniform algorithmic stability bounds for SGD and its noisy versions. Our main contributions are as follows:

  • We make a novel connection between learning theory and applied probability, and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms with a constant step-size. Our approach is based on Markov chain perturbation theory (Rudolf and Schweizer, 2018), which offers a three-step proof technique for deriving stability bounds: (i) showing the optimizer is geometrically ergodic, (ii) obtaining a Lyapunov function for the optimizer and the loss, and (iii) bounding the discrepancy between the Markov transition kernels associated with the chains (θk)k0(\theta_{k})_{k\geq 0} and (θ^k)k0(\hat{\theta}_{k})_{k\geq 0}. We illustrate this approach on SGD and show that time-uniform stability bounds can be obtained under a pseudo-Lipschitz-like condition for smooth strongly-convex losses (we recover similar results to the ones of Hardt et al. (2016)) and a class of non-convex losses (that satisfy a dissipativity condition) when a noisy perturbation with finite variance (not necessarily Gaussian, hence more general than Farghly and Rebeschini (2021)) is introduced. Our results shed more light on the role of the additional noise in terms of obtaining time-uniform bounds: in the non-convex case the optimizer might not be geometrically ergodic unless additional noise is introduced, hence the bound cannot be obtained. Moreover, our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature (Aybat et al., 2020; Lessard et al., 2016; Fallah et al., 2022; Gürbüzbalaban et al., 2022; Liu et al., 2020; Aybat et al., 2019).

  • We then investigate the case where no additional noise is introduced to the SGD recursion and the geometric ergodicity condition does not hold. First, for non-convex losses, we prove a time-uniform stability bound, where the bound converges to a positive number (instead of zero) as nn\to\infty, and this limit depends on the ‘level of non-convexity’. Then, we consider a class of (non-strongly) convex functions and prove stability bounds for the stationary distribution of (θk)k0(\theta_{k})_{k\geq 0}, which vanish as nn increases. To the best of our knowledge, these results are novel, and indicate that the stability bounds do not need to increase with time even under non-convexity and without additional perturbations; yet, they might have a different nature depending on the problem class.

One limitation of our analysis is that it requires Lipschitz surrogate loss functions and does not directly handle the original loss function, due to the use of the Wasserstein distance (Raginsky et al., 2016). Yet, surrogate losses have been readily utilized in the recent stability literature (e.g., Raj et al. (2023a, b)) and we believe that our analysis might illuminate uncovered aspects of SGD even with this requirement. All the proofs are provided in the Appendix.

2 Technical Background

2.1 The Wasserstein distance and Wasserstein algorithmic stability

Wasserstein distance. For p1p\geq 1, the pp-Wasserstein distance between two probability measures μ\mu and ν\nu on d\mathbb{R}^{d} is defined as Villani (2009):

𝒲p(μ,ν)={inf𝔼XYp}1/p,\mathcal{W}_{p}(\mu,\nu)=\left\{\inf\mathbb{E}\|X-Y\|^{p}\right\}^{1/p}, (4)

where the infimum is taken over all couplings of XμX\sim\mu and YνY\sim\nu. In particular, the dual representation for the 11-Wasserstein distance is given as Villani (2009):

𝒲1(μ,ν)=suphLip(1)|dh(x)μ(dx)dh(x)ν(dx)|,\mathcal{W}_{1}(\mu,\nu)=\sup_{h\in\text{Lip}(1)}\left|\int_{\mathbb{R}^{d}}h(x)\mu(dx)-\int_{\mathbb{R}^{d}}h(x)\nu(dx)\right|, (5)

where Lip(1)\text{Lip}(1) consists of the functions h:dh:\mathbb{R}^{d}\rightarrow\mathbb{R} that are 11-Lipschitz.

Wasserstein algorithmic stability. Algorithmic stability is a crucial concept in learning theory that has led to numerous significant theoretical breakthroughs (Bousquet and Elisseeff, 2002; Hardt et al., 2016). To begin, we will present the definition of algorithmic stability as stated in Hardt et al. (2016):

Definition 1 (Hardt et al. (2016), Definition 2.1)

Let 𝒱(d)\mathcal{RV}(\mathbb{R}^{d}) denote the set of d\mathbb{R}^{d}-valued random vectors. For a (surrogate) loss function :d×𝒳\ell:\mathbb{R}^{d}\times\mathcal{X}\rightarrow\mathbb{R}, an algorithm 𝒜:n=1𝒳n𝒱(d)\mathcal{A}:\bigcup_{n=1}^{\infty}\mathcal{X}^{n}\to\mathcal{RV}(\mathbb{R}^{d}) is ε\varepsilon-uniformly stable if

supXX^supz𝒳𝔼[(𝒜(X),z)(𝒜(X^),z)]ε,\displaystyle\sup_{X\cong\hat{X}}\sup_{z\in\mathcal{X}}~{}\mathbb{E}\left[\ell(\mathcal{A}(X),z)-\ell(\mathcal{A}(\hat{X}),z)\right]\leq\varepsilon, (6)

where the first supremum is taken over data X,X^𝒳nX,\hat{X}\in\mathcal{X}^{n} that differ by one element, denoted by XX^X\cong\hat{X}.

In this context, we purposefully employ a distinct notation for the loss function \ell (in contrast to ff) since our theoretical framework necessitates measuring algorithmic stability through a surrogate loss function, which may differ from the original loss ff. More precisely, our bounds will be based on the 1-Wasserstein distance, hence, we will need the surrogate loss \ell to be a Lipschitz continuous function, as we will detail in (5). On the other hand, for the original loss ff we will need some form of convexity (e.g., strongly convex, convex, or dissipative) and we will need the gradient of ff to be Lipschitz continuous, in order to derive Wasserstein bounds. Unfortunately, under these assumptions, we cannot further impose ff itself to be Lipschitz, hence the need for surrogate losses. Nevertheless, the usage of surrogate losses is common in learning theory, see e.g, Farghly and Rebeschini (2021); Raj et al. (2023b), and we present concrete practical examples in the Appendix.

Now, we present a result from Hardt et al. (2016) that establishes a connection between algorithmic stability and the generalization performance of a randomized algorithm. Prior to presenting the result, we define the empirical and population risks with respect to the loss function \ell as follows:

R^(θ,Xn):=\displaystyle\hat{R}(\theta,X_{n}):= 1ni=1n(θ,xi),R(θ):=𝔼x𝒟[(θ,x)].\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell(\theta,x_{i}),\quad R(\theta):=\mathbb{E}_{x\sim\mathcal{D}}[\ell(\theta,x)].
Theorem 2 (Hardt et al. (2016), Theorem 2.2)

Suppose that 𝒜\mathcal{A} is an ε\varepsilon-uniformly stable algorithm, then the expected generalization error is bounded by

|𝔼𝒜,Xn[R^(𝒜(Xn),Xn)R(𝒜(Xn))]|ε.\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\varepsilon. (7)

For a randomized algorithm, if ν\nu and ν^\hat{\nu} denotes the law of 𝒜(X)\mathcal{A}(X) and 𝒜(X^)\mathcal{A}(\hat{X}) then for a \mathcal{L}-Lipschitz surrogate loss function \ell, we have the following generalization error guarantee,

|𝔼𝒜,Xn[R^(𝒜(Xn),Xn)R(𝒜(Xn))]|supXX^𝒲1(ν,ν^).\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\mathcal{L}\sup_{X\cong\hat{X}}\mathcal{W}_{1}(\nu,\hat{\nu}). (8)

The above result can be directly obtained from the combination of the results given in (5), Definition 1, and Theorem 2 (see also Raginsky et al. (2016)).

2.2 Perturbation theory for Markov chains

Next, we recall the Wasserstein perturbation bound for Markov chains from Rudolf and Schweizer (2018). Let (θn)n=0(\theta_{n})_{n=0}^{\infty} be a Markov chain with transition kernel PP and initial distribution p0p_{0}, i.e., we have almost surely

(θnA|θ0,,θn1)=(θnA|θn1)=P(θn1,A),\mathbb{P}(\theta_{n}\in A|\theta_{0},\cdots,\theta_{n-1})=\mathbb{P}(\theta_{n}\in A|\theta_{n-1})=P(\theta_{n-1},A), (9)

and p0(A)=(θ0A)p_{0}(A)=\mathbb{P}(\theta_{0}\in A) for any measurable set AdA\subseteq\mathbb{R}^{d} and nn\in\mathbb{N}. We assume that (θ^n)n=0(\hat{\theta}_{n})_{n=0}^{\infty} is another Markov chain with transition kernel P^\hat{P} and initial distribution p^0\hat{p}_{0}. We denote by pnp_{n} the distribution of θn\theta_{n} and by p^n\hat{p}_{n} the distribution of θ^n\hat{\theta}_{n}. By δθ\delta_{\theta}, we denote the Dirac delta distribution at θ\theta, i.e. the probability measure concentrated at θ\theta. For a measurable set AdA\subseteq\mathbb{R}^{d}, we also use the notation δθP(A):=P(θ,A)\delta_{\theta}P(A):=P(\theta,A).

Lemma 3 (Rudolf and Schweizer (2018), Theorem 3.1)

Assume that there exist some ρ[0,1)\rho\in[0,1) and C(0,)C\in(0,\infty) such that

supθ,θ~d:θθ~𝒲1(Pn(θ,),Pn(θ~,))θθ~Cρn,\sup_{\theta,\tilde{\theta}\in\mathbb{R}^{d}:\theta\neq\tilde{\theta}}\frac{\mathcal{W}_{1}(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot))}{\|\theta-\tilde{\theta}\|}\leq C\rho^{n}, (10)

for any nn\in\mathbb{N}. Further assume that there exist some δ(0,1)\delta\in(0,1) and L(0,)L\in(0,\infty) and a measurable Lyapunov function V^:d[1,)\hat{V}:\mathbb{R}^{d}\rightarrow[1,\infty) of P^\hat{P} such that for any θd\theta\in\mathbb{R}^{d}:

(P^V^)(θ)δV^(θ)+L,(\hat{P}\hat{V})(\theta)\leq\delta\hat{V}(\theta)+L, (11)

where (P^V^)(θ):=dV^(θ^)P^(θ,dθ^)(\hat{P}\hat{V})(\theta):=\int_{\mathbb{R}^{d}}\hat{V}(\hat{\theta})\hat{P}(\theta,d\hat{\theta}). Then, we have

𝒲1(pn,p^n)C(ρn𝒲1(p0,p^0)+(1ρn)γκ1ρ),\mathcal{W}_{1}(p_{n},\hat{p}_{n})\leq C\left(\rho^{n}\mathcal{W}_{1}(p_{0},\hat{p}_{0})+(1-\rho^{n})\frac{\gamma\kappa}{1-\rho}\right), (12)

where γ:=supθd𝒲1(δθP,δθP^)V^(θ),κ:=max{dV^(θ)𝑑p^0(θ),L1δ}\gamma:=\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)},\quad\kappa:=\max\left\{\int_{\mathbb{R}^{d}}\hat{V}(\theta)d\hat{p}_{0}(\theta),\frac{L}{1-\delta}\right\}.

Lemma 3 provides a sufficient condition for the distributions pnp_{n} and p^n\hat{p}_{n} after nn iterations to stay close to each other given the initial distributions p0p_{0}. Lemma 3 will provide a key role in helping us derive the main results in Section 3.1 and Section 3.2. Later, in the Appendix, we will state and prove a modification of Lemma 3 (see Lemma 23 in the Appendix) that will be crucial to obtaining the main result in Section 3.3.

3 Wasserstein Stability of SGD via Markov Chain Perturbations

In this section, we will derive time-uniform Wasserstein stability bounds for SGD by using the perturbation theory presented in Rudolf and Schweizer (2018). Before considering general losses that can be non-convex, we first consider the simpler case of quadratic losses to illustrate our key ideas.

3.1 Warm up: quadratic case

To illustrate the proof technique, we start by considering a quadratic loss of the form: f(θ,xi):=(aiθyi)2/2f(\theta,x_{i}):=(a_{i}^{\top}\theta-y_{i})^{2}/2 where, xi:=(ai,yi)x_{i}:=(a_{i},y_{i}) and f(θ,xi)=ai(aiθyi)\nabla f(\theta,x_{i})=a_{i}(a_{i}^{\top}\theta-y_{i}). In this setting, the SGD recursion takes the following form:

θk=(IηbHk)θk1+ηbqk,where,Hk:=iΩkaiai,qk:=iΩkaiyi.\theta_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\theta_{k-1}+\frac{\eta}{b}q_{k},\quad\text{where,}\quad H_{k}:=\sum_{i\in\Omega_{k}}a_{i}a_{i}^{\top},\quad q_{k}:=\sum_{i\in\Omega_{k}}a_{i}y_{i}\,. (13)

The sequence (Hk,qk)(H_{k},q_{k}) are i.i.d. and for every kk, (Hk,qk)(H_{k},q_{k}) is independent of θk1\theta_{k-1}.

Similarly, we can write down the iterates of SGD with a different data set X^n:={x^1,,x^n}\hat{X}_{n}:=\{\hat{x}_{1},\ldots,\hat{x}_{n}\} with x^i=(a^i,y^i)\hat{x}_{i}=(\hat{a}_{i},\hat{y}_{i}), where X^n\hat{X}_{n} differs from XnX_{n} with at most one element:

θ^k=(IηbH^k)θ^k1+ηbq^k,whereH^k:=iΩka^ia^i,q^k:=iΩka^iy^i.\hat{\theta}_{k}=\left(I-\frac{\eta}{b}\hat{H}_{k}\right)\hat{\theta}_{k-1}+\frac{\eta}{b}\hat{q}_{k},\quad\text{where}\quad\hat{H}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{a}_{i}^{\top},\quad\hat{q}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{y}_{i}\,. (14)

Our goal is to obtain an algorithmic stability bound, through estimating the 11-Wasserstein distance between the distribution of θk\theta_{k} and θ^k\hat{\theta}_{k} and we will now illustrate the three-step proof technique that we described in Section 1. To be able to apply the perturbation theory (Rudolf and Schweizer, 2018), we start by establishing the geometric ergodicity of the Markov process (θk)k0(\theta_{k})_{k\geq 0} with transition kernel P(θ,)P(\theta,\cdot), given in the following lemma.

Lemma 4

Assume that ρ:=𝔼IηbH1<1\rho:=\mathbb{E}\left\|I-\frac{\eta}{b}H_{1}\right\|<1. Then, for any kk\in\mathbb{N}, we have the following inequality: 𝒲1(Pk(θ,),Pk(θ~,))ρkθθ~.\mathcal{W}_{1}\left(P^{k}(\theta,\cdot),P^{k}\left(\tilde{\theta},\cdot\right)\right)\leq\rho^{k}\|\theta-\tilde{\theta}\|.

We note that since H10H_{1}\succeq 0, the assumption in Lemma 4 can be satisfied under mild assumptions, for example when H10H_{1}\succ 0 with a positive probability, which is satisfied for η\eta small enough.

In the second step, we construct a Lyapunov function V^\hat{V} that satisfies the conditions of Lemma 3.

Lemma 5

Let V^(θ):=1+θ\hat{V}(\theta):=1+\|\theta\|. Assume that ρ^:=𝔼IηbH^1<1\hat{\rho}:=\mathbb{E}\left\|I-\frac{\eta}{b}\hat{H}_{1}\right\|<1. Then, we have

(P^V^)(θ)ρ^V^(θ)+1ρ^+ηb𝔼q^1.(\hat{P}\hat{V})(\theta)\leq\hat{\rho}\hat{V}(\theta)+1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|. (15)

In our third and last step, we estimate the perturbation gap based on the Lyapunov function V^\hat{V} in the form of (10), assuming that the data is bounded. Such bounded data assumptions have been commonly made in the literature (Bach, 2014; Bach and Moulines, 2013).

Lemma 6

If  supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, then, we have supθd𝒲1(δθP,δθP^)V^(θ)2ηD2n\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\frac{2\eta D^{2}}{n}.

Note that Lemma 3 relies on three conditions: the Wasserstein contraction in (10), which is obtained through Lemma 4, the drift condition for the Lyapunov function in (11), which is obtained in Lemma 5 and finally the estimate on γ\gamma in (12) which is about the one-step 11-Wasserstein distance between two semi-groups that in our context are associated with two datasets that differ by at most one element, which is obtained in Lemma 6. The only place the neighborhood assumption (supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D) is used is in the expression of γ\gamma in equation (12). Now, having all the ingredients, we can invoke Lemma 3 and we obtain the following result which provides a 1-Wasserstein bound between the distribution of iterates when applied to datasets that differ by one point.

For Yn=1𝒳nY\in\bigcup_{n=1}^{\infty}\mathcal{X}^{n} and k0k\geq 0, let ν(Y,k)\nu(Y,k) denote the law of the kk-th the SGD iterate when YY is used as the dataset, i.e., ν(X,k)\nu(X,k) and ν(X^,k)\nu(\hat{X},k) denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} obtained by the recursions (13) and (14) respectively. As shorthand notation, set νk:=ν(X,k)\nu_{k}:=\nu(X,k) and ν^k:=ν(X^,k)\hat{\nu}_{k}:=\nu(\hat{X},k).

Theorem 7

Assume θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. We also assume that ρ:=𝔼IηbH1<1\rho:=\mathbb{E}\left\|I-\frac{\eta}{b}H_{1}\right\|<1 and ρ^:=𝔼IηbH^1<1\hat{\rho}:=\mathbb{E}\left\|I-\frac{\eta}{b}\hat{H}_{1}\right\|<1 and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

𝒲1(νk,ν^k)1ρk1ρ2ηD2nmax{1+θ,1ρ^+ηb𝔼q^11ρ^}.\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\frac{1-\rho^{k}}{1-\rho}\frac{2\eta D^{2}}{n}\max\left\{1+\|\theta\|,\frac{1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|}{1-\hat{\rho}}\right\}. (16)

Proof  The result directly follows from Lemma 4, Lemma 5, Lemma 6 and Lemma 3.  
By a direct application of (8), we can obtain a generalization bound for an \mathcal{L}-Lipschitz surrogate loss function, as follows:

|𝔼𝒜,Xn[R^(𝒜(Xn),Xn)R(𝒜(Xn))]|1ρ02ηD2nmax{1+θ,1ρ0+ηb𝔼q^11ρ0},\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\frac{\mathcal{L}}{1-\rho_{0}}\frac{2\eta D^{2}}{n}\max\left\{1+\|\theta\|,\frac{1-{\rho_{0}}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|}{1-{\rho_{0}}}\right\},

where ρ0=supX1ηbHX\rho_{0}=\sup_{X}\|1-\frac{\eta}{b}H_{X}\|, HX=iΩk,ajXajajH_{X}=\sum_{i\in\Omega_{k},a_{j}\in X}a_{j}a_{j}^{\top} and XX is a random set of nn-data points from the data generating distribution. The generalization bound obtained above does not include the mean square error in the unbounded case but covers a larger class of surrogate loss functions. Because of this incompatibility, a direct comparison is not possible; however, the rate obtained in the equation above has the same dependence on the number of samples that were obtained in the previous works (Lei and Ying, 2020). For least squares, there are other works using integral operators that develop generalization bounds for SGD under a capacity condition (Lin and Rosasco, 2017; Pillaud-Vivien et al., 2018). However, these bounds only hold for the least square loss.

3.2 Strongly convex case

Next, we consider strongly convex losses. In the remainder of the paper, we will always assume that for every x𝒳x\in\mathcal{X}, f(,x)f(\cdot,x) is differentiable.

Before proceeding to the stability bound, we first introduce the following assumptions.

Assumption 1

There exist constants K1,K2>0K_{1},K_{2}>0 such that for any θ,θ^d\theta,\hat{\theta}\in\mathbb{R}^{d} and every x𝒳x\in\mathcal{X},

f(θ,x)f(θ^,x^)K1θθ^+K2xx^(θ+θ^+1).\displaystyle\|\nabla f(\theta,x)-\nabla f(\hat{\theta},\hat{x})\|\leq K_{1}\|\theta-\hat{\theta}\|+K_{2}\|x-\hat{x}\|(\|\theta\|+\|\hat{\theta}\|+1). (17)

This assumption is a pseudo-Lipschitz-like condition on f\nabla f and is satisfied for various problems such as GLMs (Bach, 2014). Next, we assume that the loss function ff is strongly convex.

Assumption 2

There exists a universal constant μ>0\mu>0 such that for any θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d} and x𝒳x\in\mathcal{X},

f(θ1,x)f(θ2,x),θ1θ2μθ1θ22.\displaystyle\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq\mu\|\theta_{1}-\theta_{2}\|^{2}.

By using the same recipe as we used for quadratic losses, we obtain the following stability result.

Theorem 8

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Assume that Assumption 1 and Assumption 2 hold. We also assume that η<min{1μ,μK12+64D2K22}\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}, supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. Let νk\nu_{k} and ν^k\hat{\nu}_{k} denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} respectively. Then, we have

𝒲1(νk,ν^k)\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k}) 8DK2(1(1ημ2)k)nμ(2Eμ+1)\displaystyle\leq\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(\frac{2E}{\mu}+1\right)
max{1+2θ2+2E2μ2,2ημK1256ημD2K22+64ημ3D2K22E2}.\displaystyle\qquad\cdot\max\left\{1+2\|\theta\|^{2}+\frac{2E^{2}}{\mu^{2}},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu^{3}}D^{2}K_{2}^{2}E^{2}\right\}. (18)

Similarly to the quadratic case, we can now directly obtain a bound on expected generalization error using (8). More precisely, for an \mathcal{L}-Lipschitz surrogate loss function \ell, we have

|𝔼𝒜,Xn[R^(𝒜(Xn),Xn)\displaystyle\Bigl{|}\mathbb{E}_{\mathcal{A},X_{n}}~{}\Bigl{[}\hat{R}(\mathcal{A}(X_{n}),X_{n})- R(𝒜(Xn))]|8DK2(1(1ημ2)k)nμ(2Eμ+1)\displaystyle R(\mathcal{A}(X_{n}))\Bigr{]}\Bigr{|}\leq\mathcal{L}\cdot\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(\frac{2E}{\mu}+1\right)
max{1+2θ2+2E2μ2,2ημK1256ημD2K22+64ημ3D2K22E2}.\displaystyle\cdot\max\left\{1+2\|\theta\|^{2}+\frac{2E^{2}}{\mu^{2}},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu^{3}}D^{2}K_{2}^{2}E^{2}\right\}.

The bound above has the same dependence on the number of samples as the ones of the previous stability analysis of (projected) SGD for strongly convex functions (Hardt et al., 2016; Liu et al., 2017; Lei and Ying, 2020). However, we have a worse dependence on the strong convexity parameter μ\mu.

3.3 Non-convex case with additive noise

Finally, we consider a class of non-convex loss functions. We assume that the loss function satisfies the following dissipativity condition.

Assumption 3

There exist constants m>0m>0 and K>0K>0 such that for any θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d} and x𝒳x\in\mathcal{X},

f(θ1,x)f(θ2,x),θ1θ2mθ1θ22K.\displaystyle\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq m\|\theta_{1}-\theta_{2}\|^{2}-K.

The class of dissipative functions satisfying this assumption are the ones that admit some gradient growth in radial directions outside a compact set. Inside the compact set though, they can have quite general non-convexity patterns. As concrete examples, they include certain one-hidden-layer neural networks (Akiyama and Suzuki, 2023); they arise in non-convex formulations of classification problems (e.g. in logistic regression with a sigmoid/non-convex link function); they can also arise in robust regression problems, see e.g. Gao et al. (2022). Also, any function that is strongly convex outside of a ball of radius for some will satisfy this assumption. Consequently, regularized regression problems where the loss is a strongly convex quadratic plus a smooth penalty that grows slower than a quadratic will belong to this class; a concrete example would be smoothed Lasso regression; many other examples are also given in Erdogdu et al. (2022). Dissipative functions also arise frequently in the sampling and Bayesian learning and global convergence in non-convex optimization literature (Raginsky et al., 2017; Gao et al., 2022).

Unlike the strongly-convex case, we can no longer obtain a Wasserstein contraction bound using the synchronous coupling technique as we did in the proof of Theorem 8. To circumvent this problem, in this setting, we consider a noisy version of SGD, with the following recursion:

θk=θk1η~F^k(θk1,Xn)+ηξk,~F^k(θk1,Xn):=1biΩkf(θk1,xi),\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n})+\eta\xi_{k},\qquad\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}), (19)

where ξk\xi_{k} are additional i.i.d. random vectors in d\mathbb{R}^{d}, independent of θk1\theta_{k-1} and Ωk\Omega_{k}, satisfying the following assumption.

Assumption 4

ξ1\xi_{1} is random vector on d\mathbb{R}^{d} with a continuous density p(x)p(x) that is positive everywhere, i.e. p(x)>0p(x)>0 for any xdx\in\mathbb{R}^{d} and  𝔼[ξ1]=0,σ2:=𝔼[ξ12]<.\mathbb{E}[\xi_{1}]=0,~{}~{}\sigma^{2}:=\mathbb{E}\left[\|\xi_{1}\|^{2}\right]<\infty.

Note that the SGLD algorithm (Welling and Teh, 2011) is a special case of this recursion, whilst our noise model can accommodate non-Gaussian distributions with finite second-order moment.

Analogously, let us define the (noisy) SGD recursion with the data set X^n\hat{X}_{n} as

θ^k=θ^k1η~F^k(θ^k1,X^n)+ηξk,\hat{\theta}_{k}=\hat{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n})+\eta\xi_{k},

and let p(θ,θ1)p(\theta,\theta_{1}) denote the probability density function of θ1=θηbiΩ1f(θ,xi)+ηξ1\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1}. Further let θ\theta_{*} be a minimizer of F^(,Xn)\hat{F}(\cdot,X_{n}). Then, by following the same three-step recipe, we obtain the following stability bound. Here, we do not provide all the constants explicitly for the sake of clarity; the complete theorem statement is given in Theorem 25 (Appendix E.1).

Theorem 9

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that η<min{1m,mK12+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. For any η^(0,1)\hat{\eta}\in(0,1), define M>0M>0 so that θ1θMp(θ,θ1)𝑑θ1η^\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}} and for any R>2K0mR>\frac{2K_{0}}{m} where K0K_{0} is defined in (67) so that

infθ,θ1d:V(θ)R,θ1θMp(θ,θ1)p(θ,θ1)η^.\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}. (20)

Let νk\nu_{k} and ν^k\hat{\nu}_{k} denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} respectively. Then, we have

𝒲1(νk,ν^k)C1(1η¯k)2ψ(1+ψ)(1η¯)2bn,\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\frac{C_{1}(1-\bar{\eta}^{k})}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}\cdot\frac{2b}{n}, (21)

where for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(1mη+2ηK0R,1)\gamma_{0}\in\left(1-m\eta+\frac{2\eta K_{0}}{R},1\right), ψ=η0ηK0\psi=\frac{\eta_{0}}{\eta K_{0}} and η¯=(1(η^η0))2+Rψγ02+Rψ\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi}. The constant C1C1(ψ,η,η^,R,γ0,K1,K2,σ2,D,E)C_{1}\equiv C_{1}(\psi,\eta,\hat{\eta},R,\gamma_{0},K_{1},K_{2},\sigma^{2},D,E) is explicitly stated in the proof.

Contrary to our previous results, the proof technique for showing Wasserstein contraction (as in Lemma 4) for this theorem relies on verifying the drift condition (Assumption 7) and the minorization condition (Assumption 8) as given in Hairer and Mattingly (2011). Once these conditions are satisfied, we invoke the explicitly computable bounds on the convergence of Markov chains developed in Hairer and Mattingly (2011).

From equation (8), we directly obtain the following generalization error bound for \mathcal{L}-Lipschitz surrogate loss function,

|𝔼𝒜,Xn[R^(𝒜(Xn),Xn)R(𝒜(Xn))]|C1(1η¯k)2ψ(1+ψ)(1η¯)2bn,\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\mathcal{L}\cdot\frac{C_{1}(1-\bar{\eta}^{k})}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}\cdot\frac{2b}{n},

where the constants are defined in Theorem 9111By using the decomposition (2), we can obtain excess risk bounds for SGLD by combining our results with Xu and Raginsky (2017): it was shown that gradient Langevin dynamics has the following optimization error is O(ε+d3/2b1/4λ1log1/ε)O(\varepsilon+d^{3/2}b^{-1/4}\lambda^{-1}\log 1/\varepsilon) after K=O(dε1λ1log1/ε)K=O(d\varepsilon^{-1}\lambda^{-1}\log 1/\varepsilon) iterations, where bb is the mini-batch size and λ\lambda is the uniform spectral gap of the continuous-time Langevin dynamics. Similar results are given for SGLD in (Xu and Raginsky, 2017, Theorem 3.6). . The above result can be directly compared with the result in (Farghly and Rebeschini, 2021, Theorem 4.1) that has the same dependence on nn and bb. However, our result is more general in the sense that we do not assume our noise to be Gaussian noise. Note that Li et al. (2019) can also accommodate non-Gaussian noise; however, their bounds increase with the number of iterations.

Remark 10

In Theorem 9, we can take R=2K0m(1+ϵ)R=\frac{2K_{0}}{m}(1+\epsilon) for some fixed ϵ(0,1)\epsilon\in(0,1) so we can take

η^=(maxM>0{min{θ1θMp(θ,θ1)𝑑θ1,infθ,θ1d:θ1θMθθ22K0m(1+ϵ)1p(θ,θ1)p(θ,θ1)}})2.\hat{\eta}=\left(\max_{M>0}\left\{\min\left\{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1},\inf_{\begin{subarray}{c}\theta,\theta_{1}\in\mathbb{R}^{d}:\|\theta_{1}-\theta_{\ast}\|\leq M\\ \|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1\end{subarray}}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\right\}\right\}\right)^{2}. (22)

Moreover, we can take η0=η^2\eta_{0}=\frac{\hat{\eta}}{2}, γ0=1mηϵ2\gamma_{0}=1-\frac{m\eta\epsilon}{2}, and ψ=η^2ηK0\psi=\frac{\hat{\eta}}{2\eta K_{0}}, so that

η¯=max{1η^2,2+(1+ϵ)η^m(1mηϵ2)2+(1+ϵ)η^m}=1mηϵ(1+ϵ)η^4m+2(1+ϵ)η^,\bar{\eta}=\max\left\{1-\frac{\hat{\eta}}{2},\frac{2+\frac{(1+\epsilon)\hat{\eta}}{m}(1-\frac{m\eta\epsilon}{2})}{2+\frac{(1+\epsilon)\hat{\eta}}{m}}\right\}=1-\frac{m\eta\epsilon(1+\epsilon)\hat{\eta}}{4m+2(1+\epsilon)\hat{\eta}}, (23)

provided that η1\eta\leq 1. Note that the parameter η^\hat{\eta} in (22) appears in the upper bound in equation (21) that controls the 11-Wasserstein algorithmic stability of the SGD. It is easy to see from equation (21) that the smaller η¯\bar{\eta}, the smaller the 11-Wasserstein bound. By the defintion of η¯\bar{\eta}, the larger η^\hat{\eta}, the smaller the 11-Wasserstein bound. As a result, we would like to choose η^\hat{\eta} to be as large as possible, and the equation (22) provides an explicit value that η^\hat{\eta} can take, which is already the largest as possible.

Next, let us provide some explicitly computable lower bounds for η^\hat{\eta} in (22). This is achievable if we specify further the noise assumption. Under the assumption that ξk\xi_{k} are i.i.d. Gaussian distributed, we have the following corollary.

Corollary 11

Under the assumptions in Theorem 9, we further assume the noise ξk\xi_{k} are i.i.d. Gaussian 𝒩(0,Σ)\mathcal{N}(0,\Sigma) so that 𝔼[ξ12]=tr(Σ)=σ2\mathbb{E}[\|\xi_{1}\|^{2}]=\text{tr}(\Sigma)=\sigma^{2}. We also assume that ΣId\Sigma\prec I_{d}. Then, we have

η^\displaystyle\hat{\eta} maxMηsupx𝒳f(θ,x){min{(1exp(12(Mηsupx𝒳f(θ,x))2)det(IdΣ))2,\displaystyle\geq\max_{M\geq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|}\Bigg{\{}\min\Bigg{\{}\Bigg{(}1-\frac{\exp(-\frac{1}{2}(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|)^{2})}{\sqrt{\det{(I_{d}-\Sigma)}}}\Bigg{)}^{2},
exp{(1+K1η)(2K0m(1+ϵ)1)1/2η2Σ1\displaystyle\qquad\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{\eta^{2}}\|\Sigma^{-1}\|
((1+K1η)(2K0m(1+ϵ)1)1/2+2(M+ηsupx𝒳f(θ,x)))}}}.\displaystyle\qquad\quad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)\Bigg{)}\Bigg{\}}\Bigg{\}}\Bigg{\}}. (24)

The above corollary provides an explicit lower bound for η^\hat{\eta} (instead of the less transparent inequality constraints in Theorem 9), and by combining with Remark 10 (see equation (23)) leads to an explicit formula for η¯\bar{\eta} which is essential to characterize the Wasserstein upper bound in (21) in Theorem 9.

4 Wasserstein Stability of SGD without Geometric Ergodicity

While the Markov chain perturbation theory enabled us to develop stability bounds for the case where we can ensure geometric ergodicity in the Wasserstein sense (i.e., proving contraction bounds), we have observed that such a strong ergodicity notion might not hold for non-strongly convex losses. In this section, we will prove two more stability bounds for SGD, without relying on Rudolf and Schweizer (2018), hence without requiring geometric ergodicity. To the best of our knowledge, these are the first uniform-time stability bounds for the considered classes of convex and non-convex problems.

4.1 Non-convex case without additive noise

The stability result we obtained in Theorem 9 required us to introduce an additional noise (Assumption 4) to be able to invoke Lemma 3. We will now show that it is possible to use a more direct approach to obtain 2-Wasserstein algorithmic stability in the non-convex case under Assumption 3 without relying on Rudolf and Schweizer (2018). However, we will observe that without geometric ergodicity will have a non-vanishing bias term in the bound. Note that, since 𝒲1(νk,ν^k)𝒲p(νk,ν^k)\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\mathcal{W}_{p}(\nu_{k},\hat{\nu}_{k}) for all p1p\geq 1, the following bound still yields a generalization bound by (8).

Theorem 12

Assume θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. We also assume that Assumption 1 and Assumption 3 hold and η<min{1m,mK12+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. Let νk\nu_{k} and ν^k\hat{\nu}_{k} denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} respectively. Then, we have

𝒲22(νk,ν^k)(1(1ηm)k)(4D2K22η(8B+2)bnm+4K2D(1+K1η)nm(1+5B)+2Km),\displaystyle\mathcal{W}_{2}^{2}(\nu_{k},\hat{\nu}_{k})\leq\left(1-(1-\eta m)^{k}\right)\cdot\left(\frac{4D^{2}K_{2}^{2}\eta(8B+2)}{bnm}+\frac{4K_{2}D(1+K_{1}\eta)}{nm}\left(1+5B\right)+\frac{2K}{m}\right), (25)

where the constant BB is explicitly defined in the proof.

While the bound (25) does not increase with the number of iterations, it is easy to see that it does not vanish as nn\rightarrow\infty, and it is small only when KK from the dissipativity condition (Assumption 3) is small. In other words, if we consider KK to be the level of non-convexity (e.g., K=0K=0 corresponds to strong convexity), as the function becomes ‘more non-convex’ the persistent term in the bound will get larger. While this persistent term might make the bound vacuous when nn\to\infty, for moderate nn the bound can be still informative as the persistent term might be dominated by the first two terms.

Moreover, discarding the persistent bias term, this bound leads to a generalization bound with rate n1/2n^{-1/2}, rather than n1n^{-1} as before. This indicates that it is beneficial to add additional noise ξk\xi_{k} in SGD as in Theorem 9 in order for the dynamics to be geometrically ergodic that can lead to a sharp bound as nn\rightarrow\infty. Finally, we note that as Theorem 12 involves 2-Wasserstein distance, it can pave the way for generalization bounds without requiring a surrogate loss. Yet, this is not immediate and would require deriving uniform L2L^{2} bounds for the iterates, e.g., Raginsky et al. (2017).

4.2 Convex case with additional geometric structure

We now present our final stability bound, where we consider relaxing the strong convexity assumption (Assumption 2) to the following milder assumption.

Assumption 5

There exists universal constants μ>0\mu>0 and p(1,2)p\in(1,2) such that for any θ1,θ2d\theta_{1},\theta_{2}\in\mathbb{R}^{d} and x𝒳x\in\mathcal{X}, f(θ1,x)f(θ2,x),θ1θ2μθ1θ2p\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq\mu\|\theta_{1}-\theta_{2}\|^{p}.

Note that as p<2p<2, the function class can be seen as an intermediate class between convex and strongly convex functions, and such a class of functions has been studied in the optimization literature (Dunn, 1981; Bertsekas, 2015).

We analogously modify Assumption 1 and consider the following assumption.

Assumption 6

There exist constants K1,K2>0K_{1},K_{2}>0 and p(1,2)p\in(1,2) such that for any θ,θ^d\theta,\hat{\theta}\in\mathbb{R}^{d} and every x𝒳x\in\mathcal{X}, f(θ,x)f(θ^,x^)K1θθ^p2+K2xx^(θp1+θ^p1+1)\|\nabla f(\theta,x)-\nabla f(\hat{\theta},\hat{x})\|\leq K_{1}\|\theta-\hat{\theta}\|^{\frac{p}{2}}+K_{2}\|x-\hat{x}\|(\|\theta\|^{p-1}+\|\hat{\theta}\|^{p-1}+1).

The next theorem establishes a stability bound for the considered class of convex losses in the stationary regime of SGD.

Theorem 13

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Suppose Assumption 5 and Assumption 6 hold (with p(1,2)p\in(1,2)) and ημK12+2p+4D2K22\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. Then νk\nu_{k} and ν^k\hat{\nu}_{k} converge to the unique stationary distributions ν\nu_{\infty} and ν^\hat{\nu}_{\infty} respectively and moreover, we have

𝒲pp(ν,ν^)C2bn+C3n,\displaystyle\mathcal{W}_{p}^{p}(\nu_{\infty},\hat{\nu}_{\infty})\leq\frac{C_{2}}{bn}+\frac{C_{3}}{n}, (26)

where the constants C2C2(η,μ,K2,D,E)C_{2}\equiv C_{2}(\eta,\mu,K_{2},D,E) and C3C3(η,μ,K1,K2,D,E)C_{3}\equiv C_{3}(\eta,\mu,K_{1},K_{2},D,E) are explicitly stated in the proof.

While we have relaxed the geometric ergodicity condition for this case, in the proof of Theorem 13, we show that the Markov chain (θk)k0(\theta_{k})_{k\geq 0} is simply ergodic, i.e., limk𝒲p(νk,ν)=0\lim_{k\to\infty}\mathcal{W}_{p}(\nu_{k},\nu_{\infty})=0. Hence, even though we still obtain a time-uniform bound, our bound holds asymptotically in kk, due to the lack of an explicit convergence rate for 𝒲p(νk,ν)\mathcal{W}_{p}(\nu_{k},\nu_{\infty}). On the other hand, the lack of strong convexity here results in a generalization bound with rate n1/pn^{-1/p}, whereas for the strongly convex case, i.e., p=2p=2, we previously obtained a rate of n1n^{-1}. This might be an indicator that there might be still room for improvement in terms of the rate, at least for this class of loss functions.

5 Conclusion

We proved time-uniform Wasserstein-stability bounds for SGD and its noisy versions under different strongly convex, convex, and non-convex classes of functions. By making a connection to Markov chain perturbation results (Rudolf and Schweizer, 2018), we introduced a three-step guideline for proving stability bounds for stochastic optimizers. As this approach required geometric ergodicity, we finally relaxed this condition and proved two other stability bounds for a large class of loss functions.

The main limitation of our approach is that it requires Lipschitz surrogate loss functions, as it is based on the Wasserstein distance. Hence, our natural next step will be to extend our analysis without such a requirement. Finally, due to the theoretical nature of this study, it does not contain any direct potential societal impacts.

Acknowledgments

Lingjiong Zhu is partially supported by the grants NSF DMS-2053454, NSF DMS-2208303, and a Simons Foundation Collaboration Grant. Mert Gürbüzbalaban’s research are supported in part by the grants Office of Naval Research Award Number N00014-21-1-2244, National Science Foundation (NSF) CCF-1814888, NSF DMS-2053485. Anant Raj is supported by the a Marie Sklodowska-Curie Fellowship (project NN-OVEROPT 101030817). Umut Şimşekli’s research is supported by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and the European Research Council Starting Grant DYNASTY – 101039676.

References

  • Akiyama and Suzuki (2023) S. Akiyama and T. Suzuki. Excess risk of two-layer ReLU neural networks in teacher-student settings and its superiority to kernel methods. In International Conference on Learning Representations, 2023.
  • Aybat et al. (2019) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. In Advances in Neural Information Processing Systems, volume 32, 2019.
  • Aybat et al. (2020) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. Robust accelerated gradient methods for smooth strongly convex functions. SIAM Journal on Optimization, 30(1):717–751, 2020.
  • Bach (2014) F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
  • Bach and Moulines (2013) F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate 𝒪(1/n)\mathcal{O}(1/n). In Advances in Neural Information Processing Systems, volume 26, 2013.
  • Bassily et al. (2020) R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, volume 33, pages 4381–4391, 2020.
  • Bertsekas (2015) D. Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
  • Bousquet and Elisseeff (2002) O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
  • Can et al. (2019) B. Can, M. Gürbüzbalaban, and L. Zhu. Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In International Conference on Machine Learning, pages 891–901. PMLR, 2019.
  • Chen et al. (2018) Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619, 2018.
  • Dunn (1981) J. C. Dunn. Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM Journal on Control and Optimization, 19(3):368–400, 1981.
  • Erdogdu et al. (2022) M. A. Erdogdu, R. Hosseinzadeh, and M. S. Zhang. Convergence of Langevin Monte Carlo in Chi-squred and Rényi divergence. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151. PMLR, 2022.
  • Fallah et al. (2022) A. Fallah, M. Gürbüzbalaban, A. Ozdaglar, U. Şimşekli, and L. Zhu. Robust distributed accelerated stochastic gradient methods for multi-agent networks. Journal of Machine Learning Research, 23(1):9893–9988, 2022.
  • Farghly and Rebeschini (2021) T. Farghly and P. Rebeschini. Time-independent generalization bounds for SGLD in non-convex settings. In Advances in Neural Information Processing Systems, volume 34, pages 19836–19846, 2021.
  • Feldman and Vondrak (2019) V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270–1279. PMLR, 2019.
  • Gao et al. (2022) X. Gao, M. Gürbüzbalaban, and L. Zhu. Global convergence of stochastic gradient Hamiltonian Monte Carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research, 70(5):2931–2947, 2022.
  • Garrigos and Gower (2023) G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
  • Gürbüzbalaban et al. (2022) M. Gürbüzbalaban, A. Ruszczyński, and L. Zhu. A stochastic subgradient method for distributionally robust non-convex and non-smooth learning. Journal of Optimization Theory and Applications, 194(3):1014–1041, 2022.
  • Hairer and Mattingly (2011) M. Hairer and J. C. Mattingly. Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, pages 109–118, Basel, 2011.
  • Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
  • Kozachkov et al. (2023) L. Kozachkov, P. M. Wensing, and J.-J. Slotine. Generalization as dynamical robustness–The role of Riemannian contraction in supervised learning. Transactions on Machine Learning Research, 4:1–25, 2023.
  • Lei and Ying (2020) Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, volume 119, pages 5809–5819. PMLR, 2020.
  • Lessard et al. (2016) L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
  • Li et al. (2019) J. Li, X. Luo, and M. Qiao. On generalization error bounds of noisy gradient methods for non-convex learning. arXiv preprint arXiv:1902.00621, 2019.
  • Lin and Rosasco (2017) J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(1):3375–3421, 2017.
  • Liu et al. (2017) T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159–2167. PMLR, 2017.
  • Liu et al. (2020) Y. Liu, Y. Gao, and W. Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, volume 33, pages 18261–18271, 2020.
  • Meyn and Tweedie (1993) S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Communications and Control Engineering Series. Springer-Verlag, London, 1993.
  • Meyn and Tweedie (1994) S. P. Meyn and R. L. Tweedie. Computable bounds for geometric convergence rates of Markov chains. Annals of Applied Probability, 4(4):981–1011, 1994.
  • Mou et al. (2018) W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory, pages 605–638. PMLR, 2018.
  • Pillaud-Vivien et al. (2018) L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, volume 31, 2018.
  • Raginsky et al. (2016) M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016.
  • Raginsky et al. (2017) M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
  • Raj et al. (2023a) A. Raj, M. Barsbey, M. Gürbüzbalaban, L. Zhu, and U. Şimşekli. Algorithmic stability of heavy-tailed stochastic gradient descent on least squares. In International Conference on Algorithmic Learning Theory, volume 201, pages 1292–1342. PMLR, 2023a.
  • Raj et al. (2023b) A. Raj, L. Zhu, M. Gürbüzbalaban, and U. Şimşekli. Algorithmic stability of heavy-tailed SGD with general loss functions. In International Conference on Machine Learning, volume 202, pages 28578–28597. PMLR, 2023b.
  • Rudolf and Schweizer (2018) D. Rudolf and N. Schweizer. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli, 24(4A):2610–2639, 2018.
  • Villani (2009) C. Villani. Optimal Transport: Old and New. Springer, Berlin, 2009.
  • Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
  • Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, 2017.
  • Zhang et al. (2022) Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of SGD: Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence, pages 2364–2373. PMLR, 2022.

Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent
APPENDIX

The Appendix is organized as follows:

  • In Section A, we provide further details and examples about the usage of surrogate losses.

  • In Section B, we provide technical background for the computable bounds for the convergence of Markov chains which will be used to prove the results in Section 3.3 in the main paper.

  • In Section C, we provide technical proofs for 1-Wasserstein perturbation results for the quadratic loss in Section 3.1 in the main paper.

  • In Section D, we provide technical proofs for 1-Wasserstein perturbation results for the strongly-convex loss in Section 3.2 in the main paper.

  • In Section E, we provide technical proofs for 1-Wasserstein perturbation results for the non-convex loss (with additive noise) in Section 3.3 in the main paper.

  • In Section F, we provide technical proofs for 22-Wasserstein stability bounds for the non-convex loss without additive noise in Section 4.1 in the main paper.

  • In Section G, we provide technical proofs for pp-Wasserstein stability bounds for the convex loss with additional geometric structure in Section 4.2 in the main paper.

A On the Usage of Surrogate Losses

While the requirement of surrogate losses is a drawback of our framework, nevertheless our setup can cover several practical settings. In this section, we will provide two such examples.

Example 1.

We can choose the surrogate loss as the truncated loss, such that:

(θ,x)=min(f(θ,x),C),\ell(\theta,x)=\min(f(\theta,x),C),

where C>0C>0 is a chosen constant. This can be seen as a “robust” version of the original loss, which has been widely used in robust optimization and is conceptually similar to adding a projection step to the optimizer.

Example 2.

Another natural setup for our framework is the 2\ell_{2}-regularized Lipschitz loss that was also used in Farghly and Rebeschini (2021). As opposed to the previous case, for the sake of this example, let us consider \ell as the true loss and ff as the surrogate loss. Then, we can choose the pair ff and \ell as follows:

f(θ,x)=(θ,x)+μ2θ22,f(\theta,x)=\ell(\theta,x)+\frac{\mu}{2}\|\theta\|_{2}^{2},

where μ>0\mu>0. Intuitively, this setting means that, we have a true loss \ell which can be Lipschitz, but in the optimization framework we consider a regularized version of the loss. In other words, we have a loss \ell; however, we run the algorithm on the regularized loss ff to have better convergence properties, and finally, we would like to understand if the algorithm generalizes on \ell or not, and we are typically not interested if the algorithm generalizes well on the regularized loss ff.

Next, we illustrate how a generalization bound for the loss ff, i.e., |𝔼[F^(θ)F(θ)]|\left|\mathbb{E}[\hat{F}(\theta)-F(\theta)]\right|. For this example, a bound on the quantity can be obtained by building on our analysis. To obtain such a bound, in addition to the bounds that we developed on |𝔼[R^(θ)R(θ)]|\left|\mathbb{E}[\hat{R}(\theta)-R(\theta)]\right|, we would need to estimate the following quantity:

|𝔼θ,Xn[1ni=1n(f(θ,xi)(θ,xi))]|.\left|\mathbb{E}_{\theta,X_{n}}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f\left(\theta,x_{i}\right)-\ell\left(\theta,x_{i}\right)\right)\right]\right|.

For illustration purposes, assume that \ell is convex and Lipschitz in the first parameter. Then, ff is μ\mu-strongly convex. Further consider that we initialize SGD from 0, i.e., θ0=0\theta_{0}=0 and set the batch size bb to 1. Denote θ=θk\theta=\theta_{k} as the kk-th iterate of SGD when applied on F^(θ,Xn)\hat{F}\left(\theta,X_{n}\right), i.e., (1). Further define the minimum:

θXn=argminθF^(θ,Xn).\theta_{X_{n}}^{\star}=\arg\min_{\theta}\hat{F}\left(\theta,X_{n}\right).

We can now analyze the error induced by the surrogate loss as follows:

|𝔼θ,Xn[1ni=1n(f(θ,xi)(θ,xi))]|\displaystyle\left|\mathbb{E}_{\theta,X_{n}}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f\left(\theta,x_{i}\right)-\ell\left(\theta,x_{i}\right)\right)\right]\right| =μ2𝔼θ,Xnθ2=μ2𝔼θ,XnθθXn+θXn2\displaystyle=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\|\theta\|^{2}=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\left\|\theta-\theta_{X_{n}}^{\star}+\theta_{X_{n}}^{\star}\right\|^{2}
μ𝔼θ,XnθθXn2+μ𝔼XnθXn2\displaystyle\leq\mu\mathbb{E}_{\theta,X_{n}}\left\|\theta-\theta_{X_{n}}^{\star}\right\|^{2}+\mu\mathbb{E}_{X_{n}}\left\|\theta_{X_{n}}^{\star}\right\|^{2}
μ𝔼Xn[(1ημ)kθXn2+2ημσXn]+μ𝔼XnθXn2\displaystyle\leq\mu\mathbb{E}_{X_{n}}\left[(1-\eta\mu)^{k}\left\|\theta_{X_{n}}^{\star}\right\|^{2}+\frac{2\eta}{\mu}\sigma_{X_{n}}\right]+\mu\mathbb{E}_{X_{n}}\left\|\theta_{X_{n}}^{\star}\right\|^{2}
=μ((1ημ)k+1)𝔼XnθXn2+2η𝔼Xn[σXn].\displaystyle=\mu\left((1-\eta\mu)^{k}+1\right)\mathbb{E}_{X_{n}}\left\|\theta_{X_{n}}^{\star}\right\|^{2}+2\eta\mathbb{E}_{X_{n}}\left[\sigma_{X_{n}}\right].

Here, the second inequality follows from standard convergence analysis for SGD (Garrigos and Gower, 2023, Theorem 5.7) and we define σXn\sigma_{X_{n}} as the stochastic gradient noise variance:

σXn:=Var[f(θXn,xi)],\sigma_{X_{n}}:=\operatorname{Var}\left[\nabla f\left(\theta_{X_{n}}^{\star},x_{i}\right)\right],

where for a random vector VV we define Var[V]:=𝔼V𝔼[V]2\operatorname{Var}[V]:=\mathbb{E}\|V-\mathbb{E}[V]\|^{2}. Hence, we can see that the error induced by the surrogate loss depends on the following factors:

  • The regularization parameter μ\mu,

  • The expected norm of the minimizers,

  • The step-size η\eta,

  • The expected stochastic gradient noise variance.

These terms can be controlled by adjusting μ\mu and η\eta.

B Technical Background

B.1 Computable bounds for the convergence of Markov chains

Geometric ergodicity and convergence rate of Markov chains has been well studied in the literature (Meyn and Tweedie, 1993, 1994; Hairer and Mattingly, 2011). In this section, we state a result from Hairer and Mattingly (2011) that provides an explicitly computable bound on the Wasserstein contraction for the Markov chains that satisfies a drift condition that relies on the construction of an appropriate Lyapunov function and a minorization condition.

Let 𝒫(θ,)\mathcal{P}(\theta,\cdot) be a Markov transition kernel for a Markov chain (θk)(\theta_{k}) on d\mathbb{R}^{d}. For any measurable function φ:d[0,+]\varphi:\mathbb{R}^{d}\rightarrow[0,+\infty], we define:

(𝒫φ)(θ)=dφ(θ~)𝒫(θ,dθ~).(\mathcal{P}\varphi)(\theta)=\int_{\mathbb{R}^{d}}\varphi(\tilde{\theta})\mathcal{P}(\theta,d\tilde{\theta}).
Assumption 7 (Drift Condition)

There exists a function V:d[0,)V:\mathbb{R}^{d}\rightarrow[0,\infty) and some constants K0K\geq 0 and γ(0,1)\gamma\in(0,1) so that

(𝒫V)(θ)γV(θ)+K,(\mathcal{P}V)(\theta)\leq\gamma V(\theta)+K,

for all θd\theta\in\mathbb{R}^{d}.

Assumption 8 (Minorization Condition)

There exists some constant η^(0,1)\hat{\eta}\in(0,1) and a probability measure ν\nu so that

infθd:V(θ)R𝒫(θ,)η^ν(),\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}\mathcal{P}(\theta,\cdot)\geq\hat{\eta}\nu(\cdot),

for some R>2K/(1γ)R>2K/(1-\gamma).

We define the weighted total variation distance:

dψ(μ1,μ2)=d(1+ψV(θ))|μ1μ2|(dθ),d_{\psi}(\mu_{1},\mu_{2})=\int_{\mathbb{R}^{d}}(1+\psi V(\theta))|\mu_{1}-\mu_{2}|(d\theta),

where ψ>0\psi>0 and V(θ)V(\theta) is the Lyapunov function that satisfies the drift condition (Assumption 7). It is known that dψd_{\psi} has the following alternative expression (Hairer and Mattingly, 2011):

dψ(μ1,μ2)=supφ:φψ1dφ(θ)(μ1μ2)(dθ),d_{\psi}(\mu_{1},\mu_{2})=\sup_{\varphi:\|\varphi\|_{\psi}\leq 1}\int_{\mathbb{R}^{d}}\varphi(\theta)(\mu_{1}-\mu_{2})(d\theta),

where ψ\|\cdot\|_{\psi} is the weighted supremum norm such that for any ψ>0\psi>0:

φψ:=supθd|φ(θ)|1+ψV(θ).\|\varphi\|_{\psi}:=\sup_{\theta\in\mathbb{R}^{d}}\frac{|\varphi(\theta)|}{1+\psi V(\theta)}.

It is also noted in Hairer and Mattingly (2011) that dψd_{\psi} has yet another equivalent expression:

dψ(μ1,μ2)=supφ:|φ|ψ1dφ(θ)(μ1μ2)(dθ),d_{\psi}(\mu_{1},\mu_{2})=\sup_{\varphi:|\|\varphi\||_{\psi}\leq 1}\int_{\mathbb{R}^{d}}\varphi(\theta)(\mu_{1}-\mu_{2})(d\theta),

where

|φ|ψ:=supθθ~|φ(θ)φ(θ~)|2+ψV(θ)+ψV(θ~).|\|\varphi\||_{\psi}:=\sup_{\theta\neq\tilde{\theta}}\frac{|\varphi(\theta)-\varphi(\tilde{\theta})|}{2+\psi V(\theta)+\psi V(\tilde{\theta})}.
Lemma 14 (Theorem 1.3. Hairer and Mattingly (2011))

If the drift condition (Assumption 7) and minorization condition (Assumption 8) hold, then there exists η¯(0,1)\bar{\eta}\in(0,1) and ψ>0\psi>0 so that

dψ(𝒫μ1,𝒫μ2)η¯dψ(μ1,μ2)d_{\psi}(\mathcal{P}\mu_{1},\mathcal{P}\mu_{2})\leq\bar{\eta}d_{\psi}(\mu_{1},\mu_{2})

for any probability measures μ1,μ2\mu_{1},\mu_{2} on d\mathbb{R}^{d}. In particular, for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(γ+2K/R,1)\gamma_{0}\in(\gamma+2K/R,1) one can choose ψ=η0/K\psi=\eta_{0}/K and η¯=(1(η^η0))(2+Rψγ0)/(2+Rψ)\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee(2+R\psi\gamma_{0})/(2+R\psi).

C Proofs of Wasserstein Perturbation Results: Quadratic Case

C.1 Proof of Lemma 4

Proof  Let Pk(θ,)P^{k}(\theta,\cdot) denote the law of θk\theta_{k} starting with θ0=θ\theta_{0}=\theta and Pk(θ~,)P^{k}(\tilde{\theta},\cdot) the law of θ~k\tilde{\theta}_{k}:

θ~k=(IηbHk)θ~k1+ηbqk,\tilde{\theta}_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\tilde{\theta}_{k-1}+\frac{\eta}{b}q_{k}, (27)

with θ~0=θ~\tilde{\theta}_{0}=\tilde{\theta}. Note that

θk=(IηbHk)θk1+ηbqk,\displaystyle\theta_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\theta_{k-1}+\frac{\eta}{b}q_{k}, (28)
θ~k=(IηbHk)θ~k1+ηbqk,\displaystyle\tilde{\theta}_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\tilde{\theta}_{k-1}+\frac{\eta}{b}q_{k}, (29)

which implies that

𝔼θkθ~k\displaystyle\mathbb{E}\left\|\theta_{k}-\tilde{\theta}_{k}\right\| =𝔼(IηbHk)(θk1θ~k1)\displaystyle=\mathbb{E}\left\|\left(I-\frac{\eta}{b}H_{k}\right)\left(\theta_{k-1}-\tilde{\theta}_{k-1}\right)\right\|
𝔼[IηbHkθk1θ~k1]=ρ𝔼θk1θ~k1.\displaystyle\leq\mathbb{E}\left[\left\|I-\frac{\eta}{b}H_{k}\right\|\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|\right]=\rho\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|. (30)

By iterating over j=k,k1,1j=k,k-1,\ldots 1, we conclude that

𝒲1(Pk(θ,),Pk(θ~,))𝔼θkθ~kρnθ0θ~0=ρkθθ~.\mathcal{W}_{1}\left(P^{k}(\theta,\cdot),P^{k}(\tilde{\theta},\cdot)\right)\leq\mathbb{E}\|\theta_{k}-\tilde{\theta}_{k}\|\leq\rho^{n}\|\theta_{0}-\tilde{\theta}_{0}\|=\rho^{k}\|\theta-\tilde{\theta}\|. (31)

This completes the proof.  

C.2 Proof of Lemma 5

Proof  First, we recall that

θ^k=(IηbH^k)θ^k1+ηbq^k,\hat{\theta}_{k}=\left(I-\frac{\eta}{b}\hat{H}_{k}\right)\hat{\theta}_{k-1}+\frac{\eta}{b}\hat{q}_{k}, (32)

where H^k:=iΩka^ia^i\hat{H}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{a}_{i}^{\top} and q^k:=iΩka^iy^i\hat{q}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{y}_{i}. Therefore, starting with θ^0=θ\hat{\theta}_{0}=\theta, we have

θ^1=(IηbH^1)θ+ηbq^1,\hat{\theta}_{1}=\left(I-\frac{\eta}{b}\hat{H}_{1}\right)\theta+\frac{\eta}{b}\hat{q}_{1}, (33)

which implies that

(P^V^)(θ)=𝔼V^(θ^1)=1+𝔼θ^11+ρ^θ+ηb𝔼q^1=ρ^V^(θ)+1ρ^+ηb𝔼q^1.\displaystyle(\hat{P}\hat{V})(\theta)=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\|\hat{\theta}_{1}\|\leq 1+\hat{\rho}\|\theta\|+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|=\hat{\rho}\hat{V}(\theta)+1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|. (34)

This completes the proof.  

C.3 Proof of Lemma 6

Proof  Let us recall that

θ1=(IηbH1)θ+ηbq1,\displaystyle\theta_{1}=\left(I-\frac{\eta}{b}H_{1}\right)\theta+\frac{\eta}{b}q_{1}, (35)
θ^1=(IηbH^1)θ+ηbq^1,\displaystyle\hat{\theta}_{1}=\left(I-\frac{\eta}{b}\hat{H}_{1}\right)\theta+\frac{\eta}{b}\hat{q}_{1}, (36)

which implies that

𝒲1(δθP,δθP^)𝔼H1H^1ηbθ+ηb𝔼q1q^1.\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\mathbb{E}\left\|H_{1}-\hat{H}_{1}\right\|\frac{\eta}{b}\|\theta\|+\frac{\eta}{b}\mathbb{E}\left\|q_{1}-\hat{q}_{1}\right\|. (37)

Since XnX_{n} and X^n\hat{X}_{n} differ by at most one element and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have (H1,q1)=(H^1,q1)(H_{1},q_{1})=(\hat{H}_{1},q_{1}) with probability nbn\frac{n-b}{n} and (H1,q1)(H^1,q1)(H_{1},q_{1})\neq(\hat{H}_{1},q_{1}) with probability bn\frac{b}{n} and moreover

𝔼H1H^1bnmax1inaiaia^ia^ibnmax1in(ai2+a^i2)2bD2n,\mathbb{E}\left\|H_{1}-\hat{H}_{1}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left\|a_{i}a_{i}^{\top}-\hat{a}_{i}\hat{a}_{i}^{\top}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left(\|a_{i}\|^{2}+\|\hat{a}_{i}\|^{2}\right)\leq\frac{2bD^{2}}{n}, (38)

and

𝔼q1q^1bnmax1inaiyia^iy^ibnmax1in(aiqi+a^iq^i)2bD2n.\mathbb{E}\left\|q_{1}-\hat{q}_{1}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left\|a_{i}y_{i}-\hat{a}_{i}\hat{y}_{i}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left(\|a_{i}\|\|q_{i}\|+\|\hat{a}_{i}\|\|\hat{q}_{i}\|\right)\leq\frac{2bD^{2}}{n}. (39)

Hence, we conclude that

supθd𝒲1(δθP,δθP^)V^(θ)supθdηb(θ+1)2bD2n1+θ=2ηD2n.\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{\frac{\eta}{b}(\|\theta\|+1)\frac{2bD^{2}}{n}}{1+\|\theta\|}=\frac{2\eta D^{2}}{n}. (40)

This completes the proof.  

D Proofs of Wasserstein Perturbation Results: Strongly Convex Case

In order to obtain the algorithmic stability bound, that is a 11-Wasserstein distance between the distribution of θk\theta_{k} and θ^k\hat{\theta}_{k}, we need to establish a sequence of technical lemmas. First, we show a 11-Wasserstein contraction rate in the following lemma.

Lemma 15

Assume that Assumption 1 and Assumption 2 hold, and further assume that η<min{1μ,μK12}\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}}\right\}. Then, for any nn\in\mathbb{N},

𝒲1(Pn(θ,),Pn(θ~,))(1ημ2)nθθ~.\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}\left(\tilde{\theta},\cdot\right)\right)\leq\left(1-\frac{\eta\mu}{2}\right)^{n}\|\theta-\tilde{\theta}\|. (41)

Proof  Let Pk(θ,)P^{k}(\theta,\cdot) denote the law of θk\theta_{k} starting with θ0=θ\theta_{0}=\theta:

θk=θk1η~F^k(θk1,Xn),\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}), (42)

and Pk(θ~,)P^{k}(\tilde{\theta},\cdot) the law of θ~k\tilde{\theta}_{k}:

θ~k=θ~k1η~F^k(θ~k1,Xn),\tilde{\theta}_{k}=\tilde{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}\left(\tilde{\theta}_{k-1},X_{n}\right), (43)

with θ~0=θ~\tilde{\theta}_{0}=\tilde{\theta}. Note that

θk=θk1ηbiΩkf(θk1,xi),\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}), (44)
θ~k=θ~k1ηbiΩkf(θ~k1,xi).\displaystyle\tilde{\theta}_{k}=\tilde{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right). (45)

Therefore, we have

𝔼θkθ~k2\displaystyle\mathbb{E}\left\|\theta_{k}-\tilde{\theta}_{k}\right\|^{2} =𝔼θk1θ~k1ηbiΩk(f(θk1,xi)f(θ~k1,xi))2\displaystyle=\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}
=𝔼θk1θ~k12+η2b2𝔼iΩk(f(θk1,xi)f(θ~k1,xi))2\displaystyle=\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left\|\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}
2ηb𝔼θk1θ~k1,iΩk(f(θk1,xi)f(θ~k1,xi)).\displaystyle\qquad\qquad-\frac{2\eta}{b}\mathbb{E}\left\langle\theta_{k-1}-\tilde{\theta}_{k-1},\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\rangle. (46)

By applying Assumption 1 and Assumption 2, we get

𝔼θkθ~k2\displaystyle\mathbb{E}\left\|\theta_{k}-\tilde{\theta}_{k}\right\|^{2}
(12ημ)𝔼θk1θ~k12+η2b2𝔼[(iΩkf(θk1,xi)f(θ~k1,xi))2]\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left[\left(\sum_{i\in\Omega_{k}}\left\|\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right\|\right)^{2}\right]
(12ημ)𝔼θk1θ~k12+η2K12𝔼θk1θ~k12\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|^{2}+\eta^{2}K_{1}^{2}\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|^{2}
(1ημ)𝔼θk1θ~k12,\displaystyle\leq(1-\eta\mu)\mathbb{E}\left\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\|^{2}, (47)

provided that ημK12\eta\leq\frac{\mu}{K_{1}^{2}}. By iterating over k=n,n1,1k=n,n-1,\ldots 1, we conclude that

(𝒲1(Pn(θ,),Pn(θ~,)))2\displaystyle\left(\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\right)^{2} (𝒲2(Pn(θ,),Pn(θ~,)))2\displaystyle\leq\left(\mathcal{W}_{2}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\right)^{2}
𝔼θnθ~n2(1ημ)nθ0θ~02(1ημ2)2nθθ~2.\displaystyle\leq\mathbb{E}\left\|\theta_{n}-\tilde{\theta}_{n}\right\|^{2}\leq(1-\eta\mu)^{n}\left\|\theta_{0}-\tilde{\theta}_{0}\right\|^{2}\leq\left(1-\frac{\eta\mu}{2}\right)^{2n}\left\|\theta-\tilde{\theta}\right\|^{2}. (48)

This completes the proof.  

Next, we construct a Lyapunov function V^\hat{V} and obtain a drift condition for the SGD (θ^k)k=0(\hat{\theta}_{k})_{k=0}^{\infty}.

Lemma 16

Assume that Assumption 1 and Assumption 2 hold. Let V^(θ):=1+θθ^2\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}, where θ^\hat{\theta}_{\ast} is the minimizer of F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). Assume that η<min{1μ,μK12+64D2K22}\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

(P^V^)(θ)(1ημ)V^(θ)+2ημη2K1256η2D2K22+64η2D2K22θ^2.(\hat{P}\hat{V})(\theta)\leq(1-\eta\mu)\hat{V}(\theta)+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}. (49)

Proof  First, we recall that

θ^k=θ^k1ηbiΩkf(θ^k1,x^i).\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right). (50)

Therefore, starting with θ^0=θ\hat{\theta}_{0}=\theta, we have

θ^1\displaystyle\hat{\theta}_{1} =θηbiΩ1f(θ,x^i)\displaystyle=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})
=θηni=1nf(θ,x^i)+η(1ni=1nf(θ,x^i)1biΩ1f(θ,x^i)).\displaystyle=\theta-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right). (51)

Moreover, we have

𝔼[1biΩ1f(θ,x^i)|θ]=1ni=1nf(θ,x^i).\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\Big{|}\theta\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). (52)

This implies that

(P^V^)(θ)\displaystyle(\hat{P}\hat{V})(\theta)
=𝔼V^(θ^1)=1+𝔼θ^1θ^2\displaystyle=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\left\|\hat{\theta}_{1}-\hat{\theta}_{\ast}\right\|^{2}
=1+θθ^+ηni=1nf(θ,x^i)2+η2𝔼1ni=1nf(θ,x^i)1biΩ1f(θ,x^i)2.\displaystyle=1+\left\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\|^{2}+\eta^{2}\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\|^{2}. (53)

We can compute that

θθ^+ηni=1nf(θ,x^i)2\displaystyle\left\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\|^{2}
=θθ^+ηni=1n(f(θ,x^i)f(θ^,x^i))2\displaystyle=\left\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
(12ημ)θθ^2+η2n2i=1n(f(θ,x^i)f(θ^,x^i))2\displaystyle\leq(1-2\eta\mu)\|\theta-\hat{\theta}_{\ast}\|^{2}+\frac{\eta^{2}}{n^{2}}\left\|\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
(12ημ+η2K12)θθ^2.\displaystyle\leq(1-2\eta\mu+\eta^{2}K_{1}^{2})\|\theta-\hat{\theta}_{\ast}\|^{2}. (54)

Moreover, we can compute that

𝔼1ni=1nf(θ,x^i)1biΩ1f(θ,x^i)2\displaystyle\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\|^{2} =𝔼1biΩ1(1nj=1nf(θ,x^j)f(θ,x^i))2\displaystyle=\mathbb{E}\left\|\frac{1}{b}\sum_{i\in\Omega_{1}}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla f(\theta,\hat{x}_{j})-\nabla f(\theta,\hat{x}_{i})\right)\right\|^{2}
=𝔼(1biΩ1K21nj=1nx^ix^j(2θ+1))2\displaystyle=\mathbb{E}\left(\frac{1}{b}\sum_{i\in\Omega_{1}}K_{2}\frac{1}{n}\sum_{j=1}^{n}\|\hat{x}_{i}-\hat{x}_{j}\|(2\|\theta\|+1)\right)^{2}
4D2K22(2θ+1)2\displaystyle\leq 4D^{2}K_{2}^{2}(2\|\theta\|+1)^{2}
8D2K22(4θ2+1)\displaystyle\leq 8D^{2}K_{2}^{2}(4\|\theta\|^{2}+1)
8D2K22(8θθ^2+8θ^2+1).\displaystyle\leq 8D^{2}K_{2}^{2}\left(8\|\theta-\hat{\theta}_{\ast}\|^{2}+8\|\hat{\theta}_{\ast}\|^{2}+1\right). (55)

Hence, we conclude that

(P^V^)(θ)\displaystyle(\hat{P}\hat{V})(\theta) (12ημ+η2K12+64η2D2K22)θθ^2+1+8η2D2K22(8θ^2+1)\displaystyle\leq\left(1-2\eta\mu+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+1+8\eta^{2}D^{2}K_{2}^{2}\left(8\|\hat{\theta}_{\ast}\|^{2}+1\right)
=(12ημ+η2K12+64η2D2K22)V^(θ)\displaystyle=\left(1-2\eta\mu+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\hat{V}(\theta)
+2ημη2K1256η2D2K22+64η2D2K22θ^2\displaystyle\qquad\qquad\qquad+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}
(1ημ)V^(θ)+2ημη2K1256η2D2K22+64η2D2K22θ^2,\displaystyle\leq(1-\eta\mu)\hat{V}(\theta)+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}, (56)

provided that ημK12+64D2K22\eta\leq\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}. This completes the proof.  

Next, we estimate the perturbation gap based on the Lyapunov function V^\hat{V}.

Lemma 17

Assume that Assumption 1 holds. Assume that supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

supθd𝒲1(δθP,δθP^)V^(θ)4DK2ηn(2θ^+1),\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\frac{4DK_{2}\eta}{n}(2\|\hat{\theta}_{\ast}\|+1), (57)

where θ^\hat{\theta}_{\ast} is the minimizer of F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}).

Proof  Let us recall that

θ1=θηbiΩ1f(θ,xi),\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i}), (58)
θ^1=θηbiΩ1f(θ,x^i),\displaystyle\hat{\theta}_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i}), (59)

which implies that

𝒲1(δθP,δθP^)\displaystyle\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right) ηb𝔼iΩ1f(θ,xi)f(θ,x^i)\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left\|\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})-\nabla f(\theta,\hat{x}_{i})\right\|
ηb𝔼[iΩ1f(θ,xi)f(θ,x^i)]\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left[\sum_{i\in\Omega_{1}}\left\|\nabla f(\theta,x_{i})-\nabla f(\theta,\hat{x}_{i})\right\|\right]
ηb𝔼[iΩ1K2xix^i(2θ+1)]\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left[\sum_{i\in\Omega_{1}}K_{2}\left\|x_{i}-\hat{x}_{i}\right\|(2\|\theta\|+1)\right] (60)

Since XnX_{n} and X^n\hat{X}_{n} differ by at most one element and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have xi=x^ix_{i}=\hat{x}_{i} for any iΩ1i\in\Omega_{1} with probability nbn\frac{n-b}{n} and xix^ix_{i}\neq\hat{x}_{i} for exactly one iΩ1i\in\Omega_{1} with probability bn\frac{b}{n} and therefore

𝒲1(δθP,δθP^)ηbbn2K2D(2θ+1)=2DK2ηn(2θ+1).\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\frac{\eta}{b}\frac{b}{n}2K_{2}D(2\|\theta\|+1)=\frac{2DK_{2}\eta}{n}(2\|\theta\|+1). (61)

Hence, we conclude that

supθd𝒲1(δθP,δθP^)V^(θ)\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)} supθd4DK2ηnθ+121+θθ^2\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{\|\theta\|+\frac{1}{2}}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}
supθd4DK2ηn2θθ^+2θ^+11+θθ^2\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{2\|\theta-\hat{\theta}_{\ast}\|+2\|\hat{\theta}_{\ast}\|+1}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}
supθd4DK2ηn2θθ^(2θ^+1)1+θθ^2\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{2\|\theta-\hat{\theta}_{\ast}\|(2\|\hat{\theta}_{\ast}\|+1)}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}
4DK2ηn(2θ^+1).\displaystyle\leq\frac{4DK_{2}\eta}{n}\left(2\|\hat{\theta}_{\ast}\|+1\right). (62)

This completes the proof.  

Next, let us provide a technical lemma that upper bounds the norm of θ\theta_{\ast} and θ^\hat{\theta}_{\ast}, which are the minimizers of F^(θ,Xn):=1ni=1nf(θ,xi)\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}) and F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}) respectively.

Lemma 18

Under Assumption 2, we have

θ1μsupx𝒳f(0,x),\|\theta_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|,

and

θ^1μsupx𝒳f(0,x).\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|.

Proof  Since f(θ,x)f(\theta,x) is μ\mu-strongly convex in θ\theta for every x𝒳x\in\mathcal{X}, we have

F^(0,Xn)F^(θ,Xn),0θ\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle =1ni=1nf(0,xi),θ\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle
=1ni=1nf(0,xi)f(θ,xi),0θμθ2,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle\geq\mu\|\theta_{\ast}\|^{2}, (63)

which implies that

μθ2supx𝒳f(0,x)θ,\mu\|\theta_{\ast}\|^{2}\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|, (64)

which yields that θ1μsupx𝒳f(0,x)\|\theta_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|. Similarly, one can show that θ^1μsupx𝒳f(0,x)\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|. This completes the proof.  

D.1 Proof of Theorem 8

Proof  By applying Lemma 15, Lemma 16, Lemma 17 and Lemma 3, we obtain

𝒲1(νk,ν^k)\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k}) 8DK2(1(1ημ2)k)nμ(2θ^+1)\displaystyle\leq\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(2\|\hat{\theta}_{\ast}\|+1\right)
max{1+θθ^2,2ημK1256ημD2K22+64ημD2K22θ^2},\displaystyle\qquad\cdot\max\left\{1+\left\|\theta-\hat{\theta}_{\ast}\right\|^{2},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}\right\}, (65)

where θ^\hat{\theta}_{\ast} is the minimizer of F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). Finally, θθ^22θ2+2θ^2\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2} and by applying Lemma 18, we complete the proof.  

E Proofs of Wasserstein Perturbation Bounds: Non-Convex Case

Lemma 19

Assume that Assumption 1, Assumption 3 and Assumption 4 hold. For any η^(0,1)\hat{\eta}\in(0,1). Define M>0M>0 so that θ1Mp(θ,θ1)𝑑θ1η^\int_{\|\theta_{1}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}} and any R>0R>0 so that R>2K0mR>\frac{2K_{0}}{m} and

infθ,θ1d:V(θ)R,θ1Mp(θ,θ1)p(θ,θ1)η^,\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}, (66)

where

K0:=2mηK1256ηD2K22+64ηD2K22θ2+2K+ησ2.K_{0}:=2m-\eta K_{1}^{2}-56\eta D^{2}K_{2}^{2}+64\eta D^{2}K_{2}^{2}\|\theta_{\ast}\|^{2}+2K+\eta\sigma^{2}. (67)

Then, for any nn\in\mathbb{N},

𝒲1(Pn(θ,),Pn(θ~,))12ψ(1+ψ)η¯ndψ(δθ,δθ~),\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}), (68)

for any θ,θ~\theta,\tilde{\theta} in d\mathbb{R}^{d}, where for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(1mη+2ηK0R,1)\gamma_{0}\in\left(1-m\eta+\frac{2\eta K{0}}{R},1\right) one can choose ψ=η0ηK0\psi=\frac{\eta_{0}}{\eta K_{0}} and η¯=(1(η^η0))2+Rψγ02+Rψ\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi} and dψd_{\psi} is the weighted total variation distance defined in Section B.1.

Proof  Our proof relies on a computable bound on the Wasserstein contraction for the Markov chains by Hairer and Mattingly (2011) that satisfies a drift condition (Assumption 7) that relies on the construction of an appropriate Lyapunov function and a minorization condition (Assumption 8).

By applying Lemma 21, we can immediately show that the following drift condition holds. Let V(θ):=1+θθ2V(\theta):=1+\|\theta-\theta_{\ast}\|^{2}, where θ\theta_{\ast} is the minimizer of 1ni=1nf(θ,xi)\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}). Assume that η<min{1m,mK1+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

(PV)(θ)(1mη)V(θ)+ηK0,(PV)(\theta)\leq(1-m\eta)V(\theta)+\eta K_{0}, (69)

where

K0:=2mηK1256ηD2K22+64ηD2K22θ2+2K+ησ2.K_{0}:=2m-\eta K_{1}^{2}-56\eta D^{2}K_{2}^{2}+64\eta D^{2}K_{2}^{2}\|\theta_{\ast}\|^{2}+2K+\eta\sigma^{2}. (70)

Thus, the drift condition (Assumption 7) holds.

Next, let us show that the minorization condition (Assumption 8) also holds. Let us recall that

θ1=θηbiΩ1f(θ,xi)+ηξ1.\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1}. (71)

We denote p(θ,θ1)p(\theta,\theta_{1}) the probability density function of θ1\theta_{1} with the emphasis on the dependence on the initial point θ\theta. Then, to check that the minorization condition (Assumption 8) holds, it suffices to show that there exists some constant η^(0,1)\hat{\eta}\in(0,1)

infθd:V(θ)Rp(θ,θ1)η^q(θ1),for any θ1d,\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\hat{\eta}q(\theta_{1}),\qquad\text{for any $\theta_{1}\in\mathbb{R}^{d}$}, (72)

for some R>2ηK01(1mη)=2K0mR>\frac{2\eta K_{0}}{1-(1-m\eta)}=\frac{2K_{0}}{m}, where q(θ1)q(\theta_{1}) is the density function of a probability distribution function on d\mathbb{R}^{d}, and (72) follows from Lemma 20. Hence, by Lemma 14, we have

dψ(Pn(θ,),Pn(θ~,))η¯ndψ(δθ,δθ~),d_{\psi}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}),

for any θ,θ~\theta,\tilde{\theta} in d\mathbb{R}^{d}, where for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(1mη+2ηK0R,1)\gamma_{0}\in\left(1-m\eta+\frac{2\eta K{0}}{R},1\right) one can choose ψ=η0ηK0\psi=\frac{\eta_{0}}{\eta K_{0}} and η¯=(1(η^η0))2+Rψγ02+Rψ\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi}.

Finally, by the Kantorovich-Rubinstein duality for the Wasserstein metric, we get for any two probability measures μ1,μ2\mu_{1},\mu_{2} on d\mathbb{R}^{d}:

𝒲1(μ1,μ2)\displaystyle\mathcal{W}_{1}(\mu_{1},\mu_{2}) =sup{dϕ(θ)(μ1μ2)(dθ):ϕ is 1-Lipschitz}\displaystyle=\sup\left\{\int_{\mathbb{R}^{d}}\phi(\theta)(\mu_{1}-\mu_{2})(d\theta):\text{$\phi$ is $1$-Lipschitz}\right\}
=sup{d(ϕ(θ)ϕ(θ))(μ1μ2)(dθ):ϕ is 1-Lipschitz}\displaystyle=\sup\left\{\int_{\mathbb{R}^{d}}(\phi(\theta)-\phi(\theta_{\ast}))(\mu_{1}-\mu_{2})(d\theta):\text{$\phi$ is $1$-Lipschitz}\right\}
dθθ|μ1μ2|(dθ)\displaystyle\leq\int_{\mathbb{R}^{d}}\|\theta-\theta_{\ast}\||\mu_{1}-\mu_{2}|(d\theta)
12ψ(1+ψ)d(1+ψV(θ))|μ1μ2|(dθ)\displaystyle\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\int_{\mathbb{R}^{d}}(1+\psi V(\theta))|\mu_{1}-\mu_{2}|(d\theta)
=12ψ(1+ψ)dψ(μ1,μ2).\displaystyle=\frac{1}{2\sqrt{\psi(1+\psi)}}d_{\psi}(\mu_{1},\mu_{2}). (73)

Hence, we conclude that

𝒲1(Pn(θ,),Pn(θ~,))12ψ(1+ψ)η¯ndψ(δθ,δθ~).\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}).

This completes the proof.  

The proof of Lemma 19 relies on the following technical lemma, which is a reformulation of Lemma 35 in Can et al. (2019) that helps establish the minorization condition (Assumption 8).

Lemma 20

For any η^(0,1)\hat{\eta}\in(0,1) and M>0M>0 so that θ1θMp(θ,θ1)𝑑θ1η^\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}} and any R>0R>0 so that

infθ,θ1d:V(θ)R,θ1θMp(θ,θ1)p(θ,θ1)η^.\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}. (74)

Then, we have

infθd:V(θ)Rp(θ,θ1)η^q(θ1),for any θ1d,\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\hat{\eta}q(\theta_{1}),\qquad\text{for any $\theta_{1}\in\mathbb{R}^{d}$}, (75)

where

q(θ1)=p(θ,θ1)1θ1θMθ1θMp(θ,θ1)𝑑θ1.q(\theta_{1})=p(\theta_{\ast},\theta_{1})\cdot\frac{1_{\|\theta_{1}-\theta_{\ast}\|\leq M}}{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}}. (76)

Proof  The proof is an adaptation of the proof of Lemma 35 in Can et al. (2019). Let us take:

q(θ1)=p(θ,θ1)1θ1θMθ1θMp(θ,θ1)𝑑θ1.q(\theta_{1})=p(\theta_{\ast},\theta_{1})\cdot\frac{1_{\|\theta_{1}-\theta_{\ast}\|\leq M}}{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}}. (77)

Then, it is clear that q(θ1)q(\theta_{1}) is a probability density function on d\mathbb{R}^{d}. It follows that (72) automatically holds for θ1θ>M\|\theta_{1}-\theta_{\ast}\|>M. Thus, we only need to show that (72) holds for θ1θM\|\theta_{1}-\theta_{\ast}\|\leq M. Since ξ1\xi_{1} has a continuous density, p(θ,θ1)p(\theta,\theta_{1}) is continuous in both θ\theta and θ1\theta_{1}. Fix MM, by continuity of p(θ,θ1)p(\theta,\theta_{1}) in both θ\theta and θ1\theta_{1}, there exists some η(0,1)\eta^{\prime}\in(0,1) such that uniformly in θ1θM\|\theta_{1}-\theta_{\ast}\|\leq M,

infθd:V(θ)Rp(θ,θ1)ηp(θ,θ1)=η^q(θ1),\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\eta^{\prime}p(\theta_{\ast},\theta_{1})=\hat{\eta}q(\theta_{1}), (78)

where we can take

η^:=ηθ1θMp(θ,θ1)𝑑θ1.\hat{\eta}:=\eta^{\prime}\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}. (79)

In particular, for any fixed η^\hat{\eta}, we can take M>0M>0 such that

θ1θMp(θ,θ1)𝑑θ1η^,\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}}, (80)

and with fixed η\eta and MM, we take R>0R>0 such that uniformly in θ1θM\|\theta_{1}-\theta_{\ast}\|\leq M,

infθd:V(θ)Rp(θ,θ1)η^p(θ,θ1).\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\sqrt{\hat{\eta}}p(\theta_{\ast},\theta_{1}). (81)

This completes the proof.  

Next, we construct a Lyapunov function V^\hat{V} and obtain a drift condition for the SGD (θ^k)k=0(\hat{\theta}_{k})_{k=0}^{\infty}.

Lemma 21

Assume that Assumption 1, Assumption 3 and Assumption 4 hold. Let V^(θ):=1+θθ^2\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}, where θ^\hat{\theta}_{\ast} is the minimizer of F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). Assume that η<min{1m,mK12+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

(P^V^)(θ)(1mη)V^(θ)+2mηη2K1256η2D2K22+64η2D2K22θ^2+2ηK+η2σ2.(\hat{P}\hat{V})(\theta)\leq(1-m\eta)\hat{V}(\theta)+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K+\eta^{2}\sigma^{2}. (82)

Proof  First, we recall that

θ^k=θ^k1ηbiΩkf(θ^k1,x^i)+ηξk.\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)+\eta\xi_{k}. (83)

Therefore, starting with θ^0=θ\hat{\theta}_{0}=\theta, we have

θ^1\displaystyle\hat{\theta}_{1} =θηbiΩ1f(θ,x^i)+ηξ1\displaystyle=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})+\eta\xi_{1}
=θηni=1nf(θ,x^i)+η(1ni=1nf(θ,x^i)1biΩ1f(θ,x^i))+ηξ1.\displaystyle=\theta-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right)+\eta\xi_{1}. (84)

Moreover, we have

𝔼[1biΩ1f(θ,x^i)|θ]=1ni=1nf(θ,x^i).\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\Big{|}\theta\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). (85)

This implies that

(P^V^)(θ)\displaystyle(\hat{P}\hat{V})(\theta)
=𝔼V^(θ^1)=1+𝔼θ^1θ^2\displaystyle=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\left\|\hat{\theta}_{1}-\hat{\theta}_{\ast}\right\|^{2}
=1+θθ^ηni=1nf(θ,x^i)2+η2𝔼1ni=1nf(θ,x^i)1biΩ1f(θ,x^i)2+η2σ2.\displaystyle=1+\left\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\|^{2}+\eta^{2}\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\|^{2}+\eta^{2}\sigma^{2}. (86)

We can compute that

θθ^ηni=1nf(θ,x^i)2\displaystyle\left\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\theta,\hat{x}_{i}\right)\right\|^{2}
=θθ^ηni=1n(f(θ,x^i)f(θ^,x^i))2\displaystyle=\left\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f\left(\theta,\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
(12mη)θθ^2+2ηK+η2n2i=1n(f(θ,x^i)f(θ^,x^i))2\displaystyle\leq(1-2m\eta)\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+2\eta K+\frac{\eta^{2}}{n^{2}}\left\|\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
(12mη+η2K12)θθ^2+2ηK.\displaystyle\leq\left(1-2m\eta+\eta^{2}K_{1}^{2}\right)\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+2\eta K. (87)

Moreover, by following the same arguments as in the proof of Lemma 16, we have

𝔼1ni=1nf(θ,x^i)1biΩ1f(θ,x^i)28D2K22(8θθ^2+8θ^2+1).\displaystyle\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\|^{2}\leq 8D^{2}K_{2}^{2}\left(8\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+8\|\hat{\theta}_{\ast}\|^{2}+1\right). (88)

Hence, we conclude that

(P^V^)(θ)\displaystyle(\hat{P}\hat{V})(\theta) (12mη+η2K12+64η2D2K22)θθ^2\displaystyle\leq\left(1-2m\eta+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}
+1+8η2D2K22(8θ^2+1)+2ηK+η2σ2\displaystyle\qquad\qquad\qquad+1+8\eta^{2}D^{2}K_{2}^{2}\left(8\|\hat{\theta}_{\ast}\|^{2}+1\right)+2\eta K+\eta^{2}\sigma^{2}
=(12mη+η2K12+64η2D2K22)V^(θ)\displaystyle=\left(1-2m\eta+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\hat{V}(\theta)
+2mηη2K1256η2D2K22+64η2D2K22θ^2+2ηK+η2σ2\displaystyle\qquad\qquad\qquad+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K+\eta^{2}\sigma^{2}
(1mη)V^(θ)+2mηη2K1256η2D2K22+64η2D2K22θ^2+2ηK+η2σ2,\displaystyle\leq(1-m\eta)\hat{V}(\theta)+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K+\eta^{2}\sigma^{2}, (89)

provided that ηmK12+64D2K22\eta\leq\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}. This completes the proof.  

Next, we estimate the perturbation gap based on the Lyapunov function V^\hat{V}.

Lemma 22

Assume that Assumption 1 and Assumption 4 hold. Assume that supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty. Then, we have

supθddψ(δθP,δθP^)V^(θ)\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}
2bnmax{ψ(4+8η2K12),\displaystyle\leq\frac{2b}{n}\max\Bigg{\{}\psi\cdot(4+8\eta^{2}K_{1}^{2}),
1+ψ(1+η2σ2+(4+8η2K12)θθ^2+4η2supx𝒳f(θ,x)2)},\displaystyle\qquad\qquad 1+\psi\cdot\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)\Bigg{\}}, (90)

where θ^\hat{\theta}_{\ast} is the minimizer of 1ni=1nf(θ,x^i)\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}) and θ\theta_{\ast} is the minimizer of 1ni=1nf(θ,xi)\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}) and dψd_{\psi} is the weighted total variation distance defined in Section B.1.

Proof  Let us recall that

θ1=θηbiΩ1f(θ,xi)+ηξ1,\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1}, (91)
θ^1=θηbiΩ1f(θ,x^i)+ηξ1,\displaystyle\hat{\theta}_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})+\eta\xi_{1}, (92)

which implies that

𝔼ξ1[V(θ1)]=1+𝔼ξ1[θ1θ2]=1+η2σ2+θθηbiΩ1f(θ,xi)2,\displaystyle\mathbb{E}_{\xi_{1}}[V(\theta_{1})]=1+\mathbb{E}_{\xi_{1}}\left[\|\theta_{1}-\theta_{\ast}\|^{2}\right]=1+\eta^{2}\sigma^{2}+\left\|\theta-\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\|^{2}, (93)

where 𝔼ξ1\mathbb{E}_{\xi_{1}} denotes expectations w.r.t. ξ1\xi_{1} only, and we can further compute that

θθηbiΩ1f(θ,xi)2\displaystyle\left\|\theta-\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\|^{2}
2θθ2+2η2b2iΩ1f(θ,xi)2\displaystyle\leq 2\|\theta-\theta_{\ast}\|^{2}+\frac{2\eta^{2}}{b^{2}}\left\|\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\|^{2}
2θθ2+2η2b2(iΩ1f(θ,xi))2\displaystyle\leq 2\|\theta-\theta_{\ast}\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(\sum_{i\in\Omega_{1}}\|\nabla f(\theta,x_{i})\|\right)^{2}
2θθ2+2η2b2(iΩ1f(θ,xi)f(θ,xi)+f(θ,xi))2\displaystyle\leq 2\|\theta-\theta_{\ast}\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(\sum_{i\in\Omega_{1}}\|\nabla f(\theta,x_{i})-\nabla f(\theta_{\ast},x_{i})\|+\|\nabla f(\theta_{\ast},x_{i})\|\right)^{2}
2θθ2+2η2b2(bK1θθ+bsupx𝒳f(θ,x))2\displaystyle\leq 2\|\theta-\theta_{\ast}\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(bK_{1}\|\theta-\theta_{\ast}\|+b\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)^{2}
2θθ2+4η2K12θθ2+4η2supx𝒳f(θ,x)2.\displaystyle\leq 2\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}K_{1}^{2}\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}. (94)

Therefore, we have

𝔼ξ1[V(θ1)]1+η2σ2+2θθ2+4η2K12θθ2+4η2supx𝒳f(θ,x)2.\displaystyle\mathbb{E}_{\xi_{1}}[V(\theta_{1})]\leq 1+\eta^{2}\sigma^{2}+2\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}K_{1}^{2}\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}. (95)

Similarly, we have

𝔼ξ1[V(θ^1)]1+η2σ2+2θθ2+4η2K12θθ2+4η2supx𝒳f(θ,x)2.\displaystyle\mathbb{E}_{\xi_{1}}[V(\hat{\theta}_{1})]\leq 1+\eta^{2}\sigma^{2}+2\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}K_{1}^{2}\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}. (96)

Since XnX_{n} and X^n\hat{X}_{n} differ by at most one element and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have xi=x^ix_{i}=\hat{x}_{i} for any iΩ1i\in\Omega_{1} with probability nbn\frac{n-b}{n} and xix^ix_{i}\neq\hat{x}_{i} for exactly one iΩ1i\in\Omega_{1} with probability bn\frac{b}{n}. Therefore, we have

dψ(δθP,δθP^)2bn(1+ψ(1+η2σ2+(2+4η2K12)θθ2+4η2supx𝒳f(θ,x)2)).d_{\psi}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\frac{2b}{n}\left(1+\psi\left(1+\eta^{2}\sigma^{2}+(2+4\eta^{2}K_{1}^{2})\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)\right). (97)

Hence, we conclude that

supθddψ(δθP,δθP^)V^(θ)\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}
2bnsupθd1+ψ(1+η2σ2+(2+4η2K12)θθ2+4η2supx𝒳f(θ,x)2)1+θθ^2\displaystyle\leq\frac{2b}{n}\sup_{\theta\in\mathbb{R}^{d}}\frac{1+\psi\left(1+\eta^{2}\sigma^{2}+(2+4\eta^{2}K_{1}^{2})\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}
2bnsupθd{1+ψ(1+η2σ2+(4+8η2K12)θθ^2)1+θθ^2\displaystyle\leq\frac{2b}{n}\sup_{\theta\in\mathbb{R}^{d}}\Bigg{\{}\frac{1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\|\theta-\hat{\theta}_{\ast}\|^{2}\right)}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}
+ψ((4+8η2K12)θθ^2+4η2supx𝒳f(θ,x)2)1+θθ^2}\displaystyle\qquad\qquad\qquad\qquad+\frac{\psi\left((4+8\eta^{2}K_{1}^{2})\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)}{1+\|\theta-\hat{\theta}_{\ast}\|^{2}}\Bigg{\}}
2bnmax{ψ(4+8η2K12),\displaystyle\leq\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),
1+ψ(1+η2σ2+(4+8η2K12)θθ^2+4η2supx𝒳f(θ,x)2)}.\displaystyle\qquad\qquad\qquad 1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)\Bigg{\}}. (98)

This completes the proof.  

It is worth noting that the Wasserstein contraction bound we obtained in Lemma 19 in the non-convex case differs from the one we obtained in the strongly-convex case (Lemma 15) in the sense that the right hand side of (68) is no longer θθ~\|\theta-\tilde{\theta}\| so that Lemma 3 is not directly applicable. Instead, in the following, we will provide a modification of Lemma 3, which will be used in proving Theorem 9 in this paper. The definitions of the notations used in the following lemma can be found in Section 2.2.

Lemma 23

Assume that there exist some ρ[0,1)\rho\in[0,1) and C(0,)C\in(0,\infty) such that

supθ,θ~d:θθ~𝒲1(Pn(θ,),Pn(θ~,))dψ(δθ,δθ~)Cρn,\sup_{\theta,\tilde{\theta}\in\mathbb{R}^{d}:\theta\neq\tilde{\theta}}\frac{\mathcal{W}_{1}(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot))}{d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}})}\leq C\rho^{n}, (99)

for any nn\in\mathbb{N}. Further assume that there exist some δ(0,1)\delta\in(0,1) and L(0,)L\in(0,\infty) and a measurable Lyapunov function V^:d[1,)\hat{V}:\mathbb{R}^{d}\rightarrow[1,\infty) of P^\hat{P} such that for any θd\theta\in\mathbb{R}^{d}:

(P^V^)(θ)δV^(θ)+L.(\hat{P}\hat{V})(\theta)\leq\delta\hat{V}(\theta)+L. (100)

Then, we have

𝒲1(pn,p^n)C(ρndψ(p0,p^0)+(1ρn)γκ1ρ),\mathcal{W}_{1}(p_{n},\hat{p}_{n})\leq C\left(\rho^{n}d_{\psi}(p_{0},\hat{p}_{0})+(1-\rho^{n})\frac{\gamma\kappa}{1-\rho}\right), (101)

where

γ:=supθddψ(δθP,δθP^)V^(θ),κ:=max{dV^(θ)𝑑p^0(θ),L1δ}.\gamma:=\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)},\qquad\kappa:=\max\left\{\int_{\mathbb{R}^{d}}\hat{V}(\theta)d\hat{p}_{0}(\theta),\frac{L}{1-\delta}\right\}. (102)

Proof  The proof is based on the modification of the proof of Lemma 3 (Theorem 3.1 in Rudolf and Schweizer (2018)). By induction we have

p~npn=(p~0p0)Pn+i=0n1p~i(P~P)Pni1,n.\tilde{p}_{n}-p_{n}=\left(\tilde{p}_{0}-p_{0}\right)P^{n}+\sum_{i=0}^{n-1}\tilde{p}_{i}\left(\tilde{P}-P\right)P^{n-i-1},\qquad n\in\mathbb{N}. (103)

We have

dψ(p~iP,p~iP~)ddψ(δθP,δθP~)𝑑p~i(θ)γdV~(θ)𝑑p~i(θ).d_{\psi}\left(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P}\right)\leq\int_{\mathbb{R}^{d}}d_{\psi}\left(\delta_{\theta}P,\delta_{\theta}\tilde{P}\right)d\tilde{p}_{i}(\theta)\leq\gamma\int_{\mathbb{R}^{d}}\tilde{V}(\theta)d\tilde{p}_{i}(\theta). (104)

Moreover, for any i=0,1,2,i=0,1,2,\ldots,

dV~(θ)𝑑p~i(θ)=dP~iV~(θ)𝑑p~0(θ)δip~0(V~)+L(1δi)1δmax{p~0(V~),L1δ},\int_{\mathbb{R}^{d}}\tilde{V}(\theta)d\tilde{p}_{i}(\theta)=\int_{\mathbb{R}^{d}}\tilde{P}^{i}\tilde{V}(\theta)d\tilde{p}_{0}(\theta)\leq\delta^{i}\tilde{p}_{0}(\tilde{V})+\frac{L(1-\delta^{i})}{1-\delta}\leq\max\left\{\tilde{p}_{0}(\tilde{V}),\frac{L}{1-\delta}\right\}, (105)

so that we obtain dψ(p~iP,p~iP~)γκd_{\psi}(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P})\leq\gamma\kappa. Therefore, we have

𝒲1(p~iP~Pni1,p~iPPni1)Cρni1dψ(p~iP,p~iP~)Cρni1γκ.\mathcal{W}_{1}\left(\tilde{p}_{i}\tilde{P}P^{n-i-1},\tilde{p}_{i}PP^{n-i-1}\right)\leq C\rho^{n-i-1}d_{\psi}\left(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P}\right)\leq C\rho^{n-i-1}\gamma\kappa. (106)

By the triangle inequality of the Wasserstein distance, we have

𝒲1(pn,p~n)\displaystyle\mathcal{W}_{1}(p_{n},\tilde{p}_{n}) 𝒲1(p0Pn,p~0Pn)+i=0n1𝒲1(p~iP~Pni1,p~iPPni1)\displaystyle\leq\mathcal{W}_{1}\left(p_{0}P^{n},\tilde{p}_{0}P^{n}\right)+\sum_{i=0}^{n-1}\mathcal{W}_{1}\left(\tilde{p}_{i}\tilde{P}P^{n-i-1},\tilde{p}_{i}PP^{n-i-1}\right)
Cρndψ(p0,p~0)+Ci=0n1ρni1γκ.\displaystyle\leq C\rho^{n}d_{\psi}(p_{0},\tilde{p}_{0})+C\sum_{i=0}^{n-1}\rho^{n-i-1}\gamma\kappa. (107)

This completes the proof.  

Next, let us provide a technical lemma that upper bounds the norm of θ\theta_{\ast} and θ^\hat{\theta}_{\ast}, which are the minimizers of F^(θ,Xn):=1ni=1nf(θ,xi)\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}) and F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}) respectively.

Lemma 24

Under Assumption 3, we have

θsupx𝒳f(0,x)+supx𝒳f(0,x)2+4mK2m,\displaystyle\|\theta_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}, (108)
θ^supx𝒳f(0,x)+supx𝒳f(0,x)2+4mK2m.\displaystyle\|\hat{\theta}_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}. (109)

Proof  By Assumption 3, we have

F^(0,Xn)F^(θ,Xn),0θ\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle =1ni=1nf(0,xi),θ\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle
=1ni=1nf(0,xi)f(θ,xi),0θ\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle
mθ2K,\displaystyle\geq m\|\theta_{\ast}\|^{2}-K, (110)

which implies that

mθ2Ksupx𝒳f(0,x)θ,m\|\theta_{\ast}\|^{2}-K\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|, (111)

which yields that

θsupx𝒳f(0,x)+supx𝒳f(0,x)2+4mK2m.\|\theta_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}. (112)

Similarly, one can show that

θ^supx𝒳f(0,x)+supx𝒳f(0,x)2+4mK2m.\|\hat{\theta}_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}. (113)

This completes the proof.  

E.1 Proof of Theorem 9

Before going to the proof, let us restate the full version of Theorem 9 that we provide below.

Theorem 25 (Complete Theorem 3.3)

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that η<min{1m,mK12+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. For any η^(0,1)\hat{\eta}\in(0,1). Define M>0M>0 so that θ1θMp(θ,θ1)𝑑θ1η^\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}} and any R>2K0mR>\frac{2K_{0}}{m} where K0K_{0} is defined in (67) so that

infθ,θ1d:V(θ)R,θ1θMp(θ,θ1)p(θ,θ1)η^.\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}. (114)

Let νk\nu_{k} and ν^k\hat{\nu}_{k} denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} respectively. Then, we have

𝒲1(νk,ν^k)\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})
1η¯k2ψ(1+ψ)(1η¯)\displaystyle\leq\frac{1-\bar{\eta}^{k}}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}
2bnmax{ψ(4+8η2K12),\displaystyle\cdot\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),
1+ψ(1+η2σ2+16(1+2η2K12)(E+E2+4mK2m)2\displaystyle\qquad\qquad 1+\psi\Bigg{(}1+\eta^{2}\sigma^{2}+16(1+2\eta^{2}K_{1}^{2})\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}
+4η2(2E2+2K12(E+E2+4mK2m)2))}\displaystyle\qquad\qquad\qquad\qquad\qquad+4\eta^{2}\left(2E^{2}+2K_{1}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}\right)\Bigg{)}\Bigg{\}}
max{1+2θ2+2(E+E2+4mK2m)2,\displaystyle\qquad\cdot\max\Bigg{\{}1+2\theta^{2}+2\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2},
2ηmK1256ηmD2K22+64ηmD2K22(E+E2+4mK2m)2+2Km+ηmσ2},\displaystyle\qquad\qquad\qquad 2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+\frac{2K}{m}+\frac{\eta}{m}\sigma^{2}\Bigg{\}}, (115)

where for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(1mη+2ηK0R,1)\gamma_{0}\in\left(1-m\eta+\frac{2\eta K_{0}}{R},1\right) one can choose ψ=η0ηK0\psi=\frac{\eta_{0}}{\eta K_{0}} and η¯=(1(η^η0))2+Rψγ02+Rψ\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi} and dψd_{\psi} is the weighted total variation distance defined in Section B.1.

Proof  By applying Lemma 19, Lemma 21, Lemma 22 and Lemma 23, which is a modification of Lemma 3, we obtain

𝒲1(νk,ν^k)\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})
1η¯k2ψ(1+ψ)(1η¯)\displaystyle\leq\frac{1-\bar{\eta}^{k}}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}
2bnmax{ψ(4+8η2K12),\displaystyle\cdot\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),
1+ψ(1+η2σ2+(4+8η2K12)θθ^2+4η2supx𝒳f(θ,x)2)}\displaystyle\qquad\qquad 1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)\Bigg{\}}
max{1+θθ^2,2ηmK1256ηmD2K22+64ηmD2K22θ^2+2Km+ηmσ2},\displaystyle\qquad\cdot\max\left\{1+\|\theta-\hat{\theta}_{\ast}\|^{2},2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+\frac{2K}{m}+\frac{\eta}{m}\sigma^{2}\right\}, (116)

where for any η0(0,η^)\eta_{0}\in(0,\hat{\eta}) and γ0(1mη+2ηK0R,1)\gamma_{0}\in\left(1-m\eta+\frac{2\eta K_{0}}{R},1\right) one can choose ψ=η0ηK0\psi=\frac{\eta_{0}}{\eta K_{0}} and η¯=(1(η^η0))2+Rψγ02+Rψ\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi} and dψd_{\psi} is the weighted total variation distance defined in Section B.1.

Finally, let us notice that θθ^22θ2+2θ^2\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta_{\ast}\|^{2}+2\|\hat{\theta}_{\ast}\|^{2} and θθ^22θ2+2θ^2\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2} and for every x𝒳x\in\mathcal{X},

f(θ,x)22f(0,x)2+2f(θ,x)f(0,x)22f(0,x)2+2K12θ2.\|\nabla f(\theta_{\ast},x)\|^{2}\leq 2\|\nabla f(0,x)\|^{2}+2\|\nabla f(\theta_{\ast},x)-\nabla f(0,x)\|^{2}\leq 2\|\nabla f(0,x)\|^{2}+2K_{1}^{2}\|\theta_{\ast}\|^{2}. (117)

By applying Lemma 24, we complete the proof.  

E.2 Proof of Corollary 11

Proof  Under our assumptions, the noise ξk\xi_{k} are i.i.d. Gaussian 𝒩(0,Σ)\mathcal{N}(0,\Sigma) so that 𝔼[ξ12]=tr(Σ)=σ2\mathbb{E}[\|\xi_{1}\|^{2}]=\text{tr}(\Sigma)=\sigma^{2}. Moreoever, we have ΣId\Sigma\prec I_{d}. Then p(θ,θ1)p(\theta_{\ast},\theta_{1}) is the probability density function of

θηbiΩ1f(θ,xi)+ηξ1.\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}. (118)

Therefore,

θ1θMp(θ,θ1)𝑑θ1=(ηbiΩ1f(θ,xi)+ηξ1M).\displaystyle\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}=\mathbb{P}\left(\left\|-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}\right\|\leq M\right). (119)

Notice that for any Ω1\Omega_{1},

ηbiΩ1f(θ,xi)ηsupx𝒳f(θ,x).\left\|\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})\right\|\leq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|. (120)

Thus, we have

θ1θMp(θ,θ1)𝑑θ1\displaystyle\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1} =1(ηbiΩ1f(θ,xi)+ηξ1>M)\displaystyle=1-\mathbb{P}\left(\left\|-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}\right\|>M\right)
1(ηξ1>Mηsupx𝒳f(θ,x))\displaystyle\geq 1-\mathbb{P}\left(\left\|\eta\xi_{1}\right\|>M-\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)
=1(ξ1>Mηsupx𝒳f(θ,x)).\displaystyle=1-\mathbb{P}\left(\left\|\xi_{1}\right\|>\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right). (121)

Since ξ𝒩(0,Σ)\xi\sim\mathcal{N}(0,\Sigma) and ΣId\Sigma\prec I_{d}, for any γ12\gamma\leq\frac{1}{2}, we have

𝔼[eγξ12]=1det(Id2γΣ).\mathbb{E}\left[e^{\gamma\|\xi_{1}\|^{2}}\right]=\frac{1}{\sqrt{\det{(I_{d}-2\gamma\Sigma)}}}.

By Chebychev’s inequality, letting γ=12\gamma=\frac{1}{2}, for any Mηsupx𝒳f(θ,x)M\geq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|, we get

θ1θMp(θ,θ1)𝑑θ111det(IdΣ)exp(12(Mηsupx𝒳f(θ,x))2).\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq 1-\frac{1}{\sqrt{\det{(I_{d}-\Sigma)}}}\exp\left(-\frac{1}{2}\left(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)^{2}\right).

Next, for any θ,θ1d\theta,\theta_{1}\in\mathbb{R}^{d} such that θ1θM\|\theta_{1}-\theta_{\ast}\|\leq M and θθ22K0m(1+ϵ)1\|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1, we have

p(θ,θ1)p(θ,θ1)=𝔼Ω1[pΩ1(θ,θ1)]𝔼Ω1[pΩ1(θ,θ1)],\displaystyle\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}=\frac{\mathbb{E}_{\Omega_{1}}[p_{\Omega_{1}}(\theta,\theta_{1})]}{\mathbb{E}_{\Omega_{1}}[p_{\Omega_{1}}(\theta_{\ast},\theta_{1})]}, (122)

where 𝔼Ω1\mathbb{E}_{\Omega_{1}} denotes the expectation w.r.t. Ω1\Omega_{1}, and pΩ1p_{\Omega_{1}} denotes the probability density function conditional on Ω1\Omega_{1}. For any given Ω1\Omega_{1}, we can compute that

pΩ1(θ,θ1)pΩ1(θ,θ1)\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})}
=exp{12η2(θ1μ(θ))Σ1(θ1μ(θ))+12η2(θ1μ(θ))Σ1(θ1μ(θ))},\displaystyle=\exp\left\{-\frac{1}{2\eta^{2}}(\theta_{1}-\mu(\theta))^{\top}\Sigma^{-1}(\theta_{1}-\mu(\theta))+\frac{1}{2\eta^{2}}(\theta_{1}-\mu(\theta_{\ast}))^{\top}\Sigma^{-1}(\theta_{1}-\mu(\theta_{\ast}))\right\}, (123)

where

μ(θ):=θηbiΩ1f(θ,xi),μ(θ):=θηbiΩ1f(θ,xi).\mu(\theta):=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i}),\qquad\mu(\theta_{\ast}):=\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i}). (124)

Therefore, for any θ,θ1d\theta,\theta_{1}\in\mathbb{R}^{d} such that θ1θM\|\theta_{1}-\theta_{\ast}\|\leq M and θθ22K0m(1+ϵ)1\|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1, we have

pΩ1(θ,θ1)pΩ1(θ,θ1)\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})} exp{12η2μ(θ)μ(θ)Σ1(θ1μ(θ)+θ1μ(θ))}\displaystyle\geq\exp\left\{-\frac{1}{2\eta^{2}}\|\mu(\theta)-\mu(\theta_{\ast})\|\cdot\|\Sigma^{-1}\|\left(\|\theta_{1}-\mu(\theta)\|+\|\theta_{1}-\mu(\theta_{\ast})\|\right)\right\}
exp{12η2μ(θ)μ(θ)Σ1(μ(θ)μ(θ)+2θ1μ(θ))}.\displaystyle\geq\exp\left\{-\frac{1}{2\eta^{2}}\|\mu(\theta)-\mu(\theta_{\ast})\|\cdot\|\Sigma^{-1}\|\left(\|\mu(\theta)-\mu(\theta_{\ast})\|+2\|\theta_{1}-\mu(\theta_{\ast})\|\right)\right\}. (125)

We can further compute that

θ1μ(θ)θ1θ+ηbiΩ1f(θ,xi)M+ηsupx𝒳f(θ,x),\displaystyle\|\theta_{1}-\mu(\theta_{\ast})\|\leq\|\theta_{1}-\theta_{\ast}\|+\left\|\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})\right\|\leq M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|, (126)

and

μ(θ)μ(θ)\displaystyle\|\mu(\theta)-\mu(\theta_{\ast})\| θθ+ηbiΩ1f(θ,xi)f(θ,xi)\displaystyle\leq\|\theta-\theta_{\ast}\|+\frac{\eta}{b}\sum_{i\in\Omega_{1}}\left\|\nabla f(\theta,x_{i})-\nabla f(\theta_{\ast},x_{i})\right\|
(1+K1η)θθ\displaystyle\leq(1+K_{1}\eta)\|\theta-\theta_{\ast}\|
(1+K1η)(2K0m(1+ϵ)1)1/2.\displaystyle\leq(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}. (127)

Hence, we have

pΩ1(θ,θ1)pΩ1(θ,θ1)\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})} exp{(1+K1η)(2K0m(1+ϵ)1)1/22η2Σ1\displaystyle\geq\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\|\Sigma^{-1}\|
((1+K1η)(2K0m(1+ϵ)1)1/2+2(M+ηsupx𝒳f(θ,x)))}.\displaystyle\qquad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)\Bigg{)}\Bigg{\}}. (128)

Since it holds for every Ω1\Omega_{1}, we have

p(θ,θ1)p(θ,θ1)\displaystyle\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})} exp{(1+K1η)(2K0m(1+ϵ)1)1/22η2Σ1\displaystyle\geq\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\|\Sigma^{-1}\|
((1+K1η)(2K0m(1+ϵ)1)1/2+2(M+ηsupx𝒳f(θ,x)))}.\displaystyle\qquad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)\Bigg{)}\Bigg{\}}. (129)

Hence, we conclude that

η^\displaystyle\hat{\eta} (maxMηsupx𝒳f(θ,x){min{1exp(12(Mηsupx𝒳f(θ,x))2)det(IdΣ),\displaystyle\geq\Bigg{(}\max_{M\geq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|}\Bigg{\{}\min\Bigg{\{}1-\frac{\exp\left(-\frac{1}{2}\left(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)^{2}\right)}{\sqrt{\det{(I_{d}-\Sigma)}}},
exp{(1+K1η)(2K0m(1+ϵ)1)1/22η2Σ1\displaystyle\qquad\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\|\Sigma^{-1}\|
((1+K1η)(2K0m(1+ϵ)1)1/2+2(M+ηsupx𝒳f(θ,x)))}}})2.\displaystyle\qquad\quad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)\Bigg{)}\Bigg{\}}\Bigg{\}}\Bigg{\}}\Bigg{)}^{2}. (130)

This completes the proof.  

F Proofs of Non-Convex Case without Additive Noise

F.1 Proof of Theorem 12

Proof  Let us recall that θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta and for any kk\in\mathbb{N},

θk=θk1ηbiΩkf(θk1,xi),\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}), (131)
θ^k=θ^k1ηbiΩkf(θ^k1,x^i).\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\hat{\theta}_{k-1},\hat{x}_{i}). (132)

Thus it follows that

θkθ^k\displaystyle\theta_{k}-\hat{\theta}_{k} =θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))+ηbk,\displaystyle=\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)+\frac{\eta}{b}\mathcal{E}_{k}, (133)

where

k:=iΩk(f(θ^k1,x^i)f(θ^k1,xi)).\mathcal{E}_{k}:=\sum_{i\in\Omega_{k}}\left(\nabla f(\hat{\theta}_{k-1},\hat{x}_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right). (134)

This implies that

θkθ^k2\displaystyle\left\|\theta_{k}-\hat{\theta}_{k}\right\|^{2} =θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))2+η2b2k2\displaystyle=\left\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}+\frac{\eta^{2}}{b^{2}}\left\|\mathcal{E}_{k}\right\|^{2}
+2θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi)),ηbk.\displaystyle\qquad+2\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle. (135)

By Assumption 1 and Assumption 3, we have

θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))2\displaystyle\left\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}
(12ηm)θk1θ^k12+2ηK+η2b2(bK1θk1θ^k1)2\displaystyle\leq(1-2\eta m)\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}+2\eta K+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|\right)^{2}
(1ηm)θk1θ^k12+2ηK,\displaystyle\leq(1-\eta m)\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}+2\eta K, (136)

provided that ηmK12\eta\leq\frac{m}{K_{1}^{2}}.

Since XnX_{n} and X^n\hat{X}_{n} differ by at most one element and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have xi=x^ix_{i}=\hat{x}_{i} for any iΩki\in\Omega_{k} with probability nbn\frac{n-b}{n} and xix^ix_{i}\neq\hat{x}_{i} for exactly one iΩki\in\Omega_{k} with probability bn\frac{b}{n} and therefore

𝔼k2bn𝔼[(K22D(2θ^k1+1))2]4D2K22bn(8𝔼θ^k12+2),\mathbb{E}\left\|\mathcal{E}_{k}\right\|^{2}\leq\frac{b}{n}\mathbb{E}\left[\left(K_{2}2D\left(2\|\hat{\theta}_{k-1}\|+1\right)\right)^{2}\right]\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\|\hat{\theta}_{k-1}\|^{2}+2\right), (137)

and moreover,

𝔼θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi)),ηbk\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle
ηbbn𝔼[(1+K1η)θk1θ^k1K22D(2θ^k1+1)]\displaystyle\leq\frac{\eta}{b}\frac{b}{n}\mathbb{E}\left[(1+K_{1}\eta)\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|K_{2}2D\left(2\|\hat{\theta}_{k-1}\|+1\right)\right]
2K2Dηn(1+K1η)𝔼[(θk1+θ^k1)(2θ^k1+1)]\displaystyle\leq\frac{2K_{2}D\eta}{n}(1+K_{1}\eta)\mathbb{E}\left[\left(\|\theta_{k-1}\|+\|\hat{\theta}_{k-1}\|\right)\left(2\|\hat{\theta}_{k-1}\|+1\right)\right]
2K2Dηn(1+K1η)(1+32𝔼θk12+72𝔼θ^k12),\displaystyle\leq\frac{2K_{2}D\eta}{n}(1+K_{1}\eta)\left(1+\frac{3}{2}\mathbb{E}\|\theta_{k-1}\|^{2}+\frac{7}{2}\mathbb{E}\|\hat{\theta}_{k-1}\|^{2}\right), (138)

where we used the inequality that

(a+b)(2b+1)=2b2+2ab+a+b1+32a2+72b2,(a+b)(2b+1)=2b^{2}+2ab+a+b\leq 1+\frac{3}{2}a^{2}+\frac{7}{2}b^{2}, (139)

for any a,ba,b\in\mathbb{R}. Therefore, we have

𝔼θkθ^k2\displaystyle\mathbb{E}\left\|\theta_{k}-\hat{\theta}_{k}\right\|^{2} (1ηm)𝔼θk1θ^k12+2ηK+4D2K22η2bn(8𝔼θ^k12+2)\displaystyle\leq(1-\eta m)\mathbb{E}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}+2\eta K+\frac{4D^{2}K_{2}^{2}\eta^{2}}{bn}\left(8\mathbb{E}\|\hat{\theta}_{k-1}\|^{2}+2\right)
+4K2Dηn(1+K1η)(1+32𝔼θk12+72𝔼θ^k12).\displaystyle\qquad+\frac{4K_{2}D\eta}{n}(1+K_{1}\eta)\left(1+\frac{3}{2}\mathbb{E}\|\theta_{k-1}\|^{2}+\frac{7}{2}\mathbb{E}\|\hat{\theta}_{k-1}\|^{2}\right). (140)

In Lemma 21, we showed that under the assumption η<min{1m,mK12+64D2K22}\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have that for every kk\in\mathbb{N},

𝔼V^(θ^k)(1ηm)𝔼V^(θ^k1)+2ηmη2K1256η2D2K22+64η2D2K22θ^2+2ηK,\displaystyle\mathbb{E}\hat{V}(\hat{\theta}_{k})\leq(1-\eta m)\mathbb{E}\hat{V}(\hat{\theta}_{k-1})+2\eta m-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K, (141)

where V^(θ):=1+θθ^2\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}, where θ^\hat{\theta}_{\ast} is the minimizer of F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}). This implies that

𝔼[V^(θ^k)]\displaystyle\mathbb{E}\left[\hat{V}\left(\hat{\theta}_{k}\right)\right]
(1ηm)k𝔼[V^(θ^0)]+2ηmη2K1256η2D2K22+64η2D2K22θ^2+2ηK1(1ηm)\displaystyle\leq(1-\eta m)^{k}\mathbb{E}\left[\hat{V}\left(\hat{\theta}_{0}\right)\right]+\frac{2\eta m-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K}{1-(1-\eta m)}
1+θθ^2+2ηmK1256ηmD2K22+64ηmD2K22θ^2+2Km,\displaystyle\leq 1+\|\theta-\hat{\theta}_{\ast}\|^{2}+2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+\frac{2K}{m}, (142)

so that

𝔼θ^k2\displaystyle\mathbb{E}\|\hat{\theta}_{k}\|^{2} 2𝔼θ^kθ^2+2θ^2\displaystyle\leq 2\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}
2θθ^2+42ηmK12112ηmD2K22+128ηmD2K22θ^2+4Km+2θ^2\displaystyle\leq 2\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+4-\frac{2\eta}{m}K_{1}^{2}-\frac{112\eta}{m}D^{2}K_{2}^{2}+\frac{128\eta}{m}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+\frac{4K}{m}+2\|\hat{\theta}_{\ast}\|^{2}
4θ2+4(E+E2+4mK2m)2+42ηmK12112ηmD2K22\displaystyle\leq 4\|\theta\|^{2}+4\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+4-\frac{2\eta}{m}K_{1}^{2}-\frac{112\eta}{m}D^{2}K_{2}^{2}
+128ηmD2K22(E+E2+4mK2m)2+4Km+2(E+E2+4mK2m)2=:B,\displaystyle\qquad+\frac{128\eta}{m}D^{2}K_{2}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+\frac{4K}{m}+2\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}=:B, (143)

where we applied Lemma 24. Similarly, we can show that

𝔼θk2B.\displaystyle\mathbb{E}\|\theta_{k}\|^{2}\leq B. (144)

Since θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta, it follows from (140), (143) and (144) that

𝒲22(νk,ν^k)\displaystyle\mathcal{W}_{2}^{2}(\nu_{k},\hat{\nu}_{k})
𝔼θkθ^k2\displaystyle\leq\mathbb{E}\left\|\theta_{k}-\hat{\theta}_{k}\right\|^{2}
(1(1ηm)k)(4D2K22ηbnm(8B+2)+4K2Dnm(1+K1η)(1+32B+72B)+2Km),\displaystyle\leq\left(1-(1-\eta m)^{k}\right)\left(\frac{4D^{2}K_{2}^{2}\eta}{bnm}(8B+2)+\frac{4K_{2}D}{nm}(1+K_{1}\eta)\left(1+\frac{3}{2}B+\frac{7}{2}B\right)+\frac{2K}{m}\right), (145)

provided that η<min{mK12,1m,mK12+64D2K22}\eta<\min\left\{\frac{m}{K_{1}^{2}},\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}. This completes the proof.  

G Proofs of Convex Case with Additional Geometric Structure

In the following technical lemma, we show that the pp-th moment of θk\theta_{k} and θ^k\hat{\theta}_{k} can be bounded in the following sense.

Lemma 26

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Suppose Assumption 5 and Assumption 6 hold and ημK12+2p+4D2K22\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}. Then, we have

1ki=1k𝔼θi1θpθθ2kημ+8ημD2K22(2p+1θp+5),\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\theta_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\theta_{\ast}\|^{p}+5\right), (146)
1ki=1k𝔼θ^i1θ^pθθ^2kημ+8ημD2K22(2p+1θ^p+5),\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right), (147)

where θ^\hat{\theta}_{\ast} is the minimizer of 1ni=1nf(θ,x^i)\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}) and θ\theta_{\ast} is the minimizer of 1ni=1nf(θ,xi)\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}).

Proof  First, we recall that

θ^k=θ^k1ηbiΩkf(θ^k1,x^i).\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right). (148)

Therefore, we have

θ^k=θ^k1ηni=1nf(θ^k1,x^i)+η(1ni=1nf(θ^k1,x^i)1biΩkf(θ^k1,x^i)).\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right). (149)

Moreover, we have

𝔼[1biΩkf(θ^k1,x^i)|θ^k1]=1ni=1nf(θ^k1,x^i).\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\Big{|}\hat{\theta}_{k-1}\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right). (150)

This implies that

𝔼θ^kθ^2\displaystyle\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2} =𝔼θ^k1θ^+ηni=1nf(θ^k1,x^i)2\displaystyle=\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\|^{2}
+η2𝔼1ni=1nf(θ^k1,x^i)1biΩkf(θ^k1,x^i)2.\displaystyle\qquad+\eta^{2}\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\|^{2}. (151)

We can compute that

θ^k1θ^+ηni=1nf(θ^k1,x^i)2\displaystyle\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\|^{2}
=θ^k1θ^+ηni=1n(f(θ^k1,x^i)f(θ^,x^i))2\displaystyle=\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
θ^k1θ^22ημθ^k1θ^p+η2n2i=1n(f(θ^k1,x^i)f(θ^,x^i))2\displaystyle\leq\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-2\eta\mu\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}+\frac{\eta^{2}}{n^{2}}\left\|\sum_{i=1}^{n}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\|^{2}
θ^k1θ^22ημθ^k1θ^p+η2K12θ^k1θ^p.\displaystyle\leq\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-2\eta\mu\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}+\eta^{2}K_{1}^{2}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}. (152)

Moreover, we can compute that

𝔼1ni=1nf(θ^k1,x^i)1biΩkf(θ^k1,x^i)2\displaystyle\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\|^{2}
=𝔼1biΩk(1nj=1nf(θ^k1,x^j)f(θ^k1,x^i))2\displaystyle=\mathbb{E}\left\|\frac{1}{b}\sum_{i\in\Omega_{k}}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{j}\right)-\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right)\right\|^{2}
=𝔼(1biΩkK21nj=1nx^ix^j(2θ^k1p1+1))2\displaystyle=\mathbb{E}\left(\frac{1}{b}\sum_{i\in\Omega_{k}}K_{2}\frac{1}{n}\sum_{j=1}^{n}\|\hat{x}_{i}-\hat{x}_{j}\|(2\|\hat{\theta}_{k-1}\|^{p-1}+1)\right)^{2}
4D2K22𝔼[(2θ^k1p1+1)2]\displaystyle\leq 4D^{2}K_{2}^{2}\mathbb{E}\left[(2\|\hat{\theta}_{k-1}\|^{p-1}+1)^{2}\right]
8D2K22(4𝔼[θ^k12(p1)]+1)\displaystyle\leq 8D^{2}K_{2}^{2}\left(4\mathbb{E}\left[\|\hat{\theta}_{k-1}\|^{2(p-1)}\right]+1\right)
8D2K22(4𝔼[θ^k1p]+5)\displaystyle\leq 8D^{2}K_{2}^{2}\left(4\mathbb{E}\left[\|\hat{\theta}_{k-1}\|^{p}\right]+5\right)
8D2K22(2p+1𝔼θ^k1θ^p+2p+1θ^p+5).\displaystyle\leq 8D^{2}K_{2}^{2}\left(2^{p+1}\mathbb{E}\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\|^{p}+2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right). (153)

Hence, by applying (152) and (153) to (151), we conclude that

𝔼θ^kθ^2\displaystyle\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2} 𝔼θ^k1θ^22ημ𝔼θ^k1θ^p+η2K12𝔼θ^k1θ^p\displaystyle\leq\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-2\eta\mu\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}+\eta^{2}K_{1}^{2}\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}
+8η2D2K22(2p+1𝔼θ^k1θ^2+2p+1θ^2+5)\displaystyle\qquad\qquad+8\eta^{2}D^{2}K_{2}^{2}\left(2^{p+1}\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}+2^{p+1}\|\hat{\theta}_{\ast}\|^{2}+5\right)
𝔼θ^k1θ^2ημ𝔼θ^k1θ^p+8η2D2K22(2p+1θ^p+5),\displaystyle\leq\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-\eta\mu\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}+8\eta^{2}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right), (154)

provided that ημK12+2p+4D2K22\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}. This implies that

𝔼θ^k1θ^p𝔼θ^k1θ^2𝔼θ^kθ^2ημ+8ημD2K22(2p+1θ^p+5),\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}\leq\frac{\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2}}{\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right), (155)

and hence

1ki=1k𝔼θ^i1θ^p\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\|^{p} θ^0θ^2𝔼θ^kθ^2kημ+8ημD2K22(2p+1θ^p+5)\displaystyle\leq\frac{\left\|\hat{\theta}_{0}-\hat{\theta}_{\ast}\right\|^{2}-\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right)
θθ^2kημ+8ημD2K22(2p+1θ^p+5).\displaystyle\leq\frac{\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right). (156)

Similarly, we can show that

1ki=1k𝔼θi1θpθθ2kημ+8ημD2K22(2p+1θp+5).\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\theta_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\theta_{\ast}\|^{p}+5\right). (157)

This completes the proof.  

Next, let us provide a technical lemma that upper bounds the norm of θ\theta_{\ast} and θ^\hat{\theta}_{\ast}, which are the minimizers of F^(θ,Xn):=1ni=1nf(θ,xi)\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i}) and F^(θ,X^n):=1ni=1nf(θ,x^i)\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}) respectively.

Lemma 27

Under Assumption 5, we have

θ1μ1p1supx𝒳f(0,x)1p1,\|\theta_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}},

and

θ^1μ1p1supx𝒳f(0,x)1p1.\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

Proof  Under Assumption 5, we have

F^(0,Xn)F^(θ,Xn),0θ\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle =1ni=1nf(0,xi),θ\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle
=1ni=1nf(0,xi)f(θ,xi),0θμθp,\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle\geq\mu\|\theta_{\ast}\|^{p}, (158)

where p(1,2)p\in(1,2), which implies that

μθpsupx𝒳f(0,x)θ,\mu\|\theta_{\ast}\|^{p}\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|, (159)

which yields that

θ1μ1p1supx𝒳f(0,x)1p1.\|\theta_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

Similarly, one can show that

θ^1μ1p1supx𝒳f(0,x)1p1.\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

This completes the proof.  

Now, we are able to state the main result for the Wasserstein algorithmic stability.

Theorem 28

Let θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta. Suppose Assumption 5 and Assumption 6 hold and ημK12+2p+4D2K22\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}} and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty and supx𝒳f(0,x)E\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E for some E<E<\infty. Let νk\nu_{k} and ν^k\hat{\nu}_{k} denote the distributions of θk\theta_{k} and θ^k\hat{\theta}_{k} respectively. Then, we have

1ki=1k𝒲pp(νi1,ν^i1)\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathcal{W}_{p}^{p}(\nu_{i-1},\hat{\nu}_{i-1})
4D2K22ηbnμ2p+2(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\cdot 2^{p+2}\cdot\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+4D2K22ηbnμ2p+2(2p+2(E/μ)pp1+10)\displaystyle\qquad\qquad\qquad+\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\cdot 2^{p+2}\left(2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\right)
+4DK2nμ(1+K1η)32p1(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+4DK2nμ(1+K1η)72p1(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+4DK2nμ(1+K1η)(102p1(E/μ)pp1+5).\displaystyle\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right). (160)

Proof  Let us recall that θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta and for any kk\in\mathbb{N},

θk=θk1ηbiΩkf(θk1,xi),\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}), (161)
θ^k=θ^k1ηbiΩkf(θ^k1,x^i).\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right). (162)

Thus it follows that

θkθ^k\displaystyle\theta_{k}-\hat{\theta}_{k} =θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))+ηbk,\displaystyle=\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)+\frac{\eta}{b}\mathcal{E}_{k}, (163)

where

k:=iΩk(f(θ^k1,x^i)f(θ^k1,xi)).\mathcal{E}_{k}:=\sum_{i\in\Omega_{k}}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right). (164)

This implies that

θkθ^k2\displaystyle\left\|\theta_{k}-\hat{\theta}_{k}\right\|^{2} =θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))2+η2b2k2\displaystyle=\left\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}+\frac{\eta^{2}}{b^{2}}\left\|\mathcal{E}_{k}\right\|^{2}
+2θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi)),ηbk.\displaystyle\qquad+2\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle. (165)

By Assumption 6 and Assumption 5, we have

θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi))2\displaystyle\left\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\|^{2}
θk1θ^k122ημθk1θ^k1p+η2b2(bK1θk1θ^k1p2)2\displaystyle\leq\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}-2\eta\mu\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{p}+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{\frac{p}{2}}\right)^{2}
=θk1θ^k122ημθk1θ^k1p+η2K12θk1θ^k1p\displaystyle=\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}-2\eta\mu\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{p}+\eta^{2}K_{1}^{2}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{p}
θk1θ^k12ημθk1θ^k1p,\displaystyle\leq\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}-\eta\mu\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{p}, (166)

provided that ημK12\eta\leq\frac{\mu}{K_{1}^{2}}.

Since XnX_{n} and X^n\hat{X}_{n} differ by at most one element and supx𝒳xD\sup_{x\in\mathcal{X}}\|x\|\leq D for some D<D<\infty, we have xi=x^ix_{i}=\hat{x}_{i} for any iΩki\in\Omega_{k} with probability nbn\frac{n-b}{n} and xix^ix_{i}\neq\hat{x}_{i} for exactly one iΩki\in\Omega_{k} with probability bn\frac{b}{n} and therefore

𝔼k2\displaystyle\mathbb{E}\left\|\mathcal{E}_{k}\right\|^{2} bn𝔼[(K22D(2θ^k1p1+1))2]\displaystyle\leq\frac{b}{n}\mathbb{E}\left[\left(K_{2}2D(2\|\hat{\theta}_{k-1}\|^{p-1}+1)\right)^{2}\right]
4D2K22bn(8𝔼[θ^k12(p1)]+2)\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\left[\|\hat{\theta}_{k-1}\|^{2(p-1)}\right]+2\right)
4D2K22bn(8𝔼[θ^k1p]+10)\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\left[\|\hat{\theta}_{k-1}\|^{p}\right]+10\right)
4D2K22bn(2p+2𝔼θ^k1θ^p+2p+2θ^p+10),\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(2^{p+2}\mathbb{E}\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\|^{p}+2^{p+2}\|\hat{\theta}_{\ast}\|^{p}+10\right), (167)

and moreover,

𝔼θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi)),ηbk\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle
ηbbn𝔼[(θk1θ^k1+K1ηθk1θ^k1p2)K22D(2θ^k1p1+1)]\displaystyle\leq\frac{\eta}{b}\frac{b}{n}\mathbb{E}\left[\left(\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|+K_{1}\eta\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{\frac{p}{2}}\right)K_{2}2D\left(2\|\hat{\theta}_{k-1}\|^{p-1}+1\right)\right]
2DK2ηn𝔼[((1+K1η)θk1θ^k1+K1η)(2θ^k1p1+1)]\displaystyle\leq\frac{2DK_{2}\eta}{n}\mathbb{E}\left[\left((1+K_{1}\eta)\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|+K_{1}\eta\right)\left(2\|\hat{\theta}_{k-1}\|^{p-1}+1\right)\right]
2DK2ηn(1+K1η)𝔼[(θk1+θ^k1+1)(2θ^k1p1+1)].\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\mathbb{E}\left[\left(\|\theta_{k-1}\|+\|\hat{\theta}_{k-1}\|+1\right)\left(2\|\hat{\theta}_{k-1}\|^{p-1}+1\right)\right]. (168)

Notice that for any x,y0x,y\geq 0 and p(1,2)p\in(1,2), we have xyp1xp+ypxy^{p-1}\leq x^{p}+y^{p}, yyp+1y\leq y^{p}+1, xxp+1x\leq x^{p}+1 and yp1yp+1y^{p-1}\leq y^{p}+1, which implies that

(x+y+1)(2yp1+1)=2xyp1+2yp+2yp1+x+y+13xp+7yp+5.(x+y+1)(2y^{p-1}+1)=2xy^{p-1}+2y^{p}+2y^{p-1}+x+y+1\leq 3x^{p}+7y^{p}+5. (169)

Therefore, by applying (169) to (168), we have

𝔼θk1θ^k1ηbiΩk(f(θk1,xi)f(θ^k1,xi)),ηbk\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle
2DK2ηn(1+K1η)(3𝔼θk1p+7𝔼θ^k1p+5)\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\left(3\mathbb{E}\|\theta_{k-1}\|^{p}+7\mathbb{E}\|\hat{\theta}_{k-1}\|^{p}+5\right)
2DK2ηn(1+K1η)(32p1𝔼θk1θp+32p1θp\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\mathbb{E}\|\theta_{k-1}-\theta_{\ast}\|^{p}+3\cdot 2^{p-1}\|\theta_{\ast}\|^{p}
+72p1𝔼θ^k1θ^p+72p1θ^p+5).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\mathbb{E}\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\|^{p}+7\cdot 2^{p-1}\|\hat{\theta}_{\ast}\|^{p}+5\Big{)}. (170)

Hence, by applying (166), (167), (170) into (165), we conclude that

𝔼θkθ^k2\displaystyle\mathbb{E}\left\|\theta_{k}-\hat{\theta}_{k}\right\|^{2} 𝔼θk1θ^k12ημ𝔼θk1θ^k1p\displaystyle\leq\mathbb{E}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{2}-\eta\mu\mathbb{E}\left\|\theta_{k-1}-\hat{\theta}_{k-1}\right\|^{p}
+4D2K22η2bn(2p+2𝔼θ^k1θ^p+2p+2θ^p+10)\displaystyle\qquad+\frac{4D^{2}K_{2}^{2}\eta^{2}}{bn}\left(2^{p+2}\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}+2^{p+2}\|\hat{\theta}_{\ast}\|^{p}+10\right)
+4DK2ηn(1+K1η)(32p1𝔼θk1θp+32p1θp\displaystyle\qquad\qquad+\frac{4DK_{2}\eta}{n}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\mathbb{E}\|\theta_{k-1}-\theta_{\ast}\|^{p}+3\cdot 2^{p-1}\|\theta_{\ast}\|^{p}
+72p1𝔼θ^k1θ^p+72p1θ^p+5),\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\mathbb{E}\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\|^{p}+7\cdot 2^{p-1}\|\hat{\theta}_{\ast}\|^{p}+5\Big{)}, (171)

provided that ημK12\eta\leq\frac{\mu}{K_{1}^{2}}. This, together with θ0=θ^0=θ\theta_{0}=\hat{\theta}_{0}=\theta, implies that

1ki=1k𝔼θi1θ^i1p\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}
4D2K22ηbnμ(2p+21ki=1k𝔼θ^i1θ^p+2p+2θ^p+10)\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\left(2^{p+2}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\|^{p}+2^{p+2}\|\hat{\theta}_{\ast}\|^{p}+10\right)
+4DK2nμ(1+K1η)(32p11ki=1k𝔼θi1θp+32p1θp\displaystyle\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\|\theta_{i-1}-\theta_{\ast}\|^{p}+3\cdot 2^{p-1}\|\theta_{\ast}\|^{p}
+72p11ki=1k𝔼θ^i1θ^p+72p1θ^p+5).\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\|^{p}+7\cdot 2^{p-1}\|\hat{\theta}_{\ast}\|^{p}+5\Big{)}. (172)

In Lemma 26, we showed that under the assumption ημK12+2p+4D2K22\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}, we have

1ki=1k𝔼θi1θpθθ2kημ+8ημD2K22(2p+1θp+5),\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\theta_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\theta_{\ast}\|^{p}+5\right), (173)
1ki=1k𝔼θ^i1θ^pθθ^2kημ+8ημD2K22(2p+1θ^p+5).\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right). (174)

Hence, by plugging (173) and (174) into (172), we conclude that

1ki=1k𝔼θi1θ^i1p\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}
4D2K22ηbnμ(2p+2(θθ^2kημ+8ημD2K22(2p+1θ^p+5))+2p+2θ^p+10)\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\left(2^{p+2}\left(\frac{\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right)\right)+2^{p+2}\|\hat{\theta}_{\ast}\|^{p}+10\right)
+4DK2nμ(1+K1η)32p1(θθ2kημ+8ημD2K22(2p+1θp+5))\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{\left\|\theta-\theta_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\theta_{\ast}\|^{p}+5\right)\right)
+4DK2nμ(1+K1η)72p1(θθ^2kημ+8ημD2K22(2p+1θ^p+5))\displaystyle\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right)\right)
+4DK2nμ(1+K1η)(32p1θp+72p1θ^p+5).\displaystyle\qquad\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(3\cdot 2^{p-1}\|\theta_{\ast}\|^{p}+7\cdot 2^{p-1}\|\hat{\theta}_{\ast}\|^{p}+5\right). (175)

Moreover, θθ^22θ2+2θ^2\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}, θθ22θ2+2θ2\|\theta-\theta_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\theta_{\ast}\|^{2} and by applying Lemma 27, we obtain

1ki=1k𝔼θi1θ^i1p\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}
4D2K22ηbnμ(2p+2(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\Bigg{(}2^{p+2}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+2p+2(E/μ)pp1+10)\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\Bigg{)}
+4DK2nμ(1+K1η)32p1(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+4DK2nμ(1+K1η)72p1(2θ2+2(E/μ)2p1kημ+8ημD2K22(2p+1(E/μ)pp1+5))\displaystyle\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)
+4DK2nμ(1+K1η)(102p1(E/μ)pp1+5).\displaystyle\qquad\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right). (176)

Finally, by the definition of pp-Wasserstein distance, we have

1ki=1k𝒲pp(νi1,ν^i1)1ki=1k𝔼θi1θp.\frac{1}{k}\sum_{i=1}^{k}\mathcal{W}_{p}^{p}(\nu_{i-1},\hat{\nu}_{i-1})\leq\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}. (177)

This completes the proof.  

G.1 Proof of Theorem 13

Proof  First, let us show that the sequences θk\theta_{k} and θ^k\hat{\theta}_{k} are ergodic.

First, let us show that the limit θ\theta_{\infty}, if exists, is unique. Consider two sequences θk(1)\theta_{k}^{(1)} and θk(2)\theta_{k}^{(2)} starting at θ0(1)\theta_{0}^{(1)} and θ0(2)\theta_{0}^{(2)} respectively with two limits θ(1)\theta_{\infty}^{(1)} and θ(2)\theta_{\infty}^{(2)}. For any kk\in\mathbb{N},

θk(1)=θk1(1)ηbiΩkf(θk1(1),xi),\displaystyle\theta_{k}^{(1)}=\theta_{k-1}^{(1)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\theta_{k-1}^{(1)},x_{i}\right), (178)
θk(2)=θk1(2)ηbiΩkf(θk1(2),xi).\displaystyle\theta_{k}^{(2)}=\theta_{k-1}^{(2)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\theta_{k-1}^{(2)},x_{i}\right). (179)

By Assumption 6 and Assumption 5, we have

θk(2)θk(1)2\displaystyle\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2} =θk1(2)θk1(1)ηbiΩk(f(θk1(2),xi)f(θk1(1),xi))2\displaystyle=\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f\left(\theta_{k-1}^{(2)},x_{i}\right)-\nabla f\left(\theta_{k-1}^{(1)},x_{i}\right)\right)\right\|^{2}
θk1(2)θk1(1)22ημθk1(2)θk1(1)p+η2b2(bK1θk1(2)θk1(1)p2)2\displaystyle\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}-2\eta\mu\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p}+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{\frac{p}{2}}\right)^{2}
=θk1(2)θk1(1)22ημθk1(2)θk1(1)p+η2K12θk1(2)θk1(1)p\displaystyle=\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}-2\eta\mu\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p}+\eta^{2}K_{1}^{2}\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p}
θk1(2)θk1(1)2ημθk1(2)θk1(1)p,\displaystyle\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}-\eta\mu\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p}, (180)

provided that ημK12\eta\leq\frac{\mu}{K_{1}^{2}}. This implies that

θk(2)θk(1)2θk1(2)θk1(1)2ημθk1(2)θk1(1)p,\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}-\eta\mu\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p}, (181)

provided that ημK12\eta\leq\frac{\mu}{K_{1}^{2}}. Thus, θk(2)θk(1)2θk1(2)θk1(1)2\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2} for any kk\in\mathbb{N}. Suppose θj1(2)θj1(1)21\left\|\theta_{j-1}^{(2)}-\theta_{j-1}^{(1)}\right\|^{2}\geq 1 for every j=1,2,,kj=1,2,\ldots,k, then we have

θk(2)θk(1)2θ0(2)θ0(1)2kημ,\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{0}^{(2)}-\theta_{0}^{(1)}\right\|^{2}-k\eta\mu, (182)

such that

θk(2)θk(1)21,for any kk0:=θ0(2)θ0(1)21ημ.\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq 1,\qquad\text{for any $k\geq k_{0}:=\frac{\left\|\theta_{0}^{(2)}-\theta_{0}^{(1)}\right\|^{2}-1}{\eta\mu}$}. (183)

Since p(1,2)p\in(1,2),

θk(2)θk(1)2(1ημ)θk1(2)θk1(1)2,\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq(1-\eta\mu)\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}, (184)

for any kk0+1k\geq k_{0}+1, which implies that θk(2)θk(1)0\theta_{k}^{(2)}-\theta_{k}^{(1)}\rightarrow 0 as kk\rightarrow\infty so that θ(2)=θ(1)\theta_{\infty}^{(2)}=\theta_{\infty}^{(1)}.

Next, let us show that for any sequence θk\theta_{k}, it converges to a limit. It follows from (184) that

θk(2)θk(1)2(1ημ)kk0θk0(2)θk0(1)2,\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq(1-\eta\mu)^{k-\lceil k_{0}\rceil}\left\|\theta_{\lceil k_{0}\rceil}^{(2)}-\theta_{\lceil k_{0}\rceil}^{(1)}\right\|^{2}, (185)

for any kk0+1k\geq k_{0}+1. Let θ0(1)\theta_{0}^{(1)} be a fixed initial value in d\mathbb{R}^{d}, and let θ0(2)=θ1(1)\theta_{0}^{(2)}=\theta_{1}^{(1)} which is random yet takes only finitely many values given θ0(1)\theta_{0}^{(1)} so that k0k_{0} is bounded. Therefore, it follows from (185) that

𝒲22(νk+1,νk)(1ημ)k𝔼[(1ημ)k0θk0(2)θk0(1)2],\mathcal{W}_{2}^{2}(\nu_{k+1},\nu_{k})\leq(1-\eta\mu)^{k}\mathbb{E}\left[(1-\eta\mu)^{-\lceil k_{0}\rceil}\left\|\theta_{\lceil k_{0}\rceil}^{(2)}-\theta_{\lceil k_{0}\rceil}^{(1)}\right\|^{2}\right], (186)

where νk\nu_{k} denotes the distribution of θk\theta_{k}, which implies that

k=1𝒲22(νk+1,νk)<.\sum_{k=1}^{\infty}\mathcal{W}_{2}^{2}(\nu_{k+1},\nu_{k})<\infty. (187)

Thus, (νk)(\nu_{k}) is a Cauchy sequence in 𝒫2(d)\mathcal{P}_{2}(\mathbb{R}^{d}) equipped with metric 𝒲2\mathcal{W}_{2} and hence there exists some ν\nu_{\infty} such that 𝒲2(νk,ν)0\mathcal{W}_{2}(\nu_{k},\nu_{\infty})\rightarrow 0 as kk\rightarrow\infty.

Hence, we showed that the sequence θk\theta_{k} is ergodic. Similarly, we can show that the sequence θ^k\hat{\theta}_{k} is ergodic.

Finally, by ergodic theorem and Fatou’s lemma, we have

𝔼θθ^p=𝔼[limk1ki=1kθi1θ^i1p]lim supk1ki=1k𝔼θi1θ^i1p.\mathbb{E}\left\|\theta_{\infty}-\hat{\theta}_{\infty}\right\|^{p}=\mathbb{E}\left[\lim_{k\rightarrow\infty}\frac{1}{k}\sum_{i=1}^{k}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}\right]\leq\limsup_{k\rightarrow\infty}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}. (188)

We can then apply (176) from the proof of Theorem 28 to obtain:

𝒲pp(ν,ν^)C2bnμ+C3n,\displaystyle\mathcal{W}_{p}^{p}(\nu_{\infty},\hat{\nu}_{\infty})\leq\frac{C_{2}}{bn\mu}+\frac{C_{3}}{n}, (189)

where

C2:=4D2K22ημ(2p+2(8ημD2K22(2p+1(E/μ)pp1+5))+2p+2(E/μ)pp1+10),\displaystyle C_{2}:=\frac{4D^{2}K_{2}^{2}\eta}{\mu}\left(2^{p+2}\left(\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)+2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\right), (190)
C3:=32D3K23ημ2(1+K1η)102p1(2p+1(E/μ)pp1+5)\displaystyle C_{3}:=\frac{32D^{3}K_{2}^{3}\eta}{\mu^{2}}(1+K_{1}\eta)\cdot 10\cdot 2^{p-1}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)
+4DK2μ(1+K1η)(102p1(E/μ)pp1+5).\displaystyle\qquad\qquad\qquad\qquad\qquad+\frac{4DK_{2}}{\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right). (191)

This completes the proof.