Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent

\nameLingjiong Zhu \email[email protected]
\addrDepartment of Mathematics
Florida State University, Tallahassee, FL, USA. \AND\nameMert Gürbüzbalaban \email[email protected]
\addrDepartment of Management Science and Information Systems
Rutgers Business School, Piscataway, NJ, USA &
Center for Statistics and Machine Learning
Princeton University, Princeton, NJ, USA. \ANDAnant Raj \email[email protected]
\addrCoordinated Science Laboraotry
University of Illinois Urbana-Champaign, IL, USA &
Inria, Ecole Normale Supérieure
PSL Research University, Paris, France. \AND\nameUmut Şimşekli ^♣ \email[email protected]
\addrInria, CNRS, Ecole Normale Supérieure
PSL Research University, Paris, France.
\name^♣ \addrCorresponding author

Abstract

Algorithmic stability is an important notion that has proven powerful for deriving generalization bounds for practical algorithms. The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. In this study, we make a novel connection between learning theory and applied probability and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms. We illustrate our approach on stochastic gradient descent (SGD) and we obtain time-uniform stability bounds (i.e., the bound does not increase with the number of iterations) for strongly convex losses and non-convex losses with additive noise, where we recover similar results to the prior art or extend them to more general cases by using a single proof technique. Our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature. It also illustrates that ergodicity is an important component for obtaining time-uniform bounds – which might not be achieved for convex or non-convex losses unless additional noise is injected to the iterates. Finally, we slightly stretch our analysis technique and prove time-uniform bounds for SGD under convex and non-convex losses (without additional additive noise), which, to our knowledge, is novel.

Keywords: Stochastic gradient descent, algorithmic stability, Wasserstein perturbation.

1 Introduction

With the development of modern machine learning applications, understanding the generalization properties of stochastic gradient descent (SGD) has become a major challenge in statistical learning theory. In this context, the main goal is to obtain computable upper-bounds on the population risk associated with the output of the SGD algorithm that is given as follows: $F(\theta):=\mathbb{E}_{x\sim\mathcal{D}}[f(\theta,x)]$ , where $x\in\mathcal{X}$ denotes a random data point, $\mathcal{D}$ is the (unknown) data distribution defined on the data space $\mathcal{X}$ , $\theta$ denotes the parameter vector, and $f:\mathbb{R}^{d}\times\mathcal{X}\to\mathbb{R}$ is an instantaneous loss function.

In a practical setting, directly minimizing $F(\theta)$ is not typically possible as $\mathcal{D}$ is unknown; yet one typically has access to a finite data set $X_{n}=\{x_{1},\ldots,x_{n}\}\in\mathcal{X}^{n}$ , where we assume each $x_{i}$ is independent and identically distributed (i.i.d.) with the common distribution $\mathcal{D}$ . Hence, given $X_{n}$ , one can then attempt to minimize the empirical risk $\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}f(\theta,x_{i})$ as a proxy for $F(\theta)$ . In this setting, SGD has been one of the most popular optimization algorithms for minimizing $\hat{F}(\theta)$ and is based on the following recursion:

\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}),\qquad\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}),

(1)

where $\eta$ is the step-size, $b$ is the batch-size, $\Omega_{k}$ is the minibatch that is chosen randomly from the set $\{1,2,\ldots,n\}$ , and its cardinality satisfies $|\Omega_{k}|=b$ .

One fruitful approach for estimating the population risk attained by SGD, i.e., $F(\theta_{k})$ , is based on the following simple decomposition:

\displaystyle F(\theta_{k})\leq\hat{F}(\theta_{k})+|\hat{F}(\theta_{k})-F(\theta_{k})|,

(2)

where the last term is called the generalization error. Once a computable upper-bound for the generalization error can be obtained, this decomposition directly leads to a computable upper bound for the population risk $F(\theta_{k})$ , since $\hat{F}(\theta_{k})$ can be computed thanks to the availability of $X_{n}$ . Hence, the challenge here is reduced to derive upper-bounds on $|\hat{F}(\theta_{k})-F(\theta_{k})|$ , typically referred to as generalization bounds.

Among many approaches for deriving generalization bounds, algorithmic stability (Bousquet and Elisseeff, 2002) has been one of the most fruitful notions that have paved the way to numerous generalization bounds for stochastic optimization algorithms (Hardt et al., 2016; Chen et al., 2018; Mou et al., 2018; Feldman and Vondrak, 2019; Lei and Ying, 2020; Zhang et al., 2022). In a nutshell, algorithmic stability measures how much the algorithm output differs if we replace one data point in $X_{n}$ with a new sample. More precisely, in the context of SGD, given another data set $\hat{X}_{n}=\{\hat{x}_{1},\ldots,\hat{x}_{n}\}=\{x_{1},\ldots,x_{i-1},\hat{x}_{i},x_{i+1},\ldots{x}_{n}\}\in\mathcal{X}^{n}$ that differs from $X_{n}$ by at most one element, we (theoretically) consider running SGD on $\hat{X}_{n}$ , i.e.,

\hat{\theta}_{k}=\hat{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n}),\qquad\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\hat{\theta}_{k-1},\hat{x}_{i}),

(3)

and we are interested in the discrepancy between $\theta_{k}$ and $\hat{\theta}_{k}$ in some precise sense (to be formally defined in the next section). The wisdom of algorithmic stability indicates that a smaller discrepancy between $\theta_{k}$ and $\hat{\theta}_{k}$ implies a smaller generalization error.

The last decade has witnessed an increasing number of stability bounds for different algorithms applied on different classes of loss functions. In a pioneering study, Hardt et al. (2016) proved a variety of stability bounds for SGD, for strongly convex, convex, and non-convex problems. Their analysis showed that, under strong convexity and bounded gradient assumptions, the generalization error of SGD with constant step-size is of order $n^{-1}$ ; whereas for general convex and non-convex problems, their bounds diverged with the number of iterations (even with a projection step), unless a decreasing step-size is used. In subsequent studies (Lei and Ying, 2020; Kozachkov et al., 2023) extended the results of Hardt et al. (2016), by either relaxing the assumptions or generalizing the setting to more general algorithms. However, their bounds still diverged for constant step-sizes, unless strong convexity is assumed. In a recent study, Bassily et al. (2020) proved stability lower-bounds for projected SGD when the loss is convex and non-smooth. Their results showed for general non-smooth loss functions we cannot expect to prove time-uniform (i.e., non-divergent with the number of iterations) stability bounds for SGD, even when a projection step is appended.

In a related line of research, several studies investigated the algorithmic stability of the stochastic gradient Langevin dynamics (SGLD) algorithm (Welling and Teh, 2011), which is essentially a ‘noisy’ version of SGD that uses the following recursion: $\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n})+\xi_{k}$ , where $(\xi_{k})_{k\geq 0}$ is a sequence of i.i.d. Gaussian vectors, independent of $\theta_{k-1}$ and $\Omega_{k}$ . The authors of Raginsky et al. (2017); Mou et al. (2018) proved stability bounds for SGLD for non-convex losses, which were then extended to more general (non-Gaussian) noise settings in Li et al. (2019). While these bounds hinted at the benefits of additional noise in terms of stability, they still increased with the number of iterations, which limited the impact of their results. More recently, Farghly and Rebeschini (2021) proved the first time-uniform stability bounds for SGLD under non-convexity, indicating that, with the presence of additive Gaussian noise, better stability bounds can be achieved. Their time-uniform results were then extended to non-Gaussian, heavy-tailed perturbations in Raj et al. (2023a, b) for quadratic and a class of non-convex problems.

While these bounds have illuminated various properties of optimization algorithms, the analysis of each case typically required a different proof technique with significantly different mathematical tools. Hence, it is not straightforward to extend the existing techniques to different algorithms with different classes of loss functions. Moreover, currently, it is not clear how the noisy perturbations affect algorithmic stability so that time-uniform bounds can be achieved, and more generally, it is not clear in which circumstances one might hope for time-uniform stability bounds.

In this study, we contribute to this line of research and prove novel time-uniform algorithmic stability bounds for SGD and its noisy versions. Our main contributions are as follows:

•

We make a novel connection between learning theory and applied probability, and introduce a unified guideline for proving Wasserstein stability bounds for stochastic optimization algorithms with a constant step-size. Our approach is based on Markov chain perturbation theory (Rudolf and Schweizer, 2018), which offers a three-step proof technique for deriving stability bounds: (i) showing the optimizer is geometrically ergodic, (ii) obtaining a Lyapunov function for the optimizer and the loss, and (iii) bounding the discrepancy between the Markov transition kernels associated with the chains $(\theta_{k})_{k\geq 0}$ and $(\hat{\theta}_{k})_{k\geq 0}$ . We illustrate this approach on SGD and show that time-uniform stability bounds can be obtained under a pseudo-Lipschitz-like condition for smooth strongly-convex losses (we recover similar results to the ones of Hardt et al. (2016)) and a class of non-convex losses (that satisfy a dissipativity condition) when a noisy perturbation with finite variance (not necessarily Gaussian, hence more general than Farghly and Rebeschini (2021)) is introduced. Our results shed more light on the role of the additional noise in terms of obtaining time-uniform bounds: in the non-convex case the optimizer might not be geometrically ergodic unless additional noise is introduced, hence the bound cannot be obtained. Moreover, our approach is flexible and can be generalizable to other popular optimizers, as it mainly requires developing Lyapunov functions, which are often readily available in the literature (Aybat et al., 2020; Lessard et al., 2016; Fallah et al., 2022; Gürbüzbalaban et al., 2022; Liu et al., 2020; Aybat et al., 2019).
•

We then investigate the case where no additional noise is introduced to the SGD recursion and the geometric ergodicity condition does not hold. First, for non-convex losses, we prove a time-uniform stability bound, where the bound converges to a positive number (instead of zero) as $n\to\infty$ , and this limit depends on the ‘level of non-convexity’. Then, we consider a class of (non-strongly) convex functions and prove stability bounds for the stationary distribution of $(\theta_{k})_{k\geq 0}$ , which vanish as $n$ increases. To the best of our knowledge, these results are novel, and indicate that the stability bounds do not need to increase with time even under non-convexity and without additional perturbations; yet, they might have a different nature depending on the problem class.

One limitation of our analysis is that it requires Lipschitz surrogate loss functions and does not directly handle the original loss function, due to the use of the Wasserstein distance (Raginsky et al., 2016). Yet, surrogate losses have been readily utilized in the recent stability literature (e.g., Raj et al. (2023a, b)) and we believe that our analysis might illuminate uncovered aspects of SGD even with this requirement. All the proofs are provided in the Appendix.

2 Technical Background

2.1 The Wasserstein distance and Wasserstein algorithmic stability

Wasserstein distance. For $p\geq 1$ , the $p$ -Wasserstein distance between two probability measures $\mu$ and $\nu$ on $\mathbb{R}^{d}$ is defined as Villani (2009):

\mathcal{W}_{p}(\mu,\nu)=\left\{\inf\mathbb{E}\|X-Y\|^{p}\right\}^{1/p},

(4)

where the infimum is taken over all couplings of $X\sim\mu$ and $Y\sim\nu$ . In particular, the dual representation for the $1$ -Wasserstein distance is given as Villani (2009):

\mathcal{W}_{1}(\mu,\nu)=\sup_{h\in\text{Lip}(1)}\left|\int_{\mathbb{R}^{d}}h(x)\mu(dx)-\int_{\mathbb{R}^{d}}h(x)\nu(dx)\right|,

(5)

where $\text{Lip}(1)$ consists of the functions $h:\mathbb{R}^{d}\rightarrow\mathbb{R}$ that are $1$ -Lipschitz.

Wasserstein algorithmic stability. Algorithmic stability is a crucial concept in learning theory that has led to numerous significant theoretical breakthroughs (Bousquet and Elisseeff, 2002; Hardt et al., 2016). To begin, we will present the definition of algorithmic stability as stated in Hardt et al. (2016):

Definition 1 (Hardt et al. (2016), Definition 2.1)

Let $\mathcal{RV}(\mathbb{R}^{d})$ denote the set of $\mathbb{R}^{d}$ -valued random vectors. For a (surrogate) loss function $\ell:\mathbb{R}^{d}\times\mathcal{X}\rightarrow\mathbb{R}$ , an algorithm $\mathcal{A}:\bigcup_{n=1}^{\infty}\mathcal{X}^{n}\to\mathcal{RV}(\mathbb{R}^{d})$ is $\varepsilon$ -uniformly stable if

\displaystyle\sup_{X\cong\hat{X}}\sup_{z\in\mathcal{X}}~{}\mathbb{E}\left[\ell(\mathcal{A}(X),z)-\ell(\mathcal{A}(\hat{X}),z)\right]\leq\varepsilon,

(6)

where the first supremum is taken over data $X,\hat{X}\in\mathcal{X}^{n}$ that differ by one element, denoted by $X\cong\hat{X}$ .

In this context, we purposefully employ a distinct notation for the loss function $\ell$ (in contrast to $f$ ) since our theoretical framework necessitates measuring algorithmic stability through a surrogate loss function, which may differ from the original loss $f$ . More precisely, our bounds will be based on the 1-Wasserstein distance, hence, we will need the surrogate loss $\ell$ to be a Lipschitz continuous function, as we will detail in (5). On the other hand, for the original loss $f$ we will need some form of convexity (e.g., strongly convex, convex, or dissipative) and we will need the gradient of $f$ to be Lipschitz continuous, in order to derive Wasserstein bounds. Unfortunately, under these assumptions, we cannot further impose $f$ itself to be Lipschitz, hence the need for surrogate losses. Nevertheless, the usage of surrogate losses is common in learning theory, see e.g, Farghly and Rebeschini (2021); Raj et al. (2023b), and we present concrete practical examples in the Appendix.

Now, we present a result from Hardt et al. (2016) that establishes a connection between algorithmic stability and the generalization performance of a randomized algorithm. Prior to presenting the result, we define the empirical and population risks with respect to the loss function $\ell$ as follows:

\displaystyle\hat{R}(\theta,X_{n}):=

\displaystyle\frac{1}{n}\sum_{i=1}^{n}\ell(\theta,x_{i}),\quad R(\theta):=\mathbb{E}_{x\sim\mathcal{D}}[\ell(\theta,x)].

Theorem 2 (Hardt et al. (2016), Theorem 2.2)

Suppose that $\mathcal{A}$ is an $\varepsilon$ -uniformly stable algorithm, then the expected generalization error is bounded by

\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\varepsilon.

(7)

For a randomized algorithm, if $\nu$ and $\hat{\nu}$ denotes the law of $\mathcal{A}(X)$ and $\mathcal{A}(\hat{X})$ then for a $\mathcal{L}$ -Lipschitz surrogate loss function $\ell$ , we have the following generalization error guarantee,

\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\mathcal{L}\sup_{X\cong\hat{X}}\mathcal{W}_{1}(\nu,\hat{\nu}).

(8)

The above result can be directly obtained from the combination of the results given in (5), Definition 1, and Theorem 2 (see also Raginsky et al. (2016)).

2.2 Perturbation theory for Markov chains

Next, we recall the Wasserstein perturbation bound for Markov chains from Rudolf and Schweizer (2018). Let $(\theta_{n})_{n=0}^{\infty}$ be a Markov chain with transition kernel $P$ and initial distribution $p_{0}$ , i.e., we have almost surely

\mathbb{P}(\theta_{n}\in A|\theta_{0},\cdots,\theta_{n-1})=\mathbb{P}(\theta_{n}\in A|\theta_{n-1})=P(\theta_{n-1},A),

(9)

and $p_{0}(A)=\mathbb{P}(\theta_{0}\in A)$ for any measurable set $A\subseteq\mathbb{R}^{d}$ and $n\in\mathbb{N}$ . We assume that $(\hat{\theta}_{n})_{n=0}^{\infty}$ is another Markov chain with transition kernel $\hat{P}$ and initial distribution $\hat{p}_{0}$ . We denote by $p_{n}$ the distribution of $\theta_{n}$ and by $\hat{p}_{n}$ the distribution of $\hat{\theta}_{n}$ . By $\delta_{\theta}$ , we denote the Dirac delta distribution at $\theta$ , i.e. the probability measure concentrated at $\theta$ . For a measurable set $A\subseteq\mathbb{R}^{d}$ , we also use the notation $\delta_{\theta}P(A):=P(\theta,A)$ .

Lemma 3 (Rudolf and Schweizer (2018), Theorem 3.1)

Assume that there exist some $\rho\in[0,1)$ and $C\in(0,\infty)$ such that

\sup_{\theta,\tilde{\theta}\in\mathbb{R}^{d}:\theta\neq\tilde{\theta}}\frac{\mathcal{W}_{1}(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot))}{\|\theta-\tilde{\theta}\|}\leq C\rho^{n},

(10)

for any $n\in\mathbb{N}$ . Further assume that there exist some $\delta\in(0,1)$ and $L\in(0,\infty)$ and a measurable Lyapunov function $\hat{V}:\mathbb{R}^{d}\rightarrow[1,\infty)$ of $\hat{P}$ such that for any $\theta\in\mathbb{R}^{d}$ :

(\hat{P}\hat{V})(\theta)\leq\delta\hat{V}(\theta)+L,

(11)

where $(\hat{P}\hat{V})(\theta):=\int_{\mathbb{R}^{d}}\hat{V}(\hat{\theta})\hat{P}(\theta,d\hat{\theta})$ . Then, we have

\mathcal{W}_{1}(p_{n},\hat{p}_{n})\leq C\left(\rho^{n}\mathcal{W}_{1}(p_{0},\hat{p}_{0})+(1-\rho^{n})\frac{\gamma\kappa}{1-\rho}\right),

(12)

where $\gamma:=\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)},\quad\kappa:=\max\left\{\int_{\mathbb{R}^{d}}\hat{V}(\theta)d\hat{p}_{0}(\theta),\frac{L}{1-\delta}\right\}$ .

Lemma 3 provides a sufficient condition for the distributions $p_{n}$ and $\hat{p}_{n}$ after $n$ iterations to stay close to each other given the initial distributions $p_{0}$ . Lemma 3 will provide a key role in helping us derive the main results in Section 3.1 and Section 3.2. Later, in the Appendix, we will state and prove a modification of Lemma 3 (see Lemma 23 in the Appendix) that will be crucial to obtaining the main result in Section 3.3.

3 Wasserstein Stability of SGD via Markov Chain Perturbations

In this section, we will derive time-uniform Wasserstein stability bounds for SGD by using the perturbation theory presented in Rudolf and Schweizer (2018). Before considering general losses that can be non-convex, we first consider the simpler case of quadratic losses to illustrate our key ideas.

3.1 Warm up: quadratic case

To illustrate the proof technique, we start by considering a quadratic loss of the form: $f(\theta,x_{i}):=(a_{i}^{\top}\theta-y_{i})^{2}/2$ where, $x_{i}:=(a_{i},y_{i})$ and $\nabla f(\theta,x_{i})=a_{i}(a_{i}^{\top}\theta-y_{i})$ . In this setting, the SGD recursion takes the following form:

\theta_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\theta_{k-1}+\frac{\eta}{b}q_{k},\quad\text{where,}\quad H_{k}:=\sum_{i\in\Omega_{k}}a_{i}a_{i}^{\top},\quad q_{k}:=\sum_{i\in\Omega_{k}}a_{i}y_{i}\,.

(13)

The sequence $(H_{k},q_{k})$ are i.i.d. and for every $k$ , $(H_{k},q_{k})$ is independent of $\theta_{k-1}$ .

Similarly, we can write down the iterates of SGD with a different data set $\hat{X}_{n}:=\{\hat{x}_{1},\ldots,\hat{x}_{n}\}$ with $\hat{x}_{i}=(\hat{a}_{i},\hat{y}_{i})$ , where $\hat{X}_{n}$ differs from $X_{n}$ with at most one element:

\hat{\theta}_{k}=\left(I-\frac{\eta}{b}\hat{H}_{k}\right)\hat{\theta}_{k-1}+\frac{\eta}{b}\hat{q}_{k},\quad\text{where}\quad\hat{H}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{a}_{i}^{\top},\quad\hat{q}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{y}_{i}\,.

(14)

Our goal is to obtain an algorithmic stability bound, through estimating the $1$ -Wasserstein distance between the distribution of $\theta_{k}$ and $\hat{\theta}_{k}$ and we will now illustrate the three-step proof technique that we described in Section 1. To be able to apply the perturbation theory (Rudolf and Schweizer, 2018), we start by establishing the geometric ergodicity of the Markov process $(\theta_{k})_{k\geq 0}$ with transition kernel $P(\theta,\cdot)$ , given in the following lemma.

Lemma 4

Assume that $\rho:=\mathbb{E}\left\|I-\frac{\eta}{b}H_{1}\right\|<1$ . Then, for any $k\in\mathbb{N}$ , we have the following inequality: $\mathcal{W}_{1}\left(P^{k}(\theta,\cdot),P^{k}\left(\tilde{\theta},\cdot\right)\right)\leq\rho^{k}\|\theta-\tilde{\theta}\|.$

We note that since $H_{1}\succeq 0$ , the assumption in Lemma 4 can be satisfied under mild assumptions, for example when $H_{1}\succ 0$ with a positive probability, which is satisfied for $\eta$ small enough.

In the second step, we construct a Lyapunov function $\hat{V}$ that satisfies the conditions of Lemma 3.

Lemma 5

Let $\hat{V}(\theta):=1+\|\theta\|$ . Assume that $\hat{\rho}:=\mathbb{E}\left\|I-\frac{\eta}{b}\hat{H}_{1}\right\|<1$ . Then, we have

(\hat{P}\hat{V})(\theta)\leq\hat{\rho}\hat{V}(\theta)+1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|.

(15)

In our third and last step, we estimate the perturbation gap based on the Lyapunov function $\hat{V}$ in the form of (10), assuming that the data is bounded. Such bounded data assumptions have been commonly made in the literature (Bach, 2014; Bach and Moulines, 2013).

Lemma 6

If $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , then, we have $\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\frac{2\eta D^{2}}{n}$ .

Note that Lemma 3 relies on three conditions: the Wasserstein contraction in (10), which is obtained through Lemma 4, the drift condition for the Lyapunov function in (11), which is obtained in Lemma 5 and finally the estimate on $\gamma$ in (12) which is about the one-step $1$ -Wasserstein distance between two semi-groups that in our context are associated with two datasets that differ by at most one element, which is obtained in Lemma 6. The only place the neighborhood assumption ( $\sup_{x\in\mathcal{X}}\|x\|\leq D$ ) is used is in the expression of $\gamma$ in equation (12). Now, having all the ingredients, we can invoke Lemma 3 and we obtain the following result which provides a 1-Wasserstein bound between the distribution of iterates when applied to datasets that differ by one point.

For $Y\in\bigcup_{n=1}^{\infty}\mathcal{X}^{n}$ and $k\geq 0$ , let $\nu(Y,k)$ denote the law of the $k$ -th the SGD iterate when $Y$ is used as the dataset, i.e., $\nu(X,k)$ and $\nu(\hat{X},k)$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ obtained by the recursions (13) and (14) respectively. As shorthand notation, set $\nu_{k}:=\nu(X,k)$ and $\hat{\nu}_{k}:=\nu(\hat{X},k)$ .

Theorem 7

Assume $\theta_{0}=\hat{\theta}_{0}=\theta$ . We also assume that $\rho:=\mathbb{E}\left\|I-\frac{\eta}{b}H_{1}\right\|<1$ and $\hat{\rho}:=\mathbb{E}\left\|I-\frac{\eta}{b}\hat{H}_{1}\right\|<1$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\frac{1-\rho^{k}}{1-\rho}\frac{2\eta D^{2}}{n}\max\left\{1+\|\theta\|,\frac{1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|}{1-\hat{\rho}}\right\}.

(16)

Proof The result directly follows from Lemma 4, Lemma 5, Lemma 6 and Lemma 3.
By a direct application of (8), we can obtain a generalization bound for an $\mathcal{L}$ -Lipschitz surrogate loss function, as follows:

\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\frac{\mathcal{L}}{1-\rho_{0}}\frac{2\eta D^{2}}{n}\max\left\{1+\|\theta\|,\frac{1-{\rho_{0}}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|}{1-{\rho_{0}}}\right\},

where $\rho_{0}=\sup_{X}\|1-\frac{\eta}{b}H_{X}\|$ , $H_{X}=\sum_{i\in\Omega_{k},a_{j}\in X}a_{j}a_{j}^{\top}$ and $X$ is a random set of $n$ -data points from the data generating distribution. The generalization bound obtained above does not include the mean square error in the unbounded case but covers a larger class of surrogate loss functions. Because of this incompatibility, a direct comparison is not possible; however, the rate obtained in the equation above has the same dependence on the number of samples that were obtained in the previous works (Lei and Ying, 2020). For least squares, there are other works using integral operators that develop generalization bounds for SGD under a capacity condition (Lin and Rosasco, 2017; Pillaud-Vivien et al., 2018). However, these bounds only hold for the least square loss.

3.2 Strongly convex case

Next, we consider strongly convex losses. In the remainder of the paper, we will always assume that for every $x\in\mathcal{X}$ , $f(\cdot,x)$ is differentiable.

Before proceeding to the stability bound, we first introduce the following assumptions.

Assumption 1

There exist constants $K_{1},K_{2}>0$ such that for any $\theta,\hat{\theta}\in\mathbb{R}^{d}$ and every $x\in\mathcal{X}$ ,

\displaystyle\|\nabla f(\theta,x)-\nabla f(\hat{\theta},\hat{x})\|\leq K_{1}\|\theta-\hat{\theta}\|+K_{2}\|x-\hat{x}\|(\|\theta\|+\|\hat{\theta}\|+1).

(17)

This assumption is a pseudo-Lipschitz-like condition on $\nabla f$ and is satisfied for various problems such as GLMs (Bach, 2014). Next, we assume that the loss function $f$ is strongly convex.

Assumption 2

There exists a universal constant $\mu>0$ such that for any $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ and $x\in\mathcal{X}$ ,

\displaystyle\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq\mu\|\theta_{1}-\theta_{2}\|^{2}.

By using the same recipe as we used for quadratic losses, we obtain the following stability result.

Theorem 8

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Assume that Assumption 1 and Assumption 2 hold. We also assume that $\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ , $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . Let $\nu_{k}$ and $\hat{\nu}_{k}$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ respectively. Then, we have

	$\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})$	$\displaystyle\leq\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(\frac{2E}{\mu}+1\right)$
		$\displaystyle\qquad\cdot\max\left\{1+2\\|\theta\\|^{2}+\frac{2E^{2}}{\mu^{2}},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu^{3}}D^{2}K_{2}^{2}E^{2}\right\}.$		(18)

Similarly to the quadratic case, we can now directly obtain a bound on expected generalization error using (8). More precisely, for an $\mathcal{L}$ -Lipschitz surrogate loss function $\ell$ , we have

	$\displaystyle\Bigl{\|}\mathbb{E}_{\mathcal{A},X_{n}}~{}\Bigl{[}\hat{R}(\mathcal{A}(X_{n}),X_{n})-$	$\displaystyle R(\mathcal{A}(X_{n}))\Bigr{]}\Bigr{\|}\leq\mathcal{L}\cdot\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(\frac{2E}{\mu}+1\right)$
		$\displaystyle\cdot\max\left\{1+2\\|\theta\\|^{2}+\frac{2E^{2}}{\mu^{2}},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu^{3}}D^{2}K_{2}^{2}E^{2}\right\}.$

The bound above has the same dependence on the number of samples as the ones of the previous stability analysis of (projected) SGD for strongly convex functions (Hardt et al., 2016; Liu et al., 2017; Lei and Ying, 2020). However, we have a worse dependence on the strong convexity parameter $\mu$ .

3.3 Non-convex case with additive noise

Finally, we consider a class of non-convex loss functions. We assume that the loss function satisfies the following dissipativity condition.

Assumption 3

There exist constants $m>0$ and $K>0$ such that for any $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ and $x\in\mathcal{X}$ ,

\displaystyle\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq m\|\theta_{1}-\theta_{2}\|^{2}-K.

The class of dissipative functions satisfying this assumption are the ones that admit some gradient growth in radial directions outside a compact set. Inside the compact set though, they can have quite general non-convexity patterns. As concrete examples, they include certain one-hidden-layer neural networks (Akiyama and Suzuki, 2023); they arise in non-convex formulations of classification problems (e.g. in logistic regression with a sigmoid/non-convex link function); they can also arise in robust regression problems, see e.g. Gao et al. (2022). Also, any function that is strongly convex outside of a ball of radius for some will satisfy this assumption. Consequently, regularized regression problems where the loss is a strongly convex quadratic plus a smooth penalty that grows slower than a quadratic will belong to this class; a concrete example would be smoothed Lasso regression; many other examples are also given in Erdogdu et al. (2022). Dissipative functions also arise frequently in the sampling and Bayesian learning and global convergence in non-convex optimization literature (Raginsky et al., 2017; Gao et al., 2022).

Unlike the strongly-convex case, we can no longer obtain a Wasserstein contraction bound using the synchronous coupling technique as we did in the proof of Theorem 8. To circumvent this problem, in this setting, we consider a noisy version of SGD, with the following recursion:

\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n})+\eta\xi_{k},\qquad\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}):=\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}),

(19)

where $\xi_{k}$ are additional i.i.d. random vectors in $\mathbb{R}^{d}$ , independent of $\theta_{k-1}$ and $\Omega_{k}$ , satisfying the following assumption.

Assumption 4

$\xi_{1}$ is random vector on $\mathbb{R}^{d}$ with a continuous density $p(x)$ that is positive everywhere, i.e. $p(x)>0$ for any $x\in\mathbb{R}^{d}$ and $\mathbb{E}[\xi_{1}]=0,~{}~{}\sigma^{2}:=\mathbb{E}\left[\|\xi_{1}\|^{2}\right]<\infty.$

Note that the SGLD algorithm (Welling and Teh, 2011) is a special case of this recursion, whilst our noise model can accommodate non-Gaussian distributions with finite second-order moment.

Analogously, let us define the (noisy) SGD recursion with the data set $\hat{X}_{n}$ as

\hat{\theta}_{k}=\hat{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\hat{\theta}_{k-1},\hat{X}_{n})+\eta\xi_{k},

and let $p(\theta,\theta_{1})$ denote the probability density function of $\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1}$ . Further let $\theta_{*}$ be a minimizer of $\hat{F}(\cdot,X_{n})$ . Then, by following the same three-step recipe, we obtain the following stability bound. Here, we do not provide all the constants explicitly for the sake of clarity; the complete theorem statement is given in Theorem 25 (Appendix E.1).

Theorem 9

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . For any $\hat{\eta}\in(0,1)$ , define $M>0$ so that $\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}}$ and for any $R>\frac{2K_{0}}{m}$ where $K_{0}$ is defined in (67) so that

\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}.

(20)

Let $\nu_{k}$ and $\hat{\nu}_{k}$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ respectively. Then, we have

\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\frac{C_{1}(1-\bar{\eta}^{k})}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}\cdot\frac{2b}{n},

(21)

where for any $\eta_{0}\in(0,\hat{\eta})$ and $\gamma_{0}\in\left(1-m\eta+\frac{2\eta K_{0}}{R},1\right)$ , $\psi=\frac{\eta_{0}}{\eta K_{0}}$ and $\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi}$ . The constant $C_{1}\equiv C_{1}(\psi,\eta,\hat{\eta},R,\gamma_{0},K_{1},K_{2},\sigma^{2},D,E)$ is explicitly stated in the proof.

Contrary to our previous results, the proof technique for showing Wasserstein contraction (as in Lemma 4) for this theorem relies on verifying the drift condition (Assumption 7) and the minorization condition (Assumption 8) as given in Hairer and Mattingly (2011). Once these conditions are satisfied, we invoke the explicitly computable bounds on the convergence of Markov chains developed in Hairer and Mattingly (2011).

From equation (8), we directly obtain the following generalization error bound for $\mathcal{L}$ -Lipschitz surrogate loss function,

\displaystyle\left|\mathbb{E}_{\mathcal{A},X_{n}}~{}\left[\hat{R}(\mathcal{A}(X_{n}),X_{n})-R(\mathcal{A}(X_{n}))\right]\right|\leq\mathcal{L}\cdot\frac{C_{1}(1-\bar{\eta}^{k})}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}\cdot\frac{2b}{n},

where the constants are defined in Theorem 9¹¹1By using the decomposition (2), we can obtain excess risk bounds for SGLD by combining our results with Xu and Raginsky (2017): it was shown that gradient Langevin dynamics has the following optimization error is $O(\varepsilon+d^{3/2}b^{-1/4}\lambda^{-1}\log 1/\varepsilon)$ after $K=O(d\varepsilon^{-1}\lambda^{-1}\log 1/\varepsilon)$ iterations, where $b$ is the mini-batch size and $\lambda$ is the uniform spectral gap of the continuous-time Langevin dynamics. Similar results are given for SGLD in (Xu and Raginsky, 2017, Theorem 3.6). . The above result can be directly compared with the result in (Farghly and Rebeschini, 2021, Theorem 4.1) that has the same dependence on $n$ and $b$ . However, our result is more general in the sense that we do not assume our noise to be Gaussian noise. Note that Li et al. (2019) can also accommodate non-Gaussian noise; however, their bounds increase with the number of iterations.

Remark 10

In Theorem 9, we can take $R=\frac{2K_{0}}{m}(1+\epsilon)$ for some fixed $\epsilon\in(0,1)$ so we can take

\hat{\eta}=\left(\max_{M>0}\left\{\min\left\{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1},\inf_{\begin{subarray}{c}\theta,\theta_{1}\in\mathbb{R}^{d}:\|\theta_{1}-\theta_{\ast}\|\leq M\\ \|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1\end{subarray}}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\right\}\right\}\right)^{2}.

(22)

Moreover, we can take $\eta_{0}=\frac{\hat{\eta}}{2}$ , $\gamma_{0}=1-\frac{m\eta\epsilon}{2}$ , and $\psi=\frac{\hat{\eta}}{2\eta K_{0}}$ , so that

\bar{\eta}=\max\left\{1-\frac{\hat{\eta}}{2},\frac{2+\frac{(1+\epsilon)\hat{\eta}}{m}(1-\frac{m\eta\epsilon}{2})}{2+\frac{(1+\epsilon)\hat{\eta}}{m}}\right\}=1-\frac{m\eta\epsilon(1+\epsilon)\hat{\eta}}{4m+2(1+\epsilon)\hat{\eta}},

(23)

provided that $\eta\leq 1$ . Note that the parameter $\hat{\eta}$ in (22) appears in the upper bound in equation (21) that controls the $1$ -Wasserstein algorithmic stability of the SGD. It is easy to see from equation (21) that the smaller $\bar{\eta}$ , the smaller the $1$ -Wasserstein bound. By the defintion of $\bar{\eta}$ , the larger $\hat{\eta}$ , the smaller the $1$ -Wasserstein bound. As a result, we would like to choose $\hat{\eta}$ to be as large as possible, and the equation (22) provides an explicit value that $\hat{\eta}$ can take, which is already the largest as possible.

Next, let us provide some explicitly computable lower bounds for $\hat{\eta}$ in (22). This is achievable if we specify further the noise assumption. Under the assumption that $\xi_{k}$ are i.i.d. Gaussian distributed, we have the following corollary.

Corollary 11

Under the assumptions in Theorem 9, we further assume the noise $\xi_{k}$ are i.i.d. Gaussian $\mathcal{N}(0,\Sigma)$ so that $\mathbb{E}[\|\xi_{1}\|^{2}]=\text{tr}(\Sigma)=\sigma^{2}$ . We also assume that $\Sigma\prec I_{d}$ . Then, we have

$\displaystyle\hat{\eta}$	$\displaystyle\geq\max_{M\geq\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|}\Bigg{\{}\min\Bigg{\{}\Bigg{(}1-\frac{\exp(-\frac{1}{2}(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|)^{2})}{\sqrt{\det{(I_{d}-\Sigma)}}}\Bigg{)}^{2},$
	$\displaystyle\qquad\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{\eta^{2}}\\|\Sigma^{-1}\\|$
	$\displaystyle\qquad\quad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)\Bigg{)}\Bigg{\}}\Bigg{\}}\Bigg{\}}.$	(24)

The above corollary provides an explicit lower bound for $\hat{\eta}$ (instead of the less transparent inequality constraints in Theorem 9), and by combining with Remark 10 (see equation (23)) leads to an explicit formula for $\bar{\eta}$ which is essential to characterize the Wasserstein upper bound in (21) in Theorem 9.

4 Wasserstein Stability of SGD without Geometric Ergodicity

While the Markov chain perturbation theory enabled us to develop stability bounds for the case where we can ensure geometric ergodicity in the Wasserstein sense (i.e., proving contraction bounds), we have observed that such a strong ergodicity notion might not hold for non-strongly convex losses. In this section, we will prove two more stability bounds for SGD, without relying on Rudolf and Schweizer (2018), hence without requiring geometric ergodicity. To the best of our knowledge, these are the first uniform-time stability bounds for the considered classes of convex and non-convex problems.

4.1 Non-convex case without additive noise

The stability result we obtained in Theorem 9 required us to introduce an additional noise (Assumption 4) to be able to invoke Lemma 3. We will now show that it is possible to use a more direct approach to obtain 2-Wasserstein algorithmic stability in the non-convex case under Assumption 3 without relying on Rudolf and Schweizer (2018). However, we will observe that without geometric ergodicity will have a non-vanishing bias term in the bound. Note that, since $\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})\leq\mathcal{W}_{p}(\nu_{k},\hat{\nu}_{k})$ for all $p\geq 1$ , the following bound still yields a generalization bound by (8).

Theorem 12

Assume $\theta_{0}=\hat{\theta}_{0}=\theta$ . We also assume that Assumption 1 and Assumption 3 hold and $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . Let $\nu_{k}$ and $\hat{\nu}_{k}$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ respectively. Then, we have

\displaystyle\mathcal{W}_{2}^{2}(\nu_{k},\hat{\nu}_{k})\leq\left(1-(1-\eta m)^{k}\right)\cdot\left(\frac{4D^{2}K_{2}^{2}\eta(8B+2)}{bnm}+\frac{4K_{2}D(1+K_{1}\eta)}{nm}\left(1+5B\right)+\frac{2K}{m}\right),

(25)

where the constant $B$ is explicitly defined in the proof.

While the bound (25) does not increase with the number of iterations, it is easy to see that it does not vanish as $n\rightarrow\infty$ , and it is small only when $K$ from the dissipativity condition (Assumption 3) is small. In other words, if we consider $K$ to be the level of non-convexity (e.g., $K=0$ corresponds to strong convexity), as the function becomes ‘more non-convex’ the persistent term in the bound will get larger. While this persistent term might make the bound vacuous when $n\to\infty$ , for moderate $n$ the bound can be still informative as the persistent term might be dominated by the first two terms.

Moreover, discarding the persistent bias term, this bound leads to a generalization bound with rate $n^{-1/2}$ , rather than $n^{-1}$ as before. This indicates that it is beneficial to add additional noise $\xi_{k}$ in SGD as in Theorem 9 in order for the dynamics to be geometrically ergodic that can lead to a sharp bound as $n\rightarrow\infty$ . Finally, we note that as Theorem 12 involves 2-Wasserstein distance, it can pave the way for generalization bounds without requiring a surrogate loss. Yet, this is not immediate and would require deriving uniform $L^{2}$ bounds for the iterates, e.g., Raginsky et al. (2017).

4.2 Convex case with additional geometric structure

We now present our final stability bound, where we consider relaxing the strong convexity assumption (Assumption 2) to the following milder assumption.

Assumption 5

There exists universal constants $\mu>0$ and $p\in(1,2)$ such that for any $\theta_{1},\theta_{2}\in\mathbb{R}^{d}$ and $x\in\mathcal{X}$ , $\left\langle\nabla f(\theta_{1},x)-\nabla f(\theta_{2},x),\theta_{1}-\theta_{2}\right\rangle\geq\mu\|\theta_{1}-\theta_{2}\|^{p}$ .

Note that as $p<2$ , the function class can be seen as an intermediate class between convex and strongly convex functions, and such a class of functions has been studied in the optimization literature (Dunn, 1981; Bertsekas, 2015).

We analogously modify Assumption 1 and consider the following assumption.

Assumption 6

There exist constants $K_{1},K_{2}>0$ and $p\in(1,2)$ such that for any $\theta,\hat{\theta}\in\mathbb{R}^{d}$ and every $x\in\mathcal{X}$ , $\|\nabla f(\theta,x)-\nabla f(\hat{\theta},\hat{x})\|\leq K_{1}\|\theta-\hat{\theta}\|^{\frac{p}{2}}+K_{2}\|x-\hat{x}\|(\|\theta\|^{p-1}+\|\hat{\theta}\|^{p-1}+1)$ .

The next theorem establishes a stability bound for the considered class of convex losses in the stationary regime of SGD.

Theorem 13

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Suppose Assumption 5 and Assumption 6 hold (with $p\in(1,2)$ ) and $\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . Then $\nu_{k}$ and $\hat{\nu}_{k}$ converge to the unique stationary distributions $\nu_{\infty}$ and $\hat{\nu}_{\infty}$ respectively and moreover, we have

\displaystyle\mathcal{W}_{p}^{p}(\nu_{\infty},\hat{\nu}_{\infty})\leq\frac{C_{2}}{bn}+\frac{C_{3}}{n},

(26)

where the constants $C_{2}\equiv C_{2}(\eta,\mu,K_{2},D,E)$ and $C_{3}\equiv C_{3}(\eta,\mu,K_{1},K_{2},D,E)$ are explicitly stated in the proof.

While we have relaxed the geometric ergodicity condition for this case, in the proof of Theorem 13, we show that the Markov chain $(\theta_{k})_{k\geq 0}$ is simply ergodic, i.e., $\lim_{k\to\infty}\mathcal{W}_{p}(\nu_{k},\nu_{\infty})=0$ . Hence, even though we still obtain a time-uniform bound, our bound holds asymptotically in $k$ , due to the lack of an explicit convergence rate for $\mathcal{W}_{p}(\nu_{k},\nu_{\infty})$ . On the other hand, the lack of strong convexity here results in a generalization bound with rate $n^{-1/p}$ , whereas for the strongly convex case, i.e., $p=2$ , we previously obtained a rate of $n^{-1}$ . This might be an indicator that there might be still room for improvement in terms of the rate, at least for this class of loss functions.

5 Conclusion

We proved time-uniform Wasserstein-stability bounds for SGD and its noisy versions under different strongly convex, convex, and non-convex classes of functions. By making a connection to Markov chain perturbation results (Rudolf and Schweizer, 2018), we introduced a three-step guideline for proving stability bounds for stochastic optimizers. As this approach required geometric ergodicity, we finally relaxed this condition and proved two other stability bounds for a large class of loss functions.

The main limitation of our approach is that it requires Lipschitz surrogate loss functions, as it is based on the Wasserstein distance. Hence, our natural next step will be to extend our analysis without such a requirement. Finally, due to the theoretical nature of this study, it does not contain any direct potential societal impacts.

Acknowledgments

Lingjiong Zhu is partially supported by the grants NSF DMS-2053454, NSF DMS-2208303, and a Simons Foundation Collaboration Grant. Mert Gürbüzbalaban’s research are supported in part by the grants Office of Naval Research Award Number N00014-21-1-2244, National Science Foundation (NSF) CCF-1814888, NSF DMS-2053485. Anant Raj is supported by the a Marie Sklodowska-Curie Fellowship (project NN-OVEROPT 101030817). Umut Şimşekli’s research is supported by the French government under management of Agence Nationale de la Recherche as part of the “Investissements d’avenir” program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and the European Research Council Starting Grant DYNASTY – 101039676.

References

Akiyama and Suzuki (2023) S. Akiyama and T. Suzuki. Excess risk of two-layer ReLU neural networks in teacher-student settings and its superiority to kernel methods. In International Conference on Learning Representations, 2023.
Aybat et al. (2019) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. A universally optimal multistage accelerated stochastic gradient method. In Advances in Neural Information Processing Systems, volume 32, 2019.
Aybat et al. (2020) N. S. Aybat, A. Fallah, M. Gurbuzbalaban, and A. Ozdaglar. Robust accelerated gradient methods for smooth strongly convex functions. SIAM Journal on Optimization, 30(1):717–751, 2020.
Bach (2014) F. Bach. Adaptivity of averaged stochastic gradient descent to local strong convexity for logistic regression. Journal of Machine Learning Research, 15(1):595–627, 2014.
Bach and Moulines (2013) F. Bach and E. Moulines. Non-strongly-convex smooth stochastic approximation with convergence rate $\mathcal{O}(1/n)$ . In Advances in Neural Information Processing Systems, volume 26, 2013.
Bassily et al. (2020) R. Bassily, V. Feldman, C. Guzmán, and K. Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. In Advances in Neural Information Processing Systems, volume 33, pages 4381–4391, 2020.
Bertsekas (2015) D. Bertsekas. Convex Optimization Algorithms. Athena Scientific, 2015.
Bousquet and Elisseeff (2002) O. Bousquet and A. Elisseeff. Stability and generalization. Journal of Machine Learning Research, 2(Mar):499–526, 2002.
Can et al. (2019) B. Can, M. Gürbüzbalaban, and L. Zhu. Accelerated linear convergence of stochastic momentum methods in Wasserstein distances. In International Conference on Machine Learning, pages 891–901. PMLR, 2019.
Chen et al. (2018) Y. Chen, C. Jin, and B. Yu. Stability and convergence trade-off of iterative optimization algorithms. arXiv preprint arXiv:1804.01619, 2018.
Dunn (1981) J. C. Dunn. Global and asymptotic convergence rate estimates for a class of projected gradient processes. SIAM Journal on Control and Optimization, 19(3):368–400, 1981.
Erdogdu et al. (2022) M. A. Erdogdu, R. Hosseinzadeh, and M. S. Zhang. Convergence of Langevin Monte Carlo in Chi-squred and Rényi divergence. In Proceedings of the 25th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 151. PMLR, 2022.
Fallah et al. (2022) A. Fallah, M. Gürbüzbalaban, A. Ozdaglar, U. Şimşekli, and L. Zhu. Robust distributed accelerated stochastic gradient methods for multi-agent networks. Journal of Machine Learning Research, 23(1):9893–9988, 2022.
Farghly and Rebeschini (2021) T. Farghly and P. Rebeschini. Time-independent generalization bounds for SGLD in non-convex settings. In Advances in Neural Information Processing Systems, volume 34, pages 19836–19846, 2021.
Feldman and Vondrak (2019) V. Feldman and J. Vondrak. High probability generalization bounds for uniformly stable algorithms with nearly optimal rate. In Conference on Learning Theory, pages 1270–1279. PMLR, 2019.
Gao et al. (2022) X. Gao, M. Gürbüzbalaban, and L. Zhu. Global convergence of stochastic gradient Hamiltonian Monte Carlo for nonconvex stochastic optimization: Nonasymptotic performance bounds and momentum-based acceleration. Operations Research, 70(5):2931–2947, 2022.
Garrigos and Gower (2023) G. Garrigos and R. M. Gower. Handbook of convergence theorems for (stochastic) gradient methods. arXiv preprint arXiv:2301.11235, 2023.
Gürbüzbalaban et al. (2022) M. Gürbüzbalaban, A. Ruszczyński, and L. Zhu. A stochastic subgradient method for distributionally robust non-convex and non-smooth learning. Journal of Optimization Theory and Applications, 194(3):1014–1041, 2022.
Hairer and Mattingly (2011) M. Hairer and J. C. Mattingly. Yet another look at Harris’ ergodic theorem for Markov chains. In Seminar on Stochastic Analysis, Random Fields and Applications VI, pages 109–118, Basel, 2011.
Hardt et al. (2016) M. Hardt, B. Recht, and Y. Singer. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pages 1225–1234. PMLR, 2016.
Kozachkov et al. (2023) L. Kozachkov, P. M. Wensing, and J.-J. Slotine. Generalization as dynamical robustness–The role of Riemannian contraction in supervised learning. Transactions on Machine Learning Research, 4:1–25, 2023.
Lei and Ying (2020) Y. Lei and Y. Ying. Fine-grained analysis of stability and generalization for stochastic gradient descent. In International Conference on Machine Learning, volume 119, pages 5809–5819. PMLR, 2020.
Lessard et al. (2016) L. Lessard, B. Recht, and A. Packard. Analysis and design of optimization algorithms via integral quadratic constraints. SIAM Journal on Optimization, 26(1):57–95, 2016.
Li et al. (2019) J. Li, X. Luo, and M. Qiao. On generalization error bounds of noisy gradient methods for non-convex learning. arXiv preprint arXiv:1902.00621, 2019.
Lin and Rosasco (2017) J. Lin and L. Rosasco. Optimal rates for multi-pass stochastic gradient methods. Journal of Machine Learning Research, 18(1):3375–3421, 2017.
Liu et al. (2017) T. Liu, G. Lugosi, G. Neu, and D. Tao. Algorithmic stability and hypothesis complexity. In International Conference on Machine Learning, pages 2159–2167. PMLR, 2017.
Liu et al. (2020) Y. Liu, Y. Gao, and W. Yin. An improved analysis of stochastic gradient descent with momentum. In Advances in Neural Information Processing Systems, volume 33, pages 18261–18271, 2020.
Meyn and Tweedie (1993) S. P. Meyn and R. L. Tweedie. Markov Chains and Stochastic Stability. Communications and Control Engineering Series. Springer-Verlag, London, 1993.
Meyn and Tweedie (1994) S. P. Meyn and R. L. Tweedie. Computable bounds for geometric convergence rates of Markov chains. Annals of Applied Probability, 4(4):981–1011, 1994.
Mou et al. (2018) W. Mou, L. Wang, X. Zhai, and K. Zheng. Generalization bounds of SGLD for non-convex learning: Two theoretical viewpoints. In Conference on Learning Theory, pages 605–638. PMLR, 2018.
Pillaud-Vivien et al. (2018) L. Pillaud-Vivien, A. Rudi, and F. Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. In Advances in Neural Information Processing Systems, volume 31, 2018.
Raginsky et al. (2016) M. Raginsky, A. Rakhlin, M. Tsao, Y. Wu, and A. Xu. Information-theoretic analysis of stability and bias of learning algorithms. In 2016 IEEE Information Theory Workshop (ITW), pages 26–30. IEEE, 2016.
Raginsky et al. (2017) M. Raginsky, A. Rakhlin, and M. Telgarsky. Non-convex learning via stochastic gradient Langevin dynamics: A nonasymptotic analysis. In Conference on Learning Theory, pages 1674–1703. PMLR, 2017.
Raj et al. (2023a) A. Raj, M. Barsbey, M. Gürbüzbalaban, L. Zhu, and U. Şimşekli. Algorithmic stability of heavy-tailed stochastic gradient descent on least squares. In International Conference on Algorithmic Learning Theory, volume 201, pages 1292–1342. PMLR, 2023a.
Raj et al. (2023b) A. Raj, L. Zhu, M. Gürbüzbalaban, and U. Şimşekli. Algorithmic stability of heavy-tailed SGD with general loss functions. In International Conference on Machine Learning, volume 202, pages 28578–28597. PMLR, 2023b.
Rudolf and Schweizer (2018) D. Rudolf and N. Schweizer. Perturbation theory for Markov chains via Wasserstein distance. Bernoulli, 24(4A):2610–2639, 2018.
Villani (2009) C. Villani. Optimal Transport: Old and New. Springer, Berlin, 2009.
Welling and Teh (2011) M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In Proceedings of the 28th International Conference on Machine Learning (ICML-11), pages 681–688, 2011.
Xu and Raginsky (2017) A. Xu and M. Raginsky. Information-theoretic analysis of generalization capability of learning algorithms. In Advances in Neural Information Processing Systems, volume 30, 2017.
Zhang et al. (2022) Y. Zhang, W. Zhang, S. Bald, V. Pingali, C. Chen, and M. Goswami. Stability of SGD: Tightness analysis and improved bounds. In Uncertainty in Artificial Intelligence, pages 2364–2373. PMLR, 2022.

Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent
APPENDIX

The Appendix is organized as follows:

•

In Section A, we provide further details and examples about the usage of surrogate losses.
•

In Section B, we provide technical background for the computable bounds for the convergence of Markov chains which will be used to prove the results in Section 3.3 in the main paper.
•

In Section C, we provide technical proofs for 1-Wasserstein perturbation results for the quadratic loss in Section 3.1 in the main paper.
•

In Section D, we provide technical proofs for 1-Wasserstein perturbation results for the strongly-convex loss in Section 3.2 in the main paper.
•

In Section E, we provide technical proofs for 1-Wasserstein perturbation results for the non-convex loss (with additive noise) in Section 3.3 in the main paper.
•

In Section F, we provide technical proofs for $2$ -Wasserstein stability bounds for the non-convex loss without additive noise in Section 4.1 in the main paper.
•

In Section G, we provide technical proofs for $p$ -Wasserstein stability bounds for the convex loss with additional geometric structure in Section 4.2 in the main paper.

A On the Usage of Surrogate Losses

While the requirement of surrogate losses is a drawback of our framework, nevertheless our setup can cover several practical settings. In this section, we will provide two such examples.

Example 1.

We can choose the surrogate loss as the truncated loss, such that:

\ell(\theta,x)=\min(f(\theta,x),C),

where $C>0$ is a chosen constant. This can be seen as a “robust” version of the original loss, which has been widely used in robust optimization and is conceptually similar to adding a projection step to the optimizer.

Example 2.

Another natural setup for our framework is the $\ell_{2}$ -regularized Lipschitz loss that was also used in Farghly and Rebeschini (2021). As opposed to the previous case, for the sake of this example, let us consider $\ell$ as the true loss and $f$ as the surrogate loss. Then, we can choose the pair $f$ and $\ell$ as follows:

f(\theta,x)=\ell(\theta,x)+\frac{\mu}{2}\|\theta\|_{2}^{2},

where $\mu>0$ . Intuitively, this setting means that, we have a true loss $\ell$ which can be Lipschitz, but in the optimization framework we consider a regularized version of the loss. In other words, we have a loss $\ell$ ; however, we run the algorithm on the regularized loss $f$ to have better convergence properties, and finally, we would like to understand if the algorithm generalizes on $\ell$ or not, and we are typically not interested if the algorithm generalizes well on the regularized loss $f$ .

Next, we illustrate how a generalization bound for the loss $f$ , i.e., $\left|\mathbb{E}[\hat{F}(\theta)-F(\theta)]\right|$ . For this example, a bound on the quantity can be obtained by building on our analysis. To obtain such a bound, in addition to the bounds that we developed on $\left|\mathbb{E}[\hat{R}(\theta)-R(\theta)]\right|$ , we would need to estimate the following quantity:

\left|\mathbb{E}_{\theta,X_{n}}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f\left(\theta,x_{i}\right)-\ell\left(\theta,x_{i}\right)\right)\right]\right|.

For illustration purposes, assume that $\ell$ is convex and Lipschitz in the first parameter. Then, $f$ is $\mu$ -strongly convex. Further consider that we initialize SGD from $0$ , i.e., $\theta_{0}=0$ and set the batch size $b$ to 1. Denote $\theta=\theta_{k}$ as the $k$ -th iterate of SGD when applied on $\hat{F}\left(\theta,X_{n}\right)$ , i.e., (1). Further define the minimum:

\theta_{X_{n}}^{\star}=\arg\min_{\theta}\hat{F}\left(\theta,X_{n}\right).

We can now analyze the error induced by the surrogate loss as follows:

	$\displaystyle\left\|\mathbb{E}_{\theta,X_{n}}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f\left(\theta,x_{i}\right)-\ell\left(\theta,x_{i}\right)\right)\right]\right\|$	$\displaystyle=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\\|\theta\\|^{2}=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\left\\|\theta-\theta_{X_{n}}^{\star}+\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle\leq\mu\mathbb{E}_{\theta,X_{n}}\left\\|\theta-\theta_{X_{n}}^{\star}\right\\|^{2}+\mu\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle\leq\mu\mathbb{E}_{X_{n}}\left[(1-\eta\mu)^{k}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}+\frac{2\eta}{\mu}\sigma_{X_{n}}\right]+\mu\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle=\mu\left((1-\eta\mu)^{k}+1\right)\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}+2\eta\mathbb{E}_{X_{n}}\left[\sigma_{X_{n}}\right].$

Here, the second inequality follows from standard convergence analysis for SGD (Garrigos and Gower, 2023, Theorem 5.7) and we define $\sigma_{X_{n}}$ as the stochastic gradient noise variance:

\sigma_{X_{n}}:=\operatorname{Var}\left[\nabla f\left(\theta_{X_{n}}^{\star},x_{i}\right)\right],

where for a random vector $V$ we define $\operatorname{Var}[V]:=\mathbb{E}\|V-\mathbb{E}[V]\|^{2}$ . Hence, we can see that the error induced by the surrogate loss depends on the following factors:

•

The regularization parameter $\mu$ ,
•

The expected norm of the minimizers,
•

The step-size $\eta$ ,
•

The expected stochastic gradient noise variance.

These terms can be controlled by adjusting $\mu$ and $\eta$ .

B Technical Background

B.1 Computable bounds for the convergence of Markov chains

Geometric ergodicity and convergence rate of Markov chains has been well studied in the literature (Meyn and Tweedie, 1993, 1994; Hairer and Mattingly, 2011). In this section, we state a result from Hairer and Mattingly (2011) that provides an explicitly computable bound on the Wasserstein contraction for the Markov chains that satisfies a drift condition that relies on the construction of an appropriate Lyapunov function and a minorization condition.

Let $\mathcal{P}(\theta,\cdot)$ be a Markov transition kernel for a Markov chain $(\theta_{k})$ on $\mathbb{R}^{d}$ . For any measurable function $\varphi:\mathbb{R}^{d}\rightarrow[0,+\infty]$ , we define:

(\mathcal{P}\varphi)(\theta)=\int_{\mathbb{R}^{d}}\varphi(\tilde{\theta})\mathcal{P}(\theta,d\tilde{\theta}).

Assumption 7 (Drift Condition)

There exists a function $V:\mathbb{R}^{d}\rightarrow[0,\infty)$ and some constants $K\geq 0$ and $\gamma\in(0,1)$ so that

(\mathcal{P}V)(\theta)\leq\gamma V(\theta)+K,

for all $\theta\in\mathbb{R}^{d}$ .

Assumption 8 (Minorization Condition)

There exists some constant $\hat{\eta}\in(0,1)$ and a probability measure $\nu$ so that

\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}\mathcal{P}(\theta,\cdot)\geq\hat{\eta}\nu(\cdot),

for some $R>2K/(1-\gamma)$ .

We define the weighted total variation distance:

d_{\psi}(\mu_{1},\mu_{2})=\int_{\mathbb{R}^{d}}(1+\psi V(\theta))|\mu_{1}-\mu_{2}|(d\theta),

where $\psi>0$ and $V(\theta)$ is the Lyapunov function that satisfies the drift condition (Assumption 7). It is known that $d_{\psi}$ has the following alternative expression (Hairer and Mattingly, 2011):

d_{\psi}(\mu_{1},\mu_{2})=\sup_{\varphi:\|\varphi\|_{\psi}\leq 1}\int_{\mathbb{R}^{d}}\varphi(\theta)(\mu_{1}-\mu_{2})(d\theta),

where $\|\cdot\|_{\psi}$ is the weighted supremum norm such that for any $\psi>0$ :

\|\varphi\|_{\psi}:=\sup_{\theta\in\mathbb{R}^{d}}\frac{|\varphi(\theta)|}{1+\psi V(\theta)}.

It is also noted in Hairer and Mattingly (2011) that $d_{\psi}$ has yet another equivalent expression:

d_{\psi}(\mu_{1},\mu_{2})=\sup_{\varphi:|\|\varphi\||_{\psi}\leq 1}\int_{\mathbb{R}^{d}}\varphi(\theta)(\mu_{1}-\mu_{2})(d\theta),

where

|\|\varphi\||_{\psi}:=\sup_{\theta\neq\tilde{\theta}}\frac{|\varphi(\theta)-\varphi(\tilde{\theta})|}{2+\psi V(\theta)+\psi V(\tilde{\theta})}.

Lemma 14 (Theorem 1.3. Hairer and Mattingly (2011))

If the drift condition (Assumption 7) and minorization condition (Assumption 8) hold, then there exists $\bar{\eta}\in(0,1)$ and $\psi>0$ so that

d_{\psi}(\mathcal{P}\mu_{1},\mathcal{P}\mu_{2})\leq\bar{\eta}d_{\psi}(\mu_{1},\mu_{2})

for any probability measures $\mu_{1},\mu_{2}$ on $\mathbb{R}^{d}$ . In particular, for any $\eta_{0}\in(0,\hat{\eta})$ and $\gamma_{0}\in(\gamma+2K/R,1)$ one can choose $\psi=\eta_{0}/K$ and $\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee(2+R\psi\gamma_{0})/(2+R\psi)$ .

C Proofs of Wasserstein Perturbation Results: Quadratic Case

C.1 Proof of Lemma 4

Proof Let $P^{k}(\theta,\cdot)$ denote the law of $\theta_{k}$ starting with $\theta_{0}=\theta$ and $P^{k}(\tilde{\theta},\cdot)$ the law of $\tilde{\theta}_{k}$ :

\tilde{\theta}_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\tilde{\theta}_{k-1}+\frac{\eta}{b}q_{k},

(27)

with $\tilde{\theta}_{0}=\tilde{\theta}$ . Note that

	$\displaystyle\theta_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\theta_{k-1}+\frac{\eta}{b}q_{k},$		(28)
	$\displaystyle\tilde{\theta}_{k}=\left(I-\frac{\eta}{b}H_{k}\right)\tilde{\theta}_{k-1}+\frac{\eta}{b}q_{k},$		(29)

which implies that

	$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|$	$\displaystyle=\mathbb{E}\left\\|\left(I-\frac{\eta}{b}H_{k}\right)\left(\theta_{k-1}-\tilde{\theta}_{k-1}\right)\right\\|$
		$\displaystyle\leq\mathbb{E}\left[\left\\|I-\frac{\eta}{b}H_{k}\right\\|\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|\right]=\rho\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|.$		(30)

By iterating over $j=k,k-1,\ldots 1$ , we conclude that

\mathcal{W}_{1}\left(P^{k}(\theta,\cdot),P^{k}(\tilde{\theta},\cdot)\right)\leq\mathbb{E}\|\theta_{k}-\tilde{\theta}_{k}\|\leq\rho^{n}\|\theta_{0}-\tilde{\theta}_{0}\|=\rho^{k}\|\theta-\tilde{\theta}\|.

(31)

This completes the proof.

C.2 Proof of Lemma 5

Proof First, we recall that

\hat{\theta}_{k}=\left(I-\frac{\eta}{b}\hat{H}_{k}\right)\hat{\theta}_{k-1}+\frac{\eta}{b}\hat{q}_{k},

(32)

where $\hat{H}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{a}_{i}^{\top}$ and $\hat{q}_{k}:=\sum_{i\in\Omega_{k}}\hat{a}_{i}\hat{y}_{i}$ . Therefore, starting with $\hat{\theta}_{0}=\theta$ , we have

\hat{\theta}_{1}=\left(I-\frac{\eta}{b}\hat{H}_{1}\right)\theta+\frac{\eta}{b}\hat{q}_{1},

(33)

which implies that

\displaystyle(\hat{P}\hat{V})(\theta)=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\|\hat{\theta}_{1}\|\leq 1+\hat{\rho}\|\theta\|+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|=\hat{\rho}\hat{V}(\theta)+1-\hat{\rho}+\frac{\eta}{b}\mathbb{E}\|\hat{q}_{1}\|.

(34)

This completes the proof.

C.3 Proof of Lemma 6

Proof Let us recall that

	$\displaystyle\theta_{1}=\left(I-\frac{\eta}{b}H_{1}\right)\theta+\frac{\eta}{b}q_{1},$		(35)
	$\displaystyle\hat{\theta}_{1}=\left(I-\frac{\eta}{b}\hat{H}_{1}\right)\theta+\frac{\eta}{b}\hat{q}_{1},$		(36)

which implies that

\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\mathbb{E}\left\|H_{1}-\hat{H}_{1}\right\|\frac{\eta}{b}\|\theta\|+\frac{\eta}{b}\mathbb{E}\left\|q_{1}-\hat{q}_{1}\right\|.

(37)

Since $X_{n}$ and $\hat{X}_{n}$ differ by at most one element and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , we have $(H_{1},q_{1})=(\hat{H}_{1},q_{1})$ with probability $\frac{n-b}{n}$ and $(H_{1},q_{1})\neq(\hat{H}_{1},q_{1})$ with probability $\frac{b}{n}$ and moreover

\mathbb{E}\left\|H_{1}-\hat{H}_{1}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left\|a_{i}a_{i}^{\top}-\hat{a}_{i}\hat{a}_{i}^{\top}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left(\|a_{i}\|^{2}+\|\hat{a}_{i}\|^{2}\right)\leq\frac{2bD^{2}}{n},

(38)

and

\mathbb{E}\left\|q_{1}-\hat{q}_{1}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left\|a_{i}y_{i}-\hat{a}_{i}\hat{y}_{i}\right\|\leq\frac{b}{n}\max_{1\leq i\leq n}\left(\|a_{i}\|\|q_{i}\|+\|\hat{a}_{i}\|\|\hat{q}_{i}\|\right)\leq\frac{2bD^{2}}{n}.

(39)

Hence, we conclude that

\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{\frac{\eta}{b}(\|\theta\|+1)\frac{2bD^{2}}{n}}{1+\|\theta\|}=\frac{2\eta D^{2}}{n}.

(40)

This completes the proof.

D Proofs of Wasserstein Perturbation Results: Strongly Convex Case

In order to obtain the algorithmic stability bound, that is a $1$ -Wasserstein distance between the distribution of $\theta_{k}$ and $\hat{\theta}_{k}$ , we need to establish a sequence of technical lemmas. First, we show a $1$ -Wasserstein contraction rate in the following lemma.

Lemma 15

Assume that Assumption 1 and Assumption 2 hold, and further assume that $\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}}\right\}$ . Then, for any $n\in\mathbb{N}$ ,

\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}\left(\tilde{\theta},\cdot\right)\right)\leq\left(1-\frac{\eta\mu}{2}\right)^{n}\|\theta-\tilde{\theta}\|.

(41)

Proof Let $P^{k}(\theta,\cdot)$ denote the law of $\theta_{k}$ starting with $\theta_{0}=\theta$ :

\theta_{k}=\theta_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}(\theta_{k-1},X_{n}),

(42)

and $P^{k}(\tilde{\theta},\cdot)$ the law of $\tilde{\theta}_{k}$ :

\tilde{\theta}_{k}=\tilde{\theta}_{k-1}-\eta\tilde{\nabla}\hat{F}_{k}\left(\tilde{\theta}_{k-1},X_{n}\right),

(43)

with $\tilde{\theta}_{0}=\tilde{\theta}$ . Note that

	$\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}),$		(44)
	$\displaystyle\tilde{\theta}_{k}=\tilde{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right).$		(45)

Therefore, we have

$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|^{2}$	$\displaystyle=\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left\\|\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\qquad\qquad-\frac{2\eta}{b}\mathbb{E}\left\langle\theta_{k-1}-\tilde{\theta}_{k-1},\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\rangle.$	(46)

By applying Assumption 1 and Assumption 2, we get

	$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|^{2}$
	$\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left[\left(\sum_{i\in\Omega_{k}}\left\\|\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right\\|\right)^{2}\right]$
	$\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\eta^{2}K_{1}^{2}\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}$
	$\displaystyle\leq(1-\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2},$		(47)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}}$ . By iterating over $k=n,n-1,\ldots 1$ , we conclude that

	$\displaystyle\left(\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\right)^{2}$	$\displaystyle\leq\left(\mathcal{W}_{2}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\right)^{2}$
		$\displaystyle\leq\mathbb{E}\left\\|\theta_{n}-\tilde{\theta}_{n}\right\\|^{2}\leq(1-\eta\mu)^{n}\left\\|\theta_{0}-\tilde{\theta}_{0}\right\\|^{2}\leq\left(1-\frac{\eta\mu}{2}\right)^{2n}\left\\|\theta-\tilde{\theta}\right\\|^{2}.$		(48)

This completes the proof.

Next, we construct a Lyapunov function $\hat{V}$ and obtain a drift condition for the SGD $(\hat{\theta}_{k})_{k=0}^{\infty}$ .

Lemma 16

Assume that Assumption 1 and Assumption 2 hold. Let $\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}$ , where $\hat{\theta}_{\ast}$ is the minimizer of $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ . Assume that $\eta<\min\left\{\frac{1}{\mu},\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

(\hat{P}\hat{V})(\theta)\leq(1-\eta\mu)\hat{V}(\theta)+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}.

(49)

Proof First, we recall that

\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right).

(50)

Therefore, starting with $\hat{\theta}_{0}=\theta$ , we have

	$\displaystyle\hat{\theta}_{1}$	$\displaystyle=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})$
		$\displaystyle=\theta-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right).$		(51)

Moreover, we have

\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\Big{|}\theta\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}).

(52)

This implies that

	$\displaystyle(\hat{P}\hat{V})(\theta)$
	$\displaystyle=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\left\\|\hat{\theta}_{1}-\hat{\theta}_{\ast}\right\\|^{2}$
	$\displaystyle=1+\left\\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}+\eta^{2}\mathbb{E}\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}.$		(53)

We can compute that

	$\displaystyle\left\\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}$
	$\displaystyle=\left\\|\theta-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq(1-2\eta\mu)\\|\theta-\hat{\theta}_{\ast}\\|^{2}+\frac{\eta^{2}}{n^{2}}\left\\|\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq(1-2\eta\mu+\eta^{2}K_{1}^{2})\\|\theta-\hat{\theta}_{\ast}\\|^{2}.$		(54)

Moreover, we can compute that

$\displaystyle\mathbb{E}\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}$	$\displaystyle=\mathbb{E}\left\\|\frac{1}{b}\sum_{i\in\Omega_{1}}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla f(\theta,\hat{x}_{j})-\nabla f(\theta,\hat{x}_{i})\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}\left(\frac{1}{b}\sum_{i\in\Omega_{1}}K_{2}\frac{1}{n}\sum_{j=1}^{n}\\|\hat{x}_{i}-\hat{x}_{j}\\|(2\\|\theta\\|+1)\right)^{2}$
	$\displaystyle\leq 4D^{2}K_{2}^{2}(2\\|\theta\\|+1)^{2}$
	$\displaystyle\leq 8D^{2}K_{2}^{2}(4\\|\theta\\|^{2}+1)$
	$\displaystyle\leq 8D^{2}K_{2}^{2}\left(8\\|\theta-\hat{\theta}_{\ast}\\|^{2}+8\\|\hat{\theta}_{\ast}\\|^{2}+1\right).$	(55)

Hence, we conclude that

$\displaystyle(\hat{P}\hat{V})(\theta)$	$\displaystyle\leq\left(1-2\eta\mu+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}+1+8\eta^{2}D^{2}K_{2}^{2}\left(8\\|\hat{\theta}_{\ast}\\|^{2}+1\right)$
	$\displaystyle=\left(1-2\eta\mu+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\hat{V}(\theta)$
	$\displaystyle\qquad\qquad\qquad+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}$
	$\displaystyle\leq(1-\eta\mu)\hat{V}(\theta)+2\eta\mu-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2},$	(56)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}+64D^{2}K_{2}^{2}}$ . This completes the proof.

Next, we estimate the perturbation gap based on the Lyapunov function $\hat{V}$ .

Lemma 17

Assume that Assumption 1 holds. Assume that $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}\leq\frac{4DK_{2}\eta}{n}(2\|\hat{\theta}_{\ast}\|+1),

(57)

where $\hat{\theta}_{\ast}$ is the minimizer of $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ .

Proof Let us recall that

	$\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i}),$		(58)
	$\displaystyle\hat{\theta}_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i}),$		(59)

which implies that

$\displaystyle\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)$	$\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left\\|\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})-\nabla f(\theta,\hat{x}_{i})\right\\|$
	$\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left[\sum_{i\in\Omega_{1}}\left\\|\nabla f(\theta,x_{i})-\nabla f(\theta,\hat{x}_{i})\right\\|\right]$
	$\displaystyle\leq\frac{\eta}{b}\mathbb{E}\left[\sum_{i\in\Omega_{1}}K_{2}\left\\|x_{i}-\hat{x}_{i}\right\\|(2\\|\theta\\|+1)\right]$	(60)

Since $X_{n}$ and $\hat{X}_{n}$ differ by at most one element and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , we have $x_{i}=\hat{x}_{i}$ for any $i\in\Omega_{1}$ with probability $\frac{n-b}{n}$ and $x_{i}\neq\hat{x}_{i}$ for exactly one $i\in\Omega_{1}$ with probability $\frac{b}{n}$ and therefore

\mathcal{W}_{1}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\frac{\eta}{b}\frac{b}{n}2K_{2}D(2\|\theta\|+1)=\frac{2DK_{2}\eta}{n}(2\|\theta\|+1).

(61)

Hence, we conclude that

$\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{\mathcal{W}_{1}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}$	$\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{\\|\theta\\|+\frac{1}{2}}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}$
	$\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{2\\|\theta-\hat{\theta}_{\ast}\\|+2\\|\hat{\theta}_{\ast}\\|+1}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}$
	$\displaystyle\leq\sup_{\theta\in\mathbb{R}^{d}}\frac{4DK_{2}\eta}{n}\frac{2\\|\theta-\hat{\theta}_{\ast}\\|(2\\|\hat{\theta}_{\ast}\\|+1)}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}$
	$\displaystyle\leq\frac{4DK_{2}\eta}{n}\left(2\\|\hat{\theta}_{\ast}\\|+1\right).$	(62)

This completes the proof.

Next, let us provide a technical lemma that upper bounds the norm of $\theta_{\ast}$ and $\hat{\theta}_{\ast}$ , which are the minimizers of $\hat{F}(\theta,X_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i})$ and $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ respectively.

Lemma 18

Under Assumption 2, we have

\|\theta_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|,

and

\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|.

Proof Since $f(\theta,x)$ is $\mu$ -strongly convex in $\theta$ for every $x\in\mathcal{X}$ , we have

	$\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle$	$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle\geq\mu\\|\theta_{\ast}\\|^{2},$		(63)

which implies that

\mu\|\theta_{\ast}\|^{2}\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|,

(64)

which yields that $\|\theta_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|$ . Similarly, one can show that $\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|$ . This completes the proof.

D.1 Proof of Theorem 8

Proof By applying Lemma 15, Lemma 16, Lemma 17 and Lemma 3, we obtain

	$\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})$	$\displaystyle\leq\frac{8DK_{2}(1-(1-\frac{\eta\mu}{2})^{k})}{n\mu}\left(2\\|\hat{\theta}_{\ast}\\|+1\right)$
		$\displaystyle\qquad\cdot\max\left\{1+\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2},2-\frac{\eta}{\mu}K_{1}^{2}-\frac{56\eta}{\mu}D^{2}K_{2}^{2}+\frac{64\eta}{\mu}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}\right\},$		(65)

where $\hat{\theta}_{\ast}$ is the minimizer of $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ . Finally, $\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}$ and by applying Lemma 18, we complete the proof.

E Proofs of Wasserstein Perturbation Bounds: Non-Convex Case

Lemma 19

Assume that Assumption 1, Assumption 3 and Assumption 4 hold. For any $\hat{\eta}\in(0,1)$ . Define $M>0$ so that $\int_{\|\theta_{1}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}}$ and any $R>0$ so that $R>\frac{2K_{0}}{m}$ and

\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}},

(66)

where

K_{0}:=2m-\eta K_{1}^{2}-56\eta D^{2}K_{2}^{2}+64\eta D^{2}K_{2}^{2}\|\theta_{\ast}\|^{2}+2K+\eta\sigma^{2}.

(67)

Then, for any $n\in\mathbb{N}$ ,

\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}),

(68)

for any $\theta,\tilde{\theta}$ in $\mathbb{R}^{d}$ , where for any $\eta_{0}\in(0,\hat{\eta})$ and $\gamma_{0}\in\left(1-m\eta+\frac{2\eta K{0}}{R},1\right)$ one can choose $\psi=\frac{\eta_{0}}{\eta K_{0}}$ and $\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi}$ and $d_{\psi}$ is the weighted total variation distance defined in Section B.1.

Proof Our proof relies on a computable bound on the Wasserstein contraction for the Markov chains by Hairer and Mattingly (2011) that satisfies a drift condition (Assumption 7) that relies on the construction of an appropriate Lyapunov function and a minorization condition (Assumption 8).

By applying Lemma 21, we can immediately show that the following drift condition holds. Let $V(\theta):=1+\|\theta-\theta_{\ast}\|^{2}$ , where $\theta_{\ast}$ is the minimizer of $\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i})$ . Assume that $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

(PV)(\theta)\leq(1-m\eta)V(\theta)+\eta K_{0},

(69)

where

K_{0}:=2m-\eta K_{1}^{2}-56\eta D^{2}K_{2}^{2}+64\eta D^{2}K_{2}^{2}\|\theta_{\ast}\|^{2}+2K+\eta\sigma^{2}.

(70)

Thus, the drift condition (Assumption 7) holds.

Next, let us show that the minorization condition (Assumption 8) also holds. Let us recall that

\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1}.

(71)

We denote $p(\theta,\theta_{1})$ the probability density function of $\theta_{1}$ with the emphasis on the dependence on the initial point $\theta$ . Then, to check that the minorization condition (Assumption 8) holds, it suffices to show that there exists some constant $\hat{\eta}\in(0,1)$

\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\hat{\eta}q(\theta_{1}),\qquad\text{for any $\theta_{1}\in\mathbb{R}^{d}$},

(72)

for some $R>\frac{2\eta K_{0}}{1-(1-m\eta)}=\frac{2K_{0}}{m}$ , where $q(\theta_{1})$ is the density function of a probability distribution function on $\mathbb{R}^{d}$ , and (72) follows from Lemma 20. Hence, by Lemma 14, we have

d_{\psi}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}),

Finally, by the Kantorovich-Rubinstein duality for the Wasserstein metric, we get for any two probability measures $\mu_{1},\mu_{2}$ on $\mathbb{R}^{d}$ :

$\displaystyle\mathcal{W}_{1}(\mu_{1},\mu_{2})$	$\displaystyle=\sup\left\{\int_{\mathbb{R}^{d}}\phi(\theta)(\mu_{1}-\mu_{2})(d\theta):\text{$\phi$ is $1$-Lipschitz}\right\}$
	$\displaystyle=\sup\left\{\int_{\mathbb{R}^{d}}(\phi(\theta)-\phi(\theta_{\ast}))(\mu_{1}-\mu_{2})(d\theta):\text{$\phi$ is $1$-Lipschitz}\right\}$
	$\displaystyle\leq\int_{\mathbb{R}^{d}}\\|\theta-\theta_{\ast}\\|\|\mu_{1}-\mu_{2}\|(d\theta)$
	$\displaystyle\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\int_{\mathbb{R}^{d}}(1+\psi V(\theta))\|\mu_{1}-\mu_{2}\|(d\theta)$
	$\displaystyle=\frac{1}{2\sqrt{\psi(1+\psi)}}d_{\psi}(\mu_{1},\mu_{2}).$	(73)

Hence, we conclude that

\mathcal{W}_{1}\left(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot)\right)\leq\frac{1}{2\sqrt{\psi(1+\psi)}}\bar{\eta}^{n}d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}}).

This completes the proof.

The proof of Lemma 19 relies on the following technical lemma, which is a reformulation of Lemma 35 in Can et al. (2019) that helps establish the minorization condition (Assumption 8).

Lemma 20

For any $\hat{\eta}\in(0,1)$ and $M>0$ so that $\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}}$ and any $R>0$ so that

\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}.

(74)

Then, we have

\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\hat{\eta}q(\theta_{1}),\qquad\text{for any $\theta_{1}\in\mathbb{R}^{d}$},

(75)

where

q(\theta_{1})=p(\theta_{\ast},\theta_{1})\cdot\frac{1_{\|\theta_{1}-\theta_{\ast}\|\leq M}}{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}}.

(76)

Proof The proof is an adaptation of the proof of Lemma 35 in Can et al. (2019). Let us take:

q(\theta_{1})=p(\theta_{\ast},\theta_{1})\cdot\frac{1_{\|\theta_{1}-\theta_{\ast}\|\leq M}}{\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}}.

(77)

Then, it is clear that $q(\theta_{1})$ is a probability density function on $\mathbb{R}^{d}$ . It follows that (72) automatically holds for $\|\theta_{1}-\theta_{\ast}\|>M$ . Thus, we only need to show that (72) holds for $\|\theta_{1}-\theta_{\ast}\|\leq M$ . Since $\xi_{1}$ has a continuous density, $p(\theta,\theta_{1})$ is continuous in both $\theta$ and $\theta_{1}$ . Fix $M$ , by continuity of $p(\theta,\theta_{1})$ in both $\theta$ and $\theta_{1}$ , there exists some $\eta^{\prime}\in(0,1)$ such that uniformly in $\|\theta_{1}-\theta_{\ast}\|\leq M$ ,

\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\eta^{\prime}p(\theta_{\ast},\theta_{1})=\hat{\eta}q(\theta_{1}),

(78)

where we can take

\hat{\eta}:=\eta^{\prime}\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}.

(79)

In particular, for any fixed $\hat{\eta}$ , we can take $M>0$ such that

\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}},

(80)

and with fixed $\eta$ and $M$ , we take $R>0$ such that uniformly in $\|\theta_{1}-\theta_{\ast}\|\leq M$ ,

\inf_{\theta\in\mathbb{R}^{d}:V(\theta)\leq R}p(\theta,\theta_{1})\geq\sqrt{\hat{\eta}}p(\theta_{\ast},\theta_{1}).

(81)

This completes the proof.

Next, we construct a Lyapunov function $\hat{V}$ and obtain a drift condition for the SGD $(\hat{\theta}_{k})_{k=0}^{\infty}$ .

Lemma 21

Assume that Assumption 1, Assumption 3 and Assumption 4 hold. Let $\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}$ , where $\hat{\theta}_{\ast}$ is the minimizer of $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ . Assume that $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

(\hat{P}\hat{V})(\theta)\leq(1-m\eta)\hat{V}(\theta)+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K+\eta^{2}\sigma^{2}.

(82)

Proof First, we recall that

\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)+\eta\xi_{k}.

(83)

Therefore, starting with $\hat{\theta}_{0}=\theta$ , we have

	$\displaystyle\hat{\theta}_{1}$	$\displaystyle=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})+\eta\xi_{1}$
		$\displaystyle=\theta-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right)+\eta\xi_{1}.$		(84)

Moreover, we have

\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\Big{|}\theta\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i}).

(85)

This implies that

	$\displaystyle(\hat{P}\hat{V})(\theta)$
	$\displaystyle=\mathbb{E}\hat{V}(\hat{\theta}_{1})=1+\mathbb{E}\left\\|\hat{\theta}_{1}-\hat{\theta}_{\ast}\right\\|^{2}$
	$\displaystyle=1+\left\\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}+\eta^{2}\mathbb{E}\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\\|^{2}+\eta^{2}\sigma^{2}.$		(86)

We can compute that

	$\displaystyle\left\\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\theta,\hat{x}_{i}\right)\right\\|^{2}$
	$\displaystyle=\left\\|\theta-\hat{\theta}_{\ast}-\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f\left(\theta,\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq(1-2m\eta)\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}+2\eta K+\frac{\eta^{2}}{n^{2}}\left\\|\sum_{i=1}^{n}\left(\nabla f(\theta,\hat{x}_{i})-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq\left(1-2m\eta+\eta^{2}K_{1}^{2}\right)\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}+2\eta K.$		(87)

Moreover, by following the same arguments as in the proof of Lemma 16, we have

\displaystyle\mathbb{E}\left\|\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})-\frac{1}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})\right\|^{2}\leq 8D^{2}K_{2}^{2}\left(8\left\|\theta-\hat{\theta}_{\ast}\right\|^{2}+8\|\hat{\theta}_{\ast}\|^{2}+1\right).

(88)

Hence, we conclude that

$\displaystyle(\hat{P}\hat{V})(\theta)$	$\displaystyle\leq\left(1-2m\eta+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}$
	$\displaystyle\qquad\qquad\qquad+1+8\eta^{2}D^{2}K_{2}^{2}\left(8\\|\hat{\theta}_{\ast}\\|^{2}+1\right)+2\eta K+\eta^{2}\sigma^{2}$
	$\displaystyle=\left(1-2m\eta+\eta^{2}K_{1}^{2}+64\eta^{2}D^{2}K_{2}^{2}\right)\hat{V}(\theta)$
	$\displaystyle\qquad\qquad\qquad+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+2\eta K+\eta^{2}\sigma^{2}$
	$\displaystyle\leq(1-m\eta)\hat{V}(\theta)+2m\eta-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+2\eta K+\eta^{2}\sigma^{2},$	(89)

provided that $\eta\leq\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}$ . This completes the proof.

Next, we estimate the perturbation gap based on the Lyapunov function $\hat{V}$ .

Lemma 22

Assume that Assumption 1 and Assumption 4 hold. Assume that $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ . Then, we have

	$\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}$
	$\displaystyle\leq\frac{2b}{n}\max\Bigg{\{}\psi\cdot(4+8\eta^{2}K_{1}^{2}),$
	$\displaystyle\qquad\qquad 1+\psi\cdot\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\\|\theta_{\ast}-\hat{\theta}_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}\right)\Bigg{\}},$		(90)

where $\hat{\theta}_{\ast}$ is the minimizer of $\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ and $\theta_{\ast}$ is the minimizer of $\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i})$ and $d_{\psi}$ is the weighted total variation distance defined in Section B.1.

Proof Let us recall that

	$\displaystyle\theta_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})+\eta\xi_{1},$		(91)
	$\displaystyle\hat{\theta}_{1}=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,\hat{x}_{i})+\eta\xi_{1},$		(92)

which implies that

\displaystyle\mathbb{E}_{\xi_{1}}[V(\theta_{1})]=1+\mathbb{E}_{\xi_{1}}\left[\|\theta_{1}-\theta_{\ast}\|^{2}\right]=1+\eta^{2}\sigma^{2}+\left\|\theta-\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\|^{2},

(93)

where $\mathbb{E}_{\xi_{1}}$ denotes expectations w.r.t. $\xi_{1}$ only, and we can further compute that

	$\displaystyle\left\\|\theta-\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\\|^{2}$
	$\displaystyle\leq 2\\|\theta-\theta_{\ast}\\|^{2}+\frac{2\eta^{2}}{b^{2}}\left\\|\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i})\right\\|^{2}$
	$\displaystyle\leq 2\\|\theta-\theta_{\ast}\\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(\sum_{i\in\Omega_{1}}\\|\nabla f(\theta,x_{i})\\|\right)^{2}$
	$\displaystyle\leq 2\\|\theta-\theta_{\ast}\\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(\sum_{i\in\Omega_{1}}\\|\nabla f(\theta,x_{i})-\nabla f(\theta_{\ast},x_{i})\\|+\\|\nabla f(\theta_{\ast},x_{i})\\|\right)^{2}$
	$\displaystyle\leq 2\\|\theta-\theta_{\ast}\\|^{2}+\frac{2\eta^{2}}{b^{2}}\left(bK_{1}\\|\theta-\theta_{\ast}\\|+b\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)^{2}$
	$\displaystyle\leq 2\\|\theta-\theta_{\ast}\\|^{2}+4\eta^{2}K_{1}^{2}\\|\theta-\theta_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}.$		(94)

Therefore, we have

\displaystyle\mathbb{E}_{\xi_{1}}[V(\theta_{1})]\leq 1+\eta^{2}\sigma^{2}+2\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}K_{1}^{2}\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}.

(95)

Similarly, we have

\displaystyle\mathbb{E}_{\xi_{1}}[V(\hat{\theta}_{1})]\leq 1+\eta^{2}\sigma^{2}+2\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}K_{1}^{2}\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}.

(96)

d_{\psi}\left(\delta_{\theta}P,\delta_{\theta}\hat{P}\right)\leq\frac{2b}{n}\left(1+\psi\left(1+\eta^{2}\sigma^{2}+(2+4\eta^{2}K_{1}^{2})\|\theta-\theta_{\ast}\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|^{2}\right)\right).

(97)

Hence, we conclude that

	$\displaystyle\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)}$
	$\displaystyle\leq\frac{2b}{n}\sup_{\theta\in\mathbb{R}^{d}}\frac{1+\psi\left(1+\eta^{2}\sigma^{2}+(2+4\eta^{2}K_{1}^{2})\\|\theta-\theta_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}\right)}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}$
	$\displaystyle\leq\frac{2b}{n}\sup_{\theta\in\mathbb{R}^{d}}\Bigg{\{}\frac{1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\\|\theta-\hat{\theta}_{\ast}\\|^{2}\right)}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}$
	$\displaystyle\qquad\qquad\qquad\qquad+\frac{\psi\left((4+8\eta^{2}K_{1}^{2})\\|\theta_{\ast}-\hat{\theta}_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}\right)}{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}}\Bigg{\}}$
	$\displaystyle\leq\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),$
	$\displaystyle\qquad\qquad\qquad 1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\\|\theta_{\ast}-\hat{\theta}_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}\right)\Bigg{\}}.$		(98)

This completes the proof.

It is worth noting that the Wasserstein contraction bound we obtained in Lemma 19 in the non-convex case differs from the one we obtained in the strongly-convex case (Lemma 15) in the sense that the right hand side of (68) is no longer $\|\theta-\tilde{\theta}\|$ so that Lemma 3 is not directly applicable. Instead, in the following, we will provide a modification of Lemma 3, which will be used in proving Theorem 9 in this paper. The definitions of the notations used in the following lemma can be found in Section 2.2.

Lemma 23

Assume that there exist some $\rho\in[0,1)$ and $C\in(0,\infty)$ such that

\sup_{\theta,\tilde{\theta}\in\mathbb{R}^{d}:\theta\neq\tilde{\theta}}\frac{\mathcal{W}_{1}(P^{n}(\theta,\cdot),P^{n}(\tilde{\theta},\cdot))}{d_{\psi}(\delta_{\theta},\delta_{\tilde{\theta}})}\leq C\rho^{n},

(99)

(\hat{P}\hat{V})(\theta)\leq\delta\hat{V}(\theta)+L.

(100)

Then, we have

\mathcal{W}_{1}(p_{n},\hat{p}_{n})\leq C\left(\rho^{n}d_{\psi}(p_{0},\hat{p}_{0})+(1-\rho^{n})\frac{\gamma\kappa}{1-\rho}\right),

(101)

where

\gamma:=\sup_{\theta\in\mathbb{R}^{d}}\frac{d_{\psi}(\delta_{\theta}P,\delta_{\theta}\hat{P})}{\hat{V}(\theta)},\qquad\kappa:=\max\left\{\int_{\mathbb{R}^{d}}\hat{V}(\theta)d\hat{p}_{0}(\theta),\frac{L}{1-\delta}\right\}.

(102)

Proof The proof is based on the modification of the proof of Lemma 3 (Theorem 3.1 in Rudolf and Schweizer (2018)). By induction we have

\tilde{p}_{n}-p_{n}=\left(\tilde{p}_{0}-p_{0}\right)P^{n}+\sum_{i=0}^{n-1}\tilde{p}_{i}\left(\tilde{P}-P\right)P^{n-i-1},\qquad n\in\mathbb{N}.

(103)

We have

d_{\psi}\left(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P}\right)\leq\int_{\mathbb{R}^{d}}d_{\psi}\left(\delta_{\theta}P,\delta_{\theta}\tilde{P}\right)d\tilde{p}_{i}(\theta)\leq\gamma\int_{\mathbb{R}^{d}}\tilde{V}(\theta)d\tilde{p}_{i}(\theta).

(104)

Moreover, for any $i=0,1,2,\ldots$ ,

\int_{\mathbb{R}^{d}}\tilde{V}(\theta)d\tilde{p}_{i}(\theta)=\int_{\mathbb{R}^{d}}\tilde{P}^{i}\tilde{V}(\theta)d\tilde{p}_{0}(\theta)\leq\delta^{i}\tilde{p}_{0}(\tilde{V})+\frac{L(1-\delta^{i})}{1-\delta}\leq\max\left\{\tilde{p}_{0}(\tilde{V}),\frac{L}{1-\delta}\right\},

(105)

so that we obtain $d_{\psi}(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P})\leq\gamma\kappa$ . Therefore, we have

\mathcal{W}_{1}\left(\tilde{p}_{i}\tilde{P}P^{n-i-1},\tilde{p}_{i}PP^{n-i-1}\right)\leq C\rho^{n-i-1}d_{\psi}\left(\tilde{p}_{i}P,\tilde{p}_{i}\tilde{P}\right)\leq C\rho^{n-i-1}\gamma\kappa.

(106)

By the triangle inequality of the Wasserstein distance, we have

	$\displaystyle\mathcal{W}_{1}(p_{n},\tilde{p}_{n})$	$\displaystyle\leq\mathcal{W}_{1}\left(p_{0}P^{n},\tilde{p}_{0}P^{n}\right)+\sum_{i=0}^{n-1}\mathcal{W}_{1}\left(\tilde{p}_{i}\tilde{P}P^{n-i-1},\tilde{p}_{i}PP^{n-i-1}\right)$
		$\displaystyle\leq C\rho^{n}d_{\psi}(p_{0},\tilde{p}_{0})+C\sum_{i=0}^{n-1}\rho^{n-i-1}\gamma\kappa.$		(107)

This completes the proof.

Lemma 24

Under Assumption 3, we have

	$\displaystyle\\|\theta_{\ast}\\|\leq\frac{\sup_{x\in\mathcal{X}}\\|\nabla f(0,x)\\|+\sqrt{\sup_{x\in\mathcal{X}}\\|\nabla f(0,x)\\|^{2}+4mK}}{2m},$		(108)
	$\displaystyle\\|\hat{\theta}_{\ast}\\|\leq\frac{\sup_{x\in\mathcal{X}}\\|\nabla f(0,x)\\|+\sqrt{\sup_{x\in\mathcal{X}}\\|\nabla f(0,x)\\|^{2}+4mK}}{2m}.$		(109)

Proof By Assumption 3, we have

$\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle$	$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle$
	$\displaystyle\geq m\\|\theta_{\ast}\\|^{2}-K,$	(110)

which implies that

m\|\theta_{\ast}\|^{2}-K\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|,

(111)

which yields that

\|\theta_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}.

(112)

Similarly, one can show that

\|\hat{\theta}_{\ast}\|\leq\frac{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|+\sqrt{\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{2}+4mK}}{2m}.

(113)

This completes the proof.

E.1 Proof of Theorem 9

Before going to the proof, let us restate the full version of Theorem 9 that we provide below.

Theorem 25 (Complete Theorem 3.3)

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Assume that Assumption 1, Assumption 3 and Assumption 4 hold. We also assume that $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . For any $\hat{\eta}\in(0,1)$ . Define $M>0$ so that $\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq\sqrt{\hat{\eta}}$ and any $R>\frac{2K_{0}}{m}$ where $K_{0}$ is defined in (67) so that

\inf_{\theta,\theta_{1}\in\mathbb{R}^{d}:V(\theta)\leq R,\|\theta_{1}-\theta_{\ast}\|\leq M}\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}\geq\sqrt{\hat{\eta}}.

(114)

Let $\nu_{k}$ and $\hat{\nu}_{k}$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ respectively. Then, we have

	$\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})$
	$\displaystyle\leq\frac{1-\bar{\eta}^{k}}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}$
	$\displaystyle\cdot\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),$
	$\displaystyle\qquad\qquad 1+\psi\Bigg{(}1+\eta^{2}\sigma^{2}+16(1+2\eta^{2}K_{1}^{2})\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad+4\eta^{2}\left(2E^{2}+2K_{1}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}\right)\Bigg{)}\Bigg{\}}$
	$\displaystyle\qquad\cdot\max\Bigg{\{}1+2\theta^{2}+2\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2},$
	$\displaystyle\qquad\qquad\qquad 2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+\frac{2K}{m}+\frac{\eta}{m}\sigma^{2}\Bigg{\}},$		(115)

where for any $\eta_{0}\in(0,\hat{\eta})$ and $\gamma_{0}\in\left(1-m\eta+\frac{2\eta K_{0}}{R},1\right)$ one can choose $\psi=\frac{\eta_{0}}{\eta K_{0}}$ and $\bar{\eta}=(1-(\hat{\eta}-\eta_{0}))\vee\frac{2+R\psi\gamma_{0}}{2+R\psi}$ and $d_{\psi}$ is the weighted total variation distance defined in Section B.1.

Proof By applying Lemma 19, Lemma 21, Lemma 22 and Lemma 23, which is a modification of Lemma 3, we obtain

	$\displaystyle\mathcal{W}_{1}(\nu_{k},\hat{\nu}_{k})$
	$\displaystyle\leq\frac{1-\bar{\eta}^{k}}{2\sqrt{\psi(1+\psi)}(1-\bar{\eta})}$
	$\displaystyle\cdot\frac{2b}{n}\max\Bigg{\{}\psi(4+8\eta^{2}K_{1}^{2}),$
	$\displaystyle\qquad\qquad 1+\psi\left(1+\eta^{2}\sigma^{2}+(4+8\eta^{2}K_{1}^{2})\\|\theta_{\ast}-\hat{\theta}_{\ast}\\|^{2}+4\eta^{2}\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|^{2}\right)\Bigg{\}}$
	$\displaystyle\qquad\cdot\max\left\{1+\\|\theta-\hat{\theta}_{\ast}\\|^{2},2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+\frac{2K}{m}+\frac{\eta}{m}\sigma^{2}\right\},$		(116)

Finally, let us notice that $\|\theta_{\ast}-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta_{\ast}\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}$ and $\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}$ and for every $x\in\mathcal{X}$ ,

\|\nabla f(\theta_{\ast},x)\|^{2}\leq 2\|\nabla f(0,x)\|^{2}+2\|\nabla f(\theta_{\ast},x)-\nabla f(0,x)\|^{2}\leq 2\|\nabla f(0,x)\|^{2}+2K_{1}^{2}\|\theta_{\ast}\|^{2}.

(117)

By applying Lemma 24, we complete the proof.

E.2 Proof of Corollary 11

Proof Under our assumptions, the noise $\xi_{k}$ are i.i.d. Gaussian $\mathcal{N}(0,\Sigma)$ so that $\mathbb{E}[\|\xi_{1}\|^{2}]=\text{tr}(\Sigma)=\sigma^{2}$ . Moreoever, we have $\Sigma\prec I_{d}$ . Then $p(\theta_{\ast},\theta_{1})$ is the probability density function of

\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}.

(118)

Therefore,

\displaystyle\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}=\mathbb{P}\left(\left\|-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}\right\|\leq M\right).

(119)

Notice that for any $\Omega_{1}$ ,

\left\|\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})\right\|\leq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|.

(120)

Thus, we have

$\displaystyle\int_{\\|\theta_{1}-\theta_{\ast}\\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}$	$\displaystyle=1-\mathbb{P}\left(\left\\|-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})+\eta\xi_{1}\right\\|>M\right)$
	$\displaystyle\geq 1-\mathbb{P}\left(\left\\|\eta\xi_{1}\right\\|>M-\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)$
	$\displaystyle=1-\mathbb{P}\left(\left\\|\xi_{1}\right\\|>\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right).$	(121)

Since $\xi\sim\mathcal{N}(0,\Sigma)$ and $\Sigma\prec I_{d}$ , for any $\gamma\leq\frac{1}{2}$ , we have

\mathbb{E}\left[e^{\gamma\|\xi_{1}\|^{2}}\right]=\frac{1}{\sqrt{\det{(I_{d}-2\gamma\Sigma)}}}.

By Chebychev’s inequality, letting $\gamma=\frac{1}{2}$ , for any $M\geq\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|$ , we get

\int_{\|\theta_{1}-\theta_{\ast}\|\leq M}p(\theta_{\ast},\theta_{1})d\theta_{1}\geq 1-\frac{1}{\sqrt{\det{(I_{d}-\Sigma)}}}\exp\left(-\frac{1}{2}\left(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|\right)^{2}\right).

Next, for any $\theta,\theta_{1}\in\mathbb{R}^{d}$ such that $\|\theta_{1}-\theta_{\ast}\|\leq M$ and $\|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1$ , we have

\displaystyle\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}=\frac{\mathbb{E}_{\Omega_{1}}[p_{\Omega_{1}}(\theta,\theta_{1})]}{\mathbb{E}_{\Omega_{1}}[p_{\Omega_{1}}(\theta_{\ast},\theta_{1})]},

(122)

where $\mathbb{E}_{\Omega_{1}}$ denotes the expectation w.r.t. $\Omega_{1}$ , and $p_{\Omega_{1}}$ denotes the probability density function conditional on $\Omega_{1}$ . For any given $\Omega_{1}$ , we can compute that

	$\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})}$
	$\displaystyle=\exp\left\{-\frac{1}{2\eta^{2}}(\theta_{1}-\mu(\theta))^{\top}\Sigma^{-1}(\theta_{1}-\mu(\theta))+\frac{1}{2\eta^{2}}(\theta_{1}-\mu(\theta_{\ast}))^{\top}\Sigma^{-1}(\theta_{1}-\mu(\theta_{\ast}))\right\},$		(123)

where

\mu(\theta):=\theta-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta,x_{i}),\qquad\mu(\theta_{\ast}):=\theta_{\ast}-\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i}).

(124)

Therefore, for any $\theta,\theta_{1}\in\mathbb{R}^{d}$ such that $\|\theta_{1}-\theta_{\ast}\|\leq M$ and $\|\theta-\theta_{\ast}\|^{2}\leq\frac{2K_{0}}{m}(1+\epsilon)-1$ , we have

	$\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})}$	$\displaystyle\geq\exp\left\{-\frac{1}{2\eta^{2}}\\|\mu(\theta)-\mu(\theta_{\ast})\\|\cdot\\|\Sigma^{-1}\\|\left(\\|\theta_{1}-\mu(\theta)\\|+\\|\theta_{1}-\mu(\theta_{\ast})\\|\right)\right\}$
		$\displaystyle\geq\exp\left\{-\frac{1}{2\eta^{2}}\\|\mu(\theta)-\mu(\theta_{\ast})\\|\cdot\\|\Sigma^{-1}\\|\left(\\|\mu(\theta)-\mu(\theta_{\ast})\\|+2\\|\theta_{1}-\mu(\theta_{\ast})\\|\right)\right\}.$		(125)

We can further compute that

\displaystyle\|\theta_{1}-\mu(\theta_{\ast})\|\leq\|\theta_{1}-\theta_{\ast}\|+\left\|\frac{\eta}{b}\sum_{i\in\Omega_{1}}\nabla f(\theta_{\ast},x_{i})\right\|\leq M+\eta\sup_{x\in\mathcal{X}}\|\nabla f(\theta_{\ast},x)\|,

(126)

and

$\displaystyle\\|\mu(\theta)-\mu(\theta_{\ast})\\|$	$\displaystyle\leq\\|\theta-\theta_{\ast}\\|+\frac{\eta}{b}\sum_{i\in\Omega_{1}}\left\\|\nabla f(\theta,x_{i})-\nabla f(\theta_{\ast},x_{i})\right\\|$
	$\displaystyle\leq(1+K_{1}\eta)\\|\theta-\theta_{\ast}\\|$
	$\displaystyle\leq(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}.$	(127)

Hence, we have

	$\displaystyle\frac{p_{\Omega_{1}}(\theta,\theta_{1})}{p_{\Omega_{1}}(\theta_{\ast},\theta_{1})}$	$\displaystyle\geq\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\\|\Sigma^{-1}\\|$
		$\displaystyle\qquad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)\Bigg{)}\Bigg{\}}.$		(128)

Since it holds for every $\Omega_{1}$ , we have

	$\displaystyle\frac{p(\theta,\theta_{1})}{p(\theta_{\ast},\theta_{1})}$	$\displaystyle\geq\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\\|\Sigma^{-1}\\|$
		$\displaystyle\qquad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)\Bigg{)}\Bigg{\}}.$		(129)

Hence, we conclude that

$\displaystyle\hat{\eta}$	$\displaystyle\geq\Bigg{(}\max_{M\geq\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|}\Bigg{\{}\min\Bigg{\{}1-\frac{\exp\left(-\frac{1}{2}\left(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)^{2}\right)}{\sqrt{\det{(I_{d}-\Sigma)}}},$
	$\displaystyle\qquad\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{2\eta^{2}}\\|\Sigma^{-1}\\|$
	$\displaystyle\qquad\quad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)\Bigg{)}\Bigg{\}}\Bigg{\}}\Bigg{\}}\Bigg{)}^{2}.$	(130)

This completes the proof.

F Proofs of Non-Convex Case without Additive Noise

F.1 Proof of Theorem 12

Proof Let us recall that $\theta_{0}=\hat{\theta}_{0}=\theta$ and for any $k\in\mathbb{N}$ ,

	$\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}),$		(131)
	$\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\hat{\theta}_{k-1},\hat{x}_{i}).$		(132)

Thus it follows that

\displaystyle\theta_{k}-\hat{\theta}_{k}

\displaystyle=\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)+\frac{\eta}{b}\mathcal{E}_{k},

(133)

where

\mathcal{E}_{k}:=\sum_{i\in\Omega_{k}}\left(\nabla f(\hat{\theta}_{k-1},\hat{x}_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right).

(134)

This implies that

	$\displaystyle\left\\|\theta_{k}-\hat{\theta}_{k}\right\\|^{2}$	$\displaystyle=\left\\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\left\\|\mathcal{E}_{k}\right\\|^{2}$
		$\displaystyle\qquad+2\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle.$		(135)

By Assumption 1 and Assumption 3, we have

	$\displaystyle\left\\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq(1-2\eta m)\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}+2\eta K+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|\right)^{2}$
	$\displaystyle\leq(1-\eta m)\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}+2\eta K,$		(136)

provided that $\eta\leq\frac{m}{K_{1}^{2}}$ .

Since $X_{n}$ and $\hat{X}_{n}$ differ by at most one element and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , we have $x_{i}=\hat{x}_{i}$ for any $i\in\Omega_{k}$ with probability $\frac{n-b}{n}$ and $x_{i}\neq\hat{x}_{i}$ for exactly one $i\in\Omega_{k}$ with probability $\frac{b}{n}$ and therefore

\mathbb{E}\left\|\mathcal{E}_{k}\right\|^{2}\leq\frac{b}{n}\mathbb{E}\left[\left(K_{2}2D\left(2\|\hat{\theta}_{k-1}\|+1\right)\right)^{2}\right]\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\|\hat{\theta}_{k-1}\|^{2}+2\right),

(137)

and moreover,

	$\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle$
	$\displaystyle\leq\frac{\eta}{b}\frac{b}{n}\mathbb{E}\left[(1+K_{1}\eta)\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|K_{2}2D\left(2\\|\hat{\theta}_{k-1}\\|+1\right)\right]$
	$\displaystyle\leq\frac{2K_{2}D\eta}{n}(1+K_{1}\eta)\mathbb{E}\left[\left(\\|\theta_{k-1}\\|+\\|\hat{\theta}_{k-1}\\|\right)\left(2\\|\hat{\theta}_{k-1}\\|+1\right)\right]$
	$\displaystyle\leq\frac{2K_{2}D\eta}{n}(1+K_{1}\eta)\left(1+\frac{3}{2}\mathbb{E}\\|\theta_{k-1}\\|^{2}+\frac{7}{2}\mathbb{E}\\|\hat{\theta}_{k-1}\\|^{2}\right),$		(138)

where we used the inequality that

(a+b)(2b+1)=2b^{2}+2ab+a+b\leq 1+\frac{3}{2}a^{2}+\frac{7}{2}b^{2},

(139)

for any $a,b\in\mathbb{R}$ . Therefore, we have

	$\displaystyle\mathbb{E}\left\\|\theta_{k}-\hat{\theta}_{k}\right\\|^{2}$	$\displaystyle\leq(1-\eta m)\mathbb{E}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}+2\eta K+\frac{4D^{2}K_{2}^{2}\eta^{2}}{bn}\left(8\mathbb{E}\\|\hat{\theta}_{k-1}\\|^{2}+2\right)$
		$\displaystyle\qquad+\frac{4K_{2}D\eta}{n}(1+K_{1}\eta)\left(1+\frac{3}{2}\mathbb{E}\\|\theta_{k-1}\\|^{2}+\frac{7}{2}\mathbb{E}\\|\hat{\theta}_{k-1}\\|^{2}\right).$		(140)

In Lemma 21, we showed that under the assumption $\eta<\min\left\{\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , we have that for every $k\in\mathbb{N}$ ,

\displaystyle\mathbb{E}\hat{V}(\hat{\theta}_{k})\leq(1-\eta m)\mathbb{E}\hat{V}(\hat{\theta}_{k-1})+2\eta m-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\|\hat{\theta}_{\ast}\|^{2}+2\eta K,

(141)

where $\hat{V}(\theta):=1+\|\theta-\hat{\theta}_{\ast}\|^{2}$ , where $\hat{\theta}_{\ast}$ is the minimizer of $\hat{F}(\theta,\hat{X}_{n}):=\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ . This implies that

	$\displaystyle\mathbb{E}\left[\hat{V}\left(\hat{\theta}_{k}\right)\right]$
	$\displaystyle\leq(1-\eta m)^{k}\mathbb{E}\left[\hat{V}\left(\hat{\theta}_{0}\right)\right]+\frac{2\eta m-\eta^{2}K_{1}^{2}-56\eta^{2}D^{2}K_{2}^{2}+64\eta^{2}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+2\eta K}{1-(1-\eta m)}$
	$\displaystyle\leq 1+\\|\theta-\hat{\theta}_{\ast}\\|^{2}+2-\frac{\eta}{m}K_{1}^{2}-\frac{56\eta}{m}D^{2}K_{2}^{2}+\frac{64\eta}{m}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+\frac{2K}{m},$		(142)

so that

$\displaystyle\mathbb{E}\\|\hat{\theta}_{k}\\|^{2}$	$\displaystyle\leq 2\mathbb{E}\left\\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\\|^{2}+2\\|\hat{\theta}_{\ast}\\|^{2}$
	$\displaystyle\leq 2\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}+4-\frac{2\eta}{m}K_{1}^{2}-\frac{112\eta}{m}D^{2}K_{2}^{2}+\frac{128\eta}{m}D^{2}K_{2}^{2}\\|\hat{\theta}_{\ast}\\|^{2}+\frac{4K}{m}+2\\|\hat{\theta}_{\ast}\\|^{2}$
	$\displaystyle\leq 4\\|\theta\\|^{2}+4\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+4-\frac{2\eta}{m}K_{1}^{2}-\frac{112\eta}{m}D^{2}K_{2}^{2}$
	$\displaystyle\qquad+\frac{128\eta}{m}D^{2}K_{2}^{2}\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}+\frac{4K}{m}+2\left(\frac{E+\sqrt{E^{2}+4mK}}{2m}\right)^{2}=:B,$	(143)

where we applied Lemma 24. Similarly, we can show that

\displaystyle\mathbb{E}\|\theta_{k}\|^{2}\leq B.

(144)

Since $\theta_{0}=\hat{\theta}_{0}=\theta$ , it follows from (140), (143) and (144) that

	$\displaystyle\mathcal{W}_{2}^{2}(\nu_{k},\hat{\nu}_{k})$
	$\displaystyle\leq\mathbb{E}\left\\|\theta_{k}-\hat{\theta}_{k}\right\\|^{2}$
	$\displaystyle\leq\left(1-(1-\eta m)^{k}\right)\left(\frac{4D^{2}K_{2}^{2}\eta}{bnm}(8B+2)+\frac{4K_{2}D}{nm}(1+K_{1}\eta)\left(1+\frac{3}{2}B+\frac{7}{2}B\right)+\frac{2K}{m}\right),$		(145)

provided that $\eta<\min\left\{\frac{m}{K_{1}^{2}},\frac{1}{m},\frac{m}{K_{1}^{2}+64D^{2}K_{2}^{2}}\right\}$ . This completes the proof.

G Proofs of Convex Case with Additional Geometric Structure

In the following technical lemma, we show that the $p$ -th moment of $\theta_{k}$ and $\hat{\theta}_{k}$ can be bounded in the following sense.

Lemma 26

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Suppose Assumption 5 and Assumption 6 hold and $\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}$ . Then, we have

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\theta_{i-1}-\theta_{\ast}\right\\|^{p}\leq\frac{\left\\|\theta-\theta_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\theta_{\ast}\\|^{p}+5\right),$		(146)
	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\\|^{p}\leq\frac{\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right),$		(147)

where $\hat{\theta}_{\ast}$ is the minimizer of $\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,\hat{x}_{i})$ and $\theta_{\ast}$ is the minimizer of $\frac{1}{n}\sum_{i=1}^{n}\nabla f(\theta,x_{i})$ .

Proof First, we recall that

\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right).

(148)

Therefore, we have

\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)+\eta\left(\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right).

(149)

Moreover, we have

\mathbb{E}\left[\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\Big{|}\hat{\theta}_{k-1}\right]=\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right).

(150)

This implies that

	$\displaystyle\mathbb{E}\left\\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\\|^{2}$	$\displaystyle=\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\\|^{2}$
		$\displaystyle\qquad+\eta^{2}\mathbb{E}\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\\|^{2}.$		(151)

We can compute that

	$\displaystyle\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\\|^{2}$
	$\displaystyle=\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}+\frac{\eta}{n}\sum_{i=1}^{n}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{2}-2\eta\mu\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}+\frac{\eta^{2}}{n^{2}}\left\\|\sum_{i=1}^{n}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{\ast},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{2}-2\eta\mu\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}+\eta^{2}K_{1}^{2}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}.$		(152)

Moreover, we can compute that

	$\displaystyle\mathbb{E}\left\\|\frac{1}{n}\sum_{i=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\frac{1}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}\left\\|\frac{1}{b}\sum_{i\in\Omega_{k}}\left(\frac{1}{n}\sum_{j=1}^{n}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{j}\right)-\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}\left(\frac{1}{b}\sum_{i\in\Omega_{k}}K_{2}\frac{1}{n}\sum_{j=1}^{n}\\|\hat{x}_{i}-\hat{x}_{j}\\|(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1)\right)^{2}$
	$\displaystyle\leq 4D^{2}K_{2}^{2}\mathbb{E}\left[(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1)^{2}\right]$
	$\displaystyle\leq 8D^{2}K_{2}^{2}\left(4\mathbb{E}\left[\\|\hat{\theta}_{k-1}\\|^{2(p-1)}\right]+1\right)$
	$\displaystyle\leq 8D^{2}K_{2}^{2}\left(4\mathbb{E}\left[\\|\hat{\theta}_{k-1}\\|^{p}\right]+5\right)$
	$\displaystyle\leq 8D^{2}K_{2}^{2}\left(2^{p+1}\mathbb{E}\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\\|^{p}+2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right).$		(153)

Hence, by applying (152) and (153) to (151), we conclude that

$\displaystyle\mathbb{E}\left\\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\\|^{2}$	$\displaystyle\leq\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{2}-2\eta\mu\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}+\eta^{2}K_{1}^{2}\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}$
	$\displaystyle\qquad\qquad+8\eta^{2}D^{2}K_{2}^{2}\left(2^{p+1}\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{2}+2^{p+1}\\|\hat{\theta}_{\ast}\\|^{2}+5\right)$
	$\displaystyle\leq\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{2}-\eta\mu\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}+8\eta^{2}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right),$	(154)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}$ . This implies that

\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{p}\leq\frac{\mathbb{E}\left\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\|^{2}-\mathbb{E}\left\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\|^{2}}{\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\hat{\theta}_{\ast}\|^{p}+5\right),

(155)

and hence

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\\|^{p}$	$\displaystyle\leq\frac{\left\\|\hat{\theta}_{0}-\hat{\theta}_{\ast}\right\\|^{2}-\mathbb{E}\left\\|\hat{\theta}_{k}-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right)$
		$\displaystyle\leq\frac{\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right).$		(156)

Similarly, we can show that

\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}\leq\frac{\left\|\theta-\theta_{\ast}\right\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\|\theta_{\ast}\|^{p}+5\right).

(157)

This completes the proof.

Lemma 27

Under Assumption 5, we have

\|\theta_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}},

and

\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

Proof Under Assumption 5, we have

	$\displaystyle\left\langle\nabla\hat{F}(0,X_{n})-\nabla\hat{F}(\theta_{\ast},X_{n}),0-\theta_{\ast}\right\rangle$	$\displaystyle=-\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i}),\theta_{\ast}\rangle$
		$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\langle\nabla f(0,x_{i})-\nabla f(\theta_{\ast},x_{i}),0-\theta_{\ast}\rangle\geq\mu\\|\theta_{\ast}\\|^{p},$		(158)

where $p\in(1,2)$ , which implies that

\mu\|\theta_{\ast}\|^{p}\leq\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\cdot\|\theta_{\ast}\|,

(159)

which yields that

\|\theta_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

Similarly, one can show that

\|\hat{\theta}_{\ast}\|\leq\frac{1}{\mu^{\frac{1}{p-1}}}\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|^{\frac{1}{p-1}}.

This completes the proof.

Now, we are able to state the main result for the Wasserstein algorithmic stability.

Theorem 28

Let $\theta_{0}=\hat{\theta}_{0}=\theta$ . Suppose Assumption 5 and Assumption 6 hold and $\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}$ and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ and $\sup_{x\in\mathcal{X}}\|\nabla f(0,x)\|\leq E$ for some $E<\infty$ . Let $\nu_{k}$ and $\hat{\nu}_{k}$ denote the distributions of $\theta_{k}$ and $\hat{\theta}_{k}$ respectively. Then, we have

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathcal{W}_{p}^{p}(\nu_{i-1},\hat{\nu}_{i-1})$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\cdot 2^{p+2}\cdot\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad\qquad\qquad+\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\cdot 2^{p+2}\left(2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\right)$
	$\displaystyle\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right).$		(160)

Proof Let us recall that $\theta_{0}=\hat{\theta}_{0}=\theta$ and for any $k\in\mathbb{N}$ ,

	$\displaystyle\theta_{k}=\theta_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f(\theta_{k-1},x_{i}),$		(161)
	$\displaystyle\hat{\theta}_{k}=\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right).$		(162)

Thus it follows that

\displaystyle\theta_{k}-\hat{\theta}_{k}

\displaystyle=\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)+\frac{\eta}{b}\mathcal{E}_{k},

(163)

where

\mathcal{E}_{k}:=\sum_{i\in\Omega_{k}}\left(\nabla f\left(\hat{\theta}_{k-1},\hat{x}_{i}\right)-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right).

(164)

This implies that

	$\displaystyle\left\\|\theta_{k}-\hat{\theta}_{k}\right\\|^{2}$	$\displaystyle=\left\\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\left\\|\mathcal{E}_{k}\right\\|^{2}$
		$\displaystyle\qquad+2\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle.$		(165)

By Assumption 6 and Assumption 5, we have

	$\displaystyle\left\\|\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}-2\eta\mu\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{p}+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{\frac{p}{2}}\right)^{2}$
	$\displaystyle=\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}-2\eta\mu\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{p}+\eta^{2}K_{1}^{2}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{p}$
	$\displaystyle\leq\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}-\eta\mu\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{p},$		(166)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}}$ .

Since $X_{n}$ and $\hat{X}_{n}$ differ by at most one element and $\sup_{x\in\mathcal{X}}\|x\|\leq D$ for some $D<\infty$ , we have $x_{i}=\hat{x}_{i}$ for any $i\in\Omega_{k}$ with probability $\frac{n-b}{n}$ and $x_{i}\neq\hat{x}_{i}$ for exactly one $i\in\Omega_{k}$ with probability $\frac{b}{n}$ and therefore

$\displaystyle\mathbb{E}\left\\|\mathcal{E}_{k}\right\\|^{2}$	$\displaystyle\leq\frac{b}{n}\mathbb{E}\left[\left(K_{2}2D(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1)\right)^{2}\right]$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\left[\\|\hat{\theta}_{k-1}\\|^{2(p-1)}\right]+2\right)$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(8\mathbb{E}\left[\\|\hat{\theta}_{k-1}\\|^{p}\right]+10\right)$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}b}{n}\left(2^{p+2}\mathbb{E}\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\\|^{p}+2^{p+2}\\|\hat{\theta}_{\ast}\\|^{p}+10\right),$	(167)

and moreover,

	$\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle$
	$\displaystyle\leq\frac{\eta}{b}\frac{b}{n}\mathbb{E}\left[\left(\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|+K_{1}\eta\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{\frac{p}{2}}\right)K_{2}2D\left(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1\right)\right]$
	$\displaystyle\leq\frac{2DK_{2}\eta}{n}\mathbb{E}\left[\left((1+K_{1}\eta)\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|+K_{1}\eta\right)\left(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1\right)\right]$
	$\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\mathbb{E}\left[\left(\\|\theta_{k-1}\\|+\\|\hat{\theta}_{k-1}\\|+1\right)\left(2\\|\hat{\theta}_{k-1}\\|^{p-1}+1\right)\right].$		(168)

Notice that for any $x,y\geq 0$ and $p\in(1,2)$ , we have $xy^{p-1}\leq x^{p}+y^{p}$ , $y\leq y^{p}+1$ , $x\leq x^{p}+1$ and $y^{p-1}\leq y^{p}+1$ , which implies that

(x+y+1)(2y^{p-1}+1)=2xy^{p-1}+2y^{p}+2y^{p-1}+x+y+1\leq 3x^{p}+7y^{p}+5.

(169)

Therefore, by applying (169) to (168), we have

	$\displaystyle\mathbb{E}\left\langle\theta_{k-1}-\hat{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\hat{\theta}_{k-1},x_{i}\right)\right),\frac{\eta}{b}\mathcal{E}_{k}\right\rangle$
	$\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\left(3\mathbb{E}\\|\theta_{k-1}\\|^{p}+7\mathbb{E}\\|\hat{\theta}_{k-1}\\|^{p}+5\right)$
	$\displaystyle\leq\frac{2DK_{2}\eta}{n}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\mathbb{E}\\|\theta_{k-1}-\theta_{\ast}\\|^{p}+3\cdot 2^{p-1}\\|\theta_{\ast}\\|^{p}$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\mathbb{E}\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\\|^{p}+7\cdot 2^{p-1}\\|\hat{\theta}_{\ast}\\|^{p}+5\Big{)}.$		(170)

Hence, by applying (166), (167), (170) into (165), we conclude that

$\displaystyle\mathbb{E}\left\\|\theta_{k}-\hat{\theta}_{k}\right\\|^{2}$	$\displaystyle\leq\mathbb{E}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{2}-\eta\mu\mathbb{E}\left\\|\theta_{k-1}-\hat{\theta}_{k-1}\right\\|^{p}$
	$\displaystyle\qquad+\frac{4D^{2}K_{2}^{2}\eta^{2}}{bn}\left(2^{p+2}\mathbb{E}\left\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\right\\|^{p}+2^{p+2}\\|\hat{\theta}_{\ast}\\|^{p}+10\right)$
	$\displaystyle\qquad\qquad+\frac{4DK_{2}\eta}{n}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\mathbb{E}\\|\theta_{k-1}-\theta_{\ast}\\|^{p}+3\cdot 2^{p-1}\\|\theta_{\ast}\\|^{p}$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\mathbb{E}\\|\hat{\theta}_{k-1}-\hat{\theta}_{\ast}\\|^{p}+7\cdot 2^{p-1}\\|\hat{\theta}_{\ast}\\|^{p}+5\Big{)},$	(171)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}}$ . This, together with $\theta_{0}=\hat{\theta}_{0}=\theta$ , implies that

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\theta_{i-1}-\hat{\theta}_{i-1}\right\\|^{p}$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\left(2^{p+2}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\\|^{p}+2^{p+2}\\|\hat{\theta}_{\ast}\\|^{p}+10\right)$
	$\displaystyle\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\Big{(}3\cdot 2^{p-1}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\\|\theta_{i-1}-\theta_{\ast}\\|^{p}+3\cdot 2^{p-1}\\|\theta_{\ast}\\|^{p}$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+7\cdot 2^{p-1}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\\|^{p}+7\cdot 2^{p-1}\\|\hat{\theta}_{\ast}\\|^{p}+5\Big{)}.$		(172)

In Lemma 26, we showed that under the assumption $\eta\leq\frac{\mu}{K_{1}^{2}+2^{p+4}D^{2}K_{2}^{2}}$ , we have

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\theta_{i-1}-\theta_{\ast}\right\\|^{p}\leq\frac{\left\\|\theta-\theta_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\theta_{\ast}\\|^{p}+5\right),$		(173)
	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\hat{\theta}_{i-1}-\hat{\theta}_{\ast}\right\\|^{p}\leq\frac{\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right).$		(174)

Hence, by plugging (173) and (174) into (172), we conclude that

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\theta_{i-1}-\hat{\theta}_{i-1}\right\\|^{p}$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\left(2^{p+2}\left(\frac{\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right)\right)+2^{p+2}\\|\hat{\theta}_{\ast}\\|^{p}+10\right)$
	$\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{\left\\|\theta-\theta_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\theta_{\ast}\\|^{p}+5\right)\right)$
	$\displaystyle\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{\left\\|\theta-\hat{\theta}_{\ast}\right\\|^{2}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right)\right)$
	$\displaystyle\qquad\qquad\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(3\cdot 2^{p-1}\\|\theta_{\ast}\\|^{p}+7\cdot 2^{p-1}\\|\hat{\theta}_{\ast}\\|^{p}+5\right).$		(175)

Moreover, $\|\theta-\hat{\theta}_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\hat{\theta}_{\ast}\|^{2}$ , $\|\theta-\theta_{\ast}\|^{2}\leq 2\|\theta\|^{2}+2\|\theta_{\ast}\|^{2}$ and by applying Lemma 27, we obtain

	$\displaystyle\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\\|\theta_{i-1}-\hat{\theta}_{i-1}\right\\|^{p}$
	$\displaystyle\leq\frac{4D^{2}K_{2}^{2}\eta}{bn\mu}\Bigg{(}2^{p+2}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad\qquad+2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\Bigg{)}$
	$\displaystyle\qquad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 3\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\cdot 7\cdot 2^{p-1}\left(\frac{2\theta^{2}+2(E/\mu)^{\frac{2}{p-1}}}{k\eta\mu}+\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)$
	$\displaystyle\qquad\qquad\quad+\frac{4DK_{2}}{n\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right).$		(176)

Finally, by the definition of $p$ -Wasserstein distance, we have

\frac{1}{k}\sum_{i=1}^{k}\mathcal{W}_{p}^{p}(\nu_{i-1},\hat{\nu}_{i-1})\leq\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\theta_{\ast}\right\|^{p}.

(177)

This completes the proof.

G.1 Proof of Theorem 13

Proof First, let us show that the sequences $\theta_{k}$ and $\hat{\theta}_{k}$ are ergodic.

First, let us show that the limit $\theta_{\infty}$ , if exists, is unique. Consider two sequences $\theta_{k}^{(1)}$ and $\theta_{k}^{(2)}$ starting at $\theta_{0}^{(1)}$ and $\theta_{0}^{(2)}$ respectively with two limits $\theta_{\infty}^{(1)}$ and $\theta_{\infty}^{(2)}$ . For any $k\in\mathbb{N}$ ,

	$\displaystyle\theta_{k}^{(1)}=\theta_{k-1}^{(1)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\theta_{k-1}^{(1)},x_{i}\right),$		(178)
	$\displaystyle\theta_{k}^{(2)}=\theta_{k-1}^{(2)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\nabla f\left(\theta_{k-1}^{(2)},x_{i}\right).$		(179)

By Assumption 6 and Assumption 5, we have

$\displaystyle\left\\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\\|^{2}$	$\displaystyle=\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f\left(\theta_{k-1}^{(2)},x_{i}\right)-\nabla f\left(\theta_{k-1}^{(1)},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\leq\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{2}-2\eta\mu\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{p}+\frac{\eta^{2}}{b^{2}}\left(bK_{1}\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{\frac{p}{2}}\right)^{2}$
	$\displaystyle=\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{2}-2\eta\mu\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{p}+\eta^{2}K_{1}^{2}\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{p}$
	$\displaystyle\leq\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{2}-\eta\mu\left\\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\\|^{p},$	(180)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}}$ . This implies that

\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}-\eta\mu\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{p},

(181)

provided that $\eta\leq\frac{\mu}{K_{1}^{2}}$ . Thus, $\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2}$ for any $k\in\mathbb{N}$ . Suppose $\left\|\theta_{j-1}^{(2)}-\theta_{j-1}^{(1)}\right\|^{2}\geq 1$ for every $j=1,2,\ldots,k$ , then we have

\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq\left\|\theta_{0}^{(2)}-\theta_{0}^{(1)}\right\|^{2}-k\eta\mu,

(182)

such that

\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq 1,\qquad\text{for any $k\geq k_{0}:=\frac{\left\|\theta_{0}^{(2)}-\theta_{0}^{(1)}\right\|^{2}-1}{\eta\mu}$}.

(183)

Since $p\in(1,2)$ ,

\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq(1-\eta\mu)\left\|\theta_{k-1}^{(2)}-\theta_{k-1}^{(1)}\right\|^{2},

(184)

for any $k\geq k_{0}+1$ , which implies that $\theta_{k}^{(2)}-\theta_{k}^{(1)}\rightarrow 0$ as $k\rightarrow\infty$ so that $\theta_{\infty}^{(2)}=\theta_{\infty}^{(1)}$ .

Next, let us show that for any sequence $\theta_{k}$ , it converges to a limit. It follows from (184) that

\left\|\theta_{k}^{(2)}-\theta_{k}^{(1)}\right\|^{2}\leq(1-\eta\mu)^{k-\lceil k_{0}\rceil}\left\|\theta_{\lceil k_{0}\rceil}^{(2)}-\theta_{\lceil k_{0}\rceil}^{(1)}\right\|^{2},

(185)

for any $k\geq k_{0}+1$ . Let $\theta_{0}^{(1)}$ be a fixed initial value in $\mathbb{R}^{d}$ , and let $\theta_{0}^{(2)}=\theta_{1}^{(1)}$ which is random yet takes only finitely many values given $\theta_{0}^{(1)}$ so that $k_{0}$ is bounded. Therefore, it follows from (185) that

\mathcal{W}_{2}^{2}(\nu_{k+1},\nu_{k})\leq(1-\eta\mu)^{k}\mathbb{E}\left[(1-\eta\mu)^{-\lceil k_{0}\rceil}\left\|\theta_{\lceil k_{0}\rceil}^{(2)}-\theta_{\lceil k_{0}\rceil}^{(1)}\right\|^{2}\right],

(186)

where $\nu_{k}$ denotes the distribution of $\theta_{k}$ , which implies that

\sum_{k=1}^{\infty}\mathcal{W}_{2}^{2}(\nu_{k+1},\nu_{k})<\infty.

(187)

Thus, $(\nu_{k})$ is a Cauchy sequence in $\mathcal{P}_{2}(\mathbb{R}^{d})$ equipped with metric $\mathcal{W}_{2}$ and hence there exists some $\nu_{\infty}$ such that $\mathcal{W}_{2}(\nu_{k},\nu_{\infty})\rightarrow 0$ as $k\rightarrow\infty$ .

Hence, we showed that the sequence $\theta_{k}$ is ergodic. Similarly, we can show that the sequence $\hat{\theta}_{k}$ is ergodic.

Finally, by ergodic theorem and Fatou’s lemma, we have

\mathbb{E}\left\|\theta_{\infty}-\hat{\theta}_{\infty}\right\|^{p}=\mathbb{E}\left[\lim_{k\rightarrow\infty}\frac{1}{k}\sum_{i=1}^{k}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}\right]\leq\limsup_{k\rightarrow\infty}\frac{1}{k}\sum_{i=1}^{k}\mathbb{E}\left\|\theta_{i-1}-\hat{\theta}_{i-1}\right\|^{p}.

(188)

We can then apply (176) from the proof of Theorem 28 to obtain:

\displaystyle\mathcal{W}_{p}^{p}(\nu_{\infty},\hat{\nu}_{\infty})\leq\frac{C_{2}}{bn\mu}+\frac{C_{3}}{n},

(189)

where

	$\displaystyle C_{2}:=\frac{4D^{2}K_{2}^{2}\eta}{\mu}\left(2^{p+2}\left(\frac{8\eta}{\mu}D^{2}K_{2}^{2}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)\right)+2^{p+2}(E/\mu)^{\frac{p}{p-1}}+10\right),$		(190)
	$\displaystyle C_{3}:=\frac{32D^{3}K_{2}^{3}\eta}{\mu^{2}}(1+K_{1}\eta)\cdot 10\cdot 2^{p-1}\left(2^{p+1}(E/\mu)^{\frac{p}{p-1}}+5\right)$
	$\displaystyle\qquad\qquad\qquad\qquad\qquad+\frac{4DK_{2}}{\mu}(1+K_{1}\eta)\left(10\cdot 2^{p-1}(E/\mu)^{\frac{p}{p-1}}+5\right).$		(191)

This completes the proof.

$\displaystyle\hat{\eta}$	$\displaystyle\geq\max_{M\geq\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|}\Bigg{\{}\min\Bigg{\{}\Bigg{(}1-\frac{\exp(-\frac{1}{2}(\frac{M}{\eta}-\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|)^{2})}{\sqrt{\det{(I_{d}-\Sigma)}}}\Bigg{)}^{2},$
	$\displaystyle\qquad\exp\Bigg{\{}-\frac{(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}}{\eta^{2}}\\|\Sigma^{-1}\\|$
	$\displaystyle\qquad\quad\cdot\Bigg{(}(1+K_{1}\eta)\left(\frac{2K_{0}}{m}(1+\epsilon)-1\right)^{1/2}+2\left(M+\eta\sup_{x\in\mathcal{X}}\\|\nabla f(\theta_{\ast},x)\\|\right)\Bigg{)}\Bigg{\}}\Bigg{\}}\Bigg{\}}.$	(24)

	$\displaystyle\left\|\mathbb{E}_{\theta,X_{n}}\left[\frac{1}{n}\sum_{i=1}^{n}\left(f\left(\theta,x_{i}\right)-\ell\left(\theta,x_{i}\right)\right)\right]\right\|$	$\displaystyle=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\\|\theta\\|^{2}=\frac{\mu}{2}\mathbb{E}_{\theta,X_{n}}\left\\|\theta-\theta_{X_{n}}^{\star}+\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle\leq\mu\mathbb{E}_{\theta,X_{n}}\left\\|\theta-\theta_{X_{n}}^{\star}\right\\|^{2}+\mu\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle\leq\mu\mathbb{E}_{X_{n}}\left[(1-\eta\mu)^{k}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}+\frac{2\eta}{\mu}\sigma_{X_{n}}\right]+\mu\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}$
		$\displaystyle=\mu\left((1-\eta\mu)^{k}+1\right)\mathbb{E}_{X_{n}}\left\\|\theta_{X_{n}}^{\star}\right\\|^{2}+2\eta\mathbb{E}_{X_{n}}\left[\sigma_{X_{n}}\right].$

	$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|$	$\displaystyle=\mathbb{E}\left\\|\left(I-\frac{\eta}{b}H_{k}\right)\left(\theta_{k-1}-\tilde{\theta}_{k-1}\right)\right\\|$
		$\displaystyle\leq\mathbb{E}\left[\left\\|I-\frac{\eta}{b}H_{k}\right\\|\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|\right]=\rho\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|.$		(30)

$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|^{2}$	$\displaystyle=\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}-\frac{\eta}{b}\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle=\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left\\|\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\\|^{2}$
	$\displaystyle\qquad\qquad-\frac{2\eta}{b}\mathbb{E}\left\langle\theta_{k-1}-\tilde{\theta}_{k-1},\sum_{i\in\Omega_{k}}\left(\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right)\right\rangle.$	(46)

	$\displaystyle\mathbb{E}\left\\|\theta_{k}-\tilde{\theta}_{k}\right\\|^{2}$
	$\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\frac{\eta^{2}}{b^{2}}\mathbb{E}\left[\left(\sum_{i\in\Omega_{k}}\left\\|\nabla f(\theta_{k-1},x_{i})-\nabla f\left(\tilde{\theta}_{k-1},x_{i}\right)\right\\|\right)^{2}\right]$
	$\displaystyle\leq(1-2\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}+\eta^{2}K_{1}^{2}\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2}$
	$\displaystyle\leq(1-\eta\mu)\mathbb{E}\left\\|\theta_{k-1}-\tilde{\theta}_{k-1}\right\\|^{2},$		(47)

Uniform-in-Time Wasserstein Stability Bounds for (Noisy) Stochastic Gradient Descent

Abstract

1 Introduction

2 Technical Background

2.1 The Wasserstein distance and Wasserstein algorithmic stability

Definition 1 (Hardt et al. (2016), Definition 2.1)

Theorem 2 (Hardt et al. (2016), Theorem 2.2)

2.2 Perturbation theory for Markov chains

Lemma 3 (Rudolf and Schweizer (2018), Theorem 3.1)

3 Wasserstein Stability of SGD via Markov Chain Perturbations

3.1 Warm up: quadratic case

Lemma 4

Lemma 5

Lemma 6

Theorem 7

3.2 Strongly convex case

Assumption 1

Assumption 2

Theorem 8

3.3 Non-convex case with additive noise

Assumption 3

Assumption 4

Theorem 9

Remark 10

Corollary 11

4 Wasserstein Stability of SGD without Geometric Ergodicity

4.1 Non-convex case without additive noise

Theorem 12

4.2 Convex case with additional geometric structure

Assumption 5

Assumption 6

Theorem 13

5 Conclusion

Acknowledgments

References

A On the Usage of Surrogate Losses

Example 1.

Example 2.

B Technical Background

B.1 Computable bounds for the convergence of Markov chains

Assumption 7 (Drift Condition)

Assumption 8 (Minorization Condition)

Lemma 14 (Theorem 1.3. Hairer and Mattingly (2011))

C Proofs of Wasserstein Perturbation Results: Quadratic Case

C.1 Proof of Lemma 4

C.2 Proof of Lemma 5

C.3 Proof of Lemma 6

D Proofs of Wasserstein Perturbation Results: Strongly Convex Case

Lemma 15

Lemma 16

Lemma 17

Lemma 18

D.1 Proof of Theorem 8

E Proofs of Wasserstein Perturbation Bounds: Non-Convex Case

Lemma 19

Lemma 20

Lemma 21

Lemma 22

Lemma 23

Lemma 24

E.1 Proof of Theorem 9

Theorem 25 (Complete Theorem 3.3)

E.2 Proof of Corollary 11

F Proofs of Non-Convex Case without Additive Noise

F.1 Proof of Theorem 12

G Proofs of Convex Case with Additional Geometric Structure

Lemma 26

Lemma 27

Theorem 28

G.1 Proof of Theorem 13

Uniform-in-Time Wasserstein Stability Bounds
for (Noisy) Stochastic Gradient Descent