This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Towards Optimal Problem Dependent Generalization Error Bounds in Statistical Learning Theory

Yunbei Xu111Graduate School of Business, New York, NY; Email: [email protected].
Columbia University
   Assaf Zeevi222 Graduate School of Business, New York, NY; Email: [email protected].
Columbia University
Abstract

We study problem-dependent rates, i.e., generalization errors that scale near-optimally with the variance, the effective loss, or the gradient norms evaluated at the “best hypothesis.” We introduce a principled framework dubbed “uniform localized convergence,” and characterize sharp problem-dependent rates for central statistical learning problems. From a methodological viewpoint, our framework resolves several fundamental limitations of existing uniform convergence and localization analysis approaches. It also provides improvements and some level of unification in the study of localized complexities, one-sided uniform inequalities, and sample-based iterative algorithms. In the so-called “slow rate” regime, we provides the first (moment-penalized) estimator that achieves the optimal variance-dependent rate for general “rich” classes; we also establish improved loss-dependent rate for standard empirical risk minimization. In the “fast rate” regime, we establish finite-sample problem-dependent bounds that are comparable to precise asymptotics. In addition, we show that iterative algorithms like gradient descent and first-order Expectation-Maximization can achieve optimal generalization error in several representative problems across the areas of non-convex learning, stochastic optimization, and learning with missing data.

1 Introduction

1.1 Background

Problem statement.

Consider the following statistical learning setting. Assume that a random sample zz follows an unknown distribution \mathbb{P} with support Z\pazocal{Z}. For each realization of zz, let (;z)\ell(\cdot;z) be a real-valued loss function, defined over the hypothesis class H\pazocal{H}. Let hHh^{*}\in\pazocal{H} be the optimal hypothesis that minimizes the population risk

(h;z):=𝔼[(h;z)].\displaystyle\mathbb{P}\ell(h;z):={\mathbb{E}}[\ell(h;z)].

Given nn i.i.d. samples {zi}i=1n\{z_{i}\}_{i=1}^{n} drawn from \mathbb{P}, our goal, roughly speaking, is to “learn” a hypothesis h^H\hat{h}\in\pazocal{H} that makes the generalization error

(h^):=(h^;z)(h;z)\displaystyle\mathscr{E}(\hat{h}):=\mathbb{P}\ell(\hat{h};z)-\mathbb{P}\ell(h^{*};z)

as small as possible. This pursuit is ubiquitous in machine learning, statistics and stochastic optimization.

We study problem-dependent rates, i.e., finite-sample generalization errors that scale tightly with problem-dependent parameters, such as the variance, the effective loss, or the gradient norms at the optimal hypothesis hh^{*}. While the direct dependence of (h^)\mathscr{E}(\hat{h}) on the sample size nn is often well-understood, it typically only reflects an “asymptotic” perspective, placing less emphasis on the scale of problem-dependent parameters. Existing literature leaves several outstanding challenges in deriving problem-dependent rates. These can be broadly categorized into the so-called “slow rate” and “fast rate” regimes, as described below.

Main challenges in the “slow rate” regime.

In the absence of strong convexity and margin conditions, the direct dependence of (h^)\mathscr{E}(\hat{h}) on the sample size (n)(n) is typically no faster than O(n12)O(n^{-\frac{1}{2}}). These settings are referred to as the “slow rate” regime. Here, the central challenge is to simultaneously achieve optimal dependence on problem-dependent parameters (the variance or the effective loss at the optimal hh^{*}) and the sample size (nn). To the best of our knowledge, this has not been achieved in previous literature for “rich” classes (to be explained shortly).

Perhaps the most widely used framework to study problem-dependent rates is the traditional “local Rademacher complexity” analysis [3, 28, 79], which has become a standard tool in learning theory. However, as we will discuss later, this analysis leads to a dependence on the sample size (nn) which is sub-optimal for essentially all “rich” classes (with the exception of parametric classes). The absence of more precise localization analysis also challenges the design of more refined estimation procedures. For example, designing estimators to achieve variance-dependent rates requires penalizing the empirical second moment to achieve the “right” bias-variance trade-off. Most antecedent work is predicated on either the traditional “local Rademacher complexity” analysis [56, 13] or coarser approaches [45, 71]. Thus, to the best of our knowledge, the question of optimal problem-dependent rates for general rich classes is still open.

Main challenges in the “fast rate” regime.

When assuming suitable curvature or so-called margin conditions, the direct dependence of (h^)\mathscr{E}(\hat{h}) on nn is typically faster than O(n12)O(n^{-\frac{1}{2}}); for that reason we refer to this as the “fast rate” regime. Here, the traditional localization analysis often provides correct dependence on the sample size (nn), but the complexity parameters (e.g., norms of gradients, Lipchitz constants, etc.) characterizing these generalization errors are not localized and hence are not problem-dependent.

Much progress on problem-dependent rates has been made under particular formulations, such as regression with structured strongly convex cost (e.g., square cost, Huber cost) [50, 53, 37], and binary classification under margin conditions [15, 84] . In particular, [50, 53] point out that the traditional “local Rademacher complexity” analysis fails to provide parameter localization for unbounded regression problems, and propose the so-called “learning without concentration” methodology to obtain problem-dependent rates that do not scale with the worst-case parameters. We are able to recover these results using a more direct concentration-based analysis and remove certain restrictions.

We will also focus on parametric models in the “fast rate” regime, which covers more “modern” non-convex learning problems whose analysis is not aligned with traditional generalization error analysis. For example, in non-convex learning problems one wants generalization error bounds for iterative optimization algorithms; and traditional localization frameworks (which mostly focus on supervised learning) do not cover general stochastic optimization and missing data problems. For many representative non-convex learning, stochastic optimization and missing data problems, it remains open to provide algorithmic solutions and problem-dependent generalization error bounds that are comparable to optimal asymptotic results.

1.2 Contributions and organization of the paper

The paper provides contributions both in the framework it develops, as well as its application to improving existing results in several problem areas. In particular, it suggests guidelines for designing estimators and learning algorithms and provides analysis tools to study them. In addition, it provides some level of unification across problem areas. Specifically, the main contributions are as follows.

(1) We introduce a principled framework to study localization in statistical learning, dubbed the “uniform localized convergence” procedure, which simultaneously provides optimal “direct dependence” on the sample size, correct scaling with problem-dependent parameters, and flexibility to unify various problem settings. Section 2 provides a description of the framework, and explains how it addresses several fundamental challenges.

(2) In the “slow rate” regime, we employ the above ideas to design the first estimator that achieves optimal variance-dependent rates for general function classes. The derivation is based on a novel procedure that optimally penalizes the empirical (centered) second moment. We also establish improved loss-dependent rates for standard empirical risk minimization, which has computational advantages. Section 3 presents these theoretical results (see Section 3.2 for the loss-dependent rate and Section 3.3 for the variance-dependent rate); and Section 4 will illustrate our improvements in non-parametric classes and VC classes.

(3) In the “fast rate” regime, we establish a “uniform localized convergence of gradient” framework for parametrized models, and characterize optimal problem-dependent rates for approximate stationary points and iterative optimization algorithms such as gradient descent and first-order Expectation-Maximization. Our results scale tightly with the gradient norms at the optimal parameter, and improve previous guarantees in non-convex learning, stochastic optimization and missing data problems. See Section 5 for the theoretical results and Sections 6-7 for illustrations of improvements over previous results.

(4) In the “fast rate” regime, we also study supervised learning with structured convex cost, where the hypothesis class can be non-parametric and heavy-tailed. This part of the work has direct relationship to a stream of literature known as “learning without concentration.” Our contributions in this setting lie in technical refinements of the generalization error bounds and some unification between one-sided uniform inequalities and concentration of truncated functions; for this reason we defer its treatment to Section 8.

2 The “uniform localized convergence” procedure

2.1 The current blueprint

Denote the empirical risk

n(h;z):=1ni=1n(h;zi),\displaystyle\mathbb{P}_{n}\ell(h;z):=\frac{1}{n}\sum_{i=1}^{n}\ell(h;z_{i}),

and consider the following straightforward decomposition of the generalization error

(h^)=(n)(h^;z)+(n(h^;z)n(h;z))+(n)(h;z).\displaystyle\mathscr{E}(\hat{h})=(\mathbb{P}-\mathbb{P}_{n})\ell(\hat{h};z)+\big{(}\mathbb{P}_{n}\ell(\hat{h};z)-\mathbb{P}_{n}\ell(h^{*};z)\big{)}+(\mathbb{P}_{n}-\mathbb{P})\ell(h^{*};z). (2.1)

The main difficulty in studying (h^)\mathscr{E}(\hat{h}) comes from bounding the first term (n)(h^;z)(\mathbb{P}-\mathbb{P}_{n})\ell(\hat{h};z), since h^\hat{h} depends on the nn samples. The simplest approach, which does not achieve problem-dependent rates, is to bound the uniform error

suphH(n)(h;z)\sup_{h\in\pazocal{H}}(\mathbb{P}-\mathbb{P}_{n})\ell(h;z)

over the entire hypothesis class H\pazocal{H}. In order to obtain problem-dependent rates, a natural modification is to consider uniform convergence over localized subsets of H\pazocal{H}.

We first give an overview of the traditional “local Rademacher complexity” analysis [3, 26, 79]. Consider a generic function class F\pazocal{F} that we wish to concentrate, which consists of real-valued functions defined on Z\pazocal{Z} (e.g., one can set f(z)=(h;z)f(z)=\ell(h;z)). Denote

f:=𝔼[f(z)],nf:=1ni=1nf(zi).\mathbb{P}f:=\mathbb{E}[f(z)],\quad\mathbb{P}_{n}f:=\frac{1}{n}\sum_{i=1}^{n}f(z_{i}).

The notation r>0r>0 will serve as a localization parameter, and δ>0\delta>0 will serve for high probability arguments. Let ψ(r;δ)\psi(r;\delta) denote a surrogate function that upper bounds the uniform error within a localized region {fF:T(f)r}\{f\in\pazocal{F}:T(f)\leq r\}, where we call T:F+T:\pazocal{F}\rightarrow\mathbb{R}_{+} the “measurement functional.” Formally, let ψ\psi be a function that maps [0,)×(0,1)[0,\infty)\times(0,1) to [0,)[0,\infty), which possibly depends on the observed samples {zi}i=1n\{z_{i}\}_{i=1}^{n}. Assume ψ\psi satisfies for arbitrary fixed δ,r\delta,r, with probability at least 1δ1-\delta,

supfF:T(f)r(n)fψ(r;δ).\displaystyle\sup_{f\in\pazocal{F}:T(f)\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi(r;\delta). (2.2)

By default we ask ψ(r;δ)\psi(r;\delta) to be a non-decreasing and non-negative function.333Here and in what follows we will assume that such suprema are measurable, namely, the required regularity conditions on the underlying function classes are met (see, e.g., [61, Appendix C], [77, Section 1.7]). The main result of the traditional localization analysis can be summarized as follows. (The statement is obtained by adapting the proof from [3, Section 3.2]; itself more general than the traditional meta-results [3, 26, 79].)

Statement 1 (current blueprint).

Assume that ψ\psi is a sub-root function, i.e., ψ(r;δ)/r\psi(r;\delta)/\sqrt{r} is non-increasing with respect to r+r\in\mathbb{R}_{+}. Assume the Bernstein condition T(f)BefT(f)\leq B_{e}\mathbb{P}f, Be>0B_{e}>0, fF\forall f\in\pazocal{F}. Then with probability at least 1δ1-\delta, for all fFf\in\pazocal{F} and K>1K>1,

(n)f1Kf+100(K1)rBe,\displaystyle(\mathbb{P}-\mathbb{P}_{n})f\leq\frac{1}{K}\mathbb{P}f+\frac{100(K-1)r^{*}}{B_{e}}, (2.3)

where rr^{*} is the “fixed point” solution of the equation r=Beψ(r;δ)r=B_{e}\psi(r;\delta).

Since its inception, Statement 1 has become a standard tool in learning theory. However, it requires a rather technical proof, and it appears to be loose when compared with the original assumption (2.2)—ideally, we would like to directly extend (2.2) to hold uniformly without sacrificing any accuracy. Moreover, some assumptions in the statement are restrictive and might not be necessary.

2.2 The “uniform localized convergence” principle

We provide a surprisingly simple analysis that greatly improves and simplifies Statement 1. Unlike Statement 1, the following proposition does not require the concentrated functions gfg_{f} to satisfy the Bernstein condition, and the surrogate function ψ\psi need not to be “sub-root.” Despite placing less restrictions, Proposition 1 is able to establish results that are typically “sharper” than the current blueprint. Note that in the proposition, both the measurement functional TT as well as the surrogate function ψ\psi are allowed to be sample-dependent.

Proposition 1 (the “uniform localized convergence” argument).

For a function class G={gf:fF}\pazocal{G}=\{g_{f}:f\in\pazocal{F}\} and functional T:F[0,R]T:\pazocal{F}\rightarrow[0,R], assume there is a function ψ(r;δ)\psi(r;\delta), which is non-decreasing with respect to rr and satisfies that δ(0,1)\forall\delta\in(0,1), r[0,R]\forall r\in[0,R], with probability at least 1δ1-\delta,

supfF:T(f)r(n)gfψ(r;δ).\displaystyle\sup_{f\in\pazocal{F}:T(f)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi(r;\delta). (2.4)

Then, given any δ(0,1)\delta\in(0,1) and r0(0,R]r_{0}\in(0,R], with probability at least 1δ1-\delta, for all fFf\in\pazocal{F}, either T(f)r0T(f)\leq{r_{0}} or

(n)gfψ(2T(f);δ(log22Rr0)1).\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(2T(f);\delta\left({\log_{2}\frac{2R}{r_{0}}}\right)^{-1}\right). (2.5)

The key intuition behind Proposition 1 is that the uniform restatement of the “localized” argument (2.4) is nearly cost-free, because the deviations (n)gf(\mathbb{P}-\mathbb{P}_{n})g_{f} can be controlled solely by the real valued functional T(f)T(f). As a result, we essentially only require uniform convergence over an interval [r0,R][r_{0},R]. The “cost” of this uniform convergence, namely, the additional log2(2Rr0)\log_{2}(\frac{2R}{r_{0}}) term in (2.5), will only appear in the form log(δ/log2(2Rr0))\log(\delta/\log_{2}(\frac{2R}{r_{0}})) in high-probability bounds, which is of a negligible O(loglogn)O(\log\log n) order in general.

Formally, we apply a “peeling” technique: we take rk=2kr0r_{k}=2^{k}r_{0}, where k=1,2,,log2Rr0k=1,2,\dots,\lceil\log_{2}\frac{R}{r_{0}}\rceil, and we use the union bound to extend (2.4) to hold for all these rkr_{k}. Then for any fFf\in\pazocal{F} such that T(f)>r0T(f)>r_{0} is true, there exists a non-negative integer kk such that 2kr0<T(f)2k+1r02^{k}r_{0}<T(f)\leq 2^{k+1}r_{0}. By the non-decreasing property of the ψ\psi function, we then have

(n)gfψ(rk+1;δ(log22Rr0)1)ψ(2T(f);δ(log22Rr0)1),(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(r_{k+1};\delta\left({\log_{2}\frac{2R}{r_{0}}}\right)^{-1}\right)\leq\psi\left(2T(f);\delta\left({\log_{2}\frac{2R}{r_{0}}}\right)^{-1}\right),

which is exactly (2.5). Interestingly, the proof of the classical result (Statement 1) relies on a relatively heavy machinery that includes more complicated peeling and re-weighting arguments (see [3, Section 3.1.4]). However, that analysis obscures the key intuition that we elucidate under inequality (2.5).

In this paper, we prove localized generalization error bounds through a unified principle, summarized at a high level in the two-step template below.

Principle of uniform localized convergence. First, determine the concentrated functions, the measurement functional and the surrogate ψ\psi, and obtain a sharp “uniform localized convergence” argument. Then, perform localization analysis that is customized to the problem setting and the learning algorithm.

Distinct from the blueprint (2.3), the right hand side of our “uniform localized convergence” argument (2.5) contains a “free” variable T(f)T(f) rather than a fixed value rr^{*}. The new principle strictly improves the current blueprint from many aspects, and its merits will be illustrated in the sequel.

2.3 Merits of “uniform localized convergence”

Our improvements in the “slow rate” regime originate from the noticeable gap between Proposition 1 and Statement 1, illustrated by the following (informal) conclusion.

Statement 2 (improvements over the current blueprint–informal statement).

Setting gf=fg_{f}=f, then under the assumptions of Statement 1, Proposition 1 provides a strict improvement over Statement 1. In particular, the slower ψ\psi grows, the larger the gap between the bounds in the two results, and the bounds are on pair only when ψ\psi is proportional to r\sqrt{r}, i.e., when the function class F\pazocal{F} is parametric and not “rich.”

Formalizing as well as providing rigorous justification for this conclusion is relatively straightforward: taking the “optimal choice” of KK in Statement 1, we can re-write its conclusion as

(n)f20rfBerBe[Statement 1],\displaystyle(\mathbb{P}-\mathbb{P}_{n})f\leq 20\sqrt{\frac{r^{*}\mathbb{P}f}{B_{e}}}-\frac{r^{*}}{B_{e}}\quad[\text{Statement \ref{state current blueprint}}],

where the right hand side is of order rf/Be\sqrt{{r^{*}\mathbb{P}f}/{B_{e}}} when f<r/Be\mathbb{P}f<r^{*}/B_{e}, and order r/Ber^{*}/B_{e} when fr/Be\mathbb{P}f\leq r^{*}/B_{e}. Our result (2.5) is also of order r/Ber^{*}/B_{e} when fr/Be\mathbb{P}f\leq r^{*}/B_{e}. However, for every ff such that f>r/Be\mathbb{P}f>r^{*}/B_{e}, it is straightforward to verify that under the assumptions in Statetment 1,

ψ(2T(f);δ)\displaystyle\psi(2T(f);\delta)\quad ψ(2Bef;δ)[Bernstein condition: T(f)Bef]\displaystyle{\leq}\quad\psi(2B_{e}\mathbb{P}f;\delta)\quad\text{[Bernstein condition: $T(f)\leq B_{e}\mathbb{P}f$]}
2Befrψ(r;δ)[ψ(r;δ) is sub-root]\displaystyle\leq\quad\frac{\sqrt{2B_{e}\mathbb{P}f}}{\sqrt{r^{*}}}\psi(r^{*};\delta)\quad\text{[$\psi(r;\delta)$ is sub-root]}
2rfBe[r is the fixed point of Bψ(r;δ)].\displaystyle{\leq}\quad\sqrt{\frac{2{r^{*}}\mathbb{P}f}{B_{e}}}\quad\text{[$r^{*}$ is the fixed point of $B\psi(r;\delta)$]}. (2.6)

Therefore, the argument ψ(2T(f);δ)2rf/Be\psi(2T(f);\delta)\leq\sqrt{{2r^{*}\mathbb{P}f}/{B_{e}}} established by (2.3) shows that the “uniform localized convergence” argument (2.5) strictly improves over Statement 1 (ignoring negligible O(loglogn)O(\log\log n) factors). Statement 2 indicates that the folklore use of fixed values as straightforward complexity control is somewhat questionable. In the “slow rate” regime, the key point to achieve optimal problem-dependent rates is to bound the generalization error using the function ψ\psi. Otherwise the bounds will have the “wrong” dependence on the sample size for all “rich” classes.

Interestingly, the merits of our approach in the “fast rate” regime stem from very different perspectives: the removal of the “sub-root” requirement on ψ\psi allows one to achieve parameter localization; and the added flexibility in the choice of gfg_{f} allows one to prove one-sided uniform inequalities and uniform convergence of gradient vectors. To better appreciate these, we provide the following informal discussion to help elucidate the key messages. Let μ>0\mu>0 be the curvature parameter in the “fast rate” regime (a common example is the strong convexity parameter for the loss function). The generalization error is often characterized by the fixed point of ψ(r;δ)/μ\psi(r;\delta)/\mu, where ψ\psi is the surrogate function that governs the uniform error of excess risk. The removal of the “sub-root” restriction is crucial because under curvature and smoothness conditions, the uniform error of excess loss typically grows “faster” than the square root function. Consider surrogate functions of the form

ψ(r;δ)=anrproblem-dependent+cnrsuper-root,\displaystyle\psi(r;\delta)=\underbrace{\sqrt{a_{n}^{*}r}}_{\text{problem-dependent}}+\underbrace{c_{n}r}_{\text{super-root}}, (2.7)

where ana_{n}^{*} is a problem-dependent rate, and cnc_{n} satisfies 0<cn<1/(2μ)0<c_{n}<1/(2\mu) when the sample size nn satisfies mild conditions. We call cnrc_{n}r the benign “super-root” component in the sense that when solving the fixed point equation r=ψ(r;δ)/μr=\psi(r;\delta)/\mu, that part can be dropped from both side of the equation. As a result, the fixed point solution only depends on order of the problem-dependent component anr\sqrt{a_{n}^{*}r}, and so one obtains problem-dependent rates. In contrast, worst-case parameters will be unavoidable if one wants to use a “sub-root” surrogate function to govern (2.7). In traditional analysis, a loose “sub-root” surrogate function is often obtained via two-sided concentration and Lipchitz contraction, making global Lipchitz constants unavoidable. Furthermore, the added flexibility to choose concentrated functions gfg_{f} is also useful. In particular, we will show that: 1) our framework unifies traditional value-based uniform convergence and uniform convergence of gradient vectors, which is crucial to study non-convex learning problems and sample-based iterative algorithms; and 2) simple “truncated” functions can be used to established one-sided uniform inequalities that are sharper than two-sided ones, which enable recovery of results in unbounded and heavy-tailed regression problems.

2.4 Unification and improvements over existing localization approaches

Beyond proving new bounds, an important objective of the paper is to provide some level of unification to the methodology of uniform convergence and localization. Here we present a historical review of uniform convergence and localization, and discuss how the “uniform localized convergence” principle unifies and improves existing approaches. We will overview four general settings where localization plays a crucial role in generalization error analysis and algorithm design: 1) the “slow rate” regime; 2) classification under margin conditions; 3) regression under curvature conditions; and 4) non-convex learning and stochastic optimization (the latter three settings belong to the “fast rate” regime).

The “slow rate” regime.

The traditional “local Rademacher complexity” analysis is the standard tool to study localized generalization error bounds in the “slow rate” regime; it also influences the design of regularization. Here, our framework resolves the fundamental limitation of the traditional analysis, leading to the first optimal loss-dependent and variance-dependent rates for general “rich” classes (see Section 3-4).

Classification under margin conditions.

One early line of work in the “fast rate” regime focuses on exploiting the margin conditions to establish fast rates in binary classification (e.g., see [26, Section 5.3]). It can be shown that the Bernstein condition in Statement 1 subsumes standard margin conditions such as Tsybakov’s margin condition [74] and Massart’s noise condition [44]. Because the loss functions in binary classification are uniformly bounded, the “current blueprint” (Statement 1) already provides a good framework to study these questions. In an orthogonal direction, some recent works study more refined alternatives to the notion “local Rademacher complexity.” While these results are important, they are within the scope of Statement 1 from the perspective of localization machinery. For example, one can consider better “sub-root” relaxation than direct Rademacher symmetrization [84]; and one may only need to consider some subset of F\pazocal{F} if using “empirical risk minimization over epsilon net” [83]. In this setting, the work presented in this paper contributes to unification rather than improvements over specific theoretical results.

Regression under curvature conditions.

The traditional localization analysis has been widely applied to curvatured regression problems to achieve “fast rates.” However, it fails to localize the complexity parameters (e.g., norms of gradients, Lipchitz constants, etc.) in the generalization error bounds. As a consequence, the traditional localization analysis is widely recognized as not being suitable for regression problems with unbounded losses; and it may be unfavorable even for uniformly bounded problems because it fails to adapt to the “effective noise level” at the optimal hypothesis. Important progress addressing the aforementioned limitations has been made during the last decade. Focusing on supervised learning problems with structured convex costs (square cost, Huber cost, etc.), the breakthrough works [50, 53] propose the so-called “learning without concentration” framework, where the central notions and proof techniques are quite different from the traditional concentration framework. An interesting direction is to recover these results through a “one-shot” concentration framework. By exploiting the intrinsic connections between one-sided uniform inequalities and the “uniform localized convergence” of truncated functions, we are able to establish such a unified analysis and illustrate systematical refinements of the generalization error bounds; This setting will be discussed in Section 8.

Non-convex learning and stochastic optimization.

Traditional value-based generalization error analysis relies on properties of global (regularized) empirical risk minimizers. However, non-convex learning problems and generalization error analysis of iterative optimization algorithms require one to consider uniform convergence of gradient vectors. And it should be noted that existing localization frameworks mostly focus on supervised learning problems and are unable to handle general stochastic optimization or missing data problems. Our framework is able to prove “uniform localized convergence of gradients,” which improves existing vector-based uniform convergence frameworks [47, 12, 2] and provides problem-dependent generalization error bounds for iterative algorithms (See Sections 5-7).

3 Problem-dependent rates in the “slow rate” regime

3.1 Preliminaries

Let V\pazocal{V}^{*} and L\pazocal{L}^{*} be the variance and the “effective loss” at the best hypothesis hh^{*}:

V:=Var[(h;z)],L:=[(h;z)infH(h;z)].\displaystyle\pazocal{V}^{*}:=\textup{Var}[\ell(h^{*};z)],\quad\pazocal{L}^{*}:=\mathbb{P}[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)].

In this section we study finite-sample generalization errors that scale tightly with V\pazocal{V}^{*} or L\pazocal{L}^{*}, which we call problem-dependent rates, without invoking strong convexity or margin conditions.

In the slow rate regime, we assume the loss function to be uniformly bounded by [B,B][-B,B], i.e., |(h;z)|B|\ell(h;z)|\leq B for all hHh\in\pazocal{H} and zZz\in\pazocal{Z}. This is a standard assumption used in almost all previous works that do not invoke curvature conditions or rely on other problem-specific structure; and our results in the slow rate regime essentially only require this boundedness assumption. Extensions to unbounded targets can be obtained via truncation techniques (see, e.g. [17]), and our problem-dependent results allow BB to be very large, potentially scaling with nn.

We represent the complexity through a surrogate function ψ(r;δ)\psi(r;\delta) that satisfies for all δ(0,1)\delta\in(0,1),

supfF:[f2]r(n)fψ(r;δ),\displaystyle\sup_{f\in\pazocal{F}:\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi(r;\delta), (3.1)

with probability at least 1δ1-\delta, where F\pazocal{F} is taken to be the excess loss class

Hh:={z((h;z)(h;z)):hH}.\displaystyle\ell\circ\pazocal{H}-\ell\circ h^{*}:=\{{z\mapsto\left(\ell(h;z)-\ell(h^{*};z)\right)}:h\in\pazocal{H}\}. (3.2)

(From the perspective of Section 2.2, we choose the excess losses as the “concentrated functions,” and use T(f)=[f2]T(f)=\mathbb{P}[f^{2}] as the “measurement functional”.) To achieve non-trivial complexity control (and ensure existence of the fixed point), we only consider “meaningful” surrogate functions, as stated below.

Definition 1 (meaningful surrogate function).

A bivariate function ψ(r;δ)\psi(r;\delta) defined over [0,)×(0,1)[0,\infty)\times(0,1) is called a meaningful surrogate function if it is non-decreasing, non-negative and bounded with respect to rr for every fixed δ(0,1)\delta\in(0,1).

We note that the above does not place significant restrictions on the choice of the surrogate function. In particular, for the ψ\psi function defined in (3.1) and the excess loss class in (3.2), the left hand side of (3.1) is itself non-decreasing and non-negative; and the boundedness requirement can always be met by setting ψ(r;δ)=ψ(4B2;δ)\psi(r;\delta)=\psi(4B^{2};\delta) for all r4B2r\geq 4B^{2}. We now give the formal definition of fixed points.

Definition 2 (fixed point).

Given a non-decreasing, non-negative and bounded function φ(r)\varphi(r) defined over [0,)[0,\infty), we define the fixed point of φ(r)\varphi(r) to be sup{r>0:φ(r)r}\sup\{r>0:\varphi(r)\leq r\}.

It is well-known that a non-decreasing, non-negative and bounded function only has finite discontinuity points, all of which belong to “discontinuity points of the first kind” [65, Definition 4.26]. Therefore, it is easy to verify that the fixed point of φ(r)\varphi(r) is the maximal solution to the equation φ(r)=r\varphi(r)=r.

Given a bounded class F\pazocal{F}, empirical process theory provides a general way to construct surrogate function by upper bounding the “local Rademacher complexity” {fF:[f2]r}\mathfrak{R}\{f\in\pazocal{F}:\mathbb{P}[f^{2}]\leq r\} (see Lemma 5 in Appendix A.6). We give the definition of Rademacher complexity for completeness.

Definition 3 (Rademacher complexity).

For a function class F\pazocal{F} that consists of mappings from Z\pazocal{Z} to \mathbb{R}, define

F:=𝔼z,υsupfF1ni=1nυif(zi),nF:=𝔼υsupfF1ni=1nυif(zi),\displaystyle\mathfrak{R}\pazocal{F}:=\mathbb{E}_{z,\upsilon}\sup_{f\in\pazocal{F}}\frac{1}{n}\sum_{i=1}^{n}\upsilon_{i}f(z_{i}),\quad{\mathfrak{R}}_{n}\pazocal{F}:=\mathbb{E}_{\upsilon}\sup_{f\in\pazocal{F}}\frac{1}{n}\sum_{i=1}^{n}\upsilon_{i}f(z_{i}),

as the Rademacher complexiy and the empirical Rademacher complexity of F\pazocal{F}, respectively, where {υi}i=1n\{\upsilon_{i}\}_{i=1}^{n} are i.i.d. Rademacher variables for which Prob(υi=1)=Prob(υi=1)=12\text{Prob}(\upsilon_{i}=1)=\text{Prob}(\upsilon_{i}=-1)=\frac{1}{2}. In the above we explicitly denote expectation operators with subscripts that describes the distribution relative to which the expectation is computed. 𝔼z\mathbb{E}_{z} means taking expectations over {zi}i=1n\{z_{i}\}_{i=1}^{n} and 𝔼υ\mathbb{E}_{\upsilon} means taking expectations over {υi}i=1n\{\upsilon_{i}\}_{i=1}^{n}.

Furthermore, Dudley’s integral bound provides one general solution to construct a computable upper bound on the local Rademacher complexity via the covering number of F\pazocal{F}. We give the definition of covering number and state Dudley’s integral bound for completeness as well.

Definition 4 (covering number).

Assume (M,metr(,))(\pazocal{M},\textup{{metr}}(\cdot,\cdot)) is a metric space, and FM\pazocal{F}\subseteq{\pazocal{M}}. The ε\varepsilon-covering number of the set F\pazocal{F} with respect to a metric metr(,)\textup{{metr}}(\cdot,\cdot) is the size of its smallest ε\varepsilon-net cover:

N(ε,F,metr)=min{m:f1,,fmF such that Fj=1mB(fj,ε)}.\displaystyle\pazocal{N}(\varepsilon,\pazocal{F},\textup{{metr}})=\min\{m:\exists f_{1},...,f_{m}\in\pazocal{F}\textup{ such that }\pazocal{F}\subseteq\cup_{j=1}^{m}\pazocal{B}(f_{j},\varepsilon)\}.

where B(f,ε):={f~:metr(f~,f)ε}\pazocal{B}(f,\varepsilon):=\{\tilde{f}:\textup{{metr}}(\tilde{f},f)\leq\varepsilon\}.

We call logN(ε,F,metr)\log\pazocal{N}(\varepsilon,\pazocal{F},\textup{{metr}}) the metric entropy of the set F\pazocal{F} with respect to a metric metr(,)\textup{{metr}}(\cdot,\cdot). Standard metrics include the Lp(n)L_{p}(\mathbb{P}_{n}) metric defined by Lp(n)(f1,f2):=n(f1(z)f2(z))ppL_{p}(\mathbb{P}_{n})(f_{1},f_{2}):=\sqrt[p]{\mathbb{P}_{n}(f_{1}(z)-f_{2}(z))^{p}} for p>0p>0. For function classes characterized by metric entropy conditions, the surrogate function constructed by Dudley’s integral bound is often optimal. We use the L2(n)L_{2}(\mathbb{P}_{n}) metric for simplicity; the result trivially extends to Lp(n)L_{p}(\mathbb{P}_{n}) metrics for p2p\geq 2, since logN(ε,F,L2(n))logN(ε,F,Lp(n))\log\pazocal{N}(\varepsilon,\pazocal{F},L_{2}(\mathbb{P}_{n}))\leq\log\pazocal{N}(\varepsilon,\pazocal{F},L_{p}(\mathbb{P}_{n})) for any set F\pazocal{F} and discretization error ε>0\varepsilon>0.

Lemma 1 (Dudley’s integral bound, [69]).

Given r>0r>0 and a class F\pazocal{F} that consists of functions defined on Z\pazocal{Z},

n{fF:n[f2]r}infε0>0{4ε0+12ε0rlogN(ε,F,L2(n))ndε}.\displaystyle\mathfrak{R}_{n}\{f\in\pazocal{F}:\mathbb{P}_{n}[f^{2}]\leq r\}\leq\inf_{\varepsilon_{0}>0}\left\{4\varepsilon_{0}+12\int_{\varepsilon_{0}}^{\sqrt{r}}\sqrt{\frac{\log\pazocal{N}(\varepsilon,\pazocal{F},L_{2}(\mathbb{P}_{n}))}{n}}d\varepsilon\right\}.

In what follows, when comparing different complexity parameters, we often use “\lor” to denote the maximum operator, and to interpret correctly, its use should be understood to take precedence over addition but not over multiplication. Throughout we will find it convenient to use “big-oh” notation to simplify various expressions and comparisons that capture order of magnitude effects. For two non-negative sequence {an}\{a_{n}\} and {bn}\{b_{n}\}, we write an=O(bn)a_{n}=O(b_{n}), if ana_{n} can be upper bounded by bnb_{n} up to an absolute constant for sufficiently large nn. The same expression is often used also in the context of probabilistic statements in which case it is interpreted as holding on an event which has a specified (typically “high”) probability. We write an=Ω(bn)a_{n}=\Omega(b_{n}) if bn/an=O(1)b_{n}/a_{n}=O(1).

3.2 Loss-dependent rates via empirical risk minimization

In this section we are interested in loss-dependent rates, which should scale tightly with L:=[(h;z)infH(h;z)]\pazocal{L}^{*}:=\mathbb{P}[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)]; the best achievable “effective loss” on H\pazocal{H}. The following theorem characterizes the loss-dependent rate of empirical risk minimization (ERM) via a surrogate function ψ\psi, its fixed point rr^{*}, the effective loss L\pazocal{L}^{*} and BB.

Theorem 1 (loss-dependent rate of ERM).

For the excess loss class F\pazocal{F} in (3.2), assume there is a meaningful surrogate function ψ(r;δ)\psi(r;\delta) that satisfies δ(0,1)\forall\delta\in(0,1) and r>0\forall r>0, with probability at least 1δ1-\delta,

supfF:[f2]r(n)fψ(r;δ).\displaystyle\sup_{f\in\pazocal{F}:\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi(r;\delta).

Then the empirical risk minimizer h^ERMargminH{n(h;z)}\hat{h}_{\textup{ERM}}\in\operatorname*{arg\,min}_{\pazocal{H}}\{\mathbb{P}_{n}\ell(h;z)\} satisfies for any fixed δ(0,1)\delta\in(0,1) and r0(0,4B2)r_{0}\in(0,4B^{2}), with probability at least 1δ1-\delta,

(h^ERM)ψ(24BL;δCr0)r6Br048B,\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq\psi\left(24B\pazocal{L}^{*};\frac{\delta}{C_{r_{0}}}\right)\lor\frac{r^{*}}{6B}\lor\frac{r_{0}}{48B},

where Cr0=2log28B2r0C_{r_{0}}=2\log_{2}\frac{8B^{2}}{r_{0}}, and rr^{*} is the fixed point of 6Bψ(8r;δCr0)6B\psi\left(8r;\frac{\delta}{C_{r_{0}}}\right).

Remarks.

1) The term r0r_{0} is negligible since it can be arbitrarily small. One can simply set r0=B2/n4r_{0}={B^{2}}/{n^{4}}, which will much smaller than rr^{*} in general (rr^{*} is at least of order B2log1δ/nB^{2}\log\frac{1}{\delta}/n in the traditional “local Rademacher complexity” analysis, because this term is unavoidable in two-sided concentration inequalities).

2) In high-probability bounds, Cr0C_{r_{0}} will only appear in the form log(Cr0/δ))\log({C_{r_{0}}}/{\delta})), which is of a negligible O(loglogn)O(\log\log n) order, so Cr0C_{r_{0}} can be viewed an absolute constant for all practical purposes. As a result, our generalization error bound can be viewed to be of the order

(h^ERM)O(ψ(BL;δ)rB).\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq O\left(\psi(B\pazocal{L}^{*};\delta)\lor\frac{r^{*}}{B}\right). (3.3)

3) By using the empirical “effective loss,” n[(h^ERM;z)infH(h;z)]\mathbb{P}_{n}[\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z)], to estimate L\pazocal{L}^{*}, the loss-dependent rate can be estimated from data without knowledge of L\pazocal{L}^{*}. We defer the details to Appendix A.3.

Comparison with existing results.

Under additional restrictions (to be explained later), the traditional analysis (2.3) leads to a loss-dependent rate of the order [3]

(h^ERM)O(LrBrB),\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq O\left(\sqrt{\frac{\pazocal{L}^{*}r^{*}}{B}}\lor\frac{r^{*}}{B}\right), (3.4)

which is strictly worse than our result (3.3) due to reasoning following Statement 2. When BLO(r)B\pazocal{L}^{*}\leq O(r^{*}), both (3.3) and (3.4) are dominated by the order r/B{r^{*}}/{B} so there is no difference between them. However, when BLΩ(r)B\pazocal{L}^{*}\geq\Omega(r^{*}), our result (3.3) will be of order ψ(BL;δ)\psi(B\pazocal{L}^{*};\delta) and the previous result (3.4) will be of order Lr/B\sqrt{{\pazocal{L}^{*}r^{*}}/{B}}. In this case, the square-root function Lr/B\sqrt{{\pazocal{L}^{*}r^{*}}/{B}} is only a coarse relaxation of ψ(BL;δ)\psi(B\pazocal{L}^{*};\delta): as the traditional analysis requires ψ\psi to be sub-root, we can compare the two orders by

ψ(BL;δ)sub-rootBLrψ(r;δ)=fixed pointO(LrB).\displaystyle\psi\left(B\pazocal{L}^{*};{\delta}\right)\overset{\text{sub-root}}{\leq}\sqrt{\frac{B\pazocal{L}^{*}}{r^{*}}}\psi(r^{*};\delta)\overset{\text{fixed point}}{=}O\left(\sqrt{\frac{\pazocal{L}^{*}r^{*}}{B}}\right). (3.5)

The “sub-root” inequality (the first inequality in (3.5)) becomes an equality when ψ(r;δ)=O(dr/n)\psi(r;\delta)=O({\sqrt{dr/n}}) in the parametric case, where dd is the parametric dimension. However, when F\pazocal{F} is “rich,” ψ(r;δ)/r\psi(r;\delta)/\sqrt{r} will be strictly decreasing so that the “sub-root” inequality can become quite loose. For example, when F\pazocal{F} is a non-parametric class we often have ψ(r;δ)=O(r1ρ/n)\psi(r;\delta)=O(\sqrt{r^{1-\rho}/n}) for some ρ(0,1)\rho\in(0,1). The richer F\pazocal{F} is (e.g., the larger ρ\rho is), the more loose the “sub-root” inequality. This intuition will be validated via examples in Section 4.

Theorem 1 also applies to broader settings than previous results. For example, in [3] it is assumed that the loss is non-negative, and their original result only adapts to (h;z)\mathbb{P}\ell(h^{*};z) rather than the “effective loss” L\pazocal{L}^{*}. Our proof (see Appendix A.2) is quite different as we bypass the Bernstein condition (which is traditionally implied by non-negativity, but not satisfied by the class used here), bypass the sub-root assumption on ψ\psi, and adapt to the “better” parameter L\pazocal{L}^{*}.

3.3 Variance-dependent rates via moment penalization

The loss-dependent rate proved in Theorem 1 contains a complexity parameter BLB\pazocal{L}^{*} within its ψ\psi function, which may still be much larger than the optimal variance V\pazocal{V}^{*}. Despite its prevalent use in practice, standard empirical risk minimization is unable to achieve variance-dependent rates in general. An example is given in [56] where V=0\pazocal{V}^{*}=0 and the optimal rate is at most O(logn/n)O({\log n}/{n}), while (h^ERM)\mathscr{E}(\hat{h}_{\textup{ERM}}) is proved to be slower than n12n^{-\frac{1}{2}}.

We follow the path of penalizing empirical second moments (or variance) [45, 71, 56, 13] to design an estimator that achieves the “right” bias-variance trade-off for general, potentially “rich,” classes. Our proposed estimator simultaneously achieves correct scaling on V\pazocal{V}^{*}, along with minimax-optimal sample dependence (nn). Besides empirical first and second moments, it only depends on the boundedness parameter BB, a computable surrogate function ψ\psi, and the confidence parameter δ\delta. All of these quantities are essentially assumed known in previous works: e.g., [45, 71] require covering number of the loss class, which implies a computable surrogate ψ\psi via Dudley’s integral bound; and estimators in [56, 13] rely on the fixed point rr^{*} of a computable surrogate ψ\psi.

In order to adapt to V\pazocal{V}^{*}, we use a sample-splitting two-stage estimation procedure (this idea is inspired by the prior work [13]). Without loss of generality, we assume access to a data set of size 2n2n. We split the data set into the “primary” data set SS and the “auxiliary” data set SS^{\prime}, both of which are of size nn. We denote n\mathbb{P}_{n} the empirical distribution of the “primary” data set, and S\mathbb{P}_{S^{\prime}} the empirical distribution of the “auxiliary” data set.

Strategy 1 (the two-stage sample-splitting estimation procedure.).

At the first-stage, we derive a preliminary estimate of L0:=(h;z)\pazocal{L}^{*}_{0}:=\mathbb{P}\ell(h^{*};z) via the “auxiliary” data set SS^{\prime}, which we refer to as L0^\widehat{{\pazocal{L}}^{*}_{0}}. Then, at the second stage, we perform regularized empirical risk minimization on the “primal” data set SS, which penalizes the centered second moment n[((h;z)L0^)2]\mathbb{P}_{n}[(\ell(h;z)-\widehat{{\pazocal{L}}^{*}_{0}})^{2}].

As we will present later, it is rather trivial to obtain a qualified preliminary estimate L0^\widehat{{\pazocal{L}}^{*}_{0}} via empirical risk minimization. Therefore, we firstly introduce the second-stage moment-penalized estimator, which is more crucial and interesting.

Strategy 2 (the second-stage moment-penalized estimator.).

Consider the excess loss class F\pazocal{F} in (3.2). Let ψ(r;δ)\psi(r;\delta) be a meaningful surrogate function that satisfies δ(0,1)\forall\delta\in(0,1), r>0\forall r>0, with probability at least 1δ1-\delta,

4n{fF:n[f2]2r}+2rlog8δn+9Blog8δnψ(r;δ).\displaystyle 4\mathfrak{R}_{n}\{f\in\pazocal{F}:\mathbb{P}_{n}[f^{2}]\leq 2r\}+\sqrt{\frac{2r\log\frac{8}{\delta}}{n}}+\frac{9B\log\frac{8}{\delta}}{n}\leq\psi(r;\delta).

Denote Cn=2log2n+5C_{n}=2\log_{2}n+5. Given a fixed δ(0,1)\delta\in(0,1), let the estimator h^MP\hat{h}_{\textup{MP}} be

h^MPargminH{n(h;z)+ψ(16n[((h;z)L0^)2];δCn)}.\displaystyle\hat{h}_{\textup{MP}}\in\arg\min_{\pazocal{H}}\left\{\mathbb{P}_{n}\ell(h;z)+\psi\left(16\mathbb{P}_{n}[(\ell(h;z)-\widehat{{\pazocal{L}}^{*}_{0}})^{2}];\frac{\delta}{C_{n}}\right)\right\}.

Given an arbitrary preliminary estimate L0^[B,B]\widehat{{\pazocal{L}}^{*}_{0}}\in[-B,B], we can prove that the generalization error of the moment-penalized estimator h^MP\hat{h}_{\textup{MP}} is at most

(h^MP)2ψ(c0[V(L0^L0)2r];δCn),\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\left(c_{0}\left[\pazocal{V}^{*}\lor(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}\lor r^{*}\right];\frac{\delta}{C_{n}}\right), (3.6)

with probability at least 1δ1-\delta, where c0c_{0} is an absolute constant, and rr^{*} is the fixed point of 16Bψ(r;δCn)16B\psi(r;\frac{\delta}{C_{n}}). Moreover, the first-stage estimation error will be negligible if

(L0^L0)2O(r).\displaystyle(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}\leq O\left(r^{*}\right). (3.7)

It is rather elementary to show that performing the standard empirical risk minimization on SS^{\prime} suffices to satisfy (3.7), provided an additional assumption that ψ\psi is a “sub-root” function. We now give our theorem on the generalization error following this two-stage procedure.

Theorem 2 (variance-dependent rate).

Let L0^=infHS(h;z)\widehat{{\pazocal{L}}^{*}_{0}}=\inf_{\pazocal{H}}\mathbb{P}_{S^{\prime}}\ell(h;z) be attained via empirical risk minimization on the auxiliary data set SS^{\prime}. Assume that the meaningful surrogate function ψ(r;δ)\psi(r;\delta) is “sub-root,” i.e. ψ(r;δ)r\frac{\psi(r;\delta)}{\sqrt{r}} is non-increasing over r[0,4B2]r\in[0,4B^{2}] for all fixed δ\delta. Then for any δ(0,12)\delta\in(0,\frac{1}{2}), by performing the moment-penalized estimator in Strategy 2, with probability at least 12δ1-2\delta,

(h^MP)2ψ(c1V;δCn)c1r8B,\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\left(c_{1}\pazocal{V}^{*};\frac{\delta}{C_{n}}\right)\lor\frac{c_{1}r^{*}}{8B},

where rr^{*} is the fixed point of Bψ(r;δCn)B\psi(r;\frac{\delta}{C_{n}}) and c1c_{1} is an absolute constant.

Remarks.

1) In high-probability bounds, CnC_{n} will only appear in the form log(Cn/δ))\log({C_{n}}/{\delta})), which is of a negligible O(loglogn)O(\log\log n) order, so CnC_{n} can effectively be viewed as constant for all practical purposes.

2) The “sub-root” assumption in Theorem 2 is only used to to bound the first-stage estimation error (see (3.7)). This assumption is not needed for the result (3.6) on the second-stage moment-penalized estimator.

3) Replacing V\pazocal{V}^{*} by an empirical centered second moment, we can prove a fully data-dependent generalization error bound that is computable from data without the knowledge of V\pazocal{V}^{*}. We leave the full discussion to Appendix A.5.

Comparison with existing results.

The best variance-dependent rate attained by existing estimators is of the order [13]

VrB2rB,\sqrt{\frac{\pazocal{V}^{*}r^{*}}{B^{2}}}\lor\frac{r^{*}}{B},

which is strictly worse than the rate proved in Theorem 2. The reasoning is similar to Statement 2 and the explanation after Theorem 1: when VO(r)\pazocal{V}^{*}\leq O(r^{*}) the two results are essentially identical, but our estimator can perform much better when VΩ(r)\pazocal{V}^{*}\geq\Omega(r^{*}). Because ψ\psi is sub-root and rr^{*} is the fixed point, we can compare the orders of the rates

ψ(V;δ)sub-rootVrψ(r;δ)=fixed pointO(VrB2).\displaystyle\psi(\pazocal{V}^{*};\delta)\overset{\text{sub-root}}{\leq}\sqrt{\frac{\pazocal{V}^{*}}{r^{*}}}\psi(r^{*};\delta)\overset{\text{fixed point}}{=}O\left(\sqrt{\frac{\pazocal{V}^{*}r^{*}}{B^{2}}}\right).

Since variance-dependent rates are generally used in applications that require robustness or exhibit large worst-case boundedness parameter, Vr\pazocal{V}^{*}\geq r^{*} is the more critical regime where one wants to ensure the estimation performance will not degrade.

Discussion.

Per our “uniform localized convergence” principle, the most obvious difficulty in proving Theorem 2 is in establishing (3.6): the empirical second moment is sample-dependent, whereas standard tools in empirical process theory (e.g., Talagrand’s concentration inequality, see Lemma 5) require the localized subsets to be independent of the samples. The core techniques in our proof essentially overcome this difficulty by concentrating data-dependent localized subsets to data-independent ones. This idea may be of independent interest; We defer details to Appendix A.4.

The tightness of our variance-dependent rates depend on tightness of the computable surrogate function ψ\psi. When covering numbers of the excess loss class are given, a direct choice is Dudley’s integral bound (see Lemma 1), which is known to be rate-optimal for many important classes.

Previous approaches usually take a simper regularization term [45, 13] that is proportional to the square root of the empirical second moment (or empirical variance). That type of penalization is “too aggressive” for rich classes from our viewpoint. [56] propose a regularization term that preserves convexity of empirical risk. However, based on an equivalence proved in their paper, they have similar limitations to the approaches that penalizes the square root of the empirical variance. Outside the study of variance-dependent rates, integral-based and local-Rademacher-complexity-based penalization is also used in model selection [42], but the setting and the goal of model selection are very different from the problem we study here.

4 Illustrative examples in the “slow rate” regime

4.1 Discussion and illustrative examples

Recall that our loss-dependent rates and variance-dependent (moment-penalized) rates are

(h^ERM)O(ψ(BL;δ)rB)[Theorem 1],\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq O\left(\psi(B\pazocal{L}^{*};\delta)\lor\frac{r^{*}}{B}\right)\quad[\text{Theorem \ref{thm slow}}], (4.1)
(h^MP)O(ψ(V;δ)rB)[Theorem 2],\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq O\left(\psi(\pazocal{V}^{*};\delta)\lor\frac{r^{*}}{B}\right)\quad[\text{Theorem \ref{thm variance rate}}], (4.2)

respectively. In contrast to our results (4.1) (4.2), the best known loss-dependent rates [3] and variance-dependent rates [13] are

(h^ERM)O(LrBrB)[existing result [3]],\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq O\left(\sqrt{\frac{\pazocal{L}^{*}r^{*}}{B}}\lor\frac{r^{*}}{B}\right)\quad[\text{existing result \cite[cite]{[\@@bibref{Number}{bartlett2005local}{}{}]}}], (4.3)
(h^previous)O(VrB2rB)[existing result [13]],\displaystyle\mathscr{E}(\hat{h}_{\text{previous}})\leq O\left(\sqrt{\frac{\pazocal{V}^{*}r^{*}}{B^{2}}}\lor\frac{r^{*}}{B}\right)\quad[\text{existing result \cite[cite]{[\@@bibref{Number}{foster2019orthogonal}{}{}]}}], (4.4)

respectively (we use h^previous\hat{h}_{\text{previous}} to denote the previous best known moment-penalized estimator proposed in [13]). To illustrate the noticeable gaps between our new results and previous known ones, we compare the two different variance-dependent rates, (4.2) and (4.4) on two important families of “rich” classes: non-parametric classes of polynomial growth and VC classes. The implications of this comparison will similarly apply to loss-dependent rates.

Before presenting the advantages of the new problem-dependent rates, we would like to discuss how to compute them. In Theorem 1 and Theorem 2, the class of concentrated functions, F\pazocal{F}, is the excess loss class Hh\ell\circ\pazocal{H}-\ell\circ h^{*} in (3.2). As we have mentioned in earlier sections, a general solution for the ψ\psi function is to use Dudley’s integral bound (Lemma 1). Knowledge of the metric entropy of the excess loss class,

logN(ε,Hh,L2(n)),\displaystyle\log\pazocal{N}(\varepsilon,\ell\circ\pazocal{H}-\ell\circ h^{*},L_{2}(\mathbb{P}_{n})),

can be used to calculate Dudley’s integral bound and construct the surrogate function ψ\psi needed in our theorems. Note that there is no difference between the metric entropy of the excess loss class and metric entropy of the loss class itself: from the definition of covering number, one has

N(ε,Hh,L2(n))=N(ε,H,L2(n)).\displaystyle\pazocal{N}(\varepsilon,\ell\circ\pazocal{H}-\ell\circ h^{*},L_{2}(\mathbb{P}_{n}))=\pazocal{N}(\varepsilon,\ell\circ\pazocal{H},L_{2}(\mathbb{P}_{n})).

We comment that almost all existing theoretical literature that discusses general function classes and losses [3, 45, 71, 13] imposes metric entropy conditions on the loss class/excess loss class rather than the hypothesis class H\pazocal{H}, and we follows that line as well to allow for a seamless comparison of the results. As a complement, we will discuss how to obtain such metric entropy conditions for practical applications in Section 4.2.

4.1.1 Non-parametric classes of polynomial growth

Example 1 (non-parametric classes of polynomial growth).

Consider a loss class H\ell\circ\pazocal{H} with the metric entropy condition

logN(ε,H,L2(n))O(ε2ρ),\displaystyle\log\pazocal{N}(\varepsilon,\ell\circ\pazocal{H},L_{2}(\mathbb{P}_{n}))\leq O\left(\varepsilon^{-2\rho}\right), (4.5)

where ρ(0,1)\rho\in(0,1) is a constant. Using Dudley’s integral bound to find ψ\psi and solving rO(Bψ(r;δ))r\leq O\left(B\psi(r;\delta)\right), it is not hard to verify that

ψ(r;δ)O(r1ρn),rO(B21+ρn11+ρ).\displaystyle\psi(r;\delta)\leq O\left(\sqrt{\frac{r^{1-\rho}}{n}}\right),\quad r^{*}\leq O\left(\frac{B^{\frac{2}{1+\rho}}}{n^{\frac{1}{1+\rho}}}\right).

As a result, our variance-dependent rate (4.2) is of the order

(h^MP)O(V1ρ2n12rB),\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq O\left({\pazocal{V}^{*}}^{\frac{1-\rho}{2}}{n}^{-\frac{1}{2}}\lor\frac{r^{*}}{B}\right), (4.6)

which is O(V1ρ2n12)O\left({\pazocal{V}^{*}}^{\frac{1-\rho}{2}}{n}^{-\frac{1}{2}}\right) when VΩ(r)\pazocal{V}^{*}\geq\Omega(r^{*}). In contrast, the previous best-known result (4.4) is of the order

(h^previous)O(VBρ1+ρn12+2ρrB),\displaystyle\mathscr{E}(\hat{h}_{\textup{previous}})\leq O\left(\sqrt{\pazocal{V}^{*}}B^{-\frac{\rho}{1+\rho}}n^{-\frac{1}{2+2\rho}}\lor\frac{r^{*}}{B}\right), (4.7)

which is O(VBρ1+ρn12+2ρ)O\left(\sqrt{\pazocal{V}^{*}}B^{-\frac{\rho}{1+\rho}}n^{-\frac{1}{2+2\rho}}\right) when VΩ(r)\pazocal{V}^{*}\geq\Omega\left(r^{*}\right). Therefore, for arbitrary choice of n,V,Bn,\pazocal{V}^{*},B, the “sub-optimality gap” is

ratio between (4.7) and (4.6):=VBρ1+ρn12+2ρrBV1ρ2n12rB=1(V(nB2)11+ρ)ρ2,\displaystyle\textup{ratio between \eqref{eq: their poly} and \eqref{eq: our poly}}:=\frac{\sqrt{\pazocal{V}^{*}}B^{-\frac{\rho}{1+\rho}}n^{-\frac{1}{2+2\rho}}\lor\frac{r^{*}}{B}}{{\pazocal{V}^{*}}^{\frac{1-\rho}{2}}{n}^{-\frac{1}{2}}\lor\frac{r^{*}}{B}}=1\lor(\pazocal{V}^{*}(\frac{n}{B^{2}})^{\frac{1}{1+\rho}})^{\frac{\rho}{2}}, (4.8)

which can be arbitrary large and grows polynomially with nn.

We consider two stylized regimes as follows (we use the notation \approx when the left hand side and the right hand side are of the same order).

  • The more “traditional” regime: B1B\approx 1, Vna\pazocal{V}^{*}\approx n^{-a} where a>0a>0 is a fixed constant. This regime captures the traditional supervised learning problems where BB is not large, but one wants to use the relatively small order of V\pazocal{V}^{*} to achieve “faster” rates.

  • The “high-risk” regime: BnbB\approx n^{b} where b>0b>0 is a fixed constant, and VB2\pazocal{V}^{*}\ll B^{2} (i.e., V\pazocal{V}^{*} is much smaller than order n2bn^{2b}). This regime captures modern “high-risk” learning problems such as counterfactual risk minimization, policy learning, and supervised learning with limited number of samples. In those settings, the worst-case boundedness parameter is considered to scale with nn so that one wants to avoid (or reduce) the dependence on BB.

In both of the two regimes, generalization errors via naive (non-localized) uniform convergence arguments will be worse than our approach by orders polynomial in nn, so we only need to compare with previous variance-dependent rates.

The “traditional” regime.

The “sub-optimality gap” (4.8) is 1(Vn11+ρ)ρ21\lor(\pazocal{V}^{*}n^{\frac{1}{1+\rho}})^{\frac{\rho}{2}}. It is quite clear that when Vna\pazocal{V}^{*}\approx n^{-a} where 0<a<11+ρ0<a<\frac{1}{1+\rho}, our variance-dependent rate improves over all previous generalization error rates by orders polynomial in nn.

The “high-risk” regime.

We focus on the simple case B21+ρV4B2B^{\frac{2}{1+\rho}}\leq\pazocal{V}^{*}\ll 4B^{2} to gain some insight, where our result exhibits an improvement of order O(nρ2(11+ρ))O(n^{\frac{\rho}{2}(\frac{1}{1+\rho})}) relative to the previous result. Clearly the larger ρ\rho, the more improvement we provide. By letting ρ1\rho\rightarrow 1 our improvement can be as large as O(n14)O(n^{\frac{1}{4}}).

4.1.2 VC-type classes

Our next example considers VC-type classes. Although this classical example has been extensively studied in learning theory, our results provide strict improvements over antecedents.

Example 2 (VC-type classes).

One general definition of VC-type classes (which is not necessarily binary) uses the metric entropy condition. Consider a loss class H\ell\circ\pazocal{H} that satisfies

logN(ε,H,L2(n))O(dlog1ε),\displaystyle\log\pazocal{N}(\varepsilon,\ell\circ\pazocal{H},L_{2}(\mathbb{P}_{n}))\leq O\left(d\log\frac{1}{\varepsilon}\right),

where dd is th so-called the Vapnik–Chervonenkis (VC) dimension [76]. Using Dudley’s integral bound to find the surrogate ψ\psi and solving rO(Bψ(r;δ))r\leq O(B\psi(r;\delta)), it can be proven [28] that

ψ(r;δ)O(drnlog8B2rBdnlog8B2r),rO(B2dlognn).\displaystyle\psi(r;\delta)\leq O\left(\sqrt{\frac{dr}{n}\log\frac{{8}B^{2}}{r}}\lor\frac{Bd}{n}\log\frac{8B^{2}}{r}\right),\quad r^{*}\leq O\left(\frac{B^{2}d\log n}{n}\right).

Recently, [13] proposed a moment-penalized estimator whose generalization error is of the rate

(h^previous)O(dVlognn+Bdlognn),\displaystyle\mathscr{E}(\hat{h}_{\textup{previous}})\leq O\left(\sqrt{\frac{d\pazocal{V}^{*}\log n}{n}}+\frac{Bd\log n}{n}\right),

in the worst case without invoking other assumptions. This result has a O(logn)O(\log n) gap compared with the Ω(dVn)\Omega(\sqrt{\frac{d\pazocal{V}^{*}}{n}}) lower bound [10], which holds for arbitrary sample size. There is much recent interest focused on the question of when the sub-optimal logn\log n factor can be removed [1, 13].

By applying Theorem 2, our refined moment-penalized estimator gives a generalization error bound of tighter rate

(h^MP)O(dVlog8B2VnBdlognn).\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq O\left(\sqrt{\frac{d\pazocal{V}^{*}\log\frac{{8}B^{2}}{\pazocal{V}^{*}}}{n}}\lor\frac{Bd\log n}{n}\right). (4.9)

This closes the O(logn)O(\log n) gap in the regime VΩ(B2(logn)α)\pazocal{V}^{*}\geq\Omega(\frac{B^{2}}{(\log n)^{\alpha}}), where α>0\alpha>0 is an arbitrary positive constant. Though this is not the central regime, it is the first positive result that closes the notorious O(logn)O(\log n) gap without invoking any additional assumptions on the loss/hypothesis class (e.g., the rather complex “capacity function” assumption introduced in [13]). We anticipate additional improvements are possible under further assumptions on the hypothesis class and the loss function.

4.2 Problem areas to which “localization” theory is applicable

In practical applications it is more standard to consider metric entropy conditions of the hypothesis class H\pazocal{H}. In view of this, we introduce two important settings where the metric entropy on the loss/excess loss class can be obtained from metric entropy conditions on the hypothesis class H\pazocal{H}. Thus, the improvements illustrated in Section 4.1 can be directly transferred to these application areas.

Supervised learning with Lipchitz continuous cost.

In supervised learning, the data zz is a feature-label pair (x,y)(x,y), and the loss (h;z)\ell(h;z) is of the form

(h;z)=sv(h(x),y),\ell(h;z)=\ell_{\textup{{sv}}}(h(x),y),

where sv\ell_{\textup{{sv}}} is a fixed cost function that is LsvL_{\textup{{sv}}}-Lipchitz continuous with respect to its first argument, namely, Lipchitz with parameter LsvL_{\textup{{sv}}}. For hypothesis classes characterized by metric entropy conditions, properties are preserved because

logN(ε,H,L2(n))logN(εLsv,H,L2(n)).\displaystyle\log\pazocal{N}(\varepsilon,\ell\circ\pazocal{H},L_{2}(\mathbb{P}_{n}))\leq\log\pazocal{N}(\frac{\varepsilon}{L_{\textup{{sv}}}},\pazocal{H},L_{2}(\mathbb{P}_{n})).

Note that LsvL_{\textup{{sv}}} only depends on the cost function and is usually of constant order. Our theory naturally applies to supervised learning problems where the cost function is Lipchitz continuous and not strongly-convex (for example, the 1\ell_{1} cost, the hinge cost, the ramp cost, etc.).

Counterfactual risk minimization.

Denote xXx\in\pazocal{X} the feature and tTt\in\pazocal{T} the treatment (e.g. T={0,1}\pazocal{T}=\{0,1\} in binary treatment experimental design), and c(x,t)c(x,t) the unknown cost function. A hypothesis (policy) hh is a map from X×T\pazocal{X}\times\pazocal{T} to [0,1][0,1] such that tTh(x,t)=1\sum_{t\in\pazocal{T}}h(x,t)=1. Thus, a hypothesis (policy) essentially maps features to a distribution over treatments. We consider the standard formulation of “learning with logged bandit feedback,” dubbed “counterfactual risk minimization” [71]: a batch of samples {(xi,ti,ci)}i=1n\{(x_{i},t_{i},c_{i})\}_{i=1}^{n} are obtained by applying a known policy h0h_{0}, so that tit_{i} is sampled from h0(xi,)h_{0}(x_{i},\cdot) and one can only observe the cost cic_{i} associated with tit_{i}. We write z=(x,t,c)z=(x,t,c) and let

(h;zi)=cih0(xi,ti)h(xi,ti),\displaystyle\ell(h;z_{i})=\frac{c_{i}}{h_{0}(x_{i},t_{i})}h(x_{i},t_{i}), (4.10)

be the “constructed loss” using importance sampling. It is straightforward to show that the population risk (h;z)\mathbb{P}\ell(h;z) is equal to the expected cost of policy hh, so determining good policies requires one to minimize the generalization error (h^)\mathscr{E}(\hat{h}). It is usually convenient to obtain metric entropy condition of the loss/excess loss class by using the linearity structure of (4.10). In particular, from the Cauchy-Schwartz inequality we can prove that

logN(ε,H,L2(n))logN(εγn,H,L4(n)),\displaystyle\log\pazocal{N}(\varepsilon,\ell\circ\pazocal{H},L_{2}(\mathbb{P}_{n}))\leq\log\pazocal{N}(\frac{\varepsilon}{\gamma_{n}},\pazocal{H},L_{4}(\mathbb{P}_{n})), (4.11)

where γn:=n[(c(x,t)h0(x,t))4]4\gamma_{n}:=\sqrt[4]{\mathbb{P}_{n}\left[(\frac{c(x,t)}{h_{0}(x,t)})^{4}\right]} only depends on the functions cc, h0h_{0} in the given problem, and the samples rather than the worst-case parameters. A systematical challenge in counterfactual risk minimization is that the worst-case boundedness parameter, suph,z|(h;z)|\sup_{h,z}|\ell(h;z)|, is typically very large, since the inverse probability term 1h0(xi,ti)\frac{1}{h_{0}(x_{i},t_{i})} in (4.10) is typically large in the worst case.

5 Problem-dependent rates in the parametric “fast rate” regime

5.1 Background

When assuming suitable curvature or margin conditions, the direct dependence of the generalization error on nn is typically faster than O(n12)O(n^{-\frac{1}{2}}). We call this the “fast rate” regime. A well-known example is, when the hypothesis class is parametrized and the loss is strongly convex with respect to the parameter, in which case the direct dependence of the generalization error on nn is typically the “parametric rate” O(n1)O(n^{-1}). In the “fast rate” regime, existing problem-dependent rates are mostly studied in supervised learning with structured convex cost; see Section 2.4 for a historical review of existing localization approaches and problem-dependent rates. Through our proposed “uniform localized convergence” procedure, we can recover results in [50, 53] for supervised learning problems with structured convex cost, and our approach is able to systematically weaken some restrictions. (We defer the discussion to Section 8 as the contributions there mostly lie in unification and some technical improvements.)

In this and the next two sections we study more modern applications in the fast rate regime, focusing on computationally efficient estimators and algorithms, where the derivation of sharp problem-dependent rates remains an open question. A secondary objective is to provide unification to vector-based uniform convergence analysis. The main focal points are the following.

Non-convexity, stationary points, and iterative algorithms.

Classical generalization error analysis relies on the property of global empirical risk minimizers. However, many important machine learning problems are non-convex. For those problems, guarantees on global empirical risk minimizes are not sufficient, and therefore one typically targets guarantees on stationary points and the iterates produced by optimization algorithms. This motivates us to study uniform convergence of gradients and sample-based iterative algorithms. Existing generalization error bounds in this area are typically not localized, and connections to traditional localization frameworks is not fully understood.

General formulation of stochastic optimization.

Existing problem-dependent rates mostly focus on supervised learning settings, with specialized assumptions on the problem structure. Hence, it is important to characterize problem-dependent generalization error bounds for more general stochastic optimization problems, in which the classical asymptotic results do not depend on the parametric dimension dd and global parameters. Existing methods, however, typically give rise to dimension-dependent factors and “large” global parameters.

Organization of this section is as follows. In Section 5.2, we will strengthen the “uniform convergence of gradients” idea by developing a theory for “uniform localized convergence of gradients.” In Section 5.3 we will provide problem-dependent rates for approximate stationary points of empirical risk and iterates of the gradient descent algorithm.

5.2 Theoretical foundations

Recently, the idea of “uniform convergence of gradients” [47, 12, 9] has been applied successfully to many non-convex learning and stochastic optimization problems. These works do not consider problem-dependent rates, and their results typically rely on various global parameters, like global Lipchitz constants and the radius of the parametric set. In this subsection we strengthen these ideas by developing a theory of “uniform localized convergence of gradients.” This theory will be proven to be more powerful in deriving problem-dependent rates. Before moving to state a key assumption, we introduce some additional notation.

Notation for the parametric “fast rate” regime.

We write the loss function as (θ;z)\ell(\theta;z) where θd\theta\in\mathbb{R}^{d} is the parameter representation of the hypothesis hh. Consider a compact set Θd\Theta\subseteq\mathbb{R}^{d} and let θ\theta^{*} be the best parameter within Θ\Theta, which satisfies θargminΘ(θ;z)\theta^{*}\in\operatorname*{arg\,min}_{\Theta}\mathbb{P}\ell(\theta;z). Denote \|\cdot\| to be the L2L_{2} norm in d\mathbb{R}^{d}, noting that most of our results can be generalized to matrix learning problems by considering the Frobenius norm. We let Bd(θ0,ρ):={θd:θθ0ρ}\pazocal{B}^{d}(\theta_{0},\rho):=\{\theta\in\mathbb{R}^{d}:\|\theta-\theta_{0}\|\leq\rho\} denote a ball with center θ0d\theta_{0}\in\mathbb{R}^{d} and radius ρ\rho. We assume that there are two radii Δm,ΔM\Delta_{m},\Delta_{M} such that Bd(θ,Δm)ΘBd(θ,ΔM)\pazocal{B}^{d}(\theta^{*},\Delta_{m})\subseteq\Theta\subseteq\pazocal{B}^{d}(\theta^{*},\Delta_{M}). We would like to provide guarantee for the generalization error

(θ^):=(θ^;z)(θ;z).\displaystyle\mathscr{E}(\hat{\theta}):=\mathbb{P}\ell(\hat{\theta};z)-\mathbb{P}\ell(\theta^{*};z).

We state a key assumption of our framework.

Assumption 1 (statistical noise of smooth population risk).

For all θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta, (θ1;z)(θ2;z)θ1θ2\frac{\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z)}{\|\theta_{1}-\theta_{2}\|} is a β\beta-sub-exponential random vector. Formally there exist β>0\beta>0 such that for any unit vector uB(0,1)u\in\pazocal{B}(0,1) and θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta,

𝔼{exp(|uT((θ1;z)(θ2;z))|βθ1θ2)}2.\displaystyle\mathbb{E}\left\{\exp\left(\frac{|u^{T}(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))|}{\beta\|\theta_{1}-\theta_{2}\|}\right)\right\}\leq 2.

From Jensen’s inequality and convexity of the exponential function, a simple consequence of Assumption 1 is that the population risk is β\beta-smooth: for any θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta,

(θ1;z)(θ2;z)βθ1θ2.\displaystyle\|\mathbb{P}\nabla\ell(\theta_{1};z)-\mathbb{P}\nabla\ell(\theta_{2};z)\|\leq\beta\|\theta_{1}-\theta_{2}\|. (5.1)

Smoothness of the population risk is a standard assumption in the optimization literature. Compared with the smoothness condition (5.1), Assumption 1 is a stronger distributional assumption: it requires the random vectors {(θ1;z)(θ2;z)θ1θ2}\left\{\frac{\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z)}{\|\theta_{1}-\theta_{2}\|}\right\} to be a β\beta-sub-exponential for all θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta, while the smoothness condition (5.1) only concerns the expectation of these random vectors. Certain distributional assumptions are necessary to analyze the generalization performance of unbounded losses (e.g., see the “Hessian statistical noise” assumption in [47], and the “moment-equivalence” assumptions in [51, 54]). Assumption 1 imposed here is applicable to many smooth machine learning models studied in previous literature, and its verification is often no harder than verification of smoothness conditions.

Lemma 2 (Hessian statistical noise implies Assumption 1).

Assumption 1 is satisfied if for all θΘ\theta\in\Theta, the random Hessian matrix 2(θ;z)\nabla^{2}\ell(\theta;z) is a β\beta-sub-exponential matrix. Formally, Assumption 1 is satisfied when there exist β>0\beta>0 such that for any unit vectors u1,u2Bd(0,1)u_{1},u_{2}\in\pazocal{B}^{d}(0,1) and any θΘ\theta\in\Theta,

𝔼{exp(1β|u1T2(θ;z)u2|)}2.\displaystyle\mathbb{E}\left\{\exp\left(\frac{1}{\beta}|u_{1}^{T}\nabla^{2}\ell(\theta;z)u_{2}|\right)\right\}\leq 2. (5.2)

Therefore, one only needs to compute the Hessian matrices and verify they are sub-exponential over Θ\Theta. For instance, many statistical estimation problems have 2(θ;z)\nabla^{2}\ell(\theta;z) proportional to zzTzz^{T} (or xxTxx^{T} when the problem is a supervised learning problem and z=(x,y)z=(x,y) is the feature-label pair). By assuming the data zz (or xx) is a τ\tau-sub-Gaussian vector444a random vector zz is called τ\tau-sub-Gaussian if for all unit vectors uBd(0,1)u\in\pazocal{B}^{d}(0,1), uTzu^{T}z is a τ\tau-sub-Gaussian variable. See Definition 6 for details., zzTzz^{T} (or xxTxx^{T}) becomes a τ2\tau^{2}-sub-exponential matrix. If the remaining quantities in 2(θ;z)\nabla^{2}\ell(\theta;z) can be uniformly bounded by some constant C0C_{0}, then Assumption 1 holds with C0τ2C_{0}\tau^{2}.

We carefully choose a function class G={g(θ,v):θΘ,vBd(0,max{ΔM,1n})}\pazocal{G}=\left\{g_{(\theta,v)}:\theta\in\Theta,v\in\pazocal{B}^{d}(0,\max\{\Delta_{M},\frac{1}{n}\})\right\} to apply concentration, where each element is a function g(θ,v):Zg_{(\theta,v)}:\pazocal{Z}\rightarrow\mathbb{R} defined by

g(θ,v)(z)=((θ;z)(θ;z))Tv.\displaystyle g_{(\theta,v)}(z)=(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))^{T}v.

By applying Proposition 1 with the “concentrated functions” g(θ,v)g_{(\theta,v)} and the “measurement functional” TT defined by T((θ,v))=θθ2+v2T((\theta,v))=\|\theta-\theta^{*}\|^{2}+\|v\|^{2}, we obtain the key arguemnt of “uniform localized convergence” for gradients.

Proposition 2 (uniform localized convergence of gradients).

Under Assumption 1, δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta, for all θΘ\theta\in\Theta,

(n)((θ;z)(θ;z))\displaystyle\left\|(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))\right\|
cβmax{θθ,1n}(d+log4log2(2nΔM+2)δn+d+log4log2(2nΔM+2)δn),\displaystyle\leq c\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\left(\sqrt{\frac{d+\log\frac{4\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{4\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}\right), (5.3)

where cc is an absolute constant.

Distinct from previous results on “uniform convergence of gradients,” which give the same upper bound on (n)(θ;z)\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell(\theta;z)\| for all θΘ\theta\in\Theta, the right hand side of (2) in Proposition 2 scales linearly with θθ\|\theta-\theta^{*}\| for all θ\theta such that θθ1n\|\theta-\theta^{*}\|\geq\frac{1}{n}. Therefore, Proposition 2 provides a refined, “localized” upper bound on the concentration of gradients. This property makes Proposition 2 the key in deriving problem-dependent rates.

5.3 Main results

In order to obtain tight problem-dependent rates, we require a very mild assumption on the noise at the optimal point θ\theta^{*}.

Assumption 2 (noise at the optimal point).

The gradient at θ\theta^{*} satisfies the Bernstein condition: there exists G>0G_{*}>0 such that for all 2kn2\leq k\leq n,

𝔼[(θ;z)k]12k!𝔼[(θ;z)2]Gk2.\displaystyle\mathbb{E}[\|\nabla\ell(\theta^{*};z)\|^{k}]\leq\frac{1}{2}k!\mathbb{E}[\|\nabla\ell(\theta^{*};z)\|^{2}]G_{*}^{k-2}. (5.4)

We note that this assumption is very mild because GG_{*} only depends on gradients at θ\theta^{*}, and GG_{*} will only appear in the O(n2)O(n^{-2}) terms in our theorems. Our approach also allows many other noise conditions at θ\theta^{*} (e.g., those for heavy-tailed noise). At a high level, the order of our generalization error bounds only depends on the concentration of n(θ;z)\mathbb{P}_{n}\nabla\ell(\theta^{*};z) relative to (θ;z)\mathbb{P}\nabla\ell(\theta^{*};z), which barely depends on noise at the single point θ\theta^{*} and can be analyzed under various types of noise conditions. We introduce Assumption 2 here because it leads to the asymptotically optimal problem-dependent parameter [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] and simplifies comparison with previous literature.

Now we turn to establish problem-dependent rates under a curvature condition. While our methodology is widely-applicable without restriction to particular curvature conditions, we will focus on the Polyak-Lojasiewicz (PL) condition, which is known to be one of the weakest conditions that guarantee linear convergence of optimization algorithms [24] as well as fast-rate generalization error [12].

Assumption 3 (Polyak-Lojasiewicz condition).

There exist μ>0\mu>0 such that for all θΘ\theta\in\Theta,

(θ;z)(θ;z)(θ;z)22μ.\displaystyle\mathbb{P}\ell(\theta;z)-\mathbb{P}\ell(\theta^{*};z)\leq\frac{\|\mathbb{P}\nabla\ell(\theta;z)\|^{2}}{2\mu}.

The PL condition is weaker than many others recently introduced in the areas of matrix recovery, deep learning, and learning dynamical systems, such as “one-point convexity” [36, 25], “star convexity” [86], and “quasar convexity” [22, 20], not to mention the classical “strong convexity.” It is also referred to as “gradient dominance condition” in previous literature [12]. Under suitable assumptions on their inputs, many popular non-convex models have been shown to satisfy the PL condition (sometimes locally rather than globally). An incomplete list of these models includes: matrix factorization [39], neural networks with one hidden layer [36], ResNets with linear activations [19], binary linear classification [47], robust regression [47], phase retrieval [70], blind deconvolution [35], linear dynamical systems [20], mixture of two Gaussians [2], to name a few.

While the PL condition is known to be one of the weakest conditions that can be used to establish linear convergence to the global minimum (see [24] for its relationship with other common curvature conditions), the generalization aspects of such structural non-convex learning problems have not been fully understood. In particular, existing generalization error bounds often contain global Lipchitz parameters that can be large for unbounded smooth losses.

Our next theorem provides problem-dependent bounds for approximate stationary points of the empirical risk, under the PL condition of the population risk. The theorem implies that optimization procedures that find stationary points of the empirical risk are also learning algorithms for the population risk.

Theorem 3 (generalization error of the approximate stationary point).

Under Assumptions 1, 2 and 3, δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta, we have the following results:

(a) there exist approximate stationary points of the empirical risk, θ^Θ\widehat{\theta}\in\Theta such that

n(θ;z)2[(θ;z)2]log4δn+Glog4δn.\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta;z)\|\leq\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n}. (5.5)

(b) for all θ^\hat{\theta} that satisfy (5.5), when ncβ2μ2(d+log4log(2nΔM+1)δ)n\geq\frac{c\beta^{2}}{\mu^{2}}\left(d+\log\frac{4\log(2n\Delta_{M}+1)}{\delta}\right), where cc is an absolute constant, we have

(θ^)64[(θ;z)2]log4δμn+32G2log24δ+4μ2μn2.\displaystyle\mathscr{E}(\hat{\theta})\leq\frac{64\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu n}+\frac{32G_{*}^{2}\log^{2}\frac{4}{\delta}+4\mu^{2}}{\mu n^{2}}.

Ignoring higher order terms and absolute constants, Theorem 3 implies a problem-dependent bound

(θ^)O([(θ;z)2]log1δμn),\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu n}\right), (5.6)

which scales tightly with the problem-dependent parameter [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]. The proof of Theorem 4 can be found in Appendix B.4. Optimality and implications of this problem-dependent rate will be discussed shortly after we present an additional theorem.

Many optimization algorithms, including gradient descent [24], stochastic gradient descent [14], non-convex SVRG [63] and SCSG [34] can efficiently find (approximate) stationary points of the empirical risk. However, convergence of these optimization algorithms is mostly studied under assumptions on the empirical risk. The next theorem demonstrates that under assumptions on the population risk, gradient descent provably achieve “small” generalization error. These type of results are challenging to prove because properties of the population risk may not transfer to the empirical risk.

Consider the gradient descent algorithm with fixed step size α\alpha and initialization θ0\theta^{0}, generating iterates according to

θt+1=θtαn(θt;z),t=0,1,\displaystyle\theta^{t+1}=\theta^{t}-\alpha\mathbb{P}_{n}\nabla\ell(\theta^{t};z),\quad t=0,1,\dots (5.7)
Theorem 4 (generalization error of gradient descent).

Assume Assumptions 1, 2, 3. Then for an initialization θ0Bd(θ,μβΔm)\theta^{0}\in\pazocal{\pazocal{B}}^{d}(\theta^{*},\sqrt{\frac{\mu}{\beta}}\Delta_{m}) and step size 1β\frac{1}{\beta}, the iterates of the gradient descent algorithm (5.7) satisfy for any fixed δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for all t=0,1,t=0,1,\dots,

(θt)16[(θ;z)2]log4δμn+8G2log24δ+μ2μn2statistical error+(1μ2β)t(θ0),\displaystyle\mathscr{E}(\theta^{t})\leq\underbrace{\frac{16\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu n}+\frac{8G_{*}^{2}\log^{2}\frac{4}{\delta}+\mu^{2}}{\mu n^{2}}}_{\textup{statistical error}}+(1-\frac{\mu}{2\beta})^{t}\mathscr{E}(\theta^{0}), (5.8)

provided that the sample size nn is large enough such that that the “statistical error” term in (5.8) is smaller than μ2Δm2\frac{\mu}{2}\Delta_{m}^{2} and ncβ2μ2(d+log8log2(2nΔM+2)δ)n\geq\frac{c\beta^{2}}{\mu^{2}}\left(d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}\right), where cc is an absolute constant.

Theorem 4 is the first broad scope result on the generalization error of gradient descent under the PL condition. It implies that after a logarithmic number of iterations, gradient descent achieves the problem-dependent rate (5.6). Note that the algorithm only requires the initialization condition in the theorem, rather than any knowledge of Θ\Theta and the problem-dependent parameters. The proof of Theorem 4 can be found in Appendix B.4, and the main idea is applicable to many other optimization algorithms as well. For example, in Section 7 we provide a similar analysis to the first-order Expectation-Maximization algorithm.

Optimality of the problem-dependent rates in Theorem 3 and Theorem 4.

It is well-known in the asymptotic statistics literature [76] that when nn tends to infinity, under mild local conditions,

n(θ^ERMθ)𝑑N(𝟎,H1QH1),\displaystyle\sqrt{n}(\hat{\theta}_{\textup{ERM}}-\theta^{*})\overset{d}{\rightarrow}N({\bf 0},H^{-1}QH^{-1}), (5.9)

where H=2(θ;z)H=\mathbb{P}\nabla^{2}\ell(\theta^{*};z), Q=[(θ;z)(θ;z)T]Q=\mathbb{P}[\nabla\ell(\theta^{*};z)\nabla\ell(\theta^{*};z)^{T}], and 𝑑\overset{d}{\rightarrow} means convergence in distribution. The asymptotic rate (5.9) is often information theoretically optimal under suitable conditions [75] (e.g., it matches the Hájek-Le Cam asymptotic minimax lower bound [18, 31] when the loss (θ;z)\ell(\theta;z) is a log likelihood function). The generalization error bounds in Theorem 3 and Theorem 4, which are of the order

(θ^)O([(θ;z)2]log1δμn),\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu n}\right), (5.10)

are natural finite-sample versions of the “ideal” asymptotic benchmark (5.9).

In both Theorem 3 and Theorem 4, the sample complexity required to make the generalization error smaller than a fixed ε>0\varepsilon>0 is

nΩ(β2dμ2+[(θ;z)2]log1δμε).\displaystyle n\geq\Omega\left(\frac{\beta^{2}d}{\mu^{2}}+\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu\varepsilon}\right). (5.11)

Here we only consider the “interesting” case of ε(0,μΔm22)\varepsilon\in(0,\frac{\mu\Delta_{m}^{2}}{2}) in Theorem 4; otherwise the initialization point θ0\theta^{0} will satisfy (θ0)ε\mathscr{E}(\theta^{0})\leq\varepsilon. Clearly, the sample complexity threshold nΩ(β2d/μ2)n\geq\Omega({\beta^{2}d}/{\mu^{2}}) scales with the dimension dd. This threshold is sharp up to absolute constants—there exists simple linear regression constructions where we require Ω(β2d/μ2)\Omega({\beta^{2}d}/{\mu^{2}}) samples [38] to make the empirical Hessian positive definite. As a result, our overall sample complexity (5.11) is essentially the sharpest result one can expect under the aforementioned assumptions.

6 Applications to non-convex learning and stochastic optimization

In this section we will compare our problem-dependent rates with previous results from two areas: non-convex learning and stochastic optimization. Another topic that nicely illustrates the advantages of our approach, Expectation-Maximization algorithms for missing data problems, is deferred to Section 7.

6.1 Non-convex learning under curvature conditions

In this subsection we discuss generalization error bounds for non-convex losses that satisfy the Polyak-Lojasiewicz (PL) condition. The PL condition is one of the weakest curvature conditions that have been rigorously and extensively studied in the areas of matrix recovery, deep learning, learning dynamical systems and learning mixture models. See our prior discussion under Assumption 3 and the reference thereof for representative models that satisfy this condition, and its relationship with other curvature conditions. The topic has attracted much recent attention because there is some empirical evidence suggesting that modern deep neural networks might satisfy this condition in large neighborhoods of global minimizers [25, 86].

For structured non-convex learning problems, a benchmark approach to prove generalization error bounds is “uniform convergence of gradients.” [47] presents the “uniform convergence of gradients” principle and proves dimension-dependent generalization error bounds to several representative problems; [12] extends this idea to obtain norm-based generalization error bounds. We will compare our problem-dependent rates with these results.

Comparison with the results in Mei et al. [47].

The main regularity assumptions imposed in [47] include: 1) an assumption on the statistical noise of the Hessian matrices, whose content is similar to our Assumption 1; and 2) an assumption that the random gradients (θ;z)\nabla\ell(\theta;z) are GG-sub-Gaussian for all θΘ\theta\in\Theta, which is not used in our framework (in contrast, we only impose a mild assumption on the gradient noise at θ\theta^{*}). They also assume the Hessian is Lipchitz continuous which we view as redundant.

The theoretical foundation in [47] is the following result on the (global) uniform convergence of gradients: with probability at least 1δ1-\delta,

supθΘ(n)(θ;z)O(Gdlognlog1δn).\displaystyle\sup_{\theta\in\Theta}\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell(\theta;z)\|\leq O\left(G\sqrt{\frac{d\log n\log\frac{1}{\delta}}{n}}\right). (6.1)

The sub-Gaussian parameter GG is larger than the global Lipchitz constant, and can be quite large in practical applications. From (6.1), when the population risk satisfies the PL condition, the generalization error for a stationary point θ^\hat{\theta} of the empirical risk can be bounded as follows:

(θ^)O(G2dlognlog1δμn).\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{G^{2}d\log n\log\frac{1}{\delta}}{\mu n}\right). (6.2)

[47] also provides guarantees for iterates of the gradient descent algorithm, but the analysis is specialized to the three applications considered in the paper. It is worth mentioning that [47] also studies the high-dimensional setting and provides a series of important results; we will not compare with those.

Our approach improves both the result (6.2) as well as the methodology (6.1) as follows.

  • Our Theorem 3 and Theorem 4 provide generalization error bounds for approximate stationary points and iterates of the gradient descent algorithm, which are of the order

    (θ^)O([(θ;z)2]log1δμn)\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu n}\right)

    Focusing on the the dominating O(n1)O(n^{-1}) term, our new result replaces O(G2dlogn)O(G^{2}d\log n) in the numerator with the typically much smaller localized quantity (θ;z)2\mathbb{P}\|\nabla\ell(\theta^{*};z)\|^{2}. In fact, from the definition of sub-Gaussian vectors, one can prove (see, e.g. [78]) that

    [(θ;z)2]supΘ(θ;z)28G2d,\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\ll\sup_{\Theta}\mathbb{P}\|\nabla\ell(\theta;z)\|^{2}\leq 8G^{2}d,

    so our bounds are always more favorable than (6.2). In passing, we also remove a superfluous logn\log n factor by using generic chaining rather than simple discretization.

  • Our Proposition 2 on the localized uniform convergence of gradients,

    (n)((θ;z)(θ;z))O(βθθd+log2log(2nΔM)δn),\displaystyle\|(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))\|\leq O\left(\beta\|\theta-\theta^{*}\|\sqrt{\frac{d+\log\frac{2\log(2n\Delta_{M})}{\delta}}{n}}\right),

    strengthens (6.1) to a localized result, under fewer assumptions.

We illustrate our improvements on a particular non-convex learning example.

Example 3 (non-convex regression with non-linear activation).

Consider the model

(θ;z)=(η(θTx)y)2,\displaystyle\ell(\theta;z)=(\eta(\theta^{T}x)-y)^{2}, (6.3)

where η()\eta(\cdot) is a non-linear activation function, and there exists θΘ\theta^{*}\in\Theta such that 𝔼[y]=η(xTθ)\mathbb{E}[y]=\eta(x^{T}\theta^{*}). This model has been empirically shown to be superior relative to convex formulations in several applications [8, 30, 58], and is representative of the class of one-hidden-layer neural network models. Popular choices of η\eta include sigmoid link η(t)=(1+et)1\eta(t)=(1+e^{-t})^{-1} and probit link η(t)=Φ(t)\eta(t)=\Phi(t) where Φ\Phi is the Gaussian cumulative distribution function. Under mild regularity conditions, the population risk (θ;z)\mathbb{P}\ell(\theta;z) satisfies the PL condition and the statistical noise conditions.

Assumption 4 (regularity conditions for Example 3).

(a) x\|x\|_{\infty} is uniformly bounded by τ\tau, the feasible parameter set Θ\Theta is given by {θd:θΔM2}\{\theta\in\mathbb{R}^{d}:\|\theta\|\leq\frac{\Delta_{M}}{2}\}, and BB is the worst-case boundedness parameter of (η(θTx)y)2(\eta(\theta^{T}x)-y)^{2} (which can scales with nn). (b) there exists Cη>0,cη>0C_{\eta}>0,c_{\eta}>0 such that

sup|t|ΔMτdmax{η(t),η′′(t)}Cη,inf|t|ΔMτdη(t)cη.\sup_{|t|\leq\Delta_{M}\tau\sqrt{d}}\max\{\eta^{\prime}(t),\eta^{\prime\prime}(t)\}\leq C_{\eta},\quad\inf_{|t|\leq\Delta_{M}\tau\sqrt{d}}\eta^{\prime}(t)\geq c_{\eta}.

(c) The feature vector xx spans all directions in d\mathbb{R}^{d}, that is, 𝔼[xxT]γτ2Id×d\mathbb{E}[xx^{T}]\succeq\gamma\tau^{2}{I}_{d\times d} for some 0<γ<10<\gamma<1.

Under Assumption 4, all of our proposed assumptions in Theorem 3 and Theorem 4 are satisfied. In particular: Assumption 1 holds with β=2Cη(Cη+B)τ2\beta=2C_{\eta}(C_{\eta}+\sqrt{B})\tau^{2}; Assumption 2 holds with G=2CητBdG_{*}=2C_{\eta}\tau\sqrt{Bd}; and Assumption 3 holds with μ=2cη3τ2γCη\mu=\frac{2c_{\eta}^{3}\tau^{2}\gamma}{C_{\eta}} (see Appendix B.5 for details). Let θ^\hat{\theta} be the approximate stationary point in Theorem 3, or the output of the gradient descent algorithm in Theorem 4 (after running sufficiently many iterations), we have the following corollary.

Corollary 5 (generalization error bound for Example 3).

Under Assumption 4,

(θ^)O(dL(τCη)2log1δμn),\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{d\pazocal{L}^{*}(\tau C_{\eta})^{2}\log\frac{1}{\delta}}{\mu n}\right), (6.4)

where L:=[(yη(θTx))2]\pazocal{L}^{*}:=\mathbb{P}[(y-\eta({\theta^{*}}^{T}x))^{2}] is the population risk at the optimal parameter. θ\theta^{*}.

Since the sub-Gaussian parameter of the random gradient satisfies GO(τCηB)G\leq O(\tau C_{\eta}\sqrt{B}) under Assumption 4, the result (6.2) in Mei et al. [47] implies a generalization error bound of the order

(θ^)O(dB(τCη)2lognlog1δμn)[existing result [47]],\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{dB(\tau C_{\eta})^{2}\log n\log\frac{1}{\delta}}{\mu n}\right)\quad[\text{existing result \cite[cite]{[\@@bibref{Number}{mei2018landscape}{}{}]}}], (6.5)

where B=supθ,x,y(η(θTx)y)2B=\sup_{\theta,x,y}(\eta(\theta^{T}x)-y)^{2} is the worst-case boundedness parameter. Let us now compare our result (6.4) with the the existing result (6.5): 1) our result (6.4) improves the worst-case boundedness parameter BB, replacing it with the much smaller optimal risk L\pazocal{L}^{*}; and 2) it removes the superfluous logarithmic factor logn\log n.

Comparison with the norm-based generalization error bound in Foster et al. [12].

Let us now compare our problem-dependent rates with the norm-based bounds in [12] (it is worth mentioning that they also provide novel results in the infinite-dimensional and high-dimensional settings). Under the formulation in Example 3 and Assumption 4, the generalization error bound proved in [12] is of the order

(θ^)O(d2B(τCη)4+dB(τCη)2log1δμn)[converted from [12]],\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{d^{2}B(\tau C_{\eta})^{4}+dB(\tau C_{\eta})^{2}\log\frac{1}{\delta}}{\mu n}\right)\quad[\text{{converted from} \cite[cite]{[\@@bibref{Number}{foster2018uniform}{}{}]}}], (6.6)

for an approximate stationary point θ^\hat{\theta}. To be specific, their original result assumes x\|x\| to be uniformly bounded by DD and the generalization error scales with D4D^{4}. Under the standard assumption xτ\|x\|_{\infty}\leq\tau (or xx being a τ\tau-sub-Gaussian random vector), D4D^{4} is of order τ4d2\tau^{4}d^{2} so their result does not achieve optimal dependence on dd (in the original statements in [12] there is a potential misunderstanding of the dependence on dd). Besides improving the worst-case boundedness parameter BB to the optimal risk L\pazocal{L}^{*}, our result (6.4) further improves (6.6) by order d(τCη)2d(\tau C_{\eta})^{2}.

Lastly, we comment that there is no formal guarantee on how to find θ^\hat{\theta} by an optimization algorithm in [12]. They merely establish the generalization error bound for approximate stationary points, but analysis of an optimization algorithm is more challenging because properties of the population risk may not carry over to the empirical risk.

6.2 Stochastic optimization

The parametric learning setting we discussed is sometimes referred to as “stochastic optimization” [66, 67]. Beyond supervised learning, stochastic optimization also covers operations research and system control problems, where the dimension dd may no longer be pertinent in the generalization error bound for sufficiently large nn (i.e., the bound should be dimension-independent in the asymptotic regime). Therefore, it is preferable to prove norm-based generalization error bounds, which do not explicitly scale with dd after a certain sample complexity threshold.

We compare our results with previous ones in the area of stochastic optimization. Those results typically assume the population risk to be strongly convex, i.e., there exists μ>0\mu>0 such that θ1,θ2Θ\forall\theta_{1},\theta_{2}\in\Theta,

(θ1;z)(θ2;z)((θ2;z))T(θ1θ2)μ2θ1θ22.\displaystyle\mathbb{P}\ell(\theta_{1};z)-\mathbb{P}\ell(\theta_{2};z)-\left(\mathbb{P}\nabla\ell(\theta_{2};z)\right)^{T}(\theta_{1}-\theta_{2})\geq\frac{\mu}{2}\|{\theta_{1}}-{\theta_{2}}\|^{2}.

While this assumption is much more restrictive than our Assumption 3, we note that our problem-dependent rate and sample complexity results are novel even in this well-studied strongly convex setting.

Recall that our problem-dependent generalization error bounds in Theorem 3 and Theorem 4 are of the order

(θ^)O([(θ;z)2]log1δμn),\displaystyle\mathscr{E}(\hat{\theta})\leq O\left(\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu n}\right), (6.7)

provided nΩ(β2d/μ2)n\geq\Omega({\beta^{2}d}/{\mu^{2}}); and the sample complexity (to achieve ε\varepsilon generalization error) is

nΩ(β2dμ2+[(θ;z)2]log1δμε).\displaystyle n\geq\Omega\left(\frac{\beta^{2}d}{\mu^{2}}+\frac{\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{1}{\delta}}{\mu\varepsilon}\right). (6.8)

Our results are natural finite-sample extensions of the classical “asymptotic normality” result (5.9), and hence are the sharpest results one can expect under aforementioned assumptions (see the discussion at the end of Section 5).

Comparison with the classical result from Shapiro et al. [68].

Perhaps the most representative result on the generalization error of empirical risk minimization (also referred to as “sample average approximation”) in the stochastic optimization literature, is Corollary 5.20 from the monograph [68]. When the random gradient (θ;z)\nabla\ell(\theta;z) is GG-sub-Gaussian for all θΘ\theta\in\Theta, that result implies

(θ^ERM)O(G2dlognlog1δμn).\displaystyle\mathscr{E}(\hat{\theta}_{\textup{ERM}})\leq O\left({\frac{G^{2}{d}\log n\log\frac{1}{\delta}}{\mu n}}\right). (6.9)

One advantage of (6.9) is that it does not require the population risk to be smooth. However, the explicit dependence on dd and the global sub-Gaussian parameter GG in (6.9) make it less favorable for some operations research applications and M-estimation problems, where the asymptotic complexity (5.9) does not depend on dd. It is easy to show that our problem-dependent generalization error bound (6.7) strictly improves on this classical result. Specifically, under the sub-Gaussian distributional assumptions on gradients, one can prove that

[(θ;z)2]supθΘ[(θ;z)2]8dG2.\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\ll\sup_{\theta\in\Theta}\mathbb{P}[\|\nabla\ell(\theta;z)\|^{2}]\leq 8dG^{2}.

Plugging this into (6.7) and (6.9), we observe that our bound improves on (6.9) by removing dependence on the worst-case L2L_{2} norm of the gradient over the entire parameter space Θ\Theta.

Comparison with results obtained from the “online to batch conversion” [23].

By assuming the population risk is strongly convex and satisfies the following “uniform Lipchitz continuous” condition,

|(θ1;z)(θ2;z)|Lunifθ1θ2,zZ,θ1,θ2Θ,\displaystyle|\ell(\theta_{1};z)-\ell(\theta_{2};z)|\leq L_{\text{unif}}\|\theta_{1}-\theta_{2}\|,\quad\forall z\in\pazocal{Z},\quad\forall\theta_{1},\theta_{2}\in\Theta,

[23] proves an “online to batch conversion” that relates the regret of an algorithm (on past data) to the generalization performance (on future data). As a result, the output θ^SGD\hat{\theta}_{SGD} of certain stochastic gradient methods (also referred to as “stochastic approximation” in the stochastic optimization literature) can be proved to satisfy

(θ^SGD)O(Gunif2log1δn).\displaystyle\mathscr{E}(\hat{\theta}_{\text{SGD}})\leq O\left(\frac{G_{\textup{unif}}^{2}\log\frac{1}{\delta}}{n}\right). (6.10)

Note that (6.10) does not require any sample size threshold. In contrast, our problem-dependent generalization error bound (6.7) provides an improved rate, but only as long as the sample size condition nΩ(β2d/μ2)n\geq\Omega({\beta^{2}d}/{\mu^{2}}) is satisfied, because in this case

[(θ;z)2]supθΘ,zZ(θ;z)2=Lunif2.\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\ll\sup_{\theta\in\Theta,z\in\pazocal{Z}}\|\nabla\ell(\theta;z)\|^{2}=L_{\textup{unif}}^{2}.

Plugging this into (6.7) and (6.10), this claimed improvement can be immediately verified.

Comparison with problem-dependent bounds in Zhang et al. [81].

By imposing both strong convexity and a uniform smoothness condition, [81] systematically provide dimension-independent generalization error bounds for empirical risk minimization. However, there are several limitations in their approach: 1) their sample complexity threshold to achieve dimension-independent generalization error is sub-optimal for many popular problems; and 2) many of their assumptions are restrictive and not required (as our analysis shows).

The main source of the limitations in [81] is the assumption that (θ;z)\ell(\theta;z) admits a uniform smooth parameter βunif\beta_{\text{unif}}, i.e.,

(θ1,z)(θ2,z)βunifθ1θ2,zZ,θ1,θ2Θ.\displaystyle\|\nabla\ell(\theta_{1},z)-\nabla\ell(\theta_{2},z)\|\leq\beta_{\text{unif}}\|\theta_{1}-\theta_{2}\|,\quad\forall z\in\pazocal{Z},\quad\forall\theta_{1},\theta_{2}\in\Theta. (6.11)

This quantity serves as the main complexity proxy. With additional assumptions that (θ,z)\ell(\theta,z) is non-negative and convex for all zz, [81, Theorem 3] proves that when

nΩ(βunifdlognμ),\displaystyle n\geq\Omega\left(\frac{\beta_{\textup{unif}}d\log n}{\mu}\right),

empirical risk minimization achieves the problem-dependent bound

(θ^ERM)O(βunifLlog1δμn).\displaystyle\mathscr{E}(\hat{\theta}_{\textup{ERM}})\leq O\left(\frac{\beta_{\textup{unif}}\pazocal{L}^{*}\log\frac{1}{\delta}}{\mu n}\right).

However, as βunif\beta_{\text{unif}} is effectively the largest value of the operator norm of the Hessian—supθ;z2(θ;z)op\sup_{\theta;z}\|\nabla^{2}\ell(\theta;z)\|_{\text{op}}, it scales with dimension dd in most statistical estimation problems. As a result, the sample complexity threshold Ω(βunifdlogn/μ)\Omega({\beta_{\textup{unif}}d\log n}/{\mu}) becomes sub-optimal for most statistical regression problems. For example, consider a simple linear regression set up:

(θ;(x,y))=yθTx2,y=xTθ+υ,υN(0,1),xN(0,τId×d).\displaystyle\ell(\theta;(x,y))=\|y-\theta^{T}x\|^{2},\quad y=x^{T}\theta^{*}+\upsilon,\quad\upsilon\sim N(0,1),\quad x\sim N(0,\tau{{I}}_{d\times d}). (6.12)

In this example, we have μ=1\mu=1 and βunif=Ω(τ2d)\beta_{\textup{unif}}=\Omega(\tau^{2}d) , so the sample complexity needed to achieve ε\varepsilon accuracy in [81] is

nΩ(τ2d2logn+dLτ2ε)[sample complexity [81]],\displaystyle n\geq{\Omega}\left(\tau^{2}d^{2}\log n+\frac{d\pazocal{L}^{*}\tau^{2}}{\varepsilon}\right)\quad[\text{sample complexity \cite[cite]{[\@@bibref{Number}{zhang2017empirical}{}{}]}}], (6.13)

In contrast, our sample complexity (6.8) is

nΩ(τ2d+dLτ2ε)[sample complexity (6.8)],\displaystyle n\geq{\Omega}\left(\tau^{2}d+\frac{d\pazocal{L}^{*}\tau^{2}}{\varepsilon}\right)\quad[\text{sample complexity \eqref{eq: our sample complexity}}],

in this example. Therefore, the Ω(τ2d2logn){\Omega}(\tau^{2}d^{2}\log n) term in (6.13) is sub-optimal primarily because the analysis in [81] relies on the uniform smoothness parameter βunif\beta_{\textup{unif}}. Moreover, their assumptions that (θ;z)\ell(\theta;z) is non-negative and convex for all zz may rule out interesting stochastic optimization applications. These are not required by our framework and results.

Recently, the approach in [81] was also extended to analyze a variant of the stochastic gradient descent algorithm [82], but the results hold only in expectation rather than with high probability, and they have similar limitations to the results in [81]. This approach has also been extended to non-convex stochastic optimization problems [40], where the generalization error bounds are of the from (6.10)—they contain the uniform Lipchitz parameter LunifL_{\text{unif}} and are not problem-dependent.

7 Learning with missing data and Expectation-Maximization algorithms

In this section we introduce a broad class of applicable non-convex and semi-supervised learning problems, in the area of “learning with missing data.” We again apply our proposed “uniform localized convergence” framework and prove a variant of Theorem 4, which gives the sharpest local convergence rate for first-order Expectation-Maximization (EM) algorithms in many widely studied problems. Our analysis improves the framework introduced recently in Balakrishnan et al. [2].

7.1 Background

Convex maximum likelihood estimation problems will generally become non-convex when there is missing or unobserved data. Assume the data (z,w)(z,w) is generated from an unknown distribution specified by the true parameter θd\theta^{*}\in\mathbb{R}^{d}, where zZz\in\pazocal{Z} corresponds to the observable data, and wWw\in\pazocal{W} corresponds to the unobservable data (also referred to as the “latent variable”). For every θd\theta\in\mathbb{R}^{d}, let fθ(z,w)f_{\theta}(z,w) be the likelihood of observing zz conditioned on ww, if the underlying distribution is specified by θ\theta. (Throughout this section we will assume the existence of density functions for simplicity.) Consider the loss function

(θ;z)=log[Wfθ(z,w)𝑑w].\displaystyle\ell(\theta;z)=-\log\left[\int_{\pazocal{W}}f_{\theta}(z,w)d{w}\right]. (7.1)

Our goal is to estimate the true parameter θ\theta^{*}, which minimizes the population risk

(θ;z)=Z(θ;z)𝑑z\mathbb{P}\ell(\theta;z)=\int_{\pazocal{Z}}\ell(\theta;z)dz

over all θd\theta\in\mathbb{R}^{d}. (Equivalently, θ\theta^{*} maximizes the population log-likelihood function.) The main challenge is that (θ;z)\mathbb{P}\ell(\theta;z) is typically non-convex, despite the fact that the conditional log-likelihood function logfθ(z,w)\log f_{\theta}(z,w) would usually be convex with respect to θ\theta, if both zz and ww were observable.

The following θ(θ;z)\ell_{\theta^{\prime}}(\theta;z) function provides a convex upper bound on (θ;z)\ell(\theta;z), and can be interpreted as the conditional expectation of the loss, as if θ\theta^{\prime} is the true parameter θ\theta^{*}:

θ(θ;z)=Wkθ(w|z)logfθ(z,w)𝑑w,\displaystyle\ell_{\theta^{\prime}}(\theta;z)=-\int_{\pazocal{W}}k_{\theta^{\prime}}(w|z)\log f_{\theta}(z,w)dw,

where kθ(w|z)k_{\theta^{\prime}}(w|z) is the conditional density of ww given zz. Denote θθ(θ;z)\nabla_{\theta}\ell_{\theta^{\prime}}(\theta;z) the gradient of θ(θ;z)\ell_{\theta^{\prime}}(\theta;z) when fixing θ\theta^{\prime}. From the definition of kθ(w|z)k_{\theta^{\prime}}(w|z), it is easy to verify that the vector-value of θθ(θ;z)\nabla_{\theta}\ell_{\theta^{\prime}}(\theta;z) at θ\theta^{\prime} is exactly the gradient of (θ;z)\ell(\theta^{\prime};z): that is, for all θ,θΘ\theta,\theta^{\prime}\in\Theta,

θθ(θ;z)|θ=θ=(θ;z).\displaystyle\nabla_{\theta}\ell_{\theta^{\prime}}(\theta;z)|_{\theta=\theta^{\prime}}=\nabla\ell(\theta^{\prime};z). (7.2)

In view of the identity (7.2), it is known [2] that gradient descent on the empirical risk n(θ;z)\mathbb{P}_{n}\ell(\theta;z) is equivalent to the first-order Expectation-Maximization algorithm: at the tt-th iteration, the “expectation” step calculates the sample average nθt(θ;z)\mathbb{P}_{n}\ell_{\theta^{t}}(\theta;z), and the “maximization” step executes the first-order update

θt+1=θtαnθt(θt;z)=θtαn(θt;z),\displaystyle\theta^{t+1}=\theta^{t}-\alpha\nabla\mathbb{P}_{n}\nabla\ell_{\theta^{t}}(\theta^{t};z)=\theta^{t}-\alpha\nabla\mathbb{P}_{n}\nabla\ell(\theta^{t};z), (7.3)

where α>0\alpha>0 is the step size. First-order EM is known to be more computationally efficient than standard EM, and more amendable for analysis [2].

Examples of learning with missing data problems for which the above observations apply include the followings.

Example 4 (Mixture of two Gaussians).

In this problem, the missing variable w{1,1}w\in\{-1,1\} is an indicator of the underlying mixture component, which has 12\frac{1}{2} probability to be 11 and the other 12\frac{1}{2} probability to be 1-1. Conditioned on ww, the observable variable zz is generated as follows.

(z|w=1)N(θ,σ2Id×d),(z|w=1)N(θ,σ2Id×d).\displaystyle(z|w=1)\sim N(\theta^{*},\sigma^{2}I_{d\times d}),\quad(z|w=-1)\sim N(-\theta^{*},\sigma^{2}I_{d\times d}).

For this problem, we have

θ(θ;z)=wθ(zi)2ziθ2+(1wθ(zi))2zi+θ2,\displaystyle\ell_{\theta^{\prime}}(\theta;z)=\frac{w_{\theta^{\prime}}(z_{i})}{2}\|z_{i}-\theta\|^{2}+\frac{(1-w_{\theta^{\prime}}(z_{i}))}{2}\|z_{i}+\theta\|^{2},

where wθ(z)=eθz22σ2[eθz22σ2+eθ+z22σ2]1w_{\theta^{\prime}}(z)=e^{-\frac{\|\theta^{\prime}-z\|^{2}}{2\sigma^{2}}}[e^{-\frac{\|\theta^{\prime}-z\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|\theta^{\prime}+z\|^{2}}{2\sigma^{2}}}]^{-1}.

Example 5 (Mixture of two component linear regression).

In this problem, xN(0,Id×d)x\sim N(0,I_{d\times d}) is a random feature vector, and w{1,1}w\in\{-1,1\} is a missing indicator variable that has 12\frac{1}{2} probability to be 11 and 12\frac{1}{2} probability to be 1-1. Conditioned on ww and xx, the label variable yy is generated as follows.

(y|w=1,x)N(xTθ,σ2),(y|w=1,x)N(xTθ,σ2).\displaystyle(y|w=1,x)\sim N(x^{T}\theta^{*},\sigma^{2}),\quad(y|w=-1,x)\sim N(-x^{T}\theta^{*},\sigma^{2}).

In this problem, the observable variable zz is the feature-label pair (x,y)(x,y), and we have

θ(θ;z)=wθ(x,y)2(yxTθ)2+1wθ(x,y)2(y+xTθ)2,\displaystyle\ell_{\theta^{\prime}}(\theta;z)=\frac{w_{\theta^{\prime}}(x,y)}{2}(y-x^{T}\theta)^{2}+\frac{1-w_{\theta^{\prime}}(x,y)}{2}(y+x^{T}\theta)^{2},

where wθ(x,y)=e(xTθy)22σ2[e(xTθy)22σ2+e(xTθ+y)22σ2]1w_{\theta^{\prime}}(x,y)=e^{-\frac{(x^{T}\theta^{\prime}-y)^{2}}{2\sigma^{2}}}[e^{-\frac{(x^{T}\theta^{\prime}-y)^{2}}{2\sigma^{2}}}+e^{-\frac{(x^{T}\theta^{\prime}+y)^{2}}{2\sigma^{2}}}]^{-1}.

7.2 Problem-dependent rates for first-order EM

Motivated by the breakthrough work Balakrishnan et al. [2], we assume that the feasible parameter space Θ\Theta contains the true parameter θ\theta^{*}, and satisfies the two assumptions.

Assumption 5 (strong convexity of θ(θ;z)\bm{\mathbb{P}\ell_{\theta^{*}}(\theta;z)}).

There exists μ1>0\mu_{1}>0 such that θ1,θ2Θ\forall\theta_{1},\theta_{2}\in\Theta

θ(θ;z)θ(θ;z)θ(θ;z)22μ1.\displaystyle\mathbb{P}\ell_{\theta^{*}}(\theta;z)-\mathbb{P}\ell_{\theta^{*}}(\theta^{*};z)\leq\frac{\|\mathbb{P}\nabla\ell_{\theta^{*}}(\theta;z)\|^{2}}{2\mu_{1}}.

Recall that θ(θ;z)\mathbb{P}\ell_{\theta^{*}}(\theta;z) is the underlying “true” log likelihood with respect to parameter θ\theta, which is unknown due to lack of information on θ\theta^{*}. It is standard to assume that θ(θ;z)\mathbb{P}\ell_{\theta^{*}}(\theta;z) is a strongly convex when there is no missing data [46, 2].

Assumption 6 (gradient smoothness).

There exists 0<μ2<μ10<\mu_{2}<\mu_{1} such that θΘ\forall\theta\in\Theta

θ(θ;z)θ(θ;z)μ2θθ.\displaystyle\|\mathbb{P}\nabla\ell_{\theta}(\theta;z)-\mathbb{P}\nabla\ell_{\theta^{*}}(\theta;z)\|\leq\mu_{2}\|\theta-\theta^{*}\|.

Assumption (6) is also assumed in [2]. While this assumption does not typically hold over all θd\theta\in\mathbb{R}^{d}, it is often satisfied with small enough μ2\mu_{2} over a local region around the true parameter θ\theta^{*} [2, 80]. Under the above two assumptions, according to the identity (7.2), the gradient of the population risk,

(θ;z)=θ(θ;z),\displaystyle\mathbb{P}\nabla\ell(\theta;z)=\mathbb{P}\nabla\ell_{\theta}(\theta;z),

can be viewed as a perturbation of θ(θ;z)\nabla\ell_{\theta^{*}}(\theta;z)—the gradient of the strongly convex function θ(θ;z)\mathbb{P}\ell_{\theta^{*}}(\theta;z). Therefore, Assumption 5 and Assumption 6 play a similar role to that of the PL condition that we have analyzed in Section 5.3. The following theorem can be viewed as a modification of our previous Theorem 4, where the proof is tailored to these new assumptions but the key ideas remain mostly similar.

Theorem 6 (generalization error of first-order EM).

Assume Assumptions 1, 2, 5, 6, and assume access to an initialization θ0Bd(θ,Δm)\theta^{0}\in\pazocal{B}^{d}(\theta^{*},\Delta_{m}). For any fixed δ(0,1)\delta\in(0,1), iterates of the first-order EM algorithm {θt}\{\theta^{t}\} generated by (7.3) with the fixed step size 2β+μ1\frac{2}{\beta+\mu_{1}} satisfy with probability at least 1δ1-\delta and all t=0,1,t=0,1,\dots,

(θt)\displaystyle\mathscr{E}(\theta^{t}) 16βμ12(2[(θ;z)2]log4δn+Glog4δ+μ1n)2+(12μ1μ22(β+μ1))2tβθ0θ2,and\displaystyle\leq\frac{16\beta}{\mu_{1}^{2}}\left(\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}+\mu_{1}}{n}\right)^{2}+\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{2t}\beta\|\theta^{0}-\theta^{*}\|^{2},\quad\text{and}
θtθ\displaystyle\|\theta^{t}-\theta^{*}\| 4μ1(2[(θ;z)2]log4δn+Glog4δ+μ1n)+(12μ1μ22(β+μ1))tθ0θ,\displaystyle\leq\frac{4}{\mu_{1}}\left(\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}+\mu_{1}}{n}\right)+\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{t}\|\theta^{0}-\theta^{*}\|, (7.4)

provided the sample size condition nmax{cβ2μ12(d+log8log2(2nΔM+2)δ),128[(θ;z)2]log4δμ1ΔM,8Glog4δ+8μ1μ1ΔM}n\geq\max\left\{\frac{c\beta^{2}}{\mu_{1}^{2}}\left(d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}\right),\frac{128\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu_{1}\Delta_{M}},\frac{8G_{*}\log\frac{4}{\delta}+8\mu_{1}}{\mu_{1}\Delta_{M}}\right\} holds, where cc is an absolute constant.

We comment that it is usually straightforward to verify Assumption 1 for a specific missing data applications (no harder than verifying it on the completely observable case). From Lemma 2, we only need to show that the hessian matrices {2θ2[logfθ(z,w)]}θΘ\{\frac{\partial^{2}}{\partial\theta^{2}}[-\log f_{\theta}(z,w)]\}_{\theta\in\Theta} are sub-exponential for all fixed ww and θ\theta. That is,

𝔼{exp(1β|u1T(2[logfθ(z,w)]θ2)u2|)}2,\displaystyle\mathbb{E}\left\{\exp\left(\frac{1}{\beta}\left|u_{1}^{T}\left(\frac{\partial^{2}[\log f_{\theta}(z,w)]}{\partial\theta^{2}}\right)u_{2}\right|\right)\right\}\leq 2, (7.5)

for any unit vectors u1,u2u_{1},u_{2}, any zz and any θΘ\theta\in\Theta. Typically, condition (7.5) simply requires that the observable data zz be a sub-Gaussian vector, regardless of the (fixed) values the unobservable data ww take.

As a result, Theorem 6 applies to a broad class of “learning with missing data” problems, including Example 4 and Example 5. In order to validate Assumption 6 on these two examples, a common strategy [2] is to assume the signal-to-noise ratio (SNR) to be lower bounded as

θση,\displaystyle\frac{\|\theta^{*}\|}{\sigma}\geq\eta, (7.6)

for some absolute constant η>0\eta>0. The following corollary holds under identical assumptions on η\eta as in [2].

Corollary 7 (Theorem 6 applied to Example 4 and Example 5).

In both Example 4 and Example 5, after sufficiently many iterations, the first-order EM algorithm with step size 11 satisfies the generalization error bound

(θt)O(σ2dlog1δn).\displaystyle\mathscr{E}(\theta^{t})\leq O\left(\frac{\sigma^{2}d\log\frac{1}{\delta}}{n}\right).

Specifically,

  • For the Gaussian mixture model (Example 4), assuming the signal-to-noise condition (7.6) holds, and the initialization point satisfies θ0Bd(θ,θ4)\theta^{0}\in\pazocal{\pazocal{B}}^{d}(\theta^{*},\frac{\|\theta^{*}\|}{4}), then the result of Theorem 6 holds with β=1\beta=1, G=σdG_{*}=\sigma\sqrt{d}, μ1=1\mu_{1}=1 and μ2=c1(1+1η2+η2)ec2η2\mu_{2}=c_{1}(1+\frac{1}{\eta^{2}}+\eta^{2})e^{-c_{2}\eta^{2}}, where c1,c2c_{1},c_{2} are absolute constants.

  • For the mixture of linear regression model (Example 5), assuming the signal-to-noise condition (7.6), and the initialization point satisfies θ0Bd(θ;θ32)\theta^{0}\in\pazocal{B}^{d}(\theta^{*};\frac{\|\theta^{*}\|}{32}), then the result of Theorem 6 holds with β=1\beta=1, G=σdG_{*}=\sigma\sqrt{d}, μ1=1\mu_{1}=1 and μ2=14\mu_{2}=\frac{1}{4}.

Proofs can be found in Appendix B.7. Notably, our results do not depends on θ\|\theta^{*}\|, and hence refine those in [2].

7.3 Discussion and improvements over previous results

In this subsection, we will first compare our general theory with the methodology in Balakrishnan et al. [2]. Then, we will discuss the improvement over [2] as illustrated in Example 4 and Example 5. Lastly, we will provide some intuition pertaining to this improvement.

Improvements in the methodology.

We now restate the theoretical result from Balakrishnan et al. [2] on the estimation error of first-order EM. Assume with probability at least 1δ1-\delta,

supΘ(n)θ(θ;z)εunif(n,δ).\displaystyle\sup_{\Theta}\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell_{\theta}(\theta;z)\|\leq\varepsilon_{\textup{unif}}(n,\delta). (7.7)

When the sample size nn is large enough, [2] proves that the first-order EM iterates {θt}t=0\{\theta^{t}\}_{t=0}^{\infty} satisfy

θtθO(εunif(n,δ)μ1μ2)+(12μ1μ2β+μ1)tθ0θ.\displaystyle\|\theta^{t}-\theta^{*}\|\leq O\left(\frac{\varepsilon_{\textup{unif}}(n,\delta)}{\mu_{1}-\mu_{2}}\right)+\left(1-\frac{2\mu_{1}-\mu_{2}}{\beta+\mu_{1}}\right)^{t}\|\theta^{0}-\theta^{*}\|. (7.8)

Compared with (6) in Theorem 6, the approach in [2] has two main limitations.

  • The result (7.8) contains a loose, global uniform convergence terms εunif(n,δ)\varepsilon_{\textup{unif}}(n,\delta) defined via (7.7). In contrast, our Theorem 4 suggests that the statistical error only depends on the concentration of n(θ;z)\mathbb{P}_{n}\nabla\ell(\theta^{*};z) relative to (θ;z)\mathbb{P}\nabla\ell(\theta^{*};z) at the single point θ\theta^{*}. The precise improvement will be illustrated on Example 4 and Example 5 shortly.

  • [2] does not discuss how to calculates the complex uniform convergence term εunif(n,δ)\varepsilon^{\textup{unif}}(n,\delta) for general models. In fact, [2] only calculate this term for Example 4. For the rest of the applications they consider, they turn to analyze sample-splitting heuristics. Although these heuristics are easier to analyze, they are less common in practice. In contrast, our Theorem 6 applies to general models without leaving the uniform convergence term unspecified.

Improvements on the examples.

For the mixture of two Gaussians (Example 4), [2] proves that after sufficiently many iterations, the first-order EM algorithm satisfies the generalization error bound

(θt)O(θ2(1+θ2σ2)dlog1δn)[GMM result [2]],\displaystyle\mathscr{E}(\theta^{t})\leq O\left(\frac{\|\theta^{*}\|^{2}(1+\frac{\|\theta^{*}\|^{2}}{\sigma^{2}})d\log\frac{1}{\delta}}{n}\right)\quad[\text{GMM result \cite[cite]{[\@@bibref{Number}{balakrishnan2017statistical}{}{}]}}], (7.9)

and for the mixture of regressions (Example 5), [2] proves that after sufficient iterations, the first-order EM algorithm satisfies the generalization error bound

(θt)O((σ2+θ2)dlog1δn)[regression result [2]].\displaystyle\mathscr{E}(\theta^{t})\leq O\left(\frac{(\sigma^{2}+\|\theta^{*}\|^{2})d\log\frac{1}{\delta}}{n}\right)\quad[\text{regression result \cite[cite]{[\@@bibref{Number}{balakrishnan2017statistical}{}{}]}}]. (7.10)

In contrast, our problem-dependent generalization error bounds given by Corollary 7 are of the order

(θt)O(σ2dlog1δn)[Corollary 7],\displaystyle\mathscr{E}(\theta^{t})\leq O\left(\frac{\sigma^{2}d\log\frac{1}{\delta}}{n}\right)\quad[\text{Corollary \ref{coro application example em}}],

which exhibits an improvement over the previous results (7.9) (7.10) from [2], under identical assumptions on the signal-noise ratio (θ/ση\|\theta^{*}\|/\sigma\geq\eta, where η\eta is a sufficiently large absolute constant specified in [2]). In particular, in the high signal-to-noise ratio regime, θ2σ2\|\theta^{*}\|^{2}\gg\sigma^{2} so our improvements are significant.

Tight characterization of the statistical error is traditionally considered challenging in the area of mixture models. Only recently, [29] provided a refined analysis of the mixture of regression problem (Example 5), and proved that the achievable generalization error is indeed of the order (σ2dlog1δ/n)\big{(}{\sigma^{2}d\log\frac{1}{\delta}}/{n}\big{)}. However, the analysis in [29] is fairly involved and customized to the specifics of the mixture of regression setting, and it is not clear how to extend the analysis to more general problems. Our theory can be applied to quite general settings, and moreover simplifies existing approaches.

From our theory, the optimal O(σ2dlog1δ/n)O\big{(}{\sigma^{2}d\log\frac{1}{\delta}}/{n}\big{)} characterization is very natural. Theorem 6 indicates the crucial fact that statistical error of the first-order EM algorithm only relies on [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}], a quantity that depends only on the optimal parameter θ\theta^{*}.

We now use Example 4 to illustrate the simplicity of our analysis. Define the function g:+g:\mathbb{R}\rightarrow\mathbb{R}^{+} as

g(u)=2e2θu22σ2eu22σ2+e2θu22σ2,\displaystyle g(u)=\frac{2e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|u\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}},

where uu is a random vector drawn from N(0,σ2Id×d)N(0,\sigma^{2}I_{d\times d}). In the high SNR regime, g(u)g(u) is anticipated to be very close to zero with high probability, due to the fact that

u22σ22θu22σ2.\displaystyle\frac{\|u\|^{2}}{2\sigma^{2}}\gg\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}.

In the Gaussian mixture model, whether w=1w=1 or w=1w=-1, it is straightforward to show that when conditioned on ww, the random vector ((θ;z)|w)(\nabla\ell(\theta^{*};z)|w) has the same distribution as u(1g(u))+θg(u)u(1-g(u))+\theta^{*}g(u). As a result, we have

[(θ;z)2]=𝔼u[u(1g(u))+θg(u)2].\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]=\mathbb{E}_{u}[\|u\cdot(1-g(u))+\theta^{*}\cdot g(u)\|^{2}].

As g(u)g(u) is very close to 0 with high probability, [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] should only scale with 𝔼u[u2]=σ2d\mathbb{E}_{u}[\|u\|^{2}]=\sigma^{2}d rather than θ\|\theta^{*}\|. This intuition also applies to other examples like the mixture of linear regression model.

8 Fast rates in supervised learning with structured convex cost

The main purpose of this section is to recover the problem-dependent rates in [53, 50] for (possibly non-parametric and heavy-tailed) supervised learning problems with structured convex cost functions. While [53, 50] propose an approach they call “learning without concentration,” our approach emphasizes the use of surrogate functions that are not “sub-root,” and relates one-sided uniform inequalities to two-sided concentration of “truncated” functions. Besides providing a unification, there are some technical improvements as well. For example, our approach does not require the “star-hull” of the hypothesis class that may increase complexity, and there are concrete examples showing that the improvement may be meaningful for non-convex classes. See Section 8.3 for contributions of our method, and detailed comparison with existing approaches.

8.1 Background

Problem formulation and assumptions.

Let the data zz be a feature-label pair (x,y)(x,y) where xXx\in\pazocal{X} and yYy\in\pazocal{Y}\subseteq\mathbb{R}. We assume every hypothesis hh in the hypothesis class H\pazocal{H} is a mapping from X\pazocal{X} to \mathbb{R}. In supervised learning, the loss function is of the form (h;(x,y))=sv(h(x),y)\ell(h;(x,y))=\ell_{\textup{{sv}}}(h(x),y) where the deterministic bivariate function sv:×\ell_{\textup{{sv}}}:\mathbb{R}\times\mathbb{R}\rightarrow\mathbb{R} is called the cost function. We assume that the cost function is differentiable, globally convex with respect to its first argument, and the population risk is smooth.

Assumption 7 (differentiability, convexity and smoothness).

The partial derivative of su\ell_{\textup{su}} with respect to its first argument, denoted 1su\partial_{1}\ell_{\textup{su}}, exists and is continuous everywhere, and sv\ell_{\textup{{sv}}} is a convex function with respect to its first argument, i.e., u1,u2,y\forall u_{1},u_{2},y\in\mathbb{R},

su(u1,y)su(u2,y)1sv(u2,y)(u1u2)0.\displaystyle\ell_{\textup{su}}(u_{1},y)-\ell_{\textup{su}}(u_{2},y)-\partial_{1}\ell_{\textup{sv}}(u_{2},y)(u_{1}-u_{2})\geq 0.

In addition, the population risk is smooth, i.e., there exists a constant βsv>0\beta_{\textup{{sv}}}>0 such that h1,h2H\forall h_{1},h_{2}\in\pazocal{H},

sv(h1(x),y)sv(h2(x),y)[1sv[(h2(x),y)(h1(x)h2(x))]βsv2[(h1(x)h2(x))2].\displaystyle\mathbb{P}\ell_{\textup{sv}}(h_{1}(x),y)-\mathbb{P}\ell_{\textup{sv}}(h_{2}(x),y)-\mathbb{P}[\partial_{1}\ell_{\textup{sv}}[(h_{2}(x),y)(h_{1}(x)-h_{2}(x))]\leq\frac{\beta_{\textup{{sv}}}}{2}\mathbb{P}[(h_{1}(x)-h_{2}(x))^{2}].

Given a cost function that is globally convex and locally strongly convex, we define {α(v)}v0\{\alpha(v)\}_{v\geq 0} as follows.

Definition 5 (strong convexity parameter).

For a fixed v>0v>0, let α(v)\alpha(v) be the largest constant such that for all yYy\in\pazocal{Y}, sv(u+y,y)\ell_{\textup{{sv}}}(u+y,y) is α(v)\alpha(v)-strongly convex with respect to uu when u[v,v]u\in[-v,v]. That is,

sv(u1+y,y)sv(u2+y,y)1sv(u2+y,y)(u1u2)α(v)2(u1u2)2,u1,u2[v,v],yY.\displaystyle\ell_{\textup{{sv}}}(u_{1}+y,y)-\ell_{\textup{{sv}}}(u_{2}+y,y)-\partial_{1}\ell_{\textup{{sv}}}(u_{2}+y,y)(u_{1}-u_{2})\geq\frac{\alpha(v)}{2}(u_{1}-u_{2})^{2},\quad\forall u_{1},u_{2}\in[-v,v],\forall y\in\pazocal{Y}.

Clearly {α(v)}v0\{\alpha(v)\}_{v\geq 0} is non-increasing with respect to vv, and we denote α(0)=lim supv0α(v)\alpha(0)=\limsup_{v\rightarrow 0}\alpha(v).

When sv\ell_{\textup{{sv}}} is second-order continuously differentiable, we have the simple relation

α(v)=sup|u|v,yY1,12sv(u+y,y),,v0,\displaystyle\alpha(v)=\sup_{|u|\leq v,y\in\pazocal{Y}}\partial^{2}_{1,1}\ell_{\textup{{sv}}}(u+y,y),\quad,\forall v\geq 0,

where 1,12sv\partial^{2}_{1,1}\ell_{\textup{{sv}}} is the second order partial derivative of sv\ell_{\textup{{sv}}} with respect to its first argument. Moreover, to accommodate popular choices of robust costs, Definition 5 also allows 1sv\partial_{1}\ell_{\textup{{sv}}} to be non-differentiable at certain points in its domain. We list three widely used cost functions, their strong convexity parameters {α(v)}v0\{\alpha(v)\}_{v\geq 0}, and the smoothness parameters βsv\beta_{\textup{{sv}}} of the corresponding population risks.

  • Square cost: consider the regression setting 𝔼[y|x]=htrue(x)\mathbb{E}[y|x]=h_{\text{true}}(x), where htrueh_{\text{true}} is the function we want to estimate (not necessarily in H\pazocal{H}). It is natural to consider the square cost function

    sv(h(x),y)=12(h(x)y)2.\displaystyle\ell_{\textup{{sv}}}(h(x),y)=\frac{1}{2}(h(x)-y)^{2}.

    Here sv(u+y,y)=u2\ell_{\textup{{sv}}}(u+y,y)=u^{2}. Thus α(v)=12,v0\alpha(v)=\frac{1}{2},\forall v\geq 0. The smoothness parameter of the population risk is βsv=12\beta_{\textup{{sv}}}=\frac{1}{2}. In this example, one does not need to localize the strong convexity parameter α(v)\alpha(v) as it is a constant.

  • Huber cost: consider the regression setting 𝔼[y|x]=htrue(x)\mathbb{E}[y|x]=h_{\text{true}}(x), where htrueh_{\text{true}} is the function we want to estimate (not necessarily in H\pazocal{H}). When the conditional distribution of yy is “heavy tailed,” one often considers the Huber cost function as follows. For γ>0\gamma>0, let

    sv,γ(h(x),y)={12(h(x)y)2 for |h(x)y|γ,γ|h(x)y|γ22 for |h(x)y|>γ.\displaystyle\ell_{\textup{{sv}},\gamma}(h(x),y)=\left\{\begin{aligned} &\frac{1}{2}(h(x)-y)^{2}&\text{ for }|h(x)-y|\leq\gamma,\\ &\gamma|h(x)-y|-\frac{\gamma^{2}}{2}&\text{ for }|h(x)-y|>\gamma.\end{aligned}\right. (8.1)

    Here α(v)=12\alpha(v)=\frac{1}{2} whenever vγv\leq\gamma but α(v)=0\alpha(v)=0 for all v>γv>\gamma. The smoothness parameter of the population risk is βsv=12\beta_{\textup{{sv}}}=\frac{1}{2}. Localization analysis of α(v)\alpha(v) is required for this loss, and the key is to avoid its inverse diverging to infinity.

  • Logistic cost: consider the standard logistic regression setting, where y{1,1}y\in\{-1,1\} and one models the “log odd ratio” as

    log(Prob(y=1|x)/Prob(y=1|x))=htrue(x).\displaystyle\log\left(\text{Prob}(y=1|x)/\text{Prob}(y=-1|x)\right)=h_{\text{true}}(x). (8.2)

    Here htrueh_{\text{true}} is the discriminant function to be estimated (not necessarily in H\pazocal{H}). The maximum likelihood estimation problem corresponds to using the cost function

    sv(h(x),y)=log(1+exp(yh(x))).\displaystyle\ell_{\textup{{sv}}}(h(x),y)=\log\Big{(}1+\exp(-yh(x))\Big{)}.

    Here 1,12sv(u+y,y)=exp(1+uy)(1+exp(1+uy))2\partial_{1,1}^{2}\ell_{\textup{{sv}}}(u+y,y)=\frac{\exp(1+uy)}{(1+\exp(1+uy))^{2}}, so we have α(v)=exp(v+1)(exp(v+1)+1)2\alpha(v)=\frac{\exp(v+1)}{(\exp(v+1)+1)^{2}}, v0\forall v\geq 0, and the smoothness parameter of the population risk is βsv=14\beta_{\textup{{sv}}}=\frac{1}{4}. The issue is that 1α(v)\frac{1}{\alpha(v)}, a complexity constant that will appear in the generalization error bound, grows exponentially with vv [21, 43]. This issue strongly motivate us to localize the parameter vv within α(v)\alpha(v) to avoid large exponential constants.

The following assumption is usually invoked in the most representative literature on this topic [50, 53].

Assumption 8 (optimality condition).

Recall that hsv(h(x),y)h^{*}\in\mathbb{P}\ell_{\textup{{sv}}}(h(x),y) is the population risk minimizer. Assume for all hHh\in\pazocal{H},

[1sv(h(x),y)(h(x)h(x))]0.\displaystyle\mathbb{P}[\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y)(h(x)-h^{*}(x))]\geq 0.

We summarize the two primary settings where Assumption 8 holds true.

  • Well-specified models: for certain problems, as long as the model is well-specified, then 1sv(h(x),y)\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y) is independent of xx and 𝔼1sv(h(x),y)=0\mathbb{E}\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y)=0. Thus Assumption 8 will hold. Examples include 1) the settings studied in [53] where sv\ell_{\textup{{sv}}} is a univariate function of (h(x)y)(h(x)-y) and 1sv(h(x),y)\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y) is odd with respect to yy, such as applications that use the square cost or the Huber cost; and 2) generalized linear models where the conditional distribution of yy belongs to the exponential family, such as the the logistic regression problem (8.2).

  • H\pazocal{H} is a convex class of functions: in this case, we verify Assumption 8 as follows. If there exists some h1Hh_{1}\in\pazocal{H} such that Assumption 8 is not true, then by considering hλ=λh1+(1λ)hHh_{\lambda}=\lambda h_{1}+(1-\lambda)h^{*}\in\pazocal{H} with λ\lambda sufficiently close to 0, we find sv(hλ(x),y)<sv(h(x),y)\mathbb{P}\ell_{\textup{{sv}}}(h_{\lambda}(x),y)<\mathbb{P}\ell_{\textup{{sv}}}(h^{*}(x),y), in contradiction, as hh^{*} is the population risk minimizer.

We call the random variable 1sv(h(x),y)\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y) the “noise multiplier” as it often characterizes the “effective noise” of the learning problem when using a particular cost function. We define another random variable ξ:=h(x)y.\xi:=h^{*}(x)-y. In some applications, ξ\xi is closely related to the “noise multiplier” (e.g., they are equivalent when one uses the square cost). And the notation ξ\xi is useful in other applications as well, because one always seeks to localize the parameter vv in α(v)\alpha(v) to the order of ξL2\|\xi\|_{L_{2}}. We denote Δ=suphHh(x)yL2\Delta=\sup_{h\in\pazocal{H}}\|h(x)-y\|_{L_{2}} and Δ=suph,x,y|h(x)y|{\Delta_{\infty}}=\sup_{h,x,y}|h(x)-y| as the worst-case L2L_{2} distance and LL_{\infty} distance between h(x)h(x) and yy. respectively. It is clear that we typically have ξL2ΔΔ\|\xi\|_{L_{2}}\ll\Delta\ll\Delta_{\infty} in practical applications.

Our analysis requires a very weak distributional assumption:

Assumption 9 (“small ball” property).

There exist constants κ>0\kappa>0 and cκ(0,1)c_{\kappa}\in(0,1) such that for all hHh\in\pazocal{H},

Prob(|h(x)h(x)|κhhL2)cκ.\displaystyle\textup{Prob}\left(|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right)\geq c_{\kappa}.

Assumption 9 is often referred to as “minimal” in the literature, and there are many examples in which it can be verified for κ\kappa and cκc_{\kappa} that are absolute constants [50, 53, 32, 27, 64, 33]. The scope of Assumption 9 subsumes and is much broader than the “sub-Gaussian” setting. For example, it is naturally satisfied when the class {hh:hH}\{h-h^{*}:h\in\pazocal{H}\} satisfies any sort of moment equivalence (see, e.g., [50, Lemma 4.1]).

Main challenges.

Let us first examine limitations of the results obtained using the traditional “local Rademacher complexity” analysis (Statement 1), which includes the results from [3, 79, 13] in the fast-rate regime. Assuming the cost function to be LsvL_{\textup{{sv}}}-Lipchitz continuous with respect to its first argument and setting f(z)=sv(h(x),y)sv(h(x),y)f(z)=\ell_{\textup{{sv}}}(h(x),y)-\ell_{\textup{{sv}}}(h^{*}(x),y), T(f)=[f2]T(f)=\mathbb{P}[f^{2}], and Be=Lsv2/α(Δ)B_{e}={L_{\textup{{sv}}}^{2}}/{\alpha(\Delta_{\infty}}), following Statement 1, one can prove that the empirical risk minimizer h^\hat{h} satisfies

(h^)O(rBe),\displaystyle\mathscr{E}(\hat{h})\leq O\left(\frac{r^{*}}{B_{e}}\right), (8.3)

where rr^{*} is the fixed point of BeψB_{e}\psi, and ψ\psi is a sub-root surrogate function that governs sup[f2]r(n)f\sup_{\mathbb{P}[f^{2}]\leq r}{(\mathbb{P}-\mathbb{P}_{n})f}. Denote by r1r_{1}^{*} the fixed point of ψ\psi. From the sub-root property of ψ\psi we know that rBe2r1r^{*}\geq B_{e}^{2}r_{1}^{*}, so the generalization error bound (8.3) is at least of order

rBeBer1=Lsv2α(Δ)r1.\displaystyle\frac{r^{*}}{B_{e}}\geq B_{e}r_{1}^{*}=\frac{L_{\textup{{sv}}}^{2}}{\alpha(\Delta_{\infty})}r_{1}^{*}. (8.4)

The main message here is that the traditional result (8.3) is often loose and not problem-dependent. As indicated by Mendelson in a series of papers [50, 53], the traditional result (8.3) has the following limitations.

  • The global Lipchitz constant LsvL_{\textup{{sv}}} is not problem-dependent and potentially unbounded. LsvL_{\textup{{sv}}} is effectively the worst-case value suph,x,y|1(h(x),y)|\sup_{h,x,y}|\partial_{1}\ell(h(x),y)|. For the square cost, this is Δ=suph,x,y|h(x)y|\Delta_{\infty}=\sup_{h,x,y}|h(x)-y| and is unbounded when either the hypothesis class or noise are unbounded. It would be beneficial to have a bound that only scales with a measure related to the “noise multiplier” 1(h(x),y)\partial_{1}\ell(h^{*}(x),y), because we usually have |1(h(x),y)|Lsv|\partial_{1}\ell(h^{*}(x),y)|\ll L_{\textup{{sv}}} in practical applications.

  • The global strong convexity parameter α(Δ)\alpha(\Delta_{\infty}) is often very small for the logistic cost and the Huber cost, so its inverse is often large (and potentially unbounded). The challenge here is to sharpen this to the inverse of a localized strongly convex parameter α(O(ξL2))\alpha(O(\|\xi\|_{L_{2}})). Since we usually have σΔ\sigma\ll\Delta_{\infty}, the inverse of α(O(ξL2))\alpha(O(\|\xi\|_{L_{2}})) can be much smaller than the inverse of α(Δ)\alpha(\Delta_{\infty}).

The “small ball method” and beyond.

The breakthrough papers [50, 53] propose the “small ball method” (also referred to as “learning without concentration”) to provide problem-dependent rates that overcome the limitations mentioned above. Their proofs builds on structural results of 010-1 valued indicator functions under the small-ball condition, whose connection to the traditional localization analysis may not be completely obvious. Moving the focal point from indicator functions to “truncated” functions, we provide the following perspectives.

1) A simple interpretation to the “small-ball” condition is that, suitably “truncated” quadratic forms are of the same scale as the original quadratic forms. Under the “small-ball” condition, one can trivially show that uniformly over all hHh\in\pazocal{H},

[min{(h(x)h(x))2,κ2hhL22}]Prob(|h(x)h(x)|κhhL2)κ2hhL22\displaystyle\mathbb{P}[\min\{(h(x)-h^{*}(x))^{2},\kappa^{2}\|h-h^{*}\|_{L_{2}}^{2}\}]\geq\text{Prob}\left(|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right)\kappa^{2}\|h-h^{*}\|_{L_{2}}^{2}
cκκ2[(h(x)h(x))2].\displaystyle\geq c_{\kappa}\kappa^{2}\mathbb{P}[(h(x)-h^{*}(x))^{2}].

This suggests that one only needs to concentrate simple “truncated” functions to derive generalization error bounds.

2) One-sided uniform inequalities are contained in the “uniform localized convergence” framework and are often derived from concentration of truncated functions. Many one-sided uniform inequalities can be equivalently written as “uniform localized convergence” arguments. Consider the uniform “lower isomorphic bound” (which plays a central role in the “small-ball” method): for some constant c>0c>0, with high probability, uniformly over all hHh\in\pazocal{H},

n[(h(x)h(x))2]c[(h(x)h(x)2].\displaystyle\mathbb{P}_{n}[(h(x)-h^{*}(x))^{2}]\geq c\mathbb{P}[(h(x)-h^{*}(x)^{2}].

The above argument is equivalent with the following “uniform localized convergence” argument:

(n)[(h(x)h(x))2](1c)T(h),hH\displaystyle(\mathbb{P}-\mathbb{P}_{n})[(h(x)-h^{*}(x))^{2}]\leq(1-c)T(h),\quad\forall h\in\pazocal{H}

where the measurement functional T(h)T(h) is set to be hhL22\|h-h^{*}\|^{2}_{L_{2}}. A more flexible perspective may directly view the truncated quadratic forms as the concentrated functions, making traditional two-sided uniform convergence tools applicable in a straightforward manner.

Motivated by the above observations, an interesting question is to recover the results in [50, 53] through a “one-shot” concentration framework, explicitly figuring out which component of the excess loss contributes to which part of the surrogate function. In what follows, we will present such an analysis. While our error bounds roughly follow the same form as the results in [50, 53], we obtain several technical improvements; see Section 8.3 for the novel implications and methodological contributions of our approach.

8.2 Main results and illustrative examples

We assume some regularity conditions that hold for non-pathological choices of surrogate functions.

Assumption 10 (regularity conditions on surrogate functions).

Assume there is a non-decreasing, non-negative and bounded function φ(r)\varphi(r) such that r>0\forall r>0,

{hh:hH,hhL22r}φ(r);\displaystyle\mathfrak{R}\{h-h^{*}:h\in\pazocal{H},\|h-h^{*}\|_{L_{2}}^{2}\leq r\}\leq\varphi(r); (8.5)

and there is a meaningful surrogate function φnoise(r,δ)\varphi_{\textup{noise}}(r,\delta) that is non-decreasing w.r.t. rr, and satisfies that δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta,

suphH,hhL22r{(n)[1sv(h(x),y)(hh)]}φnoise(r,δ).\displaystyle\sup_{h\in\pazocal{H},\|h-h^{*}\|_{L_{2}}^{2}\leq r}\left\{(\mathbb{P}-\mathbb{P}_{n})[\partial_{1}\ell_{\textup{sv}}(h^{*}(x),y)(h-h^{*})]\right\}\leq\varphi_{\textup{noise}}(r,\delta). (8.6)

Given any fixed δ(0,1)\delta\in(0,1) and r0(0,4Δ2)r_{0}\in(0,4\Delta^{2}), denote Cr0=2+(16cκ+2)log4Δ2r0C_{r_{0}}=2+\left(\frac{16}{c_{\kappa}}+2\right)\log\frac{4\Delta^{2}}{r_{0}}. Assume there is a positive integer N¯δ,r0\bar{N}_{\delta,r_{0}} such that for all nN¯δ,r0n\geq\bar{N}_{\delta,r_{0}},

φnoise(8Δ2;δCr0)α(4ξL2/cκ)ξL222andφ(8Δ2)2cκξL2216Δ.\displaystyle\varphi_{\textup{noise}}\left(8\Delta^{2};\frac{\delta}{C_{r_{0}}}\right)\leq\frac{\alpha(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}})\|\xi\|_{L_{2}}^{2}}{2}\quad\text{and}\quad\varphi\left(8\Delta^{2}\right)\leq\frac{\sqrt{2c_{\kappa}}\|\xi\|^{2}_{L_{2}}}{16\Delta}. (8.7)

We note that the requirements do not place meaningful restrictions on the choice of surrogate function. The main requirement, condition (8.7), asks for uniform errors over H\pazocal{H} to be smaller than some fixed values that are independent of nn. For non-pathological choices of surrogate functions, this will always be satisfied as long as the sample size nn is larger than some positive integer N¯δ,r0\bar{N}_{\delta,r_{0}}. The boundedness requirement for φ\varphi (and φnoise\varphi_{\text{noise}}) can always be met by setting φ(r)=φ(4Δ2)\varphi(r)=\varphi(4\Delta^{2}) (and φnoise(r;δ)=φnoise(4Δ2;δ)\varphi_{\text{noise}}(r;\delta)=\varphi_{\text{noise}}(4\Delta^{2};\delta)) for all r4Δ2r\geq 4\Delta^{2}, because hhL22Δ\|h-h^{*}\|_{L_{2}}\leq 2\Delta for all hHh\in\pazocal{H}.

Theorem 8 (supervised learning with structured convex cost).

Let Assumptions 7 8, 9, 10 hold and α(4ξL2/cκ)>0\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)>0. Let rverr_{\text{ver}}^{*} be the fixed point of the function

4cκκ2α(4ξL2/cκ)φnoise(2r;δCr0).\displaystyle\frac{4}{c_{\kappa}\kappa^{2}\cdot\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2r;\frac{\delta}{C_{r_{0}}}\right). (8.8)

Given any fixed δ(0,1)\delta\in(0,1) and r0(0,4Δ2)r_{0}\in(0,4\Delta^{2}), let rnoiser_{\text{noise}}^{*} be the fixed point of the function

8cκκ2rφ(2r).\displaystyle\frac{8}{c_{\kappa}\kappa}\sqrt{2r}\varphi(2r). (8.9)

Then with probability at least 1δ1-\delta, the empirical risk minimizer h^\hat{h} satisfies

h^hL2()2max{rnoise,rver,r0}and\displaystyle\|\hat{h}-h^{*}\|^{2}_{L_{2}(\mathbb{P})}\leq\max\left\{r^{*}_{\textup{noise}},\ r^{*}_{\textup{ver}},\ r_{0}\right\}\quad\text{and}
(h^)βsv2max{rnoise,rver,r0},\displaystyle\quad\mathscr{E}(\hat{h})\leq\frac{\beta_{\textup{{sv}}}}{2}\max\left\{r^{*}_{\textup{noise}},\ r^{*}_{\textup{ver}},\ r_{0}\right\},

provided that n>max{N¯δ,r0,72cκ2logCr0δ}n>\max\left\{\bar{N}_{\delta,r_{0}},\frac{72}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}\right\}.

Remarks.

1) The term r0r_{0} is negligible since it can be arbitrarily small. One can simply set r0=1/n4r_{0}={1}/{n^{4}}, which will be much smaller than rnoiser^{*}_{\text{noise}} for typical applications. In high-probability bounds, Cr0C_{r_{0}} will only appear in the form log(Cr0/δ))\log({C_{r_{0}}}/{\delta})), which is of a negligible O(loglogn)O(\log\log n) order. In the subsequent discussion, we will hide parameters that only depend on κ\kappa and cκc_{\kappa} in the big OO notation, as they are often absolute constants in practical applications.

2) The two fixed points rnoiser^{*}_{\text{noise}} and rverr^{*}_{\text{ver}} correspond to the two sources of complexities: the uniform errors characterized by the two surrogate functions in (8.8) and (8.9). Recall that a fundamental limitation of the traditional “local Rademacher complexity analysis” is that it requires a “sub-root” surrogate function that can not differentiate the two sources of complexity. In contrast, the surrogate function in (8.8) (which we write as O(rφ(r))O(\sqrt{r}\varphi(r)) for simplicity) is obviously a “super-root” function, thus our analysis overcomes that limitation and provides more precise upper bounds. The key point is that O(rφ(r))O(\sqrt{r}\varphi(r)) is a benign “super-root” surrogate function, in the sense that its fixed point rverr^{*}_{\text{ver}} is “very small” when the sample size is large enough; in other words, when the problem is learnable. For example, for a dd-dimensional linear classes, where φ=O(dr/n)\varphi=O(\sqrt{{dr}/{n}}), rverr^{*}_{\text{ver}} will be the fixed point of O(dr/n)O(dr/n). Thus rverr^{*}_{\text{ver}} will be 0 as long as the sample size nn is larger than O(d)O(d). Therefore, the typical generalization error derived by Theorem 8 is of order

(h^)βsv2rnoise,\displaystyle\mathscr{E}(\hat{h})\leq\frac{\beta_{\textup{{sv}}}}{2}r^{*}_{\text{noise}},

where rnoiser^{*}_{\text{noise}} is the fixed point of the function in (8.9). Clearly, rnoiser^{*}_{\textup{noise}} only depends on the noise multiplier at hh^{*} and the local strong convexity parameter, and it is independent of the worst-case parameters of the cost function.

At a high level, the subscripts “ver” and “noise” have the the meaning of “version space” and “noise multiplier,” respectively. Intuitively, rverr_{\text{ver}} is the estimation error of the noise-free realizable problem, which reflects the complexity of version space—the random subset of H\pazocal{H} that consists of all hh that agree with hh^{*} on {xi}i=1n\{x_{i}\}_{i=1}^{n}. On the other hand, rnoiser_{\textup{noise}} is the estimation error induced by the interaction of H\pazocal{H} and noise multiplier 1sv(h(x),y)\partial_{1}\ell_{\textup{sv}}(h^{*}(x),y). We refer to [50] for a more detailed discussion on the source of these two fixed points.

Now we present some representative applications of Theorem 8.

Example 6 (localization of unfavorable parameters).

In practical applications, one often wants to avoid the global Lipchitz constant and the inverse of the global strong convexity parameter. For example, in regression with square cost, the global Lipchitz constant is equal to Δ\Delta_{\infty} and is often unbounded, so it is desirable to convert it to ξL2\|\xi\|_{L_{2}}; and in logistic regression, the inverse of global strong convexity parameter is an exponential constant eO(Δ)e^{O(\Delta_{\infty})}, which we hope to convert to eO(ξL2)e^{O(\|\xi\|_{L_{2}})}. These goals are achieved in Theorem 8: since the right hand side of (8.8) contains an extra r\sqrt{r} factor, rverr^{*}_{\text{ver}} is typically much smaller than rnoiser^{*}_{\textup{noise}} for sufficiently large nn (see remark 2 after Theorem 8). Therefore, the generalization error bound will be determined by the fixed point rnoiser^{*}_{\textup{noise}}, which only depends on the noise multiplier at hh^{*} and the local strong convexity parameter.

Example 7 (regression with heavy-tailed noise).

We consider the problem of predicting yy using h(x)h(x), and allow the “noise” ξ=h(x)y\xi=h^{*}(x)-y to be heavy-tailed. To illustrate the main message of this example, we consider the dd-dimensional linear class with sub-Gaussian features. That is, h(x)=θTxh(x)=\theta^{T}x where θd\theta\in\mathbb{R}^{d}, and the random feature xdx\in\mathbb{R}^{d} is sub-Gaussian. In this setting, the Huber cost is preferred to the square cost.

  • For the Huber cost and truncation parameter γ=O(ξL2)\gamma=O(\|\xi\|_{L_{2}}) in the definition (LABEL:eq:_definition_huber), Theorem 8 implies that the parameter vv will be localized to the region where the strong convexity parameter α(v)\alpha(v) is non-zero. As a result, the strong convexity parameter in the generalization error bound will be 12\frac{1}{2} rather than the problematic value 0 (since the generalization error scales with the inverse of α(v)\alpha(v), the value 0 will make the bound vacuous). For the dd-dimensional linear class, rverr^{*}_{\text{ver}} will be 0 as long as nO(d)n\geq O(d). Since 1sv(h(x),y)\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y) will be uniformly bounded by O(σ)O(\sigma), we obtain

    rnoiseO(ξL22(d+log1δ)n),\displaystyle r_{\textup{noise}}^{*}\leq O\left(\frac{\|\xi\|_{L_{2}}^{2}(d+\log\frac{1}{\delta})}{n}\right),

    which recovers the problem-dependent rate in [53].

  • For the square cost, the fixed point rnoiser_{\textup{noise}} will often cause the generalization error to be sub-optimal. For the dd-dimensional linear class, rnoiser_{\textup{noise}} will have a polynomial dependence on 1/δ1/\delta as explained in [53]. The reason is that in the definition of φnoise(r,δ)\varphi_{\textup{noise}}(r,\delta) in 8.6, the noise multiplier 1sv(h(x),y)\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y) is equal to ξ\xi for the square cost. For “heavy-tailed” ξ\xi, this will cause the rate rnoiser_{\textup{noise}} to be sub-optimal.

We note that the condition that h^\hat{h} is the empirical risk minimizer is not essential to the proof of Theorem 8. Similar to the prior work [33], we can extend the result to more general learning rules that are based on regularization (e.g., LASSO [73], SLOPE [7], etc.) as follows.

Corollary 9 (extension to general regularized learning rules).

Let Assumptions 7 8, 9 hold. Let h^\hat{h} be the solution of

minHnsv(h(x),y)+Ψ(h),\displaystyle\min_{\pazocal{H}}\mathbb{P}_{n}\ell_{\textup{{sv}}}(h(x),y)+\Psi(h), (8.10)

where Ψ(h)\Psi(h) is a non-negative regularization term. Let H0{\pazocal{H}_{0}} be a subset of H\pazocal{H} that is independent of the samples. If inequality (8.5) is modified to

{hh:hH0,hhL22r}φ(r),\displaystyle\mathfrak{R}\{h-h^{*}:h\in\pazocal{H}_{0},\|h-h^{*}\|_{L_{2}}^{2}\leq r\}\leq\varphi(r),

and inequality (8.6) is modified to

suphH0,hhL22r{(n)[1su(h(x),y)(hh)]}+Ψ(h)φnoise(r;δ),\displaystyle\sup_{h\in\pazocal{H}_{0},\|h-h^{*}\|_{L_{2}}^{2}\leq r}\bigg{\{}(\mathbb{P}-\mathbb{P}_{n})[\partial_{1}\ell_{\textup{su}}(h^{*}(x),y)(h-h^{*})]\bigg{\}}+\Psi(h^{*})\leq{\varphi}_{\textup{noise}}(r;\delta),

then under Assumption 10, conditioned on the event {h^H0}\{\hat{h}\in\pazocal{H}_{0}\}, the conclusion of Theorem 8 remains true.

As illustrated in the following example, Corollary 9 is able to recover several important results in the high-dimensional statistics literature.

Example 8 (high-dimensional estimation and LASSO).

Consider the linear regression set-up 𝔼[y|x]=xTθ\mathbb{E}[y|x]=x^{T}\theta^{*} where θΘd\theta\in\Theta\subseteq\mathbb{R}^{d}, dnd\gg n and θ0sd\|\theta^{*}\|_{0}\leq s\ll d. Consider the LASSO estimator θ^\hat{\theta}, which is the solution of the 1\ell_{1}-norm regularized risk minimization problem, where the regularization term is Φ(h)=λθ1\Phi(h)=\lambda\|\theta\|_{1} and λ>0\lambda>0 is the regularization parameter, i.e.,

θ^argminΘnsv(θTx,y)+λθ1.\displaystyle\hat{\theta}\in\operatorname*{arg\,min}_{\Theta}\mathbb{P}_{n}\ell_{\textup{{sv}}}(\theta^{T}x,y)+\lambda\|\theta\|_{1}.

Assume sv\ell_{\textup{{sv}}} is the square cost and ξ\xi is σ\sigma-sub-Gaussian, or sv\ell_{\textup{{sv}}} is the Huber cost with truncation parameter γ=O(σ)\gamma=O(\sigma). Assume the feature xdx\in\mathbb{R}^{d} is sub-Gaussian. Following standard analysis (see, e.g., [57, Lemma 1]), by setting λ\lambda to be of order σ2log(d/δ)/n\sqrt{\sigma^{2}\log(d/\delta)/n}, the Lasso estimator θ^\hat{\theta} will lie in a sparse cone ΘS\Theta_{S} (with high probability), where it can be proven [41] that φ(r)=O(rslogd/n)\varphi(r)=O(\sqrt{{rs\log d}/{n}}) and φnoise(r;δ)=O(rσ2slog(d/δ)/n)\varphi_{\text{noise}}(r;\delta)=O(\sqrt{{r\sigma^{2}s\log({d}/{\delta})}/{n}}) (ignoring dependence on the parameters CC and pp described in Assumption 9). Applying Corollary 9 with H0={xθTx:θΘS}\pazocal{H}_{0}=\{x\mapsto\theta^{T}x:\theta\in\Theta_{S}\} and nΩ(slogd)n\geq\Omega(s\log d), we have rver=0r^{*}_{\text{ver}}=0 and

rnoiseO(σ2slogdδn).\displaystyle r^{*}_{\textup{noise}}\leq O\left(\frac{\sigma^{2}s\log\frac{d}{\delta}}{n}\right).

8.3 Contributions relative to previous approaches

So far we have recovered the main results in the prior works [50, 53], which are valid for unbounded regression problems and thus improve the traditional “local Rademacher complexity” analysis. Now we would like to illustrate how Theorem 8 improves the results in [50, 53] by removing a “star-shape” requirement. That is, we do not need to assume the hypothesis class is star-shaped/convex, or consider the star-hull of it which may increase complexity.

To be specific, [50, 53] assumes that H\pazocal{H} is a convex class (and thus star-shaped). When H\pazocal{H} is not star-shaped, the results in [50, 53] are still valid by taking the star-hull of F\pazocal{F} and considering the local Rademacher/Gaussian complexity of the star-hull. The increase in complexity is quite moderate for traditional hypothesis classes (e.g., those chracterized by covering number conditions [49, Lemma 4.6]). However, taking the star-hull may significantly increase the local Rademacher complexity of modern non-convex and overparameterized classes. Here we show that, even for very simple function classes (e.g., linear classes with non-convex support), our approach improves on what can be achieved using the star-hulls.

Note that the improvement brought by our approach is systematic and may carry over to more complicated learning procedures as well. A more comprehensive comparison with existing localization approaches will be presented after the following example.

Example 9 (overparameterized linear class with growing sparsity).

Consider the linear regression model

yN(xTθ,σ2),xN(0,Id×d),\displaystyle y\sim N(x^{T}\theta^{*},\sigma^{2}),\quad x\sim N(0,I_{d\times d}),

where θΘd\theta^{*}\in\Theta\subseteq\mathbb{R}^{d} and dnd\gg n (i.e., the model is overoarameterized). Assume the feasible parameter set θ\theta satisfies that for all θΘ\theta\in\Theta,

θθ0θθ22.\displaystyle\|\theta-\theta^{*}\|_{0}\leq\lfloor\|\theta-\theta^{*}\|_{2}^{2}\rfloor. (8.11)

In other words, the sparsity of θ\theta increases the more θ\theta deviates from θ\theta^{*}. The maximum likelihood estimation problem corresponds to minimize the empirical average of the square cost with respect to H={xxTθ:θΘ}\pazocal{H}=\{x\mapsto x^{T}\theta:\theta\in\Theta\}. For this problem, the surrogate function φnoise\varphi_{\text{noise}} need to satisfy (with probability at 1δ1-\delta)

supθΘ,θθ22r(n)[ξxT(θθ)]φnoise(r;δ),\displaystyle\sup_{\theta\in\Theta,\|\theta-\theta^{*}\|_{2}^{2}\leq r}(\mathbb{P}-\mathbb{P}_{n})[\xi\cdot x^{T}(\theta-\theta^{*})]\leq\varphi_{\text{noise}}(r;\delta), (8.12)

where the left hand side of (8.12) is the localized Gaussian complexity of H\pazocal{H}. Thanks to the sparsity condition (8.11), it can be tightly controlled by

φnoise(r;δ)=O(σ2(θ0+r)rlogdδn)=O(σ2θ0rlogdδn)problem-dependent component+O(σ2logdδnr)benign “super-root” component.\displaystyle\varphi_{\text{noise}}(r;\delta)=O\left(\sqrt{\frac{\sigma^{2}(\|\theta^{*}\|_{0}+r)r\log\frac{d}{\delta}}{n}}\right)=\underbrace{O\left(\sqrt{\frac{\sigma^{2}\|\theta^{*}\|_{0}r\log\frac{d}{\delta}}{n}}\right)}_{\text{problem-dependent component}}+\underbrace{O\left(\sqrt{\frac{\sigma^{2}\log\frac{d}{\delta}}{n}}\cdot r\right)}_{\text{benign ``super-root" component}}. (8.13)

Here, the benign “super-root” component” in φnoise(r;δ)\varphi_{\text{noise}}(r;\delta) does not affect the order of its fixed point rnoiser_{\text{noise}}^{*}: when nΩ(4σ2logdδ)n\geq\Omega(4\sigma^{2}\log\frac{d}{\delta}), the “super-root” component” in (8.13) will be less than 12r\frac{1}{2}r so that rnoiser_{\text{noise}}^{*} is of order σ2θ0logdδ/n\sigma^{2}\|\theta^{*}\|_{0}\log\frac{d}{\delta}/n. In other words, only the problem-dependent component in φnoise(r;δ)\varphi_{\text{noise}}(r;\delta) matters.

In contrast, if one takes the star-hull (e.g., expanding Θ\Theta to star(Θ)={θ+λ(θθ):θΘ,λ[0,1]}\text{star}(\Theta)=\{\theta^{*}+\lambda(\theta-\theta^{*}):\theta\in\Theta,\lambda\in[0,1]\}, then it is straightforward to verify that φnoise\varphi_{\text{noise}} has to be a “sub-root” function. A sub-root surrogate function that governs (8.13) will be at least of order

φnoise¯(r;δ)=O(σ2(θ0+Δ)rlogdδn),\displaystyle\overline{\varphi_{\text{noise}}}(r;\delta)=O\left(\sqrt{\frac{\sigma^{2}(\|\theta^{*}\|_{0}+\Delta)r\log\frac{d}{\delta}}{n}}\right),

whose fixed point unavoidably scales with the worst-case L2L_{2} distance Δ\Delta. Here we do not consider computational issues, and the key message is that if the complexity (e.g., the “effective dimension”) of an overparameterized non-convex class grows very rapidly with respect to its localization scale, then some “fast growing components” may still be benign and they may not necessarily increase the complexity. It is an open question whether such phenomena manifests in more practical applications.

Comparison with the “small ball method.”

In a series of pioneering works, Mendelson [50, 53, 51, 52] proposes the “small ball method” as an alternative approach to the traditional “concentration-contraction” framework. Under the “small ball” condition, that approach establishes one-sided uniform inequalities through structural results on binary valued indicator functions. Motivated by these works, we seek to refine the traditional concentration framework. Our approach brings added flexibility to concentration by emphasizing the use of surrogate functions that are not “sub-root,” and relates one-sided uniform inequalities to two-sided concentration of simple “truncated” functions. Following are the main contributions relative to the “small ball method.”

First, our approach does not require the hypothesis class to be star-shaped/convex (or to consider the star-hull of the hypothesis class). This improvement is particularly relevant for non-convex hypothesis classes whose complexity can grow rapidly when “away” from the optimal hypothesis. In Example 9 (and its discussion) we show that the improvement may be meaningful for some non-convex, overparametrized classes; and the phenomenon of “benign fast growing” components in overparameterized models may be of independent interest.

To the best of our knowledge, the “small ball method” cannot overcome the star-shape requirement in a straightforward manner, without additional uniform convergence arguments. The “small ball method” is able to prove one-sided inequalities that hold uniformly over a fixed sphere {hH:hhL22=r}\{h\in\pazocal{H}:\|h-h^{*}\|_{L_{2}}^{2}=r\}, and by assuming the class H\pazocal{H} to be star-shaped around hh^{*}, it circumvents the need to have a uniform bound that holds simultaneously for all possible values of rr. However, without the star-shape assumption and additional uniform convergence arguments, it is not clear how to uniformly extend the bound to all rr using peeling. In our analysis, we introduce some new tricks to address this issue. In particular, we use “adaptive truncation levels” and concentration over “rings.” Combining these with the “uniform localized convergence” procedure, we completely circumvent the need for star-hulls (see “Part II” in Appendix C.1 for details).

The discussion here is orthogonal to lifting the star-shape/convexity assumptions using aggregation [37], whose primary goal is to remove Assumption 8 (recall that this assumption implicitly asks the hypothesis class to be convex/star-shaped when the model is mis-specified). When using aggregation and improper learning procedures, it is natural to consider the complexity of the enlarged class. Still, we suspect that taking the star-hull may be unnecessary if the enlarged class need not to be star-shaped [51, 52], and our analysis may be useful there as well. We note in passing that aggregation procedures are often computationally demanding.

Lastly, the formulation of supervised costs is slightly broader here compared with [53]. In that paper, the loss is assumed to be a univariate function of (h(x)y)(h(x)-y), so costs involving the term yh(x)yh(x) (e.g., the canonical logistic cost and the costs in some other generalized linear models) are not permitted ([53] instead analyzes a modified version of the logistic cost).

Comparison with offset Rademacher complexity.

Under the square cost and assuming the so-called “lower isometry bound” as an a priori condition (see [37, Definition 5]), offset Rademahcer complexity [37] is also able to provide problem-dependent rates. However, establishing such a “lower isometry bound” is typically challenging, so this approach may still need to rely on the “small ball method” (or our analysis) for unbounded regression problems. Moreover, this tool is tailored to the setting of supervised learning with square cost, and it is unclear how to extend the analysis to more general losses.

Comparison with the “restricted strong convexity” framework in high-dimensional statistics.

In the high-dimensional statistics literature, the “restricted strong convexity” framework [57, 79] provides analytical tools to prove problem-dependent rates, but only when such condition is assumed as an a priori (see [57, Definition 2]). To achieve this, [62, 57, 41] develop a truncation-based analysis that can establish “restricted strong convexity” for sparse kernel regression and sparse generalized linear models. Those works also indicate that one-sided uniform inequalities can be established by two-sided concentration of the “truncated” functions. There are several differences between their analysis and ours. First, those proofs rely on linearity/star-shape of the hypothesis class and thus only need to prove the “restricted strong convexity” on a fixed sphere (similar to what we have discussed in comparison with the “small-ball method”). In contrast, our framework does not put any geometric restriction on the hypothesis class, by passing this through the use of “adaptive truncation levels” and concentration over “rings,” tools that may be of independent interest from a technical perspective. Second, when seeking problem-dependent generalization error bounds, the proposed L2L4L_{2}-L_{4} moment equivalence condition [57, 41] is stronger than the “small ball” condition used in our analysis. Third, the analysis does not fully localize the strong convexity parameter, and does not cover interesting supervised costs that may have zero curvature, e.g., the Huber cost.

9 Concluding remarks

This paper provides contributions both in the “uniform localized convergence” approach it develops, as well as the applications thereof to various problems areas. Below we highlight some key implications.

From a methodological viewpoint, our approach resolves some fundamental limitations of the existing uniform convergence and localization analysis methods, such as the traditional “local Rademacher complexity” analysis and the “uniform convergence of gradients.” At a high-level, it provides some general guidelines to derive generalization error bounds that are sharper than the worst-case uniform error. In particular, the following observations are of particular interest: 1) problem-dependent rates can often be explained by uniform inequalities whose right hand side is a function of the “free” variable T(f)T(f); 2) the choice of surrogate function and concentrated function are flexible, and our proposed framework brings some level of unification to localized complexities, vector-based uniform convergence results and one-sided uniform inequalities; and 3) “uniform localized convergence” arguments are also suitable to study regularization and iterative algorithms. These observations lead to a unified perspective on problem-dependent rates in various problem settings studied in the paper.

Many problem-dependent generalization error bounds proved in the paper may be of independent interest. Their study also informs the design of optimal procedures. For example: in the “slow rate” regime, we propose the first (moment-penalized) estimator that achieves optimal variance-dependent rates for general “rich” classes; and in the parametric “fast rate” regime, we show that efficient algorithms like gradient descent and the first-order Expectation-Maximization algorithm can achieve optimal problem-dependent rates in several representative problems from non-convex learning, stochastic optimization, and learning with missing data.

There are several future directions for this line of research. Applications to causal inference and machine learning problems with unobservable components may be very promising, as the focus there is typically on avoiding the worst-case parameter-dependence (see also discussion in Section 4 and Section 7). Another interesting topic is applying the “uniform localized convergence” principle to study distributional robustness, since there are profound connections between the latter and variation-based regularization [56, 6]. Lastly, extension of our framework to overparameterized models is interesting from both the theoretical and practical viewpoints. Our results in the slow rate regime may be directly applicable to the study of overparameterized neural networks, in particular, if one has sharp upper bounds on the local Rademacher complexity. While there has been much recent interest in proving norm-based upper bounds on the global Rademacher complexity for neural network models [4, 16], proving meaningful upper bounds on the local Rademahcer complexity remains largely open. It is worthy mentioning that the recent negative results in [55, 5] neither apply to our general framework nor the notion of problem-dependency we suggest (distribution-dependent quantities that depend on the best hypothesis). It is possible that combining our framework with more suitable concentrated functions and localized subsets (e.g., generalizing the data-dependent subset considered in [85]) may shed light also on the study of some overparameterized models.

References

  • [1] Susan Athey and Stefan Wager. Efficient policy learning. arXiv preprint arXiv:1702.02896, 2017.
  • [2] Sivaraman Balakrishnan, Martin J Wainwright, and Bin Yu. Statistical guarantees for the em algorithm: From population to sample-based analysis. The Annals of Statistics, 45(1):77–120, 2017.
  • [3] Peter L Bartlett, Olivier Bousquet, and Shahar Mendelson. Local rademacher complexities. The Annals of Statistics, 33(4):1497–1537, 2005.
  • [4] Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249, 2017.
  • [5] Peter L Bartlett and Philip M Long. Failures of model-dependent generalization bounds for least-norm interpolation. arXiv preprint arXiv:2010.08479, 2020.
  • [6] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.
  • [7] Małgorzata Bogdan, Ewout Van Den Berg, Chiara Sabatti, Weijie Su, and Emmanuel J Candès. Slope—adaptive variable selection via convex optimization. The annals of applied statistics, 9(3):1103, 2015.
  • [8] Olivier Chapelle, Chuong B Do, Choon H Teo, Quoc V Le, and Alex J Smola. Tighter bounds for structured estimation. In Advances in neural information processing systems, pages 281–288, 2009.
  • [9] Damek Davis and Dmitriy Drusvyatskiy. Graphical convergence of subgradients in nonconvex optimization and learning. arXiv preprint arXiv:1810.07590, 2018.
  • [10] Luc Devroye and Gábor Lugosi. Lower bounds in pattern recognition and learning. Pattern recognition, 28(7):1011–1018, 1995.
  • [11] Sjoerd Dirksen. Tail bounds via generic chaining. Electronic Journal of Probability, 20, 2015.
  • [12] Dylan J Foster, Ayush Sekhari, and Karthik Sridharan. Uniform convergence of gradients for non-convex learning and optimization. In Advances in Neural Information Processing Systems, pages 8745–8756, 2018.
  • [13] Dylan J Foster and Vasilis Syrgkanis. Orthogonal statistical learning. arXiv preprint arXiv:1901.09036, 2019.
  • [14] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.
  • [15] Evarist Giné and Vladimir Koltchinskii. Concentration inequalities and asymptotic results for ratio type empirical processes. The Annals of Probability, 34(3):1143–1216, 2006.
  • [16] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. In Conference On Learning Theory, pages 297–299. PMLR, 2018.
  • [17] László Györfi, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of nonparametric regression. Springer Science & Business Media, 2006.
  • [18] Jaroslav Hájek. Local asymptotic minimax and admissibility in estimation. In Proceedings of the sixth Berkeley symposium on mathematical statistics and probability, volume 1, pages 175–194, 1972.
  • [19] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. arXiv preprint arXiv:1611.04231, 2016.
  • [20] Moritz Hardt, Tengyu Ma, and Benjamin Recht. Gradient descent learns linear dynamical systems. The Journal of Machine Learning Research, 19(1):1025–1068, 2018.
  • [21] Elad Hazan, Tomer Koren, and Kfir Y Levy. Logistic regression: Tight bounds for stochastic and online optimization. In Conference on Learning Theory, pages 197–209, 2014.
  • [22] Oliver Hinder, Aaron Sidford, and Nimit Sohoni. Near-optimal methods for minimizing star-convex functions and beyond. In Conference on Learning Theory, pages 1894–1938. PMLR, 2020.
  • [23] Sham M Kakade and Ambuj Tewari. On the generalization ability of online strongly convex programming algorithms. In Advances in Neural Information Processing Systems, pages 801–808, 2009.
  • [24] Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak-łojasiewicz condition. In Joint European Conference on Machine Learning and Knowledge Discovery in Databases, pages 795–811. Springer, 2016.
  • [25] Robert Kleinberg, Yuanzhi Li, and Yang Yuan. An alternative view: When does sgd escape local minima? arXiv preprint arXiv:1802.06175, 2018.
  • [26] Vladimir Koltchinskii. Oracle Inequalities in Empirical Risk Minimization and Sparse Recovery Problems: Ecole d’Eté de Probabilités de Saint-Flour XXXVIII-2008, volume 2033. Springer Science & Business Media, 2011.
  • [27] Vladimir Koltchinskii and Shahar Mendelson. Bounding the smallest singular value of a random matrix without concentration. International Mathematics Research Notices, 2015(23):12991–13008, 2015.
  • [28] Vladimir Koltchinskii and Dmitriy Panchenko. Rademacher processes and bounding the risk of function learning. In High dimensional probability II, pages 443–457. Springer, 2000.
  • [29] Jeongyeol Kwon, Wei Qian, Constantine Caramanis, Yudong Chen, and Damek Davis. Global convergence of em algorithm for mixtures of two component linear regression. arXiv preprint arXiv:1810.05752, 2018.
  • [30] Jason N Laska and Richard G Baraniuk. Regime change: Bit-depth versus measurement-rate in compressive sensing. IEEE Transactions on Signal Processing, 60(7):3496–3505, 2012.
  • [31] L Le Cam. Limits of experiments. In Proceedings of the Sixth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Theory of Statistics. The Regents of the University of California, 1972.
  • [32] Guillaume Lecué and Shahar Mendelson. Sparse recovery under weak moment assumptions. arXiv preprint arXiv:1401.2188, 2014.
  • [33] Guillaume Lecué and Shahar Mendelson. Regularization and the small-ball method i: sparse recovery. The Annals of Statistics, 46(2):611–641, 2018.
  • [34] Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2348–2358, 2017.
  • [35] Xiaodong Li, Shuyang Ling, Thomas Strohmer, and Ke Wei. Rapid, robust, and reliable blind deconvolution via nonconvex optimization. Applied and computational harmonic analysis, 47(3):893–934, 2019.
  • [36] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607, 2017.
  • [37] Tengyuan Liang, Alexander Rakhlin, and Karthik Sridharan. Learning with square loss: Localization through offset rademacher complexity. In Conference on Learning Theory, pages 1260–1285, 2015.
  • [38] Alexander E Litvak, Alain Pajor, Mark Rudelson, and Nicole Tomczak-Jaegermann. Smallest singular value of random matrices and geometry of random polytopes. Advances in Mathematics, 195(2):491–523, 2005.
  • [39] Huikang Liu, Weijie Su, and Anthony Man-Cho So. Quadratic optimization with orthogonality constraints: Explicit lojasiewicz exponent and linear convergence of line-search methods. In International Conference on Machine Learning, pages 1158–1167, 2016.
  • [40] Mingrui Liu, Xiaoxuan Zhang, Lijun Zhang, Rong Jin, and Tianbao Yang. Fast rates of erm and stochastic approximation: Adaptive to error bound conditions. In Advances in Neural Information Processing Systems, pages 4678–4689, 2018.
  • [41] Po-Ling Loh and Martin J Wainwright. Regularized m-estimators with nonconvexity: Statistical and algorithmic theory for local optima. In Advances in Neural Information Processing Systems, pages 476–484, 2013.
  • [42] Gábor Lugosi. Pattern classification and learning theory. In Principles of nonparametric learning, pages 1–56. Springer, 2002.
  • [43] Ulysse Marteau-Ferey, Dmitrii Ostrovskii, Francis Bach, and Alessandro Rudi. Beyond least-squares: Fast rates for regularized empirical risk minimization through self-concordance. arXiv preprint arXiv:1902.03046, 2019.
  • [44] Pascal Massart and Élodie Nédélec. Risk bounds for statistical learning. The Annals of Statistics, 34(5):2326–2366, 2006.
  • [45] Andreas Maurer and Massimiliano Pontil. Empirical bernstein bounds and sample variance penalization. arXiv preprint arXiv:0907.3740, 2009.
  • [46] Geoffrey J McLachlan and Thriyambakam Krishnan. The EM algorithm and extensions, volume 382. John Wiley & Sons, 2007.
  • [47] Song Mei, Yu Bai, and Andrea Montanari. The landscape of empirical risk for nonconvex losses. The Annals of Statistics, 46(6A):2747–2774, 2018.
  • [48] Ron Meir and Tong Zhang. Generalization error bounds for bayesian mixture algorithms. Journal of Machine Learning Research, 4(Oct):839–860, 2003.
  • [49] Shahar Mendelson. Improving the sample complexity using global data. IEEE transactions on Information Theory, 48(7):1977–1991, 2002.
  • [50] Shahar Mendelson. Learning without concentration. In Conference on Learning Theory, pages 25–39, 2014.
  • [51] Shahar Mendelson. On aggregation for heavy-tailed classes. Probability Theory and Related Fields, 168(3-4):641–674, 2017.
  • [52] Shahar Mendelson. An optimal unrestricted learning procedure. arXiv preprint arXiv:1707.05342, 2017.
  • [53] Shahar Mendelson. Learning without concentration for general loss functions. Probability Theory and Related Fields, 171(1-2):459–502, 2018.
  • [54] Shahar Mendelson and Nikita Zhivotovskiy. Robust covariance estimation under l4l2l_{4}-l_{2} norm equivalence. arXiv preprint arXiv:1809.10462, 2018.
  • [55] Vaishnavh Nagarajan and J Zico Kolter. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, pages 11611–11622, 2019.
  • [56] Hongseok Namkoong and John C Duchi. Variance-based regularization with convex objectives. In Advances in Neural Information Processing Systems, pages 2971–2980, 2017.
  • [57] Sahand N Negahban, Pradeep Ravikumar, Martin J Wainwright, and Bin Yu. A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical Science, 27(4):538–557, 2012.
  • [58] Tan Nguyen and Scott Sanner. Algorithms for direct 0–1 loss optimization in binary classification. In International Conference on Machine Learning, pages 1085–1093, 2013.
  • [59] Iosif Pinelis. Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, 22(4):1679–1706, 1994.
  • [60] Iosif Pinelis. Correction:“optimum bounds for the distributions of martingales in banach spaces”[ann. probab. 22 (1994), no. 4, 1679–1706; mr 96b: 60010]. The Annals of Probability, 27(4):2119–2119, 1999.
  • [61] David Pollard. Convergence of stochastic processes. Springer Science & Business Media, 2012.
  • [62] Garvesh Raskutti, Martin J Wainwright, and Bin Yu. Minimax-optimal rates for sparse additive models over kernel classes via convex programming. Journal of Machine Learning Research, 13(Feb):389–427, 2012.
  • [63] Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabás Póczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
  • [64] Mark Rudelson and Roman Vershynin. Small ball probabilities for linear images of high-dimensional distributions. International Mathematics Research Notices, 2015(19):9594–9617, 2015.
  • [65] Walter Rudin. Principles of mathematical analysis, volume 3. McGraw-hill New York, 1964.
  • [66] Shai Shalev-Shwartz. Stochastic convex optimization. In Conference on Learning Theory, 2009.
  • [67] Shai Shalev-Shwartz, Ohad Shamir, Nathan Srebro, and Karthik Sridharan. Learnability, stability and uniform convergence. Journal of Machine Learning Research, 11(Oct):2635–2670, 2010.
  • [68] Alexander Shapiro, Darinka Dentcheva, and Andrzej Ruszczyński. Lectures on stochastic programming: modeling and theory. SIAM, 2009.
  • [69] Karthik Sridharan. Note on refined dudley integral covering number bound. Unpublished Manuscript, https://www.cs.cornell.edu/ sridharan/dudley.pdf, 2010.
  • [70] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. Foundations of Computational Mathematics, 18(5):1131–1198, 2018.
  • [71] Adith Swaminathan and Thorsten Joachims. Counterfactual risk minimization: Learning from logged bandit feedback. In International Conference on Machine Learning, pages 814–823, 2015.
  • [72] Michel Talagrand. Majorizing measures: the generic chaining. The Annals of Probability, 24(3):1049–1103, 1996.
  • [73] Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1):267–288, 1996.
  • [74] Alexandre B Tsybakov. Optimal rates of aggregation. In Learning theory and kernel machines, pages 303–313. Springer, 2003.
  • [75] Aad van der Vaart. On the asymptotic information bound. The Annals of Statistics, pages 1487–1500, 1989.
  • [76] Aad W Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • [77] Aad W Van der Vaart and Jon A Wellner. Weak convergence and empirical processes. Springer, 1996.
  • [78] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge University Press, 2018.
  • [79] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [80] Zhaoran Wang, Quanquan Gu, Yang Ning, and Han Liu. High dimensional expectation-maximization algorithm: Statistical optimization and asymptotic normality. arXiv preprint arXiv:1412.8729, 2014.
  • [81] Lijun Zhang, Tianbao Yang, and Rong Jin. Empirical risk minimization for stochastic convex optimization: o(1/n)o(1/n)-and o(1/n2)o(1/n^{2})-type of risk bounds. arXiv preprint arXiv:1702.02030, 2017.
  • [82] Lijun Zhang and Zhi-Hua Zhou. Stochastic approximation of smooth and strongly convex functions: Beyond the o(1/t)o(1/t) convergence rate. arXiv preprint arXiv:1901.09344, 2019.
  • [83] Nikita Zhivotovskiy. Optimal learning via local entropies and sample compression. arXiv preprint arXiv:1706.01124, 2017.
  • [84] Nikita Zhivotovskiy and Steve Hanneke. Localization of vc classes: Beyond local rademacher complexities. In International Conference on Algorithmic Learning Theory, pages 18–33. Springer, 2016.
  • [85] Lijia Zhou, D.J. Sutherland, and Nathan Srebro. On uniform convergence and low-norm interpolation learning. arXiv preprint arXiv:2006.05942, 2020.
  • [86] Yi Zhou, Junjie Yang, Huishuai Zhang, Yingbin Liang, and Vahid Tarokh. Sgd converges to global minimum in deep learning via star-convex path. arXiv preprint arXiv:1901.00451, 2019.

Appendix A Proofs for Section 2 and Section 3

In all the proofs we consider a fixed sample size nn. In order to distinguish “probability of events” and “expectation with respect to \mathbb{P},” we will use the notation Prob(𝒜)\text{Prob}(\mathscr{A}) to denote the probability of the event 𝒜\mathscr{A}.

A.1 Variants of Proposition 1

We prove a more general version of of Proposition 1. The differences are that 1) here we use a more general “peeling scale” λ\lambda which can be any value larger than 11, while in Proposition 1 we simply set λ\lambda to be 22; and 2) we only ask ψ(r;δ)\psi(r;\delta) to be a high-probability surrogate function of the uniform error over the “ring” {fF:r/λT(f)r}\{f\in\pazocal{F}:r/\lambda\leq T(f)\leq r\} rather than the “bigger” localized area {fF:0T(f)r}\{f\in\pazocal{F}:0\leq T(f)\leq r\}.

Proposition 3 (a more general “uniform localized convergence” argument).

For a function class G={gf:fF}\pazocal{G}=\{g_{f}:f\in\pazocal{F}\} and functional T:F[0,R]T:\pazocal{F}\rightarrow[0,R], assume there is a function ψ(r;δ)\psi(r;\delta) (possibly depending on the samples), which is non-decreasing with respect to rr and satisfies that δ(0,1),r[0,R]\forall\delta\in(0,1),\forall r\in[0,R], with probability at least 1δ1-\delta,

supfF:rλT(f)r(n)gfψ(r;δ).\displaystyle\sup_{f\in\pazocal{F}:\frac{r}{\lambda}\leq T(f)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi(r;\delta).

Then, given any δ(0,1)\delta\in(0,1), r0(0,R]r_{0}\in(0,R] and λ>1\lambda>1, with probability at least 1δ1-\delta, for all fFf\in\pazocal{F}, either T(f)r0T(f)\leq{r_{0}} or

(n)gfψ(λT(f);δ(logλλRr0)1).\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(\lambda T(f);\delta\left({\log_{\lambda}\frac{\lambda R}{r_{0}}}\right)^{-1}\right).
Proof of Proposition 3:

given any r0(0,R]r_{0}\in(0,R], take rk=λkr0r_{k}=\lambda^{k}r_{0}, k=1,,logλRr0k=1,\cdots,\lceil\log_{\lambda}\frac{R}{r_{0}}\rceil. Note that logλRr0logλλRr0\lceil\log_{\lambda}\frac{R}{r_{0}}\rceil\leq\log_{\lambda}\frac{\lambda R}{r_{0}}.

We use a union bound to establish that suprλT(f)r(n)gfψ(r;δ)\sup_{\frac{r}{\lambda}T(f)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi(r;\delta) holds for all these rkr_{k} simultaneously: δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta,

suprk1T(f)rk(n)gfψ(rk;δlog22Rr0),k=1,,log2Rr0.\displaystyle\sup_{r_{k-1}\leq T(f)\leq r_{k}}(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(r_{k};\frac{\delta}{\log_{2}\frac{2R}{r_{0}}}\right),\quad k=1,\cdots,\left\lceil\log_{2}\frac{R}{r_{0}}\right\rceil.

For any fixed fFf\in\pazocal{F}, if T(f)r0T(f)\leq r_{0} is false, then let kk be the non-negative integer such that λkr0<T(f)λk+1r0\lambda^{k}r_{0}<T(f)\leq\lambda^{k+1}r_{0}, and we further know that rk+1=λk+1r0λT(f)r_{k+1}=\lambda^{k+1}r_{0}\leq\lambda T(f). Therefore, with probability at least 1δ1-\delta,

(n)gf\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f} supf~F:rkT(f~)rk+1(n)gf~\displaystyle\leq\sup_{\tilde{f}\in\pazocal{F}:r_{k}\leq T(\tilde{f})\leq r_{k+1}}(\mathbb{P}-\mathbb{P}_{n})g_{\tilde{f}}
ψ(rk+1;δlogλλRr0)\displaystyle\leq\psi\left(r_{k+1};\frac{\delta}{\log_{\lambda}\frac{\lambda R}{r_{0}}}\right)
ψ(λT(f);δlogλλRr0).\displaystyle\leq\psi\left(\lambda T(f);\frac{\delta}{\log_{\lambda}\frac{\lambda R}{r_{0}}}\right).

Therefore, with probability at least 1δ1-\delta, fF\forall f\in\pazocal{F}, either T(f)r0T(f)\leq r_{0} or

(n)gfψ(λT(f);δlogλλRr0).\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(\lambda T(f);\frac{\delta}{\log_{\lambda}\frac{\lambda R}{r_{0}}}\right).

This completes the proof of Proposition 3. \square

Clearly, Proposition 1 can be viewed as a corollary of Proposition 3. We now present an implication of Proposition 1, which may be more convenient to use for some problems.

Proposition 4 (a variant of the “uniform localized convergence” argument).

For a function class G={gf:fF}\pazocal{G}=\{g_{f}:f\in\pazocal{F}\} and functional T:F[0,R]T:\pazocal{F}\rightarrow[0,R], assume there is a function ψ(r;δ)\psi(r;\delta) (possibly depending on the samples), which is non-decreasing with respect to rr and satisfies that δ(0,1)\forall\delta\in(0,1), r[0,R]\forall r\in[0,R], with probability at least 1δ1-\delta,

supfF:T(f)r(n)gfψ(r;δ).\displaystyle\sup_{f\in\pazocal{F}:T(f)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi(r;\delta).

Then, given any δ(0,1)\delta\in(0,1) and r0(0,R]r_{0}\in(0,R], with probability at least 1δ1-\delta, for all fFf\in\pazocal{F},

(n)gfψ(2T(f)r0;δCr0),\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(2T(f)\lor r_{0};\frac{\delta}{C_{r_{0}}}\right),

where Cr0=2log22Rr0C_{r_{0}}=2\log_{2}\frac{2R}{r_{0}}.

Proof of Proposition 4:

From Proposition 1 we know that with probability at least 1δ21-\frac{\delta}{2}, for all fFf\in\pazocal{F}, either T(f)r0T(f)\leq r_{0} or

(n)gfψ(2T(f);δ2(log22Rr0)1)=ψ(2T(f);δCr0).\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(2T(f);\frac{\delta}{2}\left(\log_{2}\frac{2R}{r_{0}}\right)^{-1}\right)=\psi\left(2T(f);\frac{\delta}{C_{r_{0}}}\right). (A.1)

We denote the event

𝒜1={there exists fF such that T(f)r0 and (n)gf>ψ(2T(f);δCr0)}.\displaystyle\mathscr{A}_{1}=\left\{\text{there exists $f\in\pazocal{F}$ such that $T(f)\geq r_{0}$ and $(\mathbb{P}-\mathbb{P}_{n})g_{f}>\psi\left(2T(f);\frac{\delta}{C_{r_{0}}}\right)$}\right\}.

Then from (A.1), we have

Prob(𝒜1)δ2.\displaystyle\text{Prob}(\mathscr{A}_{1})\leq\frac{\delta}{2}. (A.2)

We denote the event

𝒜2={there exists fF such that T(f)>r0 and (n)gf>ψ(r0;δCr0)}.\displaystyle\mathscr{A}_{2}=\left\{\text{there exists $f\in\pazocal{F}$ such that $T(f)>r_{0}$ and $(\mathbb{P}-\mathbb{P}_{n})g_{f}>\psi\left(r_{0};\frac{\delta}{C_{r_{0}}}\right)$}\right\}.

Then from the surrogate property of ψ\psi and the fact Cr02C_{r_{0}}\geq 2, we have

Prob(𝒜2)δCr0δ2.\displaystyle\text{Prob}(\mathscr{A}_{2})\leq\frac{\delta}{C_{r_{0}}}\leq\frac{\delta}{2}. (A.3)

Combining (A.2) and (A.3) by an union bound, we have

Prob(𝒜1𝒜2)Prob(𝒜1)+Prob(𝒜2)δ.\displaystyle\text{Prob}(\mathscr{A}_{1}\cup\mathscr{A}_{2})\leq\text{Prob}(\mathscr{A}_{1})+\text{Prob}(\mathscr{A}_{2})\leq\delta.

From the above argument, it is straightforward to prove that with probability at least 1δ1-\delta, for all fFf\in\pazocal{F},

(n)gfψ(2T(f)r0;δCr0).\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{f}\leq\psi\left(2T(f)\lor r_{0};\frac{\delta}{C_{r_{0}}}\right).

This completes the proof of Proposition 4.

A.2 Proof of Theorem 1

Let F\pazocal{F} be the excess loss class in (3.2), and define its member ff by f(z)=(h;z)(h;z),zZf(z)=\ell(h;z)-\ell(h^{*};z),\forall z\in\pazocal{Z}. Clearly, F\pazocal{F} is uniformly bounded in [2B,2B][-2B,2B]. Let T(f)=[f2]T(f)=\mathbb{P}[f^{2}]. Define f^\hat{f} by f^(z)=(h^ERM;z)(h;z),zZ\hat{f}(z)=\ell(\hat{h}_{\textup{ERM}};z)-\ell(h^{*};z),\forall z\in\pazocal{Z}.

For a fixed r0(0,4B2)r_{0}\in(0,4B^{2}), Denote Cr0=2log28B2r0C_{r_{0}}={2\log_{2}\frac{8B^{2}}{r_{0}}}. From now to the end of this proof, we will prove the generalization error bound on the event

𝒜={for all fF(n)fψ(2T(f)r0;δCr0)}.\displaystyle\mathscr{A}=\left\{\text{for all $f\in\pazocal{F}$, $(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\left(2T(f)\lor r_{0};\frac{\delta}{C_{r_{0}}}\right)$}\right\}. (A.4)

From Proposition 4 we know that

Prob(𝒜)1δ.\displaystyle\text{Prob}(\mathscr{A})\geq 1-\delta.

This means that proving the generalization error bound on the event 𝒜\mathscr{A} suffices to prove the theorem.

Denote g(z)=(h;z)infH(h;z)g(z)=\ell(h;z)-\inf_{\pazocal{H}}\ell(h;z) and g^(z)=(h^ERM;z)infH(h;z)\hat{g}(z)=\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z). Let T(g)=[g2]T(g)=\mathbb{P}[g^{2}]. We have

f(z)=g(z)((h;z)infH(h;z)),z,f(z)=g(z)-(\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)),\quad\forall z,

which implies that

[f2]2[g2]+2[((h;z)infH(h;z))2]\displaystyle\mathbb{P}[f^{2}]\leq 2\mathbb{P}[g^{2}]+2\mathbb{P}[(\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z))^{2}]
2[g2]+4BL4[g2]8BL.\displaystyle\leq 2\mathbb{P}[g^{2}]+4B\pazocal{L}^{*}\leq 4\mathbb{P}[g^{2}]\lor 8B\pazocal{L}^{*}.

Therefore, we have

T(f^)4T(g^)8BL.\displaystyle T(\hat{f})\leq 4T(\hat{g})\lor 8B\pazocal{L}^{*}. (A.5)

From the property of ERM, we have nf^0\mathbb{P}_{n}\hat{f}\leq 0, which implies that

(h^ERM)(n)f^ψ(2T(f^)r0;δCr0).\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq(\mathbb{P}-\mathbb{P}_{n})\hat{f}\leq\psi\left(2T(\hat{f})\lor r_{0};\frac{\delta}{C_{r_{0}}}\right). (A.6)

From (A.5) and (A.6) we have

g^L=(h^ERM)ψ(8T(g^)16BLr0;δCr0).\displaystyle\mathbb{P}\hat{g}-\pazocal{L}^{*}=\mathscr{E}({\hat{h}_{\textup{ERM}}})\leq\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right). (A.7)

Since g^(z)[0,2B]\hat{g}(z)\in[0,2B] for all zz, we have T(g^)2Bg^T(\hat{g})\leq 2B\mathbb{P}\hat{g}. From this fact and (A.7) we obtain

T(g^)\displaystyle T(\hat{g}) 2Bg^\displaystyle\leq 2B\mathbb{P}\hat{g}
2B(L+ψ(8T(g^)16BLr0;δCr0))\displaystyle\leq 2B\left(\pazocal{L}^{*}+\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right)\right)
=2BL+2Bψ(8T(g^)16BLr0;δCr0).\displaystyle=2B\pazocal{L}^{*}+2B\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right).

Whether BL2Bψ(8T(g^)16BLr0;δCr0)B\pazocal{L}^{*}\leq 2B\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right) or BL>2Bψ(8T(g^)16BLr0;δCr0)B\pazocal{L}^{*}>2B\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right), the above inequality always implies that

T(g^)\displaystyle T(\hat{g}) 3BL6Bψ(8T(g^)16BLr0;δCr0)\displaystyle\leq 3B\pazocal{L}^{*}\lor 6B\psi\left(8T(\hat{g})\lor 16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right)
3BL6Bψ(8T(g^);δCr0)6Bψ(16BLr0;δCr0).\displaystyle\leq 3B\pazocal{L}^{*}\lor 6B\psi\left(8T(\hat{g});\frac{\delta}{C_{r_{0}}}\right)\lor 6B\psi\left(16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right). (A.8)

Let rr^{*} be the fixed point of 6Bψ(8r;δCn)6B\psi\left(8r;\frac{\delta}{C_{n}}\right). From the definition of fixed points whether 2BLr08r2B\pazocal{L}^{*}\lor\frac{r_{0}}{8}\leq r^{*} or 2BLr08>r2B\pazocal{L}^{*}\lor\frac{r_{0}}{8}>r^{*}, we always have

6Bψ(16BLr0;δCr0)r2BLr08.6B\psi\left(16B\pazocal{L}^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right)\leq r^{*}\lor 2B\pazocal{L}^{*}\lor\frac{r_{0}}{8}.

Combining the above inequality with (A.2), we have

T(g^)3BL6Bψ(8T(g^);δCr0)rr08.\displaystyle T(\hat{g})\leq 3B\pazocal{L}^{*}\lor 6B\psi\left(8T(\hat{g});\frac{\delta}{C_{r_{0}}}\right)\lor r^{*}\lor\frac{r_{0}}{8}.

From the above inequality and again the definition of fixed points, it is straightforward to prove that

T(g^)3BLrr08.\displaystyle T(\hat{g})\leq 3B\pazocal{L}^{*}\lor r^{*}\lor\frac{r_{0}}{8}.

Combining the above inequality with (A.5), we have

T(f^)12BL4rr02.\displaystyle T(\hat{f})\leq 12B\pazocal{L}^{*}\lor 4r^{*}\lor\frac{r_{0}}{2}.

From the above inequality and (A.6) we have

(h^ERM)(n)f^ψ(24BL8rr0;δCr0),\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq(\mathbb{P}-\mathbb{P}_{n})\hat{f}\leq\psi\bigg{(}24B\pazocal{L}^{*}\lor 8r^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\bigg{)}, (A.9)

which implies that

(h^ERM)ψ(24BL;δCr0)ψ(8rr0;δCr0).\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq\psi\left(24B\pazocal{L}^{*};\frac{\delta}{C_{r_{0}}}\right)\lor\psi\left(8r^{*}\lor r_{0};\frac{\delta}{C_{r_{0}}}\right).

Recall that rr^{*} is the fixed point of 6Bψ(8r;δCr0)6B\psi(8r;\frac{\delta}{C_{r_{0}}}). Since rr08rr^{*}\lor\frac{r_{0}}{8}\geq r^{*}, from the definition of fixed points we have

6Bψ(8r2r0;δCr0)rr08.\displaystyle 6B\psi(8r^{*}\lor 2r_{0};\frac{\delta}{C_{r_{0}}})\leq r^{*}\lor\frac{r_{0}}{8}.

So we finally obtain

(h^ERM)ψ(24BL;δCr0)r6Br048B.\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq\psi\left(24B\pazocal{L}^{*};\frac{\delta}{C_{r_{0}}}\right)\lor\frac{r^{*}}{6B}\lor\frac{r_{0}}{48B}.

Recall that the generalization error bound holds true on the event 𝒜\mathscr{A} defined in (A.4), whose measure is at least 1δ1-\delta. This completes the proof. \square

A.3 Estimating the loss-dependent rate from data

In the remarks following Theorem 1, we comment that fully data-dependent loss-dependent bounds can be derived using the empirical “effective loss,” n[(h^ERM;z)infH(h;z)]\mathbb{P}_{n}[\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z)] to estimate the unknown parameter L\pazocal{L}^{*}. Here we present the full details and some discussion of this approach.

Theorem 10 (estimate of the loss-dependent rate from data).

Recall the term L\pazocal{L}^{*} is [(h;z)infH(h;z)]\mathbb{P}[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h^{*};z)] and denote L^=n[(h^ERM;z)infH(h;z)]\widehat{\pazocal{L}^{*}}=\mathbb{P}_{n}[\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z)]. Under the conditions of Theorem 1, setting Cn=2log2n+6C_{n}=2\log_{2}n+6, then for any fixed δ(0,12)\delta\in(0,\frac{1}{2}), with probability at least 12δ1-2\delta, we have

(h^ERM)ψ(cBL^;δCn)crBcBlog2δn\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq\psi\left(cB\widehat{\pazocal{L}^{*}};\frac{\delta}{C_{n}}\right)\lor\frac{cr^{*}}{B}\lor\frac{cB\log\frac{2}{\delta}}{n} (A.10)

and

Lc1(L^rBBlog2δn)c2(LrBBlog2δn),\displaystyle\pazocal{L}^{*}\leq c_{1}\left(\widehat{\pazocal{L}^{*}}\lor\frac{r^{*}}{B}\lor\frac{B\log\frac{2}{\delta}}{n}\right)\leq c_{2}\left(\pazocal{L}^{*}\lor\frac{r^{*}}{B}\lor\frac{B\log\frac{2}{\delta}}{n}\right), (A.11)

where c,c1,c2c,c_{1},c_{2} are absolute constants.

Remarks.

1) The Blog2δ/nB\log\frac{2}{\delta}/n terms (A.10) and (A.11) are negligible, because rr^{*} is at least of order B2log1δ/nB^{2}\log\frac{1}{\delta}/n for most practical applications. This order is unavoidable in traditional “local Rademacher complexity” analysis and two-sided concentration inequalities.

2) The generalization error bound (A.10) shows that without knowledge of L\pazocal{L}^{*}, one can estimate the order of our loss-dependent rate by using L^=n[(h^ERM;z)infH(h;z)]\widehat{\pazocal{L}^{*}}=\mathbb{P}_{n}[\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z)] as a proxy. Despite replacing L\pazocal{L}^{*} by L^\widehat{\pazocal{L}^{*}}, other quantities in the bound remain unchanged in order.

3) The inequality (A.11) shows that the estimation of L\pazocal{L}^{*} is tight.

Proof of Theorem 10:

from the definitions, we know that L=[(h;z)infH(h;z)]\pazocal{L}^{*}=\mathbb{P}[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h^{*};z)], L^=n[(h^ERM;z)infH(h;z)]\widehat{\pazocal{L}^{*}}=\mathbb{P}_{n}[\ell(\hat{h}_{\textup{ERM}};z)-\inf_{\pazocal{H}}\ell(h;z)] and (h;z)(h^ERM;z)\mathbb{P}\ell(h^{*};z)\leq\mathbb{P}\ell(\hat{h}_{\textup{ERM}};z). As a result, we have

LL^=(h;z)n(h^ERM;z)(n)[infH(h;z)]\displaystyle\pazocal{L}^{*}-\widehat{\pazocal{L}^{*}}=\mathbb{P}\ell(h^{*};z)-\mathbb{P}_{n}\ell(\hat{h}_{\textup{ERM}};z)-(\mathbb{P}-\mathbb{P}_{n})[\inf_{\pazocal{H}}\ell(h;z)]
(n)(h^ERM;z)(n)[infH(h;z)]\displaystyle\leq(\mathbb{P}-\mathbb{P}_{n})\ell(\hat{h}_{\textup{ERM}};z)-(\mathbb{P}-\mathbb{P}_{n})[\inf_{\pazocal{H}}\ell(h;z)]
=(n)f^+(n)[(h;z)infH(h;z)],\displaystyle=(\mathbb{P}-\mathbb{P}_{n})\hat{f}+(\mathbb{P}-\mathbb{P}_{n})[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)], (A.12)

where f^\hat{f} is defined by f^(z)=(h^ERM;z)(h;z),zZ\hat{f}(z)=\ell(\hat{h}_{\textup{ERM}};z)-\ell(h^{*};z),\forall z\in\pazocal{Z}.

We take r0=B2nr_{0}=\frac{B^{2}}{n} in Theorem 1, and denote Cn:=Cr0=2log2n+6C_{n}:=C_{r_{0}}=2\log_{2}n+6. From (A.9) in the proof of Theorem 1, on the event 𝒜\mathscr{A} defined in (A.4) (whose measure is at least 1δ1-\delta),

(h^ERM)(n)f^ψ(24BL8rB2n;δCn),\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq(\mathbb{P}-\mathbb{P}_{n})\hat{f}\leq\psi(24B\pazocal{L}^{*}\lor 8r^{*}\lor\frac{B^{2}}{n};\frac{\delta}{C_{n}}), (A.13)

where f^\hat{f} is defined by f^(z)=(h^ERM;z)(h;z),zZ\hat{f}(z)=\ell(\hat{h}_{\textup{ERM}};z)-\ell(h^{*};z),\forall z\in\pazocal{Z}.

Since 3BLrB24nr3B\pazocal{L}^{*}\lor r^{*}\lor\frac{B^{2}}{4n}\geq r^{*}, from the definition of fixed points we have

(n)f^ψ(8(3BLrB28n);δCn)\displaystyle(\mathbb{P}-\mathbb{P}_{n})\hat{f}\leq\psi\left(8\left(3B\pazocal{L}^{*}\lor r^{*}\lor\frac{B^{2}}{8n}\right);\frac{\delta}{C_{n}}\right)
3BLrB28n6BL2+r6B+B48n.\displaystyle\leq\frac{3B\pazocal{L}^{*}\lor r^{*}\lor\frac{B^{2}}{8n}}{6B}\leq\frac{\pazocal{L}^{*}}{2}+\frac{r^{*}}{6B}+\frac{B}{48n}. (A.14)

This result holds together with the result of Theorem 1 on the event 𝒜\mathscr{A}.

The random variable (h;z)infH(h;z)\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z) is uniformly bounded by [0,2B][0,2B]. From Bernstein’s inequality and the fact Var[(h;z)infH(h;z)]2BL\textup{Var}[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)]\leq 2B\pazocal{L}^{*}, with probability at least 1δ1-\delta,

|(n)[(h;z)infH(h;z)]|4BLlog2δn+2Blog2δnL4+3Blog2δn.\displaystyle\left|(\mathbb{P}-\mathbb{P}_{n})[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)]\right|\leq\sqrt{\frac{4B\pazocal{L}^{*}\log\frac{2}{\delta}}{n}}+\frac{2B\log\frac{2}{\delta}}{n}\leq\frac{\pazocal{L}^{*}}{4}+\frac{3B\log\frac{2}{\delta}}{n}. (A.15)

Consider the event

𝒜3=𝒜{inequality (A.15) holds true},\mathscr{A}_{3}=\mathscr{A}\cup\{\text{inequality \eqref{eq: bound for estimating optimal effective loss 3} holds true}\},

whose measure is at least 12δ1-2\delta. On the event 𝒜3\mathscr{A}_{3}, from inequalities (A.3) (A.3) (A.15), it is straightforward to show that

LL^34L+r6B+4Blog2δn,\displaystyle\pazocal{L}^{*}-\widehat{\pazocal{L}^{*}}\leq\frac{3}{4}\pazocal{L}^{*}+\frac{r^{*}}{6B}+\frac{4B\log\frac{2}{\delta}}{n},

which implies

L4L^+2r3B+16Blog2δn.\displaystyle\pazocal{L}^{*}\leq 4\widehat{\pazocal{L}^{*}}+\frac{2r^{*}}{3B}+\frac{16B\log\frac{2}{\delta}}{n}. (A.16)

From this result and (A.13), it is straightforward to show that

(h^ERM)ψ(cBL^;δCn)crncBlog2δn,\displaystyle\mathscr{E}(\hat{h}_{\textup{ERM}})\leq\psi\left(cB\widehat{\pazocal{L}^{*}};\frac{\delta}{C_{n}}\right)\lor\frac{cr^{*}}{n}\lor\frac{cB\log\frac{2}{\delta}}{n},

where cc is an absolute constant.

We also have

L^L=n(h^ERM)(h;z)(n)[infH(h;z)]\displaystyle\widehat{\pazocal{L}^{*}}-\pazocal{L}^{*}=\mathbb{P}_{n}\ell(\hat{h}_{\textup{ERM}})-\mathbb{P}\ell(h^{*};z)-(\mathbb{P}_{n}-\mathbb{P})[\inf_{\pazocal{H}}\ell(h;z)]
(n)(h;z)(n)[infH(h;z)]\displaystyle\leq(\mathbb{P}_{n}-\mathbb{P})\ell(h^{*};z)-(\mathbb{P}_{n}-\mathbb{P})[\inf_{\pazocal{H}}\ell(h;z)]
=(n)[(h;z)infH(h;z)].\displaystyle=(\mathbb{P}_{n}-\mathbb{P})[\ell(h^{*};z)-\inf_{\pazocal{H}}\ell(h;z)].

From this result and (A.15), on the event 𝒜3\mathscr{A}_{3},

L^54L+3Blog2δn.\displaystyle\widehat{\pazocal{L}^{*}}\leq\frac{5}{4}\pazocal{L}^{*}+\frac{3B\log\frac{2}{\delta}}{n}. (A.17)

Combine (A.16) and (A.17) we obtain

Lc1(L^rBBlog2δn)c2(LrBBlog2δn),\displaystyle\pazocal{L}^{*}\leq c_{1}\left(\widehat{\pazocal{L}^{*}}\lor\frac{r^{*}}{B}\lor\frac{B\log\frac{2}{\delta}}{n}\right)\leq c_{2}\left(\pazocal{L}^{*}\lor\frac{r^{*}}{B}\lor\frac{B\log\frac{2}{\delta}}{n}\right),

where c1c_{1} and c2c_{2} are absolute constants. This completes the proof. \square

A.4 Proof of Theorem 2

The main goal of this subsection is to prove Theorem 2. We first prove Theorem 11 (the bound (3.6) in the main paper), a guarantee for the second-stage moment penalized estimator h^MP\hat{h}_{\textup{MP}}. In order to prove Theorem 2, we then combine Theorem 11 with a guarantee for the first-stage empirical risk minimization (ERM) estimator.

A.4.1 Analysis for the second-stage moment-penalized estimator

Theorem 11 (variance-dependent rate of the second-stage estimator).

Given arbitrary preliminary estimate L0^[B,B]\widehat{{\pazocal{L}}^{*}_{0}}\in[-B,B], the generalization error of the moment-penalized estimator h^MP\hat{h}_{\textup{MP}} in Strategy 2 is bounded by

(h^MP)2ψ(c0[V(L0^L0)2r];δCn),\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\left(c_{0}\left[\pazocal{V}^{*}\lor(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}\lor r^{*}\right];\frac{\delta}{C_{n}}\right),

with probability at least 1δ1-\delta, where c0c_{0} is an absolute constant and rr^{*} is the fixed point of 16Bψ(r;δCn)16B\psi(r;\frac{\delta}{C_{n}}).

Proof of Theorem 11:

the proof of Theorem 11 consist of four parts.

Part I: use 𝝍\bm{\psi} to upper bound localized empirical processes

Let F\pazocal{F} be the excess loss class in (3.2), and define its member ff is defined by f(z)=(h;z)(h;z),zZf(z)=\ell(h;z)-\ell(h^{*};z),\forall z\in\pazocal{Z}. We have the following lemma.

Lemma 3 (bound on localized empirical processes).

Given a fixed δ1(0,1)\delta_{1}\in(0,1), let r1(δ1)r_{1}^{*}(\delta_{1}) be the fixed point of 16Bψ(r;δ1)16B\psi(r;\delta_{1}) where ψ\psi is defined in Strategy 2. Then with probability at least 1δ11-\delta_{1}, for all r>0r>0,

sup[f2]r(n)fψ(rr1(δ1);δ1).\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\left(r\lor r_{1}^{*}(\delta_{1});\delta_{1}\right). (A.18)
Proof of Lemma 3:

clearly, F\pazocal{F} is uniformly bounded in [2B,2B][-2B,2B]. When [f2]r\mathbb{P}[f^{2}]\leq r, we have [f4]4B2r\mathbb{P}[f^{4}]\leq 4B^{2}r. From Lemma 5 (the two-sided version of its second inequality), with probability at least 1δ121-\frac{\delta_{1}}{2},

sup[f2]r|(n)f2|\displaystyle\ \sup_{{\mathbb{P}[f^{2}]\leq r}}\left|(\mathbb{P}-\mathbb{P}_{n})f^{2}\right|
\displaystyle\leq 4n{f2:[f2]r}+2B2rlog8δ1n+18B2log8δ1n\displaystyle\ 4\mathfrak{R}_{n}\{f^{2}:\mathbb{P}[f^{2}]\leq r\}+2B\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{18B^{2}\log\frac{8}{\delta_{1}}}{n}
\displaystyle\leq 16Bn{f:[f2]r}+2B2rlog8δ1n+18B2log8δ1n,\displaystyle\ 16B\mathfrak{R}_{n}\{f:\mathbb{P}[f^{2}]\leq r\}+2B\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{18B^{2}\log\frac{8}{\delta_{1}}}{n},

where the last inequality follows from the Lipchitz contraction property of Rademahcer complexity (see, e.g., [48, Theorem 7]), and the fact that for all f1,f2Ff_{1},f_{2}\in\pazocal{F}, |f12(z)f22(z)|4B|f1(z)f2(z)||f_{1}^{2}(z)-f_{2}^{2}(z)|\leq 4B|f_{1}(z)-f_{2}(z)|. We conclude that with probability at least 1δ121-\frac{\delta_{1}}{2},

sup[f2]r|(n)f2|φδ1(r),\displaystyle\sup_{{\mathbb{P}[f^{2}]\leq r}}\left|(\mathbb{P}-\mathbb{P}_{n})f^{2}\right|\leq\varphi_{\delta_{1}}(r), (A.19)

where φδ1(r):=16Bn{f:[f2]r}+2B2rlog8δ1n+18B2log8δ1n\varphi_{\delta_{1}}(r):=16B\mathfrak{R}_{n}\{f:\mathbb{P}[f^{2}]\leq r\}+2B\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{18B^{2}\log\frac{8}{\delta_{1}}}{n}.

Denote r2(δ1)r_{2}^{*}(\delta_{1}) the fixed point of 4φδ1(r)4\varphi_{\delta_{1}}(r) (the fixed point must exist as 4φδ1(r)4\varphi_{\delta_{1}}(r) is a non-decreasing, non-negative and bounded function). From (A.19) and the fact that r2(δ1)r_{2}^{*}(\delta_{1}) is the fixed point of 4φδ1(r)4\varphi_{\delta_{1}}(r), if r>r2(δ1)r>r_{2}^{*}(\delta_{1}), then with probability at least 1δ121-\frac{\delta_{1}}{2},

sup[f2]r|(n)f2|r4.\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}\left|(\mathbb{P}-\mathbb{P}_{n})f^{2}\right|\leq\frac{r}{4}. (A.20)

(A.20) implies that with probability at least 1δ121-\frac{\delta_{1}}{2}, for all r>r2(δ1)r>r_{2}^{*}(\delta_{1}), [f2]r\mathbb{P}[f^{2}]\leq r implies that

n[f2]54r2r.\displaystyle\mathbb{P}_{n}[f^{2}]\leq\frac{5}{4}r\leq 2r. (A.21)

Again from the two-sided version of the second inequality in Lemma 5, we know that with probability at least 1δ121-\frac{\delta_{1}}{2},

sup[f2]r|(n)f|4n{f:[f2]r}+2rlog8δ1n+9Blog8δ1n.\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}\left|(\mathbb{P}-\mathbb{P}_{n})f\right|\leq 4\mathfrak{R}_{n}\{f:\mathbb{P}[f^{2}]\leq r\}+\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{9B\log\frac{8}{\delta_{1}}}{n}.

Combining the above inequality and (A.21) using a union bound, we know that with probability at least 1δ12δ12=1δ11-\frac{\delta_{1}}{2}-\frac{\delta_{1}}{2}=1-\delta_{1}, if r>r2(δ1)r>r_{2}^{*}(\delta_{1}), then

sup[f2]r(n)f4n{f:[f2]r}+2rlog8δ1n+9Blog8δ1n\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq 4\mathfrak{R}_{n}\{f:\mathbb{P}[f^{2}]\leq r\}+\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{9B\log\frac{8}{\delta_{1}}}{n}
4n{f:n[f2]2r}+2rlog8δ1n+9Blog8δ1n.\displaystyle\leq 4\mathfrak{R}_{n}\{f:\mathbb{P}_{n}[f^{2}]\leq 2r\}+\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{9B\log\frac{8}{\delta_{1}}}{n}. (A.22)

Recall that the ψ\psi function satisfies that r>0\forall r>0,

4n{f:n[f2]2r}+2rlog8δ1n+9Blog8δ1nψ(r;δ1).\displaystyle 4\mathfrak{R}_{n}\{f:\mathbb{P}_{n}[f^{2}]\leq 2r\}+\sqrt{\frac{2r\log\frac{8}{\delta_{1}}}{n}}+\frac{9B\log\frac{8}{\delta_{1}}}{n}\leq\psi(r;\delta_{1}).

From this fact and (A.4.1), we see that with probability at least 1δ11-\delta_{1}, for all r>0r>0,

sup[f2]r(n)fψ(rr2(δ1);δ1).\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\left(r\lor r_{2}^{*}(\delta_{1});\delta_{1}\right). (A.23)

From (A.23), in order to prove the result (A.18) in Lemma 3, we only need to prove that

r2(δ1)r1(δ1).\displaystyle r_{2}^{*}(\delta_{1})\leq r_{1}^{*}(\delta_{1}). (A.24)

Assume this is not true, i.e. r2(δ1)>r1(δ1)r_{2}^{*}({\delta_{1}})>r_{1}^{*}(\delta_{1}). Since r1(δ1)r_{1}^{*}(\delta_{1}) is the fixed point of 16Bψ(r;δ1)16B\psi(r;\delta_{1}), from the definition of fixed points we have

r2(δ1)>16Bψ(r2(δ1);δ1).r_{2}^{*}({\delta_{1}})>16B\psi(r_{2}^{*}({\delta_{1}});\delta_{1}).

From the definitions of ψ\psi and φδ1\varphi_{\delta_{1}}, for all r>r1(δ1)r>r_{1}^{*}({\delta_{1}}),

4φδ1(r)16Bψ(r;δ1).4\varphi_{\delta_{1}}(r)\leq 16B\psi(r;\delta_{1}).

From the above two inequalities and r2(δ1)>r1(δ1)r_{2}^{*}({\delta_{1}})>r_{1}^{*}(\delta_{1}), we have

r2(δ1)>16Bψ(r2(δ1);δ1)4φδ1(r2(δ1)).\displaystyle r_{2}^{*}({\delta_{1}})>16B\psi(r_{2}^{*}({\delta_{1}});\delta_{1})\geq 4\varphi_{\delta_{1}}(r_{2}^{*}({\delta_{1}})). (A.25)

From the fact that r2(δ1)r_{2}^{*}({\delta_{1}}) is the fixed point of 4φδ14\varphi_{\delta_{1}}, we have

4φδ1(r2(δ1))=r2(δ1).\displaystyle 4\varphi_{\delta_{1}}(r_{2}^{*}(\delta_{1}))=r_{2}^{*}(\delta_{1}). (A.26)

The above two inequalities (A.25) and (A.26) result in a contradiction. So the assumption r2(δ1)>r1(δ1)r_{2}^{*}(\delta_{1})>r_{1}^{*}(\delta_{1}) is false. Therefore r2(δ1)r1(δ1)r_{2}^{*}(\delta_{1})\leq r_{1}^{*}(\delta_{1}), and this completes the proof of Lemma 3. \square

Part II: a “uniform localized convergence” argument with data-dependent measurement.

Based on Lemma 3, we will modify the proof of Proposition 1 to obtain a “uniform localized convergence” argument with the data-dependent “measurement” functional n[f2]\mathbb{P}_{n}[f^{2}].

Lemma 4 (a “uniform localized convergence” argument with the data-dependent “measurement” functional).

Given a fixed δ1(0,1)\delta_{1}\in(0,1), let r1(δ1)r_{1}^{*}(\delta_{1}) be the fixed point of 16Bψ(r;δ1)16B\psi(r;\delta_{1}) where ψ\psi is defined in Strategy 2. Then with probability at least 1(log28B22r1(δ1)r1(δ1)+12)δ11-\left(\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}({\delta_{1}})}{r_{1}^{*}({\delta_{1}})}+\frac{1}{2}\right)\delta_{1}, for all fFf\in\pazocal{F} either [f2]r1(δ1)\mathbb{P}[f^{2}]\leq r_{1}^{*}({\delta_{1}}), or

(n)fψ(4n[f2];δ1).\displaystyle(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\bigg{(}4\mathbb{P}_{n}[f^{2}];\delta_{1}\bigg{)}. (A.27)
Proof of Lemma 4:

from the definition of ψ\psi and the fact that r1(δ1)r_{1}^{*}(\delta_{1}) is the fixed point of 16Bψ(r;δ1)16B\psi(r;\delta_{1}), we know that r1(δ1)144B2log8δ1n>0r_{1}^{*}(\delta_{1})\geq\frac{144B^{2}\log\frac{8}{\delta_{1}}}{n}>0. Take r0=r1(δ1)r_{0}=r_{1}^{*}(\delta_{1}).

Take R=4B2r0R=4B^{2}\lor r_{0} to be a uniform upper bound for f2\mathbb{P}f^{2}, and take rk=2kr0,k=1,,log2Rr0r_{k}=2^{k}r_{0},k=1,\cdots,\lceil\log_{2}\frac{R}{r_{0}}\rceil. Note that log2Rr0log22Rr0\lceil\log_{2}\frac{R}{r_{0}}\rceil\leq\log_{2}\frac{2R}{r_{0}}. We use the union bound to establish that sup[f2]r(n)fψ(r;δ1)\sup_{\mathbb{P}[f^{2}]\leq r}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi(r;\delta_{1}) holds for all {rk}\{r_{k}\} simultaneously: with probability at least 1log22Rr0δ11-\log_{2}\frac{2R}{r_{0}}\delta_{1},

sup[f2]rk(n)fψ(rk;δ1),k=1,,log2Rr0.\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r_{k}}(\mathbb{P}-\mathbb{P}_{n})f\leq\psi(r_{k};{\delta_{1}}),\quad k=1,\cdots,\left\lceil\log_{2}\frac{R}{r_{0}}\right\rceil.

For any fixed fFf\in\pazocal{F}, if [f2]r0\mathbb{P}[f^{2}]\leq{r_{0}} is false, let kk be the non-negative integer such that 2kr0<[g(h;z)2]2k+1r02^{k}r_{0}<\mathbb{P}[g(h;z)^{2}]\leq 2^{k+1}r_{0}. We further have that rk+1=2k+1r02[f2]r_{k+1}=2^{k+1}r_{0}\leq 2\mathbb{P}[f^{2}]. Therefore, with probability at least1log22Rr0δ11-\log_{2}\frac{2R}{r_{0}}\delta_{1},

f\displaystyle\mathbb{P}f\leq\ nf+supf~F:[f~2]rk+1(n)f~\displaystyle\mathbb{P}_{n}f+\sup_{\tilde{f}\in\pazocal{F}:\mathbb{P}[\tilde{f}^{2}]\leq r_{k+1}}(\mathbb{P}-\mathbb{P}_{n})\tilde{f}
\displaystyle\leq\ nf+ψ(rk+1;δ1)\displaystyle\mathbb{P}_{n}f+\psi(r_{k+1};\delta_{1}) (A.28)

By (A.19) we know that with probability at least 1δ121-\frac{\delta_{1}}{2},

sup[f2]r([f2]n[f2])r4\displaystyle\sup_{\mathbb{P}[f^{2}]\leq r}\left(\mathbb{P}[f^{2}]-\mathbb{P}_{n}[f^{2}]\right)\leq\frac{r}{4}

for all r>r0r>r_{0} (here we have used the fact r0=r1(δ1)r2(δ1)r_{0}=r_{1}^{*}(\delta_{1})\geq r_{2}^{*}(\delta_{1}), which is the result (A.24) in the proof of Lemma 3). From the union bound, with probability at least 1(log22Rr0+12)δ11-(\log_{2}\frac{2R}{r_{0}}+\frac{1}{2})\delta_{1}, the condition rk+1[f2]>rkr_{k+1}\geq\mathbb{P}[f^{2}]>r_{k} will imply

n[f2][f2]14rk+114rk+1,\displaystyle\mathbb{P}_{n}[f^{2}]\geq\mathbb{P}[f^{2}]-\frac{1}{4}r_{k+1}\geq\frac{1}{4}r_{k+1},

so

rk+14n[f2].\displaystyle r_{k+1}\leq 4\mathbb{P}_{n}[f^{2}].

Combining this result with (A.4.1), we have that for all ff such that T(f)>r0T(f)>{r_{0}}, with probability at least 1(log22Rr0+12)δ11-\left(\log_{2}\frac{2R}{r_{0}}+\frac{1}{2}\right)\delta_{1},

f\displaystyle\mathbb{P}f\leq\ nf+ψ(rk+1;δ1)\displaystyle\mathbb{P}_{n}f+\psi(r_{k+1};\delta_{1})
\displaystyle\leq\ nf+ψ(4n[f2];δ1).\displaystyle\mathbb{P}_{n}f+\psi\bigg{(}4\mathbb{P}_{n}[f^{2}];\delta_{1}\bigg{)}.

We conclude that with probability at least 1(log22Rr0+12)δ11-\left(\log_{2}\frac{2R}{r_{0}}+\frac{1}{2}\right)\delta_{1}, for all fFf\in\pazocal{F}, either [f2]r1(δ1)\mathbb{P}[f^{2}]\leq r_{1}^{*}({\delta_{1}}), or

(n)fψ(4n[f2];δ1).\displaystyle(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\bigg{(}4\mathbb{P}_{n}[f^{2}];\delta_{1}\bigg{)}.

This completes the proof of Lemma 4. \square

Part III: specify the moment-penalized estimator and its error bound.

We define the event

𝒜1={there exists fF such that [f2]r0 and (n)f>ψ(4n[f2];δ1)}.\mathscr{A}_{1}=\left\{\text{there exists $f\in\pazocal{F}$ such that $\mathbb{P}[f^{2}]\geq r_{0}$ and $(\mathbb{P}-\mathbb{P}_{n})f>\psi\left(4\mathbb{P}_{n}[f^{2}];\delta_{1}\right)$}\right\}.

Lemma 4 has proven that

Prob(𝒜1)(log28B22r1(δ1)r1(δ1)+12)δ1.\displaystyle\textup{Prob}(\mathscr{A}_{1})\leq\left(\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}(\delta_{1})}{r_{1}^{*}(\delta_{1})}+\frac{1}{2}\right)\delta_{1}. (A.29)

We denote the event

𝒜2={there exists fF such that [f2]r0 and (n)f>ψ(r0;δ1)}.\mathscr{A}_{2}=\{\textup{there exists $f\in\pazocal{F}$ such that $\mathbb{P}[f^{2}]\leq r_{0}$ and $(\mathbb{P}-\mathbb{P}_{n})f>\psi(r_{0};\delta_{1})$}\}.

Due to the surrogate property of ψ\psi, we have

Prob(𝒜2)δ1.\displaystyle\text{Prob}\left(\mathscr{A}_{2}\right)\leq\delta_{1}. (A.30)

Denote the event

𝒜={for all fF(n)fψ(4n[f2]r1(δ1);δ1)}.\displaystyle\mathscr{A}=\left\{\text{for all $f\in\pazocal{F}$, }(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\bigg{(}4\mathbb{P}_{n}[f^{2}]\lor r_{1}^{*}(\delta_{1});\delta_{1}\bigg{)}\right\}.

From (A.29) and (A.30), it is straightforward to prove that

Prob(𝒜)\displaystyle\text{Prob}(\mathscr{A}) 1Prob(𝒜1)Prob(𝒜2)\displaystyle\geq 1-\text{Prob}(\mathscr{A}_{1})-\text{Prob}(\mathscr{A}_{2})
1(log28B22r1(δ1)r1(δ1)+12)δ1δ1\displaystyle\geq 1-\left(\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}(\delta_{1})}{r_{1}^{*}(\delta_{1})}+\frac{1}{2}\right)\delta_{1}-\delta_{1}
1(log28B22r1(δ1)r1(δ1)+32)δ1.\displaystyle\geq 1-\left(\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}(\delta_{1})}{r_{1}^{*}(\delta_{1})}+\frac{3}{2}\right)\delta_{1}. (A.31)

Denote w(h;z)=(h;z)L0^w(h;z)=\ell(h;z)-\widehat{{\pazocal{L}}^{*}_{0}}. Then f(z)=w(h;z)w(h;z),zZf(z)=w(h;z)-w(h^{*};z),\forall z\in\pazocal{Z}, and we have that

4n[f2]8n[w(h;z)2]+8n[w(h;z)2]\displaystyle 4\mathbb{P}_{n}[f^{2}]\leq 8\mathbb{P}_{n}[w(h;z)^{2}]+8\mathbb{P}_{n}[w(h^{*};z)^{2}]
16n[w(h;z)2]16n[w(h;z)2].\displaystyle\leq 16\mathbb{P}_{n}[w(h;z)^{2}]\lor 16\mathbb{P}_{n}[w(h^{*};z)^{2}].

From the above conclusion and (A.4.1) we obtain that on the event 𝒜\mathscr{A},

(h)+n(h;z)n(h;z)+ψ(4n[f^2]r1(δ1);δ1)\displaystyle\mathscr{E}(h)+\mathbb{P}_{n}\ell(h^{*};z)\leq\mathbb{P}_{n}\ell(h;z)+\psi(4\mathbb{P}_{n}[\hat{f}^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1})
n(h;z)+ψ(16n[w(h;z)2]16n[w(h;z)2]r1(δ1);δ1)\displaystyle\leq\mathbb{P}_{n}(h;z)+\psi\bigg{(}16\mathbb{P}_{n}[w(h;z)^{2}]\lor 16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}
n(h;z)+ψ(16n[w(h;z)2]δ1)+ψ(16n[w(h;z)2]r1(δ1);δ1).\displaystyle\leq\mathbb{P}_{n}(h;z)+\psi\bigg{(}16\mathbb{P}_{n}[w(h;z)^{2}]\delta_{1}\bigg{)}+\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}. (A.32)

We specify the moment-penalized estimator to be

h^MP=argminH{n(h;z)+ψ(16n[((h;z)L0^)2];δ1)}.\displaystyle\hat{h}_{\textup{MP}}=\arg\min_{\pazocal{H}}\left\{\mathbb{P}_{n}\ell(h;z)+\psi\bigg{(}16\mathbb{P}_{n}[(\ell(h;z)-\widehat{{\pazocal{L}}^{*}_{0}})^{2}];\delta_{1}\bigg{)}\right\}.

Then we have

n(h^MP;z)+ψ(16n[w(h^MP;z)2];δ1)n(h;z)+ψ(16n[w(h;z)2];δ1)\displaystyle\mathbb{P}_{n}\ell(\hat{h}_{\textup{MP}};z)+\psi\bigg{(}16\mathbb{P}_{n}[w(\hat{h}_{\textup{MP}};z)^{2}];\delta_{1}\bigg{)}\leq\mathbb{P}_{n}\ell(h^{*};z)+\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}];\delta_{1}\bigg{)} (A.33)

Therefore, on the event 𝒜\mathscr{A},

(h^MP)n(h^MP;z)+ψ(16n[w(h^MP;z)2];δ1)+ψ(16n[w(h;z)2]r1(δ1);δ1)n(h;z)\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq\mathbb{P}_{n}\ell(\hat{h}_{\textup{MP}};z)+\psi\bigg{(}16\mathbb{P}_{n}[w(\hat{h}_{\textup{MP}};z)^{2}];\delta_{1}\bigg{)}+\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}-\mathbb{P}_{n}\ell(h^{*};z)
=argminH{n(h;z)+ψ(16n[w(h;z)];δ1)}n(h;z)+ψ(16n[w(h;z)2]r1(δ1);δ1)\displaystyle=\operatorname*{arg\,min}_{\pazocal{H}}\left\{\mathbb{P}_{n}\ell(h;z)+\psi\bigg{(}16\mathbb{P}_{n}[w(h;z)];\delta_{1}\bigg{)}\right\}-\mathbb{P}_{n}\ell(h^{*};z)+\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}
ψ(16n[w(h;z)2];δ1)+ψ(16n[w(h;z)2]r1(δ1);δ1)\displaystyle\leq\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}];\delta_{1}\bigg{)}+\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}
2ψ(16n[w(h;z)2]r1(δ1);δ1),\displaystyle\leq 2\psi\bigg{(}16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r_{1}^{*}({\delta_{1}});\delta_{1}\bigg{)}, (A.34)

where the first inequality is due to (A.4.1) and the second inequality is due to (A.33).

From Bernstein’s inequality at the single element hh^{*}, for any fixed δ2(0,1)\delta_{2}\in(0,1), with probability at least 1δ21-\delta_{2},

n[w(h;z)2][w(h;z)2]+2B2[w(h;z)2]log2δ2n+4B2log2δ2n\displaystyle\mathbb{P}_{n}[w(h^{*};z)^{2}]\leq\mathbb{P}[w(h^{*};z)^{2}]+2B\sqrt{\frac{2\mathbb{P}[w(h^{*};z)^{2}]\log\frac{2}{\delta_{2}}}{n}}+\frac{4B^{2}\log\frac{2}{\delta_{2}}}{n}
2[w(h;z)2]+6B2log2δ2n.\displaystyle\leq 2\mathbb{P}[w(h^{*};z)^{2}]+\frac{6B^{2}\log\frac{2}{\delta_{2}}}{n}. (A.35)

From (A.4.1) (A.4.1) (A.4.1), with probability at least

Prob(𝒜)δ21(log28B22r1(δ1)r1(δ1)+32)δ1δ2,\text{Prob}(\mathscr{A})-\delta_{2}\geq 1-\left(\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}(\delta_{1})}{r_{1}^{*}(\delta_{1})}+\frac{3}{2}\right)\delta_{1}-\delta_{2},

we have

(h^MP)\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}}) 2ψ(16n[w(h;z)]r1(δ1)B2n;δ1)\displaystyle\leq 2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)]\lor r_{1}^{*}({\delta_{1}})\lor\frac{B^{2}}{n};\delta_{1}\right)
2ψ((32[w(h;z)2]+96B2log2δ2n)r1(δ1)B2n;δ1),\displaystyle\leq 2\psi\left(\left(32\mathbb{P}[w(h^{*};z)^{2}]+\frac{96B^{2}\log\frac{2}{\delta_{2}}}{n}\right)\lor r_{1}^{*}({\delta_{1}})\lor\frac{B^{2}}{n};\delta_{1}\right), (A.36)

where the first inequality is due to (A.4.1) and the second inequality is due to (A.4.1).

Part IV: final steps.

From the definition of ψ\psi and the fact that r1(δ1)r_{1}^{*}(\delta_{1}) is the fixed point of 16Bψ(r;δ1)16B\psi(r;\delta_{1}), we know that

r1(δ1)144B2log8δ1n.\displaystyle r_{1}^{*}(\delta_{1})\geq\frac{144B^{2}\log\frac{8}{\delta_{1}}}{n}. (A.37)

Denote Cn:=2log2n+5C_{n}:=2\log_{2}n+5 and take

δ1=δCn,\displaystyle\delta_{1}=\frac{\delta}{C_{n}},

then we have

2log28B22r1(δ1)r1(δ1)+3max{2log28n144log8,2+3}\displaystyle 2\log_{2}\frac{8B^{2}\lor 2r_{1}^{*}(\delta_{1})}{r_{1}^{*}(\delta_{1})}+3\leq\max\left\{2\log_{2}\frac{8n}{144\log 8},2+3\right\}
max{2log2n,5}Cn,\displaystyle\leq\max\{2\log_{2}n,5\}\leq C_{n},

so

(log28B22r1(δ1)r1(δ1)+32)δ1δ2.\displaystyle\left(\log_{2}\frac{8B^{2}\lor 2r^{*}_{1}(\delta_{1})}{r^{*}_{1}(\delta_{1})}+\frac{3}{2}\right)\delta_{1}\leq{\frac{\delta}{2}}. (A.38)

Set r=r1(δ1)r^{*}=r_{1}^{*}({\delta_{1}}) and take δ2=δ2\delta_{2}=\frac{\delta}{2}. From (A.4.1), we obtain that with probability at least 1δ1-\delta, the generalization error of h^MP\hat{h}_{\textup{MP}} is upper bounded by

(h^MP)2ψ(c[[w(h;z)2]rB2log4δn];δCn),\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\bigg{(}c\left[\mathbb{P}[w(h^{*};z)^{2}]\lor r^{*}\lor\frac{B^{2}\log\frac{4}{\delta}}{n}\right];\frac{\delta}{C_{n}}\bigg{)}, (A.39)

where cc is an absolute constant. From (A.37) we have r1(δ1)144B2log8CnδnB2log4δnr_{1}^{*}(\delta_{1})\geq\frac{144B^{2}\log\frac{8C_{n}}{\delta}}{n}\geq\frac{B^{2}\log\frac{4}{\delta}}{n}. Combine this fact with the inequality (A.39), we obtain that

(h^MP)2ψ(c[[((h;z)L0^)2]r];δCn)\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\left(c\left[\mathbb{P}[(\ell(h^{*};z)-\widehat{{\pazocal{L}}^{*}_{0}})^{2}]\lor r^{*}\right];\frac{\delta}{C_{n}}\right)
2ψ(c0[Vr(L0^L0)2];δCn).\displaystyle\leq 2\psi\left(c_{0}\left[\pazocal{V}^{*}\lor r^{*}\lor(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}\right];\frac{\delta}{C_{n}}\right). (A.40)

where c0c_{0} is an absolute constant. This completes the proof of Theorem 11. \square

A.4.2 Analysis of the first-stage ERM estimator

After proving Theorem 11, the remaining part needed to prove Theorem 2 is to bound (L0^L0)2(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}—the error of the first-stage ERM estimator.

The remaining steps in the proof of Theorem 2:

We will give a guarantee on the first-stage ERM estimator, and combine this guarantee with Theorem 11 to prove Theorem 2. Recall that S\mathbb{P}_{S^{\prime}} is the empirical distribution of the “auxiliary” data set. Denote h^ERMargminHS(h;z)\hat{h}_{\textup{ERM}}\in\operatorname*{arg\,min}_{\pazocal{H}}\mathbb{P}_{S^{\prime}}\ell(h;z).

From Part I in the proof of Theorem 11, δ(0,12)\forall\delta\in(0,\frac{1}{2}), with probability at least 1δ1-\delta,

supF|(n)f|ψ(4B2;δ)ψ(4B2;δCn).\displaystyle\sup_{\pazocal{F}}|(\mathbb{P}-\mathbb{P}_{n})f|\leq\psi(4B^{2};\delta)\leq\psi\left(4B^{2};\frac{\delta}{C_{n}}\right).

Since ψ\psi is sub-root with respect to its first argument, we have

ψ(4B2;δCn)4B2ψ(r;δCn)r=r16B,\displaystyle\frac{\psi(4B^{2};\frac{\delta}{C_{n}})}{\sqrt{4B^{2}}}\leq\frac{\psi(r^{*};\frac{\delta}{C_{n}})}{\sqrt{r^{*}}}=\frac{\sqrt{r^{*}}}{16B},

where rr^{*} is the fixed point of 16Bψ(r;δCn)16B\psi(r;\frac{\delta}{C_{n}}). So we have proved that ψ(4B2;δCn)r8\psi(4B^{2};\frac{\delta}{C_{n}})\leq\frac{\sqrt{r^{*}}}{8}. Therefore,

supF|(n)f|r8.\displaystyle\sup_{\pazocal{F}}\left|(\mathbb{P}-\mathbb{P}_{n})f\right|\leq\frac{\sqrt{r^{*}}}{8}.

Because h^ERMargminHS(h;z)\hat{h}_{\textup{ERM}}\in\operatorname*{arg\,min}_{\pazocal{H}}\mathbb{P}_{S^{\prime}}\ell(h;z) and S(h^ERM;z)=L0^\mathbb{P}_{S^{\prime}}\ell(\hat{h}_{\textup{ERM}};z)=\widehat{{\pazocal{L}}^{*}_{0}}, we have

L0^L0=(S(h^ERM;z)S(h;z))+(S(h;z)(h;z))\displaystyle\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0}=(\mathbb{P}_{S^{\prime}}\ell(\hat{h}_{\textup{ERM}};z)-\mathbb{P}_{S^{\prime}}\ell(h^{*};z))+(\mathbb{P}_{S^{\prime}}\ell(h^{*};z)-\mathbb{P}\ell(h^{*};z))
S(h;z)(h;z)supF|(n)f|,\displaystyle\leq\mathbb{P}_{S^{\prime}}\ell(h^{*};z)-\mathbb{P}\ell(h^{*};z)\leq\sup_{\pazocal{F}}|(\mathbb{P}-\mathbb{P}_{n})f|,

and

L0^L0=(S(h^ERM;z))(h^ERM;z))+((h^ERM;z)(h;z))\displaystyle\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0}=(\mathbb{P}_{S^{\prime}}\ell(\hat{h}_{\textup{ERM}};z))-\mathbb{P}\ell(\hat{h}_{\textup{ERM}};z))+(\mathbb{P}\ell(\hat{h}_{\textup{ERM}};z)-\mathbb{P}\ell(h^{*};z))
S(h^ERM;z))(h^ERM;z)supF|(n)f|.\displaystyle\geq\mathbb{P}_{S^{\prime}}\ell(\hat{h}_{\textup{ERM}};z))-\mathbb{P}\ell(\hat{h}_{\textup{ERM}};z)\geq-\sup_{\pazocal{F}}|(\mathbb{P}-\mathbb{P}_{n})f|.

Hence we have

(L0^L0)2(supF|(n)f|)2r64.\displaystyle(\widehat{{\pazocal{L}}^{*}_{0}}-\pazocal{L}^{*}_{0})^{2}\leq(\sup_{\pazocal{F}}|(\mathbb{P}-\mathbb{P}_{n})f|)^{2}\leq\frac{r^{*}}{64}.

Combine this result with (A.4.1), we have with probability 12δ1-2\delta,

(h^MP)\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}}) 2ψ(c1(Vr);δCn)\displaystyle\leq 2\psi\left(c_{1}\left(\pazocal{V}^{*}\lor r^{*}\right);\frac{\delta}{C_{n}}\right)
2(ψ(c1V;δCn)ψ(c1r;δCn))\displaystyle\leq 2\left(\psi\left(c_{1}\pazocal{V}^{*};\frac{\delta}{C_{n}}\right)\lor\psi\left(c_{1}r^{*};\frac{\delta}{C_{n}}\right)\right)
2ψ(c1V;δCn)c1r8B,\displaystyle\leq 2\psi\left(c_{1}\pazocal{V}^{*};\frac{\delta}{C_{n}}\right)\lor\frac{c_{1}r^{*}}{8B},

where c1=max{c0,16}c_{1}=\max\{c_{0},16\} is an absolute constant, and the last inequality follows from the fact that c1r16>r\frac{c_{1}r^{*}}{16}>r^{*} and the definition of fixed points. This completes the proof of Theorem 2. \square

A.5 Estimating the variance-dependent rate from data

In the remark following Theorem 2, we comment that fully data-dependent variance-dependent bounds can be derived by employing an empirical estimate to the unknown parameter V\pazocal{V}^{*}. Here we present the full details and some discussion of this approach.

Theorem 12 (estimate of the variance-dependent rate from data).

Consider the empirical centered second moment

V^:=n[(h^NMP;z)L0^)2],\displaystyle\widehat{\pazocal{V}^{*}}:=\mathbb{P}_{n}\left[\ell(\hat{h}_{\textup{NMP}};z)-\widehat{\pazocal{L}^{*}_{0}})^{2}\right],

where L0^[B,B]\widehat{{\pazocal{L}}^{*}_{0}}\in[-B,B] is the preliminary estimate of L\pazocal{L}^{*} obtained in the first-stage, ψ\psi is defined in Strategy 2, and

h^NMPargminHn(h;z)2ψ(16n[((h;z)L0^)2]).\displaystyle\hat{h}_{\textup{NMP}}\in\operatorname*{arg\,min}_{\pazocal{H}}\mathbb{P}_{n}\ell(h;z)-2\psi\left(16\mathbb{P}_{n}\left[(\ell(h;z)-\widehat{\pazocal{L}^{*}_{0}})^{2}\right]\right).

For any fixed δ(0,1)\delta\in(0,1), by performing the moment-penalized estimator in Strategy 2, with probability at least 1δ21-\frac{\delta}{2},

(h^MP)4ψ(16V^;δCn)r8B,\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 4\psi\left(16\widehat{\pazocal{V}^{*}};\frac{\delta}{C_{n}}\right)\lor\frac{r^{*}}{8B}, (A.41)

where rr^{*} is the fixed point of 16Bψ(r;δCn)16B\psi(r;\frac{\delta}{C_{n}}).

Remarks.

1) The subscript “NMP” within h^NMP\hat{h}_{\textup{NMP}} means “negative moment penalization.” Note that h^NMP\hat{h}_{\textup{NMP}} may not have good generalization performance, it is only used to compute V^\widehat{\pazocal{V}^{*}} so that we can evaluate the estimator h^MP\hat{h}_{\textup{MP}} proposed in Strategy 2.

2) While the fully data-dependent generalization error bound (A.41) provides a way to evaluate the moment-penalized estimator in Strategy 2 from training data, it seems that V^\widehat{\pazocal{V}^{*}} and V\pazocal{V}^{*} are not necessarily of the same order. Therefore, (A.41) may not be as tight as the original variance-dependent rate in Theorem 2. One should view (A.41) as a relaxation of the original variance-dependent rate in Theorem 2.

3) We also comment that the “sub-root” assumption in Theorem 2 is not needed here as we do not discuss the precision of L0^\widehat{\pazocal{L}^{*}_{0}}. It is easy to combine Theorem 12 with the guarantee on L0^\widehat{\pazocal{L}^{*}_{0}} proved in Appendix A.4.2.

Proof of Theorem 12:

define f^NMP\hat{f}_{\text{NMP}} by f^NMP(z)=(h^NMP;z)(h;z),zZ\hat{f}_{\text{NMP}}(z)=\ell(\hat{h}_{\text{NMP}};z)-\ell(h^{*};z),\forall z\in\pazocal{Z}, and w(h;z)=(h;z)L0^w(h;z)=\ell(h;z)-\widehat{\pazocal{L}^{*}_{0}}. From the the results (A.4.1) (A.4.1) (A.38) in the proof of Theorem 11, we have with probability at least 1δ21-\frac{\delta}{2},

(n)fψ(4n[f2]r;δCn),fF\displaystyle(\mathbb{P}-\mathbb{P}_{n})f\leq\psi\left(4\mathbb{P}_{n}[f^{2}]\lor r^{*};\frac{\delta}{C_{n}}\right),\quad\forall f\in\pazocal{F} (A.42)

and

(h^MP)2ψ(16n[w(h;z)2]r;δCn).\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}})\leq 2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r^{*};\frac{\delta}{C_{n}}\right). (A.43)

From the definition of h^NMP\hat{h}_{\textup{NMP}},

n(h^NMP;z)2ψ(16n[w(h^NMP;z)2];δCn)n(h;z)2ψ(16n[w(h;z)2];δCn).\displaystyle\mathbb{P}_{n}\ell(\hat{h}_{\text{NMP}};z)-2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)\leq\mathbb{P}_{n}\ell(h^{*};z)-2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}}\right). (A.44)

Therefore, with probability at least 1δ21-\frac{\delta}{2}, we have

2ψ(16n[w(h;z)2];δCn)\displaystyle 2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}}\right)
2ψ(16n[w(h^NMP;z)2];δCn)+n(h;z)n(h^NMP;z)\displaystyle\leq 2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)+\mathbb{P}_{n}\ell(h^{*};z)-\mathbb{P}_{n}\ell(\hat{h}_{\text{NMP}};z)
=2ψ(16n[w(h^NMP;z)2];δCn)+[(h;z)(h^NMP;z)]+(n)[(h;z)(h^NMP;z)]\displaystyle=2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)+\mathbb{P}[\ell(h^{*};z)-\ell(\hat{h}_{\text{NMP}};z)]+(\mathbb{P}_{n}-\mathbb{P})[\ell(h^{*};z)-\ell(\hat{h}_{\text{NMP}};z)]
2ψ(16n[w(h^NMP;z)2];δCn)+(n)f^NMP\displaystyle\leq 2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)+(\mathbb{P}-\mathbb{P}_{n})\hat{f}_{\text{NMP}}
2ψ(16n[w(h^NMP;z)2];δCn)+ψ(4n[f^NMP2];δCn),\displaystyle\leq 2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)+\psi\left(4\mathbb{P}_{n}[\hat{f}_{\text{NMP}}^{2}];\frac{\delta}{C_{n}}\right), (A.45)

where the first inequality is due to (A.44), the second inequality is due to the fact that hh^{*} minimizes the population risk; and the last inequality is due to (A.42).

Note that

4n[f^NMP2]8n[w(h^NMP;z)2]+8n[w(h;z)2]\displaystyle 4\mathbb{P}_{n}[\hat{f}_{\text{NMP}}^{2}]\leq 8\mathbb{P}_{n}[w({\hat{h}_{\text{NMP}}};z)^{2}]+8\mathbb{P}_{n}[w(h^{*};z)^{2}]
16n[w(h^NMP;z)2]16n[w(h;z)2].\displaystyle\leq 16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}]\lor 16\mathbb{P}_{n}[w(h^{*};z)^{2}].

From the above inequality and (A.5), with probability at least 1δ21-\frac{\delta}{2}, we have

2ψ(16n[w(h;z)2];δCn)\displaystyle 2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}}\right)
2ψ(16n[w(h^NMP;z)2];δCn)+ψ(16n[w(h^NMP;z)2];δCn)ψ(16n[w(h;z)2];δCn).\displaystyle\leq\ 2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)+\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)\lor\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}}\right). (A.46)

Whether n[w(h;z)2]16n[w(h^NMP;z)2]\mathbb{P}_{n}[w(h^{*};z)^{2}]\leq 16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}] or n[w(h;z)2]>16n[w(h^NMP;z)2]\mathbb{P}_{n}[w(h^{*};z)^{2}]>16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}], the inequality (A.5) always implies

ψ(16n[w(h;z)2];δCn)2ψ(16n[w(h^NMP;z)2];δCn)=2ψ(16V^;δCn).\displaystyle\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}}\right)\leq 2\psi\left(16\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}];\frac{\delta}{C_{n}}\right)=2\psi\left(16\widehat{\pazocal{V}^{*}};\frac{\delta}{C_{n}}\right). (A.47)

(Note that V^:=n[w(h^NMP;z)2]\widehat{\pazocal{V}^{*}}:=\mathbb{P}_{n}[w(\hat{h}_{\text{NMP}};z)^{2}].) We conclude that with probability at least 1δ21-\frac{\delta}{2},

(h^MP)\displaystyle\mathscr{E}(\hat{h}_{\textup{MP}}) 2ψ(16n[w(h;z)2]r;δCn)\displaystyle\leq 2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}]\lor r^{*};\frac{\delta}{C_{n}}\right)
=2ψ(16n[w(h;z)2];δCn)2ψ(r;δCn)\displaystyle=2\psi\left(16\mathbb{P}_{n}[w(h^{*};z)^{2}];\frac{\delta}{C_{n}})\lor 2\psi(r^{*};\frac{\delta}{C_{n}}\right)
4ψ(16V^;δCn)r8B,\displaystyle\leq 4\psi\left(16\widehat{\pazocal{V}^{*}};\frac{\delta}{C_{n}}\right)\lor\frac{r^{*}}{8B},

where the first inequality is due to (A.43) and the last inequality is due to (A.47). This completes the proof. \square

A.6 Auxiliary lemmata

Lemma 5 (Talagrand’s concentration inequality for empirical processes, [3]).

Let F\pazocal{F} be a class of functions that map Z\pazocal{Z} into [B1,B2][B_{1},B_{2}]. Assume that there is some r>0r>0 such that for every fFf\in\pazocal{F}, Var[f(zi)]r\textup{Var}[f(z_{i})]\leq r. Then, for every δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta,

supfF(n)f3F+2rlog1δn+(B2B1)log1δn,\displaystyle\sup_{f\in\pazocal{F}}(\mathbb{P}-\mathbb{P}_{n})f\leq 3\mathfrak{R}\pazocal{F}+\sqrt{\frac{2r\log\frac{1}{\delta}}{n}}+(B_{2}-B_{1})\frac{\log\frac{1}{\delta}}{n},

and with probability at least 1δ1-\delta,

supfF(n)f4nF+2rlog2δn+92(B2B1)log2δn.\displaystyle\sup_{f\in\pazocal{F}}(\mathbb{P}-\mathbb{P}_{n})f\leq 4\mathfrak{R}_{n}\pazocal{F}+\sqrt{\frac{2r\log\frac{2}{\delta}}{n}}+\frac{9}{2}(B_{2}-B_{1})\frac{\log\frac{2}{\delta}}{n}.

Moreover, the same results hold for the quantity supfF(n)fsup_{f\in\pazocal{F}}(\mathbb{P}_{n}-\mathbb{P})f.

Lemma 6 (Bernstein’s inequality, [11]).

Let X1,,XnX_{1},\cdots,X_{n} be real-valued, independent, mean-zero random variables and suppose that for some constants σ,B>0\sigma,B>0,

1ni=1n𝔼|Xi|kk!2σ2Bk2,k=2,3,\displaystyle\frac{1}{n}\sum_{i=1}^{n}\mathbb{E}|X_{i}|^{k}\leq\frac{k!}{2}\sigma^{2}B^{k-2},\quad k=2,3,\cdots

Then, δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta

|1ni=1nXi|2σ2log2δn+Blog2δn.\displaystyle\left|\frac{1}{n}\sum_{i=1}^{n}X_{i}\right|\leq\sqrt{\frac{2\sigma^{2}\log\frac{2}{\delta}}{n}}+\frac{B\log\frac{2}{\delta}}{n}. (A.48)

Appendix B Proofs for Section 5, Section 6 and Section 7

B.1 Proof of Lemma 2

Fix uBd(0,1)u\in\pazocal{B}^{d}(0,1) and θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta, then we have

uT((θ1;z)(θ2;z))T=01uT[2(θ2+v(θ1θ2);z)](θ1θ2)𝑑v.\displaystyle u^{T}(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))^{T}=\int_{0}^{1}u^{T}[\nabla^{2}\ell(\theta_{2}+v(\theta_{1}-\theta_{2});z)](\theta_{1}-\theta_{2})dv.

By Jensen’s inequality,

exp(uT((θ1;z)(θ2;z))βθ1θ2)=exp(01uT[2(θ2+v(θ1θ2);z)](θ1θ2)θ1θ2𝑑v)\displaystyle\exp\left(\frac{u^{T}(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))}{\beta\|\theta_{1}-\theta_{2}\|}\right)=\exp\left(\int_{0}^{1}u^{T}[\nabla^{2}\ell(\theta_{2}+v(\theta_{1}-\theta_{2});z)]\frac{(\theta_{1}-\theta_{2})}{\|\theta_{1}-\theta_{2}\|}dv\right)
01exp(uT[2(θ2+v(θ1θ2);z)](θ1θ2)θ1θ2)𝑑v.\displaystyle\leq\int_{0}^{1}\exp\left(u^{T}[\nabla^{2}\ell(\theta_{2}+v(\theta_{1}-\theta_{2});z)]\frac{(\theta_{1}-\theta_{2})}{\|\theta_{1}-\theta_{2}\|}\right)dv.

It is then straightforward to prove the lemma by taking expectation with respect to zz in the above inequality and using the condition (5.2). \square

B.2 Proof of Proposition 2

Take V={vd:vmax{ΔM,1n}}V=\{v\in\mathbb{R}^{d}:\|v\|\leq\max\{\Delta_{M},\frac{1}{n}\}\}. We will first prove a “uniform localized convergence” argument over all θΘ\theta\in\Theta and vVv\in V.

Proposition 5 (directional “uniform localized convergence” of gradient).

Under Assumption 1, δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta, for all θΘ,vV\theta\in\Theta,v\in V, either θθ2+v22n2\|\theta-\theta^{*}\|^{2}+\|v\|^{2}\leq\frac{2}{n^{2}}, or

(n)[((θ;z)(θ;z))Tv]\displaystyle(\mathbb{P}-\mathbb{P}_{n})\left[(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))^{T}v\right]
c1βmax{θθ2+v2,2n2}(d+log2log2(2n2ΔM2+2)δn+d+log2log2(2n2ΔM2+2)}δn),\displaystyle\leq c_{1}\beta\max\left\{\|\theta-\theta^{*}\|^{2}+\|v\|^{2},\frac{2}{n^{2}}\right\}\left(\sqrt{\frac{d+\log\frac{2\log_{2}(2n^{2}\Delta_{M}^{2}+2)}{\delta}}{n}}+\frac{d+\log\frac{2\log_{2}(2n^{2}\Delta_{M}^{2}+2)\}}{\delta}}{n}\right),

where c1c_{1} is an absolute constant.

Proof of Proposition 5:

for (θ,v)Θ×V(\theta,v)\in\Theta\times V, let g(θ,v)=((θ;z)(θ;z))Tvg_{(\theta,v)}=(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))^{T}v. For (θ1,v1)(\theta_{1},v_{1}) and (θ2,v2)Θ×v(\theta_{2},v_{2})\in\Theta\times v, define the norm on the product space Θ×V\Theta\times V as

(θ1,v1)(θ2,v2)pro=θ1θ22+v1v22.\displaystyle\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}=\sqrt{\|\theta_{1}-\theta_{2}\|^{2}+\|v_{1}-v_{2}\|^{2}}. (B.1)

Denote B(r):={(θ,v)Θ×V:θθ2+v2r}\pazocal{B}(\sqrt{r}):=\{(\theta,v)\in\Theta\times V:\|\theta-\theta^{*}\|^{2}+\|v\|^{2}\leq r\}. Given (θ1,v1),(θ2,v2)B(r)(\theta_{1},v_{1}),(\theta_{2},v_{2})\in\pazocal{B}(\sqrt{r}), we perform the following re-arrangement and decomposition steps:

g(θ1,v1)(z)g(θ2,v2)(z)\displaystyle g_{(\theta_{1},v_{1})}(z)-g_{(\theta_{2},v_{2})}(z)
=((θ1;z)(θ;z))Tv1((θ2;z)(θ;z))Tv2\displaystyle=(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}v_{1}-(\nabla\ell(\theta_{2};z)-\nabla\ell(\theta^{*};z))^{T}v_{2}
=((θ1;z)(θ;z))T(v1v2)+((θ1;z)(θ;z))Tv2+((θ;z)(θ2;z))Tv2\displaystyle=(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}(v_{1}-v_{2})+(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}v_{2}+(\nabla\ell(\theta^{*};z)-\nabla\ell(\theta_{2};z))^{T}v_{2}
=((θ1;z)(θ;z))T(v1v2)+((θ1;z)(θ2;z))Tv2\displaystyle=(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}(v_{1}-v_{2})+(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))^{T}v_{2} (B.2)

When (θ1,v1),(θ2,v2)B(r)(\theta_{1},v_{1}),(\theta_{2},v_{2})\in\pazocal{B}(\sqrt{r}), we have

θ1θv1v2rv1v2r(θ1,v1)(θ2,v2)pro,\displaystyle\|\theta_{1}-\theta^{*}\|\|v_{1}-v_{2}\|\leq\sqrt{r}\|v_{1}-v_{2}\|\leq\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}},

so from Assumption 1, ((θ1;z)(θ;z))T(v1v2)(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}(v_{1}-v_{2}) is βr(θ1,v1)(θ2,v2)pro\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}-sub-exponential. Similarly, we can prove ((θ1;z)(θ2;z))Tv2(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))^{T}v_{2} to be βr(θ1,v1)(θ2,v2)pro\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}-sub-exponential. From the decomposition (B.2) and Jensen’s inequality, for all (θ1,v1),(θ2,v2)B(r)(\theta_{1},v_{1}),(\theta_{2},v_{2})\in\pazocal{B}(\sqrt{r}), we have

exp(g(θ1,v1)(z)g(θ2,v2)(z)2βr(θ1,v1)(θ2,v2)pro)\displaystyle\exp\left(\frac{g_{(\theta_{1},v_{1})}(z)-g_{(\theta_{2},v_{2})}(z)}{2\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}}\right)
12exp(((θ1;z)(θ;z))T(v1v2)βr(θ1,v1)(θ2,v2)pro)+12exp(((θ1;z)(θ2;z))Tv2βr(θ1,v1)(θ2,v2)pro).\displaystyle\leq\frac{1}{2}\exp\left(\frac{(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta^{*};z))^{T}(v_{1}-v_{2})}{\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}}\right)+\frac{1}{2}\exp\left(\frac{(\nabla\ell(\theta_{1};z)-\nabla\ell(\theta_{2};z))^{T}v_{2}}{\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}}\right).

By taking expectation with respect to zz in the above inequality, we prove that g(θ1,v1)(z)g(θ2,v2)(z)g_{(\theta_{1},v_{1})}(z)-g_{(\theta_{2},v_{2})}(z) is a 2βr(θ1,v1)(θ2,v2)pro2\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}-sub-exponential random variable, i.e.,

g(θ1,v1)(z)g(θ2,v2)(z)Orlicz12βr(θ1,v1)(θ2,v2)pro.\displaystyle\|g_{(\theta_{1},v_{1})}(z)-g_{(\theta_{2},v_{2})}(z)\|_{\text{Orlicz}_{1}}\leq 2\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}.

From Bernstein inequality for sub-exponential variables (Lemma 10), for any fixed u0u\geq 0 and (θ1,v1),(θ2,v2)Θ×V(\theta_{1},v_{1}),(\theta_{2},v_{2})\in\Theta\times V,

Prob{|(n)[g(θ1,v1)(z)g(θ2,v2)(z)]2βr(θ1,v1)(θ2,v2)pro2nu\displaystyle\text{Prob}\bigg{\{}|(\mathbb{P}-\mathbb{P}_{n})[g_{(\theta_{1},v_{1})}(z)-g_{(\theta_{2},v_{2})}(z)]\geq 2\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}\sqrt{\frac{2}{n}}\sqrt{u}
+2βr(θ1,v1)(θ2,v2)pronu\displaystyle+\frac{2\beta\sqrt{r}\|(\theta_{1},v_{1})-(\theta_{2},v_{2})\|_{\textup{pro}}}{n}u }2exp(u).\displaystyle\bigg{\}}\leq 2\exp(-u).

The above inequality implies that the empirical process (n)g(θ,v)(\mathbb{P}-\mathbb{P}_{n})g_{(\theta,v)} has a mixed sub-Gaussian-sub-exponential increments with respect to the metrics (2βrnpro,22βrnpro)(\frac{2\beta\sqrt{r}}{n}\|\cdot\|_{\textup{pro}},\frac{2\sqrt{2}\beta\sqrt{r}}{\sqrt{n}}\|\cdot\|_{\textup{pro}}) (see Definition 8).

From Lemma 13, there exists an absolute constants CC such that δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta,

supθθ2+v2r(n)g(θ,v)C(γ2(B(r),22βrnpro)+γ1(B(r),2βrnpro)\displaystyle\sup_{\|\theta-\theta^{*}\|^{2}+\|v\|^{2}\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{(\theta,v)}\leq C\left(\gamma_{2}\Bigg{(}\pazocal{B}(\sqrt{r}),\frac{2\sqrt{2}\beta\sqrt{r}}{\sqrt{n}}\|\cdot\|_{\textup{pro}}\right)+\gamma_{1}\left(\pazocal{B}(\sqrt{r}),\frac{2\beta\sqrt{r}}{n}\|\cdot\|_{\textup{pro}}\right)
+βrlog1δn+βrlog1δn).\displaystyle+\beta r\sqrt{\frac{\log\frac{1}{\delta}}{n}}+\beta r\frac{\log\frac{1}{\delta}}{n}\Bigg{)}.

Using Dudley’s integral (Lemma 12) to bound the γ1\gamma_{1} functional and the γ2\gamma_{2} functional, we obtain that there exist absolute constant c1c_{1} such that δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta,

sup(θ,θ)(θ,θ)2r|(n)g(θ,v)|c1βr(d+log1δn+d+log1δn).\displaystyle\sup_{\|(\theta,\theta^{\prime})-(\theta^{*},\theta^{*})\|^{2}\leq r}\left|(\mathbb{P}-\mathbb{P}_{n})g_{(\theta,v)}\right|\leq c_{1}\beta r\left(\sqrt{\frac{d+\log\frac{1}{\delta}}{n}}+\frac{d+\log\frac{1}{\delta}}{n}\right). (B.3)

We set

ψ(r;δ)=c1βr(d+log1δn+d+log1δn).\displaystyle\psi(r;\delta)=c_{1}\beta r\left(\sqrt{\frac{d+\log\frac{1}{\delta}}{n}}+\frac{d+\log\frac{1}{\delta}}{n}\right).

Denote R=2(ΔM2+1n2)R=2(\Delta_{M}^{2}+\frac{1}{n^{2}}) and r0=2n2r_{0}=\frac{2}{n^{2}}. Since VV is a dd-dimensional ball centered at the origin with radius max{ΔM,1n}\max\{\Delta_{M},\frac{1}{n}\}, we know that θθ2+v22ΔM2+1n2R\|\theta-\theta^{*}\|^{2}+\|v\|^{2}\leq 2\Delta_{M}^{2}+\frac{1}{n^{2}}\leq R. We apply Proposition 4 and obtain: for any fixed δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, for all θΘ\theta\in\Theta and vVv\in V,

(n)[((θ;z)(θ;z))Tv]=(n)g(θ,v)\displaystyle(\mathbb{P}-\mathbb{P}_{n})\left[(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))^{T}v\right]=(\mathbb{P}-\mathbb{P}_{n})g_{(\theta,v)}
ψ(max{θθ2+v2,2n2};δ2log2(2R/2n2))\displaystyle\leq\psi\left(\max\left\{\|\theta-\theta^{*}\|^{2}+\|v\|^{2},\frac{2}{n^{2}}\right\};\frac{\delta}{2\log_{2}({2R}/{\frac{2}{n^{2}}})}\right)
=c1βmax{θθ2+v2,2n2}(d+log2log2(n2R)δn+d+log2log2(n2R)δn).\displaystyle=c_{1}\beta\max\left\{\|\theta-\theta^{*}\|^{2}+\|v\|^{2},\frac{2}{n^{2}}\right\}\left(\sqrt{\frac{d+\log\frac{2\log_{2}(n^{2}R)}{\delta}}{n}}+\frac{d+\log\frac{{2\log_{2}(n^{2}R)}}{\delta}}{n}\right).

This completes the proof of Proposition 5. \square

Proof of Proposition 2:

in order to uniformly bound (n)((θ;z)(θ;z))\|(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))\| for all θΘ\theta\in\Theta, we take

v=max{θθ,1n}(n)((θ;z)(θ;z))(n)((θ;z)(θ;z))v=\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\cdot\frac{(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))}{\|(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))\|}

in Proposition 5. Clearly v=max{θθ,1n}\|v\|=\max\{\|\theta-\theta^{*}\|,\frac{1}{n}\}. From Proposition 2, we can prove that there exists an absolute constant cc such that δ(0,1)\forall\delta\in(0,1), with probability at least 1δ1-\delta, for all θΘ\theta\in\Theta,

(n)((θ;z)(θ;z))\displaystyle\|(\mathbb{P}-\mathbb{P}_{n})(\nabla\ell(\theta;z)-\nabla\ell(\theta^{*};z))\|
cβmax{θθ,1n}(d+log2log2(2n2ΔM2+2)δn+d+log2log2(2n2ΔM2+2)δn)\displaystyle\leq c\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\left(\sqrt{\frac{d+\log\frac{2\log_{2}(2n^{2}\Delta_{M}^{2}+2)}{\delta}}{n}}+\frac{d+\log\frac{2\log_{2}(2n^{2}\Delta_{M}^{2}+2)}{\delta}}{n}\right)
cβmax{θθ,1n}(d+log4log2(2nΔM+2)δn+d+log4log2(2nΔM+2)δn).\displaystyle\leq c\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\left(\sqrt{\frac{d+\log\frac{4\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{4\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}\right).

This completes the proof of Proposition 2.

\square

B.3 Proof of Theorem 3

We first prove a proposition on the uniform localized convergence of gradients under Assumption 1 and Assumption 2.

Proposition 6 (uniform localized convergence of gradients).

Let Assumption 1, Assumption 2 hold along with the optimality condition (θ;z)=0\mathbb{P}\nabla\ell(\theta^{*};z)=0. Given δ(0,1)\delta\in(0,1), denote

term I :=2[(θ;z)2]log4δn+Glog4δn,\displaystyle:=\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n},
term II :=d+log8log2(2nΔM+2)δn+d+loglog2(2nΔM+2)δn.\displaystyle:=\sqrt{\frac{d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}.

Then with probability at least 1δ1-\delta, we have the following:

(n)(θ;z)term I,\displaystyle\|(\mathbb{P}_{n}-\mathbb{P})\nabla\ell(\theta^{*};z)\|\leq\textup{term I}, (B.4)

and

(n)(θ;z)term I+c0βmax{1n,θθ}term II,θΘ,\displaystyle\|(\mathbb{P}_{n}-\mathbb{P})\nabla\ell(\theta;z)\|\leq{\textup{term I}}+c_{0}\beta\max\left\{\frac{1}{n},\|\theta-\theta^{*}\|\right\}\cdot\textup{term II},\quad\forall\theta\in\Theta, (B.5)

where c0c_{0} is an absolute constant.

Proof of Proposition 6:

from Proposition 2, there exists an absolute constant c0c_{0} such that δ1>0\forall\delta_{1}>0, with probability at least 1δ21-\frac{\delta}{2}, for all θΘ\theta\in\Theta,

(n)(θ;z)\displaystyle\|(\mathbb{P}_{n}-\mathbb{P})\nabla\ell(\theta;z)\|
(n)(θ;z)+c0βmax{θθ,1n}term II.\displaystyle\leq\|(\mathbb{P}_{n}-\mathbb{P})\nabla\ell(\theta^{*};z)\|+c_{0}\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\cdot\textup{term II}. (B.6)

From Bernstein’s inequality for vectors (Lemma 11), we have with probability at least 1δ21-\frac{\delta}{2},

(θ;z)n(θ;z)2[(θ;z)2]log4δn+Glog4δn=term I,\displaystyle\|\mathbb{P}\nabla\ell(\theta^{*};z)-\mathbb{P}_{n}\nabla\ell(\theta^{*};z)\|\leq\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n}=\textup{term I}, (B.7)

Combining (B.6) and (B.7) by a union bound, we complete the proof of Proposition 6. \square

Let FF be β\beta-smooth, i.e., for all θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta,

F(θ1)F(θ2)βθ1θ2.\displaystyle\|\nabla F(\theta_{1})-\nabla F(\theta_{2})\|\leq\beta\|\theta_{1}-\theta_{2}\|.

Under this condition and the PL condition, the population risk minimizer θ\theta^{*} will be the unique minimizer of FF on Θ\Theta. We first present the following lemma.

Lemma 7 (relationship between curvature conditions).

For a β\beta-smooth function FF, consider the following conditions:

1. Strong convexity (SC): for all θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta we have

F(θ1)F(θ2)+F(θ2)T(θ1θ2)+μ2θ1θ22.\displaystyle F(\theta_{1})\geq F(\theta_{2})+\nabla F(\theta_{2})^{T}(\theta_{1}-\theta_{2})+\frac{\mu}{2}\|\theta_{1}-\theta_{2}\|^{2}.

2. Polyak-Lojasiewisz (PL): for all θΘ\theta\in\Theta we have

F(θ)F(θ)12μF(θ)2.\displaystyle F(\theta)-F(\theta^{*})\leq\frac{1}{2\mu}\|\nabla F(\theta)\|^{2}.

3. Error Bound (EB): for all θΘ\theta\in\Theta we have

F(θ)μθθ.\displaystyle\|\nabla F(\theta)\|\geq\mu\|\theta-\theta^{*}\|.

4. Quadratic Growth (QG): for all θΘ\theta\in\Theta we have

F(θ)F(θ)μ2θθ2.\displaystyle F(\theta)-F(\theta^{*})\geq\frac{\mu}{2}\|\theta-\theta^{*}\|^{2}.

Then, the following hold:

(SC)\implies(PL)\implies(EB)\implies(QG),

(EB) with parameter μ\mu \implies (PL) with parameter μ2β\frac{\mu^{2}}{\beta}.

Proof of Lemma 7 can be adapted from [[24], Appendix A]. Note that some parameters in the original statements in [24] have typos though the proof ideas are correct. In Lemma 7 we fix those typos on the parameters. As argued in [24], (PL) and the equivalent (QG) (under the smoothness condition and change of parameters) are the most general conditions that allow linear convergence to a global minimizer.

We now prove Theorem 3.

Proof of Theorem 3:

we prove the results on the event

𝒜:={the results (B.4) (B.5) in Proposition 6 hold true},\mathscr{A}:=\{\text{the results \eqref{eq: localized gradient both result 1} \eqref{eq: localized gradient both result 2} in Proposition \ref{prop localized convergence assumption both} hold true}\},

whose measure is at least 1δ1-\delta. We keep the notations “term I” and “term II” used in Proposition 6, which are defined by

term I:=2[(θ;z)2]log4δn+Glog4δn,\displaystyle\textup{term I}:=\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n},
term II:=d+log8log2(2nΔM+2)δn+d+loglog2(2nΔM+2)δn.\displaystyle\textup{term II}:=\sqrt{\frac{d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}.

The PL condition (Assumption 3) implies that (θ;z)=0\mathbb{P}\nabla\ell(\theta^{*};z)=0. From the result (B.4) in Proposition 6,

n(θ;z)=(n)(θ;z)term I.\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta^{*};z)\|=\|(\mathbb{P}_{n}-\mathbb{P})\nabla\ell(\theta^{*};z)\|\leq\textup{term I}.

So we know that the equation

n(θ;z)term I.\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta;z)\|\leq\textup{term I}. (B.8)

must have a solution within Θ\Theta.

The result (B.5) implies that for all θΘ\theta\in\Theta such that θθ1n\|\theta-\theta^{*}\|\leq\frac{1}{n},

(θ;z)n(θ;z)+term I+c0βθθterm II.\displaystyle\|\mathbb{P}\nabla\ell(\theta;z)\|\leq\|\mathbb{P}_{n}\nabla\ell(\theta;z)\|+{\textup{term I}}+c_{0}\beta\|\theta-\theta^{*}\|\cdot{\textup{term II}}.

Since the PL condition implies (see Lemma 7)

(θ;z)μθθ,\displaystyle\|\mathbb{P}\nabla\ell(\theta;z)\|\geq\mu\|\theta-\theta^{*}\|,

for all θΘ\theta\in\Theta such that θθ1n\|\theta-\theta^{*}\|\leq\frac{1}{n}, we have

μθθ(θ;z)n(θ;z)+term I+c0βθθterm II,\displaystyle\mu\|\theta-\theta^{*}\|\leq\|\mathbb{P}\nabla\ell(\theta;z)\|\leq\|\mathbb{P}_{n}\nabla\ell(\theta;z)\|+{\textup{term I}}+c_{0}\beta\|\theta-\theta^{*}\|\cdot{\textup{term II}},

where cc is an absolute constant. Therefore, for all θΘ\theta\in\Theta, we must have

μθθ(θ;z)n(θ;z)+term I+c0βθθterm II+μn.\displaystyle\mu\|\theta-\theta^{*}\|\leq\|\mathbb{P}\nabla\ell(\theta;z)\|\leq\|\mathbb{P}_{n}\nabla\ell(\theta;z)\|+{\textup{term I}}+c_{0}\beta\|\theta-\theta^{*}\|\cdot{\textup{term II}}+\frac{\mu}{n}. (B.9)

Let θ^Θ\hat{\theta}\in\Theta be an arbitrary solution that satisfies (B.8). From (B.9), we obtain the inequalities for θ^θ\|\hat{\theta}-\theta^{*}\|:

μθ^θ(θ^;z)2term I+c0βterm IIθθ+μn.\displaystyle\mu\|\hat{\theta}-\theta^{*}\|\leq\|\mathbb{P}\nabla\ell(\hat{\theta};z)\|\leq 2\cdot{\textup{term I}}+c_{0}\beta\cdot{\textup{term II}}\cdot\|\theta-\theta^{*}\|+\frac{\mu}{n}. (B.10)

Let c=max{4c02,1}c=\max\{4c_{0}^{2},1\}. When

ncβ2(d+log4log(2nΔM+1)δ)μ2,\displaystyle n\geq\frac{c\beta^{2}(d+\log\frac{4\log(2n\Delta_{M}+1)}{\delta})}{\mu^{2}},

we have c0βterm IIμ2c_{0}\beta\cdot\textup{term II}\leq\frac{\mu}{2} so that from (B.10),

θ^θ2μ(2term I+μn)\displaystyle\|\hat{\theta}-\theta^{*}\|\leq\frac{2}{\mu}(2\cdot\text{term I}+\frac{\mu}{n})

and the event 𝒜\mathscr{A}. Plugging in “c0βterm IIμ2c_{0}\beta\cdot\textup{term II}\leq\frac{\mu}{2}” and “θ^θ2μ(2term I+μn)\|\hat{\theta}-\theta^{*}\|\leq\frac{2}{\mu}(2\cdot\text{term I}+\frac{\mu}{n})” into the second inequality within (B.10), we further have

(θ^;z)2term I+μn+μ2θ^θ\displaystyle\|\mathbb{P}\nabla\ell(\hat{\theta};z)\|\leq 2\cdot\textup{term I}+\frac{\mu}{n}+\frac{\mu}{2}\|\hat{\theta}-\theta^{*}\|
4term I+2μn.\displaystyle\leq 4\cdot\textup{term I}+\frac{2\mu}{n}.

Lastly, since the PL condition implies (see Lemma 7)

(θ^;z)(h;z)(θ^;z)22μ,\displaystyle\mathbb{P}\ell(\hat{\theta};z)-\mathbb{P}\ell(h^{*};z)\leq\frac{\|\mathbb{P}\ell(\hat{\theta};z)\|^{2}}{2\mu},

by plugging in “(θ^;z)4term I+2μn\|\mathbb{P}\nabla\ell(\hat{\theta};z)\|\leq 4\cdot\textup{term I}+\frac{2\mu}{n}” we have

(θ^;z)(h;z)\displaystyle\mathbb{P}\ell(\hat{\theta};z)-\mathbb{P}\ell(h^{*};z) (θ^;z)22μ\displaystyle\leq\frac{\|\mathbb{P}\ell(\hat{\theta};z)\|^{2}}{2\mu}
16μ(term I)2+4μn2\displaystyle\leq\frac{16}{\mu}(\text{term I})^{2}+\frac{4\mu}{n^{2}}
64[(θ;z)2]log4δμn+32G2log24δ+4μ2μn2.\displaystyle\leq\frac{64\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu n}+\frac{32G_{*}^{2}\log^{2}\frac{4}{\delta}+4\mu^{2}}{\mu n^{2}}.

This completes the proof of Theorem 3. \square

B.4 Proof of Theorem 4

We first prove a simple proposition, which studies how the accumulation of sample approximation errors at every step influences the convergence of the algorithm.

Proposition 7 (localized statistical error of a linearly convergent iterative algorithm).

Consider a function FF (for which we call the “Lyapunov function”) and a parameter γ(0,1)\gamma\in(0,1). Assume an algorithm satisfies for all t=0,1,t=0,1,\dots

F(θt+1)(1γ)F(θt)+εt(n),\displaystyle F(\theta^{t+1})\leq(1-\gamma)F(\theta^{t})+\varepsilon^{t}(n),
εt(n)α(n)F(θt)+ε(n),\displaystyle\varepsilon^{t}(n)\leq\alpha(n)F(\theta^{t})+\varepsilon^{*}(n),
andθtΘ.\displaystyle\text{and}\quad\theta^{t}\in\Theta.

When the sample size nn is large enough such that α(n)γ2\alpha(n)\leq\frac{\gamma}{2}, we have

F(θt)(1γ2)tF(θ0)+2γε(n),t=0,1,.\displaystyle F(\theta^{t})\leq\left(1-\frac{\gamma}{2}\right)^{t}F(\theta^{0})+\frac{2}{\gamma}\varepsilon^{*}(n),\quad t=0,1,\cdots.
Proof of Proposition 7:

we have

F(θt+1)(1γ+α(n))F(θt)+ε(n)\displaystyle F(\theta^{t+1})\leq(1-\gamma+\alpha(n))F(\theta^{t})+\varepsilon^{*}(n)
(1γ2)F(θ)+ε(n).\displaystyle\leq\left(1-\frac{\gamma}{2}\right)F(\theta^{*})+\varepsilon^{*}(n).

Then by induction we have

F(θt)(1γ2)tF(θ0)+2γε(n),t=0,1,.\displaystyle F(\theta^{t})\leq\left(1-\frac{\gamma}{2}\right)^{t}F(\theta^{0})+\frac{2}{\gamma}\varepsilon^{*}(n),\quad t=0,1,\cdots.

This completes the proof of Proposition 7. \square

We now prove Theorem 4.

Proof of Theorem 4:

Assumption 1 implies that the population risk is β\beta-smooth. Consider the gradient descent algorithm (5.7) with fixed step size 1β\frac{1}{\beta}. We have for all t=0,1,t=0,1,\cdots,

θt+1=θt1βn(θt;z).\displaystyle\theta^{t+1}=\theta^{t}-\frac{1}{\beta}\mathbb{P}_{n}\nabla\ell(\theta^{t};z).

So we have

(θt+1;z)(θt;z)((θt;z))T(θt+1θt)+β2θt+1θt2\displaystyle\mathbb{P}\ell(\theta^{t+1};z)-\mathbb{P}\ell(\theta^{t};z)\leq(\mathbb{P}\nabla\ell(\theta^{t};z))^{T}(\theta^{t+1}-\theta^{t})+\frac{\beta}{2}\|\theta^{t+1}-\theta^{t}\|^{2}
=1β((θt;z))T(n(θt;z))+12βn(θt;z)2\displaystyle=-\frac{1}{\beta}(\mathbb{P}\nabla\ell(\theta^{t};z))^{T}(\mathbb{P}_{n}\nabla\ell(\theta^{t};z))+\frac{1}{2\beta}\|\mathbb{P}_{n}\nabla\ell(\theta^{t};z)\|^{2}
=1β(θt;z)21β((θt;z))T(n(θt;z)(θt;z))\displaystyle=-\frac{1}{\beta}\|\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2}-\frac{1}{\beta}(\mathbb{P}\nabla\ell(\theta^{t};z))^{T}(\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z))
+12β(θt;z)+(n(θt;z)(θt;z))2\displaystyle+\frac{1}{2\beta}\|\mathbb{P}\nabla\ell(\theta^{t};z)+(\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z))\|^{2}
=12β(θt;z)2+12βn(θt;z)(θt;z)2\displaystyle=-\frac{1}{2\beta}\|\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2}+\frac{1}{2\beta}\|\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2}
μβ((θt;z)(θ;z))+12βn(θt;z)(θt;z)2.\displaystyle\leq-\frac{\mu}{\beta}(\mathbb{P}\ell(\theta^{t};z)-\mathbb{P}\ell(\theta^{*};z))+\frac{1}{2\beta}\|\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2}.

Rearranging the above inequality, and subtracting (θ;z)\mathbb{P}\ell(\theta^{*};z) from both sides, we obtain

(θt+1;z)(θ;z)(1μβ)((θt;z)(θ;z))+12βn(θt;z)(θt;z)2.\displaystyle\mathbb{P}\ell(\theta^{t+1};z)-\mathbb{P}\ell(\theta^{*};z)\leq\left(1-\frac{\mu}{\beta}\right)(\mathbb{P}\ell(\theta^{t};z)-\mathbb{P}\ell(\theta^{*};z))+\frac{1}{2\beta}\|\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2}. (B.11)

Applying Proposition 6, we continue the proof on the event

𝒜:={the results (B.4) (B.5) in Proposition 6 hold true},\mathscr{A}:=\{\text{the results \eqref{eq: localized gradient both result 1} \eqref{eq: localized gradient both result 2} in Proposition \ref{prop localized convergence assumption both} hold true}\},

whose measure is at least 1δ1-\delta. We keep the notations “term I” and “term II” used in Proposition 6, which are defined by

term I:=2[(θ;z)2]log4δn+Glog4δn,\displaystyle\textup{term I}:=\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n},
term II:=d+log8log2(2nΔM+2)δn+d+loglog2(2nΔM+2)δn.\displaystyle\textup{term II}:=\sqrt{\frac{d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}.

The result (B.5) in Proposition 6 implies that θΘ\forall\theta\in\Theta,

n(θ;z)(θ;z)term I+c0βmax{θθ,1n}term II\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta;z)-\mathbb{P}\nabla\ell(\theta;z)\|\leq{\text{term I}}+c_{0}\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\cdot{\text{term II}}
(term I+c0βnterm II)+c0βterm IIθθ,\displaystyle\leq\left(\text{term I}+\frac{c_{0}\beta}{n}\cdot\text{term II}\right)+c_{0}\beta\cdot\text{term II}\cdot\|\theta-\theta^{*}\|,

where c0c_{0} is an absolute constant. Since the PL condition implies (see Lemma 7) that

(θ;z)(θ;z)μ2θθ2,θΘ,\displaystyle\mathbb{P}\ell(\theta;z)-\mathbb{P}\ell(\theta^{*};z)\geq\frac{\mu}{2}\|\theta-\theta^{*}\|^{2},\quad\forall\theta\in\Theta,

we have

n(θ;z)(θ;z)22(term I+c0βnterm II)2\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta;z)-\mathbb{P}\nabla\ell(\theta;z)\|^{2}\leq 2\left(\text{term I}+\frac{c_{0}\beta}{n}\cdot\text{term II}\right)^{2}
+4c02β2μ((θ;z)(θ;z))(term II)2.\displaystyle+\frac{4c_{0}^{2}\beta^{2}}{\mu}\left(\mathbb{P}\ell(\theta;z)-\mathbb{P}\ell(\theta^{*};z)\right)(\text{term II})^{2}. (B.12)

Combining (B.11) and (B.4), we have that for all t=0,1,t=0,1,\dots,

(θt+1)(1μβ)(θt)+εt(n),\displaystyle\mathscr{E}(\theta^{t+1})\leq\left(1-\frac{\mu}{\beta}\right)\mathscr{E}(\theta^{t})+\varepsilon^{t}(n),
εt(n)α(n)(θt)+ε(n),\displaystyle\varepsilon^{t}(n)\leq\alpha(n)\mathscr{E}(\theta^{t})+\varepsilon^{*}(n),

where

εt(n)=12βn(θt;z)(θt;z)2,\displaystyle\varepsilon^{t}(n)=\frac{1}{2\beta}\|\mathbb{P}_{n}\nabla\ell(\theta^{t};z)-\mathbb{P}\nabla\ell(\theta^{t};z)\|^{2},
α(n)=2c02βμ(term II)2,\displaystyle\alpha(n)=\frac{2c_{0}^{2}\beta}{\mu}(\text{term II})^{2},
ε(n)=1β(term I+c0βnterm II)2.\displaystyle\varepsilon^{*}(n)=\frac{1}{\beta}\left(\text{term I}+\frac{c_{0}\beta}{n}\cdot\text{term II}\right)^{2}.

Consider the following two conditions on the sample size (note that they will be satisfied as long as nn is large enough):

α(n)μ2β,\displaystyle\alpha(n)\leq\frac{\mu}{2\beta}, (B.13)
ε(n)μ24βΔm2.\displaystyle\varepsilon^{*}(n)\leq\frac{\mu^{2}}{4\beta}\Delta_{m}^{2}. (B.14)

Now consider the condition

ncβ2μ2(d+log8log2(2nΔM+2)δ),\displaystyle n\geq\frac{c\beta^{2}}{\mu^{2}}\left(d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}\right),

where c=max{16c02,1}c=\max\{16c_{0}^{2},1\} is an absolute constant. Then (B.13) holds. Since we also require the sample size nn to be large enough such that the “statistical error” term in (5.8) is smaller than μ2Δm2\frac{\mu}{2}\Delta_{m}^{2}, the condition (B.14) is also true because

μ2Δm216[(θ;z)2]log4δμn+8G2log24δ+μ2μn22μ(term II2+μ2n)22βμε(n).\displaystyle\frac{\mu}{2}\Delta_{m}^{2}\geq\frac{16\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu n}+\frac{8G_{*}^{2}\log^{2}\frac{4}{\delta}+\mu^{2}}{\mu n^{2}}\geq\frac{2}{\mu}\left(\text{term II}^{2}+\frac{\mu}{2n}\right)^{2}{\geq}\frac{2\beta}{\mu}\cdot\varepsilon^{*}(n).

Therefore, both condition (B.13) and condition (B.14) hold true under Theorem 4’s requirement on the sample size.

Since both (B.13) and (B.14) are true, we can use induction to prove that with probability at least 1δ1-\delta, for all t=0,1,t=0,1,\dots,

(θt)μ2Δm2.\mathscr{E}(\theta^{t})\leq\frac{\mu}{2}\Delta_{m}^{2}.

Therefore, for all t=0,1,,t=0,1,\dots,

θtBd(θ,Δm)Θ.\displaystyle\theta^{t}\in\pazocal{B}^{d}(\theta^{*},\Delta_{m})\subseteq\Theta.

We choose the “Lyapunov function” in Proposition 7 to be the excess risk function (θ)\mathscr{E}(\theta). Applying Proposition 7, we obtain that: when the sample size nn is large enough such that the conditions (B.13) and (B.14) hold ture, we have

(θt;z)(θ;z)\displaystyle\mathbb{P}\ell(\theta^{t};z)-\mathbb{P}\ell(\theta^{*};z) (1μ2β)t(θ0)+2βμε(n)\displaystyle\leq\left(1-\frac{\mu}{2\beta}\right)^{t}\mathscr{E}(\theta^{0})+\frac{2\beta}{\mu}\cdot\varepsilon^{*}(n)
(1μ2β)t(θ0)+2μ(term I+μ2n)2,\displaystyle\leq\left(1-\frac{\mu}{2\beta}\right)^{t}\mathscr{E}(\theta^{0})+\frac{2}{\mu}\left(\text{term I}+\frac{\mu}{2n}\right)^{2},
(1μ2β)t(θ0)+16[(θ;z)2]log8δμn+8G2log24δ+μ2μn2.\displaystyle\leq\left(1-\frac{\mu}{2\beta}\right)^{t}\mathscr{E}(\theta^{0})+\frac{16\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{8}{\delta}}{\mu n}+\frac{8G_{*}^{2}\log^{2}\frac{4}{\delta}+\mu^{2}}{\mu n^{2}}.

This completes the proof of Theorem 4. \square

B.5 Proof of Corollary 5

We first verify Assumption 1. We have

2(θ;z)=2(η(θTx)2+(η(θTx)y)η′′(θTx)])xxT.\displaystyle\nabla^{2}\ell(\theta;z)=2\left(\eta^{\prime}(\theta^{T}x)^{2}+(\eta(\theta^{T}x)-y)\eta^{\prime\prime}(\theta^{T}x)]\right)xx^{T}.

Since xx is τ\tau-sub-Gaussian, xxTxx^{T} is a τ2\tau^{2}-sub-exponential. From the fact

|2(η(θTx)2+(η(θTx)y)η′′(θTx))|Cη(B+Cη),\displaystyle\left|2(\eta^{\prime}(\theta^{T}x)^{2}+(\eta(\theta^{T}x)-y)\eta^{\prime\prime}(\theta^{T}x))\right|\leq C_{\eta}(B+C_{\eta}),

Assumption 1 holds with β=2Cη(Cη+B)τ2\beta=2C_{\eta}(C_{\eta}+\sqrt{B})\tau^{2}.

We then verify Assumption 2. We know

(θ;z)=2(η(xTθ)y)η(xTθ)x.\displaystyle\nabla\ell(\theta^{*};z)=2(\eta(x^{T}\theta^{*})-y)\eta^{\prime}(x^{T}\theta^{*})x.

So we have for all zz,

(θ;z)d(θ;z)2CηBd.\displaystyle\|\nabla\ell(\theta^{*};z)\|\leq\sqrt{d}\|\nabla\ell(\theta^{*};z)\|_{\infty}\leq 2C_{\eta}\sqrt{Bd}.

So Assumption 2 holds with G=2CηBdG_{*}=2C_{\eta}\sqrt{Bd}.

Lastly, by [12, Lemma 5, inequality (16)], Assumption 3 holds with μ=2cη3τ2γCη\mu=\frac{2c_{\eta}^{3}\tau^{2}\gamma}{C_{\eta}}. This completes the proof. \square

B.6 Proof of Theorem 6

Before proving Theorem 6, we refer to [2, Theorem 1] for the following result on the population-based first-order EM update.

Lemma 8 (linear convergence of population-based first-order EM).

Under Assumption 5, Assumption 6 and the condition that (θ;z)\mathbb{P}\ell(\theta;z) is β\beta-smooth, the following update,

θ+=θ2β+μ1θ(θ;z)\displaystyle\theta^{+}=\theta-\frac{2}{\beta+\mu_{1}}\mathbb{P}\nabla\ell_{\theta}(\theta;z)

satisfies that

θ+θ(12μ1μ2β+μ1)θθ.\displaystyle\|\theta^{+}-\theta^{*}\|\leq\left(1-\frac{2\mu_{1}-\mu_{2}}{\beta+\mu_{1}}\right)\|\theta-\theta^{*}\|.

We now prove Theorem 6.

Proof of Theorem 6:

Assumption 1 implies that (θ;z)\mathbb{P}\ell(\theta;z) is β\beta-smooth, so Lemma 8 holds under the assumptions of Theorem 6. Now we turn to analyze the sample-based first-order EM. Consider the update of sample-based first-order EM,

θt+1=θt2β+μ1nθt(θt;z),t=0,1,\displaystyle\theta^{t+1}=\theta^{t}-\frac{2}{\beta+\mu_{1}}\mathbb{P}_{n}\nabla\ell_{\theta^{t}}(\theta^{t};z),\quad t=0,1,\dots

Fix t0t\geq 0. We have

θt+1θθt2β+μ1θt(θt;z)+2β+μ1(n)θt(θt;z)\displaystyle\|\theta^{t+1}-\theta^{*}\|\leq\|\theta^{t}-\frac{2}{\beta+\mu_{1}}\mathbb{P}\nabla\ell_{\theta^{t}}(\theta^{t};z)\|+\frac{2}{\beta+\mu_{1}}\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell_{\theta^{t}}(\theta^{t};z)\|
(12μ1μ2β+μ1)θtθ+2β+μ1(n)(θt;z).\displaystyle\leq\left(1-\frac{2\mu_{1}-\mu_{2}}{\beta+\mu_{1}}\right)\|\theta^{t}-\theta^{*}\|+\frac{2}{\beta+\mu_{1}}\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell(\theta^{t};z)\|.

Applying Proposition 6, we continue the proof on the event

A:={the results (B.4) (B.5) in Proposition 6 hold true},\pazocal{A}:=\{\text{the results \eqref{eq: localized gradient both result 1} \eqref{eq: localized gradient both result 2} in Proposition \ref{prop localized convergence assumption both} hold true}\},

whose measure is at least 1δ1-\delta. We keep the notations “term I” and “term II” used in Proposition 6, which are defined by

term I:=2[(θ;z)2]log4δn+Glog4δn,\displaystyle\textup{term I}:=\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}}{n},
term II:=d+log8log2(2nΔM+2)δn+d+loglog2(2nΔM+2)δn.\displaystyle\textup{term II}:=\sqrt{\frac{d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}}+\frac{d+\log\frac{\log_{2}(2n\Delta_{M}+2)}{\delta}}{n}.

Note that we have the optimality condition (θ;z)=0\nabla\ell(\theta^{*};z)=0, because the true parameter θ\theta^{*} is assumed to minimizes the population risk over d\mathbb{R}^{d} in the problem setting. The result (B.5) in Proposition 6 implies that θΘ\forall\theta\in\Theta,

n(θ;z)(θ;z)term I+c0βmax{θθ,1n}term II\displaystyle\|\mathbb{P}_{n}\nabla\ell(\theta;z)-\mathbb{P}\nabla\ell(\theta;z)\|\leq{\text{term I}}+c_{0}\beta\max\left\{\|\theta-\theta^{*}\|,\frac{1}{n}\right\}\cdot{\text{term II}} (B.15)
(term I+c0βnterm II)+c0βterm IIθθ,\displaystyle\leq\left(\text{term I}+\frac{c_{0}\beta}{n}\cdot\text{term II}\right)+c_{0}\beta\cdot\text{term II}\cdot\|\theta-\theta^{*}\|, (B.16)

where c0c_{0} is an absolute constant. Therefore, we have that for all t=0,1,t=0,1,\dots,

(θt+1)\displaystyle\mathscr{E}(\theta^{t+1}) (12μ1μ2β+μ1)(θt)+εt(n),\displaystyle\leq\left(1-\frac{2\mu_{1}-\mu_{2}}{\beta+\mu_{1}}\right)\mathscr{E}(\theta^{t})+\varepsilon^{t}(n),
εt(n)\displaystyle\varepsilon^{t}(n) α(n)(θt)+ε(n),\displaystyle\leq\alpha(n)\mathscr{E}(\theta^{t})+\varepsilon^{*}(n),

where

εt(n)=2β+μ1(n)(θt;z),\displaystyle\varepsilon^{t}(n)=\frac{2}{\beta+\mu_{1}}\|(\mathbb{P}-\mathbb{P}_{n})\nabla\ell(\theta^{t};z)\|,
α(n)=2c0ββ+μ1term II,\displaystyle\alpha(n)=\frac{2c_{0}\beta}{\beta+\mu_{1}}\cdot\text{term II},
ε(n)=2β+μ1(term I+c0βnterm II).\displaystyle\varepsilon^{*}(n)=\frac{2}{\beta+\mu_{1}}\left(\text{term I}+\frac{c_{0}\beta}{n}\cdot\text{term II}\right).

Consider the following two conditions on the sample size (note that they will be satisfied as long as nn is large enough):

α(n)2μ1μ22(β+μ1),\displaystyle\alpha(n)\leq\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}, (B.17)
ε(n)2μ1μ22(β+μ1)Δm.\displaystyle\varepsilon^{*}(n)\leq\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\Delta_{m}. (B.18)

When the sample size nn is large enough so that both (B.17) and (B.18) are true, we can use induction to prove that with probability at least 1δ1-\delta, for all t=0,1,t=0,1,\dots,

θtθΔm2.\|\theta^{t}-\theta^{*}\|\leq\Delta_{m}^{2}.

Therefore, for all t=0,1,,t=0,1,\dots,

θtBd(θ,Δm)Θ.\displaystyle\theta^{t}\in\pazocal{B}^{d}(\theta^{*},\Delta_{m})\subseteq\Theta.

We choose the “Lyapunov function” in Proposition 7 to be θθ\|\theta-\theta^{*}\|. Applying Proposition 7, we obtain: when the sample size nn is large enough such that the conditions (B.17) and (B.18) hold ture, we have

θtθ(12μ1μ22(β+μ1))tθ0θ+2(β+μ1)2μ1μ2ε(n)\displaystyle\|\theta^{t}-\theta^{*}\|\leq\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{t}\|\theta^{0}-\theta^{*}\|+\frac{2(\beta+\mu_{1})}{2\mu_{1}-\mu_{2}}\cdot\varepsilon^{*}(n)
(12μ1μ22(β+μ1))tθ0θ+42μ1μ2term I+2n.\displaystyle\leq\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{t}\|\theta^{0}-\theta^{*}\|+\frac{4}{2\mu_{1}-\mu_{2}}\cdot\text{term I}+\frac{2}{n}. (B.19)

When the sample size is large enough such that

ncβ2(2μ1μ2)2(d+log8log2(2nΔM+2)δ) and term I+2μ1μ22n(2μ1μ2)Δm4,\displaystyle n\geq\frac{c\beta^{2}}{(2\mu_{1}-\mu_{2})^{2}}\left(d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}\right)\text{\quad and \quad}\text{term I}+\frac{2\mu_{1}-\mu_{2}}{2n}\leq\frac{(2\mu_{1}-\mu_{2})\Delta_{m}}{4}, (B.20)

where c=max{64c02,1}c=\max\{64c_{0}^{2},1\} is an absolute constant, we have

term I2μ1μ24(Δm2n) and term II2μ1μ24c0β,\displaystyle\text{term I}\leq\frac{2\mu_{1}-\mu_{2}}{4}\left(\Delta_{m}-\frac{2}{n}\right)\text{\quad and \quad}\text{term II}\leq\frac{2\mu_{1}-\mu_{2}}{4c_{0}\beta},

which further guarantee that both the condition (B.17) and the condition (B.18) are true. We conclude that when the sample size condition (B.20) is true, we have the bound (B.6).

Now we use the fact μ12μ1μ22μ1\mu_{1}\leq 2\mu_{1}-\mu_{2}\leq 2\mu_{1} to simplify the sample size condition (B.20) and the bound (B.6). It is straightforward to verify that the sample size condition (B.20) will be satisfied when

nmax{cβ2μ12(d+log8log2(2nΔM+2)δ),128[(θ;z)2]log4δμ1ΔM,8Glog4δ+8μ1μ1ΔM};\displaystyle n\geq\max\left\{\frac{c\beta^{2}}{\mu_{1}^{2}}\left(d+\log\frac{8\log_{2}(2n\Delta_{M}+2)}{\delta}\right),\quad\frac{128\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{\mu_{1}\Delta_{M}},\quad\frac{8G_{*}\log\frac{4}{\delta}+8\mu_{1}}{\mu_{1}\Delta_{M}}\right\}; (B.21)

and the bound (B.6) implies

θtθ4μ1(2[(θ;z)2]log4δn+Glog4δ+μ1n)+(12μ1μ22(β+μ1))tθ0θ.\displaystyle\|\theta^{t}-\theta^{*}\|\leq\frac{4}{\mu_{1}}\left(\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}+\mu_{1}}{n}\right)+\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{t}\|\theta^{0}-\theta^{*}\|. (B.22)

Since we always have

(θt)β2θtθ2,\mathscr{E}(\theta^{t})\leq\frac{\beta}{2}\|\theta^{t}-\theta^{*}\|^{2},

the bound (B.22) will imply

(θt)16βμ12(2[(θ;z)2]log4δn+Glog4δ+μ1n)2+(12μ1μ22(β+μ1))2tβθ0θ2.\displaystyle\mathscr{E}(\theta^{t})\leq\frac{16\beta}{\mu_{1}^{2}}\left(\sqrt{\frac{2\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\log\frac{4}{\delta}}{n}}+\frac{G_{*}\log\frac{4}{\delta}+\mu_{1}}{n}\right)^{2}+\left(1-\frac{2\mu_{1}-\mu_{2}}{2(\beta+\mu_{1})}\right)^{2t}\beta\|\theta^{0}-\theta^{*}\|^{2}. (B.23)

Clearly, the sample size condition (B.21) and the bounds (B.22) (B.23) are identical to those presented in the statement of the theorem. This completes the proof. \square

B.7 Proof of Corollary 7

For both examples, verification of Assumptions 1, 2 and 5 is trivial. The parameters can be specified as β=1\beta=1, G=σdG_{*}=\sigma\sqrt{d}, and μ1=1\mu_{1}=1.

As for verification of Assumption 6, we refer to the following results that are direct consequence of [2].

Lemma 9 (verification of Assumption 6).

(a) [2, Lemma 2]: Consider Example 4 under the SNR condition (7.6), where η\eta is a sufficiently large constant such that η>433\eta>\frac{4\sqrt{3}}{3} and c1(1+1η2+η2)ec2η2<1c_{1}(1+\frac{1}{\eta^{2}}+\eta^{2})e^{-c_{2}\eta^{2}}<1. Then Assumption 6 holds with μ2=c1(1+1η2+η2)ec2η2\mu_{2}=c_{1}(1+\frac{1}{\eta^{2}}+\eta^{2})e^{-c_{2}\eta^{2}}. Here c1c_{1} and c2c_{2} denote the same absolute constants as in the proof of [2, Lemma 2]. Clearly, we can verify Assumption 6 for all η\eta larger than a certain absolute constant.

(b) [2, Lemma 3]: Consider Example 5 under the SNR condition (7.6), where η\eta is a sufficiently large constant such that

cη1cτ22+cτ2logηη+2η18,\displaystyle c\eta^{1-\frac{c_{\tau}^{2}}{2}}+c_{\tau}^{2}\frac{\log\eta}{\eta}+\frac{2}{\eta}\leq\frac{1}{8},
θ8η+(4+231)Cτ2logηη+3η1Cτ2218\displaystyle\sqrt{\frac{\|\theta^{*}\|}{8\eta}+(4+\frac{2}{31})\frac{C_{\tau}^{2}\log\eta}{\eta}+3\eta^{1-\frac{C_{\tau}^{2}}{2}}}\leq\frac{1}{8}

hold true for some sufficiently large constants cτ,Cτc_{\tau},C_{\tau} and an absolute constant cc. Then Assumption 6 holds with μ2=14\mu_{2}=\frac{1}{4}. Here c,cτ,Cτc,c_{\tau},C_{\tau} denote the same quantity as in the proof of [2, Lemma 3]. Clearly, we can verify Assumption 6 for all η\eta larger than a certain absolute constant.

To prove the generalization error bound in this corollary, we need to upper bound the problem-dependent parameter [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] for the two examples.

Bounding [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] for Example 4:

we define the function g:d(0,2)g:\mathbb{R}^{d}\rightarrow(0,2) as

g(u)=2e2θu22σ2eu22σ2+e2θu22σ2=2e2θ22uTθσ2+1.\displaystyle g(u)=\frac{2e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|u\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}=\frac{2}{e^{\frac{2\|\theta^{*}\|^{2}-2u^{T}\theta^{*}}{\sigma^{2}}}+1}.

In Example 4, when conditioned on w=1w=1 (i.e., when zz is drawn from N(θ,σ2Id×d)N(\theta^{*},\sigma^{2}I_{d\times d})), the random gradient (θ;z)\nabla\ell(\theta^{*};z) at θ\theta^{*} can be shown to be equal to

((θ;z)|w=1)=u(eu22σ2e2θu22σ2eu22σ2+e2θu22σ2)1g(u)+θ(2e2θu22σ2eu22σ2+e2θu22σ2)g(u),\displaystyle\left(\nabla\ell(\theta^{*};z)|w=1\right)=u\underbrace{\bigg{(}\frac{e^{-\frac{\|u\|^{2}}{2\sigma^{2}}}-e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|u\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}\bigg{)}}_{1-g(u)}+\theta^{*}\underbrace{\bigg{(}\frac{2e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|u\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-u\|^{2}}{2\sigma^{2}}}}\bigg{)}}_{g(u)},

where u=θzu=\theta^{*}-z is a random vector drawn from N(0,σ2Id×d)N(0,\sigma^{2}I_{d\times d}). And when conditioned on w=1w=-1 (i.e., when zz is drawn from N(0,σ2Id×d)N(0,\sigma^{2}I_{d\times d})), (θ;z)\nabla\ell(\theta^{*};z) can be shown to be equal to

((θ;z)|w=1)=v(ev22σ2e2θv22σ2ev22σ2+e2θv22σ2)1g(v)+θ(2e2θv22σ2ev22σ2+e2θv22σ2)g(v),\displaystyle\left(\nabla\ell(\theta^{*};z)|w=-1\right)=v\underbrace{\bigg{(}\frac{e^{-\frac{\|v\|^{2}}{2\sigma^{2}}}-e^{-\frac{\|2\theta^{*}-v\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|v\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-v\|^{2}}{2\sigma^{2}}}}\bigg{)}}_{1-g(v)}+\theta^{*}\underbrace{\bigg{(}\frac{2e^{-\frac{\|2\theta^{*}-v\|^{2}}{2\sigma^{2}}}}{e^{-\frac{\|v\|^{2}}{2\sigma^{2}}}+e^{-\frac{\|2\theta^{*}-v\|^{2}}{2\sigma^{2}}}}\bigg{)}}_{g(v)},

where v=θ+zv=\theta^{*}+z is a random vector drawn from N(0,σ2Id×d)N(0,\sigma^{2}I_{d\times d}).

Therefore, we have

[(θ;z)2]\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]
=12𝔼[(θ;z)2|w=1]+12𝔼[(θ;z)2|w=1]\displaystyle=\frac{1}{2}\mathbb{E}\left[\|\nabla\ell(\theta^{*};z)\|^{2}|w=1\right]+\frac{1}{2}{\mathbb{E}\left[\|\nabla\ell(\theta^{*};z)\|^{2}|w=-1\right]}
=12𝔼u[u(1g(u))+θg(u)2]+12𝔼v[v(1g(v))+θg(v)2]\displaystyle=\frac{1}{2}\mathbb{E}_{u}[\|u\cdot(1-g(u))+\theta^{*}\cdot g(u)\|^{2}]+\frac{1}{2}\mathbb{E}_{v}[\|v\cdot(1-g(v))+\theta^{*}\cdot g(v)\|^{2}]
=𝔼u[u(1g(u))+θg(u)2],\displaystyle=\mathbb{E}_{u}[\|u\cdot(1-g(u))+\theta^{*}\cdot g(u)\|^{2}], (B.24)

where the notation 𝔼u\mathbb{E}_{u} means taking expectation with respect to uN(0,σ2Id×d)u\sim N(0,\sigma^{2}I_{d\times d}), and the notation 𝔼v\mathbb{E}_{v} means taking expectation with respect to vN(0,σ2Id×d)v\sim N(0,\sigma^{2}I_{d\times d}).

Since 0<g(u)<20<g(u)<2, we have |1g(u)|1|1-g(u)|\leq 1. Thus

u(1g(u))+θg(u)2\displaystyle\|u\cdot(1-g(u))+\theta^{*}\cdot g(u)\|^{2}
2u2|1g(u)|2+2θ2|g(u)|2\displaystyle\leq 2\|u\|^{2}\cdot|1-g(u)|^{2}+2\|\theta^{*}\|^{2}\cdot|g(u)|^{2}
=2u2+2θ2g(u)2.\displaystyle=2\|u\|^{2}+2\|\theta^{*}\|^{2}\cdot g(u)^{2}. (B.25)

From (B.24) and (B.25), we have

[(θ;z)2]2𝔼u[u2]+2θ2𝔼u[g(u)2]\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\leq 2\mathbb{E}_{u}[\|u\|^{2}]+2\|\theta^{*}\|^{2}\mathbb{E}_{u}[g(u)^{2}]
=2σ2d+2θ2𝔼u[g(u)2].\displaystyle=2\sigma^{2}d+2\|\theta^{*}\|^{2}\mathbb{E}_{u}[g(u)^{2}]. (B.26)

Now we know that uTθu^{T}\theta^{*} is a θσ\|\theta^{*}\|\sigma-sub-Gaussian vector with mean 0. From Markov’s inequality,

Prob(|uTθ|>12θ2)2exp(14θ4θ2σ2)\displaystyle\textup{Prob}\left(|u^{T}\theta^{*}|>\frac{1}{2}\|\theta^{*}\|^{2}\right)\leq 2\exp(-\frac{\frac{1}{4}\|\theta^{*}\|^{4}}{\|\theta^{*}\|^{2}\sigma^{2}})
=2exp(θ24σ2)8σ2θ2.\displaystyle=\frac{2}{\exp(\frac{\|\theta^{*}\|^{2}}{4\sigma^{2}})}\leq\frac{8\sigma^{2}}{\|\theta^{*}\|^{2}}. (B.27)

When |uTθ|12θ2|u^{T}\theta^{*}|\leq\frac{1}{2}\|\theta^{*}\|^{2}, we have

g(u)=2eθ2uTθσ2+12eθ22σ24σ2θ2.\displaystyle g(u)=\frac{2}{e^{\frac{\|\theta^{*}\|^{2}-u^{T}\theta^{*}}{\sigma^{2}}}+1}\leq\frac{2}{e^{\frac{\|\theta^{*}\|^{2}}{2\sigma^{2}}}}\leq\frac{4\sigma^{2}}{\|\theta^{*}\|^{2}}.

Since 0<g(u)<20<g(u)<2, when |uTθ|12θ2|u^{T}\theta^{*}|\leq\frac{1}{2}\|\theta^{*}\|^{2}, we have

g(u)28σ2θ2.\displaystyle g(u)^{2}\leq\frac{8\sigma^{2}}{\|\theta^{*}\|^{2}}. (B.28)

As a result,

𝔼u[g(u)2]\displaystyle\mathbb{E}_{u}[g(u)^{2}]
Prob(|uTθ|>12θ2)𝔼[g(u)2||uTθ|>12θ2]\displaystyle\leq\textup{Prob}\left(|u^{T}\theta^{*}|>\frac{1}{2}\|\theta^{*}\|^{2}\right)\mathbb{E}\left[g(u)^{2}\bigg{|}|u^{T}\theta^{*}|>\frac{1}{2}\|\theta^{*}\|^{2}\right]
+Prob(|uTθ|12θ2)𝔼[g(u)2||uTθ|12θ2]\displaystyle\ \ \ +\textup{Prob}\left(|u^{T}\theta^{*}|\leq\frac{1}{2}\|\theta^{*}\|^{2}\right)\mathbb{E}\left[g(u)^{2}\bigg{|}|u^{T}\theta^{*}|\leq\frac{1}{2}\|\theta^{*}\|^{2}\right]
4Prob(|uTθ|>12θ2)+8σ2θ2\displaystyle\leq 4\cdot\textup{Prob}\left(|u^{T}\theta^{*}|>\frac{1}{2}\|\theta^{*}\|^{2}\right)+\frac{8\sigma^{2}}{\|\theta^{*}\|^{2}}
40σ2θ2,\displaystyle\leq\frac{40\sigma^{2}}{\|\theta^{*}\|^{2}},

where the second inequality is due to the fact 0<g(u)<20<g(u)<2 and (B.28), and the last inequality is due to (B.7). Combining the above result with (B.26), we have

[(θ;z)2](2d+40)σ2.\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\leq(2d+40)\sigma^{2}. (B.29)
Bounding [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] for Example 5:

we define the function g:×d(0,2)g:\mathbb{R}\times\mathbb{R}^{d}\rightarrow(0,2) as

g(u,x)=2e(2xTθu)22σ2eu22σ2+e(2xTθu)22σ2=2e2(xTθ)22u(xTθ)σ2+1.\displaystyle g(u,x)=\frac{2e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}{e^{-\frac{u^{2}}{2\sigma^{2}}}+e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}=\frac{2}{e^{\frac{2(x^{T}\theta^{*})^{2}-2u(x^{T}\theta^{*})}{\sigma^{2}}}+1}.

In Example 5, we have

((θ;z)|w=1,x)=[u(eu22σ2e(2xTθu)22σ2eu22σ2+e(2xTθu)22σ2)1g(u,x)+(xTθ)(2e(2xTθu)22σ2eu22σ2+e(2xTθu)22σ2)g(u,x)]x,\displaystyle\left(\nabla\ell(\theta^{*};z)|w=1,x\right)=\left[u\underbrace{\bigg{(}\frac{e^{-\frac{u^{2}}{2\sigma^{2}}}-e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}{e^{-\frac{u^{2}}{2\sigma^{2}}}+e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}\bigg{)}}_{1-g(u,x)}+(x^{T}\theta^{*})\underbrace{\bigg{(}\frac{2e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}{e^{-\frac{u^{2}}{2\sigma^{2}}}+e^{-\frac{(2x^{T}\theta^{*}-u)^{2}}{2\sigma^{2}}}}\bigg{)}}_{g(u,x)}\right]x,

where u=xTθyu=x^{T}\theta^{*}-y is a random vector drawn from N(0,σ2)N(0,\sigma^{2}). And we have

((θ;z)|w=1,x)=[v(ev22σ2e(2xTθv)22σ2ev22σ2+e(2xTθv)22σ2)1g(v,x)+(xTθ)(2e(2xTθv)222σ2ev22σ2+e(2xTθv)22σ2)g(v,x)]x,\displaystyle\left(\nabla\ell(\theta^{*};z)|w=-1,x\right)=\left[v\underbrace{\bigg{(}\frac{e^{-\frac{v^{2}}{2\sigma^{2}}}-e^{-\frac{(2x^{T}\theta^{*}-v)^{2}}{2\sigma^{2}}}}{e^{-\frac{v^{2}}{2\sigma^{2}}}+e^{-\frac{(2x^{T}\theta^{*}-v)^{2}}{2\sigma^{2}}}}\bigg{)}}_{1-g(v,x)}+(x^{T}\theta^{*})\underbrace{\bigg{(}\frac{2e^{-\frac{(2x^{T}\theta^{*}-v)^{2}\|^{2}}{2\sigma^{2}}}}{e^{-\frac{v^{2}}{2\sigma^{2}}}+e^{-\frac{(2x^{T}\theta^{*}-v)^{2}}{2\sigma^{2}}}}\bigg{)}}_{g(v,x)}\right]x,

where v=xTθ+yv=x^{T}\theta^{*}+y is a random vector drawn from N(0,σ2)N(0,\sigma^{2}).

Therefore, we have

[(θ;z)2]\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]
=12𝔼[(θ;z)2|w=1]+12𝔼[(θ;z)2|w=1]\displaystyle=\frac{1}{2}\mathbb{E}\left[\|\nabla\ell(\theta^{*};z)\|^{2}|w=1\right]+\frac{1}{2}{\mathbb{E}\left[\|\nabla\ell(\theta^{*};z)\|^{2}|w=-1\right]}
=12𝔼x[x2𝔼u[(u(1g(u,x))+θg(u,x))2|x]]+12𝔼x[x2𝔼v[(v(1g(v,x))+θg(v,x))2]]\displaystyle=\frac{1}{2}\mathbb{E}_{x}\left[\|x\|^{2}\mathbb{E}_{u}[(u\cdot(1-g(u,x))+\theta^{*}\cdot g(u,x))^{2}|x]\right]+\frac{1}{2}\mathbb{E}_{x}\left[\|x\|^{2}\mathbb{E}_{v}[(v\cdot(1-g(v,x))+\theta^{*}\cdot g(v,x))^{2}]\right]
=𝔼x[x2𝔼u[(u(1g(u,x))+θg(u,x))2|x]],\displaystyle=\mathbb{E}_{x}\left[\|x\|^{2}\mathbb{E}_{u}[(u\cdot(1-g(u,x))+\theta^{*}\cdot g(u,x))^{2}|x]\right], (B.30)

where the notation 𝔼u\mathbb{E}_{u} means taking expectation with respect to uN(0,σ2)u\sim N(0,\sigma^{2}), the notation 𝔼v\mathbb{E}_{v} means taking expectation with respect to vN(0,σ2)v\sim N(0,\sigma^{2}), and the notation 𝔼x\mathbb{E}_{x} means taking expectation with respect to xN(0,Id×d)x\sim N(0,I_{d\times d}).

Similar to the last part (i.e., the proof of (B.29)), we can prove

𝔼u[(u(1g(u,x))+θg(u,x))2|x]42σ2,x.\displaystyle\mathbb{E}_{u}[(u\cdot(1-g(u,x))+\theta^{*}\cdot g(u,x))^{2}|x]\leq 42\sigma^{2},\quad\forall x.

Combine this result with (B.30), we obtain

[(θ;z)2]42σ2𝔼x[x2]=42σ2d.\displaystyle\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}]\leq 42\sigma^{2}\mathbb{E}_{x}[\|x\|^{2}]=42\sigma^{2}d.

This gives an upper bound on [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] in Example 5.

Given that we have upper bounded [(θ;z)2]\mathbb{P}[\|\nabla\ell(\theta^{*};z)\|^{2}] by 42σ2d42\sigma^{2}d in both Example 4 and Example 5, it is straightforward to prove the generalization error bound in Corollary 7. \square

B.8 Auxiliary definitions and lemmata

Definition 6 (Orlicz norms, sub-Gaussian, sub-exponential).

For every α(0,+)\alpha\in(0,+\infty) we define the Orlicz-α\alpha norm of a random uu:

uOrliczα=inf{K>0:𝔼exp((|u|/K)α)2}.\displaystyle\|u\|_{\text{Orlicz}_{\alpha}}=\inf\{K>0:\mathbb{E}\exp\big{(}(|u|/K)^{\alpha}\big{)}\leq 2\}.

A random variable/vector XdX\in\mathbb{R}^{d} is KK-sub-Gaussian if λd\forall\lambda\in\mathbb{R}^{d}, we have

λTXOrlicz2Kλ2.\displaystyle\|\lambda^{T}X\|_{\text{Orlicz}_{2}}\leq K\|\lambda\|_{2}.

A random variable/vector XdX\in\mathbb{R}^{d} is KK-sub-exponential if λd\forall\lambda\in\mathbb{R}^{d}, we have

λTXOrlicz1Kλ2.\displaystyle\|\lambda^{T}X\|_{\text{Orlicz}_{1}}\leq K\|\lambda\|_{2}.
Lemma 10 (Bernstein’s inequality for sub-exponential random variables).

If X1,,XmX_{1},\cdots,X_{m} are sub-exponential random variables, then Bernstein’s inequality (the inequality (A.48) in Lemma 6 holds with

σ2=1ni=1nXiOrlicz12,B=max1inXiOrlicz1.\displaystyle\sigma^{2}=\frac{1}{n}\sum_{i=1}^{n}\|X_{i}\|_{\textup{Orlicz}_{1}}^{2},\quad B=\max_{1\leq i\leq n}\|X_{i}\|_{\textup{Orlicz}_{1}}.
Lemma 11 (vector Bernstein’s inequality, [59, 60] ).

Let {Xi}i=1n\{X_{i}\}_{i=1}^{n} be a sequence of i.i.d. random variables taking values in a real separable Hilbert space. Assume that 𝔼[Xi]=μ\mathbb{E}[X_{i}]=\mu, 𝔼[Xiμ2]=σ2\mathbb{E}[\|X_{i}-\mu\|^{2}]=\sigma^{2}, 1in\forall 1\leq i\leq n. We say that vector Bernstein’s condition with parameter BB holds if for all 1in1\leq i\leq n,

𝔼[Xiμk]12k!σ2Bk2,2kn.\displaystyle\mathbb{E}[\|X_{i}-\mu\|^{k}]\leq\frac{1}{2}k!\sigma^{2}B^{k-2},\quad\forall 2\leq k\leq n.

If this condition holds, then for all δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta we have

1ni=1nXiμ2σ2log2δn+Blog2δn.\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}X_{i}-\mu\right\|\leq\sqrt{\frac{2\sigma^{2}\log{\frac{2}{\delta}}}{n}}+\frac{B\log{\frac{2}{\delta}}}{n}.

The following definitions and lemmata provide some background on generic chaining.

Definition 7 (Orlicz-α\alpha processes).

Let {Xf}fF\{X_{f}\}_{f\in\pazocal{F}} be a sequence of random variables. {Xf}fF\{X_{f}\}_{f\in\pazocal{F}} is called an Orlicz-α\alpha process for a metric metr(,)\texttt{metr}(\cdot,\cdot) on F\pazocal{F} if

Xf1Xf2Orliczαmetr(f1,f2),f1,f2F.\displaystyle\|X_{f_{1}}-X_{f_{2}}\|_{\text{Orlicz}_{\alpha}}\leq\texttt{metr}(f_{1},f_{2}),\forall f_{1},f_{2}\in\pazocal{F}.

In particular, Orlicz-2 process is called “process with sub-Gaussian increments” and Orlicz-1 process is called “process with sub-exponential increments”.

Definition 8 (mixed sub-Gaussian-sub-exponential increments, [11]).

We say a process (Xθ)θΘ(X_{\theta})_{\theta\in\Theta} has mixed sub-Gaussian-sub-exponential increments with respect to the pair (metr1,metr2)(\texttt{metr}_{1},\texttt{metr}_{2}) if for all θ1,θ2Θ\theta_{1},\theta_{2}\in\Theta,

Prob(Xθ1Xθ2umetr2(θ1,θ2)+umetr1(θ1,θ2))2eu,u0.\displaystyle\text{Prob}\left(\|X_{\theta_{1}}-X_{\theta_{2}}\|\geq\sqrt{u}\cdot\texttt{metr}_{2}(\theta_{1},\theta_{2})+u\cdot\texttt{metr}_{1}(\theta_{1},\theta_{2})\right)\leq 2e^{-u},\forall u\geq 0.
Definition 9 (Talagrand’s γα\gamma_{\alpha}-functional).

A sequence F=(Fn)n0F=(\pazocal{F}_{n})_{n\geq 0} of subsets of F\pazocal{F} is called admissible if |F0|=1|\pazocal{F}_{0}|=1 and |Fn|22n|\pazocal{F}_{n}|\leq 2^{2^{n}} for all n1n\geq 1. For any 0<α<0<\alpha<\infty, the γα\gamma_{\alpha}-functional of (F,metr)(\pazocal{F},\texttt{metr}) is defined by

γα(F,d)=infFsupfFn=02nαmetr(f,Fn),\displaystyle\gamma_{\alpha}(F,d)=\inf_{F}\sup_{f\in\pazocal{F}}\sum_{n=0}^{\infty}2^{\frac{n}{\alpha}}\textup{{metr}}(f,\pazocal{F}_{n}),

where the infimum is taken over all admissible sequences and we write metr(f,Fn)=infsFnmetr(f,s)\texttt{metr}(f,\pazocal{F}_{n})=\inf_{s\in\pazocal{F}_{n}}\texttt{metr}(f,s).

Lemma 12 (Dudley’s integral bound for γα\gamma_{\alpha} functional, [72]).

There exist a constant CαC_{\alpha} depending only on α\alpha such that

γα(F,metr)Cα0+(logN(ε,F,metr))1αdε.\displaystyle\gamma_{\alpha}(\pazocal{F},\texttt{{metr}})\leq C_{\alpha}\int_{0}^{+\infty}(\log N(\varepsilon,\pazocal{F},\texttt{{metr}}))^{\frac{1}{\alpha}}d\varepsilon.
Lemma 13 (generic chaining for a process with mixed tail increments, [11]).

If (Xf)fF(X_{f})_{f\in\pazocal{F}} has mixed sub-Gaussian-sub-exponential increments with respect to the pair (metr1,metr2)(\textup{{metr}}_{1},\textup{{metr}}_{2}), then there are absolute constants c,C>0c,C>0 such that δ(0,1)\forall\delta\in(0,1),

supθΘXfXf0C(γ2(F,metr2)+γ1(F,metr1))+\displaystyle\sup_{\theta\in\Theta}\|X_{f}-X_{f_{0}}\|\leq C(\gamma_{2}(\pazocal{F},\texttt{{metr}}_{2})+\gamma_{1}(\pazocal{F},\texttt{{metr}}_{1}))+
c(log1δsupf1,f2F[metr2(f1,f2)]+log1δsupf1,f2F[metr1(f1,f2)]),\displaystyle c\left(\sqrt{\log\frac{1}{\delta}}\sup_{f_{1},f_{2}\in\pazocal{F}}[\texttt{{metr}}_{2}(f_{1},f_{2})]+\log\frac{1}{\delta}\sup_{f_{1},f_{2}\in\pazocal{F}}[\texttt{{metr}}_{1}(f_{1},f_{2})]\right),

with probability at least 1δ1-\delta.

Appendix C Proofs for Section 8

C.1 Proof of Theorem 8

The proof consists of five parts. Among them, the main purpose of Part I and Part IV is to localized the strong convexity parameter. When there is no need to localized the strong convexity parameter (e.g., when one uses the square cost), the proof can be simplified—Part I and Part IV will be quite straightforward, and all the “upper-side” truncation analysis related to 2ξL2cκ\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}, 4ξL22cκ\frac{4\|\xi\|^{2}_{L_{2}}}{c_{\kappa}} or 4ξL22κ2cκ\frac{4\|\xi\|^{2}_{L_{2}}}{\kappa^{2}c_{\kappa}} will be unnecessary.

Part I: analysis of the concentrated functions.

Denote T(h)=hhL22T(h)=\|h-h^{*}\|^{2}_{L_{2}} and

vh=min{κhhL2,2ξL2cκ}.\displaystyle v_{h}=\min\left\{\kappa\|h-h^{*}\|_{L_{2}},\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}.

For every hHh\in\pazocal{H}, define

fh(x,y)=2α(4ξL2/cκ)1sv(h(x),y)(h(x)h(x)),\displaystyle f_{h}(x,y)=\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y)(h(x)-h^{*}(x)),
gh(x,y)=min{(h(x)h(x))2,vh2}𝟙{|ξ|2ξL2cκ}.\displaystyle g_{h}(x,y)=\min\left\{(h(x)-h^{*}(x))^{2},v_{h}^{2}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}.

One can view ghg_{h} as a truncated version of the quadratic form (h(x)h(x))2(h(x)-h^{*}(x))^{2}. Later we will use concentration to control (n)(fh+gh)(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h}) uniformly.

From Lemma 14 (for which we defer to the end of Section C.1), we can show

sv(h(x),y)sv(h(x),y)1sv(h(x),y)(h(x)h(x))α(2vh)2min{(h(x)h(x))2,vh2}\displaystyle\ell_{\textup{{sv}}}(h(x),y)-\ell_{\textup{{sv}}}(h^{*}(x),y)-\partial_{1}\ell_{\textup{{sv}}}(h^{*}(x),y)(h(x)-h^{*}(x))\geq\frac{{\alpha(2v_{h})}}{2}\min\left\{(h(x)-h^{*}(x))^{2},v_{h}^{2}\right\}
α(4ξL2/cκ)2gh(x,y).\displaystyle\geq\frac{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}{2}g_{h}(x,y).

The above inequality implies that

n(fh+gh)2α(4ξL2/cκ)n[sv(h(x),y)sv(h(x),y)].\displaystyle\mathbb{P}_{n}(f_{h}+g_{h})\leq\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\mathbb{P}_{n}[\ell_{\textup{{sv}}}(h(x),y)-\ell_{\textup{{sv}}}(h^{*}(x),y)]. (C.1)

Recall that ξ=h(x)y\xi=h^{*}(x)-y. By Markov’s inequality,

Prob(|ξ|2ξL2cκ)cκ4,\displaystyle\text{Prob}\left(|\xi|\geq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right)\leq\frac{c_{\kappa}}{4}, (C.2)

From the definition of ghg_{h} and vhv_{h}, it is straightforward to show that

gh=[min{(h(x)h(x))2,vh2}𝟙{|ξ|2ξL2cκ}]\displaystyle\mathbb{P}g_{h}=\mathbb{P}\left[\min\left\{(h(x)-h^{*}(x))^{2},v_{h}^{2}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}\right]
[min{(h(x)h(x))2,vh2}𝟙{|h(x)h(x)|κhhL2}𝟙{|ξ|2ξL2cκ}]\displaystyle\geq\mathbb{P}\left[\min\left\{(h(x)-h^{*}(x))^{2},v_{h}^{2}\right\}\cdot\mathds{1}\left\{|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}\right]
[vh2𝟙{|h(x)h(x)|κhhL2}𝟙{|ξ|2ξL2cκ}]\displaystyle\geq\mathbb{P}\left[v_{h}^{2}\cdot\mathds{1}\left\{|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}\right]
=vh2Prob(|h(x)h(x)|κhhL2,|ξ|2ξL2cκ)\displaystyle=v_{h}^{2}\cdot\text{Prob}\left(|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}},\ |\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right)
vh2(Prob(|h(x)h(x)|κhhL2)Prob(ξ|>2ξL2cκ))\displaystyle\geq v_{h}^{2}\cdot\left(\text{Prob}\left(|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right)-\text{Prob}\left(\|\xi|>\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right)\right)
3cκ4vh2,\displaystyle\geq\frac{3c_{\kappa}}{4}v_{h}^{2},

where the first inequality is due to 1𝟙{|h(x)h(x)|κhhL2}1\geq\mathds{1}\left\{|h(x)-h^{*}(x)|\geq\kappa\|h-h^{*}\|_{L_{2}}\right\}; the second inequality is due to the definition of vhv_{h}; and the last inequality is due to Assumption 9 and (C.2). From Assumption 8, we have

(fh+gh)gh3cκ4vh2.\displaystyle\mathbb{P}(f_{h}+g_{h})\geq\mathbb{P}g_{h}\geq\frac{3c_{\kappa}}{4}v_{h}^{2}. (C.3)

Let us summarize the results from this part. We use the empirical average of the excess loss to upper bound n(fh+gh)\mathbb{P}_{n}(f_{h}+g_{h}) in (C.1), and use the (truncated) quadratic form to lower bound (fh+gh)\mathbb{P}(f_{h}+g_{h}) in (C.3). The next steps are to prove concentration of fhf_{h} and ghg_{h} and establish a “uniform localized convergence” argument.

Part II: bound the localized empirical process.

Given r>0r>0, we want to bound the localized empirical process

suprλT(h)r(n)(fh+gh)\sup_{\frac{r}{\lambda}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})

where λ>1\lambda>1 is a fixed value that we will specify later. From the definition of φnoise(r;δ)\varphi_{\text{noise}}(r;\delta) in Assumption 10, for any δ(0,1)\delta\in(0,1), with probability 1δ21-\frac{\delta}{2}, we have

suprλT(h)r(n)(fh+gh)suprλT(h)r(n)gh+2α(4ξL2/cκ)φnoise(r;δ2).\displaystyle\sup_{\frac{r}{\lambda}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})\leq\sup_{\frac{r}{\lambda}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h}+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(r;\frac{\delta}{2}\right). (C.4)

Given r>0r>0, denote the hypothesis class H(rλ,r)={hH:rλT(h)r}\pazocal{H}(\frac{r}{\lambda},r)=\{h\in\pazocal{H}:\frac{r}{\lambda}\leq T(h)\leq r\}, and define the function gh,rg_{h,r} as

gh,r(x,y)=min{(h(x)h(x))2,κ2r,4ξL22cκ}𝟙{|ξ|2ξL2cκ}.g_{h,r}(x,y)=\min\left\{(h(x)-h^{*}(x))^{2},\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}.

Recall that ghg_{h} is defined by

gh(x,y)=min{(h(x)h(x))2,κ2T(h),4ξL22cκ}𝟙{|ξ|2ξL2cκ}.\displaystyle g_{h}(x,y)=\min\left\{(h(x)-h^{*}(x))^{2},\kappa^{2}T(h),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}\cdot\mathds{1}\left\{|\xi|\leq\frac{2\|\xi\|_{L_{2}}}{\sqrt{c_{\kappa}}}\right\}.

For every hH(rλ,r)h\in\pazocal{H}(\frac{r}{\lambda},r) and any (x,y)X×Y(x,y)\in\pazocal{X}\times\pazocal{Y}, we have

gh,rλ(x,y)gh(x,y)gh,r(x,y),\displaystyle g_{h,\frac{r}{\lambda}}(x,y)\leq g_{h}(x,y)\leq g_{h,r}(x,y), (C.5)

and

gh,r(x,y)gh,rλ(x,y)\displaystyle g_{h,r}(x,y)-g_{h,\frac{r}{\lambda}}(x,y)
min{(h(x)h(x))2,κ2r,4ξL22ck}min{(h(x)h(x))2,κ2rλ,4ξL22cκ}\displaystyle\leq\min\left\{(h(x)-h^{*}(x))^{2},\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{k}}\right\}-\min\left\{(h(x)-h^{*}(x))^{2},\frac{\kappa^{2}r}{\lambda},\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}
min{κ2r,4ξL22ck}min{κ2rλ,4ξL22cκ}\displaystyle\leq\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{k}}\right\}-\min\left\{\frac{\kappa^{2}r}{\lambda},\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}
(11λ)min{κ2r,4ξL22cκ}.\displaystyle\leq\left(1-\frac{1}{\lambda}\right)\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}. (C.6)

From (C.5) and (C.1), for every hH(rλ,r)h\in\pazocal{H}(\frac{r}{\lambda},r) and any (x,y)X×Y(x,y)\in\pazocal{X}\times\pazocal{Y},

(11λ)min{κ2r,4ξL22cκ}gh(x,y)gh,r(x,y)0,-\left(1-\frac{1}{\lambda}\right){\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}}\leq g_{h}(x,y)-g_{h,r}(x,y)\leq 0,

which implies

(n)gh(n)gh,r+(11λ)min{κ2r,4ξL22cκ}.\displaystyle(\mathbb{P}-\mathbb{P}_{n})g_{h}\leq(\mathbb{P}-\mathbb{P}_{n})g_{h,r}+\left(1-\frac{1}{\lambda}\right){\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}}.

As a result, we have

suprKT(h)r(n)ghsuprKT(h)r(n)gh,r+(11λ)min{κ2r,4ξL22cκ}\displaystyle\sup_{\frac{r}{K}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h}\leq\sup_{\frac{r}{K}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h,r}+\left(1-\frac{1}{\lambda}\right){\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}}
supT(h)r(n)gh,r+(11λ)min{κ2r,4ξL22cκ}.\displaystyle\leq\sup_{T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h,r}+\left(1-\frac{1}{\lambda}\right){\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}}. (C.7)

We know that gh,rg_{h,r} is uniformly bounded by [0,min{κ2r,4ξL22cκ}]\left[0,\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}\right]. Form the standard bound for global Rademacher complexity [79], δ(0,1)\forall\delta\in(0,1), with probability at least 1δ21-\frac{\delta}{2},

supT(h)r(n)gh,r2{gh,r:T(h)r}+min{κ2r,4ξL22cκ}log2δ2n.\displaystyle\sup_{T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h,r}\leq 2\mathfrak{R}\{g_{h,r}:T(h)\leq r\}+\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}\sqrt{\frac{\log\frac{2}{\delta}}{2n}}. (C.8)

It is straightforward to verify that for all h1,h2Hh_{1},h_{2}\in\pazocal{H} and (x,y)X×Y(x,y)\in\pazocal{X}\times\pazocal{Y},

|gh1,r(x)gh2,r(x)|2κr|h1(x)h2(x)|.\displaystyle|g_{h_{1},r}(x)-g_{h_{2},r}(x)|\leq 2\kappa\sqrt{r}|h_{1}(x)-h_{2}(x)|.

From the Lipchitz contraction property of Rademacher complexity (see, e.g., [48, Theorem 7]), we have

{gh,r}2κr{h:T(h)r}2κrφ(r),\displaystyle\mathfrak{R}\{g_{h,r}\}\leq 2\kappa\sqrt{r}\mathfrak{R}\{h:T(h)\leq r\}\leq 2\kappa\sqrt{r}\varphi(r), (C.9)

where φ(r)\varphi(r) is defined in Assumption 10. Define the ψ\psi function as

ψ(r;δ)=4κrφ(r)+(log2δ2n+11λ)min{κ2r,4ξL22cκ}+2α(4ξL2/cκ)φnoise(r;δ2).\displaystyle\psi(r;\delta)=4\kappa\sqrt{r}\varphi(r)+\left(\sqrt{\frac{\log\frac{2}{\delta}}{2n}}+1-\frac{1}{\lambda}\right)\min\left\{{\kappa^{2}}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(r;\frac{\delta}{2}\right). (C.10)

Combining the definition (C.10) with (C.4) (C.7) (C.8) (C.9) , for any δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have

suprKT(h)r(n)(fh+gh)ψ(r;δ).\displaystyle\sup_{\frac{r}{K}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})\leq\psi(r;\delta). (C.11)
Part III: the “uniform localized convergence” argument.

Applying Proposition 3, for any δ1(0,1)\delta_{1}\in(0,1) and r0(0,4Δ2)r_{0}\in(0,4\Delta^{2}), with probability at least 1δ11-\delta_{1}, for all hHh\in\pazocal{H}, either T(h)r0T(h)\leq r_{0} or

(n)(fh+gh)ψ(λT(h);δ1(logK4KΔ2r0)1)\displaystyle(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})\leq\psi\left(\lambda T(h);\delta_{1}({\log_{K}\frac{4K\Delta^{2}}{r_{0}}})^{-1}\right)
=4κλT(h)φ(λT(h))+2α(4ξL2/cκ)φnoise(λT(h);δ12logλ4λΔ2r0)\displaystyle=4\kappa\sqrt{\lambda T(h)}\varphi(\lambda T(h))+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(\lambda T(h);\frac{\delta_{1}}{2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}}\right)
+min{λκ2T(h),4ξL22cκ}(log2logλ4λΔ2r0δ12n+11λ)last term in (C.1).\displaystyle\ \ \ +\underbrace{\min\left\{\lambda\kappa^{2}T(h),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}\left(\sqrt{\frac{\log\frac{2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}}{\delta_{1}}}{2n}}+1-\frac{1}{\lambda}\right)}_{\text{last term in \eqref{eq: variant peeling argument}}}. (C.12)

We specify

λ=8+2cκ8+cκ.\displaystyle\lambda=\frac{8+2c_{\kappa}}{8+c_{\kappa}}.

Then when n>32cκ2log2logλ4λΔ2r0δ1n>\frac{32}{c_{\kappa}^{2}}\log\frac{2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}}{\delta_{1}}, for all hHh\in\pazocal{H} we have

λ(log2logλ4λΔ2r0δ12n+11λ)<cκ4,\displaystyle\lambda\left(\sqrt{\frac{\log\frac{2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}}{\delta_{1}}}{2n}}+1-\frac{1}{\lambda}\right)<\frac{c_{\kappa}}{4},

which implies when T(h)>0T(h)>0,

last term in (C.1)<cκ4min{κ2r,4ξL22λcκ}cκ4min{κ2r,4ξL22cκ}.\displaystyle\text{last term in \eqref{eq: variant peeling argument}}<\frac{c_{\kappa}}{4}\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{\lambda c_{\kappa}}\right\}\leq\frac{c_{\kappa}}{4}\min\left\{\kappa^{2}r,\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}. (C.13)

Denote Cr0=2+(16cκ+2)log4Δ2r0C_{r_{0}}=2+\left(\frac{16}{c_{\kappa}}+2\right)\log\frac{4\Delta^{2}}{r_{0}}, then

2logλ4λΔ2r0=2+log4Δ2r0logλ2+(16cκ+2)log4Δ2r0=Cr0.\displaystyle 2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}=2+\frac{\log\frac{4\Delta^{2}}{r_{0}}}{\log\lambda}\leq 2+\left(\frac{16}{c_{\kappa}}+2\right)\log\frac{4\Delta^{2}}{r_{0}}=C_{r_{0}}.

For any δ(0,1)\delta\in(0,1), taking δ1=2logλ4λΔ2r0Cr0δ\delta_{1}=\frac{2\log_{\lambda}\frac{4\lambda\Delta^{2}}{r_{0}}}{C_{r_{0}}}\delta, from (C.1) (C.13) and the fact λ<2\lambda<2, we have the following conclusion: when n>32cκ2logCr0δn>\frac{32}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}, with probability at least 1δ1-\delta, for all hHh\in\pazocal{H}, either T(h)r0T(h)\leq r_{0} or

(n)(fh+gh)\displaystyle(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})
<4κ2T(h)φ(2T(h))+2α(4ξL2/cκ)φnoise(2T(h);δCr0)+cκ4min{κ2T(h),4ξL22cκ}.\displaystyle<4\kappa\sqrt{2T(h)}\varphi(2T(h))+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2T(h);\frac{\delta}{C_{r_{0}}}\right)+\frac{c_{\kappa}}{4}\min\left\{\kappa^{2}T(h),\frac{4\|\xi\|^{2}_{L_{2}}}{c_{\kappa}}\right\}. (C.14)

Let h^argminnsv(h(x),y)\hat{h}\in\operatorname*{arg\,min}\mathbb{P}_{n}\ell_{\textup{{sv}}}(h(x),y) be the empirical risk minimizer. From (C.1) and the property of h^\hat{h}, we have

n(fh^+gh^)2α(4ξL2/cκ)n[sv(h^(x)y)sv(h(x)y)]0.\displaystyle\mathbb{P}_{n}(f_{\hat{h}}+g_{\hat{h}})\leq\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\mathbb{P}_{n}[\ell_{\textup{{sv}}}(\hat{h}(x)-y)-\ell_{\textup{{sv}}}(h^{*}(x)-y)]\leq 0. (C.15)

Recall the result (C.3) proved in Part I,

(fh+gh)3cκ4vh2=3cκ4min{κ2T(h),4ξL22cκ}.\displaystyle\mathbb{P}(f_{h}+g_{h})\geq\frac{3c_{\kappa}}{4}v_{h}^{2}=\frac{3c_{\kappa}}{4}\min\left\{\kappa^{2}T(h),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}. (C.16)

From (C.1) (C.15) (C.16), when n>32cκ2logCr0δn>\frac{32}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}, with probability at least 1δ1-\delta, either T(h^)r0T(\hat{h})\leq r_{0} or

3cκ4min{κ2T(h^),4ξL22cκ}\displaystyle\frac{3c_{\kappa}}{4}\min\left\{\kappa^{2}T(\hat{h}),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}
<4κ2T(h^)φ(2T(h^))+2α(4ξL2/cκ)φnoise(2T(h^);δCr0)+cκκ24min{T(h^),4ξL22cκ},\displaystyle<4\kappa\sqrt{2T(\hat{h})}\varphi(2T(\hat{h}))+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2T(\hat{h});\frac{\delta}{C_{r_{0}}}\right)+\frac{c_{\kappa}\kappa^{2}}{4}\min\left\{T(\hat{h}),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\},

i.e.,

cκ2min{κ2T(h^),4ξL22cκ}\displaystyle\frac{c_{\kappa}}{2}\min\left\{\kappa^{2}T(\hat{h}),\frac{4\|\xi\|_{L_{2}}^{2}}{c_{\kappa}}\right\}
<4κ2T(h^)φ(2T(h^))+2α(4ξL2/cκ)φnoise(2T(h^);δCr0).\displaystyle<4\kappa\sqrt{2T(\hat{h})}\varphi(2T(\hat{h}))+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2T(\hat{h});\frac{\delta}{C_{r_{0}}}\right). (C.17)

In the theorem we have asked n>72cκ2logCr0δn>\frac{72}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}. Denote the event

𝒜:={either T(h^)r0 or (C.1) is true}.\displaystyle\mathscr{A}:=\{\text{either $T(\hat{h})\leq r_{0}$ or \eqref{eq: tempT3 4} is true}\}.

Then we have Prob(𝒜)1δ\text{Prob}(\mathscr{A})\geq 1-\delta.

Part IV: preliminary localization.

We first prove a preliminary localization result T(h^)max{4ξL22κ2cκ,r0}T(\hat{h})\leq\max\left\{\frac{4\|\xi\|_{L_{2}}^{2}}{\kappa^{2}c_{\kappa}},r_{0}\right\} on the event 𝒜\mathscr{A}. The essential purpose of this step is to localize the strong convexity parameter. If T(h^)(max{4ξL22κ2cκ,r0},4Δ2]T(\hat{h})\in\left(\max\left\{\frac{4\|\xi\|_{L_{2}}^{2}}{\kappa^{2}c_{\kappa}},r_{0}\right\},4\Delta^{2}\right] is true, then on the event 𝒜\mathscr{A} one have

RHS of (C.1)>2ξL22.\displaystyle\text{RHS of \eqref{eq: tempT3 4}}>2\|\xi\|_{L_{2}}^{2}. (C.18)

In the theorem we ask n>max{N¯δ,r0,72cκ2logCr0δ}n>\max\left\{\bar{N}_{\delta,r_{0}},\frac{72}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}\right\}. According to Assumption 10, this implies that

φnoise(8Δ2;δCr0)α(4ξL2/cκ)ξL222,\displaystyle\varphi_{\text{noise}}\bigg{(}8\Delta^{2};\frac{\delta}{C_{r_{0}}}\bigg{)}\leq\frac{\alpha(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}})\|\xi\|_{L_{2}}^{2}}{2}, (C.19)
φ(8Δ2)2cκξL2216Δ,\displaystyle\varphi\left(8\Delta^{2}\right)\leq\frac{\sqrt{2c_{\kappa}}\|\xi\|^{2}_{L_{2}}}{16\Delta}, (C.20)

which further imply

RHS of (C.1)ξL22+ξ2=2ξL22.\displaystyle\text{RHS of \eqref{eq: tempT3 4}}\leq\|\xi\|_{L_{2}}^{2}+\|\xi\|^{2}=2\|\xi\|_{L_{2}}^{2}. (C.21)

(C.18) and (C.21) result in a contradiction. Therefore, T(h^)T(\hat{h}) must be bounded by max{4ξL22κ2cκ,r0}\max\left\{\frac{4\|\xi\|_{L_{2}}^{2}}{\kappa^{2}c_{\kappa}},r_{0}\right\}. Then on the event 𝒜\mathscr{A}, either T(h^)r0T(\hat{h})\leq r_{0} or

cκκ22T(h^)<RHS of (C.1).\displaystyle\frac{c_{\kappa}\kappa^{2}}{2}T(\hat{h})<\text{RHS of \eqref{eq: tempT3 4}}. (C.22)
Part V: final steps.

Let rnoiser_{\text{noise}}^{*} be the fixed point of

4cκκ2α(4ξL2/cκ)φnoise(2r;δCr0),\displaystyle\frac{4}{c_{\kappa}\kappa^{2}\cdot\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2r;\frac{\delta}{C_{r_{0}}}\right),

and rverr_{\text{ver}}^{*} be the fixed point of

8cκκ2rφ(2r).\displaystyle\frac{8}{c_{\kappa}\kappa}\sqrt{2r}\varphi(2r).

From the definition of fixed points, when T(h^)>max{rver,rnoise}T(\hat{h})>\max\{r^{*}_{\text{ver}},\ r^{*}_{\text{noise}}\}, we have

cκκ24T(h^)>2α(4ξL2/cκ)φnoise(2T(h^);δCr0)\displaystyle\frac{c_{\kappa}\kappa^{2}}{4}T(\hat{h})>\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\varphi_{\text{noise}}\left(2T(\hat{h});\frac{\delta}{C_{r_{0}}}\right)

and

cκκ24T(h^)>4κ2T(h^)φ(2T(h^)).\displaystyle\frac{c_{\kappa}\kappa^{2}}{4}T(\hat{h})>4\kappa\sqrt{2T(\hat{h})}\varphi(2T(\hat{h})).

Contrasting the above two inequalities with our previous result (C.22), on the event 𝒜\mathscr{A} we have

T(h^)max{rver,rξ,r0}.\displaystyle T(\hat{h})\leq\max\{r^{*}_{\text{ver}},\ r^{*}_{\xi},\ r_{0}\}.

We conclude that when n>max{N¯δ,r0,72cκ2logCr0δ}n>\max\left\{\bar{N}_{\delta,r_{0}},\frac{72}{c_{\kappa}^{2}}\log\frac{C_{r_{0}}}{\delta}\right\}, with probability at least 1δ1-\delta,

h^hL22max{rnoise,rver,r0}.\displaystyle\|\hat{h}-h^{*}\|^{2}_{L_{2}}\leq\max\{r^{*}_{\text{noise}},\ r^{*}_{\text{ver}},\ r_{0}\}. (C.23)

Finally, from the optimality condition on hh^{*} (Assumption 8), it is straightforward to prove that for all hHh\in\pazocal{H},

(h)βsv2hhL22.\mathscr{E}(h)\leq\frac{\beta_{\textup{{sv}}}}{2}\|h-h^{*}\|^{2}_{L_{2}}.

Combining the above inequality with (C.23), we have

(h^)βsv2max{rnoise,rver,r0}.\displaystyle\mathscr{E}(\hat{h})\leq\frac{\beta_{\textup{{sv}}}}{2}\max\left\{r^{*}_{\textup{noise}},\ r^{*}_{\textup{ver}},\ r_{0}\right\}.

This completes the proof. \square

Lemma 14 (lower bound of the residual of the Taylor expansion).

Let sv\ell_{\textup{{sv}}} be convex with respect to its first argument. Given v>0v>0, for all u1,u2u_{1},u_{2}\in\mathbb{R} and yYy\in\pazocal{Y}, we have

sv(u1,y)sv(u2,y)1sv(u2,y)(u1u2)α(2v)2min{|u1u2|2,v2}𝟙{|u2y|v}.\displaystyle\ell_{\textup{{sv}}}(u_{1},y)-\ell_{\textup{{sv}}}(u_{2},y)-\partial_{1}\ell_{\textup{{sv}}}(u_{2},y)(u_{1}-u_{2})\geq\frac{\alpha(2v)}{2}\min\{|u_{1}-u_{2}|^{2},v^{2}\}\cdot\mathds{1}\{|u_{2}-y|\leq v\}. (C.24)
Proof of Lemma 14:

we consider the following four cases: (1) |u2y|>v|u_{2}-y|>v; (2) |u2y|v|u_{2}-y|\leq v and |u1u2|v|u_{1}-u_{2}|\leq v; (3) |u2y|v|u_{2}-y|\leq v and u1u2>vu_{1}-u_{2}>v ; and (4) |u2y|v|u_{2}-y|\leq v and u1u2<vu_{1}-u_{2}<-v. It is straightforward to prove (C.24) in case (1) and case (2). In case (3), because

sv(u1,y)sv(u2,y)1sv(u2,y)(u1u2)=01(1sv(u2+t(u1u2))1sv(u2))(u1u2)𝑑t,\displaystyle\ell_{\textup{{sv}}}(u_{1},y)-\ell_{\textup{{sv}}}(u_{2},y)-\partial_{1}\ell_{\textup{{sv}}}(u_{2},y)(u_{1}-u_{2})=\int_{0}^{1}(\partial_{1}\ell_{\textup{{sv}}}(u_{2}+t(u_{1}-u_{2}))-\partial_{1}\ell_{\textup{{sv}}}(u_{2}))(u_{1}-u_{2})dt,

and (1sv(u2+t(u1u2))1sv(u2))(u1u2)0(\partial_{1}\ell_{\textup{{sv}}}(u_{2}+t(u_{1}-u_{2}))-\partial_{1}\ell_{\textup{{sv}}}(u_{2}))(u_{1}-u_{2})\geq 0 for all t[0,1]t\in[0,1], we have

sv(u1,y)sv(u2,y)1sv(u2,y)(u1u2)0vu1u2(1sv(u2+t(u1u2))1sv(u2))(u1u2)𝑑t\displaystyle\ell_{\textup{{sv}}}(u_{1},y)-\ell_{\textup{{sv}}}(u_{2},y)-\partial_{1}\ell_{\textup{{sv}}}(u_{2},y)(u_{1}-u_{2})\geq\int_{0}^{\frac{v}{u_{1}-u_{2}}}(\partial_{1}\ell_{\textup{{sv}}}(u_{2}+t(u_{1}-u_{2}))-\partial_{1}\ell_{\textup{{sv}}}(u_{2}))(u_{1}-u_{2})dt
=sv(u2+v,y)sv(u2,y)1sv(u2,y)vα(2v)2v2.\displaystyle=\ell_{\textup{{sv}}}(u_{2}+v,y)-\ell_{\textup{{sv}}}(u_{2},y)-\partial_{1}\ell_{\textup{{sv}}}(u_{2},y)v\geq\frac{\alpha(2v)}{2}v^{2}.

Similarly, we can prove (C.24) in case (4). This completes the proof of Lemma 14. \square

C.2 Proof of Corollary 9

The proof is nearly identical to the proof of Theorem 8, but with the following modifications. First, we only need to consider the hypothesis set H0\pazocal{H}_{0}. Second, based on the definition of φnoise\varphi_{\text{noise}} in Corollary 9, we modify (C.4) to

suphH0,rλT(h)r(n)(fh+gh)suprλT(h)r(n)gh+2α(4ξL2/cκ)(φnoise(r;δ2)Φ(h)).\displaystyle\sup_{h\in\pazocal{H}_{0},\frac{r}{\lambda}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})(f_{h}+g_{h})\leq\sup_{\frac{r}{\lambda}\leq T(h)\leq r}(\mathbb{P}-\mathbb{P}_{n})g_{h}+\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\left(\varphi_{\text{noise}}\left(r;\frac{\delta}{2}\right)-\Phi(h^{*})\right). (C.25)

We also do similar modifications to (C.10) (C.11) (C.1) (C.1). Third, we modify (C.15) (note that this is the only place we use the property of empirical risk minimization) as follows:

n(fh^+gh^)2α(4ξL2/cκ)n[sv(h^(x)y)sv(h(x)y)]\displaystyle\mathbb{P}_{n}(f_{\hat{h}}+g_{\hat{h}})\leq\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\mathbb{P}_{n}[\ell_{\textup{{sv}}}(\hat{h}(x)-y)-\ell_{\textup{{sv}}}(h^{*}(x)-y)]
2α(4ξL2/cκ)Φ(h),\displaystyle\leq\frac{2}{\alpha\left(4\|\xi\|_{L_{2}}/\sqrt{c_{\kappa}}\right)}\Phi(h^{*}), (C.26)

where the first inequality is due to (C.1) and the second inequality is due to the definition (8.10) of the estimator h^\hat{h}. After all these modifications, the inequality (C.1) still hold true, and the remaining proof is identical to that of Theorem 8. \square