This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Early stopping and polynomial smoothing in regression with reproducing kernels

Yaroslav Averyanovlabel=e1][email protected] [    Alain Celisselabel=e2][email protected] [ Inria MODAL project-teampresep=, ]e1 Laboratoire SAMM, Paris 1 Panthéon-Sorbonne Universitypresep=, ]e2
Abstract

In this paper, we study the problem of early stopping for iterative learning algorithms in a reproducing kernel Hilbert space (RKHS) in the nonparametric regression framework. In particular, we work with the gradient descent and (iterative) kernel ridge regression algorithms. We present a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle. This method enjoys only one assumption on the regression function: it belongs to a reproducing kernel Hilbert space (RKHS). The proposed rule is proved to be minimax-optimal over different types of kernel spaces, including finite-rank and Sobolev smoothness classes. The proof is derived from the fixed-point analysis of the localized Rademacher complexities, which is a standard technique for obtaining optimal rates in the nonparametric regression literature. In addition to that, we present simulation results on artificial datasets that show the comparable performance of the designed rule with respect to other stopping rules such as the one determined by VV-fold cross-validation.

62G05,
62G08,
Nonparametric regression,
Reproducing kernels,
Early stopping,
Localized Rademacher complexities,
keywords:
[class=MSC]
keywords:
\arxiv

2007.06827 \startlocaldefs \endlocaldefs

and

1 Introduction

Early stopping rule (ESR) is a form of regularization based on choosing when to stop an iterative algorithm based on some design criterion. Its main idea is lowering the computational complexity of an iterative algorithm while preserving its statistical optimality. This approach is quite old and initially was developed for Landweber iterations to solve ill-posed matrix problems in the 1970s [20, 36]. Recent papers provided some insights for the connection between early stopping and boosting methods [6, 14, 40, 43], gradient descent, and Tikhonov regularization in a reproducing kernel Hilbert space (RKHS) [7, 29, 42]. For instance, [14] established the first optimal in-sample convergence rate of L2L^{2}-boosting with early stopping. Raskutti et al. [29] provided a result on a stopping rule that achieves the minimax-optimal rate for kernelized gradient descent and ridge regression over different smoothness classes. This work established an important connection between the localized Radamacher complexities [5, 24, 38], that characterizes the size of the explored function space, and early stopping. The main drawback of the result is that one needs to know the RKHS-norm of the regression function or its tight upper bound in order to apply this early stopping rule in practice. Besides that, this rule is design-dependent, which limits its practical application. In the subsequent work, [40] showed how to control early stopping optimality via the localized Gaussian complexities in RKHS for different boosting algorithms (L2L^{2}-boosting, LogitBoost, and AdaBoost). Another theoretical result for a not data-driven ESR was built by [11], where the authors proved a minimax-optimal (in the L2(X){L}_{2}(\mathbb{P}_{X}) out-of-sample norm) stopping rule for conjugate gradient descent in the nonparametric regression setting. [2] proposed a different approach, where the authors focused on both time/memory computational savings, combining early stopping with Nystrom subsampling technique.

Some stopping rules, that (potentially) could be applied in practice, were provided by [9, 10] and [33], and were based on the so-called minimum discrepancy principle [11, 13, 20, 23]. This principle consists of monitoring the empirical risk and determining the first time at which a given learning algorithm starts to fit the noise. In the papers mentioned, the authors considered spectral filter estimators such as gradient descent, Tikhonov (ridge) regularization, and spectral cut-off regression for the linear Gaussian sequence model, and derived several oracle-type inequalities for the proposed ESR. The main deficiency of the works [9, 10, 33] is that the authors dealt only with the linear Gaussian sequence model, and the minimax optimality result was restricted to the spectral cut-off estimator. It is worth mentioning that [33] introduced the so-called polynomial smoothing strategy to achieve the optimality of the minimum discrepancy principle ESR over Sobolev balls for the spectral cut-off estimator. More recently, [18] studied a minimum discrepancy principle stopping rule and its modified (they called it smoothed as well) version, where they provided the range of values of the regression function regularity, for which these stopping rules are optimal for different spectral filter estimators in RKHS.

Contribution. Hence, to the best of our knowledge, there is no fully data-driven stopping rule for gradient descent or ridge regression in RKHS that does not use a validation set, does not depend on the parameters of the model such as the RKHS-norm of the regression function, and explains why it is statistically optimal. In our paper, we combine techniques from [9], [29], and [33] to construct such an ESR. Our analysis is based on the bias and variance trade-off of an estimator, and we try to catch the iteration of their intersection by means of the minimum discrepancy principle [9, 13, 18] and the localized Rademacher complexities [5, 24, 27, 38]. In particular, for the kernels with infinite rank, we propose to use a special technique [13, 33] for the empirical risk in order to reduce its variance. Further, we introduce new notions of smoothed empirical Rademacher complexity and smoothed critical radius to achieve minimax optimality bounds for the functional estimator based on the proposed rule. This can be done by solving the associated fixed-point equation. It implies that the bounds in our analysis cannot be improved (up to numeric constants). It is important to note that in the present paper, we establish an important connection between a smoothed version of the statistical dimension of nn-dimensional kernel matrix, introduced by [41] for randomized projections in kernel ridge regression, with early stopping (see Section 4.3 for more details). We also show how to estimate the variance σ2\sigma^{2} of the model, in particular, for the infinite-rank kernels. In the meanwhile, we provide experimental results on artificial data indicating the consistent performance of the proposed rules.

Outline of the paper. The organization of the paper is as follows. In Section 2, we introduce the background on nonparametric regression and reproducing kernel Hilbert space. There, we explain the updates of two spectral filter iterative algorithms: gradient descent and (iterative) kernel ridge regression, that will be studied. In Section 3, we clarify how to compute our first early stopping rule for finite-rank kernels and provide an oracle-type inequality (Theorem 3.1) and an upper bound for the risk error of this stopping rule with fixed covariates (Corollary 3.2). After that, we present a similar upper bound for the risk error with random covariates (Theorem 3.3) that is proved to be minimax-rate optimal. By contrast, Section 4 is devoted to the development of a new stopping rule for infinite-rank kernels based on the polynomial smoothing [13, 33] strategy. There, Theorem 4.2 shows, under a quite general assumption on the eigenvalues of the kernel operator, a high probability upper bound for the performance of this stopping rule measured in the L2(n)L_{2}(\mathbb{P}_{n}) in-sample norm. In particular, this upper bound leads to minimax optimality over Sobolev smoothness classes. In Section 5, we compare our stopping rules to other rules, such as methods using hold-out data and VV-fold cross-validation. After that, we propose using a strategy for the estimation of the variance σ2\sigma^{2} of the regression model. Section 6 summarizes the content of the paper and describes some perspectives. Supplementary and more technical proofs are deferred to Appendix.

2 Nonparametric regression and reproducing kernel framework

2.1 Probabilistic model and notation

The context of the present work is that of nonparametric regression, where an i.i.d. sample {(xi,yi),i=1,,n}\{(x_{i},y_{i}),\ i=1,\ldots,n\} of cardinality nn is given, with xi𝒳(feature space)x_{i}\in\mathcal{X}\ (\textnormal{feature space}) and yi\ y_{i}\in\mathbb{R}. The goal is to estimate the regression function f:𝒳f^{*}:\mathcal{X}\to\mathbb{R} from the model

yi=f(xi)+ε¯i,i=1,,n,y_{i}=f^{*}(x_{i})+\overline{\varepsilon}_{i},\qquad i=1,\ldots,n, (1)

where the error variables ε¯i\overline{\varepsilon}_{i} are i.i.d. zero-mean Gaussian random variables 𝒩(0,σ2)\mathcal{N}(0,\sigma^{2}), with σ>0\sigma>0. In all what follows (except for Section 5, where results of empirical experiments are reported), the values of σ2\sigma^{2} is assumed to be known as in [29] and [40].

Along the paper, calculations are mainly derived in the fixed-design context, where the {xi}i=1n\{x_{i}\}_{i=1}^{n} are assumed to be fixed, and only the error variables {ε¯i}i=1n\{\overline{\varepsilon}_{i}\}_{i=1}^{n} are random. In this context, the performance of any estimator f^\widehat{f} of the regression function ff^{*} is measured in terms of the so-called empirical norm, that is, the L2(n)L_{2}(\mathbb{P}_{n})-norm defined by

f^fn21ni=1n[f^(xi)f(xi)]2,\lVert\widehat{f}-f^{*}\rVert_{n}^{2}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\Big{[}\widehat{f}(x_{i})-f^{*}(x_{i})\Big{]}^{2},

where hn1/ni=1nh(xi)2\lVert h\rVert_{n}\coloneqq\sqrt{1/n\sum_{i=1}^{n}h(x_{i})^{2}} for any bounded function hh over 𝒳\mathcal{X}, and ,n\langle\cdot,\cdot\rangle_{n} denotes the related inner-product defined by h1,h2n1/ni=1nh1(xi)h2(xi)\langle h_{1},h_{2}\rangle_{n}\coloneqq 1/n\sum_{i=1}^{n}h_{1}(x_{i})h_{2}(x_{i}) for any functions h1h_{1} and h2h_{2} bounded over 𝒳\mathcal{X}. In this context, ε\mathbb{P}_{\varepsilon} and 𝔼ε\mathbb{E}_{\varepsilon} denote the probability and expectation, respectively, with respect to the {ε¯i}i=1n\{\overline{\varepsilon}_{i}\}_{i=1}^{n}.

By contrast, Section 3.1.2 discusses some extensions of the previous results to the random design context, where both the covariates {xi}i=1n\{x_{i}\}_{i=1}^{n} and the responses {yi}i=1n\{y_{i}\}_{i=1}^{n} are random variables. In this random design context, the performance of an estimator f^\widehat{f} of ff^{*} is measured in terms of the L2(X)L_{2}(\mathbb{P}_{X})-norm defined by

f^f22𝔼X[(f^(X)f(X))2],\lVert\widehat{f}-f^{*}\rVert_{2}^{2}\coloneqq\mathbb{E}_{X}\Big{[}(\widehat{f}(X)-f^{*}(X))^{2}\Big{]},

where X\mathbb{P}_{X} denotes the probability distribution of the {xi}i=1n\{x_{i}\}_{i=1}^{n}. In what follows, \mathbb{P} and 𝔼\mathbb{E}, respectively, state for the probability and expectation with respect to the couples {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n}.

Notation.

Throughout the paper, \lVert\cdot\rVert and ,\langle\cdot,\cdot\rangle are the usual Euclidean norm and inner product in n\mathbb{R}^{n}. We shall write anbna_{n}\lesssim b_{n} whenever anCbna_{n}\leq Cb_{n} for some numeric constant C>0C>0 for all n1n\geq 1. anbna_{n}\gtrsim b_{n} whenever anCbna_{n}\geq Cb_{n} for some numeric constant C>0C>0 for all n1n\geq 1. Similarly, anbna_{n}\asymp b_{n} means anbna_{n}\lesssim b_{n} and bnanb_{n}\gtrsim a_{n}. [M]{1,,M}\left[M\right]\equiv\{1,\ldots,M\} for any MM\in\mathbb{N}. For a0a\geq 0, we denote by a\left\lfloor a\right\rfloor the largest natural number that is smaller than or equal to aa. We denote by a\left\lceil a\right\rceil the smallest natural number that is greater than or equal to aa. Throughout the paper, we use the notation c,c1,c~,C,C~,c,c_{1},\widetilde{c},C,\widetilde{C},\ldots to show that numeric constants c,c1,c~,C,C~,c,c_{1},\widetilde{c},C,\widetilde{C},\ldots do not depend on the parameters considered. Their values may change from line to line.

2.2 Statistical model and assumptions

2.2.1 Reproducing Kernel Hilbert Space (RKHS)

Let us start by introducing a reproducing kernel Hilbert space (RKHS) denoted by \mathcal{H} [4, 8, 22, 37]. Such a RKHS \mathcal{H} is a class of functions associated with a reproducing kernel 𝕂:𝒳2\mathbb{K}:\mathcal{X}^{2}\to\mathbb{R} and endowed with an inner-product denoted by ,\langle\cdot,\cdot\rangle_{\mathcal{H}}, and satisfying 𝕂(,x),𝕂(,y)=𝕂(x,y)\langle\mathbb{K}(\cdot,x),\mathbb{K}(\cdot,y)\rangle_{\mathcal{H}}=\mathbb{K}(x,y) for all x,y𝒳x,y\in\mathcal{X}. Each function within \mathcal{H} admits a representation as an element of L2(X)L_{2}(\mathbb{P}_{X}), which justifies the slight abuse when writing L2(X)\mathcal{H}\subset L_{2}(\mathbb{P}_{X}) (see [19] and [18, Assumption 3]).

Assuming the RKHS \mathcal{H} is separable, under suitable regularity conditions (e.g., a continuous positive-semidefinite kernel), Mercer’s theorem [31] guarantees that the kernel can be expanded as

𝕂(x,x)=k=1+μkϕk(x)ϕk(x),x,x𝒳,\mathbb{K}(x,x^{\prime})=\sum_{k=1}^{+\infty}\mu_{k}\phi_{k}(x)\phi_{k}(x^{\prime}),\quad\forall x,x^{\prime}\in\mathcal{X},

where μ1μ20\mu_{1}\geq\mu_{2}\geq\ldots\geq 0 and {ϕk}k=1+\{\phi_{k}\}_{k=1}^{+\infty} are, respectively, the eigenvalues and corresponding eigenfunctions of the kernel integral operator T𝕂T_{\mathbb{K}}, given by

T𝕂(f)(x)=𝒳𝕂(x,u)f(u)𝑑X(u),fL2(X),x𝒳.\displaystyle T_{\mathbb{K}}(f)(x)=\int_{\mathcal{X}}\mathbb{K}(x,u)f(u)d\mathbb{P}_{X}(u),\quad\forall f\in L_{2}(\mathbb{P}_{X}),\ x\in\mathcal{X}. (2)

It is then known that the family {ϕk}k=1+\{\phi_{k}\}_{k=1}^{+\infty} is an orthonormal basis of L2(X)L_{2}(\mathbb{P}_{X}), while {μkϕk}k=1+\{\sqrt{\mu_{k}}\phi_{k}\}_{k=1}^{+\infty} is an orthonormal basis of \mathcal{H}. Then, any function fL2(X)f\in\mathcal{H}\subset L_{2}(\mathbb{P}_{X}) can be expanded as f=k=1+μkθkϕkf=\sum_{k=1}^{+\infty}\sqrt{\mu_{k}}\theta_{k}\phi_{k}, where for all kk such that μk>0\mu_{k}>0, the coefficients {θk}k=1\{\theta_{k}\}_{k=1}^{\infty} are

θk=f,μkϕk=1μkf,ϕkL2(X)=𝒳f(x)ϕk(x)μk𝑑X(x).\theta_{k}=\langle f,\sqrt{\mu_{k}}\phi_{k}\rangle_{\mathcal{H}}=\frac{1}{\sqrt{\mu_{k}}}\langle f,\phi_{k}\rangle_{L_{2}(\mathbb{P}_{X})}=\int_{\mathcal{X}}\frac{f(x)\phi_{k}(x)}{\sqrt{\mu_{k}}}d\mathbb{P}_{X}(x). (3)

Therefore, each functions f,gf,g\in\mathcal{H} can be represented by the respective sequences {ak}k=1+,{bk}k=1+2()\{a_{k}\}_{k=1}^{+\infty},\{b_{k}\}_{k=1}^{+\infty}\in\ell_{2}(\mathbb{N}) such that

f=k=1+akϕk,andg=k=1+bkϕk,f=\sum_{k=1}^{+\infty}a_{k}\phi_{k},\quad\mbox{and}\quad g=\sum_{k=1}^{+\infty}b_{k}\phi_{k},

with the inner-product in the Hilbert space \mathcal{H} given by f,g=k=1+akbkμk.\langle f,g\rangle_{\mathcal{H}}=\sum_{k=1}^{+\infty}\frac{a_{k}b_{k}}{\mu_{k}}. This leads to the following representation of \mathcal{H} as an ellipsoid

={f=k=1+akϕk,k=1+ak2<+, and k=1+ak2μk<+}.\displaystyle\mathcal{H}=\left\{f=\sum_{k=1}^{+\infty}a_{k}\phi_{k},\quad\sum_{k=1}^{+\infty}a_{k}^{2}<+\infty,\mbox{ and }\sum_{k=1}^{+\infty}\frac{a_{k}^{2}}{\mu_{k}}<+\infty\right\}.

2.2.2 Main assumptions

From the initial model given by Eq. (1), we make the following assumption.

Assumption 1 (Statistical model).

Let 𝕂(,)\mathbb{K}(\cdot,\cdot) denote a reproducing kernel as defined above, and \mathcal{H} is the induced separable RKHS. Then, there exists a constant R>0R>0 such that the nn-sample (x1,y1),,(xn,yn)𝒳n×n(x_{1},y_{1}),\ldots,(x_{n},y_{n})\in\mathcal{X}^{n}\times\mathbb{R}^{n} satisfies the statistical model

yi=f(xi)+ε¯i,withf𝔹(R)={f:fR},\displaystyle y_{i}=f^{*}(x_{i})+\overline{\varepsilon}_{i},\quad\mbox{with}\quad f^{*}\in\mathbb{B}_{\mathcal{H}}(R)=\{f\in\mathcal{H}:\lVert f\rVert_{\mathcal{H}}\leq R\}, (4)

where the {ε¯i}i=1n\{\overline{\varepsilon}_{i}\}_{i=1}^{n} are i.i.d. Gaussian random variables with 𝔼[ε¯ixi]=0\mathbb{E}[\overline{\varepsilon}_{i}\mid x_{i}]=0 and 𝕍[ε¯ixi]=σ2\mathbb{V}[\overline{\varepsilon}_{i}\mid x_{i}]=\sigma^{2}.

The model from Assumption 1 can be vectorized as

Y=[y1,,yn]=F+ε¯n,Y=[y_{1},...,y_{n}]^{\top}=F^{*}+\overline{\varepsilon}\in\mathbb{R}^{n}, (5)

where F=[f(x1),,f(xn)]F^{*}=[f^{*}(x_{1}),\ldots,f^{*}(x_{n})]^{\top} and ε¯=[ε¯1,,ε¯n]\overline{\varepsilon}=[\overline{\varepsilon}_{1},\ldots,\overline{\varepsilon}_{n}]^{\top}, which turns to be useful all along the paper.

In the present paper, we make a boundness assumption on the reproducing kernel 𝕂(,)\mathbb{K}(\cdot,\cdot).

Assumption 2.

Let us assume that the measurable reproducing kernel 𝕂(,)\mathbb{K}(\cdot,\cdot) is uniformly bounded on its support, meaning that there exists a constant B>0B>0 such that

supx𝒳[𝕂(x,x)]=supx𝒳𝕂(,x)2B.\underset{x\in\mathcal{X}}{\sup}\Big{[}\mathbb{K}(x,x)\Big{]}=\underset{x\in\mathcal{X}}{\sup}||\mathbb{K}(\cdot,x)||_{\mathcal{H}}^{2}\leq B.

Moreover in what follows, we assume that B=1B=1 without loss of generality.

Assumption 2 holds for many kernels. On the one hand, it is fulfilled with an unbounded domain 𝒳\mathcal{X} with a bounded kernel (e.g., Gaussian, Laplace kernels). On the other hand, it amounts to assume the domain 𝒳\mathcal{X} is bounded with an unbounded kernel such as the polynomial or Sobolev kernels [31]. Let us also mention that Assumptions 1 and 2 (combined with the reproducing property) imply that ff^{*} is uniformly bounded since

f=supx𝒳|f,𝕂(,x)|fsupx𝒳𝕂(,x)R.\lVert f^{*}\rVert_{\infty}=\underset{x\in\mathcal{X}}{\sup}\left|\langle f^{*},\mathbb{K}(\cdot,x)\rangle_{\mathcal{H}}\right|\leq\lVert f^{*}\rVert_{\mathcal{H}}\underset{x\in\mathcal{X}}{\sup}\lVert\mathbb{K}(\cdot,x)\rVert_{\mathcal{H}}\leq R. (6)

Considering now the Gram matrix K={𝕂(xi,xj)}1i,jnK=\{\mathbb{K}(x_{i},x_{j})\}_{1\leq i,j\leq n}, the related normalized Gram matrix Kn={𝕂(xi,xj)/n}1i,jnK_{n}=\{\mathbb{K}(x_{i},x_{j})/n\}_{1\leq i,j\leq n} turns out to be symmetric and positive semidefinite. This entails the existence of the empirical eigenvalues μ^1,,μ^n\widehat{\mu}_{1},\ldots,\widehat{\mu}_{n} (respectively, the eigenvectors u^1,,u^n\widehat{u}_{1},\ldots,\widehat{u}_{n}) such that Knu^i=μ^iu^iK_{n}\widehat{u}_{i}=\widehat{\mu}_{i}\cdot\widehat{u}_{i} for all i[n]i\in[n]. Remark that Assumption 2 implies 0max(μ^1,μ1)10\leq\max(\widehat{\mu}_{1},\mu_{1})\leq 1.

For technical convenience, it turns out to be useful rephrasing the model (5) by using the SVD of the normalized Gram matrix KnK_{n}. This leads to the new (rotated) model

Zi=u^i,Y=Gi+εi,i=1,,n,Z_{i}=\langle\widehat{u}_{i},Y\rangle=G_{i}^{*}+\varepsilon_{i},\quad i=1,\ldots,n, (7)

where Gi=u^i,FG_{i}^{*}=\langle\widehat{u}_{i},F^{*}\rangle, and εi=u^i,ε¯\varepsilon_{i}=\langle\widehat{u}_{i},\overline{\varepsilon}\rangle is a zero-mean Gaussian random variable with the variance σ2\sigma^{2}.

2.3 Spectral filter algorithms

Spectral filter algorithms were first introduced for solving ill-posed inverse problems with deterministic noise [20]. Among others, one typical example of such an algorithm is the gradient descent algorithm (that is named as well as L2L^{2}-boosting [14]). They were more recently brought to the supervised learning community, for instance, by [7, 15, 21, 42]. For estimating the vector FF^{*} from Eq. (5) in the fixed-design context, such a spectral filter estimator is a linear estimator, which can be expressed as

Fλ(fλ(x1),,fλ(xn))=Kngλ(Kn)Y,\displaystyle F^{\lambda}\coloneqq\left(f^{\lambda}(x_{1}),\ldots,f^{\lambda}(x_{n})\right)^{\top}=K_{n}g_{\lambda}(K_{n})Y, (8)

where gλ:[0,1]g_{\lambda}:\ [0,1]\to\mathbb{R} is called the admissible spectral filter function [7, 21]. For example, the choice gλ(ξ)=1ξ+λg_{\lambda}(\xi)=\frac{1}{\xi+\lambda}, corresponds to the kernel ridge estimator with regularization parameter λ>0\lambda>0 (see [9, 18] for other possible choices)

From the model expressed in the empirical eigenvectors basis (7), the resulting spectral filter estimator (8) can be expressed as

Giλ(t)=u^i,Fλ(t)=γi(t)Zi,i=1,,n,G^{\lambda(t)}_{i}=\langle\widehat{u}_{i},F^{\lambda(t)}\rangle=\gamma_{i}^{(t)}Z_{i},\quad\forall i=1,\ldots,n, (9)

where tλ(t)>0t\mapsto\lambda(t)>0 is a decreasing function mapping tt to a regularization parameter value at time tt, and tγi(t)t\mapsto\gamma_{i}^{(t)} is defined by

γi(t)=μ^igλ(t)(μ^i),i=1,,n.\gamma_{i}^{(t)}=\widehat{\mu}_{i}g_{\lambda(t)}(\widehat{\mu}_{i}),\quad\forall i=1,\ldots,n.

Under the assumption that limt0gλ(t)(μ)=0,μ(0,1]\underset{t\to 0}{\lim}g_{\lambda(t)}(\mu)=0,\ \mu\in(0,1], it can be proved that γi(t)\gamma_{i}^{(t)} is a non-decreasing function of tt, γi(0)=0\gamma_{i}^{(0)}=0, and limtγi(t)=1\underset{t\to\infty}{\lim}\gamma_{i}^{(t)}=1. Moreover, μ^i=0\widehat{\mu}_{i}=0 implies γi(t)=0\gamma_{i}^{(t)}=0, as it is the case for the kernels with a finite rank, that is, when rk(Kn)r\mathrm{rk}(K_{n})\leq r almost surely.

Thanks to the remark above, we define the following convenient notations ftfλ(t)f^{t}\coloneqq f^{\lambda(t)} (for functions) and FtFλ(t)F^{t}\coloneqq F^{\lambda(t)} (for vectors), with a continuous time t0t\geq 0, by

ft=gλ(t)(SnSn)SnY,f^{t}=g_{\lambda(t)}(S_{n}^{*}S_{n})S_{n}^{*}Y, (10)

where Sn:nS_{n}:\mathcal{H}\to\mathbb{R}^{n} is the sampling operator and SnS_{n}^{*} is its adjoint, i.e. (Snf)i=f(xi)(S_{n}f)_{i}=f(x_{i}) and Kn=SnSnK_{n}=S_{n}S_{n}^{*}.

In what follows, we introduce an assumption on a γi(t)\gamma_{i}^{(t)} function that will play a crucial role in our analysis.

Assumption 3.
cmin{1,ηtμ^i}γi(t)min{1,ηtμ^i},i=1,,nc\min\{1,\eta t\widehat{\mu}_{i}\}\leq\gamma_{i}^{(t)}\leq\min\{1,\eta t\widehat{\mu}_{i}\},\quad i=1,\ldots,n

for some positive constants c(0,1)c\in(0,1) and η>0\eta>0.

Let us mention two famous examples of spectral filter estimators that satisfy Assumption 3 with c=1/2c=1/2 (see Lemma A.1 in Appendix). These examples will be further studied in the present paper.

  • Gradient descent (GD) with a constant step-size 0<η<1/μ^10<\eta<1/\widehat{\mu}_{1} and ηt+\eta t\to+\infty as t+t\to+\infty:

    γi(t)=1(1ημ^i)t,t0,i=1,,n.\gamma_{i}^{(t)}=1-(1-\eta\widehat{\mu}_{i})^{t},\quad\forall t\geq 0,\ \forall i=1,\ldots,n. (11)

    The constant step-size η\eta can be replaced by any non-increasing sequence {η(t)}t=0+\{\eta(t)\}_{t=0}^{+\infty} satisfying [29]

    • (μ^1)1η(t)η(t+1)(\widehat{\mu}_{1})^{-1}\geq\eta(t)\geq\eta(t+1)\geq\dots, for t=0,1,t=0,1,\ldots,

    • s=0t1η(s)+\sum_{s=0}^{t-1}\eta(s)\to+\infty as t+t\to+\infty.

  • Kernel ridge regression (KRR) with the regularization parameter λ(t)=1/(ηt)\lambda(t)=1/(\eta t) with η>0\eta>0:

    γi(t)=μ^iμ^i+λ(t),t>0,i=1,,n.\gamma_{i}^{(t)}=\frac{\widehat{\mu}_{i}}{\widehat{\mu}_{i}+\lambda(t)},\quad\forall t>0,\ \forall i=1,\ldots,n. (12)

    The linear parameterization λ(t)=1/(ηt)\lambda(t)=1/(\eta t) is chosen for theoretical convenience.

The examples of γi(t)\gamma_{i}^{(t)} above were derived for F0=[f0(x1),,f0(xn)]=[0,,0]F^{0}=[f^{0}(x_{1}),\ldots,f^{0}(x_{n})]^{\top}=[0,\ldots,0]^{\top} as an initialization condition without loss of generality.

2.4 Key quantities

From a set of parameters (stopping times) 𝒯{t0}\mathcal{T}\coloneqq\{t\geq 0\} for an iterative learning algorithm, the present goal is to design t^=t^({xi,yi}i=1n)\widehat{t}=\widehat{t}(\{x_{i},y_{i}\}_{i=1}^{n}) from the data {xi,yi}i=1n\{x_{i},y_{i}\}_{i=1}^{n} such that the functional estimator ft^f^{\widehat{t}} is as close as possible to the optimal one among 𝒯.\mathcal{T}.

Numerous classical model selection procedures for choosing t^\widehat{t} already exist, e.g. the (generalized) cross validation [35], AIC and BIC criteria [1, 32], the unbiased risk estimation [17], or Lepski’s balancing principle [26]. Their main drawback in the present context is that they require the practitioner to calculate all the estimators {ft,t𝒯}\{f^{t},\ t\in\mathcal{T}\} in the first step, and then choose the optimal estimator among the candidates in a second step, which can be computationally demanding.

By contrast, early stopping is a less time-consuming approach. It is based on observing one estimator at each t𝒯t\in\mathcal{T} and deciding to stop the learning process according to some criterion. Its aim is to reduce the computational cost induced by this selection procedure while preserving the statistical optimality properties of the output estimator.

The prediction error (risk) of an estimator ftf^{t} at time tt is split into a bias and a variance term [29] as

R(t)=𝔼εftfn2=𝔼εftfn2+𝔼εft𝔼εftn2=B2(t)+V(t)R(t)=\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n}^{2}=\lVert\mathbb{E}_{\varepsilon}f^{t}-f^{*}\rVert_{n}^{2}+\mathbb{E}_{\varepsilon}\lVert f^{t}-\mathbb{E}_{\varepsilon}f^{t}\rVert_{n}^{2}=B^{2}(t)+V(t)

with

B2(t)=1ni=1n(1γi(t))2(Gi)2,V(t)=σ2ni=1n(γi(t))2.B^{2}(t)=\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}(G_{i}^{*})^{2},\ \ \ \ \ \ V(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}(\gamma_{i}^{(t)})^{2}. (13)

The bias term is a non-increasing function of tt converging to zero, while the variance term is a non-decreasing function of tt. Assume further that rk(T𝕂)r\textnormal{rk}(T_{\mathbb{K}})\leq r, which implies that rk(Kn)r\textnormal{rk}(K_{n})\leq r almost surely, then the empirical risk RtR_{t} is introduced with the notation of Eq. (7).

Rt=1ni=1n(1γi(t))2Zi2=1ni=1r(1γi(t))2Zi2+1ni=r+1nZi2,R_{t}=\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}=\frac{1}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}+\frac{1}{n}\sum_{i=r+1}^{n}Z_{i}^{2}, (14)

An illustration of the typical behavior of the risk, empirical risk, bias, and variance is displayed by Figure 1.

Refer to caption
Figure 1: Bias, variance, risk, and empirical risk behavior.

Our main concern is formulating a data-driven stopping rule (a mapping from the data {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} to a positive time t^\widehat{t}) so that the prediction errors 𝔼εft^fn2\mathbb{E}_{\varepsilon}\lVert f^{\widehat{t}}-f^{*}\rVert_{n}^{2} or, equivalently, 𝔼ft^f22\mathbb{E}\lVert f^{\widehat{t}}-f^{*}\rVert_{2}^{2} are as small as possible.

The analysis of the forthcoming early stopping rules involves the use of a model complexity measure known as the localized empirical Rademacher complexity [5, 24, 38] that we generalize to its α\alpha-smoothed version, for α[0,1]\alpha\in[0,1].

Definition 2.1.

For any ϵ>0\epsilon>0, α[0,1]\alpha\in[0,1], consider the localized smoothed empirical Rademacher complexity of \mathcal{H} as

^n,α(ϵ,)=R[1nj=1rμ^jαmin{ϵ2,μ^j}]1/2.\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})=R\left[\frac{1}{n}\sum_{j=1}^{r}\widehat{\mu}_{j}^{\alpha}\textnormal{min}\{\epsilon^{2},\widehat{\mu}_{j}\}\right]^{1/2}. (15)

It corresponds to a rescaled sum of the empirical eigenvalues truncated at ϵ2\epsilon^{2} and smoothed by {μ^iα}i=1r\{\widehat{\mu}_{i}^{\alpha}\}_{i=1}^{r}.

For a given RKHS \mathcal{H} and noise level σ\sigma, let us finally define the empirical smoothed critical radius ϵ^n,α\widehat{\epsilon}_{n,\alpha} as the smallest positive value ϵ\epsilon such that

^n,α(ϵ,)ϵR2Rϵ1+ασ.\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon R}\leq\frac{2R\epsilon^{1+\alpha}}{\sigma}. (16)

There is an extensive literature on the empirical critical equation and related empirical critical radius [5, 27, 29], and it is out of the scope of the present paper providing an exhaustive review on this topic. Nevertheless, Appendix G establishes that the smoothed critical radius ϵ^n,α\widehat{\epsilon}_{n,\alpha} does exist, is unique and achieves the equality in Ineq. (16). Constant 22 in Ineq. (16) is for theoretical convenience only. If α=0\alpha=0, ^n,α(ϵ,)^n(ϵ,)\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\equiv\widehat{\mathcal{R}}_{n}(\epsilon,\mathcal{H}), and ϵ^n,αϵ^n\widehat{\epsilon}_{n,\alpha}\equiv\widehat{\epsilon}_{n}.

3 Data-driven early stopping rule and minimum discrepancy principle

Let us start by recalling that the expression of the empirical risk in Eq. (14) gives that the empirical risk is a non-increasing function of tt (as illustrated by Fig. 1 as well). This is consistent with the intuition that the amount of available information within the residuals decreases as tt grows. If there exists time tt such that ftff^{t}\approx f^{*}, then the empirical risk is approximately equal to σ2\sigma^{2} (level of noise), that is,

𝔼εRt=𝔼ε[FtYn2]𝔼ε[FYn2]=𝔼ε[εn2]=σ2.\mathbb{E}_{\varepsilon}R_{t}=\mathbb{E}_{\varepsilon}\Big{[}\lVert F^{t}-Y\rVert_{n}^{2}\Big{]}\approx\mathbb{E}_{\varepsilon}\Big{[}\lVert F^{*}-Y\rVert_{n}^{2}\Big{]}=\mathbb{E}_{\varepsilon}\Big{[}\lVert\varepsilon\rVert_{n}^{2}\Big{]}=\sigma^{2}. (17)

By introducing the reduced empirical risk R~t,t0,\widetilde{R}_{t},\ t\geq 0, and recalling that rk(Kn)r\textnormal{rk}(K_{n})\leq r,

𝔼εRt=𝔼ε[1ni=1n(1γi(t))2Zi2]=𝔼ε[1ni=1r(1γi(t))2Zi2]R~t+nrnσ2(i)σ2,\mathbb{E}_{\varepsilon}R_{t}=\mathbb{E}_{\varepsilon}\left[\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}\right]=\mathbb{E}_{\varepsilon}\underbrace{\left[\frac{1}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}\right]}_{\coloneqq\widetilde{R}_{t}}+\frac{n-r}{n}\sigma^{2}\overset{(\textnormal{i})}{\approx}\sigma^{2}, (18)

where (i)(\textnormal{i}) is due to Eq. (17). This heuristic argument gives rise to a first deterministic stopping rule tt^{*} involving the reduced empirical risk and given by

t=inf{t>0|𝔼εR~trσ2n}.t^{*}=\inf\left\{t>0\ |\ \mathbb{E}_{\varepsilon}\widetilde{R}_{t}\leq\frac{r\sigma^{2}}{n}\right\}. (19)

Since tt^{*} is not achievable in practice, an estimator of tt^{*} is given by the data-driven stopping rule τ\tau based on the so-called minimum discrepancy principle

τ=inf{t>0|R~trσ2n}.\tau=\inf\left\{t>0\ |\ \widetilde{R}_{t}\leq\frac{r\sigma^{2}}{n}\right\}. (20)

The existing literature considering the MDP-based stopping rule usually defines τ\tau by the event {Rtσ2}\{R_{t}\leq\sigma^{2}\} [9, 11, 13, 20, 23, 33]. Notice that with a full-rank kernel matrix, the reduced empirical risk R~t\widetilde{R}_{t} is equal to the classical empirical risk RtR_{t}, leading then to the same stopping rule. From a practical perspective, the knowledge of the rank of the Gram matrix avoids estimating the last nrn-r components of the vector GG^{*}, which are already known to be zero (see [29, Section 4.1] for more details).

3.1 Finite-rank kernels

3.1.1 Fixed-design framework

Let us start by discussing our results with the case of RKHS of finite-rank kernels with rank r<n:μi=0,i>rr<n:\mu_{i}=0,\ i>r, and μ^i=0,i>r\widehat{\mu}_{i}=0,\ i>r. Examples that include these kernels are the linear kernel 𝕂(x1,x2)=x1x2\mathbb{K}(x_{1},x_{2})=x_{1}^{\top}x_{2} and the polynomial kernel of degree dd\in\mathbb{N} 𝕂(x1,x2)=(1+x1x2)d\mathbb{K}(x_{1},x_{2})=(1+x_{1}^{\top}x_{2})^{d}.

The following theorem applies to any functional estimator {ft}t[0,T]\{f^{t}\}_{t\in[0,T]} generated by (10) and initialized at f0=0f^{0}=0. The main part of the proof of this result consists of properly upper bounding 𝔼ε|𝔼εR~tR~t|\mathbb{E}_{\varepsilon}|\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}-\widetilde{R}_{t^{*}}| and follows the same trend of Proposition 3.1 in [9].

Theorem 3.1.

Under Assumptions 1 and 2, given the stopping rule (20),

𝔼εfτfn22(1+θ1)𝔼εftfn2+2(3+θ)rσ2n\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq 2(1+\theta^{-1})\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n} (21)

for any positive θ\theta.

Proof of Theorem 3.1.

In this proof, we will use the following inequalities: for any a,b0,(ab)2|a2b2|a,b\geq 0,\ (a-b)^{2}\leq|a^{2}-b^{2}|, and 2abθa2+1θb22ab\leq\theta a^{2}+\frac{1}{\theta}b^{2} for θ>0\forall\theta>0.

Let us first prove the subsequent oracle-type inequality for the difference between fτf^{\tau} and ftf^{t^{*}}. Consider

ftfτn2\displaystyle\lVert f^{t^{*}}-f^{\tau}\rVert_{n}^{2} =1ni=1r(γi(t)γi(τ))2Zi21ni=1r|(1γi(t))2(1γi(τ))2|Zi2\displaystyle=\frac{1}{n}\sum_{i=1}^{r}\Big{(}\gamma_{i}^{(t^{*})}-\gamma_{i}^{(\tau)}\Big{)}^{2}Z_{i}^{2}\leq\frac{1}{n}\sum_{i=1}^{r}|(1-\gamma_{i}^{(t^{*})})^{2}-(1-\gamma_{i}^{(\tau)})^{2}|Z_{i}^{2}
=(R~tR~τ)𝕀{τt}+(R~τR~t)𝕀{τ<t}\displaystyle=(\widetilde{R}_{t^{*}}-\widetilde{R}_{\tau})\mathbb{I}\left\{\tau\geq t^{*}\right\}+(\widetilde{R}_{\tau}-\widetilde{R}_{t^{*}})\mathbb{I}\left\{\tau<t^{*}\right\}
(R~t𝔼εR~t)𝕀{τt}+(𝔼εR~tR~t)𝕀{τ<t}\displaystyle\leq(\widetilde{R}_{t^{*}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}})\mathbb{I}\left\{\tau\geq t^{*}\right\}+(\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}-\widetilde{R}_{t^{*}})\mathbb{I}\left\{\tau<t^{*}\right\}
|R~t𝔼εR~t|.\displaystyle\leq|\widetilde{R}_{t^{*}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}|.

From the definition of R~t\widetilde{R}_{t} (18), one notices that

|R~t𝔼εR~t|\displaystyle|\widetilde{R}_{t^{*}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}| =|i=1r(1γi(t))2[1n(εi2σ2)+2nεiGi]|.\displaystyle=\left|\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{*})})^{2}\Big{[}\frac{1}{n}(\varepsilon_{i}^{2}-\sigma^{2})+\frac{2}{n}\varepsilon_{i}G_{i}^{*}\Big{]}\right|.

From 𝔼ε|X(ε)|varεX(ε)\mathbb{E}_{\varepsilon}|X(\varepsilon)|\leq\sqrt{\text{var}_{\varepsilon}X(\varepsilon)} for X(ε)X(\varepsilon) centered and a+ba+b\sqrt{a+b}\leq\sqrt{a}+\sqrt{b} for any a,b0a,b\geq 0, and 𝔼ε(ε4)3σ4\mathbb{E}_{\varepsilon}\left(\varepsilon^{4}\right)\leq 3\sigma^{4}, it comes

𝔼ε|R~t𝔼εR~t|\displaystyle\mathbb{E}_{\varepsilon}|\widetilde{R}_{t^{*}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}| 2σ2n2i=1r(1γi(t))4[32σ2+2(Gi)2]\displaystyle\leq\sqrt{\frac{2\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{*})})^{4}\left[\frac{3}{2}\sigma^{2}+2(G_{i}^{*})^{2}\right]}
3σ4n2i=1r(1γi(t))2+4σ2n2i=1r(1γi(t))2(Gi)2\displaystyle\leq\sqrt{\frac{3\sigma^{4}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{*})})^{2}}+\sqrt{\frac{4\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{*})})^{2}(G_{i}^{*})^{2}}
3σ2rn+θσ2n+θ1B2(t)\displaystyle\leq\frac{\sqrt{3}\sigma^{2}\sqrt{r}}{n}+\theta\frac{\sigma^{2}}{n}+\theta^{-1}B^{2}(t^{*})
θ1B2(t)+(3+θ)rσ2n.\displaystyle\leq\theta^{-1}B^{2}(t^{*})+(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.

Applying the inequalities (a+b)22a2+2b2(a+b)^{2}\leq 2a^{2}+2b^{2} for any a,b0a,b\geq 0 and B2(t)𝔼εftfn2B^{2}(t^{*})\leq\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}, we arrive at

𝔼εfτfn2\displaystyle\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}
2𝔼εftfn2+2𝔼εfτftn2\displaystyle\leq 2\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}+2\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{t^{*}}\rVert_{n}^{2}
2(1+θ1)𝔼εftfn2+2(3+θ)rσ2n.\displaystyle\leq 2(1+\theta^{-1})\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.

First of all, it is worth noting that the risk of the estimator ftf^{t^{*}} is proved to be optimal for gradient descent and kernel ridge regression no matter the kernel we use (see Appendix C for the proof), so it remains to focus on the remainder term on the right-hand side in Ineq. (21). Theorem 3.1 applies to any reproducing kernel, but one remarks that for infinite-rank kernels, r=nr=n, and we achieve only the rate 𝒪(1/n)\mathcal{O}\left(1/\sqrt{n}\right). This rate is suboptimal since, for instance, RKHS with polynomial eigenvalue decay kernels (will be considered in the next subsection) has the minimax-optimal rate for the risk error of the order 𝒪(nββ+1)\mathcal{O}\left(n^{-\frac{\beta}{\beta+1}}\right), with β>1\beta>1. Therefore, the oracle-type inequality (21) could be useful only for finite-rank kernels due to the fast 𝒪(r/n)\mathcal{O}(\sqrt{r}/n) rate of the remainder term.

Notice that, in order to make artificially the term 𝒪(r/n)\mathcal{O}(\sqrt{r}/n) a remainder one (even for cases corresponding to infinite-rank kernels), [9, 10] introduced in the definitions of their stopping rules a restriction on the ”starting time” t0t_{0}. However, in the mentioned work, this restriction incurred the price of possibility to miss the designed time τ\tau. Besides that, [10] developed an additional procedure based on standard model selection criteria such as AIC-criterion for the spectral cut-off estimator to recover the ”missing” stopping rule and achieve optimality over Sobolev-type ellipsoids. In our work, we removed such a strong assumption.

As a corollary of Theorem 3.1, one can prove that fτf^{\tau} provides a minimax estimator of ff^{*} over the ball of radius RR.

Corollary 3.2.

Under Assumptions 1, 2, 3, if a kernel has finite rank rr, then

𝔼εfτfn2cuR2ϵ^n2,\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n}^{2}, (22)

where the constant cuc_{u} is numeric.

Proof of Corollary 3.2.

From Theorem 3.1 and Lemma C.2 in Appendix,

𝔼εfτfn216(1+θ1)R2ϵ^n2+2(3+θ)rσ2n.\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq 16(1+\theta^{-1})R^{2}\widehat{\epsilon}_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}. (23)

Further, applying [29, Section 4.3], ϵ^n2=crσ2nR2\widehat{\epsilon}_{n}^{2}=c\frac{r\sigma^{2}}{nR^{2}}, and it implies that

𝔼εfτfn2[16(1+θ1)+2(3+θ)c]R2ϵ^n2.\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq\Big{[}16(1+\theta^{-1})+\frac{2(\sqrt{3}+\theta)}{c}\Big{]}R^{2}\widehat{\epsilon}_{n}^{2}. (24)

Note that the critical radius ϵ^n\widehat{\epsilon}_{n} cannot be arbitrary small since it should satisfy Ineq. (16). As it will be clarified later, the squared empirical critical radius is essentially optimal.

3.1.2 Random-design framework

We would like to transfer the minimax optimality bound for the estimator fτf^{\tau} from the empirical L2(n)L_{2}(\mathbb{P}_{n})-norm to the population L2(X)L_{2}(\mathbb{P}_{X}) norm by means of the so-called localized population Rademacher complexity. This complexity measure became a standard tool in empirical processes and nonparametric regression [5, 24, 29, 38].

For any kernel function class studied in the paper, we consider the localized Rademacher complexity that can be seen as a population counterpart of the empirical Rademacher complexity (15) introduced earlier:

¯n(ϵ,)=R[1ni=1+min{μi,ϵ2}]1/2.\overline{\mathcal{R}}_{n}(\epsilon,\mathcal{H})=R\left[\frac{1}{n}\sum_{i=1}^{+\infty}\min\{\mu_{i},\epsilon^{2}\}\right]^{1/2}. (25)

Using the localized population Rademacher complexity, we define its population critical radius ϵn>0\epsilon_{n}>0 to be the smallest positive solution ϵ\epsilon that satisfies the inequality

¯n(ϵ,)ϵR2ϵRσ.\frac{\overline{\mathcal{R}}_{n}(\epsilon,\mathcal{H})}{\epsilon R}\leq\frac{2\epsilon R}{\sigma}. (26)

In contrast to the empirical critical radius ϵ^n\widehat{\epsilon}_{n}, this quantity is not data-dependent, since it is specified by the population eigenvalues of the kernel operator T𝕂T_{\mathbb{K}} underlying the RKHS.

Theorem 3.3.

Under Assumptions 1, 2, and 3, given the stopping time (20), there is a positive numeric constant c~u\widetilde{c}_{u} so that for finite-rank kernels with rank rr, with probability at least 1cexp(c1nϵn2)1-c\exp(-c_{1}n\epsilon_{n}^{2}),

fτf22c~uR2ϵn2\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\widetilde{c}_{u}R^{2}\epsilon_{n}^{2} (27)

In addition, the risk error of τ\tau is bounded as

𝔼fτf22c~rσ2n+C(σ,R)exp(cr)remainder term,\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\frac{\widetilde{c}r\sigma^{2}}{n}+\underbrace{C(\sigma,R)\exp(-cr)}_{\textnormal{remainder term}}, (28)

where constant C(σ,R)C(\sigma,R) depends on σ\sigma and RR only.

Remark.

The full proof is deferred to Section F. Regarding Ineq. (27), ϵn2\epsilon_{n}^{2} is proven to be the minimax-optimal rate for the L2(X)L_{2}(\mathbb{P}_{X}) norm in a RKHS (see [5, 27, 29]). As for the risk error in Ineq. (28), the (exponential) remainder term should decrease to zero faster than rσ2n\frac{r\sigma^{2}}{n}, and Theorem 3.3 provides a rate 𝒪(rσ2n)\mathcal{O}\left(\frac{r\sigma^{2}}{n}\right) that matches up to a constant the minimax bound (see, e.g., [28, Theorem 2(a)] with s=1s=1), when ff^{*} belongs to the \mathcal{H}-norm ball of a fixed radius RR, thus not improvable in general. A similar bound for finite-rank kernels was achieved in [29, Corollary 4].

We summarize our findings in the following corollary.

Corollary 3.4.

Under Assumptions 1, 2, 3 and a finite-rank kernel, the early stopping rule τ\tau satisfies

𝔼fτf22inff^supfR𝔼f^f22,\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\asymp\underset{\widehat{f}}{\inf}\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}\lVert\widehat{f}-f^{*}\rVert_{2}^{2}, (29)

where the infimum is taken over all measurable functions of the input data.

3.2 Practical behavior of τ\tau with infinite-rank kernels

A typical example of RKHS that produces an infinite-rank kernel is the kthk^{\textnormal{th}}-order Sobolev spaces for some fixed integer k1k\geq 1 with Lebesgue measure on a bounded domain. We consider Sobolev spaces that consist of functions that have kthk^{\textnormal{th}}-order weak derivatives f(k)f^{(k)} being Lebesgue integrable and f(0)(0)=f(1)(0)==f(k1)(0)=0f^{(0)}(0)=f^{(1)}(0)=\ldots=f^{(k-1)}(0)=0. It is worth mentioning that for such classes, the eigenvalues of the kernel operator μiiβ,i=1,2,\mu_{i}\asymp i^{-\beta},\ i=1,2,\ldots, with β=2k\beta=2k. Another example of kernel with this decay condition for the eigenvalues is the Laplace kernel 𝕂(x1,x2)=e|x1x2|,x1,x2\mathbb{K}(x_{1},x_{2})=e^{-|x_{1}-x_{2}|},\ x_{1},x_{2}\in\mathbb{R} (see [31, p.402]).

Firstly, let us now illustrate the practical behavior of ESR (20) (its histogram) for gradient descent (9) with the step-size η=1/(1.2μ^1)\eta=1/(1.2\widehat{\mu}_{1}) and one-dimensional Sobolev kernel 𝕂(x1,x2)=min{x1,x2}\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\} that generates the reproducing space

={f:[0,1]|f(0)=0,01(f(x))2𝑑x<}.\mathcal{H}=\left\{f:[0,1]\to\mathbb{R}\ |\ f(0)=0,\int_{0}^{1}(f^{\prime}(x))^{2}dx<\infty\right\}. (30)

We deal with the model (1) with two regression functions: a smooth piece-wise linear f(x)=|x1/2|1/2f^{*}(x)=|x-1/2|-1/2 and nonsmooth heavisine f(x)=0.093[4sin(4πx)sign(x0.3)sign(0.72x)]f^{*}(x)=0.093\ [4\ \textnormal{sin}(4\pi x)-\textnormal{sign}(x-0.3)-\textnormal{sign}(0.72-x)] functions. The design points are random xii.i.d.𝕌[0,1]x_{i}\overset{\textnormal{i.i.d.}}{\sim}\mathbb{U}[0,1]. The number of observations is n=200n=200. For both functions, fn0.28\lVert f^{*}\rVert_{n}\approx 0.28, and we set up a middle difficulty noise level σ=0.15\sigma=0.15. The number of repetitions is N=200N=200.

Refer to caption
(a)
Refer to caption
(b)
Figure 2: Histogram of τ\tau vs tt^{*} vs tbinf{t>0|B2(t)V(t)}t^{b}\coloneqq\inf\{t>0\ |\ B^{2}(t)\leq V(t)\} vs torargmint>0[𝔼εftfn2]t_{\textnormal{or}}\coloneqq\underset{t>0}{\textnormal{argmin}}\left[\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n}^{2}\right] for kernel gradient descent with the step-size η=1/(1.2μ^1)\eta=1/(1.2\widehat{\mu}_{1}) for the piece-wise linear f(x)=|x1/2|1/2f^{*}(x)=|x-1/2|-1/2 (panel (a)) and heavisine f(x)=0.093[4sin(4πx)sign(x0.3)sign(0.72x)]f^{*}(x)=0.093\ [4\ \textnormal{sin}(4\pi x)-\textnormal{sign}(x-0.3)-\textnormal{sign}(0.72-x)] (panel (b)) regression functions, and the first-order Sobolev kernel 𝕂(x1,x2)=min{x1,x2}\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\}.

In panel (a) of Figure 2, we detect that our stopping rule τ\tau has a high variance. However, if we change the signal ff^{*} from the smooth to nonsmooth one, the regression function does not belong anymore to \mathcal{H} defined in (30). In this case (panel (b) in Figure 2), the stopping rule τ\tau performs much better than for the previous regression function. In order to get a stable early stopping rule that will be close to tt^{*}, we propose using a special smoothing technique for the empirical risk.

4 Polynomial smoothing

As was discussed earlier, the main issue of poor behavior of the stopping rule τ\tau for infinite-rank kernels is the variability of the empirical risk around its expectation. A solution that we propose is to smooth the empirical risk by means of the eigenvalues of the normalized Gram matrix.

4.1 Polynomial smoothing and minimum discrepancy principle rule

We start by defining the squared α\alpha-norm as fn,α2KnαF,Fn\lVert f\rVert_{n,\alpha}^{2}\coloneqq\langle K_{n}^{\alpha}F,F\rangle_{n} for all F=[f(x1),,f(xn)]nF=\left[f(x_{1}),\ldots,f(x_{n})\right]^{\top}\in\mathbb{R}^{n} and α[0,1]\alpha\in[0,1], from which we also introduce the smoothed risk, bias, and variance of a spectral filter estimator as

Rα(t)=𝔼εftfn,α2=𝔼εftfn,α2+𝔼εft𝔼εftn,α2=Bα2(t)+Vα(t),R_{\alpha}(t)=\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n,\alpha}^{2}=\lVert\mathbb{E}_{\varepsilon}f^{t}-f^{*}\rVert_{n,\alpha}^{2}+\mathbb{E}_{\varepsilon}\lVert f^{t}-\mathbb{E}_{\varepsilon}f^{t}\rVert_{n,\alpha}^{2}=B^{2}_{\alpha}(t)+V_{\alpha}(t),

with

Bα2(t)=1ni=1nμ^iα(1γi(t))2(Gi)2,Vα(t)=σ2ni=1nμ^iα(γi(t))2.B^{2}_{\alpha}(t)=\frac{1}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}(G_{i}^{*})^{2},\ \ \ \ V_{\alpha}(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(\gamma_{i}^{(t)})^{2}. (31)

The smoothed empirical risk is

Rα,t=FtYn,α2=GtZn,α2=1ni=1nμ^iα(1γi(t))2Zi2, for t>0.R_{\alpha,t}=\lVert F^{t}-Y\rVert_{n,\alpha}^{2}=\lVert G^{t}-Z\rVert_{n,\alpha}^{2}=\frac{1}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2},\quad\textnormal{ for }t>0. (32)

Recall that the kernel is bounded by B=1B=1, thus μ^i1\widehat{\mu}_{i}\leq 1 for all i=1,,ni=1,\ldots,n, then the smoothed bias Bα2(t)B_{\alpha}^{2}(t) and smoothed variance Vα(t)V_{\alpha}(t) are smaller their non-smoothed counterparts.

Analogously to the heuristic derivation leading to the stopping rule (20), the new stopping rule is based on the discrepancy principle applied to the α\alpha-smoothed empirical risk, that is,

τα=inf{t>0|Rα,tσ2tr(Knα)n},\tau_{\alpha}=\inf\left\{t>0\ |\ R_{\alpha,t}\leq\sigma^{2}\frac{\mathrm{tr}(K_{n}^{\alpha})}{n}\right\}, (33)

where σ2tr(Knα)/n=σ2i=1nμ^iα/n\sigma^{2}\mathrm{tr}(K_{n}^{\alpha})/n=\sigma^{2}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}/n is the natural counterpart of rσ2/nr\sigma^{2}/n in the case of a full-rank kernel matrix and the α\alpha-norm.

4.2 Related work

The idea of smoothing the empirical risk (the residuals) is not new in the literature. For instance, [11, 12, 13] discussed various smoothing strategies applied to (kernelized) conjugate gradient descent, and [18] considered spectral regularization with spectral filter estimators. More closely related to the present work, [33] studied a statistical performance improvement allowed by polynomial smoothing of the residuals (as we do here) but restricted to the spectral cut-off estimator.

In [12, 13], the authors considered the following statistical inverse problem: z=Ax+σζz=Ax+\sigma\zeta, where AA is a self-adjoint operator and ζ\zeta is Gaussian noise. In their case, for the purpose of achieving optimal rates, the usual discrepancy principle rule Axmzϑδ\lVert Ax_{m}-z\rVert\leq\vartheta\delta (mm is an iteration number, ϑ\vartheta is a parameter) was modified and took the form ρλ(A)(Axmz)ϑδ\lVert\rho_{\lambda}(A)(Ax_{m}-z)\rVert\leq\vartheta\delta, where ρλ(t)=1t+λ\rho_{\lambda}(t)=\frac{1}{\sqrt{t+\lambda}} and δ\delta is the normalized variance of Gaussian noise.

In [11], the minimum discrepancy principle was modified to the following: each iteration mm of conjugate gradient descent was represented by a vector α^m=KnY\widehat{\alpha}_{m}=K_{n}^{\dagger}Y, KnK_{n}^{\dagger} is the pseudo-inverse of the normalized Gram matrix, and the learning process was stopped if YKnα^mKn<Ω\lVert Y-K_{n}\widehat{\alpha}_{m}\rVert_{K_{n}}<\Omega for some positive Ω\Omega, where αKn2=α,Knα.\lVert\alpha\rVert_{K_{n}}^{2}=\langle\alpha,K_{n}\alpha\rangle. Thus, this method corresponds (up to a threshold) to the stopping rule (33) with α=1.\alpha=1.

In the work [33], the authors concentrated on the inverse problem Y=Aξ+δWY=A\xi+\delta W and its corresponding Gaussian vector observation model Yi=μ~iξi+δεi,i[r]Y_{i}=\tilde{\mu}_{i}\xi_{i}+\delta\varepsilon_{i},\ i\in[r], where {μ~i}i=1r\{\tilde{\mu}_{i}\}_{i=1}^{r} are the singular values of the linear bounded operator AA and {εi}i=1r\{\varepsilon_{i}\}_{i=1}^{r} are Gaussian noise variables. They recovered the signal {ξi}i=1r\{\xi_{i}\}_{i=1}^{r} by a cut-off estimator of the form ξ^i(t)=𝕀{it}μ~i1Yi,i[r]\widehat{\xi}_{i}^{(t)}=\mathbb{I}\{i\leq t\}\widetilde{\mu}_{i}^{-1}Y_{i},\ i\in[r]. The discrepancy principle in this case was (AA)α/2(YAξ^(t))2κ\lVert(AA^{\top})^{\alpha/2}(Y-A\widehat{\xi}^{(t)})\rVert^{2}\leq\kappa for some positive κ.\kappa. They found out that, if the smoothing parameter α\alpha lies in the interval [14p,12p)[\frac{1}{4p},\frac{1}{2p}), where pp is the polynomial decay of the singular values {μ~i}i=1r\{\widetilde{\mu}_{i}\}_{i=1}^{r}, then the cut-off estimator is adaptive to Sobolev ellipsoids. Therefore, our work could be considered as an extension of [33] in order to generalize the polynomial smoothing strategy to more complex filter estimators such as gradient descent and (Tikhonov) ridge regression in the reproducing kernel framework.

4.3 Optimality result (fixed-design)

We pursue the analogy a bit further by defining the smoothed statistical dimension as

dn,αinf{j[n]:μ^jϵ^n,α2},d_{n,\alpha}\coloneqq\inf\left\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\right\}, (34)

and dn,α=nd_{n,\alpha}=n if no such index exists. Combined with (15), this implies that

^n,α2(ϵ^n,α,)j=1dn,αμ^jαnR2ϵ^n,α2, and ϵ^n,α2(1+α)σ2j=1dn,αμ^jα4R2n.\widehat{\mathcal{R}}_{n,\alpha}^{2}(\widehat{\epsilon}_{n,\alpha},\mathcal{H})\geq\frac{\sum_{j=1}^{d_{n,\alpha}}\widehat{\mu}_{j}^{\alpha}}{n}R^{2}\widehat{\epsilon}_{n,\alpha}^{2},\ \ \textnormal{ and }\ \ \widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\geq\frac{\sigma^{2}\sum_{j=1}^{d_{n,\alpha}}\widehat{\mu}_{j}^{\alpha}}{4R^{2}n}. (35)

Let us emphasize that [41] already introduced the so-called statistical dimension (corresponds to dn,0d_{n,0} in our notation). It appeared that the statistical dimension provides an upper bound on the minimax-optimal dimension of randomized projections for kernel ridge regression (see [41, Theorem 2, Corollary 1]). In our case, dn,αd_{n,\alpha} can be seen as a (α\alpha-smooth) version of the statistical dimension.

The purpose of the following result is to give more insight into understanding of Eq. (34) regarding the minimax risk.

Theorem 4.1 (Lower bound from Theorem 1 in [41]).

For any regular kernel class, meaning that for any k=1,,nk=1,\ldots,n, μ^k+11i=k+1nμ^ik\widehat{\mu}_{k+1}^{-1}\sum_{i=k+1}^{n}\widehat{\mu}_{i}\lesssim k, and any estimator f~\widetilde{f} of f𝔹(R)f^{*}\in\mathbb{B}_{\mathcal{H}}(R) satisfying the nonparametric model defined in Eq. (1), we get

supfR𝔼εf~fn2clR2ϵ^n2,\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}_{\varepsilon}\lVert\widetilde{f}-f^{*}\rVert_{n}^{2}\geq c_{l}R^{2}\widehat{\epsilon}_{n}^{2},

for some numeric constant cl>0c_{l}>0.

Firstly, in [41], the regularity assumption was formulated as dn,0+1nμ^idn,0ϵ^n2\sum_{d_{n,0}+1}^{n}\widehat{\mu}_{i}\lesssim d_{n,0}\widehat{\epsilon}_{n}^{2}, which directly stems from the assumption in Theorem 4.1. Let us remark that the same assumption (as in Theorem 4.1) has been already made by [18, Assumption 6]. Secondly, Theorem 4.1 applies to any kernel, as long as the condition on the tail of eigenvalues is fulfilled, which is in particular true for the reproducing kernels from Section 3.2. Thus, the fastest achievable rate by an estimator of ff^{*} is ϵ^n2\widehat{\epsilon}_{n}^{2}.

A key property for the smoothing to yield optimal results is that the value of α\alpha has to be large enough to control the tail sum of the smoothed eigenvalues by the corresponding cumulative sum, which is the purpose of the assumption below.

Assumption 4.

There exists Υ=[α0,1],α00\Upsilon=[\alpha_{0},1],\ \alpha_{0}\geq 0, such that for all αΥ\alpha\in\Upsilon and k{1,,n}k\in\{1,\ldots,n\},

i=k+1+μi2αi=1kμi2α,\sum_{i=k+1}^{+\infty}\mu_{i}^{2\alpha}\leq\mathcal{M}\sum_{i=1}^{k}\mu_{i}^{2\alpha}, (36)

where 1\mathcal{M}\geq 1 denotes a numeric constant.

We enumerate several classical examples for which this assumption holds.

Example 1 (β\beta-polynomial eigenvalue decay kernels).

Let us assume that the kernel operator satisfy that there exist numeric constants 0<cC0<c\leq C such that

ciβμiCiβ,i=1,2,,ci^{-\beta}\leq\mu_{i}\leq Ci^{-\beta},\ \ i=1,2,\ldots, (37)

For the polynomial eigenvalue-decay kernels, Assumption 4 holds with

=22β1(Cc)2and1α1β+1=α0.\mathcal{M}=2^{2\beta-1}\left(\frac{C}{c}\right)^{2}\quad\textnormal{and}\quad 1\geq\alpha\geq\frac{1}{\beta+1}=\alpha_{0}. (38)
Example 2 (γ\gamma-exponential eigenvalue-decay kernels).

Let us assume that the eigenvalues of the kernel operator satisfy that there exist numeric constants 0<cC0<c\leq C and a constant γ>0\gamma>0 such that

ceiγμiCeiγ,i=1,2,.\displaystyle ce^{-i^{\gamma}}\leq\mu_{i}\leq Ce^{-i^{\gamma}},i=1,2,\ldots.

Instances of kernels within this class include the Gaussian kernel with respect to the Lebesgue measure on the real line (with γ=2\gamma=2) or on a compact domain (with γ=1\gamma=1) (up to log\log factor in the exponent, see [38, Example 13.21]). Then, Assumption 4 holds with

=(Cc)20eyγ𝑑y21/γ2/(2α0)1/γeyγ𝑑yandα[α0,1],for anyα0(0,1).\mathcal{M}=\Big{(}\frac{C}{c}\Big{)}^{2}\frac{\int_{0}^{\infty}e^{-y^{\gamma}}dy}{\int_{2^{-1/\gamma}}^{2/(2\alpha_{0})^{1/\gamma}}e^{-y^{\gamma}}dy}\quad\textnormal{and}\quad\alpha\in[\alpha_{0},1],\quad\mbox{for any}\quad\alpha_{0}\in(0,1).

For any regular kernel class satisfying the above assumption, the next theorem provides a high probability bound on the performance of fταf^{\tau_{\alpha}} (measured in terms of the L2(n)L_{2}(\mathbb{P}_{n})-norm), which depends on the smoothed empirical critical radius.

Theorem 4.2 (Upper bound on empirical norm).

Under Assumptions 1, 2, 3, and 4, for any regular kernel and α12\alpha\leq\frac{1}{2}, the stopping time (33) satisfies

fταfn2cuR2ϵ^n,α2\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n,\alpha}^{2} (39)

with probability at least 1cexp[c1R2σ2nϵ^n,α2(1+α)]1-c\exp\Big{[}-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]} for some positive constants c1c_{1} and cuc_{u}, where c1c_{1} depends only on \mathcal{M}, cuc_{u} and cc are numeric. Moreover,

𝔼εfταfn2CR2ϵ^n,α2+20max{σ2,R2}exp[c3R2σ2nϵ^n,α2(1+α)],\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq CR^{2}\widehat{\epsilon}_{n,\alpha}^{2}+20\max\{\sigma^{2},R^{2}\}\exp\left[-c_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right], (40)

where the constant CC is numeric, constant c3c_{3} only depending on \mathcal{M}.

The complete proof of Theorem 4.2 is given in Appendix D. The main message is that the final performance of the estimator fταf^{\tau_{\alpha}} is controlled by the smoothed critical radius ϵ^n,α2\widehat{\epsilon}_{n,\alpha}^{2}. From the existing literature on the empirical critical radius [28, 29, 38, 41], it is already known that the non-smooth version ϵ^n2\widehat{\epsilon}_{n}^{2} is the typical quantity that leads to minimax rates in the RKHS (see also Theorem 4.1). The behavior of ϵ^n,α2\widehat{\epsilon}_{n,\alpha}^{2} with respect to nn is likely to depend on α\alpha, as emphasized by the notation. Intuitively, this suggests that there could exist a range of values of α\alpha, for which ϵ^n,α2\widehat{\epsilon}_{n,\alpha}^{2} is of the same order as (or faster than) ϵ^n2\widehat{\epsilon}_{n}^{2}, leading therefore to optimal rates.

Another striking aspect of Ineq. (40) is related to the additional terms involving the exponential function in Ineq. (40). As far as (39) is a statement with ”high probability”, this term is expected to converge to 0 at a rate depending on nϵ^n,α2n\widehat{\epsilon}_{n,\alpha}^{2}. Therefore, the final convergence rate as well as the fact that this term is (or not) negligible will depend on α\alpha.

As a consequence of Theorem 4.1, as far as there exist values of α\alpha such that ϵ^n,α2\widehat{\epsilon}_{n,\alpha}^{2} is at most as large as ϵ^n2\widehat{\epsilon}_{n}^{2}, the estimator fταf^{\tau_{\alpha}} is optimal.

4.4 Consequences for β\beta-polynomial eigenvalue-decay kernels

The leading idea in the present section is identifying values of α\alpha, for which the bound (39) from Theorem 4.2 scales as R2ϵ^n2R^{2}\widehat{\epsilon}_{n}^{2}.

Let us recall the definition of a polynomial decay kernel from (37):

ciβμiCiβ,i=1,2,, for β>1 and numeric constants c,C>0.ci^{-\beta}\leq\mu_{i}\leq Ci^{-\beta},\ i=1,2,\ldots,\ \ \textnormal{ for }\beta>1\textnormal{ and numeric constants }c,C>0.

One typical example of the reproducing kernel satisfying this condition is the Sobolev kernel on [0,1]×[0,1][0,1]\times[0,1] given by 𝕂(x,x)=min{x,x}\mathbb{K}(x,x^{\prime})=\min\{x,x^{\prime}\} with β=2\beta=2 [29]. The corresponding RKHS is the first-order Sobolev class, that is, the class of functions that are almost everywhere differentiable with the derivative in L2[0,1]L_{2}[0,1].

Lemma 4.3.

For any β\beta-polynomial eigenvalue decay kernel, there exist numeric constants c1,c2>0c_{1},c_{2}>0 such that for α<1/β\alpha<1/\beta, one has

c1ϵ^n2ϵ^n,α2c2ϵ^n2(σ22R2n)ββ+1.\displaystyle c_{1}\widehat{\epsilon}_{n}^{2}\leq\widehat{\epsilon}_{n,\alpha}^{2}\leq c_{2}\widehat{\epsilon}_{n}^{2}\asymp\left(\frac{\sigma^{2}}{2R^{2}n}\right)^{\frac{\beta}{\beta+1}}.

The proof of Lemma 4.3 was deferred to Lemma A.2 in Appendix A and is not reproduced here. Therefore, if αβ<1\alpha\beta<1, then ϵ^n,α2ϵ^n2(σ22R2n)ββ+1\widehat{\epsilon}_{n,\alpha}^{2}\asymp\widehat{\epsilon}_{n}^{2}\asymp\left(\frac{\sigma^{2}}{2R^{2}n}\right)^{\frac{\beta}{\beta+1}}. Let us now recall from (38) that Assumption 4 holds for α(β+1)1\alpha\geq(\beta+1)^{-1}. All these arguments lead us to the next result, which establishes the minimax optimality of τα\tau_{\alpha} with any kernel satisfying the β\beta-polynomial eigenvalue-decay assumption, as long as α[1β+1,min{1β,12})\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right).

Corollary 4.4.

Under Assumptions 12, 3, and the β\beta-polynomial eigenvalue decay (37), for any α[1β+1,min{1β,12})\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right), the early stopping rule τα\tau_{\alpha} satisfies

𝔼εfταfn2inff^supfR𝔼εf^fn2,\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\asymp\underset{\widehat{f}}{\inf}\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}_{\varepsilon}\lVert\widehat{f}-f^{*}\rVert_{n}^{2}, (41)

where the infimum is taken over all measurable functions of the input data.

Corollary 4.4 establishes an optimality result in the fixed-design framework since as long as (β+1)1α<min{β1,12}(\beta+1)^{-1}\leq\alpha<\min\left\{\beta^{-1},\frac{1}{2}\right\}, the upper bound matches the lower bound up to multiplicative constants. Moreover, this property holds uniformly with respect to β>1\beta>1, provided the value of α\alpha is chosen appropriately. An interesting feature of this bound is that the optimal value of α\alpha only depends on the (polynomial) decay rate of the empirical eigenvalues of the normalized Gram matrix. This suggests that any effective estimator of the unknown parameter β\beta could be plugged into the above (fixed-design) result and would lead to an optimal rate. Note that [33] has emphasized a similar trade-off for the smoothing parameter α\alpha (polynomial smoothing), considering the spectral cut-off estimator in the Gaussian sequence model. Regarding convergence rates, Corollary 4.4 combined with Lemma 4.3 suggests that the convergence rate of the expected risk is of the order 𝒪(nββ+1)\mathcal{O}\left(n^{-\frac{\beta}{\beta+1}}\right). This is the same as the already known one in nonparametric regression in the random design framework [29, 34], which is known to be minimax-optimal as long as ff^{*} belongs to the RKHS \mathcal{H}.

5 Empirical comparison with existing stopping rules

The present section aims at illustrating the practical behavior of several stopping rules discussed along the paper as well as making a comparison with existing alternative stopping rules.

5.1 Stopping rules involved

The empirical comparison is carried out between the stopping rules τ\tau (20) and τα\tau_{\alpha} with α[1β+1,min{1β,12})\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right) (33), and four alternative stopping rules that are briefly described in the what follows. For the sake of comparison, most of them correspond to early stopping rules already considered in [29].

Hold-out stopping rule

We consider a procedure based on the hold-out idea [3]. Data {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} are split into two parts: the training sample Strain=(xtrain,ytrain)S_{\textnormal{train}}=(x_{\textnormal{train}},y_{\textnormal{train}}) and the test sample Stest=(xtest,ytest)S_{\textnormal{test}}=(x_{\textnormal{test}},y_{\textnormal{test}}) so that the training sample and test sample represent a half of the whole dataset. We train the learning algorithm for t=0,1,t=0,1,\ldots and estimate the risk for each tt by Rho(ft)=1niStest((y^test)iyi)2R_{\textnormal{ho}}(f^{t})=\frac{1}{n}\sum_{i\in S_{\textnormal{test}}}((\widehat{y}_{\textnormal{test}})_{i}-y_{i})^{2}, where (y^test)i(\widehat{y}_{\textnormal{test}})_{i} denotes the output of the algorithm trained at iteration tt on StrainS_{\textnormal{train}} and evaluated at the point xix_{i} of the test sample. The final stopping rule is defined as

T^HO=argmin{t|Rho(ft+1)>Rho(ft)}1.\widehat{\textnormal{T}}_{\textnormal{HO}}=\textnormal{argmin}\Big{\{}t\in\mathbb{N}\ |\ R_{\textnormal{ho}}(f^{t+1})>R_{\textnormal{ho}}(f^{t})\Big{\}}-1. (42)

Although it does not completely use the data for training (loss of information), the hold-out strategy has been proved to output minimax-optimal estimators in various contexts (see, for instance, [15, 16] with Sobolev spaces and β2\beta\leq 2).

V-fold stopping rule

The observations {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} are randomly split into V=4V=4 equal sized blocks. At each round (among the VV ones), V1V-1 blocks are devoted to training Strain=(xtrain,ytrain)S_{\textnormal{train}}=(x_{\textnormal{train}},y_{\textnormal{train}}), and the remaining one serves for the test sample Stest=(xtest,ytest)S_{\textnormal{test}}=(x_{\textnormal{test}},y_{\textnormal{test}}). At each iteration t=1,t=1,\ldots, the risk is estimated by RVFCV(ft)=1V1j=1V11n/ViStest(j)((y^test)iyi)2R_{\textnormal{VFCV}}(f^{t})=\frac{1}{V-1}\sum_{j=1}^{V-1}\frac{1}{n/V}\sum_{i\in S_{\textnormal{test}}(j)}((\widehat{y}_{\textnormal{test}})_{i}-y_{i})^{2}, where y^test\widehat{y}_{\textnormal{test}} was described for the hold-out stopping rule. The final stopping time is

T^VFCV=argmin{t|RVFCV(ft+1)>RVFCV(ft)}1.\widehat{\textnormal{T}}_{\textnormal{VFCV}}=\textnormal{argmin}\big{\{}t\in\mathbb{N}\ |\ R_{\textnormal{VFCV}}(f^{t+1})>R_{\textnormal{VFCV}}(f^{t})\big{\}}-1. (43)

V-fold cross validation is widely used in practice since, on the one hand, it is more computationally tractable than other splitting-based methods such as leave-one-out or leave-p-out (see the survey [3]), and on the other hand, it enjoys a better statistical performance than the hold-out (lower variability).

Raskutti-Wainwright-Yu stopping rule (from [29])

The use of this stopping rule heavily relies on the assumption that f2\lVert f^{*}\rVert_{\mathcal{H}}^{2} is known, which is a strong requirement in practice. It controls the bias-variance trade-off by using upper bounds on the bias and variance terms. The latter involves the localized empirical Rademacher complexity ^n(1ηt,)\widehat{\mathcal{R}}_{n}\left(\frac{1}{\sqrt{\eta t}},\mathcal{H}\right). It stops as soon as (upper bound of) the bias term becomes smaller than (upper bound on) the variance term, which leads to

T^RWY=argmin{t|^n(1ηt,)>(2eσηt)1}1.\widehat{\textnormal{T}}_{\textnormal{RWY}}=\textnormal{argmin}\Big{\{}t\in\mathbb{N}\ |\ \widehat{\mathcal{R}}_{n}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}>(2e\sigma\eta t)^{-1}\Big{\}}-1. (44)

Theoretical minimum discrepancy-based stopping rule tt^{*}

The fourth stopping rule is the one introduced in (19). It relies on the minimum discrepancy principle and involves the (theoretical) expected empirical risk 𝔼εRt\mathbb{E}_{\varepsilon}R_{t}:

t=inf{t|𝔼εRtσ2}.t^{*}=\inf\left\{t\in\mathbb{N}\ |\ \mathbb{E}_{\varepsilon}R_{t}\leq\sigma^{2}\right\}.

This stopping time is introduced for comparison purposes only since it cannot be computed in practice. This rule is proved to be optimal (see Appendix C) for any bounded reproducing kernel, so it could serve as a reference in the present empirical comparison.

Oracle stopping rule

The ”oracle” stopping rule defines the first time the risk curve starts to increase.

tor=argmin{t|𝔼εft+1fn2>𝔼εftfn2}1.t_{\textnormal{or}}=\textnormal{argmin}\big{\{}t\in\mathbb{N}\ |\ \mathbb{E}_{\varepsilon}\lVert f^{t+1}-f^{*}\rVert_{n}^{2}>\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n}^{2}\big{\}}-1. (45)

In situations where only one global minimum does exists for the risk, this rule coincides with the global minimum location. Its formulation reflects the realistic constraint that we do not have access to the whole risk curve (unlike in the classical model selection setup).

5.2 Simulation design

Artificial data are generated according to the regression model yj=f(xj)+εjy_{j}=f^{*}(x_{j})+\varepsilon_{j}, j=1,,nj=1,\ldots,n, where εji.i.d.𝒩(0,σ2)\varepsilon_{j}\overset{\textnormal{i.i.d.}}{\sim}\mathcal{N}(0,\sigma^{2}) with the equidistant xj=j/n,j=1,,nx_{j}=j/n,\ j=1,\ldots,n, and σ=0.15\sigma=0.15. The same experiments have been also carried out with uniform xi𝕌[0,1]x_{i}\sim\mathbb{U}[0,1] (not reported here) without any change regarding the conclusions. The sample size nn varies from 4040 to 400400.

The gradient descent algorithm (9) has been used with the step-size η=(1.2μ^1)1\eta=(1.2\,\widehat{\mu}_{1})^{-1} and initialization F0=[0,,0]F^{0}=[0,\ldots,0]^{\top}.

The present comparison involves two regression functions with the same L2(n)L_{2}(\mathbb{P}_{n})-norms of the signal fn0.28\lVert f^{*}\rVert_{n}\approx 0.28: (i)(i) a piecewise linear function called ”smooth” f(x)=|x1/2|1/2f^{*}(x)=|x-1/2|-1/2, and (ii)(ii) a ”sinus” f(x)=0.4sin(4πx)f^{*}(x)=0.4\ \textnormal{sin}(4\pi x).

To ease the comparison, the piecewise linear regression function was set up as in [29, Figure 3].

The case of finite-rank kernels is addressed in Section 5.3.1 with the so-called polynomial kernel of degree 33 defined by 𝕂(x1,x2)=(1+x1x2)3\mathbb{K}(x_{1},x_{2})=(1+x_{1}^{\top}x_{2})^{3} on the unit square [0,1]×[0,1][0,1]\times[0,1]. By contrast, Section 5.3.2 tackles the polynomial decay kernels with the first-order Sobolev kernel 𝕂(x1,x2)=min{x1,x2}\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\} on the unit square [0,1]×[0,1][0,1]\times[0,1].

The performance of the early stopping rules is measured in terms of the L2(n)L_{2}(\mathbb{P}_{n}) squared norm ftfn2\lVert f^{t}-f^{*}\rVert_{n}^{2} averaged over N=100N=100 independent trials.

For our simulations, we use a variance estimation method that is described in Section 5.4. This method is asymptotically unbiased, which is sufficient for our purposes.

5.3 Results of the simulation experiments

5.3.1 Finite-rank kernels

Refer to caption
(a)
Refer to caption
(b)
Figure 3: Kernel gradient descent with the step-size η=1/(1.2μ^1)\eta=1/(1.2\widehat{\mu}_{1}) and polynomial kernel 𝕂(x1,x2)=(1+x1x2)3,x1,x2[0,1]\mathbb{K}(x_{1},x_{2})=(1+x_{1}^{\top}x_{2})^{3},\ x_{1},x_{2}\in[0,1], for the estimation of two noised regression functions: the smooth f(x)=|x1/2|1/2f^{*}(x)=|x-1/2|-1/2 for panel (a), and the ”sinus” f(x)=0.4sin(4πx)f^{*}(x)=0.4\ \textnormal{sin}(4\pi x) for panel (b), with the equidistant covariates xj=j/nx_{j}=j/n. Each curve corresponds to the L2(n)L_{2}(\mathbb{P}_{n}) squared norm error for the stopping rules (45), (19), (44), (43), (20) averaged over 100100 independent trials, versus the sample size n={40,80,120,200,320,400}n=\{40,80,120,200,320,400\}.

Figure 3 displays the (averaged) L2(n)L_{2}(\mathbb{P}_{n})-norm error of the oracle stopping rule (45), our stopping rule τ\tau (20), tt^{*} (19), minimax-optimal stopping rule T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} (44), and 44-fold cross validation stopping time T^VFCV\widehat{\textnormal{T}}_{\textnormal{VFCV}} (43) versus the sample size. Figure 3(a) shows the results for the piecewise linear regression function whereas Figure 3(b) corresponds to the ”sinus” regression function.

All the curves decrease as nn grows. From these graphs, the overall worst performance is achieved by T^VFCV\widehat{\textnormal{T}}_{\textnormal{VFCV}}, especially with a small sample size, which can be due to the additional randomness induced by the preliminary random splitting with 4FCV4-FCV. By contrast, the minimum discrepancy-based stopping rules (τ\tau and tt^{*}) exhibit the best performances compared to the results of T^VFCV\widehat{\textnormal{T}}_{\textnormal{VFCV}} and T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}}. The averaged mean-squared error of τ\tau is getting closer to the one of tt^{*} as the number of samples nn increases, which was expected from the theory and also intuitively, since τ\tau has been introduced as an estimator of tt^{*}. From Figure 3(a), T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} is less accurate for small sample sizes, but improves a lot as nn grows up to achieving a performance similar to that of τ\tau. This can result from the fact that T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} is built from upper bounds on the bias and variance terms, which are likely to be looser with a small sample size, but achieve an optimal convergence rate as nn increases. On Figure 3(b), the reason why τ\tau exhibits (strongly) better results than T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} owes to the main assumption on the regression function, namely that f1\lVert f^{*}\rVert_{\mathcal{H}}\leq 1. This could be violated for the ”sinus” function.

5.3.2 Polynomial eigenvalue decay kernels

Figure 4 displays the resulting (averaged over 100100 repetitions) L2(n)L_{2}(\mathbb{P}_{n})-error of τα\tau_{\alpha} (with α=(β+1)1=0.33\alpha=(\beta+1)^{-1}=0.33) (33), T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} (44), tt^{*} (19), and T^HO\widehat{\textnormal{T}}_{\textnormal{HO}} (42) versus the sample size.

Refer to caption
(a)
Refer to caption
(b)
Figure 4: Kernel gradient descent (9) with the step-size η=1/(1.2μ^1)\eta=1/(1.2\widehat{\mu}_{1}) and Sobolev kernel 𝕂(x1,x2)=min{x1,x2},x1,x2[0,1]\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\},\ x_{1},x_{2}\in[0,1] for the estimation of two noised regression functions: the smooth f(x)=|x1/2|1/2f^{*}(x)=|x-1/2|-1/2 for panel (a) and the ”sinus” f(x)=0.4sin(4πx)f^{*}(x)=0.4\ \textnormal{sin}(4\pi x) for panel (b), with the equidistant covariates xj=j/nx_{j}=j/n. Each curve corresponds to the L2(n)L_{2}(\mathbb{P}_{n}) squared norm error for the stopping times (45), (19), (44), (42), (33) with α=0.33\alpha=0.33, averaged over 100100 independent trials, versus the sample size n={40,80,120,200,320,400}n=\{40,80,120,200,320,400\}.

Figure 4(a) shows that all stopping rules seem to work equivalently well, although there is a slight advantage for T^HO\widehat{\textnormal{T}}_{\textnormal{HO}} and T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} compared to tt^{*} and τα\tau_{\alpha}. However, as nn grows to 400400, the performances of all stopping rules become very close to each other. Let us emphasize that the true value of β\beta is not known in these experiments. Therefore, the value (β+1)1=0.33(\beta+1)^{-1}=0.33 has been estimated from the decay of the empirical eigenvalue of the normalized Gram matrix. This can explain why the performance of τα\tau_{\alpha} remains worse than that of T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}}.

The story described by Figure 4(b) is somewhat different. The first striking remark is that T^RWY\widehat{\textnormal{T}}_{\textnormal{RWY}} completely fails on this example, which still stems from the (unsatisfied) constraint on the \mathcal{H}-norm of ff^{*}. However, the best performance is still achieved by the Hold-out stopping rule, although τα\tau_{\alpha} and tt^{*} remain very close to the latter. The fact that tt^{*} remains close to the oracle stopping rule (without any need for smoothing) supports the idea that the minimum discrepancy is a reliable principle for designing an effective stopping rule. The deficiency of τ\tau (by contrast to τα\tau_{\alpha}) then results from the variability of the empirical risk, which does not remain close enough to its expectation. This bad behavior is then balanced by introducing the polynomial smoothing at level α\alpha within the definition of τα\tau_{\alpha}, which enjoys close to optimal practical performances.

Let us also mention that T^HO\widehat{\textnormal{T}}_{\textnormal{HO}} exhibit some variability, in particular, with small sample sizes as illustrated by Figures 4(a) and 4(b).

The overall conclusion is that the smoothed minimum discrepancy-based stopping time τα\tau_{\alpha} leads to almost optimal performances provided α=(β+1)1\alpha=(\beta+1)^{-1}, where β\beta quantifies the polynomial decay of the empirical eigenvalues {μ^i}i=1n\{\widehat{\mu}_{i}\}_{i=1}^{n}.

5.4 Estimation of variance and decay rate for polynomial eigenvalue decay kernels

The purpose of the present section is to describe two strategies for estimating: (i)(i) the decay rate of the empirical eigenvalues of the normalized Gram matrix, and (ii)(ii) the variance parameter σ2\sigma^{2}.

5.4.1 Polynomial decay parameter estimation

From the empirical version of the polynomial decay assumption (37), one can easily derive upper and lower bounds for β\beta as log(μ^i/μ^i+1)log(C/c)log(1+1/i)βlog(μ^i/μ^i+1)+log(C/c)log(1+1/i)\frac{\log(\widehat{\mu}_{i}/\widehat{\mu}_{i+1})-\log(C/c)}{\log(1+1/i)}\leq\beta\leq\frac{\log(\widehat{\mu}_{i}/\widehat{\mu}_{i+1})+\log(C/c)}{\log(1+1/i)}. The difference between these upper and lower bounds is equal to 2log(C/c)log(1+1/i)\frac{2\log(C/c)}{\log(1+1/i)}, which is minimized for i=1i=1. Then the best precision on the estimated value of β\beta is reached with i=1i=1, which yields the estimator β^=log(μ^1/μ^2)log2\widehat{\beta}=\frac{\log(\widehat{\mu}_{1}/\widehat{\mu}_{2})}{\log 2}.

5.4.2 Variance parameter estimation

There is a bunch of suggestions for variance estimation with linear smoothers; see, e.g., Section 5.6 in the book [39]. In our simulation experiments, two cases are distinguished: the situation where the reproducing kernel has finite rank rr, and the situation where rk(T𝕂)=\textnormal{rk}(T_{\mathbb{K}})=\infty. In both cases, an asymptotically unbiased estimator of σ2\sigma^{2} is designed.

Finite-rank kernel.

With such a finite-rank kernel, the estimation of the noise is made from the coordinates {Zi}i=r+1n\{Z_{i}\}_{i=r+1}^{n} corresponding to the situation, where Gi=0,i>rG_{i}^{*}=0,\ i>r (see Section 4.1.1 in [29]). Actually, these coordinates (which are pure noise) are exploited to build an easy-to-compute estimator of σ2\sigma^{2}, that is,

σ^2=i=nr+1nZi2nr.\widehat{\sigma}^{2}=\frac{\sum_{i=n-r+1}^{n}Z_{i}^{2}}{n-r}. (46)
Infinite-rank kernel.

If rk(T𝕂)=rk(T_{\mathbb{K}})=\infty, we suggest using the following result.

Lemma 5.1.

For any regular kernel (see Theorem 4.1), any value of tt satisfying ηtϵ^n2+\eta t\cdot\widehat{\epsilon}_{n}^{2}\to+\infty as n+n\to+\infty yields that σ^2=Rt1ni=1n(1γi(t))2\widehat{\sigma}^{2}=\frac{R_{t}}{\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}} is an asymptotically unbiased estimator of σ2\sigma^{2}.

A sketch of the proof of Lemma 5.1 is given in Appendix H. Based on this lemma, we suggest taking t=Tt=T, where TT is the maximum number of iterations allowed to execute due to computational constraints. Notice that as long as we access closed-form expressions of the estimator, there is no need to compute all estimators for tt between 1tT1\leq t\leq T. The final estimator of σ2\sigma^{2} used in the experiments of Section 5.3 is given by

σ^2=RT1ni=1n(1γi(T))2.\widehat{\sigma}^{2}=\frac{R_{T}}{\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(T)})^{2}}. (47)

6 Conclusion

In this paper, we describe spectral filter estimators (e.g., gradient descent, kernel ridge regression) for the non-parametric regression function estimation in RKHS. Two new data-driven early stopping rules τ\tau (20) and τα\tau_{\alpha} (33) for these iterative algorithms are designed. In more detail, we show that for the infinite-rank reproducing kernels, τ\tau has a high variance due to the variability of the empirical risk around its expectation, and we proposed a way to reduce this variability by means of smoothing the empirical L2(n)L_{2}(\mathbb{P}_{n})-norm (and, as a consequence, the empirical risk) by the eigenvalues of the normalized kernel matrix. We demonstrate in Corollaries 3.4 and 4.4 that our stopping times τ\tau and τα\tau_{\alpha} yield minimax-optimal rates, in particular, for finite-rank kernel classes and Sobolev spaces. It is worth emphasizing that computing the stopping times requires only the estimation of the variance σ2\sigma^{2} and computing (μ^1,,μ^n)(\widehat{\mu}_{1},\ldots,\widehat{\mu}_{n}). Theoretical results are confirmed empirically: τ\tau and τα\tau_{\alpha} with the smoothing parameter α=(β+1)1\alpha=(\beta+1)^{-1}, where β\beta is the polynomial decay rate of the eigenvalues of the normalized Gram matrix, perform favorably in comparison with stopping rules based on hold-out data and 4-fold cross-validation.

There are various open questions that could be tackled after our results. A deficiency of our strategy is that the construction of τ\tau and τα\tau_{\alpha} is based on the assumption that the regression function belongs to a known RKHS, which restricts (mildly) the smoothness of the regression function. We would like to understand how our results extend to other loss functions besides the squared loss (for example, in the classification framework), as it was done in [40]. Another research direction could be to use early stopping with fast approximation techniques for kernels [30] to avoid calculation of all eigenvalues of the normalized Gram matrix that can be prohibited for large-scale problems.

Appendix A Useful results

In this section, we present several auxiliary lemmas that are repeatedly used in the paper.

Lemma A.1.

[29, ηt=ηt\eta_{t}=\eta t in Lemma 8 and ν=ηt\nu=\eta t in Lemma 13] For any bounded kernel, with γi(t)\gamma_{i}^{(t)} corresponding to gradient descent or kernel ridge regression, for every t0t\geq 0,

12min{1,ηtμ^i}γi(t)min{1,ηtμ^i},i=1,,n.\displaystyle\frac{1}{2}\min\{1,\eta t\widehat{\mu}_{i}\}\leq\gamma_{i}^{(t)}\leq\min\{1,\eta t\widehat{\mu}_{i}\},\ \ i=1,\ldots,n. (48)

The following result shows the magnitude of the smoothed critical radius for polynomial eigenvalue decay kernels.

Lemma A.2.

Assume that μ^iCiβ,i=1,2,,n\widehat{\mu}_{i}\leq Ci^{-\beta},\ i=1,2,\ldots,n, for αβ<1\alpha\beta<1, one has

ϵ^n,α2[Cα1αβ+C1+αβ(1+α)1]2ββ+1[σ22R2n]ββ+1.\widehat{\epsilon}_{n,\alpha}^{2}\asymp\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]^{\frac{2\beta}{\beta+1}}\left[\frac{\sigma^{2}}{2R^{2}n}\right]^{\frac{\beta}{\beta+1}}.
Proof of Lemma A.2.

For every M(ϵ)(0,n]M(\epsilon)\in(0,n] and αβ<1\alpha\beta<1, we have

^n,α(ϵ,)\displaystyle\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H}) R1nj=1nmin{Cjβ,ϵ2}Cαjβα\displaystyle\leq R\sqrt{\frac{1}{n}}\sqrt{\sum_{j=1}^{n}\min\{Cj^{-\beta},\epsilon^{2}\}C^{\alpha}j^{-\beta\alpha}}
RCαnj=1M(ϵ)jβαϵ+RC1+αnj=M(ϵ)njββα\displaystyle\leq R\sqrt{\frac{C^{\alpha}}{n}}\sqrt{\sum_{j=1}^{\left\lfloor M(\epsilon)\right\rfloor}j^{-\beta\alpha}}\epsilon+R\sqrt{\frac{C^{1+\alpha}}{n}}\sqrt{\sum_{j=\left\lceil M(\epsilon)\right\rceil}^{n}j^{-\beta-\beta\alpha}}
RCα1αβM(ϵ)1αβnϵ+RC1+αn1β(1+α)11M(ϵ)β(1+α)1\displaystyle\leq R\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}\frac{M(\epsilon)^{1-\alpha\beta}}{n}}\epsilon+R\sqrt{\frac{C^{1+\alpha}}{n}}\sqrt{\frac{1}{\beta(1+\alpha)-1}\frac{1}{M(\epsilon)^{\beta(1+\alpha)-1}}}

Set M(ϵ)=ϵ2/βM(\epsilon)=\epsilon^{-2/\beta} that implies M(ϵ)1αβϵ=ϵ11αββ\sqrt{M(\epsilon)^{1-\alpha\beta}}\epsilon=\epsilon^{1-\frac{1-\alpha\beta}{\beta}}, and

^n,α(ϵ,)R[Cα1αβ+C1+αβ(1+α)1]ϵ11αββ1n.\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\leq R\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]\epsilon^{1-\frac{1-\alpha\beta}{\beta}}\frac{1}{\sqrt{n}}.

Therefore, the smoothed critical inequality ^n,α(ϵ,)2R2σϵ2+α\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\leq\frac{2R^{2}}{\sigma}\epsilon^{2+\alpha} is satisfied for

ϵ^n,α2=c~[Cα1αβ+C1+αβ(1+α)1]2ββ+1[σ22R2n]ββ+1.\widehat{\epsilon}_{n,\alpha}^{2}=\widetilde{c}\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]^{\frac{2\beta}{\beta+1}}\left[\frac{\sigma^{2}}{2R^{2}n}\right]^{\frac{\beta}{\beta+1}}. (49)

Notice that M(ϵ^n,α)(R2σ2)1β+1n1β+1(R2σ2)1β+1nM(\widehat{\epsilon}_{n,\alpha})\asymp\left(\frac{R^{2}}{\sigma^{2}}\right)^{\frac{1}{\beta+1}}n^{\frac{1}{\beta+1}}\lesssim\left(\frac{R^{2}}{\sigma^{2}}\right)^{\frac{1}{\beta+1}}n. Besides that, due to Lemma G.1, one can choose a positive constant c~\widetilde{c} in Eq. (49) such that M(ϵ^n,α)nM(\widehat{\epsilon}_{n,\alpha})\leq n. ∎

For the next two lemmas define the positive self-adjoint trace-class covariance operator

Σ𝔼X[𝕂(,X)𝕂(,X)],\Sigma\coloneqq\mathbb{E}_{X}\left[\mathbb{K}(\cdot,X)\otimes\mathbb{K}(\cdot,X)\right],

where \otimes is the Kronecker product between two elements in \mathcal{H} such that (ab)u=ab,u(a\otimes b)u=a\langle b,u\rangle_{\mathcal{H}}, for every uu\in\mathcal{H}. We know that Σ\Sigma and T𝕂T_{\mathbb{K}} have the same eigenvalues {μj}j=1\{\mu_{j}\}_{j=1}^{\infty}. Moreover, we introduce the smoothed empirical covariance operator as

Σ^n,α1nj=1nμ^j2α𝕂(,xj)𝕂(,xj).\widehat{\Sigma}_{n,\alpha}\coloneqq\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\mathbb{K}(\cdot,x_{j})\otimes\mathbb{K}(\cdot,x_{j}). (50)
Lemma A.3.

For each a>0a>0, any 1kn1\leq k\leq n, α[0,1/2]\alpha\in[0,1/2], and θ>1\theta>1, one has

X(j=1kμj2α>θθ1j=1kμ^j2α+a(1+3θ)θ3(θ1)n)2exp(a)\mathbb{P}_{X}\left(\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\frac{\theta}{\theta-1}\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)\theta}{3(\theta-1)n}\right)\leq 2\exp(-a)
Proof.

Let Πk\Pi_{k} be the orthogonal projection from \mathcal{H} onto the span of the eigenfunctions (ϕj:j=1,,k)(\phi_{j}:j=1,\ldots,k). Then by the variational characterization of partial traces, one has j=1kμj2α=tr(ΠkΣ2α)\sum_{j=1}^{k}\mu_{j}^{2\alpha}=\textnormal{tr}\left(\Pi_{k}\Sigma^{2\alpha}\right) and j=1kμ^j2αtr(ΠkΣ^n,α)\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}\geq\textnormal{tr}\left(\Pi_{k}\widehat{\Sigma}_{n,\alpha}\right). One concludes that

j=1kμj2αj=1kμ^j2αtr(Πk(Σ2αΣ^n,α)).\sum_{j=1}^{k}\mu_{j}^{2\alpha}-\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\left(\Sigma^{2\alpha}-\widehat{\Sigma}_{n,\alpha}\right)\right).

By reproducing property and Mercer’s theorem, Πk𝕂(,X)2=i=1kμiϕi2(X)\lVert\Pi_{k}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}=\sum_{i=1}^{k}\mu_{i}\phi_{i}^{2}(X), and

j=1kμj2αj=1kμ^j2α\displaystyle\sum_{j=1}^{k}\mu_{j}^{2\alpha}-\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha} 𝔼XΠkΣα12𝕂(,X)21nj=1nμ^j2αΠk𝕂(,xj)2\displaystyle\leq\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-\frac{1}{2}}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}-\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}
𝔼XΠkΣα12𝕂(,X)21nj=1nμ^j2αΠk𝕂(,xj)2.\displaystyle\leq\mid\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-\frac{1}{2}}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}-\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}\mid.

Since μ^j2αΠk𝕂(,xj)21\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}\leq 1, one has 𝔼X[μ^j4αΠk𝕂(,xj)4]i=1kμi\mathbb{E}_{X}\left[\widehat{\mu}_{j}^{4\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{4}\right]\leq\sum_{i=1}^{k}\mu_{i}, and by Bernstein’s inequality, for any a>0a>0,

X(j=1kμj2α>j=1kμ^j2α+2a(j=1kμj)n+a3n)2exp(a).\mathbb{P}_{X}\left(\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\sqrt{\frac{2a\left(\sum_{j=1}^{k}\mu_{j}\right)}{n}}+\frac{a}{3n}\right)\leq 2\exp(-a).

Then, by using j=1kμjj=1kμj2α\sum_{j=1}^{k}\mu_{j}\leq\sum_{j=1}^{k}\mu_{j}^{2\alpha} when α[0,1/2]\alpha\in[0,1/2], and 2xyθx+yθ\sqrt{2xy}\leq\theta x+\frac{y}{\theta} for any θ>0\theta>0, one gets

X((11θ)j=1kμj2α>j=1kμ^j2α+a(1+3θ)3n)2exp(a),\mathbb{P}_{X}\left(\left(1-\frac{1}{\theta}\right)\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq 2\exp(-a),

for any a>0a>0. ∎

Lemma A.4.

For each a>0a>0, any 0kn0\leq k\leq n, α[0,1/2]\alpha\in[0,1/2], and θ>1\theta>1, one has

X(j>kμ^j2α>θ+1θj>kμj2α+a(1+3θ)3n)exp(a).\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\frac{\theta+1}{\theta}\sum_{j>k}\mu_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq\exp(-a).
Proof.

The proof of [18, Lemma 33] could be easily generalized to the smoothed version by using the proof of Lemma A.3. Let Πk\Pi_{k} be the orthogonal projection from \mathcal{H} onto the span of the population eigenfunctions (ϕj:j>k)\left(\phi_{j}:j>k\right). Then by the variational characterization of partial traces, one has j>kμj2α=tr(ΠkΣ2α)\sum_{j>k}\mu_{j}^{2\alpha}=\textnormal{tr}\left(\Pi_{k}\Sigma^{2\alpha}\right) and j>kμ^j2αtr(ΠkΣ^n,α)\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\widehat{\Sigma}_{n,\alpha}\right). One concludes that

j>kμ^j2αj>kμj2αtr(Πk(Σ^n,αΣ2α)).\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}-\sum_{j>k}\mu_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\left(\widehat{\Sigma}_{n,\alpha}-\Sigma^{2\alpha}\right)\right).

Hence,

j>kμ^j2αj>kμj2α1nj=1nμ^j2αΠk𝕂(,xj)2𝔼XΠkΣα1/2𝕂(,X)2.\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}-\sum_{j>k}\mu_{j}^{2\alpha}\leq\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}-\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-1/2}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}.

Since μ^j2αΠk𝕂(,xj)21\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}(\cdot,x_{j})\rVert_{\mathcal{H}}^{2}\leq 1 and by using the reproducing property and Mercer’s theorem, Πk𝕂(,X)2=j>kμjϕj2(X)\lVert\Pi_{k}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}=\sum_{j>k}\mu_{j}\phi_{j}^{2}(X), one has

𝔼X[μ^j4αΠk𝕂(,xj)4]j>kμj.\mathbb{E}_{X}\left[\widehat{\mu}_{j}^{4\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{4}\right]\leq\sum_{j>k}\mu_{j}.

Bernstein’s inequality yields that for any a>0a>0,

X(j>kμ^j2α>j>kμj2α+2a(j>kμj)n+a3n)exp(a).\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\sum_{j>k}\mu_{j}^{2\alpha}+\sqrt{\frac{2a\left(\sum_{j>k}\mu_{j}\right)}{n}}+\frac{a}{3n}\right)\leq\exp(-a).

Using the inequalities j>kμjj>kμj2α\sum_{j>k}\mu_{j}\leq\sum_{j>k}\mu_{j}^{2\alpha}, when α[0,1/2]\alpha\in[0,1/2], and

2a(j>kμj)n1θj>kμj+aθn,\sqrt{\frac{2a\left(\sum_{j>k}\mu_{j}\right)}{n}}\leq\frac{1}{\theta}\sum_{j>k}\mu_{j}+\frac{a\theta}{n},

one gets

X(j>kμ^j2α>(1+1θ)j>kμj2α+a(1+3θ)3n)exp(a),\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\left(1+\frac{1}{\theta}\right)\sum_{j>k}\mu_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq\exp(-a),

for any a>0a>0 and θ>1\theta>1. ∎

Corollary A.5.

Assumption 4, Lemma A.3, and Lemma A.4 imply that for any 1kn1\leq k\leq n, a>0a>0, θ>1\theta>1, and α[α0,1/2]\alpha\in[\alpha_{0},1/2],

j=k+1nμ^j2α(θ+1)θ1j=1kμ^j2α+a(1+3θ)3n((θ+1)θ1+1)\sum_{j=k+1}^{n}\widehat{\mu}_{j}^{2\alpha}\leq\frac{(\theta+1)\mathcal{M}}{\theta-1}\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a(1+3\theta)}{3n}\left(\frac{\mathcal{M}(\theta+1)}{\theta-1}+1\right) (51)

with probability (over {xi}i=1n\{x_{i}\}_{i=1}^{n}) at least 13exp(a)1-3\exp(-a).

Appendix B Handling the smoothed bias and variance

Lemma B.1.

Under Assumptions 1, 2,

Bα2(t)R2(ηt)1+α,α[0,1].B_{\alpha}^{2}(t)\leq\frac{R^{2}}{(\eta t)^{1+\alpha}},\quad\alpha\in[0,1]. (52)
Proof of Lemma B.1.

Proof of [29, Lemma 7] can be easily generalized to obtain the result. ∎

Here, we recall one concentration result from [29, Section 4.1.2]. For any t>0t>0 and δ>0\delta>0, one has V(t)=𝔼ε[v(t)]V(t)=\mathbb{E}_{\varepsilon}\left[v(t)\right], and

ε(|v(t)V(t)|δ)2exp[cnδσ2min{1,R2δσ2ηt^n2(1ηt,)}].\mathbb{P}_{\varepsilon}\Big{(}|v(t)-V(t)|\geq\delta\Big{)}\leq 2\ \exp\left[-\frac{cn\delta}{\sigma^{2}}\min\left\{1,\frac{R^{2}\delta}{\sigma^{2}\eta t\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta t}},\mathcal{H})}\right\}\right]. (53)

Appendix C Auxiliary lemma for finite-rank kernels

Let us first transfer the critical inequality (16) from ϵ\epsilon to tt.

Definition C.1.

Set ϵ=1ηt\epsilon=\frac{1}{\sqrt{\eta t}} in (16), and let us define t^ϵ\widehat{t}_{\epsilon} as the largest positive solution to the following fixed-point equation

σ2ηtR2^n2(1ηt,)4R2ηt.\frac{\sigma^{2}\eta t}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta t}},\mathcal{H}\right)\leq\frac{4R^{2}}{\eta t}. (54)

Note that the empirical critical radius ϵ^n=1ηt^ϵ\widehat{\epsilon}_{n}=\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}}, and such a point t^ϵ\widehat{t}_{\epsilon} exists since ϵ^n\widehat{\epsilon}_{n} exists and is unique [27, 5, 29]. Moreover, t^ϵ\widehat{t}_{\epsilon} provides the equality in Ineq. (54).

Remark that at t=t:B2(t)=2σ2ni=1rγi(t)V(t)σ2ni=1rγi(t)t=t^{*}:B^{2}(t)=\frac{2\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}-V(t)\geq\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}. Thus, due to the construction of t^ϵ\widehat{t}_{\epsilon} ( t^ϵ\widehat{t}_{\epsilon} is the point of intersection of an upper bound on the bias and a lower bound on σ22ni=1rγi(t)\frac{\sigma^{2}}{2n}\sum_{i=1}^{r}\gamma_{i}^{(t)}) and monotonicity (in tt) of all the terms involved, we get tt^ϵt^{*}\leq\widehat{t}_{\epsilon}.

Lemma C.2.

Recall the definition of the stopping rule tt^{*} (19). Under Assumptions 1, 2, and 3, the following holds for any reproducing kernel:

𝔼εftfn28R2ϵ^n2.\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}\leq 8R^{2}\widehat{\epsilon}_{n}^{2}.
Proof of Lemma C.2.

Let us define a proxy version of the variance term: V~(t)σ2ni=1rγi(t)\widetilde{V}(t)\coloneqq\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}. Moreover, for all t>0t>0,

𝔼εRt=B2(t)+σ2ni=1n(1γi(t))2.\mathbb{E}_{\varepsilon}R_{t}=B^{2}(t)+\frac{\sigma^{2}}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}. (55)

From the fact that 𝔼εRt=σ2\mathbb{E}_{\varepsilon}R_{t^{*}}=\sigma^{2}, 𝔼εftfn2=B2(t)+V(t)=2V~(t)\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}=B^{2}(t^{*})+V(t^{*})=2\widetilde{V}(t^{*}).

Therefore, in order to prove the lemma, our goal is to get an upper bound on V~(t)\widetilde{V}(t^{*}). Since the function ηt^n2(1ηt,)\eta t\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta t}},\mathcal{H}) is monotonic in tt (see, for example, Lemma G.1), and tt^ϵt^{*}\leq\widehat{t}_{\epsilon}, we conclude that

V~(t)σ2ηtR2^n2(1ηt,)σ2ηt^ϵR2^n2(1ηt^ϵ,)=4R2ϵ^n2.\widetilde{V}(t^{*})\leq\frac{\sigma^{2}\eta t^{*}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta t^{*}}},\mathcal{H}\right)\leq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}},\mathcal{H}\right)=4R^{2}\widehat{\epsilon}_{n}^{2}.

Appendix D Proofs for polynomial smoothing (fixed design)

In the proofs, we will need three additional definitions below.

Definition D.1.

In Definition 15, set ϵ=1ηt\epsilon=\frac{1}{\sqrt{\eta t}}, then for any α[0,1]\alpha\in[0,1], the smoothed critical inequality (15) is equivalent to

σ2ηt4^n,α2(1ηt,)R4(ηt)1+α.\frac{\sigma^{2}\eta t}{4}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}\leq\frac{R^{4}}{(\eta t)^{1+\alpha}}. (56)

Due to Lemma G.1, the left-hand side of (56) is non-decreasing in tt, and the right-hand side is non-increasing in tt.

Definition D.2.

For any α[0,1]\alpha\in[0,1], define the stopping rule t^ϵ,α\widehat{t}_{\epsilon,\alpha} such that

ϵ^n,α2=1ηt^ϵ,α,\widehat{\epsilon}_{n,\alpha}^{2}=\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}, (57)

then Ineq. (56) becomes the equality at t=t^ϵ,αt=\widehat{t}_{\epsilon,\alpha} thanks to the monotonicity and continuity of both terms in the inequality.

Further, we define the stopping time t~ϵ,α\widetilde{t}_{\epsilon,\alpha} and t¯ϵ,α\overline{t}_{\epsilon,\alpha}, a lower bound and an upper bound on tαinf{t>0|𝔼εRα,tσ2ni=1nμ^iα}t_{\alpha}^{*}\coloneqq\inf\left\{t>0\ |\ \mathbb{E}_{\varepsilon}R_{\alpha,t}\leq\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\right\}, α[0,1]\forall\alpha\in[0,1].

Definition D.3.

Define the smoothed proxy variance V~α(t)σ2ni=1nμ^iαγi(t)\widetilde{V}_{\alpha}(t)\coloneqq\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\gamma_{i}^{(t)} and the following stopping times

t¯ϵ,α=inf{t>0|Bα2(t)=12V~α(t)},t~ϵ,α=inf{t>0|Bα2(t)=3V~α(t)}.\displaystyle\begin{split}\overline{t}_{\epsilon,\alpha}&=\inf\big{\{}t>0\ |\ B_{\alpha}^{2}(t)=\frac{1}{2}\widetilde{V}_{\alpha}(t)\big{\}},\\ \widetilde{t}_{\epsilon,\alpha}&=\inf\big{\{}t>0\ |\ B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t)\big{\}}.\end{split} (58)

Notice that at t=t~ϵ,αt=\widetilde{t}_{\epsilon,\alpha}:

6R2(ηt)1+αR2(ηt)1+αBα2(t)=3V~α(t)32σ2R2ηt^n,α2(1ηt,).\frac{6R^{2}}{(\eta t)^{1+\alpha}}\geq\frac{R^{2}}{(\eta t)^{1+\alpha}}\geq B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t)\geq\frac{3}{2}\frac{\sigma^{2}}{R^{2}}\eta t\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}.

At t=t¯ϵ,αt=\overline{t}_{\epsilon,\alpha}:

R2(ηt)1+αBα2(t)=12V~α(t)σ2ηt4R2^n,α2(1ηt,).\frac{R^{2}}{(\eta t)^{1+\alpha}}\geq B_{\alpha}^{2}(t)=\frac{1}{2}\widetilde{V}_{\alpha}(t)\geq\frac{\sigma^{2}\eta t}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}.

Thus, t~ϵ,α\widetilde{t}_{\epsilon,\alpha} and t¯ϵ,α\overline{t}_{\epsilon,\alpha} satisfy the smoothed critical inequality (56). Moreover, t^ϵ,α\widehat{t}_{\epsilon,\alpha} is always greater than or equal to t¯ϵ,α\overline{t}_{\epsilon,\alpha} and t~ϵ,α\widetilde{t}_{\epsilon,\alpha} since t^ϵ,α\widehat{t}_{\epsilon,\alpha} is the largest value satisfying Ineq. (56). As a consequence of Lemma G.1 and continuity of (56) in tt, one has 1ηt~ϵ,α1ηt¯ϵ,α1ηt^ϵ,α=ϵ^n,α2\frac{1}{\eta\widetilde{t}_{\epsilon,\alpha}}\asymp\frac{1}{\eta\overline{t}_{\epsilon,\alpha}}\asymp\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=\widehat{\epsilon}_{n,\alpha}^{2}. We assume for simplicity that

ϵ¯n,α2\displaystyle\overline{\epsilon}_{n,\alpha}^{2} 1ηt¯ϵ,α=c1ηt^ϵ,α=cϵ^n,α2,\displaystyle\coloneqq\frac{1}{\eta\overline{t}_{\epsilon,\alpha}}=c^{\prime}\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=c^{\prime}\widehat{\epsilon}_{n,\alpha}^{2},
ϵ~n,α2\displaystyle\widetilde{\epsilon}_{n,\alpha}^{2} 1ηt~ϵ,α=c′′1ηt^ϵ,α=c′′ϵ^n,α2\displaystyle\coloneqq\frac{1}{\eta\widetilde{t}_{\epsilon,\alpha}}=c^{\prime\prime}\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=c^{\prime\prime}\widehat{\epsilon}_{n,\alpha}^{2}

for some positive numeric constants c,c′′1c^{\prime},c^{\prime\prime}\geq 1, that do not depend on nn, due to the fact that t^ϵ,αt¯ϵ,α\widehat{t}_{\epsilon,\alpha}\geq\overline{t}_{\epsilon,\alpha} and t^ϵ,αt~ϵ,α\widehat{t}_{\epsilon,\alpha}\geq\widetilde{t}_{\epsilon,\alpha}.

The following lemma decomposes the risk error into several parts that will be further analyzed in subsequent Lemmas D.7, D.8.

Lemma D.4.

Recall the definition of τα\tau_{\alpha} (33), then

fταfn22B2(τα)+2v(τα),\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2B^{2}(\tau_{\alpha})+2v(\tau_{\alpha}),

where v(t)=1ni=1n(γi(t))2εi2,t>0v(t)=\frac{1}{n}\sum_{i=1}^{n}(\gamma_{i}^{(t)})^{2}\varepsilon_{i}^{2},\ t>0, is the stochastic part of the variance.

Proof of Lemma D.4.

Let us define the noise vector ε[ε1,,εn]\varepsilon\coloneqq[\varepsilon_{1},...,\varepsilon_{n}]^{\top} and, for each t>0t>0, two vectors that correspond to the bias and variance parts:

b~2(t)(gt(Kn)KnIn)F,v~(t)gt(Kn)Knε.\tilde{b}^{2}(t)\coloneqq(g_{t}(K_{n})K_{n}-I_{n})F^{*},\ \ \ \ \tilde{v}(t)\coloneqq g_{t}(K_{n})K_{n}\varepsilon. (59)

It gives the following expressions for the stochastic part of the variance and bias:

v(t)=v~(t),v~(t)n,B2(t)=b~2(t),b~2(t)n.v(t)=\langle\tilde{v}(t),\tilde{v}(t)\rangle_{n},\ \ B^{2}(t)=\langle\tilde{b}^{2}(t),\tilde{b}^{2}(t)\rangle_{n}. (60)

General expression for the L2(n)L_{2}(\mathbb{P}_{n})-norm error at τα\tau_{\alpha} takes the form

fταfn2=B2(τα)+v(τα)+2b~2(τα),v~(τα)n.\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}=B^{2}(\tau_{\alpha})+v(\tau_{\alpha})+2\langle\tilde{b}^{2}(\tau_{\alpha}),\tilde{v}(\tau_{\alpha})\rangle_{n}. (61)

Therefore, applying the inequality 2|x,yn|xn2+yn22\left|\langle x,y\rangle_{n}\right|\leq\lVert x\rVert_{n}^{2}+\lVert y\rVert_{n}^{2} for any x,ynx,y\in\mathbb{R}^{n}, and (60), we obtain

fταfn22B2(τα)+2v(τα).\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2B^{2}(\tau_{\alpha})+2v(\tau_{\alpha}). (62)

D.1 Two deviation inequalities for τα\tau_{\alpha}

This is the first deviation inequality for τα\tau_{\alpha} that will be used in Lemma D.7 to control the variance term.

Lemma D.5.

Recall Definition D.3 of t¯ϵ,α\overline{t}_{\epsilon,\alpha}, then under Assumptions 1, 2, 3, and 4,

ε(τα>t¯ϵ,α)5exp[c1R2σ2nϵ^n,α2(1+α)],\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],

where a positive constant c1c_{1} depends only on \mathcal{M}.

Proof of Lemma D.5.

Set κασ2trKnα/n\kappa_{\alpha}\coloneqq\sigma^{2}\textnormal{tr}K_{n}^{\alpha}/n, then due to the monotonicity of the smoothed empirical risk, for all ttαt\geq t_{\alpha}^{*},

ε(τα>t)=ε(Rα,t𝔼εRα,t>κα𝔼εRα,t).\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>t\right)=\mathbb{P}_{\varepsilon}\left(R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}>\kappa_{\alpha}-\mathbb{E}_{\varepsilon}R_{\alpha,t}\right).

Consider

Rα,t𝔼εRα,t=σ2ni=1nμ^iα(1γi(t))2(εi2σ21)Σ1+2ni=1nμ^iα(1γi(t))2GiεiΣ2.R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}=\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}\left(\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\right)}_{\Sigma_{1}}+\underbrace{\frac{2}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}. (63)

Define

Δt,ακα𝔼εRα,t=Bα2(t)Vα(t)+2V~α(t),\Delta_{t,\alpha}\coloneqq\kappa_{\alpha}-\mathbb{E}_{\varepsilon}R_{\alpha,t}=-B_{\alpha}^{2}(t)-V_{\alpha}(t)+2\widetilde{V}_{\alpha}(t),

where V~α(t)=σ2ni=1nμ^iαγi(t)\widetilde{V}_{\alpha}(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\gamma_{i}^{(t)}.

Further, set t=t¯ϵ,αt=\overline{t}_{\epsilon,\alpha}, and recall that ηt¯ϵ,α=ηt^ϵ,αc\eta\overline{t}_{\epsilon,\alpha}=\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime}} for c1c^{\prime}\geq 1. This implies

Δt¯ϵ,α,α12V~α(t¯ϵ,α)\displaystyle\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}\geq\frac{1}{2}\widetilde{V}_{\alpha}(\overline{t}_{\epsilon,\alpha}) σ24ni=1nμ^iαmin{1,ηt^ϵ,αcμ^i}\displaystyle\geq\frac{\sigma^{2}}{4n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{1,\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime}}\widehat{\mu}_{i}\right\}
=σ2ηt^ϵ,α4nci=1nμ^iαmin{cηt^ϵ,α,μ^i}\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4nc^{\prime}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{\frac{c^{\prime}}{\eta\widehat{t}_{\epsilon,\alpha}},\widehat{\mu}_{i}\right\}
σ2ηt^ϵ,α4cR2^n,α2(1ηt^ϵ,α,)\displaystyle\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4c^{\prime}R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)
=R2cϵ^n,α2(1+α).\displaystyle=\frac{R^{2}}{c^{\prime}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.

Then for the event AA from Corollary A.5, by standard concentration results on linear and quadratic sums of Gaussian random variables (see, e.g., [25, Lemma 1]),

ε(Σ1>Δt¯ϵ,α,α2A)\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right) exp[Δt¯ϵ,α,α216(a(t¯ϵ,α)2+Δt¯ϵ,α,α2a(t¯ϵ,α))],\displaystyle\leq\exp\left[-\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}^{2}}{16(\lVert a(\overline{t}_{\epsilon,\alpha})\rVert^{2}+\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\lVert a(\overline{t}_{\epsilon,\alpha})\rVert_{\infty})}\right], (64)
ε(Σ2>Δt¯ϵ,α,α2)\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{2}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\right) exp[nΔt¯ϵ,α,α232σ2Bα2(t¯ϵ,α)],\displaystyle\leq\exp\left[-\frac{n\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}^{2}}{32\sigma^{2}B_{\alpha}^{2}(\overline{t}_{\epsilon,\alpha})}\right], (65)

where ai(t¯ϵ,α)=σ2nμ^iα(1γi(t¯ϵ,α))2,i[n]a_{i}(\overline{t}_{\epsilon,\alpha})=\frac{\sigma^{2}}{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(\overline{t}_{\epsilon,\alpha})})^{2},\ i\in[n].

In what follows, we simplify the bounds above.

Firstly, recall that B=1B=1, which implies μ^11\widehat{\mu}_{1}\leq 1, and a(t¯ϵ,α)σ2n\lVert a(\overline{t}_{\epsilon,\alpha})\rVert_{\infty}\leq\frac{\sigma^{2}}{n}, and

12Δt¯ϵ,α,α34V~α(t¯ϵ,α)34V~α(t^ϵ,α)\displaystyle\frac{1}{2}\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}\leq\frac{3}{4}\widetilde{V}_{\alpha}(\overline{t}_{\epsilon,\alpha})\leq\frac{3}{4}\widetilde{V}_{\alpha}(\widehat{t}_{\epsilon,\alpha}) 34R2σ2ηt^ϵ,α^n,α2(1ηt^ϵ,α,)\displaystyle\leq\frac{3}{4R^{2}}\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)
=3R2ϵ^n,α2(1+α).\displaystyle=3R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.

Secondly, we will upper bound the Euclidean norm of a(t¯ϵ,α)a(\overline{t}_{\epsilon,\alpha}). Recall Corollary A.5 with a=R2σ2nϵ^n,α2(1+α)a=\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)} and θ=2\theta=2, the definition of the smoothed statistical dimension dn,α=min{j[n]:μ^jϵ^n,α2}d_{n,\alpha}=\min\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\}, and Ineq. (35): ϵ^n,α2(1+α)σ2i=1dn,αμ^iα4R2n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\geq\frac{\sigma^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}}{4R^{2}n}, which implies

a(t¯ϵ,α)2\displaystyle\lVert a(\overline{t}_{\epsilon,\alpha})\rVert^{2} =σ4n2i=1nμ^i2α(1γi(t¯ϵ,α))4σ4n2[i=1dn,αμ^iα+i=dn,α+1nμ^i2α]\displaystyle=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{2\alpha}\left(1-\gamma_{i}^{(\overline{t}_{\epsilon,\alpha})}\right)^{4}\leq\frac{\sigma^{4}}{n^{2}}\left[\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{2\alpha}\right]
σ4n2[4nR2ϵ^n,α2(1+α)σ2+3i=1dn,αμ^i2α+7(3+1)R23σ2ϵ^n,α2(1+α)]\displaystyle\leq\frac{\sigma^{4}}{n^{2}}\left[\frac{4nR^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}}{\sigma^{2}}+3\mathcal{M}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{2\alpha}+\frac{7(3\mathcal{M}+1)R^{2}}{3\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]
σ2R2n[4+12+3(3+1)]ϵ^n,α2(1+α).\displaystyle\leq\frac{\sigma^{2}R^{2}}{n}\left[4+12\mathcal{M}+3(3\mathcal{M}+1)\right]\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.

Finally, using the upper bound Bα2(t¯ϵ,α)R2(ηt¯ϵ,α)1+αR2(c)2ϵ^n,α2(1+α)B_{\alpha}^{2}(\overline{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{(\eta\overline{t}_{\epsilon,\alpha})^{1+\alpha}}\leq R^{2}(c^{\prime})^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)} for all α[0,1]\alpha\in[0,1] and the fact that ε(A)=X1,,Xn(𝕀(A))=X1,,Xn(A)\mathbb{P}_{\varepsilon}(A)=\mathbb{P}_{X_{1},\ldots,X_{n}}\left(\mathbb{I}(A)\right)=\mathbb{P}_{X_{1},\ldots,X_{n}}(A) for the event AA from Corollary A.5, one gets

ε(Σ1>Δt¯ϵ,α,α2)\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\right) ε(Σ1>Δt¯ϵ,α,α2A)+X1,,Xn(Ac),\displaystyle\leq\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)+\mathbb{P}_{X_{1},\ldots,X_{n}}\left(A^{c}\right),
ε(τα>t¯ϵ,α)\displaystyle\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right) 5exp[c1R2σ2nϵ^n,α2(1+α)],\displaystyle\leq 5\ \exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],

for some positive numeric c1>0c_{1}>0 that depends only on \mathcal{M}.

What follows is the second deviation inequality for τα\tau_{\alpha} that will be further used in Lemma D.8 to control the bias term.

Lemma D.6.

Recall Definition D.3 of t~ϵ,α\widetilde{t}_{\epsilon,\alpha}, then under Assumptions 1, 2, 3, and 4,

ε(τα<t~ϵ,α)5exp[c2R2σ2nϵ^n,α2(1+α)]\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right)\leq 5\ \exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right] (66)

for a positive constant c2c_{2} that depends only on \mathcal{M}.

Proof of Lemma D.6.

Set κασ2trKnα/n\kappa_{\alpha}\coloneqq\sigma^{2}\textnormal{tr}K_{n}^{\alpha}/n. Note that t~ϵ,αtα\widetilde{t}_{\epsilon,\alpha}\leq t_{\alpha}^{*} by construction.

Further, for all ttαt\leq t_{\alpha}^{*}, due to the monotonicity of Rα,tR_{\alpha,t},

ε(τα<t)\displaystyle\mathbb{P}_{\varepsilon}\Big{(}\tau_{\alpha}<t\Big{)} =ε(Rα,t𝔼εRα,t(𝔼εRα,tκα))\displaystyle=\mathbb{P}_{\varepsilon}\Big{(}R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}\leq-(\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha})\Big{)}
ε(σ2ni=1nμ^iα(1γi(t))2(εi2σ21)Σ1𝔼εRα,tκα2)\displaystyle\leq\mathbb{P}_{\varepsilon}\bigg{(}\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}\Big{(}\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\Big{)}}_{\Sigma_{1}}\leq-\frac{\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}}{2}\bigg{)}
+ε(2ni=1nμ^iα(1γi(t))2GiεiΣ2𝔼εRα,tκα2).\displaystyle+\mathbb{P}_{\varepsilon}\bigg{(}\underbrace{\frac{2}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}\leq-\frac{\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}}{2}\bigg{)}.

Consider Δt,α𝔼εRα,tκα=Bα2(t)+Vα(t)2V~α(t)\Delta_{t,\alpha}\coloneqq\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}=B_{\alpha}^{2}(t)+V_{\alpha}(t)-2\widetilde{V}_{\alpha}(t). At t=t~ϵ,αt=\widetilde{t}_{\epsilon,\alpha}, we have Bα2(t)=3V~α(t)B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t), thus

Δt~ϵ,α,αV~α(t~ϵ,α).\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}\geq\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha}).

Then for the event AA from Corollary A.5, by standard concentration results on linear and quadratic sums of Gaussian random variables (see, e.g., [25, Lemma 1]),

ε(Σ1Δt~ϵ,α,α2A)exp[V~α2(t~ϵ,α)16a(t~ϵ,α)2],ε(Σ2Δt~ϵ,α,α2)exp[nV~α2(t~ϵ,α)32σ2Bα2(t~ϵ,α)],\displaystyle\begin{split}\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)&\leq\exp\left[-\frac{\widetilde{V}_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}{16\lVert a(\widetilde{t}_{\epsilon,\alpha})\rVert^{2}}\right],\\ \mathbb{P}_{\varepsilon}\left(\Sigma_{2}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\right)&\leq\exp\left[-\frac{-n\widetilde{V}_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}{32\sigma^{2}B_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}\right],\end{split} (67)

where ai(t~ϵ,α)=σ2nμ^iα(1γi(t~ϵ,α)),i[n]a_{i}(\widetilde{t}_{\epsilon,\alpha})=\frac{\sigma^{2}}{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(\widetilde{t}_{\epsilon,\alpha})}),\ i\in[n].

In what follows, we simplify the bounds above.

First, we deal with the Euclidean norm of ai(t~ϵ,α),i[n]a_{i}(\widetilde{t}_{\epsilon,\alpha}),\ i\in[n]. By μ^11\widehat{\mu}_{1}\leq 1 and Corollary A.5 with a=R2σ2nϵ^n,α2(1+α)a=\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)} and θ=2\theta=2, and Ineq. (35), it gives us

a(t~ϵ,α)2=σ4n2i=1nμ^i2α(1γi(t~ϵ,α))4σ4n2[i=1dn,αμ^iα+i=dn,α+1nμ^i2α][4+12+3(3+1)]σ2nR2ϵ^n,α2(1+α).\displaystyle\begin{split}\lVert a(\widetilde{t}_{\epsilon,\alpha})\rVert^{2}=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{2\alpha}(1-\gamma_{i}^{(\widetilde{t}_{\epsilon,\alpha})})^{4}&\leq\frac{\sigma^{4}}{n^{2}}\left[\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{2\alpha}\right]\\ &\leq\left[4+12\mathcal{M}+3(3\mathcal{M}+1)\right]\frac{\sigma^{2}}{n}R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.\end{split} (68)

Recall that ηt~ϵ,α=ηt^ϵ,αc′′\eta\widetilde{t}_{\epsilon,\alpha}=\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime\prime}} for c′′1c^{\prime\prime}\geq 1. Therefore, it is sufficient to lower bound V~α(t~ϵ,α)\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha}) as follows.

V~α(t~ϵ,α)σ22ni=1nμ^iαmin{1,ηt^ϵ,αc′′μ^i}\displaystyle\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha})\geq\frac{\sigma^{2}}{2n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\{1,\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime\prime}}\widehat{\mu}_{i}\} =σ2ηt^ϵ,α2nc′′i=1nμ^iαmin{c′′ηt^ϵ,α,μ^i}\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{2nc^{\prime\prime}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{\frac{c^{\prime\prime}}{\eta\widehat{t}_{\epsilon,\alpha}},\widehat{\mu}_{i}\right\}
σ2ηt^ϵ,α2R2c′′^n,α2(1ηt^ϵ,α,)\displaystyle\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{2R^{2}c^{\prime\prime}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)
=2R2c′′ϵ^n,α2(1+α).\displaystyle=\frac{2R^{2}}{c^{\prime\prime}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.

By using the bound Bα2(t~ϵ,α)R2(ηt~ϵ,α)1+αR2(c′′)2ϵ^n,α2(1+α)B_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{(\eta\widetilde{t}_{\epsilon,\alpha})^{1+\alpha}}\leq R^{2}(c^{\prime\prime})^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}, inserting this expression with (68) into (67), and using the fact that ε(A)=X1,,Xn(𝕀(A))=X1,,Xn(A)\mathbb{P}_{\varepsilon}(A)=\mathbb{P}_{X_{1},\ldots,X_{n}}\left(\mathbb{I}(A)\right)=\mathbb{P}_{X_{1},\ldots,X_{n}}(A) for the event AA from Corollary A.5, we have

ε(Σ1Δt~ϵ,α,α2)\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\right) ε(Σ1Δt~ϵ,α,α2A)+X1,,Xn(Ac),\displaystyle\leq\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)+\mathbb{P}_{X_{1},\ldots,X_{n}}\left(A^{c}\right),
ε(τα<t~ϵ,α)\displaystyle\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right) 5exp[c2R2σ2nϵ^n,α2(1+α)],\displaystyle\leq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],

where c2c_{2} depends only on \mathcal{M}. ∎

D.2 Bounding the stochastic part of the variance term at τα\tau_{\alpha}

Lemma D.7.

Under Assumptions 1, 2, 3, and 4, for any regular kernel, the stochastic part of the variance at τα\tau_{\alpha} is bounded as follows.

v(τα)8(1+C)R2ϵ^n,α2v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

with probability at least 16exp[c1nR2σ2ϵ^n,α2(1+α)]1-6\exp\Big{[}-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}, where a constant c1c_{1} depends only on \mathcal{M}.

Proof of Lemma D.7.

ε(τα>t¯ϵ,α)5exp[c1R2σ2nϵ^n,α2(1+α)]\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right] due to Lemma D.5. Therefore, thanks to the monotonicity of γi(t)\gamma_{i}^{(t)} in tt, with probability at least 15exp[c1R2σ2nϵ^n,α2(1+α)]1-5\ \exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right], v(τα)v(t¯ϵ,α)v(\tau_{\alpha})\leq v(\overline{t}_{\epsilon,\alpha}).

After that, due to the concentration inequality (53),

ε(|v(t¯ϵ,α)V(t¯ϵ,α)|δ)2exp[cnδσ2min{1,R2δσ2ηt¯ϵ,α^n2(1ηt¯ϵ,α,)}].\mathbb{P}_{\varepsilon}\Big{(}|v(\overline{t}_{\epsilon,\alpha})-V(\overline{t}_{\epsilon,\alpha})|\geq\delta\Big{)}\leq 2\exp\left[-\frac{cn\delta}{\sigma^{2}}\min\left\{1,\frac{R^{2}\delta}{\sigma^{2}\eta\overline{t}_{\epsilon,\alpha}\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta\overline{t}_{\epsilon,\alpha}}},\mathcal{H})}\right\}\right].

Now, by setting δ=σ2ηt^ϵ,αR2^n2(1ηt^ϵ,α,)σ2ηt^ϵ,αR2^n,α2(1ηt^ϵ,α,)\delta=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\Big{(}\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\Big{)}\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\Big{)} and recalling Lemma G.2, it yields

v(t¯ϵ,α)V(t¯ϵ,α)+δV~(t^ϵ,α)+4(1+C)R2ϵ^n,α2σ2ηt^ϵ,αR2^n2(1ηt^ϵ,α,)+4(1+C)R2ϵ^n,α28(1+C)R2ϵ^n,α2\displaystyle\begin{split}v(\overline{t}_{\epsilon,\alpha})&\leq V(\overline{t}_{\epsilon,\alpha})+\delta\\ &\leq\widetilde{V}(\widehat{t}_{\epsilon,\alpha})+4(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\\ &\leq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)+4(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\\ &\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\end{split} (69)

with probability at least 1exp[cn4R2σ2ϵ^n,α2(1+α)]1-\exp\Big{[}-cn\frac{4R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}. Combining all the pieces together, we get

v(τα)8(1+C)R2ϵ^n,α2v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2} (70)

with probability at least 16exp[c1nR2σ2ϵ^n,α2(1+α)]1-6\exp\Big{[}-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}. ∎

D.3 Bounding the bias term at τα\tau_{\alpha}

Lemma D.8.

Under Assumptions 1, 2, 3, and 4,

B2(τα)c′′R2ϵ^n,α2B^{2}(\tau_{\alpha})\leq c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2} (71)

with probability at least 15exp[c2R2σ2nϵ^n,α2(1+α)]1-5\exp\Big{[}-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]} for a positive numeric constant c′′1c^{\prime\prime}\geq 1 and constant c2c_{2} that depends only on \mathcal{M}.

Proof of Lemma D.8.

ε(τα<t~ϵ,α)5exp[c2R2σ2nϵ^n,α2(1+α)]\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right] due to Lemma D.6. Therefore, thanks to the monotonicity of the bias term, with probability at least 15exp[c2R2σ2nϵ^n,α2(1+α)]1-5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right], B2(τα)B2(t~ϵ,α)R2ηt~ϵ,α=c′′R2ϵ^n,α2.B^{2}(\tau_{\alpha})\leq B^{2}(\widetilde{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{\eta\widetilde{t}_{\epsilon,\alpha}}=c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}.

Appendix E Proof of Theorem 4.2

From Lemmas D.4, D.7, and D.8, we get

fταfn22c′′R2ϵ^n,α2+16(1+C)R2ϵ^n,α2\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+16(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2} (72)

with probability at least 111exp[c1R2σ2nϵ^n,α2(1+α)]1-11\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right], where c1c_{1} depends only on \mathcal{M}. Moreover, by taking the expectation in Ineq. (62), it yields

𝔼εfταfn22𝔼ε[B2(τα)]+2𝔼ε[v(τα)].\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2\mathbb{E}_{\varepsilon}[B^{2}(\tau_{\alpha})]+2\mathbb{E}_{\varepsilon}[v(\tau_{\alpha})].

Let us upper bound 𝔼ε[B2(τα)]\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right] and 𝔼ε[v(τα)]\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]. First, define a~B2(t~ϵ,α)\widetilde{a}\coloneqq B^{2}(\widetilde{t}_{\epsilon,\alpha}), thus

𝔼ε[B2(τα)]=ε(B2(τα)>a~)𝔼ε[B2(τα)B2(τα)>a~]+ε(B2(τα)a~)𝔼ε[B2(τα)B2(τα)a~].\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right]&=\mathbb{P}_{\varepsilon}\Big{(}B^{2}(\tau_{\alpha})>\widetilde{a}\Big{)}\mathbb{E}_{\varepsilon}\Big{[}B^{2}(\tau_{\alpha})\mid B^{2}(\tau_{\alpha})>\widetilde{a}\Big{]}\\ &+\mathbb{P}_{\varepsilon}\Big{(}B^{2}(\tau_{\alpha})\leq\widetilde{a}\Big{)}\mathbb{E}_{\varepsilon}\Big{[}B^{2}(\tau_{\alpha})\mid B^{2}(\tau_{\alpha})\leq\widetilde{a}\Big{]}.\end{split} (73)

Defining δ15exp[c2R2σ2nϵ^n,α2(1+α)]\delta_{1}\coloneqq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right] from Lemma D.8 and using the upper bound B2(t)R2B^{2}(t)\leq R^{2} for any t>0t>0 gives the following.

𝔼ε[B2(τα)]R2δ1+B2(t~ϵ,α)R2(δ1+c′′ϵ^n,α2).\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right]\leq R^{2}\delta_{1}+B^{2}(\widetilde{t}_{\epsilon,\alpha})\leq R^{2}\left(\delta_{1}+c^{\prime\prime}\widehat{\epsilon}_{n,\alpha}^{2}\right). (74)

As for 𝔼ε[v(τα)]\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right],

𝔼ε[v(τα)]=𝔼ε[v(τα)𝕀{v(τα)8(1+C)R2ϵ^n,α2}]+𝔼ε[v(τα)𝕀{v(τα)>8(1+C)R2ϵ^n,α2}],\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]&=\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\mathbb{I}\left\{v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]\\ &+\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right],\end{split} (75)

and due to Lemma D.7 and Cauchy-Schwarz inequality,

𝔼ε[v(τα)]8(1+C)R2ϵ^n,α2+𝔼ε[v(τα)𝕀{v(τα)>8(1+C)R2ϵ^n,α2}]\displaystyle\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+\mathbb{E}_{\varepsilon}\Big{[}v(\tau_{\alpha})\mathbb{I}\Big{\{}v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\Big{\}}\Big{]}
8(1+C)R2ϵ^n,α2+𝔼εv2(τα)𝔼ε[𝕀{v(τα)>8(1+C)R2ϵ^n,α2}].\displaystyle\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+\sqrt{\mathbb{E}_{\varepsilon}v^{2}(\tau_{\alpha})}\sqrt{\mathbb{E}_{\varepsilon}\left[\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]}. (76)

Notice that v2(τα)1n2[i=1nεi2]2v^{2}(\tau_{\alpha})\leq\frac{1}{n^{2}}\left[\sum_{i=1}^{n}\varepsilon_{i}^{2}\right]^{2}, and

𝔼ε[v2(τα)]1n2[i=1n𝔼εεi4+2i<j𝔼ε(εi2εj2)]3σ4n2n23σ4.\mathbb{E}_{\varepsilon}\left[v^{2}(\tau_{\alpha})\right]\leq\frac{1}{n^{2}}\left[\sum_{i=1}^{n}\mathbb{E}_{\varepsilon}\varepsilon_{i}^{4}+2\sum_{i<j}\mathbb{E}_{\varepsilon}\left(\varepsilon_{i}^{2}\varepsilon_{j}^{2}\right)\right]\leq\frac{3\sigma^{4}}{n^{2}}n^{2}\leq 3\sigma^{4}. (77)

At the same time, thanks to Lemma D.7,

𝔼ε[𝕀{v(τα)>8(1+C)R2ϵ^n,α2}]6exp(c1nR2σ2ϵ^n,α2(1+α)).\mathbb{E}_{\varepsilon}\left[\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]\leq 6\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right).

Thus, by inserting the last two inequalities into (76), it gives

𝔼ε[v(τα)]8(1+C)R2ϵ^n,α2+5σ2exp(c1nR2σ2ϵ^n,α2(1+α)).\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+5\sigma^{2}\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right).

Finally, summing up all the terms together,

𝔼εfταfn2\displaystyle\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2} [16(1+C)+2c′′]R2ϵ^n,α2\displaystyle\leq\left[16(1+C)+2c^{\prime\prime}\right]R^{2}\widehat{\epsilon}_{n,\alpha}^{2}
+20max{σ2,R2}exp(c1nR2σ2ϵ^n,α2(1+α)),\displaystyle+20\max\{\sigma^{2},R^{2}\}\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right),

where a constant c1c_{1} depends only on \mathcal{M}, constant c′′c^{\prime\prime} is numeric.

Appendix F Proof of Theorem 3.3

We will use the definition of τ\tau (20) with the threshold κrσ2n\kappa\coloneqq\frac{r\sigma^{2}}{n} so that, due to the monotonicity of the ”reduced” empirical risk R~t\widetilde{R}_{t},

ε(τ>t)=ε(R~t𝔼εR~t>κ𝔼εR~tΔt),\mathbb{P}_{\varepsilon}\left(\tau>t\right)=\mathbb{P}_{\varepsilon}\Big{(}\widetilde{R}_{t}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}>\underbrace{\kappa-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}}_{\Delta_{t}}\Big{)},

where

Δt=B2(t)V(t)+2σ2ni=1rγi(t)2V~(t).\Delta_{t}=-B^{2}(t)-V(t)+\underbrace{\frac{2\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}}_{2\widetilde{V}(t)}. (78)

Assume that Δt0\Delta_{t}\geq 0. Remark that

R~t𝔼εR~t=σ2ni=1r(1γi(t))2(εi2σ21)Σ1+2ni=1r(1γi(t))2GiεiΣ2.\widetilde{R}_{t}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}=\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}\left(\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\right)}_{\Sigma_{1}}+\underbrace{\frac{2}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}. (79)

By applying [25, Lemma 1] to Σ1\Sigma_{1}, it yields

ε(Σ1>Δt2)exp[Δt2/44(a(t)2+Δt2a(t))],\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{t}}{2}\right)\leq\exp\left[\frac{-\Delta_{t}^{2}/4}{4(\lVert a(t)\rVert^{2}+\frac{\Delta_{t}}{2}\lVert a(t)\rVert_{\infty})}\right], (80)

where ai(t)σ2n(1γi(t))2,i[r]a_{i}(t)\coloneqq\frac{\sigma^{2}}{n}(1-\gamma_{i}^{(t)})^{2},\ i\in[r]. In addition, [38, Proposition 2.5] gives us

ε(Σ2>Δt2)exp[nΔt232σ2B2(t)].\mathbb{P}_{\varepsilon}\left(\Sigma_{2}>\frac{\Delta_{t}}{2}\right)\leq\exp\left[-\frac{n\Delta_{t}^{2}}{32\sigma^{2}B^{2}(t)}\right]. (81)

Define a stopping time t¯ϵ\overline{t}_{\epsilon} as follows.

t¯ϵinf{t>0:B2(t)=12V~(t)}.\overline{t}_{\epsilon}\coloneqq\inf\left\{t>0:B^{2}(t)=\frac{1}{2}\widetilde{V}(t)\right\}. (82)

Note that t¯ϵ\overline{t}_{\epsilon} serves as an upper bound on tt^{*} and as a lower bound on t^ϵ\widehat{t}_{\epsilon}. Moreover, t¯ϵ\overline{t}_{\epsilon} satisfies the critical inequality (54). Therefore, due to Lemma G.1 and continuity of (54) in tt, there is a positive numeric constant c1c^{\prime}\geq 1, that do not depend on nn, such that 1ηt¯ϵ=c1ηt^ϵ\frac{1}{\eta\overline{t}_{\epsilon}}=c^{\prime}\frac{1}{\eta\widehat{t}_{\epsilon}}.

In what follows, we simplify two high probability bounds (80) and (81) at t=t¯ϵt=\overline{t}_{\epsilon}.

Since applying [29, Section 4.3], ϵ^n2=crσ2nR2\widehat{\epsilon}_{n}^{2}=c\frac{r\sigma^{2}}{nR^{2}}, one can bound a(t¯ϵ)2\lVert a(\overline{t}_{\epsilon})\rVert^{2} as follows.

a(t¯ϵ)2=σ4n2i=1r(1γi(t¯ϵ))4rσ4n2=R2σ2ϵ^n2cn.\lVert a(\overline{t}_{\epsilon})\rVert^{2}=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(\overline{t}_{\epsilon})})^{4}\leq\frac{r\sigma^{4}}{n^{2}}=\frac{R^{2}\sigma^{2}\widehat{\epsilon}_{n}^{2}}{cn}. (83)

Remark that in (80) a(t¯ϵ)=σ2nmaxi[r][(1γi(t¯ϵ))]σ2n\lVert a(\overline{t}_{\epsilon})\rVert_{\infty}=\frac{\sigma^{2}}{n}\underset{i\in[r]}{\max}\Big{[}(1-\gamma_{i}^{(\overline{t}_{\epsilon})})\Big{]}\leq\frac{\sigma^{2}}{n}, and

Δt¯ϵ234V~(t¯ϵ)34V~(t^ϵ)34σ2R2ηt^ϵ^n2(1ηt^ϵ,)=3R2ϵ^n2.\frac{\Delta_{\overline{t}_{\epsilon}}}{2}\leq\frac{3}{4}\widetilde{V}(\overline{t}_{\epsilon})\leq\frac{3}{4}\widetilde{V}(\widehat{t}_{\epsilon})\leq\frac{3}{4}\frac{\sigma^{2}}{R^{2}}\eta\widehat{t}_{\epsilon}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}},\mathcal{H}\right)=3R^{2}\widehat{\epsilon}_{n}^{2}.

As for a lower bound on Δt¯ϵ\Delta_{\overline{t}_{\epsilon}},

Δt¯ϵ12V~(t¯ϵ)σ24ni=1rmin{1,ηt^ϵcμ^i}\displaystyle\Delta_{\overline{t}_{\epsilon}}\geq\frac{1}{2}\widetilde{V}(\overline{t}_{\epsilon})\geq\frac{\sigma^{2}}{4n}\sum_{i=1}^{r}\min\left\{1,\frac{\eta\widehat{t}_{\epsilon}}{c^{\prime}}\widehat{\mu}_{i}\right\} =σ2ηt^ϵ4nci=1rmin{cηt^ϵ,μ^i}\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon}}{4nc^{\prime}}\sum_{i=1}^{r}\min\left\{\frac{c^{\prime}}{\eta\widehat{t}_{\epsilon}},\widehat{\mu}_{i}\right\}
R2cϵ^n2.\displaystyle\geq\frac{R^{2}}{c^{\prime}}\widehat{\epsilon}_{n}^{2}.

By knowing that B2(t¯ϵ)R2ηt¯ϵ=cR2ϵ^n2B^{2}(\overline{t}_{\epsilon})\leq\frac{R^{2}}{\eta\overline{t}_{\epsilon}}=c^{\prime}R^{2}\widehat{\epsilon}_{n}^{2} and summing up bounds (80), (81) with t=t¯ϵt=\overline{t}_{\epsilon}, it yields the following.

ε(τ>t¯ϵ)2exp[CR2σ2nϵ^n2].\mathbb{P}_{\varepsilon}\left(\tau>\overline{t}_{\epsilon}\right)\leq 2\ \exp\left[-C\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right]. (84)

From [29, Lemma 9], ft¯ϵ7R\lVert f^{\overline{t}_{\epsilon}}\rVert_{\mathcal{H}}\leq\sqrt{7}R with probability at least 14exp[c3R2σ2nϵ^n2]1-4\ \textnormal{exp}\Big{[}-c_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\Big{]}. Thus, Ineq. (84) allows to say:

fτ7R with probability at least 16exp(c~3R2σ2nϵ^n2) for c~3>0.\lVert f^{\tau}\rVert_{\mathcal{H}}\leq\sqrt{7}R\textnormal{ with probability at least }1-6\ \exp\left(-\tilde{c}_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right)\textnormal{ for }\tilde{c}_{3}>0.

It implies that fτffτ+f(1+7)R\lVert f^{\tau}-f^{*}\rVert_{\mathcal{H}}\leq\lVert f^{\tau}\rVert_{\mathcal{H}}+\lVert f^{*}\rVert_{\mathcal{H}}\leq\left(1+\sqrt{7}\right)R with the same probability. Thus, according to [38, Theorem 14.1], for some positive numeric constants c1,c~4,c~5:c_{1},\tilde{c}_{4},\tilde{c}_{5}:

fτf222fτfn2+c1R2ϵn2\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq 2\lVert f^{\tau}-f^{*}\rVert_{n}^{2}+c_{1}R^{2}\epsilon_{n}^{2}

with probability (w.r.t. ε\varepsilon) at least 16exp[c~3R2σ2nϵ^n2]1-6\ \textnormal{exp}\left[-\tilde{c}_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right] and with probability (w.r.t. {xi}i=1n\{x_{i}\}_{i=1}^{n}) at least 1c~4exp[c~5R2σ2nϵn2]1-\tilde{c}_{4}\ \textnormal{exp}\left[-\tilde{c}_{5}\frac{R^{2}}{\sigma^{2}}n\epsilon_{n}^{2}\right].

Moreover, the same arguments (with α=0\alpha=0 and without Assumption 4) as in the proof of Theorem 4.2, [38, Proposition 14.25] and [29, Section 4.3.1] yield

fτfn2cuR2ϵ^n2c~uR2ϵn2rσ2n\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n}^{2}\leq\widetilde{c}_{u}R^{2}\epsilon_{n}^{2}\lesssim\frac{r\sigma^{2}}{n} (85)

with probability at least 1c1exp[c2R2σ2nϵn2]1-c_{1}\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\epsilon_{n}^{2}\right]. Then by the Cauchy-Schwarz inequality,

𝔼fτf22\displaystyle\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2} =𝔼[fτf22𝕀{fτf22crσ2n}]+\displaystyle=\mathbb{E}\left[\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\frac{cr\sigma^{2}}{n}\right\}\right]+
+𝔼[fτf22𝕀{fτf22>crσ2n}]\displaystyle+\mathbb{E}\left[\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{*}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right\}\right]
crσ2n+𝔼fτf24(fτf22>crσ2n).\displaystyle\leq\frac{cr\sigma^{2}}{n}+\sqrt{\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{4}}\sqrt{\mathbb{P}\left(\lVert f^{\tau}-f^{*}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right)}.

Since fτ=gλ(τ)(Σn)SnYf^{\tau}=g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}Y, where the empirical covariance operator

Σn\displaystyle\Sigma_{n} =1ni=1n𝕂(,xi)𝕂(,xi),\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{K}(\cdot,x_{i})\otimes\mathbb{K}(\cdot,x_{i}),
Σn\displaystyle\Sigma_{n} =SnSn.\displaystyle=S_{n}^{*}S_{n}.

and γi(τ)=μ^igλ(τ)(μ^i)1\gamma_{i}^{(\tau)}=\widehat{\mu}_{i}g_{\lambda(\tau)}(\widehat{\mu}_{i})\leq 1, one has

ffτ=(Igλ(τ)(Σn)Σn)fgλ(τ)(Σn)Snε,f^{*}-f^{\tau}=(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon,

and due to the definition of τ\tau,

σ2=(Igλ(τ)(Σn)Sn)Yn2.\displaystyle\sigma^{2}=\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})Y\rVert_{n}^{2}.

We know that

fτf22\displaystyle\lVert f^{\tau}-f^{*}\rVert_{2}^{2} μ1(Igλ(τ)(Σn)Σn)fgλ(τ)(Σn)Snε2\displaystyle\leq\mu_{1}\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}
(Igλ(τ)(Σn)Σn)fgλ(τ)(Σn)Snε2,\displaystyle\leq\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2},

and

σ2\displaystyle\sigma^{2} =(ISngλ(τ)(Σn)Sn)Snfn2+(ISngλ(τ)(Σn)Sn)εn2\displaystyle=\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})S_{n}f^{*}\rVert_{n}^{2}+\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rVert_{n}^{2}
+2(ISngλ(τ)(Σn)Sn)Snf,(ISngλ(τ)(Σn)Sn)εn𝒜n.\displaystyle+\underbrace{2\langle(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})S_{n}f^{*},(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rangle_{n}}_{\mathcal{A}_{n}}.

Further,

(Igλ(τ)(Σn)Σn)f\displaystyle\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*} gλ(τ)(Σn)Snε2=(Igλ(τ)(Σn)Σn)f2\displaystyle-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}=\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}\rVert_{\mathcal{H}}^{2}
+gλ(τ)(Σn)Snε2\displaystyle+\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}
2(Igλ(τ)(Σn)Σn)f,gλ(τ)(Σn)Snε𝒜.\displaystyle-\underbrace{2\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*},g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rangle_{\mathcal{H}}}_{\mathcal{A}_{\mathcal{H}}}.

Thus, subtracting the empirical term from the RKHS term, one gets

(Igλ(τ)(Σn)Σn)fgλ(τ)(Σn)Snε2σ2=(𝒜+𝒜n)Δ𝒜+norm discrepancy,\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\sigma^{2}=\underbrace{-(\mathcal{A}_{\mathcal{H}}+\mathcal{A}_{n})}_{\Delta\mathcal{A}}+\text{norm discrepancy},

where norm discrepancy=gλ(τ)(Σn)Snε2(ISngλ(τ)(Σn)Sn)εn2\text{norm discrepancy}=\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rVert_{n}^{2}.

Firstly, Sngλ(τ)(Σn)Sn=Kngλ(τ)(Kn)S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}=K_{n}g_{\lambda(\tau)}(K_{n}), and

gλ(τ)(Σn)Snε2=1nεKn2[gλ(τ)(Kn)]2ε.\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}=\frac{1}{n}\varepsilon^{\top}K_{n}^{2}[g_{\lambda(\tau)}(K_{n})]^{2}\varepsilon.

Secondly,

Δ𝒜\displaystyle\Delta\mathcal{A} =2(Igλ(τ)(Σn)Σn)f,gλ(τ)(Σn)Snε\displaystyle=-2\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*},g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rangle_{\mathcal{H}}
2(ISngλ(τ)(Σn)Sn)Snf,(ISngλ(τ)(Σn)Sn)εn.\displaystyle-2\langle(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})S_{n}f^{*},(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rangle_{n}.

Thirdly, since h,Sn𝐥=Snh,𝐥n\langle h,S_{n}^{*}\mathbf{l}\rangle_{\mathcal{H}}=\langle S_{n}h,\mathbf{l}\rangle_{n} for any hh\in\mathcal{H} and 𝐥n\mathbf{l}\in\mathbb{R}^{n}, and h,h=Snh,Snhn\langle h,h\rangle_{\mathcal{H}}=\langle S_{n}h,S_{n}h\rangle_{n}, one gets

(Igλ(τ)(Σn)Σn)f,\displaystyle\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}, gλ(τ)(Σn)Snε=Sn(Igλ(τ)(Σn)Σn)f,Sngλ(τ)(Σn)Snεn\displaystyle g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rangle_{\mathcal{H}}=\langle S_{n}(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*},S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rangle_{n}
=(IKngλ(τ)(Kn))Fb~λ(τ)2,Kngλ(τ)(Kn)εn\displaystyle=\langle\underbrace{(I-K_{n}g_{\lambda(\tau)}(K_{n}))F^{*}}_{\tilde{b}_{\lambda(\tau)}^{2}},K_{n}g_{\lambda(\tau)}(K_{n})\varepsilon\rangle_{n}
b~λ(τ)2nεn\displaystyle\leq\lVert\tilde{b}_{\lambda(\tau)}^{2}\rVert_{n}\lVert\varepsilon\rVert_{n}
Rεn.\displaystyle\leq R\lVert\varepsilon\rVert_{n}.

Fourthly,

2(IKngλ(τ)(Kn))F,(IKngλ(τ)(Kn))εn2b~λ(τ)2nεn2Rεn.2\langle(I-K_{n}g_{\lambda(\tau)}(K_{n}))F^{*},(I-K_{n}g_{\lambda(\tau)}(K_{n}))\varepsilon\rangle_{n}\leq 2\lVert\tilde{b}_{\lambda(\tau)}^{2}\rVert_{n}\lVert\varepsilon\rVert_{n}\leq 2R\lVert\varepsilon\rVert_{n}.

Combining everything together and using γi(τ)1\gamma_{i}^{(\tau)}\leq 1 for each ii,

(Igλ(τ)(Σn)Σn)fgλ(τ)(Σn)Snε2σ2εn2+4Rεn.\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\sigma^{2}\leq\lVert\varepsilon\rVert_{n}^{2}+4R\lVert\varepsilon\rVert_{n}.

The last equation implies that

fτf22\displaystyle\lVert f^{\tau}-f^{*}\rVert_{2}^{2} σ2+εn2+4Rεn,\displaystyle\leq\sigma^{2}+\lVert\varepsilon\rVert_{n}^{2}+4R\lVert\varepsilon\rVert_{n},
𝔼fτf24\displaystyle\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{4} 3σ4+16R2σ2+𝔼εn4+8R𝔼εn3+8Rσ2𝔼εn\displaystyle\leq 3\sigma^{4}+16R^{2}\sigma^{2}+\mathbb{E}\lVert\varepsilon\rVert_{n}^{4}+8R\mathbb{E}\lVert\varepsilon\rVert_{n}^{3}+8R\sigma^{2}\mathbb{E}\lVert\varepsilon\rVert_{n}
6σ4+24Rσ3+16R2σ2.\displaystyle\leq 6\sigma^{4}+24R\sigma^{3}+16R^{2}\sigma^{2}.

As a consequence of the last inequality,

𝔼fτf22c~rσ2n+C(σ,R)exp(cr).\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\frac{\widetilde{c}r\sigma^{2}}{n}+C(\sigma,R)\exp(-cr).

Appendix G Auxiliary results

Lemma G.1.

[29, Appendix D] Under Assumptions 1 and 2, for any α[0,1]\alpha\in[0,1], the function ϵ^n,α(ϵ,)ϵ\epsilon\mapsto\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon} is non-increasing (as a function of ϵ\epsilon) on the interval (0,+)(0,+\infty), and consequently, for any numeric constant c>0c>0,

^n,α(ϵ,)ϵcR2σϵ1+α\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon}\leq c\frac{R^{2}}{\sigma}\epsilon^{1+\alpha} (86)

has a smallest positive solution. In addition to that, ϵ^n,α\widehat{\epsilon}_{n,\alpha} (15) exists, is unique, and satisfies equality in Eq. (86).

Lemma G.2.

Under Assumptions 1, 2, 3, any regular kernel and t^ϵ,α\widehat{t}_{\epsilon,\alpha} from Definition D.2 satisfy

σ2ηt^ϵ,α4R2^n2(1ηt^ϵ,α,)(1+C)R2ηt^ϵ,α.\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)\leq\frac{(1+C)R^{2}}{\eta\widehat{t}_{\epsilon,\alpha}}. (87)

Thus, t^ϵ,α\widehat{t}_{\epsilon,\alpha} provides a smallest positive solution to the non-smooth version of the critical inequality.

Proof of Lemma G.2.

First, we recall that σ2ηt^ϵ,α4R2^n,α2(1ηt^ϵ,α,)=R2ϵ^n,α2(1+α)\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)=R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}. Then for dn,α=min{j[n]:μ^jϵ^n,α2}d_{n,\alpha}=\min\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\},

σ2ηt^ϵ,α4R2^n,α2(1ηt^ϵ,α,)=σ24nϵ^n,α2i=1nμ^iαmin{μ^i,ϵ^n,α2}=σ24nϵ^n,α2[ϵ^n,α2i=1dn,αμ^iα+i=dn,α+1nμ^i1+α]=R2ϵ^n,α2(1+α).\displaystyle\begin{split}\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)&=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\{\widehat{\mu}_{i},\widehat{\epsilon}_{n,\alpha}^{2}\}\\ &=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\left[\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{1+\alpha}\right]\\ &=R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.\end{split} (88)

The last two lines of (88) yield σ24nϵ^n,α2=R2ϵ^n,α2(1+α)ϵ^n,α2i=1dn,αμ^iα+i=dn,α+1nμ^i1+α\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}=\frac{R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}}{\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{1+\alpha}}.

Second, consider the left-hand part of the non-smooth version of the critical inequality (56) at t=t^ϵ,αt=\widehat{t}_{\epsilon,\alpha}:

σ2ηt^ϵ,α4R2^n2(1ηt^ϵ,α,)=σ24nϵ^n,α2i=1nmin{μ^i,ϵ^n,α2}R2i=1dn,αϵ^n,α4+2α+ϵ^n,α2(1+α)i=dn,α+1nμ^iϵ^n,α2i=1dn,αμ^iα.\displaystyle\begin{split}\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)&=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\sum_{i=1}^{n}\min\{\widehat{\mu}_{i},\widehat{\epsilon}_{n,\alpha}^{2}\}\\ &\leq R^{2}\frac{\sum_{i=1}^{d_{n,\alpha}}\widehat{\epsilon}_{n,\alpha}^{4+2\alpha}+\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}}{\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}}.\end{split} (89)

Notice that μ^iϵ^n,α2\widehat{\mu}_{i}\geq\widehat{\epsilon}_{n,\alpha}^{2}, and μ^iαϵ^n,α2α\widehat{\mu}_{i}^{\alpha}\geq\widehat{\epsilon}_{n,\alpha}^{2\alpha}, for idn,αi\leq d_{n,\alpha}. This implies i=1dn,αϵ^n,α4+2αϵ^n,α4i=1dn,αμ^iα\sum_{i=1}^{d_{n,\alpha}}\widehat{\epsilon}_{n,\alpha}^{4+2\alpha}\leq\widehat{\epsilon}_{n,\alpha}^{4}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}, and also that i=dn,α+1nμ^iCϵ^n,α2(1α)i=1dn,αμ^iα\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}\leq C\widehat{\epsilon}_{n,\alpha}^{2(1-\alpha)}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha} since the kernel is regular. Hence,

ϵ^n,α2αi=dn,α+1nμ^i\displaystyle\widehat{\epsilon}_{n,\alpha}^{2\alpha}\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i} Cϵ^n,α2i=1dn,αμ^iα,\displaystyle\leq C\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha},

which leads to the desired upper bound with ϵ^n,α2=(ηt^ϵ,α)1\widehat{\epsilon}_{n,\alpha}^{2}=(\eta\widehat{t}_{\epsilon,\alpha})^{-1}:

σ2ηt^ϵ,α4R2^n2(1ηt^ϵ,α,)(1+C)R2ϵ^n,α2.\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)\leq(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}.

Appendix H Proof of Lemma 5.1

Let us prove the lemma only for kernel ridge regression. W.l.o.g. assume that η=R=σ=1\eta=R=\sigma=1 and notice that

𝔼ε[Rt1/ni=1n(1γi(t))2]=σ2+B2(t)1/ni=1n(1γi(t))2.\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[\frac{R_{t}}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}\right]=\sigma^{2}+\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}.\end{split} (90)

From Lemma B.1, B2(t)1tB^{2}(t)\leq\frac{1}{t}. As for the denominator,

1ni=1n(1γi(t))2=1ni=1n1(1+tμ^i)2.\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{(1+t\widehat{\mu}_{i})^{2}}.

With the parameterization t=1ϵ2t=\frac{1}{\epsilon^{2}} and dn,0=min{j[n]:μ^jϵ^n2}d_{n,0}=\min\left\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n}^{2}\right\}, since γi(t),i=1,,n\gamma_{i}^{(t)},i=1,\ldots,n, is a non-decreasing function in tt,

B2(t)1/ni=1n(1γi(t))2\displaystyle\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}} 11nϵ^n2i=1n(ϵ^n2μ^i+ϵ^n2)2\displaystyle\leq\frac{1}{\frac{1}{n\widehat{\epsilon}_{n}^{2}}\sum_{i=1}^{n}\left(\frac{\widehat{\epsilon}_{n}^{2}}{\widehat{\mu}_{i}+\widehat{\epsilon}_{n}^{2}}\right)^{2}}
1ndn,04nϵ^n2.\displaystyle\leq\frac{1}{\frac{n-d_{n,0}}{4n\widehat{\epsilon}_{n}^{2}}}.

From [41, Section 2.3], dn,0=cnϵ^n2d_{n,0}=cn\widehat{\epsilon}_{n}^{2}, which implies

B2(t)1/ni=1n(1γi(t))24ϵ^n21cϵ^n20.\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}\leq\frac{4\widehat{\epsilon}_{n}^{2}}{1-c\widehat{\epsilon}_{n}^{2}}\to 0.
{acks}

[Acknowledgments] The authors would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments that improved the quality of this paper.

References

  • [1] {bincollection}[author] \bauthor\bsnmAkaike, \bfnmHirotogu\binitsH. (\byear1998). \btitleInformation theory and an extension of the maximum likelihood principle. In \bbooktitleSelected papers of hirotugu akaike \bpages199–213. \bpublisherSpringer. \endbibitem
  • [2] {barticle}[author] \bauthor\bsnmAngles, \bfnmTomas\binitsT., \bauthor\bsnmCamoriano, \bfnmRaffaello\binitsR., \bauthor\bsnmRudi, \bfnmAlessandro\binitsA. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2015). \btitleNYTRO: When Subsampling Meets Early Stopping. \bjournalarXiv e-prints \bpagesarXiv:1510.05684. \endbibitem
  • [3] {barticle}[author] \bauthor\bsnmArlot, \bfnmSylvain\binitsS., \bauthor\bsnmCelisse, \bfnmAlain\binitsA. \betalet al. (\byear2010). \btitleA survey of cross-validation procedures for model selection. \bjournalStatistics surveys \bvolume4 \bpages40–79. \endbibitem
  • [4] {barticle}[author] \bauthor\bsnmAronszajn, \bfnmNachman\binitsN. (\byear1950). \btitleTheory of reproducing kernels. \bjournalTransactions of the American mathematical society \bvolume68 \bpages337–404. \endbibitem
  • [5] {barticle}[author] \bauthor\bsnmBartlett, \bfnmPeter L\binitsP. L., \bauthor\bsnmBousquet, \bfnmOlivier\binitsO., \bauthor\bsnmMendelson, \bfnmShahar\binitsS. \betalet al. (\byear2005). \btitleLocal rademacher complexities. \bjournalThe Annals of Statistics \bvolume33 \bpages1497–1537. \endbibitem
  • [6] {barticle}[author] \bauthor\bsnmBartlett, \bfnmPeter L\binitsP. L. and \bauthor\bsnmTraskin, \bfnmMikhail\binitsM. (\byear2007). \btitleAdaboost is consistent. \bjournalJournal of Machine Learning Research \bvolume8 \bpages2347–2368. \endbibitem
  • [7] {barticle}[author] \bauthor\bsnmBauer, \bfnmFrank\binitsF., \bauthor\bsnmPereverzev, \bfnmSergei\binitsS. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2007). \btitleOn regularization algorithms in learning theory. \bjournalJournal of complexity \bvolume23 \bpages52–72. \endbibitem
  • [8] {bbook}[author] \bauthor\bsnmBerlinet, \bfnmAlain\binitsA. and \bauthor\bsnmThomas-Agnan, \bfnmChristine\binitsC. (\byear2011). \btitleReproducing kernel Hilbert spaces in probability and statistics. \bpublisherSpringer Science & Business Media. \endbibitem
  • [9] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmHoffmann, \bfnmMarc\binitsM. and \bauthor\bsnmReiß, \bfnmMarkus\binitsM. (\byear2018). \btitleOptimal adaptation for early stopping in statistical inverse problems. \bjournalSIAM/ASA Journal on Uncertainty Quantification \bvolume6 \bpages1043–1075. \endbibitem
  • [10] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmHoffmann, \bfnmMarc\binitsM., \bauthor\bsnmReiß, \bfnmMarkus\binitsM. \betalet al. (\byear2018). \btitleEarly stopping for statistical inverse problems via truncated SVD estimation. \bjournalElectronic Journal of Statistics \bvolume12 \bpages3204–3231. \endbibitem
  • [11] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmKrämer, \bfnmNicole\binitsN. (\byear2016). \btitleConvergence rates of kernel conjugate gradient for random design regression. \bjournalAnalysis and Applications \bvolume14 \bpages763–794. \endbibitem
  • [12] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmMathé, \bfnmPeter\binitsP. (\byear2010). \btitleConjugate gradient regularization under general smoothness and noise assumptions. \bjournalJournal of Inverse and Ill-posed Problems \bvolume18 \bpages701–726. \endbibitem
  • [13] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmMathé, \bfnmPeter\binitsP. (\byear2012). \btitleDiscrepancy principle for statistical inverse problems with application to conjugate gradient iteration. \bjournalInverse problems \bvolume28 \bpages115011. \endbibitem
  • [14] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2003). \btitleBoosting with the L 2 loss: regression and classification. \bjournalJournal of the American Statistical Association \bvolume98 \bpages324–339. \endbibitem
  • [15] {barticle}[author] \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. (\byear2006). \btitleOptimal Rates for Regularization Operators in Learning Theory. \endbibitem
  • [16] {barticle}[author] \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. and \bauthor\bsnmYao, \bfnmYuan\binitsY. (\byear2010). \btitleCross-validation based adaptation for regularization operators in learning theory. \bjournalAnalysis and Applications \bvolume8 \bpages161–183. \endbibitem
  • [17] {barticle}[author] \bauthor\bsnmCavalier, \bfnmLaurent\binitsL., \bauthor\bsnmGolubev, \bfnmGK\binitsG., \bauthor\bsnmPicard, \bfnmDominique\binitsD., \bauthor\bsnmTsybakov, \bfnmAB\binitsA. \betalet al. (\byear2002). \btitleOracle inequalities for inverse problems. \bjournalThe Annals of Statistics \bvolume30 \bpages843–874. \endbibitem
  • [18] {barticle}[author] \bauthor\bsnmCelisse, \bfnmAlain\binitsA. and \bauthor\bsnmWahl, \bfnmMartin\binitsM. (\byear2021). \btitleAnalyzing the discrepancy principle for kernelized spectral filter learning algorithms. \bjournalJournal of Machine Learning Research \bvolume22 \bpages1–59. \endbibitem
  • [19] {barticle}[author] \bauthor\bsnmCucker, \bfnmFelipe\binitsF. and \bauthor\bsnmSmale, \bfnmSteve\binitsS. (\byear2002). \btitleOn the mathematical foundations of learning. \bjournalBulletin of the American mathematical society \bvolume39 \bpages1–49. \endbibitem
  • [20] {bbook}[author] \bauthor\bsnmEngl, \bfnmHeinz Werner\binitsH. W., \bauthor\bsnmHanke, \bfnmMartin\binitsM. and \bauthor\bsnmNeubauer, \bfnmAndreas\binitsA. (\byear1996). \btitleRegularization of inverse problems \bvolume375. \bpublisherSpringer Science & Business Media. \endbibitem
  • [21] {barticle}[author] \bauthor\bsnmGerfo, \bfnmL Lo\binitsL. L., \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL., \bauthor\bsnmOdone, \bfnmFrancesca\binitsF., \bauthor\bsnmVito, \bfnmE De\binitsE. D. and \bauthor\bsnmVerri, \bfnmAlessandro\binitsA. (\byear2008). \btitleSpectral algorithms for supervised learning. \bjournalNeural Computation \bvolume20 \bpages1873–1897. \endbibitem
  • [22] {bbook}[author] \bauthor\bsnmGu, \bfnmChong\binitsC. (\byear2013). \btitleSmoothing spline ANOVA models \bvolume297. \bpublisherSpringer Science & Business Media. \endbibitem
  • [23] {bbook}[author] \bauthor\bsnmHansen, \bfnmPer Christian\binitsP. C. (\byear2010). \btitleDiscrete inverse problems: insight and algorithms \bvolume7. \bpublisherSiam. \endbibitem
  • [24] {barticle}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir\binitsV. \betalet al. (\byear2006). \btitleLocal Rademacher complexities and oracle inequalities in risk minimization. \bjournalThe Annals of Statistics \bvolume34 \bpages2593–2656. \endbibitem
  • [25] {barticle}[author] \bauthor\bsnmLaurent, \bfnmBeatrice\binitsB. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear2000). \btitleAdaptive estimation of a quadratic functional by model selection. \bjournalAnnals of statistics \bpages1302–1338. \endbibitem
  • [26] {barticle}[author] \bauthor\bsnmMathé, \bfnmPeter\binitsP. and \bauthor\bsnmPereverzev, \bfnmSergei V\binitsS. V. (\byear2003). \btitleGeometry of linear ill-posed problems in variable Hilbert scales. \bjournalInverse problems \bvolume19 \bpages789. \endbibitem
  • [27] {binproceedings}[author] \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2002). \btitleGeometric parameters of kernel machines. In \bbooktitleInternational Conference on Computational Learning Theory \bpages29–43. \bpublisherSpringer. \endbibitem
  • [28] {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2012). \btitleMinimax-optimal rates for sparse additive models over kernel classes via convex programming. \bjournalJournal of Machine Learning Research \bvolume13 \bpages389–427. \endbibitem
  • [29] {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2014). \btitleEarly stopping and non-parametric regression: an optimal data-dependent stopping rule. \bjournalJournal of Machine Learning Research \bvolume15 \bpages335–366. \endbibitem
  • [30] {binproceedings}[author] \bauthor\bsnmRudi, \bfnmAlessandro\binitsA., \bauthor\bsnmCamoriano, \bfnmRaffaello\binitsR. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2015). \btitleLess is more: Nyström computational regularization. In \bbooktitleAdvances in Neural Information Processing Systems \bpages1657–1665. \endbibitem
  • [31] {bbook}[author] \bauthor\bsnmScholkopf, \bfnmBernhard\binitsB. and \bauthor\bsnmSmola, \bfnmAlexander J\binitsA. J. (\byear2001). \btitleLearning with kernels: support vector machines, regularization, optimization, and beyond. \bpublisherMIT press. \endbibitem
  • [32] {barticle}[author] \bauthor\bsnmSchwarz, \bfnmGideon\binitsG. \betalet al. (\byear1978). \btitleEstimating the dimension of a model. \bjournalThe annals of statistics \bvolume6 \bpages461–464. \endbibitem
  • [33] {bmisc}[author] \bauthor\bsnmStankewitz, \bfnmBernhard\binitsB. (\byear2019). \btitleSmoothed residual stopping for statistical inverse problems via truncated SVD estimation. \endbibitem
  • [34] {barticle}[author] \bauthor\bsnmStone, \bfnmCharles J\binitsC. J. \betalet al. (\byear1985). \btitleAdditive regression and other nonparametric models. \bjournalThe annals of Statistics \bvolume13 \bpages689–705. \endbibitem
  • [35] {barticle}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1977). \btitlePractical approximate solutions to linear operator equations when the data are noisy. \bjournalSIAM Journal on Numerical Analysis \bvolume14 \bpages651–667. \endbibitem
  • [36] {bincollection}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1987). \btitleThree topics in ill-posed problems. In \bbooktitleInverse and ill-posed problems \bpages37–51. \bpublisherElsevier. \endbibitem
  • [37] {bbook}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1990). \btitleSpline models for observational data \bvolume59. \bpublisherSiam. \endbibitem
  • [38] {bbook}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2019). \btitleHigh-dimensional statistics: A non-asymptotic viewpoint \bvolume48. \bpublisherCambridge University Press. \endbibitem
  • [39] {bbook}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2006). \btitleAll of nonparametric statistics. \bpublisherSpringer Science & Business Media. \endbibitem
  • [40] {binproceedings}[author] \bauthor\bsnmWei, \bfnmYuting\binitsY., \bauthor\bsnmYang, \bfnmFanny\binitsF. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2017). \btitleEarly stopping for kernel boosting algorithms: A general analysis with localized complexities. In \bbooktitleAdvances in Neural Information Processing Systems \bpages6067–6077. \endbibitem
  • [41] {barticle}[author] \bauthor\bsnmYang, \bfnmYun\binitsY., \bauthor\bsnmPilanci, \bfnmMert\binitsM. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2017). \btitleRandomized sketches for kernels: Fast and optimal nonparametric regression. \bjournalThe Annals of Statistics \bvolume45 \bpages991–1023. \endbibitem
  • [42] {barticle}[author] \bauthor\bsnmYao, \bfnmYuan\binitsY., \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. and \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. (\byear2007). \btitleOn Early Stopping in Gradient Descent Learning. \bjournalConstructive Approximation \bvolume26 \bpages289–315. \bdoi10.1007/s00365-006-0663-2 \endbibitem
  • [43] {barticle}[author] \bauthor\bsnmZhang, \bfnmTong\binitsT., \bauthor\bsnmYu, \bfnmBin\binitsB. \betalet al. (\byear2005). \btitleBoosting with early stopping: Convergence and consistency. \bjournalThe Annals of Statistics \bvolume33 \bpages1538–1579. \endbibitem