Early stopping and polynomial smoothing in regression with reproducing kernels

Yaroslav Averyanovlabel=e1][email protected] [ Alain Celisselabel=e2][email protected] [ Inria MODAL project-teampresep=, ]e1 Laboratoire SAMM, Paris 1 Panthéon-Sorbonne Universitypresep=, ]e2

Abstract

In this paper, we study the problem of early stopping for iterative learning algorithms in a reproducing kernel Hilbert space (RKHS) in the nonparametric regression framework. In particular, we work with the gradient descent and (iterative) kernel ridge regression algorithms. We present a data-driven rule to perform early stopping without a validation set that is based on the so-called minimum discrepancy principle. This method enjoys only one assumption on the regression function: it belongs to a reproducing kernel Hilbert space (RKHS). The proposed rule is proved to be minimax-optimal over different types of kernel spaces, including finite-rank and Sobolev smoothness classes. The proof is derived from the fixed-point analysis of the localized Rademacher complexities, which is a standard technique for obtaining optimal rates in the nonparametric regression literature. In addition to that, we present simulation results on artificial datasets that show the comparable performance of the designed rule with respect to other stopping rules such as the one determined by $V-$ fold cross-validation.

62G05,

62G08,

Nonparametric regression,

Reproducing kernels,

Early stopping,

Localized Rademacher complexities,

keywords:

[class=MSC]

keywords:

\arxiv

2007.06827 \startlocaldefs \endlocaldefs

and

1 Introduction

Early stopping rule (ESR) is a form of regularization based on choosing when to stop an iterative algorithm based on some design criterion. Its main idea is lowering the computational complexity of an iterative algorithm while preserving its statistical optimality. This approach is quite old and initially was developed for Landweber iterations to solve ill-posed matrix problems in the 1970s [20, 36]. Recent papers provided some insights for the connection between early stopping and boosting methods [6, 14, 40, 43], gradient descent, and Tikhonov regularization in a reproducing kernel Hilbert space (RKHS) [7, 29, 42]. For instance, [14] established the first optimal in-sample convergence rate of $L^{2}$ -boosting with early stopping. Raskutti et al. [29] provided a result on a stopping rule that achieves the minimax-optimal rate for kernelized gradient descent and ridge regression over different smoothness classes. This work established an important connection between the localized Radamacher complexities [5, 24, 38], that characterizes the size of the explored function space, and early stopping. The main drawback of the result is that one needs to know the RKHS-norm of the regression function or its tight upper bound in order to apply this early stopping rule in practice. Besides that, this rule is design-dependent, which limits its practical application. In the subsequent work, [40] showed how to control early stopping optimality via the localized Gaussian complexities in RKHS for different boosting algorithms ( $L^{2}$ -boosting, LogitBoost, and AdaBoost). Another theoretical result for a not data-driven ESR was built by [11], where the authors proved a minimax-optimal (in the ${L}_{2}(\mathbb{P}_{X})$ out-of-sample norm) stopping rule for conjugate gradient descent in the nonparametric regression setting. [2] proposed a different approach, where the authors focused on both time/memory computational savings, combining early stopping with Nystrom subsampling technique.

Some stopping rules, that (potentially) could be applied in practice, were provided by [9, 10] and [33], and were based on the so-called minimum discrepancy principle [11, 13, 20, 23]. This principle consists of monitoring the empirical risk and determining the first time at which a given learning algorithm starts to fit the noise. In the papers mentioned, the authors considered spectral filter estimators such as gradient descent, Tikhonov (ridge) regularization, and spectral cut-off regression for the linear Gaussian sequence model, and derived several oracle-type inequalities for the proposed ESR. The main deficiency of the works [9, 10, 33] is that the authors dealt only with the linear Gaussian sequence model, and the minimax optimality result was restricted to the spectral cut-off estimator. It is worth mentioning that [33] introduced the so-called polynomial smoothing strategy to achieve the optimality of the minimum discrepancy principle ESR over Sobolev balls for the spectral cut-off estimator. More recently, [18] studied a minimum discrepancy principle stopping rule and its modified (they called it smoothed as well) version, where they provided the range of values of the regression function regularity, for which these stopping rules are optimal for different spectral filter estimators in RKHS.

Contribution. Hence, to the best of our knowledge, there is no fully data-driven stopping rule for gradient descent or ridge regression in RKHS that does not use a validation set, does not depend on the parameters of the model such as the RKHS-norm of the regression function, and explains why it is statistically optimal. In our paper, we combine techniques from [9], [29], and [33] to construct such an ESR. Our analysis is based on the bias and variance trade-off of an estimator, and we try to catch the iteration of their intersection by means of the minimum discrepancy principle [9, 13, 18] and the localized Rademacher complexities [5, 24, 27, 38]. In particular, for the kernels with infinite rank, we propose to use a special technique [13, 33] for the empirical risk in order to reduce its variance. Further, we introduce new notions of smoothed empirical Rademacher complexity and smoothed critical radius to achieve minimax optimality bounds for the functional estimator based on the proposed rule. This can be done by solving the associated fixed-point equation. It implies that the bounds in our analysis cannot be improved (up to numeric constants). It is important to note that in the present paper, we establish an important connection between a smoothed version of the statistical dimension of $n$ -dimensional kernel matrix, introduced by [41] for randomized projections in kernel ridge regression, with early stopping (see Section 4.3 for more details). We also show how to estimate the variance $\sigma^{2}$ of the model, in particular, for the infinite-rank kernels. In the meanwhile, we provide experimental results on artificial data indicating the consistent performance of the proposed rules.

Outline of the paper. The organization of the paper is as follows. In Section 2, we introduce the background on nonparametric regression and reproducing kernel Hilbert space. There, we explain the updates of two spectral filter iterative algorithms: gradient descent and (iterative) kernel ridge regression, that will be studied. In Section 3, we clarify how to compute our first early stopping rule for finite-rank kernels and provide an oracle-type inequality (Theorem 3.1) and an upper bound for the risk error of this stopping rule with fixed covariates (Corollary 3.2). After that, we present a similar upper bound for the risk error with random covariates (Theorem 3.3) that is proved to be minimax-rate optimal. By contrast, Section 4 is devoted to the development of a new stopping rule for infinite-rank kernels based on the polynomial smoothing [13, 33] strategy. There, Theorem 4.2 shows, under a quite general assumption on the eigenvalues of the kernel operator, a high probability upper bound for the performance of this stopping rule measured in the $L_{2}(\mathbb{P}_{n})$ in-sample norm. In particular, this upper bound leads to minimax optimality over Sobolev smoothness classes. In Section 5, we compare our stopping rules to other rules, such as methods using hold-out data and $V-$ fold cross-validation. After that, we propose using a strategy for the estimation of the variance $\sigma^{2}$ of the regression model. Section 6 summarizes the content of the paper and describes some perspectives. Supplementary and more technical proofs are deferred to Appendix.

2 Nonparametric regression and reproducing kernel framework

2.1 Probabilistic model and notation

The context of the present work is that of nonparametric regression, where an i.i.d. sample $\{(x_{i},y_{i}),\ i=1,\ldots,n\}$ of cardinality $n$ is given, with $x_{i}\in\mathcal{X}\ (\textnormal{feature space})$ and $\ y_{i}\in\mathbb{R}$ . The goal is to estimate the regression function $f^{*}:\mathcal{X}\to\mathbb{R}$ from the model

y_{i}=f^{*}(x_{i})+\overline{\varepsilon}_{i},\qquad i=1,\ldots,n,

(1)

where the error variables $\overline{\varepsilon}_{i}$ are i.i.d. zero-mean Gaussian random variables $\mathcal{N}(0,\sigma^{2})$ , with $\sigma>0$ . In all what follows (except for Section 5, where results of empirical experiments are reported), the values of $\sigma^{2}$ is assumed to be known as in [29] and [40].

Along the paper, calculations are mainly derived in the fixed-design context, where the $\{x_{i}\}_{i=1}^{n}$ are assumed to be fixed, and only the error variables $\{\overline{\varepsilon}_{i}\}_{i=1}^{n}$ are random. In this context, the performance of any estimator $\widehat{f}$ of the regression function $f^{*}$ is measured in terms of the so-called empirical norm, that is, the $L_{2}(\mathbb{P}_{n})$ -norm defined by

\lVert\widehat{f}-f^{*}\rVert_{n}^{2}\coloneqq\frac{1}{n}\sum_{i=1}^{n}\Big{[}\widehat{f}(x_{i})-f^{*}(x_{i})\Big{]}^{2},

where $\lVert h\rVert_{n}\coloneqq\sqrt{1/n\sum_{i=1}^{n}h(x_{i})^{2}}$ for any bounded function $h$ over $\mathcal{X}$ , and $\langle\cdot,\cdot\rangle_{n}$ denotes the related inner-product defined by $\langle h_{1},h_{2}\rangle_{n}\coloneqq 1/n\sum_{i=1}^{n}h_{1}(x_{i})h_{2}(x_{i})$ for any functions $h_{1}$ and $h_{2}$ bounded over $\mathcal{X}$ . In this context, $\mathbb{P}_{\varepsilon}$ and $\mathbb{E}_{\varepsilon}$ denote the probability and expectation, respectively, with respect to the $\{\overline{\varepsilon}_{i}\}_{i=1}^{n}$ .

By contrast, Section 3.1.2 discusses some extensions of the previous results to the random design context, where both the covariates $\{x_{i}\}_{i=1}^{n}$ and the responses $\{y_{i}\}_{i=1}^{n}$ are random variables. In this random design context, the performance of an estimator $\widehat{f}$ of $f^{*}$ is measured in terms of the $L_{2}(\mathbb{P}_{X})$ -norm defined by

\lVert\widehat{f}-f^{*}\rVert_{2}^{2}\coloneqq\mathbb{E}_{X}\Big{[}(\widehat{f}(X)-f^{*}(X))^{2}\Big{]},

where $\mathbb{P}_{X}$ denotes the probability distribution of the $\{x_{i}\}_{i=1}^{n}$ . In what follows, $\mathbb{P}$ and $\mathbb{E}$ , respectively, state for the probability and expectation with respect to the couples $\{(x_{i},y_{i})\}_{i=1}^{n}$ .

Notation.

Throughout the paper, $\lVert\cdot\rVert$ and $\langle\cdot,\cdot\rangle$ are the usual Euclidean norm and inner product in $\mathbb{R}^{n}$ . We shall write $a_{n}\lesssim b_{n}$ whenever $a_{n}\leq Cb_{n}$ for some numeric constant $C>0$ for all $n\geq 1$ . $a_{n}\gtrsim b_{n}$ whenever $a_{n}\geq Cb_{n}$ for some numeric constant $C>0$ for all $n\geq 1$ . Similarly, $a_{n}\asymp b_{n}$ means $a_{n}\lesssim b_{n}$ and $b_{n}\gtrsim a_{n}$ . $\left[M\right]\equiv\{1,\ldots,M\}$ for any $M\in\mathbb{N}$ . For $a\geq 0$ , we denote by $\left\lfloor a\right\rfloor$ the largest natural number that is smaller than or equal to $a$ . We denote by $\left\lceil a\right\rceil$ the smallest natural number that is greater than or equal to $a$ . Throughout the paper, we use the notation $c,c_{1},\widetilde{c},C,\widetilde{C},\ldots$ to show that numeric constants $c,c_{1},\widetilde{c},C,\widetilde{C},\ldots$ do not depend on the parameters considered. Their values may change from line to line.

2.2 Statistical model and assumptions

2.2.1 Reproducing Kernel Hilbert Space (RKHS)

Let us start by introducing a reproducing kernel Hilbert space (RKHS) denoted by $\mathcal{H}$ [4, 8, 22, 37]. Such a RKHS $\mathcal{H}$ is a class of functions associated with a reproducing kernel $\mathbb{K}:\mathcal{X}^{2}\to\mathbb{R}$ and endowed with an inner-product denoted by $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ , and satisfying $\langle\mathbb{K}(\cdot,x),\mathbb{K}(\cdot,y)\rangle_{\mathcal{H}}=\mathbb{K}(x,y)$ for all $x,y\in\mathcal{X}$ . Each function within $\mathcal{H}$ admits a representation as an element of $L_{2}(\mathbb{P}_{X})$ , which justifies the slight abuse when writing $\mathcal{H}\subset L_{2}(\mathbb{P}_{X})$ (see [19] and [18, Assumption 3]).

Assuming the RKHS $\mathcal{H}$ is separable, under suitable regularity conditions (e.g., a continuous positive-semidefinite kernel), Mercer’s theorem [31] guarantees that the kernel can be expanded as

\mathbb{K}(x,x^{\prime})=\sum_{k=1}^{+\infty}\mu_{k}\phi_{k}(x)\phi_{k}(x^{\prime}),\quad\forall x,x^{\prime}\in\mathcal{X},

where $\mu_{1}\geq\mu_{2}\geq\ldots\geq 0$ and $\{\phi_{k}\}_{k=1}^{+\infty}$ are, respectively, the eigenvalues and corresponding eigenfunctions of the kernel integral operator $T_{\mathbb{K}}$ , given by

\displaystyle T_{\mathbb{K}}(f)(x)=\int_{\mathcal{X}}\mathbb{K}(x,u)f(u)d\mathbb{P}_{X}(u),\quad\forall f\in L_{2}(\mathbb{P}_{X}),\ x\in\mathcal{X}.

(2)

It is then known that the family $\{\phi_{k}\}_{k=1}^{+\infty}$ is an orthonormal basis of $L_{2}(\mathbb{P}_{X})$ , while $\{\sqrt{\mu_{k}}\phi_{k}\}_{k=1}^{+\infty}$ is an orthonormal basis of $\mathcal{H}$ . Then, any function $f\in\mathcal{H}\subset L_{2}(\mathbb{P}_{X})$ can be expanded as $f=\sum_{k=1}^{+\infty}\sqrt{\mu_{k}}\theta_{k}\phi_{k}$ , where for all $k$ such that $\mu_{k}>0$ , the coefficients $\{\theta_{k}\}_{k=1}^{\infty}$ are

\theta_{k}=\langle f,\sqrt{\mu_{k}}\phi_{k}\rangle_{\mathcal{H}}=\frac{1}{\sqrt{\mu_{k}}}\langle f,\phi_{k}\rangle_{L_{2}(\mathbb{P}_{X})}=\int_{\mathcal{X}}\frac{f(x)\phi_{k}(x)}{\sqrt{\mu_{k}}}d\mathbb{P}_{X}(x).

(3)

Therefore, each functions $f,g\in\mathcal{H}$ can be represented by the respective sequences $\{a_{k}\}_{k=1}^{+\infty},\{b_{k}\}_{k=1}^{+\infty}\in\ell_{2}(\mathbb{N})$ such that

f=\sum_{k=1}^{+\infty}a_{k}\phi_{k},\quad\mbox{and}\quad g=\sum_{k=1}^{+\infty}b_{k}\phi_{k},

with the inner-product in the Hilbert space $\mathcal{H}$ given by $\langle f,g\rangle_{\mathcal{H}}=\sum_{k=1}^{+\infty}\frac{a_{k}b_{k}}{\mu_{k}}.$ This leads to the following representation of $\mathcal{H}$ as an ellipsoid

\displaystyle\mathcal{H}=\left\{f=\sum_{k=1}^{+\infty}a_{k}\phi_{k},\quad\sum_{k=1}^{+\infty}a_{k}^{2}<+\infty,\mbox{ and }\sum_{k=1}^{+\infty}\frac{a_{k}^{2}}{\mu_{k}}<+\infty\right\}.

2.2.2 Main assumptions

From the initial model given by Eq. (1), we make the following assumption.

Assumption 1 (Statistical model).

Let $\mathbb{K}(\cdot,\cdot)$ denote a reproducing kernel as defined above, and $\mathcal{H}$ is the induced separable RKHS. Then, there exists a constant $R>0$ such that the $n$ -sample $(x_{1},y_{1}),\ldots,(x_{n},y_{n})\in\mathcal{X}^{n}\times\mathbb{R}^{n}$ satisfies the statistical model

\displaystyle y_{i}=f^{*}(x_{i})+\overline{\varepsilon}_{i},\quad\mbox{with}\quad f^{*}\in\mathbb{B}_{\mathcal{H}}(R)=\{f\in\mathcal{H}:\lVert f\rVert_{\mathcal{H}}\leq R\},

(4)

where the $\{\overline{\varepsilon}_{i}\}_{i=1}^{n}$ are i.i.d. Gaussian random variables with $\mathbb{E}[\overline{\varepsilon}_{i}\mid x_{i}]=0$ and $\mathbb{V}[\overline{\varepsilon}_{i}\mid x_{i}]=\sigma^{2}$ .

The model from Assumption 1 can be vectorized as

Y=[y_{1},...,y_{n}]^{\top}=F^{*}+\overline{\varepsilon}\in\mathbb{R}^{n},

(5)

where $F^{*}=[f^{*}(x_{1}),\ldots,f^{*}(x_{n})]^{\top}$ and $\overline{\varepsilon}=[\overline{\varepsilon}_{1},\ldots,\overline{\varepsilon}_{n}]^{\top}$ , which turns to be useful all along the paper.

In the present paper, we make a boundness assumption on the reproducing kernel $\mathbb{K}(\cdot,\cdot)$ .

Assumption 2.

Let us assume that the measurable reproducing kernel $\mathbb{K}(\cdot,\cdot)$ is uniformly bounded on its support, meaning that there exists a constant $B>0$ such that

\underset{x\in\mathcal{X}}{\sup}\Big{[}\mathbb{K}(x,x)\Big{]}=\underset{x\in\mathcal{X}}{\sup}||\mathbb{K}(\cdot,x)||_{\mathcal{H}}^{2}\leq B.

Moreover in what follows, we assume that $B=1$ without loss of generality.

Assumption 2 holds for many kernels. On the one hand, it is fulfilled with an unbounded domain $\mathcal{X}$ with a bounded kernel (e.g., Gaussian, Laplace kernels). On the other hand, it amounts to assume the domain $\mathcal{X}$ is bounded with an unbounded kernel such as the polynomial or Sobolev kernels [31]. Let us also mention that Assumptions 1 and 2 (combined with the reproducing property) imply that $f^{*}$ is uniformly bounded since

\lVert f^{*}\rVert_{\infty}=\underset{x\in\mathcal{X}}{\sup}\left|\langle f^{*},\mathbb{K}(\cdot,x)\rangle_{\mathcal{H}}\right|\leq\lVert f^{*}\rVert_{\mathcal{H}}\underset{x\in\mathcal{X}}{\sup}\lVert\mathbb{K}(\cdot,x)\rVert_{\mathcal{H}}\leq R.

(6)

Considering now the Gram matrix $K=\{\mathbb{K}(x_{i},x_{j})\}_{1\leq i,j\leq n}$ , the related normalized Gram matrix $K_{n}=\{\mathbb{K}(x_{i},x_{j})/n\}_{1\leq i,j\leq n}$ turns out to be symmetric and positive semidefinite. This entails the existence of the empirical eigenvalues $\widehat{\mu}_{1},\ldots,\widehat{\mu}_{n}$ (respectively, the eigenvectors $\widehat{u}_{1},\ldots,\widehat{u}_{n}$ ) such that $K_{n}\widehat{u}_{i}=\widehat{\mu}_{i}\cdot\widehat{u}_{i}$ for all $i\in[n]$ . Remark that Assumption 2 implies $0\leq\max(\widehat{\mu}_{1},\mu_{1})\leq 1$ .

For technical convenience, it turns out to be useful rephrasing the model (5) by using the SVD of the normalized Gram matrix $K_{n}$ . This leads to the new (rotated) model

Z_{i}=\langle\widehat{u}_{i},Y\rangle=G_{i}^{*}+\varepsilon_{i},\quad i=1,\ldots,n,

(7)

where $G_{i}^{*}=\langle\widehat{u}_{i},F^{*}\rangle$ , and $\varepsilon_{i}=\langle\widehat{u}_{i},\overline{\varepsilon}\rangle$ is a zero-mean Gaussian random variable with the variance $\sigma^{2}$ .

2.3 Spectral filter algorithms

Spectral filter algorithms were first introduced for solving ill-posed inverse problems with deterministic noise [20]. Among others, one typical example of such an algorithm is the gradient descent algorithm (that is named as well as $L^{2}$ -boosting [14]). They were more recently brought to the supervised learning community, for instance, by [7, 15, 21, 42]. For estimating the vector $F^{*}$ from Eq. (5) in the fixed-design context, such a spectral filter estimator is a linear estimator, which can be expressed as

\displaystyle F^{\lambda}\coloneqq\left(f^{\lambda}(x_{1}),\ldots,f^{\lambda}(x_{n})\right)^{\top}=K_{n}g_{\lambda}(K_{n})Y,

(8)

where $g_{\lambda}:\ [0,1]\to\mathbb{R}$ is called the admissible spectral filter function [7, 21]. For example, the choice $g_{\lambda}(\xi)=\frac{1}{\xi+\lambda}$ , corresponds to the kernel ridge estimator with regularization parameter $\lambda>0$ (see [9, 18] for other possible choices)

From the model expressed in the empirical eigenvectors basis (7), the resulting spectral filter estimator (8) can be expressed as

G^{\lambda(t)}_{i}=\langle\widehat{u}_{i},F^{\lambda(t)}\rangle=\gamma_{i}^{(t)}Z_{i},\quad\forall i=1,\ldots,n,

(9)

where $t\mapsto\lambda(t)>0$ is a decreasing function mapping $t$ to a regularization parameter value at time $t$ , and $t\mapsto\gamma_{i}^{(t)}$ is defined by

\gamma_{i}^{(t)}=\widehat{\mu}_{i}g_{\lambda(t)}(\widehat{\mu}_{i}),\quad\forall i=1,\ldots,n.

Under the assumption that $\underset{t\to 0}{\lim}g_{\lambda(t)}(\mu)=0,\ \mu\in(0,1]$ , it can be proved that $\gamma_{i}^{(t)}$ is a non-decreasing function of $t$ , $\gamma_{i}^{(0)}=0$ , and $\underset{t\to\infty}{\lim}\gamma_{i}^{(t)}=1$ . Moreover, $\widehat{\mu}_{i}=0$ implies $\gamma_{i}^{(t)}=0$ , as it is the case for the kernels with a finite rank, that is, when $\mathrm{rk}(K_{n})\leq r$ almost surely.

Thanks to the remark above, we define the following convenient notations $f^{t}\coloneqq f^{\lambda(t)}$ (for functions) and $F^{t}\coloneqq F^{\lambda(t)}$ (for vectors), with a continuous time $t\geq 0$ , by

f^{t}=g_{\lambda(t)}(S_{n}^{*}S_{n})S_{n}^{*}Y,

(10)

where $S_{n}:\mathcal{H}\to\mathbb{R}^{n}$ is the sampling operator and $S_{n}^{*}$ is its adjoint, i.e. $(S_{n}f)_{i}=f(x_{i})$ and $K_{n}=S_{n}S_{n}^{*}$ .

In what follows, we introduce an assumption on a $\gamma_{i}^{(t)}$ function that will play a crucial role in our analysis.

Assumption 3.

c\min\{1,\eta t\widehat{\mu}_{i}\}\leq\gamma_{i}^{(t)}\leq\min\{1,\eta t\widehat{\mu}_{i}\},\quad i=1,\ldots,n

for some positive constants $c\in(0,1)$ and $\eta>0$ .

Let us mention two famous examples of spectral filter estimators that satisfy Assumption 3 with $c=1/2$ (see Lemma A.1 in Appendix). These examples will be further studied in the present paper.

•
Gradient descent (GD) with a constant step-size $0<\eta<1/\widehat{\mu}_{1}$ and $\eta t\to+\infty$ as $t\to+\infty$ :

$\gamma_{i}^{(t)}=1-(1-\eta\widehat{\mu}_{i})^{t},\quad\forall t\geq 0,\ \forall i=1,\ldots,n.$ (11)

The constant step-size $\eta$ can be replaced by any non-increasing sequence $\{\eta(t)\}_{t=0}^{+\infty}$ satisfying [29]
- –
  
  $(\widehat{\mu}_{1})^{-1}\geq\eta(t)\geq\eta(t+1)\geq\dots$ , for $t=0,1,\ldots$ ,
- –
  
  $\sum_{s=0}^{t-1}\eta(s)\to+\infty$ as $t\to+\infty$ .
•

Kernel ridge regression (KRR) with the regularization parameter $\lambda(t)=1/(\eta t)$ with $\eta>0$ :

$\gamma_{i}^{(t)}=\frac{\widehat{\mu}_{i}}{\widehat{\mu}_{i}+\lambda(t)},\quad\forall t>0,\ \forall i=1,\ldots,n.$ (12)

The linear parameterization $\lambda(t)=1/(\eta t)$ is chosen for theoretical convenience.

The examples of $\gamma_{i}^{(t)}$ above were derived for $F^{0}=[f^{0}(x_{1}),\ldots,f^{0}(x_{n})]^{\top}=[0,\ldots,0]^{\top}$ as an initialization condition without loss of generality.

2.4 Key quantities

From a set of parameters (stopping times) $\mathcal{T}\coloneqq\{t\geq 0\}$ for an iterative learning algorithm, the present goal is to design $\widehat{t}=\widehat{t}(\{x_{i},y_{i}\}_{i=1}^{n})$ from the data $\{x_{i},y_{i}\}_{i=1}^{n}$ such that the functional estimator $f^{\widehat{t}}$ is as close as possible to the optimal one among $\mathcal{T}.$

Numerous classical model selection procedures for choosing $\widehat{t}$ already exist, e.g. the (generalized) cross validation [35], AIC and BIC criteria [1, 32], the unbiased risk estimation [17], or Lepski’s balancing principle [26]. Their main drawback in the present context is that they require the practitioner to calculate all the estimators $\{f^{t},\ t\in\mathcal{T}\}$ in the first step, and then choose the optimal estimator among the candidates in a second step, which can be computationally demanding.

By contrast, early stopping is a less time-consuming approach. It is based on observing one estimator at each $t\in\mathcal{T}$ and deciding to stop the learning process according to some criterion. Its aim is to reduce the computational cost induced by this selection procedure while preserving the statistical optimality properties of the output estimator.

The prediction error (risk) of an estimator $f^{t}$ at time $t$ is split into a bias and a variance term [29] as

R(t)=\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n}^{2}=\lVert\mathbb{E}_{\varepsilon}f^{t}-f^{*}\rVert_{n}^{2}+\mathbb{E}_{\varepsilon}\lVert f^{t}-\mathbb{E}_{\varepsilon}f^{t}\rVert_{n}^{2}=B^{2}(t)+V(t)

with

B^{2}(t)=\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}(G_{i}^{*})^{2},\ \ \ \ \ \ V(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}(\gamma_{i}^{(t)})^{2}.

(13)

The bias term is a non-increasing function of $t$ converging to zero, while the variance term is a non-decreasing function of $t$ . Assume further that $\textnormal{rk}(T_{\mathbb{K}})\leq r$ , which implies that $\textnormal{rk}(K_{n})\leq r$ almost surely, then the empirical risk $R_{t}$ is introduced with the notation of Eq. (7).

R_{t}=\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}=\frac{1}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}+\frac{1}{n}\sum_{i=r+1}^{n}Z_{i}^{2},

(14)

An illustration of the typical behavior of the risk, empirical risk, bias, and variance is displayed by Figure 1.

Refer to caption — Figure 1: Bias, variance, risk, and empirical risk behavior.

Our main concern is formulating a data-driven stopping rule (a mapping from the data $\{(x_{i},y_{i})\}_{i=1}^{n}$ to a positive time $\widehat{t}$ ) so that the prediction errors $\mathbb{E}_{\varepsilon}\lVert f^{\widehat{t}}-f^{*}\rVert_{n}^{2}$ or, equivalently, $\mathbb{E}\lVert f^{\widehat{t}}-f^{*}\rVert_{2}^{2}$ are as small as possible.

The analysis of the forthcoming early stopping rules involves the use of a model complexity measure known as the localized empirical Rademacher complexity [5, 24, 38] that we generalize to its $\alpha-$ smoothed version, for $\alpha\in[0,1]$ .

Definition 2.1.

For any $\epsilon>0$ , $\alpha\in[0,1]$ , consider the localized smoothed empirical Rademacher complexity of $\mathcal{H}$ as

\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})=R\left[\frac{1}{n}\sum_{j=1}^{r}\widehat{\mu}_{j}^{\alpha}\textnormal{min}\{\epsilon^{2},\widehat{\mu}_{j}\}\right]^{1/2}.

(15)

It corresponds to a rescaled sum of the empirical eigenvalues truncated at $\epsilon^{2}$ and smoothed by $\{\widehat{\mu}_{i}^{\alpha}\}_{i=1}^{r}$ .

For a given RKHS $\mathcal{H}$ and noise level $\sigma$ , let us finally define the empirical smoothed critical radius $\widehat{\epsilon}_{n,\alpha}$ as the smallest positive value $\epsilon$ such that

\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon R}\leq\frac{2R\epsilon^{1+\alpha}}{\sigma}.

(16)

There is an extensive literature on the empirical critical equation and related empirical critical radius [5, 27, 29], and it is out of the scope of the present paper providing an exhaustive review on this topic. Nevertheless, Appendix G establishes that the smoothed critical radius $\widehat{\epsilon}_{n,\alpha}$ does exist, is unique and achieves the equality in Ineq. (16). Constant $2$ in Ineq. (16) is for theoretical convenience only. If $\alpha=0$ , $\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\equiv\widehat{\mathcal{R}}_{n}(\epsilon,\mathcal{H})$ , and $\widehat{\epsilon}_{n,\alpha}\equiv\widehat{\epsilon}_{n}$ .

3 Data-driven early stopping rule and minimum discrepancy principle

Let us start by recalling that the expression of the empirical risk in Eq. (14) gives that the empirical risk is a non-increasing function of $t$ (as illustrated by Fig. 1 as well). This is consistent with the intuition that the amount of available information within the residuals decreases as $t$ grows. If there exists time $t$ such that $f^{t}\approx f^{*}$ , then the empirical risk is approximately equal to $\sigma^{2}$ (level of noise), that is,

\mathbb{E}_{\varepsilon}R_{t}=\mathbb{E}_{\varepsilon}\Big{[}\lVert F^{t}-Y\rVert_{n}^{2}\Big{]}\approx\mathbb{E}_{\varepsilon}\Big{[}\lVert F^{*}-Y\rVert_{n}^{2}\Big{]}=\mathbb{E}_{\varepsilon}\Big{[}\lVert\varepsilon\rVert_{n}^{2}\Big{]}=\sigma^{2}.

(17)

By introducing the reduced empirical risk $\widetilde{R}_{t},\ t\geq 0,$ and recalling that $\textnormal{rk}(K_{n})\leq r$ ,

\mathbb{E}_{\varepsilon}R_{t}=\mathbb{E}_{\varepsilon}\left[\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}\right]=\mathbb{E}_{\varepsilon}\underbrace{\left[\frac{1}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2}\right]}_{\coloneqq\widetilde{R}_{t}}+\frac{n-r}{n}\sigma^{2}\overset{(\textnormal{i})}{\approx}\sigma^{2},

(18)

where $(\textnormal{i})$ is due to Eq. (17). This heuristic argument gives rise to a first deterministic stopping rule $t^{*}$ involving the reduced empirical risk and given by

t^{*}=\inf\left\{t>0\ |\ \mathbb{E}_{\varepsilon}\widetilde{R}_{t}\leq\frac{r\sigma^{2}}{n}\right\}.

(19)

Since $t^{*}$ is not achievable in practice, an estimator of $t^{*}$ is given by the data-driven stopping rule $\tau$ based on the so-called minimum discrepancy principle

\tau=\inf\left\{t>0\ |\ \widetilde{R}_{t}\leq\frac{r\sigma^{2}}{n}\right\}.

(20)

The existing literature considering the MDP-based stopping rule usually defines $\tau$ by the event $\{R_{t}\leq\sigma^{2}\}$ [9, 11, 13, 20, 23, 33]. Notice that with a full-rank kernel matrix, the reduced empirical risk $\widetilde{R}_{t}$ is equal to the classical empirical risk $R_{t}$ , leading then to the same stopping rule. From a practical perspective, the knowledge of the rank of the Gram matrix avoids estimating the last $n-r$ components of the vector $G^{*}$ , which are already known to be zero (see [29, Section 4.1] for more details).

3.1 Finite-rank kernels

3.1.1 Fixed-design framework

Let us start by discussing our results with the case of RKHS of finite-rank kernels with rank $r<n:\mu_{i}=0,\ i>r$ , and $\widehat{\mu}_{i}=0,\ i>r$ . Examples that include these kernels are the linear kernel $\mathbb{K}(x_{1},x_{2})=x_{1}^{\top}x_{2}$ and the polynomial kernel of degree $d\in\mathbb{N}$ $\mathbb{K}(x_{1},x_{2})=(1+x_{1}^{\top}x_{2})^{d}$ .

The following theorem applies to any functional estimator $\{f^{t}\}_{t\in[0,T]}$ generated by (10) and initialized at $f^{0}=0$ . The main part of the proof of this result consists of properly upper bounding $\mathbb{E}_{\varepsilon}|\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}-\widetilde{R}_{t^{*}}|$ and follows the same trend of Proposition 3.1 in [9].

Theorem 3.1.

Under Assumptions 1 and 2, given the stopping rule (20),

\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq 2(1+\theta^{-1})\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}

(21)

for any positive $\theta$ .

Proof of Theorem 3.1.

In this proof, we will use the following inequalities: for any $a,b\geq 0,\ (a-b)^{2}\leq|a^{2}-b^{2}|$ , and $2ab\leq\theta a^{2}+\frac{1}{\theta}b^{2}$ for $\forall\theta>0$ .

Let us first prove the subsequent oracle-type inequality for the difference between $f^{\tau}$ and $f^{t^{*}}$ . Consider

	$\displaystyle\lVert f^{t^{*}}-f^{\tau}\rVert_{n}^{2}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{r}\Big{(}\gamma_{i}^{(t^{})}-\gamma_{i}^{(\tau)}\Big{)}^{2}Z_{i}^{2}\leq\frac{1}{n}\sum_{i=1}^{r}\|(1-\gamma_{i}^{(t^{})})^{2}-(1-\gamma_{i}^{(\tau)})^{2}\|Z_{i}^{2}$
		$\displaystyle=(\widetilde{R}_{t^{}}-\widetilde{R}_{\tau})\mathbb{I}\left\{\tau\geq t^{}\right\}+(\widetilde{R}_{\tau}-\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau<t^{}\right\}$
		$\displaystyle\leq(\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau\geq t^{}\right\}+(\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}-\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau<t^{}\right\}$
		$\displaystyle\leq\|\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}\|.$

From the definition of $\widetilde{R}_{t}$ (18), one notices that

\displaystyle|\widetilde{R}_{t^{*}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{*}}|

\displaystyle=\left|\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{*})})^{2}\Big{[}\frac{1}{n}(\varepsilon_{i}^{2}-\sigma^{2})+\frac{2}{n}\varepsilon_{i}G_{i}^{*}\Big{]}\right|.

From $\mathbb{E}_{\varepsilon}|X(\varepsilon)|\leq\sqrt{\text{var}_{\varepsilon}X(\varepsilon)}$ for $X(\varepsilon)$ centered and $\sqrt{a+b}\leq\sqrt{a}+\sqrt{b}$ for any $a,b\geq 0$ , and $\mathbb{E}_{\varepsilon}\left(\varepsilon^{4}\right)\leq 3\sigma^{4}$ , it comes

	$\displaystyle\mathbb{E}_{\varepsilon}\|\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}\|$	$\displaystyle\leq\sqrt{\frac{2\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{4}\left[\frac{3}{2}\sigma^{2}+2(G_{i}^{})^{2}\right]}$
		$\displaystyle\leq\sqrt{\frac{3\sigma^{4}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{2}}+\sqrt{\frac{4\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{2}(G_{i}^{*})^{2}}$
		$\displaystyle\leq\frac{\sqrt{3}\sigma^{2}\sqrt{r}}{n}+\theta\frac{\sigma^{2}}{n}+\theta^{-1}B^{2}(t^{*})$
		$\displaystyle\leq\theta^{-1}B^{2}(t^{*})+(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.$

Applying the inequalities $(a+b)^{2}\leq 2a^{2}+2b^{2}$ for any $a,b\geq 0$ and $B^{2}(t^{*})\leq\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}$ , we arrive at

	$\displaystyle\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}$
	$\displaystyle\leq 2\mathbb{E}_{\varepsilon}\lVert f^{t^{}}-f^{}\rVert_{n}^{2}+2\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{t^{*}}\rVert_{n}^{2}$
	$\displaystyle\leq 2(1+\theta^{-1})\mathbb{E}_{\varepsilon}\lVert f^{t^{}}-f^{}\rVert_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.$

∎

First of all, it is worth noting that the risk of the estimator $f^{t^{*}}$ is proved to be optimal for gradient descent and kernel ridge regression no matter the kernel we use (see Appendix C for the proof), so it remains to focus on the remainder term on the right-hand side in Ineq. (21). Theorem 3.1 applies to any reproducing kernel, but one remarks that for infinite-rank kernels, $r=n$ , and we achieve only the rate $\mathcal{O}\left(1/\sqrt{n}\right)$ . This rate is suboptimal since, for instance, RKHS with polynomial eigenvalue decay kernels (will be considered in the next subsection) has the minimax-optimal rate for the risk error of the order $\mathcal{O}\left(n^{-\frac{\beta}{\beta+1}}\right)$ , with $\beta>1$ . Therefore, the oracle-type inequality (21) could be useful only for finite-rank kernels due to the fast $\mathcal{O}(\sqrt{r}/n)$ rate of the remainder term.

Notice that, in order to make artificially the term $\mathcal{O}(\sqrt{r}/n)$ a remainder one (even for cases corresponding to infinite-rank kernels), [9, 10] introduced in the definitions of their stopping rules a restriction on the ”starting time” $t_{0}$ . However, in the mentioned work, this restriction incurred the price of possibility to miss the designed time $\tau$ . Besides that, [10] developed an additional procedure based on standard model selection criteria such as AIC-criterion for the spectral cut-off estimator to recover the ”missing” stopping rule and achieve optimality over Sobolev-type ellipsoids. In our work, we removed such a strong assumption.

As a corollary of Theorem 3.1, one can prove that $f^{\tau}$ provides a minimax estimator of $f^{*}$ over the ball of radius $R$ .

Corollary 3.2.

Under Assumptions 1, 2, 3, if a kernel has finite rank $r$ , then

\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n}^{2},

(22)

where the constant $c_{u}$ is numeric.

Proof of Corollary 3.2.

From Theorem 3.1 and Lemma C.2 in Appendix,

\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq 16(1+\theta^{-1})R^{2}\widehat{\epsilon}_{n}^{2}+2(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.

(23)

Further, applying [29, Section 4.3], $\widehat{\epsilon}_{n}^{2}=c\frac{r\sigma^{2}}{nR^{2}}$ , and it implies that

\mathbb{E}_{\varepsilon}\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq\Big{[}16(1+\theta^{-1})+\frac{2(\sqrt{3}+\theta)}{c}\Big{]}R^{2}\widehat{\epsilon}_{n}^{2}.

(24)

∎

Note that the critical radius $\widehat{\epsilon}_{n}$ cannot be arbitrary small since it should satisfy Ineq. (16). As it will be clarified later, the squared empirical critical radius is essentially optimal.

3.1.2 Random-design framework

We would like to transfer the minimax optimality bound for the estimator $f^{\tau}$ from the empirical $L_{2}(\mathbb{P}_{n})$ -norm to the population $L_{2}(\mathbb{P}_{X})$ norm by means of the so-called localized population Rademacher complexity. This complexity measure became a standard tool in empirical processes and nonparametric regression [5, 24, 29, 38].

For any kernel function class studied in the paper, we consider the localized Rademacher complexity that can be seen as a population counterpart of the empirical Rademacher complexity (15) introduced earlier:

\overline{\mathcal{R}}_{n}(\epsilon,\mathcal{H})=R\left[\frac{1}{n}\sum_{i=1}^{+\infty}\min\{\mu_{i},\epsilon^{2}\}\right]^{1/2}.

(25)

Using the localized population Rademacher complexity, we define its population critical radius $\epsilon_{n}>0$ to be the smallest positive solution $\epsilon$ that satisfies the inequality

\frac{\overline{\mathcal{R}}_{n}(\epsilon,\mathcal{H})}{\epsilon R}\leq\frac{2\epsilon R}{\sigma}.

(26)

In contrast to the empirical critical radius $\widehat{\epsilon}_{n}$ , this quantity is not data-dependent, since it is specified by the population eigenvalues of the kernel operator $T_{\mathbb{K}}$ underlying the RKHS.

Theorem 3.3.

Under Assumptions 1, 2, and 3, given the stopping time (20), there is a positive numeric constant $\widetilde{c}_{u}$ so that for finite-rank kernels with rank $r$ , with probability at least $1-c\exp(-c_{1}n\epsilon_{n}^{2})$ ,

\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\widetilde{c}_{u}R^{2}\epsilon_{n}^{2}

(27)

In addition, the risk error of $\tau$ is bounded as

\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\frac{\widetilde{c}r\sigma^{2}}{n}+\underbrace{C(\sigma,R)\exp(-cr)}_{\textnormal{remainder term}},

(28)

where constant $C(\sigma,R)$ depends on $\sigma$ and $R$ only.

Remark.

The full proof is deferred to Section F. Regarding Ineq. (27), $\epsilon_{n}^{2}$ is proven to be the minimax-optimal rate for the $L_{2}(\mathbb{P}_{X})$ norm in a RKHS (see [5, 27, 29]). As for the risk error in Ineq. (28), the (exponential) remainder term should decrease to zero faster than $\frac{r\sigma^{2}}{n}$ , and Theorem 3.3 provides a rate $\mathcal{O}\left(\frac{r\sigma^{2}}{n}\right)$ that matches up to a constant the minimax bound (see, e.g., [28, Theorem 2(a)] with $s=1$ ), when $f^{*}$ belongs to the $\mathcal{H}$ -norm ball of a fixed radius $R$ , thus not improvable in general. A similar bound for finite-rank kernels was achieved in [29, Corollary 4].

We summarize our findings in the following corollary.

Corollary 3.4.

Under Assumptions 1, 2, 3 and a finite-rank kernel, the early stopping rule $\tau$ satisfies

\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\asymp\underset{\widehat{f}}{\inf}\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}\lVert\widehat{f}-f^{*}\rVert_{2}^{2},

(29)

where the infimum is taken over all measurable functions of the input data.

3.2 Practical behavior of $\tau$ with infinite-rank kernels

A typical example of RKHS that produces an infinite-rank kernel is the $k^{\textnormal{th}}$ -order Sobolev spaces for some fixed integer $k\geq 1$ with Lebesgue measure on a bounded domain. We consider Sobolev spaces that consist of functions that have $k^{\textnormal{th}}$ -order weak derivatives $f^{(k)}$ being Lebesgue integrable and $f^{(0)}(0)=f^{(1)}(0)=\ldots=f^{(k-1)}(0)=0$ . It is worth mentioning that for such classes, the eigenvalues of the kernel operator $\mu_{i}\asymp i^{-\beta},\ i=1,2,\ldots$ , with $\beta=2k$ . Another example of kernel with this decay condition for the eigenvalues is the Laplace kernel $\mathbb{K}(x_{1},x_{2})=e^{-|x_{1}-x_{2}|},\ x_{1},x_{2}\in\mathbb{R}$ (see [31, p.402]).

Firstly, let us now illustrate the practical behavior of ESR (20) (its histogram) for gradient descent (9) with the step-size $\eta=1/(1.2\widehat{\mu}_{1})$ and one-dimensional Sobolev kernel $\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\}$ that generates the reproducing space

\mathcal{H}=\left\{f:[0,1]\to\mathbb{R}\ |\ f(0)=0,\int_{0}^{1}(f^{\prime}(x))^{2}dx<\infty\right\}.

(30)

We deal with the model (1) with two regression functions: a smooth piece-wise linear $f^{*}(x)=|x-1/2|-1/2$ and nonsmooth heavisine $f^{*}(x)=0.093\ [4\ \textnormal{sin}(4\pi x)-\textnormal{sign}(x-0.3)-\textnormal{sign}(0.72-x)]$ functions. The design points are random $x_{i}\overset{\textnormal{i.i.d.}}{\sim}\mathbb{U}[0,1]$ . The number of observations is $n=200$ . For both functions, $\lVert f^{*}\rVert_{n}\approx 0.28$ , and we set up a middle difficulty noise level $\sigma=0.15$ . The number of repetitions is $N=200$ .

In panel (a) of Figure 2, we detect that our stopping rule $\tau$ has a high variance. However, if we change the signal $f^{*}$ from the smooth to nonsmooth one, the regression function does not belong anymore to $\mathcal{H}$ defined in (30). In this case (panel (b) in Figure 2), the stopping rule $\tau$ performs much better than for the previous regression function. In order to get a stable early stopping rule that will be close to $t^{*}$ , we propose using a special smoothing technique for the empirical risk.

4 Polynomial smoothing

As was discussed earlier, the main issue of poor behavior of the stopping rule $\tau$ for infinite-rank kernels is the variability of the empirical risk around its expectation. A solution that we propose is to smooth the empirical risk by means of the eigenvalues of the normalized Gram matrix.

4.1 Polynomial smoothing and minimum discrepancy principle rule

We start by defining the squared $\alpha$ -norm as $\lVert f\rVert_{n,\alpha}^{2}\coloneqq\langle K_{n}^{\alpha}F,F\rangle_{n}$ for all $F=\left[f(x_{1}),\ldots,f(x_{n})\right]^{\top}\in\mathbb{R}^{n}$ and $\alpha\in[0,1]$ , from which we also introduce the smoothed risk, bias, and variance of a spectral filter estimator as

R_{\alpha}(t)=\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n,\alpha}^{2}=\lVert\mathbb{E}_{\varepsilon}f^{t}-f^{*}\rVert_{n,\alpha}^{2}+\mathbb{E}_{\varepsilon}\lVert f^{t}-\mathbb{E}_{\varepsilon}f^{t}\rVert_{n,\alpha}^{2}=B^{2}_{\alpha}(t)+V_{\alpha}(t),

with

B^{2}_{\alpha}(t)=\frac{1}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}(G_{i}^{*})^{2},\ \ \ \ V_{\alpha}(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(\gamma_{i}^{(t)})^{2}.

(31)

The smoothed empirical risk is

R_{\alpha,t}=\lVert F^{t}-Y\rVert_{n,\alpha}^{2}=\lVert G^{t}-Z\rVert_{n,\alpha}^{2}=\frac{1}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}Z_{i}^{2},\quad\textnormal{ for }t>0.

(32)

Recall that the kernel is bounded by $B=1$ , thus $\widehat{\mu}_{i}\leq 1$ for all $i=1,\ldots,n$ , then the smoothed bias $B_{\alpha}^{2}(t)$ and smoothed variance $V_{\alpha}(t)$ are smaller their non-smoothed counterparts.

Analogously to the heuristic derivation leading to the stopping rule (20), the new stopping rule is based on the discrepancy principle applied to the $\alpha-$ smoothed empirical risk, that is,

\tau_{\alpha}=\inf\left\{t>0\ |\ R_{\alpha,t}\leq\sigma^{2}\frac{\mathrm{tr}(K_{n}^{\alpha})}{n}\right\},

(33)

where $\sigma^{2}\mathrm{tr}(K_{n}^{\alpha})/n=\sigma^{2}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}/n$ is the natural counterpart of $r\sigma^{2}/n$ in the case of a full-rank kernel matrix and the $\alpha-$ norm.

4.2 Related work

The idea of smoothing the empirical risk (the residuals) is not new in the literature. For instance, [11, 12, 13] discussed various smoothing strategies applied to (kernelized) conjugate gradient descent, and [18] considered spectral regularization with spectral filter estimators. More closely related to the present work, [33] studied a statistical performance improvement allowed by polynomial smoothing of the residuals (as we do here) but restricted to the spectral cut-off estimator.

In [12, 13], the authors considered the following statistical inverse problem: $z=Ax+\sigma\zeta$ , where $A$ is a self-adjoint operator and $\zeta$ is Gaussian noise. In their case, for the purpose of achieving optimal rates, the usual discrepancy principle rule $\lVert Ax_{m}-z\rVert\leq\vartheta\delta$ ( $m$ is an iteration number, $\vartheta$ is a parameter) was modified and took the form $\lVert\rho_{\lambda}(A)(Ax_{m}-z)\rVert\leq\vartheta\delta$ , where $\rho_{\lambda}(t)=\frac{1}{\sqrt{t+\lambda}}$ and $\delta$ is the normalized variance of Gaussian noise.

In [11], the minimum discrepancy principle was modified to the following: each iteration $m$ of conjugate gradient descent was represented by a vector $\widehat{\alpha}_{m}=K_{n}^{\dagger}Y$ , $K_{n}^{\dagger}$ is the pseudo-inverse of the normalized Gram matrix, and the learning process was stopped if $\lVert Y-K_{n}\widehat{\alpha}_{m}\rVert_{K_{n}}<\Omega$ for some positive $\Omega$ , where $\lVert\alpha\rVert_{K_{n}}^{2}=\langle\alpha,K_{n}\alpha\rangle.$ Thus, this method corresponds (up to a threshold) to the stopping rule (33) with $\alpha=1.$

In the work [33], the authors concentrated on the inverse problem $Y=A\xi+\delta W$ and its corresponding Gaussian vector observation model $Y_{i}=\tilde{\mu}_{i}\xi_{i}+\delta\varepsilon_{i},\ i\in[r]$ , where $\{\tilde{\mu}_{i}\}_{i=1}^{r}$ are the singular values of the linear bounded operator $A$ and $\{\varepsilon_{i}\}_{i=1}^{r}$ are Gaussian noise variables. They recovered the signal $\{\xi_{i}\}_{i=1}^{r}$ by a cut-off estimator of the form $\widehat{\xi}_{i}^{(t)}=\mathbb{I}\{i\leq t\}\widetilde{\mu}_{i}^{-1}Y_{i},\ i\in[r]$ . The discrepancy principle in this case was $\lVert(AA^{\top})^{\alpha/2}(Y-A\widehat{\xi}^{(t)})\rVert^{2}\leq\kappa$ for some positive $\kappa.$ They found out that, if the smoothing parameter $\alpha$ lies in the interval $[\frac{1}{4p},\frac{1}{2p})$ , where $p$ is the polynomial decay of the singular values $\{\widetilde{\mu}_{i}\}_{i=1}^{r}$ , then the cut-off estimator is adaptive to Sobolev ellipsoids. Therefore, our work could be considered as an extension of [33] in order to generalize the polynomial smoothing strategy to more complex filter estimators such as gradient descent and (Tikhonov) ridge regression in the reproducing kernel framework.

4.3 Optimality result (fixed-design)

We pursue the analogy a bit further by defining the smoothed statistical dimension as

d_{n,\alpha}\coloneqq\inf\left\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\right\},

(34)

and $d_{n,\alpha}=n$ if no such index exists. Combined with (15), this implies that

\widehat{\mathcal{R}}_{n,\alpha}^{2}(\widehat{\epsilon}_{n,\alpha},\mathcal{H})\geq\frac{\sum_{j=1}^{d_{n,\alpha}}\widehat{\mu}_{j}^{\alpha}}{n}R^{2}\widehat{\epsilon}_{n,\alpha}^{2},\ \ \textnormal{ and }\ \ \widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\geq\frac{\sigma^{2}\sum_{j=1}^{d_{n,\alpha}}\widehat{\mu}_{j}^{\alpha}}{4R^{2}n}.

(35)

Let us emphasize that [41] already introduced the so-called statistical dimension (corresponds to $d_{n,0}$ in our notation). It appeared that the statistical dimension provides an upper bound on the minimax-optimal dimension of randomized projections for kernel ridge regression (see [41, Theorem 2, Corollary 1]). In our case, $d_{n,\alpha}$ can be seen as a ( $\alpha$ -smooth) version of the statistical dimension.

The purpose of the following result is to give more insight into understanding of Eq. (34) regarding the minimax risk.

Theorem 4.1 (Lower bound from Theorem 1 in [41]).

For any regular kernel class, meaning that for any $k=1,\ldots,n$ , $\widehat{\mu}_{k+1}^{-1}\sum_{i=k+1}^{n}\widehat{\mu}_{i}\lesssim k$ , and any estimator $\widetilde{f}$ of $f^{*}\in\mathbb{B}_{\mathcal{H}}(R)$ satisfying the nonparametric model defined in Eq. (1), we get

\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}_{\varepsilon}\lVert\widetilde{f}-f^{*}\rVert_{n}^{2}\geq c_{l}R^{2}\widehat{\epsilon}_{n}^{2},

for some numeric constant $c_{l}>0$ .

Firstly, in [41], the regularity assumption was formulated as $\sum_{d_{n,0}+1}^{n}\widehat{\mu}_{i}\lesssim d_{n,0}\widehat{\epsilon}_{n}^{2}$ , which directly stems from the assumption in Theorem 4.1. Let us remark that the same assumption (as in Theorem 4.1) has been already made by [18, Assumption 6]. Secondly, Theorem 4.1 applies to any kernel, as long as the condition on the tail of eigenvalues is fulfilled, which is in particular true for the reproducing kernels from Section 3.2. Thus, the fastest achievable rate by an estimator of $f^{*}$ is $\widehat{\epsilon}_{n}^{2}$ .

A key property for the smoothing to yield optimal results is that the value of $\alpha$ has to be large enough to control the tail sum of the smoothed eigenvalues by the corresponding cumulative sum, which is the purpose of the assumption below.

Assumption 4.

There exists $\Upsilon=[\alpha_{0},1],\ \alpha_{0}\geq 0$ , such that for all $\alpha\in\Upsilon$ and $k\in\{1,\ldots,n\}$ ,

\sum_{i=k+1}^{+\infty}\mu_{i}^{2\alpha}\leq\mathcal{M}\sum_{i=1}^{k}\mu_{i}^{2\alpha},

(36)

where $\mathcal{M}\geq 1$ denotes a numeric constant.

We enumerate several classical examples for which this assumption holds.

Example 1 ( $\beta$ -polynomial eigenvalue decay kernels).

Let us assume that the kernel operator satisfy that there exist numeric constants $0<c\leq C$ such that

ci^{-\beta}\leq\mu_{i}\leq Ci^{-\beta},\ \ i=1,2,\ldots,

(37)

For the polynomial eigenvalue-decay kernels, Assumption 4 holds with

\mathcal{M}=2^{2\beta-1}\left(\frac{C}{c}\right)^{2}\quad\textnormal{and}\quad 1\geq\alpha\geq\frac{1}{\beta+1}=\alpha_{0}.

(38)

Example 2 ( $\gamma$ -exponential eigenvalue-decay kernels).

Let us assume that the eigenvalues of the kernel operator satisfy that there exist numeric constants $0<c\leq C$ and a constant $\gamma>0$ such that

\displaystyle ce^{-i^{\gamma}}\leq\mu_{i}\leq Ce^{-i^{\gamma}},i=1,2,\ldots.

Instances of kernels within this class include the Gaussian kernel with respect to the Lebesgue measure on the real line (with $\gamma=2$ ) or on a compact domain (with $\gamma=1$ ) (up to $\log$ factor in the exponent, see [38, Example 13.21]). Then, Assumption 4 holds with

\mathcal{M}=\Big{(}\frac{C}{c}\Big{)}^{2}\frac{\int_{0}^{\infty}e^{-y^{\gamma}}dy}{\int_{2^{-1/\gamma}}^{2/(2\alpha_{0})^{1/\gamma}}e^{-y^{\gamma}}dy}\quad\textnormal{and}\quad\alpha\in[\alpha_{0},1],\quad\mbox{for any}\quad\alpha_{0}\in(0,1).

For any regular kernel class satisfying the above assumption, the next theorem provides a high probability bound on the performance of $f^{\tau_{\alpha}}$ (measured in terms of the $L_{2}(\mathbb{P}_{n})$ -norm), which depends on the smoothed empirical critical radius.

Theorem 4.2 (Upper bound on empirical norm).

Under Assumptions 1, 2, 3, and 4, for any regular kernel and $\alpha\leq\frac{1}{2}$ , the stopping time (33) satisfies

\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

(39)

with probability at least $1-c\exp\Big{[}-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}$ for some positive constants $c_{1}$ and $c_{u}$ , where $c_{1}$ depends only on $\mathcal{M}$ , $c_{u}$ and $c$ are numeric. Moreover,

\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq CR^{2}\widehat{\epsilon}_{n,\alpha}^{2}+20\max\{\sigma^{2},R^{2}\}\exp\left[-c_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],

(40)

where the constant $C$ is numeric, constant $c_{3}$ only depending on $\mathcal{M}$ .

The complete proof of Theorem 4.2 is given in Appendix D. The main message is that the final performance of the estimator $f^{\tau_{\alpha}}$ is controlled by the smoothed critical radius $\widehat{\epsilon}_{n,\alpha}^{2}$ . From the existing literature on the empirical critical radius [28, 29, 38, 41], it is already known that the non-smooth version $\widehat{\epsilon}_{n}^{2}$ is the typical quantity that leads to minimax rates in the RKHS (see also Theorem 4.1). The behavior of $\widehat{\epsilon}_{n,\alpha}^{2}$ with respect to $n$ is likely to depend on $\alpha$ , as emphasized by the notation. Intuitively, this suggests that there could exist a range of values of $\alpha$ , for which $\widehat{\epsilon}_{n,\alpha}^{2}$ is of the same order as (or faster than) $\widehat{\epsilon}_{n}^{2}$ , leading therefore to optimal rates.

Another striking aspect of Ineq. (40) is related to the additional terms involving the exponential function in Ineq. (40). As far as (39) is a statement with ”high probability”, this term is expected to converge to 0 at a rate depending on $n\widehat{\epsilon}_{n,\alpha}^{2}$ . Therefore, the final convergence rate as well as the fact that this term is (or not) negligible will depend on $\alpha$ .

As a consequence of Theorem 4.1, as far as there exist values of $\alpha$ such that $\widehat{\epsilon}_{n,\alpha}^{2}$ is at most as large as $\widehat{\epsilon}_{n}^{2}$ , the estimator $f^{\tau_{\alpha}}$ is optimal.

4.4 Consequences for $\beta$ -polynomial eigenvalue-decay kernels

The leading idea in the present section is identifying values of $\alpha$ , for which the bound (39) from Theorem 4.2 scales as $R^{2}\widehat{\epsilon}_{n}^{2}$ .

Let us recall the definition of a polynomial decay kernel from (37):

ci^{-\beta}\leq\mu_{i}\leq Ci^{-\beta},\ i=1,2,\ldots,\ \ \textnormal{ for }\beta>1\textnormal{ and numeric constants }c,C>0.

One typical example of the reproducing kernel satisfying this condition is the Sobolev kernel on $[0,1]\times[0,1]$ given by $\mathbb{K}(x,x^{\prime})=\min\{x,x^{\prime}\}$ with $\beta=2$ [29]. The corresponding RKHS is the first-order Sobolev class, that is, the class of functions that are almost everywhere differentiable with the derivative in $L_{2}[0,1]$ .

Lemma 4.3.

For any $\beta$ -polynomial eigenvalue decay kernel, there exist numeric constants $c_{1},c_{2}>0$ such that for $\alpha<1/\beta$ , one has

\displaystyle c_{1}\widehat{\epsilon}_{n}^{2}\leq\widehat{\epsilon}_{n,\alpha}^{2}\leq c_{2}\widehat{\epsilon}_{n}^{2}\asymp\left(\frac{\sigma^{2}}{2R^{2}n}\right)^{\frac{\beta}{\beta+1}}.

The proof of Lemma 4.3 was deferred to Lemma A.2 in Appendix A and is not reproduced here. Therefore, if $\alpha\beta<1$ , then $\widehat{\epsilon}_{n,\alpha}^{2}\asymp\widehat{\epsilon}_{n}^{2}\asymp\left(\frac{\sigma^{2}}{2R^{2}n}\right)^{\frac{\beta}{\beta+1}}$ . Let us now recall from (38) that Assumption 4 holds for $\alpha\geq(\beta+1)^{-1}$ . All these arguments lead us to the next result, which establishes the minimax optimality of $\tau_{\alpha}$ with any kernel satisfying the $\beta$ -polynomial eigenvalue-decay assumption, as long as $\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right)$ .

Corollary 4.4.

Under Assumptions 1, 2, 3, and the $\beta$ -polynomial eigenvalue decay (37), for any $\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right)$ , the early stopping rule $\tau_{\alpha}$ satisfies

\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\asymp\underset{\widehat{f}}{\inf}\underset{\lVert f^{*}\rVert_{\mathcal{H}}\leq R}{\sup}\mathbb{E}_{\varepsilon}\lVert\widehat{f}-f^{*}\rVert_{n}^{2},

(41)

where the infimum is taken over all measurable functions of the input data.

Corollary 4.4 establishes an optimality result in the fixed-design framework since as long as $(\beta+1)^{-1}\leq\alpha<\min\left\{\beta^{-1},\frac{1}{2}\right\}$ , the upper bound matches the lower bound up to multiplicative constants. Moreover, this property holds uniformly with respect to $\beta>1$ , provided the value of $\alpha$ is chosen appropriately. An interesting feature of this bound is that the optimal value of $\alpha$ only depends on the (polynomial) decay rate of the empirical eigenvalues of the normalized Gram matrix. This suggests that any effective estimator of the unknown parameter $\beta$ could be plugged into the above (fixed-design) result and would lead to an optimal rate. Note that [33] has emphasized a similar trade-off for the smoothing parameter $\alpha$ (polynomial smoothing), considering the spectral cut-off estimator in the Gaussian sequence model. Regarding convergence rates, Corollary 4.4 combined with Lemma 4.3 suggests that the convergence rate of the expected risk is of the order $\mathcal{O}\left(n^{-\frac{\beta}{\beta+1}}\right)$ . This is the same as the already known one in nonparametric regression in the random design framework [29, 34], which is known to be minimax-optimal as long as $f^{*}$ belongs to the RKHS $\mathcal{H}$ .

5 Empirical comparison with existing stopping rules

The present section aims at illustrating the practical behavior of several stopping rules discussed along the paper as well as making a comparison with existing alternative stopping rules.

5.1 Stopping rules involved

The empirical comparison is carried out between the stopping rules $\tau$ (20) and $\tau_{\alpha}$ with $\alpha\in\left[\frac{1}{\beta+1},\min\left\{\frac{1}{\beta},\frac{1}{2}\right\}\right)$ (33), and four alternative stopping rules that are briefly described in the what follows. For the sake of comparison, most of them correspond to early stopping rules already considered in [29].

Hold-out stopping rule

We consider a procedure based on the hold-out idea [3]. Data $\{(x_{i},y_{i})\}_{i=1}^{n}$ are split into two parts: the training sample $S_{\textnormal{train}}=(x_{\textnormal{train}},y_{\textnormal{train}})$ and the test sample $S_{\textnormal{test}}=(x_{\textnormal{test}},y_{\textnormal{test}})$ so that the training sample and test sample represent a half of the whole dataset. We train the learning algorithm for $t=0,1,\ldots$ and estimate the risk for each $t$ by $R_{\textnormal{ho}}(f^{t})=\frac{1}{n}\sum_{i\in S_{\textnormal{test}}}((\widehat{y}_{\textnormal{test}})_{i}-y_{i})^{2}$ , where $(\widehat{y}_{\textnormal{test}})_{i}$ denotes the output of the algorithm trained at iteration $t$ on $S_{\textnormal{train}}$ and evaluated at the point $x_{i}$ of the test sample. The final stopping rule is defined as

\widehat{\textnormal{T}}_{\textnormal{HO}}=\textnormal{argmin}\Big{\{}t\in\mathbb{N}\ |\ R_{\textnormal{ho}}(f^{t+1})>R_{\textnormal{ho}}(f^{t})\Big{\}}-1.

(42)

Although it does not completely use the data for training (loss of information), the hold-out strategy has been proved to output minimax-optimal estimators in various contexts (see, for instance, [15, 16] with Sobolev spaces and $\beta\leq 2$ ).

V-fold stopping rule

The observations $\{(x_{i},y_{i})\}_{i=1}^{n}$ are randomly split into $V=4$ equal sized blocks. At each round (among the $V$ ones), $V-1$ blocks are devoted to training $S_{\textnormal{train}}=(x_{\textnormal{train}},y_{\textnormal{train}})$ , and the remaining one serves for the test sample $S_{\textnormal{test}}=(x_{\textnormal{test}},y_{\textnormal{test}})$ . At each iteration $t=1,\ldots$ , the risk is estimated by $R_{\textnormal{VFCV}}(f^{t})=\frac{1}{V-1}\sum_{j=1}^{V-1}\frac{1}{n/V}\sum_{i\in S_{\textnormal{test}}(j)}((\widehat{y}_{\textnormal{test}})_{i}-y_{i})^{2}$ , where $\widehat{y}_{\textnormal{test}}$ was described for the hold-out stopping rule. The final stopping time is

\widehat{\textnormal{T}}_{\textnormal{VFCV}}=\textnormal{argmin}\big{\{}t\in\mathbb{N}\ |\ R_{\textnormal{VFCV}}(f^{t+1})>R_{\textnormal{VFCV}}(f^{t})\big{\}}-1.

(43)

V-fold cross validation is widely used in practice since, on the one hand, it is more computationally tractable than other splitting-based methods such as leave-one-out or leave-p-out (see the survey [3]), and on the other hand, it enjoys a better statistical performance than the hold-out (lower variability).

Raskutti-Wainwright-Yu stopping rule (from [29])

The use of this stopping rule heavily relies on the assumption that $\lVert f^{*}\rVert_{\mathcal{H}}^{2}$ is known, which is a strong requirement in practice. It controls the bias-variance trade-off by using upper bounds on the bias and variance terms. The latter involves the localized empirical Rademacher complexity $\widehat{\mathcal{R}}_{n}\left(\frac{1}{\sqrt{\eta t}},\mathcal{H}\right)$ . It stops as soon as (upper bound of) the bias term becomes smaller than (upper bound on) the variance term, which leads to

\widehat{\textnormal{T}}_{\textnormal{RWY}}=\textnormal{argmin}\Big{\{}t\in\mathbb{N}\ |\ \widehat{\mathcal{R}}_{n}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}>(2e\sigma\eta t)^{-1}\Big{\}}-1.

(44)

Theoretical minimum discrepancy-based stopping rule $t^{*}$

The fourth stopping rule is the one introduced in (19). It relies on the minimum discrepancy principle and involves the (theoretical) expected empirical risk $\mathbb{E}_{\varepsilon}R_{t}$ :

t^{*}=\inf\left\{t\in\mathbb{N}\ |\ \mathbb{E}_{\varepsilon}R_{t}\leq\sigma^{2}\right\}.

This stopping time is introduced for comparison purposes only since it cannot be computed in practice. This rule is proved to be optimal (see Appendix C) for any bounded reproducing kernel, so it could serve as a reference in the present empirical comparison.

Oracle stopping rule

The ”oracle” stopping rule defines the first time the risk curve starts to increase.

t_{\textnormal{or}}=\textnormal{argmin}\big{\{}t\in\mathbb{N}\ |\ \mathbb{E}_{\varepsilon}\lVert f^{t+1}-f^{*}\rVert_{n}^{2}>\mathbb{E}_{\varepsilon}\lVert f^{t}-f^{*}\rVert_{n}^{2}\big{\}}-1.

(45)

In situations where only one global minimum does exists for the risk, this rule coincides with the global minimum location. Its formulation reflects the realistic constraint that we do not have access to the whole risk curve (unlike in the classical model selection setup).

5.2 Simulation design

Artificial data are generated according to the regression model $y_{j}=f^{*}(x_{j})+\varepsilon_{j}$ , $j=1,\ldots,n$ , where $\varepsilon_{j}\overset{\textnormal{i.i.d.}}{\sim}\mathcal{N}(0,\sigma^{2})$ with the equidistant $x_{j}=j/n,\ j=1,\ldots,n$ , and $\sigma=0.15$ . The same experiments have been also carried out with uniform $x_{i}\sim\mathbb{U}[0,1]$ (not reported here) without any change regarding the conclusions. The sample size $n$ varies from $40$ to $400$ .

The gradient descent algorithm (9) has been used with the step-size $\eta=(1.2\,\widehat{\mu}_{1})^{-1}$ and initialization $F^{0}=[0,\ldots,0]^{\top}$ .

The present comparison involves two regression functions with the same $L_{2}(\mathbb{P}_{n})$ -norms of the signal $\lVert f^{*}\rVert_{n}\approx 0.28$ : $(i)$ a piecewise linear function called ”smooth” $f^{*}(x)=|x-1/2|-1/2$ , and $(ii)$ a ”sinus” $f^{*}(x)=0.4\ \textnormal{sin}(4\pi x)$ .

To ease the comparison, the piecewise linear regression function was set up as in [29, Figure 3].

The case of finite-rank kernels is addressed in Section 5.3.1 with the so-called polynomial kernel of degree $3$ defined by $\mathbb{K}(x_{1},x_{2})=(1+x_{1}^{\top}x_{2})^{3}$ on the unit square $[0,1]\times[0,1]$ . By contrast, Section 5.3.2 tackles the polynomial decay kernels with the first-order Sobolev kernel $\mathbb{K}(x_{1},x_{2})=\min\{x_{1},x_{2}\}$ on the unit square $[0,1]\times[0,1]$ .

The performance of the early stopping rules is measured in terms of the $L_{2}(\mathbb{P}_{n})$ squared norm $\lVert f^{t}-f^{*}\rVert_{n}^{2}$ averaged over $N=100$ independent trials.

For our simulations, we use a variance estimation method that is described in Section 5.4. This method is asymptotically unbiased, which is sufficient for our purposes.

5.3 Results of the simulation experiments

5.3.1 Finite-rank kernels

Figure 3 displays the (averaged) $L_{2}(\mathbb{P}_{n})$ -norm error of the oracle stopping rule (45), our stopping rule $\tau$ (20), $t^{*}$ (19), minimax-optimal stopping rule $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ (44), and $4$ -fold cross validation stopping time $\widehat{\textnormal{T}}_{\textnormal{VFCV}}$ (43) versus the sample size. Figure 3(a) shows the results for the piecewise linear regression function whereas Figure 3(b) corresponds to the ”sinus” regression function.

All the curves decrease as $n$ grows. From these graphs, the overall worst performance is achieved by $\widehat{\textnormal{T}}_{\textnormal{VFCV}}$ , especially with a small sample size, which can be due to the additional randomness induced by the preliminary random splitting with $4-FCV$ . By contrast, the minimum discrepancy-based stopping rules ( $\tau$ and $t^{*}$ ) exhibit the best performances compared to the results of $\widehat{\textnormal{T}}_{\textnormal{VFCV}}$ and $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ . The averaged mean-squared error of $\tau$ is getting closer to the one of $t^{*}$ as the number of samples $n$ increases, which was expected from the theory and also intuitively, since $\tau$ has been introduced as an estimator of $t^{*}$ . From Figure 3(a), $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ is less accurate for small sample sizes, but improves a lot as $n$ grows up to achieving a performance similar to that of $\tau$ . This can result from the fact that $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ is built from upper bounds on the bias and variance terms, which are likely to be looser with a small sample size, but achieve an optimal convergence rate as $n$ increases. On Figure 3(b), the reason why $\tau$ exhibits (strongly) better results than $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ owes to the main assumption on the regression function, namely that $\lVert f^{*}\rVert_{\mathcal{H}}\leq 1$ . This could be violated for the ”sinus” function.

5.3.2 Polynomial eigenvalue decay kernels

Figure 4 displays the resulting (averaged over $100$ repetitions) $L_{2}(\mathbb{P}_{n})$ -error of $\tau_{\alpha}$ (with $\alpha=(\beta+1)^{-1}=0.33$ ) (33), $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ (44), $t^{*}$ (19), and $\widehat{\textnormal{T}}_{\textnormal{HO}}$ (42) versus the sample size.

Figure 4(a) shows that all stopping rules seem to work equivalently well, although there is a slight advantage for $\widehat{\textnormal{T}}_{\textnormal{HO}}$ and $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ compared to $t^{*}$ and $\tau_{\alpha}$ . However, as $n$ grows to $400$ , the performances of all stopping rules become very close to each other. Let us emphasize that the true value of $\beta$ is not known in these experiments. Therefore, the value $(\beta+1)^{-1}=0.33$ has been estimated from the decay of the empirical eigenvalue of the normalized Gram matrix. This can explain why the performance of $\tau_{\alpha}$ remains worse than that of $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ .

The story described by Figure 4(b) is somewhat different. The first striking remark is that $\widehat{\textnormal{T}}_{\textnormal{RWY}}$ completely fails on this example, which still stems from the (unsatisfied) constraint on the $\mathcal{H}$ -norm of $f^{*}$ . However, the best performance is still achieved by the Hold-out stopping rule, although $\tau_{\alpha}$ and $t^{*}$ remain very close to the latter. The fact that $t^{*}$ remains close to the oracle stopping rule (without any need for smoothing) supports the idea that the minimum discrepancy is a reliable principle for designing an effective stopping rule. The deficiency of $\tau$ (by contrast to $\tau_{\alpha}$ ) then results from the variability of the empirical risk, which does not remain close enough to its expectation. This bad behavior is then balanced by introducing the polynomial smoothing at level $\alpha$ within the definition of $\tau_{\alpha}$ , which enjoys close to optimal practical performances.

Let us also mention that $\widehat{\textnormal{T}}_{\textnormal{HO}}$ exhibit some variability, in particular, with small sample sizes as illustrated by Figures 4(a) and 4(b).

The overall conclusion is that the smoothed minimum discrepancy-based stopping time $\tau_{\alpha}$ leads to almost optimal performances provided $\alpha=(\beta+1)^{-1}$ , where $\beta$ quantifies the polynomial decay of the empirical eigenvalues $\{\widehat{\mu}_{i}\}_{i=1}^{n}$ .

5.4 Estimation of variance and decay rate for polynomial eigenvalue decay kernels

The purpose of the present section is to describe two strategies for estimating: $(i)$ the decay rate of the empirical eigenvalues of the normalized Gram matrix, and $(ii)$ the variance parameter $\sigma^{2}$ .

5.4.1 Polynomial decay parameter estimation

From the empirical version of the polynomial decay assumption (37), one can easily derive upper and lower bounds for $\beta$ as $\frac{\log(\widehat{\mu}_{i}/\widehat{\mu}_{i+1})-\log(C/c)}{\log(1+1/i)}\leq\beta\leq\frac{\log(\widehat{\mu}_{i}/\widehat{\mu}_{i+1})+\log(C/c)}{\log(1+1/i)}$ . The difference between these upper and lower bounds is equal to $\frac{2\log(C/c)}{\log(1+1/i)}$ , which is minimized for $i=1$ . Then the best precision on the estimated value of $\beta$ is reached with $i=1$ , which yields the estimator $\widehat{\beta}=\frac{\log(\widehat{\mu}_{1}/\widehat{\mu}_{2})}{\log 2}$ .

5.4.2 Variance parameter estimation

There is a bunch of suggestions for variance estimation with linear smoothers; see, e.g., Section 5.6 in the book [39]. In our simulation experiments, two cases are distinguished: the situation where the reproducing kernel has finite rank $r$ , and the situation where $\textnormal{rk}(T_{\mathbb{K}})=\infty$ . In both cases, an asymptotically unbiased estimator of $\sigma^{2}$ is designed.

Finite-rank kernel.

With such a finite-rank kernel, the estimation of the noise is made from the coordinates $\{Z_{i}\}_{i=r+1}^{n}$ corresponding to the situation, where $G_{i}^{*}=0,\ i>r$ (see Section 4.1.1 in [29]). Actually, these coordinates (which are pure noise) are exploited to build an easy-to-compute estimator of $\sigma^{2}$ , that is,

\widehat{\sigma}^{2}=\frac{\sum_{i=n-r+1}^{n}Z_{i}^{2}}{n-r}.

(46)

Infinite-rank kernel.

If $rk(T_{\mathbb{K}})=\infty$ , we suggest using the following result.

Lemma 5.1.

For any regular kernel (see Theorem 4.1), any value of $t$ satisfying $\eta t\cdot\widehat{\epsilon}_{n}^{2}\to+\infty$ as $n\to+\infty$ yields that $\widehat{\sigma}^{2}=\frac{R_{t}}{\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}$ is an asymptotically unbiased estimator of $\sigma^{2}$ .

A sketch of the proof of Lemma 5.1 is given in Appendix H. Based on this lemma, we suggest taking $t=T$ , where $T$ is the maximum number of iterations allowed to execute due to computational constraints. Notice that as long as we access closed-form expressions of the estimator, there is no need to compute all estimators for $t$ between $1\leq t\leq T$ . The final estimator of $\sigma^{2}$ used in the experiments of Section 5.3 is given by

\widehat{\sigma}^{2}=\frac{R_{T}}{\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(T)})^{2}}.

(47)

6 Conclusion

In this paper, we describe spectral filter estimators (e.g., gradient descent, kernel ridge regression) for the non-parametric regression function estimation in RKHS. Two new data-driven early stopping rules $\tau$ (20) and $\tau_{\alpha}$ (33) for these iterative algorithms are designed. In more detail, we show that for the infinite-rank reproducing kernels, $\tau$ has a high variance due to the variability of the empirical risk around its expectation, and we proposed a way to reduce this variability by means of smoothing the empirical $L_{2}(\mathbb{P}_{n})$ -norm (and, as a consequence, the empirical risk) by the eigenvalues of the normalized kernel matrix. We demonstrate in Corollaries 3.4 and 4.4 that our stopping times $\tau$ and $\tau_{\alpha}$ yield minimax-optimal rates, in particular, for finite-rank kernel classes and Sobolev spaces. It is worth emphasizing that computing the stopping times requires only the estimation of the variance $\sigma^{2}$ and computing $(\widehat{\mu}_{1},\ldots,\widehat{\mu}_{n})$ . Theoretical results are confirmed empirically: $\tau$ and $\tau_{\alpha}$ with the smoothing parameter $\alpha=(\beta+1)^{-1}$ , where $\beta$ is the polynomial decay rate of the eigenvalues of the normalized Gram matrix, perform favorably in comparison with stopping rules based on hold-out data and 4-fold cross-validation.

There are various open questions that could be tackled after our results. A deficiency of our strategy is that the construction of $\tau$ and $\tau_{\alpha}$ is based on the assumption that the regression function belongs to a known RKHS, which restricts (mildly) the smoothness of the regression function. We would like to understand how our results extend to other loss functions besides the squared loss (for example, in the classification framework), as it was done in [40]. Another research direction could be to use early stopping with fast approximation techniques for kernels [30] to avoid calculation of all eigenvalues of the normalized Gram matrix that can be prohibited for large-scale problems.

Appendix A Useful results

In this section, we present several auxiliary lemmas that are repeatedly used in the paper.

Lemma A.1.

[29, $\eta_{t}=\eta t$ in Lemma 8 and $\nu=\eta t$ in Lemma 13] For any bounded kernel, with $\gamma_{i}^{(t)}$ corresponding to gradient descent or kernel ridge regression, for every $t\geq 0$ ,

\displaystyle\frac{1}{2}\min\{1,\eta t\widehat{\mu}_{i}\}\leq\gamma_{i}^{(t)}\leq\min\{1,\eta t\widehat{\mu}_{i}\},\ \ i=1,\ldots,n.

(48)

The following result shows the magnitude of the smoothed critical radius for polynomial eigenvalue decay kernels.

Lemma A.2.

Assume that $\widehat{\mu}_{i}\leq Ci^{-\beta},\ i=1,2,\ldots,n$ , for $\alpha\beta<1$ , one has

\widehat{\epsilon}_{n,\alpha}^{2}\asymp\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]^{\frac{2\beta}{\beta+1}}\left[\frac{\sigma^{2}}{2R^{2}n}\right]^{\frac{\beta}{\beta+1}}.

Proof of Lemma A.2.

For every $M(\epsilon)\in(0,n]$ and $\alpha\beta<1$ , we have

	$\displaystyle\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})$	$\displaystyle\leq R\sqrt{\frac{1}{n}}\sqrt{\sum_{j=1}^{n}\min\{Cj^{-\beta},\epsilon^{2}\}C^{\alpha}j^{-\beta\alpha}}$
		$\displaystyle\leq R\sqrt{\frac{C^{\alpha}}{n}}\sqrt{\sum_{j=1}^{\left\lfloor M(\epsilon)\right\rfloor}j^{-\beta\alpha}}\epsilon+R\sqrt{\frac{C^{1+\alpha}}{n}}\sqrt{\sum_{j=\left\lceil M(\epsilon)\right\rceil}^{n}j^{-\beta-\beta\alpha}}$
		$\displaystyle\leq R\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}\frac{M(\epsilon)^{1-\alpha\beta}}{n}}\epsilon+R\sqrt{\frac{C^{1+\alpha}}{n}}\sqrt{\frac{1}{\beta(1+\alpha)-1}\frac{1}{M(\epsilon)^{\beta(1+\alpha)-1}}}$

Set $M(\epsilon)=\epsilon^{-2/\beta}$ that implies $\sqrt{M(\epsilon)^{1-\alpha\beta}}\epsilon=\epsilon^{1-\frac{1-\alpha\beta}{\beta}}$ , and

\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\leq R\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]\epsilon^{1-\frac{1-\alpha\beta}{\beta}}\frac{1}{\sqrt{n}}.

Therefore, the smoothed critical inequality $\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})\leq\frac{2R^{2}}{\sigma}\epsilon^{2+\alpha}$ is satisfied for

\widehat{\epsilon}_{n,\alpha}^{2}=\widetilde{c}\left[\sqrt{\frac{C^{\alpha}}{1-\alpha\beta}}+\sqrt{\frac{C^{1+\alpha}}{\beta(1+\alpha)-1}}\right]^{\frac{2\beta}{\beta+1}}\left[\frac{\sigma^{2}}{2R^{2}n}\right]^{\frac{\beta}{\beta+1}}.

(49)

Notice that $M(\widehat{\epsilon}_{n,\alpha})\asymp\left(\frac{R^{2}}{\sigma^{2}}\right)^{\frac{1}{\beta+1}}n^{\frac{1}{\beta+1}}\lesssim\left(\frac{R^{2}}{\sigma^{2}}\right)^{\frac{1}{\beta+1}}n$ . Besides that, due to Lemma G.1, one can choose a positive constant $\widetilde{c}$ in Eq. (49) such that $M(\widehat{\epsilon}_{n,\alpha})\leq n$ . ∎

For the next two lemmas define the positive self-adjoint trace-class covariance operator

\Sigma\coloneqq\mathbb{E}_{X}\left[\mathbb{K}(\cdot,X)\otimes\mathbb{K}(\cdot,X)\right],

where $\otimes$ is the Kronecker product between two elements in $\mathcal{H}$ such that $(a\otimes b)u=a\langle b,u\rangle_{\mathcal{H}}$ , for every $u\in\mathcal{H}$ . We know that $\Sigma$ and $T_{\mathbb{K}}$ have the same eigenvalues $\{\mu_{j}\}_{j=1}^{\infty}$ . Moreover, we introduce the smoothed empirical covariance operator as

\widehat{\Sigma}_{n,\alpha}\coloneqq\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\mathbb{K}(\cdot,x_{j})\otimes\mathbb{K}(\cdot,x_{j}).

(50)

Lemma A.3.

For each $a>0$ , any $1\leq k\leq n$ , $\alpha\in[0,1/2]$ , and $\theta>1$ , one has

\mathbb{P}_{X}\left(\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\frac{\theta}{\theta-1}\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)\theta}{3(\theta-1)n}\right)\leq 2\exp(-a)

Proof.

Let $\Pi_{k}$ be the orthogonal projection from $\mathcal{H}$ onto the span of the eigenfunctions $(\phi_{j}:j=1,\ldots,k)$ . Then by the variational characterization of partial traces, one has $\sum_{j=1}^{k}\mu_{j}^{2\alpha}=\textnormal{tr}\left(\Pi_{k}\Sigma^{2\alpha}\right)$ and $\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}\geq\textnormal{tr}\left(\Pi_{k}\widehat{\Sigma}_{n,\alpha}\right)$ . One concludes that

\sum_{j=1}^{k}\mu_{j}^{2\alpha}-\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\left(\Sigma^{2\alpha}-\widehat{\Sigma}_{n,\alpha}\right)\right).

By reproducing property and Mercer’s theorem, $\lVert\Pi_{k}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}=\sum_{i=1}^{k}\mu_{i}\phi_{i}^{2}(X)$ , and

	$\displaystyle\sum_{j=1}^{k}\mu_{j}^{2\alpha}-\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}$	$\displaystyle\leq\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-\frac{1}{2}}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}-\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}$
		$\displaystyle\leq\mid\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-\frac{1}{2}}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}-\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}\mid.$

Since $\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}\leq 1$ , one has $\mathbb{E}_{X}\left[\widehat{\mu}_{j}^{4\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{4}\right]\leq\sum_{i=1}^{k}\mu_{i}$ , and by Bernstein’s inequality, for any $a>0$ ,

\mathbb{P}_{X}\left(\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\sqrt{\frac{2a\left(\sum_{j=1}^{k}\mu_{j}\right)}{n}}+\frac{a}{3n}\right)\leq 2\exp(-a).

Then, by using $\sum_{j=1}^{k}\mu_{j}\leq\sum_{j=1}^{k}\mu_{j}^{2\alpha}$ when $\alpha\in[0,1/2]$ , and $\sqrt{2xy}\leq\theta x+\frac{y}{\theta}$ for any $\theta>0$ , one gets

\mathbb{P}_{X}\left(\left(1-\frac{1}{\theta}\right)\sum_{j=1}^{k}\mu_{j}^{2\alpha}>\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq 2\exp(-a),

for any $a>0$ . ∎

Lemma A.4.

For each $a>0$ , any $0\leq k\leq n$ , $\alpha\in[0,1/2]$ , and $\theta>1$ , one has

\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\frac{\theta+1}{\theta}\sum_{j>k}\mu_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq\exp(-a).

Proof.

The proof of [18, Lemma 33] could be easily generalized to the smoothed version by using the proof of Lemma A.3. Let $\Pi_{k}$ be the orthogonal projection from $\mathcal{H}$ onto the span of the population eigenfunctions $\left(\phi_{j}:j>k\right)$ . Then by the variational characterization of partial traces, one has $\sum_{j>k}\mu_{j}^{2\alpha}=\textnormal{tr}\left(\Pi_{k}\Sigma^{2\alpha}\right)$ and $\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\widehat{\Sigma}_{n,\alpha}\right)$ . One concludes that

\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}-\sum_{j>k}\mu_{j}^{2\alpha}\leq\textnormal{tr}\left(\Pi_{k}\left(\widehat{\Sigma}_{n,\alpha}-\Sigma^{2\alpha}\right)\right).

Hence,

\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}-\sum_{j>k}\mu_{j}^{2\alpha}\leq\frac{1}{n}\sum_{j=1}^{n}\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{2}-\mathbb{E}_{X}\lVert\Pi_{k}\Sigma^{\alpha-1/2}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}.

Since $\widehat{\mu}_{j}^{2\alpha}\lVert\Pi_{k}\mathbb{K}(\cdot,x_{j})\rVert_{\mathcal{H}}^{2}\leq 1$ and by using the reproducing property and Mercer’s theorem, $\lVert\Pi_{k}\mathbb{K}\left(\cdot,X\right)\rVert_{\mathcal{H}}^{2}=\sum_{j>k}\mu_{j}\phi_{j}^{2}(X)$ , one has

\mathbb{E}_{X}\left[\widehat{\mu}_{j}^{4\alpha}\lVert\Pi_{k}\mathbb{K}\left(\cdot,x_{j}\right)\rVert_{\mathcal{H}}^{4}\right]\leq\sum_{j>k}\mu_{j}.

Bernstein’s inequality yields that for any $a>0$ ,

\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\sum_{j>k}\mu_{j}^{2\alpha}+\sqrt{\frac{2a\left(\sum_{j>k}\mu_{j}\right)}{n}}+\frac{a}{3n}\right)\leq\exp(-a).

Using the inequalities $\sum_{j>k}\mu_{j}\leq\sum_{j>k}\mu_{j}^{2\alpha}$ , when $\alpha\in[0,1/2]$ , and

\sqrt{\frac{2a\left(\sum_{j>k}\mu_{j}\right)}{n}}\leq\frac{1}{\theta}\sum_{j>k}\mu_{j}+\frac{a\theta}{n},

one gets

\mathbb{P}_{X}\left(\sum_{j>k}\widehat{\mu}_{j}^{2\alpha}>\left(1+\frac{1}{\theta}\right)\sum_{j>k}\mu_{j}^{2\alpha}+\frac{a\left(1+3\theta\right)}{3n}\right)\leq\exp(-a),

for any $a>0$ and $\theta>1$ . ∎

Corollary A.5.

Assumption 4, Lemma A.3, and Lemma A.4 imply that for any $1\leq k\leq n$ , $a>0$ , $\theta>1$ , and $\alpha\in[\alpha_{0},1/2]$ ,

\sum_{j=k+1}^{n}\widehat{\mu}_{j}^{2\alpha}\leq\frac{(\theta+1)\mathcal{M}}{\theta-1}\sum_{j=1}^{k}\widehat{\mu}_{j}^{2\alpha}+\frac{a(1+3\theta)}{3n}\left(\frac{\mathcal{M}(\theta+1)}{\theta-1}+1\right)

(51)

with probability (over $\{x_{i}\}_{i=1}^{n}$ ) at least $1-3\exp(-a)$ .

Appendix B Handling the smoothed bias and variance

Lemma B.1.

Under Assumptions 1, 2,

B_{\alpha}^{2}(t)\leq\frac{R^{2}}{(\eta t)^{1+\alpha}},\quad\alpha\in[0,1].

(52)

Proof of Lemma B.1.

Proof of [29, Lemma 7] can be easily generalized to obtain the result. ∎

Here, we recall one concentration result from [29, Section 4.1.2]. For any $t>0$ and $\delta>0$ , one has $V(t)=\mathbb{E}_{\varepsilon}\left[v(t)\right]$ , and

\mathbb{P}_{\varepsilon}\Big{(}|v(t)-V(t)|\geq\delta\Big{)}\leq 2\ \exp\left[-\frac{cn\delta}{\sigma^{2}}\min\left\{1,\frac{R^{2}\delta}{\sigma^{2}\eta t\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta t}},\mathcal{H})}\right\}\right].

(53)

Appendix C Auxiliary lemma for finite-rank kernels

Let us first transfer the critical inequality (16) from $\epsilon$ to $t$ .

Definition C.1.

Set $\epsilon=\frac{1}{\sqrt{\eta t}}$ in (16), and let us define $\widehat{t}_{\epsilon}$ as the largest positive solution to the following fixed-point equation

\frac{\sigma^{2}\eta t}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta t}},\mathcal{H}\right)\leq\frac{4R^{2}}{\eta t}.

(54)

Note that the empirical critical radius $\widehat{\epsilon}_{n}=\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}}$ , and such a point $\widehat{t}_{\epsilon}$ exists since $\widehat{\epsilon}_{n}$ exists and is unique [27, 5, 29]. Moreover, $\widehat{t}_{\epsilon}$ provides the equality in Ineq. (54).

Remark that at $t=t^{*}:B^{2}(t)=\frac{2\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}-V(t)\geq\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}$ . Thus, due to the construction of $\widehat{t}_{\epsilon}$ ( $\widehat{t}_{\epsilon}$ is the point of intersection of an upper bound on the bias and a lower bound on $\frac{\sigma^{2}}{2n}\sum_{i=1}^{r}\gamma_{i}^{(t)}$ ) and monotonicity (in $t$ ) of all the terms involved, we get $t^{*}\leq\widehat{t}_{\epsilon}$ .

Lemma C.2.

Recall the definition of the stopping rule $t^{*}$ (19). Under Assumptions 1, 2, and 3, the following holds for any reproducing kernel:

\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}\leq 8R^{2}\widehat{\epsilon}_{n}^{2}.

Proof of Lemma C.2.

Let us define a proxy version of the variance term: $\widetilde{V}(t)\coloneqq\frac{\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}$ . Moreover, for all $t>0$ ,

\mathbb{E}_{\varepsilon}R_{t}=B^{2}(t)+\frac{\sigma^{2}}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}.

(55)

From the fact that $\mathbb{E}_{\varepsilon}R_{t^{*}}=\sigma^{2}$ , $\mathbb{E}_{\varepsilon}\lVert f^{t^{*}}-f^{*}\rVert_{n}^{2}=B^{2}(t^{*})+V(t^{*})=2\widetilde{V}(t^{*})$ .

Therefore, in order to prove the lemma, our goal is to get an upper bound on $\widetilde{V}(t^{*})$ . Since the function $\eta t\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta t}},\mathcal{H})$ is monotonic in $t$ (see, for example, Lemma G.1), and $t^{*}\leq\widehat{t}_{\epsilon}$ , we conclude that

\widetilde{V}(t^{*})\leq\frac{\sigma^{2}\eta t^{*}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta t^{*}}},\mathcal{H}\right)\leq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}},\mathcal{H}\right)=4R^{2}\widehat{\epsilon}_{n}^{2}.

∎

Appendix D Proofs for polynomial smoothing (fixed design)

In the proofs, we will need three additional definitions below.

Definition D.1.

In Definition 15, set $\epsilon=\frac{1}{\sqrt{\eta t}}$ , then for any $\alpha\in[0,1]$ , the smoothed critical inequality (15) is equivalent to

\frac{\sigma^{2}\eta t}{4}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}\leq\frac{R^{4}}{(\eta t)^{1+\alpha}}.

(56)

Due to Lemma G.1, the left-hand side of (56) is non-decreasing in $t$ , and the right-hand side is non-increasing in $t$ .

Definition D.2.

For any $\alpha\in[0,1]$ , define the stopping rule $\widehat{t}_{\epsilon,\alpha}$ such that

\widehat{\epsilon}_{n,\alpha}^{2}=\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}},

(57)

then Ineq. (56) becomes the equality at $t=\widehat{t}_{\epsilon,\alpha}$ thanks to the monotonicity and continuity of both terms in the inequality.

Further, we define the stopping time $\widetilde{t}_{\epsilon,\alpha}$ and $\overline{t}_{\epsilon,\alpha}$ , a lower bound and an upper bound on $t_{\alpha}^{*}\coloneqq\inf\left\{t>0\ |\ \mathbb{E}_{\varepsilon}R_{\alpha,t}\leq\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\right\}$ , $\forall\alpha\in[0,1]$ .

Definition D.3.

Define the smoothed proxy variance $\widetilde{V}_{\alpha}(t)\coloneqq\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\gamma_{i}^{(t)}$ and the following stopping times

\displaystyle\begin{split}\overline{t}_{\epsilon,\alpha}&=\inf\big{\{}t>0\ |\ B_{\alpha}^{2}(t)=\frac{1}{2}\widetilde{V}_{\alpha}(t)\big{\}},\\ \widetilde{t}_{\epsilon,\alpha}&=\inf\big{\{}t>0\ |\ B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t)\big{\}}.\end{split}

(58)

Notice that at $t=\widetilde{t}_{\epsilon,\alpha}$ :

\frac{6R^{2}}{(\eta t)^{1+\alpha}}\geq\frac{R^{2}}{(\eta t)^{1+\alpha}}\geq B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t)\geq\frac{3}{2}\frac{\sigma^{2}}{R^{2}}\eta t\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}.

At $t=\overline{t}_{\epsilon,\alpha}$ :

\frac{R^{2}}{(\eta t)^{1+\alpha}}\geq B_{\alpha}^{2}(t)=\frac{1}{2}\widetilde{V}_{\alpha}(t)\geq\frac{\sigma^{2}\eta t}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta t}},\mathcal{H}\Big{)}.

Thus, $\widetilde{t}_{\epsilon,\alpha}$ and $\overline{t}_{\epsilon,\alpha}$ satisfy the smoothed critical inequality (56). Moreover, $\widehat{t}_{\epsilon,\alpha}$ is always greater than or equal to $\overline{t}_{\epsilon,\alpha}$ and $\widetilde{t}_{\epsilon,\alpha}$ since $\widehat{t}_{\epsilon,\alpha}$ is the largest value satisfying Ineq. (56). As a consequence of Lemma G.1 and continuity of (56) in $t$ , one has $\frac{1}{\eta\widetilde{t}_{\epsilon,\alpha}}\asymp\frac{1}{\eta\overline{t}_{\epsilon,\alpha}}\asymp\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=\widehat{\epsilon}_{n,\alpha}^{2}$ . We assume for simplicity that

	$\displaystyle\overline{\epsilon}_{n,\alpha}^{2}$	$\displaystyle\coloneqq\frac{1}{\eta\overline{t}_{\epsilon,\alpha}}=c^{\prime}\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=c^{\prime}\widehat{\epsilon}_{n,\alpha}^{2},$
	$\displaystyle\widetilde{\epsilon}_{n,\alpha}^{2}$	$\displaystyle\coloneqq\frac{1}{\eta\widetilde{t}_{\epsilon,\alpha}}=c^{\prime\prime}\frac{1}{\eta\widehat{t}_{\epsilon,\alpha}}=c^{\prime\prime}\widehat{\epsilon}_{n,\alpha}^{2}$

for some positive numeric constants $c^{\prime},c^{\prime\prime}\geq 1$ , that do not depend on $n$ , due to the fact that $\widehat{t}_{\epsilon,\alpha}\geq\overline{t}_{\epsilon,\alpha}$ and $\widehat{t}_{\epsilon,\alpha}\geq\widetilde{t}_{\epsilon,\alpha}$ .

The following lemma decomposes the risk error into several parts that will be further analyzed in subsequent Lemmas D.7, D.8.

Lemma D.4.

Recall the definition of $\tau_{\alpha}$ (33), then

\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2B^{2}(\tau_{\alpha})+2v(\tau_{\alpha}),

where $v(t)=\frac{1}{n}\sum_{i=1}^{n}(\gamma_{i}^{(t)})^{2}\varepsilon_{i}^{2},\ t>0$ , is the stochastic part of the variance.

Proof of Lemma D.4.

Let us define the noise vector $\varepsilon\coloneqq[\varepsilon_{1},...,\varepsilon_{n}]^{\top}$ and, for each $t>0$ , two vectors that correspond to the bias and variance parts:

\tilde{b}^{2}(t)\coloneqq(g_{t}(K_{n})K_{n}-I_{n})F^{*},\ \ \ \ \tilde{v}(t)\coloneqq g_{t}(K_{n})K_{n}\varepsilon.

(59)

It gives the following expressions for the stochastic part of the variance and bias:

v(t)=\langle\tilde{v}(t),\tilde{v}(t)\rangle_{n},\ \ B^{2}(t)=\langle\tilde{b}^{2}(t),\tilde{b}^{2}(t)\rangle_{n}.

(60)

General expression for the $L_{2}(\mathbb{P}_{n})$ -norm error at $\tau_{\alpha}$ takes the form

\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}=B^{2}(\tau_{\alpha})+v(\tau_{\alpha})+2\langle\tilde{b}^{2}(\tau_{\alpha}),\tilde{v}(\tau_{\alpha})\rangle_{n}.

(61)

Therefore, applying the inequality $2\left|\langle x,y\rangle_{n}\right|\leq\lVert x\rVert_{n}^{2}+\lVert y\rVert_{n}^{2}$ for any $x,y\in\mathbb{R}^{n}$ , and (60), we obtain

\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2B^{2}(\tau_{\alpha})+2v(\tau_{\alpha}).

(62)

∎

D.1 Two deviation inequalities for $\tau_{\alpha}$

This is the first deviation inequality for $\tau_{\alpha}$ that will be used in Lemma D.7 to control the variance term.

Lemma D.5.

Recall Definition D.3 of $\overline{t}_{\epsilon,\alpha}$ , then under Assumptions 1, 2, 3, and 4,

\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],

where a positive constant $c_{1}$ depends only on $\mathcal{M}$ .

Proof of Lemma D.5.

Set $\kappa_{\alpha}\coloneqq\sigma^{2}\textnormal{tr}K_{n}^{\alpha}/n$ , then due to the monotonicity of the smoothed empirical risk, for all $t\geq t_{\alpha}^{*}$ ,

\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>t\right)=\mathbb{P}_{\varepsilon}\left(R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}>\kappa_{\alpha}-\mathbb{E}_{\varepsilon}R_{\alpha,t}\right).

Consider

R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}=\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}\left(\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\right)}_{\Sigma_{1}}+\underbrace{\frac{2}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}.

(63)

Define

\Delta_{t,\alpha}\coloneqq\kappa_{\alpha}-\mathbb{E}_{\varepsilon}R_{\alpha,t}=-B_{\alpha}^{2}(t)-V_{\alpha}(t)+2\widetilde{V}_{\alpha}(t),

where $\widetilde{V}_{\alpha}(t)=\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\gamma_{i}^{(t)}$ .

Further, set $t=\overline{t}_{\epsilon,\alpha}$ , and recall that $\eta\overline{t}_{\epsilon,\alpha}=\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime}}$ for $c^{\prime}\geq 1$ . This implies

	$\displaystyle\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}\geq\frac{1}{2}\widetilde{V}_{\alpha}(\overline{t}_{\epsilon,\alpha})$	$\displaystyle\geq\frac{\sigma^{2}}{4n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{1,\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime}}\widehat{\mu}_{i}\right\}$
		$\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4nc^{\prime}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{\frac{c^{\prime}}{\eta\widehat{t}_{\epsilon,\alpha}},\widehat{\mu}_{i}\right\}$
		$\displaystyle\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4c^{\prime}R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)$
		$\displaystyle=\frac{R^{2}}{c^{\prime}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.$

Then for the event $A$ from Corollary A.5, by standard concentration results on linear and quadratic sums of Gaussian random variables (see, e.g., [25, Lemma 1]),

	$\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)$	$\displaystyle\leq\exp\left[-\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}^{2}}{16(\lVert a(\overline{t}_{\epsilon,\alpha})\rVert^{2}+\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\lVert a(\overline{t}_{\epsilon,\alpha})\rVert_{\infty})}\right],$		(64)
	$\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{2}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\right)$	$\displaystyle\leq\exp\left[-\frac{n\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}^{2}}{32\sigma^{2}B_{\alpha}^{2}(\overline{t}_{\epsilon,\alpha})}\right],$		(65)

where $a_{i}(\overline{t}_{\epsilon,\alpha})=\frac{\sigma^{2}}{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(\overline{t}_{\epsilon,\alpha})})^{2},\ i\in[n]$ .

In what follows, we simplify the bounds above.

Firstly, recall that $B=1$ , which implies $\widehat{\mu}_{1}\leq 1$ , and $\lVert a(\overline{t}_{\epsilon,\alpha})\rVert_{\infty}\leq\frac{\sigma^{2}}{n}$ , and

	$\displaystyle\frac{1}{2}\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}\leq\frac{3}{4}\widetilde{V}_{\alpha}(\overline{t}_{\epsilon,\alpha})\leq\frac{3}{4}\widetilde{V}_{\alpha}(\widehat{t}_{\epsilon,\alpha})$	$\displaystyle\leq\frac{3}{4R^{2}}\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)$
		$\displaystyle=3R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.$

Secondly, we will upper bound the Euclidean norm of $a(\overline{t}_{\epsilon,\alpha})$ . Recall Corollary A.5 with $a=\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}$ and $\theta=2$ , the definition of the smoothed statistical dimension $d_{n,\alpha}=\min\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\}$ , and Ineq. (35): $\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\geq\frac{\sigma^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}}{4R^{2}n}$ , which implies

	$\displaystyle\lVert a(\overline{t}_{\epsilon,\alpha})\rVert^{2}$	$\displaystyle=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{2\alpha}\left(1-\gamma_{i}^{(\overline{t}_{\epsilon,\alpha})}\right)^{4}\leq\frac{\sigma^{4}}{n^{2}}\left[\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{2\alpha}\right]$
		$\displaystyle\leq\frac{\sigma^{4}}{n^{2}}\left[\frac{4nR^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}}{\sigma^{2}}+3\mathcal{M}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{2\alpha}+\frac{7(3\mathcal{M}+1)R^{2}}{3\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$
		$\displaystyle\leq\frac{\sigma^{2}R^{2}}{n}\left[4+12\mathcal{M}+3(3\mathcal{M}+1)\right]\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.$

Finally, using the upper bound $B_{\alpha}^{2}(\overline{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{(\eta\overline{t}_{\epsilon,\alpha})^{1+\alpha}}\leq R^{2}(c^{\prime})^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}$ for all $\alpha\in[0,1]$ and the fact that $\mathbb{P}_{\varepsilon}(A)=\mathbb{P}_{X_{1},\ldots,X_{n}}\left(\mathbb{I}(A)\right)=\mathbb{P}_{X_{1},\ldots,X_{n}}(A)$ for the event $A$ from Corollary A.5, one gets

	$\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\right)$	$\displaystyle\leq\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{\overline{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)+\mathbb{P}_{X_{1},\ldots,X_{n}}\left(A^{c}\right),$
	$\displaystyle\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right)$	$\displaystyle\leq 5\ \exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],$

for some positive numeric $c_{1}>0$ that depends only on $\mathcal{M}$ .

∎

What follows is the second deviation inequality for $\tau_{\alpha}$ that will be further used in Lemma D.8 to control the bias term.

Lemma D.6.

Recall Definition D.3 of $\widetilde{t}_{\epsilon,\alpha}$ , then under Assumptions 1, 2, 3, and 4,

\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right)\leq 5\ \exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]

(66)

for a positive constant $c_{2}$ that depends only on $\mathcal{M}$ .

Proof of Lemma D.6.

Set $\kappa_{\alpha}\coloneqq\sigma^{2}\textnormal{tr}K_{n}^{\alpha}/n$ . Note that $\widetilde{t}_{\epsilon,\alpha}\leq t_{\alpha}^{*}$ by construction.

Further, for all $t\leq t_{\alpha}^{*}$ , due to the monotonicity of $R_{\alpha,t}$ ,

	$\displaystyle\mathbb{P}_{\varepsilon}\Big{(}\tau_{\alpha}<t\Big{)}$	$\displaystyle=\mathbb{P}_{\varepsilon}\Big{(}R_{\alpha,t}-\mathbb{E}_{\varepsilon}R_{\alpha,t}\leq-(\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha})\Big{)}$
		$\displaystyle\leq\mathbb{P}_{\varepsilon}\bigg{(}\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}\Big{(}\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\Big{)}}_{\Sigma_{1}}\leq-\frac{\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}}{2}\bigg{)}$
		$\displaystyle+\mathbb{P}_{\varepsilon}\bigg{(}\underbrace{\frac{2}{n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}\leq-\frac{\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}}{2}\bigg{)}.$

Consider $\Delta_{t,\alpha}\coloneqq\mathbb{E}_{\varepsilon}R_{\alpha,t}-\kappa_{\alpha}=B_{\alpha}^{2}(t)+V_{\alpha}(t)-2\widetilde{V}_{\alpha}(t)$ . At $t=\widetilde{t}_{\epsilon,\alpha}$ , we have $B_{\alpha}^{2}(t)=3\widetilde{V}_{\alpha}(t)$ , thus

\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}\geq\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha}).

Then for the event $A$ from Corollary A.5, by standard concentration results on linear and quadratic sums of Gaussian random variables (see, e.g., [25, Lemma 1]),

\displaystyle\begin{split}\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)&\leq\exp\left[-\frac{\widetilde{V}_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}{16\lVert a(\widetilde{t}_{\epsilon,\alpha})\rVert^{2}}\right],\\ \mathbb{P}_{\varepsilon}\left(\Sigma_{2}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\right)&\leq\exp\left[-\frac{-n\widetilde{V}_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}{32\sigma^{2}B_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})}\right],\end{split}

(67)

where $a_{i}(\widetilde{t}_{\epsilon,\alpha})=\frac{\sigma^{2}}{n}\widehat{\mu}_{i}^{\alpha}(1-\gamma_{i}^{(\widetilde{t}_{\epsilon,\alpha})}),\ i\in[n]$ .

In what follows, we simplify the bounds above.

First, we deal with the Euclidean norm of $a_{i}(\widetilde{t}_{\epsilon,\alpha}),\ i\in[n]$ . By $\widehat{\mu}_{1}\leq 1$ and Corollary A.5 with $a=\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}$ and $\theta=2$ , and Ineq. (35), it gives us

\displaystyle\begin{split}\lVert a(\widetilde{t}_{\epsilon,\alpha})\rVert^{2}=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{2\alpha}(1-\gamma_{i}^{(\widetilde{t}_{\epsilon,\alpha})})^{4}&\leq\frac{\sigma^{4}}{n^{2}}\left[\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{2\alpha}\right]\\ &\leq\left[4+12\mathcal{M}+3(3\mathcal{M}+1)\right]\frac{\sigma^{2}}{n}R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.\end{split}

(68)

Recall that $\eta\widetilde{t}_{\epsilon,\alpha}=\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime\prime}}$ for $c^{\prime\prime}\geq 1$ . Therefore, it is sufficient to lower bound $\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha})$ as follows.

	$\displaystyle\widetilde{V}_{\alpha}(\widetilde{t}_{\epsilon,\alpha})\geq\frac{\sigma^{2}}{2n}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\{1,\frac{\eta\widehat{t}_{\epsilon,\alpha}}{c^{\prime\prime}}\widehat{\mu}_{i}\}$	$\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{2nc^{\prime\prime}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\left\{\frac{c^{\prime\prime}}{\eta\widehat{t}_{\epsilon,\alpha}},\widehat{\mu}_{i}\right\}$
		$\displaystyle\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{2R^{2}c^{\prime\prime}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)$
		$\displaystyle=\frac{2R^{2}}{c^{\prime\prime}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.$

By using the bound $B_{\alpha}^{2}(\widetilde{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{(\eta\widetilde{t}_{\epsilon,\alpha})^{1+\alpha}}\leq R^{2}(c^{\prime\prime})^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}$ , inserting this expression with (68) into (67), and using the fact that $\mathbb{P}_{\varepsilon}(A)=\mathbb{P}_{X_{1},\ldots,X_{n}}\left(\mathbb{I}(A)\right)=\mathbb{P}_{X_{1},\ldots,X_{n}}(A)$ for the event $A$ from Corollary A.5, we have

	$\displaystyle\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\right)$	$\displaystyle\leq\mathbb{P}_{\varepsilon}\left(\Sigma_{1}\leq-\frac{\Delta_{\widetilde{t}_{\epsilon,\alpha},\alpha}}{2}\mid A\right)+\mathbb{P}_{X_{1},\ldots,X_{n}}\left(A^{c}\right),$
	$\displaystyle\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right)$	$\displaystyle\leq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right],$

where $c_{2}$ depends only on $\mathcal{M}$ . ∎

D.2 Bounding the stochastic part of the variance term at $\tau_{\alpha}$

Lemma D.7.

Under Assumptions 1, 2, 3, and 4, for any regular kernel, the stochastic part of the variance at $\tau_{\alpha}$ is bounded as follows.

v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

with probability at least $1-6\exp\Big{[}-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}$ , where a constant $c_{1}$ depends only on $\mathcal{M}$ .

Proof of Lemma D.7.

$\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}>\overline{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ due to Lemma D.5. Therefore, thanks to the monotonicity of $\gamma_{i}^{(t)}$ in $t$ , with probability at least $1-5\ \exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ , $v(\tau_{\alpha})\leq v(\overline{t}_{\epsilon,\alpha})$ .

After that, due to the concentration inequality (53),

\mathbb{P}_{\varepsilon}\Big{(}|v(\overline{t}_{\epsilon,\alpha})-V(\overline{t}_{\epsilon,\alpha})|\geq\delta\Big{)}\leq 2\exp\left[-\frac{cn\delta}{\sigma^{2}}\min\left\{1,\frac{R^{2}\delta}{\sigma^{2}\eta\overline{t}_{\epsilon,\alpha}\widehat{\mathcal{R}}_{n}^{2}(\frac{1}{\sqrt{\eta\overline{t}_{\epsilon,\alpha}}},\mathcal{H})}\right\}\right].

Now, by setting $\delta=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\Big{(}\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\Big{)}\geq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\Big{(}\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\Big{)}$ and recalling Lemma G.2, it yields

\displaystyle\begin{split}v(\overline{t}_{\epsilon,\alpha})&\leq V(\overline{t}_{\epsilon,\alpha})+\delta\\ &\leq\widetilde{V}(\widehat{t}_{\epsilon,\alpha})+4(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\\ &\leq\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)+4(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\\ &\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\end{split}

(69)

with probability at least $1-\exp\Big{[}-cn\frac{4R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}$ . Combining all the pieces together, we get

v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

(70)

with probability at least $1-6\exp\Big{[}-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}$ . ∎

D.3 Bounding the bias term at $\tau_{\alpha}$

Lemma D.8.

Under Assumptions 1, 2, 3, and 4,

B^{2}(\tau_{\alpha})\leq c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

(71)

with probability at least $1-5\exp\Big{[}-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\Big{]}$ for a positive numeric constant $c^{\prime\prime}\geq 1$ and constant $c_{2}$ that depends only on $\mathcal{M}$ .

Proof of Lemma D.8.

$\mathbb{P}_{\varepsilon}\left(\tau_{\alpha}<\widetilde{t}_{\epsilon,\alpha}\right)\leq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ due to Lemma D.6. Therefore, thanks to the monotonicity of the bias term, with probability at least $1-5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ , $B^{2}(\tau_{\alpha})\leq B^{2}(\widetilde{t}_{\epsilon,\alpha})\leq\frac{R^{2}}{\eta\widetilde{t}_{\epsilon,\alpha}}=c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}.$ ∎

Appendix E Proof of Theorem 4.2

From Lemmas D.4, D.7, and D.8, we get

\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2c^{\prime\prime}R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+16(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}

(72)

with probability at least $1-11\exp\left[-c_{1}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ , where $c_{1}$ depends only on $\mathcal{M}$ . Moreover, by taking the expectation in Ineq. (62), it yields

\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}\leq 2\mathbb{E}_{\varepsilon}[B^{2}(\tau_{\alpha})]+2\mathbb{E}_{\varepsilon}[v(\tau_{\alpha})].

Let us upper bound $\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right]$ and $\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]$ . First, define $\widetilde{a}\coloneqq B^{2}(\widetilde{t}_{\epsilon,\alpha})$ , thus

\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right]&=\mathbb{P}_{\varepsilon}\Big{(}B^{2}(\tau_{\alpha})>\widetilde{a}\Big{)}\mathbb{E}_{\varepsilon}\Big{[}B^{2}(\tau_{\alpha})\mid B^{2}(\tau_{\alpha})>\widetilde{a}\Big{]}\\ &+\mathbb{P}_{\varepsilon}\Big{(}B^{2}(\tau_{\alpha})\leq\widetilde{a}\Big{)}\mathbb{E}_{\varepsilon}\Big{[}B^{2}(\tau_{\alpha})\mid B^{2}(\tau_{\alpha})\leq\widetilde{a}\Big{]}.\end{split}

(73)

Defining $\delta_{1}\coloneqq 5\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right]$ from Lemma D.8 and using the upper bound $B^{2}(t)\leq R^{2}$ for any $t>0$ gives the following.

\mathbb{E}_{\varepsilon}\left[B^{2}(\tau_{\alpha})\right]\leq R^{2}\delta_{1}+B^{2}(\widetilde{t}_{\epsilon,\alpha})\leq R^{2}\left(\delta_{1}+c^{\prime\prime}\widehat{\epsilon}_{n,\alpha}^{2}\right).

(74)

As for $\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]$ ,

\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]&=\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\mathbb{I}\left\{v(\tau_{\alpha})\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]\\ &+\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right],\end{split}

(75)

and due to Lemma D.7 and Cauchy-Schwarz inequality,

	$\displaystyle\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+\mathbb{E}_{\varepsilon}\Big{[}v(\tau_{\alpha})\mathbb{I}\Big{\{}v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\Big{\}}\Big{]}$
	$\displaystyle\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+\sqrt{\mathbb{E}_{\varepsilon}v^{2}(\tau_{\alpha})}\sqrt{\mathbb{E}_{\varepsilon}\left[\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]}.$		(76)

Notice that $v^{2}(\tau_{\alpha})\leq\frac{1}{n^{2}}\left[\sum_{i=1}^{n}\varepsilon_{i}^{2}\right]^{2}$ , and

\mathbb{E}_{\varepsilon}\left[v^{2}(\tau_{\alpha})\right]\leq\frac{1}{n^{2}}\left[\sum_{i=1}^{n}\mathbb{E}_{\varepsilon}\varepsilon_{i}^{4}+2\sum_{i<j}\mathbb{E}_{\varepsilon}\left(\varepsilon_{i}^{2}\varepsilon_{j}^{2}\right)\right]\leq\frac{3\sigma^{4}}{n^{2}}n^{2}\leq 3\sigma^{4}.

(77)

At the same time, thanks to Lemma D.7,

\mathbb{E}_{\varepsilon}\left[\mathbb{I}\left\{v(\tau_{\alpha})>8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}\right\}\right]\leq 6\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right).

Thus, by inserting the last two inequalities into (76), it gives

\mathbb{E}_{\varepsilon}\left[v(\tau_{\alpha})\right]\leq 8(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}+5\sigma^{2}\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right).

Finally, summing up all the terms together,

	$\displaystyle\mathbb{E}_{\varepsilon}\lVert f^{\tau_{\alpha}}-f^{*}\rVert_{n}^{2}$	$\displaystyle\leq\left[16(1+C)+2c^{\prime\prime}\right]R^{2}\widehat{\epsilon}_{n,\alpha}^{2}$
		$\displaystyle+20\max\{\sigma^{2},R^{2}\}\exp\left(-c_{1}n\frac{R^{2}}{\sigma^{2}}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\right),$

where a constant $c_{1}$ depends only on $\mathcal{M}$ , constant $c^{\prime\prime}$ is numeric.

Appendix F Proof of Theorem 3.3

We will use the definition of $\tau$ (20) with the threshold $\kappa\coloneqq\frac{r\sigma^{2}}{n}$ so that, due to the monotonicity of the ”reduced” empirical risk $\widetilde{R}_{t}$ ,

\mathbb{P}_{\varepsilon}\left(\tau>t\right)=\mathbb{P}_{\varepsilon}\Big{(}\widetilde{R}_{t}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}>\underbrace{\kappa-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}}_{\Delta_{t}}\Big{)},

where

\Delta_{t}=-B^{2}(t)-V(t)+\underbrace{\frac{2\sigma^{2}}{n}\sum_{i=1}^{r}\gamma_{i}^{(t)}}_{2\widetilde{V}(t)}.

(78)

Assume that $\Delta_{t}\geq 0$ . Remark that

\widetilde{R}_{t}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t}=\underbrace{\frac{\sigma^{2}}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}\left(\frac{\varepsilon_{i}^{2}}{\sigma^{2}}-1\right)}_{\Sigma_{1}}+\underbrace{\frac{2}{n}\sum_{i=1}^{r}(1-\gamma_{i}^{(t)})^{2}G_{i}^{*}\varepsilon_{i}}_{\Sigma_{2}}.

(79)

By applying [25, Lemma 1] to $\Sigma_{1}$ , it yields

\mathbb{P}_{\varepsilon}\left(\Sigma_{1}>\frac{\Delta_{t}}{2}\right)\leq\exp\left[\frac{-\Delta_{t}^{2}/4}{4(\lVert a(t)\rVert^{2}+\frac{\Delta_{t}}{2}\lVert a(t)\rVert_{\infty})}\right],

(80)

where $a_{i}(t)\coloneqq\frac{\sigma^{2}}{n}(1-\gamma_{i}^{(t)})^{2},\ i\in[r]$ . In addition, [38, Proposition 2.5] gives us

\mathbb{P}_{\varepsilon}\left(\Sigma_{2}>\frac{\Delta_{t}}{2}\right)\leq\exp\left[-\frac{n\Delta_{t}^{2}}{32\sigma^{2}B^{2}(t)}\right].

(81)

Define a stopping time $\overline{t}_{\epsilon}$ as follows.

\overline{t}_{\epsilon}\coloneqq\inf\left\{t>0:B^{2}(t)=\frac{1}{2}\widetilde{V}(t)\right\}.

(82)

Note that $\overline{t}_{\epsilon}$ serves as an upper bound on $t^{*}$ and as a lower bound on $\widehat{t}_{\epsilon}$ . Moreover, $\overline{t}_{\epsilon}$ satisfies the critical inequality (54). Therefore, due to Lemma G.1 and continuity of (54) in $t$ , there is a positive numeric constant $c^{\prime}\geq 1$ , that do not depend on $n$ , such that $\frac{1}{\eta\overline{t}_{\epsilon}}=c^{\prime}\frac{1}{\eta\widehat{t}_{\epsilon}}$ .

In what follows, we simplify two high probability bounds (80) and (81) at $t=\overline{t}_{\epsilon}$ .

Since applying [29, Section 4.3], $\widehat{\epsilon}_{n}^{2}=c\frac{r\sigma^{2}}{nR^{2}}$ , one can bound $\lVert a(\overline{t}_{\epsilon})\rVert^{2}$ as follows.

\lVert a(\overline{t}_{\epsilon})\rVert^{2}=\frac{\sigma^{4}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(\overline{t}_{\epsilon})})^{4}\leq\frac{r\sigma^{4}}{n^{2}}=\frac{R^{2}\sigma^{2}\widehat{\epsilon}_{n}^{2}}{cn}.

(83)

Remark that in (80) $\lVert a(\overline{t}_{\epsilon})\rVert_{\infty}=\frac{\sigma^{2}}{n}\underset{i\in[r]}{\max}\Big{[}(1-\gamma_{i}^{(\overline{t}_{\epsilon})})\Big{]}\leq\frac{\sigma^{2}}{n}$ , and

\frac{\Delta_{\overline{t}_{\epsilon}}}{2}\leq\frac{3}{4}\widetilde{V}(\overline{t}_{\epsilon})\leq\frac{3}{4}\widetilde{V}(\widehat{t}_{\epsilon})\leq\frac{3}{4}\frac{\sigma^{2}}{R^{2}}\eta\widehat{t}_{\epsilon}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon}}},\mathcal{H}\right)=3R^{2}\widehat{\epsilon}_{n}^{2}.

As for a lower bound on $\Delta_{\overline{t}_{\epsilon}}$ ,

	$\displaystyle\Delta_{\overline{t}_{\epsilon}}\geq\frac{1}{2}\widetilde{V}(\overline{t}_{\epsilon})\geq\frac{\sigma^{2}}{4n}\sum_{i=1}^{r}\min\left\{1,\frac{\eta\widehat{t}_{\epsilon}}{c^{\prime}}\widehat{\mu}_{i}\right\}$	$\displaystyle=\frac{\sigma^{2}\eta\widehat{t}_{\epsilon}}{4nc^{\prime}}\sum_{i=1}^{r}\min\left\{\frac{c^{\prime}}{\eta\widehat{t}_{\epsilon}},\widehat{\mu}_{i}\right\}$
		$\displaystyle\geq\frac{R^{2}}{c^{\prime}}\widehat{\epsilon}_{n}^{2}.$

By knowing that $B^{2}(\overline{t}_{\epsilon})\leq\frac{R^{2}}{\eta\overline{t}_{\epsilon}}=c^{\prime}R^{2}\widehat{\epsilon}_{n}^{2}$ and summing up bounds (80), (81) with $t=\overline{t}_{\epsilon}$ , it yields the following.

\mathbb{P}_{\varepsilon}\left(\tau>\overline{t}_{\epsilon}\right)\leq 2\ \exp\left[-C\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right].

(84)

From [29, Lemma 9], $\lVert f^{\overline{t}_{\epsilon}}\rVert_{\mathcal{H}}\leq\sqrt{7}R$ with probability at least $1-4\ \textnormal{exp}\Big{[}-c_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\Big{]}$ . Thus, Ineq. (84) allows to say:

\lVert f^{\tau}\rVert_{\mathcal{H}}\leq\sqrt{7}R\textnormal{ with probability at least }1-6\ \exp\left(-\tilde{c}_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right)\textnormal{ for }\tilde{c}_{3}>0.

It implies that $\lVert f^{\tau}-f^{*}\rVert_{\mathcal{H}}\leq\lVert f^{\tau}\rVert_{\mathcal{H}}+\lVert f^{*}\rVert_{\mathcal{H}}\leq\left(1+\sqrt{7}\right)R$ with the same probability. Thus, according to [38, Theorem 14.1], for some positive numeric constants $c_{1},\tilde{c}_{4},\tilde{c}_{5}:$

\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq 2\lVert f^{\tau}-f^{*}\rVert_{n}^{2}+c_{1}R^{2}\epsilon_{n}^{2}

with probability (w.r.t. $\varepsilon$ ) at least $1-6\ \textnormal{exp}\left[-\tilde{c}_{3}\frac{R^{2}}{\sigma^{2}}n\widehat{\epsilon}_{n}^{2}\right]$ and with probability (w.r.t. $\{x_{i}\}_{i=1}^{n}$ ) at least $1-\tilde{c}_{4}\ \textnormal{exp}\left[-\tilde{c}_{5}\frac{R^{2}}{\sigma^{2}}n\epsilon_{n}^{2}\right]$ .

Moreover, the same arguments (with $\alpha=0$ and without Assumption 4) as in the proof of Theorem 4.2, [38, Proposition 14.25] and [29, Section 4.3.1] yield

\lVert f^{\tau}-f^{*}\rVert_{n}^{2}\leq c_{u}R^{2}\widehat{\epsilon}_{n}^{2}\leq\widetilde{c}_{u}R^{2}\epsilon_{n}^{2}\lesssim\frac{r\sigma^{2}}{n}

(85)

with probability at least $1-c_{1}\exp\left[-c_{2}\frac{R^{2}}{\sigma^{2}}n\epsilon_{n}^{2}\right]$ . Then by the Cauchy-Schwarz inequality,

	$\displaystyle\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}$	$\displaystyle=\mathbb{E}\left[\lVert f^{\tau}-f^{}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{}\rVert_{2}^{2}\leq\frac{cr\sigma^{2}}{n}\right\}\right]+$
		$\displaystyle+\mathbb{E}\left[\lVert f^{\tau}-f^{}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right\}\right]$
		$\displaystyle\leq\frac{cr\sigma^{2}}{n}+\sqrt{\mathbb{E}\lVert f^{\tau}-f^{}\rVert_{2}^{4}}\sqrt{\mathbb{P}\left(\lVert f^{\tau}-f^{}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right)}.$

Since $f^{\tau}=g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}Y$ , where the empirical covariance operator

	$\displaystyle\Sigma_{n}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\mathbb{K}(\cdot,x_{i})\otimes\mathbb{K}(\cdot,x_{i}),$
	$\displaystyle\Sigma_{n}$	$\displaystyle=S_{n}^{*}S_{n}.$

and $\gamma_{i}^{(\tau)}=\widehat{\mu}_{i}g_{\lambda(\tau)}(\widehat{\mu}_{i})\leq 1$ , one has

f^{*}-f^{\tau}=(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon,

and due to the definition of $\tau$ ,

\displaystyle\sigma^{2}=\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})Y\rVert_{n}^{2}.

We know that

	$\displaystyle\lVert f^{\tau}-f^{*}\rVert_{2}^{2}$	$\displaystyle\leq\mu_{1}\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rVert_{\mathcal{H}}^{2}$
		$\displaystyle\leq\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rVert_{\mathcal{H}}^{2},$

and

	$\displaystyle\sigma^{2}$	$\displaystyle=\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{})S_{n}f^{}\rVert_{n}^{2}+\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rVert_{n}^{2}$
		$\displaystyle+\underbrace{2\langle(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{})S_{n}f^{},(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rangle_{n}}_{\mathcal{A}_{n}}.$

Further,

	$\displaystyle\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}$	$\displaystyle-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rVert_{\mathcal{H}}^{2}=\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{}\rVert_{\mathcal{H}}^{2}$
		$\displaystyle+\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}$
		$\displaystyle-\underbrace{2\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{},g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rangle_{\mathcal{H}}}_{\mathcal{A}_{\mathcal{H}}}.$

Thus, subtracting the empirical term from the RKHS term, one gets

\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\sigma^{2}=\underbrace{-(\mathcal{A}_{\mathcal{H}}+\mathcal{A}_{n})}_{\Delta\mathcal{A}}+\text{norm discrepancy},

where $\text{norm discrepancy}=\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\lVert(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rVert_{n}^{2}$ .

Firstly, $S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}=K_{n}g_{\lambda(\tau)}(K_{n})$ , and

\lVert g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}=\frac{1}{n}\varepsilon^{\top}K_{n}^{2}[g_{\lambda(\tau)}(K_{n})]^{2}\varepsilon.

Secondly,

	$\displaystyle\Delta\mathcal{A}$	$\displaystyle=-2\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{},g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rangle_{\mathcal{H}}$
		$\displaystyle-2\langle(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{})S_{n}f^{},(I-S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*})\varepsilon\rangle_{n}.$

Thirdly, since $\langle h,S_{n}^{*}\mathbf{l}\rangle_{\mathcal{H}}=\langle S_{n}h,\mathbf{l}\rangle_{n}$ for any $h\in\mathcal{H}$ and $\mathbf{l}\in\mathbb{R}^{n}$ , and $\langle h,h\rangle_{\mathcal{H}}=\langle S_{n}h,S_{n}h\rangle_{n}$ , one gets

	$\displaystyle\langle(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*},$	$\displaystyle g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{}\varepsilon\rangle_{\mathcal{H}}=\langle S_{n}(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{},S_{n}g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rangle_{n}$
		$\displaystyle=\langle\underbrace{(I-K_{n}g_{\lambda(\tau)}(K_{n}))F^{*}}_{\tilde{b}_{\lambda(\tau)}^{2}},K_{n}g_{\lambda(\tau)}(K_{n})\varepsilon\rangle_{n}$
		$\displaystyle\leq\lVert\tilde{b}_{\lambda(\tau)}^{2}\rVert_{n}\lVert\varepsilon\rVert_{n}$
		$\displaystyle\leq R\lVert\varepsilon\rVert_{n}.$

Fourthly,

2\langle(I-K_{n}g_{\lambda(\tau)}(K_{n}))F^{*},(I-K_{n}g_{\lambda(\tau)}(K_{n}))\varepsilon\rangle_{n}\leq 2\lVert\tilde{b}_{\lambda(\tau)}^{2}\rVert_{n}\lVert\varepsilon\rVert_{n}\leq 2R\lVert\varepsilon\rVert_{n}.

Combining everything together and using $\gamma_{i}^{(\tau)}\leq 1$ for each $i$ ,

\lVert(I-g_{\lambda(\tau)}(\Sigma_{n})\Sigma_{n})f^{*}-g_{\lambda(\tau)}(\Sigma_{n})S_{n}^{*}\varepsilon\rVert_{\mathcal{H}}^{2}-\sigma^{2}\leq\lVert\varepsilon\rVert_{n}^{2}+4R\lVert\varepsilon\rVert_{n}.

The last equation implies that

	$\displaystyle\lVert f^{\tau}-f^{*}\rVert_{2}^{2}$	$\displaystyle\leq\sigma^{2}+\lVert\varepsilon\rVert_{n}^{2}+4R\lVert\varepsilon\rVert_{n},$
	$\displaystyle\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{4}$	$\displaystyle\leq 3\sigma^{4}+16R^{2}\sigma^{2}+\mathbb{E}\lVert\varepsilon\rVert_{n}^{4}+8R\mathbb{E}\lVert\varepsilon\rVert_{n}^{3}+8R\sigma^{2}\mathbb{E}\lVert\varepsilon\rVert_{n}$
		$\displaystyle\leq 6\sigma^{4}+24R\sigma^{3}+16R^{2}\sigma^{2}.$

As a consequence of the last inequality,

\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}\leq\frac{\widetilde{c}r\sigma^{2}}{n}+C(\sigma,R)\exp(-cr).

Appendix G Auxiliary results

Lemma G.1.

[29, Appendix D] Under Assumptions 1 and 2, for any $\alpha\in[0,1]$ , the function $\epsilon\mapsto\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon}$ is non-increasing (as a function of $\epsilon$ ) on the interval $(0,+\infty)$ , and consequently, for any numeric constant $c>0$ ,

\frac{\widehat{\mathcal{R}}_{n,\alpha}(\epsilon,\mathcal{H})}{\epsilon}\leq c\frac{R^{2}}{\sigma}\epsilon^{1+\alpha}

(86)

has a smallest positive solution. In addition to that, $\widehat{\epsilon}_{n,\alpha}$ (15) exists, is unique, and satisfies equality in Eq. (86).

Lemma G.2.

Under Assumptions 1, 2, 3, any regular kernel and $\widehat{t}_{\epsilon,\alpha}$ from Definition D.2 satisfy

\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)\leq\frac{(1+C)R^{2}}{\eta\widehat{t}_{\epsilon,\alpha}}.

(87)

Thus, $\widehat{t}_{\epsilon,\alpha}$ provides a smallest positive solution to the non-smooth version of the critical inequality.

Proof of Lemma G.2.

First, we recall that $\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)=R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}$ . Then for $d_{n,\alpha}=\min\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n,\alpha}^{2}\}$ ,

\displaystyle\begin{split}\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n,\alpha}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)&=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\sum_{i=1}^{n}\widehat{\mu}_{i}^{\alpha}\min\{\widehat{\mu}_{i},\widehat{\epsilon}_{n,\alpha}^{2}\}\\ &=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\left[\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{1+\alpha}\right]\\ &=R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}.\end{split}

(88)

The last two lines of (88) yield $\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}=\frac{R^{2}\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}}{\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}+\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}^{1+\alpha}}$ .

Second, consider the left-hand part of the non-smooth version of the critical inequality (56) at $t=\widehat{t}_{\epsilon,\alpha}$ :

\displaystyle\begin{split}\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)&=\frac{\sigma^{2}}{4n\widehat{\epsilon}_{n,\alpha}^{2}}\sum_{i=1}^{n}\min\{\widehat{\mu}_{i},\widehat{\epsilon}_{n,\alpha}^{2}\}\\ &\leq R^{2}\frac{\sum_{i=1}^{d_{n,\alpha}}\widehat{\epsilon}_{n,\alpha}^{4+2\alpha}+\widehat{\epsilon}_{n,\alpha}^{2(1+\alpha)}\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}}{\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}}.\end{split}

(89)

Notice that $\widehat{\mu}_{i}\geq\widehat{\epsilon}_{n,\alpha}^{2}$ , and $\widehat{\mu}_{i}^{\alpha}\geq\widehat{\epsilon}_{n,\alpha}^{2\alpha}$ , for $i\leq d_{n,\alpha}$ . This implies $\sum_{i=1}^{d_{n,\alpha}}\widehat{\epsilon}_{n,\alpha}^{4+2\alpha}\leq\widehat{\epsilon}_{n,\alpha}^{4}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}$ , and also that $\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}\leq C\widehat{\epsilon}_{n,\alpha}^{2(1-\alpha)}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha}$ since the kernel is regular. Hence,

\displaystyle\widehat{\epsilon}_{n,\alpha}^{2\alpha}\sum_{i=d_{n,\alpha}+1}^{n}\widehat{\mu}_{i}

\displaystyle\leq C\widehat{\epsilon}_{n,\alpha}^{2}\sum_{i=1}^{d_{n,\alpha}}\widehat{\mu}_{i}^{\alpha},

which leads to the desired upper bound with $\widehat{\epsilon}_{n,\alpha}^{2}=(\eta\widehat{t}_{\epsilon,\alpha})^{-1}$ :

\frac{\sigma^{2}\eta\widehat{t}_{\epsilon,\alpha}}{4R^{2}}\widehat{\mathcal{R}}_{n}^{2}\left(\frac{1}{\sqrt{\eta\widehat{t}_{\epsilon,\alpha}}},\mathcal{H}\right)\leq(1+C)R^{2}\widehat{\epsilon}_{n,\alpha}^{2}.

∎

Appendix H Proof of Lemma 5.1

Let us prove the lemma only for kernel ridge regression. W.l.o.g. assume that $\eta=R=\sigma=1$ and notice that

\displaystyle\begin{split}\mathbb{E}_{\varepsilon}\left[\frac{R_{t}}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}\right]=\sigma^{2}+\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}.\end{split}

(90)

From Lemma B.1, $B^{2}(t)\leq\frac{1}{t}$ . As for the denominator,

\frac{1}{n}\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}=\frac{1}{n}\sum_{i=1}^{n}\frac{1}{(1+t\widehat{\mu}_{i})^{2}}.

With the parameterization $t=\frac{1}{\epsilon^{2}}$ and $d_{n,0}=\min\left\{j\in[n]:\widehat{\mu}_{j}\leq\widehat{\epsilon}_{n}^{2}\right\}$ , since $\gamma_{i}^{(t)},i=1,\ldots,n$ , is a non-decreasing function in $t$ ,

	$\displaystyle\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}$	$\displaystyle\leq\frac{1}{\frac{1}{n\widehat{\epsilon}_{n}^{2}}\sum_{i=1}^{n}\left(\frac{\widehat{\epsilon}_{n}^{2}}{\widehat{\mu}_{i}+\widehat{\epsilon}_{n}^{2}}\right)^{2}}$
		$\displaystyle\leq\frac{1}{\frac{n-d_{n,0}}{4n\widehat{\epsilon}_{n}^{2}}}.$

From [41, Section 2.3], $d_{n,0}=cn\widehat{\epsilon}_{n}^{2}$ , which implies

\frac{B^{2}(t)}{1/n\sum_{i=1}^{n}(1-\gamma_{i}^{(t)})^{2}}\leq\frac{4\widehat{\epsilon}_{n}^{2}}{1-c\widehat{\epsilon}_{n}^{2}}\to 0.

{acks}

[Acknowledgments] The authors would like to thank the anonymous referees, an Associate Editor and the Editor for their constructive comments that improved the quality of this paper.

References

[1] {bincollection}[author] \bauthor\bsnmAkaike, \bfnmHirotogu\binitsH. (\byear1998). \btitleInformation theory and an extension of the maximum likelihood principle. In \bbooktitleSelected papers of hirotugu akaike \bpages199–213. \bpublisherSpringer. \endbibitem
[2] {barticle}[author] \bauthor\bsnmAngles, \bfnmTomas\binitsT., \bauthor\bsnmCamoriano, \bfnmRaffaello\binitsR., \bauthor\bsnmRudi, \bfnmAlessandro\binitsA. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2015). \btitleNYTRO: When Subsampling Meets Early Stopping. \bjournalarXiv e-prints \bpagesarXiv:1510.05684. \endbibitem
[3] {barticle}[author] \bauthor\bsnmArlot, \bfnmSylvain\binitsS., \bauthor\bsnmCelisse, \bfnmAlain\binitsA. \betalet al. (\byear2010). \btitleA survey of cross-validation procedures for model selection. \bjournalStatistics surveys \bvolume4 \bpages40–79. \endbibitem
[4] {barticle}[author] \bauthor\bsnmAronszajn, \bfnmNachman\binitsN. (\byear1950). \btitleTheory of reproducing kernels. \bjournalTransactions of the American mathematical society \bvolume68 \bpages337–404. \endbibitem
[5] {barticle}[author] \bauthor\bsnmBartlett, \bfnmPeter L\binitsP. L., \bauthor\bsnmBousquet, \bfnmOlivier\binitsO., \bauthor\bsnmMendelson, \bfnmShahar\binitsS. \betalet al. (\byear2005). \btitleLocal rademacher complexities. \bjournalThe Annals of Statistics \bvolume33 \bpages1497–1537. \endbibitem
[6] {barticle}[author] \bauthor\bsnmBartlett, \bfnmPeter L\binitsP. L. and \bauthor\bsnmTraskin, \bfnmMikhail\binitsM. (\byear2007). \btitleAdaboost is consistent. \bjournalJournal of Machine Learning Research \bvolume8 \bpages2347–2368. \endbibitem
[7] {barticle}[author] \bauthor\bsnmBauer, \bfnmFrank\binitsF., \bauthor\bsnmPereverzev, \bfnmSergei\binitsS. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2007). \btitleOn regularization algorithms in learning theory. \bjournalJournal of complexity \bvolume23 \bpages52–72. \endbibitem
[8] {bbook}[author] \bauthor\bsnmBerlinet, \bfnmAlain\binitsA. and \bauthor\bsnmThomas-Agnan, \bfnmChristine\binitsC. (\byear2011). \btitleReproducing kernel Hilbert spaces in probability and statistics. \bpublisherSpringer Science & Business Media. \endbibitem
[9] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmHoffmann, \bfnmMarc\binitsM. and \bauthor\bsnmReiß, \bfnmMarkus\binitsM. (\byear2018). \btitleOptimal adaptation for early stopping in statistical inverse problems. \bjournalSIAM/ASA Journal on Uncertainty Quantification \bvolume6 \bpages1043–1075. \endbibitem
[10] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG., \bauthor\bsnmHoffmann, \bfnmMarc\binitsM., \bauthor\bsnmReiß, \bfnmMarkus\binitsM. \betalet al. (\byear2018). \btitleEarly stopping for statistical inverse problems via truncated SVD estimation. \bjournalElectronic Journal of Statistics \bvolume12 \bpages3204–3231. \endbibitem
[11] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmKrämer, \bfnmNicole\binitsN. (\byear2016). \btitleConvergence rates of kernel conjugate gradient for random design regression. \bjournalAnalysis and Applications \bvolume14 \bpages763–794. \endbibitem
[12] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmMathé, \bfnmPeter\binitsP. (\byear2010). \btitleConjugate gradient regularization under general smoothness and noise assumptions. \bjournalJournal of Inverse and Ill-posed Problems \bvolume18 \bpages701–726. \endbibitem
[13] {barticle}[author] \bauthor\bsnmBlanchard, \bfnmGilles\binitsG. and \bauthor\bsnmMathé, \bfnmPeter\binitsP. (\byear2012). \btitleDiscrepancy principle for statistical inverse problems with application to conjugate gradient iteration. \bjournalInverse problems \bvolume28 \bpages115011. \endbibitem
[14] {barticle}[author] \bauthor\bsnmBühlmann, \bfnmPeter\binitsP. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2003). \btitleBoosting with the L 2 loss: regression and classification. \bjournalJournal of the American Statistical Association \bvolume98 \bpages324–339. \endbibitem
[15] {barticle}[author] \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. (\byear2006). \btitleOptimal Rates for Regularization Operators in Learning Theory. \endbibitem
[16] {barticle}[author] \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. and \bauthor\bsnmYao, \bfnmYuan\binitsY. (\byear2010). \btitleCross-validation based adaptation for regularization operators in learning theory. \bjournalAnalysis and Applications \bvolume8 \bpages161–183. \endbibitem
[17] {barticle}[author] \bauthor\bsnmCavalier, \bfnmLaurent\binitsL., \bauthor\bsnmGolubev, \bfnmGK\binitsG., \bauthor\bsnmPicard, \bfnmDominique\binitsD., \bauthor\bsnmTsybakov, \bfnmAB\binitsA. \betalet al. (\byear2002). \btitleOracle inequalities for inverse problems. \bjournalThe Annals of Statistics \bvolume30 \bpages843–874. \endbibitem
[18] {barticle}[author] \bauthor\bsnmCelisse, \bfnmAlain\binitsA. and \bauthor\bsnmWahl, \bfnmMartin\binitsM. (\byear2021). \btitleAnalyzing the discrepancy principle for kernelized spectral filter learning algorithms. \bjournalJournal of Machine Learning Research \bvolume22 \bpages1–59. \endbibitem
[19] {barticle}[author] \bauthor\bsnmCucker, \bfnmFelipe\binitsF. and \bauthor\bsnmSmale, \bfnmSteve\binitsS. (\byear2002). \btitleOn the mathematical foundations of learning. \bjournalBulletin of the American mathematical society \bvolume39 \bpages1–49. \endbibitem
[20] {bbook}[author] \bauthor\bsnmEngl, \bfnmHeinz Werner\binitsH. W., \bauthor\bsnmHanke, \bfnmMartin\binitsM. and \bauthor\bsnmNeubauer, \bfnmAndreas\binitsA. (\byear1996). \btitleRegularization of inverse problems \bvolume375. \bpublisherSpringer Science & Business Media. \endbibitem
[21] {barticle}[author] \bauthor\bsnmGerfo, \bfnmL Lo\binitsL. L., \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL., \bauthor\bsnmOdone, \bfnmFrancesca\binitsF., \bauthor\bsnmVito, \bfnmE De\binitsE. D. and \bauthor\bsnmVerri, \bfnmAlessandro\binitsA. (\byear2008). \btitleSpectral algorithms for supervised learning. \bjournalNeural Computation \bvolume20 \bpages1873–1897. \endbibitem
[22] {bbook}[author] \bauthor\bsnmGu, \bfnmChong\binitsC. (\byear2013). \btitleSmoothing spline ANOVA models \bvolume297. \bpublisherSpringer Science & Business Media. \endbibitem
[23] {bbook}[author] \bauthor\bsnmHansen, \bfnmPer Christian\binitsP. C. (\byear2010). \btitleDiscrete inverse problems: insight and algorithms \bvolume7. \bpublisherSiam. \endbibitem
[24] {barticle}[author] \bauthor\bsnmKoltchinskii, \bfnmVladimir\binitsV. \betalet al. (\byear2006). \btitleLocal Rademacher complexities and oracle inequalities in risk minimization. \bjournalThe Annals of Statistics \bvolume34 \bpages2593–2656. \endbibitem
[25] {barticle}[author] \bauthor\bsnmLaurent, \bfnmBeatrice\binitsB. and \bauthor\bsnmMassart, \bfnmPascal\binitsP. (\byear2000). \btitleAdaptive estimation of a quadratic functional by model selection. \bjournalAnnals of statistics \bpages1302–1338. \endbibitem
[26] {barticle}[author] \bauthor\bsnmMathé, \bfnmPeter\binitsP. and \bauthor\bsnmPereverzev, \bfnmSergei V\binitsS. V. (\byear2003). \btitleGeometry of linear ill-posed problems in variable Hilbert scales. \bjournalInverse problems \bvolume19 \bpages789. \endbibitem
[27] {binproceedings}[author] \bauthor\bsnmMendelson, \bfnmShahar\binitsS. (\byear2002). \btitleGeometric parameters of kernel machines. In \bbooktitleInternational Conference on Computational Learning Theory \bpages29–43. \bpublisherSpringer. \endbibitem
[28] {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2012). \btitleMinimax-optimal rates for sparse additive models over kernel classes via convex programming. \bjournalJournal of Machine Learning Research \bvolume13 \bpages389–427. \endbibitem
[29] {barticle}[author] \bauthor\bsnmRaskutti, \bfnmGarvesh\binitsG., \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. and \bauthor\bsnmYu, \bfnmBin\binitsB. (\byear2014). \btitleEarly stopping and non-parametric regression: an optimal data-dependent stopping rule. \bjournalJournal of Machine Learning Research \bvolume15 \bpages335–366. \endbibitem
[30] {binproceedings}[author] \bauthor\bsnmRudi, \bfnmAlessandro\binitsA., \bauthor\bsnmCamoriano, \bfnmRaffaello\binitsR. and \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. (\byear2015). \btitleLess is more: Nyström computational regularization. In \bbooktitleAdvances in Neural Information Processing Systems \bpages1657–1665. \endbibitem
[31] {bbook}[author] \bauthor\bsnmScholkopf, \bfnmBernhard\binitsB. and \bauthor\bsnmSmola, \bfnmAlexander J\binitsA. J. (\byear2001). \btitleLearning with kernels: support vector machines, regularization, optimization, and beyond. \bpublisherMIT press. \endbibitem
[32] {barticle}[author] \bauthor\bsnmSchwarz, \bfnmGideon\binitsG. \betalet al. (\byear1978). \btitleEstimating the dimension of a model. \bjournalThe annals of statistics \bvolume6 \bpages461–464. \endbibitem
[33] {bmisc}[author] \bauthor\bsnmStankewitz, \bfnmBernhard\binitsB. (\byear2019). \btitleSmoothed residual stopping for statistical inverse problems via truncated SVD estimation. \endbibitem
[34] {barticle}[author] \bauthor\bsnmStone, \bfnmCharles J\binitsC. J. \betalet al. (\byear1985). \btitleAdditive regression and other nonparametric models. \bjournalThe annals of Statistics \bvolume13 \bpages689–705. \endbibitem
[35] {barticle}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1977). \btitlePractical approximate solutions to linear operator equations when the data are noisy. \bjournalSIAM Journal on Numerical Analysis \bvolume14 \bpages651–667. \endbibitem
[36] {bincollection}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1987). \btitleThree topics in ill-posed problems. In \bbooktitleInverse and ill-posed problems \bpages37–51. \bpublisherElsevier. \endbibitem
[37] {bbook}[author] \bauthor\bsnmWahba, \bfnmGrace\binitsG. (\byear1990). \btitleSpline models for observational data \bvolume59. \bpublisherSiam. \endbibitem
[38] {bbook}[author] \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2019). \btitleHigh-dimensional statistics: A non-asymptotic viewpoint \bvolume48. \bpublisherCambridge University Press. \endbibitem
[39] {bbook}[author] \bauthor\bsnmWasserman, \bfnmLarry\binitsL. (\byear2006). \btitleAll of nonparametric statistics. \bpublisherSpringer Science & Business Media. \endbibitem
[40] {binproceedings}[author] \bauthor\bsnmWei, \bfnmYuting\binitsY., \bauthor\bsnmYang, \bfnmFanny\binitsF. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2017). \btitleEarly stopping for kernel boosting algorithms: A general analysis with localized complexities. In \bbooktitleAdvances in Neural Information Processing Systems \bpages6067–6077. \endbibitem
[41] {barticle}[author] \bauthor\bsnmYang, \bfnmYun\binitsY., \bauthor\bsnmPilanci, \bfnmMert\binitsM. and \bauthor\bsnmWainwright, \bfnmMartin J\binitsM. J. (\byear2017). \btitleRandomized sketches for kernels: Fast and optimal nonparametric regression. \bjournalThe Annals of Statistics \bvolume45 \bpages991–1023. \endbibitem
[42] {barticle}[author] \bauthor\bsnmYao, \bfnmYuan\binitsY., \bauthor\bsnmRosasco, \bfnmLorenzo\binitsL. and \bauthor\bsnmCaponnetto, \bfnmAndrea\binitsA. (\byear2007). \btitleOn Early Stopping in Gradient Descent Learning. \bjournalConstructive Approximation \bvolume26 \bpages289–315. \bdoi10.1007/s00365-006-0663-2 \endbibitem
[43] {barticle}[author] \bauthor\bsnmZhang, \bfnmTong\binitsT., \bauthor\bsnmYu, \bfnmBin\binitsB. \betalet al. (\byear2005). \btitleBoosting with early stopping: Convergence and consistency. \bjournalThe Annals of Statistics \bvolume33 \bpages1538–1579. \endbibitem

	$\displaystyle\lVert f^{t^{*}}-f^{\tau}\rVert_{n}^{2}$	$\displaystyle=\frac{1}{n}\sum_{i=1}^{r}\Big{(}\gamma_{i}^{(t^{})}-\gamma_{i}^{(\tau)}\Big{)}^{2}Z_{i}^{2}\leq\frac{1}{n}\sum_{i=1}^{r}\|(1-\gamma_{i}^{(t^{})})^{2}-(1-\gamma_{i}^{(\tau)})^{2}\|Z_{i}^{2}$
		$\displaystyle=(\widetilde{R}_{t^{}}-\widetilde{R}_{\tau})\mathbb{I}\left\{\tau\geq t^{}\right\}+(\widetilde{R}_{\tau}-\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau<t^{}\right\}$
		$\displaystyle\leq(\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau\geq t^{}\right\}+(\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}-\widetilde{R}_{t^{}})\mathbb{I}\left\{\tau<t^{}\right\}$
		$\displaystyle\leq\|\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}\|.$

	$\displaystyle\mathbb{E}_{\varepsilon}\|\widetilde{R}_{t^{}}-\mathbb{E}_{\varepsilon}\widetilde{R}_{t^{}}\|$	$\displaystyle\leq\sqrt{\frac{2\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{4}\left[\frac{3}{2}\sigma^{2}+2(G_{i}^{})^{2}\right]}$
		$\displaystyle\leq\sqrt{\frac{3\sigma^{4}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{2}}+\sqrt{\frac{4\sigma^{2}}{n^{2}}\sum_{i=1}^{r}(1-\gamma_{i}^{(t^{})})^{2}(G_{i}^{*})^{2}}$
		$\displaystyle\leq\frac{\sqrt{3}\sigma^{2}\sqrt{r}}{n}+\theta\frac{\sigma^{2}}{n}+\theta^{-1}B^{2}(t^{*})$
		$\displaystyle\leq\theta^{-1}B^{2}(t^{*})+(\sqrt{3}+\theta)\frac{\sqrt{r}\sigma^{2}}{n}.$

	$\displaystyle\mathbb{E}\lVert f^{\tau}-f^{*}\rVert_{2}^{2}$	$\displaystyle=\mathbb{E}\left[\lVert f^{\tau}-f^{}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{}\rVert_{2}^{2}\leq\frac{cr\sigma^{2}}{n}\right\}\right]+$
		$\displaystyle+\mathbb{E}\left[\lVert f^{\tau}-f^{}\rVert_{2}^{2}\mathbb{I}\left\{\lVert f^{\tau}-f^{}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right\}\right]$
		$\displaystyle\leq\frac{cr\sigma^{2}}{n}+\sqrt{\mathbb{E}\lVert f^{\tau}-f^{}\rVert_{2}^{4}}\sqrt{\mathbb{P}\left(\lVert f^{\tau}-f^{}\rVert_{2}^{2}>\frac{cr\sigma^{2}}{n}\right)}.$

Early stopping and polynomial smoothing in regression with reproducing kernels

Abstract

keywords:

keywords:

1 Introduction

2 Nonparametric regression and reproducing kernel framework

2.1 Probabilistic model and notation

Notation.

2.2 Statistical model and assumptions

2.2.1 Reproducing Kernel Hilbert Space (RKHS)

2.2.2 Main assumptions

Assumption 1 (Statistical model).

Assumption 2.

2.3 Spectral filter algorithms

Assumption 3.

2.4 Key quantities

Definition 2.1.

3 Data-driven early stopping rule and minimum discrepancy principle

3.1 Finite-rank kernels

3.1.1 Fixed-design framework

Theorem 3.1.

Proof of Theorem 3.1.

Corollary 3.2.

Proof of Corollary 3.2.

3.1.2 Random-design framework

Theorem 3.3.

Remark.

Corollary 3.4.

3.2 Practical behavior of τ\tau with infinite-rank kernels

4 Polynomial smoothing

4.1 Polynomial smoothing and minimum discrepancy principle rule

4.2 Related work

4.3 Optimality result (fixed-design)

Theorem 4.1 (Lower bound from Theorem 1 in [41]).

Assumption 4.

Example 1 (β\beta-polynomial eigenvalue decay kernels).

Example 2 (γ\gamma-exponential eigenvalue-decay kernels).

Theorem 4.2 (Upper bound on empirical norm).

4.4 Consequences for β\beta-polynomial eigenvalue-decay kernels

Lemma 4.3.

Corollary 4.4.

5 Empirical comparison with existing stopping rules

5.1 Stopping rules involved

Hold-out stopping rule

V-fold stopping rule

Raskutti-Wainwright-Yu stopping rule (from [29])

Theoretical minimum discrepancy-based stopping rule t∗t^{*}

Oracle stopping rule

5.2 Simulation design

5.3 Results of the simulation experiments

5.3.1 Finite-rank kernels

5.3.2 Polynomial eigenvalue decay kernels

5.4 Estimation of variance and decay rate for polynomial eigenvalue decay kernels

5.4.1 Polynomial decay parameter estimation

5.4.2 Variance parameter estimation

Finite-rank kernel.

Infinite-rank kernel.

Lemma 5.1.

6 Conclusion

Appendix A Useful results

Lemma A.1.

Lemma A.2.

Proof of Lemma A.2.

Lemma A.3.

Proof.

Lemma A.4.

Proof.

Corollary A.5.

Appendix B Handling the smoothed bias and variance

Lemma B.1.

Proof of Lemma B.1.

Appendix C Auxiliary lemma for finite-rank kernels

Definition C.1.

Lemma C.2.

Proof of Lemma C.2.

Appendix D Proofs for polynomial smoothing (fixed design)

Definition D.1.

Definition D.2.

Definition D.3.

Lemma D.4.

3.2 Practical behavior of $\tau$ with infinite-rank kernels

Example 1 ( $\beta$ -polynomial eigenvalue decay kernels).

Example 2 ( $\gamma$ -exponential eigenvalue-decay kernels).

4.4 Consequences for $\beta$ -polynomial eigenvalue-decay kernels

Theoretical minimum discrepancy-based stopping rule $t^{*}$

D.1 Two deviation inequalities for $\tau_{\alpha}$

D.2 Bounding the stochastic part of the variance term at $\tau_{\alpha}$

D.3 Bounding the bias term at $\tau_{\alpha}$