On the Optimality of Misspecified Kernel Ridge Regression
Abstract
In the misspecified kernel ridge regression problem, researchers usually assume the underground true function , a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) for some . The existing minimax optimal results require which implicitly requires where is the embedding index, a constant depending on . Whether the KRR is optimal for all is an outstanding problem lasting for years. In this paper, we show that KRR is minimax optimal for any when the is a Sobolev RKHS.
1 Introduction
Suppose that the samples are i.i.d. sampled from an unknown distribution on , where and . The regression problem aims to find a function such that the risk
is relatively small. It is well known that the conditional mean function given by minimizes the risk . Therefore, we may focus on establishing the convergence rate (either in expectation or in probability) for the excess risk (generalization error)
(1) |
where is the marginal distribution of on .
In the non-parametric regression settings, researchers often assume that falls into a class of functions with certain structures and develop non-parametric methods to obtain the estimator . One of the most popular non-parametric regression methods, the kernel method, aims to estimate using candidate functions from a reproducing kernel Hilbert space (RKHS) , a separable Hilbert space associated to a kernel function defined on , e.g., Kohler & Krzyżak (2001); Cucker & Smale (2001); Caponnetto & de Vito (2007); Steinwart & Christmann (2008). This paper focuses on the kernel ridge regression (KRR), which constructs an estimator by solving the penalized least square problem
(2) |
where is referred to as the regularization parameter.
Since the minimax optimality of KRR has been proved for (Caponnetto & de Vito, 2007), a large body of literature has studied the convergence rate of the generalization error of misspecified KRR () and whether the rate is optimal in the minimax sense. It turns out that the eigenvalue decay rate (), the source condition (), and the embedding index () of the RKHS jointly determine the convergence behavior of the misspecified KRR (see Section 2.2 for definitions). If we only assume that belongs to an interpolation space of the RKHS for some , the well-known information-theoretic lower bound shows that the minimax lower bound is . The state-of-the-art work Fischer & Steinwart (2020) has already shown that when , the upper bound of the convergence rate of KRR is and hence is optimal. However, when for some , all the existing works need an additional boundedness assumption on to prove the same upper bound . The boundedness assumption will result in a smaller function space, i.e., when . Fischer & Steinwart (2020) further reveals that the minimax rate of the excess risk associated to the smaller function space is larger than for any . This lower bound of the minimax rate is smaller than the upper bound of the convergence rate and hence they can not prove the minimax optimality of KRR when .
It has been an outstanding problem for years whether KRR is minimax optimal for all the (Pillaud-Vivien et al., 2018; Fischer & Steinwart, 2020; Liu & Shi, 2022). This paper concludes that, for Sobolev RKHS, KRR is optimal for all . Thus, we know that KRR is optimal for Sobolev RKHS and all . Together with a recent work on the saturation effect where KRR can not be optimal for (Li et al., 2023), the optimality of KRR for Sobolev RKHS is well understood.
1.1 Related work
Kernel ridge regression has been studied as a special kind of spectral regularization algorithm (Rosasco et al., 2005; Caponnetto, 2006; Bauer et al., 2007; Gerfo et al., 2008; Mendelson & Neeman, 2010). In large part of the literature, the ‘hardness’ of the regression problem is determined by two parameters: 1. the source condition (), which characterizes a function’s relative ‘smoothness’ with respect to the RKHS; 2. the eigenvalue decay rate () (or capacity, effective dimension equivalently), which characterizes the RKHS itself. These two parameters divide the convergence behavior of KRR or spectral regularization algorithm into different regimes and lead to different convergence rates (Dicker et al., 2017; Blanchard & Mücke, 2018; Lin et al., 2018; Lin & Cevher, 2020; Li et al., 2023, etc.). Caponnetto & de Vito (2007) first proves the optimality when and Lin et al. (2018) extends the desired upper bound of the convergence rate to the regime .
KRR in the misspecified case () has also been discussed by another important line of work which considers the embedding index () of the RKHS and performs refined analysis (Steinwart et al., 2009; Dicker et al., 2017; Pillaud-Vivien et al., 2018; Fischer & Steinwart, 2020; Celisse & Wahl, 2020; Li et al., 2022). The desired upper bound is extended to the regime , and the minimax optimality is extended to the regime . It is worth pointing out that when falls into a less-smooth interpolation space which does not imply the boundedness of functions therein, all existing works (either considering embedding index or not) require an additional boundedness assumption, i.e., (Lin & Cevher, 2020; Fischer & Steinwart, 2020; Talwai & Simchi-Levi, 2022; Li et al., 2022, etc). As discussed in the introduction, this will lead to the suboptimality in the regime.
This paper follows the line of work that considers the embedding index, refines the proof by handling the additional boundedness assumption, and solves the optimality problem for Sobolev RKHS and all . In addition, our technical improvement also sheds light on the optimality for general RKHS. Specifically, we replace the boundedness assumption with a far weaker -integrability assumption, which turns out to be reasonable for many RKHS. Note that our results focus on the most frequently used convergence rate for KRR, and it can be easily extended to convergence rate (e.g., Fischer & Steinwart 2020) and general spectral regularization algorithms (e.g., Lin et al. 2018).
2 Preliminaries
2.1 Basic concepts
Let be the input space and be the output space. Let be an unknown probability distribution on satisfying , and denote the corresponding marginal distribution on as . We use (in short ) to represent the -spaces. Then the generalization error (1) can be written as
Throughout the paper, we denote as a separable RKHS on with respect to a continuous and bounded kernel function satisfying
Define the integral operator as
(3) |
It is well known that is a positive, self-adjoint, trace-class, and hence a compact operator (Steinwart & Scovel, 2012). The spectral theorem for self-adjoint compact operators and Mercer’s decomposition theorem yield that
where is an at most countable set, the eigenvalues is a non-increasing summable sequence, and are the corresponding eigenfunctions. Denote the samples as and . The representer theorem (see, e.g., Steinwart & Christmann 2008) gives an explicit expression of the KRR estimator defined by (2), i.e.,
where
and
We also need to introduce the interpolation spaces of RKHS. For any , the fractional power integral operator is defined as
and the interpolation space of is defined as
(4) |
with the inner product
(5) |
It is easy to show that is also a separable Hilbert space with orthogonal basis . Specially, we have and . For , the embeddings exist and are compact (Fischer & Steinwart, 2020). For the functions in with larger , we say they have higher regularity (smoothness) with respect to the RKHS.
As an example, the Sobolev space is an RKHS if , and its interpolation space is still a Sobolev space given by , see Section 3.1 for detailed discussions.
2.2 Assumptions
This subsection lists the standard assumptions that frequently appeared in related literature.
Assumption 2.1 (Eigenvalue decay rate (EDR)).
Suppose that the eigenvalue decay rate (EDR) of is , i.e, there are positive constants and such that
Note that the eigenvalues and EDR are only determined by the marginal distribution and the RKHS . The polynomial eigenvalue decay rate assumption is standard in related literature and is also referred to as the capacity condition or effective dimension condition.
Assumption 2.2 (Embedding index).
We say that has the embedding property for some , if there is a constant such that
(6) |
where denotes the operator norm of the embedding. Then we define the embedding index of an RKHS as
In fact, for any , we can define as the smallest constant such that
if there is no such constant, set . Then Fischer & Steinwart (2020, Theorem 9) shows that for ,
The larger is, the weaker the embedding property is. Note that since , is always true for . In addition, Fischer & Steinwart (2020, Lemma 10) also shows that can not be less than .
Note that the embedding property (6) holds for any . This directly implies that all the functions in are bounded, . However, the embedding property may not hold for .
Assumption 2.3 (Source condition).
For , there is a constant such that and
Functions in with smaller are less smooth, which will be harder for an algorithm to estimate.
Assumption 2.4 (Moment of error).
The noise satisfies that there are constants such that for any ,
This is a standard assumption to control the noise such that the tail probability decays fast (Lin & Cevher, 2020; Fischer & Steinwart, 2020). It is satisfied for, for instance, the Gaussian noise with bounded variance or sub-Gaussian noise. Some literature (e.g., Steinwart et al. 2009; Pillaud-Vivien et al. 2018; Jun et al. 2019, etc) also uses a stronger assumption which directly implies both Assumption 2.4 and the boundedness of .
2.3 Review of state-of-the-art results
For the convenience of comparing our results with previous works, we review state-of-the-art upper and lower bounds of the convergence rate of KRR (Fischer & Steinwart, 2020).
Proposition 2.5 (Upper bound).
Suppose that Assumption 2.1,2.2, 2.3 and 2.4 hold for and . Furthermore, suppose that there exists a constant such that . Let be the KRR estimator defined by (2). Then in the case of , by choosing , for any fixed , when is sufficiently large, with probability at least , we have
where is a constant independent of and .
Remark 2.6.
Proposition 2.7 (Minimax lower bound).
Let be a probability distribution on such that Assumption 2.1 and 2.2 are satisfied for . Let consist of all the distributions on satisfying 2.3, 2.4 for and with marginal distribution . For a constant , let consists of all the distributions on such that . Then for any , there exists a constant , for all learning methods, for any fixed , when is sufficiently large, there is a distribution such that, with probability at least , we have
Remark 2.8.
Under the precondition , (1) when , since there always exists , the upper bound of the convergence rate in Proposition 2.5 coincides with the minimax lower bound in Proposition 2.7 and hence is minimax optimal; (2) but when , existing results fail to show the optimality of KRR. The same gap between the upper and lower bound exists for gradient descent and stochastic gradient descent with multiple passes (Pillaud-Vivien et al., 2018).
Note that in Proposition 2.7, we consider the distributions in . If we consider all the distributions in , we have the following minimax lower bound, which is often referred to as the information-theoretic lower bound (see, e.g., Rastogi & Sampath 2017).
Proposition 2.9 (Information-theoretic lower bound).
Let be a probability distribution on such that Assumption 2.1 is satisfied. Let consist of all the distributions on satisfying 2.3, 2.4 for and with marginal distribution . Then there exists a constant , for all learning methods, for any fixed , when is sufficiently large, there is a distribution such that, with probability at least , we have
3 Main Results
The main results of this paper aim to remove the boundedness assumption . We state the main theorem in terms of general RKHS, and make detailed discussions for Sobolev space as a particular case.
Theorem 3.1.
Remark 3.2.
Remark 3.3.
RKHS with uniformly bounded eigenfunctions, i.e., are frequently considered (Mendelson & Neeman, 2010; Steinwart et al., 2009; Pillaud-Vivien et al., 2018). For this kind of RKHS, the Assumption 2.2 holds for (Fischer & Steinwart, 2020, Lemma 10), hence is satisfied. In addition, the assumption in Theorem 3.1 turns into . Recalling that when , so it is reasonable to assume that the functions in is -integrable for some .
Remark 3.4.
3.1 Optimality for Sobolev RKHS
Let us first introduce some concepts of (fractional) Sobolev space (see, e.g., Adams & Fournier 2003). In this section, we assume that is a bounded domain with smooth boundary and the Lebesgue measure . Denote as the corresponding space. For , we denote as the Sobolev space with smoothness and . Then the (fractional) Sobolev space for any real number can be defined through real interpolation:
where . (For details of real interpolation of Banach spaces, we refer to Sawano (2018, Chapter 4.2.2)). It is well known that when , is a separable RKHS with respect to a bounded kernel, and the corresponding EDR is (see, e.g., Edmunds & Triebel 1996)
Furthermore, for the interpolation space of under Lebesgue measure defined by (4), we have (see, e.g., Steinwart & Scovel 2012, Theorem 4.6), for ,
Now we begin to introduce the embedding theorem of (fractional) Sobolev space from 7.57 of Adams (1975), which is crucial when applying Theorem 3.1 to Sobolev RKHS.
Proposition 3.5 (Embedding theorem).
Let be defined as above, we have
-
(i)
if , then for .
-
(ii)
if , then for .
-
(iii)
if for some nonnegative integer , then ,
where denotes the Hölder space and denotes the continuous embedding.
On the one hand, (iii) of Proposition 3.5 shows that, for a Sobolev RKHS and any ,
where . So the Assumption 2.2 holds for , and thus . On the other hand, (i) of Proposition 3.5 shows that if ,
where . In addition, (ii) and (iii) of Proposition 3.5 show that also holds if . Therefore, for any , we have
hold simultaneously.
Now we are ready to state a theorem as the corollary of Theorem 3.1 and Proposition 3.5, which implies the optimality of KRR for Sobolev RKHS and any .
Theorem 3.6.
Let be a bounded domain with a smooth boundary. The RKHS is for some . Suppose that the distribution satisfies that for , the noise satisfies Assumption 2.4, and the marginal distribution on has Lebesgue density for two constants and . Let be the KRR estimator defined by (2). Then, by choosing , for any fixed , when is sufficiently large, with probability at least , we have
where is a constant independent of and .
Note that we say that the distribution has Lebesgue density , if is equivalent to the Lebesgue measure , i.e., , and there exist constants such that .
Remark 3.7 (Optimality for Sobolev RKHS).
Without the boundedness assumption, Theorem 3.6 proves the same upper bound of the convergence rate . Together with the information-theoretic lower bound in Proposition 2.9, we prove the optimality of KRR for all , while state-of-the-art result Fischer & Steinwart (2020, Corollary 5) can only prove for . Since when , the saturation effect of KRR has been proved in a recent work Li et al. (2023), the optimality of KRR for Sobolev spaces is well understood.
3.2 Sketch of proof
In this subsection, we present the sketch of the proof of Theorem 3.1. The rigorous proof of Theorem 3.1, Proposition 2.9, and Theorem 3.6 will be in the appendix. We refer to Fischer & Steinwart (2020, Chapter 6) for the proof of Proposition 2.5 and 2.7.
Our proofs are based on the standard integral operator techniques dating back to Smale & Zhou (2007), and we refine the proof to handle the unbounded case. Let us first introduce some frequently used notations. Define the sample operator as
and its adjoint operator
Next we define the sample covariance operator as
(7) |
and the sample basis function
Then the KRR estimator (2) can be expressed by (see, e.g., Caponnetto & de Vito (2007, Chapter 5))
Define as the unique minimizer given by
Note that the integral operator can also be seen as a bounded linear operator on , can be expressed by
(8) |
where is the expectation of given by
The first step in our proof is to decompose the generalization error into two terms, which are often referred to as the approximation error and estimation error,
(9) |
Then we will show that by choosing ,
(10) |
and for any fixed , when is sufficiently large, with probability at least , we have
(11) |
Plugging (10) and (11) into (9) and we will finish the proof of Theorem 3.1.
The approximation error
Suppose that . Then using the expression (8) and simple inequality, we will show that
The estimation error
Denote and . We will first rewrite the estimator error as follows
(12) |
For the first term in (3.2), we have
For the second term in (3.2), using the concentration result between and , it can also be bounded by a constant. The main challenge of the proof is bounding the third term. We will show that the third term in (3.2) can be rewritten as
(13) |
The form of (3.2) allows us to use the well-known Bernstein type inequality, which controls the difference between a sum of i.i.d. random variables and its expectation. Traditionally, this step requires the boundedness of . We refine this procedure by a truncation method and prove that with high probability,
Specifically, we consider two subset of : and where is allowed to diverge as . When choosing as a proper order of , we show that on the one hand, the norm in is still upper bounded by ; on the other hand, the norms in vanishes with a probability tending to 1 as . We argue that such exists by taking advantage of the -integrability of , which is required in the statement of Theorem 3.1.
4 Experiments
In our experiments, we aim to verify that for functions but not in , the KRR estimator can still achieve the convergence rate . We show the results for both Sobolev RKHS, and general RKHS with uniformly bounded eigenfunctions mentioned in Remark 3.3.
4.1 Experiments in Sobolev space
Suppose that and the marginal distribution is the uniform distribution on . We consider the RKHS to be the Sobolev space with smoothness 1. Section 3.1 shows that the EDR is and embedding index is . We construct a function in by
(14) |
for some . We will show in Appendix F that the series in (14) converges on . In addition, since , we also have . The explicit formula of the kernel associated to is given by Thomas-Agnan (1996, Corollary 2), i.e., .
We consider the following data generation procedure:
where is numerically approximated by the first 1000 terms in (14) with , and . We choose the regularization parameter as for a fixed . The sample size is chosen from 1000 to 5000, with intervals of 100. We numerically compute the generalization error by Simpson’s formula with testing points. For each , we repeat the experiments 50 times and present the average generalization error as well as the region within one standard deviation. To visualize the convergence rate , we perform logarithmic least-squares to fit the generalization error with respect to the sample size and display the value of .
We try different values of , Figure 1 presents the convergence curve under the best choice of . It can be concluded that the convergence rate of KRR’s generalization error is indeed approximately equal to , without the boundedness assumption of the true function . We also did another experiment that used cross validation to choose the regularization parameter. Figure 3 in Appendix F shows a similar result as Figure 1. We refer to Appendix F for more details of the experiments.
Remark 4.1.
This setting is similar to Pillaud-Vivien et al. (2018). The difference is that we choose the source condition to be smaller than so that .
4.2 Experiments in general RKHS
Suppose that and the marginal distribution is the uniform distribution on . It is well known that the following RKHS
is associated with the kernel (Wainwright, 2019). Further, its eigenvalues and eigenfunctions can be written as
and
It is easy to see that the EDR is , the eigenfunctions are uniformly bounded, and the embedding index is (see Remark 3.3). We construct a function in by
(15) |
for some . We will show in Appendix F that the series in (15) converges on . Since , we also have .
We use the same data generation procedure as Section 4.1:
where is numerically approximated by the first 1000 terms in (15) with , and .
Figure 2 presents the convergence curve under the best choice of . It can also be concluded that the convergence rate of KRR’s generalization error is indeed approximately equal to . Figure 4 in Appendix F shows the result of using cross validation.
5 Conclusion and Discussion
This paper considers the convergence rate and the minimax optimality of kernel ridge regression, especially in the misspecified case. When the true function falls into a less-smooth () interpolation space of the RKHS , it has been an outstanding problem for years that whether the boundedness assumption of in the proof can be removed or weakened so that KRR is optimal for all . This paper proves that for Sobolev RKHS, the boundedness assumption can be directly removed. This result implies that, on the one hand, the desired upper bound of the convergence rate holds for all the functions in (which can only be proved for before); on the other hand, KRR is minimax optimal for all source conditions (which can only be proved for before). Together with the saturation effect of KRR (Li et al., 2023) when , all convergence behaviors of Sobolev RKHS are understood. For general RKHS, we also prove that the boundedness assumption can be replaced by a -integrability assumption which turns out to be reasonable, at least for RKHS with uniformly bounded eigenfunctions. We verify the results through experiments for both Sobolev RKHS and general RKHS.
If an RKHS has embedding index , considering the embedding index will just recover the results of (Lin et al., 2018). Note that we assume throughout this paper, on which we will perform refined analysis. Technical tools in this paper can be extended to general spectral regularization algorithms and -generalization error. Another direct open question is to discuss whether the -integrability assumption in Theorem 3.1 holds for general RKHS, which we conjecture to be true.
We also notice a line of work which studies the learning curves of kernel ridge regression (Spigler et al., 2020; Bordelon et al., 2020; Cui et al., 2021) and crossovers between different noise magnitudes. At present, their results all rely on a Gaussian design assumption (or some variation), which is a very strong assumption. Nevertheless, their empirical results are enlightening and attract interest in the study of the learning curves of kernel ridge regression. We believe that studying the misspecified case in our paper is a crucial step to remove the Gaussian design assumption and draw complete conclusions about the learning curves of kernel ridge regression. KRR is also connected with Gaussian process regression (Kanagawa et al., 2018). Jin et al. (2021) claimed to establish the learning curves for Gaussian process regression and thus for KRR. However, there is a gap in their proof where an essential covering argument is missing: their Corollary 20 provides a high probability bound for fixed , while the proof of their Lemma 40 and Lemma 41 mistakenly use Corollary 20 to assert that the bound holds simultaneously for all .
Since Jacot et al. (2018) introduced the NTK , kernel regression has become a natural surrogate for the neural networks in the ‘lazy trained regime’. Our work is also motivated by our empirical studies in the neural networks and NTK regressions. Specifically, we found that the source condition with respect to the neural tangent kernel (or other frequently used kernels) is relatively small () for some frequently used real datasets (MNIST, CIFAR-10). This observation urged us to determine the optimal convergence rate when the RKHS is misspecified.
Acknowledgements
This research was partially supported by Beijing Natural Science Foundation (Grant Z190001), the National Natural Science Foundation of China (Grant 11971257), National Key R&D Program of China (2020AAA0105200), and Beijing Academy of Artificial Intelligence.
References
- Adams (1975) Adams, R. Sobolev Spaces. Adams. Pure and applied mathematics. Academic Press, 1975. URL https://books.google.co.uk/books?id=JxzpSAAACAAJ.
- Adams & Fournier (2003) Adams, R. A. and Fournier, J. J. Sobolev Spaces. Elsevier, 2003.
- Bauer et al. (2007) Bauer, F., Pereverzyev, S., and Rosasco, L. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
- Blanchard & Mücke (2018) Blanchard, G. and Mücke, N. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18:971–1013, 2018.
- Bordelon et al. (2020) Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In ICML, 2020.
- Caponnetto (2006) Caponnetto, A. Optimal rates for regularization operators in learning theory. 2006.
- Caponnetto & de Vito (2007) Caponnetto, A. and de Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
- Celisse & Wahl (2020) Celisse, A. and Wahl, M. Analyzing the discrepancy principle for kernelized spectral filter learning algorithms. J. Mach. Learn. Res., 22:76:1–76:59, 2020.
- Cucker & Smale (2001) Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2001.
- Cui et al. (2021) Cui, H., Loureiro, B., Krzakala, F., and Zdeborov’a, L. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In NeurIPS, 2021.
- Dicker et al. (2017) Dicker, L., Foster, D. P., and Hsu, D. J. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, 11:1022–1047, 2017.
- Edmunds & Triebel (1996) Edmunds, D. E. and Triebel, H. Function Spaces, Entropy Numbers, Differential Operators. Cambridge Tracts in Mathematics. Cambridge University Press, 1996. doi: 10.1017/CBO9780511662201.
- Fischer & Steinwart (2020) Fischer, S.-R. and Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21:205:1–205:38, 2020.
- Gerfo et al. (2008) Gerfo, L. L., Rosasco, L., Odone, F., Vito, E. D., and Verri, A. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
- Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Jin et al. (2021) Jin, H., Banerjee, P. K., and Montúfar, G. Learning curves for gaussian process regression with power-law priors and targets. ArXiv, abs/2110.12231, 2021.
- Jun et al. (2019) Jun, K.-S., Cutkosky, A., and Orabona, F. Kernel truncated randomized ridge regression: Optimal rates and low noise acceleration. ArXiv, abs/1905.10680, 2019.
- Kanagawa et al. (2018) Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
- Kohler & Krzyżak (2001) Kohler, M. and Krzyżak, A. Nonparametric regression estimation using penalized least squares. IEEE Trans. Inf. Theory, 47:3054–3059, 2001.
- Li et al. (2023) Li, Y., Zhang, H., and Lin, Q. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023.
- Li et al. (2022) Li, Z., Meunier, D., Mollenhauer, M., and Gretton, A. Optimal rates for regularized conditional mean embedding learning. ArXiv, abs/2208.01711, 2022.
- Lin & Cevher (2020) Lin, J. and Cevher, V. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Journal of Machine Learning Research, 21:147–1, 2020.
- Lin et al. (2018) Lin, J., Rudi, A., Rosasco, L., and Cevher, V. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48:868–890, 2018.
- Liu & Shi (2022) Liu, J. and Shi, L. Statistical optimality of divide and conquer kernel-based functional linear regression. ArXiv, abs/2211.10968, 2022.
- Mendelson & Neeman (2010) Mendelson, S. and Neeman, J. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, February 2010.
- Pillaud-Vivien et al. (2018) Pillaud-Vivien, L., Rudi, A., and Bach, F. R. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. ArXiv, abs/1805.10074, 2018.
- Rastogi & Sampath (2017) Rastogi, A. and Sampath, S. Optimal rates for the regularized learning algorithms under general source condition. Frontiers in Applied Mathematics and Statistics, 3, 2017.
- Rosasco et al. (2005) Rosasco, L., De Vito, E., and Verri, A. Spectral methods for regularization in learning theory. DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005.
- Sawano (2018) Sawano, Y. Theory of Besov spaces, volume 56. Springer, 2018.
- Smale & Zhou (2007) Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153–172, 2007.
- Spigler et al. (2020) Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020, 2020.
- Steinwart & Christmann (2008) Steinwart, I. and Christmann, A. Support vector machines. In Information Science and Statistics, 2008.
- Steinwart & Scovel (2012) Steinwart, I. and Scovel, C. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012.
- Steinwart et al. (2009) Steinwart, I., Hush, D., and Scovel, C. Optimal rates for regularized least squares regression. In COLT, pp. 79–93, 2009.
- Talwai & Simchi-Levi (2022) Talwai, P. M. and Simchi-Levi, D. Optimal learning rates for regularized least-squares with a fourier capacity condition. 2022.
- Thomas-Agnan (1996) Thomas-Agnan, C. Computing a family of reproducing kernels for statistical applications. Numerical Algorithms, 13:21–32, 1996.
- Tsybakov (2009) Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York ; London, 1st edition, 2009.
- Wainwright (2019) Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.
Appendix A Proof of Theorem 3.1
Throughout the proof, we denote
(16) |
where is the regularization parameter and is defined by (3), (7). We use to denote the operator norm of a bounded linear operator from a Banach space to , i.e., . Without bringing ambiguity, we will briefly denote the operator norm as . In addition, we use and to denote the trace and the trace norm of an operator. We use to denote the Hilbert-Schmidt norm. In addition, we denote as , as for brevity throughout the proof.
The following lemma will be frequently used, so we present it at the beginning of our proof.
Lemma A.1.
For any and , we have
(18) |
Proof.
Since for any and , the lemma follows from
(19) |
∎
A.1 Proof of the approximation error
The following theorem gives the bound of -norm of when . As a special case, the approximation error follows from the result when .
Theorem A.2.
Suppose that Assumption 2.3 holds for . Denote , then for any and , we have
(20) |
A.2 Proof of the estimation error
Theorem A.3.
Proof.
First, rewrite the estimator error as follows
(23) |
For the first term in (23), we have
(24) |
For the second term in (23), for any , using Proposition D.5 and we known that when satisfy that
(25) |
we have
(26) |
with probability as least . Therefore,
(27) |
with probability as least , where is the identity operator . Note that since we have , there always exists such that (25) is satisfied when and is sufficiently large .
A.3 Proof of Theorem 3.1
Appendix B Proof of Theorem 3.6
Note that the RKHS is defined as the (fractional) Sobolev space , which is regardless of the marginal distribution . But the definition of interpolation space (4) is dependent on . When has Lebesgue density , Fischer & Steinwart (2020, (14)) shows that
(31) |
and
(32) |
where we denote as the interpolation space of under marginal distribution . So we can also apply Proposition 3.5 to and the embedding property of it is the same as . Denote , since , for any , we have
(33) |
hold simultaneously.
Appendix C Proof of Proposition 2.9
We will construct a family of probability distributions on and apply Lemma E.6. Recall that is a probability distribution on such that Assumption 2.1 is satisfied. Denote the class of functions
(34) |
and for every , define the probability distribution on such that
(35) |
where and . It is easy to show that such falls into the family in Proposition 2.9. (Assumption 2.1 and 2.3 are satisfied obviously. Assumption 2.4 follows from results of moments of Gaussian random variables, see, e.g., Fischer & Steinwart (2020, Lemma 21)).
Using Lemma E.8, for , there exists for some such that
(36) |
For , define the functions as
(37) |
Since
(38) |
Where in (38) represents the constant in Assumption 2.1. So if is small such that
(39) |
then we have
Denote as a family of probability distribution index by , then (41) implies the second condition in Lemma E.6 holds. Further, using (36), we have
(43) |
where is a constant independent of .
Applying Lemma E.6 to (41) and (43), we have
(44) |
When is sufficiently large so that is sufficiently large, the probability in the R.H.S. of (44) is larger than . For , choose , without loss of generality we assume . Then (44) shows that there exists a constant , for all estimator we can find a function and the corresponding distribution such that, with probability at least ,
(45) |
So we finish the proof.
Appendix D Useful propositions for upper bounds
This proposition bounds the norm of when .
Proposition D.1.
Proof.
Suppose that . Since and , we have
where we use Lemma A.1 for the last inequality. Further, using by Assumption 2.2, we have .
∎
The following proposition is an application of the classical Bernstein type inequality but considering a truncation version of , which will bring refined analysis compared to previous work.
Proposition D.2.
Proof.
Note that can represent a -equivalence class in . When defining the set , we actually denote as the representative
To use Lemma E.4, we need to bound the m-th moment of .
(49) |
Using the inequality , we have
(50) |
Plugging (D) into (D), we have
(51) | ||||
(52) |
Now we begin to bound (52). Note that we have proved in Lemma E.1 that for -almost ,
(53) |
In addition, we also have
(54) |
So we have
(55) |
Using Assumption 2.4, we have
(56) |
so we get the upper bound of (52), i.e.,
(57) |
Now we begin to bound (51).
-
(1)
When , using the definition of and Proposition D.1, we have
(58) -
(2)
When , without loss of generality, we assume . using Theorem A.2 for , we have
(59)
Therefore, for all we have
(60) |
In addition, using Theorem A.2 for , we also have
(61) |
So we get the upper bound of (51), i.e.,
(62) |
Denote
(63) |
then we have . Using Lemma E.4, we finish the proof. ∎
Remark D.3.
Based on Proposition D.2, the following Proposition will give an upper bound of .
Proposition D.4.
Proof.
Denote and , then (64) is equivalent to
(65) |
Consider the subset and . Since , we have
(66) |
Decomposing as , we have
(67) |
Given , here we firstly fixed an such that
(68) |
For the first term in (67), denoted as I, using Theorem D.2, for all , with probability at least , we have
(69) |
where . Recalling that , simple calculation shows that by choosing ,
Further calculations show that
(73) |
and
(74) |
To make when , letting , we have the first restriction of :
(75) |
That is to say, if we choose , we have
(76) |
For the second term in (67), denoted as II, we have
(77) |
Letting , we have , i.e. . This gives the second restriction of , i.e.,
(78) |
For the third term in (67), denoted as III. Since we have already known that so
III | ||||
(79) |
Using Cauchy-Schwarz and the bound of approximation error (Theorem A.2), we have
(80) |
In addition, we have
(81) |
Plugging (80) and (81) into (D), we have
(82) |
Comparing (82) with and in (69). We know that if , (D) . So the third term will not give further restriction of .
To sum up, if we choose such that restrictions (75) and (78) are satisfied, then we can prove that (65) is satisfied with probability at least . Since for a fixed , when is sufficiently large, is sufficiently small such that, e.g., . Without loss of generality, we say (65) is satisfied with probability at least .
Note that, such exists if
(83) |
Recalling that for (68), we only assume there exists satisfying , so such exists if and only if
(84) |
which is what we assume in the theorem. ∎
Proposition D.5.
Suppose that the embedding index is . Then for any and all , with probability at least , we have
(85) |
where
(86) |
Proof.
Appendix E Auxiliary lemma
E.1 Lemmas for upper bound
The following lemma is where we take advantage of the embedding index and embedding property in Assumption 2.2.
Lemma E.1.
Suppose that the embedding index is . Then for any , for -almost , we have
(91) |
Proof.
Recalling that , we have
(92) |
where we use Lemma A.1 for the last inequality, and we finish the proof. ∎
Lemma E.1 has a direct corollary.
Corollary E.2.
Suppose that the embedding index is . Then for any , for -almost , we have
(93) |
Proof.
The following concentration inequality about self-adjoint Hilbert-Schmidt operator valued random variables is frequently used in related literature, e.g., Fischer & Steinwart (2020, Theorem 27) and Lin & Cevher (2020, Lemma 26).
Lemma E.3.
Let be a probability space, be a separable Hilbert space. Suppose that are i.i.d. random variables with values in the set of self-adjoint Hilbert-Schmidt operators. If , and the operator norm , and there exists a self-adjoint positive semi-definite trace class operator with . Then for , with probability at least , we have
(95) |
The following Bernstein inequality about vector-valued random variables is frequently used, e.g., Caponnetto & de Vito (2007, Proposition 2) and Fischer & Steinwart (2020, Theorem 26).
Lemma E.4 (Bernstein inequality).
Let be a probability space, be a separable Hilbert space, and be a random variable with
for all . Then for , are i.i.d. random variables, with probability at least , we have
(96) |
Lemma E.5.
If , we have
(97) |
Proof.
Since , we have
(98) | ||||
(99) |
for some constant . Similarly, we have
(100) |
for some constant . ∎
E.2 Lemmas for minimax lower bound
The following lemma is a standard approach to derive the minimax lower bound, which can be found in Tsybakov (2009, Theorem 2.5).
Lemma E.6.
Suppose that there is a non-parametric class of functions and a (semi-)distance on . is a family of probability distributions indexed by . Assume that and suppose that contains elements such that,
-
(1)
;
-
(2)
, and
(101)
with and . Then
(102) |
Lemma E.7.
Suppose that is a distribution on and . Suppose that
(103) |
where are independent Gaussian random error. Denote the two corresponding distributions on as . The KL divergence of two probability distributions on is
(104) |
if and otherwise . Then we have
(105) |
where denotes the independent product of distributions .
Proof.
The lemma directly follows from the definition of KL divergence and the fact that
(106) |
∎
The following lemma is a result from Tsybakov (2009, Lemma 2.9)
Lemma E.8.
Denote . Let , there exists a subset of such that ,
(107) |
and .
Appendix F Details of experiments
F.1 Experiments in Sobolev RKHS
First, we prove that the series in (14) converges and is continuous on for .
We begin with the computation of the sum of first terms of , note that
(108) |
So we have
(109) |
Similarly, we have
(110) |
Note that (109) and (110) are uniformly bounded in for any and . In addition, is monotone and decreases to zero. Use the Dirichlet criterion and we know that the series in (14) is uniformly convergence in . Due to the arbitrariness of , we know that the series converges and is continuous on .
In Figure 3 (a), we present the results of different choices of for in the experiment of Section 4.2. Figure 2 corresponds to the curve in Figure 3 (a), which has the smallest generalization error. In Figure 3 (b), we use 5-fold cross validation to choose the regularization parameter in KRR and present the logarithmic errors and sample sizes. Again, we use logarithmic least-squares to compute the convergence rate , which is still approximately equal to .


F.2 Experiments in general RKHS
First, we prove that the series in (15) converges and is continuous on for .
We begin with the computation of the sum of first terms of ,
(111) |
So we have
(112) |
which is uniformly bounded in for any and .
Note that is monotone and decreases to zero. Use the Dirichlet criterion and we know that the series in (15) is uniformly convergence in . Due to the arbitrariness of , we know that the series converges and is continuous on .
In Figure 4 (a), we present the results of different choices of for in the experiment of Section 4.2. Figure 2 corresponds to the curve in Figure 4 (a), which has the smallest generalization error. In Figure 4 (b), we use 5-fold cross validation to choose the regularization parameter in KRR and present the logarithmic errors and sample sizes. Again, we use logarithmic least-squares to compute the convergence rate , which is still approximately equal to .

