This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On the Optimality of Misspecified Kernel Ridge Regression

Haobo Zhang    Yicheng Li    Weihao Lu    Qian Lin
Abstract

In the misspecified kernel ridge regression problem, researchers usually assume the underground true function fρ[]sf_{\rho}^{*}\in[\mathcal{H}]^{s}, a less-smooth interpolation space of a reproducing kernel Hilbert space (RKHS) \mathcal{H} for some s(0,1)s\in(0,1). The existing minimax optimal results require fρL<\|f_{\rho}^{*}\|_{L^{\infty}}<\infty which implicitly requires s>α0s>\alpha_{0} where α0(0,1)\alpha_{0}\in(0,1) is the embedding index, a constant depending on \mathcal{H}. Whether the KRR is optimal for all s(0,1)s\in(0,1) is an outstanding problem lasting for years. In this paper, we show that KRR is minimax optimal for any s(0,1)s\in(0,1) when the \mathcal{H} is a Sobolev RKHS.

Kernel ridge regression, Misspecified, Reproducing kernel Hilbert space, Sobolev space, minimax optimality

1 Introduction

Suppose that the samples {(xi,yi)}i=1n\{(x_{i},y_{i})\}_{i=1}^{n} are i.i.d. sampled from an unknown distribution ρ\rho on 𝒳×𝒴\mathcal{X}\times\mathcal{Y}, where 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} and 𝒴\mathcal{Y}\subseteq\mathbb{R}. The regression problem aims to find a function f^\hat{f} such that the risk

(f^)=𝔼(x,y)ρ[(f^(x)y)2]\mathcal{E}(\hat{f})=\mathbb{E}_{(x,y)\sim\rho}\left[\left(\hat{f}(x)-y\right)^{2}\right]

is relatively small. It is well known that the conditional mean function given by fρ(x)𝔼ρ[y|x]=𝒴ydρ(y|x)f_{\rho}^{*}(x)\coloneqq\mathbb{E}_{\rho}[~{}y\;|\;x~{}]=\int_{\mathcal{Y}}y\mathrm{d}\rho(y|x) minimizes the risk (f)\mathcal{E}(f). Therefore, we may focus on establishing the convergence rate (either in expectation or in probability) for the excess risk (generalization error)

𝔼xμ[(f^(x)fρ(x))2],\mathbb{E}_{x\sim\mu}\left[\left(\hat{f}(x)-f_{\rho}^{*}(x)\right)^{2}\right], (1)

where μ\mu is the marginal distribution of ρ\rho on 𝒳\mathcal{X}.

In the non-parametric regression settings, researchers often assume that fρ(x)f_{\rho}^{*}(x) falls into a class of functions with certain structures and develop non-parametric methods to obtain the estimator f^\hat{f}. One of the most popular non-parametric regression methods, the kernel method, aims to estimate fρf_{\rho}^{*} using candidate functions from a reproducing kernel Hilbert space (RKHS) \mathcal{H}, a separable Hilbert space associated to a kernel function kk defined on 𝒳\mathcal{X}, e.g., Kohler & Krzyżak (2001); Cucker & Smale (2001); Caponnetto & de Vito (2007); Steinwart & Christmann (2008). This paper focuses on the kernel ridge regression (KRR), which constructs an estimator f^λ\hat{f}_{\lambda} by solving the penalized least square problem

f^λ=argminf(1ni=1n(yif(xi))2+λf2),\hat{f}_{\lambda}=\underset{f\in\mathcal{H}}{\arg\min}\left(\frac{1}{n}\sum_{i=1}^{n}\left(y_{i}-f\left(x_{i}\right)\right)^{2}+\lambda\|f\|_{\mathcal{H}}^{2}\right), (2)

where λ>0\lambda>0 is referred to as the regularization parameter.

Since the minimax optimality of KRR has been proved for fρ[]s,1s2f_{\rho}^{*}\in[\mathcal{H}]^{s},1\leq s\leq 2 (Caponnetto & de Vito, 2007), a large body of literature has studied the convergence rate of the generalization error of misspecified KRR (fρf_{\rho}^{*}\notin\mathcal{H}) and whether the rate is optimal in the minimax sense. It turns out that the eigenvalue decay rate (β>1\beta>1), the source condition (s>0s>0), and the embedding index (α0<1\alpha_{0}<1) of the RKHS jointly determine the convergence behavior of the misspecified KRR (see Section 2.2 for definitions). If we only assume that fρf_{\rho}^{*} belongs to an interpolation space []s[\mathcal{H}]^{s} of the RKHS \mathcal{H} for some s>0s>0, the well-known information-theoretic lower bound shows that the minimax lower bound is nsβsβ+1n^{-\frac{s\beta}{s\beta+1}}. The state-of-the-art work Fischer & Steinwart (2020) has already shown that when α0<s2\alpha_{0}<s\leq 2, the upper bound of the convergence rate of KRR is nsβsβ+1n^{-\frac{s\beta}{s\beta+1}} and hence is optimal. However, when fρ[]sf_{\rho}^{*}\in[\mathcal{H}]^{s} for some 0<sα00<s\leq\alpha_{0}, all the existing works need an additional boundedness assumption on fρf_{\rho}^{*} to prove the same upper bound nsβsβ+1n^{-\frac{s\beta}{s\beta+1}}. The boundedness assumption will result in a smaller function space, i.e., []sL(𝒳,μ)[]s[\mathcal{H}]^{s}\cap L^{\infty}(\mathcal{X,\mu})\subsetneqq[\mathcal{H}]^{s} when sα0s\leq\alpha_{0}. Fischer & Steinwart (2020) further reveals that the minimax rate of the excess risk associated to the smaller function space is larger than nαβαβ+1n^{-\frac{\alpha\beta}{\alpha\beta+1}} for any α>α0\alpha>\alpha_{0}. This lower bound of the minimax rate is smaller than the upper bound of the convergence rate and hence they can not prove the minimax optimality of KRR when sα0s\leq\alpha_{0}.

It has been an outstanding problem for years whether KRR is minimax optimal for all the s(0,1)s\in(0,1) (Pillaud-Vivien et al., 2018; Fischer & Steinwart, 2020; Liu & Shi, 2022). This paper concludes that, for Sobolev RKHS, KRR is optimal for all 0<s<10<s<1. Thus, we know that KRR is optimal for Sobolev RKHS and all 0<s20<s\leq 2. Together with a recent work on the saturation effect where KRR can not be optimal for s>2s>2 (Li et al., 2023), the optimality of KRR for Sobolev RKHS is well understood.

1.1 Related work

Kernel ridge regression has been studied as a special kind of spectral regularization algorithm (Rosasco et al., 2005; Caponnetto, 2006; Bauer et al., 2007; Gerfo et al., 2008; Mendelson & Neeman, 2010). In large part of the literature, the ‘hardness’ of the regression problem is determined by two parameters: 1. the source condition (ss), which characterizes a function’s relative ‘smoothness’ with respect to the RKHS; 2. the eigenvalue decay rate (β\beta) (or capacity, effective dimension equivalently), which characterizes the RKHS itself. These two parameters divide the convergence behavior of KRR or spectral regularization algorithm into different regimes and lead to different convergence rates (Dicker et al., 2017; Blanchard & Mücke, 2018; Lin et al., 2018; Lin & Cevher, 2020; Li et al., 2023, etc.). Caponnetto & de Vito (2007) first proves the optimality when 1s21\leq s\leq 2 and Lin et al. (2018) extends the desired upper bound of the convergence rate to the regime s+1β>1s+\frac{1}{\beta}>1.

KRR in the misspecified case (fρors<1f_{\rho}^{*}\notin\mathcal{H}~{}\text{or}~{}s<1) has also been discussed by another important line of work which considers the embedding index (α0\alpha_{0}) of the RKHS and performs refined analysis (Steinwart et al., 2009; Dicker et al., 2017; Pillaud-Vivien et al., 2018; Fischer & Steinwart, 2020; Celisse & Wahl, 2020; Li et al., 2022). The desired upper bound nsβsβ+1n^{-\frac{s\beta}{s\beta+1}} is extended to the regime s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, and the minimax optimality is extended to the regime s>α0s>\alpha_{0}. It is worth pointing out that when fρf_{\rho}^{*} falls into a less-smooth interpolation space which does not imply the boundedness of functions therein, all existing works (either considering embedding index or not) require an additional boundedness assumption, i.e., fρL(𝒳,μ)B<\|f_{\rho}^{*}\|_{L^{\infty}(\mathcal{X},\mu)}\leq B_{\infty}<\infty (Lin & Cevher, 2020; Fischer & Steinwart, 2020; Talwai & Simchi-Levi, 2022; Li et al., 2022, etc). As discussed in the introduction, this will lead to the suboptimality in the sα0s\leq\alpha_{0} regime.

This paper follows the line of work that considers the embedding index, refines the proof by handling the additional boundedness assumption, and solves the optimality problem for Sobolev RKHS and all s(0,1)s\in(0,1). In addition, our technical improvement also sheds light on the optimality for general RKHS. Specifically, we replace the boundedness assumption with a far weaker LqL^{q}-integrability assumption, which turns out to be reasonable for many RKHS. Note that our results focus on the most frequently used L2L^{2} convergence rate for KRR, and it can be easily extended to []γ,γ0[\mathcal{H}]^{\gamma},\gamma\geq 0 convergence rate (e.g., Fischer & Steinwart 2020) and general spectral regularization algorithms (e.g., Lin et al. 2018).

2 Preliminaries

2.1 Basic concepts

Let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} be the input space and 𝒴\mathcal{Y}\subseteq\mathbb{R} be the output space. Let ρ\rho be an unknown probability distribution on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} satisfying 𝒳×𝒴y2dρ(x,y)<\int_{\mathcal{X}\times\mathcal{Y}}y^{2}\mathrm{d}\rho(x,y)<\infty, and denote the corresponding marginal distribution on 𝒳\mathcal{X} as μ\mu. We use Lp(𝒳,μ)L^{p}(\mathcal{X},\mu) (in short LpL^{p}) to represent the LpL^{p}-spaces. Then the generalization error (1) can be written as

f^fρL22.\left\|\hat{f}-f_{\rho}^{*}\right\|_{L^{2}}^{2}.

Throughout the paper, we denote \mathcal{H} as a separable RKHS on 𝒳\mathcal{X} with respect to a continuous and bounded kernel function kk satisfying

supx𝒳k(x,x)κ2.\sup\limits_{x\in\mathcal{X}}k(x,x)\leq\kappa^{2}.

Define the integral operator T:L2(𝒳,μ)L2(𝒳,μ)T:L^{2}(\mathcal{X},\mu)\to L^{2}(\mathcal{X},\mu) as

(Tf)(x):=𝒳k(x,y)f(y)dμ(y).(Tf)(x):=\int_{\mathcal{X}}k(x,y)f(y)\mathrm{d}\mu(y). (3)

It is well known that TT is a positive, self-adjoint, trace-class, and hence a compact operator (Steinwart & Scovel, 2012). The spectral theorem for self-adjoint compact operators and Mercer’s decomposition theorem yield that

k(x,y)\displaystyle k(x,y) =iNλiei(x)ei(y),\displaystyle=\sum_{i\in N}\lambda_{i}e_{i}(x)e_{i}(y),
T\displaystyle T =iNλi,eiL2ei,\displaystyle=\sum_{i\in N}\lambda_{i}\left\langle\cdot,e_{i}\right\rangle_{L^{2}}e_{i},

where NN is an at most countable set, the eigenvalues {λi}iN(0,)\{\lambda_{i}\}_{i\in N}\subseteq(0,\infty) is a non-increasing summable sequence, and {ei}iN\{e_{i}\}_{i\in N} are the corresponding eigenfunctions. Denote the samples as X=(x1,,xn)X=\left(x_{1},\cdots,x_{n}\right) and y=(y1,,yn)\textbf{y}=\left(y_{1},\cdots,y_{n}\right)^{\prime}. The representer theorem (see, e.g., Steinwart & Christmann 2008) gives an explicit expression of the KRR estimator defined by (2), i.e.,

f^λ(x)=𝕂(x,X)(𝕂(X,X)+nλI)1𝐲,\hat{f}_{\lambda}(x)=\mathbb{K}(x,X)(\mathbb{K}(X,X)+n\lambda I)^{-1}\mathbf{y},

where

𝕂(X,X)=(k(xi,xj))n×n,\mathbb{K}(X,X)=\left(k\left(x_{i},x_{j}\right)\right)_{n\times n},

and

𝕂(x,X)=(k(x,x1),,k(x,xn)).\mathbb{K}(x,X)=\left(k\left(x,x_{1}\right),\cdots,k\left(x,x_{n}\right)\right).

We also need to introduce the interpolation spaces of RKHS. For any s0s\geq 0, the fractional power integral operator Ts:L2(𝒳,μ)L2(𝒳,μ)T^{s}:L^{2}(\mathcal{X},\mu)\to L^{2}(\mathcal{X},\mu) is defined as

Ts(f)=iNλisf,eiL2ei,T^{s}(f)=\sum_{i\in N}\lambda_{i}^{s}\left\langle f,e_{i}\right\rangle_{L^{2}}e_{i},

and the interpolation space []s,s0[\mathcal{H}]^{s},s\geq 0 of \mathcal{H} is defined as

[]sRanTs2={iNλis2aieiiNai2<},[\mathcal{H}]^{s}\coloneqq\operatorname{Ran}T^{\frac{s}{2}}=\left\{\sum_{i\in N}\lambda_{i}^{\frac{s}{2}}a_{i}e_{i}\mid\sum\limits_{i\in N}a_{i}^{2}<\infty\right\}, (4)

with the inner product

f,g[]s=Ts2f,Ts2gL2.\langle f,g\rangle_{[\mathcal{H}]^{s}}=\left\langle T^{-\frac{s}{2}}f,T^{-\frac{s}{2}}g\right\rangle_{L^{2}}. (5)

It is easy to show that []s[\mathcal{H}]^{s} is also a separable Hilbert space with orthogonal basis {λis2ei}iN\{\lambda_{i}^{\frac{s}{2}}e_{i}\}_{i\in N}. Specially, we have []0L2(𝒳,μ)[\mathcal{H}]^{0}\subseteq L^{2}(\mathcal{X},\mu) and []1[\mathcal{H}]^{1}\subseteq\mathcal{H}. For 0<s1<s20<s_{1}<s_{2}, the embeddings []s2[]s1[]0[\mathcal{H}]^{s_{2}}\hookrightarrow[\mathcal{H}]^{s_{1}}\hookrightarrow[\mathcal{H}]^{0} exist and are compact (Fischer & Steinwart, 2020). For the functions in []s[\mathcal{H}]^{s} with larger ss, we say they have higher regularity (smoothness) with respect to the RKHS.

As an example, the Sobolev space Hm(𝒳)H^{m}(\mathcal{X}) is an RKHS if m>d2m>\frac{d}{2}, and its interpolation space is still a Sobolev space given by [Hm(𝒳)]sHms(𝒳),s>0[H^{m}(\mathcal{X})]^{s}\cong H^{ms}(\mathcal{X}),\forall s>0, see Section 3.1 for detailed discussions.

2.2 Assumptions

This subsection lists the standard assumptions that frequently appeared in related literature.

Assumption 2.1 (Eigenvalue decay rate (EDR)).

Suppose that the eigenvalue decay rate (EDR) of \mathcal{H} is β>1\beta>1, i.e, there are positive constants cc and CC such that

ciβλiCiβ,iN.ci^{-\beta}\leq\lambda_{i}\leq Ci^{-\beta},\quad\forall i\in N.

Note that the eigenvalues λi\lambda_{i} and EDR are only determined by the marginal distribution μ\mu and the RKHS \mathcal{H}. The polynomial eigenvalue decay rate assumption is standard in related literature and is also referred to as the capacity condition or effective dimension condition.

Assumption 2.2 (Embedding index).

We say that []α[\mathcal{H}]^{\alpha} has the embedding property for some α[1β,1]\alpha\in[\frac{1}{\beta},1], if there is a constant 0<A<0<A<\infty such that

[]αL(𝒳,μ)A,\left\|[\mathcal{H}]^{\alpha}\hookrightarrow L^{\infty}(\mathcal{X},\mu)\right\|\leq A, (6)

where \|\cdot\| denotes the operator norm of the embedding. Then we define the embedding index of an RKHS \mathcal{H} as

α0=inf{α:[]αhas the embedding property}.\alpha_{0}=\inf\left\{\alpha:[\mathcal{H}]^{\alpha}~{}\text{has the embedding property}\right\}.

In fact, for any α>0\alpha>0, we can define MαM_{\alpha} as the smallest constant A>0A>0 such that

iNλiαei2(x)A2,μ-a.e. x𝒳,\sum_{i\in N}\lambda_{i}^{\alpha}e_{i}^{2}(x)\leq A^{2},\quad\mu\text{-a.e. }x\in\mathcal{X},

if there is no such constant, set Mα=M_{\alpha}=\infty. Then Fischer & Steinwart (2020, Theorem 9) shows that for α>0\alpha>0,

[]αL(𝒳,μ)=Mα.\left\|[\mathcal{H}]^{\alpha}\hookrightarrow L^{\infty}(\mathcal{X},\mu)\right\|=M_{\alpha}.

The larger α\alpha is, the weaker the embedding property is. Note that since supx𝒳k(x,x)κ2\sup_{x\in\mathcal{X}}k(x,x)\leq\kappa^{2}, Mακ<M_{\alpha}\leq\kappa<\infty is always true for α1\alpha\geq 1. In addition, Fischer & Steinwart (2020, Lemma 10) also shows that α\alpha can not be less than 1β\frac{1}{\beta}.

Note that the embedding property (6) holds for any α>α0\alpha>\alpha_{0}. This directly implies that all the functions in []α[\mathcal{H}]^{\alpha} are μ-a.e.\mu\text{-a.e.} bounded, α>α0\alpha>\alpha_{0}. However, the embedding property may not hold for α=α0\alpha=\alpha_{0}.

Assumption 2.3 (Source condition).

For s>0s>0, there is a constant R>0R>0 such that fρ[]sf_{\rho}^{*}\in[\mathcal{H}]^{s} and

fρ[]sR.\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}\leq R.

Functions in []s[\mathcal{H}]^{s} with smaller ss are less smooth, which will be harder for an algorithm to estimate.

Assumption 2.4 (Moment of error).

The noise ϵyfρ(x)\epsilon\coloneqq y-f_{\rho}^{*}(x) satisfies that there are constants σ,L>0\sigma,L>0 such that for any m2m\geq 2,

𝔼(|ϵ|mx)12m!σ2Lm2,μ-a.e. x𝒳.\mathbb{E}\left(|\epsilon|^{m}\mid x\right)\leq\frac{1}{2}m!\sigma^{2}L^{m-2},\quad\mu\text{-a.e. }x\in\mathcal{X}.

This is a standard assumption to control the noise such that the tail probability decays fast (Lin & Cevher, 2020; Fischer & Steinwart, 2020). It is satisfied for, for instance, the Gaussian noise with bounded variance or sub-Gaussian noise. Some literature (e.g., Steinwart et al. 2009; Pillaud-Vivien et al. 2018; Jun et al. 2019, etc) also uses a stronger assumption y[L0,L0]y\in[-L_{0},L_{0}] which directly implies both Assumption 2.4 and the boundedness of fρf_{\rho}^{*}.

2.3 Review of state-of-the-art results

For the convenience of comparing our results with previous works, we review state-of-the-art upper and lower bounds of the convergence rate of KRR (Fischer & Steinwart, 2020).

Proposition 2.5 (Upper bound).

Suppose that Assumption 2.1,2.2, 2.3 and 2.4 hold for 0<s20<s\leq 2 and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Furthermore, suppose that there exists a constant B>0B_{\infty}>0 such that fρL(𝒳,μ)B\|f_{\rho}^{*}\|_{L^{\infty}(\mathcal{X},\mu)}\leq B_{\infty}. Let f^λ\hat{f}_{\lambda} be the KRR estimator defined by (2). Then in the case of s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}}, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta, we have

f^λfρL22(ln4δ)2Cnsβsβ+1,\left\|\hat{f}_{\lambda}-f_{\rho}^{*}\right\|_{L_{2}}^{2}\leq\left(\ln\frac{4}{\delta}\right)^{2}Cn^{-\frac{s\beta}{s\beta+1}},

where CC is a constant independent of nn and δ\delta.

Remark 2.6.

When s+1βα0s+\frac{1}{\beta}\leq\alpha_{0}, Fischer & Steinwart (2020, Theorem 1) also proves the state-of-the-art upper bound of the convergence rate (n/logr(n))sα,r>0(n/\log^{r}(n))^{-\frac{s}{\alpha}},\forall r>0 and α>α0\alpha>\alpha_{0} which can be arbitrarily close to α0\alpha_{0}. We should note that s+1β>α0s+\frac{1}{\beta}>\alpha_{0} is always satisfied for Sobolev RKHS (see Section 3.1), so we focus on s+1β>α0s+\frac{1}{\beta}>\alpha_{0} in this paper.

Proposition 2.7 (Minimax lower bound).

Let μ\mu be a probability distribution on 𝒳\mathcal{X} such that Assumption 2.1 and 2.2 are satisfied for 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Let 𝒫\mathcal{P} consist of all the distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} satisfying 2.3, 2.4 for 0<s20<s\leq 2 and with marginal distribution μ\mu. For a constant B>0B_{\infty}>0, let 𝒫\mathcal{P}_{\infty} consists of all the distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} such that fρL(𝒳,μ)B\|f_{\rho}^{*}\|_{L^{\infty}(\mathcal{X},\mu)}\leq B_{\infty}. Then for any α>α0\alpha>\alpha_{0}, there exists a constant CC, for all learning methods, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, there is a distribution ρ𝒫𝒫\rho\in\mathcal{P}\cap\mathcal{P}_{\infty} such that, with probability at least 1δ1-\delta, we have

f^fρL22Cδnmax{s,α}βmax{s,α}β+1.\left\|\hat{f}-f_{\rho}^{*}\right\|_{L^{2}}^{2}\geq C\delta n^{-\frac{\max\{s,\alpha\}\beta}{\max\{s,\alpha\}\beta+1}}.
Remark 2.8.

Under the precondition s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, (1) when s>α0s>\alpha_{0}, since there always exists α0<αs\alpha_{0}<\alpha\leq s, the upper bound of the convergence rate in Proposition 2.5 coincides with the minimax lower bound in Proposition 2.7 and hence is minimax optimal; (2) but when sα0s\leq\alpha_{0}, existing results fail to show the optimality of KRR. The same gap between the upper and lower bound exists for gradient descent and stochastic gradient descent with multiple passes (Pillaud-Vivien et al., 2018).

Note that in Proposition 2.7, we consider the distributions in 𝒫𝒫\mathcal{P}\cap\mathcal{P}_{\infty}. If we consider all the distributions in 𝒫\mathcal{P}, we have the following minimax lower bound, which is often referred to as the information-theoretic lower bound (see, e.g., Rastogi & Sampath 2017).

Proposition 2.9 (Information-theoretic lower bound).

Let μ\mu be a probability distribution on 𝒳\mathcal{X} such that Assumption 2.1 is satisfied. Let 𝒫\mathcal{P} consist of all the distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} satisfying 2.3, 2.4 for 0<s20<s\leq 2 and with marginal distribution μ\mu. Then there exists a constant CC, for all learning methods, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, there is a distribution ρ𝒫\rho\in\mathcal{P} such that, with probability at least 1δ1-\delta, we have

f^fρL22Cδnsβsβ+1.\left\|\hat{f}-f_{\rho}^{*}\right\|_{L^{2}}^{2}\geq C\delta n^{-\frac{s\beta}{s\beta+1}}.

3 Main Results

The main results of this paper aim to remove the boundedness assumption fρL(𝒳,μ)B\|f_{\rho}^{*}\|_{L^{\infty}(\mathcal{X},\mu)}\leq B_{\infty}. We state the main theorem in terms of general RKHS, and make detailed discussions for Sobolev space as a particular case.

Theorem 3.1.

Suppose that Assumption 2.1, 2.2, 2.3 and 2.4 hold for 0<s20<s\leq 2 and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Suppose that fρLq(𝒳,μ)f_{\rho}^{*}\in L^{q}(\mathcal{X},\mu) and fρLq(𝒳,μ)Cq<\|f_{\rho}^{*}\|_{L^{q}(\mathcal{X},\mu)}\leq C_{q}<\infty for some q>2(sβ+1)2+(sα0)βq>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta}. Let f^λ\hat{f}_{\lambda} be the KRR estimator defined by (2). Then, in the case of s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}}, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta, we have

f^λfρL22(ln4δ)2Cnsβsβ+1,\left\|\hat{f}_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}^{2}\leq\left(\ln\frac{4}{\delta}\right)^{2}Cn^{-\frac{s\beta}{s\beta+1}},

where CC is a constant independent of nn and δ\delta.

Remark 3.2.

In Theorem 3.1, we replace the boundedness assumption with a LqL^{q}-integrability assumption and prove the same upper bound of the convergence rate as Proposition 2.5 in the case of s+1β>α0s+\frac{1}{\beta}>\alpha_{0}. As shown in the following, both s+1β>α0s+\frac{1}{\beta}>\alpha_{0} and q>2(sβ+1)2+(sα0)βq>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta} are naturally satisfied for Sobolev RKHS.

Remark 3.3.

RKHS with uniformly bounded eigenfunctions, i.e., supiNeiL<\sup_{i\in N}\|e_{i}\|_{L^{\infty}}<\infty are frequently considered (Mendelson & Neeman, 2010; Steinwart et al., 2009; Pillaud-Vivien et al., 2018). For this kind of RKHS, the Assumption 2.2 holds for α0=1β\alpha_{0}=\frac{1}{\beta} (Fischer & Steinwart, 2020, Lemma 10), hence s+1β>α0s+\frac{1}{\beta}>\alpha_{0} is satisfied. In addition, the assumption q>2(sβ+1)2+(sα0)βq>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta} in Theorem 3.1 turns into q>2q>2. Recalling that []0L2(𝒳,μ)[\mathcal{H}]^{0}\subseteq L^{2}(\mathcal{X},\mu) when s=0s=0, so it is reasonable to assume that the functions in []s,s>0,[\mathcal{H}]^{s},s>0, is LqL^{q}-integrable for some q>2q>2.

Remark 3.4.

If the RKHS \mathcal{H} satisfies that []sLq[\mathcal{H}]^{s}\hookrightarrow L^{q}, i.e., []sLq=[]s[\mathcal{H}]^{s}\cap L^{q}=[\mathcal{H}]^{s} for some qq satisfying the integrability required in Theorem 3.1. Using Proposition 2.9, the minimax lower bound will be still nsβsβ+1n^{-\frac{s\beta}{s\beta+1}} even when making the assumption fρLq<\|f_{\rho}^{*}\|_{L^{q}}<\infty.

3.1 Optimality for Sobolev RKHS

Let us first introduce some concepts of (fractional) Sobolev space (see, e.g., Adams & Fournier 2003). In this section, we assume that 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} is a bounded domain with smooth boundary and the Lebesgue measure ν\nu. Denote L2(𝒳)L2(𝒳,ν)L^{2}(\mathcal{X})\coloneqq L^{2}(\mathcal{X},\nu) as the corresponding L2L^{2} space. For mm\in\mathbb{N}, we denote Hm(𝒳)H^{m}(\mathcal{X}) as the Sobolev space with smoothness mm and H0(𝒳)L2(𝒳)H^{0}(\mathcal{X})\coloneqq L^{2}(\mathcal{X}). Then the (fractional) Sobolev space for any real number r>0r>0 can be defined through real interpolation:

Hr(𝒳):=(L2(𝒳),Hm(𝒳))rm,2,H^{r}(\mathcal{X}):=\left(L^{2}(\mathcal{X}),H^{m}(\mathcal{X})\right)_{\frac{r}{m},2},

where m:=min{k:k>r}m:=\min\{k\in\mathbb{N}:k>r\}. (For details of real interpolation of Banach spaces, we refer to Sawano (2018, Chapter 4.2.2)). It is well known that when r>d2r>\frac{d}{2}, Hr(𝒳)H^{r}(\mathcal{X}) is a separable RKHS with respect to a bounded kernel, and the corresponding EDR is (see, e.g., Edmunds & Triebel 1996)

β=2rd.\beta=\frac{2r}{d}.

Furthermore, for the interpolation space of Hr(𝒳)H^{r}(\mathcal{X}) under Lebesgue measure defined by (4), we have (see, e.g., Steinwart & Scovel 2012, Theorem 4.6), for s>0s>0,

[Hr(𝒳)]s=Hrs(𝒳).[H^{r}(\mathcal{X})]^{s}=H^{rs}(\mathcal{X}).

Now we begin to introduce the embedding theorem of (fractional) Sobolev space from 7.57 of Adams (1975), which is crucial when applying Theorem 3.1 to Sobolev RKHS.

Proposition 3.5 (Embedding theorem).

Let Hr(𝒳),r>0H^{r}(\mathcal{X}),r>0 be defined as above, we have

  1. (i)

    if d>2rd>2r, then Hr(𝒳)Lq(𝒳)H^{r}(\mathcal{X})\hookrightarrow L^{q}(\mathcal{X}) for 2q2dd2r2\leq q\leq\frac{2d}{d-2r}.

  2. (ii)

    if d=2rd=2r, then Hr(𝒳)Lq(𝒳)H^{r}(\mathcal{X})\hookrightarrow L^{q}(\mathcal{X}) for 2q<2\leq q<\infty.

  3. (iii)

    if d<2(rj)d<2(r-j) for some nonnegative integer jj, then Hr(𝒳)Cj,γ(𝒳),γ=rjd2H^{r}(\mathcal{X})\hookrightarrow C^{j,\gamma}(\mathcal{X}),\gamma=r-j-\frac{d}{2},

where Cj,γ(𝒳)C^{j,\gamma}(\mathcal{X}) denotes the Hölder space and \hookrightarrow denotes the continuous embedding.

On the one hand, (iii) of Proposition 3.5 shows that, for a Sobolev RKHS =Hr(𝒳),r>d2\mathcal{H}=H^{r}(\mathcal{X}),r>\frac{d}{2} and any α>1β=d2r\alpha>\frac{1}{\beta}=\frac{d}{2r},

[Hr(𝒳)]α=Hrα(𝒳)C0,γ(𝒳)L(𝒳),[H^{r}(\mathcal{X})]^{\alpha}=H^{r\alpha}(\mathcal{X})\hookrightarrow C^{0,\gamma}(\mathcal{X})\hookrightarrow L^{\infty}(\mathcal{X}),

where γ>0\gamma>0. So the Assumption 2.2 holds for α0=1β\alpha_{0}=\frac{1}{\beta}, and thus 2(sβ+1)2+(sα0)β=2\frac{2(s\beta+1)}{2+(s-\alpha_{0})\beta}=2. On the other hand, (i) of Proposition 3.5 shows that if d>2rsd>2rs,

[Hr(𝒳)]s=Hrs(𝒳)Lq(𝒳),[H^{r}(\mathcal{X})]^{s}=H^{rs}(\mathcal{X})\hookrightarrow L^{q}(\mathcal{X}),

where q=2dd2rs=21sβ>2q=\frac{2d}{d-2rs}=\frac{2}{1-s\beta}>2. In addition, (ii) and (iii) of Proposition 3.5 show that q>2q>2 also holds if d2rsd\leq 2rs. Therefore, for any 0<s20<s\leq 2, we have

Assumption2.2;s+1β>α0;q>2(sβ+1)2+(sα0)β=2\text{Assumption}~{}\ref{assumption embedding};~{}s+\frac{1}{\beta}>\alpha_{0};~{}q>\frac{2(s\beta+1)}{2+(s-\alpha_{0})\beta}=2

hold simultaneously.

Now we are ready to state a theorem as the corollary of Theorem 3.1 and Proposition 3.5, which implies the optimality of KRR for Sobolev RKHS and any 0<s20<s\leq 2.

Theorem 3.6.

Let 𝒳d\mathcal{X}\subseteq\mathbb{R}^{d} be a bounded domain with a smooth boundary. The RKHS is =Hr(𝒳)\mathcal{H}=H^{r}(\mathcal{X}) for some r>d/2r>d/2. Suppose that the distribution ρ\rho satisfies that fρ[]sR\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}\leq R for 0<s20<s\leq 2, the noise satisfies Assumption 2.4, and the marginal distribution μ\mu on 𝒳\mathcal{X} has Lebesgue density 0<cp(x)C0<c\leq p(x)\leq C for two constants cc and CC. Let f^λ\hat{f}_{\lambda} be the KRR estimator defined by (2). Then, by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}}, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta, we have

f^λfρL22(ln4δ)2Cnsβsβ+1,\left\|\hat{f}_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}^{2}\leq\left(\ln\frac{4}{\delta}\right)^{2}Cn^{-\frac{s\beta}{s\beta+1}},

where CC is a constant independent of nn and δ\delta.

Note that we say that the distribution μ\mu has Lebesgue density 0<cp(x)C0<c\leq p(x)\leq C, if μ\mu is equivalent to the Lebesgue measure ν\nu, i.e., μν,νμ\mu\ll\nu,\nu\ll\mu, and there exist constants c,C>0c,C>0 such that cdμdνCc\leq\frac{\mathrm{d}\mu}{\mathrm{d}\nu}\leq C.

Remark 3.7 (Optimality for Sobolev RKHS).

Without the boundedness assumption, Theorem 3.6 proves the same upper bound of the convergence rate nsβsβ+1n^{-\frac{s\beta}{s\beta+1}}. Together with the information-theoretic lower bound in Proposition 2.9, we prove the optimality of KRR for all 0<s20<s\leq 2, while state-of-the-art result Fischer & Steinwart (2020, Corollary 5) can only prove for 1β<s2\frac{1}{\beta}<s\leq 2. Since when s>2s>2, the saturation effect of KRR has been proved in a recent work Li et al. (2023), the optimality of KRR for Sobolev spaces is well understood.

3.2 Sketch of proof

In this subsection, we present the sketch of the proof of Theorem 3.1. The rigorous proof of Theorem 3.1, Proposition 2.9, and Theorem 3.6 will be in the appendix. We refer to Fischer & Steinwart (2020, Chapter 6) for the proof of Proposition 2.5 and 2.7.

Our proofs are based on the standard integral operator techniques dating back to Smale & Zhou (2007), and we refine the proof to handle the unbounded case. Let us first introduce some frequently used notations. Define the sample operator as

Kx:,yyk(x,),K_{x}:\mathbb{R}\rightarrow\mathcal{H},~{}~{}y\mapsto yk(x,\cdot),

and its adjoint operator

Kx:,ff(x).K_{x}^{*}:\mathcal{H}\rightarrow\mathbb{R},~{}~{}f\mapsto f(x).

Next we define the sample covariance operator TX:T_{X}:\mathcal{H}\to\mathcal{H} as

TX:=1ni=1nKxiKxi,T_{X}:=\frac{1}{n}\sum_{i=1}^{n}K_{x_{i}}K_{x_{i}}^{*}, (7)

and the sample basis function

gZ:=1ni=1nKxiyi.g_{Z}:=\frac{1}{n}\sum_{i=1}^{n}K_{x_{i}}y_{i}\in\mathcal{H}.

Then the KRR estimator (2) can be expressed by (see, e.g., Caponnetto & de Vito (2007, Chapter 5))

f^λ=(TX+λ)1gZ.\hat{f}_{\lambda}=\left(T_{X}+\lambda\right)^{-1}g_{Z}.

Define fλf_{\lambda} as the unique minimizer given by

fλ=argminf(𝒳×𝒴(f(x)y)2dρ(x,y)+λf2).f_{\lambda}=\underset{f\in\mathcal{H}}{\arg\min}\left(\int_{\mathcal{X}\times\mathcal{Y}}\left(f(x)-y\right)^{2}\mathrm{d}\rho(x,y)+\lambda\|f\|_{\mathcal{H}}^{2}\right).

Note that the integral operator TT can also be seen as a bounded linear operator on \mathcal{H}, fλf_{\lambda} can be expressed by

fλ=(T+λ)1g,f_{\lambda}=\left(T+\lambda\right)^{-1}g, (8)

where gg is the expectation of gZg_{Z} given by

g=𝔼gZ=𝒳k(x,)fρ(x)𝑑μ(x)=Tfρ.g=\mathbb{E}g_{Z}=\int_{\mathcal{X}}k(x,\cdot)f_{\rho}^{*}(x)d\mu(x)=Tf_{\rho}^{*}\in\mathcal{H}.

The first step in our proof is to decompose the generalization error into two terms, which are often referred to as the approximation error and estimation error,

f^λfρL2f^λfλL2+fλfρL2,\left\|\hat{f}_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}\leq\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}+\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}, (9)

Then we will show that by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}},

fλfρL2CRλs2=CRn12sβsβ+1;\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}\leq CR\lambda^{\frac{s}{2}}=CRn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}; (10)

and for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta, we have

f^λfλL2ln4δCλ12βn=ln4δCn12sβsβ+1.\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}\leq\ln\frac{4}{\delta}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln\frac{4}{\delta}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (11)

Plugging (10) and (11) into (9) and we will finish the proof of Theorem 3.1.

The approximation error

Suppose that fρ=iNaieif_{\rho}^{*}=\sum_{i\in N}a_{i}e_{i}. Then using the expression (8) and simple inequality, we will show that

fλfρL22=iN(λλis/2λ+λi)2λisai2λsfρ[]s.\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}^{2}=\sum\limits_{i\in N}\left(\frac{\lambda\lambda_{i}^{s/2}}{\lambda+\lambda_{i}}\right)^{2}\lambda_{i}^{-s}a_{i}^{2}\leq\lambda^{s}\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}.

The estimation error

Denote TXλ=TX+λT_{X\lambda}=T_{X}+\lambda and Tλ=T+λT_{\lambda}=T+\lambda. We will first rewrite the estimator error as follows

f^λfλL2=T12(f^λfλ)\displaystyle\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}=\left\|T^{\frac{1}{2}}\left(\hat{f}_{\lambda}-f_{\lambda}\right)\right\|_{\mathcal{H}}
T12Tλ12Tλ12TXλ1Tλ12Tλ12(gZTXλfλ).\displaystyle\leq\left\|T^{\frac{1}{2}}T_{\lambda}^{-\frac{1}{2}}\right\|\cdot\left\|T_{\lambda}^{\frac{1}{2}}T_{X\lambda}^{-1}T_{\lambda}^{\frac{1}{2}}\right\|\cdot\left\|T_{\lambda}^{-\frac{1}{2}}\left(g_{Z}-T_{X\lambda}f_{\lambda}\right)\right\|_{\mathcal{H}}. (12)

For the first term in (3.2), we have

T12Tλ12=supiN(λiλi+λ)121.\left\|T^{\frac{1}{2}}T_{\lambda}^{-\frac{1}{2}}\right\|=\sup\limits_{i\in N}\left(\frac{\lambda_{i}}{\lambda_{i}+\lambda}\right)^{\frac{1}{2}}\leq 1.

For the second term in (3.2), using the concentration result between TXλT_{X\lambda} and TλT_{\lambda}, it can also be bounded by a constant. The main challenge of the proof is bounding the third term. We will show that the third term in (3.2) can be rewritten as

Tλ12[(gZ(TX+λ+TT)fλ)]\displaystyle\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-\left(T_{X}+\lambda+T-T\right)f_{\lambda}\right)\right]\right\|_{\mathcal{H}}
=Tλ12[(gZTXfλ)(T+λ)fλ+Tfλ]\displaystyle=\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(T+\lambda\right)f_{\lambda}+Tf_{\lambda}\right]\right\|_{\mathcal{H}}
=Tλ12[(gZTXfλ)(gTfλ)].\displaystyle=\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(g-Tf_{\lambda}\right)\right]\right\|_{\mathcal{H}}. (13)

The form of (3.2) allows us to use the well-known Bernstein type inequality, which controls the difference between a sum of i.i.d. random variables and its expectation. Traditionally, this step requires the boundedness of fρf_{\rho}^{*}. We refine this procedure by a truncation method and prove that with high probability,

(3.2)ln2δCλ12βn=ln2δCn12sβsβ+1.\eqref{sketch-1}\leq\ln{\frac{2}{\delta}}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln{\frac{2}{\delta}}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}.

Specifically, we consider two subset of 𝒳\mathcal{X}: Ω1={xΩ:|fρ(x)|t}\Omega_{1}=\{x\in\Omega:|f_{\rho}^{*}(x)|\leq t\} and Ω2=𝒳\Ω1\Omega_{2}=\mathcal{X}\backslash\Omega_{1} where tt is allowed to diverge as nn\to\infty. When choosing tt as a proper order of nn, we show that on the one hand, the norm in Ω1\Omega_{1} is still upper bounded by n12sβsβ+1n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}; on the other hand, the norms in Ω2\Omega_{2} vanishes with a probability tending to 1 as nn\to\infty. We argue that such tt exists by taking advantage of the LqL^{q}-integrability of fρf_{\rho}^{*}, which is required in the statement of Theorem 3.1.

4 Experiments

In our experiments, we aim to verify that for functions fρ[]sf_{\rho}^{*}\in[\mathcal{H}]^{s} but not in LL^{\infty}, the KRR estimator can still achieve the convergence rate nsβsβ+1n^{-\frac{s\beta}{s\beta+1}}. We show the results for both Sobolev RKHS, and general RKHS with uniformly bounded eigenfunctions mentioned in Remark 3.3.

4.1 Experiments in Sobolev space

Suppose that 𝒳=[0,1]\mathcal{X}=[0,1] and the marginal distribution μ\mu is the uniform distribution on [0,1][0,1]. We consider the RKHS =H1(𝒳)\mathcal{H}=H^{1}(\mathcal{X}) to be the Sobolev space with smoothness 1. Section 3.1 shows that the EDR is β=2\beta=2 and embedding index is α0=1β\alpha_{0}=\frac{1}{\beta}. We construct a function in []s\L[\mathcal{H}]^{s}\backslash L^{\infty} by

f(x)=k=11ks+0.5(sin(2kπx)+cos(2kπx)),f^{*}(x)=\sum\limits_{k=1}^{\infty}\frac{1}{k^{s+0.5}}\left(\sin\left(2k\pi x\right)+\cos\left(2k\pi x\right)\right), (14)

for some 0<s<1β=0.50<s<\frac{1}{\beta}=0.5. We will show in Appendix F that the series in (14) converges on (0,1)(0,1). In addition, since sin2kπ+cos2kπ1\sin 2k\pi+\cos 2k\pi\equiv 1, we also have fL(𝒳)f^{*}\notin L^{\infty}(\mathcal{X}). The explicit formula of the kernel associated to H1(𝒳)H^{1}(\mathcal{X}) is given by Thomas-Agnan (1996, Corollary 2), i.e., k(x,y)=1sinh1cosh(1max(x,y))cosh(1min(x,y))k(x,y)=\frac{1}{\sinh{1}}\cosh{(1-\max(x,y))\cosh{(1-\min(x,y))}}.

We consider the following data generation procedure:

y=f(x)+ϵ,y=f^{*}(x)+\epsilon,

where ff^{*} is numerically approximated by the first 1000 terms in (14) with s=0.4s=0.4, x𝒰[0,1]x\sim\mathcal{U}[0,1] and ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1). We choose the regularization parameter as λ=cnβsβ+1=cn109\lambda=cn^{-\frac{\beta}{s\beta+1}}=cn^{-\frac{10}{9}} for a fixed cc. The sample size nn is chosen from 1000 to 5000, with intervals of 100. We numerically compute the generalization error f^fL2\|\hat{f}-f^{*}\|_{L^{2}} by Simpson’s formula with NnN\gg n testing points. For each nn, we repeat the experiments 50 times and present the average generalization error as well as the region within one standard deviation. To visualize the convergence rate rr, we perform logarithmic least-squares logerr=rlogn+b\log\text{err}=r\log n+b to fit the generalization error with respect to the sample size and display the value of rr.

Refer to caption

Figure 1: Error decay curve of Sobolev RKHS. Both axes are logarithmic. The curve shows the average generalization errors over 50 trials; the region within one standard deviation is shown in green. The dashed black line is computed using logarithmic least-squares, and the slope represents the convergence rate rr.

We try different values of cc, Figure 1 presents the convergence curve under the best choice of cc. It can be concluded that the convergence rate of KRR’s generalization error is indeed approximately equal to nsβsβ+1=n49n^{-\frac{s\beta}{s\beta+1}}=n^{-\frac{4}{9}}, without the boundedness assumption of the true function ff^{*}. We also did another experiment that used cross validation to choose the regularization parameter. Figure 3 in Appendix F shows a similar result as Figure 1. We refer to Appendix F for more details of the experiments.

Remark 4.1.

This setting is similar to Pillaud-Vivien et al. (2018). The difference is that we choose the source condition ss to be smaller than α0\alpha_{0} so that fL(𝒳)f^{*}\notin L^{\infty}(\mathcal{X}).

4.2 Experiments in general RKHS

Suppose that 𝒳=[0,1]\mathcal{X}=[0,1] and the marginal distribution μ\mu is the uniform distribution on [0,1][0,1]. It is well known that the following RKHS

:={f:[0,1]f\displaystyle\mathcal{H}:=\Big{\{}f:[0,1]\rightarrow\mathbb{R}\mid f is A.C., f(0)=0,\displaystyle\text{ is A.C., }f(0)=0,
01(f(x))2dx<}.\displaystyle\int_{0}^{1}\left(f^{\prime}(x)\right)^{2}\mathrm{~{}d}x<\infty\Big{\}}.

is associated with the kernel k(x,y)=min(x,y)k(x,y)=\min(x,y) (Wainwright, 2019). Further, its eigenvalues and eigenfunctions can be written as

λn=(2n12π)2,n=1,2,\lambda_{n}=\left(\frac{2n-1}{2}\pi\right)^{-2},\quad n=1,2,\cdots

and

en(x)=2sin(2n12πx),n=1,2,e_{n}(x)=\sqrt{2}\sin\left(\frac{2n-1}{2}\pi x\right),\quad n=1,2,\cdots

It is easy to see that the EDR is β=2\beta=2, the eigenfunctions are uniformly bounded, and the embedding index is α0=1β\alpha_{0}=\frac{1}{\beta} (see Remark 3.3). We construct a function in []s\L[\mathcal{H}]^{s}\backslash L^{\infty} by

f(x)=k=11ks+0.5e2k1(x),f^{*}(x)=\sum\limits_{k=1}^{\infty}\frac{1}{k^{s+0.5}}e_{2k-1}(x), (15)

for some 0<s<1β=0.50<s<\frac{1}{\beta}=0.5. We will show in Appendix F that the series in (15) converges on (0,1)(0,1). Since e2k1(1)1e_{2k-1}(1)\equiv 1, we also have fL(𝒳)f^{*}\notin L^{\infty}(\mathcal{X}).

We use the same data generation procedure as Section 4.1:

y=f(x)+ϵ,y=f^{*}(x)+\epsilon,

where ff^{*} is numerically approximated by the first 1000 terms in (15) with s=0.4s=0.4, x𝒰[0,1]x\sim\mathcal{U}[0,1] and ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1).

Figure 2 presents the convergence curve under the best choice of cc. It can also be concluded that the convergence rate of KRR’s generalization error is indeed approximately equal to nsβsβ+1=n49n^{-\frac{s\beta}{s\beta+1}}=n^{-\frac{4}{9}}. Figure 4 in Appendix F shows the result of using cross validation.

Refer to caption

Figure 2: Error decay curve of general RKHS. Both axes are logarithmic. The curve shows the average generalization errors over 50 trials; the region within one standard deviation is shown in green. The dashed black line is computed using logarithmic least-squares, and the slope represents the convergence rate rr.

5 Conclusion and Discussion

This paper considers the convergence rate and the minimax optimality of kernel ridge regression, especially in the misspecified case. When the true function fρf_{\rho}^{*} falls into a less-smooth (sα0s\leq\alpha_{0}) interpolation space of the RKHS []sL[\mathcal{H}]^{s}\nsubseteq L^{\infty}, it has been an outstanding problem for years that whether the boundedness assumption of fρf_{\rho}^{*} in the proof can be removed or weakened so that KRR is optimal for all 0<s20<s\leq 2. This paper proves that for Sobolev RKHS, the boundedness assumption can be directly removed. This result implies that, on the one hand, the desired upper bound of the convergence rate holds for all the functions in []s[\mathcal{H}]^{s} (which can only be proved for []sL[\mathcal{H}]^{s}\cap L^{\infty} before); on the other hand, KRR is minimax optimal for all source conditions 0<s20<s\leq 2 (which can only be proved for 1β<s2\frac{1}{\beta}<s\leq 2 before). Together with the saturation effect of KRR (Li et al., 2023) when s>2s>2, all convergence behaviors of Sobolev RKHS are understood. For general RKHS, we also prove that the boundedness assumption can be replaced by a LqL^{q}-integrability assumption which turns out to be reasonable, at least for RKHS with uniformly bounded eigenfunctions. We verify the results through experiments for both Sobolev RKHS and general RKHS.

If an RKHS \mathcal{H} has embedding index α0=1\alpha_{0}=1, considering the embedding index will just recover the results of (Lin et al., 2018). Note that we assume α0(0,1)\alpha_{0}\in(0,1) throughout this paper, on which we will perform refined analysis. Technical tools in this paper can be extended to general spectral regularization algorithms and []γ[\mathcal{H}]^{\gamma}-generalization error. Another direct open question is to discuss whether the LqL^{q}-integrability assumption in Theorem 3.1 holds for general RKHS, which we conjecture to be true.

We also notice a line of work which studies the learning curves of kernel ridge regression (Spigler et al., 2020; Bordelon et al., 2020; Cui et al., 2021) and crossovers between different noise magnitudes. At present, their results all rely on a Gaussian design assumption (or some variation), which is a very strong assumption. Nevertheless, their empirical results are enlightening and attract interest in the study of the learning curves of kernel ridge regression. We believe that studying the misspecified case in our paper is a crucial step to remove the Gaussian design assumption and draw complete conclusions about the learning curves of kernel ridge regression. KRR is also connected with Gaussian process regression (Kanagawa et al., 2018). Jin et al. (2021) claimed to establish the learning curves for Gaussian process regression and thus for KRR. However, there is a gap in their proof where an essential covering argument is missing: their Corollary 20 provides a high probability bound for fixed ν\nu, while the proof of their Lemma 40 and Lemma 41 mistakenly use Corollary 20 to assert that the bound holds simultaneously for all ν\nu.

Since Jacot et al. (2018) introduced the NTK , kernel regression has become a natural surrogate for the neural networks in the ‘lazy trained regime’. Our work is also motivated by our empirical studies in the neural networks and NTK regressions. Specifically, we found that the source condition with respect to the neural tangent kernel (or other frequently used kernels) is relatively small (s1s\ll 1) for some frequently used real datasets (MNIST, CIFAR-10). This observation urged us to determine the optimal convergence rate when the RKHS is misspecified.

Acknowledgements

This research was partially supported by Beijing Natural Science Foundation (Grant Z190001), the National Natural Science Foundation of China (Grant 11971257), National Key R&D Program of China (2020AAA0105200), and Beijing Academy of Artificial Intelligence.

References

  • Adams (1975) Adams, R. Sobolev Spaces. Adams. Pure and applied mathematics. Academic Press, 1975. URL https://books.google.co.uk/books?id=JxzpSAAACAAJ.
  • Adams & Fournier (2003) Adams, R. A. and Fournier, J. J. Sobolev Spaces. Elsevier, 2003.
  • Bauer et al. (2007) Bauer, F., Pereverzyev, S., and Rosasco, L. On regularization algorithms in learning theory. Journal of complexity, 23(1):52–72, 2007.
  • Blanchard & Mücke (2018) Blanchard, G. and Mücke, N. Optimal rates for regularization of statistical inverse learning problems. Foundations of Computational Mathematics, 18:971–1013, 2018.
  • Bordelon et al. (2020) Bordelon, B., Canatar, A., and Pehlevan, C. Spectrum dependent learning curves in kernel regression and wide neural networks. In ICML, 2020.
  • Caponnetto (2006) Caponnetto, A. Optimal rates for regularization operators in learning theory. 2006.
  • Caponnetto & de Vito (2007) Caponnetto, A. and de Vito, E. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7:331–368, 2007.
  • Celisse & Wahl (2020) Celisse, A. and Wahl, M. Analyzing the discrepancy principle for kernelized spectral filter learning algorithms. J. Mach. Learn. Res., 22:76:1–76:59, 2020.
  • Cucker & Smale (2001) Cucker, F. and Smale, S. On the mathematical foundations of learning. Bulletin of the American Mathematical Society, 39:1–49, 2001.
  • Cui et al. (2021) Cui, H., Loureiro, B., Krzakala, F., and Zdeborov’a, L. Generalization error rates in kernel regression: The crossover from the noiseless to noisy regime. In NeurIPS, 2021.
  • Dicker et al. (2017) Dicker, L., Foster, D. P., and Hsu, D. J. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, 11:1022–1047, 2017.
  • Edmunds & Triebel (1996) Edmunds, D. E. and Triebel, H. Function Spaces, Entropy Numbers, Differential Operators. Cambridge Tracts in Mathematics. Cambridge University Press, 1996. doi: 10.1017/CBO9780511662201.
  • Fischer & Steinwart (2020) Fischer, S.-R. and Steinwart, I. Sobolev norm learning rates for regularized least-squares algorithms. Journal of Machine Learning Research, 21:205:1–205:38, 2020.
  • Gerfo et al. (2008) Gerfo, L. L., Rosasco, L., Odone, F., Vito, E. D., and Verri, A. Spectral algorithms for supervised learning. Neural Computation, 20(7):1873–1897, 2008.
  • Jacot et al. (2018) Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  • Jin et al. (2021) Jin, H., Banerjee, P. K., and Montúfar, G. Learning curves for gaussian process regression with power-law priors and targets. ArXiv, abs/2110.12231, 2021.
  • Jun et al. (2019) Jun, K.-S., Cutkosky, A., and Orabona, F. Kernel truncated randomized ridge regression: Optimal rates and low noise acceleration. ArXiv, abs/1905.10680, 2019.
  • Kanagawa et al. (2018) Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
  • Kohler & Krzyżak (2001) Kohler, M. and Krzyżak, A. Nonparametric regression estimation using penalized least squares. IEEE Trans. Inf. Theory, 47:3054–3059, 2001.
  • Li et al. (2023) Li, Y., Zhang, H., and Lin, Q. On the saturation effect of kernel ridge regression. In The Eleventh International Conference on Learning Representations, 2023.
  • Li et al. (2022) Li, Z., Meunier, D., Mollenhauer, M., and Gretton, A. Optimal rates for regularized conditional mean embedding learning. ArXiv, abs/2208.01711, 2022.
  • Lin & Cevher (2020) Lin, J. and Cevher, V. Optimal convergence for distributed learning with stochastic gradient methods and spectral algorithms. Journal of Machine Learning Research, 21:147–1, 2020.
  • Lin et al. (2018) Lin, J., Rudi, A., Rosasco, L., and Cevher, V. Optimal rates for spectral algorithms with least-squares regression over Hilbert spaces. Applied and Computational Harmonic Analysis, 48:868–890, 2018.
  • Liu & Shi (2022) Liu, J. and Shi, L. Statistical optimality of divide and conquer kernel-based functional linear regression. ArXiv, abs/2211.10968, 2022.
  • Mendelson & Neeman (2010) Mendelson, S. and Neeman, J. Regularization in kernel learning. The Annals of Statistics, 38(1):526–565, February 2010.
  • Pillaud-Vivien et al. (2018) Pillaud-Vivien, L., Rudi, A., and Bach, F. R. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes. ArXiv, abs/1805.10074, 2018.
  • Rastogi & Sampath (2017) Rastogi, A. and Sampath, S. Optimal rates for the regularized learning algorithms under general source condition. Frontiers in Applied Mathematics and Statistics, 3, 2017.
  • Rosasco et al. (2005) Rosasco, L., De Vito, E., and Verri, A. Spectral methods for regularization in learning theory. DISI, Universita degli Studi di Genova, Italy, Technical Report DISI-TR-05-18, 2005.
  • Sawano (2018) Sawano, Y. Theory of Besov spaces, volume 56. Springer, 2018.
  • Smale & Zhou (2007) Smale, S. and Zhou, D.-X. Learning theory estimates via integral operators and their approximations. Constructive Approximation, 26:153–172, 2007.
  • Spigler et al. (2020) Spigler, S., Geiger, M., and Wyart, M. Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm. Journal of Statistical Mechanics: Theory and Experiment, 2020, 2020.
  • Steinwart & Christmann (2008) Steinwart, I. and Christmann, A. Support vector machines. In Information Science and Statistics, 2008.
  • Steinwart & Scovel (2012) Steinwart, I. and Scovel, C. Mercer’s theorem on general domains: On the interaction between measures, kernels, and RKHSs. Constructive Approximation, 35(3):363–417, 2012.
  • Steinwart et al. (2009) Steinwart, I., Hush, D., and Scovel, C. Optimal rates for regularized least squares regression. In COLT, pp.  79–93, 2009.
  • Talwai & Simchi-Levi (2022) Talwai, P. M. and Simchi-Levi, D. Optimal learning rates for regularized least-squares with a fourier capacity condition. 2022.
  • Thomas-Agnan (1996) Thomas-Agnan, C. Computing a family of reproducing kernels for statistical applications. Numerical Algorithms, 13:21–32, 1996.
  • Tsybakov (2009) Tsybakov, A. B. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer, New York ; London, 1st edition, 2009.
  • Wainwright (2019) Wainwright, M. J. High-Dimensional Statistics: A Non-Asymptotic Viewpoint. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, 2019.

Appendix A Proof of Theorem 3.1

Throughout the proof, we denote

Tλ=T+λ;TXλ=TX+λ,T_{\lambda}=T+\lambda;~{}~{}T_{X\lambda}=T_{X}+\lambda, (16)

where λ\lambda is the regularization parameter and T,TXT,T_{X} is defined by (3), (7). We use (B1,B2)\|\cdot\|_{\mathscr{B}(B_{1},B_{2})} to denote the operator norm of a bounded linear operator from a Banach space B1B_{1} to B2B_{2}, i.e., A(B1,B2)=supfB1=1AfB2\|A\|_{\mathscr{B}(B_{1},B_{2})}=\sup\limits_{\|f\|_{B_{1}}=1}\|Af\|_{B_{2}}. Without bringing ambiguity, we will briefly denote the operator norm as \|\cdot\|. In addition, we use trA\text{tr}A and A1\|A\|_{1} to denote the trace and the trace norm of an operator. We use A2\|A\|_{2} to denote the Hilbert-Schmidt norm. In addition, we denote L2(𝒳,μ)L^{2}(\mathcal{X},\mu) as L2L^{2}, L(𝒳,μ)L^{\infty}(\mathcal{X},\mu) as LL^{\infty} for brevity throughout the proof.

In addition, denote the effective dimension

𝒩(λ)=tr(T(T+λ)1)=iNλiλi+λ,\displaystyle\mathcal{N}(\lambda)=\text{tr}\big{(}T(T+\lambda)^{-1}\big{)}=\sum\limits_{i\in N}\frac{\lambda_{i}}{\lambda_{i}+\lambda}, (17)

since the EDR of \mathcal{H} is β\beta, Lemma E.5 shows that 𝒩(λ)λ1β\mathcal{N}(\lambda)\asymp\lambda^{-\frac{1}{\beta}}.

The following lemma will be frequently used, so we present it at the beginning of our proof.

Lemma A.1.

For any λ>0\lambda>0 and s[0,1]s\in[0,1], we have

supt0tst+λλs1.\sup_{t\geq 0}\frac{t^{s}}{t+\lambda}\leq\lambda^{s-1}. (18)
Proof.

Since asa+1a^{s}\leq a+1 for any a0a\geq 0 and s[0,1]s\in[0,1], the lemma follows from

(tλ)stλ+1=t+λλ.\left(\frac{t}{\lambda}\right)^{s}\leq\frac{t}{\lambda}+1=\frac{t+\lambda}{\lambda}. (19)

A.1 Proof of the approximation error

The following theorem gives the bound of []γ[\mathcal{H}]^{\gamma}-norm of fλfρf_{\lambda}-f_{\rho}^{*} when 0γs0\leq\gamma\leq s. As a special case, the approximation error fλfρL2\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}} follows from the result when γ=0\gamma=0.

Theorem A.2.

Suppose that Assumption 2.3 holds for 0<s20<s\leq 2. Denote fλ=(T+λ)1gf_{\lambda}=(T+\lambda)^{-1}g, then for any λ>0\lambda>0 and 0γs0\leq\gamma\leq s, we have

fλfρ[]γRλsγ2.\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{[\mathcal{H}]^{\gamma}}\leq R\lambda^{\frac{s-\gamma}{2}}. (20)
Proof.

Suppose that fρ=iNaieif_{\rho}^{*}=\sum_{i\in N}a_{i}e_{i}. Recall that fλ=(T+λ)1g=(T+λ)1Tfρf_{\lambda}=\left(T+\lambda\right)^{-1}g=\left(T+\lambda\right)^{-1}Tf_{\rho}^{*}, we have

fλfρ[]γ2\displaystyle\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{[\mathcal{H}]^{\gamma}}^{2} =iNλλ+λiaiei()[]γ2=iN(λλ+λi)2λiγai2\displaystyle=\left\|\sum\limits_{i\in N}\frac{\lambda}{\lambda+\lambda_{i}}a_{i}e_{i}(\cdot)\right\|_{[\mathcal{H}]^{\gamma}}^{2}=\sum\limits_{i\in N}\left(\frac{\lambda}{\lambda+\lambda_{i}}\right)^{2}\lambda_{i}^{-\gamma}a_{i}^{2}
=iN(λλisγ2λ+λi)2λisai2\displaystyle=\sum\limits_{i\in N}\left(\frac{\lambda\lambda_{i}^{\frac{s-\gamma}{2}}}{\lambda+\lambda_{i}}\right)^{2}\lambda_{i}^{-s}a_{i}^{2}
(λsupiNλisγ2λ+λi)2iNλisai2\displaystyle\leq\left(\lambda\sup\limits_{i\in N}\frac{\lambda_{i}^{\frac{s-\gamma}{2}}}{\lambda+\lambda_{i}}\right)^{2}\sum\limits_{i\in N}\lambda_{i}^{-s}a_{i}^{2}
λsγ2fρ[]s,\displaystyle\leq\lambda^{\frac{s-\gamma}{2}}\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}},

where we use Lemma A.1 for the last inequality and the []s[\mathcal{H}]^{s} norm defined by (5). Then the theorem follows from Assumption 2.3, i.e., fρ[]sR.\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}\leq R.

A.2 Proof of the estimation error

Theorem A.3.

Suppose that Assumption 2.1, 2.2, 2.3 and 2.4 hold for 0<s20<s\leq 2 and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Suppose that fρLq(𝒳,μ)f_{\rho}^{*}\in L^{q}(\mathcal{X},\mu) and fρLq(𝒳,μ)Cq<\|f_{\rho}^{*}\|_{L^{q}(\mathcal{X},\mu)}\leq C_{q}<\infty for some q>2(sβ+1)2+(sα0)βq>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta}. Then in the case of s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}}, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta , we have

f^λfλL2ln4δCn12sβsβ+1.\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}\leq\ln\frac{4}{\delta}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (22)

where CC is a constant that only depends on κ,R,L,σ,Cq\kappa,R,L,\sigma,C_{q}.

Proof.

First, rewrite the estimator error as follows

f^λfλL2=T12(f^λfλ)T12Tλ12Tλ12TXλ1Tλ12Tλ12(gZTXλfλ).\displaystyle\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}=\left\|T^{\frac{1}{2}}\left(\hat{f}_{\lambda}-f_{\lambda}\right)\right\|_{\mathcal{H}}\leq\left\|T^{\frac{1}{2}}T_{\lambda}^{-\frac{1}{2}}\right\|\cdot\left\|T_{\lambda}^{\frac{1}{2}}T_{X\lambda}^{-1}T_{\lambda}^{\frac{1}{2}}\right\|\cdot\left\|T_{\lambda}^{-\frac{1}{2}}\left(g_{Z}-T_{X\lambda}f_{\lambda}\right)\right\|_{\mathcal{H}}. (23)

For the first term in (23), we have

T12Tλ12=supiN(λiλi+λ)121.\left\|T^{\frac{1}{2}}T_{\lambda}^{-\frac{1}{2}}\right\|=\sup\limits_{i\in N}\left(\frac{\lambda_{i}}{\lambda_{i}+\lambda}\right)^{\frac{1}{2}}\leq 1. (24)

For the second term in (23), for any α>α0\alpha>\alpha_{0}, using Proposition D.5 and we known that when λ,n\lambda,n satisfy that

uMα2λαnln4κ2𝒩(λ)(T+λ)δ2T18,u\coloneqq\frac{M_{\alpha}^{2}\lambda^{-\alpha}}{n}\ln\frac{4\kappa^{2}\mathcal{N}(\lambda)\left(\|T\|+\lambda\right)}{\frac{\delta}{2}\|T\|}\leq\frac{1}{8}, (25)

we have

aTλ12(TTX)Tλ1243u+2u23.a\coloneqq\|T_{\lambda}^{-\frac{1}{2}}(T-T_{X})T_{\lambda}^{-\frac{1}{2}}\|\leq\frac{4}{3}u+\sqrt{2u}\leq\frac{2}{3}. (26)

with probability as least 1δ21-\frac{\delta}{2}. Therefore,

Tλ12TXλ1Tλ12\displaystyle\left\|T_{\lambda}^{\frac{1}{2}}T_{X\lambda}^{-1}T_{\lambda}^{\frac{1}{2}}\right\| =(Tλ12(TX+λ)Tλ12)1\displaystyle=\left\|\left(T_{\lambda}^{-\frac{1}{2}}(T_{X}+\lambda)T_{\lambda}^{-\frac{1}{2}}\right)^{-1}\right\|
=(ITλ12(TXT)Tλ12)1\displaystyle=\left\|\left(I-T_{\lambda}^{-\frac{1}{2}}(T_{X}-T)T_{\lambda}^{-\frac{1}{2}}\right)^{-1}\right\|
k=0Tλ12(TXT)Tλ12k\displaystyle\leq\sum\limits_{k=0}^{\infty}\left\|T_{\lambda}^{-\frac{1}{2}}(T_{X}-T)T_{\lambda}^{-\frac{1}{2}}\right\|^{k}
k=0(23)k3,\displaystyle\leq\sum\limits_{k=0}^{\infty}\left(\frac{2}{3}\right)^{k}\leq 3, (27)

with probability as least 1δ21-\frac{\delta}{2}, where II is the identity operator \mathcal{H}\to\mathcal{H}. Note that since we have s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, there always exists α0<α<s+1β\alpha_{0}<\alpha<s+\frac{1}{\beta} such that (25) is satisfied when λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}} and nn is sufficiently large .

For the third term in (23), it can be rewritten as

Tλ12(gZTXλfλ)\displaystyle\left\|T_{\lambda}^{-\frac{1}{2}}\left(g_{Z}-T_{X\lambda}f_{\lambda}\right)\right\|_{\mathcal{H}} =Tλ12[(gZ(TX+λ+TT)fλ)]\displaystyle=\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-\left(T_{X}+\lambda+T-T\right)f_{\lambda}\right)\right]\right\|_{\mathcal{H}}
=Tλ12[(gZTXfλ)(T+λ)fλ+Tfλ]\displaystyle=\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(T+\lambda\right)f_{\lambda}+Tf_{\lambda}\right]\right\|_{\mathcal{H}}
=Tλ12[(gZTXfλ)(gTfλ)].\displaystyle=\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(g-Tf_{\lambda}\right)\right]\right\|_{\mathcal{H}}. (28)

Using Proposition D.4,with probability at least 1δ21-\frac{\delta}{2}, we have

Tλ12[(gZTXfλ)(gTfλ)]ln4δCλ12βn=ln4δCn12sβsβ+1.\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(g-Tf_{\lambda}\right)\right]\right\|_{\mathcal{H}}\leq\ln{\frac{4}{\delta}}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln{\frac{4}{\delta}}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (29)

Plugging (24), (A.2) and (29) into (23) and we finish the proof. ∎

A.3 Proof of Theorem 3.1

Using the approximation-estimation error decomposition,

f^λfρL2f^λfλL2+fλfρL2,\left\|\hat{f}_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}\leq\left\|\hat{f}_{\lambda}-f_{\lambda}\right\|_{L^{2}}+\left\|f_{\lambda}-f_{\rho}^{*}\right\|_{L^{2}}, (30)

together with Theorem A.2 and Theorem A.3 and we finish the proof of Theorem 3.1.

Appendix B Proof of Theorem 3.6

Note that the RKHS \mathcal{H} is defined as the (fractional) Sobolev space Hr(𝒳)H^{r}(\mathcal{X}), which is regardless of the marginal distribution μ\mu. But the definition of interpolation space (4) is dependent on μ\mu. When μ\mu has Lebesgue density 0<cp(x)C0<c\leq p(x)\leq C, Fischer & Steinwart (2020, (14)) shows that

L2(𝒳,μ)L2(𝒳,ν),L^{2}(\mathcal{X},\mu)\cong L^{2}(\mathcal{X},\nu), (31)

and

[Hr(𝒳)]μs(L2(𝒳,μ),[Hr(𝒳)]μ1)s,2(L2(𝒳,ν),[Hr(𝒳)]ν1)s,2[Hr(𝒳)]νsHrs(𝒳),\left[H^{r}(\mathcal{X})\right]_{\mu}^{s}\cong\left(L_{2}(\mathcal{X},\mu),\left[H^{r}(\mathcal{X})\right]^{1}_{\mu}\right)_{s,2}\cong\left(L_{2}(\mathcal{X},\nu),\left[H^{r}(\mathcal{X})\right]^{1}_{\nu}\right)_{s,2}\cong\left[H^{r}(\mathcal{X})\right]_{\nu}^{s}\cong H^{rs}(\mathcal{X}), (32)

where we denote [Hr(𝒳)]μs\left[H^{r}(\mathcal{X})\right]_{\mu}^{s} as the interpolation space of Hr(𝒳)H^{r}(\mathcal{X}) under marginal distribution μ\mu. So we can also apply Proposition 3.5 to [Hr(𝒳)]μs\left[H^{r}(\mathcal{X})\right]_{\mu}^{s} and the embedding property of it is the same as [Hr(𝒳)]s\left[H^{r}(\mathcal{X})\right]^{s}. Denote β=2rd\beta=\frac{2r}{d}, since α0=1β\alpha_{0}=\frac{1}{\beta}, for any 0<s20<s\leq 2, we have

Assumption2.2;s+1β>α0;q>2(sβ+1)2+(sα0)β=2\text{Assumption}~{}\ref{assumption embedding};~{}s+\frac{1}{\beta}>\alpha_{0};~{}q>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta}=2 (33)

hold simultaneously.

Therefore, all the assumptions in Theorem 3.1 are satisfied, and the proof follows from applying Theorem 3.1.

Appendix C Proof of Proposition 2.9

We will construct a family of probability distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} and apply Lemma E.6. Recall that μ\mu is a probability distribution on 𝒳\mathcal{X} such that Assumption 2.1 is satisfied. Denote the class of functions

Bs(R)={f[]s:f[]sR},B^{s}(R)=\left\{f\in[\mathcal{H}]^{s}:\|f\|_{[\mathcal{H}]^{s}}\leq R\right\}, (34)

and for every fBs(R)f\in B^{s}(R), define the probability distribution ρf\rho_{f} on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} such that

y=f(x)+ϵ,xμ,y=f(x)+\epsilon,~{}~{}x\sim\mu, (35)

where ϵ𝒩(0,σ¯2)\epsilon\sim\mathcal{N}(0,\bar{\sigma}^{2}) and σ¯=min(σ,L)\bar{\sigma}=\min(\sigma,L). It is easy to show that such ρf\rho_{f} falls into the family 𝒫\mathcal{P} in Proposition 2.9. (Assumption 2.1 and 2.3 are satisfied obviously. Assumption 2.4 follows from results of moments of Gaussian random variables, see, e.g., Fischer & Steinwart (2020, Lemma 21)).

Using Lemma E.8, for m=n1sβ+1m=n^{\frac{1}{s\beta+1}}, there exists ω(0),,ω(M){0,1}m\omega^{(0)},\cdots,\omega^{(M)}\in\{0,1\}^{m} for some M2m/8M\geq 2^{m/8} such that

k=1m|ωk(i)ωk(j)|m8,0i<jM.\sum_{k=1}^{m}\left|\omega_{k}^{(i)}-\omega_{k}^{(j)}\right|\geq\frac{m}{8},\quad\forall 0\leq i<j\leq M. (36)

For ϵ=C0msβ1\epsilon=C_{0}m^{-s\beta-1}, define the functions fi,i=1,2,,Mf_{i},i=1,2,\cdots,M as

fi:=ϵ1/2k=1mωk(i)em+k.f_{i}:=\epsilon^{1/2}\sum_{k=1}^{m}\omega_{k}^{(i)}e_{m+k}. (37)

Since

fi[]s=ϵk=1mλm+ks(ωk(i))2ϵk=1mλ2ms2sβcϵk=1mmsβ2sβcϵmsβ+1=2sβcC0,\left\|f_{i}\right\|_{[\mathcal{H}]^{s}}=\epsilon\sum_{k=1}^{m}\lambda_{m+k}^{-s}\left(\omega_{k}^{(i)}\right)^{2}\leq\epsilon\sum_{k=1}^{m}\lambda_{2m}^{-s}\leq 2^{s\beta}c\epsilon\sum_{k=1}^{m}m^{s\beta}\leq 2^{s\beta}c\epsilon m^{s\beta+1}=2^{s\beta}cC_{0}, (38)

Where cc in (38) represents the constant in Assumption 2.1. So if C0C_{0} is small such that

2sβcC0R,2^{s\beta}cC_{0}\leq R, (39)

then we have fiBs(R),i=1,2,,M.f_{i}\in B^{s}(R),i=1,2,\cdots,M.

Using Lemma E.7, we have

KL(ρfin,ρf0n)\displaystyle\mathrm{KL}\left(\rho_{f_{i}}^{n},\rho_{f_{0}}^{n}\right) =n2σ¯2fiL2(𝒳,μ)2\displaystyle=\frac{n}{2\bar{\sigma}^{2}}\left\|f_{i}\right\|_{L^{2}(\mathcal{X},\mu)}^{2}
=nϵ2σ¯2k=1m(ωk(i))2\displaystyle=\frac{n\epsilon}{2\bar{\sigma}^{2}}\sum_{k=1}^{m}\left(\omega_{k}^{(i)}\right)^{2}
nϵm2σ¯2=n2σ¯2C0msβ.\displaystyle\leq\frac{n\epsilon m}{2\bar{\sigma}^{2}}=\frac{n}{2\bar{\sigma}^{2}}C_{0}m^{-s\beta}. (40)

Recall that M2m/8M\geq 2^{m/8} implies lnMln28m\ln M\geq\frac{\ln 2}{8}m. For a fixed a(0,18)a\in(0,\frac{1}{8}), since m=n1sβ+1m=n^{\frac{1}{s\beta+1}}, letting

KL(ρfin,ρf0n)n2σ¯2C0msβaln28malnM,\mathrm{KL}\left(\rho_{f_{i}}^{n},\rho_{f_{0}}^{n}\right)\leq\frac{n}{2\bar{\sigma}^{2}}C_{0}m^{-s\beta}\leq a\frac{\ln 2}{8}m\leq a\ln M, (41)

we have

C0σ¯2ln24a.C_{0}\leq\frac{\bar{\sigma}^{2}\ln 2}{4}a. (42)

So we can choose C0=caC_{0}=c^{\prime}a such that (39) and (42) are satisfied.

Denote {ρfin,fiBs(R)}\left\{\rho_{f_{i}}^{n},f_{i}\in B^{s}(R)\right\} as a family of probability distribution index by fif_{i}, then (41) implies the second condition in Lemma E.6 holds. Further, using (36), we have

d(fi,fj)2=fifjL22=ϵk=1m(ωk(i)ωk(j))2ϵm8=ca8msβcansβsβ+1,d\left(f_{i},f_{j}\right)^{2}=\left\|f_{i}-f_{j}\right\|_{L^{2}}^{2}=\epsilon\sum_{k=1}^{m}\left(\omega_{k}^{(i)}-\omega_{k}^{(j)}\right)^{2}\geq\frac{\epsilon m}{8}=\frac{c^{\prime}a}{8}m^{-s\beta}\geq c^{\prime}an^{-\frac{s\beta}{s\beta+1}}, (43)

where cc^{\prime} is a constant independent of nn.

Applying Lemma E.6 to (41) and (43), we have

inff^nsupfBs(R)ρf{f^nfL22cansβsβ+1}M1+M(12a2alnM).\inf_{\hat{f}_{n}}\sup_{f\in B^{s}(R)}\mathbb{P}_{\rho_{f}}\left\{\left\|\hat{f}_{n}-f\right\|_{L^{2}}^{2}\geq c^{\prime}an^{-\frac{s\beta}{s\beta+1}}\right\}\geq\frac{\sqrt{M}}{1+\sqrt{M}}\left(1-2a-\sqrt{\frac{2a}{\ln M}}\right). (44)

When nn is sufficiently large so that MM is sufficiently large, the probability in the R.H.S. of (44) is larger than 13a1-3a. For δ(0,1)\delta\in(0,1), choose a=δ3a=\frac{\delta}{3}, without loss of generality we assume a(0,18)a\in(0,\frac{1}{8}). Then (44) shows that there exists a constant CC, for all estimator f^,\hat{f}, we can find a function fBs(R)f\in B^{s}(R) and the corresponding distribution ρf𝒫\rho_{f}\in\mathcal{P} such that, with probability at least 1δ1-\delta,

f^fL22Cδnsβsβ+1.\left\|\hat{f}-f\right\|_{L^{2}}^{2}\geq C\delta n^{-\frac{s\beta}{s\beta+1}}. (45)

So we finish the proof.

Appendix D Useful propositions for upper bounds

This proposition bounds the LL^{\infty} norm of fλf_{\lambda} when sα0s\leq\alpha_{0}.

Proposition D.1.

Suppose that Assumption 2.1, 2.2 and 2.3 hold for 0<sα00<s\leq\alpha_{0} and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Denote fλ=(T+λ)1gf_{\lambda}=(T+\lambda)^{-1}g, then for any λ>0\lambda>0 and any α>α0\alpha>\alpha_{0}, we have

fλLMαfρ[]sλαs2.\|f_{\lambda}\|_{L^{\infty}}\leq M_{\alpha}\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}\lambda^{-\frac{\alpha-s}{2}}. (46)
Proof.

Suppose that fρ=iNaieif_{\rho}^{*}=\sum_{i\in N}a_{i}e_{i}. Since sα0s\leq\alpha_{0} and α>α0\alpha>\alpha_{0}, we have

fλ[]α2\displaystyle\left\|f_{\lambda}\right\|_{[\mathcal{H}]^{\alpha}}^{2} =iN(λi1α2λ+λi)2ai2\displaystyle=\sum\limits_{i\in N}\left(\frac{\lambda_{i}^{1-\frac{\alpha}{2}}}{\lambda+\lambda_{i}}\right)^{2}a_{i}^{2}
=iN(λi1αs2λ+λi)2λisai2\displaystyle=\sum\limits_{i\in N}\left(\frac{\lambda_{i}^{1-\frac{\alpha-s}{2}}}{\lambda+\lambda_{i}}\right)^{2}\lambda_{i}^{-s}a_{i}^{2}
fρ[]s2λsα,\displaystyle\leq\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}^{2}\lambda^{s-\alpha},

where we use Lemma A.1 for the last inequality. Further, using []αL(𝒳,μ)=Mα\|[\mathcal{H}]^{\alpha}\hookrightarrow L^{\infty}(\mathcal{X},\mu)\|=M_{\alpha} by Assumption 2.2, we have fλLMαfλ[]αMαfρ[]sλαs2\|f_{\lambda}\|_{L^{\infty}}\leq M_{\alpha}\|f_{\lambda}\|_{[\mathcal{H}]^{\alpha}}\leq M_{\alpha}\|f_{\rho}^{*}\|_{[\mathcal{H}]^{s}}\lambda^{-\frac{\alpha-s}{2}}.

The following proposition is an application of the classical Bernstein type inequality but considering a truncation version of fρf_{\rho}^{*}, which will bring refined analysis compared to previous work.

Proposition D.2.

Suppose that Assumption 2.1, 2.2, 2.3 and 2.4 hold for 0<s20<s\leq 2 and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Denote ξi=ξ(xi,yi)=Tλ12(KxiyiTxifλ)\xi_{i}=\xi(x_{i},y_{i})=T_{\lambda}^{-\frac{1}{2}}(K_{x_{i}}y_{i}-T_{x_{i}}f_{\lambda}) and Ω0={xΩ:|fρ(x)|t}\Omega_{0}=\{x\in\Omega:|f_{\rho}^{*}(x)|\leq t\}. Then for any α>α0\alpha>\alpha_{0} and all δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have

1ni=1nξiIxiΩ0𝔼ξxIxΩ0ln2δ(C1λα2nM~+C2𝒩12(λ)n+C3λαs2n),\left\|\frac{1}{n}\sum\limits_{i=1}^{n}\xi_{i}I_{x_{i}\in\Omega_{0}}-\mathbb{E}\xi_{x}I_{x\in\Omega_{0}}\right\|_{\mathcal{H}}\leq\ln\frac{2}{\delta}\left(\frac{C_{1}\lambda^{-\frac{\alpha}{2}}}{n}\cdot\tilde{M}+\frac{C_{2}\mathcal{N}^{\frac{1}{2}}(\lambda)}{\sqrt{n}}+\frac{C_{3}\lambda^{-\frac{\alpha-s}{2}}}{\sqrt{n}}\right), (48)

where M~=MαRλαs2+t+L\tilde{M}=M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}+t+L, and LL is the constant in (2.4). C1=82Mα,C2=8σ,C3=82MαRC_{1}=8\sqrt{2}M_{\alpha},C_{2}=8\sigma,C_{3}=8\sqrt{2}M_{\alpha}R.

Proof.

Note that fρf_{\rho}^{*} can represent a μ\mu-equivalence class in L2(𝒳,μ)L^{2}(\mathcal{X},\mu). When defining the set Ω0\Omega_{0}, we actually denote fρf_{\rho}^{*} as the representative fρ(x)=𝒴ydρ(y|x).f_{\rho}^{*}(x)=\int_{\mathcal{Y}}y\mathrm{d}\rho(y|x).

To use Lemma E.4, we need to bound the m-th moment of ξ(x,y)IxΩ0\xi(x,y)I_{x\in\Omega_{0}}.

𝔼ξ(x,y)IxΩ0m\displaystyle\mathbb{E}\left\|\xi(x,y)I_{x\in\Omega_{0}}\right\|_{\mathcal{H}}^{m} =𝔼Tλ12Kx(yfλ(x))IxΩ0m\displaystyle=\mathbb{E}\left\|T_{\lambda}^{-\frac{1}{2}}K_{x}(y-f_{\lambda}(x))I_{x\in\Omega_{0}}\right\|_{\mathcal{H}}^{m}
𝔼(Tλ12k(x,)m𝔼(|(yfλ(x))IxΩ0|m|x)).\displaystyle\leq\mathbb{E}\Big{(}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{m}\mathbb{E}\big{(}\left|(y-f_{\lambda}(x))I_{x\in\Omega_{0}}\right|^{m}~{}\big{|}~{}x\big{)}\Big{)}. (49)

Using the inequality (a+b)m2m1(am+bm)(a+b)^{m}\leq 2^{m-1}\left(a^{m}+b^{m}\right), we have

|yfλ(x)|m\displaystyle\left|y-f_{\lambda}(x)\right|^{m} 2m1(|fλ(x)fρ(x)|m+|fρ(x)y|m)\displaystyle\leq 2^{m-1}\left(\left|f_{\lambda}(x)-f_{\rho}^{*}(x)\right|^{m}+\left|f_{\rho}^{*}(x)-y\right|^{m}\right)
=2m1(|fλ(x)fρ(x)|m+|ϵ|m).\displaystyle=2^{m-1}\left(\left|f_{\lambda}(x)-f_{\rho}^{*}(x)\right|^{m}+|\epsilon|^{m}\right). (50)

Plugging (D) into (D), we have

𝔼ξ(x,y)IxΩ0m\displaystyle\mathbb{E}\left\|\xi(x,y)I_{x\in\Omega_{0}}\right\|_{\mathcal{H}}^{m}~{}\leq~{} 2m1𝔼(Tλ12k(x,)m|(fλ(x)fρ(x))IxΩ0|m)\displaystyle 2^{m-1}\mathbb{E}\Big{(}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{m}\left|(f_{\lambda}(x)-f_{\rho}^{*}(x))I_{x\in\Omega_{0}}\right|^{m}\Big{)} (51)
+2m1𝔼(Tλ12k(x,)m𝔼(|ϵIxΩ0|m|x))\displaystyle+2^{m-1}\mathbb{E}\Big{(}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{m}\mathbb{E}\big{(}\left|\epsilon~{}I_{x\in\Omega_{0}}\right|^{m}~{}\big{|}~{}x\big{)}\Big{)} (52)

Now we begin to bound (52). Note that we have proved in Lemma E.1 that for μ\mu-almost x𝒳x\in\mathcal{X},

Tλ12k(x,)Mαλα2;\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}\leq M_{\alpha}\lambda^{-\frac{\alpha}{2}}; (53)

In addition, we also have

𝔼Tλ12k(x,)2\displaystyle\mathbb{E}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{2} =𝔼iN(1λi+λ)12λiei(x)ei()2\displaystyle=\mathbb{E}\Big{\|}\sum\limits_{i\in N}(\frac{1}{\lambda_{i}+\lambda})^{\frac{1}{2}}\lambda_{i}e_{i}(x)e_{i}(\cdot)\Big{\|}_{\mathcal{H}}^{2}
=𝔼(iNλiλi+λei2(x))\displaystyle=\mathbb{E}\Big{(}\sum\limits_{i\in N}\frac{\lambda_{i}}{\lambda_{i}+\lambda}e_{i}^{2}(x)\Big{)}
=iNλiλi+λ\displaystyle=\sum\limits_{i\in N}\frac{\lambda_{i}}{\lambda_{i}+\lambda}
=𝒩(λ).\displaystyle=\mathcal{N}(\lambda). (54)

So we have

𝔼Tλ12k(x,)msupx𝒳Tλ12k(x,)m2𝔼Tλ12k(x,)2(Mαλα2)m2𝒩(λ).\mathbb{E}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{m}\leq\sup_{x\in\mathcal{X}}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{m-2}\cdot\mathbb{E}\left\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\right\|_{\mathcal{H}}^{2}\leq\big{(}M_{\alpha}\lambda^{-\frac{\alpha}{2}}\big{)}^{m-2}\mathcal{N}(\lambda). (55)

Using Assumption 2.4, we have

𝔼(|ϵIxΩ0|mx)𝔼(|ϵ|mx)12m!σ2Lm2,μ-a.e. x𝒳,\mathbb{E}\left(|\epsilon I_{x\in\Omega_{0}}|^{m}\mid x\right)\leq\mathbb{E}\left(|\epsilon|^{m}\mid x\right)\leq\frac{1}{2}m!\sigma^{2}L^{m-2},\quad\mu\text{-a.e. }x\in\mathcal{X}, (56)

so we get the upper bound of (52), i.e.,

(52)12m!(2σ𝒩12(λ))2(2Mαλα2L)m2.(\ref{proof of 4.9-3})\leq\frac{1}{2}m!\left(\sqrt{2}\sigma\mathcal{N}^{\frac{1}{2}}(\lambda)\right)^{2}(2M_{\alpha}\lambda^{-\frac{\alpha}{2}}L)^{m-2}. (57)

Now we begin to bound (51).

  1. (1)

    When sα0s\leq\alpha_{0}, using the definition of Ω0\Omega_{0} and Proposition D.1, we have

    (fλfρ)IxΩ0LfλL+fρIxΩ0LMαRλαs2+t:=M.\|(f_{\lambda}-f_{\rho}^{*})I_{x\in\Omega_{0}}\|_{L^{\infty}}\leq\|f_{\lambda}\|_{L^{\infty}}+\|f_{\rho}^{*}I_{x\in\Omega_{0}}\|_{L^{\infty}}\leq M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}+t:=M. (58)
  2. (2)

    When s>α0s>\alpha_{0}, without loss of generality, we assume α0<αs\alpha_{0}<\alpha\leq s. using Theorem A.2 for γ=α\gamma=\alpha, we have

    (fλfρ)IxΩ0LMαfλfρ[]αMαRλαs2<M.\|(f_{\lambda}-f_{\rho}^{*})I_{x\in\Omega_{0}}\|_{L^{\infty}}\leq M_{\alpha}\|f_{\lambda}-f_{\rho}^{*}\|_{[\mathcal{H}]^{\alpha}}\leq M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}<M. (59)

Therefore, for all 0<s20<s\leq 2 we have

(fλfρ)IxΩ0LM.\displaystyle\|(f_{\lambda}-f_{\rho}^{*})I_{x\in\Omega_{0}}\|_{L^{\infty}}\leq M. (60)

In addition, using Theorem A.2 for γ=0\gamma=0, we also have

𝔼|(fλ(x)fρ(x))IxΩ0|2𝔼|fλ(x)fρ(x)|2(Rλs2)2.\mathbb{E}|(f_{\lambda}(x)-f_{\rho}^{*}(x))I_{x\in\Omega_{0}}|^{2}\leq\mathbb{E}|f_{\lambda}(x)-f_{\rho}^{*}(x)|^{2}\leq(R\lambda^{\frac{s}{2}})^{2}. (61)

So we get the upper bound of (51), i.e.,

(51)\displaystyle(\ref{proof of 4.9-2}) 2m1(Mαλα2)m(fλfρ)IxΩ0Lm2𝔼|(fλ(x)fρ(x))IxΩ0|2\displaystyle\leq 2^{m-1}(M_{\alpha}\lambda^{-\frac{\alpha}{2}})^{m}\cdot\|(f_{\lambda}-f_{\rho}^{*})I_{x\in\Omega_{0}}\|_{L^{\infty}}^{m-2}\cdot\mathbb{E}|(f_{\lambda}(x)-f_{\rho}^{*}(x))I_{x\in\Omega_{0}}|^{2}
2m1(Mαλα2)mMm2(Rλs2)2\displaystyle\leq 2^{m-1}(M_{\alpha}\lambda^{-\frac{\alpha}{2}})^{m}\cdot M^{m-2}\cdot(R\lambda^{\frac{s}{2}})^{2}
12m!(2Mαλα2M)m2(2MαRλαs2)2.\displaystyle\leq\frac{1}{2}m!\big{(}2M_{\alpha}\lambda^{-\frac{\alpha}{2}}M\big{)}^{m-2}\big{(}2M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}\big{)}^{2}. (62)

Denote

L~\displaystyle\tilde{L} =2Mα(M+L)λα2\displaystyle=2M_{\alpha}(M+L)\lambda^{-\frac{\alpha}{2}}
σ~\displaystyle\tilde{\sigma} =2MαRλαs2+2σ𝒩12(λ),\displaystyle=2M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}+\sqrt{2}\sigma\mathcal{N}^{\frac{1}{2}}(\lambda), (63)

then we have 𝔼ξ(x,y)IxΩ0m12m!σ~2L~m2\mathbb{E}\left\|\xi(x,y)I_{x\in\Omega_{0}}\right\|_{\mathcal{H}}^{m}\leq\frac{1}{2}m!\tilde{\sigma}^{2}\tilde{L}^{m-2}. Using Lemma E.4, we finish the proof. ∎

Remark D.3.

In fact, when we later applying Proposition D.2 in the proof of Proposition D.4, the truncation method in this proposition is necessary only in the sα0s\leq\alpha_{0} case. But for the completeness and consistency of our proof, we also include s>α0s>\alpha_{0} in this proposition.

Based on Proposition D.2, the following Proposition will give an upper bound of Tλ12[(gZTXfλ)(gTfλ)]\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(g-Tf_{\lambda}\right)\right]\right\|_{\mathcal{H}}.

Proposition D.4.

Suppose that Assumption 2.1, 2.2, 2.3 and 2.4 hold for 0<s20<s\leq 2 and 1βα0<1\frac{1}{\beta}\leq\alpha_{0}<1. Suppose that fρLq(𝒳,μ)f_{\rho}^{*}\in L^{q}(\mathcal{X},\mu) and fρLq(𝒳,μ)Cq<\|f_{\rho}^{*}\|_{L^{q}(\mathcal{X},\mu)}\leq C_{q}<\infty for some q>2(sβ+1)2+(sα0)βq>\frac{2(s\beta+1)}{2+(s-\alpha_{\tiny 0})\beta}. Then in the case of s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, by choosing λnβsβ+1\lambda\asymp n^{-\frac{\beta}{s\beta+1}}, for any fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, with probability at least 1δ1-\delta , we have

Tλ12[(gZTXfλ)(gTfλ)]ln2δCλ12βn=ln2δCn12sβsβ+1,\left\|T_{\lambda}^{-\frac{1}{2}}\left[\left(g_{Z}-T_{X}f_{\lambda}\right)-\left(g-Tf_{\lambda}\right)\right]\right\|_{\mathcal{H}}\leq\ln{\frac{2}{\delta}}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln{\frac{2}{\delta}}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}, (64)

where CC is a constant that only depends on κ,R,L,σ,Cq\kappa,R,L,\sigma,C_{q}.

Proof.

Denote ξi=ξ(xi,yi)=Tλ12(KxiyiTxifλ)\xi_{i}=\xi(x_{i},y_{i})=T_{\lambda}^{-\frac{1}{2}}(K_{x_{i}}y_{i}-T_{x_{i}}f_{\lambda}) and ξx=ξ(x,y)=Tλ12(KxyTxfλ)\xi_{x}=\xi(x,y)=T_{\lambda}^{-\frac{1}{2}}(K_{x}y-T_{x}f_{\lambda}), then (64) is equivalent to

1ni=1nξi𝔼ξxln2δCλ12βn=ln2δCn12sβsβ+1.\left\|\frac{1}{n}\sum\limits_{i=1}^{n}\xi_{i}-\mathbb{E}\xi_{x}\right\|_{\mathcal{H}}\leq\ln{\frac{2}{\delta}}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln{\frac{2}{\delta}}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (65)

Consider the subset Ω1={xΩ:|fρ(x)|t}\Omega_{1}=\{x\in\Omega:|f_{\rho}^{*}(x)|\leq t\} and Ω2=𝒳\Ω1\Omega_{2}=\mathcal{X}\backslash\Omega_{1}. Since fρLq(𝒳,μ)Cq\|f_{\rho}^{*}\|_{L^{q}(\mathcal{X},\mu)}\leq C_{q}, we have

P(xΩ2)=P(|fρ(x)|>t)𝔼|fρ(x)|qtq(Cq)qtq.P(x\in\Omega_{2})=P\Big{(}|f_{\rho}^{*}(x)|>t\Big{)}\leq\frac{\mathbb{E}|f_{\rho}^{*}(x)|^{q}}{t^{q}}\leq\frac{(C_{q})^{q}}{t^{q}}. (66)

Decomposing ξi\xi_{i} as ξiIxiΩ1+ξiIxiΩ2\xi_{i}I_{x_{i}\in\Omega_{1}}+\xi_{i}I_{x_{i}\in\Omega_{2}}, we have

1ni=1nξi𝔼ξx1ni=1nξiIxiΩ1𝔼ξxIxΩ1+1ni=1nξiIxiΩ2+𝔼ξxIxΩ2.\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}-\mathbb{E}\xi_{x}\right\|_{\mathcal{H}}\leq\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}I_{x_{i}\in\Omega_{1}}-\mathbb{E}\xi_{x}I_{x\in\Omega_{1}}\right\|_{\mathcal{H}}+\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}I_{x_{i}\in\Omega_{2}}\right\|_{{}_{\mathcal{H}}}+\left\|\mathbb{E}\xi_{x}I_{x\in\Omega_{2}}\right\|_{{}_{\mathcal{H}}}. (67)

Given s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, here we firstly fixed an α\alpha such that

α0<α<s+1β.\alpha_{0}<\alpha<s+\frac{1}{\beta}. (68)

For the first term in (67), denoted as I, using Theorem D.2, for all δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have

Iln2δ(C1λα2nM~+C2𝒩12(λ)n+C3λαs2n),\text{I}\leq\ln\frac{2}{\delta}\left(\frac{C_{1}\lambda^{-\frac{\alpha}{2}}}{n}\cdot\tilde{M}+\frac{C_{2}\mathcal{N}^{\frac{1}{2}}(\lambda)}{\sqrt{n}}+\frac{C_{3}\lambda^{-\frac{\alpha-s}{2}}}{\sqrt{n}}\right), (69)

where M~=MαRλαs2+t+L\tilde{M}=M_{\alpha}R\lambda^{-\frac{\alpha-s}{2}}+t+L. Recalling that s+1β>α0s+\frac{1}{\beta}>\alpha_{0}, simple calculation shows that by choosing λ=nβsβ+1\lambda=n^{-\frac{\beta}{s\beta+1}},

  • the second term in (69):

    C2𝒩12(λ)nλ12βn=n12sβsβ+1;\displaystyle\frac{C_{2}\mathcal{N}^{\frac{1}{2}}(\lambda)}{\sqrt{n}}\asymp\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}; (70)
  • the third term in (69):

    C3λαs2nn12(αs+1/β1)n12sβsβ+1n12sβsβ+1;\frac{C_{3}\lambda^{-\frac{\alpha-s}{2}}}{\sqrt{n}}\asymp n^{\frac{1}{2}(\frac{\alpha}{s+1/\beta}-1)}\cdot n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}\lesssim n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}; (71)
  • the first term in (69):

    C1λα2nM~\displaystyle\frac{C_{1}\lambda^{-\frac{\alpha}{2}}}{n}\cdot\tilde{M} λα2nλαs2+λα2nt+λα2nL.\displaystyle\asymp\frac{\lambda^{-\frac{\alpha}{2}}}{n}\lambda^{-\frac{\alpha-s}{2}}+\frac{\lambda^{-\frac{\alpha}{2}}}{n}\cdot t+\frac{\lambda^{-\frac{\alpha}{2}}}{n}\cdot L. (72)

Further calculations show that

λα2nλαs2=nαs+1/β1n12sβsβ+1n12sβsβ+1,\frac{\lambda^{-\frac{\alpha}{2}}}{n}\lambda^{-\frac{\alpha-s}{2}}=n^{\frac{\alpha}{s+1/\beta}-1}\cdot n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}\lesssim n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}, (73)

and

λα2n=n12αβsβ2sβ+1n12sβsβ+1n12sβsβ+1.\frac{\lambda^{-\frac{\alpha}{2}}}{n}=n^{\frac{1}{2}\frac{\alpha\beta-s\beta-2}{s\beta+1}}\cdot n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}\lesssim n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (74)

To make (72)n12sβsβ+1\eqref{1.2-6}\lesssim n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}} when λ=nβsβ+1\lambda=n^{-\frac{\beta}{s\beta+1}}, letting λα2ntn12sβsβ+1\frac{\lambda^{-\frac{\alpha}{2}}}{n}\cdot t\leq n^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}, we have the first restriction of tt:

(R1):tn12(1+1αβsβ+1).\textbf{(R1)}:\quad t\leq n^{\frac{1}{2}(1+\frac{1-\alpha\beta}{s\beta+1})}. (75)

That is to say, if we choose tn12(1+1αβsβ+1)t\leq n^{\frac{1}{2}(1+\frac{1-\alpha\beta}{s\beta+1})}, we have

Iln2δCλ12βn=ln2δCn12sβsβ+1.\text{I}\leq\ln{\frac{2}{\delta}}C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=\ln{\frac{2}{\delta}}Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. (76)

For the second term in (67), denoted as II, we have

τn:=P(II>λ12βn)\displaystyle\tau_{n}:=P(\text{II}>\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}) P(xis.t.xiΩ2)=1P(xiΩ2,i=1,2,,n)\displaystyle\leq P\Big{(}~{}\exists x_{i}~{}\text{s.t.}~{}x_{i}\in\Omega_{2}\Big{)}=1-P\Big{(}x_{i}\notin\Omega_{2},\forall i=1,2,\cdots,n\Big{)}
=1P(xΩ2)n\displaystyle=1-P\Big{(}x\notin\Omega_{2}\Big{)}^{n}
=1P(|fρ(x)|t)n\displaystyle=1-P\Big{(}|f_{\rho}^{*}(x)|\leq t\Big{)}^{n}
1(1(Cq)qtq)n.\displaystyle\leq 1-\Big{(}1-\frac{(C_{q})^{q}}{t^{q}}\Big{)}^{n}. (77)

Letting τn:=P(II>λ12βn)0\tau_{n}:=P(\text{II}>\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}})\to 0, we have tqn\displaystyle{t^{q}}\gg n, i.e. tn1qt\gg n^{\frac{1}{q}}. This gives the second restriction of tt, i.e.,

(R2):tn1q,orn1q=o(t).\textbf{(R2)}:\quad t\gg n^{\frac{1}{q}},~{}\text{or}~{}n^{\frac{1}{q}}=o(t). (78)

For the third term in (67), denoted as III. Since we have already known that Tλ12k(x,)λα2,μ-a.e. x𝒳,\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}\leq\lambda^{-\frac{\alpha}{2}},\mu\text{-a.e. }x\in\mathcal{X}, so

III 𝔼ξxIxΩ2𝔼[Tλ12k(x,)|(yfλ(x))IxΩ2|]\displaystyle\leq\mathbb{E}\|\xi_{x}I_{x\in\Omega_{2}}\|_{\mathcal{H}}\leq\mathbb{E}\Big{[}\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}\cdot\big{|}\big{(}y-f_{\lambda}(x)\big{)}I_{x\in\Omega_{2}}\big{|}\Big{]}
λα2𝔼|(yfλ(x))IxΩ2|\displaystyle\leq\lambda^{-\frac{\alpha}{2}}\mathbb{E}\big{|}\big{(}y-f_{\lambda}(x)\big{)}I_{x\in\Omega_{2}}\big{|}
λα2(𝔼|(fρ(x)fλ(x))IxΩ2|+𝔼|(fρ(x)y)IxΩ2|)\displaystyle\leq\lambda^{-\frac{\alpha}{2}}\Big{(}\mathbb{E}\big{|}\big{(}f_{\rho}^{*}(x)-f_{\lambda}(x)\big{)}I_{x\in\Omega_{2}}\big{|}+\mathbb{E}\big{|}\big{(}f_{\rho}^{*}(x)-y\big{)}I_{x\in\Omega_{2}}\big{|}\Big{)}
λα2(𝔼|(fρ(x)fλ(x))IxΩ2|+𝔼|ϵIxΩ2|).\displaystyle\leq\lambda^{-\frac{\alpha}{2}}\Big{(}\mathbb{E}\big{|}\big{(}f_{\rho}^{*}(x)-f_{\lambda}(x)\big{)}I_{x\in\Omega_{2}}\big{|}+\mathbb{E}\big{|}\epsilon\cdot I_{x\in\Omega_{2}}\big{|}\Big{)}. (79)

Using Cauchy-Schwarz and the bound of approximation error (Theorem A.2), we have

𝔼|(fρ(x)fλ(x))IxΩ2|(fρfλL2)12(P(xΩ2))12Rλs2Cqq2tq2.\displaystyle\mathbb{E}\big{|}\big{(}f_{\rho}^{*}(x)-f_{\lambda}(x)\big{)}I_{x\in\Omega_{2}}\big{|}\leq\left(\left\|f_{\rho}^{*}-f_{\lambda}\right\|_{L^{2}}\right)^{\frac{1}{2}}\cdot\left(P(x\in\Omega_{2})\right)^{\frac{1}{2}}\leq R\lambda^{\frac{s}{2}}C_{q}^{\frac{q}{2}}t^{-\frac{q}{2}}. (80)

In addition, we have

𝔼|ϵIxΩ2|=𝔼(𝔼|ϵIxΩ2||x)σ𝔼|IxΩ2|σ(Cq)qtq.\displaystyle\mathbb{E}\big{|}\epsilon\cdot I_{x\in\Omega_{2}}\big{|}=\mathbb{E}\left(\mathbb{E}\big{|}\epsilon\cdot I_{x\in\Omega_{2}}\big{|}~{}\Big{|}~{}x\right)\leq\sigma\mathbb{E}\left|I_{x\in\Omega_{2}}\right|\leq\sigma(C_{q})^{q}t^{-q}. (81)

Plugging (80) and (81) into (D), we have

IIIRCqq2λαs2tq2+σ(Cq)qλα2tq.\text{III}\leq RC_{q}^{\frac{q}{2}}\lambda^{-\frac{\alpha-s}{2}}t^{-\frac{q}{2}}+\sigma(C_{q})^{q}\lambda^{-\frac{\alpha}{2}}t^{-q}. (82)

Comparing (82) with C3λαs2nC_{3}\frac{\lambda^{-\frac{\alpha-s}{2}}}{\sqrt{n}} and C1λα2nC_{1}\frac{\lambda^{-\frac{\alpha}{2}}}{n} in (69). We know that if tn1qt\geq n^{\frac{1}{q}}, (D) Cλ12βn=Cn12sβsβ+1\leq C\frac{\lambda^{-\frac{1}{2\beta}}}{\sqrt{n}}=Cn^{-\frac{1}{2}\frac{s\beta}{s\beta+1}}. So the third term will not give further restriction of tt.

To sum up, if we choose tt such that restrictions (75) and (78) are satisfied, then we can prove that (65) is satisfied with probability at least 1δτn,(τn0)1-\delta-\tau_{n},(\tau_{n}\to 0). Since for a fixed δ(0,1)\delta\in(0,1), when nn is sufficiently large, τn\tau_{n} is sufficiently small such that, e.g., τn<δ10\tau_{n}<\frac{\delta}{10}. Without loss of generality, we say (65) is satisfied with probability at least 1δ1-\delta.

Note that, such tt exists if

1q<12(1+1αβsβ+1)q>2(sβ+1)2+(sα)β.\frac{1}{q}<\frac{1}{2}(1+\frac{1-\alpha\beta}{s\beta+1})\Longleftrightarrow q>\frac{2(s\beta+1)}{2+(s-\alpha)\beta}. (83)

Recalling that for (68), we only assume there exists α\alpha satisfying α0<α<s+1β\alpha_{0}<\alpha<s+\frac{1}{\beta}, so such tt exists if and only if

1q<12(1+1α0βsβ+1)q>2(sβ+1)2+(sα0)β.\frac{1}{q}<\frac{1}{2}(1+\frac{1-\alpha_{0}\beta}{s\beta+1})\Longleftrightarrow q>\frac{2(s\beta+1)}{2+(s-\alpha_{0})\beta}. (84)

which is what we assume in the theorem. ∎

Proposition D.5.

Suppose that the embedding index is α0\alpha_{0}. Then for any α>α0\alpha>\alpha_{0} and all δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have

Tλ12(TTX)Tλ124Mα2λα3nB+2Mα2λαnB,\|T_{\lambda}^{-\frac{1}{2}}(T-T_{X})T_{\lambda}^{-\frac{1}{2}}\|\leq\frac{4M_{\alpha}^{2}\lambda^{-\alpha}}{3n}B+\sqrt{\frac{2M_{\alpha}^{2}\lambda^{-\alpha}}{n}B}, (85)

where

B=ln4𝒩(λ)(T+λ)δT.B=\ln{\frac{4\mathcal{N}(\lambda)(\|T\|+\lambda)}{\delta\|T\|}}. (86)
Proof.

Denote Ai=Tλ12(TTxi)Tλ12A_{i}=T_{\lambda}^{-\frac{1}{2}}(T-T_{x_{i}})T_{\lambda}^{-\frac{1}{2}}, using Lemma E.2 we have

Ai=Tλ12TTλ12+Tλ12TxiTλ122Mα2λα.\|A_{i}\|=\|T_{\lambda}^{-\frac{1}{2}}TT_{\lambda}^{-\frac{1}{2}}\|+\|T_{\lambda}^{-\frac{1}{2}}T_{x_{i}}T_{\lambda}^{-\frac{1}{2}}\|\leq 2M_{\alpha}^{2}\lambda^{-\alpha}. (87)

We use ABA\preceq B to denote that ABA-B is a positive semi-definite operator. Using the fact that 𝔼(B𝔼B)2𝔼B2\mathbb{E}(B-\mathbb{E}B)^{2}\preceq\mathbb{E}B^{2} for a self-adjoint operator BB, we have

𝔼Ai2𝔼[Tλ12TxiTλ12]2.\mathbb{E}A_{i}^{2}\preceq\mathbb{E}\left[T_{\lambda}^{-\frac{1}{2}}T_{x_{i}}T_{\lambda}^{-\frac{1}{2}}\right]^{2}. (88)

In addition, Lemma E.2 shows that 0Tλ12TxiTλ12Mα2λα,μ-a.e. x𝒳0\preceq T_{\lambda}^{-\frac{1}{2}}T_{x_{i}}T_{\lambda}^{-\frac{1}{2}}\preceq M_{\alpha}^{2}\lambda^{-\alpha},\mu\text{-a.e. }x\in\mathcal{X}. So we have

𝔼Ai2𝔼[Tλ12TxiTλ12]2𝔼[Mα2λαTλ12TxiTλ12]=Mα2λαTλ1T,\mathbb{E}A_{i}^{2}\preceq\mathbb{E}\left[T_{\lambda}^{-\frac{1}{2}}T_{x_{i}}T_{\lambda}^{-\frac{1}{2}}\right]^{2}\preceq\mathbb{E}\left[M_{\alpha}^{2}\lambda^{-\alpha}\cdot T_{\lambda}^{-\frac{1}{2}}T_{x_{i}}T_{\lambda}^{-\frac{1}{2}}\right]=M_{\alpha}^{2}\lambda^{-\alpha}T_{\lambda}^{-1}T, (89)

Defining an operator V:=Mα2λαTλ1TV:=M_{\alpha}^{2}\lambda^{-\alpha}T_{\lambda}^{-1}T, we have

V\displaystyle\|V\| =Mα2λαλ1λ1+λ=Mα2λαTT+λMα2λα;\displaystyle=M_{\alpha}^{2}\lambda^{-\alpha}\frac{\lambda_{1}}{\lambda_{1}+\lambda}=M_{\alpha}^{2}\lambda^{-\alpha}\frac{\|T\|}{\|T\|+\lambda}\leq M_{\alpha}^{2}\lambda^{-\alpha};
trV\displaystyle\text{tr}V =Mα2λα𝒩(λ);\displaystyle=M_{\alpha}^{2}\lambda^{-\alpha}\mathcal{N}(\lambda);
trVV\displaystyle\frac{\text{tr}V}{\|V\|} =𝒩(λ)(T+λ)T.\displaystyle=\frac{\mathcal{N}(\lambda)(\|T\|+\lambda)}{\|T\|}. (90)

Using Lemma E.3 to AiA_{i}, VV, we finish the proof. ∎

Appendix E Auxiliary lemma

E.1 Lemmas for upper bound

The following lemma is where we take advantage of the embedding index and embedding property in Assumption 2.2.

Lemma E.1.

Suppose that the embedding index is α0\alpha_{0}. Then for any α>α0\alpha>\alpha_{0}, for μ\mu-almost x𝒳x\in\mathcal{X}, we have

Tλ12k(x,)2Mα2λα.\displaystyle\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}^{2}\leq M_{\alpha}^{2}\lambda^{-\alpha}. (91)
Proof.

Recalling that []αL(𝒳)=Mα\|[\mathcal{H}]^{\alpha}\hookrightarrow L^{\infty}(\mathcal{X})\|=M_{\alpha}, we have

Tλ12k(x,)2\displaystyle\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}^{2} =iN(1λi+λ)12λiei(x)ei()2\displaystyle=\Big{\|}\sum\limits_{i\in N}(\frac{1}{\lambda_{i}+\lambda})^{\frac{1}{2}}\lambda_{i}e_{i}(x)e_{i}(\cdot)\Big{\|}_{\mathcal{H}}^{2}
=iNλiλi+λei2(x)\displaystyle=\sum\limits_{i\in N}\frac{\lambda_{i}}{\lambda_{i}+\lambda}e_{i}^{2}(x)
=[iNλiαei2(x)]supiNλi1αλi+λ\displaystyle=\big{[}\sum\limits_{i\in N}\lambda_{i}^{\alpha}e_{i}^{2}(x)\big{]}\sup\limits_{i\in N}\frac{\lambda_{i}^{1-\alpha}}{\lambda_{i}+\lambda}
Mα2λα,\displaystyle\leq M_{\alpha}^{2}\lambda^{-\alpha}, (92)

where we use Lemma A.1 for the last inequality, and we finish the proof. ∎

Lemma E.1 has a direct corollary.

Corollary E.2.

Suppose that the embedding index is α0\alpha_{0}. Then for any α>α0\alpha>\alpha_{0}, for μ\mu-almost x𝒳x\in\mathcal{X}, we have

Tλ12TxTλ12Mα2λα.\|T_{\lambda}^{-\frac{1}{2}}T_{x}T_{\lambda}^{-\frac{1}{2}}\|\leq M_{\alpha}^{2}\lambda^{-\alpha}. (93)
Proof.

Note that for any ff\in\mathcal{H},

Tλ12TxTλ12f\displaystyle T_{\lambda}^{-\frac{1}{2}}T_{x}T_{\lambda}^{-\frac{1}{2}}f =Tλ12KxKxTλ12f\displaystyle=T_{\lambda}^{-\frac{1}{2}}K_{x}K_{x}^{*}T_{\lambda}^{-\frac{1}{2}}f
=Tλ12Kxk(x,),Tλ12f\displaystyle=T_{\lambda}^{-\frac{1}{2}}K_{x}\langle k(x,\cdot),T_{\lambda}^{-\frac{1}{2}}f\rangle_{\mathcal{H}}
=Tλ12KxTλ12k(x,),f\displaystyle=T_{\lambda}^{-\frac{1}{2}}K_{x}\langle T_{\lambda}^{-\frac{1}{2}}k(x,\cdot),f\rangle_{\mathcal{H}}
=Tλ12k(x,),fTλ12k(x,).\displaystyle=\langle T_{\lambda}^{-\frac{1}{2}}k(x,\cdot),f\rangle_{\mathcal{H}}\cdot T_{\lambda}^{-\frac{1}{2}}k(x,\cdot). (94)

So Tλ12TxTλ12=supf=1Tλ12TxTλ12f=supf=1Tλ12k(x,),fTλ12k(x,)=Tλ12k(x,)2\|T_{\lambda}^{-\frac{1}{2}}T_{x}T_{\lambda}^{-\frac{1}{2}}\|=\sup\limits_{\|f\|_{\mathcal{H}}=1}\|T_{\lambda}^{-\frac{1}{2}}T_{x}T_{\lambda}^{-\frac{1}{2}}f\|_{\mathcal{H}}=\sup\limits_{\|f\|_{\mathcal{H}}=1}\langle T_{\lambda}^{-\frac{1}{2}}k(x,\cdot),f\rangle_{\mathcal{H}}\cdot\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}=\|T_{\lambda}^{-\frac{1}{2}}k(x,\cdot)\|_{\mathcal{H}}^{2}. Use Lemma E.1 and we finish the proof. ∎

The following concentration inequality about self-adjoint Hilbert-Schmidt operator valued random variables is frequently used in related literature, e.g., Fischer & Steinwart (2020, Theorem 27) and Lin & Cevher (2020, Lemma 26).

Lemma E.3.

Let (𝒳,,μ)(\mathcal{X},\mathcal{B},\mu) be a probability space, \mathcal{H} be a separable Hilbert space. Suppose that A1,,AnA_{1},\cdots,A_{n} are i.i.d. random variables with values in the set of self-adjoint Hilbert-Schmidt operators. If 𝔼Ai=0\mathbb{E}A_{i}=0, and the operator norm AiLμ-a.e. x𝒳\|A_{i}\|\leq L~{}~{}\mu\text{-a.e. }x\in\mathcal{X}, and there exists a self-adjoint positive semi-definite trace class operator VV with 𝔼Ai2V\mathbb{E}A_{i}^{2}\preceq V. Then for δ(0,1)\delta\in(0,1), with probability at least 1δ1-\delta, we have

1ni=1nAi2Lβ3n+2Vβn,β=ln4trVδV.\displaystyle\left\|\frac{1}{n}\sum_{i=1}^{n}A_{i}\right\|\leq\frac{2L\beta}{3n}+\sqrt{\frac{2\|V\|\beta}{n}},\quad\beta=\ln\frac{4\rm{tr}V}{\delta\|V\|}. (95)

The following Bernstein inequality about vector-valued random variables is frequently used, e.g., Caponnetto & de Vito (2007, Proposition 2) and Fischer & Steinwart (2020, Theorem 26).

Lemma E.4 (Bernstein inequality).

Let (Ω,,P)(\Omega,\mathcal{B},P) be a probability space, HH be a separable Hilbert space, and ξ:ΩH\xi:\Omega\to H be a random variable with

𝔼ξHm12m!σ2Lm2,\mathbb{E}\|\xi\|_{H}^{m}\leq\frac{1}{2}m!\sigma^{2}L^{m-2},

for all m>2m>2. Then for δ(0,1)\delta\in(0,1), ξi\xi_{i} are i.i.d. random variables, with probability at least 1δ1-\delta, we have

1ni=1nξi𝔼ξH42ln2δ(Ln+σn).\left\|\frac{1}{n}\sum_{i=1}^{n}\xi_{i}-\mathbb{E}\xi\right\|_{H}\leq 4\sqrt{2}\ln{\frac{2}{\delta}}\left(\frac{L}{n}+\frac{\sigma}{\sqrt{n}}\right). (96)
Lemma E.5.

If λiiβ\lambda_{i}\asymp i^{-\beta}, we have

𝒩(λ)λ1β.\displaystyle\mathcal{N}(\lambda)\asymp\lambda^{-\frac{1}{\beta}}. (97)
Proof.

Since ciβλiCiβc~{}i^{-\beta}\leq\lambda_{i}\leq Ci^{-\beta}, we have

𝒩(λ)\displaystyle\mathcal{N}(\lambda) =i=1λiλi+λi=1CiβCiβ+λ=i=1CC+λiβ\displaystyle=\sum_{i=1}^{\infty}\frac{\lambda_{i}}{\lambda_{i}+\lambda}\leq\sum_{i=1}^{\infty}\frac{Ci^{-\beta}}{Ci^{-\beta}+\lambda}=\sum_{i=1}^{\infty}\frac{C}{C+\lambda i^{\beta}} (98)
0Cλxβ+Cdx=λ1β0Cyβ+CdyC1λ1β.\displaystyle\leq\int_{0}^{\infty}\frac{C}{\lambda x^{\beta}+C}\mathrm{d}x=\lambda^{-\frac{1}{\beta}}\int_{0}^{\infty}\frac{C}{y^{\beta}+C}\mathrm{d}y\leq C_{1}\lambda^{-\frac{1}{\beta}}. (99)

for some constant C1C_{1}. Similarly, we have

𝒩(λ)C2λ1β,\mathcal{N}(\lambda)\geq C_{2}\lambda^{-\frac{1}{\beta}}, (100)

for some constant C2C_{2}. ∎

E.2 Lemmas for minimax lower bound

The following lemma is a standard approach to derive the minimax lower bound, which can be found in Tsybakov (2009, Theorem 2.5).

Lemma E.6.

Suppose that there is a non-parametric class of functions Θ\Theta and a (semi-)distance d(,)d(\cdot,\cdot) on Θ\Theta. {Pθ,θΘ}\left\{P_{\theta},\theta\in\Theta\right\} is a family of probability distributions indexed by Θ\Theta. Assume that M2M\geq 2 and suppose that Θ\Theta contains elements θ0,θ1,,θM\theta_{0},\theta_{1},\cdots,\theta_{M} such that,

  1. (1)

    d(θj,θk)2s>0,0j<kMd\left(\theta_{j},\theta_{k}\right)\geq 2s>0,\quad\forall 0\leq j<k\leq M;

  2. (2)

    PjP0,j=1,,MP_{j}\ll P_{0},\quad\forall j=1,\cdots,M, and

    1Mj=1MK(Pj,P0)alogM,\frac{1}{M}\sum_{j=1}^{M}K\left(P_{j},P_{0}\right)\leq a\log M, (101)

with 0<a<1/80<a<1/8 and Pj=Pθj,j=0,1,,MP_{j}=P_{\theta_{j}},j=0,1,\cdots,M. Then

infθ^supθΘPθ(d(θ^,θ)s)M1+M(12a2alogM).\inf_{\hat{\theta}}\sup_{\theta\in\Theta}P_{\theta}(d(\hat{\theta},\theta)\geq s)\geq\frac{\sqrt{M}}{1+\sqrt{M}}\left(1-2a-\sqrt{\frac{2a}{\log M}}\right). (102)
Lemma E.7.

Suppose that μ\mu is a distribution on 𝒳\mathcal{X} and fiL2(𝒳,μ)f_{i}\in L^{2}(\mathcal{X},\mu). Suppose that

y=fi(x)+ϵ,i=1,2,y=f_{i}(x)+\epsilon,\quad i=1,2, (103)

where ϵ𝒩(0,σ2)\epsilon\sim\mathcal{N}(0,\sigma^{2}) are independent Gaussian random error. Denote the two corresponding distributions on 𝒳×𝒴\mathcal{X}\times\mathcal{Y} as ρi,i=1,2\rho_{i},i=1,2. The KL divergence of two probability distributions on Ω\Omega is

K(P1,P2)Ωlog(dP1dP2)dP1,K\left(P_{1},P_{2}\right)\coloneqq\int_{\Omega}\log\left(\frac{\mathrm{d}P_{1}}{\mathrm{~{}d}P_{2}}\right)\mathrm{d}P_{1}, (104)

if P1P2P_{1}\ll P_{2} and otherwise K(P1,P2)K\left(P_{1},P_{2}\right)\coloneqq\infty. Then we have

KL(ρ1n,ρ2n)=nKL(ρ1,ρ2)=n2σ2f1f2L2(𝒳,dμ)2,\mathrm{KL}\left(\rho_{1}^{n},\rho_{2}^{n}\right)=n\mathrm{KL}\left(\rho_{1},\rho_{2}\right)=\frac{n}{2\sigma^{2}}\left\|f_{1}-f_{2}\right\|_{L^{2}(\mathcal{X},d\mu)}^{2}, (105)

where ρin\rho_{i}^{n} denotes the independent product of nn distributions ρi,i=1,2\rho_{i},i=1,2.

Proof.

The lemma directly follows from the definition of KL divergence and the fact that

KL(N(f1(x),σ2),N(f2(x),σ2))=12σ2|f1(x)f2(x)|2.\mathrm{KL}\left(N\left(f_{1}(x),\sigma^{2}\right),N\left(f_{2}(x),\sigma^{2}\right)\right)=\frac{1}{2\sigma^{2}}\left|f_{1}(x)-f_{2}(x)\right|^{2}. (106)

The following lemma is a result from Tsybakov (2009, Lemma 2.9)

Lemma E.8.

Denote Ω={ω=(ω1,,ωm),ωi{0,1}}={0,1}m\Omega=\left\{\omega=\left(\omega_{1},\cdots,\omega_{m}\right),\omega_{i}\in\{0,1\}\right\}=\{0,1\}^{m}. Let m8m\geq 8, there exists a subset {ω(0),,ω(M)}\left\{\omega^{(0)},\cdots,\omega^{(M)}\right\} of  Ω\Omega such that ω(0)=(0,,0)\omega^{(0)}=(0,\cdots,0),

dHam (ω(i),ω(j))k=1m|ωk(i)ωk(j)|m8,0i<jM,d_{\text{Ham }}\left(\omega^{(i)},\omega^{(j)}\right)\coloneqq\sum_{k=1}^{m}\left|\omega_{k}^{(i)}-\omega_{k}^{(j)}\right|\geq\frac{m}{8},\quad\forall 0\leq i<j\leq M, (107)

and M2m/8M\geq 2^{m/8}.

Appendix F Details of experiments

F.1 Experiments in Sobolev RKHS

First, we prove that the series in (14) converges and f(x)f^{*}(x) is continuous on (0,1)(0,1) for 0<s<1β=0.50<s<\frac{1}{\beta}=0.5.

We begin with the computation of the sum of first NN terms of {sin2kπx+cos2kπx}\{\sin 2k\pi x+\cos 2k\pi x\}, note that

2sin(πx)(sin(2πx)+sin(4πx)++sin(2Nπx))\displaystyle-2\sin(\pi x)\left(\sin\left(2\pi x\right)+\sin\left(4\pi x\right)+\cdots+\sin\left(2N\pi x\right)\right)
=[cos(2π+π)xcos(2ππ)x]+[cos(4π+π)xcos(4ππ)x]\displaystyle=\left[\cos\left(2\pi+\pi\right)x-\cos\left(2\pi-\pi\right)x\right]+\left[\cos\left(4\pi+\pi\right)x-\cos\left(4\pi-\pi\right)x\right]
++[cos(2Nπ+π)xcos(2Nππ)x]\displaystyle\quad\quad+\cdots+\left[\cos\left(2N\pi+\pi\right)x-\cos\left(2N\pi-\pi\right)x\right]
=cos(2Nπ+π)xcosπx.\displaystyle=\cos\left(2N\pi+\pi\right)x-\cos\pi x. (108)

So we have

|(sin(2πx)+sin(4πx)++sin(2Nπx))|=|cos(2Nπ+π)xcosπx||2sin(πx)|;\displaystyle\left|\left(\sin\left(2\pi x\right)+\sin\left(4\pi x\right)+\cdots+\sin\left(2N\pi x\right)\right)\right|=\frac{\left|\cos\left(2N\pi+\pi\right)x-\cos\pi x\right|}{\left|2\sin(\pi x)\right|}; (109)

Similarly, we have

|(cos(2πx)+cos(4πx)++cos(2Nπx))|=|sin(2Nπ+π)xsinπx||2sin(πx)|.\displaystyle\left|\left(\cos\left(2\pi x\right)+\cos\left(4\pi x\right)+\cdots+\cos\left(2N\pi x\right)\right)\right|=\frac{\left|\sin\left(2N\pi+\pi\right)x-\sin\pi x\right|}{\left|2\sin(\pi x)\right|}. (110)

Note that (109) and (110) are uniformly bounded in [δ0,1δ0][\delta_{0},1-\delta_{0}] for any δ0>0\delta_{0}>0 and NN. In addition, {k(s+0.5)}\{k^{-(s+0.5)}\} is monotone and decreases to zero. Use the Dirichlet criterion and we know that the series in (14) is uniformly convergence in [δ0,1δ0][\delta_{0},1-\delta_{0}]. Due to the arbitrariness of δ0\delta_{0}, we know that the series converges and f(x)f^{*}(x) is continuous on (0,1)(0,1).

In Figure 3 (a), we present the results of different choices of cc for λ=cnβsβ+1\lambda=cn^{-\frac{\beta}{s\beta+1}} in the experiment of Section 4.2. Figure 2 corresponds to the curve c=0.1c=0.1 in Figure 3 (a), which has the smallest generalization error. In Figure 3 (b), we use 5-fold cross validation to choose the regularization parameter in KRR and present the logarithmic errors and sample sizes. Again, we use logarithmic least-squares to compute the convergence rate rr, which is still approximately equal to nsβsβ+1=n49n^{-\frac{s\beta}{s\beta+1}}=n^{-\frac{4}{9}}.

Refer to caption
(a) fixed c
Refer to caption
(b) cross validation
Figure 3: Error decay curves of Sobolev RKHS. Both axes are logarithmic. (a) The curves show the average generalization errors of different c over 50 trials; and the regions within one standard deviation are shown in the corresponding colors. (b) The scatters show the average generalization errors obtained by 5-fold cross validation over 50 trials. In both (a) and (b), the dashed black lines are computed using logarithmic least-squares, and the slopes represent the convergence rates rr.

F.2 Experiments in general RKHS

First, we prove that the series in (15) converges and f(x)f^{*}(x) is continuous on (0,1)(0,1) for 0<s<1β=0.50<s<\frac{1}{\beta}=0.5.

We begin with the computation of the sum of first NN terms of e2k1(x)e_{2k-1}(x),

2sin(πx)(sin(πx2)+sin(5πx2)++sin((4N3)πx2))\displaystyle-2\sin(\pi x)\left(\sin\left(\frac{\pi x}{2}\right)+\sin\left(\frac{5\pi x}{2}\right)+\cdots+\sin\left(\frac{\left(4N-3\right)\pi x}{2}\right)\right)
=[cos(π+π2)xcos(ππ2)x]+[cos(5π2+π)xcos(5π2π)x]\displaystyle=\left[\cos\left(\pi+\frac{\pi}{2}\right)x-\cos\left(\pi-\frac{\pi}{2}\right)x\right]+\left[\cos\left(\frac{5\pi}{2}+\pi\right)x-\cos\left(\frac{5\pi}{2}-\pi\right)x\right]
++[cos((4N3)π2+π)xcos((4N3)π2π)x]\displaystyle\quad\quad+\cdots+\left[\cos\left(\frac{\left(4N-3\right)\pi}{2}+\pi\right)x-\cos\left(\frac{\left(4N-3\right)\pi}{2}-\pi\right)x\right]
=cos((4N1)π2)xcosπ2x.\displaystyle=\cos\left(\frac{\left(4N-1\right)\pi}{2}\right)x-\cos\frac{\pi}{2}x. (111)

So we have

|sin(πx2)+sin(5πx2)++sin((4N3)πx2)|=|cos((4N1)π2)xcosπ2x||2sin(πx)|,\displaystyle\left|\sin\left(\frac{\pi x}{2}\right)+\sin\left(\frac{5\pi x}{2}\right)+\cdots+\sin\left(\frac{\left(4N-3\right)\pi x}{2}\right)\right|=\frac{\left|\cos\left(\frac{\left(4N-1\right)\pi}{2}\right)x-\cos\frac{\pi}{2}x\right|}{\left|2\sin(\pi x)\right|}, (112)

which is uniformly bounded in [δ0,1δ0][\delta_{0},1-\delta_{0}] for any δ0>0\delta_{0}>0 and NN.

Note that {k(s+0.5)}\{k^{-(s+0.5)}\} is monotone and decreases to zero. Use the Dirichlet criterion and we know that the series in (15) is uniformly convergence in [δ0,1δ0][\delta_{0},1-\delta_{0}]. Due to the arbitrariness of δ0\delta_{0}, we know that the series converges and f(x)f^{*}(x) is continuous on (0,1)(0,1).

In Figure 4 (a), we present the results of different choices of cc for λ=cnβsβ+1\lambda=cn^{-\frac{\beta}{s\beta+1}} in the experiment of Section 4.2. Figure 2 corresponds to the curve c=1.0c=1.0 in Figure 4 (a), which has the smallest generalization error. In Figure 4 (b), we use 5-fold cross validation to choose the regularization parameter in KRR and present the logarithmic errors and sample sizes. Again, we use logarithmic least-squares to compute the convergence rate rr, which is still approximately equal to nsβsβ+1=n49n^{-\frac{s\beta}{s\beta+1}}=n^{-\frac{4}{9}}.

Refer to caption
(a) fixed c
Refer to caption
(b) cross validation
Figure 4: Error decay curves of general RKHS.