This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Oversampling Divide-and-conquer for Response-skewed Kernel Ridge Regression

Jingyi Zhang
Tsinghua University
China
Xiaoxiao Sun
The University of Arizona
Abstract

The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We combine a novel response-adaptive partition strategy with the oversampling technique synergistically to overcome the limitation. Through the proposed novel algorithm, we allocate some carefully identified informative observations to multiple nodes (local processors). Although the oversampling technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller risk than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.

1 INTRODUCTION

We consider the problem of calculating large-scale kernel ridge regression (KRR) estimates in a nonparametric regression model. Although the theoretical properties of the KRR estimator are well-understood [8, 50, 38], in practice, the computation of KRR estimates may suffer from a large computational burden. In particular, for a sample of size NN, it requires O(N3)O(N^{3}) computational time to calculate a KRR estimate using the standard approach, as will be detailed in Section 2. Such a computational cost is prohibitive when the sample size NN is considerable. The divide-and-conquer approach has been implemented pervasively to alleviate such computational burden [51, 52, 44, 45, 46]. Such an approach randomly partitions the full sample into kk subsamples of equal sizes, then calculates a local estimate on an independent local processor (also called a local node) for each subsample. The local estimates are then averaged to obtain the global estimate. The divide-and-conquer approach reduces the computational cost of calculating KRR estimates from O(N3)O(N^{3}) to O(N3/k2)O(N^{3}/k^{2}). Such savings may be substantial as kk grows.

Despite algorithmic benefits, the success of the divide-and-conquer approach highly depends on the assumption that the subsamples can well represent the observed full sample. Nevertheless, this assumption cannot be guaranteed in many real-world applications, where the response variable YY may have a highly skewed distribution. Specifically, the random variable YY has a highly skewed distribution fYf_{Y} if fYf_{Y} is nearly zero inside a large region AYA_{Y}\subset\mathbb{R}. Problems of this type arise in high energy physics, Bayesian inference, financial research, biomedical research, environmental data, among others [28, 1, 12]. In these applications, YAYY\in A_{Y} are the responses that occur with reduced frequency. However, such responses are often of more interest as they tend to have a more widespread impact. For example, YAYY\in A_{Y} may represent a rare signal for seismic activity or stock market fluctuations. Overlooking such signals could resulting in a substantial negative impact on society either economically or in terms of human casualties.

Refer to caption
Figure 1: An illustration of unacceptable results for the classical dacKRR estimate.

Recall that in the classical divide-and-conquer approach, a random subsample is processed on every local node. Under the aforementioned rare-event scenarios, such a subsample could fail to have observations selected from the potentially informative region AYA_{Y}. The local estimate based on such a subsample thus is very likely to overlook the region AYA_{Y}. Averaging these local estimators could lead to unreliable estimations and predictions over these informative regions. A synthetic example in Fig. 1 illustrates the scenario that the response is highly skewed. In this example, the one-dimensional sample (gray points) is uniformly generated from [0,1][0,1]. The response variable YY has a heavy-tailed distribution, as illustrated by the histogram. The classical dacKRR method is used to estimate the true function (gray curve). We observe that almost all the local estimates (blue curves) Averaging these local estimates thus results in a global estimate (red curve) with unacceptable estimation performance. Such an observation is due to the fact that the subsample in each node overlooks the informative region over the distribution of YY (the peaks) and thus fails to provide a successful global estimate.

Our contributions. To combat the obstacles, we develop a novel response-adaptive partition approach with oversampling to obtain large-scale kernel ridge regression estimates, especially when the response is highly skewed. Although the oversampling technique is widely used for addressing discrete label imbalance, extending such a technique to the continuous dacKRR setting is nontrivial. We bridge the gap by providing both theoretical and practical guidance on how to effectively over-sample the observations. Different from the classical divide-and-conquer approach, such that each observation is allocated to only one local node, we propose to allocate some informative data points to multiple nodes. Theoretically, we show the proposed estimates have a smaller risk than the classical dacKRR estimates under mild conditions. Furthermore, we show the number of strata ll of the response vector and the number of nodes kk regulate the trade-off between the computational time and the risk of the proposed estimator. In particular, when kk is well-selected, a larger ll is associated with a longer computational time as well as a smaller risk. Such results are novel to the best of our knowledge. Our theoretical findings are supported by both simulated and real-data analyses. In addition, we show the proposed method is not specific to the dacKRR method and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting.

2 PRELIMINARIES

Model setup. Let {\mathcal{H}} be a reproducing kernel Hilbert space (RKHS). Such a RKHS is induced by the reproducing kernel K(,):d×dK(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}, which is a symmetric nonnegative definite function, and an inner product ,\langle\cdot,\cdot\rangle_{\mathcal{H}} satisfying g(),K(𝒙,)=g(𝒙)\langle g(\cdot),K(\bm{x},\cdot)\rangle_{\mathcal{H}}=g(\bm{x}) for any gg\in{\mathcal{H}}. The RKHS {\mathcal{H}} is equipped with the norm g=(g(),g())1/2||g||_{\mathcal{H}}=(\langle g(\cdot),g(\cdot)\rangle)^{1/2}. The well-known Mercer’s theorem states that, under some regularity conditions, the kernel function K(,)K(\cdot,\cdot) can be written as K(𝒙,𝒛)=j=1μjϕj(𝒙)ϕj(𝒛)K(\bm{x},\bm{z})=\sum_{j=1}^{\infty}\mu_{j}\phi_{j}(\bm{x})\phi_{j}(\bm{z}), where {μj}j=10\{\mu_{j}\}_{j=1}^{\infty}\geq 0 is a sequence of decreasing eigenvalues and {ϕj()}j=1\{\phi_{j}(\cdot)\}_{j=1}^{\infty} is a family of orthonormal basis functions. The smoothness of a function gg\in{\mathcal{H}} is characterized by the decaying rate of the eigenvalues {μj}j=1\{\mu_{j}\}_{j=1}^{\infty}. The major types of such decaying rates include the finite rank type, the exponentially decaying type, and the polynomially decaying type. The representative kernel function of these three major types are the polynomial kernel K(𝒙,𝒛)=(1+𝒙𝒛)rK(\bm{x},\bm{z})=(1+\bm{x}^{\intercal}\bm{z})^{r} (rr is an integer), the Gaussian kernel K(𝒙,𝒛)=exp(𝒙𝒛2/σ2)K(\bm{x},\bm{z})=\mbox{exp}(-||\bm{x}-\bm{z}||^{2}/\sigma^{2}) (σ>0\sigma>0 is the scale parameter and ||||||\cdot|| is the Euclidean norm), and the kernel K(x,z)=1+min(x,z)K(x,z)=1+\mbox{min}(x,z), respectively. We refer to [14, 36] for more details.

Consider the nonparametric regression model

yi=η(𝒙i)+ϵi,i=1,,N,y_{i}=\eta(\bm{x}_{i})+\epsilon_{i},\quad i=1,\ldots,N, (1)

where yiy_{i}\in\mathbb{R} is the response, 𝒙id\bm{x}_{i}\in\mathbb{R}^{d} is the predictors, η\eta is the unknown function to be estimated, and {ϵi}i=1N\{\epsilon_{i}\}_{i=1}^{N} are i.i.d. normal random errors with zero mean and unknown variance σ2<\sigma^{2}<\infty. The kernel ridge regression estimator aims to find a projection of η\eta into the RKHS {\mathcal{H}}. Such an estimator η^\widehat{\eta} can be written as

η^=argminη{1Ni=1N{yiη(𝒙i)}2+λ||η||2}.\displaystyle\widehat{\eta}=\underset{\eta\in{\mathcal{H}}}{\mbox{argmin}}\left\{\frac{1}{N}\sum_{i=1}^{N}\{y_{i}-\eta(\bm{x}_{i})\}^{2}+\lambda||\eta||_{\mathcal{H}}^{2}\right\}. (2)

Here, the regularization parameter λ\lambda controls the trade-off between the goodness of fit of η^\widehat{\eta} and the smoothness of it. A penalized least squares framework analogous to Equation (2) has been extensively studied in the literature of regression splines and smoothing splines [40, 13, 23, 16, 10, 33, 48, 37, 47, 24, 49, 29].

The well-known representer theorem [40] states that the minimizer of Equation (2) in the RKHS {\mathcal{H}} takes the form η^(𝒙)=i=1NβiK(𝒙i,𝒙).\widehat{\eta}(\bm{x})=\sum_{i=1}^{N}\beta_{i}K(\bm{x}_{i},\bm{x}). Let 𝒚=(y1,,yN)\bm{y}=(y_{1},\ldots,y_{N})^{\intercal} be the response vector, 𝜷=(β1,,βN)\bm{\beta}=(\beta_{1},\ldots,\beta_{N})^{\intercal} be the coefficient vector and 𝐊{\mathbf{K}} be the kernel matrix such that the (i,j)(i,j)-th element equals K(𝒙i,𝒙j)K(\bm{x}_{i},\bm{x}_{j}). These coefficient vector 𝜷\bm{\beta} can be estimated through solving the minimization problem as follows,

𝜷^=argmin𝜷N1N(𝒚𝐊𝜷)(𝒚𝐊𝜷)+λ𝜷𝐊𝜷.\widehat{\bm{\beta}}=\underset{\bm{\beta}\in\mathbb{R}^{N}}{\mathrm{argmin}}\frac{1}{N}(\bm{y}-{\mathbf{K}}\bm{\beta})^{\intercal}(\bm{y}-{\mathbf{K}}\bm{\beta})+\lambda\bm{\beta}^{\intercal}{\mathbf{K}}\bm{\beta}. (3)

It is known that the solution of such a minimization problem has a closed form

𝜷^=(𝐊+Nλ𝐈N)1𝒚,\widehat{\bm{\beta}}=({\mathbf{K}}+N\lambda\mathbf{I}_{N})^{-1}\bm{y}, (4)

where the regularization parameter λ\lambda can be selected based on the cross-validation technique, the Mallows’s Cp method [27], or the generalized cross-validation (GCV) criterion [41].

Although the solution of the minimization problem (3) has a closed-form, the computational cost for calculating the solution using Equation (4) is of the order O(N3)O(N^{3}), which is prohibitive when the sample size NN is considerable. In this paper, we focus on the divide-and-conquer approach for alleviating such a computational burden. In recent decades, there also exist a large number of studies that aim to develop low-rank matrix approximation methods to accelerate the calculation of kernel ridge regression estimates [31, 42, 32, 26, 30]. These approaches are beyond the scope of this paper. In practice, one can combine the divide-and-conquer approach with the aforementioned methods to further accelerate the calculation.

Background of the divide-and-conquer approach. The classical divide-and-conquer approach is easy to describe. Rather than solving the minimization problem (3) using the full sample, the divide-and-conquer approach randomly partitions the full sample into kk subsamples of equal sizes. Each subsample is then allocated to an independent local node. Next, kk minimization problems are solved independently on each local node based on the corresponding subsamples. The final estimate is simply the average of all the local estimates. The divide-and-conquer approach has proven to be effective in linear models [4, 21], partially linear models [53], nonparametric regression models [51, 52, 19, 35, 11], principal component analysis [43, 6], matrix factorization [25], among others.

Algorithm 1 Classical divide-and-conquer kernel ridge regression
Input: The training set {(𝒙i,yi)}i=1N\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}; the number of nodes kk
Step 1: Randomly and evenly partition the sample {(𝒙1,y1),,(𝒙N,yN)}\{(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{N},y_{N})\} into kk disjoint subsamples, denoted by SS1,,SSk\SS_{1},\ldots,\SS_{k}. Let |SSi||\SS_{i}| be the number of observations in SSi\SS_{i}.
Step 2:
for jj in 1,,k1,\ldots,k do
     calculate the local kernel ridge regression estimate on the jj-th local node
η^j=argminη{1|SSj|(𝒙,y)SSj(yη(𝒙))2+λj||η||2}.\hat{\eta}_{j}=\underset{\eta\in{\mathcal{H}}}{\mathrm{argmin}}\{\frac{1}{|\SS_{j}|}\sum_{(\bm{x},y)\in\SS_{j}}(y-\eta(\bm{x}))^{2}+\lambda_{j}||\eta||_{\mathcal{H}}^{2}\}.
end for
Output: Combine each local estimates to obtain the final estimate η~=1kj=1kη^j.\widetilde{\eta}=\frac{1}{k}\sum_{j=1}^{k}\hat{\eta}_{j}.

Algorithm 1 summarizes the classical divide-and-conquer method under the kernel ridge regression setting. Such an algorithm reduces the computational cost for the estimation from O(N3)O(N^{3}) to O(N3/k2)O(N^{3}/k^{2}). The savings may be substantial as kk grows.

One difficulty in Algorithm 1 is how to choose the regularization parameter λj\lambda_{j}s. A natural way to determine the size of these regularization parameters is to utilize the standard approaches, e.g., the Mallows’s Cp criterion or the (GCV) criterion, based only on the local subsample SSj\SS_{j}. It is known that such a simple strategy may lead to a global estimate that suffers from suboptimal performance [52, 45]. To overcome the challenges, there has been a large number of studies dedicated to developing more effective methods of selecting the local regularization parameters in the recent decade. For example, [52] proposed to select the regularization parameter λj\lambda_{j} according to the order of the entire sample size NN instead of the subsample size N/kN/k. Recently, [45] proposed the distributed GCV method to select a global optimal regularization parameter for nonparametric regression under the classical divide-and-conquer setting.

3 MOTIVATION AND METHODOLOGY

Motivation. To motivate the development of the proposed method, we first re-examine the oversampling strategy. Such a strategy is well-known in imbalanced data analysis, where the labels can be viewed as skewed discrete response variables. [17, 15, 18]. Classical statistical and machine learning algorithms assume that the number of observations in each class is roughly at the same scale. In many real-world cases, however, such an assumption may not hold, resulting in imbalanced data. Imbalanced data pose a difficulty for classical learning algorithms, as they will be biassed towards the majority group. Despite the rareness, the minority class is usually more important from the data mining perspective, as it may carry useful and important information. An effective classification algorithm thus should take such unbalances into account. To tackle the imbalanced data, the oversampling strategy supplements the training set with multiple copies of the minority classes and keeps all the observations in the majority classes to make the whole dataset suitable for a standard learning algorithm. Such strategy has proven to be effective for achieving more robust results [17, 15, 18].

Intuitively, the data with a continuous skewed response can be considered as imbalanced data. In particular, if we divide the range of the skewed response {yi}i=1N\{y_{i}\}_{i=1}^{N} into ll equally-spaced disjoint intervals, denoted by 𝒴1,,𝒴l{\mathcal{Y}}_{1},\ldots,{\mathcal{Y}}_{l}, then there will be a disproportionate ratio of observations in each interval. The intervals that contain more observations can be considered as the majority classes, and the ones that contain fewer observations can be considered as the minority classes. Recall that the classical divide-and-conquer approach utilizes a simple random subsample from the full sample to calculate the local estimate. Therefore, when the response variable is highly skewed, such a subsample could easily overlook the observations respecting the minority classes, resulting in unpleasant estimates.

Main algorithm. Inspired by the oversampling strategy, we propose to supplement the training data with multiple copies of the minority classes before applying the divide-and-conquer approach. In addition, analogous to the downsampling strategy, the subsample in each node should keep most of the observations from the minority classes. Let |SS||\SS| be the number of observations in the set SS\SS. Let [][\cdot] be the rounding function that rounds down the number to its nearest integer. The proposed method, called oversampling divide-and-conquer kernel ridge regression, is summarized in Algorithm 2.

Refer to caption
Figure 2: Illustration of Algorithm 2.

We use a toy example to illustrate Algorithm 2 in Fig. 2. Suppose there are three nodes (k=3k=3), and the histogram of the response is divided into three slices (l=3l=3). The left panel of Fig. 2 illustrates the oversampling process. In particular, the observations are duplicated by certain times, such that the total number of observations respecting each slice is roughly equal to each other. The original observations and the duplicated observations are marked by tilted lines and horizontal lines, respectively. The right panel of Fig. 2 shows the next step. For each slice, the observations within such a slice are then randomly and evenly allocated to each of the nodes. The observations allocated to different nodes are marked by blue, red and purple, respectively. Finally, similar to the classical dacKRR approach, the local estimates are calculated and then are averaged to obtain the global estimate.

Algorithm 2 Oversampling divide-and-conquer kernel ridge regression
Input: The training set {(𝒙i,yi)}i=1N\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}; the number of nodes kk, the number of slices ll
Step 1: Divide the range of the responses {yi}i=1N\{y_{i}\}_{i=1}^{N} into ll disjoint equally-spaced intervals, denoted by 𝒴1,,𝒴l{\mathcal{Y}}_{1},\ldots,{\mathcal{Y}}_{l}. Without lose of generality, we assume |𝒴1||𝒴l||{\mathcal{Y}}_{1}|\geq\cdots\geq|{\mathcal{Y}}_{l}|.
Step 2:
for jj in 1,,l1,\ldots,l do
     Let 𝒴j{\mathcal{Y}}_{j}^{*} be the set that contains [|𝒴1|/|𝒴j|][|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|] copies of all the elements in 𝒴j{\mathcal{Y}}_{j}. Randomly and evenly partition the set {(𝒙,y)|y𝒴j}\{(\bm{x},y)|y\in{\mathcal{Y}}_{j}^{*}\} into kk disjoint subsamples, denoted by 𝒜1,j,,𝒜k,j{\mathcal{A}}_{1,j},\ldots,{\mathcal{A}}_{k,j}.
end for
Step 3:
for ii in 1,,k1,\ldots,k do
     The subsample respecting the ii-th local node is SSi=𝒟{𝒜i,1𝒜i,l}\SS_{i}=\mathcal{D}\{{\mathcal{A}}_{i,1}\cup\cdots\cup{\mathcal{A}}_{i,l}\}, where 𝒟{}\mathcal{D}\{\cdot\} denotes de-duplicating. Calculate the local estimate on the ii-th local node
η^i=argminη{1|SSi|(𝒙,y)SSi(yη(𝒙))2+λi||η||2}.\hat{\eta}_{i}=\underset{\eta\in{\mathcal{H}}}{\mathrm{argmin}}\{\frac{1}{|\SS_{i}|}\sum_{(\bm{x},y)\in\SS_{i}}(y-\eta(\bm{x}))^{2}+\lambda_{i}||\eta||_{\mathcal{H}}^{2}\}.
end for
Output: Combine each local estimates to obtain the final estimate η¯=1ki=1kη^i.\bar{\eta}=\frac{1}{k}\sum_{i=1}^{k}\hat{\eta}_{i}.

Implementation details and the computational cost. Analogous to Algorithm 1, a difficulty in Algorithm 2 is how to choose the regularization parameters λi\lambda_{i}’s. Throughout this paper, we opt to use the same approach to select λi\lambda_{i}’s as the one proposed in [52]. The number of slices ll can be set as a constant or be determined by Scott’s rule; see [34] for more details. Our empirical studies indicate that the performance of the proposed algorithm is robust to a wide range of choices of ll. In addition, to avoid the data dependency within a node raised by oversampling, we conduct a “de-duplicating” process before calculating the local estimates.

Consider the total number of observations in the training set after the oversampling and de-duplicating processes. Let N~\tilde{N} denote such a number. When the response variable has a roughly uniform distribution, few observations would be copied, and thus one has N~=O(N)\tilde{N}=O(N). In such a case, the computational cost of Algorithm 2 is at the same order as that of the classical divide-and-conquer method. In the most extreme case that the most majority class 𝒴1{\mathcal{Y}}_{1} contains almost all the observations, it can be shown that N~lN\tilde{N}\leq lN. Therefore, the computational cost of Algorithm 2 is at the order of O(l3N3/k2)O(l^{3}N^{3}/k^{2}). Consequently, when ll is a constant, again, the computational cost of Algorithm 2 is at the same order of the classical divide-and-conquer method. Otherwise, when ll and kk go to infinity, for example, when ll is determined by Scott’s rule [34] ( l=O(N1/3)l=O(N^{1/3})), and k=O(l)k=O(l), the computational cost of Algorithm 2 becomes O(N4/k2)O(N^{4}/k^{2}). However, such a higher computational cost is associated with theoretical benefits, as will be detailed in the next section.

4 THEORETICAL RESULTS

Main theorem. In this section, we obtain the upper bounds on the risk of the proposed estimator η¯\bar{\eta} and establish the main result in Theorem 4.1. We show our bounds contain a smaller asymptotic estimation variance term, compared to the bounds of the classical dacKRR estimator [52, 20]. The following assumptions are required to bound the terms in Theorem 4.1. Assumption 4 guarantees the spectral decomposition of the kernel via Mercer’s theorem. Assumption 4 is a regularity assumption on η0\eta_{0}. These two assumptions are fairly standard. Assumption 4 requires that the basis functions are uniformly bounded. [52] showed that one could easily verify this assumption for a given kernel KK. For instance, this assumption holds for many classical kernel functions, e.g., the Gaussian kernel.

Assumption 1. The reproducing kernel KK is symmetric, positive definite, and square integrable.

Assumption 2. The underlying function η0\eta_{0}\in\mathcal{H}.

Assumption 3. There exists some ω2>0\omega^{2}>0 such that sup𝒙𝒳j=1ujϕj(𝒙)2ω2<\sup_{\bm{x}\in\mathcal{X}}\sum_{j=1}^{\infty}u_{j}\phi_{j}(\bm{x})^{2}\leq\omega^{2}<\infty.

Let {𝐗~i,𝒚~i}\{\tilde{{\mathbf{X}}}_{i},\tilde{\bm{y}}_{i}\} be the subset with size of n~\tilde{n} allocated to the iith node after de-duplicating, i=1,,ki=1,\ldots,k. We assume they are drawn i.i.d from an unknown probability measure 𝒫\mathcal{P}. Let 𝒫X\mathcal{P}_{X} denote marginal distribution of 𝐗~i\tilde{{\mathbf{X}}}_{i}. For brevity, we further assume the sample size for each node is the same. The bounds on the risk of the proposed estimator η¯\bar{\eta} are given in Theorem 4.1, followed by the main corollary.

Theorem 4.1.

Let dλ=j1(1+λ/μj)1d_{\lambda}=\sum_{j\geq 1}(1+\lambda/\mu_{j})^{-1} be the the effective dimensionality of the kernel KK [50]. Under Assumption 4-4, the risk of the proposed estimator η¯\bar{\eta} is bounded by

𝔼{η¯η0𝒫X2}8λη02+4dλσ2n~k+4dλ×(μ12η02+ω2σ2λn~k)exp(3λn~28ω2).\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}k}+4d_{\lambda}\\ &\times(\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}k})\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.\end{split}
Corollary 4.2.

Assume the first two terms of the bounds in Theorem 4.1 are dominant, we have

𝔼{η¯η0𝒫X2}=O(1){λη02Squared bias+σ2dλN~Variance}.\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=O(1)\bigg{\{}\underbrace{\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}}_{\text{Squared bias}}+\underbrace{\frac{\sigma^{2}d_{\lambda}}{\tilde{N}}}_{\text{Variance}}\bigg{\}}. (5)

With the properly chosen regularization parameter, most of the commonly used kernels, e.g., the kernels with polynomial eigendecay, can satisfy this assumption for Corollary 5 [9]. More discussions can be found in the Supplementary Material. The squared bias term for the classical dacKRR estimator is at the order of O(λη02)O(\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}), which is the same as the one in Equation (5) [52]. However, different from the variance term in Equation (5), which equals O(σ2dλ/N~)O(\sigma^{2}d_{\lambda}/\tilde{N}), the variance term for the classical dacKRR estimator is at the order of O(σ2dλ/N)O(\sigma^{2}d_{\lambda}/N). Our rate depends on the eigenvalues {μj2}\{\mu_{j}^{2}\} instead of the trace of kernel derived in [52]. This rate is tighter than that in [2] and is considered as the minimax rate.

Practical guidance on the selection of the parameter ll and kk. We now compare the risk of the classical dacKRR estimator with the proposed one under three different scenarios. Furthermore, we provide some practical guidance on the selection of the number of slides ll and the number of nodes kk. We denote the total sample size after oversampling as N~\tilde{N}^{*}.

First, consider the scenario that NN~N\approx\tilde{N}^{*} as shown in Fig. 3(a). In this case, oversampling is unnecessary, thus the proposed estimate η¯\bar{\eta} has the same risk as the classical dacKRR estimator.

Second, consider the scenario that the response variable has a slightly skewed distribution, i.e., one has N~=O(N)\tilde{N}^{*}=O(N) as shown in Fig. 3(b). Under this scenario, the total sample size after de-duplicating is also in the same order of the original sample size, i.e. N~=O(N)\tilde{N}=O(N). Corollary 5 indicates the proposed estimate η¯\bar{\eta} has the same risk as the classical dacKRR estimator.

Third, consider the scenario that the response variable has a highly skewed distribution. Under this scenario, the sample size in the “majorest” slide is in the same order of NN, and the sample size in the “minorest” slide, denoted as NlN_{l}, is in the order of o(N)o(N). In these cases, one has N~=O(lN)\tilde{N}^{*}=O(lN) We then discuss under two cases, (1) when Nl=o(N/l)N_{l}=o(N/l), especially, when ll is a constant; and (2) when Nl=O(N/l)N_{l}=O(N/l). In case (1), the sample size NN after de-duplicating is in the same order of NN. The risk of η¯\bar{\eta} then has the same order as the classical dacKRR estimator. In case (2), notice that Nl=o(N)N_{l}=o(N), this indicates that ll should go to infinity as NN\to\infty. In this scenario, we consider k=O(N/Nl)k=O(N/N_{l}) i.e., k=O(l)k=O(l), then we have N~=O(N~)\tilde{N}=O(\tilde{N}^{*}). Thus equation (5) becomes

𝔼{η¯η0𝒫X2}=O(1){λη02+σ2dλlN}.\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=O(1)\bigg{\{}\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\sigma^{2}d_{\lambda}}{lN}\bigg{\}}. (6)

For example, when ll is determined by Scott’s rule, one has l=O(N1/3)l=O(N^{1/3}), and we choose k=O(l)k=O(l). Equation (6) indicates the proposed estimator has a smaller risk than the classical dacKRR estimator. Corollary 5 indicates that the parameters ll and kk regulate the trade-off between the computational time and the risk. Specifically, when kk is well-selected, a larger ll is associated with a longer computational time as well as a smaller risk.

Refer to caption
Figure 3: Three different scenarios for the “skewness”. The solid bars are original data and the lined ones are over-sampled data.

5 SIMULATION RESULTS

To show the effectiveness of the proposed estimator, we compared it with the classical dacKRR estimator in terms of the mean squared error (MSE). We calculated the MSE for each of the estimators based on 100 replicates. In particular, MSE=i=1100η^(i)η02/100)\mbox{MSE}=\sum_{i=1}^{100}\|\hat{\eta}^{(i)}-\eta_{0}\|^{2}/100), where η0\eta_{0} is the true function and η^(i)\hat{\eta}^{(i)} represents the estimator in the iith replication, respectively. Through all the experiments in this section, we set the number of nodes k=100k=100. Gaussian kernel function was used for the kernel ridge regression. We followed the procedure in [52] to select the bandwidth for the kernel and the regularization parameters. For the proposed method, we divided the range of {yi}i=1N\{y_{i}\}_{i=1}^{N} into ll slices according to Scott’s rule [34].

Refer to caption
Figure 4: Comparison of different estimators. Each row represents a different true function and each column represents a different dd.
Refer to caption
Figure 5: Comparison for different choices of ll’s.

Recall that in Algorithm 2, each observation is copied for certain times, such that the total number of observations in each slice is roughly equal to each other. A natural question is how does the number of such copies affects the performance of the proposed method. In other words, if a fewer number of copies are supplemented to the training data, would the proposed method perform better or worse? To answer this question, we induced a constant τ\tau to control the oversampling size. Specifically, for the jjth slice, j=1,,lj=1,\ldots,l, we duplicated the observations within such a slice for ([τ|𝒴1|/|𝒴j|]1)([\tau|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|]-1) times, instead of ([|𝒴1|/|𝒴j|]1)([|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|]-1) times in Algorithm 2. We set τ={1,1/2,1/3,1/4,1/5}\tau=\{1,1/2,1/3,1/4,1/5\}. This procedure is equivalent to Algorithm 2 when τ=1\tau=1.

Let 𝟏d\bm{1}^{d} be the dd-dimensional vector, such that all elements are equal to 11. We considered a function that has a highly skewed response, i.e.,

g(𝒙,𝒄)=0.1𝒙𝒄+0.05sin(0.01π𝒙𝒄+0.05),g(\bm{x},\bm{c})=\frac{0.1}{\|\bm{x}-\bm{c}\|+0.05}\sin\left(\frac{0.01\pi}{\|\bm{x}-\bm{c}\|+0.05}\right),

where 𝒙[0,1]d\bm{x}\in[0,1]^{d}, and \|\cdot\| represents the 𝕃2\mathbb{L}_{2} norm. We then simulated the data from Model (1) with N={2,4,6,8,10}×103N=\{2,4,6,8,10\}\times 10^{3}, d=1,2,4,6d=1,2,4,6, and two different regression function η0\eta_{0}’s,

Uni-peak: η0(𝒙)=g(𝒙,𝒄)\eta_{0}(\bm{x})=g(\bm{x},\bm{c}), with 𝒄=0.4×𝟏d\bm{c}=0.4\times\bm{1}^{d};

Double-peak: η0(𝒙)=g(𝒙,𝒄1)+0.4g(𝒙,𝒄2)\eta_{0}(\bm{x})=g(\bm{x},\bm{c}_{1})+0.4g(\bm{x},\bm{c}_{2}), with 𝒄1=0.4×𝟏d\bm{c}_{1}=0.4\times\bm{1}^{d} and 𝒄2=0.7×𝟏d\bm{c}_{2}=0.7\times\bm{1}^{d}.

Figure 4 shows the MSE versus different sample size NN under various settings. Each column represents a different dd, and each row represents a different η0\eta_{0}. The classical dacKRR estimator and the full sample KRR estimator are labeled as blue lines and black lines, respectively. The proposed estimators are labeled as red lines, and the proposed method with τ=1\tau=1, i.e., Algorithm 2, is labeled as solid red lines. The vertical bars represent the standard errors obtained from 100 replications. In Fig. 4, we first observed that the classical dacKRR estimator, as expected, does not perform well. We then observed that the proposed estimators perform consistently better than the classical estimator. Such an observation indicates that when the response is highly skewed, oversampling the observations respecting the minority values of the response helps improve the estimation accuracy. Finally, we observed that a larger value of τ\tau is associated with a better performance. In particular, as the number of copies increases, the proposed estimator tends to have faster convergence rates. This observation supports Corollary 5, which states that a larger number of observations after the oversampling process is associated with a smaller estimation MSE.

Besides Scott’s rule (l=O(N1/3)l=O(N^{1/3})), we also considered other rules, e.g., the Sturge’s formula (l=O(log2N)l=O(\log_{2}N)) [39] and the Freedman-Diaconis choice (l=O(N1/2)l=O(N^{1/2})) [7]. Figure 5 compares the empirical performance of the proposed estimator w.r.t. different choices of ll’s. For the setting of unskewed response (the first row), all methods share similar performance. For response-skewed settings (the second and the third row), all the three aforementioned rules yield better performance than fixed ll. Among all, Scott’s rule gives the best results. The simulation results are consistent with Corollary 4.2, which indicates that the proposed estimator shows advantages over the classical one.

Besides the impact of ll, we also studied the impact of kk. In addition, we have also applied the proposed strategy to other nonparametric regression estimators, say smoothing splines, under the divide-and-conquer setting. Such simulation results are provided in the Supplementary Material. These results indicated that the performance of the proposed method is robust the choice of kk and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting. It is also possible to consider unequally-spaced slicing methods in the proposed method; however, such extensions are beyond the scope of this paper.

6 REAL DATA ANALYSIS

We applied the proposed method to a real-world dataset called Melbourne-housing-market 111Data source https://www.kaggle.com/anthonypino/ melbourne-housing-market.. The dataset includes housing clearance data in Melbourne from the year 2016 to 2017. Each observation represents a record for a sold house, including the total price and several predictors. Of interest is to predict the price-per-square-meter (range from 20 to 30000) for each sold house using the longitude and the latitude of the house, and the distance from the house to the central business district (CBD). Such a goal can be achieved by using kernel ridge regression.

Refer to caption
Figure 6: Left panel: testing MSE versus different number of nodes. Right panel: CPU time (in seconds) versus different number of nodes. Vertical bars represent the standard errors.

We replicated the experiment one hundred times. In each replicate, we used stratified sampling to randomly pick 10%10\% of the observations as the testing set and the remaining as the training set. The Gaussian kernel function was used for the kernel ridge regression. We set the number of nodes k={10,30,50,70,90,110}k=\{10,30,50,70,90,110\}. We followed the procedure in [52] to select the bandwidth for the kernel and the regularization parameters. The performance of different estimators was evaluated by the MSE, respecting the prediction on the testing set. The experiment was conducted in R using a Windows computer with 16GB of memory and a single-threaded 3.5Ghz CPU.

The left panel of Fig. 6 shows the MSE versus the different number of nodes kk, and the right panel of Fig. 6 shows the CPU time (in seconds) versus different kk. To the best of our knowledge, our work is the first approach that addresses the response-skewed issue under the dacKRR setting, while other approaches only considered discrete-response settings or cannot be easily extended to the divide-and-conquer scenario. Thus we only compare our method with the classical dacKRR approach. In Fig. 6, the blue lines and the red lines represent the classical dacKRR approach and the proposed approach, respectively. Vertical bars represent the standard errors, which are obtained from one hundred replicates. The black lines represent the results of the full sample KRR estimate, and the gray dashed lines represent its standard errors. We can see that the proposed estimate consistently outperforms the classical dacKRR estimate. In particular, we observe the MSE of the proposed estimate is comparable with the MSE of the full sample KRR estimate. The MSE for the classical dacKRR estimate, however, almost doubles the MSE for the full sample KRR estimate when the number of nodes kk is greater than one hundred. The bottom panel of Fig. 6 shows the CPU time of the proposed approach is comparable with the CPU time of the classical divide-and-conquer approach. All these observations are consistent with the findings in the previous section. Such observations indicate the proposed approach outperforms the classical dacKRR approach for response-skewed data without requiring too much extra computing time.

Appendix A SIMULATION STUDIES

A.1 Impact of The Number of Nodes

We considered the function as in the manuscript, i.e.,

g(𝒙,𝒄)=0.1𝒙𝒄+0.05sin(0.01π𝒙𝒄+0.05),g(\bm{x},\bm{c})=\frac{0.1}{\|\bm{x}-\bm{c}\|+0.05}\sin\left(\frac{0.01\pi}{\|\bm{x}-\bm{c}\|+0.05}\right),

where 𝒙[0,1]d\bm{x}\in[0,1]^{d}, and \|\cdot\| represents the 𝕃2\mathbb{L}_{2} norm. We then simulated the data from Model (1) with N=104N=10^{4}, d=1,2,4,6d=1,2,4,6, and two different regression function η0\eta_{0}’s,

Uni-peak: η0(𝒙)=g(𝒙,𝒄)\eta_{0}(\bm{x})=g(\bm{x},\bm{c}), with 𝒄=0.4×𝟏d\bm{c}=0.4\times\bm{1}^{d};

Double-peak: η0(𝒙)=g(𝒙,𝒄1)+0.4g(𝒙,𝒄2)\eta_{0}(\bm{x})=g(\bm{x},\bm{c}_{1})+0.4g(\bm{x},\bm{c}_{2}), with 𝒄1=0.4×𝟏d\bm{c}_{1}=0.4\times\bm{1}^{d} and 𝒄2=0.7×𝟏d\bm{c}_{2}=0.7\times\bm{1}^{d}.

Figure 7 study the impact of the number of nodes kk. In Fig. 7, we fix NN and let kk varies. We observe that the proposed estimator consistently outperforms the classical one under all scenarios.

Refer to caption
Figure 7: Comparison for different choices of kk’s.

A.2 Application in Other Non-parametric Regression Methods

Besides the KRR estimator, we also applied the proposed response-adaptive partition strategy to other non-parametric regression methods. One example is the smoothing splines under the divide-and-conquer setting. In the simulation, we again study the impact of the over-sample size n~\tilde{n}, the number of slices ll, and the number of nodes kk. We set the dimension d=4d=4, other settings are similar to the cases for KRR. The results are shown in Fig. 8. Each row of Fig. 8 shows the impact of n~,l\tilde{n},l and kk, respectively. The odd columns represent the estimation MSE, and the even columns represent the CPU time in seconds. The first two columns represent the results of the uni-peak setting, and the last two columns represent the results of the double-peak setting. We can see that the results show the similar pattern as in the KRR cases. These results indicate that the performance of the proposed method is robust for different nonparametric regressions.

Refer to caption
Figure 8: Application of the proposed method in smoothing splines.

Appendix B PROOFS

B.1 Proof of Theorem 4.1

The proof of Theorem 4.1 relies on the following bias-variance decomposition

𝔼{η¯η0𝒫X2}=𝔼{𝔼(η¯)η0+η¯𝔼(η¯)𝒫X2}=𝔼{𝔼(η¯)η0𝒫X2}+𝔼{1kj=1k(η^j𝔼(η^j))𝒫X2}=𝔼{𝔼(η^1)η0𝒫X2}+1k𝔼{η^1𝔼(η^1)𝒫X2},\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&=\mathbb{E}\{\left\lVert\mathbb{E}(\bar{\eta})-\eta_{0}+\bar{\eta}-\mathbb{E}(\bar{\eta})\right\rVert^{2}_{\mathcal{P}_{X}}\}\\ &=\mathbb{E}\{\left\lVert\mathbb{E}(\bar{\eta})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}+\mathbb{E}\Bigg{\{}\left\lVert\frac{1}{k}\sum_{j=1}^{k}(\hat{\eta}_{j}-\mathbb{E}(\hat{\eta}_{j}))\right\rVert^{2}_{\mathcal{P}_{X}}\Bigg{\}}\\ &=\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}+\frac{1}{k}\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\},\end{split}

where 𝔼{η¯η0𝒫X2}=𝔼{X(η¯(𝐗1)η0(𝐗1))2𝑑𝒫X}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert_{\mathcal{P}_{X}}^{2}\}=\mathbb{E}\{\int_{X}(\bar{\eta}({\mathbf{X}}_{1})-\eta_{0}({\mathbf{X}}_{1}))^{2}d\mathcal{P}_{X}\}. The claim of Theorem 4.1 can be proved by combining the inequalities in Lemma B.1 and B.2. The minimax bound for risk is

𝔼{η¯η0𝒫X2}8λη02+4dλσ2n~k+4dλ×(μ12η02+ω2σ2λn~k)exp(3λn~28ω2).\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}k}+4d_{\lambda}\\ &\times(\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}k})\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.\end{split}

With the properly chosen smoothing parameter, the first two terms in the upper bound are typically dominate for most of the commonly used kernels, e.g., the kernels for the Sobolev spaces, see [9]. For any kernel with ν\nu-polynomial eigendecay, we obtain the optimal convergence rate for risk, i.e., O((1/N~)2ν/(2ν+1))O((1/\tilde{N})^{2\nu/(2\nu+1)}), with λ=(1/N~)2ν/(2ν+1)\lambda=(1/\tilde{N})^{2\nu/(2\nu+1)} for the finite number of nodes. \square

Lemma B.1.

Under Assumption 1-3, we have

𝔼{𝔼(η^1)η0𝒫X2}8λη02+4dλμ12η02exp(3λn~28ω2).\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+4d_{\lambda}\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

Proof. The estimate η^1\hat{\eta}_{1} minimizes the following empirical objective,

(η)=12n~i=1n~(y~1iK𝐗1i,η)2+12Jλη,η,\mathcal{L}(\eta)=\frac{1}{2\tilde{n}}\sum_{i=1}^{\tilde{n}}(\tilde{y}_{1i}-{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}})^{2}+\frac{1}{2}{\langle}J_{\lambda}\eta,\eta{\rangle}_{\mathcal{H}}, (7)

where (y~1i\tilde{y}_{1i}, 𝐗1i{\mathbf{X}}_{1i}), i=1,,n~i=1,\cdots,\tilde{n} are the data assigned to the first node, Jλ:J_{\lambda}:\mathcal{H}\rightarrow\mathcal{H} is a self-adjoint operator such that Jλη,η=λη,η{\langle}J_{\lambda}\eta,\eta{\rangle}_{\mathcal{H}}=\lambda{\langle}\eta,\eta{\rangle}_{\mathcal{H}} for η\eta\in\mathcal{H}, and η(𝐗1i)=K𝐗1i,η\eta({\mathbf{X}}_{1i})={\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}}. Note that the objective function (7) is Fréchet differentiable [22]. We can obtain

(η)η=1n~i=1n~(y~1iK𝐗1i,η)K𝐗i1+λη=1n~i=1n~y~1iK𝐗1i+1n~i=1n~K𝐗1i,ηK𝐗1i+λη.\begin{split}\frac{\partial\mathcal{L}(\eta)}{\partial\eta}&=-\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}(\tilde{y}_{1i}-{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}})K_{{\mathbf{X}}_{i1}}+\lambda\eta\\ &=-\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\tilde{y}_{1i}K_{{\mathbf{X}}_{1i}}+\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}+\lambda\eta.\end{split}

For notation simplicity, we define

Γ𝐗1η=1n~i=1n~K𝐗1i,ηK𝐗1i=1n~i=1n~η(𝐗1i)K𝐗1i,F𝐗1𝒚~1=1n~i=1n~y~1iK𝐗1i,\begin{split}\Gamma_{{\mathbf{X}}_{1}}\eta=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\eta({\mathbf{X}}_{1i})K_{{\mathbf{X}}_{1i}},\\ F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\tilde{y}_{1i}K_{{\mathbf{X}}_{1i}},\end{split}

where F𝐗1F_{{\mathbf{X}}_{1}}^{*} is the adjoint of F𝐗1=(η,K𝐗11,,η,K𝐗1n~)=(η(𝐗11),,η(𝐗1n~))F_{{\mathbf{X}}_{1}}=({\langle}\eta,K_{{\mathbf{X}}_{11}}{\rangle}_{{\mathcal{H}}},\cdots,{\langle}\eta,K_{{\mathbf{X}}_{1\tilde{n}}}{\rangle}_{{\mathcal{H}}})^{\intercal}=(\eta({\mathbf{X}}_{11}),\cdots,\eta({\mathbf{X}}_{1\tilde{n}}))^{\intercal}. We then have that

(η)η=F𝐗1𝒚~1+Γ𝐗1η+λη,\frac{\partial\mathcal{L}(\eta)}{\partial\eta}=-F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1}+\Gamma_{{\mathbf{X}}_{1}}\eta+\lambda\eta,

which yields the estimate

η^1=(Γ𝐗1+λI)1F𝐗1𝒚~1,\hat{\eta}_{1}=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1},

where 𝒚~1=(y11,,y1n~)\bm{\tilde{y}}_{1}=(y_{11},\cdots,y_{1\tilde{n}})^{\intercal}. Using the fact that 𝒚~1=F𝐗1η0+ϵ1\bm{\tilde{y}}_{1}=F_{{\mathbf{X}}_{1}}\eta_{0}+\bm{\epsilon}_{1}, and plugging it into the equation above yields

η^1=(Γ𝐗1+λI)1F𝐗1F𝐗1η0+(Γ𝐗1+λI)1F𝐗1ϵ1.\hat{\eta}_{1}=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}+(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}. (8)

Taking expectation on both sides of (8), we obtain

𝔼(η^1)=(Γ𝐗1+λI)1F𝐗1F𝐗1η0.\mathbb{E}(\hat{\eta}_{1})=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}.

The difference between 𝔼(η^1)\mathbb{E}(\hat{\eta}_{1}) and η0\eta_{0} can be written as

η0𝔼(η^1)={I(Γ𝐗1+λI)1Γ𝐗1}η0,\eta_{0}-\mathbb{E}(\hat{\eta}_{1})=\{I-(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\Gamma_{{\mathbf{X}}_{1}}\}\eta_{0},

where Γ𝐗1η0=1n~i=1n~K𝐗1i,η0K𝐗1i=1n~i=1n~η0(𝐗1i)K𝐗1i=F𝐗1F𝐗1η0\Gamma_{{\mathbf{X}}_{1}}\eta_{0}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta_{0}{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\eta_{0}({\mathbf{X}}_{1i})K_{{\mathbf{X}}_{1i}}=F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}. Thus, we have

𝔼{𝔼(η^1)η0𝒫X2}=𝔼Γ1/2{I(Γ𝐗1+λI)1Γ𝐗1}η02,\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=\mathbb{E}\left\lVert\Gamma^{1/2}\{I-(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\Gamma_{{\mathbf{X}}_{1}}\}\eta_{0}\right\rVert_{{\mathcal{H}}}^{2}, (9)

where Γη=Xη,K𝒙K𝒙𝑑𝒫X(𝒙)\Gamma\eta=\int_{X}{\langle}\eta,K_{\bm{x}}{\rangle}_{{\mathcal{H}}}K_{\bm{x}}d\mathcal{P}_{X}(\bm{x}) for η\eta\in{\mathcal{H}}. The equation above uses the fact that one can move between {\mathcal{H}} and L2(𝒫X)L^{2}(\mathcal{P}_{X}) inner products as follows

f,g𝒫X=f,Γg.{\langle}f,g{\rangle}_{\mathcal{P}_{X}}={\langle}f,\Gamma g{\rangle}_{{\mathcal{H}}}.

Let ϕk\phi_{k}, k=1,2,k=1,2,\cdots be the orthonormal basis for {\mathcal{H}} and 𝚽=(ϕk(𝐗1i))1in~;1k<\bm{\Phi}=(\phi_{k}({\mathbf{X}}_{1i}))_{1\leq i\leq\tilde{n};1\leq k<\infty} be the nn\infty matrix based on the orthonormal basis. We then have η0=k=1akϕk\eta_{0}=\sum_{k=1}^{\infty}a_{k}\phi_{k}, and the norm in (9) can be written as

𝔼𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}𝒂2()2,\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\bm{a}\right\rVert_{\ell^{2}(\mathbb{N})}^{2},

where 𝚪=diag(μ12,μ22,)\bm{\Gamma}=\text{diag}(\mu_{1}^{2},\mu_{2}^{2},\cdots), 𝑰=diag(1,1,)\bm{I}=\text{diag}(1,1,\cdots), 𝚪n~=1n𝚽𝚽\bm{\Gamma}_{\tilde{n}}=\frac{1}{n}\bm{\Phi}^{\intercal}\bm{\Phi}, and 𝒂=(a1,a2,)\bm{a}=(a_{1},a_{2},\cdots)^{\intercal}.

𝔼𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}𝒂2()2𝔼{𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}2()2𝟙(𝒞zc)}𝒂2()2+𝔼{𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}2()2𝟙(𝒞z)}𝒂2()2,\begin{split}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\bm{a}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}&\leq\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z}^{c})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &+\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})},\end{split} (10)

where the event for 0<z<10<z<1,

𝒞z={(𝚪+λ𝑰)1/2(𝚪n~𝚪)(𝚪+λ𝑰)1/2)2()z},\mathcal{C}_{z}=\Bigg{\{}\left\lVert(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}_{\tilde{n}}-\bm{\Gamma})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})\right\rVert_{\ell^{2}(\mathbb{N})}\geq z\Bigg{\}},

and 𝒞zc\mathcal{C}_{z}^{c} is the complement of 𝒞z\mathcal{C}_{z}. We use the fact that 𝟙(𝒞zc)𝟙(𝒞z)=0\mathbbm{1}(\mathcal{C}_{z}^{c})*\mathbbm{1}(\mathcal{C}_{z})=0 to obtain the inequality in (10). We now bound the first term on the right-hand side of (10).

𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}2()2=𝚪1/2(𝚪n~+λ𝑰)1(𝚪n~+λ𝑰)(𝑰(𝚪n~+λ𝑰)1𝚪n~)2()2(𝚪n~+λ𝑰)1/22()2(𝚪n~+λ𝑰)(𝑰(𝚪n~+λ𝑰)1𝚪n~)2()2𝚪1/2(𝚪n~+λ𝑰)1/22()24λ2𝚪(𝚪n~+λ𝑰)12()(𝚪n~+λ𝑰)12().\begin{split}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&=\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}})\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1/2}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}})\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1/2}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq 4\lambda^{2}\left\lVert\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}.\end{split}

The last inequality follows based on Definition 1 in [2]. For 0γ10\leq\gamma\leq 1, we have

𝚪γ/2(𝚪n~+λ𝑰)1𝚪γ/22()=𝚪γ/2{(𝚪n~𝚪)+(𝚪+λ𝑰)}1𝚪γ/22()𝚪γ(𝚪+λ𝑰)12()(𝑰(𝚪+λ𝑰)1/2(𝚪𝚪n~)(𝚪+λ𝑰)1/2)12()λγ1(𝑰(𝚪+λ𝑰)1/2(𝚪𝚪n~)(𝚪+λ𝑰)1/2)12(),\begin{split}\left\lVert\bm{\Gamma}^{\gamma/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}^{\gamma/2}\right\rVert_{\ell^{2}(\mathbb{N})}&=\left\lVert\bm{\Gamma}^{\gamma/2}\{(\bm{\Gamma}_{\tilde{n}}-\bm{\Gamma})+(\bm{\Gamma}+\lambda\bm{I})\}^{-1}\bm{\Gamma}^{\gamma/2}\right\rVert_{\ell^{2}(\mathbb{N})}\\ &\leq\left\lVert\bm{\Gamma}^{\gamma}(\bm{\Gamma}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\\ &\leq\lambda^{\gamma-1}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})},\end{split}

where the last inequality follows based on the spectral theorem. Detailed discussions can be found in [3] and [9, Chapter 9]. Applying the above inequality on the event 𝒞zc\mathcal{C}_{z}^{c} when γ=1\gamma=1 and γ=0\gamma=0, we obtain

𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}2()2𝟙(𝒞zc)4λ(1z)2,\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z}^{c})\leq\frac{4\lambda}{(1-z)^{2}},

for any 0<z1/20<z\leq 1/2.

For the second term in (10), we have that

𝔼{𝚪1/2{𝑰(𝚪n~+λ𝑰)1𝚪n~}2()2𝟙(𝒞z)}𝒂2()2𝚪1/22()2𝑰(𝚪n~+λ𝑰)1𝚪n~2()2P(𝒞z)𝒂2()2μ12𝒂2()2P(𝒞z),\begin{split}\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&\leq\left\lVert\bm{\Gamma}^{1/2}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}\left\lVert\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}P(\mathcal{C}_{z})\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq\mu_{1}^{2}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}P(\mathcal{C}_{z}),\end{split}

where the last inequality follows based on Definition 1 in [2]. We complete the proof by applying Lemma 1 adapted from [5], which states that

P(𝒞z)4dλexp(3λn~28ω2).P(\mathcal{C}_{z})\leq 4d_{\lambda}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

\square

Lemma B.2.

Under Assumption 1-3, we have

𝔼{η^1𝔼(η^1)𝒫X2}4dλσ2n~+4dλω2σ2λn~exp(3λn~28ω2).\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\}\leq\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}}+4d_{\lambda}\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}}\exp(-\frac{3\lambda\tilde{n}}{28\omega^{2}}).

Proof. Note that

𝔼(η^1)=(Γ𝐗1+λI)1F𝐗1F𝐗1η0.\mathbb{E}(\hat{\eta}_{1})=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}.

From (8), we thus have

𝔼{η^1𝔼(η^1)𝒫X2}=𝔼Γ1/2(Γ𝐗1+λI)1F𝐗1ϵ12.\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\}=\mathbb{E}\left\lVert\Gamma^{1/2}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}\right\rVert^{2}_{\mathcal{H}}. (11)

Following similar techniques in Lemma 2.1, the norm of {\mathcal{H}} in (11) can be written as

𝔼Γ1/2(Γ𝐗1+λI)1F𝐗1ϵ12=1n~2𝔼𝚪1/2(𝚪n~+λ𝑰)1𝚽ϵ12()2.\mathbb{E}\left\lVert\Gamma^{1/2}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}\right\rVert^{2}_{\mathcal{H}}=\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}.

By Von Neumann’s inequality and the assumption that 𝔼[ϵ1i2]=σ2\mathbb{E}[\epsilon_{1i}^{2}]=\sigma^{2} for i=1,,n~i=1,\cdots,\tilde{n}, we obtain

1n~2𝔼𝚪1/2(𝚪n~+λ𝑰)1𝚽ϵ12()2σ2n~𝔼{Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)}.\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\leq\frac{\sigma^{2}}{\tilde{n}}\mathbb{E}\{\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\}.

The rest of proof is similar to that in Lemma 2.2. Following the same technique in (10), we have

1n~2𝔼𝚪1/2(𝚪n~+λ𝑰)1𝚽ϵ12()2σ2n~Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)P(𝒞z)+σ2n~𝔼{Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)𝟙(𝒞zc)}.\begin{split}\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&\leq\frac{\sigma^{2}}{\tilde{n}}\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})P(\mathcal{C}_{z})\\ &+\frac{\sigma^{2}}{\tilde{n}}\mathbb{E}\{\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\mathbbm{1}(\mathcal{C}_{z}^{c})\}.\end{split} (12)

We now bound the first term on the right-hand side of (12). By Von Neumann inequality, we have

Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)Tr(𝚪)𝚪n~+λ𝑰)2𝚪n~)2().\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq\operatorname{Tr}(\bm{\Gamma})\left\lVert\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\right\rVert_{\ell^{2}(\mathbb{N})}. (13)

From Cauchy-Schwartz inequality, we have the following inequality for any η\eta\in{\mathcal{H}}

Γ𝐗1(Γ𝐗1+λI)1η,(Γ𝐗1+λI)1ηΓ𝐗1(Γ𝐗1+λI)1η(Γ𝐗1+λI)1η.{\langle}\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta,(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta{\rangle}_{{\mathcal{H}}}\leq\left\lVert\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\left\lVert(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}.

From Definition 1 in [2], we also have

Γ𝐗1(Γ𝐗1+λI)1ηη,\left\lVert\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\leq\left\lVert\eta\right\rVert_{{\mathcal{H}}},

and

(Γ𝐗1+λI)1η1λη.\left\lVert(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\leq\frac{1}{\lambda}\left\lVert\eta\right\rVert_{{\mathcal{H}}}.

Thus, we have

𝚪n~+λ𝑰)2𝚪n~)2()1λ.\left\lVert\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\right\rVert_{\ell^{2}(\mathbb{N})}\leq\frac{1}{\lambda}. (14)

Plugging (14) into (13), we have

Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)ω2λ\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq\frac{\omega^{2}}{\lambda}

For the second term on the right-hand side of (12), we have

Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)=Tr(𝚪(𝚪n~+λ𝑰)1(𝚪n~+λ𝑰)(𝚪n~+λ𝑰)2𝚪n~)Tr(𝚪(𝚪n~+λ𝑰)1(𝚪n~+λ𝑰)(𝚪n~+λ𝑰)2𝚪n~2,\begin{split}\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})&=\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\\ &\leq\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}_{\mathbb{N}}},\end{split}

where the last inequality follows by Von Neumann’s inequality. From (14) and Definition 1 in [2], it is easy to verify that

(𝚪n~+λ𝑰)(𝚪n~+λ𝑰)2𝚪n~22.\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}_{\mathbb{N}}}\leq 2.

Applying Von Neumann’s inequality again, we have

Tr(𝚪(𝚪n~+λ𝑰)2𝚪n~)2dλ(𝑰(𝚪+λ𝑰)1/2(𝚪𝚪n~)(𝚪+λ𝑰)1/2)12(),\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq 2d_{\lambda}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})},

which is bounded by 2dλ/(1z)2d_{\lambda}/(1-z) on the event 𝒞zc\mathcal{C}_{z}^{c}. We complete the proof by setting z=1/2z=1/2 and using the fact that

P(𝒞z)4dλexp(3λn~28ω2).P(\mathcal{C}_{z})\leq 4d_{\lambda}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

\square

References

  • [1] A. A. Afifi, J. B. Kotlerman, S. L. Ettner, and M. Cowan. Methods for improving regression analysis for skewed continuous or counted responses. Annu. Rev. Public Health, 28:95–111, 2007.
  • [2] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52–72, 2007.
  • [3] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
  • [4] C. P. Chen and C.-Y. Zhang. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275:314–347, 2014.
  • [5] L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, 11(1):1022–1047, 2017.
  • [6] J. Fan, D. Wang, K. Wang, and Z. Zhu. Distributed estimation of principal eigenspaces. Annals of statistics, 47(6):3009, 2019.
  • [7] D. Freedman and P. Diaconis. On the histogram as a density estimator: L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(4):453–476, 1981.
  • [8] S. A. Geer and S. van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
  • [9] C. Gu. Smoothing Spline ANOVA Models. Springer Science & Business Media, 2013.
  • [10] C. Gu and Y.-J. Kim. Penalized likelihood regression: General formulation and efficient approximation. Canadian Journal of Statistics, 30(4):619–628, 2002.
  • [11] Z.-C. Guo, L. Shi, and Q. Wu. Learning theory of distributed regression with bias corrected regularization kernel network. The Journal of Machine Learning Research, 18(1):4237–4261, 2017.
  • [12] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
  • [13] T. Hastie. Pseudosplines. Journal of the Royal Statistical Society. Series B (Methodological), pages 379–396, 1996.
  • [14] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
  • [15] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
  • [16] X. He, L. Shen, and Z. Shen. A data-adaptive knot selection scheme for fitting splines. IEEE Signal Processing Letters, 8(5):137–139, 2001.
  • [17] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
  • [18] B. Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4):221–232, 2016.
  • [19] S.-B. Lin, X. Guo, and D.-X. Zhou. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1):3202–3232, 2017.
  • [20] M. Liu, Z. Shang, and G. Cheng. How many machines can we use in parallel computing for kernel ridge regression? arXiv preprint arXiv:1805.09948, 2018.
  • [21] J. Lu, G. Cheng, and H. Liu. Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212, 2016.
  • [22] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
  • [23] Z. Luo and G. Wahba. Hybrid adaptive splines. Journal of the American Statistical Association, 92(437):107–116, 1997.
  • [24] P. Ma, J. Z. Huang, and N. Zhang. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika, 102(3):631–645, 2015.
  • [25] L. Mackey, A. Talwalkar, and M. I. Jordan. Divide-and-conquer matrix factorization. Advances in neural information processing systems, 24, 2011.
  • [26] M. W. Mahoney. Lecture notes on randomized linear algebra. arXiv preprint arXiv:1608.04481, 2016.
  • [27] C. L. Mallows. Some comments on Cp. Technometrics, 42(1):87–94, 2000.
  • [28] D. McGuinness, S. Bennett, and E. Riley. Statistical analysis of highly skewed immune response data. Journal of immunological methods, 201(1):99–114, 1997.
  • [29] C. Meng, X. Zhang, J. Zhang, W. Zhong, and P. Ma. More efficient approximation of smoothing splines via space-filling basis selection. Biometrika, 107:723–735, 2020.
  • [30] C. Musco and C. Musco. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems, pages 3833–3845, 2017.
  • [31] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
  • [32] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 28:1657–1665, 2015.
  • [33] D. Ruppert. Selecting the number of knots for penalized splines. Journal of computational and graphical statistics, 11(4):735–757, 2002.
  • [34] D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 2015.
  • [35] Z. Shang and G. Cheng. Computational limits of a distributed algorithm for smoothing spline. The Journal of Machine Learning Research, 18(1):3809–3845, 2017.
  • [36] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
  • [37] J. C. Sklar, J. Wu, W. Meiring, and Y. Wang. Nonparametric regression with basis selection from multiple libraries. Technometrics, 55(2):189–201, 2013.
  • [38] I. Steinwart, D. R. Hush, C. Scovel, et al. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
  • [39] H. A. Sturges. The choice of a class interval. Journal of the american statistical association, 21(153):65–66, 1926.
  • [40] G. Wahba. Spline Models for Observational Data. SIAM, 1990.
  • [41] G. Wahba and P. Craven. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–404, 1978.
  • [42] S. Wang. A practical guide to randomized matrix computations with matlab implementations. arXiv preprint arXiv:1505.07570, 2015.
  • [43] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione. A review of distributed algorithms for principal component analysis. Proceedings of the IEEE, 106(8):1321–1340, 2018.
  • [44] C. Xu, Y. Zhang, R. Li, and X. Wu. On the feasibility of distributed kernel regression for big data. IEEE Transactions on Knowledge and Data Engineering, 28(11):3041–3052, 2016.
  • [45] D. Xu and Y. Wang. Divide and recombine approaches for fitting smoothing spline models with large datasets. Journal of Computational and Graphical Statistics, 27(3):677–683, 2018.
  • [46] G. Xu, Z. Shang, and G. Cheng. Distributed generalized cross-validation for divide-and-conquer kernel ridge regression and its asymptotic optimality. Journal of computational and graphical statistics, 28(4):891–908, 2019.
  • [47] Y. Yuan, N. Chen, and S. Zhou. Adaptive B-spline knot selection using multi-resolution basis set. Iie Transactions, 45(12):1263–1277, 2013.
  • [48] H. H. Zhang, G. Wahba, Y. Lin, M. Voelker, M. Ferris, R. Klein, and B. Klein. Variable selection and model building via likelihood basis pursuit. Journal of the American Statistical Association, 99(467):659–672, 2004.
  • [49] J. Zhang, H. Jin, Y. Wang, X. Sun, P. Ma, and W. Zhong. Smoothing spline ANOVA models and their applications in complex and massive datasets. Topics in Splines and Applications, page 63, 2018.
  • [50] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.
  • [51] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression. In Conference on learning theory, pages 592–617, 2013.
  • [52] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299–3340, 2015.
  • [53] T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. Annals of statistics, 44(4):1400, 2016.