Oversampling Divide-and-conquer for Response-skewed Kernel Ridge Regression

Jingyi Zhang
Tsinghua University China
Xiaoxiao Sun
The University of Arizona

Abstract

The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We combine a novel response-adaptive partition strategy with the oversampling technique synergistically to overcome the limitation. Through the proposed novel algorithm, we allocate some carefully identified informative observations to multiple nodes (local processors). Although the oversampling technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller risk than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.

1 INTRODUCTION

We consider the problem of calculating large-scale kernel ridge regression (KRR) estimates in a nonparametric regression model. Although the theoretical properties of the KRR estimator are well-understood [8, 50, 38], in practice, the computation of KRR estimates may suffer from a large computational burden. In particular, for a sample of size $N$ , it requires $O(N^{3})$ computational time to calculate a KRR estimate using the standard approach, as will be detailed in Section 2. Such a computational cost is prohibitive when the sample size $N$ is considerable. The divide-and-conquer approach has been implemented pervasively to alleviate such computational burden [51, 52, 44, 45, 46]. Such an approach randomly partitions the full sample into $k$ subsamples of equal sizes, then calculates a local estimate on an independent local processor (also called a local node) for each subsample. The local estimates are then averaged to obtain the global estimate. The divide-and-conquer approach reduces the computational cost of calculating KRR estimates from $O(N^{3})$ to $O(N^{3}/k^{2})$ . Such savings may be substantial as $k$ grows.

Despite algorithmic benefits, the success of the divide-and-conquer approach highly depends on the assumption that the subsamples can well represent the observed full sample. Nevertheless, this assumption cannot be guaranteed in many real-world applications, where the response variable $Y$ may have a highly skewed distribution. Specifically, the random variable $Y$ has a highly skewed distribution $f_{Y}$ if $f_{Y}$ is nearly zero inside a large region $A_{Y}\subset\mathbb{R}$ . Problems of this type arise in high energy physics, Bayesian inference, financial research, biomedical research, environmental data, among others [28, 1, 12]. In these applications, $Y\in A_{Y}$ are the responses that occur with reduced frequency. However, such responses are often of more interest as they tend to have a more widespread impact. For example, $Y\in A_{Y}$ may represent a rare signal for seismic activity or stock market fluctuations. Overlooking such signals could resulting in a substantial negative impact on society either economically or in terms of human casualties.

Refer to caption — Figure 1: An illustration of unacceptable results for the classical dacKRR estimate.

Recall that in the classical divide-and-conquer approach, a random subsample is processed on every local node. Under the aforementioned rare-event scenarios, such a subsample could fail to have observations selected from the potentially informative region $A_{Y}$ . The local estimate based on such a subsample thus is very likely to overlook the region $A_{Y}$ . Averaging these local estimators could lead to unreliable estimations and predictions over these informative regions. A synthetic example in Fig. 1 illustrates the scenario that the response is highly skewed. In this example, the one-dimensional sample (gray points) is uniformly generated from $[0,1]$ . The response variable $Y$ has a heavy-tailed distribution, as illustrated by the histogram. The classical dacKRR method is used to estimate the true function (gray curve). We observe that almost all the local estimates (blue curves) Averaging these local estimates thus results in a global estimate (red curve) with unacceptable estimation performance. Such an observation is due to the fact that the subsample in each node overlooks the informative region over the distribution of $Y$ (the peaks) and thus fails to provide a successful global estimate.

Our contributions. To combat the obstacles, we develop a novel response-adaptive partition approach with oversampling to obtain large-scale kernel ridge regression estimates, especially when the response is highly skewed. Although the oversampling technique is widely used for addressing discrete label imbalance, extending such a technique to the continuous dacKRR setting is nontrivial. We bridge the gap by providing both theoretical and practical guidance on how to effectively over-sample the observations. Different from the classical divide-and-conquer approach, such that each observation is allocated to only one local node, we propose to allocate some informative data points to multiple nodes. Theoretically, we show the proposed estimates have a smaller risk than the classical dacKRR estimates under mild conditions. Furthermore, we show the number of strata $l$ of the response vector and the number of nodes $k$ regulate the trade-off between the computational time and the risk of the proposed estimator. In particular, when $k$ is well-selected, a larger $l$ is associated with a longer computational time as well as a smaller risk. Such results are novel to the best of our knowledge. Our theoretical findings are supported by both simulated and real-data analyses. In addition, we show the proposed method is not specific to the dacKRR method and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting.

2 PRELIMINARIES

Model setup. Let ${\mathcal{H}}$ be a reproducing kernel Hilbert space (RKHS). Such a RKHS is induced by the reproducing kernel $K(\cdot,\cdot):\mathbb{R}^{d}\times\mathbb{R}^{d}\rightarrow\mathbb{R}$ , which is a symmetric nonnegative definite function, and an inner product $\langle\cdot,\cdot\rangle_{\mathcal{H}}$ satisfying $\langle g(\cdot),K(\bm{x},\cdot)\rangle_{\mathcal{H}}=g(\bm{x})$ for any $g\in{\mathcal{H}}$ . The RKHS ${\mathcal{H}}$ is equipped with the norm $||g||_{\mathcal{H}}=(\langle g(\cdot),g(\cdot)\rangle)^{1/2}$ . The well-known Mercer’s theorem states that, under some regularity conditions, the kernel function $K(\cdot,\cdot)$ can be written as $K(\bm{x},\bm{z})=\sum_{j=1}^{\infty}\mu_{j}\phi_{j}(\bm{x})\phi_{j}(\bm{z})$ , where $\{\mu_{j}\}_{j=1}^{\infty}\geq 0$ is a sequence of decreasing eigenvalues and $\{\phi_{j}(\cdot)\}_{j=1}^{\infty}$ is a family of orthonormal basis functions. The smoothness of a function $g\in{\mathcal{H}}$ is characterized by the decaying rate of the eigenvalues $\{\mu_{j}\}_{j=1}^{\infty}$ . The major types of such decaying rates include the finite rank type, the exponentially decaying type, and the polynomially decaying type. The representative kernel function of these three major types are the polynomial kernel $K(\bm{x},\bm{z})=(1+\bm{x}^{\intercal}\bm{z})^{r}$ ( $r$ is an integer), the Gaussian kernel $K(\bm{x},\bm{z})=\mbox{exp}(-||\bm{x}-\bm{z}||^{2}/\sigma^{2})$ ( $\sigma>0$ is the scale parameter and $||\cdot||$ is the Euclidean norm), and the kernel $K(x,z)=1+\mbox{min}(x,z)$ , respectively. We refer to [14, 36] for more details.

Consider the nonparametric regression model

y_{i}=\eta(\bm{x}_{i})+\epsilon_{i},\quad i=1,\ldots,N,

(1)

where $y_{i}\in\mathbb{R}$ is the response, $\bm{x}_{i}\in\mathbb{R}^{d}$ is the predictors, $\eta$ is the unknown function to be estimated, and $\{\epsilon_{i}\}_{i=1}^{N}$ are i.i.d. normal random errors with zero mean and unknown variance $\sigma^{2}<\infty$ . The kernel ridge regression estimator aims to find a projection of $\eta$ into the RKHS ${\mathcal{H}}$ . Such an estimator $\widehat{\eta}$ can be written as

\displaystyle\widehat{\eta}=\underset{\eta\in{\mathcal{H}}}{\mbox{argmin}}\left\{\frac{1}{N}\sum_{i=1}^{N}\{y_{i}-\eta(\bm{x}_{i})\}^{2}+\lambda||\eta||_{\mathcal{H}}^{2}\right\}.

(2)

Here, the regularization parameter $\lambda$ controls the trade-off between the goodness of fit of $\widehat{\eta}$ and the smoothness of it. A penalized least squares framework analogous to Equation (2) has been extensively studied in the literature of regression splines and smoothing splines [40, 13, 23, 16, 10, 33, 48, 37, 47, 24, 49, 29].

The well-known representer theorem [40] states that the minimizer of Equation (2) in the RKHS ${\mathcal{H}}$ takes the form $\widehat{\eta}(\bm{x})=\sum_{i=1}^{N}\beta_{i}K(\bm{x}_{i},\bm{x}).$ Let $\bm{y}=(y_{1},\ldots,y_{N})^{\intercal}$ be the response vector, $\bm{\beta}=(\beta_{1},\ldots,\beta_{N})^{\intercal}$ be the coefficient vector and ${\mathbf{K}}$ be the kernel matrix such that the $(i,j)$ -th element equals $K(\bm{x}_{i},\bm{x}_{j})$ . These coefficient vector $\bm{\beta}$ can be estimated through solving the minimization problem as follows,

\widehat{\bm{\beta}}=\underset{\bm{\beta}\in\mathbb{R}^{N}}{\mathrm{argmin}}\frac{1}{N}(\bm{y}-{\mathbf{K}}\bm{\beta})^{\intercal}(\bm{y}-{\mathbf{K}}\bm{\beta})+\lambda\bm{\beta}^{\intercal}{\mathbf{K}}\bm{\beta}.

(3)

It is known that the solution of such a minimization problem has a closed form

\widehat{\bm{\beta}}=({\mathbf{K}}+N\lambda\mathbf{I}_{N})^{-1}\bm{y},

(4)

where the regularization parameter $\lambda$ can be selected based on the cross-validation technique, the Mallows’s Cp method [27], or the generalized cross-validation (GCV) criterion [41].

Although the solution of the minimization problem (3) has a closed-form, the computational cost for calculating the solution using Equation (4) is of the order $O(N^{3})$ , which is prohibitive when the sample size $N$ is considerable. In this paper, we focus on the divide-and-conquer approach for alleviating such a computational burden. In recent decades, there also exist a large number of studies that aim to develop low-rank matrix approximation methods to accelerate the calculation of kernel ridge regression estimates [31, 42, 32, 26, 30]. These approaches are beyond the scope of this paper. In practice, one can combine the divide-and-conquer approach with the aforementioned methods to further accelerate the calculation.

Background of the divide-and-conquer approach. The classical divide-and-conquer approach is easy to describe. Rather than solving the minimization problem (3) using the full sample, the divide-and-conquer approach randomly partitions the full sample into $k$ subsamples of equal sizes. Each subsample is then allocated to an independent local node. Next, $k$ minimization problems are solved independently on each local node based on the corresponding subsamples. The final estimate is simply the average of all the local estimates. The divide-and-conquer approach has proven to be effective in linear models [4, 21], partially linear models [53], nonparametric regression models [51, 52, 19, 35, 11], principal component analysis [43, 6], matrix factorization [25], among others.

Algorithm 1 Classical divide-and-conquer kernel ridge regression

Input: The training set

\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}

; the number of nodes

k

Step 1: Randomly and evenly partition the sample

\{(\bm{x}_{1},y_{1}),\ldots,(\bm{x}_{N},y_{N})\}

into

k

disjoint subsamples, denoted by

\SS_{1},\ldots,\SS_{k}

. Let

|\SS_{i}|

be the number of observations in

\SS_{i}

Step 2:

for

j

1,\ldots,k

calculate the local kernel ridge regression estimate on the

j

-th local node

\hat{\eta}_{j}=\underset{\eta\in{\mathcal{H}}}{\mathrm{argmin}}\{\frac{1}{|\SS_{j}|}\sum_{(\bm{x},y)\in\SS_{j}}(y-\eta(\bm{x}))^{2}+\lambda_{j}||\eta||_{\mathcal{H}}^{2}\}.

end for

Output: Combine each local estimates to obtain the final estimate

\widetilde{\eta}=\frac{1}{k}\sum_{j=1}^{k}\hat{\eta}_{j}.

Algorithm 1 summarizes the classical divide-and-conquer method under the kernel ridge regression setting. Such an algorithm reduces the computational cost for the estimation from $O(N^{3})$ to $O(N^{3}/k^{2})$ . The savings may be substantial as $k$ grows.

One difficulty in Algorithm 1 is how to choose the regularization parameter $\lambda_{j}$ s. A natural way to determine the size of these regularization parameters is to utilize the standard approaches, e.g., the Mallows’s Cp criterion or the (GCV) criterion, based only on the local subsample $\SS_{j}$ . It is known that such a simple strategy may lead to a global estimate that suffers from suboptimal performance [52, 45]. To overcome the challenges, there has been a large number of studies dedicated to developing more effective methods of selecting the local regularization parameters in the recent decade. For example, [52] proposed to select the regularization parameter $\lambda_{j}$ according to the order of the entire sample size $N$ instead of the subsample size $N/k$ . Recently, [45] proposed the distributed GCV method to select a global optimal regularization parameter for nonparametric regression under the classical divide-and-conquer setting.

3 MOTIVATION AND METHODOLOGY

Motivation. To motivate the development of the proposed method, we first re-examine the oversampling strategy. Such a strategy is well-known in imbalanced data analysis, where the labels can be viewed as skewed discrete response variables. [17, 15, 18]. Classical statistical and machine learning algorithms assume that the number of observations in each class is roughly at the same scale. In many real-world cases, however, such an assumption may not hold, resulting in imbalanced data. Imbalanced data pose a difficulty for classical learning algorithms, as they will be biassed towards the majority group. Despite the rareness, the minority class is usually more important from the data mining perspective, as it may carry useful and important information. An effective classification algorithm thus should take such unbalances into account. To tackle the imbalanced data, the oversampling strategy supplements the training set with multiple copies of the minority classes and keeps all the observations in the majority classes to make the whole dataset suitable for a standard learning algorithm. Such strategy has proven to be effective for achieving more robust results [17, 15, 18].

Intuitively, the data with a continuous skewed response can be considered as imbalanced data. In particular, if we divide the range of the skewed response $\{y_{i}\}_{i=1}^{N}$ into $l$ equally-spaced disjoint intervals, denoted by ${\mathcal{Y}}_{1},\ldots,{\mathcal{Y}}_{l}$ , then there will be a disproportionate ratio of observations in each interval. The intervals that contain more observations can be considered as the majority classes, and the ones that contain fewer observations can be considered as the minority classes. Recall that the classical divide-and-conquer approach utilizes a simple random subsample from the full sample to calculate the local estimate. Therefore, when the response variable is highly skewed, such a subsample could easily overlook the observations respecting the minority classes, resulting in unpleasant estimates.

Main algorithm. Inspired by the oversampling strategy, we propose to supplement the training data with multiple copies of the minority classes before applying the divide-and-conquer approach. In addition, analogous to the downsampling strategy, the subsample in each node should keep most of the observations from the minority classes. Let $|\SS|$ be the number of observations in the set $\SS$ . Let $[\cdot]$ be the rounding function that rounds down the number to its nearest integer. The proposed method, called oversampling divide-and-conquer kernel ridge regression, is summarized in Algorithm 2.

We use a toy example to illustrate Algorithm 2 in Fig. 2. Suppose there are three nodes ( $k=3$ ), and the histogram of the response is divided into three slices ( $l=3$ ). The left panel of Fig. 2 illustrates the oversampling process. In particular, the observations are duplicated by certain times, such that the total number of observations respecting each slice is roughly equal to each other. The original observations and the duplicated observations are marked by tilted lines and horizontal lines, respectively. The right panel of Fig. 2 shows the next step. For each slice, the observations within such a slice are then randomly and evenly allocated to each of the nodes. The observations allocated to different nodes are marked by blue, red and purple, respectively. Finally, similar to the classical dacKRR approach, the local estimates are calculated and then are averaged to obtain the global estimate.

Algorithm 2 Oversampling divide-and-conquer kernel ridge regression

Input: The training set

\{(\bm{x}_{i},y_{i})\}_{i=1}^{N}

; the number of nodes

k

, the number of slices

l

Step 1: Divide the range of the responses

\{y_{i}\}_{i=1}^{N}

into

l

disjoint equally-spaced intervals, denoted by

{\mathcal{Y}}_{1},\ldots,{\mathcal{Y}}_{l}

. Without lose of generality, we assume

|{\mathcal{Y}}_{1}|\geq\cdots\geq|{\mathcal{Y}}_{l}|

Step 2:

for

j

1,\ldots,l

Let

{\mathcal{Y}}_{j}^{*}

be the set that contains

[|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|]

copies of all the elements in

{\mathcal{Y}}_{j}

. Randomly and evenly partition the set

\{(\bm{x},y)|y\in{\mathcal{Y}}_{j}^{*}\}

into

k

disjoint subsamples, denoted by

{\mathcal{A}}_{1,j},\ldots,{\mathcal{A}}_{k,j}

end for

Step 3:

for

i

1,\ldots,k

The subsample respecting the

i

-th local node is

\SS_{i}=\mathcal{D}\{{\mathcal{A}}_{i,1}\cup\cdots\cup{\mathcal{A}}_{i,l}\}

, where

\mathcal{D}\{\cdot\}

denotes de-duplicating. Calculate the local estimate on the

i

-th local node

\hat{\eta}_{i}=\underset{\eta\in{\mathcal{H}}}{\mathrm{argmin}}\{\frac{1}{|\SS_{i}|}\sum_{(\bm{x},y)\in\SS_{i}}(y-\eta(\bm{x}))^{2}+\lambda_{i}||\eta||_{\mathcal{H}}^{2}\}.

end for

Output: Combine each local estimates to obtain the final estimate

\bar{\eta}=\frac{1}{k}\sum_{i=1}^{k}\hat{\eta}_{i}.

Implementation details and the computational cost. Analogous to Algorithm 1, a difficulty in Algorithm 2 is how to choose the regularization parameters $\lambda_{i}$ ’s. Throughout this paper, we opt to use the same approach to select $\lambda_{i}$ ’s as the one proposed in [52]. The number of slices $l$ can be set as a constant or be determined by Scott’s rule; see [34] for more details. Our empirical studies indicate that the performance of the proposed algorithm is robust to a wide range of choices of $l$ . In addition, to avoid the data dependency within a node raised by oversampling, we conduct a “de-duplicating” process before calculating the local estimates.

Consider the total number of observations in the training set after the oversampling and de-duplicating processes. Let $\tilde{N}$ denote such a number. When the response variable has a roughly uniform distribution, few observations would be copied, and thus one has $\tilde{N}=O(N)$ . In such a case, the computational cost of Algorithm 2 is at the same order as that of the classical divide-and-conquer method. In the most extreme case that the most majority class ${\mathcal{Y}}_{1}$ contains almost all the observations, it can be shown that $\tilde{N}\leq lN$ . Therefore, the computational cost of Algorithm 2 is at the order of $O(l^{3}N^{3}/k^{2})$ . Consequently, when $l$ is a constant, again, the computational cost of Algorithm 2 is at the same order of the classical divide-and-conquer method. Otherwise, when $l$ and $k$ go to infinity, for example, when $l$ is determined by Scott’s rule [34] ( $l=O(N^{1/3})$ ), and $k=O(l)$ , the computational cost of Algorithm 2 becomes $O(N^{4}/k^{2})$ . However, such a higher computational cost is associated with theoretical benefits, as will be detailed in the next section.

4 THEORETICAL RESULTS

Main theorem. In this section, we obtain the upper bounds on the risk of the proposed estimator $\bar{\eta}$ and establish the main result in Theorem 4.1. We show our bounds contain a smaller asymptotic estimation variance term, compared to the bounds of the classical dacKRR estimator [52, 20]. The following assumptions are required to bound the terms in Theorem 4.1. Assumption 4 guarantees the spectral decomposition of the kernel via Mercer’s theorem. Assumption 4 is a regularity assumption on $\eta_{0}$ . These two assumptions are fairly standard. Assumption 4 requires that the basis functions are uniformly bounded. [52] showed that one could easily verify this assumption for a given kernel $K$ . For instance, this assumption holds for many classical kernel functions, e.g., the Gaussian kernel.

Assumption 1. The reproducing kernel $K$ is symmetric, positive definite, and square integrable.

Assumption 2. The underlying function $\eta_{0}\in\mathcal{H}$ .

Assumption 3. There exists some $\omega^{2}>0$ such that $\sup_{\bm{x}\in\mathcal{X}}\sum_{j=1}^{\infty}u_{j}\phi_{j}(\bm{x})^{2}\leq\omega^{2}<\infty$ .

Let $\{\tilde{{\mathbf{X}}}_{i},\tilde{\bm{y}}_{i}\}$ be the subset with size of $\tilde{n}$ allocated to the $i$ th node after de-duplicating, $i=1,\ldots,k$ . We assume they are drawn i.i.d from an unknown probability measure $\mathcal{P}$ . Let $\mathcal{P}_{X}$ denote marginal distribution of $\tilde{{\mathbf{X}}}_{i}$ . For brevity, we further assume the sample size for each node is the same. The bounds on the risk of the proposed estimator $\bar{\eta}$ are given in Theorem 4.1, followed by the main corollary.

Theorem 4.1.

Let $d_{\lambda}=\sum_{j\geq 1}(1+\lambda/\mu_{j})^{-1}$ be the the effective dimensionality of the kernel $K$ [50]. Under Assumption 4-4, the risk of the proposed estimator $\bar{\eta}$ is bounded by

\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}k}+4d_{\lambda}\\ &\times(\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}k})\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.\end{split}

Corollary 4.2.

Assume the first two terms of the bounds in Theorem 4.1 are dominant, we have

\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=O(1)\bigg{\{}\underbrace{\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}}_{\text{Squared bias}}+\underbrace{\frac{\sigma^{2}d_{\lambda}}{\tilde{N}}}_{\text{Variance}}\bigg{\}}.

(5)

With the properly chosen regularization parameter, most of the commonly used kernels, e.g., the kernels with polynomial eigendecay, can satisfy this assumption for Corollary 5 [9]. More discussions can be found in the Supplementary Material. The squared bias term for the classical dacKRR estimator is at the order of $O(\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}})$ , which is the same as the one in Equation (5) [52]. However, different from the variance term in Equation (5), which equals $O(\sigma^{2}d_{\lambda}/\tilde{N})$ , the variance term for the classical dacKRR estimator is at the order of $O(\sigma^{2}d_{\lambda}/N)$ . Our rate depends on the eigenvalues $\{\mu_{j}^{2}\}$ instead of the trace of kernel derived in [52]. This rate is tighter than that in [2] and is considered as the minimax rate.

Practical guidance on the selection of the parameter $l$ and $k$ . We now compare the risk of the classical dacKRR estimator with the proposed one under three different scenarios. Furthermore, we provide some practical guidance on the selection of the number of slides $l$ and the number of nodes $k$ . We denote the total sample size after oversampling as $\tilde{N}^{*}$ .

First, consider the scenario that $N\approx\tilde{N}^{*}$ as shown in Fig. 3(a). In this case, oversampling is unnecessary, thus the proposed estimate $\bar{\eta}$ has the same risk as the classical dacKRR estimator.

Second, consider the scenario that the response variable has a slightly skewed distribution, i.e., one has $\tilde{N}^{*}=O(N)$ as shown in Fig. 3(b). Under this scenario, the total sample size after de-duplicating is also in the same order of the original sample size, i.e. $\tilde{N}=O(N)$ . Corollary 5 indicates the proposed estimate $\bar{\eta}$ has the same risk as the classical dacKRR estimator.

Third, consider the scenario that the response variable has a highly skewed distribution. Under this scenario, the sample size in the “majorest” slide is in the same order of $N$ , and the sample size in the “minorest” slide, denoted as $N_{l}$ , is in the order of $o(N)$ . In these cases, one has $\tilde{N}^{*}=O(lN)$ We then discuss under two cases, (1) when $N_{l}=o(N/l)$ , especially, when $l$ is a constant; and (2) when $N_{l}=O(N/l)$ . In case (1), the sample size $N$ after de-duplicating is in the same order of $N$ . The risk of $\bar{\eta}$ then has the same order as the classical dacKRR estimator. In case (2), notice that $N_{l}=o(N)$ , this indicates that $l$ should go to infinity as $N\to\infty$ . In this scenario, we consider $k=O(N/N_{l})$ i.e., $k=O(l)$ , then we have $\tilde{N}=O(\tilde{N}^{*})$ . Thus equation (5) becomes

\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=O(1)\bigg{\{}\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\sigma^{2}d_{\lambda}}{lN}\bigg{\}}.

(6)

For example, when $l$ is determined by Scott’s rule, one has $l=O(N^{1/3})$ , and we choose $k=O(l)$ . Equation (6) indicates the proposed estimator has a smaller risk than the classical dacKRR estimator. Corollary 5 indicates that the parameters $l$ and $k$ regulate the trade-off between the computational time and the risk. Specifically, when $k$ is well-selected, a larger $l$ is associated with a longer computational time as well as a smaller risk.

5 SIMULATION RESULTS

To show the effectiveness of the proposed estimator, we compared it with the classical dacKRR estimator in terms of the mean squared error (MSE). We calculated the MSE for each of the estimators based on 100 replicates. In particular, $\mbox{MSE}=\sum_{i=1}^{100}\|\hat{\eta}^{(i)}-\eta_{0}\|^{2}/100)$ , where $\eta_{0}$ is the true function and $\hat{\eta}^{(i)}$ represents the estimator in the $i$ th replication, respectively. Through all the experiments in this section, we set the number of nodes $k=100$ . Gaussian kernel function was used for the kernel ridge regression. We followed the procedure in [52] to select the bandwidth for the kernel and the regularization parameters. For the proposed method, we divided the range of $\{y_{i}\}_{i=1}^{N}$ into $l$ slices according to Scott’s rule [34].

Recall that in Algorithm 2, each observation is copied for certain times, such that the total number of observations in each slice is roughly equal to each other. A natural question is how does the number of such copies affects the performance of the proposed method. In other words, if a fewer number of copies are supplemented to the training data, would the proposed method perform better or worse? To answer this question, we induced a constant $\tau$ to control the oversampling size. Specifically, for the $j$ th slice, $j=1,\ldots,l$ , we duplicated the observations within such a slice for $([\tau|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|]-1)$ times, instead of $([|{\mathcal{Y}}_{1}|/|{\mathcal{Y}}_{j}|]-1)$ times in Algorithm 2. We set $\tau=\{1,1/2,1/3,1/4,1/5\}$ . This procedure is equivalent to Algorithm 2 when $\tau=1$ .

Let $\bm{1}^{d}$ be the $d$ -dimensional vector, such that all elements are equal to $1$ . We considered a function that has a highly skewed response, i.e.,

g(\bm{x},\bm{c})=\frac{0.1}{\|\bm{x}-\bm{c}\|+0.05}\sin\left(\frac{0.01\pi}{\|\bm{x}-\bm{c}\|+0.05}\right),

where $\bm{x}\in[0,1]^{d}$ , and $\|\cdot\|$ represents the $\mathbb{L}_{2}$ norm. We then simulated the data from Model (1) with $N=\{2,4,6,8,10\}\times 10^{3}$ , $d=1,2,4,6$ , and two different regression function $\eta_{0}$ ’s,

Uni-peak: $\eta_{0}(\bm{x})=g(\bm{x},\bm{c})$ , with $\bm{c}=0.4\times\bm{1}^{d}$ ;

Double-peak: $\eta_{0}(\bm{x})=g(\bm{x},\bm{c}_{1})+0.4g(\bm{x},\bm{c}_{2})$ , with $\bm{c}_{1}=0.4\times\bm{1}^{d}$ and $\bm{c}_{2}=0.7\times\bm{1}^{d}$ .

Figure 4 shows the MSE versus different sample size $N$ under various settings. Each column represents a different $d$ , and each row represents a different $\eta_{0}$ . The classical dacKRR estimator and the full sample KRR estimator are labeled as blue lines and black lines, respectively. The proposed estimators are labeled as red lines, and the proposed method with $\tau=1$ , i.e., Algorithm 2, is labeled as solid red lines. The vertical bars represent the standard errors obtained from 100 replications. In Fig. 4, we first observed that the classical dacKRR estimator, as expected, does not perform well. We then observed that the proposed estimators perform consistently better than the classical estimator. Such an observation indicates that when the response is highly skewed, oversampling the observations respecting the minority values of the response helps improve the estimation accuracy. Finally, we observed that a larger value of $\tau$ is associated with a better performance. In particular, as the number of copies increases, the proposed estimator tends to have faster convergence rates. This observation supports Corollary 5, which states that a larger number of observations after the oversampling process is associated with a smaller estimation MSE.

Besides Scott’s rule ( $l=O(N^{1/3})$ ), we also considered other rules, e.g., the Sturge’s formula ( $l=O(\log_{2}N)$ ) [39] and the Freedman-Diaconis choice ( $l=O(N^{1/2})$ ) [7]. Figure 5 compares the empirical performance of the proposed estimator w.r.t. different choices of $l$ ’s. For the setting of unskewed response (the first row), all methods share similar performance. For response-skewed settings (the second and the third row), all the three aforementioned rules yield better performance than fixed $l$ . Among all, Scott’s rule gives the best results. The simulation results are consistent with Corollary 4.2, which indicates that the proposed estimator shows advantages over the classical one.

Besides the impact of $l$ , we also studied the impact of $k$ . In addition, we have also applied the proposed strategy to other nonparametric regression estimators, say smoothing splines, under the divide-and-conquer setting. Such simulation results are provided in the Supplementary Material. These results indicated that the performance of the proposed method is robust the choice of $k$ and has the potential to improve the estimation of other nonparametric regression estimators under the divide-and-conquer setting. It is also possible to consider unequally-spaced slicing methods in the proposed method; however, such extensions are beyond the scope of this paper.

6 REAL DATA ANALYSIS

We applied the proposed method to a real-world dataset called Melbourne-housing-market ¹¹1Data source https://www.kaggle.com/anthonypino/ melbourne-housing-market.. The dataset includes housing clearance data in Melbourne from the year 2016 to 2017. Each observation represents a record for a sold house, including the total price and several predictors. Of interest is to predict the price-per-square-meter (range from 20 to 30000) for each sold house using the longitude and the latitude of the house, and the distance from the house to the central business district (CBD). Such a goal can be achieved by using kernel ridge regression.

We replicated the experiment one hundred times. In each replicate, we used stratified sampling to randomly pick $10\%$ of the observations as the testing set and the remaining as the training set. The Gaussian kernel function was used for the kernel ridge regression. We set the number of nodes $k=\{10,30,50,70,90,110\}$ . We followed the procedure in [52] to select the bandwidth for the kernel and the regularization parameters. The performance of different estimators was evaluated by the MSE, respecting the prediction on the testing set. The experiment was conducted in R using a Windows computer with 16GB of memory and a single-threaded 3.5Ghz CPU.

The left panel of Fig. 6 shows the MSE versus the different number of nodes $k$ , and the right panel of Fig. 6 shows the CPU time (in seconds) versus different $k$ . To the best of our knowledge, our work is the first approach that addresses the response-skewed issue under the dacKRR setting, while other approaches only considered discrete-response settings or cannot be easily extended to the divide-and-conquer scenario. Thus we only compare our method with the classical dacKRR approach. In Fig. 6, the blue lines and the red lines represent the classical dacKRR approach and the proposed approach, respectively. Vertical bars represent the standard errors, which are obtained from one hundred replicates. The black lines represent the results of the full sample KRR estimate, and the gray dashed lines represent its standard errors. We can see that the proposed estimate consistently outperforms the classical dacKRR estimate. In particular, we observe the MSE of the proposed estimate is comparable with the MSE of the full sample KRR estimate. The MSE for the classical dacKRR estimate, however, almost doubles the MSE for the full sample KRR estimate when the number of nodes $k$ is greater than one hundred. The bottom panel of Fig. 6 shows the CPU time of the proposed approach is comparable with the CPU time of the classical divide-and-conquer approach. All these observations are consistent with the findings in the previous section. Such observations indicate the proposed approach outperforms the classical dacKRR approach for response-skewed data without requiring too much extra computing time.

Appendix A SIMULATION STUDIES

A.1 Impact of The Number of Nodes

We considered the function as in the manuscript, i.e.,

g(\bm{x},\bm{c})=\frac{0.1}{\|\bm{x}-\bm{c}\|+0.05}\sin\left(\frac{0.01\pi}{\|\bm{x}-\bm{c}\|+0.05}\right),

where $\bm{x}\in[0,1]^{d}$ , and $\|\cdot\|$ represents the $\mathbb{L}_{2}$ norm. We then simulated the data from Model (1) with $N=10^{4}$ , $d=1,2,4,6$ , and two different regression function $\eta_{0}$ ’s,

Uni-peak: $\eta_{0}(\bm{x})=g(\bm{x},\bm{c})$ , with $\bm{c}=0.4\times\bm{1}^{d}$ ;

Double-peak: $\eta_{0}(\bm{x})=g(\bm{x},\bm{c}_{1})+0.4g(\bm{x},\bm{c}_{2})$ , with $\bm{c}_{1}=0.4\times\bm{1}^{d}$ and $\bm{c}_{2}=0.7\times\bm{1}^{d}$ .

Figure 7 study the impact of the number of nodes $k$ . In Fig. 7, we fix $N$ and let $k$ varies. We observe that the proposed estimator consistently outperforms the classical one under all scenarios.

A.2 Application in Other Non-parametric Regression Methods

Besides the KRR estimator, we also applied the proposed response-adaptive partition strategy to other non-parametric regression methods. One example is the smoothing splines under the divide-and-conquer setting. In the simulation, we again study the impact of the over-sample size $\tilde{n}$ , the number of slices $l$ , and the number of nodes $k$ . We set the dimension $d=4$ , other settings are similar to the cases for KRR. The results are shown in Fig. 8. Each row of Fig. 8 shows the impact of $\tilde{n},l$ and $k$ , respectively. The odd columns represent the estimation MSE, and the even columns represent the CPU time in seconds. The first two columns represent the results of the uni-peak setting, and the last two columns represent the results of the double-peak setting. We can see that the results show the similar pattern as in the KRR cases. These results indicate that the performance of the proposed method is robust for different nonparametric regressions.

Appendix B PROOFS

B.1 Proof of Theorem 4.1

The proof of Theorem 4.1 relies on the following bias-variance decomposition

\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&=\mathbb{E}\{\left\lVert\mathbb{E}(\bar{\eta})-\eta_{0}+\bar{\eta}-\mathbb{E}(\bar{\eta})\right\rVert^{2}_{\mathcal{P}_{X}}\}\\ &=\mathbb{E}\{\left\lVert\mathbb{E}(\bar{\eta})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}+\mathbb{E}\Bigg{\{}\left\lVert\frac{1}{k}\sum_{j=1}^{k}(\hat{\eta}_{j}-\mathbb{E}(\hat{\eta}_{j}))\right\rVert^{2}_{\mathcal{P}_{X}}\Bigg{\}}\\ &=\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}+\frac{1}{k}\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\},\end{split}

where $\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert_{\mathcal{P}_{X}}^{2}\}=\mathbb{E}\{\int_{X}(\bar{\eta}({\mathbf{X}}_{1})-\eta_{0}({\mathbf{X}}_{1}))^{2}d\mathcal{P}_{X}\}$ . The claim of Theorem 4.1 can be proved by combining the inequalities in Lemma B.1 and B.2. The minimax bound for risk is

\begin{split}\mathbb{E}\{\left\lVert\bar{\eta}-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}&\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}k}+4d_{\lambda}\\ &\times(\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}k})\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.\end{split}

With the properly chosen smoothing parameter, the first two terms in the upper bound are typically dominate for most of the commonly used kernels, e.g., the kernels for the Sobolev spaces, see [9]. For any kernel with $\nu$ -polynomial eigendecay, we obtain the optimal convergence rate for risk, i.e., $O((1/\tilde{N})^{2\nu/(2\nu+1)})$ , with $\lambda=(1/\tilde{N})^{2\nu/(2\nu+1)}$ for the finite number of nodes. $\square$

Lemma B.1.

Under Assumption 1-3, we have

\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}\leq 8\lambda\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}+4d_{\lambda}\mu_{1}^{2}\left\lVert\eta_{0}\right\rVert^{2}_{\mathcal{H}}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

Proof. The estimate $\hat{\eta}_{1}$ minimizes the following empirical objective,

\mathcal{L}(\eta)=\frac{1}{2\tilde{n}}\sum_{i=1}^{\tilde{n}}(\tilde{y}_{1i}-{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}})^{2}+\frac{1}{2}{\langle}J_{\lambda}\eta,\eta{\rangle}_{\mathcal{H}},

(7)

where ( $\tilde{y}_{1i}$ , ${\mathbf{X}}_{1i}$ ), $i=1,\cdots,\tilde{n}$ are the data assigned to the first node, $J_{\lambda}:\mathcal{H}\rightarrow\mathcal{H}$ is a self-adjoint operator such that ${\langle}J_{\lambda}\eta,\eta{\rangle}_{\mathcal{H}}=\lambda{\langle}\eta,\eta{\rangle}_{\mathcal{H}}$ for $\eta\in\mathcal{H}$ , and $\eta({\mathbf{X}}_{1i})={\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}}$ . Note that the objective function (7) is Fréchet differentiable [22]. We can obtain

\begin{split}\frac{\partial\mathcal{L}(\eta)}{\partial\eta}&=-\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}(\tilde{y}_{1i}-{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{\mathcal{H}})K_{{\mathbf{X}}_{i1}}+\lambda\eta\\ &=-\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\tilde{y}_{1i}K_{{\mathbf{X}}_{1i}}+\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}+\lambda\eta.\end{split}

For notation simplicity, we define

\begin{split}\Gamma_{{\mathbf{X}}_{1}}\eta=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\eta({\mathbf{X}}_{1i})K_{{\mathbf{X}}_{1i}},\\ F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\tilde{y}_{1i}K_{{\mathbf{X}}_{1i}},\end{split}

where $F_{{\mathbf{X}}_{1}}^{*}$ is the adjoint of $F_{{\mathbf{X}}_{1}}=({\langle}\eta,K_{{\mathbf{X}}_{11}}{\rangle}_{{\mathcal{H}}},\cdots,{\langle}\eta,K_{{\mathbf{X}}_{1\tilde{n}}}{\rangle}_{{\mathcal{H}}})^{\intercal}=(\eta({\mathbf{X}}_{11}),\cdots,\eta({\mathbf{X}}_{1\tilde{n}}))^{\intercal}$ . We then have that

\frac{\partial\mathcal{L}(\eta)}{\partial\eta}=-F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1}+\Gamma_{{\mathbf{X}}_{1}}\eta+\lambda\eta,

which yields the estimate

\hat{\eta}_{1}=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\tilde{y}}_{1},

where $\bm{\tilde{y}}_{1}=(y_{11},\cdots,y_{1\tilde{n}})^{\intercal}$ . Using the fact that $\bm{\tilde{y}}_{1}=F_{{\mathbf{X}}_{1}}\eta_{0}+\bm{\epsilon}_{1}$ , and plugging it into the equation above yields

\hat{\eta}_{1}=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}+(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}.

(8)

Taking expectation on both sides of (8), we obtain

\mathbb{E}(\hat{\eta}_{1})=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}.

The difference between $\mathbb{E}(\hat{\eta}_{1})$ and $\eta_{0}$ can be written as

\eta_{0}-\mathbb{E}(\hat{\eta}_{1})=\{I-(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\Gamma_{{\mathbf{X}}_{1}}\}\eta_{0},

where $\Gamma_{{\mathbf{X}}_{1}}\eta_{0}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}{\langle}K_{{\mathbf{X}}_{1i}},\eta_{0}{\rangle}_{{\mathcal{H}}}K_{{\mathbf{X}}_{1i}}=\frac{1}{\tilde{n}}\sum_{i=1}^{\tilde{n}}\eta_{0}({\mathbf{X}}_{1i})K_{{\mathbf{X}}_{1i}}=F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}$ . Thus, we have

\mathbb{E}\{\left\lVert\mathbb{E}(\hat{\eta}_{1})-\eta_{0}\right\rVert^{2}_{\mathcal{P}_{X}}\}=\mathbb{E}\left\lVert\Gamma^{1/2}\{I-(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\Gamma_{{\mathbf{X}}_{1}}\}\eta_{0}\right\rVert_{{\mathcal{H}}}^{2},

(9)

where $\Gamma\eta=\int_{X}{\langle}\eta,K_{\bm{x}}{\rangle}_{{\mathcal{H}}}K_{\bm{x}}d\mathcal{P}_{X}(\bm{x})$ for $\eta\in{\mathcal{H}}$ . The equation above uses the fact that one can move between ${\mathcal{H}}$ and $L^{2}(\mathcal{P}_{X})$ inner products as follows

{\langle}f,g{\rangle}_{\mathcal{P}_{X}}={\langle}f,\Gamma g{\rangle}_{{\mathcal{H}}}.

Let $\phi_{k}$ , $k=1,2,\cdots$ be the orthonormal basis for ${\mathcal{H}}$ and $\bm{\Phi}=(\phi_{k}({\mathbf{X}}_{1i}))_{1\leq i\leq\tilde{n};1\leq k<\infty}$ be the $n\infty$ matrix based on the orthonormal basis. We then have $\eta_{0}=\sum_{k=1}^{\infty}a_{k}\phi_{k}$ , and the norm in (9) can be written as

\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\bm{a}\right\rVert_{\ell^{2}(\mathbb{N})}^{2},

where $\bm{\Gamma}=\text{diag}(\mu_{1}^{2},\mu_{2}^{2},\cdots)$ , $\bm{I}=\text{diag}(1,1,\cdots)$ , $\bm{\Gamma}_{\tilde{n}}=\frac{1}{n}\bm{\Phi}^{\intercal}\bm{\Phi}$ , and $\bm{a}=(a_{1},a_{2},\cdots)^{\intercal}$ .

\begin{split}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\bm{a}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}&\leq\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z}^{c})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &+\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})},\end{split}

(10)

where the event for $0<z<1$ ,

\mathcal{C}_{z}=\Bigg{\{}\left\lVert(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}_{\tilde{n}}-\bm{\Gamma})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})\right\rVert_{\ell^{2}(\mathbb{N})}\geq z\Bigg{\}},

and $\mathcal{C}_{z}^{c}$ is the complement of $\mathcal{C}_{z}$ . We use the fact that $\mathbbm{1}(\mathcal{C}_{z}^{c})*\mathbbm{1}(\mathcal{C}_{z})=0$ to obtain the inequality in (10). We now bound the first term on the right-hand side of (10).

\begin{split}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&=\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}})\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1/2}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}})\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1/2}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq 4\lambda^{2}\left\lVert\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}.\end{split}

The last inequality follows based on Definition 1 in [2]. For $0\leq\gamma\leq 1$ , we have

\begin{split}\left\lVert\bm{\Gamma}^{\gamma/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}^{\gamma/2}\right\rVert_{\ell^{2}(\mathbb{N})}&=\left\lVert\bm{\Gamma}^{\gamma/2}\{(\bm{\Gamma}_{\tilde{n}}-\bm{\Gamma})+(\bm{\Gamma}+\lambda\bm{I})\}^{-1}\bm{\Gamma}^{\gamma/2}\right\rVert_{\ell^{2}(\mathbb{N})}\\ &\leq\left\lVert\bm{\Gamma}^{\gamma}(\bm{\Gamma}+\lambda\bm{I})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})}\\ &\leq\lambda^{\gamma-1}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})},\end{split}

where the last inequality follows based on the spectral theorem. Detailed discussions can be found in [3] and [9, Chapter 9]. Applying the above inequality on the event $\mathcal{C}_{z}^{c}$ when $\gamma=1$ and $\gamma=0$ , we obtain

\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z}^{c})\leq\frac{4\lambda}{(1-z)^{2}},

for any $0<z\leq 1/2$ .

For the second term in (10), we have that

\begin{split}\mathbb{E}\bigg{\{}\left\lVert\bm{\Gamma}^{1/2}\{\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\mathbbm{1}(\mathcal{C}_{z})\bigg{\}}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&\leq\left\lVert\bm{\Gamma}^{1/2}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}\left\lVert\bm{I}-(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}(\mathbb{N})}^{2}P(\mathcal{C}_{z})\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\\ &\leq\mu_{1}^{2}\left\lVert\bm{a}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}P(\mathcal{C}_{z}),\end{split}

where the last inequality follows based on Definition 1 in [2]. We complete the proof by applying Lemma 1 adapted from [5], which states that

P(\mathcal{C}_{z})\leq 4d_{\lambda}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

$\square$

Lemma B.2.

Under Assumption 1-3, we have

\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\}\leq\frac{4d_{\lambda}\sigma^{2}}{\tilde{n}}+4d_{\lambda}\frac{\omega^{2}\sigma^{2}}{\lambda\tilde{n}}\exp(-\frac{3\lambda\tilde{n}}{28\omega^{2}}).

Proof. Note that

\mathbb{E}(\hat{\eta}_{1})=(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}F_{{\mathbf{X}}_{1}}\eta_{0}.

From (8), we thus have

\mathbb{E}\{\left\lVert\hat{\eta}_{1}-\mathbb{E}(\hat{\eta}_{1})\right\rVert^{2}_{\mathcal{P}_{X}}\}=\mathbb{E}\left\lVert\Gamma^{1/2}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}\right\rVert^{2}_{\mathcal{H}}.

(11)

Following similar techniques in Lemma 2.1, the norm of ${\mathcal{H}}$ in (11) can be written as

\mathbb{E}\left\lVert\Gamma^{1/2}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}F_{{\mathbf{X}}_{1}}^{*}\bm{\epsilon}_{1}\right\rVert^{2}_{\mathcal{H}}=\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}.

By Von Neumann’s inequality and the assumption that $\mathbb{E}[\epsilon_{1i}^{2}]=\sigma^{2}$ for $i=1,\cdots,\tilde{n}$ , we obtain

\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}\leq\frac{\sigma^{2}}{\tilde{n}}\mathbb{E}\{\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\}.

The rest of proof is similar to that in Lemma 2.2. Following the same technique in (10), we have

\begin{split}\frac{1}{\tilde{n}^{2}}\mathbb{E}\left\lVert\bm{\Gamma}^{1/2}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\bm{\Phi}^{\intercal}\bm{\epsilon}_{1}\right\rVert^{2}_{\ell^{2}(\mathbb{N})}&\leq\frac{\sigma^{2}}{\tilde{n}}\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})P(\mathcal{C}_{z})\\ &+\frac{\sigma^{2}}{\tilde{n}}\mathbb{E}\{\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\mathbbm{1}(\mathcal{C}_{z}^{c})\}.\end{split}

(12)

We now bound the first term on the right-hand side of (12). By Von Neumann inequality, we have

\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq\operatorname{Tr}(\bm{\Gamma})\left\lVert\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\right\rVert_{\ell^{2}(\mathbb{N})}.

(13)

From Cauchy-Schwartz inequality, we have the following inequality for any $\eta\in{\mathcal{H}}$

{\langle}\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta,(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta{\rangle}_{{\mathcal{H}}}\leq\left\lVert\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\left\lVert(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}.

From Definition 1 in [2], we also have

\left\lVert\Gamma_{{\mathbf{X}}_{1}}(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\leq\left\lVert\eta\right\rVert_{{\mathcal{H}}},

and

\left\lVert(\Gamma_{{\mathbf{X}}_{1}}+\lambda I)^{-1}\eta\right\rVert_{{\mathcal{H}}}\leq\frac{1}{\lambda}\left\lVert\eta\right\rVert_{{\mathcal{H}}}.

Thus, we have

\left\lVert\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\right\rVert_{\ell^{2}(\mathbb{N})}\leq\frac{1}{\lambda}.

(14)

Plugging (14) into (13), we have

\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq\frac{\omega^{2}}{\lambda}

For the second term on the right-hand side of (12), we have

\begin{split}\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})&=\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\\ &\leq\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-1}\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}_{\mathbb{N}}},\end{split}

where the last inequality follows by Von Neumann’s inequality. From (14) and Definition 1 in [2], it is easy to verify that

\left\lVert(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}}\right\rVert_{\ell^{2}_{\mathbb{N}}}\leq 2.

Applying Von Neumann’s inequality again, we have

\operatorname{Tr}(\bm{\Gamma}(\bm{\Gamma}_{\tilde{n}}+\lambda\bm{I})^{-2}\bm{\Gamma}_{\tilde{n}})\leq 2d_{\lambda}\left\lVert(\bm{I}-(\bm{\Gamma}+\lambda\bm{I})^{-1/2}(\bm{\Gamma}-\bm{\Gamma}_{\tilde{n}})(\bm{\Gamma}+\lambda\bm{I})^{-1/2})^{-1}\right\rVert_{\ell^{2}(\mathbb{N})},

which is bounded by $2d_{\lambda}/(1-z)$ on the event $\mathcal{C}_{z}^{c}$ . We complete the proof by setting $z=1/2$ and using the fact that

P(\mathcal{C}_{z})\leq 4d_{\lambda}\exp\bigg{(}-\frac{3\lambda\tilde{n}}{28\omega^{2}}\bigg{)}.

$\square$

References

[1] A. A. Afifi, J. B. Kotlerman, S. L. Ettner, and M. Cowan. Methods for improving regression analysis for skewed continuous or counted responses. Annu. Rev. Public Health, 28:95–111, 2007.
[2] F. Bauer, S. Pereverzev, and L. Rosasco. On regularization algorithms in learning theory. Journal of Complexity, 23(1):52–72, 2007.
[3] A. Caponnetto and E. De Vito. Optimal rates for the regularized least-squares algorithm. Foundations of Computational Mathematics, 7(3):331–368, 2007.
[4] C. P. Chen and C.-Y. Zhang. Data-intensive applications, challenges, techniques and technologies: A survey on big data. Information Sciences, 275:314–347, 2014.
[5] L. H. Dicker, D. P. Foster, and D. Hsu. Kernel ridge vs. principal component regression: Minimax bounds and the qualification of regularization operators. Electronic Journal of Statistics, 11(1):1022–1047, 2017.
[6] J. Fan, D. Wang, K. Wang, and Z. Zhu. Distributed estimation of principal eigenspaces. Annals of statistics, 47(6):3009, 2019.
[7] D. Freedman and P. Diaconis. On the histogram as a density estimator: L2 theory. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 57(4):453–476, 1981.
[8] S. A. Geer and S. van de Geer. Empirical Processes in M-estimation. Cambridge University Press, 2000.
[9] C. Gu. Smoothing Spline ANOVA Models. Springer Science & Business Media, 2013.
[10] C. Gu and Y.-J. Kim. Penalized likelihood regression: General formulation and efficient approximation. Canadian Journal of Statistics, 30(4):619–628, 2002.
[11] Z.-C. Guo, L. Shi, and Q. Wu. Learning theory of distributed regression with bias corrected regularization kernel network. The Journal of Machine Learning Research, 18(1):4237–4261, 2017.
[12] G. Haixiang, L. Yijing, J. Shang, G. Mingyun, H. Yuanyue, and G. Bing. Learning from class-imbalanced data: Review of methods and applications. Expert Systems with Applications, 73:220–239, 2017.
[13] T. Hastie. Pseudosplines. Journal of the Royal Statistical Society. Series B (Methodological), pages 379–396, 1996.
[14] T. Hastie, R. Tibshirani, and J. Friedman. The elements of statistical learning: data mining, inference, and prediction. Springer Science & Business Media, 2009.
[15] H. He and E. A. Garcia. Learning from imbalanced data. IEEE Transactions on knowledge and data engineering, 21(9):1263–1284, 2009.
[16] X. He, L. Shen, and Z. Shen. A data-adaptive knot selection scheme for fitting splines. IEEE Signal Processing Letters, 8(5):137–139, 2001.
[17] N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6(5):429–449, 2002.
[18] B. Krawczyk. Learning from imbalanced data: open challenges and future directions. Progress in Artificial Intelligence, 5(4):221–232, 2016.
[19] S.-B. Lin, X. Guo, and D.-X. Zhou. Distributed learning with regularized least squares. The Journal of Machine Learning Research, 18(1):3202–3232, 2017.
[20] M. Liu, Z. Shang, and G. Cheng. How many machines can we use in parallel computing for kernel ridge regression? arXiv preprint arXiv:1805.09948, 2018.
[21] J. Lu, G. Cheng, and H. Liu. Nonparametric heterogeneity testing for massive data. arXiv preprint arXiv:1601.06212, 2016.
[22] D. G. Luenberger. Optimization by vector space methods. John Wiley & Sons, 1997.
[23] Z. Luo and G. Wahba. Hybrid adaptive splines. Journal of the American Statistical Association, 92(437):107–116, 1997.
[24] P. Ma, J. Z. Huang, and N. Zhang. Efficient computation of smoothing splines via adaptive basis sampling. Biometrika, 102(3):631–645, 2015.
[25] L. Mackey, A. Talwalkar, and M. I. Jordan. Divide-and-conquer matrix factorization. Advances in neural information processing systems, 24, 2011.
[26] M. W. Mahoney. Lecture notes on randomized linear algebra. arXiv preprint arXiv:1608.04481, 2016.
[27] C. L. Mallows. Some comments on Cp. Technometrics, 42(1):87–94, 2000.
[28] D. McGuinness, S. Bennett, and E. Riley. Statistical analysis of highly skewed immune response data. Journal of immunological methods, 201(1):99–114, 1997.
[29] C. Meng, X. Zhang, J. Zhang, W. Zhong, and P. Ma. More efficient approximation of smoothing splines via space-filling basis selection. Biometrika, 107:723–735, 2020.
[30] C. Musco and C. Musco. Recursive sampling for the Nyström method. In Advances in Neural Information Processing Systems, pages 3833–3845, 2017.
[31] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in neural information processing systems, pages 1177–1184, 2008.
[32] A. Rudi, R. Camoriano, and L. Rosasco. Less is more: Nyström computational regularization. Advances in Neural Information Processing Systems, 28:1657–1665, 2015.
[33] D. Ruppert. Selecting the number of knots for penalized splines. Journal of computational and graphical statistics, 11(4):735–757, 2002.
[34] D. W. Scott. Multivariate Density Estimation: Theory, Practice, and Visualization. John Wiley & Sons, 2015.
[35] Z. Shang and G. Cheng. Computational limits of a distributed algorithm for smoothing spline. The Journal of Machine Learning Research, 18(1):3809–3845, 2017.
[36] J. Shawe-Taylor and N. Cristianini. Kernel methods for pattern analysis. Cambridge university press, 2004.
[37] J. C. Sklar, J. Wu, W. Meiring, and Y. Wang. Nonparametric regression with basis selection from multiple libraries. Technometrics, 55(2):189–201, 2013.
[38] I. Steinwart, D. R. Hush, C. Scovel, et al. Optimal rates for regularized least squares regression. In COLT, pages 79–93, 2009.
[39] H. A. Sturges. The choice of a class interval. Journal of the american statistical association, 21(153):65–66, 1926.
[40] G. Wahba. Spline Models for Observational Data. SIAM, 1990.
[41] G. Wahba and P. Craven. Smoothing noisy data with spline functions: Estimating the correct degree of smoothing by the method of generalized cross-validation. Numerische Mathematik, 31:377–404, 1978.
[42] S. Wang. A practical guide to randomized matrix computations with matlab implementations. arXiv preprint arXiv:1505.07570, 2015.
[43] S. X. Wu, H.-T. Wai, L. Li, and A. Scaglione. A review of distributed algorithms for principal component analysis. Proceedings of the IEEE, 106(8):1321–1340, 2018.
[44] C. Xu, Y. Zhang, R. Li, and X. Wu. On the feasibility of distributed kernel regression for big data. IEEE Transactions on Knowledge and Data Engineering, 28(11):3041–3052, 2016.
[45] D. Xu and Y. Wang. Divide and recombine approaches for fitting smoothing spline models with large datasets. Journal of Computational and Graphical Statistics, 27(3):677–683, 2018.
[46] G. Xu, Z. Shang, and G. Cheng. Distributed generalized cross-validation for divide-and-conquer kernel ridge regression and its asymptotic optimality. Journal of computational and graphical statistics, 28(4):891–908, 2019.
[47] Y. Yuan, N. Chen, and S. Zhou. Adaptive B-spline knot selection using multi-resolution basis set. Iie Transactions, 45(12):1263–1277, 2013.
[48] H. H. Zhang, G. Wahba, Y. Lin, M. Voelker, M. Ferris, R. Klein, and B. Klein. Variable selection and model building via likelihood basis pursuit. Journal of the American Statistical Association, 99(467):659–672, 2004.
[49] J. Zhang, H. Jin, Y. Wang, X. Sun, P. Ma, and W. Zhong. Smoothing spline ANOVA models and their applications in complex and massive datasets. Topics in Splines and Applications, page 63, 2018.
[50] T. Zhang. Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9):2077–2098, 2005.
[51] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression. In Conference on learning theory, pages 592–617, 2013.
[52] Y. Zhang, J. Duchi, and M. Wainwright. Divide and conquer kernel ridge regression: A distributed algorithm with minimax optimal rates. The Journal of Machine Learning Research, 16(1):3299–3340, 2015.
[53] T. Zhao, G. Cheng, and H. Liu. A partially linear framework for massive heterogeneous data. Annals of statistics, 44(4):1400, 2016.