This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Kernel-Based Neural Network for High-dimensional Genetic Risk Prediction Analysis

Xiaoxi Shen Xiaoran Tong and
Qing Lu
Abstract

Risk prediction capitalizing on emerging human genome findings holds great promise for new prediction and prevention strategies. While the large amounts of genetic data generated from high-throughput technologies offer us a unique opportunity to study a deep catalog of genetic variants for risk prediction, the high-dimensionality of genetic data and complex relationships between genetic variants and disease outcomes bring tremendous challenges to risk prediction analysis. To address these rising challenges, we propose a kernel-based neural network (KNN) method. KNN inherits features from both linear mixed models (LMM) and classical neural networks and is designed for high-dimensional risk prediction analysis. To deal with datasets with millions of variants, KNN summarizes genetic data into kernel matrices and use the kernel matrices as inputs. Based on the kernel matrices, KNN builds a single-layer feedforward neural network, which makes it feasible to consider complex relationships between genetic variants and disease outcomes. The parameter estimation in KNN is based on MINQUE and we show, that under certain conditions, the average prediction error of KNN can be smaller than that of LMM. Simulation studies also confirm the results.


Keywords: Human genome, Complex relationships, MINQUE

1 Introduction

Linear mixed effect models are powerful tools to model complex data structures. By adding random effects into the model, it becomes feasible to model correlated observations. Moreover, it is also possible in linear mixed effect models to make best predictions on the random effects. In genetic studies, more advantages of linear mixed effect models have been explored. For instance, in genome-wide association studies (GWAS) (Bush and Moore, 2012), a simple linear regression is conducted on each single-nucleotide polymorphism (SNP) so that there are a large number of hypothesis to be tested and we also need to deal with the multiple test correction issue as well. On the other hand, if we think the genetic effect as a random effect, the null hypothesis we are going to test reduces to testing whether the variance component of the random effect is zero or not. Here are some well-known applications: in sequence kernel association test (SKAT) proposed by Wu et al. (2011), a score type test based on a mixed effect model is used to perform the genetic effect; Yang et al. (2011) created the tool known as the genome-wide complex trait analysis (GCTA), which is also based on linear mixed model, to address the “missing heretability” problem.

In this paper, we propose a method, which we call it the kernel neural network (KNN) for high-dimensional risk prediction analysis. As will be seen in the paper, under certain scenarios, our model can reduce to a linear mixed effect model. We call such method KNN in that it also inherits an important property from the neural network, which is that the model can be used to consider nonlinear effects. Due to the complex structure of such method, it is difficult to obtain estimators for the parameters in the model. To make things worse, not all the parameters are identifiable. To address such issue, instead of using the popular likelihood type inference using the restricted maximum likelihood estimator (REML) (Corbeil and Searle, 1976), we use the minimum quadratic unbiased estimator (MINQUE) proposed by Rao (1970, 1971, 1972) to estimate the “variance components”. We show both theoretically and empirically that the model has some interesting properties.

The remaining paper is arranged as follows: Section 2 provides the basic description of the KNN and the estimation procedure for the parameters; Section 3 provides some discussion on how to make predictions using KNN and followed by some simulation results in Section 4. Before we proceeding to the main text, we first summarize some notations that will be frequently used in this paper.

Notations: Throughout the paper, capital bold italic letters 𝑨,𝑩,,𝚪,𝚯,\boldsymbol{A},\boldsymbol{B},\ldots,\boldsymbol{\Gamma},\boldsymbol{\Theta},\ldots will be used to denote matrices; small bold italic letters 𝒂,𝒃,,𝜶,𝜷,\boldsymbol{a},\boldsymbol{b},\ldots,\boldsymbol{\alpha},\boldsymbol{\beta},\ldots will be used to denote vectors and other small letters will be used to denote scalars. 𝑰n\boldsymbol{I}_{n} will be used to denote an n×nn\times n identity matrix and the symbol “\precsim” will be used to denote asymptotically less than.

2 Methodologies

Kernel methods are widely used in recent machine learning due to its capability of capturing nonlinear features from the data so that the prediction error can be diminished. As has been mentioned in Shawe-Taylor et al. (2004), given a kernel and a training set, the kernel matrix acts as an information bottleneck, as all the information available to a kernel algorithm must be extracted from this matrix. On the other hand, linear mixed effect models are also widely used in the area of genetic risk prediction (Yang et al., 2011). Therefore, it seems natural to combine these two methods together. A naive way is two change the covariance matrix of the random effect in the linear mixed model to a kernel matrix. For instance, in Yang et al. (2011), they consider the following linear mixed model:

𝒚=𝒁𝜷+𝒂+ϵ,\boldsymbol{y}=\boldsymbol{Z\beta}+\boldsymbol{a}+\boldsymbol{\epsilon}, (1)

where 𝒚n\boldsymbol{y}\in\mathbb{R}^{n} is a vector of phenotypes; 𝒁\boldsymbol{Z} is the design matrix for fixed effects 𝜷\boldsymbol{\beta}; 𝒂n\boldsymbol{a}\in\mathbb{R}^{n} is the total genetic effects of the individuals with 𝒂𝒩n(𝟎,σa2𝑲)\boldsymbol{a}\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sigma_{a}^{2}\boldsymbol{K}\right), and 𝑲\boldsymbol{K} can be interpreted as the genetic relationship matrix between subjects; ϵ𝒩n(𝟎,σϵ2𝑰n)\boldsymbol{\epsilon}\sim\mathcal{N}_{n}(\boldsymbol{0},\sigma_{\epsilon}^{2}\boldsymbol{I}_{n}). Such model can also be written into the following hierarchical structure:

𝒚|𝒁,𝒂\displaystyle\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{a} 𝒩n(𝒁𝜷+𝒂,σϵ2𝑰n)\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta}+\boldsymbol{a},\sigma_{\epsilon}^{2}\boldsymbol{I}_{n}\right)
𝒂\displaystyle\boldsymbol{a} 𝒩n(𝟎,σa2𝑲).\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sigma_{a}^{2}\boldsymbol{K}\right).

To model more complex genotype-to-phenotype relationships, what we did is a little step further, we start by creating some latent features from the kernel matrix. Based on these latent features, higher order kernel matrices can be constructed, which will be used as the covariance matrix for the random effect.

We now explain the model in more details. Consider that the phenotype 𝒚\boldsymbol{y} is modeled as a random effect model: given some latent variables 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m},

𝒚|𝒁,𝒂\displaystyle\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{a} 𝒩n(𝒁𝜷+𝒂,ϕ𝑰n)\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta}+\boldsymbol{a},\phi\boldsymbol{I}_{n}\right)
𝒂|𝒖1,,𝒖m\displaystyle\boldsymbol{a}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} 𝒩n(𝟎,j=1Jτj𝑲j(𝑼)),\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right),

that is the covariance matrix of the random effect 𝒂\boldsymbol{a} is a positive combination of some latent kernel matrices 𝑲j(𝑼)\boldsymbol{K}_{j}(\boldsymbol{U}) constructed based on latent variables 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} and we let 𝑼=[𝒖1,𝒖m]n×m\boldsymbol{U}=[\boldsymbol{u}_{1},\cdots\boldsymbol{u}_{m}]\in\mathbb{R}^{n\times m}. Moreover, the latent variable 𝒖i\boldsymbol{u}_{i} is modeled as follow:

𝒖1,,𝒖m i.i.d. 𝒩n(𝟎,l=1Lξl𝑲l(𝑿)),\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\sim\textrm{ i.i.d. }\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right),

where in the model, nn is the sample size; mm is the number of hidden units in the network; 𝑲l(𝑿),l=1,,L\boldsymbol{K}_{l}(\boldsymbol{X}),l=1,\ldots,L are kernel matrices constructed based on the genetic variables. For instance, if we have pp genetic variants (pp can be greater than nn), we can define 𝑲(𝑿)=p1𝑿𝑿T\boldsymbol{K}(\boldsymbol{X})=p^{-1}\boldsymbol{XX}^{T}, which is known as the product kernel.

As an illustration, the basic hierarchical structure of the model can be seen from Figure 1. Due to the similarity in the network structure as in the case of popular neural network, we thus call our model a kernel neural network (KNN). KNN has several nice features. First, by considering genetic effects as random effects, the method can simultaneously deal with millions of variants. This addresses the limitation of the fixed-effect conventional neural network, which is computationally prohibitive on such a large number of variables. Second, by using random genetic effects, the model complexity is also greatly reduced, as we no longer require to estimate a large number of fixed genetic effects. Third, KNN allows for a large number of hidden units without increasing model complexity. Finally, using hidden units, KNN can capture non-linear and non-additive effects, and therefore is able to model complex functions beyond linear model.

Refer to caption
Figure 1: An illustration of the hierarchical structure of the kernel neural network model.

In the remaining part of this section and section 3, we will focus on the scenario where there is no fixed effects, that is 𝜷=𝟎\boldsymbol{\beta}=\boldsymbol{0}. In section 4, we will consider the general estimation procedure when 𝜷𝟎\boldsymbol{\beta}\neq\boldsymbol{0} and we will also calculate the prediction error.

2.1 Quadratic Estimators for Variance Components

Popular estimation strategies for variance components in linear models are the maximum likelihood estimator (MLE) and the restricted maximum likelihood estimator (REML) (Corbeil and Searle, 1976). However, both methods depend on the marginal distribution of 𝒚\boldsymbol{y}. In our kernel neural network (KNN) model, it is generally difficult to obtain the marginal distribution of 𝒚\boldsymbol{y}, which involves high dimensional integration with respect to 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}. Moreover, the 𝒖i\boldsymbol{u}_{i}’s are embedded in the kernel matrix 𝑲(𝑼)\boldsymbol{K}(\boldsymbol{U}), which makes the integration even more complicated.

On the other hand, given the model described in the previous paragraph, we can easily know the conditional distribution of 𝒚|𝒖1,,𝒖m\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}:

𝒚|𝒖1,,𝒖m𝒩n(𝟎,j=1Jτj𝑲j(𝑼)+ϕ𝑰n).\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right).

Then the marginal mean and variance of 𝒚\boldsymbol{y} can be obtained via conditioning arguments:

𝔼[𝒚]\displaystyle\mathbb{E}[\boldsymbol{y}] =𝔼(𝔼[𝒚|𝒖1,,𝒖m])=𝟎.\displaystyle=\mathbb{E}\left(\mathbb{E}[\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}]\right)=\boldsymbol{0}.
Var[𝒚]\displaystyle\textrm{Var}[\boldsymbol{y}] =𝔼[Var(𝒚|𝒖1,,𝒖m)]+Var[𝔼(𝒚|𝒖1,,𝒖m)]\displaystyle=\mathbb{E}\left[\textrm{Var}\left(\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right)\right]+\textrm{Var}\left[\mathbb{E}\left(\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right)\right]
=𝔼[j=1Jτj𝑲j(𝑼)+ϕ𝑰n]\displaystyle=\mathbb{E}\left[\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right]
=j=1Jτj𝔼[𝑲j(𝑼)]+ϕ𝑰n.\displaystyle=\sum_{j=1}^{J}\tau_{j}\mathbb{E}[\boldsymbol{K}_{j}(\boldsymbol{U})]+\phi\boldsymbol{I}_{n}.
:=j=0Jτj𝔼[𝑲j(𝑼)],\displaystyle:=\sum_{j=0}^{J}\tau_{j}\mathbb{E}\left[\boldsymbol{K}_{j}(\boldsymbol{U})\right],

where τ0=ϕ\tau_{0}=\phi and 𝑲0(𝑼)=𝑰n\boldsymbol{K}_{0}(\boldsymbol{U})=\boldsymbol{I}_{n}. Given the marginal mean and covariance matrix, the MInimum Quadratic Unbiased Estimator (MINQUE) proposed by Rao (1970, 1971, 1972) can be used to estimate the variance components. The basic idea of MINQUE is to use a quadratic form 𝒚T𝚯𝒚\boldsymbol{y}^{T}\boldsymbol{\Theta}\boldsymbol{y} to estimate a linear combination of variance components. The MINQUE matrix 𝚯\boldsymbol{\Theta} is obtained by minimizing a suitable matrix norm, which is typically chosen to be the Frobenius norm, of the difference between 𝚯\boldsymbol{\Theta} and the matrix in the quadratic estimator by assuming that we know the random components in the linear models. The constraint in the optimization problem is to guarantee the unbiasedness of the estimators. One advantage of MINQUE is that it has a closed form solution provided by Lemma 3.4 in Rao (1971) so that it can be computed efficiently. However, MINQUE can also provide a negative estimate for a single variance component. When this occurs, we simply set the negative estimators to be zero, as we usually did for MLE and REML of variance components, except for the error variance component. If the MINQUE estimate for error variance component becomes negative, we project the MINQUE matrix 𝚯\boldsymbol{\Theta} onto the positive semi-definite cone 𝒮n+\mathcal{S}_{n}^{+}. Specifically, the modified estimate for error component becomes

τ^0=𝒚T𝑶[max{λ1,0}max{λn,0}]𝑶T𝒚,\hat{\tau}_{0}=\boldsymbol{y}^{T}\boldsymbol{O}\begin{bmatrix}\max\{\lambda_{1},0\}&&\\ &\ddots&\\ &&\max\{\lambda_{n},0\}\end{bmatrix}\boldsymbol{O}^{T}\boldsymbol{y},

where 𝑶diag{λ1,,λn}𝑶T\boldsymbol{O}\textrm{diag}\{\lambda_{1},\ldots,\lambda_{n}\}\boldsymbol{O}^{T} is the eigen-decomposition of 𝚯\boldsymbol{\Theta}.

2.2 MINQUE in KNN

For the ease of theoretical justifications, throughout the remaining of the paper, we illustrate the method with one kernel matrix with the form of 𝑲ij(𝑼)=f[1m𝒘iT𝒘j]\boldsymbol{K}_{ij}(\boldsymbol{U})=f\left[\frac{1}{m}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}\right], where 𝒘1,,𝒘n\boldsymbol{w}_{1},\ldots,\boldsymbol{w}_{n} are the rows of the matrix 𝑼\boldsymbol{U}.

We start by considering the simplest case f(x)=xf(x)=x, in which case, the kernel matrix becomes 𝑲(𝑼)=1m𝑼𝑼T\boldsymbol{K}(\boldsymbol{U})=\frac{1}{m}\boldsymbol{U}\boldsymbol{U}^{T}. In this case, we have an explicit form of the marginal variance of 𝒚\boldsymbol{y}. Since 𝑼𝑼T𝒲n(m,l=1Lξl𝑲l(𝑿))\boldsymbol{U}\boldsymbol{U}^{T}\sim\mathcal{W}_{n}(m,\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})), we have

𝑽:=Var[𝒚]=τ𝔼[𝑲(𝑼)]+ϕ𝑰n=l=1Lτξl𝑲l(𝑿)+ϕ𝑰n.\boldsymbol{V}:=\textrm{Var}[\boldsymbol{y}]=\tau\mathbb{E}[\boldsymbol{K}(\boldsymbol{U})]+\phi\boldsymbol{I}_{n}=\sum_{l=1}^{L}\tau\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})+\phi\boldsymbol{I}_{n}. (2)

From equation (2), we can see that there is an identifiability issue if we directly estimate τ\tau and ξl\xi_{l}. To resolve this issue, we reparameterize the covariance matrix by letting θl=τξl\theta_{l}=\tau\xi_{l}, θ0=ϕ\theta_{0}=\phi, 𝑲0(𝑿)=𝑰n\boldsymbol{K}_{0}(\boldsymbol{X})=\boldsymbol{I}_{n} and rewrite Var[𝒚]\textrm{Var}[\boldsymbol{y}] as

𝑽=l=0Lθl𝑲l(𝑿)=l=0Lθl𝑺l(𝑿)𝑺lT(𝑿),\boldsymbol{V}=\sum_{l=0}^{L}\theta_{l}\boldsymbol{K}_{l}(\boldsymbol{X})=\sum_{l=0}^{L}\theta_{l}\boldsymbol{S}_{l}(\boldsymbol{X})\boldsymbol{S}_{l}^{T}(\boldsymbol{X}), (3)

where 𝑺0(𝑿),,𝑺L(𝑿)\boldsymbol{S}_{0}(\boldsymbol{X}),\ldots,\boldsymbol{S}_{L}(\boldsymbol{X}) are the Cholesky lower triangles for the kernel matrices 𝑲0(𝑿),,𝑲L(𝑿)\boldsymbol{K}_{0}(\boldsymbol{X}),\ldots,\boldsymbol{K}_{L}(\boldsymbol{X}), respectively. Then the parameters θ0,θ1,,θL\theta_{0},\theta_{1},\ldots,\theta_{L} can be estimated via MINQUE. Specifically

θ^0\displaystyle\hat{\theta}_{0} =𝒚T𝑨^0𝒚\displaystyle=\boldsymbol{y}^{T}\hat{\boldsymbol{A}}_{0}\boldsymbol{y}
θ^i\displaystyle\hat{\theta}_{i} =𝒚T𝑨i𝒚0,i=1,,L,\displaystyle=\boldsymbol{y}^{T}\boldsymbol{A}_{i}\boldsymbol{y}\vee 0,\quad i=1,\ldots,L,

where

𝑨i=l=0Lηl[l=0L𝑲l(𝑿)]1𝑲l(𝑿)[l=0L𝑲l(𝑿)]1,i=0,,L\boldsymbol{A}_{i}=\sum_{l=0}^{L}\eta_{l}\left[\sum_{l=0}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right]^{-1}\boldsymbol{K}_{l}(\boldsymbol{X})\left[\sum_{l=0}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right]^{-1},\quad i=0,\ldots,L

and η0,,ηL\eta_{0},\ldots,\eta_{L} are the solutions to 𝚪𝜼=𝒆i\boldsymbol{\Gamma\eta}=\boldsymbol{e}_{i}, where 𝒆i\boldsymbol{e}_{i} is a vector of zero except that the iith element is 1 and

𝚪ij=tr([l=0L𝑲l(𝑿)]1𝑲i(𝑿)[l=0L𝑲l(𝑿)]1𝑲j(𝑿)).\boldsymbol{\Gamma}_{ij}=\textrm{tr}\left(\left[\sum_{l=0}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right]^{-1}\boldsymbol{K}_{i}(\boldsymbol{X})\left[\sum_{l=0}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right]^{-1}\boldsymbol{K}_{j}(\boldsymbol{X})\right).

Moreover, 𝑨^0=P𝒮n+𝑨0\hat{\boldsymbol{A}}_{0}=P_{\mathcal{S}_{n}^{+}}\boldsymbol{A}_{0} as mentioned above. For general kernel matrix of the form 𝑲(𝑼)=f[1m𝑼𝑼T]\boldsymbol{K}(\boldsymbol{U})=f\left[\frac{1}{m}\boldsymbol{U}\boldsymbol{U}^{T}\right], where f[𝑩]f[\boldsymbol{B}] means that we apply the map f:f:\mathbb{R}\to\mathbb{R} elementwisely to the matrix 𝑩\boldsymbol{B} and ff is a function on \mathbb{R} satisfying the following property, which we called the Generalized Linear Separable Condition:

f(α=1kcαxα)=α=1kgα(c1,,ck)hα(x1,,xk),f\left(\sum_{\alpha=1}^{k}c_{\alpha}x_{\alpha}\right)=\sum_{\alpha=1}^{k^{\prime}}g_{\alpha}(c_{1},\ldots,c_{k})h_{\alpha}(x_{1},\ldots,x_{k}), (4)

where c1,,ckc_{1},\ldots,c_{k}\in\mathbb{R} are coefficients and g1,,gk,h1,,hkg_{1},\ldots,g_{k},h_{1},\ldots,h_{k} are some functions. Examples of kernel functions satisfying the condition are polynomial kernels. We know that

𝑲st(𝑼)=f(𝒘sT𝒘tm),s,t=1,,n\boldsymbol{K}_{st}(\boldsymbol{U})=f\left(\frac{\boldsymbol{w}_{s}^{T}\boldsymbol{w}_{t}}{m}\right),\quad s,t=1,\ldots,n (5)

and

[𝒘i1𝒘j1],,[𝒘im𝒘jm]i.i.d. 𝒩2([00],[σiiσijσijσjj]),\begin{bmatrix}\boldsymbol{w}_{i1}\\ \boldsymbol{w}_{j1}\end{bmatrix},\ldots,\begin{bmatrix}\boldsymbol{w}_{im}\\ \boldsymbol{w}_{jm}\end{bmatrix}\sim\textrm{i.i.d. }\mathcal{N}_{2}\left(\begin{bmatrix}0\\ 0\end{bmatrix},\begin{bmatrix}\sigma_{ii}&\sigma_{ij}\\ \sigma_{ij}&\sigma_{jj}\end{bmatrix}\right), (6)

where σii,σjj,σij\sigma_{ii},\sigma_{jj},\sigma_{ij} are the corresponding elements in 𝚺=l=1Lξl𝑲l(𝑿)\boldsymbol{\Sigma}=\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X}). We can then use Taylor expansion to obtain an approximation of 𝔼[𝑲ij(𝑼)]\mathbb{E}[\boldsymbol{K}_{ij}(\boldsymbol{U})], which is given in Lemma 1. The proof of Lemma 1 can be found in Appendix B.

Lemma 1.

Let the random vectors 𝒘i,𝒘jm\boldsymbol{w}_{i},\boldsymbol{w}_{j}\in\mathbb{R}^{m} be as in (6) and the 𝑲ij(𝑼)\boldsymbol{K}_{ij}(\boldsymbol{U}) as defined in (5). Then if

|f′′(λσij+(1λ)𝒘iT𝒘jm)|M,a.s.,\left|f^{\prime\prime}\left(\lambda\sigma_{ij}+(1-\lambda)\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}\right)\right|\leq M,\quad\textrm{a.s.}, (7)

for some M>0M>0 and all λ[0,1]\lambda\in[0,1], we have

(|𝑲ij(𝑼)𝑲^ij(𝑼)|>δ)\displaystyle\mathbb{P}\left(\left|\boldsymbol{K}_{ij}(\boldsymbol{U})-\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})\right|>\delta\right) 4exp{m(1δ20MsijδMσij214|σij|2δM)},\displaystyle\leq 4\exp\left\{-m\left(1\wedge\frac{\delta}{20Ms_{ij}}\wedge\frac{\delta}{M\sigma_{ij}^{2}}\wedge\frac{1}{4|\sigma_{ij}|}\sqrt{\frac{2\delta}{M}}\right)\right\},

where 𝑲^ij(𝑼)=f(σij)+f(σij)(𝒘iT𝒘jmσij).\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})=f(\sigma_{ij})+f^{\prime}(\sigma_{ij})\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right).

Lemma 1 and Remark 2 in Appendix B show that when we use K^(𝑼)=[𝑲^ij(𝑼)]\hat{K}(\boldsymbol{U})=[\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})] to approximate 𝑲(𝑼)\boldsymbol{K}(\boldsymbol{U}) and the number of hidden features 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} is large enough, the approximation will be sufficiently small. Hence, we can write 𝑲(𝑼)\boldsymbol{K}(\boldsymbol{U}) as follow:

𝑲(𝑼)=𝑲^(𝑼)+oP(1)=f[𝚺]+f[Σ](1m𝑼𝑼T𝚺)+oP(1),\boldsymbol{K}(\boldsymbol{U})=\hat{\boldsymbol{K}}(\boldsymbol{U})+o_{P}(1)=f[\boldsymbol{\Sigma}]+f^{\prime}[\Sigma]\odot\left(\frac{1}{m}\boldsymbol{U}\boldsymbol{U}^{T}-\boldsymbol{\Sigma}\right)+o_{P}(1), (8)

where \odot means the Hadamard product of two matrices. Moreover, the Strong Law of Large Numbers implies 1m𝑼𝑼T𝚺\frac{1}{m}\boldsymbol{U}\boldsymbol{U}^{T}\to\boldsymbol{\Sigma} a.s. Therefore, equation (8) can be further written as

𝑲(𝑼)=f[𝚺]+oP(1),\boldsymbol{K}(\boldsymbol{U})=f[\boldsymbol{\Sigma}]+o_{P}(1),

i.e., 𝑲(𝑼)𝑃f[𝚺]\boldsymbol{K}(\boldsymbol{U})\xrightarrow{P}f[\boldsymbol{\Sigma}] as mm\to\infty element-wisely. Following from the version of Dominated Convergence Theorem based on convergence in probability, the following lemma can be easily shown.

Lemma 2.

Under the assumptions of Lemma 1, if f′′(ηij)(𝒘iT𝒘jmσij)2L1()f^{\prime\prime}\left(\eta_{ij}\right)\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}\in L^{1}(\mathbb{P}), then

𝔼[12f′′(ηij)(𝒘iT𝒘jmσij)2]=o(1),\mathbb{E}\left[\frac{1}{2}f^{\prime\prime}\left(\eta_{ij}\right)\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}\right]=o(1),

where ηij=λσij+(1λ)𝒘iT𝒘jm\eta_{ij}=\lambda\sigma_{ij}+(1-\lambda)\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m} for some λ[0,1]\lambda\in[0,1].

Based on Lemma 2, the marginal variance-covariance matrix 𝑽\boldsymbol{V} can be written as

𝑽\displaystyle\boldsymbol{V} =τ𝔼[𝑲(𝑼)]+ϕ𝑰n\displaystyle=\tau\mathbb{E}\left[\boldsymbol{K}(\boldsymbol{U})\right]+\phi\boldsymbol{I}_{n}
τ𝔼[𝑲^(𝑼)]+ϕ𝑰n\displaystyle\simeq\tau\mathbb{E}\left[\hat{\boldsymbol{K}}(\boldsymbol{U})\right]+\phi\boldsymbol{I}_{n}
τf[𝚺]+ϕ𝑰n\displaystyle\simeq\tau f[\boldsymbol{\Sigma}]+\phi\boldsymbol{I}_{n}
=τl=1Lgl(ξ1,,ξL)hl[𝑲1(𝑿),,𝑲L(𝑿)]+ϕ𝑰n\displaystyle=\tau\sum_{l=1}^{L^{\prime}}g_{l}(\xi_{1},\ldots,\xi_{L})h_{l}\left[\boldsymbol{K}_{1}(\boldsymbol{X}),\ldots,\boldsymbol{K}_{L}(\boldsymbol{X})\right]+\phi\boldsymbol{I}_{n}
=l=0Lθl𝑺l(𝑿)𝑺lT(𝑿),\displaystyle=\sum_{l=0}^{L^{\prime}}\theta_{l}\boldsymbol{S}_{l}(\boldsymbol{X})\boldsymbol{S}_{l}^{T}(\boldsymbol{X}),

where θ0=ϕ\theta_{0}=\phi, θl=τgl(ξ1,,ξL)\theta_{l}=\tau g_{l}(\xi_{1},\ldots,\xi_{L}), l=1,,Ll=1,\ldots,L^{\prime} and S0(𝑿)=𝑰nS_{0}(\boldsymbol{X})=\boldsymbol{I}_{n}, 𝑺l(𝑿)\boldsymbol{S}_{l}(\boldsymbol{X}) is the Cholesky lower triangle for the matrix hl[𝑲1(𝑿),,𝑲L(𝑿)]h_{l}\left[\boldsymbol{K}_{1}(\boldsymbol{X}),\ldots,\boldsymbol{K}_{L}(\boldsymbol{X})\right], l=1,,Ll=1,\ldots,L. As an example, we may consider f(x)=(1+x)2f(x)=(1+x)^{2}, which corresponds to the output polynomial kernel. f[𝚺]=(𝑱+ξ11p𝑿𝑿T)2f[\boldsymbol{\Sigma}]=(\boldsymbol{J}+\xi_{1}\frac{1}{p}\boldsymbol{XX}^{T})^{\mathchoice{\mathbin{\ooalign{$\displaystyle\wedge$\crcr$\displaystyle\bigcirc$}}}{\mathbin{\ooalign{$\textstyle\wedge$\crcr$\textstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptstyle\wedge$\crcr$\scriptstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptscriptstyle\wedge$\crcr$\scriptscriptstyle\bigcirc$}}}2}, where the symbol 2\mathchoice{\mathbin{\ooalign{$\displaystyle\wedge$\crcr$\displaystyle\bigcirc$}}}{\mathbin{\ooalign{$\textstyle\wedge$\crcr$\textstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptstyle\wedge$\crcr$\scriptstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptscriptstyle\wedge$\crcr$\scriptscriptstyle\bigcirc$}}}2 means the elementwise square. In this case, L=1L=1 and 𝑲1(𝑿)=p1𝑿𝑿T\boldsymbol{K}_{1}(\boldsymbol{X})=p^{-1}\boldsymbol{XX}^{T}. Then note that

f[𝚺]\displaystyle f[\boldsymbol{\Sigma}] =𝑱+2ξ1p𝑱𝑿𝑿𝑻+ξ12p2(𝑿𝑿T)2\displaystyle=\boldsymbol{J}+\frac{2\xi_{1}}{p}\boldsymbol{J}\odot\boldsymbol{XX^{T}}+\frac{\xi_{1}^{2}}{p^{2}}(\boldsymbol{XX}^{T})^{\mathchoice{\mathbin{\ooalign{$\displaystyle\wedge$\crcr$\displaystyle\bigcirc$}}}{\mathbin{\ooalign{$\textstyle\wedge$\crcr$\textstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptstyle\wedge$\crcr$\scriptstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptscriptstyle\wedge$\crcr$\scriptscriptstyle\bigcirc$}}}2}
=𝑱+2ξ11p𝑿𝑿T+ξ121p2(𝑿𝑿T)2.\displaystyle=\boldsymbol{J}+2\xi_{1}\frac{1}{p}\boldsymbol{XX}^{T}+\xi_{1}^{2}\frac{1}{p^{2}}(\boldsymbol{XX}^{T})^{\mathchoice{\mathbin{\ooalign{$\displaystyle\wedge$\crcr$\displaystyle\bigcirc$}}}{\mathbin{\ooalign{$\textstyle\wedge$\crcr$\textstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptstyle\wedge$\crcr$\scriptstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptscriptstyle\wedge$\crcr$\scriptscriptstyle\bigcirc$}}}2}.

This shows that for polynomial output kernel and one input product kernel, L=3L^{\prime}=3 with g1(ξ1)=1,g2(ξ1)=2ξ1,g3(ξ1)=ξ12g_{1}(\xi_{1})=1,g_{2}(\xi_{1})=2\xi_{1},g_{3}(\xi_{1})=\xi_{1}^{2} and h1[𝑲1(𝑿)]=𝑱,h2[𝑲1(𝑿)]=𝑲1(𝑿)=p1𝑿𝑿T,h3[𝑲1(𝑿)]=p2(𝑿𝑿T)2h_{1}[\boldsymbol{K}_{1}(\boldsymbol{X})]=\boldsymbol{J},h_{2}[\boldsymbol{K}_{1}(\boldsymbol{X})]=\boldsymbol{K}_{1}(\boldsymbol{X})=p^{-1}\boldsymbol{XX}^{T},h_{3}[\boldsymbol{K}_{1}(\boldsymbol{X})]=p^{-2}(\boldsymbol{XX}^{T})^{\mathchoice{\mathbin{\ooalign{$\displaystyle\wedge$\crcr$\displaystyle\bigcirc$}}}{\mathbin{\ooalign{$\textstyle\wedge$\crcr$\textstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptstyle\wedge$\crcr$\scriptstyle\bigcirc$}}}{\mathbin{\ooalign{$\scriptscriptstyle\wedge$\crcr$\scriptscriptstyle\bigcirc$}}}2}. The parameters θ0,,θL\theta_{0},\ldots,\theta_{L} can be estimated via MINQUE as well. Based on the above discussion, we can see that the estimation of the variance components in KNN through MINQUE is an approximation. What we basically do here is to use a complex mixed model to approximate the KNN.

3 Predictions

In this section, we make a theoretical comparison of prediction performance between KNN and LMM. Based on our model, the best predictor for 𝒂\boldsymbol{a} is given by

𝒚^\displaystyle\hat{\boldsymbol{y}} =𝔼[𝒂|𝒚]=𝔼[𝔼(𝒂|𝒚,𝒖1,,𝒖m)]\displaystyle=\mathbb{E}\left[\boldsymbol{a}|\boldsymbol{y}\right]=\mathbb{E}\left[\mathbb{E}\left(\boldsymbol{a}|\boldsymbol{y},\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right)\right]
=𝔼[(j=1Jτj𝑲j(𝑼))(j=1Jτj𝑲j(𝑼)+ϕ𝑰n)1]𝒚\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right)^{-1}\right]\boldsymbol{y}
=𝔼[(j=1Jϕ1τj𝑲j(𝑼))(j=1Jϕ1τj𝑲j(𝑼)+𝑰n)1]𝒚\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{J}\phi^{-1}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\phi^{-1}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\boldsymbol{y}
:=𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]𝒚,\displaystyle:=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\boldsymbol{y},

where τ~j=τjϕ1,j=1,,m\tilde{\tau}_{j}=\tau_{j}\phi^{-1},j=1,\ldots,m. The prediction error based on 𝒚^\hat{\boldsymbol{y}} is given by

R\displaystyle R =(𝒚𝒚^)T(𝒚𝒚^)\displaystyle=\left(\boldsymbol{y}-\hat{\boldsymbol{y}}\right)^{T}\left(\boldsymbol{y}-\hat{\boldsymbol{y}}\right)
=𝒚T(𝑰n𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1])T\displaystyle=\boldsymbol{y}^{T}\left(\boldsymbol{I}_{n}-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)^{T}
(𝑰n𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1])𝒚.\displaystyle\hskip 85.35826pt\left(\boldsymbol{I}_{n}-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)\boldsymbol{y}.

Note that

𝑰n𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]\displaystyle\boldsymbol{I}_{n}-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]
=\displaystyle= 𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰nj=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]\displaystyle\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}-\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]
=\displaystyle= 𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰n)1],\displaystyle\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right],

we have

R=𝒚T(𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰n)1])2𝒚.R=\boldsymbol{y}^{T}\left(\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)^{2}\boldsymbol{y}.

Direct evaluation of the prediction error RR is complicated. Instead, we approximate it based on asymptotic results. Same as the above, we focus on the case where J=1J=1 and 𝑲(𝑼)=f[1m𝑼𝑼T]\boldsymbol{K}(\boldsymbol{U})=f\left[\frac{1}{m}\boldsymbol{UU}^{T}\right]. The proof of the following Lemma can be found in Appendix B.

Lemma 3 (Approximation of Prediction Error).
  1. (i)

    When f(x)=xf(x)=x, then as mm\to\infty,

    R𝒚T(l=1Lτ~ξ𝑲l(𝑿)+𝑰n)2𝒚.R\simeq\boldsymbol{y}^{T}\left(\sum_{l=1}^{L}\tilde{\tau}\xi\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-2}\boldsymbol{y}.
  2. (ii)

    When ff is continuous and f[𝚺]𝒮+nf[\boldsymbol{\Sigma}]\in\mathcal{S}_{+}^{n}, then as mm\to\infty,

    R𝒚T(τ~f[l=1Lξ𝑲l(𝑿)]+𝑰n)2𝒚.R\simeq\boldsymbol{y}^{T}\left(\tilde{\tau}f\left[\sum_{l=1}^{L}\xi\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\boldsymbol{I}_{n}\right)^{-2}\boldsymbol{y}.

Now we compare the average prediction error between kernel neural network and linear mixed model. For a linear mixed model, the prediction error using best predictor can be obtained as follow. The proof can be found in Appendix B.

Proposition 1 (Prediction Error for a Linear Mixed Model).

Consider the linear mixed effect model

𝒚\displaystyle\boldsymbol{y} =𝒂+ϵ;\displaystyle=\boldsymbol{a}+\boldsymbol{\epsilon};
𝒂\displaystyle\boldsymbol{a} 𝒩n(𝟎,σR2𝚺);\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sigma_{R}^{2}\boldsymbol{\Sigma}\right);
ϵ\displaystyle\boldsymbol{\epsilon} 𝒩n(𝟎,ϕ𝑰n).\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\phi\boldsymbol{I}_{n}\right).

The prediction error based on quadratic loss and the best predictor 𝒚^=𝔼[𝒂|𝒚]=σ~R2𝚺(σ~R2𝚺+𝑰n)1𝒚\hat{\boldsymbol{y}}=\mathbb{E}[\boldsymbol{a}|\boldsymbol{y}]=\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}\left(\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n}\right)^{-1}\boldsymbol{y} (σ~R2=σR2ϕ1)(\tilde{\sigma}_{R}^{2}=\sigma_{R}^{2}\phi^{-1}) is given by

PELMM=ϕi=1n(σ~R2λi(𝚺)+1)1,PE_{LMM}=\phi\sum_{i=1}^{n}\left(\tilde{\sigma}_{R}^{2}\lambda_{i}(\boldsymbol{\Sigma})+1\right)^{-1},

where PELMMPE_{LMM} is the average prediction error for the linear mixed model.

Proposition 2.

Assuming that σ2=ϕ\sigma^{2}=\phi and σ~R2τ~min1lLξl\tilde{\sigma}_{R}^{2}\leq\tilde{\tau}\min_{1\leq l\leq L}\xi_{l}, we have

PEKNNPELMM,PE_{KNN}\precsim PE_{LMM},

where PEKNNPE_{KNN} stands for average prediction error for kernel neural network.

Proof.

For the kernel neural network, the average prediction error is given by

PEKNN\displaystyle PE_{KNN} =𝔼[𝒚T𝑨2𝒚]=𝔼[𝔼(𝒚T𝑨2𝒚|𝒖1,,𝒖m)]\displaystyle=\mathbb{E}\left[\boldsymbol{y}^{T}\boldsymbol{A}^{2}\boldsymbol{y}\right]=\mathbb{E}\left[\mathbb{E}\left(\boldsymbol{y}^{T}\boldsymbol{A}^{2}\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right)\right]
=ϕ𝔼[tr(𝑨2(τ~𝑲(𝑼)+𝑰n))]\displaystyle=\phi\mathbb{E}\left[\textrm{tr}\left(\boldsymbol{A}^{2}\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)\right)\right]
=ϕtr{(𝔼[(τ~𝑲(𝑼)+𝑰n)1])2𝔼[τ~𝑲(𝑼)+𝑰n]}\displaystyle=\phi\textrm{tr}\left\{\left(\mathbb{E}\left[\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)^{2}\mathbb{E}\left[\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right]\right\}
ϕtr{(τ~l=1Lξl𝑲l(𝑿)+𝑰n)2(τ~l=1Lξl𝑲l(𝑿)+𝑰n)}\displaystyle\simeq\phi\textrm{tr}\left\{\left(\tilde{\tau}\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-2}\left(\tilde{\tau}\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)\right\}
=ϕtr{(τ~l=1Lξl𝑲l(𝑿)+𝑰n)1}\displaystyle=\phi\textrm{tr}\left\{\left(\tilde{\tau}\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-1}\right\}
ϕi=1n(τ~min1lLξlλi(l=1L𝑲l(𝑿))+1)1\displaystyle\leq\phi\sum_{i=1}^{n}\left(\tilde{\tau}\min_{1\leq l\leq L}\xi_{l}\lambda_{i}\left(\sum_{l=1}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right)+1\right)^{-1}

Under the assumptions in this proposition and the linear mixed model with 𝚺=l=1L𝑲l(𝑿)\boldsymbol{\Sigma}=\sum_{l=1}^{L}\boldsymbol{K}_{l}(\boldsymbol{X}), we have

ϕ(τ~min1lLλi(𝚺)+1)1σ2(σ~R2λi(𝚺)+1)1=ϕσ2σ~R2λi(𝚺)+1τ~min1lLξlλi(𝚺)+11,\frac{\phi\left(\tilde{\tau}\min_{1\leq l\leq L}\lambda_{i}(\boldsymbol{\Sigma})+1\right)^{-1}}{\sigma^{2}\left(\tilde{\sigma}_{R}^{2}\lambda_{i}(\boldsymbol{\Sigma})+1\right)^{-1}}=\frac{\phi}{\sigma^{2}}\frac{\tilde{\sigma}_{R}^{2}\lambda_{i}(\boldsymbol{\Sigma})+1}{\tilde{\tau}\min_{1\leq l\leq L}\xi_{l}\lambda_{i}(\boldsymbol{\Sigma})+1}\leq 1,

which implies that PEKNNPELMMPE_{KNN}\precsim PE_{LMM}. ∎

Remark 1.

The result for Proposition 2 can be illustrated by using Figure 2. As shown in the Figure 2 for the case of L=1L=1, there are two paths from the kernel matrix based on 𝑿\boldsymbol{X} to the response 𝒚\boldsymbol{y}. One is the kernel neural network path (solid line) and the other is the linear mixed model path (dash-dotted line). The intuition behind the assumption σ~R2τ~ξ\tilde{\sigma}_{R}^{2}\leq\tilde{\tau}\xi is that the kernel neural network should explain more variations than the linear mixed model as it has two portions.

Refer to caption
Figure 2: The intuition under the assumption σ~R2τ~ξ\tilde{\sigma}_{R}^{2}\leq\tilde{\tau}\xi in Proposition 2.

We then extend the result to 𝑲(𝑼)=f[1m𝑼𝑼T]\boldsymbol{K}(\boldsymbol{U})=f\left[\frac{1}{m}\boldsymbol{UU}^{T}\right], where ff is as described in Lemma 3(ii).

Proposition 3.

Under the above notations, assuming that σ2=ϕ\sigma^{2}=\phi, σ~R2τ~min1lLξl\tilde{\sigma}_{R}^{2}\leq\tilde{\tau}\min_{1\leq l\leq L}\xi_{l} and λ1((fι)[l=1Lξl𝑲l(𝑿)])0\lambda_{1}\left((f-\iota)\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]\right)\geq 0 with |f′′(x)|M|f^{\prime\prime}(x)|\leq M for some M>0M>0 and all xx between mini,jσij\min_{i,j}\sigma_{ij} and maxi,j𝒗iT𝒗jm\max_{i,j}\frac{\boldsymbol{v}_{i}^{T}\boldsymbol{v}_{j}}{m}, we have

PEKNNPELMM,PE_{KNN}\precsim PE_{LMM},

where λ1(𝚺)\lambda_{1}(\boldsymbol{\Sigma}) is the smallest eigenvalue of the matrix 𝚺\boldsymbol{\Sigma}.

Proof.

Note that

PEKNN\displaystyle PE_{KNN} =ϕtr{(𝔼[(τ~𝑲(𝑼)+𝑰n)1])2𝔼[τ~𝑲(𝑼)+𝑰n]}\displaystyle=\phi\textrm{tr}\left\{\left(\mathbb{E}\left[\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)^{2}\mathbb{E}\left[\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right]\right\}
ϕtr{(τ~f[l=1Lξl𝑲l(𝑿)]+𝑰n)1}\displaystyle\simeq\phi\textrm{tr}\left\{\left(\tilde{\tau}f\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\boldsymbol{I}_{n}\right)^{-1}\right\}
=ϕi=1n1τ~λi(f[l=1Lξl𝑲l(𝑿)])+1\displaystyle=\phi\sum_{i=1}^{n}\frac{1}{\tilde{\tau}\lambda_{i}\left(f\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]\right)+1}
ϕi=1n1τ~λi((fι)[l=1Lξl𝑲l(𝑿)]+min1lLξll=1L𝑲l(𝑿))+1,\displaystyle\leq\phi\sum_{i=1}^{n}\frac{1}{\tilde{\tau}\lambda_{i}\left((f-\iota)\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\min_{1\leq l\leq L}\xi_{l}\sum_{l=1}^{L}\boldsymbol{K}_{l}(\boldsymbol{X})\right)+1},

where ι:𝚺𝚺\iota:\boldsymbol{\Sigma}\mapsto\boldsymbol{\Sigma} is the identity map. Corollary 4.3.15 in Horn and Johnson (2012) implies that

PEKNNϕi=1n1τ~min1lLξlλi(𝑲l(X))+τ~λ1((fι)[l=1Lξl𝑲l(𝑿)])+1PE_{KNN}\precsim\phi\sum_{i=1}^{n}\frac{1}{\tilde{\tau}\min_{1\leq l\leq L}\xi_{l}\lambda_{i}(\boldsymbol{K}_{l}(X))+\tilde{\tau}\lambda_{1}\left((f-\iota)\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]\right)+1}

Corollary 1.

If f[l=1Lξl𝑲l(𝑿)]l=1Lξl𝑲l(𝑿)f\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]-\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X}) is positive semidefinite, then

APEKNNAPELMM.APEKNN\precsim APELMM.
Example 1 (Polynomial Kernels).

For a polynomial kernel of degree dd, i.e., 𝑲ij(𝑼)=(c+𝒘iT𝒘jm)d\boldsymbol{K}_{ij}(\boldsymbol{U})=\left(c+\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}\right)^{d}, we have f(x)=(c+x)d=k=0d(dk)cdkxkf(x)=(c+x)^{d}=\sum_{k=0}^{d}\binom{d}{k}c^{d-k}x^{k} so that

(fι)(x)=cd+(dcd11)x+k=2d(dk)cdkxk.(f-\iota)(x)=c^{d}+(dc^{d-1}-1)x+\sum_{k=2}^{d}\binom{d}{k}c^{d-k}x^{k}.

Theorem 4.1 in Hiai (2009) states that for a real function on (α,α)(-\alpha,\alpha), 0<α0<\alpha\leq\infty, it is Schur positive111For a real function ff on (α,α)(-\alpha,\alpha) and for nn\in\mathbb{N}, it is Schur-positive of order nn if f[𝑨]f[\boldsymbol{A}] is positive semidefinite for all positive semidefinite 𝑨n()\boldsymbol{A}\in\mathcal{M}_{n}(\mathbb{R}) with entries in (α,α)(-\alpha,\alpha). if and only if it is analytic and f(k)(0)0f^{(k)}(0)\geq 0 for all k0k\geq 0. Since fιf-\iota is a polynomial function, it is clearly analytic. We can then expand f(x)f(x) using Taylor expansion around 0 and obtain

(dk)cdk=f(k)(0)k!f(k)(0)=d!(dk)!cdk,k=0,,d.\binom{d}{k}c^{d-k}=\frac{f^{(k)}(0)}{k!}\Rightarrow f^{(k)}(0)=\frac{d!}{(d-k)!}c^{d-k},\quad k=0,\ldots,d.

Hence, we have

(fι)(k)(x)={dcd11if k=1d!(dk)!cdkif k{0,1,,d}{1}0kd+1.(f-\iota)^{(k)}(x)=\left\{\begin{array}[]{ll}dc^{d-1}-1&\textrm{if }k=1\\ \frac{d!}{(d-k)!}c^{d-k}&\textrm{if }k\in\{0,1,\ldots,d\}\setminus\{1\}\\ 0&k\geq d+1\end{array}\right..

To make fιf-\iota Schur positive, we only need to require c1dd10c\geq\sqrt[d-1]{\frac{1}{d}}\geq 0 so that the minimum eigenvalue condition of Proposition 3 holds.

4 Including Fixed Effects

In the previous discussions, we focus on the case where the marginal distribution of the response variable 𝒚\boldsymbol{y} has mean 𝟎\boldsymbol{0}. In many applications, as the one we see in the ADNI real data application, there are many important covariates which may have large effects to the response variable. In this part, we are going to extend the proposed KNN model to take the covariates into account. As we have mentioned in section 2, the general structure of KNN is

𝒚|𝒁,𝒂\displaystyle\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{a} 𝒩n(𝒁𝜷+𝒂,ϕ𝑰n)\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta}+\boldsymbol{a},\phi\boldsymbol{I}_{n}\right)
𝒂|𝒖1,,𝒖m\displaystyle\boldsymbol{a}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} 𝒩n(𝟎,j=1Jτj𝑲j(𝑼))\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right) (9)
𝒖1,,𝒖m\displaystyle\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}  i.i.d. 𝒩n(𝟎,l=1Lξl𝑲l(𝑿)).\displaystyle\sim\textrm{ i.i.d. }\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right).

Similar to the situation considered before, given the latent variables 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}, we have

𝒚|𝒁,𝒖1,,𝒖m𝒩n(𝒁𝜷,j=1Jτj𝑲j(𝑼)+ϕ𝑰n)\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta},\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right)

and then the marginal mean and covariance matrix of 𝒚\boldsymbol{y} can be obtained as follow:

𝔼[𝒚|𝒁]\displaystyle\mathbb{E}\left[\boldsymbol{y}|\boldsymbol{Z}\right] =𝔼[𝔼[𝒚|𝒁,𝒖1,,𝒖m]]=𝔼[𝒁𝜷]=𝒁𝜷\displaystyle=\mathbb{E}\left[\mathbb{E}\left[\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right]\right]=\mathbb{E}\left[\boldsymbol{Z\beta}\right]=\boldsymbol{Z\beta}
Var[𝒚|𝒁]\displaystyle\textrm{Var}\left[\boldsymbol{y}|\boldsymbol{Z}\right] =Var[𝔼[𝒚|𝒁,𝒖1,,𝒖m]]+𝔼[Var[𝔼[𝒚|𝒁,𝒖1,,𝒖m]]]\displaystyle=\textrm{Var}\left[\mathbb{E}\left[\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right]\right]+\mathbb{E}\left[\textrm{Var}\left[\mathbb{E}\left[\boldsymbol{y}|\boldsymbol{Z},\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}\right]\right]\right]
=Var[𝒁𝜷]+𝔼[j=1Jτj𝑲j(𝑼)+ϕ𝑰n]\displaystyle=\textrm{Var}[\boldsymbol{Z\beta}]+\mathbb{E}\left[\sum_{j=1}^{J}\tau_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right]
=j=0Jτj𝔼[𝑲j(𝑼)],\displaystyle=\sum_{j=0}^{J}\tau_{j}\mathbb{E}\left[\boldsymbol{K}_{j}(\boldsymbol{U})\right],

where τ0=ϕ\tau_{0}=\phi and 𝑲0(𝑼)=𝑰n\boldsymbol{K}_{0}(\boldsymbol{U})=\boldsymbol{I}_{n}. Again, we focus on the case where J=1J=1 and 𝑲(𝑼)\boldsymbol{K}(\boldsymbol{U}) is of the form f[1m𝑼𝑼T]f\left[\frac{1}{m}\boldsymbol{UU}^{T}\right] with ff satisfying the generalized linear separable condition so that after suitable reparameterization, the variance components become estimable. A natural way to obtain the estimators for 𝜷\boldsymbol{\beta} and the variance components 𝜽\boldsymbol{\theta} is to first obtain a “good” estimates for the variance components 𝜽^\hat{\boldsymbol{\theta}} and then plug it into the Aitken equation to obtain the estimates for fixed-effect parameters 𝜷^\hat{\boldsymbol{\beta}}. Kackar and Harville (1981) showed that as long as 𝜽^\hat{\boldsymbol{\theta}} is even and translation-invariant, 𝒑T𝜷^+𝒒T𝒂^\boldsymbol{p}^{T}\hat{\boldsymbol{\beta}}+\boldsymbol{q}^{T}\hat{\boldsymbol{a}} is an unbiased prediction for the quantity 𝒑T𝜷+𝒒T𝒂\boldsymbol{p}^{T}\boldsymbol{\beta}+\boldsymbol{q}^{T}\boldsymbol{a}.

Let 𝑹\boldsymbol{R} be an r×nr\times n matrix with r=nrank(𝒁)r=n-\textrm{rank}(\boldsymbol{Z}) such that 𝑹𝒁=𝑶\boldsymbol{RZ}=\boldsymbol{O} and 𝑹𝑹T=𝑰r\boldsymbol{RR}^{T}=\boldsymbol{I}_{r}. Such a matrix can be found using the QR decomposition of 𝒁\boldsymbol{Z} (McCulloch et al., 2008). The estimators for the variance components can then be estimated based on the transformed model:

𝒚~|𝒂~\displaystyle\tilde{\boldsymbol{y}}|\tilde{\boldsymbol{a}} 𝒩r(𝒂~,ϕ𝑰r)\displaystyle\sim\mathcal{N}_{r}(\tilde{\boldsymbol{a}},\phi\boldsymbol{I}_{r})
𝒂~|𝒖1,,𝒖m\displaystyle\tilde{\boldsymbol{a}}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} 𝒩r(𝟎,j=1Jτj𝑹𝑲j(𝑼)𝑹T)\displaystyle\sim\mathcal{N}_{r}\left(\boldsymbol{0},\sum_{j=1}^{J}\tau_{j}\boldsymbol{R}\boldsymbol{K}_{j}(\boldsymbol{U})\boldsymbol{R}^{T}\right)
𝒖1,,𝒖m\displaystyle\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m}  i.i.d. 𝒩n(𝟎,l=1Lξl𝑲l(𝑿)),\displaystyle\sim\textrm{ i.i.d. }\mathcal{N}_{n}\left(\boldsymbol{0},\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right),

where 𝒚~=𝑹𝒚\tilde{\boldsymbol{y}}=\boldsymbol{Ry} and 𝒂~=𝑹𝒂\tilde{\boldsymbol{a}}=\boldsymbol{Ra}, which is the same model framework we mainly discussed in section LABEL:sec:Methodologies. As we have seen, the estimators for the variance components after reparameterization is of the form 𝒚~T𝚯𝒚~=𝒚T𝑹𝑻𝚯𝑹𝒚\tilde{\boldsymbol{y}}^{T}\boldsymbol{\Theta}\tilde{\boldsymbol{y}}=\boldsymbol{y}^{T}\boldsymbol{R^{T}\Theta Ry}. Since it is a quadratic form, clearly it is an even function in 𝒚\boldsymbol{y}. To see that it is translation invariant, we note that for any 𝒄𝒞(𝒁)\boldsymbol{c}\in\mathcal{C}(\boldsymbol{Z}), we can know that 𝒄=𝒁𝒃\boldsymbol{c}=\boldsymbol{Zb} for some vector 𝒃r\boldsymbol{b}\in\mathbb{R}^{r} and we have

(𝒚𝜸)T𝑹T𝚯𝑹(𝒚𝜸)\displaystyle(\boldsymbol{y}-\boldsymbol{\gamma})^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta R}(\boldsymbol{y}-\boldsymbol{\gamma}) =(𝒚𝒁𝒃)T𝑹T𝚯𝑹(𝒚𝒁𝒃)\displaystyle=(\boldsymbol{y}-\boldsymbol{Zb})^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta R}(\boldsymbol{y}-\boldsymbol{Zb})
=𝒚T𝑹T𝚯𝑹𝒚2𝒚T𝑹T𝚯𝑹𝒁𝒃+𝒃T𝒁T𝑹T𝚯𝑹𝒁𝒃\displaystyle=\boldsymbol{y}^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta Ry}-2\boldsymbol{y}^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta RZb}+\boldsymbol{b}^{T}\boldsymbol{Z}^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta RZb}
=𝒚T𝑹T𝚯𝑹𝒚,\displaystyle=\boldsymbol{y}^{T}\boldsymbol{R}^{T}\boldsymbol{\Theta Ry},

where the last equality follows since 𝑹𝒁=𝑶\boldsymbol{RZ}=\boldsymbol{O} as defined. Therefore, we can know that the obtained estimators for variance components are also translation-invariant and based on the results in Kackar and Harville (1981), the obtained estimator for 𝜷\boldsymbol{\beta} by plugging in the variance component estimators is unbiased.

For the prediction error, we note that when the covariates present, the predictor for 𝒚\boldsymbol{y} is given by 𝒚^=𝒁𝜷^+𝒂^=𝒁𝜷^+𝔼[𝒂|𝒚]\hat{\boldsymbol{y}}=\boldsymbol{Z}\hat{\boldsymbol{\beta}}+\hat{\boldsymbol{a}}=\boldsymbol{Z}\hat{\boldsymbol{\beta}}+\mathbb{E}[\boldsymbol{a}|\boldsymbol{y}]. Based on the result from linear mixed models, we know that

𝒂^\displaystyle\hat{\boldsymbol{a}} =𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1](𝒚𝒁𝜷)\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\left(\boldsymbol{y}-\boldsymbol{Z}\boldsymbol{\beta}\right)
=𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1][𝑰n𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1]𝒚\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\left[\boldsymbol{I}_{n}-\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\right]\boldsymbol{y}

where 𝑽=Var[𝒚]=j=1Jτj𝔼[𝑲j(𝑼)]+ϕ𝑰n\boldsymbol{V}=\textrm{Var}[\boldsymbol{y}]=\sum_{j=1}^{J}\tau_{j}\mathbb{E}[\boldsymbol{K}_{j}(\boldsymbol{U})]+\phi\boldsymbol{I}_{n} and then

𝒚𝒚^\displaystyle\boldsymbol{y}-\hat{\boldsymbol{y}} =[𝑰n𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1]𝒚\displaystyle=\left[\boldsymbol{I}_{n}-\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\right]\boldsymbol{y}
𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1][𝑰n𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1]𝒚\displaystyle\hskip 28.45274pt-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\left[\boldsymbol{I}_{n}-\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\right]\boldsymbol{y}
=(𝑰n𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1])[𝑰n𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1]𝒚\displaystyle=\left(\boldsymbol{I}_{n}-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\right)\left[\boldsymbol{I}_{n}-\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\right]\boldsymbol{y}
=𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰n)1][𝑰n𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1]𝒚,\displaystyle=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\left[\boldsymbol{I}_{n}-\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\right]\boldsymbol{y},

where the last equality follows by noting that

𝑰n𝔼[(j=1Jτ~j𝑲j(𝑼))(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]=𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]\boldsymbol{I}_{n}-\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})\right)\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]

as shown above. Therefore, by letting 𝑨=𝔼[(j=1Jτ~j𝑲j(𝑼)+𝑰n)1]\boldsymbol{A}=\mathbb{E}\left[\left(\sum_{j=1}^{J}\tilde{\tau}_{j}\boldsymbol{K}_{j}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right] and 𝑷𝑽=𝒁(𝒁T𝑽1𝒁)𝒁T𝑽1\boldsymbol{P}_{\boldsymbol{V}}=\boldsymbol{Z}\left(\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}\boldsymbol{Z}\right)^{-}\boldsymbol{Z}^{T}\boldsymbol{V}^{-1}, the prediction error is obtained as follow:

(𝒚𝒚^)T(𝒚𝒚^)\displaystyle(\boldsymbol{y}-\hat{\boldsymbol{y}})^{T}(\boldsymbol{y}-\hat{\boldsymbol{y}}) =𝒚T(𝑰n𝑷𝑽T)𝑨^2(𝑰n𝑷𝑽)𝒚.\displaystyle=\boldsymbol{y}^{T}\left(\boldsymbol{I}_{n}-\boldsymbol{P}_{\boldsymbol{V}}^{T}\right)\hat{\boldsymbol{A}}^{2}\left(\boldsymbol{I}_{n}-\boldsymbol{P}_{\boldsymbol{V}}\right)\boldsymbol{y}.

When J=1J=1, as we have shown in the proof of Lemma 3, 𝔼[(τ~𝑲(𝑼)+𝑰n)1]\mathbb{E}\left[\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\simeq
(τ~f[l=1Lξl𝑲l(𝑿)]+𝑰n)1\left(\tilde{\tau}f\left[\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\boldsymbol{I}_{n}\right)^{-1} so that we can estimate the prediction error by using simple plug-in estimators.

(𝒚𝒚^)T(𝒚𝒚^)^\displaystyle\widehat{(\boldsymbol{y}-\hat{\boldsymbol{y}})^{T}(\boldsymbol{y}-\hat{\boldsymbol{y}})} 𝒚T(𝑰n𝑷𝑽^T)(τ~^f[l=1Lξ^l𝑲l(𝑿)]+𝑰n)2(𝑰n𝑷𝑽^)𝒚.\displaystyle\simeq\boldsymbol{y}^{T}\left(\boldsymbol{I}_{n}-\boldsymbol{P}_{\hat{\boldsymbol{V}}}^{T}\right)\left(\hat{\tilde{\tau}}f\left[\sum_{l=1}^{L}\hat{\xi}_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\boldsymbol{I}_{n}\right)^{-2}\left(\boldsymbol{I}_{n}-\boldsymbol{P}_{\hat{\boldsymbol{V}}}\right)\boldsymbol{y}.

5 Simulations

In this section, we conducted some simulations to compare the prediction performance of KNN with MINQUE estimation to the prediction performance of Best Linear Unbiased Estimator (BLUP) in linear mixed model. All the simulations are based on 100 individuals with 500 Monte Carlo iterations.

5.1 Nonlinear Random Effect

In this simulation, we examine the performances of both methods under the situation of nonlinear random effects. Specifically, we used the following model to simulate the response:

𝒚=𝟏n+2𝜻+f(𝒂)+ϵ,𝒂𝒩n(𝟎,1p𝑮𝑮T),\boldsymbol{y}=\boldsymbol{1}_{n}+2\boldsymbol{\zeta}+f(\boldsymbol{a})+\boldsymbol{\epsilon},\quad\boldsymbol{a}\sim\mathcal{N}_{n}\left(\boldsymbol{0},\frac{1}{p}\boldsymbol{GG}^{T}\right), (10)

where 𝑮\boldsymbol{G} is an n×pn\times p matrix containing the genetic information (SNP) and 𝜻,ϵ i.i.d. 𝒩n(𝟎,𝑰n)\boldsymbol{\zeta},\boldsymbol{\epsilon}\sim\textrm{ i.i.d. }\mathcal{N}_{n}(\boldsymbol{0},\boldsymbol{I}_{n}). In the simulation, four types of functions ff are considered, which are linear (f(x)=xf(x)=x), sine (f(x)=sin(2πx)f(x)=\sin(2\pi x)), inverse logistic (f(x)=1/(1+ex)f(x)=1/(1+e^{-x})) and polynomial function of order 2 (f(x)=x2f(x)=x^{2}). When applying the kernel neural network, we set L=J=1L=J=1 and choose 𝑲(𝑿)\boldsymbol{K}(\boldsymbol{X}) and 𝑲(𝑼)\boldsymbol{K}(\boldsymbol{U}) as either product kernel or polynomial kernel of order 2. We summarize the prediction errors of LMM and KNN via boxplot in Figure 3. Figure 3 shows the results when ff is a linear or a sine function. The cases where ff is an inverse logistic function or a polynomial function of order 2 are summarized in Appendix C. As we can observe from the boxplots, when the output kernel is chosen to be the product kernel, the performance of KNN is similar to the performance of LMM although when the input kernel is chosen to be polynomial kernel, KNN gets a slightly better prediction error, which we think is not significantly different from that of LMM. However, KNN performs significantly better than LMM when the output kernel is chosen to be polynomial kernel. As one can tell from the box plots, when both the input and output kernels are chosen to be polynomial, the KNN has the best performance in terms of the prediction error, which is consistent for all nonlinear functions simulated in this section.

Refer to caption
Refer to caption
Figure 3: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors. The left panel shows the results when a linear function is used and the right panel shows the results when a sine function is used. In the horizontal axis, “1” corresponds to the LMM; “2” corresponds to the KNN with product input kernel and product output kernel; “3” corresponds to the KNN with product input and polynomial output; “4” corresponds to the KNN with polynomial input and product output and “5” corresponds to the polynomial input and polynomial output.

5.2 Nonadditive Effects

In this simulation, we explore the performances of both method under non-additive effects. We conducted two simulations in terms of two different types of non-additive effects. In the first simulation, we mainly focus on the interaction effect and generate the response using the following model:

𝒚=f(𝑮)+ϵ,\boldsymbol{y}=f(\boldsymbol{G})+\boldsymbol{\epsilon},

where 𝑮=[𝒈1,,𝒈p]n×p\boldsymbol{G}=[\boldsymbol{g}_{1},\ldots,\boldsymbol{g}_{p}]\in\mathbb{R}^{n\times p} is the SNP data and ϵ𝒩n(𝟎,𝑰n)\boldsymbol{\epsilon}\sim\mathcal{N}_{n}(\boldsymbol{0},\boldsymbol{I}_{n}). When applying both methods, the mean is adjusted so that the response has marginal mean 0. In the simulation, we randomly pick 10 causal SNPs, denoted by 𝒈i1,,𝒈i10\boldsymbol{g}_{i_{1}},\ldots,\boldsymbol{g}_{i_{10}} and consider the following function,

f(𝑮)=1j1<j210𝒈ij1𝒈ij2,f(\boldsymbol{G})=\sum_{1\leq j_{1}<j_{2}\leq 10}\boldsymbol{g}_{i_{j_{1}}}\odot\boldsymbol{g}_{i_{j_{2}}},

where \odot stands for the Hadamard product. For LMM, the product kernel was used as the covariance matrix to generate the random effect. The result is shown in Figure 4. It is interesting to notice from the boxplots that both LMM and KNN have many outliers when product kernel was used. Overall, LMM has larger variations compared to KNN in this simulation. When the output kernel in KNN is the polynomial kernel, the performance of KNN is much better than that of LMM.

Refer to caption
Figure 4: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model focusing on the interaction effect. The vertical axis is scaled to 0-5 by removing some outliers to make the comparison visually clear. In the horizontal axis, “1” corresponds to the LMM; “2” corresponds to the KNN with product input kernel and product output kernel; “3” corresponds to the KNN with product input and polynomial output; “4” corresponds to the KNN with polynomial input and product output and “5” corresponds to the polynomial input and polynomial output.

There are three main modes of inheritance: additive, dominant and recessive. In many situations, the additive coding (AA=0, Aa=1, aa=2) is used. In the second simulation, we consider the dominant coding (AA=1, Aa=1, aa=0) and the recessive coding (AA=0, Aa=1, aa=1). The response was simulated based on the model:

𝒚=𝒂+ϵ,𝒂𝒩n(𝟎,1p𝑮𝑮T),ϵ𝒩n(𝟎,𝑰n),\boldsymbol{y}=\boldsymbol{a}+\boldsymbol{\epsilon},\quad\boldsymbol{a}\sim\mathcal{N}_{n}\left(\boldsymbol{0},\frac{1}{p}\boldsymbol{G}^{\prime}\boldsymbol{G}^{\prime T}\right),\quad\boldsymbol{\epsilon}\sim\mathcal{N}_{n}(\boldsymbol{0},\boldsymbol{I}_{n}),

where 𝑮\boldsymbol{G}^{\prime} is a SNP data matrix based on dominant coding or recessive coding so that each element in 𝑮\boldsymbol{G}^{\prime} takes only two possible values 0 and 1. Figure 5 summarizes the simulation results. By comparing the two boxplots in Figure 5, the performances look similar in both cases. Similar as before, KNN with input polynomial kernel and output polynomial kernel achieves the lowest prediction error.

Refer to caption
Refer to caption
Figure 5: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model using dominant coding (left figure) and recessive coding (right figure) for SNPs. In the horizontal axis, “1” corresponds to the LMM; “2” corresponds to the KNN with product input kernel and product output kernel; “3” corresponds to the KNN with product input and polynomial output; “4” corresponds to the KNN with polynomial input and product output and “5” corresponds to the polynomial input and polynomial output.

5.3 Non-normal Error Distributions

In this simulation, we consider different types of error distributions and explore the performance in terms of prediction error between LMM and KNN. Specifically, we focus on two types of error distributions. The first one is a tt-distribution with degrees of freedom 2, which is a heavy-tailed distribution and the second one is a centered χ12\chi_{1}^{2}-distribution, which is a non-symmetric distribution. The response was simulated based on the model as in section 5.1 except that the function applied to the random effect is f(x)=xf(x)=x and the distribution for the error term ϵ\epsilon is either t2t_{2} or centered χ12\chi_{1}^{2} distribution. The results are summarized in Figure 6. From the results, we can know that the KNN with polynomial input and output kernel is slightly more robust to non-normal distributions compared with LMM and KNN with other combinations of input and output kernels considered in the simulations. Since t2t_{2}-distribution is a heavy-tailed distribution, we can find that there are more outliers than in the normal case. In terms of the number of outliers, KNN with polynomial input and output kernel still performs the best.

Refer to caption
Refer to caption
Figure 6: The boxplots for linear mixed models (LMM) and kernel neural networks (KNN) in terms of prediction errors based on the simulation model using tt-distribution for error (left figure) and centered χ12\chi_{1}^{2}-distribution for error (right figure). In the horizontal axis, “1” corresponds to the LMM; “2” corresponds to the KNN with product input kernel and product output kernel; “3” corresponds to the KNN with product input and polynomial output; “4” corresponds to the KNN with polynomial input and product output and “5” corresponds to the polynomial input and polynomial output.

6 Real Data Application

We applied our method to the whole genome sequencing data from Alzheimer’s Disease Neuroimaging Initiative (ADNI) as well and made predictions on the responses. A total of 808 individuals at the baseline of the ADNI1 and ADNI2 studies have the whole genome sequencing data. We dropped the single nucleotide polymorphisms (SNPs) with low calling rate (<<0.9), or low minor allele frequencies (MAF) (<<0.01), or those failed to pass the Hardy Weinberg exact test (pp-value<<1e-6), and non-European American samples were also dropped. The data was then uploaded to the server in the University of Michigan for posterior likelihood imputation (https://imputationserver.sph.umich.edu/index.html). From the imputed data, we extracted SNPs with allelic R2>0.9R^{2}>0.9 and then the covariance kernel matrix the normalized identity-by-state (IBS) kernel matrix were constructed for analysis. Specifically, the (i,j)(i,j)the element in each of the two kernel matrix is defined as follows:

𝑲cov(𝒈i,𝒈j)\displaystyle\boldsymbol{K}^{\textrm{cov}}(\boldsymbol{g}_{i},\boldsymbol{g}_{j}) =1p1k=1p(gik1pkgik)(gjk1pkgjk)\displaystyle=\frac{1}{p-1}\sum_{k=1}^{p}\left(g_{ik}-\frac{1}{p}\sum_{k}g_{ik}\right)\left(g_{jk}-\frac{1}{p}\sum_{k}g_{jk}\right)
𝑲ibs(𝒈i,𝒈j)\displaystyle\boldsymbol{K}^{\textrm{ibs}}(\boldsymbol{g}_{i},\boldsymbol{g}_{j}) =12pk=1p[2|gikgjk|],\displaystyle=\frac{1}{2p}\sum_{k=1}^{p}\left[2-|g_{ik}-g_{jk}|\right],

where pp is the number of SNPs in all expressions.

Four volume measures of cortical regions, which are hippocampus, ventricles, entorhinal and whole brain volumes were used as phenotypes of interest. We chose these four cortical regions since they play important roles in the Alzheimer’s disease (AD). The loss in the volumes of the whole brain, hippocampus and entorhinal and the increment in the ventricular volume can be detected among AD patients. When we applied both the KNN method and LMM method, we only include the subjects having both genetic information and phenotypic information, which results in 513 individuals for hippocampus; 564 individuals for ventricles; 516 individuals for entorhinal and 570 individuals for whole brain volumes.

The response variable was chosen to be the natural logarithm of the volumes of the four cortical regions and the covariates were chosen as the age, gender, education status and APOE4. Then the KNN is based on the model

𝒚|𝒖1,,𝒖m\displaystyle\boldsymbol{y}|\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} 𝒩n(𝒁𝜷,τ𝑲(𝑼)+ϕ𝑰n)\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta},\tau\boldsymbol{K}(\boldsymbol{U})+\phi\boldsymbol{I}_{n}\right)
𝒖1,,𝒖m\displaystyle\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} 𝒩n(𝟎,ξ1𝑲cov+ξ2𝑲ibs),\displaystyle\sim\mathcal{N}_{n}\left(\boldsymbol{0},\xi_{1}\boldsymbol{K}^{\textrm{cov}}+\xi_{2}\boldsymbol{K}^{\textrm{ibs}}\right),

and the LMM is based on the following model assumption:

𝒚𝒩n(𝒁𝜷,τ1𝑲cov+τ2𝑲ibs+τ3𝑰n).\boldsymbol{y}\sim\mathcal{N}_{n}\left(\boldsymbol{Z\beta},\tau_{1}\boldsymbol{K}^{\textrm{cov}}+\tau_{2}\boldsymbol{K}^{\textrm{ibs}}+\tau_{3}\boldsymbol{I}_{n}\right). (11)

The restricted maximum likelihood estimates for τi\tau_{i}, i=1,2,3i=1,2,3 were then calculated based on the Fisher scoring methods and the BLUP for the LMM were computed based on these estimators. Similarly, when we applied the KNN methods, the two kernel matrices 𝑲cov\boldsymbol{K}^{\textrm{cov}} and 𝑲ibs\boldsymbol{K}^{\textrm{ibs}} were used as the input kernel matrices and the output kernel matrix is chosen to be either the product kernel or the polynomial kernel of order 2. The average mean square errors on predictors of both methods were summarized in the Table 1.

Table 1: Average mean squared prediction error of KNN with product output kernel matrix, KNN with output kernel matrix as polynomial of order 2 and the BLUP based on LMM.
KNN(prod) KNN(poly) LMM
Hippocampus 2.01e-06 4.17e-05 2.11e-02
Ventricles 2.44e-03 1.98e-03 2.24e-01
Entorhinal 1.01e-05 1.06e-04 4.05e-02
Whole Brain Volume 1.25e-07 3.04e-08 3.68e-11

As we can see from the table, both KNN and LMM have good prediction errors. However, we would say that the prediction errors from KNN are more realistic. It is interesting to note that the average prediction errors of LMM for entorhinal and whole brain volume are extremely close to zero. We checked the estimates of the variance components in this case and noticed that the variance component associated with the identity matrix, which is τ3\tau_{3} as in (11). Since the BLUP of the random effect in this case is given by

BLUP\displaystyle BLUP =𝒁𝜷^+(τ1𝑲cov+τ2𝑲ibs)(τ1𝑲cov+τ2𝑲ibs+τ3𝑰n)1(𝒚𝒁𝜷^)\displaystyle=\boldsymbol{Z}\hat{\boldsymbol{\beta}}+\left(\tau_{1}\boldsymbol{K}^{\textrm{cov}}+\tau_{2}\boldsymbol{K}^{\textrm{ibs}}\right)\left(\tau_{1}\boldsymbol{K}^{\textrm{cov}}+\tau_{2}\boldsymbol{K}^{\textrm{ibs}}+\tau_{3}\boldsymbol{I}_{n}\right)^{-1}(\boldsymbol{y}-\boldsymbol{Z}\hat{\boldsymbol{\beta}})

so that if τ3\tau_{3} is very close to 0, we can know that the BLUP is very close to the response 𝒚\boldsymbol{y}, which leads to the extremely low prediction error. On the other hand, for KNN, since when the estimate of error variance becomes negative, we project the MINQUE matrix onto the positive semidefinite cone so that we will always get a positive estimate for the error variance component, which makes the calculation of prediction error more reasonable.

7 Discussion

In this paper, a kernel-based neural network model was proposed for prediction analyses. The kernel-based neural network can be thought of as an extension of linear mixed model since it can reduce to a linear mixed model through choosing product kernel matrix as the output kernel matrix and via reparameterization. A modified MINQUE strategy is used to obtain the estimators of variance components in the KNN model if the output kernel satisfies the generalized linear separable condition. Empirical simulation studies and real data application show that the KNN model can achieve better performance in terms of the mean squared prediction error when the output kernel matrix is chosen as the polynomial kernel matrix. This is analogous to the popular neural network model, where better prediction accuracy can be achieved when nonlinear activation functions are applied.

Many extensions can be made to make the KNN model more flexible and have much broader applications. First, it may be possible to consider conducting base kernel matrix selection. Although in this paper, we do not consider how to choose the number of base kernel matrix LL, but too many kernel matrices will certainly increase the amount of redundant information. So it may be beneficial to propose a criterion on choosing the base kernel matrices. An other possible extension of the KNN model is more challenging. The theoretical properties discussed in this paper mainly focus on the case where only one output kernel matrix is used and the kernel function has a specific form. It is natural to consider more general kernel functions, but the estimation procedure of the variance component would be more complex. Moreover, it is also advisable to consider the performance of KNN if deep network structures were applied.

Appendix A Some Results from Concentration Inequality and Matrix Analysis

A.1 Sub-Gaussian and Sub-Exponential Inequalities

In this part, we present two basic concentration inequalities used in the main text. For more details on concentration inequality, readers may refer to Buldygin and Kozachenko (2000), Boucheron et al. (2004) and Ledoux (2005).

Definition 1.
  1. (i)

    A random variable XX with mean μ=𝔼[X]\mu=\mathbb{E}[X] is called sub-Gaussian if there exists a number σ0\sigma\geq 0 such that

    𝔼[eλ(Xμ)]exp{12λ2σ2},λ\mathbb{E}\left[e^{\lambda(X-\mu)}\right]\leq\exp\left\{\frac{1}{2}\lambda^{2}\sigma^{2}\right\},\quad\forall\lambda\in\mathbb{R}

    The constant σ\sigma is known as the sub-Gaussian parameter.

  2. (ii)

    A random variable XX with mean μ=𝔼[X]\mu=\mathbb{E}[X] is called sub-exponential (also called pre-Gaussian) if there exist non-negative constants (β,α)(\beta,\alpha) such that

    𝔼[eλ(Xμ)]exp{12λ2β2},|λ|<1α.\mathbb{E}\left[e^{\lambda(X-\mu)}\right]\leq\exp\left\{\frac{1}{2}\lambda^{2}\beta^{2}\right\},\quad\forall|\lambda|<\frac{1}{\alpha}.

    The pair of constants (β,α)(\beta,\alpha) is known as the sub-exponential parameter.

Based on the definition of sub-Gaussian random variables, it is clear that if X𝒩(μ,σ2)X\sim\mathcal{N}(\mu,\sigma^{2}), then XX is sub-Gaussian with sub-Gaussian parameter σ\sigma. The tail probabilities of both sub-Gaussian and sub-exponential random variables can be bounded exponentially. For sub-Gaussian random variables, the result is the famous Hoeffding inequality.

Theorem 4.

(Hoeffding, 1963) Suppose that the random variables X1,,XnX_{1},\ldots,X_{n} are independent and XiX_{i} has mean μi\mu_{i} and sub-Gaussian parameter σi\sigma_{i}. Then for all t>0t>0,

(|i=1n(Xiμi)|>t)2exp{t22i=1nσi2}.\mathbb{P}\left(\left|\sum_{i=1}^{n}(X_{i}-\mu_{i})\right|>t\right)\leq 2\exp\left\{-\frac{t^{2}}{2\sum_{i=1}^{n}\sigma_{i}^{2}}\right\}.
Theorem 5.

Suppose that XX is a sub-exponential random variable with mean μ\mu and sub-exponential parameters (β,α)(\beta,\alpha). Then for all t>0t>0,

(|Xμ|>t){2exp{t22β2},if 0<tβ2α2exp{t2α},if t>β2α.\mathbb{P}\left(|X-\mu|>t\right)\leq\left\{\begin{array}[]{ll}2\exp\left\{-\frac{t^{2}}{2\beta^{2}}\right\},&\textrm{if }0<t\leq\frac{\beta^{2}}{\alpha}\\ 2\exp\left\{-\frac{t}{2\alpha}\right\},&\textrm{if }t>\frac{\beta^{2}}{\alpha}\end{array}\right..

A.2 Results on Matrix Analysis

Proposition 6 shows that a inverse map of a matrix is a continuous map, which will be frequently used in later parts when we approximate the average prediction errors.

Proposition 6 (Gentle (2007)).

Let 𝚿\boldsymbol{\Psi} be an arbitrary invertible matrix. Then the map f:𝚿𝚿1f:\boldsymbol{\Psi}\mapsto\boldsymbol{\Psi}^{-1} is continuous.

Proof.

Since 𝚿1=|𝚿|1Adj(𝚿)\boldsymbol{\Psi}^{-1}=|\boldsymbol{\Psi}|^{-1}\textrm{Adj}(\boldsymbol{\Psi}), where Adj(𝚿)\textrm{Adj}(\boldsymbol{\Psi}) is the adjugate matrix of 𝚿\boldsymbol{\Psi}, i.e., Adj(𝚿)=𝑪T\textrm{Adj}(\boldsymbol{\Psi})=\boldsymbol{C}^{T} and 𝑪\boldsymbol{C} is the cofactor matrix of 𝚿\boldsymbol{\Psi} with 𝑪ij=(1)i+j𝑴ij\boldsymbol{C}_{ij}=(-1)^{i+j}\boldsymbol{M}_{ij} and 𝑴ij\boldsymbol{M}_{ij} is the (i,j)(i,j) cofactor of 𝚿\boldsymbol{\Psi}. Based on the definition of determinant, it is easy to see that the map g:𝚿|𝚿|g:\boldsymbol{\Psi}\mapsto|\boldsymbol{\Psi}| is continuous and the map h:𝚿Adj(𝚿)h:\boldsymbol{\Psi}\mapsto\textrm{Adj}(\boldsymbol{\Psi}) is continuous as well. Therefore, the map f:𝚿𝚿1f:\boldsymbol{\Psi}\mapsto\boldsymbol{\Psi}^{-1} is continuous. ∎

Another result that will be used later is that any two matrix norms are equivalent in the sense that for any given pair of matrix norms s\|\cdot\|_{s} and t\|\cdot\|_{t}, there is a finite positive constant CstC_{st} such that

𝑨sCst𝑨t,𝑨n,\|\boldsymbol{A}\|_{s}\leq C_{st}\|\boldsymbol{A}\|_{t},\quad\forall\boldsymbol{A}\in\mathcal{M}_{n},

where n\mathcal{M}_{n} is the collection of all n×nn\times n matrices.

Appendix B Technical Proofs

B.1 Proof of Lemma 1

Proof.

Note that

𝔼[𝒘iT𝒘jm]=1mk=1m𝔼[𝒘ik𝒘jk]=σij.\mathbb{E}\left[\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}\right]=\frac{1}{m}\sum_{k=1}^{m}\mathbb{E}\left[\boldsymbol{w}_{ik}\boldsymbol{w}_{jk}\right]=\sigma_{ij}.

Now we consider the Taylor expansion of 𝑲ij(𝑼)\boldsymbol{K}_{ij}(\boldsymbol{U}) around σij\sigma_{ij}:

𝑲ij(𝑼)=f(σij)+f(σij)(𝒘iT𝒘jmσij)+12f′′(ηij)(𝒘iT𝒘jmσij)2,\boldsymbol{K}_{ij}(\boldsymbol{U})=f(\sigma_{ij})+f^{\prime}(\sigma_{ij})\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)+\frac{1}{2}f^{\prime\prime}(\eta_{ij})\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2},

where ηij\eta_{ij} is between σij\sigma_{ij} and 𝒘iT𝒘jm\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}. Then the truncation error can be evaluated as follows. For all δ>0\delta>0,

(|𝑲ij(𝑼)𝑲^ij(𝑼)|>δ)\displaystyle\mathbb{P}\left(\left|\boldsymbol{K}_{ij}(\boldsymbol{U})-\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})\right|>\delta\right) =(12|f′′(ηij)|(𝒘iT𝒘jmσij)2>δ)\displaystyle=\mathbb{P}\left(\frac{1}{2}|f^{\prime\prime}(\eta_{ij})|\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}>\delta\right)
(12M(𝒘iT𝒘jmσij)2>δ)\displaystyle\leq\mathbb{P}\left(\frac{1}{2}M\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}>\delta\right)
=(|𝒘iT𝒘jmσij|>2δM).\displaystyle=\mathbb{P}\left(\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right|>\sqrt{\frac{2\delta}{M}}\right).

So it suffices to evaluate the tail probability (|𝒘iT𝒘jmσij|>2δM)\mathbb{P}\left(\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right|>\sqrt{\frac{2\delta}{M}}\right). Note that

𝒘j|𝒘i𝒩m(σijσii𝒘i,(σjjσij2σii)𝑰m),\boldsymbol{w}_{j}|\boldsymbol{w}_{i}\sim\mathcal{N}_{m}\left(\frac{\sigma_{ij}}{\sigma_{ii}}\boldsymbol{w}_{i},\left(\sigma_{jj}-\frac{\sigma_{ij}^{2}}{\sigma_{ii}}\right)\boldsymbol{I}_{m}\right),

we get

𝒘iT𝒘j|𝒘i𝒩(σijσii𝒘iT𝒘i,(σjjσij2σii)𝒘iT𝒘𝒊)\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}|\boldsymbol{w}_{i}\sim\mathcal{N}\left(\frac{\sigma_{ij}}{\sigma_{ii}}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i},\left(\sigma_{jj}-\frac{\sigma_{ij}^{2}}{\sigma_{ii}}\right)\boldsymbol{w}_{i}^{T}\boldsymbol{w_{i}}\right)

Therefore, given 𝒘i\boldsymbol{w}_{i}, the random variable 𝒘iT𝒘j\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j} is sub-Gaussian with sub-Gaussian parameter σii1si,j𝒘iT𝒘i\sigma_{ii}^{-1}s_{i,j}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}, where sij=σiiσjjσij2s_{ij}=\sigma_{ii}\sigma_{jj}-\sigma_{ij}^{2}. Since

(|𝒘iT𝒘jmσij|>2δM)\displaystyle\mathbb{P}\left(\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right|>\sqrt{\frac{2\delta}{M}}\right) (|𝒘iT𝒘jmσijσii𝒘iT𝒘im|>122δM)+\displaystyle\leq\mathbb{P}\left(\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\frac{\sigma_{ij}}{\sigma_{ii}}\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}{m}\right|>\frac{1}{2}\sqrt{\frac{2\delta}{M}}\right)+
(|σijσii𝒘iT𝒘imσij|>122δM)\displaystyle\hskip 85.35826pt\mathbb{P}\left(\left|\frac{\sigma_{ij}}{\sigma_{ii}}\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}{m}-\sigma_{ij}\right|>\frac{1}{2}\sqrt{\frac{2\delta}{M}}\right)
:=(I)+(II),\displaystyle:=(I)+(II),

it suffices to provide bounds for (I) and (II).

For (II)(II), note that 𝒘i𝒩m(𝟎,σii𝑰m)\boldsymbol{w}_{i}\sim\mathcal{N}_{m}(\boldsymbol{0},\sigma_{ii}\boldsymbol{I}_{m}), we have 1σii𝒘iT𝒘iχm2\frac{1}{\sigma_{ii}}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\sim\chi_{m}^{2}, which implies that

𝔼[eλ(𝒗iT𝒗iσiim)]\displaystyle\mathbb{E}\left[e^{\lambda\left(\frac{\boldsymbol{v}_{i}^{T}\boldsymbol{v}_{i}}{\sigma_{ii}}-m\right)}\right] =eλm𝔼[eλ𝒗iT𝒗iσii]=eλm(12λ)m2,for λ<12\displaystyle=e^{-\lambda m}\mathbb{E}\left[e^{\lambda\frac{\boldsymbol{v}_{i}^{T}\boldsymbol{v}_{i}}{\sigma_{ii}}}\right]=e^{-\lambda m}(1-2\lambda)^{-\frac{m}{2}},\quad\textrm{for }\lambda<\frac{1}{2}
=(eλ12λ)m\displaystyle=\left(\frac{e^{-\lambda}}{\sqrt{1-2\lambda}}\right)^{m}
e2mλ2,for all |λ|<14\displaystyle\leq e^{2m\lambda^{2}},\quad\textrm{for all }|\lambda|<\frac{1}{4}
=e4mλ22,for all |λ|<14,\displaystyle=e^{\frac{4m\lambda^{2}}{2}},\quad\textrm{for all }|\lambda|<\frac{1}{4},

i.e., 1σii𝒘iT𝒘i\frac{1}{\sigma_{ii}}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i} is sub-exponential with parameters (2m,4)(2\sqrt{m},4). Hence, for σij0\sigma_{ij}\neq 0, by Theorem 5 in Appendix A.1, we have

(II)\displaystyle(II) =(|1σii𝒘iT𝒘im|>2m|σij|2δM)\displaystyle=\mathbb{P}\left(\left|\frac{1}{\sigma_{ii}}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}-m\right|>\frac{2m}{|\sigma_{ij}|}\sqrt{\frac{2\delta}{M}}\right)
2exp{(δmσij2Mm4|σij|2δM)}\displaystyle\leq 2\exp\left\{-\left(\frac{\delta m}{\sigma_{ij}^{2}M}\wedge\frac{m}{4|\sigma_{ij}|}\sqrt{\frac{2\delta}{M}}\right)\right\} (12)

If σij=0\sigma_{ij}=0, then (II)=0(II)=0.

For (I)(I), by Hoeffding inequality, we have

(I)\displaystyle(I) =𝔼𝒘i[(|𝒘iT𝒘jmσijσii𝒘iT𝒘im|>122δM|𝒘i)]\displaystyle=\mathbb{E}_{\boldsymbol{w}_{i}}\left[\mathbb{P}\left(\left.\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\frac{\sigma_{ij}}{\sigma_{ii}}\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}{m}\right|>\frac{1}{2}\sqrt{\frac{2\delta}{M}}\right|\boldsymbol{w}_{i}\right)\right]
=𝔼𝒘i[(|𝒘iT𝒘jσijσii𝒘iT𝒘i|>m22δM|𝒘i)]\displaystyle=\mathbb{E}_{\boldsymbol{w}_{i}}\left[\mathbb{P}\left(\left.\left|\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}-\frac{\sigma_{ij}}{\sigma_{ii}}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\right|>\frac{m}{2}\sqrt{\frac{2\delta}{M}}\right|\boldsymbol{w}_{i}\right)\right]
𝔼𝒘i[2exp{σiim2δ4Msij𝒘iT𝒘i}]\displaystyle\leq\mathbb{E}_{\boldsymbol{w}_{i}}\left[2\exp\left\{-\frac{\sigma_{ii}m^{2}\delta}{4Ms_{ij}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}\right\}\right] (13)

From Theorem A in Inglot (2010) stated that for a random variable χχm2\chi\sim\chi_{m}^{2}, the 100(1α)100(1-\alpha)th percentile is upper bounded by m+log(1/α)+2mlog(1/α)m+\log(1/\alpha)+2\sqrt{m\log(1/\alpha)}, which is of the order 𝒪(m)\mathcal{O}(m) as mm\to\infty. Now, since σii1𝒘iT𝒘iχm2\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\sim\chi_{m}^{2}, we get for any α(0,1)\alpha\in(0,1),

(σii1𝒘iT𝒘im+2log1α+22mlog1α)=α.\mathbb{P}\left(\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\geq m+2\log\frac{1}{\alpha}+2\sqrt{2m\log\frac{1}{\alpha}}\right)=\alpha.

Let q(α,m)=m+2log(1/α)+22mlog(1/α)q(\alpha,m)=m+2\log(1/\alpha)+2\sqrt{2m\log(1/\alpha)}. Since the function exp{a/x}\exp\{-a/x\} is increasing in xx for a>0a>0, we can further bound (13) as follows:

(I)\displaystyle(I) 𝔼𝒘i[2exp{m2δ4Msijσii1𝒘iT𝒘i}𝕀{σii1𝒘iT𝒘iq(α,m)}]+\displaystyle\leq\mathbb{E}_{\boldsymbol{w}_{i}}\left[2\exp\left\{-\frac{m^{2}\delta}{4Ms_{ij}\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}\right\}\mathbb{I}_{\{\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\leq q(\alpha,m)\}}\right]+
𝔼𝒘i[2exp{m2δ4Msijσii1𝒘iT𝒘i}𝕀{σii1𝒘iT𝒘iq(α,m)}]\displaystyle\hskip 113.81102pt\mathbb{E}_{\boldsymbol{w}_{i}}\left[2\exp\left\{-\frac{m^{2}\delta}{4Ms_{ij}\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}}\right\}\mathbb{I}_{\{\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\geq q(\alpha,m)\}}\right]
2exp{m2δ4Msijq(α,m)}+2(σii1𝒘iT𝒘iq(α,m))\displaystyle\leq 2\exp\left\{-\frac{m^{2}\delta}{4Ms_{ij}q(\alpha,m)}\right\}+2\mathbb{P}\left(\sigma_{ii}^{-1}\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{i}\geq q(\alpha,m)\right)
2exp{m2δ4Msijq(α,m)}+2α.\displaystyle\leq 2\exp\left\{-\frac{m^{2}\delta}{4Ms_{ij}q(\alpha,m)}\right\}+2\alpha.

By choosing α=exp{m}\alpha=\exp\{-m\}, we get q(α,m)=m+2m+2m2=5mq(\alpha,m)=m+2m+2\sqrt{m^{2}}=5m so that

(I)\displaystyle(I) 2exp{m2δ20Mmsij}+2exp{m}\displaystyle\leq 2\exp\left\{-\frac{m^{2}\delta}{20Mms_{ij}}\right\}+2\exp\{-m\}
2exp{m(1δ20Msij)}.\displaystyle\leq 2\exp\left\{-m\left(1\wedge\frac{\delta}{20Ms_{ij}}\right)\right\}. (14)

Combining (14) and (12), we obtain for all δ>0\delta>0,

(|𝑲ij(𝑼)𝑲^ij(𝑼)|>δ)\displaystyle\mathbb{P}\left(\left|\boldsymbol{K}_{ij}(\boldsymbol{U})-\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})\right|>\delta\right) (I)+(II)\displaystyle\leq(I)+(II)
4exp{m(1δ20MsijδMσij214|σij|2δM)}.\displaystyle\leq 4\exp\left\{-m\left(1\wedge\frac{\delta}{20Ms_{ij}}\wedge\frac{\delta}{M\sigma_{ij}^{2}}\wedge\frac{1}{4|\sigma_{ij}|}\sqrt{\frac{2\delta}{M}}\right)\right\}.

Remark 2.

The condition (7) can be weakened as follows:

f′′(λσij+(1λ)𝒘iT𝒘jm)=𝒪p(1)f^{\prime\prime}\left(\lambda\sigma_{ij}+(1-\lambda)\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}\right)=\mathcal{O}_{p}(1)

for all λ[0,1]\lambda\in[0,1]. In such case, the evaluation of truncation error can be modified as follows: For all δ>0\delta>0, there exists Mδ>0M_{\delta}>0 such that

(|f′′(ηij)|>Mδ)<δ2,\mathbb{P}\left(\left|f^{\prime\prime}(\eta_{ij})\right|>M_{\delta}\right)<\frac{\delta}{2},

where ηij=λσij+(1λ)𝒘iT𝒘jm\eta_{ij}=\lambda\sigma_{ij}+(1-\lambda)\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m} for some λ[0,1]\lambda\in[0,1] and then

(|𝑲ij(𝑼)𝑲^ij(𝑼)|>δ)\displaystyle\mathbb{P}\left(\left|\boldsymbol{K}_{ij}(\boldsymbol{U})-\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})\right|>\delta\right) =(12f′′(ηij)(𝒘iT𝒘jmσij)2>δ)\displaystyle=\mathbb{P}\left(\frac{1}{2}f^{\prime\prime}(\eta_{ij})\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}>\delta\right)
({12f′′(ηij)(𝒘iT𝒘jmσij)2>δ}{|f′′(ηij)|Mδ})+\displaystyle\leq\mathbb{P}\left(\left\{\frac{1}{2}f^{\prime\prime}(\eta_{ij})\left(\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right)^{2}>\delta\right\}\cap\left\{|f^{\prime\prime}(\eta_{ij})|\leq M_{\delta}\right\}\right)+
(|f′′(ηij)|>Mδ)\displaystyle\hskip 142.26378pt\mathbb{P}\left(|f^{\prime\prime}(\eta_{ij})|>M_{\delta}\right)
(|𝒘iT𝒘jmσij|>2δMδ)+δ2\displaystyle\leq\mathbb{P}\left(\left|\frac{\boldsymbol{w}_{i}^{T}\boldsymbol{w}_{j}}{m}-\sigma_{ij}\right|>\sqrt{\frac{2\delta}{M_{\delta}}}\right)+\frac{\delta}{2}
4exp{m(1δ20MδsijδMδσij214|σij|2δMδ)}+δ2\displaystyle\leq 4\exp\left\{-m\left(1\wedge\frac{\delta}{20M_{\delta}s_{ij}}\wedge\frac{\delta}{M_{\delta}\sigma_{ij}^{2}}\wedge\frac{1}{4|\sigma_{ij}|}\sqrt{\frac{2\delta}{M_{\delta}}}\right)\right\}+\frac{\delta}{2}

Now, we can choose m>(1δ20MδsijδMδσij214|σij|2δMδ)1log8δm>\left(1\wedge\frac{\delta}{20M_{\delta}s_{ij}}\wedge\frac{\delta}{M_{\delta}\sigma_{ij}^{2}}\wedge\frac{1}{4|\sigma_{ij}|}\sqrt{\frac{2\delta}{M_{\delta}}}\right)^{-1}\log\frac{8}{\delta} so that

(|𝑲ij(𝑼)𝑲^ij(𝑼)|>δ)<δ2+δ2=δ\mathbb{P}\left(\left|\boldsymbol{K}_{ij}(\boldsymbol{U})-\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})\right|>\delta\right)<\frac{\delta}{2}+\frac{\delta}{2}=\delta

and hence 𝑲ij(𝑼)=𝑲^ij(𝑼)+op(1)\boldsymbol{K}_{ij}(\boldsymbol{U})=\hat{\boldsymbol{K}}_{ij}(\boldsymbol{U})+o_{p}(1).

B.2 Proof of Lemma 3

Proof.
  1. (i)

    When f(x)=xf(x)=x, we have 𝑲(𝑼)=1m𝑼𝑼T\boldsymbol{K}(\boldsymbol{U})=\frac{1}{m}\boldsymbol{UU}^{T} and since the hidden random vectors 𝒖1,,𝒖m\boldsymbol{u}_{1},\ldots,\boldsymbol{u}_{m} are i.i.d, we can know that

    𝒖1𝒖1T,,𝒖m𝒖mT i.i.d. 𝒲n(1,l=1Lξl𝑲l(𝑿)),\boldsymbol{u}_{1}\boldsymbol{u}_{1}^{T},\ldots,\boldsymbol{u}_{m}\boldsymbol{u}_{m}^{T}\sim\textrm{ i.i.d. }\mathcal{W}_{n}\left(1,\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right),

    where 𝒲n(1,l=1Lξl𝑲l(𝑿))\mathcal{W}_{n}\left(1,\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X})\right) stands for a Wishart distribution with degrees of freedom 1 and covariance matrix l=1Lξl𝑲l(𝑿)\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X}). Therefore, the Strong Law of Large Numbers implies that as mm\to\infty,

    𝑲(𝑼)=1m𝑼𝑼T=1mi=1m𝒖i𝒖iT𝔼[𝒖1𝒖1T]=l=1Lξl𝑲l(𝑿),a.s..\boldsymbol{K}(\boldsymbol{U})=\frac{1}{m}\boldsymbol{U}\boldsymbol{U}^{T}=\frac{1}{m}\sum_{i=1}^{m}\boldsymbol{u}_{i}\boldsymbol{u}_{i}^{T}\to\mathbb{E}\left[\boldsymbol{u}_{1}\boldsymbol{u}_{1}^{T}\right]=\sum_{l=1}^{L}\xi_{l}\boldsymbol{K}_{l}(\boldsymbol{X}),\quad\textrm{a.s.}.

    Since the the map ψ:𝑨𝑨1\psi:\boldsymbol{A}\mapsto\boldsymbol{A}^{-1} for non-singular matrix 𝑨\boldsymbol{A} is continuous, then we can obtain the following result by using the Continuous Mapping Theorem.

    (τ~𝑲(𝑼)+𝑰n)1(l=1Lτ~ξ𝑲l(𝑿)+𝑰n)1,a.s., as m\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\to\left(\sum_{l=1}^{L}\tilde{\tau}\xi\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-1},\quad\textrm{a.s., as }m\to\infty

    Let 𝑨~=(τ~𝑲(𝑼)+𝑰n)1\tilde{\boldsymbol{A}}=(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n})^{-1}, we have

    max1i,jn|𝑨~ij|𝑨~\displaystyle\max_{1\leq i,j\leq n}|\tilde{\boldsymbol{A}}_{ij}|\leq\|\tilde{\boldsymbol{A}}\|_{\infty} 𝑨~op\displaystyle\lesssim\|\tilde{\boldsymbol{A}}\|_{\textrm{op}}
    =λmax((τ~𝑲(𝑼)+𝑰n)1)\displaystyle=\lambda_{\max}\left(\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right)
    =1τ~λmin(𝑲(𝑼))+11<.\displaystyle=\frac{1}{\tilde{\tau}\lambda_{\min}(\boldsymbol{K}(\boldsymbol{U}))+1}\leq 1<\infty.

    Therefore, by Bounded Convergence Theorem,

    𝑨:=𝔼[(τ~𝑲(𝑼)+𝑰n)1](l=1Lτ~ξ𝑲l(𝑿)+𝑰n)1,a.s. as m.\boldsymbol{A}:=\mathbb{E}\left[\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\to\left(\sum_{l=1}^{L}\tilde{\tau}\xi\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-1},\quad\textrm{a.s. as }m\to\infty.

    Asymptotically, we get as mm\to\infty.

    R𝒚T(l=1Lτ~ξ𝑲l(𝑿)+𝑰n)2𝒚.R\simeq\boldsymbol{y}^{T}\left(\sum_{l=1}^{L}\tilde{\tau}\xi\boldsymbol{K}_{l}(\boldsymbol{X})+\boldsymbol{I}_{n}\right)^{-2}\boldsymbol{y}.
  2. (ii)

    Note that equation (8) can be further written as

    𝑲(𝑼)=f[𝚺]+oP(1),\boldsymbol{K}(\boldsymbol{U})=f[\boldsymbol{\Sigma}]+o_{P}(1),

    or equivalently, 𝑲(𝑼)𝑃f[𝚺]\boldsymbol{K}(\boldsymbol{U})\xrightarrow{P}f[\boldsymbol{\Sigma}] as mm\to\infty element-wisely. Similarly, under the assumption of 𝑲(𝑼)op<\|\boldsymbol{K}(\boldsymbol{U})\|_{\textrm{op}}<\infty a.s., we have

    𝔼[τ~𝑲(𝑼)+𝑰n]τ~f[𝚺]+𝑰n.\mathbb{E}\left[\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right]\to\tilde{\tau}f[\boldsymbol{\Sigma}]+\boldsymbol{I}_{n}.

    Hence by Bounded Convergence Theorem and Continuous Mapping Theorem, we have

    𝑨=𝔼[(τ~𝑲(𝑼)+𝑰n)1](τ~f[𝚺]+𝑰n)1,as m,\boldsymbol{A}=\mathbb{E}\left[\left(\tilde{\tau}\boldsymbol{K}(\boldsymbol{U})+\boldsymbol{I}_{n}\right)^{-1}\right]\to\left(\tilde{\tau}f[\boldsymbol{\Sigma}]+\boldsymbol{I}_{n}\right)^{-1},\quad\textrm{as }m\to\infty,

    which shows that as mm\to\infty,

    R𝒚T(τ~f[l=1Lξ𝑲l(𝑿)]+𝑰n)2𝒚.R\simeq\boldsymbol{y}^{T}\left(\tilde{\tau}f\left[\sum_{l=1}^{L}\xi\boldsymbol{K}_{l}(\boldsymbol{X})\right]+\boldsymbol{I}_{n}\right)^{-2}\boldsymbol{y}.

B.3 Proof of Proposition 2

Proof.

The result follows by noting that

APELMM\displaystyle APELMM =𝔼[(𝒚𝔼[𝒂|𝒚])T(𝒚𝔼[𝒂|𝒚])]\displaystyle=\mathbb{E}\left[(\boldsymbol{y}-\mathbb{E}[\boldsymbol{a}|\boldsymbol{y}])^{T}(\boldsymbol{y}-\mathbb{E}[\boldsymbol{a}|\boldsymbol{y}])\right]
=𝔼[𝒚T(𝑰nσ~R2𝚺(σ~R2𝚺+𝑰n)1)T(𝑰nσ~R2𝚺(σ~R2𝚺+𝑰n)1)𝒚]\displaystyle=\mathbb{E}\left[\boldsymbol{y}^{T}\left(\boldsymbol{I}_{n}-\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}(\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n})^{-1}\right)^{T}\left(\boldsymbol{I}_{n}-\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}(\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n})^{-1}\right)\boldsymbol{y}\right]
=𝔼[𝒚T((σ~R2𝚺+𝑰n)1)2𝒚]\displaystyle=\mathbb{E}\left[\boldsymbol{y}^{T}\left((\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n})^{-1}\right)^{2}\boldsymbol{y}\right]
=tr[((σ~R2𝚺+𝑰n)1)2(σR2𝚺+ϕ𝑰n)]\displaystyle=\textrm{tr}\left[\left((\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n})^{-1}\right)^{2}\left(\sigma_{R}^{2}\boldsymbol{\Sigma}+\phi\boldsymbol{I}_{n}\right)\right]
=ϕtr[(σ~R2𝚺+𝑰n)1]\displaystyle=\phi\textrm{tr}\left[(\tilde{\sigma}_{R}^{2}\boldsymbol{\Sigma}+\boldsymbol{I}_{n})^{-1}\right]
=ϕi=1n(σ~R2λi(𝚺)+1)1.\displaystyle=\phi\sum_{i=1}^{n}\left(\tilde{\sigma}_{R}^{2}\lambda_{i}(\boldsymbol{\Sigma})+1\right)^{-1}.

Appendix C More Simulation Results

Figure 7 demonstrates the performance of LMM and KNN in terms of prediction error when the inverse logistic function and the polynomial function of order 2 are used.

Refer to caption
Refer to caption
Figure 7: The boxplots for linear mixed models (LMM) and kernel neural network (KNN) in terms of prediction errors. The left panel shows the results when an inverse logistic function is used and the right panel shows the results when a polynomial function of order 2 is used. In the horizontal axis, “1” corresponds to the LMM; “2” corresponds to the KNN with product input kernel and product output kernel; “3” corresponds to the KNN with product input and polynomial output; “4” corresponds to the KNN with polynomial input and product output and “5” corresponds to the polynomial input and polynomial output.

References

  • Boucheron et al. (2004) Stéphane Boucheron, Gábor Lugosi, and Olivier Bousquet. Concentration inequalities. In Advanced Lectures on Machine Learning, pages 208–240. Springer, 2004.
  • Buldygin and Kozachenko (2000) Valeriĭ Vladimirovich Buldygin and IU V Kozachenko. Metric characterization of random variables and random processes, volume 188. American Mathematical Soc., 2000.
  • Bush and Moore (2012) William S Bush and Jason H Moore. Genome-wide association studies. PLoS computational biology, 8(12):e1002822, 2012.
  • Corbeil and Searle (1976) Robert R Corbeil and Shayle R Searle. Restricted maximum likelihood (reml) estimation of variance components in the mixed model. Technometrics, 18(1):31–38, 1976.
  • Gentle (2007) James E Gentle. Matrix algebra: theory, computations, and applications in statistics. Springer Science & Business Media, 2007.
  • Hiai (2009) Fumio Hiai. Monotonicity for entrywise functions of matrices. Linear Algebra and its Applications, 431(8):1125–1146, 2009.
  • Hoeffding (1963) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American statistical association, 58(301):13–30, 1963.
  • Horn and Johnson (2012) Roger A. Horn and Charles R. Johnson. Matrix Analysis. Cambridge University Press, 2 edition, 2012. doi: 10.1017/9781139020411.
  • Inglot (2010) Tadeusz Inglot. Inequalities for quantiles of the chi-square distribution. Probability and Mathematical Statistics, 30(2):339–351, 2010.
  • Kackar and Harville (1981) Raghu N Kackar and David A Harville. Unbiasedness of two-stage estimation and prediction procedures for mixed linear models. Communications in statistics-theory and methods, 10(13):1249–1261, 1981.
  • Ledoux (2005) Michel Ledoux. The concentration of measure phenomenon. Number 89. American Mathematical Soc., 2005.
  • McCulloch et al. (2008) Charles E. McCulloch, S. R. Searle, and John M. Neuhaus. Generalized, linear, and mixed models. Wiley, Hoboken, N.J, 2nd edition, 2008. ISBN 0470073713;9780470073711;.
  • Rao (1970) C Radhakrishna Rao. Estimation of heteroscedastic variances in linear models. Journal of the American Statistical Association, 65(329):161–172, 1970.
  • Rao (1971) C Radhakrishna Rao. Estimation of variance and covariance components—minque theory. Journal of multivariate analysis, 1(3):257–275, 1971.
  • Rao (1972) C Radhakrishna Rao. Estimation of variance and covariance components in linear models. Journal of the American Statistical Association, 67(337):112–115, 1972.
  • Shawe-Taylor et al. (2004) John Shawe-Taylor, Nello Cristianini, et al. Kernel methods for pattern analysis. Cambridge university press, 2004.
  • Wu et al. (2011) Michael C Wu, Seunggeun Lee, Tianxi Cai, Yun Li, Michael Boehnke, and Xihong Lin. Rare-variant association testing for sequencing data with the sequence kernel association test. The American Journal of Human Genetics, 89(1):82–93, 2011.
  • Yang et al. (2011) Jian Yang, S Hong Lee, Michael E Goddard, and Peter M Visscher. Gcta: a tool for genome-wide complex trait analysis. The American Journal of Human Genetics, 88(1):76–82, 2011.