This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Projective Resampling Imputation Mean Estimation Method for Missing Covariates Problem

Zishu Zhan1,2, Xiangjie Li3,∗, Jingxiao Zhang1,2,∗
Abstract

Missing data is a common problem in clinical data collection, which causes difficulty in the statistical analysis of such data. To overcome problems caused by incomplete data, we propose a new imputation method called projective resampling imputation mean estimation (PRIME), which can also address “the curse of dimensionality” problem in imputation with less information loss. We use various sample sizes, missing-data rates, covariate correlations, and noise levels in simulation studies, and all results show that PRIME outperformes other methods such as iterative least-squares estimation (ILSE), maximum likelihood (ML), and complete-case analysis (CC). Moreover, we conduct a study of influential factors in cardiac surgery-associated acute kidney injury (CSA-AKI), which show that our method performs better than the other models. Finally, we prove that PRIME has a consistent property under some regular conditions.

Keywords : projective resampling, missing covariates problem, imputation method, linear regression

1Center for Applied Statistics, Renmin University of China,
2 School of Statistics, Remin University of China,
3 State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College

1 Introduction

In medical research, an investigator’s ultimate interest may be in inferring prognostic markers, given the patients’ genetic, cytokine, and/or environmental backgrounds [1, 2]. However, in practical applications, data are often missing. The most common approach to address missing-data problems is complete-case analysis (CC), which is simple but inefficient. CC can also lead to biased estimates when the data are not missing completely at random. The maximum likelihood method (ML [3, 4]) and the inverse probability weighting method (IPW [5, 6]) are also widely used approaches to address missing data. However, likelihood-based methods are sensitive to model assumptions, and re-weighting methods do not always make full use of the available data. Alternatively, imputation [7, 8] is a more flexible approach to couple with missing data.

However, a preliminary analysis of cardiac surgery-associated acute kidney injury (CSA-AKI) data used in Chen et al. [2] indicates that the missing-data patterns vary across individuals. Accordingly, new and more capable quantitative methods are needed for this individual-specific missing data. Furthermore, it is common for only a small fraction of records to have complete information across all sources. Existing methods do not work well when the percentage of unavailable data is high. To estimate the coefficients (rather than predicting them), Lin et al. [9] proposed the iterative least-squares estimation (ILSE) method to deal with individual-specific missing-data patterns using the classical regression framework, but it needs a complete set of observations to obtain the initial values, and its results might not converge when based on bad initial values. Furthermore, Lin et al. [9] may have difficulty accommodating data missing from both important and unimportant variables.

In this study, based on the idea of projection resampling/random projection, we propose Projection Resampling Imputation Mean Estimation (PRIME), a method that tackles the aforementioned drawbacks of existing methods. The key idea behind projection resampling/random projection was given in the Johnson-Lindenstrauss lemma [10], which preserves pairwise distances after projecting a set of points to a randomly chosen low-dimensional subspace. There are several previous studies on projection resampling/random projection for dimension reduction, including Schulman [11] for clustering, Donoho [12] for signal processing, Shi et al. [13] for classification, Maillard and Munos [14] for linear regression, and Le et al. [15] for kernel approximation. Specifically, the idea of PRIME is to project the covariates along randomly sampled directions to obtain samples of scalar-valued predictors and kernels (dimension reduction). Next, a simple geometric average is taken on the scalar-predictor-based kernel to impute the missing parts (using all-sided information). Our method has several advantages, including the following. First, PRIME can deal with a high degree of missing data, even data containing no complete observations, while most existing methods require at least a fraction of the subjects to have fully complete observations. Second, we can average the imputed estimates from multiple projection directions and fully utilize the available information to reach a more reliable and useful result. Third, to reduce the undesirable influence of unimportant variables, PRIME can be easily extended to sparse PRIME (denoted as SPRIME), which has a profound impact in practical applications.

The remainder of the paper is organized as follows. Section 2 introduces the basic setup of PRIME and SPRIME. Theoretical properties are discussed in Section 3. Sections 4 and 5 present the numerical results using simulated and real data examples, respectively. Section 6 presents some concluding remarks. In addition, our proposed method is implemented using R and the scripts to reproduce our results are available at https://github.com/eleozzr/PRIME. The proof of the theorem is available the Appendix.

2 Projective resampling Imputation mean estimation (PRIME)

2.1 Model and estimation by PRIME

In this paper, let 𝒀=(Y1,Y2,,Yn){\bm{Y}}=(Y_{1},Y_{2},\cdots,Y_{n})^{\top} be the response variable of interest and 𝑿={Xij:i=1,2,,n,j=1,2,,p}{\bm{X}}=\{X_{ij}:i=1,2,\cdots,n,j=1,2,\cdots,p\} be the covariate matrix. We assume that n>pn>p. We consider the presence of missing covariates by rijr_{ij}, which denotes the missing-data indicator for XijX_{ij}, where rijr_{ij} is 1 if XijX_{ij} is missing and is 0 otherwise. For each unit ii, Ai={j:rij=0,j=1,2,,p}A_{i}=\{j:r_{ij}=0,j=1,2,\cdots,p\} denotes the available covariates set, e.g., Ai={1,2,,p}A_{i}=\{1,2,\cdots,p\} for the complete case.

In this study, we focus on a linear regression model. Assume that the random sample {(Yi,𝑿i):i=1,2,,n,j=1,2,,p}\{(Y_{i},{\bm{X}}_{i}):i=1,2,\cdots,n,\ j=1,2,\cdots,p\} is generated by:

Yi=j=1pβjXij+εi=𝑿i𝜷+εi,\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i}={\bm{X}}_{i}^{\top}{\bm{\beta}}+\varepsilon_{i}, (1)

where 𝜷=(β1,β2,,βp){\bm{\beta}}=(\beta_{1},\beta_{2},\cdots,\beta_{p})^{\top} is the coefficient vector for the covariates and the εi\varepsilon_{i}’s are independently identically distributed random errors. We assume E(εi|𝑿i)=0E(\varepsilon_{i}|{\bm{X}}_{i})=0 and E(εi2|𝑿i)=σ2E(\varepsilon_{i}^{2}|{\bm{X}}_{i})=\sigma^{2}.

Covariates can be divided into two parts based on AiA_{i}: 𝑿i,Ai=(Xij:jAi){\bm{X}}_{i,A_{i}}=(X_{ij}:j\in A_{i})^{\top} for observed covariates and 𝑿i,A¯i=(Xij:jAi){\bm{X}}_{i,\bar{A}_{i}}=(X_{ij}:j\notin A_{i})^{\top} for missing covariates. Thus, equation (1) can be expressed as

Yi=𝑿i,Ai𝜷Ai+𝑿i,A¯i𝜷A¯i+εi,\displaystyle Y_{i}={\bm{X}}_{i,A_{i}}^{\top}{\bm{\beta}}_{A_{i}}+{\bm{X}}_{i,\bar{A}_{i}}^{\top}{\bm{\beta}}_{\bar{A}_{i}}+\varepsilon_{i},

where 𝜷Ai{\bm{\beta}}_{A_{i}} and 𝜷A¯i{\bm{\beta}}_{\bar{A}_{i}} denote the regression coefficients for the complete and incomplete covariates, respectively. For 𝑿i,A¯i{\bm{X}}_{i,\bar{A}_{i}}, which is unobserved, refer to Lin et al. [9], we take the expectation of YiY_{i} given the observed covariates. Thus, we obtain the following equation:

E(Yi|𝑿i,Ai)=𝑿i,Ai𝜷Ai+E(𝑿i,A¯i𝜷A¯i|𝑿i,Ai).E(Y_{i}|{\bm{X}}_{i,A_{i}})={\bm{X}}_{i,A_{i}}^{\top}{\bm{\beta}}_{A_{i}}+E({\bm{X}}_{i,\bar{A}_{i}}^{\top}{\bm{\beta}}_{\bar{A}_{i}}|{\bm{X}}_{i,A_{i}}). (2)

Hence, by equation (2), we can impute the incomplete part using the information on 𝑿i,Ai{\bm{X}}_{i,A_{i}} to obtain an estimator for 𝜷{\bm{\beta}}. We use the following estimator to estimate the missing components of the covariates for unit ii:

X~ij=i=1nI(AiAij)XijKh(𝑿i,Ai𝑿i,Ai)i=1nI(AiAij)Kh(𝑿iAi𝑿i,Ai),(jAi),\displaystyle\tilde{X}_{ij}=\frac{\sum_{i^{\prime}=1}^{n}I(A_{i^{\prime}}\supset A_{i}\cup j)X_{i^{\prime}j}K_{h}({\bm{X}}_{i^{\prime},A_{i}}-{\bm{X}}_{i,A_{i}})}{\sum_{i^{\prime}=1}^{n}I(A_{i^{\prime}}\supset A_{i}\cup j)K_{h}({\bm{X}}_{i^{\prime}A_{i}}-{\bm{X}}_{i,A_{i}})},\quad\left(j\notin A_{i}\right), (3)

In equation (3), Kh()=K(/h)/hK_{h}(\cdot)=K(\cdot/h)/h, where K()K(\cdot) is a kernel function and hh is a bandwidth. In this way, we can make use of the information as fully as possible. To tackle the problem of “the curse of dimensionality”, we further transform the estimator in equation (3) using the projective resampling method. As shown in Figure 1, for subject ii^{\prime}, we project 𝑿i,Ai{\bm{X}}_{i^{\prime},A_{i}} onto random directions {𝒗b,Ai1×|Ai|,b=1,2,,B}\{{\bm{v}}_{b,A_{i}}\in\mathcal{R}^{1\times|A_{i}|},b=1,2,\cdots,B\}, where |A||A| denotes the cardinality of a set A, then obtain BB kernel values using the resulting scalars {𝑿i,Ai𝒗b,Ai,b=1,2,,B}\{{\bm{X}}_{i^{\prime},A_{i}}^{\top}{\bm{v}}_{b,A_{i}},b=1,2,\cdots,B\}, and finally integrate the BB kernels through the geometric mean.

X^ij=i=1nI(AiAij)Xij[b=1BKh(𝐗i,Ai𝒗b,Ai𝑿i,Ai𝒗b,Ai)]1Bi=1nI(AiAij)[b=1BKh(𝐗i,AiT𝒗b,Ai𝑿i,Ai𝒗b,Ai)]1B,(jAi),\hat{X}_{ij}=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A_{i}\cup j\right)X_{i^{\prime}j}\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}^{\top}{\bm{v}}_{b,A_{i}}-{\bm{X}}^{\top}_{i,A_{i}}{\bm{v}}_{b,A_{i}}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A_{i}\cup j\right)\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}^{T}{\bm{v}}_{b,A_{i}}-{\bm{X}}^{\top}_{i,A_{i}}{\bm{v}}_{b,A_{i}}\right)\right]^{\frac{1}{B}}},\quad\left(j\notin A_{i}\right), (4)

where 𝒗b,Ai{\bm{v}}_{b,A_{i}} is a random vector, with each entry vb,Ai,jv_{b,A_{i},j} chosen independently from a distribution 𝒟\mathcal{D} that is symmetric about the origin with E(vb,Ai,j2)=1E(v_{b,A_{i},j}^{2})=1. In practice, we usually generate vb,Ai,jv_{b,A_{i},j} from N(0,1)N(0,1) or U(1,1)U(-1,1), but random vectors from other possible distributions can also be used.

Refer to caption
Figure 1: Random projection

Applying the above imputation strategy, we can obtain 𝒁i{\bm{Z}}_{i} by using observed data for part AiA_{i} and imputed data for part A¯i\bar{A}_{i}.

𝒁i=(XijI(jAi)+X^ijI(jAi):j=1,2,,p).\displaystyle{\bm{Z}}_{i}=(X_{ij}I(j\in A_{i})+\hat{X}_{ij}I(j\notin A_{i}):j=1,2,\cdots,p)^{\top}. (5)

Therefore, we propose the following estimation equation:

U(𝜷)=1ni=1n𝒁i{Yi𝒁i𝜷}.U({\bm{\beta}})=\frac{1}{n}\sum_{i=1}^{n}{\bm{Z}}_{i}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}. (6)

Thus, the estimator of the regression coefficient can be solved by

𝜷^={i=1n𝒁i𝒁i}1i=1n𝒁iYi.\hat{\bm{\beta}}=\left\{\sum_{i=1}^{n}{\bm{Z}}_{i}{\bm{Z}}_{i}^{\top}\right\}^{-1}\sum_{i=1}^{n}{\bm{Z}}_{i}Y_{i}.

Specifically, we propose the following algorithm:

Algorithm 1 algorithm for PRIME

Input: {(Yi,𝑿i,Ai):i=1,2,,n,j=1,2,,p}\{(Y_{i},{\bm{X}}_{i},A_{i}):i=1,2,\cdots,n,j=1,2,\cdots,p\},
Output: 𝜷^\hat{\bm{\beta}}

1:Sample entries of 𝒗b,Ai(b=1,2,,B)1×|Ai|{\bm{v}}_{b,A_{i}}\ (b=1,2,\cdots,B)\in\mathcal{R}^{1\times|A_{i}|} i.i.d. from N(0,1)N(0,1) or U(1,1)U(-1,1)
2:for 1in1\leq i\leq n and 1jp1\leq j\leq p do
3:     1) Apply the equation (4) to obtain X^ij,(jAi)\hat{X}_{ij},(j\notin A_{i})
4:     2) Apply the equation(5) to obtain the imputed data ZiZ_{i} based on 𝒗b,Ai(b=1,2,,B){\bm{v}}_{b,A_{i}}(b=1,2,\cdots,B)
5:Solve the closed-form equation to get 𝜷^\hat{\bm{\beta}}.
6:return 𝜷^\hat{\bm{\beta}}

2.2 Simultaneous fitting and selection by sparse PRIME

Because the number of disease-associated biomarkers is not expected to be large, it is of great importance to take the sparse assumption into account when not all variables contribute to outcome variables. Hence, we assume the linear regression model in equation (1) is sparse and define the index set of the active and inactive predictors by 1={j:βj0}\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\} and 2={j:βj=0}\mathcal{I}_{2}=\{j:\beta_{j}=0\}, respectively. Our practical goal is to identify which biomarkers in the CSA-AKI datasets are disease-related as well as to estimate the corresponding coefficients. The main idea of sparse PRIME is to replace the estimation equation 6 with the penalized estimation equations as follows:

{i=1nZi1{Yi𝒁i𝜷}+λnγ|β1|γ1sign(β1)=0,i=1nZip{Yi𝒁i𝜷}+λnγ|βp|γ1sign(βp)=0.\left\{\begin{aligned} \sum_{i=1}^{n}{Z}_{i1}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}&+\lambda_{n}\gamma|\beta_{1}|^{\gamma-1}sign(\beta_{1})=0,\\ &\vdots&\\ \sum_{i=1}^{n}{Z}_{ip}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}&+\lambda_{n}\gamma|\beta_{p}|^{\gamma-1}sign(\beta_{p})=0.\end{aligned}\right. (7)

where λγ|βj|γ1sign(βj)\lambda\gamma|\beta_{j}|^{\gamma-1}sign(\beta_{j}) is the partial derivative for the penalty function with respect to βj\beta_{j}. The least absolute shrinkage and selection operator (LASSO) estimator is defined to satisfy γ=1\gamma=1. Optimizing the functions in (7) with γ=1\gamma=1 is computationally cumbersome because the functions are non-differentiable. Fortunately, the shooting algorithm proposed in Fu [16] can be used to compute the LASSO estimator. Moreover, Fu [16, 17] proved that the unique estimator of (7) is equivalent to the solution of the penalized objective function as follows:

min𝜷12ni=1n{Yi𝒁i𝜷}2+λnnj=1p|βj|γ.\min_{\bm{\beta}}\frac{1}{2n}\sum_{i=1}^{n}\left\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\right\}^{2}+\frac{\lambda_{n}}{n}\sum_{j=1}^{p}|\beta_{j}|^{\gamma}.

We can solve this penalized regression-form problem and get a sparse estimator 𝜷^s\hat{\bm{\beta}}_{s} using the glmnet package in R.

To decrease the processing time, we can use the following sparse random projection with i.i.d. entries:

vb,Ai,j=si{1,with probability12si0,with probability 11si1,with probability12si.v_{b,A_{i},j}=\sqrt{s_{i}}\left\{\begin{aligned} 1,&\ \ \text{with probability}\ \frac{1}{2s_{i}}\\ 0,&\ \ \text{with probability}\ 1-\frac{1}{s_{i}}\\ -1,&\ \ \text{with probability}\ \frac{1}{2s_{i}}.\end{aligned}\right.

where Achlioptas [18] used si=1s_{i}=1 or si=3s_{i}=3 and Li et al. [19] showed that one can use si=|Ai|s_{i}=\sqrt{|A_{i}|} or even si=|Ai|log|Ai|s_{i}=\frac{|A_{i}|}{\log|A_{i}|} to significantly reduce the computing time with little loss of accuracy.

3 Theoretical properties

We study the consistency of the projection resampling least estimator 𝜷^\hat{\bm{\beta}}. Denote the true value of 𝜷{\bm{\beta}} by 𝜷0{\bm{\beta}}_{0}. We make the following assumptions:

(A1)

h=O(nl)h=O(n^{-l}) with 1/4<l<1/21/4<l<1/2;

(A2)

Ai𝑿iA_{i}\bot{\bm{X}}_{i};

(A3)

The kernel function K()K(\cdot) is a symmetric density function with compact support [0,1][0,1] and a bounded derivative;

(A4)

𝜷0{\bm{\beta}}_{0}\in\mathcal{B}, where \mathcal{B} is a bounded set;

(A5)

For a missing-data pattern AA, let ej,A(𝒘A)=E(Xij|𝐗i,A=𝒘A)e_{j,A}({\bm{w}}_{A})=E\left(X_{ij}|\mathbf{X}_{i,A}={\bm{w}}_{A}\right) be the conditional expectation of XijX_{ij} and fA(𝒘A)f_{A}({\bm{w}}_{A}) be the density of 𝐗i,A.\mathbf{X}_{i,A}. Assume ej,A(𝒘A)e_{j,A}({\bm{w}}_{A}) and fA(𝒘A)f_{A}({\bm{w}}_{A}) have continuous second derivatives with respect to 𝒘A{\bm{w}}_{A} on the corresponding support;

(A6)

E(vbj4)<(1jp)E(v^{4}_{bj})<\infty\ (1\leq j\leq p);

(A7)

𝑿i{\bm{X}}_{i} is bounded for all i1i\geq 1, and the limit limni=1n𝑿i/n=𝑿0\lim_{n\to\infty}\sum_{i=1}^{n}{\bm{X}}_{i}/n={\bm{X}}_{0} exists;

(A8)

Dn=1ni=1n𝑿i𝑿iDD_{n}=\frac{1}{n}\sum_{i=1}^{n}{\bm{X}}_{i}{\bm{X}}_{i}^{\top}\to D and 1nmax1in𝑿i𝑿i0\frac{1}{n}\max_{1\leq i\leq n}{\bm{X}}_{i}^{\top}{\bm{X}}_{i}\to 0, where DD is a finite and positive definite matrix;

(A9)

There exists a constant C0>0C_{0}>0, inf𝒘APr(AiAj)fA(𝒘A)>C0\inf_{{\bm{w}}_{A}}P_{r}(A_{i^{\prime}}\supset A\cup j)f_{A}({\bm{w}}_{A})>C_{0}.

Assumptions (A1)–(A5) are the same as the conditions in Lin et al. [9]. Specifically, Assumption (A1) requires under-smoothing to obtain a rootn-n consistent estimator, which is a commonly used regularity assumption in semiparametric regression. Assumption (A2) addresses the missing data mechanism and ensures the PRIME estimator is consistent. As mentioned in Lin et al. [9], Assumption (A2) is weaker than assuming the data are missing completely at random. Assumptions (A3)–(A5) are standard in nonparametric regression. Assumption (A3) is achieved when the kernel function is the Gaussian kernel, but it is more general. Assumption (A6) is a moment bound required in Arriaga and Vempala [20] and Li et al. [19], a necessary technical condition. Assumptions (A7) and (A8) are commonly used in penalized estimation problems [16, 17]. We assume inf𝒘APr(AiAj)fA(𝒘A)>C0\inf_{{\bm{w}}_{A}}P_{r}(A_{i^{\prime}}\supset A\cup j)f_{A}({\bm{w}}_{A})>C_{0} to make sure that there are enough samples being used to estimate ej,A(𝒘A)e_{j,A}({\bm{w}}_{A}).

Theorem 1.

Suppose Assumptions (A1)-(A6) hold and j1={j:βj0},j=1,2,,pj\in\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\},j=1,2,\cdots,p, as nn\to\infty, then 𝛃^𝛃0\hat{\bm{\beta}}\to{\bm{\beta}}_{0} in probability.

Theorem 2.

Suppose Assumptions (A1)-(A8) hold, j1={j:βj0},j=1,2,,qj\in\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\},j=1,2,\cdots,q, j2={j:βj=0},j=q+1,,pj\in\mathcal{I}_{2}=\{j:\beta_{j}=0\},j=q+1,\cdots,p and tuning parameter λn=o(n)\lambda_{n}=o(\sqrt{n}) as nn\to\infty, then 𝛃^s𝛃0\hat{\bm{\beta}}_{s}\to{\bm{\beta}}_{0} in probability.

4 Simulation

In this section, we consider several simulated scenarios to highlight the properties of PRIME in contrast to some other methods. We experimentally investigate the performance of the following methods:

Full:

the least-squares estimator based on the full data as a benchmark;

PRIME:

the proposed method;

ILSE:

the iterative least-square method in Lin et al. [9];

ML:

the maximum likelihood method proposed in Jiang et al. [4];

CC:

the complete-case analysis method.

For each model setting with a specific choice of parameters, we repeat the simulation 100 times and evaluate the performance of models using the normalized absolute distance (NAD) and the mean squared error (MSE), defined as follows:

NADj=1Ni=1N|β^jβ0j|β0j,j=1,2,p,{\rm NAD}_{j}=\frac{1}{N}\sum_{i=1}^{N}\frac{|\hat{\beta}_{j}-\beta_{0j}|}{\beta_{0j}},\ j=1,2\cdots,p,
MSE=1Ni=1N1pj=1p(β^jβ0j)2.{\rm MSE}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{j}-\beta_{0j})^{2}.

In addition, we calculate the optimal MSE rate, defined as the proportion of times each method (except Full) produced the smallest MSE in repetitions. The MSE based on N repetitions is partitioned into MSE=Variance+Bias2, as follows:

1Ni=1N1pj=1p(β^j(i)β0j)2=1pj=1p1Ni=1N(β^j(i)β¯j)2+1pj=1p(β¯jβ0j)2,\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{j}^{(i)}-\beta_{0j})^{2}=\frac{1}{p}\sum_{j=1}^{p}\frac{1}{N}\sum_{i=1}^{N}(\hat{\beta}_{j}^{(i)}-\bar{\beta}_{j})^{2}+\frac{1}{p}\sum_{j=1}^{p}(\bar{\beta}_{j}-\beta_{0j})^{2},

where β¯j=1Ni=1Nβ^j(i)\bar{\beta}_{j}=\frac{1}{N}\sum_{i=1}^{N}\hat{\beta}_{j}^{(i)} and N=100N=100 in this simulation study. For simplicity, we set bandwidth h=n1/3h=n^{-1/3} for PRIME, SPRIME and ILSE. In the following sections, we compare the methods using various settings for sample size, missing data rates, noise levels, and feature correlations.

4.1 Scenario 1: Different noise levels

The data generation model has the linear expression

Yi=j=1pβjXij+εi,i=1,2,,n,\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i},\ i=1,2,\cdots,n,

where n=100,200n=100,200, p=12p=12, 𝜷=(1,0.6,1.5,1,1.2,0.4,1,0.7,1.3,0.5,1.1,1.4,0.9){\bm{\beta}}=(1,-0.6,1.5,1,1.2,0.4,-1,-0.7,1.3,0.5,1.1,-1.4,0.9)^{\top}.

We generate (Xi1,,Xip)(X_{i1},\cdots,X_{ip}) from the multivariate normal distribution Np(0,Σ)N_{p}(0,\Sigma). We set the non-diagonal elements ρij\rho_{ij} of Σ\Sigma equal to 0.50.5 and the diagonal elements of Σ\Sigma equal to 11. For εi\varepsilon_{i}, we use the error distribution N(0,σ2)N(0,\sigma^{2}), where σ2\sigma^{2} changes with R2=Var(𝑿i𝜷)/{Var(𝑿i𝜷)+σ2}R^{2}=\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})/\{\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})+\sigma^{2}\}. We consider the cases in which R2=0.1,0.2,,0.9R^{2}=0.1,0.2,\cdots,0.9.

Missing data are divided into three generative scenarios or assumptions. Missing completely at random (MCAR) means the missingness is independent of the values of the data. Missing at random (MAR) means the propensity of data to be missing depends on the observed values, whereas missing not at random (MNAR) covers the remaining scenario that the mechanism depends on the unobserved values (the variables that are missing). In our study, we consider situations that differ from the classical MCAR, MAR, and MNAR mechanism.

For each sample, variables X10,X11,X12X_{10},X_{11},X_{12} are always available. There are twelve “typical” missing patterns considered, the details are shown in Table 1. We divide the 12 patterns into two groups. Specifically, the first group 𝑨(1){\bm{A}}^{(1)} consists of (A1,A2,,A6)(A_{1},A_{2},\cdots,A_{6}), and the second group 𝑨(2){\bm{A}}^{(2)} consists of the rest missing patterns. We randomly assign the missing patterns in 𝑨(1){\bm{A}}^{(1)} to the sample with missing probability P=aP=a. Furthermore, we set the missing probability of iith unit for the patterns in 𝑨(2){\bm{A}}^{(2)} as P={1+exp(bεi+c)}1P=\{1+\exp(b\varepsilon_{i}+c)\}^{-1}. Then we randomly assign the patterns in 𝑨(2){\bm{A}}^{(2)} to the missing samples. The settings in Table 2 are used for the missing rate (MR).

Table 1: Missing pattern for all simulation examples.
Group Pattern Variable
Full X1X_{1} X2X_{2} X3X_{3} X4X_{4} X5X_{5} X6X_{6} X7X_{7} X8X_{8} X9X_{9} X10X_{10} X11X_{11} X12X_{12}
𝑨(1){\bm{A}}^{(1)} A1A_{1}
A2A_{2}
A3A_{3}
A4A_{4}
A5A_{5}
A6A_{6}
𝑨(2){\bm{A}}^{(2)} A7A_{7}
A8A_{8}
A9A_{9}
A10A_{10}
A11A_{11}
A12A_{12}
Table 2: Missing rating setting for all simulation examples.
Missing rate 𝑹2{\bm{R}}^{2}
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9
60%60\% aa 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1 0.1
bb -1.5 -1.5 -1.5 -1.5 -1.5 -1.5 -2 -2 -4
cc -4 -2 -2 -1.5 -1.5 -1 -1 -1 -1
90%90\% aa 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.75 0.65
bb -1.5 -1.5 -2 -2 -3 -3 -3.5 -3.5 -4
cc -4 -4 -4 -4 -4 -4 -4 -4 -4

The MSE results are shown in Figures 3, 6, 9, and 12 for R2=0.2,0.5,0.8R^{2}=0.2,0.5,0.8. The results of optimal rate of MSE are displayed in Figures 4, 7, 10, and 13. To determine the relative performance, we rank the NADs of the five methods at each repetition. The mean NAD ranks are displayed in Figures 2, 5, 8, and 11 for R2=0.2,0.5,0.8R^{2}=0.2,0.5,0.8. The results of the cases not shown here are available in the supplementary materials. In general, they exhibit patterns which is similar to those shown here.

To make the comparison of Full, PRIME, and ILSE easier, the MSE bar plots for CC and ML are manually scaled because these two methods lead to much higher MSEs than the other methods. The main conclusions are as follows:

  1. 1.

    When nn and R2R^{2} increase, the MSEs of Full, ILSE, and PRIME generally decrease, as expected. However, the relative performance of these three methods does not change.

  2. 2.

    Generally, PRIME outperformes ILSE and ML in terms of NAD, and all three significantly outperforme CC. Surprisingly, PRIME was close or even superior to the Full method in estimating the coefficients of X11X_{11} and X12X_{12} when MR=90%,n=100,\text{MR}=90\%,n=100, and R2=0.2R^{2}=0.2 and in estimating the coefficients of X10X_{10} and X12X_{12} when MR=90%,n=200,\text{MR}=90\%,n=200, and R2=0.2R^{2}=0.2. These results confirm the superiority of the PRIME method.

  3. 3.

    The comparison of results shows that the proposed PRIME method performed better than the other three methods (ILSE, CC, and ML). The bias and variance decomposition figures show that ILSE produces more biased estimates than PRIME. The CC method?s estimation has extremely high variance in almost all ranges of R2R^{2}, and CC?s approximation error was larger when R2R^{2} was not very high. Furthermore, the CC method produces biased estimates, as expected, because of the missing-data mechanism. The ML performance is not stable: ML’s performance is close to our proposed method?s when n=200,MR=90%,n=200,\text{MR}=90\%, and R2=0.8R^{2}=0.8, but ML has the poorest performance when n=100,MR=90%,n=100,\text{MR}=90\%, and R2=0.8R^{2}=0.8 because ML estimators can be highly biased when the MAR assumption does not hold.

  4. 4.

    The optimal rates of MSE show a proportion over 40%40\% for PRIME, and this indicates that PRIME yields the smallest MSE of all the competitors in more than 40% of the trials in Scenario 1. When PRIME yields the optimal rate, ILSE or ML most often yields the second-smallest MSE. When PRIME does not yield the lowest MSE, ILSE and ML most often do. Not surprisingly, CC rarely produces the smallest MSE except when MR=60%\text{MR}=60\% and R2=0.9R^{2}=0.9.

Refer to caption
Refer to caption
Refer to caption
Figure 2: NAD with n=100n=100 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 3: MSE with n=100n=100 and 90% missing data for different methods.
Refer to caption
Figure 4: Optimal rate of MSE with n=100n=100 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 5: NAD with n=200n=200 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 6: MSE with n=200n=200 and 90% missing data for different methods.
Refer to caption
Figure 7: Optimal rate of MSE with n=200n=200 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 8: NAD with n=100n=100 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 9: MSE with n=100n=100 and 60% missing data for different methods.
Refer to caption
Figure 10: Optimal rate of MSE with n=100n=100 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 11: NAD with n=200n=200 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 12: MSE with n=200n=200 and 60% missing data for different methods.
Refer to caption
Figure 13: Optimal rate of MSE with n=200n=200 and 60% missing data for different methods.

4.2 Scenario 2: Varying correlation between variables

To compare the methods with different correlations, we consider the correlation between XiX_{i} and XjX_{j} in four situations: p1p_{1} for ρij=0.2\rho_{ij}=0.2, p2p_{2} for ρij=0.5\rho_{ij}=0.5, p3p_{3} for ρij=0.8\rho_{ij}=0.8 and p4p_{4} for ρij=0.8|ij|\rho_{ij}=0.8^{|i-j|}. Here, we set σ2\sigma^{2} with R2=0.7R^{2}=0.7. All other aspects remain the same as in Scenario 1. For the missing rate, two settings are considered, where

  • (a,b,c)=(0.1,2,1)(a,b,c)=(0.1,-2,-1), so that the missing rate is approximately 60%,

  • (a,b,c)=(0.7,3.5,4)(a,b,c)=(0.7,-3.5,-4), so that the missing rate is approximately 90%.

The NAD and MSE results are shown in Figures 14, 15, 17, 18, 20, 21, 23, and 24. The optimal MSE rates are shown in Figures 16, 19, 22, and 25. The main conclusions are as follows:

  1. 1.

    Scenario 2 results yield conclusions similar to those of Scenario 1. PRIME produces the smallest NADs and MSEs in almost all cases. As shown in Figures 21 and 24, although CC has the smallest estimation error, its estimation variance is extremely high, which causes problems in the MSE.

  2. 2.

    The optimal rate results show that PRIME has obvious advantages over other methods because it produces the smallest MSE in almost all situations except when MR=90%,n=200,\text{MR}=90\%,n=200, and ρij=0.8\rho_{ij}=0.8. However, when ρij\rho_{ij} increases, the gap between PRIME and other methods increases. CC still has the worst MSE performance among the four methods.

Refer to caption
Refer to caption
Refer to caption
Figure 14: NAD with n=100n=100 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 15: MSE with n=100n=100 and 90% missing data for different methods.
Refer to caption
Figure 16: Optimal rate of MSE with n=100n=100 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 17: NAD with n=200n=200 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 18: MSE with n=200n=200 and 90% missing data for different methods.
Refer to caption
Figure 19: Optimal rate of MSE with n=200n=200 and 90% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 20: NAD with n=100n=100 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 21: MSE with n=100n=100 and 60% missing data for different methods.
Refer to caption
Figure 22: Optimal rate of MSE with n=100n=100 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 23: NAD with n=200n=200 and 60% missing data for different methods.
Refer to caption
Refer to caption
Refer to caption
Figure 24: MSE with n=200n=200 and 60% missing data for different methods.
Refer to caption
Figure 25: Optimal rate of MSE with n=200n=200 and 60% missing data for different methods.

4.3 Scenario 3: Taking sparse structure into consideration

In this scenario, we illustrate the proposed SPRIME by studying the data from simulations. We consider penalized Full (denoted as SFull), complete-case analysis with a penalty (denoted as SCC), ILSE, and ML as the alternatives. ILSE in Lin et al. [9] and ML in Jiang et al. [4] were proposed without considering the sparse assumption; hence, we use them directly instead of using the penalized estimation form. We acknowledge that there are other approaches such as those in Xue and Qu [21] that can be used to address a high-dimensional missing-data problem. However, the missing-data patterns in these methods are different from the individual-specific case.

The model used to generate data has the linear expression

Yi=j=1pβjXij+εi,i=1,2,,n,\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i},\ i=1,2,\cdots,n,

where n=200n=200, p=30p=30, (β1,,β12)=(1,0.6,1.5,1,1.2,0.4,1,0.7,1.3,0.5,1.1,1.4,0.9)(\beta_{1},\cdots,\beta_{12})^{\top}=(1,-0.6,1.5,1,1.2,0.4,-1,-0.7,1.3,0.5,1.1,-1.4,0.9), and βj=0(j=13,,30)\beta_{j}=0\ (j=13,\cdots,30). We generate (Xi1,,Xip)(X_{i1},\cdots,X_{ip}) from the multivariate normal distribution Np(𝟎,𝚺)N_{p}(\bf{0},\Sigma). We set the non-diagonal element ρij\rho_{ij} of Σ\Sigma equal to 0.50.5. For εi\varepsilon_{i}, we use the error distribution N(0,σ2)N(0,\sigma^{2}), where σ2\sigma^{2} changes with R2=Var(𝑿i𝜷)/{Var(𝑿i𝜷)+σ2}R^{2}=\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})/\{\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})+\sigma^{2}\}. We consider only the case in which R2=0.7R^{2}=0.7.

When taking sparse structure into consideration, the coefficients βj(j=13,,30)\beta_{j}\ (j=13,\cdots,30) are equal to 0. Thus, the criterion, NAD used in Scenario 1 and 2, is no longer meaningful. So we only use MSE and the optimal rate of MSE to assess the performance of different methods. Several conclusions can be drawn from the figures:

  1. 1.

    Scenario 3 results yield conclusions similar to those of Scenarios 1 and 2. From Figure 26 we can see that SPRIME produces the smallest MSEs in almost all cases.

  2. 2.

    When MR=60%, the optimal rates of MSE are 0.95, 0.01, 0.00, 0.04 for SPRIME, ILSE, CC and ML, respectively. When MR=90%, the optimal rates of MSE are 0.94, 0.05, 0.01, 0.00 for SPRIME, ILSE, CC and ML, respectively. Not surprisingly, SPRIME that considers the sparse structure performs better than ILSE and ML without the penalty, which reconfirms the superiority of PRIME-type approach.

Refer to caption
Refer to caption
Figure 26: MSE with n=200n=200 missing data for different methods.

5 Cardiac surgery-associated acute kidney injury study

In this example, we illustrate the proposed method by analyzing data regarding Cardiac surgery-associated acute kidney injury (CSA-AKI). CSA-AKI is the second condition for acute kidney injury in the intensive care setting and sometimes causes death [2]. However, because of a general lack of effective treatment for CSA-AKI, tools or methods for earlier identification are very important for prevention and management of the syndrome. To find more predictive biomarkers for CSA-AKI, Chen et al. [2] collected 32 plasma cytokines, including CTACK, FGFa, G-CSF, HGF, interferon-α\alpha2, interferon-gamma (IFN-γ\gamma), IL-1α\alpha, IL-1β\beta, IL-2, IL-4, IL-6, IL-7, IL-8, IL-9, IL-10, IL-12p70, IL-12p40, IL-16, IL-17α\alpha, IL-18, IP-10, MCP-1, MCP-3, M-CSF, MIF, MIG, MIP-1α\alpha (macrophage inflammatory protein-1 α\alpha), MIP-1β\beta (macrophage inflammatory protein-1 β\beta), SCF, SCGF-β\beta, SDF-1α\alpha, and tumor necrosis factor-α\alpha. CSA-AKI severity is evaluated primarily by deltaScore, which is measured by serum creatinine alterations before and after surgery. Serum creatinine concentrations before and after surgery are measured by an identical testing platform in the clinical laboratory of the hospital.

We use the continuous-variable deltaScore as the response. For simplicity, we conduct a standardized transformation to scale the both response and covariates. Furthermore, we exclude subjects with missing deltaScores because the aforementioned methods (PRIME and SPRIME) apply primarily to the missing covariates. Finally, 321 patients are enrolled for statistical analysis, of which only approximately 60% have complete covariate information.

Because the number of related variables is not expected to be large, as in Scenario 3 in the simulation study, we use SPRIME and SCC to simultaneously select and estimate the coefficients of the factors that might shed light on the deltaScore. We also use ILSE and ML directly without considering the sparse assumption. The regression coefficient estimates obtained from the four methods are listed in Table 3. Among them, IL-8, IL-10, IFN-γ\gamma, IL-16, and MIP-α\alpha are also found to be related to CSA-AKI in Chen et al. [2]. However, in real-world data, it is difficult to objectively evaluate the performance of candidate methods. Therefore, we delete the subjects with missing covariates and construct missing data manually for the complete-case data of CSA-AKI. For the same reason as before in Scenario 3, we consider only MSE to evaluate the estimation accuracy. However, because of the unknown true coefficients, we are unable to evaluate MSE as described in the simulation. Hence, we calculate MSEFull\text{MSE}_{\text{Full}} instead, as follows:

MSEFull=1Ni=1N1pj=1p(β^Full,jβ^j)2.\text{MSE}_{\text{Full}}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{\text{Full},j}-\hat{\beta}_{j})^{2}.

where β^Full,j(j=1,2,,p)\hat{\beta}_{\text{Full},j}\ (j=1,2,\cdots,p) are the estimated coefficients obtained using Full method.

The setting of missing-data patterns is the same as that in the simulation study except for the missing probability function. We randomly assign the missing patterns in 𝑨(1){\bm{A}}^{(1)} to the sample with missing probability P=aP=a. Furthermore, we set the missing probability of the iith unit for the patterns in 𝑨(2){\bm{A}}^{(2)} as P={1+exp(bϵi+c)}1P=\{1+\exp(b\epsilon_{i}+c)\}^{-1}, where ϵi=Yi𝑿i𝜷^Full\epsilon_{i}=Y_{i}-{\bm{X}}_{i}^{\top}\hat{\bm{\beta}}_{\rm Full}. Then, we randomly assign the patterns in 𝑨(2){\bm{A}}^{(2)} to the missing samples. For the missing rate, we set (a,b,c)=(1.5,2,0.4)(a,b,c)=(-1.5,-2,0.4). Consequently, the missing rate is approximately 90%. This is repeats N=100N=100 times to randomly generate missing data. Here, like in simulation, we also set bandwidth h=n1/3h=n^{-1/3}.

The MSEFull\text{MSE}_{\text{Full}} results are shown in Figure 27. The optimal rates of MSE are 1.00, 0.00, 0.00, 0.00 for SPRIME, ILSE, CC and ML, respectively. The resuls show that SPRIME has advantages over the other methods in estimation accuracy as it produces the smallest MSEFull\text{MSE}_{\text{Full}}. The missing mechanism and the wrong model assumption may give rise to the worse performance of ILSE, CC and ML.

Table 3: Regression coefficients of CPLSD with regression coefficient estimates (The symbol ”-” means that the current variable or predictor has not been selected).
Variable SPRIME SCC ILSE ML
age -0.058 -0.099 -0.167 -0.012
BMI - - 0.040 0.048
hospitalized time 0.051 0.045 0.069 0.080
CTACK - 0.020 0.207 0.117
FGFa - - 0.137 0.137
G-CSF - - -0.403 -0.462
HGF - - -0.120 -0.162
IFN-α\alpha2 - - 0.233 0.393
IFN-γ\gamma 0.089 0.063 0.207 0.199
IL-1α\alpha - - -0.126 -0.113
IL-1β\beta - - -0.257 -0.256
IL-2 - - 0.349 0.280
IL-4 - - -0.011 -0.012
IL-6 - - -0.126 -0.148
IL-7 - - -0.165 -0.182
IL-8 0.149 0.155 0.527 0.624
IL-9 - -0.008 -0.020 -0.034
IL-10 0.040 0.023 0.043 0.064
IL-12p70 - - 0.347 0.361
IL-12p40. - - -0.029 -0.099
IL-16 0.100 0.109 0.143 0.118
IL-17α\alpha - - -0.322 -0.303
IL-18 - - -0.071 -0.080
IP-10 - - -0.057 -0.170
MCP-1 - - 0.081 0.060
MCP-3 - - -0.099 -0.233
M-CSF - 0.080 0.024 0.042
MIF - - -0.131 -0.146
MIG 0.024 0.009 0.169 0.342
MIP-1α\alpha 0.036 0.039 0.315 0.330
MIP-1β\beta - - -0.187 -0.165
SCF 0.175 0.143 0.166 0.175
SCGF-β\beta 0.093 0.0923 0.060 0.121
SDF-1α\alpha - - -0.096 -0.057
preLVEF - - 0.056 0.073
TNF-α\alpha - - -0.100 -0.071
Refer to caption
Figure 27: MSE with real data for different methods.

6 Conclusions

In this study, we propose a projective resampling imputation mean estimation method to estimate the regression coefficients for a high rate of missing-data covariates. Our first set of random features projects data points onto a randomly chosen line and then averages the resulting scalar values to yield a comprehensive result. The random lines are drawn from the standard normal distribution to ensure less loss of information. We experimentally evaluate the performance of the PRIME method, and the results showe that the proposed method is a feasible alternative.

However, the aforementioned work has been developed for a classical setting. Developing PRIME to combine generalized linear models or Cox models with missing data warrants future research. Furthermore, we considered only missing covariates even though it is common to encounter cases where both covariates and responses are missing. Hence, developing methods to address practical issues will be the focus of our future work.

References

  • [1] Mengyun Wu, Jian Huang, and Shuangge Ma. Identifying gene-gene interactions using penalized tensor regression. Statistics in medicine, 37(4):598–610, 2018.
  • [2] Zhongli Chen, Liang Chen, Guangyu Yao, Wenbo Yang, Ke Yang, and Chenglong Xiong. Novel blood cytokine-based model for predicting severe acute kidney injury and poor outcomes after cardiac surgery. Journal of the American Heart Association, 9(22):e018004, 2020.
  • [3] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, 39(1):1–22, 1977.
  • [4] Wei Jiang, Julie Josse, Marc Lavielle, and TraumaBase Group. Logistic regression with missing covariates—parameter estimation, model selection and prediction within a joint-modeling framework. Computational Statistics and Data Analysis, 2020.
  • [5] Shaun R Seaman and Ian R White. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3):278–295, 2013.
  • [6] BaoLuo Sun and Eric J. Tchetgen Tchetgen. On inverse probability weighting for nonmonotone missing at random data. Journal of the American Statistical Association, 113(521):369–379, 2018.
  • [7] Judith M Conn, Kung-Jong Lui, and Daniel L McGee. A model-based approach to the imputation of missing data: Home injury incidences. Statistics in Medicine, 8(3):263–266, 1989.
  • [8] Xiaoyan Yin, Daniel Levy, Christine Willinger, Aram Adourian, and Martin G Larson. Multiple imputation and analysis for high-dimensional incomplete proteomics data. Statistics in medicine, 35(8):1315–1326, 2016.
  • [9] Huazhen Lin, Wei Liu, and Wei Lan. Regression analysis with individual-specific patterns of missing covariates. Journal of Business and Economic Statistics, 19(3):231–253, 2019.
  • [10] Willian B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189–206, 1984.
  • [11] Leonard J. Schulman. Clustering for edge-cost minimization (extended abstract). Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 547–555, 2000.
  • [12] D.L. Donoho. Compressed sensing. IEEE Transactions Information Theory, 52:1289–1306, 2006.
  • [13] Qinfeng Shi, Hanxi Li, and Chunhua Shen. Rapid face recognition using hashing. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2753–2760, 2010.
  • [14] Odalric-Ambrym Maillard and Rémi Munos. Linear regression with random projections. Journal of Machine Learning Research, 13(1):2735–2772, 2012.
  • [15] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood: approximating kernel expansions in loglinear time. ICML’13 Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, 2013.
  • [16] Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998.
  • [17] Wenjiang J Fu. Penalized estimating equations. Biometrics, 59(1):126–132, 2003.
  • [18] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
  • [19] Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296, 2006.
  • [20] Rosa I Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 16:161–182, 2006.
  • [21] Fei Xue and Annie Qu. Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, pages 1–14, 2020.
  • [22] Kani Chen, Shaojun Guo, Liuquan Sun, and Jane-Ling Wang. Global partial likelihood for nonparametric proportional hazards models. Journal of the American Statistical Association, 105(490):750–760, 2010.

Appendix

Proof of Theorem 1

Proof.

Following the proof of Lin et al. [9], we first define

E^j,A(𝒘A)=i=1nI(AiAj)Xijb=1B[Kh(𝐗i,A𝒗b,A𝒘A𝒗b,A)]1Bi=1nI(AiAj)b=1B[Kh(𝐗i,AT𝒗b,A𝒘A𝒗b,A)]1B,(jA)\widehat{E}_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)X_{i^{\prime}j}\prod_{b=1}^{B}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}^{\top}_{A}{\bm{v}}_{b,A}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\prod_{b=1}^{B}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}^{T}{\bm{v}}_{b,A}-{\bm{w}}^{\top}_{A}{\bm{v}}_{b,A}\right)\right]^{\frac{1}{B}}},\quad\left(j\notin A\right)

and then

𝒁i=(XijI(jAi)+E^j,Ai(𝑿i,Ai)I(jAi:1jp),{\bm{Z}}_{i}=(X_{ij}I(j\in A_{i})+\widehat{E}_{j,A_{i}}({\bm{X}}_{i,A_{i}})I(j\notin A_{i}:1\leq j\leq p),
U(𝜷)=1ni=1n𝒁i{Yi𝒁i𝜷}.U({\bm{\beta}})=\frac{1}{n}\sum_{i=1}^{n}{\bm{Z}}_{i}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}.

Let

𝐳i=(XijI(jAi)+ej,Ai(𝐗i,Ai)I(jAi):1jp)\mathbf{z}_{i}=\left(X_{ij}I\left(j\in A_{i}\right)+e_{j,A_{i}}\left(\mathbf{X}_{i,A_{i}}\right)I\left(j\notin A_{i}\right):\right.1\leq j\leq p)^{\top}
u(𝜷)=E[𝐳i(𝜷){𝐗iT𝜷0𝐳i𝜷}].u(\bm{\beta})=E\left[{\mathbf{z}}_{i}(\bm{\beta})\left\{\mathbf{X}_{i}^{\mathrm{T}}\bm{\beta}_{0}-\mathbf{z}_{i}^{\top}\bm{\beta}\right\}\right].

Similar to Lin et al. [9], we first prove that 𝜷0{\bm{\beta}}_{0} is a solution of u(𝜷)u({\bm{\beta}}). We have

E(Yi|𝑿i,Ai,Ai)=𝒛i𝜷0.E(Y_{i}|{\bm{X}}^{\top}_{i,A_{i}},A_{i})={\bm{z}}_{i}^{\top}{\bm{\beta}}_{0}.

and we also have E(Yi|𝑿i,Ai,Ai)=E(𝑿i|𝑿i,Ai,Ai)𝜷0E(Y_{i}|{\bm{X}}_{i,A_{i}},A_{i})=E({\bm{X}}^{\top}_{i}|{\bm{X}}_{i,A_{i}},A_{i}){\bm{\beta}}_{0}, hence we can get E{(𝑿i𝒛i)|𝑿i,Ai,Ai}𝜷0=0E\left\{({\bm{X}}_{i}-{\bm{z}}^{\top}_{i})|{\bm{X}}_{i,A_{i}},A_{i}\right\}{\bm{\beta}}_{0}=0. Noting that 𝒛i{\bm{z}}_{i} is a function of (𝑿i,Ai,Ai)({\bm{X}}_{i,A_{i}},A_{i}), then

u(𝜷0)=E{𝒛i(𝑿i𝜷0𝒛i𝜷0)}=0u({\bm{\beta}}_{0})=E\left\{{\bm{z}}_{i}\left({\bm{X}}_{i}^{\top}{\bm{\beta}}_{0}-{\bm{z}}_{i}^{\top}{\bm{\beta}}_{0}\right)\right\}=0 (8)

Similar to Lin et al. [9], it is easy to show that 𝜷0{\bm{\beta}}_{0} is the unique solution of u(𝜷)=0u({\bm{\beta}})=0.

To prove Theorem 1, it suffices to show that

sup𝜷U(𝜷)u(𝜷)2=0.\sup_{{\bm{\beta}}\in\mathcal{B}}\|U({\bm{\beta}})-u({\bm{\beta}})\|_{2}=0. (9)

We rewrite

U(𝜷)u(𝜷)=U1U2+U3+U4,U({\bm{\beta}})-u({\bm{\beta}})=U_{1}-U_{2}+U_{3}+U_{4},

where

U1=1ni=1n(𝐙i𝐳i)Yi,\displaystyle U_{1}=\frac{1}{n}\sum_{i=1}^{n}\left({\mathbf{Z}}_{i}-{\mathbf{z}}_{i}\right)Y_{i},
U2=1ni=1n(𝐙i𝐙iT𝐳i𝐳iT)𝜷,\displaystyle U_{2}=\frac{1}{n}\sum_{i=1}^{n}\left({\mathbf{Z}}_{i}\mathbf{Z}_{i}^{T}-{\mathbf{z}}_{i}\mathbf{z}_{i}^{T}\right){\bm{\beta}},
U3=1ni=1n[𝒛i{𝐗iT𝜷0𝐳iT𝜷}E{𝒛i(𝐗iT𝜷0𝐳iT𝜷)}],\displaystyle U_{3}=\frac{1}{n}\sum_{i=1}^{n}\left[{\bm{z}}_{i}\left\{\mathbf{X}_{i}^{T}\bm{\beta}_{0}-\mathbf{z}_{i}^{T}\bm{\beta}\right\}-E\left\{{\bm{z}}_{i}\left(\mathbf{X}_{i}^{T}\bm{\beta}_{0}-\mathbf{z}_{i}^{T}\bm{\beta}\right)\right\}\right],
U4=1ni=1n𝐳iεi0.\displaystyle U_{4}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{z}}_{i}\varepsilon_{i0}.

Then sup𝜷U32=op(1),sup𝜷U42=op(1)\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{3}\|_{2}=o_{p}(1),\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{4}\|_{2}=o_{p}(1) follows from the weak law of large numbers. Noting that

E^j,A(𝒘A)ej,A(𝒘A)=i=1nI(AiAj){Xijej,A(𝒘A)}[b=1BKh(𝐗i,A𝒗b,A𝒘A𝒗i,A)]1Bi=1nI(AiAj)[b=1BKh(𝐗i,A𝒗b,A𝒘A𝒗i,A)]1B\hat{E}_{j,A}({\bm{w}}_{A})-e_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}_{A}^{\top}{\bm{v}}_{i,A}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}_{A}^{\top}{\bm{v}}_{i,A}\right)\right]^{\frac{1}{B}}} (10)

For gaussian kernel generates better empirical performance than do other types of kernels, here we assume Kh()K_{h}(\cdot) is a gaussian kernel function, then [b=1BKh(𝑿i,Ai𝒗b,Ai𝒘i,A𝒗b,A)]1B\left[\prod_{b=1}^{B}K_{h}({\bm{X}}^{\top}_{i^{\prime},A_{i}}{\bm{v}}_{b,A_{i}}-{\bm{w}}^{\top}_{i,A}{\bm{v}}_{b,A})\right]^{\frac{1}{B}} can be written as exp{1BV𝑿i,A1BV𝒘i,A2}\exp\left\{-\left\|\frac{1}{\sqrt{B}}V{\bm{X}}_{i^{\prime},A}-\frac{1}{\sqrt{B}}V{\bm{w}}_{i,A}\right\|^{2}\right\}, where VV be a B×|A|B\times|A| random matrix whose entries are chosen independently from either N(0,1)N(0,1) or U(1,1)U(-1,1). Hence, we can show that [b=1BKh(𝑿i,A𝒗b,A𝑿i,P𝒗b,A)]1B=Kh(𝑿i,A𝑿i,A)+op(1)\left[\prod_{b=1}^{B}K_{h}({\bm{X}}^{\top}_{i^{\prime},A}{\bm{v}}_{b,A}-{\bm{X}}^{\top}_{i,P}{\bm{v}}_{b,A})\right]^{\frac{1}{B}}=K_{h}({\bm{X}}_{i^{\prime},A}-{\bm{X}}_{i,A})+o_{p}(1) using Theorem 1 in Arriaga and Vempala [20]. Then, (10) can be writtern as

E^j,A(𝒘A)ej,A(𝒘A)=i=1nI(AiAj){Xijej,A(𝒘A)}[Kh(𝐗i,A𝒘A)+op(1)]i=1nI(AiAj)[Kh(𝐗i,A𝒘A)+op(1)]\displaystyle\hat{E}_{j,A}({\bm{w}}_{A})-e_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}
=i=1nI(AiAj){Xijej,A(𝒘A)}Kh(𝐗i,A𝒘A)i=1nI(AiAj)[Kh(𝐗i,A𝒘A)+op(1)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}
+i=1nI(AiAj){Xijej,A(𝒘A)}op(1)i=1nI(AiAj)[Kh(𝐗i,A𝒘A)+op(1)]\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}o_{p}(1)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}
=E1+E2.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=E_{1}+E_{2}.

Under the conditions given, following the proof of theorem 1 in Lin et al. [9] and the lemma 4 in Chen et al. [22], we know that

sup𝒘AE12sup𝒘Ai=1nI(AiAj){Xijej,A(𝒘A)}Kh(𝐗i,Ai𝒘A)i=1nI(AiAj)[Kh(𝐗i,A𝒘A)+op(1)]2=op(1)\sup_{{\bm{w}}_{A}}\|E_{1}\|_{2}\leq\sup_{{\bm{w}}_{A}}\left\|\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}-{\bm{w}}_{A}\right)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}\right\|_{2}=o_{p}(1)\\

Let R(𝒘A;j)=1/ni=1nI(AiAj)Kh(𝐗i,A𝒘A)R({\bm{w}}_{A};j)=1/n\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right), and sup𝒘A|R(𝒘A;j)Pr(AiAj)f(𝒘A)|=Op(logn/nh+h2)\sup_{{\bm{w}}_{A}}|R({\bm{w}}_{A};j)-P_{r}\left(A_{i^{\prime}}\supset A\cup j\right)f({\bm{w}}_{A})|=O_{p}\left(\sqrt{\log n}/\sqrt{nh}+h^{2}\right) follows from lemma 4 in Chen et al. [22]. Using Assumption (A5), then inf𝒘AR(𝒘A;j)>C=C0Op(logn/nh+h2)\inf_{{\bm{w}}_{A}}R({\bm{w}}_{A};j)>C=C_{0}-O_{p}\left(\sqrt{\log n}/\sqrt{nh}+h^{2}\right). Hence, we have

sup𝒘AE2sup𝒘A1ni=1nI(AiAj){Xijej,A(𝒘A)}Kh(𝐗i,Ai𝒘A)C+1ni=1nI(AiAj)op(1)op(1)2.\displaystyle\sup_{{\bm{w}}_{A}}\left\|E_{2}\right\|\leq\sup_{{\bm{w}}_{A}}\left\|\frac{1}{n}\sum_{i^{\prime}=1}^{n}\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}-{\bm{w}}_{A}\right)}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}o_{p}(1)\right\|_{2}.

Let Si=I(AiAj){Xijej,A(𝒘A)}/{C+1ni=1nI(AiAj)op(1)}S_{i^{\prime}}=I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}/\{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)\}. Clearly, for the condition Ai𝑿iA_{i}\bot{\bm{X}}_{i}

E(Si)=E[I(AiAj)XijC+1ni=1nI(AiAj)op(1)]E[I(AiAj)E(Xij|𝑿i,A=𝒘A)C+1ni=1nI(AiAj)op(1)]\displaystyle E(S_{i^{\prime}})=E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)X_{i^{\prime}j}}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]-E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)E(X_{ij}|{\bm{X}}_{i,A}={\bm{w}}_{A})}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]
=E[XijC+op(1)|I(AiAj))=1]Pr(AiAj)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=E\left[\frac{X_{i^{\prime}j}}{C+o_{p}(1)}\Bigg{|}I\left(A_{i^{\prime}}\supset A\cup j)\right)=1\right]P_{r}(A_{i^{\prime}}\supset A\cup j)
E{E[Xijc+op(1)|𝑿i,A=𝒘A]|I(AiAj)=1}Pr(AiAj)\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-E\left\{E\left[\frac{X_{ij}}{c+o_{p}(1)}\Bigg{|}{\bm{X}}_{i^{\prime},A}={\bm{w}}_{A}\right]\Bigg{|}I\left(A_{i^{\prime}}\supset A\cup j\right)=1\right\}P_{r}(A_{i^{\prime}}\supset A\cup j)
=1C{E(Xij)Pr(AiAj)E[E(Xij|𝑿i,A=𝒘A)]Pr(AiAj)}\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E\left[E(X_{ij}|{\bm{X}}_{i,A}={\bm{w}}_{A})\right]P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}
=1C{E(Xij)Pr(AiAj)E(Xij)Pr(AiAj)}=0.\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E(X_{ij})P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}=0.

Then using the weak law of large number, we have

sup𝒘AE2op(1).\displaystyle\sup_{{\bm{w}}_{A}}\left\|E_{2}\right\|\leq o_{p}(1).

Thus, we obtain sup𝜷U12=op(1)\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{1}\|_{2}=o_{p}(1) and the same argument can also apply to sup𝜷U22\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{2}\|_{2}. The above convergences imply that sup𝜷U(𝜷)u(𝜷)2=0\sup_{{\bm{\beta}}\in\mathcal{B}}\|U({\bm{\beta}})-u({\bm{\beta}})\|_{2}=0. Following the technical derivation follow from Lin et al. [9], Theorem 1 holds.

Proof of Theorem 2

Proof.

It is obvious that the equations (8) and (9) are always correct no matter the assumption j1,j=1,2,,pj\in\mathcal{I}_{1},j=1,2,\cdots,p is satisfied or not. Then, the proof of Theorem 2 is easily conducted by using the theorem 1 and 2 in Fu [16] and thoerem 3 in Fu [17]. ∎