Projective Resampling Imputation Mean Estimation Method for Missing Covariates Problem

Zishu Zhan^1,2, Xiangjie Li^3,∗, Jingxiao Zhang^1,2,∗

Abstract

Missing data is a common problem in clinical data collection, which causes difficulty in the statistical analysis of such data. To overcome problems caused by incomplete data, we propose a new imputation method called projective resampling imputation mean estimation (PRIME), which can also address “the curse of dimensionality” problem in imputation with less information loss. We use various sample sizes, missing-data rates, covariate correlations, and noise levels in simulation studies, and all results show that PRIME outperformes other methods such as iterative least-squares estimation (ILSE), maximum likelihood (ML), and complete-case analysis (CC). Moreover, we conduct a study of influential factors in cardiac surgery-associated acute kidney injury (CSA-AKI), which show that our method performs better than the other models. Finally, we prove that PRIME has a consistent property under some regular conditions.

Keywords : projective resampling, missing covariates problem, imputation method, linear regression

¹Center for Applied Statistics, Renmin University of China,
² School of Statistics, Remin University of China,
³ State Key Laboratory of Cardiovascular Disease, Fuwai Hospital, National Center for Cardiovascular Diseases, Chinese Academy of Medical Sciences and Peking Union Medical College

1 Introduction

In medical research, an investigator’s ultimate interest may be in inferring prognostic markers, given the patients’ genetic, cytokine, and/or environmental backgrounds [1, 2]. However, in practical applications, data are often missing. The most common approach to address missing-data problems is complete-case analysis (CC), which is simple but inefficient. CC can also lead to biased estimates when the data are not missing completely at random. The maximum likelihood method (ML [3, 4]) and the inverse probability weighting method (IPW [5, 6]) are also widely used approaches to address missing data. However, likelihood-based methods are sensitive to model assumptions, and re-weighting methods do not always make full use of the available data. Alternatively, imputation [7, 8] is a more flexible approach to couple with missing data.

However, a preliminary analysis of cardiac surgery-associated acute kidney injury (CSA-AKI) data used in Chen et al. [2] indicates that the missing-data patterns vary across individuals. Accordingly, new and more capable quantitative methods are needed for this individual-specific missing data. Furthermore, it is common for only a small fraction of records to have complete information across all sources. Existing methods do not work well when the percentage of unavailable data is high. To estimate the coefficients (rather than predicting them), Lin et al. [9] proposed the iterative least-squares estimation (ILSE) method to deal with individual-specific missing-data patterns using the classical regression framework, but it needs a complete set of observations to obtain the initial values, and its results might not converge when based on bad initial values. Furthermore, Lin et al. [9] may have difficulty accommodating data missing from both important and unimportant variables.

In this study, based on the idea of projection resampling/random projection, we propose Projection Resampling Imputation Mean Estimation (PRIME), a method that tackles the aforementioned drawbacks of existing methods. The key idea behind projection resampling/random projection was given in the Johnson-Lindenstrauss lemma [10], which preserves pairwise distances after projecting a set of points to a randomly chosen low-dimensional subspace. There are several previous studies on projection resampling/random projection for dimension reduction, including Schulman [11] for clustering, Donoho [12] for signal processing, Shi et al. [13] for classification, Maillard and Munos [14] for linear regression, and Le et al. [15] for kernel approximation. Specifically, the idea of PRIME is to project the covariates along randomly sampled directions to obtain samples of scalar-valued predictors and kernels (dimension reduction). Next, a simple geometric average is taken on the scalar-predictor-based kernel to impute the missing parts (using all-sided information). Our method has several advantages, including the following. First, PRIME can deal with a high degree of missing data, even data containing no complete observations, while most existing methods require at least a fraction of the subjects to have fully complete observations. Second, we can average the imputed estimates from multiple projection directions and fully utilize the available information to reach a more reliable and useful result. Third, to reduce the undesirable influence of unimportant variables, PRIME can be easily extended to sparse PRIME (denoted as SPRIME), which has a profound impact in practical applications.

The remainder of the paper is organized as follows. Section 2 introduces the basic setup of PRIME and SPRIME. Theoretical properties are discussed in Section 3. Sections 4 and 5 present the numerical results using simulated and real data examples, respectively. Section 6 presents some concluding remarks. In addition, our proposed method is implemented using R and the scripts to reproduce our results are available at https://github.com/eleozzr/PRIME. The proof of the theorem is available the Appendix.

2 Projective resampling Imputation mean estimation (PRIME)

2.1 Model and estimation by PRIME

In this paper, let ${\bm{Y}}=(Y_{1},Y_{2},\cdots,Y_{n})^{\top}$ be the response variable of interest and ${\bm{X}}=\{X_{ij}:i=1,2,\cdots,n,j=1,2,\cdots,p\}$ be the covariate matrix. We assume that $n>p$ . We consider the presence of missing covariates by $r_{ij}$ , which denotes the missing-data indicator for $X_{ij}$ , where $r_{ij}$ is 1 if $X_{ij}$ is missing and is 0 otherwise. For each unit $i$ , $A_{i}=\{j:r_{ij}=0,j=1,2,\cdots,p\}$ denotes the available covariates set, e.g., $A_{i}=\{1,2,\cdots,p\}$ for the complete case.

In this study, we focus on a linear regression model. Assume that the random sample $\{(Y_{i},{\bm{X}}_{i}):i=1,2,\cdots,n,\ j=1,2,\cdots,p\}$ is generated by:

\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i}={\bm{X}}_{i}^{\top}{\bm{\beta}}+\varepsilon_{i},

(1)

where ${\bm{\beta}}=(\beta_{1},\beta_{2},\cdots,\beta_{p})^{\top}$ is the coefficient vector for the covariates and the $\varepsilon_{i}$ ’s are independently identically distributed random errors. We assume $E(\varepsilon_{i}|{\bm{X}}_{i})=0$ and $E(\varepsilon_{i}^{2}|{\bm{X}}_{i})=\sigma^{2}$ .

Covariates can be divided into two parts based on $A_{i}$ : ${\bm{X}}_{i,A_{i}}=(X_{ij}:j\in A_{i})^{\top}$ for observed covariates and ${\bm{X}}_{i,\bar{A}_{i}}=(X_{ij}:j\notin A_{i})^{\top}$ for missing covariates. Thus, equation (1) can be expressed as

\displaystyle Y_{i}={\bm{X}}_{i,A_{i}}^{\top}{\bm{\beta}}_{A_{i}}+{\bm{X}}_{i,\bar{A}_{i}}^{\top}{\bm{\beta}}_{\bar{A}_{i}}+\varepsilon_{i},

where ${\bm{\beta}}_{A_{i}}$ and ${\bm{\beta}}_{\bar{A}_{i}}$ denote the regression coefficients for the complete and incomplete covariates, respectively. For ${\bm{X}}_{i,\bar{A}_{i}}$ , which is unobserved, refer to Lin et al. [9], we take the expectation of $Y_{i}$ given the observed covariates. Thus, we obtain the following equation:

E(Y_{i}|{\bm{X}}_{i,A_{i}})={\bm{X}}_{i,A_{i}}^{\top}{\bm{\beta}}_{A_{i}}+E({\bm{X}}_{i,\bar{A}_{i}}^{\top}{\bm{\beta}}_{\bar{A}_{i}}|{\bm{X}}_{i,A_{i}}).

(2)

Hence, by equation (2), we can impute the incomplete part using the information on ${\bm{X}}_{i,A_{i}}$ to obtain an estimator for ${\bm{\beta}}$ . We use the following estimator to estimate the missing components of the covariates for unit $i$ :

\displaystyle\tilde{X}_{ij}=\frac{\sum_{i^{\prime}=1}^{n}I(A_{i^{\prime}}\supset A_{i}\cup j)X_{i^{\prime}j}K_{h}({\bm{X}}_{i^{\prime},A_{i}}-{\bm{X}}_{i,A_{i}})}{\sum_{i^{\prime}=1}^{n}I(A_{i^{\prime}}\supset A_{i}\cup j)K_{h}({\bm{X}}_{i^{\prime}A_{i}}-{\bm{X}}_{i,A_{i}})},\quad\left(j\notin A_{i}\right),

(3)

In equation (3), $K_{h}(\cdot)=K(\cdot/h)/h$ , where $K(\cdot)$ is a kernel function and $h$ is a bandwidth. In this way, we can make use of the information as fully as possible. To tackle the problem of “the curse of dimensionality”, we further transform the estimator in equation (3) using the projective resampling method. As shown in Figure 1, for subject $i^{\prime}$ , we project ${\bm{X}}_{i^{\prime},A_{i}}$ onto random directions $\{{\bm{v}}_{b,A_{i}}\in\mathcal{R}^{1\times|A_{i}|},b=1,2,\cdots,B\}$ , where $|A|$ denotes the cardinality of a set A, then obtain $B$ kernel values using the resulting scalars $\{{\bm{X}}_{i^{\prime},A_{i}}^{\top}{\bm{v}}_{b,A_{i}},b=1,2,\cdots,B\}$ , and finally integrate the $B$ kernels through the geometric mean.

\hat{X}_{ij}=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A_{i}\cup j\right)X_{i^{\prime}j}\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}^{\top}{\bm{v}}_{b,A_{i}}-{\bm{X}}^{\top}_{i,A_{i}}{\bm{v}}_{b,A_{i}}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A_{i}\cup j\right)\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}^{T}{\bm{v}}_{b,A_{i}}-{\bm{X}}^{\top}_{i,A_{i}}{\bm{v}}_{b,A_{i}}\right)\right]^{\frac{1}{B}}},\quad\left(j\notin A_{i}\right),

(4)

where ${\bm{v}}_{b,A_{i}}$ is a random vector, with each entry $v_{b,A_{i},j}$ chosen independently from a distribution $\mathcal{D}$ that is symmetric about the origin with $E(v_{b,A_{i},j}^{2})=1$ . In practice, we usually generate $v_{b,A_{i},j}$ from $N(0,1)$ or $U(-1,1)$ , but random vectors from other possible distributions can also be used.

Refer to caption — Figure 1: Random projection

Applying the above imputation strategy, we can obtain ${\bm{Z}}_{i}$ by using observed data for part $A_{i}$ and imputed data for part $\bar{A}_{i}$ .

\displaystyle{\bm{Z}}_{i}=(X_{ij}I(j\in A_{i})+\hat{X}_{ij}I(j\notin A_{i}):j=1,2,\cdots,p)^{\top}.

(5)

Therefore, we propose the following estimation equation:

U({\bm{\beta}})=\frac{1}{n}\sum_{i=1}^{n}{\bm{Z}}_{i}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}.

(6)

Thus, the estimator of the regression coefficient can be solved by

\hat{\bm{\beta}}=\left\{\sum_{i=1}^{n}{\bm{Z}}_{i}{\bm{Z}}_{i}^{\top}\right\}^{-1}\sum_{i=1}^{n}{\bm{Z}}_{i}Y_{i}.

Specifically, we propose the following algorithm:

Algorithm 1 algorithm for PRIME

Input: $\{(Y_{i},{\bm{X}}_{i},A_{i}):i=1,2,\cdots,n,j=1,2,\cdots,p\}$ ,
Output: $\hat{\bm{\beta}}$

1:Sample entries of

{\bm{v}}_{b,A_{i}}\ (b=1,2,\cdots,B)\in\mathcal{R}^{1\times|A_{i}|}

i.i.d. from

N(0,1)

U(-1,1)

2:for

1\leq i\leq n

and

1\leq j\leq p

3: 1) Apply the equation (4) to obtain

\hat{X}_{ij},(j\notin A_{i})

4: 2) Apply the equation(5) to obtain the imputed data

Z_{i}

based on

{\bm{v}}_{b,A_{i}}(b=1,2,\cdots,B)

5:Solve the closed-form equation to get

\hat{\bm{\beta}}

6:return

\hat{\bm{\beta}}

2.2 Simultaneous fitting and selection by sparse PRIME

Because the number of disease-associated biomarkers is not expected to be large, it is of great importance to take the sparse assumption into account when not all variables contribute to outcome variables. Hence, we assume the linear regression model in equation (1) is sparse and define the index set of the active and inactive predictors by $\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\}$ and $\mathcal{I}_{2}=\{j:\beta_{j}=0\}$ , respectively. Our practical goal is to identify which biomarkers in the CSA-AKI datasets are disease-related as well as to estimate the corresponding coefficients. The main idea of sparse PRIME is to replace the estimation equation 6 with the penalized estimation equations as follows:

\left\{\begin{aligned} \sum_{i=1}^{n}{Z}_{i1}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}&+\lambda_{n}\gamma|\beta_{1}|^{\gamma-1}sign(\beta_{1})=0,\\ &\vdots&\\ \sum_{i=1}^{n}{Z}_{ip}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}&+\lambda_{n}\gamma|\beta_{p}|^{\gamma-1}sign(\beta_{p})=0.\end{aligned}\right.

(7)

where $\lambda\gamma|\beta_{j}|^{\gamma-1}sign(\beta_{j})$ is the partial derivative for the penalty function with respect to $\beta_{j}$ . The least absolute shrinkage and selection operator (LASSO) estimator is defined to satisfy $\gamma=1$ . Optimizing the functions in (7) with $\gamma=1$ is computationally cumbersome because the functions are non-differentiable. Fortunately, the shooting algorithm proposed in Fu [16] can be used to compute the LASSO estimator. Moreover, Fu [16, 17] proved that the unique estimator of (7) is equivalent to the solution of the penalized objective function as follows:

\min_{\bm{\beta}}\frac{1}{2n}\sum_{i=1}^{n}\left\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\right\}^{2}+\frac{\lambda_{n}}{n}\sum_{j=1}^{p}|\beta_{j}|^{\gamma}.

We can solve this penalized regression-form problem and get a sparse estimator $\hat{\bm{\beta}}_{s}$ using the glmnet package in R.

To decrease the processing time, we can use the following sparse random projection with i.i.d. entries:

v_{b,A_{i},j}=\sqrt{s_{i}}\left\{\begin{aligned} 1,&\ \ \text{with probability}\ \frac{1}{2s_{i}}\\ 0,&\ \ \text{with probability}\ 1-\frac{1}{s_{i}}\\ -1,&\ \ \text{with probability}\ \frac{1}{2s_{i}}.\end{aligned}\right.

where Achlioptas [18] used $s_{i}=1$ or $s_{i}=3$ and Li et al. [19] showed that one can use $s_{i}=\sqrt{|A_{i}|}$ or even $s_{i}=\frac{|A_{i}|}{\log|A_{i}|}$ to significantly reduce the computing time with little loss of accuracy.

3 Theoretical properties

We study the consistency of the projection resampling least estimator $\hat{\bm{\beta}}$ . Denote the true value of ${\bm{\beta}}$ by ${\bm{\beta}}_{0}$ . We make the following assumptions:

(A1): $h=O(n^{-l})$ with $1/4<l<1/2$ ;
(A2): $A_{i}\bot{\bm{X}}_{i}$ ;
(A3): The kernel function $K(\cdot)$ is a symmetric density function with compact support $[0,1]$ and a bounded derivative;
(A4): ${\bm{\beta}}_{0}\in\mathcal{B}$ , where $\mathcal{B}$ is a bounded set;
(A5): For a missing-data pattern $A$ , let $e_{j,A}({\bm{w}}_{A})=E\left(X_{ij}|\mathbf{X}_{i,A}={\bm{w}}_{A}\right)$ be the conditional expectation of $X_{ij}$ and $f_{A}({\bm{w}}_{A})$ be the density of $\mathbf{X}_{i,A}.$ Assume $e_{j,A}({\bm{w}}_{A})$ and $f_{A}({\bm{w}}_{A})$ have continuous second derivatives with respect to ${\bm{w}}_{A}$ on the corresponding support;
(A6): $E(v^{4}_{bj})<\infty\ (1\leq j\leq p)$ ;
(A7): ${\bm{X}}_{i}$ is bounded for all $i\geq 1$ , and the limit $\lim_{n\to\infty}\sum_{i=1}^{n}{\bm{X}}_{i}/n={\bm{X}}_{0}$ exists;
(A8): $D_{n}=\frac{1}{n}\sum_{i=1}^{n}{\bm{X}}_{i}{\bm{X}}_{i}^{\top}\to D$ and $\frac{1}{n}\max_{1\leq i\leq n}{\bm{X}}_{i}^{\top}{\bm{X}}_{i}\to 0$ , where $D$ is a finite and positive definite matrix;
(A9): There exists a constant $C_{0}>0$ , $\inf_{{\bm{w}}_{A}}P_{r}(A_{i^{\prime}}\supset A\cup j)f_{A}({\bm{w}}_{A})>C_{0}$ .

Assumptions (A1)–(A5) are the same as the conditions in Lin et al. [9]. Specifically, Assumption (A1) requires under-smoothing to obtain a root $-n$ consistent estimator, which is a commonly used regularity assumption in semiparametric regression. Assumption (A2) addresses the missing data mechanism and ensures the PRIME estimator is consistent. As mentioned in Lin et al. [9], Assumption (A2) is weaker than assuming the data are missing completely at random. Assumptions (A3)–(A5) are standard in nonparametric regression. Assumption (A3) is achieved when the kernel function is the Gaussian kernel, but it is more general. Assumption (A6) is a moment bound required in Arriaga and Vempala [20] and Li et al. [19], a necessary technical condition. Assumptions (A7) and (A8) are commonly used in penalized estimation problems [16, 17]. We assume $\inf_{{\bm{w}}_{A}}P_{r}(A_{i^{\prime}}\supset A\cup j)f_{A}({\bm{w}}_{A})>C_{0}$ to make sure that there are enough samples being used to estimate $e_{j,A}({\bm{w}}_{A})$ .

Theorem 1.

Suppose Assumptions (A1)-(A6) hold and $j\in\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\},j=1,2,\cdots,p$ , as $n\to\infty$ , then $\hat{\bm{\beta}}\to{\bm{\beta}}_{0}$ in probability.

Theorem 2.

Suppose Assumptions (A1)-(A8) hold, $j\in\mathcal{I}_{1}=\{j:\beta_{j}\neq 0\},j=1,2,\cdots,q$ , $j\in\mathcal{I}_{2}=\{j:\beta_{j}=0\},j=q+1,\cdots,p$ and tuning parameter $\lambda_{n}=o(\sqrt{n})$ as $n\to\infty$ , then $\hat{\bm{\beta}}_{s}\to{\bm{\beta}}_{0}$ in probability.

4 Simulation

In this section, we consider several simulated scenarios to highlight the properties of PRIME in contrast to some other methods. We experimentally investigate the performance of the following methods:

Full:: the least-squares estimator based on the full data as a benchmark;
PRIME:: the proposed method;
ILSE:: the iterative least-square method in Lin et al. [9];
ML:: the maximum likelihood method proposed in Jiang et al. [4];
CC:: the complete-case analysis method.

For each model setting with a specific choice of parameters, we repeat the simulation 100 times and evaluate the performance of models using the normalized absolute distance (NAD) and the mean squared error (MSE), defined as follows:

{\rm NAD}_{j}=\frac{1}{N}\sum_{i=1}^{N}\frac{|\hat{\beta}_{j}-\beta_{0j}|}{\beta_{0j}},\ j=1,2\cdots,p,

{\rm MSE}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{j}-\beta_{0j})^{2}.

In addition, we calculate the optimal MSE rate, defined as the proportion of times each method (except Full) produced the smallest MSE in repetitions. The MSE based on N repetitions is partitioned into MSE=Variance+Bias², as follows:

\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{j}^{(i)}-\beta_{0j})^{2}=\frac{1}{p}\sum_{j=1}^{p}\frac{1}{N}\sum_{i=1}^{N}(\hat{\beta}_{j}^{(i)}-\bar{\beta}_{j})^{2}+\frac{1}{p}\sum_{j=1}^{p}(\bar{\beta}_{j}-\beta_{0j})^{2},

where $\bar{\beta}_{j}=\frac{1}{N}\sum_{i=1}^{N}\hat{\beta}_{j}^{(i)}$ and $N=100$ in this simulation study. For simplicity, we set bandwidth $h=n^{-1/3}$ for PRIME, SPRIME and ILSE. In the following sections, we compare the methods using various settings for sample size, missing data rates, noise levels, and feature correlations.

4.1 Scenario 1: Different noise levels

The data generation model has the linear expression

\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i},\ i=1,2,\cdots,n,

where $n=100,200$ , $p=12$ , ${\bm{\beta}}=(1,-0.6,1.5,1,1.2,0.4,-1,-0.7,1.3,0.5,1.1,-1.4,0.9)^{\top}$ .

We generate $(X_{i1},\cdots,X_{ip})$ from the multivariate normal distribution $N_{p}(0,\Sigma)$ . We set the non-diagonal elements $\rho_{ij}$ of $\Sigma$ equal to $0.5$ and the diagonal elements of $\Sigma$ equal to $1$ . For $\varepsilon_{i}$ , we use the error distribution $N(0,\sigma^{2})$ , where $\sigma^{2}$ changes with $R^{2}=\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})/\{\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})+\sigma^{2}\}$ . We consider the cases in which $R^{2}=0.1,0.2,\cdots,0.9$ .

Missing data are divided into three generative scenarios or assumptions. Missing completely at random (MCAR) means the missingness is independent of the values of the data. Missing at random (MAR) means the propensity of data to be missing depends on the observed values, whereas missing not at random (MNAR) covers the remaining scenario that the mechanism depends on the unobserved values (the variables that are missing). In our study, we consider situations that differ from the classical MCAR, MAR, and MNAR mechanism.

For each sample, variables $X_{10},X_{11},X_{12}$ are always available. There are twelve “typical” missing patterns considered, the details are shown in Table 1. We divide the 12 patterns into two groups. Specifically, the first group ${\bm{A}}^{(1)}$ consists of $(A_{1},A_{2},\cdots,A_{6})$ , and the second group ${\bm{A}}^{(2)}$ consists of the rest missing patterns. We randomly assign the missing patterns in ${\bm{A}}^{(1)}$ to the sample with missing probability $P=a$ . Furthermore, we set the missing probability of $i$ th unit for the patterns in ${\bm{A}}^{(2)}$ as $P=\{1+\exp(b\varepsilon_{i}+c)\}^{-1}$ . Then we randomly assign the patterns in ${\bm{A}}^{(2)}$ to the missing samples. The settings in Table 2 are used for the missing rate (MR).

Table 1: Missing pattern for all simulation examples.

Group	Pattern	Variable
	Full	$X_{1}$	$X_{2}$	$X_{3}$	$X_{4}$	$X_{5}$	$X_{6}$	$X_{7}$	$X_{8}$	$X_{9}$	$X_{10}$	$X_{11}$	$X_{12}$
${\bm{A}}^{(1)}$	$A_{1}$	✓	✓		✓	✓	✓	✓	✓	✓	✓	✓	✓
	$A_{2}$	✓	✓	✓	✓	✓		✓	✓	✓	✓	✓	✓
	$A_{3}$	✓	✓	✓	✓	✓	✓	✓	✓		✓	✓	✓
	$A_{4}$		✓	✓		✓	✓	✓	✓	✓	✓	✓	✓
	$A_{5}$		✓	✓	✓	✓	✓		✓	✓	✓	✓	✓
	$A_{6}$	✓	✓	✓		✓	✓		✓	✓	✓	✓	✓
${\bm{A}}^{(2)}$	$A_{7}$				✓	✓	✓	✓	✓	✓	✓	✓	✓
	$A_{8}$	✓	✓	✓				✓	✓	✓	✓	✓	✓
	$A_{9}$	✓	✓	✓	✓	✓	✓				✓	✓	✓
	$A_{10}$							✓	✓	✓	✓	✓	✓
	$A_{11}$				✓	✓	✓				✓		✓
	$A_{12}$	✓	✓	✓							✓	✓	✓

Table 2: Missing rating setting for all simulation examples.

Missing rate		${\bm{R}}^{2}$
		0.1	0.2	0.3	0.4	0.5	0.6	0.7	0.8	0.9
$60\%$	$a$	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1	0.1
	$b$	-1.5	-1.5	-1.5	-1.5	-1.5	-1.5	-2	-2	-4
	$c$	-4	-2	-2	-1.5	-1.5	-1	-1	-1	-1
$90\%$	$a$	0.75	0.75	0.75	0.75	0.75	0.75	0.75	0.75	0.65
	$b$	-1.5	-1.5	-2	-2	-3	-3	-3.5	-3.5	-4
	$c$	-4	-4	-4	-4	-4	-4	-4	-4	-4

The MSE results are shown in Figures 3, 6, 9, and 12 for $R^{2}=0.2,0.5,0.8$ . The results of optimal rate of MSE are displayed in Figures 4, 7, 10, and 13. To determine the relative performance, we rank the NADs of the five methods at each repetition. The mean NAD ranks are displayed in Figures 2, 5, 8, and 11 for $R^{2}=0.2,0.5,0.8$ . The results of the cases not shown here are available in the supplementary materials. In general, they exhibit patterns which is similar to those shown here.

To make the comparison of Full, PRIME, and ILSE easier, the MSE bar plots for CC and ML are manually scaled because these two methods lead to much higher MSEs than the other methods. The main conclusions are as follows:

1.

When $n$ and $R^{2}$ increase, the MSEs of Full, ILSE, and PRIME generally decrease, as expected. However, the relative performance of these three methods does not change.
2.

Generally, PRIME outperformes ILSE and ML in terms of NAD, and all three significantly outperforme CC. Surprisingly, PRIME was close or even superior to the Full method in estimating the coefficients of $X_{11}$ and $X_{12}$ when $\text{MR}=90\%,n=100,$ and $R^{2}=0.2$ and in estimating the coefficients of $X_{10}$ and $X_{12}$ when $\text{MR}=90\%,n=200,$ and $R^{2}=0.2$ . These results confirm the superiority of the PRIME method.
3.

The comparison of results shows that the proposed PRIME method performed better than the other three methods (ILSE, CC, and ML). The bias and variance decomposition figures show that ILSE produces more biased estimates than PRIME. The CC method?s estimation has extremely high variance in almost all ranges of $R^{2}$ , and CC?s approximation error was larger when $R^{2}$ was not very high. Furthermore, the CC method produces biased estimates, as expected, because of the missing-data mechanism. The ML performance is not stable: ML’s performance is close to our proposed method?s when $n=200,\text{MR}=90\%,$ and $R^{2}=0.8$ , but ML has the poorest performance when $n=100,\text{MR}=90\%,$ and $R^{2}=0.8$ because ML estimators can be highly biased when the MAR assumption does not hold.
4.

The optimal rates of MSE show a proportion over $40\%$ for PRIME, and this indicates that PRIME yields the smallest MSE of all the competitors in more than 40% of the trials in Scenario 1. When PRIME yields the optimal rate, ILSE or ML most often yields the second-smallest MSE. When PRIME does not yield the lowest MSE, ILSE and ML most often do. Not surprisingly, CC rarely produces the smallest MSE except when $\text{MR}=60\%$ and $R^{2}=0.9$ .

4.2 Scenario 2: Varying correlation between variables

To compare the methods with different correlations, we consider the correlation between $X_{i}$ and $X_{j}$ in four situations: $p_{1}$ for $\rho_{ij}=0.2$ , $p_{2}$ for $\rho_{ij}=0.5$ , $p_{3}$ for $\rho_{ij}=0.8$ and $p_{4}$ for $\rho_{ij}=0.8^{|i-j|}$ . Here, we set $\sigma^{2}$ with $R^{2}=0.7$ . All other aspects remain the same as in Scenario 1. For the missing rate, two settings are considered, where

•

$(a,b,c)=(0.1,-2,-1)$ , so that the missing rate is approximately 60%,
•

$(a,b,c)=(0.7,-3.5,-4)$ , so that the missing rate is approximately 90%.

The NAD and MSE results are shown in Figures 14, 15, 17, 18, 20, 21, 23, and 24. The optimal MSE rates are shown in Figures 16, 19, 22, and 25. The main conclusions are as follows:

1.

Scenario 2 results yield conclusions similar to those of Scenario 1. PRIME produces the smallest NADs and MSEs in almost all cases. As shown in Figures 21 and 24, although CC has the smallest estimation error, its estimation variance is extremely high, which causes problems in the MSE.
2.

The optimal rate results show that PRIME has obvious advantages over other methods because it produces the smallest MSE in almost all situations except when $\text{MR}=90\%,n=200,$ and $\rho_{ij}=0.8$ . However, when $\rho_{ij}$ increases, the gap between PRIME and other methods increases. CC still has the worst MSE performance among the four methods.

4.3 Scenario 3: Taking sparse structure into consideration

In this scenario, we illustrate the proposed SPRIME by studying the data from simulations. We consider penalized Full (denoted as SFull), complete-case analysis with a penalty (denoted as SCC), ILSE, and ML as the alternatives. ILSE in Lin et al. [9] and ML in Jiang et al. [4] were proposed without considering the sparse assumption; hence, we use them directly instead of using the penalized estimation form. We acknowledge that there are other approaches such as those in Xue and Qu [21] that can be used to address a high-dimensional missing-data problem. However, the missing-data patterns in these methods are different from the individual-specific case.

The model used to generate data has the linear expression

\displaystyle Y_{i}=\sum_{j=1}^{p}\beta_{j}X_{ij}+\varepsilon_{i},\ i=1,2,\cdots,n,

where $n=200$ , $p=30$ , $(\beta_{1},\cdots,\beta_{12})^{\top}=(1,-0.6,1.5,1,1.2,0.4,-1,-0.7,1.3,0.5,1.1,-1.4,0.9)$ , and $\beta_{j}=0\ (j=13,\cdots,30)$ . We generate $(X_{i1},\cdots,X_{ip})$ from the multivariate normal distribution $N_{p}(\bf{0},\Sigma)$ . We set the non-diagonal element $\rho_{ij}$ of $\Sigma$ equal to $0.5$ . For $\varepsilon_{i}$ , we use the error distribution $N(0,\sigma^{2})$ , where $\sigma^{2}$ changes with $R^{2}=\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})/\{\text{Var}({\bm{X}}_{i}^{\top}{\bm{\beta}})+\sigma^{2}\}$ . We consider only the case in which $R^{2}=0.7$ .

When taking sparse structure into consideration, the coefficients $\beta_{j}\ (j=13,\cdots,30)$ are equal to 0. Thus, the criterion, NAD used in Scenario 1 and 2, is no longer meaningful. So we only use MSE and the optimal rate of MSE to assess the performance of different methods. Several conclusions can be drawn from the figures:

1.

Scenario 3 results yield conclusions similar to those of Scenarios 1 and 2. From Figure 26 we can see that SPRIME produces the smallest MSEs in almost all cases.
2.

When MR=60%, the optimal rates of MSE are 0.95, 0.01, 0.00, 0.04 for SPRIME, ILSE, CC and ML, respectively. When MR=90%, the optimal rates of MSE are 0.94, 0.05, 0.01, 0.00 for SPRIME, ILSE, CC and ML, respectively. Not surprisingly, SPRIME that considers the sparse structure performs better than ILSE and ML without the penalty, which reconfirms the superiority of PRIME-type approach.

5 Cardiac surgery-associated acute kidney injury study

In this example, we illustrate the proposed method by analyzing data regarding Cardiac surgery-associated acute kidney injury (CSA-AKI). CSA-AKI is the second condition for acute kidney injury in the intensive care setting and sometimes causes death [2]. However, because of a general lack of effective treatment for CSA-AKI, tools or methods for earlier identification are very important for prevention and management of the syndrome. To find more predictive biomarkers for CSA-AKI, Chen et al. [2] collected 32 plasma cytokines, including CTACK, FGFa, G-CSF, HGF, interferon- $\alpha$ 2, interferon-gamma (IFN- $\gamma$ ), IL-1 $\alpha$ , IL-1 $\beta$ , IL-2, IL-4, IL-6, IL-7, IL-8, IL-9, IL-10, IL-12p70, IL-12p40, IL-16, IL-17 $\alpha$ , IL-18, IP-10, MCP-1, MCP-3, M-CSF, MIF, MIG, MIP-1 $\alpha$ (macrophage inflammatory protein-1 $\alpha$ ), MIP-1 $\beta$ (macrophage inflammatory protein-1 $\beta$ ), SCF, SCGF- $\beta$ , SDF-1 $\alpha$ , and tumor necrosis factor- $\alpha$ . CSA-AKI severity is evaluated primarily by deltaScore, which is measured by serum creatinine alterations before and after surgery. Serum creatinine concentrations before and after surgery are measured by an identical testing platform in the clinical laboratory of the hospital.

We use the continuous-variable deltaScore as the response. For simplicity, we conduct a standardized transformation to scale the both response and covariates. Furthermore, we exclude subjects with missing deltaScores because the aforementioned methods (PRIME and SPRIME) apply primarily to the missing covariates. Finally, 321 patients are enrolled for statistical analysis, of which only approximately 60% have complete covariate information.

Because the number of related variables is not expected to be large, as in Scenario 3 in the simulation study, we use SPRIME and SCC to simultaneously select and estimate the coefficients of the factors that might shed light on the deltaScore. We also use ILSE and ML directly without considering the sparse assumption. The regression coefficient estimates obtained from the four methods are listed in Table 3. Among them, IL-8, IL-10, IFN- $\gamma$ , IL-16, and MIP- $\alpha$ are also found to be related to CSA-AKI in Chen et al. [2]. However, in real-world data, it is difficult to objectively evaluate the performance of candidate methods. Therefore, we delete the subjects with missing covariates and construct missing data manually for the complete-case data of CSA-AKI. For the same reason as before in Scenario 3, we consider only MSE to evaluate the estimation accuracy. However, because of the unknown true coefficients, we are unable to evaluate MSE as described in the simulation. Hence, we calculate $\text{MSE}_{\text{Full}}$ instead, as follows:

\text{MSE}_{\text{Full}}=\frac{1}{N}\sum_{i=1}^{N}\frac{1}{p}\sum_{j=1}^{p}(\hat{\beta}_{\text{Full},j}-\hat{\beta}_{j})^{2}.

where $\hat{\beta}_{\text{Full},j}\ (j=1,2,\cdots,p)$ are the estimated coefficients obtained using Full method.

The setting of missing-data patterns is the same as that in the simulation study except for the missing probability function. We randomly assign the missing patterns in ${\bm{A}}^{(1)}$ to the sample with missing probability $P=a$ . Furthermore, we set the missing probability of the $i$ th unit for the patterns in ${\bm{A}}^{(2)}$ as $P=\{1+\exp(b\epsilon_{i}+c)\}^{-1}$ , where $\epsilon_{i}=Y_{i}-{\bm{X}}_{i}^{\top}\hat{\bm{\beta}}_{\rm Full}$ . Then, we randomly assign the patterns in ${\bm{A}}^{(2)}$ to the missing samples. For the missing rate, we set $(a,b,c)=(-1.5,-2,0.4)$ . Consequently, the missing rate is approximately 90%. This is repeats $N=100$ times to randomly generate missing data. Here, like in simulation, we also set bandwidth $h=n^{-1/3}$ .

The $\text{MSE}_{\text{Full}}$ results are shown in Figure 27. The optimal rates of MSE are 1.00, 0.00, 0.00, 0.00 for SPRIME, ILSE, CC and ML, respectively. The resuls show that SPRIME has advantages over the other methods in estimation accuracy as it produces the smallest $\text{MSE}_{\text{Full}}$ . The missing mechanism and the wrong model assumption may give rise to the worse performance of ILSE, CC and ML.

Table 3: Regression coefficients of CPLSD with regression coefficient estimates (The symbol ”-” means that the current variable or predictor has not been selected).

Variable	SPRIME	SCC	ILSE	ML
age	-0.058	-0.099	-0.167	-0.012
BMI	$-$	$-$	0.040	0.048
hospitalized time	0.051	0.045	0.069	0.080
CTACK	$-$	0.020	0.207	0.117
FGFa	$-$	$-$	0.137	0.137
G-CSF	$-$	$-$	-0.403	-0.462
HGF	$-$	$-$	-0.120	-0.162
IFN- $\alpha$ 2	$-$	$-$	0.233	0.393
IFN- $\gamma$	0.089	0.063	0.207	0.199
IL-1 $\alpha$	$-$	$-$	-0.126	-0.113
IL-1 $\beta$	$-$	$-$	-0.257	-0.256
IL-2	$-$	$-$	0.349	0.280
IL-4	$-$	$-$	-0.011	-0.012
IL-6	$-$	$-$	-0.126	-0.148
IL-7	$-$	$-$	-0.165	-0.182
IL-8	0.149	0.155	0.527	0.624
IL-9	$-$	-0.008	-0.020	-0.034
IL-10	0.040	0.023	0.043	0.064
IL-12p70	$-$	$-$	0.347	0.361
IL-12p40.	$-$	$-$	-0.029	-0.099
IL-16	0.100	0.109	0.143	0.118
IL-17 $\alpha$	$-$	$-$	-0.322	-0.303
IL-18	$-$	$-$	-0.071	-0.080
IP-10	$-$	$-$	-0.057	-0.170
MCP-1	$-$	$-$	0.081	0.060
MCP-3	$-$	$-$	-0.099	-0.233
M-CSF	$-$	0.080	0.024	0.042
MIF	$-$	$-$	-0.131	-0.146
MIG	0.024	0.009	0.169	0.342
MIP-1 $\alpha$	0.036	0.039	0.315	0.330
MIP-1 $\beta$	$-$	$-$	-0.187	-0.165
SCF	0.175	0.143	0.166	0.175
SCGF- $\beta$	0.093	0.0923	0.060	0.121
SDF-1 $\alpha$	$-$	$-$	-0.096	-0.057
preLVEF	$-$	$-$	0.056	0.073
TNF- $\alpha$	$-$	$-$	-0.100	-0.071

6 Conclusions

In this study, we propose a projective resampling imputation mean estimation method to estimate the regression coefficients for a high rate of missing-data covariates. Our first set of random features projects data points onto a randomly chosen line and then averages the resulting scalar values to yield a comprehensive result. The random lines are drawn from the standard normal distribution to ensure less loss of information. We experimentally evaluate the performance of the PRIME method, and the results showe that the proposed method is a feasible alternative.

However, the aforementioned work has been developed for a classical setting. Developing PRIME to combine generalized linear models or Cox models with missing data warrants future research. Furthermore, we considered only missing covariates even though it is common to encounter cases where both covariates and responses are missing. Hence, developing methods to address practical issues will be the focus of our future work.

References

[1] Mengyun Wu, Jian Huang, and Shuangge Ma. Identifying gene-gene interactions using penalized tensor regression. Statistics in medicine, 37(4):598–610, 2018.
[2] Zhongli Chen, Liang Chen, Guangyu Yao, Wenbo Yang, Ke Yang, and Chenglong Xiong. Novel blood cytokine-based model for predicting severe acute kidney injury and poor outcomes after cardiac surgery. Journal of the American Heart Association, 9(22):e018004, 2020.
[3] Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. Maximum likelihood from incomplete data via the em algorithm. Journal of the Royal Statistical Society Series B, 39(1):1–22, 1977.
[4] Wei Jiang, Julie Josse, Marc Lavielle, and TraumaBase Group. Logistic regression with missing covariates—parameter estimation, model selection and prediction within a joint-modeling framework. Computational Statistics and Data Analysis, 2020.
[5] Shaun R Seaman and Ian R White. Review of inverse probability weighting for dealing with missing data. Statistical Methods in Medical Research, 22(3):278–295, 2013.
[6] BaoLuo Sun and Eric J. Tchetgen Tchetgen. On inverse probability weighting for nonmonotone missing at random data. Journal of the American Statistical Association, 113(521):369–379, 2018.
[7] Judith M Conn, Kung-Jong Lui, and Daniel L McGee. A model-based approach to the imputation of missing data: Home injury incidences. Statistics in Medicine, 8(3):263–266, 1989.
[8] Xiaoyan Yin, Daniel Levy, Christine Willinger, Aram Adourian, and Martin G Larson. Multiple imputation and analysis for high-dimensional incomplete proteomics data. Statistics in medicine, 35(8):1315–1326, 2016.
[9] Huazhen Lin, Wei Liu, and Wei Lan. Regression analysis with individual-specific patterns of missing covariates. Journal of Business and Economic Statistics, 19(3):231–253, 2019.
[10] Willian B. Johnson and Joram Lindenstrauss. Extensions of lipschitz mappings into a hilbert space. Contemporary Mathematics, 26:189–206, 1984.
[11] Leonard J. Schulman. Clustering for edge-cost minimization (extended abstract). Proceedings of the thirty-second annual ACM symposium on Theory of computing, pages 547–555, 2000.
[12] D.L. Donoho. Compressed sensing. IEEE Transactions Information Theory, 52:1289–1306, 2006.
[13] Qinfeng Shi, Hanxi Li, and Chunhua Shen. Rapid face recognition using hashing. 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 2753–2760, 2010.
[14] Odalric-Ambrym Maillard and Rémi Munos. Linear regression with random projections. Journal of Machine Learning Research, 13(1):2735–2772, 2012.
[15] Quoc Le, Tamás Sarlós, and Alex Smola. Fastfood: approximating kernel expansions in loglinear time. ICML’13 Proceedings of the 30th International Conference on International Conference on Machine Learning - Volume 28, 2013.
[16] Wenjiang J Fu. Penalized regressions: the bridge versus the lasso. Journal of computational and graphical statistics, 7(3):397–416, 1998.
[17] Wenjiang J Fu. Penalized estimating equations. Biometrics, 59(1):126–132, 2003.
[18] Dimitris Achlioptas. Database-friendly random projections: Johnson-lindenstrauss with binary coins. Journal of computer and System Sciences, 66(4):671–687, 2003.
[19] Ping Li, Trevor J Hastie, and Kenneth W Church. Very sparse random projections. Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 287–296, 2006.
[20] Rosa I Arriaga and Santosh Vempala. An algorithmic theory of learning: Robust concepts and random projection. Machine Learning, 16:161–182, 2006.
[21] Fei Xue and Annie Qu. Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, pages 1–14, 2020.
[22] Kani Chen, Shaojun Guo, Liuquan Sun, and Jane-Ling Wang. Global partial likelihood for nonparametric proportional hazards models. Journal of the American Statistical Association, 105(490):750–760, 2010.

Appendix

Proof of Theorem 1

Proof.

Following the proof of Lin et al. [9], we first define

\widehat{E}_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)X_{i^{\prime}j}\prod_{b=1}^{B}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}^{\top}_{A}{\bm{v}}_{b,A}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\prod_{b=1}^{B}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}^{T}{\bm{v}}_{b,A}-{\bm{w}}^{\top}_{A}{\bm{v}}_{b,A}\right)\right]^{\frac{1}{B}}},\quad\left(j\notin A\right)

and then

{\bm{Z}}_{i}=(X_{ij}I(j\in A_{i})+\widehat{E}_{j,A_{i}}({\bm{X}}_{i,A_{i}})I(j\notin A_{i}:1\leq j\leq p),

U({\bm{\beta}})=\frac{1}{n}\sum_{i=1}^{n}{\bm{Z}}_{i}\{Y_{i}-{\bm{Z}}_{i}^{\top}{\bm{\beta}}\}.

Let

\mathbf{z}_{i}=\left(X_{ij}I\left(j\in A_{i}\right)+e_{j,A_{i}}\left(\mathbf{X}_{i,A_{i}}\right)I\left(j\notin A_{i}\right):\right.1\leq j\leq p)^{\top}

u(\bm{\beta})=E\left[{\mathbf{z}}_{i}(\bm{\beta})\left\{\mathbf{X}_{i}^{\mathrm{T}}\bm{\beta}_{0}-\mathbf{z}_{i}^{\top}\bm{\beta}\right\}\right].

Similar to Lin et al. [9], we first prove that ${\bm{\beta}}_{0}$ is a solution of $u({\bm{\beta}})$ . We have

E(Y_{i}|{\bm{X}}^{\top}_{i,A_{i}},A_{i})={\bm{z}}_{i}^{\top}{\bm{\beta}}_{0}.

and we also have $E(Y_{i}|{\bm{X}}_{i,A_{i}},A_{i})=E({\bm{X}}^{\top}_{i}|{\bm{X}}_{i,A_{i}},A_{i}){\bm{\beta}}_{0}$ , hence we can get $E\left\{({\bm{X}}_{i}-{\bm{z}}^{\top}_{i})|{\bm{X}}_{i,A_{i}},A_{i}\right\}{\bm{\beta}}_{0}=0$ . Noting that ${\bm{z}}_{i}$ is a function of $({\bm{X}}_{i,A_{i}},A_{i})$ , then

u({\bm{\beta}}_{0})=E\left\{{\bm{z}}_{i}\left({\bm{X}}_{i}^{\top}{\bm{\beta}}_{0}-{\bm{z}}_{i}^{\top}{\bm{\beta}}_{0}\right)\right\}=0

(8)

Similar to Lin et al. [9], it is easy to show that ${\bm{\beta}}_{0}$ is the unique solution of $u({\bm{\beta}})=0$ .

To prove Theorem 1, it suffices to show that

\sup_{{\bm{\beta}}\in\mathcal{B}}\|U({\bm{\beta}})-u({\bm{\beta}})\|_{2}=0.

(9)

We rewrite

U({\bm{\beta}})-u({\bm{\beta}})=U_{1}-U_{2}+U_{3}+U_{4},

where

	$\displaystyle U_{1}=\frac{1}{n}\sum_{i=1}^{n}\left({\mathbf{Z}}_{i}-{\mathbf{z}}_{i}\right)Y_{i},$
	$\displaystyle U_{2}=\frac{1}{n}\sum_{i=1}^{n}\left({\mathbf{Z}}_{i}\mathbf{Z}_{i}^{T}-{\mathbf{z}}_{i}\mathbf{z}_{i}^{T}\right){\bm{\beta}},$
	$\displaystyle U_{3}=\frac{1}{n}\sum_{i=1}^{n}\left[{\bm{z}}_{i}\left\{\mathbf{X}_{i}^{T}\bm{\beta}_{0}-\mathbf{z}_{i}^{T}\bm{\beta}\right\}-E\left\{{\bm{z}}_{i}\left(\mathbf{X}_{i}^{T}\bm{\beta}_{0}-\mathbf{z}_{i}^{T}\bm{\beta}\right)\right\}\right],$
	$\displaystyle U_{4}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{z}}_{i}\varepsilon_{i0}.$

Then $\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{3}\|_{2}=o_{p}(1),\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{4}\|_{2}=o_{p}(1)$ follows from the weak law of large numbers. Noting that

\hat{E}_{j,A}({\bm{w}}_{A})-e_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}_{A}^{\top}{\bm{v}}_{i,A}\right)\right]^{\frac{1}{B}}}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[\prod_{b=1}^{B}K_{h}\left(\mathbf{X}_{i^{\prime},A}^{\top}{\bm{v}}_{b,A}-{\bm{w}}_{A}^{\top}{\bm{v}}_{i,A}\right)\right]^{\frac{1}{B}}}

(10)

For gaussian kernel generates better empirical performance than do other types of kernels, here we assume $K_{h}(\cdot)$ is a gaussian kernel function, then $\left[\prod_{b=1}^{B}K_{h}({\bm{X}}^{\top}_{i^{\prime},A_{i}}{\bm{v}}_{b,A_{i}}-{\bm{w}}^{\top}_{i,A}{\bm{v}}_{b,A})\right]^{\frac{1}{B}}$ can be written as $\exp\left\{-\left\|\frac{1}{\sqrt{B}}V{\bm{X}}_{i^{\prime},A}-\frac{1}{\sqrt{B}}V{\bm{w}}_{i,A}\right\|^{2}\right\}$ , where $V$ be a $B\times|A|$ random matrix whose entries are chosen independently from either $N(0,1)$ or $U(-1,1)$ . Hence, we can show that $\left[\prod_{b=1}^{B}K_{h}({\bm{X}}^{\top}_{i^{\prime},A}{\bm{v}}_{b,A}-{\bm{X}}^{\top}_{i,P}{\bm{v}}_{b,A})\right]^{\frac{1}{B}}=K_{h}({\bm{X}}_{i^{\prime},A}-{\bm{X}}_{i,A})+o_{p}(1)$ using Theorem 1 in Arriaga and Vempala [20]. Then, (10) can be writtern as

	$\displaystyle\hat{E}_{j,A}({\bm{w}}_{A})-e_{j,A}({\bm{w}}_{A})=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}+\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}o_{p}(1)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}=E_{1}+E_{2}.$

Under the conditions given, following the proof of theorem 1 in Lin et al. [9] and the lemma 4 in Chen et al. [22], we know that

\sup_{{\bm{w}}_{A}}\|E_{1}\|_{2}\leq\sup_{{\bm{w}}_{A}}\left\|\frac{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}-{\bm{w}}_{A}\right)}{\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)\left[K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)+o_{p}(1)\right]}\right\|_{2}=o_{p}(1)\\

Let $R({\bm{w}}_{A};j)=1/n\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)K_{h}\left(\mathbf{X}_{i^{\prime},A}-{\bm{w}}_{A}\right)$ , and $\sup_{{\bm{w}}_{A}}|R({\bm{w}}_{A};j)-P_{r}\left(A_{i^{\prime}}\supset A\cup j\right)f({\bm{w}}_{A})|=O_{p}\left(\sqrt{\log n}/\sqrt{nh}+h^{2}\right)$ follows from lemma 4 in Chen et al. [22]. Using Assumption (A5), then $\inf_{{\bm{w}}_{A}}R({\bm{w}}_{A};j)>C=C_{0}-O_{p}\left(\sqrt{\log n}/\sqrt{nh}+h^{2}\right)$ . Hence, we have

\displaystyle\sup_{{\bm{w}}_{A}}\left\|E_{2}\right\|\leq\sup_{{\bm{w}}_{A}}\left\|\frac{1}{n}\sum_{i^{\prime}=1}^{n}\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}K_{h}\left(\mathbf{X}_{i^{\prime},A_{i}}-{\bm{w}}_{A}\right)}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}o_{p}(1)\right\|_{2}.

Let $S_{i^{\prime}}=I\left(A_{i^{\prime}}\supset A\cup j\right)\left\{X_{i^{\prime}j}-e_{j,A}({\bm{w}}_{A})\right\}/\{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)\}$ . Clearly, for the condition $A_{i}\bot{\bm{X}}_{i}$

	$\displaystyle E(S_{i^{\prime}})=E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)X_{i^{\prime}j}}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]-E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)E(X_{ij}\|{\bm{X}}_{i,A}={\bm{w}}_{A})}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=E\left[\frac{X_{i^{\prime}j}}{C+o_{p}(1)}\Bigg{\|}I\left(A_{i^{\prime}}\supset A\cup j)\right)=1\right]P_{r}(A_{i^{\prime}}\supset A\cup j)$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-E\left\{E\left[\frac{X_{ij}}{c+o_{p}(1)}\Bigg{\|}{\bm{X}}_{i^{\prime},A}={\bm{w}}_{A}\right]\Bigg{\|}I\left(A_{i^{\prime}}\supset A\cup j\right)=1\right\}P_{r}(A_{i^{\prime}}\supset A\cup j)$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E\left[E(X_{ij}\|{\bm{X}}_{i,A}={\bm{w}}_{A})\right]P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E(X_{ij})P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}=0.$

Then using the weak law of large number, we have

\displaystyle\sup_{{\bm{w}}_{A}}\left\|E_{2}\right\|\leq o_{p}(1).

Thus, we obtain $\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{1}\|_{2}=o_{p}(1)$ and the same argument can also apply to $\sup_{{\bm{\beta}}\in\mathcal{B}}\|U_{2}\|_{2}$ . The above convergences imply that $\sup_{{\bm{\beta}}\in\mathcal{B}}\|U({\bm{\beta}})-u({\bm{\beta}})\|_{2}=0$ . Following the technical derivation follow from Lin et al. [9], Theorem 1 holds.

∎

Proof of Theorem 2

Proof.

It is obvious that the equations (8) and (9) are always correct no matter the assumption $j\in\mathcal{I}_{1},j=1,2,\cdots,p$ is satisfied or not. Then, the proof of Theorem 2 is easily conducted by using the theorem 1 and 2 in Fu [16] and thoerem 3 in Fu [17]. ∎

	$\displaystyle E(S_{i^{\prime}})=E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)X_{i^{\prime}j}}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]-E\left[\frac{I\left(A_{i^{\prime}}\supset A\cup j\right)E(X_{ij}\|{\bm{X}}_{i,A}={\bm{w}}_{A})}{C+\frac{1}{n}\sum_{i^{\prime}=1}^{n}I\left(A_{i^{\prime}}\supset A\cup j\right)o_{p}(1)}\right]$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=E\left[\frac{X_{i^{\prime}j}}{C+o_{p}(1)}\Bigg{\|}I\left(A_{i^{\prime}}\supset A\cup j)\right)=1\right]P_{r}(A_{i^{\prime}}\supset A\cup j)$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}~{}-E\left\{E\left[\frac{X_{ij}}{c+o_{p}(1)}\Bigg{\|}{\bm{X}}_{i^{\prime},A}={\bm{w}}_{A}\right]\Bigg{\|}I\left(A_{i^{\prime}}\supset A\cup j\right)=1\right\}P_{r}(A_{i^{\prime}}\supset A\cup j)$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E\left[E(X_{ij}\|{\bm{X}}_{i,A}={\bm{w}}_{A})\right]P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}$
	$\displaystyle~{}~{}~{}~{}~{}~{}~{}~{}=\frac{1}{C}\left\{E(X_{i^{\prime}j})P_{r}(A_{i^{\prime}}\supset A\cup j)-E(X_{ij})P_{r}(A_{i^{\prime}}\supset A\cup j)\right\}=0.$