This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive and Efficient Learning with Blockwise Missing and Semi-Supervised Data

Yiming Li Department of Biostatistics, Columbia University, US Xuehan Yang Department of Biostatistics, Columbia University, US Ying Wei Department of Biostatistics, Columbia University, US Molei Liu111Corresponding email: [email protected] Department of Biostatistics, Columbia University, US
Abstract

Data fusion is an important way to realize powerful and generalizable analyses across multiple sources. However, different capability of data collection across the sources has become a prominent issue in practice. This could result in the blockwise missingness (BM) of covariates troublesome for integration. Meanwhile, the high cost of obtaining gold-standard labels can cause the missingness of response on a large proportion of samples, known as the semi-supervised (SS) problem. In this paper, we consider a challenging scenario confronting both the BM and SS issues, and propose a novel Data-adaptive projecting Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE). Starting with a complete-data-only estimator, it involves two successive projection steps to reduce its variance without incurring bias. Compared to existing approaches, DEFUSE achieves a two-fold improvement. First, it leverages the BM labeled sample more efficiently through a novel data-adaptive projection approach robust to model misspecification on the missing covariates, leading to better variance reduction. Second, our method further incorporates the large unlabeled sample to enhance the estimation efficiency through imputation and projection. Compared to the previous SS setting with complete covariates, our work reveals a more essential role of the unlabeled sample in the BM setting. These advantages are justified in asymptotic and simulation studies. We also apply DEFUSE for the risk modeling and inference of heart diseases with the MIMIC-III electronic medical record (EMR) data.

Keywords: Missing Data; Data fusion; Control variate; Calibration; Linear allocation; Statistical Efficiency.

1 Introduction

1.1 Background

Data fusion enables more powerful and comprehensive analysis by integrating complementary views provided by different data sources, and has been more and more frequently adopted with the increasing capacity in unifying and synthesizing data models across different cohorts or institutions. As an important example, in the field of health and medicine, there has been growing interests and efforts in linking electronic medical records (EMRs) with biobank data to tackle complex health issues (Bycroft et al.,, 2018; Castro et al.,, 2022, e.g.). EMRs include detailed longitudinal clinical observations of patients, offering a rich source of health information over time. By combining these records with the genetics, genomics and other multi-omics data in biobanks, this type of data fusion enables a much deeper understanding of disease prognosis at the molecular level, potentially transforming patient care and treatment strategies (Zengini et al.,, 2018; Verma et al.,, 2023; Wang et al.,, 2024, e.g.). In addition to combine different types of data, data fusion has also been scaled up to merge data from multiple institutions. The UK Biobank (Sudlow et al.,, 2015) and the All-of-Us Research Program (The All of Us Research Program Investigators,, 2019) in the U.S. are prime examples.

To robustly and efficiently learn from multi-source fused data sets, several main statistical challenges need to be addressed. One is blockwise missing (BM) often occurring when certain variables are collected or defined differently across the sources. Consequently, one or multiple entire blocks of covariates are missing in the merged dataset. Dealing with the paucity of accurate outcomes in EMR is another major problem. This is due to the intensive human efforts or time costs demanded for collecting accurate outcomes like the manual chart reviewing labels created by experts for certain diseases or a long-term endpoint taking years to follow on. In this situation, one is often interested in semi-supervised (SS) learning that combines some small labeled subset with the large unlabeled sample collected from the same population without observations of the true outcome. In the aforementioned EMR applications, it is often to have the BM and SS problems occurring together, e.g., our real example in Section 5. This new setup has a more complicated data structure to be formally described in Section 1.2 and motivates our methodological research in this paper.

1.2 Problem Setup

Let YY be the outcome of interest and 𝑿=(X1,X2,,Xp)\boldsymbol{X}=(X_{1},X_{2},...,X_{p})^{\top} be a vector of pp-dimensional covariates. As illustrated on the left panel of Figure 1, there are three data structures of the observations from the same population: (i) N𝒞N_{\mathcal{L}\mathcal{C}} observations of (Y,𝑿)(Y,\boldsymbol{X}) that are labeled and complete, denoted as 𝒞\mathcal{L}\mathcal{C}; (ii) NN_{\mathcal{L}\mathcal{M}} labeled and covariates-missing observations of (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}}) denoted as \mathcal{L}\mathcal{M} where 𝒫m{1,2,,p}\mathcal{P}_{m}\subset\{1,2,\ldots,p\} is the index set of the blockwise missing covariates and 𝒫mc={1,2,,p}𝒫m\mathcal{P}_{m^{c}}=\{1,2,...,p\}\setminus\mathcal{P}_{m}; and (iii) N𝒰𝒞N_{\mathcal{U}\mathcal{C}} unlabeled sample with complete observation of the covariates denoted as 𝒰𝒞\mathcal{U}\mathcal{C}. Note that we consider a missing completely at random (MCAR) regime with the 𝒞\mathcal{L}\mathcal{C}, \mathcal{L}\mathcal{M} and 𝒰𝒞\mathcal{U}\mathcal{C} samples having a homogeneous underlying distribution. In the SS setting, we assume that N𝒰𝒞N𝒞+NN_{\mathcal{U}\mathcal{C}}\gg N_{\mathcal{L}\mathcal{C}}+N_{\mathcal{L}\mathcal{M}}. Denote by ρ=N/N𝒞\rho={N_{\mathcal{L}\mathcal{M}}}/{N_{\mathcal{L}\mathcal{C}}}. In our application, N𝒞N_{\mathcal{L}\mathcal{C}} and NN_{\mathcal{L}\mathcal{M}} have similar magnitudes while our method and theory also accommodate the ρ0\rho\rightarrow 0 and ρ\rho\rightarrow\infty scenarios. Let N=N𝒞+N𝒰𝒞+NN=N_{\mathcal{L}\mathcal{C}}+N_{\mathcal{U}\mathcal{C}}+N_{\mathcal{L}\mathcal{M}}, and the observed samples as

𝑫i={I(𝒯i𝒰𝒞)Yi,I(𝒯i)𝑿i,𝒫m,𝑿i,𝒫mc},i=1,2,,N.\boldsymbol{D}_{i}=\{I(\mathcal{T}_{i}\neq\mathcal{U}\mathcal{C})Y_{i},I(\mathcal{T}_{i}\neq\mathcal{L}\mathcal{M})\boldsymbol{X}_{i,\mathcal{P}_{m}},\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}}\},\quad i=1,2,\ldots,N.

where 𝒯i{𝒞,𝒰𝒞,}\mathcal{T}_{i}\in\{\mathcal{L}\mathcal{C},\mathcal{U}\mathcal{C},\mathcal{L}\mathcal{M}\} indicates the data type of the observation ii. Also, we are interested the scenario with multiple BM structures: 𝒯i{𝒞,𝒰𝒞,1,,R}\mathcal{T}_{i}\in\{\mathcal{L}\mathcal{C},\mathcal{U}\mathcal{C},\mathcal{L}\mathcal{M}_{1},\ldots,\mathcal{L}\mathcal{M}_{R}\} with each r\mathcal{L}\mathcal{M}_{r} representing a subset of data with observations of (Y,𝑿𝒫mcr)(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}}) and missing covariates on the index set 𝒫mr\mathcal{P}_{m}^{r} satisfying 𝒫mr𝒫m\mathcal{P}_{m}^{r}\neq\mathcal{P}_{m}^{\ell} for any pair of rr\neq\ell in {1,,R}\{1,\ldots,R\}; see the right panel of Figure 1.

Refer to caption
Refer to caption
Figure 1: Data structures of the BM and SS setting considered in this paper. The red block (𝒞\mathcal{L}\mathcal{C}) represents the labeled and complete subset of data where both YY and 𝑿\boldsymbol{X} are observed completely; the yellow block (\mathcal{L}\mathcal{M} or r\mathcal{L}\mathcal{M}_{r}) represent the labeled sample with 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} or 𝑿𝒫mr\boldsymbol{X}_{\mathcal{P}_{m}^{r}} missing; and the green block (𝒰𝒞\mathcal{U}\mathcal{C}) represents the subset of unlabeled data with a complete observation of 𝑿\boldsymbol{X}. The sample size of 𝒰𝒞\mathcal{U}\mathcal{C} is usually much larger than the sample sizes of 𝒞\mathcal{L}\mathcal{C} and \mathcal{L}\mathcal{M}. We consider both the single BM scenario on the left penal and the multiple BM scenario on the right.

Our primary interest lies in estimating and inferring the generalized linear model (GLM):

E(Y|𝑿)=g(𝜸𝑿),\mathrm{E}(Y|\boldsymbol{X})=g(\boldsymbol{\gamma}^{\top}\boldsymbol{X}), (1)

where 𝜸=(γ1,γ2,,γp)\boldsymbol{\gamma}=(\gamma_{1},\gamma_{2},...,\gamma_{p})^{\top} is a vector of unknown coefficients and g()g(\cdot) is a known link function. For example, g(a)=ag(a)=a corresponds to the Gaussian linear model and g(a)=ea/(1+ea)g(a)=e^{a}/(1+e^{a}) for the logistic model. It is important to note that our framework allows the model (1) to be misspecified, and we actually define the model coefficients of our interests as 𝜸¯\bar{\boldsymbol{\gamma}}, the solution to the population-level moment equation:

𝐔(𝜸)=E[𝑿{Yg(𝜸𝑿)}]=𝟎.\mathbf{U}(\boldsymbol{\gamma})={\rm E}\left[\boldsymbol{X}\left\{Y-g\left(\boldsymbol{\gamma}^{\top}\boldsymbol{X}\right)\right\}\right]=\mathbf{0}. (2)

The model parameter 𝜸=(γ1,,γp)\boldsymbol{\gamma}=(\gamma_{1},\ldots,\gamma_{p})^{\top} can be identified and estimated using the standard regression on the 𝒞\mathcal{L}\mathcal{C} sample but \mathcal{L}\mathcal{M} and 𝒰𝒞\mathcal{U}\mathcal{C} cannot be directly incorporated into this procedure, which results in loss of effective samples. In this paper, our aim is to leverage both the \mathcal{L}\mathcal{M} and 𝒰𝒞\mathcal{U}\mathcal{C} samples to enhance the estimation efficiency of 𝜸\boldsymbol{\gamma} without incurring bias. As will be introduced in Section 5.1, the data in our setup can also include auxiliary features that are predictive of YY and 𝑿\boldsymbol{X}, observed on all subjects, and possibly high-dimensional. These variables are not included in the outcome model Y𝑿Y\sim\boldsymbol{X} of our primary interest due to scientific reasons but can be simply incorporated into our framework to enhance estimation efficiency. For the ease of presentation, we drop the notation of such auxiliary features in the main body of our method and theory.

1.3 Related Literature

With the increasing ability in collecting data with common models across multiple institutions, integrative regression with blockwise missing covariates has been an important problem frequently studied in recent literature. One of the most well known and commonly used approach to address missing data is Multiple Imputation with Chained Equations (MICE) White et al., (2011). However, as a practical, general, and principal strategy, MICE is sometimes subjected to high computation costs as well as potential bias and inefficiency caused by the misspecification or excessive error in imputation, e.g., see our simulation results.

Recently, Yu et al., (2020) proposed an approach named DISCOM for integrative linear regression under BM. It approximates the covariance of (𝑿,Y)(\boldsymbol{X},Y) through the optimal linear combination across different data structures and incorporates it with an extended sparse regression to derive an integrative estimator. Xue and Qu, (2021) proposed a multiple blockwise imputation (MBI) approach that aggregates linear estimating equations with multi-source imputed data sets constructed according to their specific BM structures. They justified the efficiency gain of their proposal over the complete-data-only estimator. Xue et al., (2021) extended this framework work for high-dimensional inference with BM data sets. Inspired by the idea of debiasing, Song et al., (2024) developed an approach for a similar problem without imputing the missing blocks but showing better efficiency. Jin and Rothenhäusler, (2023) proposed a modular regression approach accommodating BM covariates under the conditional independence assumption across the blocks of covariates given the response.

All works reviewed above primarily focused on the linear model, which is subtly different from the GLM with a nonlinear link as the latter typically requires the 𝒞\mathcal{L}\mathcal{C} sample to ensure identifiability of the outcome model while the linear model does not. For integrative GLM regression with BM data, Li et al., 2023b proposed a penalized estimating equation approach with multiple imputation (PEE–MI) that incorporates the correlation of multiple imputed observations into the objective function and realizes effective variable selection. Kundu and Chatterjee, (2023) and Zhao et al., (2023) proposed to incorporate the \mathcal{L}\mathcal{M} sample through the generalized methods of moments (GMM) that actually uses the score functions of a reduced GLM (i.e., the parametric GLM for Y𝑿𝒫mcY\sim\boldsymbol{X}_{\mathcal{P}_{m^{c}}}) on the \mathcal{L}\mathcal{M} sample for variance reduction. Such a “reduced model” strategy has been used frequently in other work like Song et al., (2024), as well as for fusing clinical and observational data to enhance the efficiency of causal inference (Yang and Ding,, 2019; Hatt et al.,, 2022; Shi et al.,, 2023, e.g.).

As will be detailed in our asymptotic and numerical studies, the above-introduced reduced model or imputation strategies does not characterize the true influence of the \mathcal{L}\mathcal{M} data and, thus, not achieving semiparametric efficiency in the BM regression setting, which dates back to the seminal work by Robins et al., (1994) and has been explored (Chen et al.,, 2008; Li and Luedtke,, 2023; Li et al., 2023a, ; Qiu et al.,, 2023, e.g.) in various settings. In particular, Li and Luedtke, (2023) proposed derived the canonical gradient for a general data fusion problem and justified its efficient. Nevertheless, such semiparametric efficiency property is established under an ideal scenario with a correct conditional model for 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}, which is stringent in practice. Their method could be less robust and effective in a more realistic case when the imputation for 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} is not consistent with the truth. This issue has been recognized and studied in recent literature mainly focusing on SS inference (Gronsbell et al.,, 2022; Schmutz et al.,, 2022; Song et al.,, 2023; Angelopoulos et al.,, 2023; Miao et al.,, 2023; Gan and Liang,, 2023; Wu and Liu,, 2023). Notably, in our real-world study including high-dimensional auxiliary features to impute the missing variables (see Section 5.1), this problem is even more pronounced due to the difficulty of high-dimensional modeling. In this case, it is challenging to realize adaptive and efficient data fusion with above-mentioned strategies, which can be seen from the results in Section 5.

Our work is also closely related to recent progress in semi-supervised learning (SSL) that aims at using the large unlabeled sample to enhance the statistical efficiency of the analysis with relatively small labeled data sets. On this track, Kawakita and Kanamori, (2013) and Song et al., (2023) proposed to use some density ratio model matching the labeled and unlabeled samples to reweight the regression on the labeled sample, which, counter-intuitively, results in SSL estimators with smaller variances than their supervised learning (SL) counterparts. Chakrabortty and Cai, (2018) and Gronsbell et al., (2022) proposed to incorporate the unlabeled sample by imputing its unobserved outcome YY with an imputation model constructed with the labeled data. Extending this imputation framework, Cai and Guo, (2020), Zhang and Bradic, (2022) and Deng et al., (2023) studied the SSL problem with high-dimensional 𝑿\boldsymbol{X}, and Wu and Liu, (2023) addressed the SSL of graphic models.

Though having different constructions, these recent SSL approaches achieve a very similar adaptive property that they can produce SSL estimators more efficient than SL when the outcome model for E(Y|𝑿)\mathrm{E}(Y|\boldsymbol{X}) is misspecified and as efficient as SL when it is correct. Azriel et al., (2022) discussed this property and realized it for linear models in a more flexible and effective way. Notably, our SSL setting with BM designs has not been studied in this track of SSL literature. As will be shown later, this problem is essentially distinguished from existing work in SSL as the efficiency gain by using the 𝒰𝒞\mathcal{U}\mathcal{C} sample is attainable even when the outcome model (1) is correctly specified in our case. In addition, we notice that Xue et al., (2021) and Song et al., (2024) also considered using large 𝒰𝒞\mathcal{U}\mathcal{C} samples to assist learning of high-dimensional linear models under the BM setting. Unlike us, they only used the 𝒰𝒞\mathcal{U}\mathcal{C} sample to combat and reduce the excessive regularization errors in sparse regression and debiased inference. Nevertheless, for the inference of a low-dimensional projection of the model parameters, the asymptotic variance still remains unchanged compared to the SL scenario after incorporating the 𝒰𝒞\mathcal{U}\mathcal{C} sample in their methods.

1.4 Our contribution

Regarding the limitation of existing work discussed in the previous section, we propose a Data-adaptive projecting Estimation approach for data FUsion in the SEmi-supervised setting (DEFUSE). Our method contains two major novel steps, it initiates from the simple 𝒞\mathcal{L}\mathcal{C}-only estimator, and first incorporates the \mathcal{L}\mathcal{M} sample to reduce its asymptotic variance through a data-adaptive projection procedure for the score functions. Then, it utilizes the large 𝒰𝒞\mathcal{U}\mathcal{C} sample in another projecting step, to further improve the efficiency of the data-fused estimator resulting from the previous step. In the following, we shall summarize the main novelty and contribution of our work that have been justified and demonstrated through the asymptotic, simulation, and real-world studies.

  1. 1.

    Our approach in combining the \mathcal{L}\mathcal{M} and 𝒞\mathcal{L}\mathcal{C} samples is shown to produce an asymptotically unbiased and more efficient estimator compared to recent developments for BM regression (Xue and Qu,, 2021; Zhao et al.,, 2023; Song et al.,, 2024, e.g.). The newly proposed framework is inspired by the semiparametric theory and allows the flexible use of general ML algorithms to impute the missing variables. The efficiency gain of our method over existing methods in various settings like model misspecification are justified from both the theoretical and numerical perspectives.

  2. 2.

    We develop a novel data-driven control function approach to realize adaptive and efficient estimation under potential misspecification or poor quality of the imputation models, especially when there exists nonlinear relationship not captured by them. Compared to recent work in addressing problematic nuisance models (Angelopoulos et al.,, 2023; Miao et al.,, 2023, e.g.), our strategy is shown to realize more flexible and efficient calibration in our theory, simulation, and real-world studies. This improvement is even clearer when high-dimensional auxiliary features are included in the nuisance model like our EMR application.

  3. 3.

    We establish a novel framework to bring in large unlabeled samples under the SS setting and leverage them to enhance the efficiency of the BM estimation. This enables a more efficient use of the 𝒰𝒞\mathcal{U}\mathcal{C} sample leading to an estimator of strictly smaller variance than that obtained only using 𝒞\mathcal{L}\mathcal{C} and \mathcal{L}\mathcal{M}, which has not been readily achieved in previous work on a similar setting (Xue et al.,, 2021; Song et al.,, 2024). Our development is also shown to be more efficient than existing approaches for the standard SS setting without BM (Gronsbell et al.,, 2022, e.g.), in both theoretical and numerical studies.

  4. 4.

    We propose a linear allocation approach to extend our framework in addressing multiple BM structures, in which the classic semiparametric theory does not apply.

2 Method

2.1 Outline and Preliminary Framework

Without the loss of generality, we aim at estimating and inferring a linear projection of the model coefficients 𝜸\boldsymbol{\gamma} denoted by 𝒄𝜸\boldsymbol{c}^{\top}\boldsymbol{\gamma} where 𝒄\boldsymbol{c} is an arbitrary vector satisfying 𝒄2=1\|\boldsymbol{c}\|_{2}=1. In this framework, one can take 𝒄\boldsymbol{c} as the jj-th unit vector in p\mathbb{R}^{p} to infer each GLM effect γj\gamma_{j} and can also take 𝒄\boldsymbol{c} as the observed covariates of some new coming subject for the purpose of prediction. Let E^𝒯[f(𝑫)]\widehat{{\rm E}}_{\mathcal{T}}[f(\boldsymbol{D})] represent the empirical mean of some measurable function f()f(\cdot) on the data set 𝒯\mathcal{T} that can be taken as one or the union of more than one belonging to {𝒞,,𝒰𝒞}\{\mathcal{L}\mathcal{C},\mathcal{L}\mathcal{M},\mathcal{U}\mathcal{C}\}, e.g.,

E^𝒞[f]=1N𝒞𝒯i=𝒞f(𝑫𝒊),E^𝒞[f]=1N𝒞+N𝒯i{𝒞,}f(𝑫𝒊).\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[f]=\frac{1}{N_{\mathcal{L}\mathcal{C}}}\sum_{\mathcal{T}_{i}=\mathcal{L}\mathcal{C}}f(\boldsymbol{D_{i}}),\quad\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}\cup\mathcal{L}\mathcal{M}}[f]=\frac{1}{N_{\mathcal{L}\mathcal{C}}+N_{\mathcal{L}\mathcal{M}}}\sum_{\mathcal{T}_{i}\in\{\mathcal{L}\mathcal{C},\mathcal{L}\mathcal{M}\}}f(\boldsymbol{D_{i}}).

Similarly, we use Var^𝒯[f]\widehat{{\rm Var}}_{\mathcal{T}}[f] to represent the empirical variance of f(𝑫)f(\boldsymbol{D}) on data set 𝒯\mathcal{T}, e.g., Var^𝒞[f]=N𝒞1𝒯i=𝒞{f(𝑫𝒊)E^𝒞[f]}2\widehat{{\rm Var}}_{\mathcal{L}\mathcal{C}}[f]=N_{\mathcal{L}\mathcal{C}}^{-1}\sum_{\mathcal{T}_{i}=\mathcal{L}\mathcal{C}}\{f(\boldsymbol{D_{i}})-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[f]\}^{2}. We start from introducing the DEFUSE method for the scenario with a single \mathcal{L}\mathcal{M} data set; see the left panel of Figure 1. It includes three steps outlined in Figure 2 and also in Algorithm 1 presented later.

Refer to caption
Figure 2: Outline of the proposed DEFUSE Method. See also Algorithm 1.

First, we solve the empirical version of the estimating equation (2) for 𝜸\boldsymbol{\gamma} based on the 𝒞\mathcal{L}\mathcal{C} sample:

𝐔~𝒞(𝜸)=E^𝒞[𝑿{Yg(𝜸𝑿)}]=𝟎.\tilde{\mathbf{U}}_{\mathcal{L}\mathcal{C}}(\boldsymbol{\gamma})=\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\boldsymbol{X}\{Y-g(\boldsymbol{\gamma}^{\top}\boldsymbol{X})\}]=\mathbf{0}. (3)

Denote the solution as 𝜸~\tilde{\boldsymbol{\gamma}}. It is also the maximum likelihood estimator (MLE) for the Gaussian linear and logistic models when g(a)=ag(a)=a and g(a)=ea/(1+ea)g(a)=e^{a}/(1+e^{a}) respectively. Using the standard asymptotic theory for M-estimation (Van der Vaart,, 2000), we have the expansion

𝜸~𝜸¯1N𝒞𝒯i=𝒞𝑯¯1𝑿i{Yig(𝜸¯𝑿i)}:=1N𝒞𝒯i=𝒞𝑺¯i;𝑯¯=𝔼[g˙(𝜸¯𝑿)𝑿𝑿]\tilde{\boldsymbol{{\gamma}}}-\boldsymbol{\bar{\gamma}}\approx\frac{1}{N_{\mathcal{L}\mathcal{C}}}\sum_{\mathcal{T}_{i}={\mathcal{L}\mathcal{C}}}\bar{\boldsymbol{H}}^{-1}\boldsymbol{X}_{i}\{Y_{i}-g(\boldsymbol{\bar{\gamma}}^{\top}\boldsymbol{X}_{i})\}:=\frac{1}{N_{\mathcal{L}\mathcal{C}}}\sum_{\mathcal{T}_{i}={\mathcal{L}\mathcal{C}}}\bar{\boldsymbol{S}}_{i};\quad\bar{\boldsymbol{H}}=\mathbb{E}[\dot{g}(\boldsymbol{\bar{\gamma}}^{\top}\boldsymbol{X})\boldsymbol{X}\boldsymbol{X}^{\top}]

which will be more strictly given by Lemma 1. Consequently, the asymptotic variance of 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{{\gamma}}} is proportional to that of the score function 𝒄𝑺¯i\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}_{i}. Therefore, our basic idea of incorporating the \mathcal{L}\mathcal{M} sample is to leverage its observations of (Yi,𝑿i,𝒫mc)(Y_{i},\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}}) to form some function explaining the variance of 𝒄𝑺¯i\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}_{i} as much as possible, and use it as a control variate to adjust 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{{\gamma}}} for variance reduction.

In specific, we let 𝑯~=E^𝒞[g(𝜸~𝑿)𝑿𝑿]\tilde{\boldsymbol{H}}=\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[g(\tilde{\boldsymbol{{\gamma}}}^{\top}\boldsymbol{X})\boldsymbol{X}\boldsymbol{X}^{\top}] and 𝑺~i=𝑯~1𝑿i{Yig(𝜸~𝑿i)}\tilde{\boldsymbol{S}}_{i}=\tilde{\boldsymbol{H}}^{-1}\boldsymbol{X}_{i}\{Y_{i}-g(\tilde{\boldsymbol{{\gamma}}}^{\top}\boldsymbol{X}_{i})\}. Motivated by semiparametric theory (Robins et al.,, 1994, e.g.), the ideal control function in this case is φ(𝑿𝒫mc,Y)=ρ1+ρ𝔼𝑿𝒫m[𝒄𝑺¯|𝑿𝒫mc,Y]\varphi^{*}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)=\frac{\rho}{1+\rho}\mathbb{E}_{\boldsymbol{X}_{\mathcal{P}_{m}}}[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y] (with empirically form ρ1+ρ𝔼𝑿𝒫m[𝒄𝑺~|𝑿𝒫mc,Y]\frac{\rho}{1+\rho}\mathbb{E}_{\boldsymbol{X}_{\mathcal{P}_{m}}}[\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y]), with the corresponding estimator incorporating \mathcal{L}\mathcal{M} formed as

𝒄𝜸~E^𝒞[φ]+E^[φ].\boldsymbol{c}^{\top}\tilde{\boldsymbol{{\gamma}}}-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\varphi^{*}]+\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}[\varphi^{*}]. (4)

Here, the augmentation term E^𝒞[φ]+E^[φ]-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\varphi^{*}]+\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}[\varphi^{*}] can be viewed as a control variate of mean zero and negatively correlated with the empirical mean of the score 𝒄𝑺¯\boldsymbol{c}^{\top}\bar{\boldsymbol{S}} on 𝒞\mathcal{L}\mathcal{C}. Among all the control variate in this form, φ(𝑿𝒫mc,Y)\varphi^{*}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) provides the optimal choice to achieve the semiparametric efficiency. However, a critical concern in this classic framework is that computing φ(𝑿𝒫mc,Y)\varphi^{*}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) requires the knowledge of the conditional distribution P(𝑿𝒫m|𝑿𝒫mc,Y){\rm P}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y). It is typically unknown in practice and needs to be replaced with some data generation model μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜼~){\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}}) where μ\mu is pre-specified and 𝜼~\tilde{\boldsymbol{\eta}} is an empirical estimate of the unknown model parameters. Correspondingly, our imputation of φ\varphi^{*} is formed as

φ~1(𝑿𝒫mc,Y;𝜼~)=ρ1+ρ𝒄𝑺~({𝒙𝒫m,𝑿𝒫mc},Y)μ(𝒙𝒫m|𝑿𝒫mc,Y;𝜼~)𝑑𝒙𝒫m,\tilde{\varphi}_{1}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})=\frac{\rho}{1+\rho}\int\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\{\boldsymbol{x}_{\mathcal{P}_{m}},\boldsymbol{X}_{\mathcal{P}_{m^{c}}}\},Y){\mu}(\boldsymbol{x}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})\,d\boldsymbol{x}_{\mathcal{P}_{m}}, (5)

and to be used as a control function for the variance reduction of 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}. In the following, we remark and discuss on the constructing strategy and computation of μ~𝑿𝒫m:=μ(𝒙𝒫m|𝑿𝒫mc,Y;𝜼~)\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}}:={\mu}(\boldsymbol{x}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}}) and φ~1,i:=φ~1(𝑿𝒫mc,Y;𝜼~)\tilde{\varphi}_{1,i}:=\tilde{\varphi}_{1}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}}) in practice.

Remark 1.

The data imputation model μ~𝐗𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}} can be either a simple parametric model with finite-dimensional 𝛈~\tilde{\boldsymbol{\eta}} or more sophisticated ones with diverging parameter space. For example, for continuous 𝐗𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}, one can either (i) assume that 𝐗𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} follows a (multivariate) Gaussian linear model against 𝐗𝒫mc,Y\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y and learn it through the a standard MLE; or (ii) assume 𝐗𝒫m=m(𝐗𝒫mc,Y)+𝛆\boldsymbol{X}_{\mathcal{P}_{m}}=m(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)+\boldsymbol{\varepsilon} with 𝛆N(0,𝚺)\boldsymbol{\varepsilon}\sim{\rm N}(0,\boldsymbol{\Sigma}) independent from 𝐗𝒫mc,Y\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y, then use some ML algorithm (e.g., random forest or neural network) to learn the mean model m(𝐗𝒫mc,Y)m(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) and the standard method of moment on the fitted residuals to estimate 𝚺\boldsymbol{\Sigma}; or (iii) implement generative ML like conditional generative adversarial network (GAN) (Mirza and Osindero,, 2014) to learn μ{\mu} without any model assumptions.

Moreover, as will be discussed and implemented in Section 5, μ~𝐗𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}} can be constructed with auxiliary covariates not belonging to 𝐗\boldsymbol{X} included in the targeted outcome model. This may invoke learning with high-dimensional features and excessive errors, which highlights the importance of carrying out flexible data-adaptive calibration to be introduced in Section 2.2. Regarding these choices, our asymptotic analysis generally accommodate all the three types with or without auxiliary covariates.

When using complex ML algorithms like random forests to learn μ~𝑿𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}}, we adopt cross-fitting to avoid the over-fitting bias as frequently done in related contexts (Chernozhukov et al.,, 2016; Liu et al.,, 2023, e.g.). In specific, we randomly split the 𝒞\mathcal{L}\mathcal{C} sample into KK folds with an equal size, denoted as {𝒞(1),𝒞(2),,𝒞(K)}\{\mathcal{L}\mathcal{C}^{(1)},\mathcal{L}\mathcal{C}^{(2)},...,\mathcal{L}\mathcal{C}^{(K)}\}. For each k[K]k\in[K], we implement the ML method on 𝒞(k)=jk𝒞(j)\mathcal{L}\mathcal{C}^{(-k)}=\cup_{j\neq k}\mathcal{L}\mathcal{C}^{(j)} to obtain an estimator 𝜼~(k)\tilde{\boldsymbol{\eta}}^{(-k)}. Then for each ii from 𝒞\mathcal{L}\mathcal{C}, we use φ~1(𝑿i,𝒫mc,Yi;𝜼~(k))\tilde{\varphi}_{1}(\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}},Y_{i};\tilde{\boldsymbol{\eta}}^{(-k)}) as its score imputation since 𝜼~(k)\tilde{\boldsymbol{\eta}}^{(-k)} is independent with 𝑫i\boldsymbol{D}_{i}. For each ii from \mathcal{L}\mathcal{M}, we use the average imputation K1k=1Kφ~1(𝑿i,𝒫mc,Yi;𝜼~(k))K^{-1}\sum_{k=1}^{K}\tilde{\varphi}_{1}(\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}},Y_{i};\tilde{\boldsymbol{\eta}}^{(-k)}). As shown in Section 3, this strategy is important to eliminate the over-fitting bias of arbitrary ML methods. For the ease of presentation, we slightly abuse the notation in the remaining paper by dropping the superscript of kk related to cross-fitting.

With μ~𝑿𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}}, one can use either numerical methods like Gaussian quadrature or Monte-Carlo (MC) sampling to compute φ~1,i\tilde{\varphi}_{1,i}. When 𝑿𝒫m{\boldsymbol{X}_{\mathcal{P}_{m}}} includes many covariates, the computational burden and accuracy through either way seems to be a challenging issue. However, inspecting the form of the score function 𝒄𝑺~(𝑿,Y)\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y), we notice that it depends on the entire 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} only through at most two-dimensional linear representations 𝜸~𝒫m𝑿𝒫m\tilde{\boldsymbol{\gamma}}_{\mathcal{P}_{m}}^{\top}\boldsymbol{X}_{\mathcal{P}_{m}} and (𝒄𝑯¯1)𝒫m𝑿𝒫m(\boldsymbol{c}^{\top}\bar{\boldsymbol{H}}^{-1})_{\mathcal{P}_{m}}\boldsymbol{X}_{\mathcal{P}_{m}}. Thus, both Gaussian quadrature and MC can achieve good computational performance thanks to this low dimensionality (Caflisch,, 1998; Kovvali,, 2022). For MC, we could also use the same set of MC samples on all subjects in certain scenarios (e.g., Gaussian noise under homoscedasticity) to save computational costs. In our case, we choose the MC method to compute φ~1\tilde{\varphi}_{1} as it facilitates our data-adaption procedure to be introduced next.

2.2 Data-adaptive Calibration

The numerical approximation of (5) offers a convenient approach for estimating the conditional expectation function φ\varphi^{*}. However, the nuisance model μ~𝑿𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}}, whether parametric or nonparametric, can deviate from the truth P(𝑿𝒫m|𝑿𝒫mc,Y){\rm P}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) in practice due to model misspecification or excessive ML errors. This deviation typically does not cause excessive bias (as shown in our asymptotic analysis) but could result in inefficient estimation in (4). Similar observations have been made in recent missing data literature (Angelopoulos et al.,, 2023; Wu and Liu,, 2023, e.g.). To address this issue, we propose a novel data-adaptive calibration approach to further reduce the variance in the presence of model misspecifications. The key idea is to introduce a broader class of conditional model for 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} with additional unknown parameters 𝜹\boldsymbol{\delta}, denoted as μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜹,𝜼~)\mu^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}}). We can then calibrate through its free parameter 𝜹\boldsymbol{\delta} to produce a more efficient estimator than (4) with wrong μ~𝑿𝒫m\tilde{{\mu}}_{\boldsymbol{X}_{\mathcal{P}_{m}}}.

In specific, let {𝑿i,𝒫m[q]:q=1,2,,Q}\{\boldsymbol{X}_{i,\mathcal{P}_{m}}^{[q]}:q=1,2,\ldots,Q\} be QQ independent random draw from the preliminary μ(𝑿𝒫m𝑿𝒫mc,Y;𝜼~)\mu(\boldsymbol{X}_{\mathcal{P}_{m}}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}}) for the iith subject in 𝒞\mathcal{L}\mathcal{C}\cup\mathcal{L}\mathcal{M}. The standard MC approximation of φ~1,i\tilde{\varphi}_{1,i} defined in (5) is given by Q1q=1Qρ1+ρ𝒄𝑺~({𝑿i,𝒫mc,𝑿i,𝒫m[q]},Yi).Q^{-1}\sum_{q=1}^{Q}\frac{\rho}{1+\rho}\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\{\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}},\boldsymbol{X}_{i,\mathcal{P}_{m}}^{[q]}\},Y_{i}). In Section 3, we show that a size QQ moderately increasing with the sample size is sufficient to guarantee the asymptotic normality and efficiency of our estimator. Inspired by importance weighting, we re-construct the control function with our newly introduced μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜹,𝜼~)\mu^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}}) as follows

ρ1+ρμ(𝒙𝒫m|𝑿𝒫mc,Y;𝜹,𝜼~)μ(𝒙𝒫m𝑿𝒫mc,Y;𝜼~)𝒄𝑺~({𝒙𝒫m,𝑿𝒫mc},Y)[μ(𝒙𝒫m𝑿𝒫mc,Y;𝜼~)d𝒙𝒫m].\frac{\rho}{1+\rho}\int\frac{\mu^{\prime}(\boldsymbol{x}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}})}{\mu(\boldsymbol{x}_{\mathcal{P}_{m}}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})}\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\{\boldsymbol{x}_{\mathcal{P}_{m}},\boldsymbol{X}_{\mathcal{P}_{m^{c}}}\},Y)\left[\mu(\boldsymbol{x}_{\mathcal{P}_{m}}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})\,d\boldsymbol{x}_{\mathcal{P}_{m}}\right].

Its MC approximation is given by

φ~1(𝑿i,𝒫mc,Yi;𝜹,𝜼~)=1Qq=1Qρ1+ρw(𝑿i[q],Yi;𝜹)𝒄𝑺~(𝑿i[q],Yi),\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}},Y_{i};\boldsymbol{\delta},\tilde{\boldsymbol{\eta}})=\frac{1}{Q}\sum_{q=1}^{Q}\frac{\rho}{1+\rho}w(\boldsymbol{X}_{i}^{[q]},Y_{i};\boldsymbol{\delta})\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X}_{i}^{[q]},Y_{i}),

where 𝑿i[q]={𝑿i,𝒫mc,𝑿i,𝒫m[q]}\boldsymbol{X}_{i}^{[q]}=\{\boldsymbol{X}_{i,\mathcal{P}_{m^{c}}},\boldsymbol{X}_{i,\mathcal{P}_{m}}^{[q]}\} and w()w(\cdot) is known as the importance weighting (IW) function such that

w(𝑿i[q],Y;𝜹)=μ(𝑿𝒫m[q]|𝑿𝒫mc,Y;𝜹,𝜼~)μ(𝑿𝒫m[q]𝑿𝒫mc,Y;𝜼~).w(\boldsymbol{X}_{i}^{[q]},Y;\boldsymbol{\delta})=\frac{\mu^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m}}^{[q]}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}})}{\mu(\boldsymbol{X}_{\mathcal{P}_{m}}^{[q]}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})}.

Following the definition of ww, we can construct μ\mu^{\prime} as the product of μ\mu and an IW function ww that is only parameterized by 𝜹\boldsymbol{\delta}. It is easy to see that when w1w\equiv 1, μμ\mu^{\prime}\equiv\mu. Hence, μ\mu^{\prime} encompasses a broader class of conditional models of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}. We then replace φ\varphi^{*} in (5) by φ~1\tilde{\varphi}_{1}^{\prime}, and calibrate it by optimizing 𝜹\boldsymbol{\delta} as follows.

𝜹~=argmin𝜹Var^𝒞[𝒄𝑺~(𝑿,Y)φ~1(𝑿𝒫mc,Y;𝜹,𝜼~)]+1ρVar^𝒞[φ~1(𝑿𝒫mc,Y;𝜹,𝜼~)].\tilde{\boldsymbol{\delta}}=\arg\min_{\boldsymbol{\delta}}\widehat{{\rm Var}}_{\mathcal{L}\mathcal{C}}\left[\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y)-\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}})\right]+\frac{1}{\rho}\widehat{{\rm Var}}_{\mathcal{L}\mathcal{C}}\left[\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\tilde{\boldsymbol{\eta}})\right]. (6)

By design, 𝜹~\tilde{\boldsymbol{\delta}} maximize the correlation between 𝒄𝑺~(𝑿,Y)\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y) and φ~1\tilde{\varphi}_{1} while minimizing the variance of φ~1\tilde{\varphi}_{1} itself. Consequently, the new control variable φ~1\tilde{\varphi}_{1} with optimized 𝜹~\tilde{\boldsymbol{\delta}} can further reduce the estimation variance when μ\mu is misspecified. We define the corresponding estimator in a single BM scenario as

β^1=𝒄𝜸~E^𝒞[φ~1(𝑿𝒫mc,Y;𝜹~,𝜼~)]+E^[φ~1(𝑿𝒫mc,Y;𝜹~,𝜼~)],\hat{\beta}_{1}=\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\delta}},\tilde{\boldsymbol{\eta}})]+\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}[\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\delta}},\tilde{\boldsymbol{\eta}})], (7)

and denote it as DEFUSE1.

Just like μ\mu, the IW model ww can also be constructed using general learners (cross-fitting needed for complex ML) as long as the constant 11 belongs to their feasible spaces or sets. Without loss of generality, we impose this constraint as w(𝒙[q],y;𝟎)1w(\boldsymbol{x}^{[q]},y;\boldsymbol{0})\equiv 1. As will be shown later, this can ensure that when μ(𝑿𝒫m𝑿𝒫mc,Y;𝜼~)\mu(\boldsymbol{X}_{\mathcal{P}_{m}}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}}) is consistent with the true conditional distribution of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}, we have μμ\mu^{\prime}\asymp\mu and the resulting γ~1\tilde{\gamma}_{1} to be asymptotically equivalent with the semiparametric efficient estimator (4). Similarly, it is also tentative to make w0w\equiv 0 feasible in our construction since it can ensure β^1\hat{\beta}_{1} to have no larger variance than the 𝒞\mathcal{L}\mathcal{C}-only estimator 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}, i.e., avoiding negative effects of incorporating \mathcal{L}\mathcal{M}. For the choice of ww, one can see Assumption 5 and Lemma 2 for more rigorous details.

Notably, even when μ\mu^{\prime} has negative values or an integral not equal to 11, β^1\hat{\beta}_{1} can still be asymptotic unbiased and of effectively reduced variance. In general, (6) can be a non-convex problem with a high computational cost to optimize due to the MC sampling. To handle it, one may use stochastic gradient descendant (SGD) that generates a few MC samples for each observation (i.e., with small QQ) and makes gradient update accordingly in multiple iterations. We shall remark on practical choices on ww and 𝜹\boldsymbol{\delta} subject to our computational concerns.

Remark 2.

To avoid the high computational costs and potential over-fitting, we recommend setting ww to have a simpler form than μ\mu. A simple but usually effective choice is to use exponential tilting as described in (Johns,, 1988), where w(𝐱,y;𝛅)=exp{𝛅ξ(𝐱,y)}w(\boldsymbol{x},y;\boldsymbol{\delta})=\exp\{\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y)\} and ξ(𝐱,y)\xi(\boldsymbol{x},y) is a dd-dimensional basis with 𝛅d\boldsymbol{\delta}\in\mathbb{R}^{d}. However, this makes (6) non-convex and hence challenging to solve. Therefore, we propose further simplifying it to w(𝐱,y;𝛅)=1+𝛅ξ(𝐱,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y), which makes (6) a convex problem with an explicit-form solution. This form closely resembles the exponential one when 𝛅ξ(𝐱,y)\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y) is small. Additionally, when ξ(𝐱,y)\xi(\boldsymbol{x},y) include a constant term of 11 for intercept, both w0w\equiv 0 and w1w\equiv 1 are feasible solutions.

2.3 Semi-supervised Estimation

After obtaining DEFUSE1, we further incorporate the 𝒰𝒞\mathcal{U}\mathcal{C} sample to enhance its efficiency. This also involves a data-adaptive projection step that shares a similar spirit as in DEFUSE1. For the 𝒞\mathcal{L}\mathcal{C}-sample terms composing β^1\hat{\beta}_{1} in (7), an interesting observation is that the score function of 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} is orthogonal to 𝑿\boldsymbol{X} when the outcome model (1) is correct. This orthogonality leads to limited efficiency gains in the existing SSL methods for standard GLMs, such as (Chakrabortty and Cai,, 2018), because their estimates are score-function based. However, the projection of 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} on (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}}), φ~1(𝑿𝒫mc,Y;𝜹~,𝜼~)\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\delta}},\tilde{\boldsymbol{\eta}}) is typically not orthogonal to 𝑿\boldsymbol{X}. Thus, unlike those existing SSL methods, 𝑿\boldsymbol{X} can partially explain the variance of β^1\hat{\beta}_{1} through the projection, even when the outcome model is correct. This promises the benefit of leveraging the large 𝒰𝒞\mathcal{U}\mathcal{C} sample in our estimates.

Inspired by this, we propose to model the conditional distribution YY given 𝑿\boldsymbol{X}, denoted as ν(Y|𝑿;𝜽~)\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}}), which can be either directly specified as the outcome model (1) with some extra nuisance parameters (e.g., variance of Gaussian noise), or estimated using ML as discussed in Remark 1. Similar as in Section 2.2, we introduce an IW function

h(𝑿,Y;𝜻)=ν(Y|𝑿;𝜻,𝜽~)ν(Y|𝑿;𝜽~),h(\boldsymbol{X},Y;\boldsymbol{\zeta})=\frac{\nu^{\prime}(Y|\boldsymbol{X};\boldsymbol{\zeta},\tilde{\boldsymbol{\theta}})}{\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}})},

where ν(Y|𝑿;𝜻,𝜽~)\nu^{\prime}(Y|\boldsymbol{X};\boldsymbol{\zeta},\tilde{\boldsymbol{\theta}}) stands for the calibrated conditional model for YY to improve the efficiency when ν(Y|𝑿;𝜽~)\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}}) is misspecified. The optimal 𝜻\boldsymbol{\zeta} can be obtained by

𝜻~=argmin𝜻Var^𝒞[S~1(𝑿,Y)φ~2(𝑿;𝜻,𝜽~)],\tilde{\boldsymbol{\zeta}}=\arg\min_{\boldsymbol{\zeta}}\widehat{{\rm Var}}_{\mathcal{L}\mathcal{C}}\left[\tilde{S}_{1}(\boldsymbol{X},Y)-\tilde{\varphi}_{2}^{\prime}(\boldsymbol{X};\boldsymbol{\zeta},\tilde{\boldsymbol{\theta}})\right], (8)

where S~1(𝑿,Y)=𝒄𝑺~(𝑿,Y)φ~1(𝑿𝒫mc,Y;𝜹~,𝜼~)\tilde{S}_{1}(\boldsymbol{X},Y)=\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y)-\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\delta}},\tilde{\boldsymbol{\eta}}) is the empirical score function of DEFUSE1 in the 𝒞\mathcal{L}\mathcal{C} sample. The term φ~2(𝑿;𝜻,𝜽~)\tilde{\varphi}_{2}^{\prime}(\boldsymbol{X};\boldsymbol{\zeta},\tilde{\boldsymbol{\theta}}) is constructed as Q1q=1Qh(𝑿,Y[q];𝜻)S~1(𝑿,Y[q]),Q^{-1}\sum_{q=1}^{Q}h(\boldsymbol{X},Y^{[q]};\boldsymbol{\zeta})\tilde{S}_{1}(\boldsymbol{X},Y^{[q]}), where {Y[q]:q=1,2,,Q}\{Y^{[q]}:q=1,2,\ldots,Q\} are i.i.d. random drawn from the preliminary model ν(Y|𝑿;𝜽~)\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}}).

Comparing to (6), the objective function in (8) does not include the variance term of φ~2\tilde{\varphi}_{2}^{\prime}. That is because the variance of φ~2\tilde{\varphi}_{2}^{\prime} on 𝒰𝒞\mathcal{U}\mathcal{C} is negligible in the SS setting with N𝒰𝒞N𝒞N_{\mathcal{U}\mathcal{C}}\gg N_{\mathcal{L}\mathcal{C}}. Finally, the DEFUSE estimator in the single BM scenario is constructed as

β^2=β^1E^𝒞[φ~2(𝑿;𝜻~,𝜽~)]+E^𝒰𝒞[φ~2(𝑿;𝜻~,𝜽~)].\hat{\beta}_{2}=\hat{\beta}_{1}-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\tilde{\varphi}_{2}^{\prime}(\boldsymbol{X};\tilde{\boldsymbol{\zeta}},\tilde{\boldsymbol{\theta}})]+\widehat{{\rm E}}_{\mathcal{U}\mathcal{C}}[\tilde{\varphi}_{2}^{\prime}(\boldsymbol{X};\tilde{\boldsymbol{\zeta}},\tilde{\boldsymbol{\theta}})]. (9)

Again, we impose h(𝑿,Y;𝟎)=1h(\boldsymbol{X},Y;\boldsymbol{0})=1 so that β^2\hat{\beta}_{2} can achieve semiparametric efficiency in this SS step when the imputation distribution ν(|𝑿;𝜽~)\nu(\cdot|\boldsymbol{X};\tilde{\boldsymbol{\theta}}) is consistent with the truth. Also, letting constant 0 be in the feasible space of hh can ensure that φ~2=0\tilde{\varphi}_{2}^{\prime}=0 is feasible and β^2\hat{\beta}_{2} has a smaller or equal variance than β^1\hat{\beta}_{1}. For specification of hh, we suggest using similar strategies as proposed in Remark 2. We end our exposition for the single BM and SS setting with a remark on our constructing strategy.

Remark 3.

Our framework starts from leveraging the \mathcal{L}\mathcal{M} sample by projecting 𝐜𝛄~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} to (Y,𝐗𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}}) and then incorporates 𝒰𝒞\mathcal{U}\mathcal{C} to further enhance the resulting DEFUSE1 estimator. Reversing the order of these two projection steps is likely to result in a less efficient estimator because the score of 𝐜𝛄~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} tends to be orthogonal to 𝐗\boldsymbol{X} and directly using 𝒰𝒞\mathcal{U}\mathcal{C} on it may not be helpful when the outcome model (1) is strictly or approximately correct. Also, we notice that φ(𝐗𝒫mc,Y)\varphi^{*}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) is orthogonal to 𝐗𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} when the models for YY and 𝐗𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} are both correct and its empirical version φ~1(𝐗𝒫mc,Y;𝛅~,𝛈~)\tilde{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\delta}},\tilde{\boldsymbol{\eta}}) tends to be poorly explained by 𝐗𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} on the \mathcal{L}\mathcal{M} data in our finite sample studies. Therefore, we do not elaborate the variance contributed by the \mathcal{L}\mathcal{M} sample in our SS procedure to save computation costs.

2.4 DEFUSE in the Multiple BM Scenario

In this section, we extend the proposed DEFUSE method to the scenario of R2R\geq 2 data sets {r:r=1,,R}\{\mathcal{L}\mathcal{M}_{r}:r=1,\ldots,R\} with different BM sets denoted as {𝒫mr:r=1,,R}\{\mathcal{P}_{m}^{r}:r=1,\ldots,R\} and sample sizes {Nr:1,,R}\{N_{\mathcal{L}\mathcal{M}_{r}}:1,\ldots,R\}; see the right panel of Figure 1. Same as the single BM scenario, we start with the standard 𝒞\mathcal{L}\mathcal{C}-only estimator 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} and leverage r\mathcal{L}\mathcal{M}_{r}’s to improve its efficiency. To this end, we first implement data-adaptive calibration proposed in Section 2.2 with each r\mathcal{L}\mathcal{M}_{r} separately, to obtain the control function φ~1,r(𝑿𝒫mcr,Y)\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y) analog to φ~1\tilde{\varphi}_{1}^{\prime} in the single BM case. Here we drop the notations of the nuisance parameters in φ~1,r\tilde{\varphi}_{1,r}^{\prime} for simplification.

Denote by 𝝋~1(𝑿,Y)={φ~1,1(𝑿𝒫mc1,Y),,φ~1,R(𝑿𝒫mcR,Y)}\tilde{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)=\{\tilde{\varphi}_{1,1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{1}},Y),\ldots,\tilde{\varphi}_{1,R}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{R}},Y)\}^{\top}. Then we assemble the multiple control functions to form the DEFUSE1 estimator as

β^1=𝒄𝜸~+r=1Ra~r{E^r[φ~1,r(𝑿𝒫mcr,Y)]E^𝒞[φ~1,r(𝑿𝒫mcr,Y)]},\hat{\beta}_{1}^{\bullet}=\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}+\sum_{r=1}^{R}\tilde{a}_{r}\left\{\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}_{r}}\left[\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}\left[\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]\right\}, (10)

where the ensemble weights a~r\tilde{a}_{r}’s in 𝒂~\tilde{\boldsymbol{a}} are solved from the quadratic programming:

min𝒂=(a1,,aR)Var^𝒞[𝒄𝑺~(𝑿,Y)𝒂𝝋~1(𝑿,Y)]+r=1Rar2ρrVar^r[φ~1,r(𝑿𝒫mcr,Y)],\min_{\boldsymbol{a}=(a_{1},\ldots,a_{R})^{\top}}~{}\widehat{{\rm Var}}_{\mathcal{L}\mathcal{C}}\left[\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y)-\boldsymbol{a}^{\top}\tilde{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)\right]+\sum_{r=1}^{R}\frac{a_{r}^{2}}{\rho_{r}}\widehat{{\rm Var}}_{\mathcal{L}\mathcal{M}_{r}}\left[\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right], (11)

and ρr=min{Nr/N𝒞,C}\rho_{r}=\min\{N_{\mathcal{L}\mathcal{M}_{r}}/N_{\mathcal{L}\mathcal{C}},C\}. The goal of (11) is to find the optimal allocation 𝒂~\tilde{\boldsymbol{a}} minimizing the asymptotic variance of DEFUSE1 as formed in (10) that augments the preliminary 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} with a linear combination of the variance reduction term coming from every r𝒞\mathcal{L}\mathcal{M}_{r}\cup\mathcal{L}\mathcal{C}. Correspondingly, the objective function in (11) is set as the empirical variance of such an estimator with weights 𝒂~\tilde{\boldsymbol{a}}; see our asymptotic analysis for details.

One can view (11) as a ridge regression with each coefficient ara_{r} penalized according to the sample size NrN_{\mathcal{L}\mathcal{M}_{r}} and variance of φ~1,r\tilde{\varphi}_{1,r}^{\prime}. For certain BM patterns, 𝝋~1(𝑿,Y)\tilde{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y) can have highly correlated or collinear elements. In this case, the “ridge penalty” on 𝒂~\tilde{\boldsymbol{a}} can actually ensure the non-singularity and stability of (11). Thus, we assign an upper bound CC to each ρr\rho_{r} in our definition so that ρr1\rho_{r}^{-1} cannot be too small to make this penalty ineffective even when NrN𝒞N_{\mathcal{L}\mathcal{M}_{r}}\gg N_{\mathcal{L}\mathcal{C}}.

Similarly to the single BM scenario, we incorporate the 𝒰𝒞\mathcal{U}\mathcal{C} sample to further reduce the variance of β^1\hat{\beta}_{1}^{\bullet} using the same idea in Section 2.3. Note that the empirical score function of β^1\hat{\beta}_{1}^{\bullet} on 𝒞\mathcal{L}\mathcal{C} can be written as 𝒄𝑺~(𝑿,Y)𝒂~𝝋~1(𝑿,Y)\boldsymbol{c}^{\top}\tilde{\boldsymbol{S}}(\boldsymbol{X},Y)-\tilde{\boldsymbol{a}}^{\top}\tilde{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y). Naturally, we use it to replace the score S~1(𝑿,Y)\tilde{S}_{1}(\boldsymbol{X},Y) in (8) and solve it to obtain corresponding calibration parameters. Then we use this solution to impute the score and augment β^1\hat{\beta}_{1}^{\bullet} in the same form as (9). This procedure can be regarded as a straightforward extension of our proposal in Section 2.3.

We summarize the main steps of the DEFUSE approach in Algorithm 1. We notice that some recent work like Kundu and Chatterjee, (2023) highlighted practical needs to protect the privacy of individual-level data when fusing the data sets 𝒞,1,,R\mathcal{L}\mathcal{C},\mathcal{L}\mathcal{M}_{1},\ldots,\mathcal{L}\mathcal{M}_{R} coming from different data institutions. This has been achieved by transferring summary data only (e.g., mean vectors, model coefficients, and covariance matrices) across the sites, which is often referred as the DataSHIELD framework or constraint Wolfson et al., (2010). Our method can naturally accommodate this situation as well. In specific, note that Steps 1, 2.1, and 3.1 in Algorithm 1 only rely on the 𝒞\mathcal{L}\mathcal{C} sample while Steps 2.2 and 3.2 require summary data (i.e., the empirical mean of the imputed score functions) from each r\mathcal{L}\mathcal{M}_{r} or 𝒰𝒞\mathcal{U}\mathcal{C}. In practice, 𝒞\mathcal{L}\mathcal{C} and 𝒰𝒞\mathcal{U}\mathcal{C} are likely to be stored in the same site. Thus, under the data-sharing constraint, the implementation of DEFUSE requires usually one and at most two rounds of communication between 𝒞\mathcal{L}\mathcal{C} and other sites, which is communication efficient and user-friendly.

Algorithm 1 DEFUSE
Samples 𝒞\mathcal{L}\mathcal{C}, {r:r=1,,R}\{\mathcal{L}\mathcal{M}_{r}:r=1,\ldots,R\} (R1R\geq 1), and 𝒰𝒞\mathcal{U}\mathcal{C}; a loading vector 𝒄\boldsymbol{c}.
[Step 1] Solving Equation (3) on 𝒞\mathcal{L}\mathcal{C} data to obtain 𝜸~\tilde{\boldsymbol{\gamma}} and compute 𝑺~\tilde{\boldsymbol{S}}.
[Step 2.1] For each r{1,,R}r\in\{1,\ldots,R\}: learn the conditional distribution μ~𝑿𝒫mr\tilde{\mu}_{\boldsymbol{X}_{\mathcal{P}_{m}^{r}}}; then solve for the data-adaptive calibration function via (6) on 𝒞\mathcal{L}\mathcal{C} (with 𝒫m\mathcal{P}_{m} set as each 𝒫mr\mathcal{P}_{m}^{r}).
[Step 2.2] When R=1R=1, construct the DEFUSE1 estimator β^1\hat{\beta}_{1} using (7); when R2R\geq 2, obtain the optimal allocation 𝒂~\tilde{\boldsymbol{a}} by (11) and use it to construct β^1\hat{\beta}_{1}^{\bullet} as in (10).
[Step.3.1] Learn the conditional distribution ν(Y|𝑿;𝜽~)\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}}) on 𝒞\mathcal{L}\mathcal{C}. Obtain the data-adaptive calibration function via (8) on 𝒞\mathcal{L}\mathcal{C}.
[Step.3.2] Construct the final DEFUSE estimator by (9) or in a similar way when R2R\geq 2 as discussed in Section 2.4.

3 Asymptotic Analysis

3.1 Notations and Assumptions

In this section, we conduct theoretical analysis on the asymptotic normality and efficiency gain of DEFUSE. Justification of all the results presented in this section can be found in Appendix. We begin with notations and main assumptions. Let P(𝒁){\rm P}(\boldsymbol{Z}) denote the true probability density or mass function of random variables 𝒁\boldsymbol{Z} and P(𝒁1|𝒁2){\rm P}(\boldsymbol{Z}_{1}|\boldsymbol{Z}_{2}) be the conditional probability function, Cov(){\rm Cov}(\cdot) be the population covariance operator, and Diag(uk)k=1,,d{\rm Diag}(u_{k})_{k=1,\ldots,d} be the d×dd\times d diagonal matrix with its kk-th diagonal entry being uku_{k}. For simplicity of presentation, denote by n=N𝒞n=N_{\mathcal{L}\mathcal{C}}. For two sequences ana_{n} and bnb_{n} with index nn\rightarrow\infty, we say an<a_{n}<\infty when there exists some constant M>0M>0 such that an<Ma_{n}<M, an=O(bn)a_{n}=O(b_{n}) when limn|an/bn|<\lim_{n\rightarrow\infty}|a_{n}/b_{n}|<\infty, an=o(bn)a_{n}=o(b_{n}) for limnan/bn=0\lim_{n\rightarrow\infty}a_{n}/b_{n}=0, and an=o𝖯(bn)a_{n}=o_{\sf P}(b_{n}) or an=O𝖯(bn)a_{n}=O_{\sf P}(b_{n}) for an=o(bn)a_{n}=o(b_{n}) or an=O(bn)a_{n}=O(b_{n}) with the probability approaching 11. We say f(x)f(x) is a Lipschitz function if there exists a constant A>0A>0 such that for any x1x_{1} and x2x_{2} in the domain of ff, |f(x1)f(x2)|A|x1x2||f(x_{1})-f(x_{2})|\leq A|x_{1}-x_{2}|.

We first introduce the assumptions for analyzing DEFUSE in the single BM scenario. Assume that ρ=N/n\rho=N_{\mathcal{L}\mathcal{M}}/n converges to some ρ¯[0,]\bar{\rho}\in[0,\infty], n+N=o(N𝒰𝒞)n+N_{\mathcal{L}\mathcal{M}}=o(N_{\mathcal{U}\mathcal{C}}), and the number of MC sample Q>nϵQ>n^{\epsilon} with an arbitrary constant ϵ>0\epsilon>0 that can be set small to save computational costs without having impacts on the asymptotic properties of DEFUSE.

Assumption 1.

The derivative of the link function g˙(u)\dot{g}(u) is Lipschitz; 𝛄{\boldsymbol{\gamma}} belongs to a compact space including 𝛄¯\bar{\boldsymbol{\gamma}} as its interior; and the Hessian 𝐇¯\bar{\boldsymbol{H}} has all its eigenvalues staying away from 0 and \infty. The joint distribution of (Y,𝐗)(Y,\boldsymbol{X}) has a continuously differentiable density for the continuous components and satisfies that E[Y4+𝐗24+{g(𝐗𝛄¯)}2+{g˙(𝐗𝛄¯)}2]<.\mathrm{E}\left[Y^{4}+\|\boldsymbol{X}\|_{2}^{4}+\{g(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}+\{\dot{g}(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}\right]<\infty.

Assumption 2.

There exists some population level parameters 𝛈¯\bar{\boldsymbol{\eta}} and 𝛉¯\bar{\boldsymbol{\theta}} such that

Eμ¯𝒫m×PY,𝒫mc[Δμ(𝜼~,𝜼¯)]2=op(1);Eν¯Y×P𝑿[Δν(𝜽~,𝜽¯)]2=op(1);\displaystyle{\rm E}_{\bar{\mu}_{\mathcal{P}_{m}}\times{\rm P}_{Y,\mathcal{P}_{m^{c}}}}\left[\Delta_{\mu}(\tilde{\boldsymbol{\eta}},\bar{\boldsymbol{\eta}})\right]^{2}=o_{p}(1);\quad{\rm E}_{\bar{\nu}_{Y}\times{\rm P}_{\boldsymbol{X}}}[\Delta_{\nu}(\tilde{\boldsymbol{\theta}},\bar{\boldsymbol{\theta}})]^{2}=o_{p}(1);
Eμ¯𝒫m×PY,𝒫mc[Y4+𝑿24+{g(𝑿𝜸¯)}2+{g˙(𝑿𝜸¯)}2]<;\displaystyle{\rm E}_{\bar{\mu}_{\mathcal{P}_{m}}\times{\rm P}_{Y,\mathcal{P}_{m^{c}}}}\left[Y^{4}+\|\boldsymbol{X}\|_{2}^{4}+\{g(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}+\{\dot{g}(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}\right]<\infty;
Eν¯Y×P𝑿[Y4+𝑿24+{g(𝑿𝜸¯)}2+{g˙(𝑿𝜸¯)}2]<,\displaystyle{\rm E}_{\bar{\nu}_{Y}\times{\rm P}_{\boldsymbol{X}}}\left[Y^{4}+\|\boldsymbol{X}\|_{2}^{4}+\{g(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}+\{\dot{g}(\boldsymbol{X}^{\top}\bar{\boldsymbol{\gamma}})\}^{2}\right]<\infty,

where we denote by

Δμ(𝜼~,𝜼¯)=|μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜼~)μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜼¯)|;\displaystyle\Delta_{\mu}(\tilde{\boldsymbol{\eta}},\bar{\boldsymbol{\eta}})=\left|{\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\tilde{\boldsymbol{\eta}})-{\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\eta}})\right|;
Δν(𝜽~,𝜽¯)=|ν(Y|𝑿;𝜽~)ν(Y|𝑿;𝜽¯)|;\displaystyle\Delta_{\nu}(\tilde{\boldsymbol{\theta}},\bar{\boldsymbol{\theta}})=\left|\nu(Y|\boldsymbol{X};\tilde{\boldsymbol{\theta}})-\nu(Y|\boldsymbol{X};\bar{\boldsymbol{\theta}})\right|;
μ¯𝒫m×PY,𝒫mc=μ(𝑿𝒫m|𝑿𝒫mc,Y;𝜼¯)×P(Y,𝑿𝒫mc);ν¯Y×P𝑿=ν(Y|𝑿;𝜽¯)×P(𝑿).\displaystyle\bar{\mu}_{\mathcal{P}_{m}}\times{\rm P}_{Y,\mathcal{P}_{m^{c}}}={\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\eta}})\times{\rm P}(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}});\quad\bar{\nu}_{Y}\times{\rm P}_{\boldsymbol{X}}=\nu(Y|\boldsymbol{X};\bar{\boldsymbol{\theta}})\times{\rm P}(\boldsymbol{X}).
Assumption 3.

There exists some population level parameters 𝛅¯\bar{\boldsymbol{\delta}} and 𝛇¯\bar{\boldsymbol{\zeta}} such that

Eμ¯𝒫m×PY,𝒫mc[Δw(𝜹~,𝜹¯)]2=op(1);Eν¯Y×P𝑿[Δh(𝜻~,𝜻¯)]2=op(1),\displaystyle{\rm E}_{\bar{\mu}_{\mathcal{P}_{m}}\times{\rm P}_{Y,\mathcal{P}_{m^{c}}}}[\Delta_{w}(\tilde{\boldsymbol{\delta}},\bar{\boldsymbol{\delta}})]^{2}=o_{p}(1);\quad{\rm E}_{\bar{\nu}_{Y}\times{\rm P}_{\boldsymbol{X}}}[\Delta_{h}(\tilde{\boldsymbol{\zeta}},\bar{\boldsymbol{\zeta}})]^{2}=o_{p}(1),

where Δw(𝛅~,𝛅¯)=|w(𝐗,Y;𝛅~)w(𝐗,Y;𝛅¯)|\Delta_{w}(\tilde{\boldsymbol{\delta}},\bar{\boldsymbol{\delta}})=|w(\boldsymbol{X},Y;\tilde{\boldsymbol{\delta}})-w(\boldsymbol{X},Y;\bar{\boldsymbol{\delta}})| and Δh(𝛇~,𝛇¯)=|h(𝐗,Y;𝛇~)h(𝐗,Y;𝛇¯)|\Delta_{h}(\tilde{\boldsymbol{\zeta}},\bar{\boldsymbol{\zeta}})=|h(\boldsymbol{X},Y;\tilde{\boldsymbol{\zeta}})-h(\boldsymbol{X},Y;\bar{\boldsymbol{\zeta}})|.

Remark 4.

Assumption 1 include mild and common regularity conditions for M-estimation (Van der Vaart,, 2000). Assumptions 2 and 3 convey a message that to establish the asymptotic normality of DEFUSE, all the nuisance estimators used for imputation and calibration need to be consistent with some limiting values. Compared to the double machine learning (DML) framework (Chernozhukov et al.,, 2018), we consider a more stringent MCAR regime, under which the estimation forms (7) and (9) automatically correcting the excessive bias of the nuisance estimators (with cross-fitting for general ML). Consequently, our assumption on the quality of the nuisance estimators is substantially weaker than DML from two aspects. First, we only require them to converge to some population limits but not necessarily the true distributions required by DML. Second, we assume an op(1)o_{p}(1)-convergence rate essentially slower than the op(n1/4)o_{p}(n^{-1/4}) rate assumed in DML.

Remark 5.

Assumption 2 on the consistency of the preliminary imputation models is reasonable and justifiable across broad settings with μ¯𝒫m\bar{\mu}_{\mathcal{P}_{m}} and ν¯Y\bar{\nu}_{Y} constructed using either classic regression or modern ML tools. One can refer to comprehensive literature for the consistency of various learning approaches under potentially misspecification, e.g., parametric GLMs (Tian et al.,, 2007, e.g.); kernel methods (Zhang et al.,, 2023, e.g.); high-dimensional sparse regression (Negahban et al.,, 2012, e.g.); and random forest (Athey et al.,, 2019, e.g.).

For the calibration estimators 𝜹~\tilde{\boldsymbol{\delta}} and 𝜻~\tilde{\boldsymbol{\zeta}}, we present Proposition 1 to justify their consistency in Assumption 3 under a parametric form satisfying mild regularity conditions.

Proposition 1.

Under the specification w(𝐱,y;𝛅)=1+𝛅ξ(𝐱,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y) and h(𝐱,y;𝛇)=1+𝛇ϕ(𝐱,y)h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\boldsymbol{\zeta}^{\top}\phi(\boldsymbol{x},y) with some prespecified basis functions ξ\xi and ϕ\phi, Assumptions 1 and 2, as well as the regularity Assumption A1 in Appendix, our Assumption 3 holds.

Based on the population-level parameters introduced in Assumptions 2 and 3, we further define the population control and score functions as

φ¯1(𝑿𝒫mc,Y)=ρ¯1+ρ¯Eμ¯𝒫m(|𝑿𝒫mc,Y;𝜼¯)[w(𝑿,Y;𝜹¯)𝒄𝑺¯𝑿𝒫mc,Y];S¯1(𝑿,Y)=𝒄𝑺¯φ¯1(𝑿𝒫mc,Y);φ¯2(𝑿)=Eν(|𝑿;𝜽¯)h(𝑿,Y;𝜻¯)S¯1(𝑿,Y),\begin{split}&\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)=\frac{\bar{\rho}}{1+\bar{\rho}}{\rm E}_{\bar{\mu}_{\mathcal{P}_{m}}(\cdot|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\eta}})}[w(\boldsymbol{X},Y;\bar{\boldsymbol{\delta}})\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}\mid\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y];\\ &\bar{S}_{1}(\boldsymbol{X},Y)=\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}-\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y);\quad\bar{\varphi}_{2}^{\prime}(\boldsymbol{X})={\rm E}_{\nu(\cdot|\boldsymbol{X};\bar{\boldsymbol{\theta}})}h(\boldsymbol{X},Y;\bar{\boldsymbol{\zeta}})\bar{S}_{1}(\boldsymbol{X},Y),\end{split} (12)

which will be used later to characterize the asymptotic properties of DEFUSE. Finally, we introduce an additional Assumption 4 used for asymptotic analysis of the multiple BM scenario in Section 2.4.

Assumption 4.

In the multiple BM setting, it is satisfied: (i) Our data and construction for all BM components in {r:r=1,,R}\{\mathcal{L}\mathcal{M}_{r}:r=1,\ldots,R\} satisfy Assumptions 13 for \mathcal{L}\mathcal{M} in the single BM scenario; (ii) Each ρr=min{Nr/N𝒞,C}\rho_{r}=\min\{N_{\mathcal{L}\mathcal{M}_{r}}/N_{\mathcal{L}\mathcal{C}},C\} converges to some ρ¯r\bar{\rho}_{r} for r=1,,Rr=1,\ldots,R; (iii) Given that (i) and (ii) hold, define φ¯1,r(𝐗𝒫mc1,Y)\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{1}},Y) as the limiting function for each r\mathcal{L}\mathcal{M}_{r} analog to φ¯1\bar{\varphi}_{1}^{\prime} in (12), and denote by 𝛗¯1(𝐗,Y)={φ¯1,r(𝐗𝒫mc1,Y)}r=1,,R\bar{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)=\{\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{1}},Y)\}_{r=1,\ldots,R}^{\top}. Then matrix

Var[𝝋¯1(𝑿,Y)]+Diag{1ρ¯rVar[φ¯1,r(𝑿𝒫mcr,Y)]}r=1,,R{\rm Var}\left[\bar{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)\right]+{\rm Diag}\left\{\frac{1}{\bar{\rho}_{r}}{\rm Var}\left[\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]\right\}_{r=1,\ldots,R}

has all its eigenvalues staying away from 0.

Assumption 4 is introduced to guarantee that (11) can produce a consistent estimation of the unique population-level optimal allocation

𝒂¯=(a¯1,,a¯R)=argmin𝒂Var[𝒄𝑺¯(𝑿,Y)𝒂𝝋¯1(𝑿,Y)]+r=1Rar2ρ¯rVar[φ¯1,r(𝑿𝒫mcr,Y)].\bar{\boldsymbol{a}}=(\bar{a}_{1},\ldots,\bar{a}_{R})^{\top}=\arg\min_{\boldsymbol{a}}~{}{\rm Var}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)-\boldsymbol{a}^{\top}\bar{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)\right]+\sum_{r=1}^{R}\frac{a_{r}^{2}}{\bar{\rho}_{r}}{\rm Var}\left[\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right].

This assumption holds when every data set r\mathcal{L}\mathcal{M}_{r} is informative enough to 𝒄𝑺¯(𝑿,Y)\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y) and the variance of φ¯1,r(𝑿𝒫mc1,Y)\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{1}},Y) is not too small.

3.2 Asymptotic Normality

In this section, we present the asymptotic properties of the proposed estimators, with their efficiency analyzed and articulated in the next section. We start from the standard M-estimation result of the 𝒞\mathcal{L}\mathcal{C}-only estimator 𝜸~\tilde{\boldsymbol{\gamma}} in Lemma 1 that is already well-established (Van der Vaart,, 2000, e.g.).

Lemma 1.

Under Assumptions 13, 𝛄~𝑝𝛄¯\tilde{\boldsymbol{\gamma}}\overset{p}{\to}\boldsymbol{\bar{\gamma}} and n1/2(𝐜𝛄~𝐜𝛄¯)=n1/2E^𝒞[𝐜𝐒¯]+o𝖯(1),n^{1/2}(\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}-\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}})=n^{1/2}\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}]+o_{\sf P}(1), weakly converges to the normal distribution N{0,Var(𝐜𝐒¯i)}{\rm N}\{0,{\rm Var}(\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}_{i})\}.

Next, we establish the consistency and asymptotic normality for the DEFUSE1 and DEFUSE estimators in the single BM scenario.

Theorem 1.

Under Assumptions 13, we have both β^1,β^2𝑝𝐜𝛄¯\hat{\beta}_{1},\hat{\beta}_{2}\overset{p}{\to}\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}} and

n(β^1𝒄𝜸¯)\displaystyle\sqrt{n}(\hat{\beta}_{1}-\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}}) =nE^𝒞[S¯1(𝑿,Y)]+nE^[φ¯1(𝑿𝒫mc,Y)]+o𝖯(1)\displaystyle={\sqrt{n}}\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}\left[\bar{S}_{1}(\boldsymbol{X},Y)\right]+{\sqrt{n}}\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}\left[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)\right]+o_{\sf P}(1)
n(β^2𝒄𝜸¯)\displaystyle\sqrt{n}(\hat{\beta}_{2}-\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}}) =nE^𝒞[S¯1(𝑿,Y)φ¯2(𝑿)]+nE^[φ¯1(𝑿𝒫mc,Y)]+o𝖯(1),\displaystyle={\sqrt{n}}\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}\left[\bar{S}_{1}(\boldsymbol{X},Y)-\bar{\varphi}_{2}^{\prime}(\boldsymbol{X})\right]+{\sqrt{n}}\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}\left[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)\right]+o_{\sf P}(1),

which weakly converge to N{0,Var[S¯1(𝐗,Y)]+Var[φ¯1(𝐗𝒫mc,Y)]/ρ¯}{\rm N}\{0,{\rm Var}[\bar{S}_{1}(\boldsymbol{X},Y)]+{\rm Var}[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)]/\bar{\rho}\} and N{0,Var[S¯1(𝐗,Y)φ¯2(𝐗)]+Var[φ¯1(𝐗𝒫mc,Y)]/ρ¯}{\rm N}\{0,{\rm Var}[\bar{S}_{1}(\boldsymbol{X},Y)-\bar{\varphi}_{2}^{\prime}(\boldsymbol{X})]+{\rm Var}[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y)]/\bar{\rho}\} respectively.

Theorem 1 states that our DEFUSE estimators are n1/2n^{-1/2}-consistent, asymptotically unbiased and normal no matter the imputation distributions are correct or not and even when the nuisance estimators have slow convergence rates. The efficiency of β^1\hat{\beta}_{1} and β^2\hat{\beta}_{2} will be analyzed and discussed based on Theorem 1 later. As an extension, we also establish an asymptotic theorem for β^1\hat{\beta}_{1}^{\bullet} in the multiple BM setting. Its downstream SS estimation can be justified in basically the same way as the analysis on β^2\hat{\beta}_{2} in Theorem 1 so we omit this part to avoid redundancy.

Theorem 2.

Under Assumption 4, β^1𝑝𝐜𝛄¯\hat{\beta}_{1}^{\bullet}\overset{p}{\to}\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}} and

n(β^1𝒄𝜸¯)=nE^𝒞[𝒄𝑺¯(𝑿,Y)𝒂¯𝝋¯1(𝑿,Y)]+nr=1Ra¯rE^r[φ¯1,r(𝑿𝒫mcr,Y)]+o𝖯(1),\sqrt{n}(\hat{\beta}_{1}^{\bullet}-\boldsymbol{c}^{\top}\boldsymbol{\bar{\gamma}})={\sqrt{n}}\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)-\bar{\boldsymbol{a}}^{\top}\bar{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)\right]+{\sqrt{n}}\sum_{r=1}^{R}\bar{a}_{r}\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}_{r}}\left[\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]+o_{\sf P}(1),

weakly weakly converges to the normal distribution of mean 0 and variance

Var[𝒄𝑺¯(𝑿,Y)𝒂¯𝝋¯1(𝑿,Y)]+r=1Ra¯r2ρ¯rVar[φ¯1,r(𝑿𝒫mcr,Y)].{\rm Var}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)-\bar{\boldsymbol{a}}^{\top}\bar{\boldsymbol{\varphi}}_{1}^{\bullet}(\boldsymbol{X},Y)\right]+\sum_{r=1}^{R}\frac{\bar{a}_{r}^{2}}{\bar{\rho}_{r}}{\rm Var}\left[\bar{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right].

3.3 Relative Efficiency

Now we shall demonstrate the relative efficiency of the proposed DEFUSE estimators compared to a broad set of existing ones including: (a) the 𝒞\mathcal{L}\mathcal{C}-only estimator 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}; (b) the standard semiparametric efficient estimator (Robins et al.,, 1994) based on (4) with φ\varphi^{*} simply imputed by φ~1\tilde{\varphi}_{1}; (c) BM estimators constructed in different strategies (Xue and Qu,, 2021; Zhao et al.,, 2023), as well as Xue et al., (2021); Song et al., (2024) also considering the SS setting; (d) SS methods only incorporating the 𝒰𝒞\mathcal{U}\mathcal{C} sample (Gronsbell et al.,, 2022, e.g.). In addition, we will remark on the connection of our data-adaptive calibration strategy to some recent work for the SS inference. We begin with the assumptions and lemma on the feasible set and regularity of the calibration models.

Assumption 5.

The population-level parameters 𝛅¯\bar{\boldsymbol{\delta}} and 𝛇¯\bar{\boldsymbol{\zeta}} defined in Assumption 2 are the unique minimizer of the variance functions V1(𝛅)V_{1}({\boldsymbol{\delta}}) and V2(𝛇)V_{2}(\boldsymbol{\zeta}) respectively, where

V1(𝜹)=\displaystyle V_{1}({\boldsymbol{\delta}})= Var[𝒄𝑺¯(𝑿,Y)φ¯1(𝑿𝒫mc,Y;𝜹,𝜼¯)]+ρ¯1Var[φ¯1(𝑿𝒫mc,Y;𝜹,𝜼¯)];\displaystyle{\rm Var}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)-\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\bar{\boldsymbol{\eta}})\right]+{\bar{\rho}}^{-1}{\rm Var}\left[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\bar{\boldsymbol{\eta}})\right];
V2(𝜻)=\displaystyle V_{2}(\boldsymbol{\zeta})= Var[S¯1(𝑿,Y)φ¯2(𝑿;𝜻,𝜽¯)]+ρ¯1Var[φ¯1(𝑿𝒫mc,Y;𝜹¯,𝜼¯)].\displaystyle{\rm Var}\left[\bar{S}_{1}(\boldsymbol{X},Y)-\bar{\varphi}_{2}^{\prime}(\boldsymbol{X};\boldsymbol{\zeta},\bar{\boldsymbol{\theta}})\right]+{\bar{\rho}}^{-1}{\rm Var}\left[\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\delta}},\bar{\boldsymbol{\eta}})\right].
Assumption 6.

Constant 11 belongs to the feasible sets of the calibration models and we can set w(𝐱,y;𝟎)1w(\boldsymbol{x},y;\boldsymbol{0})\equiv 1 and h(𝐱,y;𝟎)1h(\boldsymbol{x},y;\boldsymbol{0})\equiv 1 without loss of generality. If some f(𝐱,y)f(\boldsymbol{x},y) belongs to the feasible set of w(𝐱,y;𝛅)w(\boldsymbol{x},y;\boldsymbol{\delta}) or h(𝐱,y;𝛇)h(\boldsymbol{x},y;\boldsymbol{\zeta}), then the function cf(𝐱,y)cf(\boldsymbol{x},y) also belongs to that feasible set for any cc\in\mathbb{R}.

Lemma 2.

Assumptions 5 and 6 will hold when all assumptions used in Proposition 1 hold and the forms of the calibration models are taken as w(𝐱,y;𝛅)=1+𝛅ξ(𝐱,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y) and h(𝐱,y;𝛇)=1+𝛇ϕ(𝐱,y)h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\boldsymbol{\zeta}^{\top}\phi(\boldsymbol{x},y) with ξ(𝐱,y)\xi(\boldsymbol{x},y) and ϕ(𝐱,y)\phi(\boldsymbol{x},y) including constant 11.

Though imposing linear specification on ω\omega and hh, Lemma 2 can be generalized to other forms like w(𝒙,y;δ1,𝜹2)=δ1exp{𝜹2ξ(𝒙,y)}w(\boldsymbol{x},y;\delta_{1},\boldsymbol{\delta}_{2})=\delta_{1}\exp\{\boldsymbol{\delta}_{2}^{\top}\xi(\boldsymbol{x},y)\} but this may encounter local minima issues in practice. The regularity assumptions on ξ\xi and ϕ\phi in Lemma 2, i.e., Assumption A1 is mild and tends to hold under proper choices.

Remark 6.

We notice that some recent approaches for power enhancement under misspecified nuisance models leverage similar properties as Assumptions 5 and 6. For example, Gronsbell et al., (2022) combined the supervised and SS estimators through their optimal convex allocation with minimum variance. We can show that their strategy is essentially similar to setting our h(𝐱,y;𝛇)=1+ζ1[0,1]h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\zeta_{1}\in[0,1] with only one parameter ζ1\zeta_{1} to fit. This easily satisfies above-introduced assumptions on h()h(\cdot) (with c[0,1]c\in[0,1] in Assumption 6). As a result, our following corollaries about efficiency may apply in their case. Similar constructions could be found also in Miao et al., (2023) and others. Consequently, when taking h(𝐱,y;𝛇)h(\boldsymbol{x},y;\boldsymbol{\zeta}) as a nested model of the most simple one 1+ζ11+\zeta_{1}, e.g., h(𝐱,y;𝛇)=1+𝐱𝛇h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\boldsymbol{x}^{\top}\boldsymbol{\zeta}, one could get more efficient estimators through our strategy. That means, our work actually implies a more flexible and powerful framework for nuisance model calibration in a broader set of missing data problems like SS inference.

Now we shall introduce several corollaries of Section 3.2 to characterize the efficiency of the DEFUSE estimators. We define RE(b^1|b^2)=AVar(b^2)/AVar(b^1)\mbox{RE}(\hat{b}_{1}|\hat{b}_{2})={\rm AVar}(\hat{b}_{2})/{\rm AVar}(\hat{b}_{1}) as the relative efficiency between two asymptotically normal estimators b^1\hat{b}_{1} and b^2\hat{b}_{2}, where AVar(){\rm AVar}(\cdot) represents their asymptotic variances. First, we focus on the single BM scenario and compare the asymptotic variances of the DEFUSE1 and 𝒞\mathcal{L}\mathcal{C}-only estimators in Corollary 1. We show that their RE(β^1|𝒄𝜸~)\mbox{RE}(\hat{\beta}_{1}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}) is ensured to be larger than 11 and increase along with the ratio ρ¯=limN/N𝒞\bar{\rho}=\lim N_{\mathcal{L}\mathcal{M}}/N_{\mathcal{L}\mathcal{C}}. This is a natural result as w(𝒙,y;𝜹0)0w(\boldsymbol{x},y;\boldsymbol{\delta}_{0})\equiv 0 will reduce our construction of DEFUSE1 to the 𝒞\mathcal{L}\mathcal{C}-only 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} and utilizing more \mathcal{L}\mathcal{M} samples can amplify our efficiency gain over 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}} not depending on \mathcal{L}\mathcal{M}.

Corollary 1 (DEFUSE1 v.s. 𝒞\mathcal{L}\mathcal{C}-only).

Under Assumptions 13, 5 and 6, there exists some 𝛅0\boldsymbol{\delta}_{0} such that w(𝐱,y;𝛅0)0w(\boldsymbol{x},y;\boldsymbol{\delta}_{0})\equiv 0, and we have

RE(β^1|𝒄𝜸~)=V1(𝜹0)min𝜹V1(𝜹)1.\mbox{RE}(\hat{\beta}_{1}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}})=\frac{V_{1}(\boldsymbol{\delta}_{0})}{\min_{\boldsymbol{\delta}}V_{1}({\boldsymbol{\delta}})}\geq 1.

When 𝐜𝐒¯(𝐗,Y)\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y) and φ¯1(𝐗𝒫mc,Y;𝛅,𝛈¯)\bar{\varphi}_{1}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\boldsymbol{\delta},\bar{\boldsymbol{\eta}}) have a non-zero correlation, RE(β^1|𝐜𝛄~)>1\mbox{RE}(\hat{\beta}_{1}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}})>1 and RE(β^1|𝐜𝛄~)\mbox{RE}(\hat{\beta}_{1}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}) monotonically increases with ρ¯\bar{\rho} (the sample size ratio of \mathcal{L}\mathcal{M} to 𝒞\mathcal{L}\mathcal{C}) with all other parameters unchanged.

Then, still in the single BM scenario, we connect DEFUSE1 with the classic semiparametric efficient estimator β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿:=𝒄𝜸~E^𝒞[φ~1]+E^[φ~1]\hat{\beta}_{\sf semieff}:=\boldsymbol{c}^{\top}\tilde{\boldsymbol{{\gamma}}}-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}[\tilde{\varphi}_{1}]+\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}}[\tilde{\varphi}_{1}] obtained by replacing the truth φ\varphi^{*} with its estimate φ~1\tilde{\varphi}_{1} in (4). We show that DEFUSE1 will be as efficient as β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿\hat{\beta}_{\sf semieff} in the ideal case that φ~1\tilde{\varphi}_{1} is consistent with φ\varphi^{*}. More importantly, β^1\hat{\beta}_{1} is guaranteed to be no worse and typically more efficient than β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿\hat{\beta}_{\sf semieff} when the conditional model for 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} is wrongly specified or erroneous and φ~1\tilde{\varphi}_{1} does not converge to the truth. Similar efficiency properties can be justified for our step of leveraging 𝒰𝒞\mathcal{U}\mathcal{C} to construct β^2\hat{\beta}_{2}; see Appendix.

Corollary 2.

Under Assumptions 13, 5 and 6, and no matter μ~𝐗𝒫m\tilde{\mu}_{\boldsymbol{X}_{\mathcal{P}_{m}}} is consistent with the truth or not, we have

RE(β^1|β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿)=V1(𝟎)min𝜹V1(𝜹)1\mbox{RE}(\hat{\beta}_{1}|\hat{\beta}_{\sf semieff})=\frac{V_{1}(\boldsymbol{0})}{\min_{\boldsymbol{\delta}}V_{1}({\boldsymbol{\delta}})}\geq 1

When μ(𝐗𝒫m|𝐗𝒫mc,Y;𝛈¯)=P(𝐗𝒫m|𝐗𝒫mc,Y){\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\eta}})={\rm P}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y), we have RE(β^1|β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿)=1\mbox{RE}(\hat{\beta}_{1}|\hat{\beta}_{\sf semieff})=1 and

AVar(nβ^1)=E(Var[𝒄𝑺¯(𝑿,Y)|𝑿𝒫mc,Y])+Var{E[𝒄𝑺¯(𝑿,Y)|𝑿𝒫mc,Y]}1+ρ¯.{\rm AVar}(\hat{\sqrt{n}\beta}_{1})={\rm E}\left({\rm Var}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)\big{|}\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y\right]\right)+\frac{{\rm Var}\left\{{\rm E}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)\big{|}\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y\right]\right\}}{1+{\bar{\rho}}}.
Remark 7.

By Corollary 2, DEFUSE1 reaches the semiparametric efficiency bound (Robins et al.,, 1994) when μ~𝐗𝒫m\tilde{\mu}_{\boldsymbol{X}_{\mathcal{P}_{m}}} is consistent with the truth. This desirable property was not readily achieved by recent BM fusion approaches relying on other fusing strategies (Xue and Qu,, 2021; Zhao et al.,, 2023; Song et al.,, 2024, e.g.). This implies that with correct nuisance models, AVar(β^1){\rm AVar}(\hat{\beta}_{1}) is ensured to be smaller those of the existing methods. Our numerical results in next sections support this point. Nevertheless, our method requires the observation of 𝒞\mathcal{L}\mathcal{C} samples while methods like Xue and Qu, (2021) consider a more general case not necessarily having the 𝒞\mathcal{L}\mathcal{C} data. Also, our method tends to have higher computational costs due to MC sampling.

In Corollary 3, we quantify the efficiency gain of incorporating the 𝒰𝒞\mathcal{U}\mathcal{C} sample to further enhance β^1\widehat{\beta}_{1}. Based on this, all above corollaries and remarks on β^1\widehat{\beta}_{1} will naturally hold for our final DEFUSE estimator β^2\hat{\beta}_{2}. We also present the asymptotic variance of β^2\hat{\beta}_{2} in the ideal case that both the conditional models μ~\tilde{\mu} and ν~\tilde{\nu} are consistent with the true distributions.

Corollary 3.

Under Assumptions 13, 5 and 6, there exists some 𝛇0\boldsymbol{\zeta}_{0} such that h(𝐱,y;𝛇0)0h(\boldsymbol{x},y;\boldsymbol{\zeta}_{0})\equiv 0 and

RE(β^2|β^1)=V2(𝜻0)min𝜻V2(𝜻)1,\mbox{RE}(\hat{\beta}_{2}|\hat{\beta}_{1})=\frac{V_{2}(\boldsymbol{\zeta}_{0})}{\min_{\boldsymbol{\zeta}}V_{2}({\boldsymbol{\zeta}})}\geq 1,

with a strict >> holding when the correlation of S¯1(𝐗,Y)\bar{S}_{1}(\boldsymbol{X},Y) and φ¯2(𝐗)\bar{\varphi}_{2}^{\prime}(\boldsymbol{X}) is not zero and RE(β^2|β^1)\mbox{RE}(\hat{\beta}_{2}|\hat{\beta}_{1}) monotonically increases with ρ¯=limnρ\bar{\rho}=\lim_{n\rightarrow\infty}\rho. When μ(𝐗𝒫m|𝐗𝒫mc,Y;𝛈¯)=P(𝐗𝒫m|𝐗𝒫mc,Y){\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y;\bar{\boldsymbol{\eta}})={\rm P}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y) and ν(Y|𝐗;𝛉¯)=P(Y|𝐗)\nu(Y|\boldsymbol{X};\bar{\boldsymbol{\theta}})={\rm P}(Y|\boldsymbol{X}), we further have that

AVar(nβ^2)=\displaystyle{\rm AVar}(\sqrt{n}\hat{\beta}_{2})= E{Var[𝒄𝑺¯(𝑿,Y)ρ¯1+ρ¯E[𝒄𝑺¯(𝑿,Y)|𝑿𝒫mc,Y]|𝑿]}\displaystyle{\rm E}\left\{{\rm Var}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)-\frac{\bar{\rho}}{1+\bar{\rho}}{\rm E}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)\big{|}\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y\right]\big{|}\boldsymbol{X}\right]\right\}
+ρ¯Var{E[𝒄𝑺¯(𝑿,Y)|𝑿𝒫mc,Y]}(1+ρ¯)2\displaystyle+\frac{\bar{\rho}{\rm Var}\left\{{\rm E}\left[\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)\big{|}\boldsymbol{X}_{\mathcal{P}_{m^{c}}},Y\right]\right\}}{(1+{\bar{\rho}})^{2}}

AVar(nβ^2)<σ𝗌𝗌2:=Var{𝒄𝑺¯(𝑿,Y)|𝑿}{\rm AVar}(\sqrt{n}\hat{\beta}_{2})<\sigma_{\sf ss}^{2}:={\rm Var}\left\{\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y)\big{|}\boldsymbol{X}\right\}, where σ𝗌𝗌2\sigma_{\sf ss}^{2} is the semiparametric efficiency bound for the estimation of 𝐜𝛄¯\boldsymbol{c}^{\top}\bar{\boldsymbol{\gamma}} under the SS setting with only 𝒞\mathcal{L}\mathcal{C} and 𝒰𝒞\mathcal{U}\mathcal{C}.

Remark 8.

Recent work including Xue et al., (2021) and Song et al., (2024) also considered the SS and BM setting and only leveraged the large 𝒰𝒞\mathcal{U}\mathcal{C} sample to assist debiased inference of high-dimensional linear models. The efficiency gain of the SS estimator β^2\hat{\beta}_{2} over β^1\hat{\beta}_{1} stated in Corollary 3 has not been readily achieved in earlier work on this track. For the standard SS setting, methods proposed by Chakrabortty and Cai, (2018) and others have achieved the semiparametric efficiency bound σ𝗌𝗌2\sigma_{\sf ss}^{2} when ν(Y|𝐗;𝛉¯)\nu(Y|\boldsymbol{X};\bar{\boldsymbol{\theta}}) is correct. We show in Corollary 3 that AVar(β^2)<σ𝗌𝗌2{\rm AVar}(\hat{\beta}_{2})<\sigma_{\sf ss}^{2} when both μ~\tilde{\mu} and ν~\tilde{\nu} are consistent with the truths. Meanwhile, we demonstrate in our simulation and real-world studies that such an efficiency gain of DEFUSE over existing SS approaches (Gronsbell et al.,, 2022; Miao et al.,, 2023, e.g.) is generally significant, no matter the nuisance models are correct or not.

At last, we introduce the efficiency property of β^1\hat{\beta}_{1}^{\bullet} in the multiple BM setting. In Corollary 4, we establish the optimality of the linear allocation weights 𝒂~\tilde{\boldsymbol{a}} for r\mathcal{L}\mathcal{M}_{r}’s, and show that β^1\hat{\beta}_{1}^{\bullet} is more efficient than the 𝒞\mathcal{L}\mathcal{C}-only 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}. Although the classic semiparametric theory is only applied to the single BM scenario (see Section 6 for more discussion), we still find in numerical studies that our way of constructing the control functions φ~1,r(𝑿𝒫mcr,Y)\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y) for each r\mathcal{L}\mathcal{M}_{r} as well as the linear allocation strategy on 1,,R\mathcal{L}\mathcal{M}_{1},\ldots,\mathcal{L}\mathcal{M}_{R} can help to improve the performance β^1\hat{\beta}_{1}^{\bullet} over existing multiple BM methods not relying on the semiparametric construction; see Section 4.2 for details.

Corollary 4.

Under Assumption 4 and that each ρ¯r<C\bar{\rho}_{r}<C, β^1\hat{\beta}_{1}^{\bullet} achieves the smallest asymptotic variance among the linear allocation estimators formed as

𝒄𝜸~+r=1Rar{E^r[φ~1,r(𝑿𝒫mcr,Y)]E^𝒞[φ~1,r(𝑿𝒫mcr,Y)]},a1,,aR.\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}+\sum_{r=1}^{R}a_{r}\left\{\widehat{{\rm E}}_{\mathcal{L}\mathcal{M}_{r}}\left[\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]-\widehat{{\rm E}}_{\mathcal{L}\mathcal{C}}\left[\tilde{\varphi}_{1,r}^{\prime}(\boldsymbol{X}_{\mathcal{P}_{m^{c}}^{r}},Y)\right]\right\},\quad a_{1},\ldots,a_{R}\in\mathbb{R}.

Also, RE(β^1|𝐜𝛄~)1\mbox{RE}(\hat{\beta}_{1}^{\bullet}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}})\geq 1, where >> holds when at least one φ¯1,r(𝐗,Y)\bar{{\varphi}}_{1,r}(\boldsymbol{X},Y) among r=1,,Rr=1,\ldots,R have non-zero correlations with 𝐜𝐒¯(𝐗,Y)\boldsymbol{c}^{\top}\bar{\boldsymbol{S}}(\boldsymbol{X},Y).

4 Simulation Study

4.1 Settings and Benchmarks

In this section, we conduct numerical studies to assess the validity, efficiency, and robustness of DEFUSE and compare it with a broad set of existing methods. Our simulation studies include various designs and data generation settings introduced as follows. In every scenario, we consider 𝑿=(X1,X2,X3,X4,X5)\boldsymbol{X}=(X_{1},X_{2},X_{3},X_{4},X_{5})^{\top} with (X1,X2,X3)(X_{1},X_{2},X_{3}) fully observed (in 𝒫mc\mathcal{P}_{m^{c}}) and (X4,X5)(X_{4},X_{5}) possibly missing (in 𝒫m\mathcal{P}_{m}).

(I) Binary YY and mixture linear P𝑿𝒫m{\rm P}_{\boldsymbol{X}_{\mathcal{P}_{m}}}.

In this setting, we first generate binary YY from the bernoulli distribution with probability 0.50.5, and then generate 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} from the following mixture Gaussian:

𝑿𝒫mcY{N{(0.4,0.5,0.5),𝕀3}, if Y=1N{(0.6,0.3,0.7),𝕀3}, if Y=0,\boldsymbol{X}_{\mathcal{P}_{m^{c}}}\mid Y\sim\begin{cases}N\{(0.4,-0.5,0.5),\mathbb{I}_{3}\},&\mbox{ if }Y=1\\ N\{(0.6,-0.3,0.7),\mathbb{I}_{3}\},&\mbox{ if }Y=0,\end{cases}

where 𝕀3\mathbb{I}_{3} represents the identity matrix. With 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} and YY, generate

𝑿𝒫m={ω(123221)𝑿𝒫mc+ϵ if Y=1ω(132322)𝑿𝒫mc+ϵ if Y=0\boldsymbol{X}_{\mathcal{P}_{m}}=\begin{cases}\omega\cdot\begin{pmatrix}1&-2&3\\ 2&-2&-1\end{pmatrix}\boldsymbol{X}_{\mathcal{P}_{m^{c}}}+\epsilon&\mbox{ if }Y=1\\ \omega\cdot\begin{pmatrix}-1&3&-2\\ -3&2&2\end{pmatrix}\boldsymbol{X}_{\mathcal{P}_{m^{c}}}+\epsilon&\mbox{ if }Y=0\\ \end{cases}

where ϵN(0,𝕀2)\epsilon\sim{\rm N}(0,\mathbb{I}_{2}) is an independent noise term. The hyperparameter ω\omega is set to range from 0.10.1 to 11 characterizing the dependence of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} on the other variables. The sample sizes of the three data sets are set to be N𝒞=500N_{\mathcal{L}\mathcal{C}}=500, N𝒰𝒞=30000N_{\mathcal{U}\mathcal{C}}=30000, and N=ρ500N_{\mathcal{L}\mathcal{M}}=\rho\cdot 500 with the hyperparameter ρ{2,3,4,5}\rho\in\{2,3,4,5\}.

(II) Binary YY and nonlinear P𝑿𝒫m{\rm P}_{\boldsymbol{X}_{\mathcal{P}_{m}}}.

We generate (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m^{c}}}) in the same way as in (I) but generate 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} through a nonlinear model: 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m^{c}}} and YY, generate

𝑿𝒫m={(123221)(X12,X22,X32)+(11)+ϵ if Y=1(132322)(X12,X22,X32)+ϵ if Y=0\boldsymbol{X}_{\mathcal{P}_{m}}=\begin{cases}\begin{pmatrix}1&-2&3\\ 2&-2&-1\end{pmatrix}(X_{1}^{2},X_{2}^{2},X_{3}^{2})^{\top}+\begin{pmatrix}-1\\ 1\end{pmatrix}+\epsilon&\mbox{ if }Y=1\\ \begin{pmatrix}-1&3&-2\\ -3&2&2\end{pmatrix}(X_{1}^{2},X_{2}^{2},X_{3}^{2})^{\top}+\epsilon&\mbox{ if }Y=0\\ \end{cases}

where ϵN(0,𝕀2)\epsilon\sim{\rm N}(0,\mathbb{I}_{2}) is an independent noise. Set N𝒞=500N_{\mathcal{L}\mathcal{C}}=500, N𝒰𝒞=30000N_{\mathcal{U}\mathcal{C}}=30000, and N=2500N_{\mathcal{L}\mathcal{M}}=2500.

(III) Continuous YY and linear model of P𝑿𝒫m{\rm P}_{\boldsymbol{X}_{\mathcal{P}_{m}}}.

First, we generate (X1,X2,X3)(X_{1},X_{2},X_{3}) independently from the uniform distribution on [1,1][-1,1], and then generate the rest of 𝑿\boldsymbol{X} by

X=(0.5,0.5,0.5)𝑿𝒫mc+ϵ,ϵN(0,1),for =4,5.X_{\ell}=(0.5,0.5,0.5)\boldsymbol{X}_{\mathcal{P}_{m^{c}}}+\epsilon_{\ell},\quad\epsilon_{\ell}\sim N(0,1),\quad\mbox{for }\ell=4,5.

The following linear model is used to simulate the response YY:

Y=X1X2+X3+ωX4+ωX5+ϵY,ϵYN(0,1).\displaystyle Y=-X_{1}-X_{2}+X_{3}+\omega X_{4}+\omega X_{5}+\epsilon_{Y},\quad\epsilon_{Y}\sim N(0,1).

The hyperparameter ω\omega is set to range from 0.10.1 to 11 characterizing the association between 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} and YY conditional on 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}^{c}}. The sample sizes of the three data sets are set to be N𝒞=500,N𝒰𝒞=30000N_{\mathcal{L}\mathcal{C}}=500,N_{\mathcal{U}\mathcal{C}}=30000, and N=ρ500N_{\mathcal{L}\mathcal{M}}=\rho\cdot 500 with the hyperparameter ρ{2,3,4,5}\rho\in\{2,3,4,5\}.

(IV) Linear regression with multiple BM scenario.

First, we generate (Y,𝑿)(Y,\boldsymbol{X}) in the same way as in (III). In this case, set R=2R=2, 𝑿𝒫m1=X4\boldsymbol{X}_{\mathcal{P}_{m}}^{1}=X_{4}, and 𝑿𝒫m2=X5\boldsymbol{X}_{\mathcal{P}_{m}}^{2}=X_{5}. The sample sizes of the four data sets are set to be N𝒞=N1=N2=500N_{\mathcal{L}\mathcal{C}}=N_{\mathcal{L}\mathcal{M}_{1}}=N_{\mathcal{L}\mathcal{M}_{2}}=500 and N𝒰𝒞=30000N_{\mathcal{U}\mathcal{C}}=30000.  

In the settings (I) and (II), the outcome model is the logistic model for Y𝑿Y\sim\boldsymbol{X}. In (II), the dependence of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} on 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}^{c}} is quadratic, which may make the imputation models subject to model misspecification. We also include (III) and (IV) with linear models for YY to facilitate the comparison with a broader set of existing methods for BM regression. In (I) and (III), the hyperparameters ω\omega and ρ\rho are introduced and varied to study how the efficiency gain of our estimators changes according to the dependence between the missing and the observed variables as well as to the sample size ratio between \mathcal{L}\mathcal{M} and 𝒞\mathcal{L}\mathcal{C}. In (IV), the hyperparameters ω\omega and ρ\rho are fixed, and the more complicated multiple BM structure is brought in. In each scenario, we evaluate and compare the estimators through their bias and mean square errors on each coefficient or averaged over all of them.

Methods under comparison.

We summarize all methods under comparison with some implementation details; see also Table 1. 𝒞\mathcal{L}\mathcal{C}-only: The standard estimator 𝜸~\tilde{\boldsymbol{\gamma}} obtained using 𝒞\mathcal{L}\mathcal{C} data only. DEFUSE1: Our proposed estimator β^1\hat{\beta}_{1} based on 𝒞\mathcal{L}\mathcal{C} and \mathcal{L}\mathcal{M} data. DEFUSE: Our final estimator β^2\hat{\beta}_{2}. MICE: Multiple imputation by chained equations with R package mice (van Buuren and Groothuis-Oudshoorn,, 2011). MBI: The multiple blockwise imputation method proposed by Xue and Qu, (2021) combining 𝒞\mathcal{L}\mathcal{C} and \mathcal{L}\mathcal{M} data. HTLGMM: The data fusion approach proposed by Zhao et al., (2023) based on the generalized method of moments. SSB: The SS and BM fusion method proposed by Song et al., (2024) using the 𝒞\mathcal{L}\mathcal{C}, \mathcal{L}\mathcal{M} and 𝒰𝒞\mathcal{U}\mathcal{C} samples. Benchmarks MBI, HTLGMM, and SSB are all implemented under our low-dimensional regression settings without using any sparse penalties. SemiEff: The classic semiparametric efficient estimator β^𝗌𝖾𝗆𝗂𝖾𝖿𝖿\hat{\beta}_{\sf semieff} constructed following Robins et al., (1994). SSL: The SS estimator using 𝒞\mathcal{L}\mathcal{C} and 𝒰𝒞\mathcal{U}\mathcal{C} (Gronsbell et al.,, 2022, e.g.), with its imputation model for YY constructed through natural spline regression.

Method Nonlinear model Single BM Multiple BM Leverage 𝒰𝒞\mathcal{U}\mathcal{C}
DEFUSE1 \checkmark \checkmark \checkmark
DEFUSE \checkmark \checkmark \checkmark \checkmark
MICE \checkmark \checkmark \checkmark
MBI \checkmark \checkmark
SSL \checkmark \checkmark
HTLGMM \checkmark \checkmark
SSB \checkmark \checkmark \checkmark
𝒞\mathcal{L}\mathcal{C}-only \checkmark
SemiEff \checkmark \checkmark
Table 1: Methods under our comparison, including a summary on their applicability to nonlinear (logistic) GLMs, whether they use the single or multiple BM data sets, and whether they use the unlabeled complete sample 𝒰𝒞\mathcal{U}\mathcal{C}. All methods can be implemented and evaluated in our linear model and single BM setting (III) and some are only applicable to a subset of our simulation settings, e.g., MBI only applied to (III) and (IV) with linear outcome models.

For the conditional distribution of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} used in multiple approaches, we carry out parametric regression to learn its conditional mean function and the standard method of moment to estimate its conditional covariance matrix as described in Remark 1. In (I) and (II), we fit Gaussian linear models for 𝑿𝒫m𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}}\sim\boldsymbol{X}_{\mathcal{P}_{m}^{c}} stratified by the binary YY. In (III) and (IV), we fit Gaussian linear regression for 𝑿𝒫m(𝑿𝒫mc,Y)\boldsymbol{X}_{\mathcal{P}_{m}}\sim(\boldsymbol{X}_{\mathcal{P}_{m}^{c}},Y). In settings (I), (III), and (IV), our parametric model for 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}^{c}} is correctly specified while it is wrong under (II) due to the quadratic effects of 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}^{c}}. We also fit parametric models for imputation of YY. For the data-adaptive calibration functions used in DEFUSE, we set w(𝒙,y;𝜹)=1+𝜹(1,𝒙𝒫mc,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) and h(𝒙,y;𝜻)=1+ζ1h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\zeta_{1} in all settings.

4.2 Comparison with existing methods

In settings (I) and (III), we evaluate the accuracy and efficiency of the included estimators with a fixed ω=0.3\omega=0.3 in (I) and ω=1\omega=1 in (III), and the sample size ratio ρ{2,3,4,5}\rho\in\{2,3,4,5\}. The results are summarized in Table 2 for setting (I) and Table 4 for (III). These two tables present the relative efficiency (RE) to the 𝒞\mathcal{L}\mathcal{C}-only estimator defined as the inverse ratio of the average mean squared errors (MSE) over the coefficients γ1,,γ5\gamma_{1},\ldots,\gamma_{5}. They also include the average mean absolute bias overall coefficients.

All methods except MICE produce small bias and valid estimation. Compared to recent BM data fusion methods, DEFUSE1 attains significantly higher efficiency than HTLGMM in the logistic model setting (I) and higher efficiency than MBI, SSB, and HTLGMM in the linear setting (III). For example, the RE of DEFUSE1 is around 30%30\% larger than that of MBI in setting (III) with ρ=2\rho=2. This improvement becomes even larger when ρ=N/N𝒞\rho=N_{\mathcal{L}\mathcal{M}}/N_{\mathcal{L}\mathcal{C}} grows. As another example, DEFUSE1 is 40%40\% more efficient than HTLGMM in the logistic setting (I) with ρ=3\rho=3. As seen from Table 6, our method also attains better performance over both MBI and SSB in setting (IV) with the linear outcome model and multiple BM data structure. For instance, DEFUSE1 has around 50%50\% higher RE than MBI and SSB on γ1\gamma_{1} in setting (IV).

Meanwhile, DEFUSE1 shows close REs to the classic semiparametric efficient estimator (SemiEff) in both settings (I) and (III) with a correct model μ𝑿𝒫m\mu_{\boldsymbol{X}_{\mathcal{P}_{m}}} for 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}. This finite-sample result is consistent with our Corollary 2 and Remark 7 that when μ~𝑿𝒫m\tilde{\mu}_{\boldsymbol{X}_{\mathcal{P}_{m}}} is consistent with the truth, DEFUSE1 will be asymptotically equivalent to SemiEff. Moreover, through incorporating the large 𝒰𝒞\mathcal{U}\mathcal{C} sample, DEFUSE achieves around 10%10\% improved average efficiency over DEFUSE1 in all settings. This agrees with our Corollary 3. Such improvements will be demonstrated and discussed more comprehensively and carefully in Section 4.3. We also notice that the standard SSL method has no efficiency gain over the 𝒞\mathcal{L}\mathcal{C}-only estimator in settings (I) and (III) because it only incorporates 𝒰𝒞\mathcal{U}\mathcal{C}, which has been shown to contribute zero improvement when the outcome model for YY is correctly specified (Chakrabortty and Cai,, 2018). This finding also corresponds to Remark 3 that reversing our Steps 2 and 3 in Algorithm 1 tends to result in less efficient estimator.

In Tables 3 and 5, we present the RE (to 𝒞\mathcal{L}\mathcal{C}-only) on each coefficient in the simulation setting (II) with a nonlinear model for 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} not captured by the parametric conditional model. To demonstrate the importance of properly specifying the calibration function w(𝒙,y;𝜹)w(\boldsymbol{x},y;\boldsymbol{\delta}), we take w(𝒙,y;𝜹)=1+𝜹ξ(𝒙,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}\xi(\boldsymbol{x},y) and include two choices on ξ(𝒙,y)\xi(\boldsymbol{x},y) for the construction of DEFUSE1: (1) ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 and (2) ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y). Choice ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 only involves an “intercept” term for calibration and is actually relevant to proposals in recent work like Miao et al., (2023). Interestingly, DEFUSE1 with ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) outperforms both SemiEff and DEFUSE1 with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1. For example, on γ3\gamma_{3}, DEFUSE1 with ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) has around 50%50\% higher RE than both SemiEff and DEFUSE1 with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1. This demonstrates that our new data-adaptive calibration approach can effectively improve the estimation efficiency over the classic SemiEff when the imputation model μ~𝑿𝒫m\tilde{\mu}_{\boldsymbol{X}_{\mathcal{P}_{m}}} is wrongly specified or of poor quality. To realize this advantage, it is important to specify the calibration function w(𝒙,y;𝜹)w(\boldsymbol{x},y;\boldsymbol{\delta}) in a proper way, which can be seen by comparing DEFUSE1 under the two different choices on ξ(𝒙,y)\xi(\boldsymbol{x},y) in this setting.

Methods ratio ρ\rho = 2 ratio ρ\rho = 3 ratio ρ\rho = 4 ratio ρ\rho = 5
RE Bias RE Bias RE Bias RE Bias
DEFUSE1 1.74 0.011 1.90 0.012 2.15 0.015 2.23 0.015
DEFUSE 1.88 0.007 2.09 0.010 2.45 0.012 2.56 0.012
MICE 0.93 0.190 1.00 0.214 1.18 0.229 1.30 0.240
SSL 1.01 0.020 1.01 0.020 1.01 0.020 1.01 0.020
HTLGMM 1.42 0.029 1.46 0.033 1.52 0.035 1.54 0.036
𝒞\mathcal{L}\mathcal{C}-only 1 0.019 1 0.019 1 0.019 1 0.019
SemiEff 1.76 0.010 1.92 0.010 2.18 0.012 2.26 0.012
Table 2: Average absolute bias and relative efficiency (RE) to 𝒞\mathcal{L}\mathcal{C}-only of the methods in simulation setting (I) with ω=0.3\omega=0.3 and ρ{2,3,4,5}\rho\in\{2,3,4,5\}.
Methods ratio ρ\rho = 2 ratio ρ\rho = 3 ratio ρ\rho = 4 ratio ρ\rho = 5
RE Bias RE Bias RE Bias RE Bias
SemiEff 1.88 0.020 2.22 0.010 2.52 0.016 2.63 0.019
DEFUSE1 ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 1.91 0.018 2.26 0.018 2.57 0.024 2.70 0.022
DEFUSE1 ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) 2.09 0.013 2.51 0.021 2.98 0.022 3.23 0.027
MICE 2.22 0.103 2.72 0.135 3.66 0.151 4.11 0.167
SSL 1.04 0.47 1.06 0.47 1.04 0.45 1.05 0.046
HTLGMM 1.44 0.038 1.63 0.037 1.78 0.042 1.85 0.046
Table 3: Average absolute bias and relative efficiency (RE) to 𝒞\mathcal{L}\mathcal{C}-only of the methods in simulation setting (II) with ω=1\omega=1 and ρ{2,3,4,5}\rho\in\{2,3,4,5\}.
Methods ratio ρ\rho = 2 ratio ρ\rho = 3 ratio ρ\rho = 4 ratio ρ\rho = 5
RE Bias RE Bias RE Bias RE Bias
DEFUSE1 1.32 0.006 1.38 0.007 1.39 0.007 1.42 0.007
DEFUSE 1.41 0.004 1.51 0.004 1.56 0.004 1.62 0.004
MICE 0.06 0.294 0.04 0.381 0.03 0.453 0.02 0.506
MBI 1.11 0.006 1.15 0.007 1.14 0.007 1.16 0.007
SSL 0.98 0.005 0.98 0.005 0.98 0.005 0.98 0.005
HTLGMM 1.15 0.005 1.21 0.005 1.20 0.006 1.22 0.006
SSB 0.95 0.007 1.03 0.009 1.07 0.006 1.11 0.006
𝒞\mathcal{L}\mathcal{C}-only 1 0.005 1 0.005 1 0.005 1 0.005
SemiEff 1.32 0.006 1.38 0.007 1.39 0.007 1.42 0.007
Table 4: Average absolute bias and relative efficiency (RE) to 𝒞\mathcal{L}\mathcal{C}-only in simulation setting (III) with ω=1\omega=1 and ρ{2,3,4,5}\rho\in\{2,3,4,5\}.
Method γ1\gamma_{1} γ2\gamma_{2} γ3\gamma_{3} γ4\gamma_{4} γ5\gamma_{5}
SemiEff 4.48 3.84 2.86 0.99 1.00
DEFUSE1 ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 4.44 3.99 3.12 0.98 0.96
DEFUSE1 ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) 4.87 5.11 3.90 1.23 1.06
MICE 4.86 10.01 4.92 0.18 0.58
SSL 1.01 1.08 1.00 0.93 1.25
HTLGMM 2.41 2.80 1.90 1.06 1.09
Table 5: Current Relative efficiency (RE) to 𝒞\mathcal{L}\mathcal{C}-only on each coefficient under Setting (II).
Method γ1\gamma_{1} γ2\gamma_{2} γ3\gamma_{3} γ4\gamma_{4} γ5\gamma_{5}
DEFUSE1 1.60 1.78 2.39 1.49 1.67
DEFUSE 1.69 1.83 2.55 1.51 1.68
MICE 0.52 0.63 0.69 0.20 0.22
MBI 1.06 1.09 1.06 0.95 1.01
SSB 1.07 1.12 1.51 0.60 0.81
SSL 1.01 1.00 1.00 0.99 1.02
Table 6: Relative efficiency (RE) to 𝒞\mathcal{L}\mathcal{C}-only on each coefficient under Setting (IV).

4.3 Patterns of relative efficiency

In this section, we present more detailed results in settings (I) and (III) to assess DEFUSE under different values of ρ=N/N𝒞\rho=N_{\mathcal{L}\mathcal{M}}/N_{\mathcal{L}\mathcal{C}} and ω\omega quantifying the dependence of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} on (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m}^{c}}). We only present results on the coefficients γ1,3,5\gamma_{1,3,5}. The results on other coefficients are similar and can be found in Appendix. We focus on the RE of DEFUSE1 to the 𝒞\mathcal{L}\mathcal{C}-only estimator as well as the RE of DEFUSE to DEFUSE1.

In Figures 3 and 5, we present the REs of DEFUSE1 to 𝒞\mathcal{L}\mathcal{C}-only. A larger efficiency gain of DEFUSE1 over 𝒞\mathcal{L}\mathcal{C}-only can be observed with an increasing sample size ratio ρ=N/N𝒞\rho=N_{\mathcal{L}\mathcal{M}}/N_{\mathcal{L}\mathcal{C}}; see also Tables 2 and 4. This finding agrees with our Corollary 1 that RE(β^1|𝒄𝜸~)\mbox{RE}(\hat{\beta}_{1}|\boldsymbol{c}^{\top}\tilde{\boldsymbol{\gamma}}) monotonically increases with ρ\rho. On γ5\gamma_{5} corresponding to the partially missing X5X_{5}, the efficiency gain of DEFUSE1 tends to be higher with an increasing dependence parameter ω\omega. Intuitively, when ω=0\omega=0, X5X_{5} will be independent from (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m}^{c}}) and the \mathcal{L}\mathcal{M} sample with only (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m}^{c}}) observed cannot provide any information on X5X_{5}. Consequently, incorporating \mathcal{L}\mathcal{M} could not help much on γ5\gamma_{5} whose score function is largely dependent on X5X_{5}. On the other hand, under a large ω\omega, (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m}^{c}}) observed on \mathcal{L}\mathcal{M} can be informative to the score function of γ5\gamma_{5} and improve the estimation efficiency on it. Interestingly, for γ1\gamma_{1} and γ3\gamma_{3} corresponding to covariates in 𝑿𝒫mc\boldsymbol{X}_{\mathcal{P}_{m}^{c}} observed on \mathcal{L}\mathcal{M}, increasing ω\omega tends to cause a lower efficiency gain of our method, which is opposite to the pattern of REs on γ5\gamma_{5}. For example, in setting (III) with ρ=1\rho=1, the RE of DEFUSE1 on γ3\gamma_{3} decrease from 44 to 22 when ω\omega changes from 0.10.1 to 0.80.8; see Figure 5. This is because when ω\omega becomes larger, the score functions of γ1,3\gamma_{1,3} tend to be more dependent on 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}}, which makes the missingness of 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} on \mathcal{L}\mathcal{M} more impactful.

In Figures 4 and 6, we explore the RE trends of DEFUSE to DEFUSE1 under different sample size ratios ρ\rho and weights ω\omega. Similarly, we find that a larger ρ\rho results in more efficiency gain of DEFUSE over DEFUSE1, which agrees with Corollary 3 that RE(β^1|β^2)\mbox{RE}(\hat{\beta}_{1}|\hat{\beta}_{2}) will increase with ρ\rho when the nuisance μ~\tilde{\mu} and ν~\tilde{\nu} are of good quality. Also, on γ5\gamma_{5} corresponding to the BM covaraite X5X_{5}, the RE of DEFUSE to DEFUSE1 tends to become larger when the dependence of X5X_{5} on (Y,𝑿𝒫mc)(Y,\boldsymbol{X}_{\mathcal{P}_{m}^{c}}) characterized by ω\omega gets larger. Nevertheless, the overall improvements of DEFUSE over DEFUSE1 is relatively small on γ5\gamma_{5}. Interestingly, on γ1\gamma_{1} and γ3\gamma_{3}, we find that the RE first increases and then decreases with ω\omega being larger. Future theoretical analysis is needed to help understanding and explaining this non-monotonic trend better.

Refer to caption
Figure 3: Relative efficiency (RE) of DEFUSE1 to 𝒞\mathcal{L}\mathcal{C}-only on γ1,3,5\gamma_{1,3,5} in setting (I), with ω{0.1,0.2,,1}\omega\in\{0.1,0.2,\ldots,1\} and ρ{2,3,4,5}\rho\in\{2,3,4,5\}. The horizontal black line represents RE=1\mbox{RE}=1.
Refer to caption
Figure 4: Relative efficiency (RE) of DEFUSE to DEFUSE1 on γ1,3,5\gamma_{1,3,5} in setting (I), with ω{0.1,0.2,,1}\omega\in\{0.1,0.2,\ldots,1\} and ρ{2,3,4,5}\rho\in\{2,3,4,5\}. The horizontal black line represents RE=1\mbox{RE}=1.
Refer to caption
Figure 5: Relative efficiency (RE) of DEFUSE1 to 𝒞\mathcal{L}\mathcal{C}-only on γ1,3,5\gamma_{1,3,5} in setting (III), with ω{0.1,0.2,,1}\omega\in\{0.1,0.2,\ldots,1\} and ρ{2,3,4,5}\rho\in\{2,3,4,5\}. The horizontal black line represents RE=1\mbox{RE}=1.
Refer to caption
Figure 6: Relative efficiency (RE) of DEFUSE to DEFUSE1 on γ1,3,5\gamma_{1,3,5} in setting (III), with ω{0.1,0.2,,1}\omega\in\{0.1,0.2,\ldots,1\} and ρ{2,3,4,5}\rho\in\{2,3,4,5\}. The horizontal black line represents RE=1\mbox{RE}=1.

5 Heart Disease Risk Modeling with MIMIC-III

5.1 Extension on Auxiliary Covariates

We begin with introducing a natural extension of our framework to include auxiliary features, which will be used in our real-world study. Consider the scenario with covariates {𝑿,𝑾}\{\boldsymbol{X},\boldsymbol{W}\} consisting of the primary risk factors 𝑿\boldsymbol{X} and auxiliary or surrogate features 𝑾\boldsymbol{W}. Instead of the whole {𝑿,𝑾}\{\boldsymbol{X},\boldsymbol{W}\}, the focus is still on Y𝑿Y\sim\boldsymbol{X} defined by the population equation (2). The missing structure on (Y,𝑿)(Y,\boldsymbol{X}) remains the same as described in Section 1.2, i.e., one or multiple subsets of covariates 𝑿𝒫m\boldsymbol{X}_{\mathcal{P}_{m}} partially missing on the labeled sample and YY being unlabeled on 𝒰𝒞\mathcal{U}\mathcal{C} with complete 𝑿\boldsymbol{X}. Meanwhile, suppose 𝑾\boldsymbol{W} to be complete on all subjects.

Such auxiliary features 𝑾\boldsymbol{W} are commonly used in practice. For instance, in EMR linked biobank study, the interest lies in the relationship between certain genetic variants 𝑿\boldsymbol{X} and disease status YY. 𝑾\boldsymbol{W} is taken as EMR proxies (e.g., relevant diagnostic codes) for YY and acts as auxiliary features not included in the genetic risk model. In clinical or epidemiological studies, 𝑿\boldsymbol{X} is the treatment or key risk factors, YY is a long-term outcome, and 𝑾\boldsymbol{W} can be some early endpoints or surrogate outcomes (e.g., tumor response rates) occurring and observed post the baseline with only 𝑿\boldsymbol{X}. In this case, one is interested in Y𝑿Y\sim\boldsymbol{X} to make clinical decision at the baseline without 𝑾\boldsymbol{W}. In both examples, the auxiliary 𝑾\boldsymbol{W} could still be informative to YY and XX and, thus, included to boost the statistical efficiency.

Our framework can be naturally extended to accommodate this setting. For the construction described in Section 2 and summarized in Algorithm 1, the only change to make is to allow the inclusion of 𝑾\boldsymbol{W} in the nuisance imputation models μ(𝑿𝒫m|𝑿𝒫mc,𝑾,Y;𝜼){\mu}(\boldsymbol{X}_{\mathcal{P}_{m}}|\boldsymbol{X}_{\mathcal{P}_{m^{c}}},\boldsymbol{W},Y;\boldsymbol{\eta}) and ν(Y|𝑿,𝑾;𝜽)\nu(Y|\boldsymbol{X},\boldsymbol{W};{\boldsymbol{\theta}}), as well as in the calibration functions w(𝑿,𝑾,Y;𝜹)w(\boldsymbol{X},\boldsymbol{W},Y;{\boldsymbol{\delta}}) and h(𝑿,𝑾,Y;𝜻)h(\boldsymbol{X},\boldsymbol{W},Y;{\boldsymbol{\zeta}}). In this case, 𝑾\boldsymbol{W} could often be of high-dimensionality as considered in Zhou et al., (2024) as well as our real example. The theoretical results established in Section 3 can also be applied to this setting. Despite the association of 𝑾\boldsymbol{W} with both 𝑿\boldsymbol{X} and YY, including 𝑾\boldsymbol{W} in the nuisance imputation models will not create bias thanks to our constructions like (7) enabling automatic bias correction with respect to the nuisance estimators. It is also important to note that our data-adaptive control function approach can be particularly useful in the presence of erroneous nuisance estimators with high-dimensional 𝑾\boldsymbol{W}, which will be illustrated though the real example in this section.

5.2 Setup

To illustrate the utility of DEFUSE, we apply it to MIMIC-III (Medical Information Mart for Intensive Care III; Johnson et al., (2016)), a large database comprising de-identified electronic health records (EMRs) of over forty thousand patients associated with their stay in critical care units of the Beth Israel Deaconess Medical Center between 2001 and 2012. The database includes various types of information such as demographics, vital sign measurements made at the bedside, laboratory test results, procedures, medications, caregiver notes, imaging reports, and mortality in and out of the hospital. As EMRs include large and diverse patient populations and cover comprehensive phenotypes, they have been widely used by researchers to identify underlying risk factors for certain disease conditions. Our goal is to make statistical inferences on the risk prediction model for heart diseases (HD) against some potential risk factors related to the health condition of a subject.

Our response of interest Y{0,1}Y\in\{0,1\} is the HD status identified from the manual chart reviewing labels obtained in a previous study on MIMIC-III data (Gehrmann et al.,, 2018); see their paper for the detailed definition of the phenotype HD. We greatly thank the authors for making their gold-standard labels accessible. For 𝑿\boldsymbol{X}, we include age, encounter of the International Classification of Diseases (ICD) codes of low high-density lipoproteins (HDL), ICD encounter of type II diabetes, and several HD-related laboratory (lab) variables selected using KESER (Hong et al.,, 2021), an online tool for extracting clinical concepts and features relevant to a given phenotype. For the convenience of downstream risk modeling, we derive a lab risk score (LRS) through weighting and combining the selected lab variables with the main ICD code of HD as a surrogate outcome on the large 𝒰𝒞\mathcal{U}\mathcal{C} sample. Since the LRS is derived with a large sample, it has a negligible estimation uncertainty and can be viewed as given. See the list of selected lab variables and details about the LRS in Appendix.

Noticing that each lab variable is not observed on a particular fraction of patients, we separately consider two settings of missingness: (i) a single BM task with one BM covariate taken as the derived LRS; (ii) a multiple BM task with two BM lab variables thyroid-stimulating hormone (TSH) and hypochromia (HPO), with joint missingness on 1\mathcal{L}\mathcal{M}_{1}, TSH-only missingness on 2\mathcal{L}\mathcal{M}_{2}, HPO-only missing on 3\mathcal{L}\mathcal{M}_{3}. On task (i), there are N𝒞=467N_{\mathcal{L}\mathcal{C}}=467 labeled subject with complete observation of 𝑿\boldsymbol{X}, N=678N_{\mathcal{L}\mathcal{M}}=678 labeled subjects with BM covariates, and N𝒰𝒞=7037N_{\mathcal{U}\mathcal{C}}=7037 unlabeled subjects with complete 𝑿\boldsymbol{X}. Similar to the simulation, we compare our approaches to four applicable benchmarks to this case, including 𝒞\mathcal{L}\mathcal{C}-only, SSL, Semieff, and HTLGMM. We do not include MICE since it tends to result in excessive bias, as seen from the simulation results. For (ii), there are N𝒞=486N_{\mathcal{L}\mathcal{C}}=486 labeled subject with complete observation of 𝑿\boldsymbol{X}, N1=231N_{\mathcal{L}\mathcal{M}_{1}}=231, N2=175N_{\mathcal{L}\mathcal{M}_{2}}=175, and N3=153N_{\mathcal{L}\mathcal{M}_{3}}=153 for three types of missing, and N𝒰𝒞=7014N_{\mathcal{U}\mathcal{C}}=7014.

To assist data-fusion, we include auxiliary features 𝑾\boldsymbol{W} as introduced in Section 5.1. With all codified features including the ICD and procedure codes in EMR, we apply a simple pre-screening procedure with the ICD count of HD as a surrogate outcome against around all the EMR features on the whole sample, to select 140140 features and include them in 𝑾\boldsymbol{W} for the downstream risk model estimation. For the implementation of DEFUSE, we follow the same strategy as our simulation with the conditional models constructed separately through kernel machine (KM) and Lasso with their hyper-parameters tuned by cross-validation. Similar to our simulation design, the calibration functions are set as w(𝒙,y;𝜹)=1+𝜹(1,𝒙𝒫mc,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) and h(𝒙,y;𝜻)=1+ζ1h(\boldsymbol{x},y;\boldsymbol{\zeta})=1+\zeta_{1} and we also include DEFUSE1 with w(𝒙,y;𝜹)=1w(\boldsymbol{x},y;\boldsymbol{\delta})=1 for comparison with existing calibration strategies (Miao et al.,, 2023, e.g.). To ensure fair comparison, we use the same machine learning estimators for missing variable imputation in all the benchmark methods if needed.

5.3 Results

Table 7 reports the relative efficiency (RE) of the data-fusing estimators to 𝒞\mathcal{L}\mathcal{C}-only in the single BM task (i). DEFUSE achieves the highest REs on all coefficients. For example, in estimating the HDL coefficient, our method reduces more than 50%50\% variance compared to 𝒞\mathcal{L}\mathcal{C}-only and its RE to HTLGMM is around 1.411.41. Interestingly, under both the KM and Lasso implementation, DEFUSE1 with the calibration basis w(𝒙,y;𝜹)=1+𝜹(1,𝒙𝒫mc,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) attains smaller estimation variance on the partially missing predictor LRS compared to SemiEff and DEFUSE1 with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 while they all have close performance on the other predictors. In specific, with Lasso imputation, DEFUSE1 with w(𝒙,y;𝜹)=1+𝜹(1,𝒙𝒫mc,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) has 10%10\% improved efficiency over 𝒞\mathcal{L}\mathcal{C}-only while either SemiEff or DEFUSE1 with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 has no improvement over 𝒞\mathcal{L}\mathcal{C}-only. Similarly, when using KM, DEFUSE1 with w(𝒙,y;𝜹)=1+𝜹(1,𝒙𝒫mc,y)w(\boldsymbol{x},y;\boldsymbol{\delta})=1+\boldsymbol{\delta}^{\top}(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) attains a 9%9\% higher RE than SemiEff and DEFUSE1 with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1. This demonstrates not only the importance of data-adaptive calibration on the erroneous machine learning estimators, but also the effectiveness of our newly proposed control function strategy in Section 2.2 that is more general and sophisticated than the simple one with ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1. In addition, on LRS, DEFUSE with the SSL step shows a further improvement over DEFUSE1 with 6-7% higher efficiency while such an improvement is not significant on the other coefficients.

Table 8 provides a summary of the estimated coefficients as well as the bootstrap standard errors and pp-values on the single BM task, with the nuisance imputation models fitted using KM. Through the output of DEFUSE, one can conclude that given other risk factors, a higher risk of HD is associated with low HDL at the level 0.10.1 as well as type II diabetes in the level 0.050.05. Notably, with 𝒞\mathcal{L}\mathcal{C}-only and SSL, both the effects of HDL and diabetes are shown to be insignificant even in the 0.10.1 level and diabetes is not significant in HTLGMM. This suggests that DEFUSE is more powerful than these methods to discover both signals. Interestingly, it is well-documented in biomedical literature on the associations between heart disease and diabetes (Haffner,, 2000; Peters et al.,, 2014, e.g.) as well as low HDL (Després et al.,, 2000; Rader and Hovingh,, 2014, e.g.), which supports the findings of DEFUSE.

At last, Table 9 summarizes the RE to the 𝒞\mathcal{L}\mathcal{C}-only estimator in the multiple BM scenario introduced above. DEFUSE attains more than 25%25\% efficiency improvement over 𝒞\mathcal{L}\mathcal{C}-only on all coefficients and more than 50%50\% variance reduction on that of age, which is again more significant than SSL. One can find the summary table of the estimates and pp-values in this scenario in Appendix.

Method Age HDL Diabetes LRS
SemiEff (KM) 2.15 2.21 2.20 1.08
DEFUSE1 ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 (KM) 2.15 2.21 2.20 1.08
DEFUSE1 ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) (KM) 2.16 2.22 2.20 1.17
DEFUSE (KM) 2.19 2.24 2.22 1.24
SemiEff (Lasso) 2.14 2.20 2.19 0.97
DEFUSE1 ξ(𝒙,y)=1\xi(\boldsymbol{x},y)=1 (Lasso) 2.14 2.20 2.19 1.00
DEFUSE1 ξ(𝒙,y)=(1,𝒙𝒫mc,y)\xi(\boldsymbol{x},y)=(1,\boldsymbol{x}_{\mathcal{P}_{m}^{c}},y) (Lasso) 2.14 2.20 2.19 1.10
DEFUSE (Lasso) 2.17 2.23 2.22 1.16
SSL 1.07 0.99 1.00 1.00
HTLGMM 1.76 1.55 1.56 1.02
Table 7: Relative efficiency (RE) to the standard 𝒞\mathcal{L}\mathcal{C}-only estimator on each coefficient in the single BM scenario of our real example. KM stands for kernel machine and LRS for lab risk score.
Method Age HDL Diabetes LRS
coese\text{coe}_{\text{se}} pp-value coese\text{coe}_{\text{se}} pp-value coese\text{coe}_{\text{se}} pp-value coese\text{coe}_{\text{se}} pp-value
𝒞\mathcal{L}\mathcal{C}-only 0.410.1340.41_{0.134} 0.002 0.350.2320.35_{0.232} 0.136 0.150.2320.15_{0.232} 0.509 0.250.1160.25_{0.116} 0.032
SemiEff 0.460.0920.46_{0.092} <<0.001 0.260.1560.26_{0.156} 0.093 0.310.1560.31_{0.156} 0.047 0.320.1120.32_{0.112} 0.004
DEFUSE1 0.460.0910.46_{0.091} <<0.001 0.270.1560.27_{0.156} 0.088 0.300.1560.30_{0.156} 0.051 0.320.1070.32_{0.107} 0.002
DEFUSE 0.460.0910.46_{0.091} <<0.001 0.280.1550.28_{0.155} 0.07 0.300.1550.30_{0.155} 0.05 0.390.1040.39_{0.104} <<0.001
SSL 0.410.1300.41_{0.130} 0.001 0.340.2330.34_{0.233} 0.140 0.160.2320.16_{0.232} 0.480 0.250.1160.25_{0.116} 0.032
HTLGMM 0.470.1010.47_{0.101} <<0.001 0.370.1860.37_{0.186} 0.046 0.200.1850.20_{0.185} 0.174 0.250.1150.25_{0.115} 0.029
Table 8: Fitted coefficient (coe), empirical standard error (se), and pp-value for the coefficient of each covariate in the single BM scenario of our real-world study. The
Method Age HDL Diabetes TSH Hypochromia
DEFUSE1 2.41 1.29 1.30 1.71 1.27
DEFUSE 2.42 1.29 1.31 1.82 1.28
SSL 1.00 0.99 1.06 0.94 1.13
Table 9: Relative efficiency (RE) to the standard 𝒞\mathcal{L}\mathcal{C}-only estimator on each coefficient in the multiple BM scenario of our real-world study.

6 Discussion

In this paper, we develop DEFUSE, a novel approach for robust and efficient data-fusion in the presence of blockwise missing covariates and large unlabeled samples. Its validity and relative efficiency compared to a comprehensive set of existing methods are justified through theoretical, numerical, and real-world studies. The principle of our construction strategy is to leverage the general and powerful theory on semiparametric efficient estimation while incorporating the key data-adaptive calibration and optimal linear allocation procedures to maintain robustness and effectiveness to a more realistic scenario that the nuisance distributional models are imperfect or problematic. We shall point out the limitation and future direction of our work.

Our linear allocation strategy for the multiple BM scenario does not achieve semiparametric efficiency. We notice that the general canonical gradient method proposed by Li and Luedtke, (2023) could be leveraged to further improve β^1\hat{\beta}_{1}^{\bullet} in this direction, even with non-nested sets of observed covariates in 1,,R\mathcal{L}\mathcal{M}_{1},\ldots,\mathcal{L}\mathcal{M}_{R}. Nevertheless, it is still an open question about how to incorporate our data-adaptive calibration approach and the SS setting into this more complicated framework. In the simulation study, we adopt the parametric and kernel methods to construct the conditional models μ~\tilde{\mu} and ν~\tilde{\nu}. As was pointed out, DEFUSE allows the use of more complex tools like generative adversarial models to estimate the unknown conditional distributions, which could achieve better performance for relatively high-dimensional missing covariates. However, care must be taken to avoid over-fitting and overly high computational costs.

Also, our current method requires 𝒞\mathcal{L}\mathcal{C} sample that may not be accessible in certain scenarios (Song et al.,, 2024, e.g.). It is not hard to see that (2) can actually be identified and constructed with the 𝒰𝒞\mathcal{U}\mathcal{C} and 1,,R\mathcal{L}\mathcal{M}_{1},\ldots,\mathcal{L}\mathcal{M}_{R} samples as long as the union set of the observed covariates in 1\mathcal{L}\mathcal{M}_{1}’s is complete. It is interesting to generalize our method to accommodate this setting without observation of 𝒞\mathcal{L}\mathcal{C}. One possible way is to first utilize a part of \mathcal{L}\mathcal{M} data sets to derive a preliminary estimator that is asymptotically normal but not efficient, and then implement similar steps to Algorithm 1 to reduce its variance. Other more complicated missing structures, e.g., single or multiple missing blocks in the unlabeled sample may also warrant future research.

We notice a comprehensive set of recent literature in addressing covariate shift of 𝑿\boldsymbol{X} (Liu et al.,, 2023; Qiu et al.,, 2023, e.g.), label or model shift of YY (Li et al., 2023a, ; Zhao et al.,, 2023, e.g.), as well as high-dimensionality (Xue et al.,, 2021; Song et al.,, 2024, e.g.) in data fusion under various types of missingness. DEFUSE can also be generalized to address these practical issues. For high-dimensional sparse regression model of YY, we can use debiased Lasso (Zhang and Zhang,, 2014; Van de Geer et al.,, 2014; Javanmard and Montanari,, 2014, e.g.) in Step 1 of Algorithm 1 to derive the preliminary estimator 𝒄𝜸~\boldsymbol{c}^{\top}\tilde{\boldsymbol{{\gamma}}} that is n1/2n^{-1/2}-consistent and asymptotically normal. Then the remaining Steps 2 and 3 can be implemented in a similar way as Algorithm 1. Since our main theorems only require the o𝖯(1)o_{\sf P}(1) convergence of the nuisance models, construction procedures of our nuisance parameters could naturally incorporate sparse regularization to accommodate high-dimensionality without introducing excessive bias. For potential covariate shift across the 𝒞\mathcal{L}\mathcal{C}, \mathcal{L}\mathcal{M} and 𝒰𝒞\mathcal{U}\mathcal{C} samples, one can introduce density ratio (propensity score) models for adjustment (Qiu et al.,, 2023; Liu et al.,, 2023, e.g.) based on our current MCAR construction. This can result in doubly robust forms requiring stronger assumption on the correctness and convergence of the nuisance models. For the conditional model shift of YY, adaptive data fusion under our setting can be challenging, especially for statistical inference. We notice some related recent literature like Han et al., (2021) and Li et al., 2023a , and leave this open problem to future research.

References

  • Angelopoulos et al., (2023) Angelopoulos, A. N., Duchi, J. C., and Zrnic, T. (2023). Ppi++: Efficient prediction-powered inference. arXiv preprint arXiv:2311.01453.
  • Athey et al., (2019) Athey, S., Tibshirani, J., and Wager, S. (2019). Generalized random forests. The Annals of Statistics, 47(2):1148–1178.
  • Azriel et al., (2022) Azriel, D., Brown, L. D., Sklar, M., Berk, R., Buja, A., and Zhao, L. (2022). Semi-supervised linear regression. Journal of the American Statistical Association, 117(540):2238–2251.
  • Bycroft et al., (2018) Bycroft, C., Freeman, C., Petkova, D., Band, G., Elliott, L. T., Sharp, K., Motyer, A., Vukcevic, D., Delaneau, O., O’Connell, J., et al. (2018). The uk biobank resource with deep phenotyping and genomic data. Nature, 562(7726):203–209.
  • Caflisch, (1998) Caflisch, R. E. (1998). Monte carlo and quasi-monte carlo methods. Acta numerica, 7:1–49.
  • Cai and Guo, (2020) Cai, T. T. and Guo, Z. (2020). Semisupervised inference for explained variance in high dimensional linear regression and its applications. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 82(2):391–419.
  • Castro et al., (2022) Castro, V. M., Gainer, V., Wattanasin, N., Benoit, B., Cagan, A., Ghosh, B., Goryachev, S., Metta, R., Park, H., Wang, D., et al. (2022). The mass general brigham biobank portal: an i2b2-based data repository linking disparate and high-dimensional patient data to support multimodal analytics. Journal of the American Medical Informatics Association, 29(4):643–651.
  • Chakrabortty and Cai, (2018) Chakrabortty, A. and Cai, T. (2018). Efficient and adaptive linear regression in semi-supervised settings. The Annals of Statistics, 46(4):1541–1572.
  • Chen et al., (2008) Chen, X., Hong, H., and Tarozzi, A. (2008). Semiparametric efficiency in gmm models with auxiliary data.
  • Chernozhukov et al., (2018) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., and Robins, J. (2018). Double/debiased machine learning for treatment and structural parameters. The Econometrics Journal, 21(1):C1–C68.
  • Chernozhukov et al., (2016) Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., and Newey, W. K. (2016). Double machine learning for treatment and causal parameters. Technical report, cemmap working paper.
  • Deng et al., (2023) Deng, S., Ning, Y., Zhao, J., and Zhang, H. (2023). Optimal and safe estimation for high-dimensional semi-supervised learning. Journal of the American Statistical Association, pages 1–12.
  • Després et al., (2000) Després, J.-P., Lemieux, I., Dagenais, G.-R., Cantin, B., and Lamarche, B. (2000). Hdl-cholesterol as a marker of coronary heart disease risk: the quebec cardiovascular study. Atherosclerosis, 153(2):263–272.
  • Gan and Liang, (2023) Gan, F. and Liang, W. (2023). Prediction de-correlated inference. arXiv preprint arXiv:2312.06478.
  • Gehrmann et al., (2018) Gehrmann, S., Dernoncourt, F., Li, Y., Carlson, E. T., Wu, J. T., Welt, J., Foote Jr, J., Moseley, E. T., Grant, D. W., Tyler, P. D., et al. (2018). Comparing deep learning and concept extraction based methods for patient phenotyping from clinical narratives. PloS one, 13(2):e0192360.
  • Gronsbell et al., (2022) Gronsbell, J., Liu, M., Tian, L., and Cai, T. (2022). Efficient evaluation of prediction rules in semi-supervised settings under stratified sampling. Journal of the Royal Statistical Society: Series B: Statistical Methodology.
  • Haffner, (2000) Haffner, S. M. (2000). Coronary heart disease in patients with diabetes.
  • Han et al., (2021) Han, L., Hou, J., Cho, K., Duan, R., and Cai, T. (2021). Federated adaptive causal estimation (face) of target treatment effects. arXiv preprint arXiv:2112.09313.
  • Hatt et al., (2022) Hatt, T., Berrevoets, J., Curth, A., Feuerriegel, S., and van der Schaar, M. (2022). Combining observational and randomized data for estimating heterogeneous treatment effects. arXiv preprint arXiv:2202.12891.
  • Hong et al., (2021) Hong, C., Rush, E., Liu, M., Zhou, D., Sun, J., Sonabend, A., Castro, V. M., Schubert, P., Panickan, V. A., Cai, T., et al. (2021). Clinical knowledge extraction via sparse embedding regression (keser) with multi-center large scale electronic health record data. NPJ digital medicine, 4(1):1–11.
  • Javanmard and Montanari, (2014) Javanmard, A. and Montanari, A. (2014). Confidence intervals and hypothesis testing for high-dimensional regression. The Journal of Machine Learning Research, 15(1):2869–2909.
  • Jin and Rothenhäusler, (2023) Jin, Y. and Rothenhäusler, D. (2023). Modular regression: Improving linear models by incorporating auxiliary data. Journal of Machine Learning Research, 24(351):1–52.
  • Johns, (1988) Johns, M. V. (1988). Importance sampling for bootstrap confidence intervals. Journal of the American Statistical Association, 83(403):709–714.
  • Johnson et al., (2016) Johnson, A. E., Pollard, T. J., Shen, L., Lehman, L.-w. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Anthony Celi, L., and Mark, R. G. (2016). Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9.
  • Kawakita and Kanamori, (2013) Kawakita, M. and Kanamori, T. (2013). Semi-supervised learning with density-ratio estimation. Machine learning, 91(2):189–209.
  • Kovvali, (2022) Kovvali, N. (2022). Theory and applications of Gaussian quadrature methods. Springer Nature.
  • Kundu and Chatterjee, (2023) Kundu, P. and Chatterjee, N. (2023). Logistic regression analysis of two-phase studies using generalized method of moments. Biometrics, 79(1):241–252.
  • (28) Li, S., Gilbert, P. B., and Luedtke, A. (2023a). Data fusion using weakly aligned sources. arXiv preprint arXiv:2308.14836.
  • Li and Luedtke, (2023) Li, S. and Luedtke, A. (2023). Efficient estimation under data fusion. Biometrika, 110(4):1041–1054.
  • (30) Li, Y., Yang, H., Yu, H., Huang, H., and Shen, Y. (2023b). Penalized estimating equations for generalized linear models with multiple imputation. The Annals of Applied Statistics, 17(3):2345–2363.
  • Liu et al., (2023) Liu, M., Zhang, Y., Liao, K. P., and Cai, T. (2023). Augmented transfer regression learning with semi-non-parametric nuisance models. Journal of Machine Learning Research, 24(293):1–50.
  • The All of Us Research Program Investigators, (2019) The All of Us Research Program Investigators (2019). The “all of us” research program. New England Journal of Medicine, 381(7):668–676.
  • Miao et al., (2023) Miao, J., Miao, X., Wu, Y., Zhao, J., and Lu, Q. (2023). Assumption-lean and data-adaptive post-prediction inference. arXiv preprint arXiv:2311.14220.
  • Mirza and Osindero, (2014) Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784.
  • Negahban et al., (2012) Negahban, S. N., Ravikumar, P., Wainwright, M. J., Yu, B., et al. (2012). A unified framework for high-dimensional analysis of mm-estimators with decomposable regularizers. Statistical Science, 27(4):538–557.
  • Peters et al., (2014) Peters, S. A., Huxley, R. R., and Woodward, M. (2014). Diabetes as risk factor for incident coronary heart disease in women compared with men: a systematic review and meta-analysis of 64 cohorts including 858,507 individuals and 28,203 coronary events. Diabetologia, 57:1542–1551.
  • Qiu et al., (2023) Qiu, H., Tchetgen, E. T., and Dobriban, E. (2023). Efficient and multiply robust risk estimation under general forms of dataset shift. arXiv preprint arXiv:2306.16406.
  • Rader and Hovingh, (2014) Rader, D. J. and Hovingh, G. K. (2014). Hdl and cardiovascular disease. The Lancet, 384(9943):618–625.
  • Robins et al., (1994) Robins, J. M., Rotnitzky, A., and Zhao, L. P. (1994). Estimation of regression coefficients when some regressors are not always observed. Journal of the American statistical Association, 89(427):846–866.
  • Schmutz et al., (2022) Schmutz, H., Humbert, O., and Mattei, P.-A. (2022). Don’t fear the unlabelled: safe semi-supervised learning via debiasing. In The Eleventh International Conference on Learning Representations.
  • Shi et al., (2023) Shi, X., Pan, Z., and Miao, W. (2023). Data integration in causal inference. Wiley Interdisciplinary Reviews: Computational Statistics, 15(1):e1581.
  • Song et al., (2023) Song, S., Lin, Y., and Zhou, Y. (2023). A general m-estimation theory in semi-supervised framework. Journal of the American Statistical Association, pages 1–11.
  • Song et al., (2024) Song, S., Lin, Y., and Zhou, Y. (2024). Semi-supervised inference for block-wise missing data without imputation. Journal of Machine Learning Research, 25(99):1–36.
  • Sudlow et al., (2015) Sudlow, C., Gallacher, J., Allen, N., Beral, V., Burton, P., Danesh, J., Downey, P., Elliott, P., Green, J., Landray, M., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS medicine, 12(3):e1001779.
  • Tian et al., (2007) Tian, L., Cai, T., Goetghebeur, E., and Wei, L. (2007). Model evaluation based on the sampling distribution of estimated absolute prediction error. Biometrika, 94(2):297–311.
  • van Buuren and Groothuis-Oudshoorn, (2011) van Buuren, S. and Groothuis-Oudshoorn, K. (2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67.
  • Van de Geer et al., (2014) Van de Geer, S., Bühlmann, P., Ritov, Y., Dezeure, R., et al. (2014). On asymptotically optimal confidence regions and tests for high-dimensional models. The Annals of Statistics, 42(3):1166–1202.
  • Van der Vaart, (2000) Van der Vaart, A. W. (2000). Asymptotic statistics, volume 3. Cambridge university press.
  • Verma et al., (2023) Verma, A., Huffman, J. E., Rodriguez, A., Conery, M., Liu, M., Ho, Y.-L., Kim, Y., Heise, D. A., Guare, L., Panickan, V. A., et al. (2023). Diversity and scale: genetic architecture of 2,068 traits in the va million veteran program. medRxiv.
  • Wang et al., (2024) Wang, X., Liu, M., Nogues, I.-E., Chen, T., Xiong, X., Bonzel, C.-L., Zhang, H., Hong, C., Xia, Y., Dahal, K., et al. (2024). Heterogeneous associations between interleukin-6 receptor variants and phenotypes across ancestries and implications for therapy. Scientific Reports, 14(1):8021.
  • White et al., (2011) White, I. R., Royston, P., and Wood, A. M. (2011). Multiple imputation using chained equations: issues and guidance for practice. Statistics in medicine, 30(4):377–399.
  • Wolfson et al., (2010) Wolfson, M., Wallace, S. E., Masca, N., Rowe, G., Sheehan, N. A., Ferretti, V., LaFlamme, P., Tobin, M. D., Macleod, J., Little, J., et al. (2010). DataSHIELD: resolving a conflict in contemporary bioscience performing a pooled analysis of individual-level data without sharing the data. International journal of epidemiology, 39(5):1372–1382.
  • Wu and Liu, (2023) Wu, D. and Liu, M. (2023). Robust and efficient semi-supervised learning for ising model. arXiv preprint arXiv:2311.15031.
  • Xue et al., (2021) Xue, F., Ma, R., and Li, H. (2021). Statistical inference for high-dimensional linear regression with blockwise missing data. arXiv preprint arXiv:2106.03344.
  • Xue and Qu, (2021) Xue, F. and Qu, A. (2021). Integrating multisource block-wise missing data in model selection. Journal of the American Statistical Association, 116(536):1914–1927.
  • Yang and Ding, (2019) Yang, S. and Ding, P. (2019). Combining multiple observational data sources to estimate causal effects. Journal of the American Statistical Association.
  • Yu et al., (2020) Yu, G., Li, Q., Shen, D., and Liu, Y. (2020). Optimal sparse linear prediction for block-missing multi-modality data without imputation. Journal of the American Statistical Association, 115(531):1406–1419.
  • Zengini et al., (2018) Zengini, E., Hatzikotoulas, K., Tachmazidou, I., Steinberg, J., Hartwig, F. P., Southam, L., Hackinger, S., Boer, C. G., Styrkarsdottir, U., Gilly, A., et al. (2018). Genome-wide analyses using uk biobank data provide insights into the genetic architecture of osteoarthritis. Nature genetics, 50(4):549–558.
  • Zhang and Zhang, (2014) Zhang, C.-H. and Zhang, S. S. (2014). Confidence intervals for low dimensional parameters in high dimensional linear models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 76(1):217–242.
  • Zhang et al., (2023) Zhang, H., Li, Y., Lu, W., and Lin, Q. (2023). On the optimality of misspecified kernel ridge regression. In International Conference on Machine Learning, pages 41331–41353. PMLR.
  • Zhang and Bradic, (2022) Zhang, Y. and Bradic, J. (2022). High-dimensional semi-supervised learning: in search of optimal inference of the mean. Biometrika, 109(2):387–403.
  • Zhao et al., (2023) Zhao, R., Kundu, P., Saha, A., and Chatterjee, N. (2023). Heterogeneous transfer learning for building high-dimensional generalized linear models with disparate datasets. arXiv preprint arXiv:2312.12786.
  • Zhou et al., (2024) Zhou, D., Liu, M., Li, M., and Cai, T. (2024). Doubly robust augmented model accuracy transfer inference with high dimensional features. Journal of the American Statistical Association, (just-accepted):1–26.