Transfer learning of individualized treatment rules from experimental to real-world data

Lili Wu and Shu Yang
Department of Statistics, North Carolina State University The authors gratefully acknowledge NSF grant DMS 1811245, NCI grant P01 CA142538, NIA grant 1R01AG066883, and NIEHS grant 1R01ES031651.

Abstract

Individualized treatment effect lies at the heart of precision medicine. Interpretable individualized treatment rules (ITRs) are desirable for clinicians or policymakers due to their intuitive appeal and transparency. The gold-standard approach to estimating the ITRs is randomized experiments, where subjects are randomized to different treatment groups and the bias is minimized to the extent possible. However, experimental data are limited in external validity because of their selection restrictions and therefore are not representative of the target real-world population. Conventional learning methods of optimal interpretable ITRs for a target population based only on experimental data are biased. On the other hand, real-world data (RWD) are becoming popular and provide a representative sample of the population. To learn the generalizable optimal interpretable ITRs, we propose an integrative transfer learning method based on weighting schemes to calibrate the covariate distribution of the experiment to that of the RWD. We show that the proposed ITR estimator has a theoretical guarantee of the risk consistency. We evaluate the transfer learner based on the finite-sample performance through simulation and apply it to a real data application of a job training program.

Keywords: Augmented inverse probability weighting, Classification error, Covariate shift, Cross-validation, Transportability, Weighting calibration

1 Introduction

A personalized recommendation, tailored to individual characteristics, is becoming increasingly popular in real life such as healthcare, education, e-commerce, etc. An optimal individualized treatment rule (ITR) is defined as maximizing the mean value function of an outcome of interest over the target population if all individuals in the whole population follows the given rule. Many machine learning approaches to estimating the optimal ITR are available, such as outcome-weighted learning with support vector machine (Zhao et al., 2012; Zhou et al., 2017), regression-based methods with Adaboost (Kang et al., 2014), regularized linear basis function (Qian and Murphy, 2011), K-nearest neighbor (Zhou and Kosorok, 2017) and generalized additive model (Moodie et al., 2014). However, the ITRs derived by machine learning methods can be too complex to extract practically meaningful insights for policymakers. For example, in healthcare, clinicians need to scrutinize the estimated ITRs for scientific validity but the black-box rules conceal the clear relationship between patient characteristics and treatment recommendations. In these cases, parsimonious and interpretable ITRs are desirable.

It is well known that randomized experiments are the gold standard for learning the optimal ITRs because randomization of treatments in the design stage can ensure no unmeasured confounding (Greenland, 1990). However, randomized experiments might lack external validity or representativeness (Rothwell, 2005) because of the selective criteria for individual eligibility in the experiments. Consequently, the distribution of the covariates in the randomized experiments might differ from that in the target population. Thus, a parsimonious ITR constructed from randomized experiments cannot directly generalize to a target population (Zhao et al., 2019). Alternatively, real-world data (RWD) such as population survey data usually involve a large and representative sample to the target population. A large body of literature in statistics has focused on transporting or generalizing the average treatment effects from the experiments to a target population or a different study population (e.g., Cole and Stuart, 2010; Hartman et al., 2015). One exception for the ITRs is Zhao et al. (2019); Mo et al. (2020) which estimates the optimal linear ITR from the experimental data by optimizing the worst-case quality assessment among all covariate distributions in the target population satisfying some moment conditions or distributional closeness, while we address the problem in a different situation where we already have a representative sample at hand.

In this article, we propose a general framework of estimating a generalizable interpretable optimal ITR from randomized experiments to a target population represented by the RWD. We propose transfer weighting that shifts the covariate distribution in the randomized experiments to that in the RWD. To correctly estimate the population optimal ITR, transfer weights are used in two places. First, the empirical estimator of the population value function is biased when considering only the experimental sample, and hence the transfer weights are used to correct the bias of the estimator of the value function. Second, the nuisance functions in the value function such as the propensity score and the outcome mean also require transfer weighting in order to estimate the population counterparts. Then we can obtain the weighted interpretable optimal ITR by reformulating the problem of maximizing the weighted value to that of minimizing the weighted classification error with flexible loss function choices. We consider different parametric and nonparametric approaches to learning transfer weights. Moreover, a practical issue arises regarding the choice of transfer weights. Weighting can correct for the bias but also reduce the effective sample size and increase the variance of the weighted value estimator. To address the bias-variance tradeoff, we propose a cross-validation procedure that maximizes the estimated population value function to choose among the weighting schemes. Lastly, we provide the theoretical guarantee of the risk consistency of the proposed transfer weighted ITRs.

The remaining of the article is organized as follows. We introduce our proposed method in Section 3. In Section 4, we study the asymptotic properties of the proposed estimator to guarantee its performance in the target population theoretically. In Section 5, we use various simulation settings to illustrate how this approach works. In Section 6, we apply the proposed method to the National Supported Work program (LaLonde, 1986) for improving the decisions of whether to recommend the job training program based on the individual characteristics over the target population. We conclude the article in Section 7.

2 Basic Setup

2.1 Notation

Let $\bm{X}\in\mathcal{X}\subseteq\mathbb{R}^{p}$ be a vector of pre-treatment covariates, $A\in\mathcal{A}=\{0,1\}$ the binary treatment, and $Y\in\mathbb{R}$ the outcome of interest. Under the potential outcomes framework (Rubin, 1974), let $Y^{*}(a)$ denote the potential outcome had the individual received treatment $a$ $\in\mathcal{A}$ . Define a treatment regime as a mapping $d:\mathcal{X}\rightarrow\mathcal{A}$ from the space of individual characteristic to the treatment. Then we define the potential outcome under regime $d$ as $Y^{*}(d)=\sum_{a\in\mathcal{A}}Y^{*}(a)\mathbf{1}\{d(\bm{X})=a\}$ , the value of regime $d$ as $V(d)=\mathbb{E}\{Y^{*}(d)\}$ , where the expectation is taken over the distribution in the target population, and the optimal regime $d^{\text{opt}}$ as the one satisfying $V(d^{\text{opt}})\geq V(d),\>\forall d.$ As discussed in the introduction, in many applications such as healthcare, interpretable treatment regimes are desirable since they might bring more domain insights and be more trustworthy to clinicians compared with the complex black-box regimes. Thus, it is valuable to learn an optimal interpretable regime. Specifically, we focus on the regimes with linear forms, i.e, $d(\bm{X};\eta)=\mathbf{1}(\eta_{0}+\eta_{1}^{\intercal}\bm{X}>0),$ denoted by $d_{\eta}$ for simplicity.

We aim at learning ITRs which lead to the highest benefits for the target population with size $N$ , $\{\bm{X}_{i},Y_{i}^{*}(0),Y_{i}^{*}(1)\}_{i=1}^{N}$ . Suppose we have access to two independent data sources: the experiment data with $\{(\bm{X}_{i},A_{i},Y_{i})\}_{i\in\mathcal{I}_{n}}$ and the RWD with $\{\bm{X}_{i}\}_{i\in\mathcal{I}_{m}}$ , where $\mathcal{I}_{n},\mathcal{I}_{m}\subseteq\{1,\dots,N\}$ are samples’ index sets with size $n,m$ , respectively. In the experimental data, individuals are selected based on certain criteria and then are completely randomized to receive treatments. The experimental data might induce selection bias which leads to a different distribution of $\bm{X}$ in experiments from that in the target population. On the other hand, we assume that individuals in the RWD are a random sample from the target population, thus the distribution of the covariates in the RWD can characterize that of the target population. Let $S_{i}$ be the binary indicator of the $i$ th individual participates in the randomized experiment: $S_{i}=1$ if $i\in\mathcal{I}_{n}$ and $0$ if $i\in\mathcal{I}_{m}$ . We denote $\pi_{A}(\bm{X})$ as the propensity score of receiving the active treatment $P(A=1\mid\bm{X})$ , $Q(\bm{X},A)$ as the conditional expectation of outcomes given covariates and treatments $\mathbb{E}(Y\mid\bm{X},A)$ , and $\tau(\bm{X})$ as the contrast function $Q(\bm{X},1)-Q(\bm{X},0)$ .

We make the following assumptions, which are standard and well-studied assumptions in the causal inference and treatment regime literature (Chakraborty, 2013).

Assumption 1.

A\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}{\{Y^{*}(0),Y^{*}(1)\}}\mid(\bm{X},S=1)

. b:

Y=Y^{*}(A)

. c: There exist

c_{1},c_{2}\in(0,1)

such that

\pi_{A}(\bm{X})\in[c_{1},c_{2}]

for all

\bm{X}\in\mathcal{X}.

2.2 Existing learning methods for ITRs

Identifying the optimal individualized treatment rules has been studied in a large body of statistical literature. One category is regression-based by directly estimating the conditional expectation of the outcome given patients’ characteristics and treatments received (Murphy, 2005). Both parametric and nonparametric models can be used to approximate $Q(\bm{X},A)$ . To encourage the interpretability, one strategy is to posit a parametric model as $Q(\bm{X},A;\beta)$ , and the estimated optimal ITR within the class indexed by $\beta$ is defined as $\mathbf{1}\{\tau(\bm{X};\hat{\beta})>0\}$ .

Another category is value-based by first estimating the value of an ITRs and then searching the ITR that renders the highest value. One strategy is to estimate $\pi_{A}(\bm{X})$ first, where one common way is to posit a parametric model as $\pi_{A}(\bm{X}_{i};\gamma)$ such as logistic regression or a constant, and then use inverse probability weighting (IPW; Horvitz and Thompson, 1952) to estimate the value of an ITR indexed by $\eta$ . This estimator is consistent to the true value only if $\pi_{A}(\bm{X}_{i};\gamma)$ is correctly specified and can be unstable if the estimate $\pi_{A}(\bm{X}_{i};\hat{\gamma})$ is close to zero. The augmented inverse probability weighting (AIPW; Zhang, Tsiatis, Laber and Davidian (2012)) approach is introduced to combine IPW and outcome regression estimator $\widehat{Q}(\bm{X}_{i},A)$ to estimate the value

\begin{split}\widehat{V}_{\text{aw}}(d_{\eta};\mathcal{I}_{n})=\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\Bigg{(}&\left[\frac{A_{i}d(\bm{X}_{i};\eta)}{\pi_{A}(\bm{X}_{i};\hat{\gamma})}+\frac{(1-A_{i})\left\{1-d(\bm{X}_{i};\eta)\right\}}{1-\pi_{A}(\bm{X}_{i};\hat{\gamma})}\right]\left[Y_{i}-\widehat{Q}\left\{\bm{X}_{i},d(\bm{X}_{i};\eta)\right\}\right]\\ &-\widehat{Q}\left\{\bm{X}_{i},d(\bm{X}_{i};\eta)\right\}\Bigg{)}.\end{split}

(1)

The AIPW estimator is doubly robust in the sense that it will be consistent to the value of the ITR $d_{\eta}$ if either the model for $\pi_{A}(\bm{X})$ or $Q(\bm{X},A)$ is correctly specified, but not necessarily both.

Learning the optimal ITR by optimizing the value can also be formulated as a classification problem (Zhang, Tsiatis, Davidian, Zhang and Laber, 2012). The classification perspective has equivalent forms to all the regression-based and value-based methods above, and can also bring us additional optimization advantages over directly value searching.

Lemma 1.

The optimal treatment rule by maximizing $V(d_{\eta})$ in $\eta$ is equivalent to that by minimizing the risk

\mathbb{E}\left(\big{|}\tau(\bm{X})\big{|}\Big{[}\mathbf{1}\big{\{}\tau(\bm{X})>0\big{\}}-d(\bm{X};\eta)\Big{]}^{2}\right)

(2)

in $\eta$ (Zhang, Tsiatis, Davidian, Zhang and Laber, 2012).

Lemma 1 reformulates the problem of estimating an optimal treatment rule as a weighted classification problem, where $\mid\tau(\bm{X})\mid$ is regarded as the weight, $\mathbf{1}\{\tau(\bm{X})>0\}$ the binary label, and $d(\bm{X};\eta)$ the classification rule of interest. Let $l_{0-1}$ be a 0-1 loss, i.e., $l_{0-1}(u)=\mathbf{1}(u\leq 0)$ . The objective function (2) can also be rewritten as

$\mathcal{R}_{\mathcal{F}}(\tau;\eta,l_{0-1})=\mathbb{E}\left\{\mathcal{\mathcal{L}}(\tau;\eta,l_{0-1})\right\},\ \ \mathcal{\mathcal{L}}(\tau;\eta,l_{0-1})=\mid\tau(\bm{X})\mid l_{0-1}\left\{\left[2\mathbf{1}\{\tau(\bm{X})>0\}-1\right]f(\bm{X};\eta)\right\},$

(3)

where $f(\bm{x};\eta)=\eta^{\intercal}\bm{x}$ , $d(\bm{x};\eta)=\mathbf{1}\left\{f(\bm{x};\eta)>0\right\}$ , $\mathcal{F}$ is a certain class covering the parameters of the ITRs, i.e., $\eta\in\mathcal{F}$ .

Let $\hat{\tau}_{i}$ be an estimator of $\tau(\bm{X}_{i})$ . From Lemma 1, to estimate the optimal ITR, one can minimize the following empirical objective function:

$\displaystyle n^{-1}\sum_{i\in\mathcal{I}_{n}}\mid\hat{\tau}_{i}\mid\{\mathbf{1}{(\hat{\tau}_{i}>0)}-d(\bm{X}_{i};\eta)\}^{2}$	$\displaystyle=$	$\displaystyle n^{-1}\sum_{i\in\mathcal{I}_{n}}\mid\hat{\tau}_{i}\mid\mathbf{1}\left\{\mathbf{1}\left(\hat{\tau}_{i}>0\right)\neq d(\bm{X}_{i};\eta)\right\}$	(4)
	$\displaystyle=$	$\displaystyle n^{-1}\sum_{i\in\mathcal{I}_{n}}\mid\hat{\tau}_{i}\mid l_{0-1}\left[\left\{2\mathbf{1}(\hat{\tau}_{i}>0)-1\right\}f(\bm{X}_{i};\eta)\right]$
	$\displaystyle=$	$\displaystyle n^{-1}\sum_{i\in\mathcal{I}_{n}}\mathcal{L}(\hat{\tau}_{i};\eta,l_{0-1}).$

In this classification framework, we can also consider the three different approaches to estimating $\tau(\bm{X}_{i})$ , which have the equivalent forms and enjoy the same properties with the regression- and value-based estimator. For example, The AIPW estimator of $\tau(\bm{X}_{i})$ is

\hat{\tau}_{\mathrm{aw},i}=\left[\frac{\mathbf{1}(A_{i}=1)\big{\{}Y_{i}-\widehat{Q}(\bm{X}_{i},1)\big{\}}}{\hat{\pi}_{A}(\bm{X}_{i})}+\widehat{Q}(\bm{X}_{i},1)\right]-\left[\frac{\mathbf{1}(A_{i}=0)\big{\{}Y_{i}-\widehat{Q}(\bm{X}_{i},0)\big{\}}}{1-\hat{\pi}_{A}(\bm{X}_{i})}+\widehat{Q}(\bm{X}_{i},0)\right].

The solution obtained by minimizing (4) in $\eta$ with replacing $\hat{\tau}_{i}$ with $\hat{\tau}_{\text{aw},i}$ is equivalent to that obtained by maximizing the AIPW value estimator $\widehat{V}_{\text{aw}}(d_{\eta};\mathcal{I}_{n})$ in (1) and enjoys the doubly robust property.

When the experimental sample is representative of the target population, the above ITR learning methods can estimate the population optimal ITRs that leads to the highest value. To correct for the selection bias of the experimental sample, we propose an integrative strategy that leverages the randomization in the experimental sample and the representativeness of the RWD. As pointed out by Zhao et al. (2019), the optimal rule obtained from the experimental sample is also optimal for the target population if no restriction to the class of $d,$ but such optimal rules are too complex. Therefore, we focus on learning interpretable and parsimonious ITRs and thus integrative methods are required to correct for the selection bias of the experimental sample.

3 Integrative transfer learning of ITRs

3.1 Assumptions for identifying population optimal ITRs

To identify population optimal ITRs, we assume the covariates $\bm{X}$ can fully explain the selection mechanism of inclusion into the randomized experiment and the covariate distribution in the RWD is representative to the target population.

Assumption 2.

S\mathchoice{\mathrel{\hbox to0.0pt{$\displaystyle\perp$\hss}\mkern 2.0mu{\displaystyle\perp}}}{\mathrel{\hbox to0.0pt{$\textstyle\perp$\hss}\mkern 2.0mu{\textstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptstyle\perp$\hss}\mkern 2.0mu{\scriptstyle\perp}}}{\mathrel{\hbox to0.0pt{$\scriptscriptstyle\perp$\hss}\mkern 2.0mu{\scriptscriptstyle\perp}}}\left\{Y^{*}(0),Y^{*}(1),A\right\}\mid\bm{X}

. b:

f(\bm{X}\mid S=0)=f(\bm{X})

. c: There exist

c_{3},c_{4}\in(0,1)

such that

\pi_{S}(\bm{X})\in[c_{3},c_{4}]

for all

\bm{X}\in\mathcal{X}.

Combining Assumption 2.a and Assumption 1.b implies that the experimental sample is non-informative, i.e., the sampling score is $P(S=1\mid A,\bm{X},Y)=P(S=1\mid\bm{X})=\pi_{S}(\bm{X})$ . It requires that the observed covariates $\bm{X}$ is rich enough such that it can decide whether or not the subject is selected into the experimental data.

Assumption 2.b requires that covariate distribution in the RWD is the same as that in the target population so that we can leverage the representativeness of the RWD. As a result, it implies that $m^{-1}\sum_{i\in\mathcal{I}_{m}}g(\bm{X}_{i})$ would be an unbiased estimator of $\mathbb{E}{\{g(\bm{X})\}}$ , the covariate expectation in the target population for any $g$ . Assumption 2.a and Assumption 2.b are also adopted in the covariate shift problem in machine learning, a sub-area in domain adaptation, where standard classifiers do not perform well when the independent variables in the training data have a different covariate distribution from that in the testing data. Re-weighting individuals to correct the over- or under-representation is one way to cope with the covariate shift in classification problems (Kouw and Loog, 2018).

Assumption 2.c requires each subject has chance to be selected into the experimental data so as to guarantee the existence of $\pi_{S}^{-1}(\bm{X})$ .

The following proposition identifies the population value function using the inverse of sampling score as the transfer weight, and leads to an equivalent weighted classification loss function to identify the population optimal ITR.

Proposition 1.

Let the transfer weight be $w=\pi_{S}^{-1}(\bm{X})$ , under Assumption 1.b, Assumption 2.a and Assumption 2.c, the population value of an ITR $d_{\eta}$ is identified by $V(d_{\eta})=\mathbb{E}\left[\omega SY^{*}\left\{d_{\eta}(\bm{X})\right\}\right]$ . Then the population optimal ITR can be obtained by minimizing

\mathbb{E}\left(wS\big{|}\tau(\bm{X})\big{|}\Big{[}\mathbf{1}\big{\{}\tau(\bm{X})>0\big{\}}-d_{\eta}(\bm{X})\Big{]}^{2}\right)\text{ in $\eta$.}

Two challenges arise for using Proposition 1 to estimate the optimal ITR: (i) the weights $w$ are unknown and (ii) the estimators of $\tau(\bm{X})$ discussed so far such as $\hat{\tau}_{\text{aw}}$ are biased based only on the experimental sample. In the following subsections, we present the estimation of the transfer weights and unbiased estimators of $\tau(\bm{X})$ .

3.2 Estimation of transfer weights

We consider various parametric and nonparametric methods to estimate the transfer weights $w$ . Each method has its own benefits: parametric methods are easy to implement and efficient if the models are correctly specified, while nonparametric methods are more robust and less sensitive to model misspecification.

3.2.1 Parametric approach

For parametric approaches, we assume the target population size $N$ is known. Similar to modeling propensity score $\pi_{A}(\bm{X})$ , we can posit a logistic regression model for $\pi_{S}(\bm{X})$ ; i.e., $\text{logit}\{\pi_{S}(\bm{X};\alpha)\}=\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}$ , where $\alpha=(\alpha_{0},\alpha_{1}^{\intercal})^{\intercal}\in\mathbb{R}^{p+1}$ . The typical maximum likelihood estimator of $\alpha$ can be obtained by

	$\displaystyle\hat{\alpha}=$	$\displaystyle\underset{\alpha}{\mathrm{argmax}}\frac{1}{N}\sum_{i=1}^{N}\left[S_{i}\log\pi_{S}(\bm{X}_{i};\alpha)+(1-S_{i})\log\left\{1-\pi_{S}(\bm{X}_{i};\alpha)\right\}\right]$
	$\displaystyle=$	$\displaystyle\underset{\alpha}{\mathrm{argmax}}\frac{1}{N}\sum_{i=1}^{N}\left[S_{i}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})-\mathrm{log}\left\{1+\mathrm{exp}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})\right\}\right].$		(5)

However, the second term $N^{-1}\sum_{i=1}^{N}\mathrm{log}\left\{1+\mathrm{exp}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})\right\}$ in (5) is not feasible to calculate because we can not observe $\bm{X}$ for all individuals in the population. The key insight is that this term can be estimated by $m^{-1}\sum_{i\in\mathcal{I}_{m}}\mathrm{log}\left\{1+\mathrm{exp}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})\right\}$ based on the RWD. This strategy leads to our modified maximum likelihood estimator

\hat{\alpha}=\underset{\alpha}{\mathrm{argmax}}\left[\frac{1}{N}\sum_{i=1}^{N}S_{i}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})-\frac{1}{m}\sum_{i=1}^{m}\mathrm{log}\left\{1+\mathrm{exp}(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})\right\}\right].

Alternatively, Assumption 2.b leads to unbiased estimating equations of $\alpha$

\mathbb{E}\left\{\frac{Sg(\bm{X})}{\pi_{S}(\bm{X};\alpha)}\right\}=\mathbb{E}\{g(\bm{X})\},

for any $g$ . Therefore, we can use the following estimating equations

\frac{1}{N}\sum_{i=1}^{N}\frac{S_{i}g(\bm{X}_{i})}{\pi_{S}(\bm{X}_{i};\alpha)}=\frac{1}{m}\sum_{i\in\mathcal{I}_{m}}g(\bm{X}_{i})

for $\alpha$ , where $g(\bm{X})\in\mathbb{R}^{p+1}$ . For simplicity, one can take $g(\bm{X})$ as $\partial\text{logit}\{\pi_{S}(\bm{X};\alpha)\}/\partial\alpha$ .

3.2.2 Nonparametric approach

The above parametric approaches require $\pi_{S}(\bm{X};\alpha)$ to be correctly specified and the population size $N$ to be known. These requirements may be stringent, because the sampling mechanism into the randomized experiment is unknown and the population size is difficult to obtain. We now consider the constrained optimization algorithm of Wang and Zubizarreta (2020) to estimate the transfer weights by

	$\displaystyle\underset{w}{\text{min}}\ \sum_{i=1}^{N}S_{i}w_{i}\log w_{i}$
	$\displaystyle\text{subject to}\ \Big{\|}\sum_{i=1}^{N}w_{i}S_{i}g_{k}(\bm{X}_{i})-\frac{1}{m}\sum_{i\in\mathcal{I}_{m}}g_{k}(\bm{X}_{i})\Big{\|}\leq\sigma_{k}\ \>\>(k=1,\dots,K),$		(6)

where $\sigma_{k}\geq 0;w_{i}\in[0,1],\>i\in\mathcal{I}_{n},$ $\sum_{i\in\mathcal{I}_{n}}w_{i}=1,$ $\text{ and }\{g_{1},\ldots,g_{K}\}$ can be chosen as the first-, second- and higher order moments of the covariate distribution. The constants $\sigma_{k}$ are the tolerate limits of the imbalances in $g_{k}$ . When all the $\sigma_{k}$ ’s are taken 0, (6) becomes the exact balance, and the solution $w$ becomes the entropy balancing weights (Hainmueller, 2012). The choice of $\sigma_{k}$ is related to the standard bias-variance tradeoff. On the one hand, if $\sigma_{k}$ is too small, the weight distribution can have a large variability in order to satisfy the stringent constraints, and in some extreme scenarios the weights may not exist. On the other hand, if $\sigma_{k}$ is too large, the covariate imbalances remain and therefore the resulting weights are not sufficient to correct for the selection bias. Following Wang and Zubizarreta (2020), we choose $\sigma_{k}$ from a pre-specified set by the tuning algorithm described in Supplementary Materials.

3.3 Estimation of $\tau(\bm{X})$

The transfer weights are also required to estimate $\tau(\bm{X})$ from the experimental data. In parallel to Section 2.2, we can obtain the weighted estimators of $\tau(\bm{X}).$ For example, the weighted AIPW estimator of $\tau(\bm{X}_{i})$

\leavevmode\resizebox{460.35788pt}{}{$\hat{\tau}_{\mathrm{aw},i}^{w}=\Big{[}\frac{\mathbf{1}(A_{i}=1)\big{\{}Y_{i}-\widehat{Q}_{w}(\bm{X}_{i},1)\big{\}}}{\hat{\pi}_{A}^{w}(\bm{X}_{i})}+\widehat{Q}_{w}(\bm{X}_{i},1)\Big{]}-\Big{[}\frac{\mathbf{1}(A_{i}=0)\big{\{}Y_{i}-\widehat{Q}_{w}(\bm{X}_{i},0)\big{\}}}{1-\hat{\pi}_{A}^{w}(\bm{X}_{i})}+\widehat{Q}_{w}(\bm{X}_{i},0)\Big{]}$},

where $\widehat{Q}_{w}(\cdot,a)$ can be estimated by weighted least square for linear $Q$ functions with the estimated transfer weights, and $\hat{\pi}_{A}^{w}(\bm{X}_{i})=\widehat{P}_{w}(A_{i}=1\mid\bm{X}_{i})$ is the weighted propensity score which can be obtained by the weighted regression model with the estimated transfer weights using the experimental data.

Let $\hat{\tau}_{i}^{w}$ be a generic weighted estimator of $\tau(\bm{X})$ . Proposition 1 implies an empirical risk function

\begin{split}&\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid\{\mathbf{1}{(\hat{\tau}_{i}^{w}>0)}-d(\bm{X}_{i};\eta)\}^{2}\\ =&\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid l_{0-1}\left[\left\{2\mathbf{1}(\hat{\tau}_{i}^{w}>0)-1\right\}f(\bm{X}_{i};\eta)\right]\end{split}

(7)

to estimate $\eta$ .

It is challenging to optimize the objective function (7) since it involves a non-smooth non-convex objective function of $\eta$ . One way is to optimize it directly using grid search or the genetic algorithm in Zhang, Tsiatis, Laber and Davidian (2012) for learning linear decision rules (implemented with rgenoud in R), but grid search becomes untenable for a higher dimension of $\eta$ , and the genetic algorithm cannot guarantee a unique solution. Alternatively, one can replace $l_{0-1}$ with a surrogate loss function, such as a hinge loss used in the support vector machine (Vapnik, 1998) and outcome weighted learning (Zhao et al., 2012), or a ramp loss (Collobert et al., 2006) which truncates the unbounded hinge loss by trading convexity for robustness to outliers (Wu and Liu, 2007). We adopt the smoothed ramp loss function $l(u)$ proposed in Zhou et al. (2017), which retains the robustness and also gains computational advantages due to being smooth everywhere, defined in (8). $l(u)$ can be decomposed into the difference of two smooth convex functions, $l(u)=l_{1}(u)-l_{0}(u)$ , where $l_{s}(u)$ is defined in (9).

\displaystyle l(u)

\displaystyle=\begin{cases}0&\text{if }u\geq 1,\\ (1-u)^{2}&\text{if }0\leq u<1,\\ 2-(1+u)^{2}&\text{if }-1\leq u<0,\\ 2&\text{if $u\leq-1$.}\end{cases}

(8)

\displaystyle l_{s}(u)

\displaystyle=\begin{cases}0&\text{if }u\geq s,\\ (s-u)^{2}&\text{if }s-1\leq u<s,\\ 2s-2u-1&\text{if }u<s-1.\end{cases}

(9)

Similarly to Zhou et al. (2017), we can apply the d.c. algorithm (Le Thi Hoai and Tao, 1997) to solve the non-convex minimization problem by first decomposing the original objective function into the summation of a convex component and a concave component and then minimizing a sequence of convex subproblems. Applying (9) to (7), our empirical risk objective becomes

\underbrace{\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid l_{1}(u_{i})}_{\text{a convex function of $\eta$}}+\underbrace{\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\left\{-\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid l_{0}(u_{i})\right\}}_{\text{a concave function of $\eta$ }},

(10)

where $u_{i}=\left\{2\mathbf{1}(\hat{\tau}_{i}^{w}>0)-1\right\}f(\bm{X}_{i};\eta)$ . Let $C_{i}^{\text{cav}}=-\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid l_{0}(u_{i})$ and

\xi_{i}=\frac{\partial C_{i}^{\text{cav}}}{\partial u_{i}}=-\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid\frac{\partial l_{0}(u_{i})}{\partial u_{i}},

(11)

for $i\in\mathcal{I}_{n}.$ Therefore, the convex subproblem at $(t+1)$ th iteration in the d.c. algorithm is to

\underset{\eta}{\text{min}}\;\;\;\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\hat{w}_{i}\mid\hat{\tau}_{i}^{w}\mid l_{1}(u_{i})+\frac{1}{n}\sum_{i\in\mathcal{I}_{n}}\xi_{i}^{(t)}u_{i},

(12)

which is a smooth unconstrained optimization problem and can be solved with many efficient algorithms such as L-BFGS (Nocedal, 1980). Algorithm 1 summarizes the proposed procedure to obtain the minimizer $\hat{\eta}$ .

Algorithm 1 The d.c. Algorithm for Finding Linear Decision Rule Parameters

Set a small value for error tolerance, say

\epsilon=10^{-6}

; Initialize

\xi_{i}^{(0)}=2\mid\hat{\tau}_{i}\mid

repeat

Obtain

\hat{\eta}

by solving (12) with L-BFGS;

Update

u_{i}=\left\{2\mathbf{1}(\hat{\tau}_{i}>0)-1\right\}f(\bm{X}_{i};\hat{\eta})

;

Update

\xi_{i}^{(t)}

by (11).

until

\mid\mid\mathbb{\mathbf{\bm{\xi}}}^{(t+1)}-\bm{\xi}^{(t)}\mid\mid_{\infty}\leq\epsilon

Return

\hat{\eta}

3.4 Cross-validation

We have discussed various approaches for the estimation of transfer weights and individual contrast function. Moreover, although transfer weights can correct the selection bias of the experimental data, they also increase the variance of the estimator compared to the unweighted counterpart. The effect of weighting on the precision of estimators can be quantified by the effective sample size originated in survey sampling (Kish and Stat, 1992). If the true contrast function can be approximated well by linear models, the estimated ITR with transfer weights may not outperform the one without weighting. As a result, for a given application, it is critical to develop a data-adaptive procedure to choose the best method among the aforementioned weighted estimators and whether or not using transfer weights.

Toward this end, we propose a cross-validation procedure. To proceed, we split the sample into two parts, one for learning the ITRs and the other for evaluating the estimated ITRs. We also adopt multiple sample splits in order to increase the robustness since single split might have dramatic varying results due to randomness. For evaluation of an ITR $d_{\eta}$ , we estimate the value for the given ITR by the weighted AIPW estimator in (13) with the estimated nonparametric transfer weights $\hat{w}$ ,

\begin{split}\widehat{V}(d_{\eta};\hat{w},\mathcal{I}_{n})=\sum_{i\in\mathcal{I}_{n}}\hat{w}_{i}\Bigg{(}&\Big{[}\frac{A_{i}d(\bm{X}_{i};\eta)}{\hat{\pi}_{A}^{w}(\bm{X}_{i})}+\frac{(1-A_{i})\left\{1-d(\bm{X}_{i};\eta)\right\}}{1-\hat{\pi}_{A}^{w}(\bm{X}_{i})}\Big{]}\left[Y_{i}-\widehat{Q}_{w}\left\{\bm{X}_{i},d(\bm{X}_{i};\eta)\right\}\right]\\ &-\widehat{Q}_{w}\left\{\bm{X}_{i},d(\bm{X}_{i};\eta)\right\}\Bigg{)}.\end{split}

(13)

Since we are more interested in the comparisons among different weighting and unweighting methods instead of different contrast function estimation methods, thus we fix an estimation method for the contrast function at first, and then use cross-validation to choose among the set of candidate methods $\mathcal{G}$ . This $\mathcal{G}$ contains the proposed learning method with nonparametric transfer weights described in Section 3.2.2, and unweighted estimation method. If the target population size $N$ is available, we can also include the weighted estimators with the two parametric approaches introduced in Section 3.2.1 into $\mathcal{G}$ . Then we can get the value estimates averaged over all the sample splits for each learning method in $\mathcal{G}$ , and return the method with the highest value estimates. The more explicit description of the procedure is presented in Algorithm 2.

Algorithm 2 Cross-validation with Multi-Sample Splits

Input The experimental data

\mathcal{D}_{n}=\{(\bm{X}_{i},A_{i},Y_{i})\}_{i\in\mathcal{I}_{n}}

, the RWD

\mathcal{D}_{m}=\{\bm{X}_{i}\}_{i\in\mathcal{I}_{m}}

, the number of sample splits

B

, the set of candidate methods need to compare

\mathcal{G}

, the target sample size

N

is optional (needed if

\mathbb{\mathcal{G}}

contains parametric weighting methods).

for b = 1, …,

B

Randomly split

\mathcal{D}_{n}

into two equal-size parts as the experimental training and testing data:

\mathcal{D}_{n}^{\text{train}}

and

\mathcal{D}_{n}^{\text{test}}

; let

\mathcal{I}_{n}^{\text{test}}

be the index set of

\mathcal{D}_{n}^{\text{test}}

. Similarly, randomly split

\mathcal{D}_{m}

into two equal-size parts

\mathcal{D}_{m}^{\text{train}}

and

\mathcal{D}_{m}^{\text{test}}

;

Estimate ITRs with all methods in

\mathcal{G}

based on

\mathcal{D}_{n}^{\text{train}}

and

\mathcal{D}_{m}^{\text{train}}

(use

N/2

as the target population size if

\mathbb{\mathcal{G}}

contains parametric weighting methods). Denote the estimated ITRs as

\{\hat{d}_{g}^{\text{train}}\}_{g\in\mathcal{G}}

;

Obtain the nonparametric transfer weights

\{\widetilde{w}_{i}\}_{i\in\mathcal{I}_{n}^{\text{test}}}

based on the test data

\mathcal{D}_{n}^{\text{test}}

and

\mathcal{D}_{m}^{\text{test}}

;

Estimate the value of the estimated ITRs with weighted AIPW

\widehat{V}(\hat{d}_{g}^{\text{train}};\widetilde{w},\mathcal{I}_{n}^{\text{test}})

, denoted as as

\widehat{V}_{g}^{(b)},g\in\mathcal{G}

end for

Average the estimated values over

B

splits for each estimated ITR, denoted as

\overline{V}_{g}=B^{-1}\sum_{b=1}^{B}\widehat{V}_{g}^{(b)}

g\in\mathcal{G}

Return

\underset{g\in\mathcal{G}}{\arg\max}=\overline{V}_{g}

4 Asymptotic Properties

In this section, we provide a theoretical guarantee that the estimated ITR within a certain class $\mathcal{F}$ is consistent for the true population optimal ITR within $\mathcal{F}$ . Toward this end, we show that the expected risk of the estimated ITR is consistent for that of the true optimal ITR within $\mathcal{F}$ . In particular, we consider the ITRs learned by the nonparametric transfer weights, weighted AIPW estimator $\hat{\tau}_{\text{aw}}^{w}$ with the logistic regression model $\pi_{A}(\bm{X};\gamma)$ and the linear $Q$ function. The extensions to other cases with parametric transfer weights and regression and IPW estimators are straightforward.

We introduce more notation. Let $\theta=(\mathbf{\gamma^{\intercal},}\beta^{\intercal})^{\intercal}$ and let the contrast function

\leavevmode\resizebox{432.17372pt}{}{$\tau_{\text{aw}}^{\theta}(\bm{X})=\frac{\mathbf{1}(A=1)\left\{Y-Q(\bm{X},1;\beta)\right\}}{\pi_{A}(\bm{X};\gamma)}+Q(\bm{X},1;\beta)-\frac{\mathbf{1}(A=0)\{Y-Q(\bm{X},0;\beta)\}}{1-\pi_{A}(\bm{X};\gamma)}-Q(\bm{X},0;\beta)$}.

(14)

Denote the general smooth ramp loss function $l^{\zeta}(u)=l(\zeta_{1}u)/\zeta_{2}$ , where $\zeta_{1},\zeta_{2}>0$ , $l$ is defined in (8), so it is easy to see that when $\zeta=(\zeta_{1},\zeta_{2})\rightarrow\zeta^{*}\overset{\mathrm{\Delta}}{=}(+\infty,2)$ , the general smooth ramp loss function will converge to the 0-1 loss function, i.e., $l^{\zeta}(u)\rightarrow l_{0-1}(u),\forall u$ . Note that the general smooth ramp loss function can also be written as the difference of two convex functions, $l^{\zeta}(u)=l_{1}(\zeta_{1}u)/\zeta_{2}-l_{0}(\zeta_{1}u)/\zeta_{2}$ , thus the d.c. algorithm in Algorithm 1 can be easily generalized to $l^{\zeta}(u)$ . Recalling the loss function $\mathcal{\mathcal{L}}(\tau;\eta,l_{0-1})$ and the corresponding $\mathcal{F}$ -risk $\mathcal{R}_{\mathcal{F}}(\tau;\eta,l_{0-1})$ defined in (3), we replace $\tau$ with $\tau_{\text{aw}}^{\theta}$ and obtain:

\leavevmode\resizebox{460.35788pt}{}{$\mathcal{R}_{\mathcal{F}}(\tau_{\text{aw}}^{\theta};\eta,l_{0-1})=\mathbb{P}\left\{\mathcal{\mathcal{L}}\left(\tau_{\text{aw}}^{\theta};\eta,l_{0-1}\right)\right\},\>\mathcal{\mathcal{L}}\left(\tau_{\text{aw}}^{\theta};\eta,l_{0-1}\right)=\mid\tau_{\text{aw}}^{\theta}(\bm{X})\mid l_{0-1}\left([2\mathbf{1}\{\tau_{\text{aw}}^{\theta}(\bm{X})>0\}-1]f(\bm{X};\eta)\right),\>\forall\eta\in\mathcal{F}$},

denoted as $\mathcal{R}_{\mathcal{F}}(\eta)$ and $\mathcal{L}(\eta,\theta)$ respectively for simplicity. We define the minimal $\mathcal{F}$ -risk as $\mathcal{R}_{\mathcal{F}}^{*}=\inf_{\eta\in\mathcal{F}}\mathcal{R}_{\mathcal{F}}(\eta)$ . Similarly, replacing $l_{0-1}$ in $\mathcal{R}_{\mathcal{F}}(\eta)$ and $\mathcal{L}(\eta,\theta)$ with the general smooth ramp loss $l^{\zeta}$ , we can obtain $\mathcal{R}_{\mathcal{F}}(\tau_{\text{aw}}^{\theta};\eta,l^{\zeta})$ and $\mathcal{\mathcal{L}}\left(\tau_{\text{aw}}^{\theta};\eta,l^{\zeta}\right)$ , denoted as as $\mathcal{R}_{\zeta\mathcal{F}}(\eta)$ and $L^{\zeta}(\eta,\theta)$ , respectively. Likewise, the minimal $\mathcal{F}$ -risk with $l^{\zeta}$ is defined as $\mathcal{R}_{\zeta\mathcal{F}}^{*}=\inf_{\eta\in\mathcal{F}}\mathcal{R}_{\zeta\mathcal{F}}(\eta)$ .

The goal is to show that the decision rule is $\mathcal{F}$ -consistency (Bartlett et al., 2006), i.e.,

\lim_{N\rightarrow\infty}\mathcal{R}_{\mathcal{F}}(\hat{\eta})=\mathcal{R}_{\mathcal{F}}^{*},

where $\hat{\eta}=\arg\min_{\eta\in\mathcal{F}}\sum_{i=1}^{N}S_{i}\hat{w}_{i}L^{\zeta}(\eta,\hat{\theta}_{w})$ , where $\hat{\theta}_{w}=(\hat{\gamma}_{w}^{\intercal},\hat{\beta}_{w}^{\intercal})^{\intercal}$ , the estimated logistic regression coefficients $\hat{\gamma}_{w}=\arg\max_{\gamma}\sum_{i=1}^{N}\hat{w}_{i}S_{i}[A_{i}\log\pi_{A}(\bm{X}_{i};\gamma)+(1-A_{i})\log\left\{1-\pi_{A}(\bm{X}_{i};\gamma)\right\}]$ which is converge in probability to $\arg\max_{\gamma}\mathbb{E}[A\log\pi_{A}(\bm{X};\gamma)+(1-A_{i})\log\left\{1-\pi_{A}(\bm{X};\gamma)\right\}]$ $\overset{\mathrm{\Delta}}{=}\gamma^{*}$ as $N\rightarrow\infty$ under regular conditions, and the estimated outcome regression coefficients $\hat{\beta}_{w}=\arg\min_{\beta}\sum_{i=1}^{N}\hat{w}_{i}S_{i}(Y_{i}-\beta^{\intercal}L_{i})^{2}$ , where $L_{i}=(1,\bm{X}_{i}^{\intercal},A_{i},A_{i}\bm{X}_{i}^{\intercal})$ , thus we have

\begin{split}\hat{\beta}_{w}&=\left(\sum_{i=1}^{N}N\hat{w}_{i}S_{i}L_{i}L_{i}^{\intercal}\right)^{-1}\sum_{i=1}^{N}N\hat{w}_{i}S_{i}L_{i}Y_{i}\rightarrow\{\mathbb{E}(LL^{\intercal})\}^{-1}\mathbb{E}(LY)\overset{\mathrm{\Delta}}{=}\beta^{*},\end{split}

in probability as $N\rightarrow\infty$ . The convergences of $\hat{\gamma}_{w}$ and $\hat{\beta}_{w}$ are implied from the weak law of large numbers and the result $\max_{i}\mid N\hat{w}_{i}-1/\pi_{S}(\bm{X}_{i})\mid=o_{p}(1)$ proved in Theorem 2 in Wang and Zubizarreta (2020). Let $\theta^{*}=(\gamma^{*\intercal},\beta^{*}{}^{\intercal})^{\intercal}$ , thus we have

\hat{\theta}_{w}\xrightarrow{p}\theta^{*}.

(15)

It is easy to see that $\mathcal{R}_{\zeta\mathcal{F}}(\eta)$ will converge to $\mathcal{R}_{\mathcal{F}}(\eta)$ when $\zeta\rightarrow\zeta^{*},$ therefore, to obtain $\mathcal{F}$ -consistency, it is suffice to show that $\lim_{N\rightarrow\infty}\mathcal{R}_{\zeta\mathcal{F}}(\hat{\eta})=\mathcal{R}_{\zeta\mathcal{F}}^{*}$ which is stated in the theorem below.

Theorem 1.

Under Assumptions S1–S10, we have $\lim_{N\rightarrow\infty}\mathcal{R}_{\zeta\mathcal{F}}(\hat{\eta})=\mathcal{R}_{\zeta\mathcal{F}}^{*}.$

5 Simulation study

We evaluate the finite-sample performances of the proposed estimators with a set of simulation studies. We first generate a target population of a size $N=10^{6}$ . The covariates are generated by $\bm{X}_{i}=(X_{i1},X_{i2})^{\intercal}\sim N(\mathbf{1}_{2\times 1},I_{2\times 2})$ , and the potential outcome is generated by $Y_{i}^{*}(a)\mid\bm{X}_{i}=1+2X_{i1}+3X_{i2}+a\tau(\bm{X}_{i})+\epsilon_{i}(a),$ where $\epsilon_{i}(a)$ i.i.d. follows $N(0,0.5^{2})$ for $a=0,1.$ We consider three contrast functions: (I). $\tau(\bm{X}_{i})=\arctan\{\exp(1+X_{i1})-3X_{i2}-5\};$ (II). $\tau(\bm{X}_{i})=\cos(1)+\cos(X_{i1})+\cos(X_{i2})-3/2;$ (III). $\tau(\bm{X}_{i})=1+2X_{i1}+3X_{i2}$ .

In order to evaluate the necessity and performances of weighting methods, we consider two categories of data generating models. One is the scenarios where the true optimal policies are not in linear forms , i.e., the settings (I) and (II), while the other one is considering the true optimal policy is in a linear form as the setting (III). We start with the first two scenarios. From the data visualization, we observed that the covariates shifted between the experimental data and RWD to a varying extent in the two generative models: the covariates generated by (I) have similar distributions between the experimental data and RWD, while those generated by (II) have very different ones. Then we generate the RWD by randomly sample $m=5000$ subjects from the target population $\{\bm{X}_{i}\}_{i=1}^{N}$ , denoted as $\{\bm{X}_{i}\}_{i\in\mathcal{I}_{m}}$ . To form the experimental sample, we generate the indicator of selection according to $S_{i}\mid\bm{X}_{i}\sim\text{Bernoulli}\left[\exp(\alpha_{0}+\alpha_{1}^{{}^{\intercal}}\bm{X}_{i})/\{1+\exp(\alpha_{0}+\alpha_{1}^{\intercal}\bm{X}_{i})\}\right]$ , where $\alpha_{1}=(1,-2)^{\intercal}$ and $\alpha_{0}=-8$ which gives the average sample sizes of the experimental data over the simulation replicates round 1386, respectively. In the experimental data denoted as $\mathcal{I}_{n}$ , the treatments are randomly assigned according to $A_{i}\sim\mathrm{Bernoulli}(0.5)$ , and then the actual observed outcomes $Y_{i}=Y_{i}^{*}(A_{i})$ , $i\in\mathcal{I}_{n}$ .

For each setting, we compare the following estimators:

1. “ $w_{1}$ ”: our proposed weighted estimator with parametric weights learned by the modified maximum likelihood introduced in Section 3.2.1.

2. “ $w_{2}$ ”: our proposed weighted estimator with parametric weights learned by the estimating equations introduced in Section 3.2.1.

3. “cv”: our proposed weighted estimator using cross-validation procedure introduced in Section 3.4, and we set the number of sample splits as 10 in all the simulation studies.

4. “np”: our proposed weighted estimator with nonparametric weights.

5. “unweight”: only using the experimental data to learn the optimal linear ITRs.

6. “bm”: using the weighted estimator but we replace the estimated transfer weights $\hat{w}_{i}$ with the normalized true weights $w_{i}=\{\pi_{S}(\bm{X}_{i})\}^{-1}/\sum_{i\in\mathcal{I}_{n}}\{\pi_{S}(\bm{X}_{i})\}^{-1}$ , and replace the estimated contract functions $\hat{\tau}_{i}$ with the true contrast values $Y_{i}^{*}(1)-Y_{i}^{*}(0)$ .

7. “Imai”: the method proposed in Imai et al. (2013). They construct the transfer weights by fitting a Bayesian additive regression trees (BART) model using covariates as predictors and whether to be in the experimental data as outcomes, and then use a variant of SVM to estimate the conditional treatment effect (implemented in the R package FindIt). Then given $\bm{X}$ , we can assign treatment $1$ if its estimated conditional treatment effect is positive, and $0$ otherwise.

8. “DBN”: the method proposed in Mo et al. (2020), where they propose to use distributionally robust ITRs, that is, value functions are evaluated under all testing covariate distributions that are ”close” to the training distribution, and the worse-case takes a minimal one, and then the ITRs are learned by maximizing the defined worst-case value function.

For the estimators 1. – 5. in the above, we estimate the contrast function using $\tau_{\text{aw}}$ with linear $Q$ functions and constant propensity scores estimated via the proportion of $A=1$ in the experimental data, $n^{-1}\sum_{i\in\mathcal{I}_{n}}A_{i}$ .

To study the impact of model misspecification of $\pi_{S}(\bm{X})$ , we consider to estimate $\alpha$ by fitting logistic regression wrongly: $\pi_{S}(\bm{X};\alpha)=\exp(\alpha_{0}+\alpha_{1}^{\intercal}\widetilde{\bm{X}})/\{1+\exp(\alpha_{0}+\alpha_{1}^{\intercal}\widetilde{\bm{X}})\}$ , where $\widetilde{\bm{X}}=(X_{1}^{2},X_{2}^{2})$ . And as described in the proposed methods, only estimators 1. – 3. need to specify the sampling model. To measure the performances of the different approaches, we use the mean value MSE defined as $N^{-1}\sum_{i=1}^{N}\left(Y_{i}^{*}\{d_{\eta}(\bm{X}_{i})\}-Y_{i}^{*}[\mathbf{1}\{\tau(\bm{X}_{i})>0\}]\right)^{2}$ . Thus the lower the MSE, the better performances the approach has.

The results are summarized in Figure 1, the dark red point in each boxplot is the mean value MSE averaged over 200 Monte Carlo replicates. For the generative model (I) shown in Figure 1, unweighted and DBN rules perform worst in all the cases. The second parametric weighting method is relatively robust compare with the first parametric one when the sampling model is misspecified; this is reasonable since the first one uses maximum likelihood estimation which is more reliable on the correctness of the parametric model specification. When the training data and testing data are similarly distributed, the performances of parametric, nonparametric weighting method and the cross-validation procedure are similar; in this case, sampling model misspecification does not have much influence on the rule learning.

Refer to caption — Figure 1: Value MSE boxplots of different methods for the generative model (I) and (II). The red point in each box is the average of the value MSE.

For the generative model (II) shown in Figure 1, among the performances of weighted and unweighted estimators, we can see the weighted estimators outperform the unweighted estimators when model correctly specified; the nonparametric weighted estimator outperforms other estimators and the performance of the parametric weighted estimators have dramatically decreased when the sampling model misspecified. These findings make sense, because when the more different the training and testing data are, the more crucial the bias correction is. Thus sampling model misspecification might have huge influence on the learned rules.

From the both generative models, we can see that the nonparametric weighting and cross-validation procedure are more robust when the sampling model is unknown and recommended in practice use. As for DBN estimator, it performs badly in terms of MSE magnitude or variability, one possible reason might be that the tuning parameters when measuring the ”closeness” of two distributions need to be chosen more carefully in practice Mo et al. (2020). As for the comparison with Imai also has relative good performance among all estimators and has relative small variability.

As for setting (III), we consider the correctly specified sample score model. As shown in Figure 2, unweighted method performances the best, close to the benchmark. This demonstrates that when the true optimal polices are in linear forms, then learning with only the experimental data performances the best. Thus weighting is not a necessity, and it even will induce the variance of the estimators. The other competitive estimator is using cross-validation. Therefore, when we have no idea of the true data generating models, using CV is a good choice in practice.

6 Real Data Application

We apply the proposed methods to personalized recommendations of a job training program on post-market earning. Our analysis is based on two data sources (downloaded from https://users.nber.org/~rdehejia/data/.nswdata2.html): an experimental dataset from the National Supported Work (NSW) program (LaLonde, 1986) and a non-experimental dataset from the Current Population Survey (CPS; LaLonde (1986)). The NSW dataset includes $185$ individuals who received the job training program ( $A=1$ ) and $260$ individuals who did not receive the job training program ( $A=0$ ), and the CPS dataset includes $15992$ individuals without the job training program. Let $\bm{X}$ be a vector including an intercept $1$ and the pre-intervention covariates: age, education, Black (1 if black, 0 otherwise), Hispanic (1 if Hispanic, 0 otherwise), married (1 if married, 0 otherwise), whether having high school degree (1 if no degree, 0 otherwise), earning in 1974, and earning in 1975. The outcome variable is the earning in 1978. We transfer the earnings in the year 1974, 1975, and 1978 by taking the logarithm of the earnings plus one, denoted as “log.Re74”, “log.Re75” and “log.Re78”. We aim to learn a linear rule $d(\bm{X};\eta)=\mathbf{1}(\bm{X}^{\intercal}\eta>0)$ tailoring the recommendation of the job training program to an individual with characteristics $\bm{X}$ .

The CPS sample is generally representative of the target population; while the participants to the NSW program may not represent the general population. Table 1 contrasts sample averages of each pre-intervention covariate in the NSW and CPS samples and shows that the covariates are highly imbalanced between the two samples. On average, the NSW sample is younger, less educated, more black people, fewer married, more without high school degree, fewer earnings in 1974 and 1975. Because of imbalance, ITRs cannot be learned based only on the NSW sample, thus we are motivated to use the proposed weighted transfer learning approaches to learn optimal linear ITRs which benefit the target population most.

Table 1: Sample averages of pre-intervention variables in the NSW and CPS samples and results from the weighted transfer and unweighted learning methods: The first two rows are sample averages of each pre-intervention covariate in the NSW and CPS samples, respectively. The last two rows are estimated linear coefficients of each corresponding variable in the linear decision rule from the weighted transfer and unweighted learning methods, respectively.

	intercept	Age	Educ	Black	Hisp	Married	Nodegr	log.Re74	log.Re75
experiment NSW	1	25.37	10.20	0.83	0.09	0.17	0.78	2.27	2.70
Non-experiment CPS	1	33.23	12.03	0.07	0.07	0.71	0.30	8.23	8.29
	$\eta_{0}$	$\eta_{1}$	$\eta_{2}$	$\eta_{3}$	$\eta_{4}$	$\eta_{5}$	$\eta_{6}$	$\eta_{7}$	$\eta_{8}$
Unweighted learning $\hat{\eta}_{{\rm u}}$	-1	0.03	0.07	0.81	2.31	1.89	-0.62	-0.60	0.47
Weighted transfer learning $\hat{\eta}_{{\rm w}}$	-1	0.02	0.05	0.66	0.36	0.14	0.28	-0.27	0.15

We consider the unweighted and weighted transfer learning methods. For the weighted transfer learning method, we consider the nonparametric weighting approach, because the parametric approach requires the target population size $N$ which is unknown in this application. To estimate the weighted and unweighted decision rules, we use the AIPW estimator $\hat{\tau}_{\text{aw}}$ with linear Q functions since it outperforms other competitors in the simulation study in Section 5. We also implement the cross-validation procedure to choose between the unweighted and weighted transfer learning methods.

Table 1 presents the estimated linear coefficients of each corresponding variable in the linear decision rule from the weighted transfer and unweighted learning methods, respectively. Moreover, from the cross-validation procedure, the weighted transfer learning method produces a smaller population risk than the unweighted learning method and therefore is chosen as the final result. Compared with the learned decision rule with only the experiment data NSW which oversamples the low-income individuals, individuals without high school degrees (i.e. Nodegr = 1) are more likely to have higher outcomes if offered the job training program in the target population.

To assess the performances of the learned rules for the target population, following Kallus (2017), we create a test sample that is representative of the target population. We sample $100$ individuals at random from the CPS sample to form a control group and then find their closest matches in the NSW treated sample to form the treatment group (use pairmatch in R package optmatch). The test sample thus contains the $100$ matched pairs, mimicking a experimental sample where binary actions are randomly assigned with equal probabilities. We use random forest on the test data to estimate $Q$ function $\mathbb{E}(Y\mid\bm{X},a)$ , denoted the estimator as $\widehat{Q}_{\text{test}}(\cdot,a),$ $a=0,1$ , then we can use $\sum_{i\in\text{CPS}}\widehat{Q}_{\text{test}}\left\{\bm{X}_{i},d(\bm{X}_{i};\eta)\right\}/\mid\text{CPS}\mid$ to estimate the value of an ITR $d(\cdot;\eta)$ , where $\mid\text{CPS}\mid$ is the sample size of CPS. Besides the weighted linear rule $d(\cdot;\hat{\eta}_{w})$ and the unweighted linear rule $d(\cdot;\hat{\eta}_{u})$ learned from the above, we also compare the Imai method introduced in Imai et al. (2013): first to estimate the conditional probability of being selected into NSW given $\bm{X}$ from the total sample which consists of NSW and CSP using BART model, and then to estimate the conditional treatment effect weighted by the inverse of the sampling probability. Then the Imai rule will recommend the job training program if the conditional treatment effect is positive and will not recommend it otherwise. We replicate this sampling test data procedure 100 times, estimate values of the three ITRs for each replicate, and get the averaged estimated values (standard error) as follows: 8.395 (0.009), 6.106 (0.011), and 8.049 (0.006) for $d(\cdot;\hat{\eta}_{w})$ , $d(\cdot;\hat{\eta}_{u})$ , and Imai respectively, thus the transfer weighted rule has the highest value estimate. Although this evaluation method needs assumption of no unmeasured confounders for the non-experimental data CPS, which is not needed in our learning procedure, it can still bring us confidence, to some degree, that the weighted linear rule has better generalizability to the target population.

7 Discussion

We develop a general framework to learn the optimal interpretable ITRs for the target population by leveraging no unmeasured confounding in the experimental data and the representativeness of covariates in the RWD. The proposed procedure is easy to implement. By constructing transfer weights to correct the distribution of covariates in the experimental data, we can learn the optimal interpretable ITRs for the target population. Besides, we provide a data-adaptive procedure to choose among different weighting methods by modified cross-validation. Moreover, we establish the theoretical guarantee for the consistency of the proposed estimator of ITRs.

Some future directions worth considering. One is to quantify the uncertainty of the estimated ITRs and therefore to conduct statistical inference. Moreover, in some scenarios, beside the covariate information, comparable treatment and outcome information are also available in the RWD. Unlike the randomized experiment, the treatment assignment in the RWD is determined by the preference of physicians and patients. In this case, an interesting topic is to develop integrative methods that can leverage the strengths of both data sources for efficient estimation of the ITRs.

References

(1)
Bartlett et al. (2006) Bartlett, P. L., Jordan, M. I. and McAuliffe, J. D. (2006). Convexity, classification, and risk bounds, Journal of the American Statistical Association 101(473): 138–156.
Chakraborty (2013) Chakraborty, B. (2013). Statistical methods for dynamic treatment regimes, Springer.
Cole and Stuart (2010) Cole, S. R. and Stuart, E. A. (2010). Generalizing evidence from randomized clinical trials to target populations: the actg 320 trial, American journal of epidemiology 172(1): 107–115.
Collobert et al. (2006) Collobert, R., Sinz, F., Weston, J. and Bottou, L. (2006). Trading convexity for scalability, Proceedings of the 23rd international conference on Machine learning, pp. 201–208.
Greenland (1990) Greenland, S. (1990). Randomization, statistics, and causal inference, Epidemiology pp. 421–429.
Hainmueller (2012) Hainmueller, J. (2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies, Political analysis pp. 25–46.
Hartman et al. (2015) Hartman, E., Grieve, R., Ramsahai, R. and Sekhon, J. S. (2015). From sample average treatment effect to population average treatment effect on the treated: combining experimental with observational studies to estimate population treatment effects, Journal of the Royal Statistical Society: Series A (Statistics in Society) 178(3): 757–778.
Horvitz and Thompson (1952) Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacement from a finite universe, Journal of the American statistical Association 47(260): 663–685.
Imai et al. (2013) Imai, K., Ratkovic, M. et al. (2013). Estimating treatment effect heterogeneity in randomized program evaluation, The Annals of Applied Statistics 7(1): 443–470.
Kallus (2017) Kallus, N. (2017). Recursive partitioning for personalization using observational data, Proceedings of the 34th International Conference on Machine Learning-Volume 70, JMLR. org, pp. 1789–1798.
Kang et al. (2014) Kang, C., Janes, H. and Huang, Y. (2014). Combining biomarkers to optimize patient treatment recommendations, Biometrics 70(3): 695–707.
Kish and Stat (1992) Kish, L. and Stat, J. O. (1992). Weighting for unequal pi.
Kouw and Loog (2018) Kouw, W. M. and Loog, M. (2018). An introduction to domain adaptation and transfer learning, arXiv preprint arXiv:1812.11806 .
LaLonde (1986) LaLonde, R. J. (1986). Evaluating the econometric evaluations of training programs with experimental data, The American economic review pp. 604–620.
Le Thi Hoai and Tao (1997) Le Thi Hoai, A. and Tao, P. D. (1997). Solving a class of linearly constrained indefinite quadratic problems by dc algorithms, Journal of global optimization 11(3): 253–285.
Mo et al. (2020) Mo, W., Qi, Z. and Liu, Y. (2020). Learning optimal distributionally robust individualized treatment rules, Journal of the American Statistical Association pp. 1–16.
Moodie et al. (2014) Moodie, E. E., Dean, N. and Sun, Y. R. (2014). Q-learning: Flexible learning about useful utilities, Statistics in Biosciences 6(2): 223–243.
Murphy (2005) Murphy, S. A. (2005). A generalization error for q-learning, Journal of Machine Learning Research 6(Jul): 1073–1097.
Nocedal (1980) Nocedal, J. (1980). Updating quasi-newton matrices with limited storage, Mathematics of computation 35(151): 773–782.
Qian and Murphy (2011) Qian, M. and Murphy, S. A. (2011). Performance guarantees for individualized treatment rules, Annals of statistics 39(2): 1180.
Rothwell (2005) Rothwell, P. M. (2005). External validity of randomised controlled trials:“to whom do the results of this trial apply?”, The Lancet 365(9453): 82–93.
Rubin (1974) Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies., Journal of educational Psychology 66(5): 688.
Vapnik (1998) Vapnik, V. N. (1998). Statistical Learning Theory, Wiley-Interscience.
Wang and Zubizarreta (2020) Wang, Y. and Zubizarreta, J. R. (2020). Minimal dispersion approximately balancing weights: asymptotic properties and practical considerations, Biometrika 107(1): 93–105.
Wu and Liu (2007) Wu, Y. and Liu, Y. (2007). Robust truncated hinge loss support vector machines, Journal of the American Statistical Association 102(479): 974–983.
Zhang, Tsiatis, Davidian, Zhang and Laber (2012) Zhang, B., Tsiatis, A. A., Davidian, M., Zhang, M. and Laber, E. (2012). Estimating optimal treatment regimes from a classification perspective, Stat 1(1): 103–114.
Zhang, Tsiatis, Laber and Davidian (2012) Zhang, B., Tsiatis, A. A., Laber, E. B. and Davidian, M. (2012). A robust method for estimating optimal treatment regimes, Biometrics 68(4): 1010–1018.
Zhao et al. (2019) Zhao, Y.-Q., Zeng, D., Tangen, C. M. and LeBlanc, M. L. (2019). Robustifying trial-derived optimal treatment rules for a target population, Electronic journal of statistics 13(1): 1717.
Zhao et al. (2012) Zhao, Y., Zeng, D., Rush, A. J. and Kosorok, M. R. (2012). Estimating individualized treatment rules using outcome weighted learning, Journal of the American Statistical Association 107(499): 1106–1118.
Zhou and Kosorok (2017) Zhou, X. and Kosorok, M. R. (2017). Causal nearest neighbor rules for optimal treatment regimes, arXiv preprint arXiv:1711.08451 .
Zhou et al. (2017) Zhou, X., Mayer-Hamblett, N., Khan, U. and Kosorok, M. R. (2017). Residual weighted learning for estimating individualized treatment rules, Journal of the American Statistical Association 112(517): 169–187.