Individualized treatment rules
under stochastic treatment cost constraints
Abstract
Estimation and evaluation of individualized treatment rules have been studied extensively, but real-world treatment resource constraints have received limited attention in existing methods. We investigate a setting in which treatment is intervened upon based on covariates to optimize the mean counterfactual outcome under treatment cost constraints when the treatment cost is random. In a particularly interesting special case, an instrumental variable corresponding to encouragement to treatment is intervened upon with constraints on the proportion receiving treatment. For such settings, we first develop a method to estimate optimal individualized treatment rules. We further construct an asymptotically efficient plug-in estimator of the corresponding average treatment effect relative to a given reference rule.
1 Introduction
The effect of a treatment often varies across subgroups of the population [38, 52]. When such differences are clinically meaningful, it may be beneficial to assign treatments strategically depending on subgroup membership. Such treatment assignment mechanisms are called individualized treatment rules (ITRs). A treatment rule is commonly evaluated on the basis of the mean counterfactual outcome value it generates — what is often referred to as the treatment rule’s value — and an ITR with an optimal value is called an optimal ITR. There is an extensive literature on estimation of optimal ITRs and their corresponding values using data from randomized trials or observational studies [6, 21, 25, 37, 54].
Most existing approaches for estimating ITRs do not incorporate real-world resource constraints. Without such constraints, an optimal ITR would assign the treatment to members of a subgroup provided there is any benefit for such individuals, even when this benefit is minute. In contrast, under treatment resource limits, it may be more advantageous to reserve treatment for subgroups with the greatest benefit from treatment. This issue has received attention in recent work. Luedtke and van der Laan developed methods for estimation and evaluation of optimal ITRs with a constraint on the proportion receiving treatment [20]. Qiu et al. instead considered related problems in settings in which instrumental variables (IVs) are available [34]. In one of the settings they considered, the same resource constraint is imposed as in Luedtke and van der Laan [20] but a binary IV is used to identify optimal ITRs even in settings in which there may be unmeasured confounders. In another setting considered in Qiu et al. [34], the authors considered interventions on a causal IV or encouragement status, and developed methods to estimate individualized encouragement (rather than treatment) rules with a constraint on the proportion receiving both encouragement and treatment [33]. They also developed nonparametrically efficient estimators of the average causal effect of optimal rules relative to a prespecified reference rule. Sun et al. considered a setting in which the cost of treatment is random and dependent upon baseline covariates. They developed methods to estimate optimal ITRs under a constraint on the expected additional treatment cost as compared to control, though inference on the impact of implementing the optimal ITR in the population was not studied [41]. Sun [42] considered a related problem involving the development of optimal ITRs under resource constraints, and established the asymptotic properties of the estimated optimal ITR. Their method appears viable when the class of ITRs is restricted by the user a priori.
In this paper, we study estimation and inference for an optimal rule under two different cost constraints. The first is the same as appearing in Sun et al. [41]. In contrast to earlier work on this setting, we do not constrain the class of ITRs considered and provide a means to obtain inference about the optimal ITR. The second constraint we consider places a cap on the total cost under the rule rather than on the incremental cost relative to control. To our knowledge, the latter problem has not previously been considered in the literature. Both of these estimation problems mirror the intervention-on-encouragement setting considered in Qiu et al. [34] but involve different constraints and a more general cost function.
Similarly as in Qiu et al. [34], the estimators that we develop are asymptotically efficient within a nonparametric model and enable the construction of asymptotically valid confidence intervals for the impact of implementing the optimal rule. We develop our estimators using similar tools — such as semiparametric efficiency theory [31, 51] and targeted minimum loss-based estimation (TMLE) [45, 49] — as were used to tackle the related problem studied in Qiu et al. [34]. Consequently, our proposed estimators are similar to that in Qiu et al. [34]. Therefore, we will streamline the presentation by highlighting the key similarities and focusing on the differences between these related problems and estimation schemes.
The rest of this paper is organized as follows. In Section 2, we describe the problem setup, introduce notation, and present the causal estimands along with basic causal conditions. In Section 3, we present additional causal conditions and the corresponding nonparametric identification results. In Section 4, we present our proposed estimators and their theoretical properties. In Section 5, we present a simulation illustrating the performance of our proposed estimators. We make concluding remarks in Section 6. Proofs, technical conditions, and additional simulation results can be found in the Supplementary Material.
2 Setup and objectives
To facilitate comparisons with Qiu et al. [34], we adopt similar notation as in that work. Suppose that we observe independent and identically distributed data units , where is an unknown sampling distribution. A prototypical data unit consists of the quadruplet , where is the vector of baseline covariates, is the treatment status, is the random treatment cost, and is the outcome of interest. As a convention, we assume that larger values of are preferable. We use to denote a fixed transformation of upon which we allow treatment decisions to depend. For example, may be a subset of covariates in or a summary of (e.g., BMI as a summary of height and weight). In practice, may be chosen based on prior knowledge on potential modifiers of the treatment effect as well as the cost of measuring various covariates. We distinguish between and because of their different roles. On the one hand, we will assume that the full covariate contains all confounders and thus is used to identify causal effects, while might not be sufficient for this purpose. On the other hand, some covariates in may be expensive or difficult to measure in future applications, and thus implementing an optimal ITR based on a subset of covariate may be desirable. In the rest of this paper, we will use the shorthand notation , and to refer to , and , respectively. We define an individualized (stochastic) treatment rule (ITR) to be a function that prescribes treatment with probability according to an exogenous source of randomness for an individual with covariate value . Any stochastic ITR that only takes values in is referred to as a deterministic ITR.
In this work, we adopt the potential outcomes framework [27, 39]. For each individual, we use and to denote the potential treatment cost and potential outcome, respectively, corresponding to scenarios in which the individual has treatment status . We use to denote an expectation over the counterfactual observations and the exogenous random mechanism defining a rule, and to denote an expectation over observables alone under sampling from . We make the usual Stable Unit Treatment Value Assumption (SUTVA) assumption.
Condition A1 (Stable Unit Treatment Value Assumption).
The counterfactual data unit of one individual is unaffected by the treatment assigned to other individuals, and there is only a single version of the treatment, so that implies that and .
Remark 1.
The ITRs we consider are not truly individualized, because they are based on the value of covariate rather than each individual’s unique potential treatment effects and . Nevertheless, depending on the resolution of , these ITRs can be considerably more individualized than assigning everyone to either treatment or control. In this paper, we adopt the conventional nomenclature and refer to the treatment rules we study as ITRs [see, e.g., 5, 7, 14, 18, 19, 29, 32, 40, 47, 54, 55, 57].
We define and to be the counterfactual treatment cost and outcome, respectively, for an ITR under an exogenous random mechanism. We note that if for an individual with covariate , an exogenous random mechanism is used to randomly assign treatment with probability and thus and are random for this given individual. If were implemented in the population, then the population mean outcome would be , where we use to denote expectation under the true data-generating mechanism involving potential outcomes and exogenous randomness in . We consider a generic treatment resource constraint requiring that a convex combination of the population average treatment cost and the population average additional treatment cost compared to control be no greater than a specified constant . Consequently, an optimal ITR under this constraint is a solution in to
maximize | (1) |
Here, is also a constant specified by the investigator. Natural choices of are , corresponding to a constraint on the population average additional treatment cost compared to control, and , corresponding to a constraint on the population average treatment cost. The first choice may be preferred when the control treatment corresponds to the current standard of care and a limited budget is available to fund the novel treatment to some patients. The second choice may be more relevant when both treatment and control incur treatment costs.
Remark 2.
Our setup is similar to that in Qiu et al. [34] if we view and defined here as the instrumental variable/encouragement and treatment status defined in those prior works, respectively. However, the constraint in our setup is different from the constraint considered previously. In IV settings, the constraint in (1) with is useful when assigning treatment always incurs a cost, regardless of whether encouragement is applied, such as in distributing a limited supply of an expensive drug within a health system based on the results of a randomized clinical trial. It is instead useful with when no encouragement is present under the standard of care but intervention on the encouragement is of interest when additional treatment resources are available. The constraint considered in Qiu et al. [34, 33] was instead useful in cases in which treatment only incurs a cost when paired with encouragement, such as when housing vouchers are used to encourage individuals to live in a certain area. In the general setting in which is viewed as treatment status and as a random treatment cost, the constraint in (1) with is identical to that considered in Sun et al. [41] — we refer the readers to these works for a more in-depth discussion of the relation between the current problem setup and IV settings.
To evaluate an optimal ITR , we follow Qiu et al. [34] in considering three types of reference ITRs and develop methods for statistical inference on the difference in the mean counterfactual outcome between and a reference ITR . The first type of reference ITR considered, denoted by (=fixed rule), is any fixed ITR that may be specified by the investigator before the study. When , it is usually most reasonable to consider the rule that always assigns control, namely , because the constraint in (1) may arise due to limited funding for implementing treatment whereas the standard of care rule is to always assign control. The second type, denoted by (=random), prescribes treatment completely at random to individuals regardless of their baseline covariates. The probability of prescribing treatment is chosen such that the treatment resource is saturated (i.e., all available resources are used) or all individuals receive treatment, if such a probability exists. Symbolically, this ITR is given by under the condition that and . Although has the same interpretation as the corresponding encouragement rule in Qiu et al. [34], its mathematical expression is different due to the different resource constraint. This rule may be of interest if it is known a priori that treatment is harmless. The third type, denoted by (=true propensity), prescribes treatment according to the true propensity of the treatment implied by the study sampling mechanism , so that equals . This ITR may be of interest in two settings. In one setting, satisfies the treatment resource constraint. The investigator may wish to determine the extent to which the implementation of an optimal ITR would improve upon the standard of care. In the other setting, the treatment resource constraint is newly introduced and the standard of care ITR may lead to overuse of treatment resources. The investigator may then be interested in whether the implementation of an optimal constrained ITR would result, despite the new resource constraint, in a noninferior mean outcome.
3 Identification of causal estimands
In this section, we present nonparametric identification results. Though these results are similar to those for individualized encouragement rules in Qiu et al. [34], there are two key differences. First, the form of some of the conditions in Qiu et al. [34] need to be modified to account for the novel resource constraint considered here. Second, two additional conditions are needed to overcome challenges that arise due to this new constraint.
We first introduce notation that will be useful when presenting our identification results and our proposed estimators. For any observed-data distribution , we define pointwise the conditional mean functions and , where we use to denote an expectation over observables alone under sampling from , and their corresponding contrasts due to different treatment status, and . We also define the average of these contrasts conditional on as and , and the propensity to receive treatment . Additionally, we define , . These quantities play an important role in tackling the problem at hand. Throughout the paper, for ease of notation, if is a quantity or operation indexed by distribution , we may denote by . As an example, we may use to denote .
We introduce additional causal conditions we will require, positivity and unconfoundedness. In one form or another, these conditions commonly appear in the causal inference literature [49], including in the IV literature [1, 15, 43, 53].
Condition A2 (Strong positivity).
There exists a constant such that holds for -almost every .
Condition A3 (Unconfoundedness of treatment).
For each , and are conditionally independent given for -almost every .
Equipped with these conditions, we are able to state a theorem on the nonparametric identification of the mean counterfactual outcomes and average treatment effect (ATE) — these results can be viewed as a corollary of the well-known G-formula [36].
Theorem 1 (Identification of ATE and expected treatment resource expenditure).
In view of Theorem 1, the objective function in (1) can be identified as
and, similarly, the expected cost is identified as . It follows that the optimization problem (1) is equivalent to
maximize | (2) |
This differs from Equation 3 defining optimal individualized encouragement rules in Qiu et al. [34]. We now present two additional conditions so that (2) is a fractional knapsack problem [9], thereby allowing us to use existing results from the optimization literature. These conditions are similar to those in Sun et al. [41].
Condition A4 (Strictly costlier treatment).
There exists a constant such that holds for -almost every .
Condition A5 (Financial feasibility of assigning treatment).
The inequality holds.
Condition A4 is reasonable if treatment is more expensive than control. When applied to an IV setting as outlined in Remark 2, this condition corresponds to the assumption that the IV is indeed an encouragement to take treatment. This condition is slightly stronger than its counterpart in Sun et al. [41], which only requires that . This stronger condition is needed to ensure the asymptotic linearity of our proposed estimator in Section 4. Under Condition A4, it is evident that Condition A5 is reasonable because if , then no ITR satisfies the treatment resource constraint in view of the fact that , whereas if , then only the trivial ITR satisfies the constraint and there is no need to estimate an optimal ITR.
Under these two additional conditions, (2) is a fractional knapsack problem [9] in which every subgroup defined by a different value of corresponds to a different ‘item’. A solution in the special case in which and was given in Theorem 1 of Sun et al. [41]. We now state a more general result with the following differences: (i) the treatment decision may be based on a summary rather than the entire covariate vector , and (ii) may take any value in rather than only zero. We also explicitly state the randomization probability at the boundary for completeness and clarity. Despite these differences, the result we obtain is similar to Theorem 1 in Sun et al. [41]. Define pointwise , and write and .
Theorem 2 (Optimal ITR).
We also note that the reference ITRs introduced in Section 2 are also identified under the above conditions. In particular, it can be shown that and .
4 Estimating and evaluating optimal individualized treatment rules
In this section, we present an estimator of an optimal ITR and an inferential procedure for its ATE relative to a reference ITR , where is any of , or . The proposed procedure is an adaptation of the method first proposed in Qiu et al. [34, 33].
We begin by introducing some notations that are useful for defining the estimands. We define the parameter or for each ITR and distribution , depending on whether the domain of is or . Here, we consider the model to be locally nonparametric at [31]. For , the ATE of an optimal ITR relative to a reference ITR equals . We are interested in making inference about , where we have suppressed dependence on from our shorthand notation.
4.1 Pathwise differentiability of the ATE
We first present a result regarding the pathwise differentiability of the ATE. Pathwise differentiability of the parameter of interest serves as the foundation for constructing asymptotically efficient estimators of this parameter, based on which an inferential procedure may be developed. Additional technical conditions are required and are provided in Section S1 in the Supplementary Material. For a distribution , a function , an ITR , and a decision threshold , we define pointwise the following functions:
(3) |
One key condition we rely on is the following non-exceptional law assumption.
Condition B1 (Non-exceptional law).
.
Under this condition, the true optimal ITR is identical to an indicator function. If all covariates are discrete, then we can plug in the empirical estimates into the identification formulae in Theorems 1–2 and show that the resulting estimators of the ATE are asymptotically normal by the delta method even when Condition B1 does not hold. We do not further pursue this simple case in this paper, and thus need to rely on the non-exceptional law assumption, namely Condition B1, to account for continuous covariates. We list additional technical conditions in Supplement S1.
We can now provide a formal result describing the pathwise differentiability of the ATE parameter.
Theorem 3 (Pathwise differentiability of the ATE).
We note that the pathwise differentiability of was established in Theorem 3 of Qiu et al. [34] for . The other results can be proven using similar techniques. We put the proof of these results in Supplement S4.2. In view of Theorem 3, it follows that the ATE parameter is pathwise differentiable at with nonparametric canonical gradient
(4) |
for .
Remark 3.
We have noted similar additional terms related to the resource being used in the canonical gradient of the mean counterfactual outcome or ATE of optimal ITRs under resource constraints, for example, in Luedtke and van der Laan [20] and Qiu et al. [34]. In our problem, this additional term is
Such terms appear to come from solving a fractional knapsack problem with truncation at zero and take the form of a product of (i) the threshold in the solution, and (ii) a term that equals the influence function of the resource being used under the solution when the resource is saturated. We conjecture that such structures generally exist for fractional knapsack problems.
4.2 Proposed estimator and asymptotic linearity
We next present our proposed nonparametric procedure for estimating an optimal ITR and the corresponding ATE . We will generally use subscript to denote an estimator with sample size , and add a hat to a nuisance function estimator that is targeted toward estimating .
-
1.
Use the empirical distribution of as an estimate of the true marginal distribution of . Compute estimates , , , and of , , , and , respectively, using flexible regression methods. Recall that , , , and . Define pointwise .
-
2.
Estimate an optimal ITR:
-
(a)
Estimate with a one-step correction estimator
-
(b)
Let , and . For any , define , and
The rule is the sample analog of an ITR that prescribes treatment to those with the highest values of , regardless of whether treatment is harmful or not, until treatment resources run out.
-
(c)
Compute , which is used to define an estimate of for which the plug-in estimator is asymptotically linear under conditions, as follows:
-
•
if and there is a solution in to
(5) then take to be this solution;
-
•
otherwise, set .
-
•
-
(d)
Estimate using the sample analog of with treatment resource constraint , namely
-
(a)
-
3.
Obtain an estimate of the reference ITR as follows:
-
•
For , take to be .
-
•
For ,
-
(a)
obtain a targeted estimate of : run an ordinary least-squares linear regression with outcome , covariate , offset and no intercept. Take to be the fitted mean model;
-
(b)
take to be the constant function , where we define pointwise .
-
(a)
-
•
For , take to be .
-
•
-
4.
Estimate ATE of relative to the reference ITR with a targeted minimum-loss based estimator (TMLE) :
-
(a)
obtain a targeted estimate of : run an ordinary least-squares linear regression with outcome , covariate , offset and no intercept. Take to be the fitted mean function.
-
(b)
with being any distribution with components and , take
where is defined as or depending on the covariate used by the reference ITR.
-
(a)
The above procedure is similar to that proposed in Qiu et al. [34]. One key difference is the use of the refined estimator of obtained via the estimating equation (5), which is key to ensuring the asymptotic linearity of . Another difference is that the denominator of is now , which is consistent with our different definition of the unit value for solving the fractional knapsack problem (2). Similarly to TMLE for other problems, when or has known bounds (e.g., the closed interval ), to obtain a corresponding targeted estimate that respect the known bounds, we may use logistic regression rather than ordinary least-squares [12].
The above procedure has both similarities and substantial differences compared to the estimation procedure proposed by Sun et al. [41]. The main difference is that our procedure is targeted towards efficient estimation of and inference about the ATE of of the optimal ITR under a nonparametric model, while Sun et al. [41] focus on estimating the optimal ITR and does not evaluate this optimal ITR. This leads to a key difference between the two procedures when estimating the optimal ITR: we need to solve an estimating equation (5), which is crucial to ensuring that the estimator is asymptotically linear, while Sun et al. [41] do not. The requirement of solving (5) is related to the nature of the fractional knapsack problem discussed in Remark 3, and we conjecture that such a calibration on the resource used is necessary for general problems of the same nature. Our procedure is also related to the method in Sun [42]. Sun [42] rely on the availability of asymptotically normal estimators of both the average benefit and average resource used (Assumption 2.4), a nontrivial requirement when the propensity score is unknown in observational studies. Our procedure essentially produces such estimators: in Step 4, an asymptotically normal estimator of the ATE is constructed, whereas an asymptotically normal estimator of the expected resource is produced in Step 2 and used to calibrate the resource expenditure of the estimated optimal ITR in Step 2(c).
Remark 4.
In Step 1 of the above procedure, we estimate the functions and using a naïve approach based on outcome regression. It is viable to use more advanced techniques such as the doubly robust methods in van der Laan and Rubin [45], van der Laan and Luedtke [46], Luedtke and van der Laan [22], and Kennedy [17] or R-learning as in Nie and Wager [28]. These methods were developed for conditional average treatment effect estimation and might lead to better estimators of and . It is also possible to develop multiply robust methods to estimate using influence function techniques. Such methods to estimate are beyond the scope of our paper, whose main focus is on the inference for the ATE. Our theoretical analysis of the estimator only applies to naïve estimators based on outcome regression, but we expect only minor modifications to be required to study these more advanced estimators once their asymptotic behavior is characterized.
Remark 5.
In Step 2(a), it is also viable to use other efficient estimators of , for example, a targeted minimum loss-based estimator (TMLE). We note that estimating is only one component of estimating the optimal ITR . Methods such as TMLE can be preferable to ensure that the estimator respects known bounds on the estimand. However, in our case, such an improvement in estimating does not necessarily lead to an improvement in the estimation of .
We now present results on the asymptotic linearity and efficiency of our proposed estimator. We state and discuss the technical conditions required by the theorem below in Supplement S1.
Theorem 4 (Asymptotic linearity of ATE estimator).
To conduct inference about , we can directly plug the estimators of nuisance functions into to obtain a consistent estimator of , and then take the sample variance to obtain a consistent estimator of the asymptotic variance . The proof of Theorem 4 can be found in Supplements S4.3 and S4.4.
Remark 6.
Remark 7.
We note that, unlike in Qiu et al. [34] where the bound lies in due to the binary nature of treatment status, the methods we propose here do not require knowledge of an upper bound on treatment costs. When such a bound is indeed known (e.g., one), our methods may still be applied as long as all special cases corresponding to or in Section 4 are replaced by being equal to or less than the known bound, respectively.
5 Simulation
5.1 Simulation setting
In this simulation study, we investigate the performance of our proposed estimator of the ATE of an optimal ITR relative to specified reference ITRs. We focus here on the setting . This scenario is more difficult than the case because it requires the estimation of .
We generate data from a model in which the treatment is an IV and the treatment cost and outcome are both binary. This data-generating mechanism satisfies all causal conditions and has an unobserved confounder between treatment cost and outcome. We first generate a trivariate covariate , where , and are mutually independent. We also simulate an unobserved treatment-outcome confounder independently of , and then simulate , , and as follows:
We introduce in the data-generating mechanism to emphasize that we do not require assumptions on the joint distribution of treatment cost and outcome conditional on covariates. We consider all three reference ITRs , where we set . We set , which is an active constraint with and .
The ITRs we consider are based on all covariates — that is, we take . We estimate the nuisance functions using the Super Learner [48] with library including a logistic regression, generalized additive model with logit link [13], gradient boosting machine [10, 11, 23, 24], support vector machine [2, 8] and neural network [3, 35]. Because none of the nuisance functions follow a logistic regression model, the resulting ensemble learner is not expected to achieve the parametric convergence rate. Since both and are binary, we use logistic regression rather than ordinary least-squares to obtain their corresponding targeted estimates in Section 4.2. We consider sample size , and run 1000 Monte Carlo repetitions for each sample size. We implement the algorithm that incorporates cross-fitting discussed in Remark 6 and described in Section S3 in the Supplementary Material.
To evaluate the performance of our proposed estimator, we investigate the bias and root mean squared error (RMSE) of the estimator. We also investigate the coverage probability and the width of nominal 95% Wald CIs constructed using influence function-based standard error estimates. We further investigate the probability that our confidence lower limit falls below the true ATE, that is, the coverage probability of the 97.5% Wald confidence lower bound.
5.2 Simulation results
Table 1 presents the performance of our proposed estimator in this simulation. For sample sizes 500, 1000 and 4000, the CI coverage of our proposed method is lower than the nominal coverage 95%. When sample size is larger (16000), the CI coverage of our proposed method increases to 90–93%. The coverage of the confidence lower bounds is much closer to nominal (97.5%) for all sample sizes considered, though, and is always approximately nominal when the sample size is large. For all reference ITRs, the bias and RMSE of our proposed estimator appear to converge to zero faster than and at the same rate as the square root of sample size, respectively. All biases are negative, which is expected in view of Remark 6. All standard errors underestimate the variation of the estimator with the extent decreasing as sample size increases.
Performance measure | Sample size | |||
---|---|---|---|---|
95% Wald CI coverage | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
97.5% confidence lower | 500 | |||
bound coverage | 1000 | |||
4000 | ||||
16000 | ||||
bias | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
RMSE | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
Ratio of mean standard error | 500 | |||
to standard deviation | 1000 | |||
4000 | ||||
16000 |
Figure 1 presents the width of the Wald CIs scaled by the square root of sample size . Our theory indicates that the CI width should shrink at a root- rate, and our simulation results are consistent with this. There are some outlying cases of extremely wide or narrow CIs. This is expected for small sample sizes because the estimator of in Theorem 4 resembles a sample mean and might not be close to with high probability when sample size is small. In practice, this issue might be slightly mitigated by fine-tuning the involved machine learning algorithms.

As indicated in Theorem 4, theoretical guarantees for the validity of the Wald CIs rely on the nuisance function estimators converging to the truth sufficiently quickly. It appears that the undercoverage of our Wald CI in small samples may owe, in part, to poor estimation of these nuisance functions in small sample sizes. To illustrate how our procedure may perform with improved small-sample nuisance function estimators, we conducted another two simulations: one is identical to those reported earlier in all ways except that the nuisance function estimators , and are taken to be equal to the truth; the other is a simpler scenario under a lower dimension and a parametric model. The results are presented in Section S5 in the Supplementary Material and suggest that our proposed estimator may achieve significantly better performance with improved machine learning estimators of the nuisance functions. This motivates seeking ways to optimize the finite-sample performance of the nuisance function estimators employed in future applications of the proposed method, possibly based on prior subject-matter expertise. The underestimation of standard errors in this simulation also motivates future work exploring whether there are standard error estimators with better finite-sample performance, for example, estimators based on the bootstrap.
6 Conclusion
There is an extensive literature on estimating optimal ITRs and evaluating their performance. Among these works, only a few incorporated treatment resource constraints. In this paper, we build upon Sun et al. [41] and study the problem of estimating optimal ITRs under treatment cost constraints when the treatment cost is random. Using similar techniques as used in Qiu et al. [34], we have proposed novel methods to estimate an optimal ITR and infer about the corresponding average treatment effect relative to a prespecified reference ITR, under a locally nonparametric model. Our methods may also be applied to instrumental variable (IV) settings studied in Qiu et al. [34] when the IV is intervened on.
References
- Abadie [2003] Abadie, A. (2003). Semiparametric instrumental variable estimation of treatment response models. Journal of Econometrics 113(2), 231–263.
- Bennett and Campbell [2000] Bennett, K. P. and C. Campbell (2000). Support vector machines: hype or hallelujah? SIGKDD Explor. Newsl. 2(2), 1–13.
- Bishop [1995] Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Oxford University Press.
- Bolthausen et al. [2002] Bolthausen, E., E. Perkins, and A. van der Vaart (2002). Lectures on Probability Theory and Statistics: Ecole D’Eté de Probabilités de Saint-Flour XXIX-1999, Volume 1781 of Lecture Notes in Mathematics. Berlin, Heidelberg: Springer Science & Business Media.
- Butler et al. [2018] Butler, E. L., E. B. Laber, S. M. Davis, and M. R. Kosorok (2018). Incorporating Patient Preferences into Estimation of Optimal Individualized Treatment Rules. Biometrics 74(1), 18–26.
- Chakraborty and Moodie [2013] Chakraborty, B. and E. E. Moodie (2013). Statistical Methods for Dynamic Treatment Regimes. Statistics for Biology and Health. New York, NY: Springer New York.
- Chen et al. [2018] Chen, J., H. Fu, X. He, M. R. Kosorok, and Y. Liu (2018). Estimating individualized treatment rules for ordinal treatments. Biometrics 74(3), 924–933.
- Cortes and Vapnik [1995] Cortes, C. and V. Vapnik (1995). Support-vector networks. Machine Learning 20(3), 273–297.
- Dantzig [1957] Dantzig, G. B. (1957). Discrete-variable extremum problems. Operations research 5(2), 266–288.
- Friedman [2001] Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine. Technical Report 5.
- Friedman [2002] Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics and Data Analysis 38(4), 367–378.
- Gruber and Van Der Laan [2010] Gruber, S. and M. J. Van Der Laan (2010). A targeted maximum likelihood estimator of a causal effect on a bounded continuous outcome. International Journal of Biostatistics 6(1).
- Hastie and Tibshirani [1990] Hastie, T. and R. Tibshirani (1990). Generalized additive models. Chapman and Hall.
- Imai and Li [2021] Imai, K. and M. L. Li (2021). Experimental Evaluation of Individualized Treatment Rules. Journal of the American Statistical Association.
- Imbens and Angrist [1994] Imbens, G. W. and J. D. Angrist (1994). Identification and Estimation of Local Average Treatment Effects. Econometrica 62(2), 467–475.
- Kennedy [2016] Kennedy, E. H. (2016). Semiparametric theory and empirical processes in causal inference. In Statistical causal inferences and their applications in public health research, pp. 141–167. Springer.
- Kennedy [2020] Kennedy, E. H. (2020). Towards optimal doubly robust estimation of heterogeneous causal effects. arXiv preprint arXiv:2004.14497v3.
- Laber and Zhao [2015] Laber, E. and Y. Zhao (2015). Tree-based methods for individualized treatment regimes. Biometrika 102(3), 501–514.
- Lei et al. [2012] Lei, H., I. Nahum-Shani, K. Lynch, D. Oslin, and S. A. Murphy (2012). A ”SMART” design for building individualized treatment sequences. Annual Review of Clinical Psychology 8, 21–48.
- Luedtke and van der Laan [2016a] Luedtke, A. R. and M. J. van der Laan (2016a). Optimal Individualized Treatments in Resource-Limited Settings. International Journal of Biostatistics 12(1), 283–303.
- Luedtke and van der Laan [2016b] Luedtke, A. R. and M. J. van der Laan (2016b). Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Annals of Statistics 44(2), 713–742.
- Luedtke and van der Laan [2016c] Luedtke, A. R. and M. J. van der Laan (2016c). Super-Learning of an Optimal Dynamic Treatment Rule. International Journal of Biostatistics 12(1), 305–332.
- Mason et al. [1999] Mason, L., J. Baxter, P. Bartlett, and M. Frean (1999). Boosting Algorithms as Gradient Descent in Function Space. Technical report.
- Mason et al. [2000] Mason, L., J. Baxter, P. L. Bartlett, and M. Frean (2000). Boosting Algorithms as Gradient Descent. Technical report.
- Murphy [2003] Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65(2), 331–355.
- Newey and Robins [2018] Newey, W. K. and J. R. Robins (2018). Cross-Fitting and Fast Remainder Rates for Semiparametric Estimation. arXiv preprint arXiv:1801.09138v1.
- Neyman [1923] Neyman, J. (1923). Sur les applications de la théorie des probabilités aux expériences agricoles: Essay des principles. (Excerpts reprinted and translated to English, 1990). Statistical Science 5, 463–472.
- Nie and Wager [2021] Nie, X. and S. Wager (2021). Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108(2), 299–319.
- Petersen et al. [2007] Petersen, M. L., S. G. Deeks, and M. J. van der Laan (2007). Individualized treatment rules: Generating candidate clinical trials. Statistics in Medicine 26(25), 4578–4601.
- Pfanzagl [1982] Pfanzagl, J. (1982). Contributions to a General Asymptotic Statistical Theory, Volume 13 of Lecture Notes in Statistics. New York, NY: Springer New York.
- Pfanzagl [1990] Pfanzagl, J. (1990). Estimation in semiparametric models. In Estimation in Semiparametric Models, pp. 17–22. Springer.
- Qian and Murphy [2011] Qian, M. and S. A. Murphy (2011). Performance guarantees for individualized treatment rules. The Annals of Statistics 39(2), 1180.
- Qiu et al. [2021a] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021a). Correction to: “optimal individualized decision rules using instrumental variable methods”. Journal of the American Statistical Association (just-accepted), 1–2.
- Qiu et al. [2021b] Qiu, H., M. Carone, E. Sadikova, M. Petukhova, R. C. Kessler, and A. Luedtke (2021b). Optimal individualized decision rules using instrumental variable methods. Journal of the American Statistical Association 116(533), 174–191.
- Ripley [2014] Ripley, B. D. (2014). Pattern recognition and neural networks. Cambridge University Press.
- Robins [1986] Robins, J. (1986). A new approach to causal inference in mortality studies with a sustained exposure period-application to control of the healthy worker survivor effect. Mathematical modelling 7(9-12), 1393–1512.
- Robins [2004] Robins, J. M. (2004). Optimal Structural Nested Models for Optimal Sequential Decisions. pp. 189–326. Springer, New York, NY.
- Rothwell [2005] Rothwell, P. M. (2005). Subgroup analysis in randomised controlled trials: Importance, indications, and interpretation. Lancet 365(9454), 176–186.
- Rubin [1974] Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology 66(5), 688–701.
- Song et al. [2015] Song, R., M. Kosorok, D. Zeng, Y. Zhao, E. Laber, and M. Yuan (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning. Stat 4(1), 59–68.
- Sun et al. [2021] Sun, H., S. Du, and S. Wager (2021). Treatment Allocation under Uncertain Costs. arXiv preprint arXiv:2103.11066v1.
- Sun [2021] Sun, L. (2021). Empirical Welfare Maximization with Constraints. arXiv preprint arXiv:2103.15298v1.
- Tchetgen Tchetgen and Vansteelandt [2013] Tchetgen Tchetgen, E. J. and S. Vansteelandt (2013). Alternative Identification and Inference for the Effect of Treatment on the Treated with an Instrumental Variable. Harvard University BiostatisticsWorking Paper Series.
- van der Laan [2017] van der Laan, M. (2017). A Generally Efficient Targeted Minimum Loss Based Estimator based on the Highly Adaptive Lasso. International Journal of Biostatistics 13(2).
- van der Laan and Rubin [2006] van der Laan, M. and D. Rubin (2006). Targeted Maximum Likelihood Learning. The International Journal of Biostatistics 2(1).
- van der Laan and Luedtke [2014] van der Laan, M. J. and A. R. Luedtke (2014). Targeted Learning of the Mean Outcome under an Optimal Dynamic Treatment Rule. Journal of Causal Inference 3(1).
- van der Laan and Petersen [2007] van der Laan, M. J. and M. L. Petersen (2007). Causal effect models for realistic individualized treatment and intention to treat rules. International Journal of Biostatistics 3(1).
- van der Laan et al. [2007] van der Laan, M. J., E. C. Polley, and A. E. Hubbard (2007). Super Learner. Statistical Applications in Genetics and Molecular Biology 6(1).
- van der Laan and Rose [2018] van der Laan, M. J. and S. Rose (2018). Targeted Learning in Data Science.
- van der Vaart and Wellner [2000] van der Vaart, A. and J. Wellner (2000). Weak Convergence and Empirical Processes: With Applications to Statistics. Springer Series in Statistics. Springer.
- van der Vaart [1998] van der Vaart, A. W. (1998). Asymptotic Statistics. Cambridge University Press.
- Varadhan et al. [2013] Varadhan, R., J. B. Segal, C. M. Boyd, A. W. Wu, and C. O. Weiss (2013). A framework for the analysis of heterogeneity of treatment effect in patient-centered outcomes research. Journal of Clinical Epidemiology 66(8), 818–825.
- Wang and Tchetgen Tchetgen [2018] Wang, L. and E. Tchetgen Tchetgen (2018). Bounded, efficient and multiply robust estimation of average treatment effects using instrumental variables. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 80(3), 531–550.
- Zhao et al. [2012] Zhao, Y., D. Zeng, A. J. Rush, and M. R. Kosorok (2012). Estimating Individualized Treatment Rules Using Outcome Weighted Learning. Journal of the American Statistical Association 107(499), 1106–1118.
- Zhao et al. [2015] Zhao, Y. Q., D. Zeng, E. B. Laber, R. Song, M. Yuan, and M. R. Kosorok (2015). Doubly robust learning for estimating individualized treatment with censored data. Biometrika 102(1), 151–168.
- Zheng and van der Laan [2011] Zheng, W. and M. J. van der Laan (2011). Cross-Validated Targeted Minimum-Loss-Based Estimation. pp. 459–474. Springer, New York, NY.
- Zhou et al. [2017] Zhou, X., N. Mayer-Hamblett, U. Khan, and M. R. Kosorok (2017). Residual Weighted Learning for Estimating Individualized Treatment Rules. Journal of the American Statistical Association 112(517), 169–187.
Supplementary Material for “Individualized treatment rules under stochastic treatment cost constraints”
This Supplementary Material is organized as follows. Section S1 contains technical conditions to ensure that the statistical parameter of interest, the average treatment effect, is pathwise differentiable and that our proposed estimator is asymptotically efficient. We discuss a particular technical condition that may be difficult to verify in Section S2. In Section S3, we describe a modified version of our proposed estimator with improved performance in small to moderate samples. We present proofs of theoretical results in Section S4. In Section S5, we present the results of a simulation under an idealized setting. These results may provide guidance on interpreting the simulation results in Section 5.
As noted in the main text, the methods proposed in this work build upon tools used in Qiu et al. [34]; as such, the involved technical details bear similarity. To orient readers and facilitate comparisons, we have organized these supplementary materials for these papers similarly and shared portions of technical details when appropriate.
S1 Technical conditions for pathwise differentiability of parameter and asymptotic linearity of proposed estimator
In this section, we list the additional technical conditions required by Theorems 3 and 4 in Section 4 that we omit in the main text. Before doing this, we define pointwise
Condition B2 (Nonzero continuous density of around ).
If , then the distribution of has positive, finite and continuous Lebesgue density in a neighborhood of .
Since Condition B2 is most plausible when covariates are continuous, in this case, it is also plausible to expect the distribution of to be continuous and thus Condition B2 holds.
Condition B3 (Smooth treatment cost function or lack of constraint).
If , then the function is continuously differentiable with nonzero derivative in a neighborhood of ; if and , then .
Condition B3 requires different conditions in separate cases. There are three cases in terms of the sufficiency of the budget to treat every individual: (i) there is an infinite budget and no constraint is present (); (ii) the budget is insufficient (); and (iii) the budget is finite but sufficient ( and ). Condition B3 makes no assumption for Case (i). In Case (ii), we require a function to be locally continuously differentiable. Since by Condition A4, this function is nonincreasing and thus only continuous differentiability is required. For each , this function is an integral of additional cost over the set and has a similar nature to survival functions. When covariates are continuous, it is plausible to assume that is continuous and thus is continuously differentiable. In Case (iii), we require that the budget has a surplus. When it is unknown a priori whether the budget is sufficient to treat every individual, namely in Case (ii) or (iii), it is highly unlikely that the budget exactly suffices with no surplus. Therefore, Condition B3 is mild.
Condition B4 (Bounded additional treatment cost).
is bounded.
Condition B5 (Active constraint).
If , then it holds that .
Condition B5 requires that, when the rule that assigns treatment completely at random while respecting the budget constraint is the reference rule of interest, it should not correspond to the trivial rule that assigns treatment to every individual. The rule equals only when the budget is sufficient to treat every individual. Since, as a separate reference rule from given fixed rules , the reference rule is only interesting when the budget constraint is active, Condition B5 often holds automatically.
Condition B6 (Sufficient rates for nuisance estimators).
Condition B6 holds if all above nuisance estimators converge at a rate faster than , which may be much slower than the parametric rate and thus allows for the use of flexible nonparametric estimators. This condition also holds if , , and each converges slower than , as long as the estimated propensity score converges sufficiently fast to compensate.
Condition B7 (Consistency of estimated influence function).
The following terms are all :
Condition B8 (Consistency of strong positivity).
With probability tending to one over the sample used to obtain , it holds that .
Condition B9 (Consistency of strictly more costly treatment).
With probability tending to one over the sample used to obtain and , it holds that and .
Condition B10 (Fast rate of estimated optimal ITR).
As sample size tends to infinity, it holds that
Condition B10 may, at first sight, appear to be difficult to verify and is discussed in detail in Section S2. As shown in Theorem S1 of Section S2, Condition B10 may require faster rates on nuisance estimators than Condition B6. For example, convergence in the -sense at a rate is sufficient for Condition B6, but a rate is needed in order to use Theorem S1 to show that Condition B10 holds.
Condition B11 (Donsker condition).
is a subset of a fixed -Donsker class with probability tending to 1. Additionally, each of , and belongs to a (possibly different) fixed -Donsker class with probability tending to 1.
Condition B12 (Glivenko-Cantelli condition).
and . Moreover, (i) if , then, for any sufficiently close to , belongs to a -Glivenko-Cantelli class with probability tending to 1; (ii) otherwise, if , then, for any with sufficiently large , belongs to a -Glivenko-Cantelli class with probability tending to 1.
S2 Sufficient condition for fast convergence rate of estimated optimal rule
Condition B10, which is required by Theorem 4, may seem unintuitive and difficult to verify. In Theorem S1 below, we present sufficient conditions for Condition B10 that are similar to those in Qiu et al. [34].
Throughout the rest of the Supplement, for two quantities , we use to denote for some constant that may depend on .
Theorem S1 (Sufficient condition for Condition B10).
Assume that . Further assume that each of and belongs to a (possibly different) fixed -Donsker class with probability tending to 1. Suppose also that the distribution of () has nonzero finite continuous Lebesgue density in a neighborhood of and a neighborhood of . Under Condition B4, the following statements hold.
-
•
If for some , then
-
•
If , then
The proof of Theorem S1 is very similar to Theorem 5 in Qiu et al. [34] and can be found in Section S4.5.
S3 Modified procedure with cross-fitting
In this section, we describe our proposed procedure to estimate the ATE with cross-fitting, which is mentioned in Remark 6. We use to denote a user-specified fixed number of folds to split the data. Common choices of used in practice include 5, 10 and 20.
-
1.
Use the empirical distribution of as an estimate of the true marginal distribution of . Compute estimates , and of , and , respectively using flexible regression methods.
-
2.
Estimate an optimal individualized treatment rule for each observation:
-
(a)
Create folds: split the set of observation indices into mutually exclusive and exhaustive folds of (approximately) equal size. Denote these sets by , . Define . For each , let be the index of the fold containing ; in other words, is the unique value of such that .
-
(b)
Estimate using sample splitting: for each , compute estimates and of and using flexible regression methods based on data . For each , let be the sample splitting estimate of .
-
(c)
Estimate with a one-step correction estimator
-
(d)
Let and . For any , define , , and, for ,
-
(e)
Compute , which is used to define an estimate of for which the plug-in estimator is asymptotically linear.
-
•
If and there is a solution in to
(S1) then take to be this solution.
-
•
otherwise, set .
-
•
-
(f)
For each , estimate with
-
(a)
-
3.
Obtain an estimate of the reference ITR as follows:
-
•
For , take to be .
-
•
For ,
-
(a)
obtain a targeted estimate of : run an ordinary least-squared regression using observations with outcome , offset , no intercept and covariate . Take to be the fitted mean model;
-
(b)
take to be the constant function , where we define pointwise .
-
(a)
-
•
For , take to be .
-
•
-
4.
Estimate ATE of relative to the reference ITR with a targeted minimum-loss based estimator (TMLE) :
-
(a)
obtain a targeted estimate of : run an ordinary least-squares linear regression using observations with outcome , offset , no intercept and covariate . Take to be the fitted mean function.
-
(b)
with being any distribution with components and , set where .
-
(a)
S4 Proof of theorems
S4.1 Identification results (Theorem 1 and 2)
Proof of Theorem 1.
Note that
Similarly, . Hence, . By the law of total expectation, this yields that . It then follows that
The results for the treatment cost can be proved similarly. ∎
We next prove Theorem 2.
Proof of Theorem 2.
Let be any ITR that satisfies the constraint that . We will show that , implying that is a solution to (2).
Observe that
Note that if and if . Combining this observation with the fact that , the above shows that
If , then , as desired; otherwise, and , and so it follows that . Therefore, we conclude that is a solution to (2). ∎
S4.2 Pathwise differentiability of ATE parameter (Theorem 3)
We follow existing literature on semiparametric efficiency theory closely to prove pathwise differentiability of our estimands and asymptotic efficiency of our estimators under nonparametric models. We refer readers to, for example, Pfanzagl [30, 31], Bolthausen et al. [4], for a more thorough introduction to semiparametric efficiency.
To derive the canonical gradient of the ATE parameters, let be the set of score functions with range contained in and we study the behavior of the parameters under perturbations in an arbitrary direction . We note that the -closure of is indeed .
We define , and via its Radon-Nikodym derivative with respect to :
(S2) |
for any in a sufficiently small neighborhood of 0 such that the right-hand side is positive for all . It is straightforward to verify that the score function for at is indeed . For the rest of this section, we may drop from the notation and use as a shorthand notation for when no confusion should arise.
We will see that each parameter evaluated at depends on the following marginal or conditional distributions in a clean way: the marginal distribution of , the marginal distribution of , the conditional distribution of given , the conditional distribution of given , and the conditional distribution of given . We now derive their closed-form expressions. Let , and . We can then show that
(S3) |
Moreover, , -a.s., -a.s., and -a.s.
We finally introduce some additional notations that are used for the rest of the section. We use to denote a generic positive constant that may vary line by line. Let be the survival function of the distribution of when . We also use the notation defined in Section S2. For a generic function , we will use the big- and little-oh notations, namely and , respectively, to denote the behavior of as . Finally, for a general function or quantity that depends on a distribution , we use to denote . For example, we may write as a shorthand for . We will also write expectations under as .
The derivation of the canonical gradients of can be found in the Supplement of Qiu et al. [34]. We now derive the canonical gradients of , and , which are different from the parameters in Qiu et al. [34].
S4.2.1 Canonical gradient of (Theorem 3)
Fix a score . Note that, for all , . Combining this, (S3) and the chain rule yields that
where we have used the fact that -a.s., -a.s., and . Therefore, the canonical gradient of at is .
S4.2.2 Canonical gradient of
Let be a score function in . We aim to show that
(S4) |
which shows that is pathwise differentiable with canonical gradient at .
By similar arguments to those in Section 3.4 of Kennedy [16], we can show that
(S5) | ||||
(S6) |
Consequently, and . It follows that, for all in a sufficiently small neighborhood of zero, Condition B5 implies that . Consequently, for each in this neighborhood, , where we have used that . It follows that the derivative is the same as the derivative of at , provided this derivative exists. Noting that is a particular instance of a fixed treatment rule, we may take to be in the results on pathwise differentiability of and show that
(S7) |
As both the above derivative and the derivatives in (S5) and (S6) exist, by the chain rule, it follows that
Note that . Plugging (S5), (S6) and (S7) into the above and we can show that the right-hand side of the above is equal to the right-hand side of (S4). As , we have shown that (S4) holds, and the desired result follows.
S4.2.3 Canonical gradient of
Let be a score function in . The argument that we use parallels that of Luedtke and van der Laan [20] and Qiu et al. [34], except that it is slightly modified to account for the fact that the resource constraint takes a different form in this paper.
We first note that all of following hold for all sufficiently close to zero:
(S8) | ||||
(S9) | ||||
(S10) |
The derivations of these inequalities are straightforward and hence omitted. Under Condition A4, the above inequalities imply that
(S11) |
For sufficiently close to zero, it will be useful to define
for . We also define when the derivative exists.
We first show two lemmas. These two lemmas show that, under a perturbed distribution with magnitude , the fluctuation in the threshold is of order . This result is crucial in quantifying the convergence rate of two terms in the expansion of , namely terms 1 and 3 in (S12) below. In particular, term 1 is the main challenge in the analysis as it comes from the perturbation in the threshold and is unique in estimation problems involving the evaluation of optimal ITRs. The first studies the convergence of to . Because it may be the case that , the convergence stated in this result is convergence in the extended real line.
Lemma S1.
Under the conditions of Theorem 3, as .
Proof of Lemma S1.
We separately consider the cases where and .
Suppose that . For all sufficiently small and sufficiently small , by (S10), (S11) and the fact that the range of is contained in , we can show that
Under Condition B3, as long as is small enough, the right-hand side converges to as . Moreover, Conditions B3 and A4 can be combined to show that the derivative of is strictly negative for all for sufficiently small , and so . Because by the definition of under Condition B2, it follows that, for all sufficiently close to zero, . By the definition , it follows that, for all sufficiently close to zero, , that is, .
By similar arguments, we can show that, for all sufficiently close to zero, . Indeed,
The right-hand side converges to as provided is sufficiently small. The derivative of is strictly negative on provided is small enough, and therefore, . Hence, . By the definition of , it follows that .
Combining these two results, we see that, for all sufficiently close to zero, . Hence, . As is an arbitrary in a neighborhood of zero, it follows that . That is, as in the case that .
We now study the case where . If , then it is trivial that for all , and so the desired result holds. Suppose now that . Fix a small enough so that the bound in (S11) is valid for all . Also fix and . By (S11) and the bound on the range of ,
Because is a nonnegative decreasing function, the right-hand side is no greater than . This upper bound tends to as . Hence, . By Condition B3 and the monotonicity of , , and so for all sufficiently close to zero. By the definition of , it follows that for all sufficiently close to zero. Since is arbitrary, the desired result follows. ∎
The next lemma establishes a rate of convergence of to as .
Lemma S2.
Under conditions of Theorem 3, .
Proof of Lemma S2.
We separately consider the cases where and .
We start with the easier case where . In this case, Lemma S1 shows that is equal to for all sufficiently close to zero. Thus, .
Now consider the more difficult case where . By the Lipschitz property of the function , we can show that . As a consequence, to show that , it suffices to show that . We next establish this statement.
Fix in a sufficiently small neighborhood of zero. By the definition , the bound on the range of , and (S11), it holds that . We use a Taylor expansion of about , which is justified by Condition B3 provided is small enough, and it follows that
By Condition B3, . Plugging this into the above shows that
Note that Condition B3 implies that . Therefore, the above shows that, for all sufficiently close to zero, , which implies that there exists an sequence for which .
A similar argument, which is based on the observation that , can be used to show that there exists an sequence such that . Combining these two bounds shows that , as desired. This concludes the proof. ∎
Our derivation of the canonical gradient is based on the following decomposition:
(S12) |
We separately study each of the six terms on the right-hand side, which we refer to as term 1 up to term 6.
Study of term 1 in (S12): We will show that this term is . By Lemma S2 and (S9),
Under Condition B2, . We apply a similar argument as that used to prove Lemma 2 in van der Laan and Luedtke [46]:
Because implies that either (i) and have different signs or (ii) only one of these quantities is zero, the display continues as | ||||
Using the facts that by Condition A4, that since probabilities are no more than one, and that , the display continues as | ||||
Leveraging the bound on that appears in the indicator function, we see that | ||||
where the final equality holds by Condition B1. The integral in the final expression is , and so this expression is .
Study of term 2 in (S12): By the result on the pathwise differentiability of , setting to be , we see that the second term satisfies , where is equal to .
Study of term 3 in (S12): We will show that the third term is identical to zero for any that is sufficiently close to zero. If , then this term is trivially zero. Otherwise, . Lemma S1 shows that, in this case, for sufficiently close to zero. Hence, . Consequently, term 3 equals zero for all sufficiently close to zero.
Study of term 4 in (S12): We will show that this term can be writes as for an appropriately defined that does not depend on . Note that there exists a function for which , , and, for all ,
The function can be chosen so that the above term indicates little-oh behavior uniformly over and . By the definition of from (S3), we see that
where the little-oh terms are uniform over and . Hence,
Since -a.s., the display continues as | ||||
As a consequence, term 4 satisfies
where
Study of term 5 in (S12): By (S3) and the fact that whenever , we see that , where is defined as . Since is defined as , we see that it also holds that .
Conclusion of the derivation of the canonical gradient of : Combining our results regarding the six terms in (S12), we see that
Dividing both sides by and taking the limit as , we see that is the canonical gradient of at .
S4.3 Expansions based on gradients or pseudo-gradients
In this section, we present (approximate) first-order expansions of ATE parameters based on which we construct our proposed targeted minimum-loss based estimators (TMLE) and prove their asymptotic linearity. We refer the readers to Supplement S5 of Qiu et al. [34] for an overview of TMLE based on gradients and pseudo-gradients. The overall idea behind TMLE based on gradients is the following: the empirical mean of the gradient at the estimated distribution can be viewed as the first-order bias of the plug-in estimator; this bias can be removed by solving the estimating equation that equates the first-order bias to zero. The idea behind pseudo-gradients is similar, except that the gradient is replaced by an approximation that we term pseudo-gradient so that the corresponding estimating equation is easy to solve with a single regression step.
For any ITR that utilizes all covariates, we define
For any ITR that only utilizes , for convenience, we define .
For and , it is straightforward to show that the following expansions hold:
For , we expand this parameter sequentially as follows:
For , straightforward but tedious calculation shows that the following expansion holds:
S4.4 Asymptotic linearity of proposed estimator (Theorem 4)
For convenience, we set to have component and , even though the plug-in estimator does not explicitly involve these functions. We start with some lemmas that facilitate the proof of the main theorem. In this section, we define and to simplify notations.
Our proof is centered around the expansions in Supplement S4.3. We first prove a few lemmas. Lemma S3 is a standard asymptotic linearity result on estimators and about treatment resource being used for constant ITRs and , respectively; Lemma S4 is a technical convenient tool to convert conditions on norms in Condition B6 between functions; Lemmas S5–S7 are analysis results for estimators that are similar to Lemmas S1–S2 for deterministic perturbations of , and they lead to the crucial Lemma S8 on the negligibility of the remainder for an arbitrary ITR .
Lemma S3 (Asymptotic linearity of and ).
Under the conditions of Theorem 4,
This result follows from the facts that (i) is a one-step correction estimator of [30], and (ii) is a TMLE for [44, 49]. Therefore the proof is omitted.
Lemma S4 (Lemma S8 in Qiu et al. [34]).
The following Lemmas S5–S7 prove consistency of the estimated thresholds used to define the estimated optimal ITR .
Lemma S5 (Lemma S5 in Qiu et al. [34]).
Let , , be bounded and functions and . Then
If takes values in , then can be replaced by .
This lemma is a stochastic variant of the deterministic result in Lemma S1 and has a similar proof. Therefore, the arguments are slightly abbreviated here.
Proof of Lemma S6.
We separately consider the cases where and .
First consider the case where . We start by showing that, for any sufficiently close to , it holds that . Fix an in a neighborhood of . By the triangle inequality,
(S13) |
We will show that the right-hand side is . By Condition B12, the third term on the right is for sufficiently close to . Moreover, because the second term is no greater than , Condition B12 also implies that this second term is also . We will now argue that the first term is . By Lemma S5 and Condition B4, for any ,
where the final relation follows from Markov’s inequality. We next show that the last line is . Fix . For that is sufficiently close to and that is sufficiently small, by Condition B2, we see that is continuous in and hence, for all sufficiently small , it holds that . Therefore,
Since by Condition B12, the right-hand side of the above display converges to zero as . Therefore, . Recalling (S13), the above results imply that for any that is sufficiently close to .
Fix . For any sufficiently small, the above result and Lemma S3 imply that and . By Condition B3, provided is sufficiently small. It follows that, with probability tending to one, , and hence by the definition of . Because is arbitrary, it follows that .
The case where can be proved similarly. If , then it trivially holds that for all and the desired result holds. Otherwise, for any for which is sufficiently large, a nearly identical argument to that used above shows that . By Condition B3 and monotonicity of , it follows that , and so, with probability tending to one, and hence by the definition of . Because is arbitrary, we have shown that . ∎
Lemma S7 (Consistency of and existence of solution to Eq. 5 when ).
Assume that the conditions of Theorem 4 hold. The following statements hold:
We separately prove i, ii, and iii in the case that , and then we separately prove iii in the cases where , and .
Proof of i from Lemma S7..
Our strategy for showing the existence of a solution to (5) is as follows. First, we show that the left-hand side of (5) consistently estimates the treatment resource being used uniformly over rules . Next, we show that the left-hand side of (5) is a continuous function in that takes different signs at and with probability tending to one.
Define . We first show that
(S14) |
We rely on the fact that, for fixed , is a one-step estimator of . Note that
Conditions B7 and B11 along with Lemma S4 imply that the first term on the right-hand side is . For the second term, we note that the fact that for all and all and the Cauchy-Schwarz inequality imply that
Hence, the second term is by Condition B6. The third term is also by an almost identical argument. Combining the previous two displays shows that (S14) holds.
Applying (S14) at shows that . Therefore, with probability tending to one. Applying this result at shows that . Combining this fact with the fact that whenever shows that with probability tending to one. Combining these results at and with the fact that is a continuous function shows that, with probability tending to one, there exists a such that . Lemma S6 then implies with probability tending to 1. ∎
Proof of iii from Lemma S7 when ..
In this proof, we use to denote a probability statement over the draws of . Fix . We will argue by contradiction to show that and as , implying the consistency of . The consistency of then follows. We study these two events separately. First, we suppose that
(S15) |
Then there exists such that, for all in an infinite sequence , the probability is at least . Consequently, for any , the following holds with probability at least :
(S16) |
We now show that the first term is . For any and , by Lemma S5 and Condition B4,
Similarly to the proof of Lemma S6, the fact that (Condition B12) ensures that . By Condition B3, is a negative constant. Because (S16) holds with probability at least for infinitely many , this shows that is not . This contradicts our result from part ii of this lemma. Therefore, (S15) is false, that is, .
Now we assume that, for some , . Then there exists such that, for all in an infinite sequence , . Now, for any , the following holds with probability at least :
The rest of the argument is almost identical to the contradiction argument for the previous event, and is therefore omitted.
Since is arbitrary, combining the results of these two contradiction arguments shows that , as desired. ∎
Proof of Lemma S8.
We next prove Theorem 4.
Proof of Theorem 4.
First, we note the following facts, which will be sufficient to ensure that the remainders and empirical process terms in all of the first-order expansions given above are . By Condition B6, Lemmas S4 and S8, the Cauchy-Schwarz inequality, and boundedness of the range of an ITR, the following terms are all :
Moreover, by Condition B10, ; by Conditions B7 and B11, for all and
Therefore, all relevant remainders and empirical process terms are .
We separately study the three cases where , , and .
Case I: . It holds that
where the last step follows from the TMLE construction of (Step 4a of our estimator), which implies that
We now show that the second term on the right-hand side is zero with probability tending to one. If , then this term is zero. Otherwise, . By Lemma S7, the following holds with probability tending to one:
and hence the second term is zero with probability tending to one, as desired. Therefore, .
Case II: . It holds that
where we have used and to denote the values that the two functions take, respectively. The TMLE construction of (Step 4a of our estimator) implies that
and hence
which is zero with probability tending to one as proved above. By Condition B5, Lemma S3 and the delta method for influence functions, the value that takes is an asymptotic linear estimator of the value that takes. Straightforward application of the delta method for influence functions implies that
Case 3: . It holds that
The TMLE construction of (Step 4a of our estimator) implies that
so
which is zero with probability tending to one as proved above. Therefore,
Conclusion: The asymptotic linearity result on follows from the above results. Consequently, the asymptotic normality result on holds by the central limit theorem and Slutsky’s theorem. ∎
S4.5 Proof of Theorem S1
In this section, we prove Theorem S1. The arguments are almost identical to those in Supplement S9 Qiu et al. [34] with adaptations to the different treatment resource constraint.
Lemma S9 (Convergence rate of if ).
Assume that the conditions for Theorem 4 hold. Suppose that , that the Lebesgue density of the distribution of under is well-defined, nonzero and finite in a neighborhood of and that . Under these conditions, the following implications hold with probability tending to one:
-
•
If for some , then .
-
•
If , then .
The condition that is reasonable if has a continuous distribution when , in which case .
Proof of Lemma S9.
We study the three cases where , and separately.
We first study the case where . By Lemma S7, with probability tending to one, where is a solution to (5), and
We argue conditionally on the event that is a solution to (5). Adding to both sides shows that . By a Taylor expansion of under Conditions B2, B3 and A4, the left-hand side is equal to for some , yielding that
which immediately implies that
(S17) |
The rest of the proof for this case and the proof for the other two cases are identical to the proof of Lemma S14 in Qiu et al. [34]. We present the argument below for completeness. By Lemma S5 and Condition B4, for any it holds that
Fix a positive sequence , where each may be random through observations , such that as . By a Taylor expansion of , the survival function of the distribution of when , around , which is valid under Condition B2 provided is sufficiently small, it follows that
Here we recall that is finite by Condition B2. Returning to (S17),
If for some , by Markov’s inequality, . In this case, taking yields that with probability tending to one. If , then taking yields that , and hence that with probability tending to one. The desired result follows by noting that and in both cases, with probability tending to one.
We now study the case where . By Lemma S6, with probability tending to one, and hence , as desired.
We finally study the case where . We argue conditional on the event that a solution to (5) exists, which happens with probability tending to one by Lemma S7. Recall that for convenience we let when . Then, exactly one of the following two events happen: (i) or , in which case ; (2) and , in which case a similar argument as the above proof for the case where shows that the distance between and has the desired bound. The desired result holds conditional on either event, so it holds unconditional on either event. ∎
We finally prove Theorem S1.
Proof of Theorem S1.
Observe that
(S18) |
Starting from this inequality, the rest of the proof is identical to that of Theorem 5 in Qiu et al. [34]. We present the argument below for completeness. Let be a positive sequence, where each is random through the observations , such that as .
We denote the three terms on the right-hand side by terms 1, 2, and 3, and study these terms separately. It is useful to note that , so the Lebesgue density of the distribution of , , is finite in a neighborhood of with probability tending to one.
Study of term 1 in (S18): Observe that
First consider the bound with the -distance. Because if and only if (i) and take different signs or (ii) only one of them is zero, this event implies , and so this term is upper bounded by
where second to last relation holds by Hölder’s inequality and Markov’s inequality, and the last relation holds with probability tending to one by the assumption that the distribution of , , has a continuous finite Lebesgue density in a neighborhood of and Lemma S7. Taking yields that .
Next consider the bound with the -distance. We have that
Therefore, the first term is upper bounded by both and , up to an absolute constant.
Study of term 2 in (S18): Because if and only if the two indicators take different signs or only one of them is zero, these indicators only take different values if . Therefore, term 2 bounds as
where the last step holds for with probability tending to one by the assumption that the distribution of , , has a continuous finite Lebesgue density in a neighborhood of and Lemma S7. If , by Lemma S9, with probability tending to one,
Otherwise, by Lemma S6, with probability tending to one, and the above result still holds.
Study of term 3 in (S18): By Lemma S5,
By a Taylor expansion of around , similarly to the proof of Lemma S9, with probability tending to one,
where . If for some , then . Taking yields that . If , then taking yields that with probability tending to one. Also note that, by Lemma S9, if , then, with probability tending to one,
The same holds when since then with probability tending to one.
Therefore, with probability tending to one,
Conclusion of the bound in (S18): We finally combine the bounds for all three terms. Note that for any sequence of non-negative random variables and sequence of constants . It follows that, with probability tending to one,
∎
S5 Additional simulations
S5.1 Results of simulation with nuisance functions being truth
In this section, we present the results of the simulation with an identical setting as that in Section 5 in the main text except that the nuisance functions are taken to be the truth rather than estimated via machine learning. The purpose of this simulation is to show that the performance of our proposed estimator may be significantly improved by using machine learning estimators of nuisance functions that outperform those used in the simulation study reported in the main text.
Table S1 presents the performance of our proposed estimator in this simulation. The Wald CI coverage is close to 95% for sample sizes of 1000 or more. The coverage of the confidence lower bounds is also close to the nominal coverage of 97.5%. Therefore, our proposed procedure appears to have the potential to be significantly improved when using improved estimators of nuisance functions. Figure S1 presents the width of our 95% Wald CI scaled by the square root of sample size . For each estimand, the scaled width appears to stabilize as grows and to be similar to the scaled width observed in the simulation reported in Section 5, where nuisance functions are estimated from data.
Performance measure | Sample size | |||
---|---|---|---|---|
95% Wald CI coverage | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
97.5% confidence lower | 500 | |||
bound coverage | 1000 | |||
4000 | ||||
16000 | ||||
bias | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
RMSE | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
Ratio of mean standard error | 500 | |||
to standard deviation | 1000 | |||
4000 | ||||
16000 |

S5.2 Simulation under a low dimension and a parametric model
In this section, we describe the additional simulation in a setting with a low dimension and a parametric model as well as the simulation results.
The data is generated as follows. We first generate a univariate covariate . We then generate , and as follows:
where and are independent conditional on . We set , , and , which is an active constraint with and . We use logistic regression to estimate functions , and . All other simulation settings are identical to that in Section 5.
The simulation results are presented in Table S2 and Figure S2. The performance is generally between the nonparametric setting in Section 5 and the oracle setting in Section S5.1. The CI coverage is much better than the nonparametric case, thus suggesting that our method might perform better with improved estimators of nuisance functions , and .
Performance measure | Sample size | |||
---|---|---|---|---|
95% Wald CI coverage | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
97.5% confidence lower | 500 | |||
bound coverage | 1000 | |||
4000 | ||||
16000 | ||||
bias | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
RMSE | 500 | |||
1000 | ||||
4000 | ||||
16000 | ||||
Ratio of mean standard error | 500 | |||
to standard deviation | 1000 | |||
4000 | ||||
16000 |
