This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Optimal Individualized Treatment Rule For Combination Treatments Under Budget Constraints

Qi Xu University of California Irvine, Irvine, USA. [email protected]    Haoda Fu Eli Lilly and Company, Indianapolis, USA. [email protected]    Annie Qu University of California Irvine, Irvine, USA. [email protected]
Abstract

The individualized treatment rule (ITR), which recommends an optimal treatment based on individual characteristics, has drawn considerable interest from many areas such as precision medicine, personalized education, and personalized marketing. Existing ITR estimation methods mainly adopt one of two or more treatments. However, a combination of multiple treatments could be more powerful in various areas. In this paper, we propose a novel Double Encoder Model (DEM) to estimate the individualized treatment rule for combination treatments. The proposed double encoder model is a nonparametric model which not only flexibly incorporates complex treatment effects and interaction effects among treatments, but also improves estimation efficiency via the parameter-sharing feature. In addition, we tailor the estimated ITR to budget constraints through a multi-choice knapsack formulation, which enhances our proposed method under restricted-resource scenarios. In theory, we provide the value reduction bound with or without budget constraints, and an improved convergence rate with respect to the number of treatments under the DEM. Our simulation studies show that the proposed method outperforms the existing ITR estimation in various settings. We also demonstrate the superior performance of the proposed method in PDX data that recommends optimal combination treatments to shrink the tumor size of the colorectal cancer.

keywords:
Causal Inference; Combination therapy; Multi-choice knapsack; Neural Network; Precision medicine.

1 Introduction

Individualized decision-making has played a prominent role in many fields such as precision medicine, personalized education, and personalized marketing due to the rapid development of personalized data collection. For example, in precision medicine, individualized treatments based on individuals’ demographic information and their overall comorbidity improve healthcare quality (Schmieder et al., 2015). However, most existing individualized decision-making approaches select one out of multiple treatments, whereas recent advances in medical and marketing research have suggested that applying multiple treatments simultaneously, referred to as combination treatments, could enhance overall healthcare. Specifically, combination treatments are able to reduce treatment failure or fatality rates, and overcome treatment resistance for many chronic diseases (e.g., Mokhtari et al., 2017; Kalra et al., 2010; Maruthur et al., 2016; Bozic et al., 2013; Korkut et al., 2015; Möttönen et al., 1999; Forrest and Tamura, 2010; Tamma et al., 2012). Therefore, it is critical to develop a novel statistical method to recommend individualized combination treatments.

There are various existing methods for estimating the optimal individualized treatment rule. The first approach is the model-based approach, which estimates an outcome regression model given pre-treatment covariates and the treatment. The optimal ITR is derived by maximizing the outcome over possible treatments conditioned on the pre-treatment covariates. Existing works such as Q-learning (Qian and Murphy, 2011), A-learning (Lu et al., 2013; Shi et al., 2018) and RD-learning (Meng and Qiao, 2020) all belong to this approach. The other approach is known as the direct-search approach, which directly maximizes the expected outcome over a class of decision functions to obtain an optimal ITR. The seminal works of the direct-search approach include outcome weighted learning (Zhao et al., 2012; Huang et al., 2019), residual weighted learning (Zhou et al., 2017), and augmented outcome weighted learning (Zhou and Kosorok, 2017). However, the aforementioned methods in these two categories are designed for selecting one optimal treatment among two or more treatments. In order to accommodate combination treatments, Liang et al. (2018) proposed an outcome weighted learning approach using the Hamming loss, extending the direct-search approaches to estimate optimal individualized treatment rules for combination treatments. Except for outcome weighted learning with the Hamming loss (Liang et al., 2018), the other methods treat each combination treatment as independent ones, ignoring the correlation between different combinations. Consequently, this type of modeling strategy suffers from the curse of dimensionality issue due to the combinatorial nature of combination treatments, accompanied by computation cost increases and estimation efficiency sacrifices. In addition, the method in (Liang et al., 2018) ignores the interaction effects among different treatments, which leads to inconsistent estimation of the ITR (Liang et al., 2018). In summary, existing methods ignore either correlation among combinations or interactions among treatments, yet both of them are essential to ensure an accurate and efficient estimation of the individualized treatment rule for combination treatments.

In this paper, we propose a double encoder model (DEM) to estimate the optimal individualized treatment rule for combination treatments. The proposed method incorporates both the interaction effects among different treatments and correlations among different combinations. Specifically, we introduce an outcome regression model where the treatment effects are represented by the inner product between the pre-treatment covariates and the treatment encoder. Specifically, the treatment encoder is decoupled as the additive treatment encoder and the interactive treatment encoder, where the interactive treatment encoder explicitly models the interaction effects of combination treatments. Meanwhile, the covariates encoder allows either parametric or nonparametric models to learn a low-dimensional representation of pre-treatment covariates. Finally, we derive the optimal individualized treatment rule for combination treatments by maximizing the outcome regression model over the combination treatments. As we developed our method, a parallel work (Kaddour et al., 2021) proposed the generalized Robinson decomposition, which estimates the conditional average treatment effects (CATE) for structured treatments such as graphs, images, and texts. Their proposed generalized Robinson decomposition also utilizes two neural networks to represent the treatment effects given covariates 𝐗\mathbf{X} and treatments 𝐀\mathbf{A}. In spite of the overlap, our proposed method targets the combination treatments, especially considering the interaction effects among different treatments and correlations between different combinations.

Furthermore, the combination treatments assignment might be restricted by limited resources in a real world scenario. Existing works (Luedtke and van der Laan, 2016; Kitagawa and Tetenov, 2018) consider the total amount constraint for binary treatments only, where the assignments are determined by the quantiles of treatment effects. In contrast, allocating combinations of treatments with a limited amount is an NP-hard problem, thus an analytical solution like quantiles does not exist. To address these problem, we formulate the constrained individualized treatment rule as a multi-choice knapsack problem (Kellerer et al., 2004), and solve this optimization problem through an efficient dynamic programming algorithm.

The main advantages and contributions of this paper are summarized as follows. First of all, the proposed method addresses the curse of dimensionality issue in combination treatment problems through a double encoder framework. In this framework, the covariates encoder captures the shared function bases of treatment effects, while the treatment encoder learns the coefficients for those function bases. This approach shifts complexity of treatment effects to the complexities of the covariates and treatment encoder, which are managed by our well-designed structural architecture. Second, the proposed method enhances the estimation efficiency and the rate of value reduction convergence through the parameter-sharing feature inherent in the neural network-based treatment encoder. Third, the nonparametric modeling strategy employed by the double encoder model accommodates the intricate treatment and interaction effects, effectively mitigating the model mis-specification problem and leading to consistent estimation of the ITR. Fourth, the proposed multi-choice knapsack framework enables the tailoring of individualized decisions within budget constraints. Apart from its application in the proposed method, this framework is generally applicable to other outcome regression models for deriving budget-constrained decisions for multi-arm or combination treatment scenarios.

In regards to the theoretical properties of the estimated ITR, we provide the value reduction bound for the ITR for combination treatments with or without budget constraints. Thereafter, we provide a non-asymptotic value reduction bound for the DEM, which guarantees that the value function of the estimated individualized treatment rule converges to the optimal value function with a high probability and the proposed method achieves a faster convergence rate compared with existing methods for the multi-arm ITR. The improvement in convergence rate is attained by the hierarchical structure of the neural network where the parameters are shared by all combinations and the input dimension is proportional to the number of treatments instead of the total number of combination treatments.

The proposed method demonstrates superior performance over existing methods in our numerical studies especially when the number of treatments is large and there exist interaction effects among different treatments. In the real data application, we apply the proposed method to recommend the optimal combination treatments to shrink tumor size of colorectal cancer, which achieves the maximal tumor size shrinkage and shows its potential in improving individualized healthcare.

The rest of this paper is organized as follows. In Section 2, we introduce the notations and background of the Q-learning framework, and the budget constraints problem. In Section 3, we propose the double encoder model (DEM) to estimate the optimal individualized treatment rule for combination treatments and impose a budget constraint on the original problem. In Section 4, we establish the theoretical properties of the proposed method. In Section 5, we illustrate the empirical performance of the proposed method in various simulation studies. In Section 6, we apply the proposed method to PDX data, which aims to recommend optimal combination treatments for colorectal cancer. We provide discussion and concluding remarks in Section 7.

2 Notations and Background

In this section, we introduce the problem setup and notations for the estimation of individualized treatment rules for combination treatments. Consider the data (𝐗,𝐀,Y)(\mathbf{X},\mathbf{A},Y) collected from designed experiments or observational studies. The subject pre-treatment covariates are denoted by 𝐗𝒳p\mathbf{X}\in\mathcal{X}\subset\mathbb{R}^{p}, which might include patients’ demographics and lab test results. The combinations of KK treatments are denoted by 𝐀=(A1,A2,,AK)𝒜{0,1}K\mathbf{A}=(A^{1},A^{2},...,A^{K})\in\mathcal{A}\subset\{0,1\}^{K}, where Ak=1A^{k}=1 indicates that the kkth treatment is administered and Ak=0A^{k}=0 otherwise. Note that some combinations are infeasible to be considered in real applications, for example, many drug-drug interactions could lead to risks for patients outweighing the benefits (Rodrigues, 2019). Therefore, we consider a subset 𝒜\mathcal{A} of all the possible 2K2^{K} combinations in the treatment rule. We may also denote the treatments by A~{1,2,,|𝒜|}\tilde{A}\in\{1,2,...,|\mathcal{A}|\} as categorical encodings, and use these two sets of notations interchangeably without ambiguity. The outcome of our interest is denoted by YY\in\mathbb{R}. Without loss of generality, we assume a larger value of YY is preferable, for example, the shrinkage of tumor size.

In causal inference, the potential outcome framework (Rubin, 1974) is to describe the possible outcome after a certain treatment is assigned. We use Y(𝐀)Y(\mathbf{A}) to denote the potential outcome throughout the paper. Due to the “fundamental problem of causal inference” (Holland, 1986), which indicates that only one potential outcome is observed for each subject, it is infeasible to estimate the subject-wise optimal treatment. Instead, our goal of estimating the optimal individualized treatment rule for combination treatments is to maximize the population-wise expected potential outcome, which is also known as the value function:

𝒱(d):=𝔼[Y{d(𝐗)}],\displaystyle\mathcal{V}(d):=\mathbb{E}[Y\{d(\mathbf{X})\}], (1)

where d():𝒳𝒜d(\cdot):\mathcal{X}\rightarrow\mathcal{A} is an individualized treatment rule. The value function is defined as the expectation of the potential outcomes over the population distribution of (𝐗,𝐀,Y)(\mathbf{X},\mathbf{A},Y) under 𝐀=d(𝐗)\mathbf{A}=d(\mathbf{X}), which is estimable when the following causal assumptions (Rubin, 1974) holds:

Assumption 1

(a) Stable Unit Treatment Value Assumption (SUTVA): Y=Y(𝐀)Y=Y(\mathbf{A}); (b) No unmeasured confounders: 𝐀Y(𝐚)|𝐗\mathbf{A}\perp\!\!\!\perp Y(\mathbf{a})|\mathbf{X}, for any 𝐚𝒜\mathbf{a}\in\mathcal{A}; (c) Positivity: (𝐀=𝐚|𝐗)p𝒜\mathbb{P}(\mathbf{A}=\mathbf{a}|\mathbf{X})\geq p_{\mathcal{A}}, 𝐚𝒜,𝐗𝒳\forall\mathbf{a}\in\mathcal{A},\forall\mathbf{X}\in\mathcal{X}, for some p𝒜>0p_{\mathcal{A}}>0.

Assumption (a) is also referred to as “consistency” in causal inference, which assumes that the potential outcomes of each subject do not vary with treatments assigned to other subjects. The treatments are well-defined in that the same treatment leads to the same potential outcome. Assumption (b) states that all confounders are observed in pre-treatment covariates, so that the treatment and potential outcomes are conditionally independent given the pre-treatment covariates. Assumption (c) claims that for any pre-treatment covariates 𝐗\mathbf{X}, each treatment can be assigned with a positive probability.

Based on these assumptions, the value function defined in (1) can be identified as follows:

𝒱(d)=𝔼{Y|𝐀=d(𝐗)}=𝔼{𝐀𝒜𝔼(Y|𝐗,𝐀)𝕀{d(𝐗)=𝐀}},\displaystyle\mathcal{V}(d)=\mathbb{E}\{Y|\mathbf{A}=d(\mathbf{X})\}=\mathbb{E}\bigg{\{}\sum_{\mathbf{A}\in\mathcal{A}}\mathbb{E}(Y|\mathbf{X},\mathbf{A})\mathbb{I}\{d(\mathbf{X})=\mathbf{A}\}\bigg{\}}, (2)

where 𝕀()\mathbb{I}(\cdot) is the indicator function. To maximize the value function, we can first estimate the conditional expectation 𝔼(Y|𝐗=𝐱,𝐀=𝐚)\mathbb{E}(Y|\mathbf{X}=\mathbf{x},\mathbf{A}=\mathbf{a}), namely the Q-function in the literature (Clifton and Laber, 2020). Then the optimal individualized treatment rule can be obtained by

d(𝐱)argmax𝐚𝒜𝔼(Y|𝐗=𝐱,𝐀=𝐚).\displaystyle d^{*}(\mathbf{x})\in\operatorname*{argmax}_{\mathbf{a}\in\mathcal{A}}\mathbb{E}(Y|\mathbf{X}=\mathbf{x},\mathbf{A}=\mathbf{a}). (3)

From the perspective of the multi-arm treatments, the Q-function (Qian and Murphy, 2011; Qi et al., 2020; Kosorok and Laber, 2019) can be formulated as:

𝔼(Y|𝐗,A~)=m(𝐗)+l=1|𝒜|δl(𝐗)𝕀(A~=l),\displaystyle\mathbb{E}(Y|\mathbf{X},\tilde{A})=m(\mathbf{X})+\sum_{l=1}^{|\mathcal{A}|}\delta_{l}(\mathbf{X})\mathbb{I}(\tilde{A}=l), (4)

where m(𝐗)m(\mathbf{X}) is the treatment-free effect representing a null effect without any treatment and functions δl(𝐗)\delta_{l}(\mathbf{X})’s are treatment effects for the llth treatment. There are two major challenges when (4) is applied to the combination treatments problem: First, if δl()\delta_{l}(\cdot)’s are imposed to be some parametric model, for example, linear model (Qian and Murphy, 2011; Kosorok and Laber, 2019), it could have severe misspecification issue, especially considering the complex nature of interaction effects of combination treatments. Second, as the number of treatments KK increases, the number of treatment-specific functions δl()\delta_{l}(\cdot)’s could grow exponentially. Therefore, the estimation efficiency of the ITR based on Q-function (4) could be severely compromised for either parametric or nonparametric models, especially in clinical trials or observational studies with limited sample sizes.

In addition, considering the combination of multiple treatments expands the treatment space 𝒜\mathcal{A} and provides much more feasible treatment options. Therefore, each individual could have more choices rather than a yes-or-no as in the binary treatment scenario. Therefore, it is possible to consider accommodating realistic budget constraints while maintaining an effective outcome. In this paper, we further consider a population-level budget constraint as follows. Suppose costs over the KK treatments are 𝐜=(c1,c2,,cK)\mathbf{c}=(c_{1},c_{2},...,c_{K}), where ckc_{k} denotes the cost for the kkth treatment. Then the budget constraint for a population with a sample size nn is:

𝒞n(d):=1ni=1n𝐜Td(𝐗i)B,\displaystyle\mathcal{C}_{n}(d):=\frac{1}{n}\sum_{i=1}^{n}\mathbf{c}^{T}d(\mathbf{X}_{i})\leq B, (5)

where BB is the average budget for each subject. This budget constraint is suitable for many policy-making problems such as welfare programs (Bhattacharya and Dupas, 2012) and vaccination distribution problem (Matrajt et al., 2021).

3 Methodology

In Section 3.1, we introduce the proposed Double Encoder Model (DEM) for estimating the optimal ITR for combination treatments. Section 3.2 considers the optimal assignment of combination treatments under budget constraints. The estimation procedure and implementation details are provided in Section 3.3.

3.1 Double Encoder Model for ITRs

Our proposed Double Encoder Model (DEM) formulates the conditional expectation 𝔼(Y|𝐗,𝐀)\mathbb{E}(Y|\mathbf{X},\mathbf{A}), or the Q-function, as follows:

𝔼(Y|𝐗,𝐀)=m(𝐗)+α(𝐗)Tβ(𝐀),\displaystyle\mathbb{E}(Y|\mathbf{X},\mathbf{A})=m(\mathbf{X})+\alpha(\mathbf{X})^{T}\beta(\mathbf{A}), (6)

where m():𝒳m(\cdot):\mathcal{X}\rightarrow\mathbb{R} is the treatment-free effects as in (4), and α():𝒳r\alpha(\cdot):\mathcal{X}\rightarrow\mathbb{R}^{r} is an encoder that represents individuals’ pre-treatment covariates in the rr-dimensional latent space, which is called the covariate encoder. And β():𝒜r\beta(\cdot):\mathcal{A}\rightarrow\mathbb{R}^{r} is another encoder representing the combination treatment in the same rr-dimensional latent space, named as the treatment encoder. In particular, these two encoders capture the unobserved intrinsic features of subjects and treatments; for instance, the covariates encoder α()\alpha(\cdot) represents the patients’ underlying health status, while the treatment encoder β()\beta(\cdot) learns physiological mechanisms of the treatment. The inner product α(𝐗)Tβ(𝐀)\alpha(\mathbf{X})^{T}\beta(\mathbf{A}) represents the concordance between subjects and treatments, hence representing the treatment effects on subjects.

From the perspective of function approximation, the covariates encoder α(𝐗)\alpha(\mathbf{X}) learns the function bases of treatment effects, and the treatment encoder β(𝐀)\beta(\mathbf{A}) learns the coefficients associated with those function bases. Consequently, the treatment effects are represented as the linear combinations of rr functions:

δl(𝐗)=i=1rβ(i)(A~l)α(i)(𝐗).\displaystyle\delta_{l}(\mathbf{X})=\sum_{i=1}^{r}\beta^{(i)}(\tilde{A}_{l})\alpha^{(i)}(\mathbf{X}).

Note that the model for multi-arm treatments (4) is a special case of the double encoder model (6) where α(𝐗)=(δ1(𝐗),,δ|𝒜|(𝐗))\alpha(\mathbf{X})=(\delta_{1}(\mathbf{X}),...,\delta_{|\mathcal{A}|}(\mathbf{X})) and β(A~)=(𝕀(A~=A~1),,𝕀(A~=A~|𝒜|))\beta(\tilde{A})=(\mathbb{I}(\tilde{A}=\tilde{A}_{1}),...,\mathbb{I}(\tilde{A}=\tilde{A}_{|\mathcal{A}|})) if r=|𝒜|r=|\mathcal{A}|. Another special case of (6) is the angle-based modeling (Zhang et al., 2020; Qi et al., 2020; Xue et al., 2021), which has been applied to the estimation of the ITR for multi-arm treatments. In the angle-based framework, each treatment is encoded with a fixed vertex in the simplex, and each subject is projected in the latent space of the same dimension as the treatments so that the optimal treatment is determined by the angle between treatment vertices and the subject latent factors. However, the dimension of the simplex and latent space is r=|𝒜|1r=|\mathcal{A}|-1, which leads the angle-based modeling suffers from the same inefficiency issue as (4).

Since different combination treatments could contain the same individual treatments, it is over-parameterized to model treatment effects for each combination treatment independently. For instance, the treatment effect of the combination of drug AA and drug BB is correlated with the individual treatment effects of drug AA and of drug BB, respectively. Therefore, we seek to find a low-dimensional function space to incorporate the correlation of the combination treatments without over-parametrization. In the DEM (6), the dimension of the encoders output rr controls the complexity of the function space spanned by α(1)(),,α(r)()\alpha^{(1)}(\cdot),...,\alpha^{(r)}(\cdot). Empirically, the dimension rr is a tuning parameter, which can be determined via the hyper-parameter tuning procedure. In other words, the complexity of the DEM is determined by the data itself, rather than pre-specified. In addition, the reduced dimension also leads to a parsimonious model with fewer parameters, which permits an efficient estimation of treatment effects. Furthermore, we do not impose any parametric assumptions on α()\alpha(\cdot), which allows us to employ flexible nonlinear or nonparametric models with rr-dimensional output to avoid the potential misspecification of treatment effects.

Given the double encoder framework in (6), the treatment effects of the combination treatments share the same function bases α(1)()\alpha^{(1)}(\cdot), …, α(r)()\alpha^{(r)}(\cdot). Therefore, the treatment encoder β()\beta(\cdot) is necessary to represent all treatments in 𝒜\mathcal{A} so that α(𝐗)Tβ(𝐀)\alpha(\mathbf{X})^{T}\beta(\mathbf{A}) can represent treatment effects for all treatments. Through this modeling strategy, we convert the complexity of |𝒜||\mathcal{A}| treatment-specific functions δl()\delta_{l}(\cdot)’s in (4) to the representation complexity of β()\beta(\cdot) in that β()\beta(\cdot) represents |𝒜||\mathcal{A}| treatments in rr-dimensional latent space. As a result, we can reduce the complexity of the combination treatment problem and achieve an efficient estimation if an efficient representation (i.e. r|𝒜|r\ll|\mathcal{A}|) of |𝒜||\mathcal{A}| treatments can be found.

In summary, the double encoder model (6) is a promising framework to tackle the two challenges in (4) if covariates and treatment encoders can provide flexible and powerful representations of covariates and treatments, respectively, which will be elaborated in the following sections. Before we dive into the details of the covariates and treatment encoders, we first show the universal approximation property of the double encoder model, which guarantees its flexibility in approximating complex treatment effects.

Theorem 1

For any treatment effects δl(𝐗)2={f:𝐱𝒳|f(2)(𝐱)|2𝑑𝐱<}\delta_{l}(\mathbf{X})\in\mathcal{H}^{2}=\{f:\int_{\mathbf{x}\in\mathcal{X}}|f^{(2)}(\mathbf{x})|^{2}d\mathbf{x}<\infty\}, and for any ϵ>0\epsilon>0, there exists α():𝒳r\alpha(\cdot):\mathcal{X}\rightarrow\mathbb{R}^{r} and β():𝒳r\beta(\cdot):\mathcal{X}\rightarrow\mathbb{R}^{r}, where Kr|𝒜|K\leq r\leq|\mathcal{A}| such that

δl(𝐗)α(𝐗)Tβ(A~l)2ϵ,for any A~l𝒜.\displaystyle\lVert\delta_{l}(\mathbf{X})-\alpha(\mathbf{X})^{T}\beta(\tilde{A}_{l})\rVert_{\mathcal{H}^{2}}\leq\epsilon,\quad\text{for any }\tilde{A}_{l}\in\mathcal{A}.

The above theorem guarantees that the DEM (6) can represent the function space considered in (4) sufficiently well given a sufficiently large rr.

3.1.1 Treatment Encoder

In this section, we introduce the detailed modeling strategy for treatment encoder β()\beta(\cdot). The treatment effects of combination treatments can be decoupled into two components: additive treatment effects, which is the sum of treatment effects from single treatments in combination; and interaction effects, which are the additional effects induced by the combinations of multiple treatments. Therefore, we formulate the treatment encoder as follows:

β(𝐀)=β0(𝐀)+β1(𝐀)=𝐖𝐀+β1(𝐀),s.t.β1(𝐀)=0,for any 𝐀{𝐀:k=1KAk2},\displaystyle\begin{split}\beta(\mathbf{A})&=\beta_{0}(\mathbf{A})+\beta_{1}(\mathbf{A})=\mathbf{W}\mathbf{A}+\beta_{1}(\mathbf{A}),\\ \text{s.t.}&\quad\beta_{1}(\mathbf{A})=0,\quad\text{for any }\mathbf{A}\in\{\mathbf{A}:\sum_{k=1}^{K}A_{k}\geq 2\},\end{split} (7)

where β0(𝐀)\beta_{0}(\mathbf{A}) and β1(𝐀)\beta_{1}(\mathbf{A}) are additive and interactive treatment encoders, respectively. In particular, β0(𝐀)\beta_{0}(\mathbf{A}) is a linear function with respect to 𝐀\mathbf{A}, where 𝐖=(𝐖1,𝐖2,,𝐖K)r×K\mathbf{W}=(\mathbf{W}_{1},\mathbf{W}_{2},...,\mathbf{W}_{K})\in\mathbb{R}^{r\times K} and 𝐖k\mathbf{W}_{k} is the latent representation of the kkth treatment. As a result, α(𝐗)Tβ0(𝐀)=k:{Ak=1}𝐖kTα(𝐗)\alpha(\mathbf{X})^{T}\beta_{0}(\mathbf{A})=\sum_{k:\{A^{k}=1\}}\mathbf{W}_{k}^{T}\alpha(\mathbf{X}) are the additive treatment effects of the combination treatment 𝐀\mathbf{A}. The constraints for β1()\beta_{1}(\cdot) ensures the identifiability of β0()\beta_{0}(\cdot) and β1()\beta_{1}(\cdot) such that any representation β(𝐀)\beta(\mathbf{A}) can be uniquely decoupled into β0(𝐀)\beta_{0}(\mathbf{A}) and β1(𝐀)\beta_{1}(\mathbf{A}).

The interaction effects are challenging to estimate in combination treatments. A naive solution is to assume that interaction effects are ignorable, which leads the additive treatment encoder β0(𝐀)\beta_{0}(\mathbf{A}) to be saturated in estimating the treatment effects of combination treatments. However, interaction effects are widely perceived in many fields such as medicine (Stader et al., 2020; Li et al., 2018), psychology (Caspi et al., 2010), and public health (Braveman et al., 2011). Statistically, ignoring the interaction effects could lead to inconsistent estimation of the treatment effects (Zhao and Ding, 2023) and the ITR (Liang et al., 2018). Hence, it is critical to incorporate the interaction effects in estimating the ITR for combination treatments.

A straightforward approach to model the interactive treatment encoder β1(𝐀)\beta_{1}(\mathbf{A}) is similar to the additive treatment encoder β0(𝐀)\beta_{0}(\mathbf{A}), which we name as the treatment dictionary. Specifically, a matrix 𝐕=(𝐕1,𝐕2,,𝐕|𝒜|)r×|𝒜|\mathbf{V}=(\mathbf{V}_{1},\mathbf{V}_{2},...,\mathbf{V}_{|\mathcal{A}|})\in\mathbb{R}^{r\times|\mathcal{A}|} is a dictionary that stores the latent representations of each combination treatment so that β1(𝐀)\beta_{1}(\mathbf{A}) is defined as follows

β0(𝐀)=𝐕𝐞A~,\displaystyle\beta_{0}(\mathbf{A})=\mathbf{V}\mathbf{e}_{\tilde{A}}, (8)

where 𝐞A~\mathbf{e}_{\tilde{A}} is the one-hot encoding of the categorical representation of 𝐀\mathbf{A}. Since the number of possible combination treatments |𝒜||\mathcal{A}| could grow exponentially as KK increases, the parameters of 𝐕\mathbf{V} could also explode. Even worse, each column 𝐕l\mathbf{V}_{l} can be updated only if the associated treatment A~l\tilde{A}_{l} is observed. Given a limited sample size, each treatment could be only observed a few times in the combination treatment scenarios, which leads the estimation efficiency of 𝐕\mathbf{V} to be severely compromised. The same puzzle is also observed in other methods. For the Q-function in(4), the parameters in δl(𝐗)\delta_{l}(\mathbf{X}) can be updated only if A~l\tilde{A}_{l} is observed; In the Treatment-Agnostic Representation Network (TARNet) (Shalit et al., 2017) and the Dragonnet (Shi et al., 2019), each treatment is associated with an independent set of regression layers to estimate the treatment-specific treatment effects, which results in inefficiency estimation for combination treatment problems.

In order to overcome the above issue, we propose to utilize the feed-forward neural network (Goodfellow et al., 2016) to learn efficient latent representations in the rr-dimensional space. Specifically, the interactive treatment encoder is defined as

β1(𝐀)=𝒰Lσσ𝒰1(𝐀),\displaystyle\beta_{1}(\mathbf{A})=\mathcal{U}_{L}\circ\sigma\circ...\circ\sigma\circ\mathcal{U}_{1}(\mathbf{A}), (9)

where 𝒰l(𝐱)=𝐔l𝐱+𝐛l\mathcal{U}_{l}(\mathbf{x})=\mathbf{U}_{l}\mathbf{x}+\mathbf{b}_{l} is the linear operator with the weight matrix 𝐔lrl×rl1\mathbf{U}_{l}\in\mathbb{R}^{r_{l}\times r_{l-1}} and the biases 𝐛l\mathbf{b}_{l}. The activation function is chosen as ReLU function σ(𝐱)=max(𝐱,0)\sigma(\mathbf{x})=\max(\mathbf{x},0) in this paper. An illustration of the neural network interactive treatment encoder is shown in Figure 1. Note that all parameters in (9) are shared among all possible treatments, so all of the weight matrices and biases in (9) are updated regardless of the input treatment, which could improve the estimation efficiency, even though (9) may include more parameters than the treatment dictionary (8). As a result, the double encoder model with (9) not only guarantees a faster convergence rate (with respect to KK) of the value function but also improves the empirical performance especially when KK is large or sample size nn is small, which will be shown in numerical studies and real data analysis. A direct comparison of the neural network interactive treatment encoder (9) and the treatment encoder (8), the additive model (4), TARNet (Shalit et al., 2017) and Dragonnet (Shi et al., 2019) are also shown in Figure 1.

Refer to caption
Figure 1: The left panel shows the parameter update scheme in the additive model (4), the treatment dictionary (8), TARNet (Shalit et al., 2017) and dragonnet (Shi et al., 2019). Only the treatment-specific parameters corresponding to 𝐚0\mathbf{a}_{0} are updated. The right panel shows the parameter-sharing feature of the neural network interactive treatment encoder (9). All parameters except for non-activated input parameters are updated based on the gradient with respect to the observation (𝐱0,𝐚0,y0)(\mathbf{x}_{0},\mathbf{a}_{0},y_{0}).

Although the interactive treatment encoder (9) allows an efficient estimation, it is not guaranteed to represent up to |𝒜||\mathcal{A}| interaction effects. In the treatment dictionary (8), columns 𝐕l\mathbf{V}_{l}’s are free parameters to represent |𝒜||\mathcal{A}| treatments without any constraints. However, an “under-parameterized” neural network is not capable of representing |𝒜||\mathcal{A}| treatments in rr-dimensional space. For example, if there are three treatments to be combined (K=3K=3), and the treatment effects are sufficiently captured by one-dimensional α(𝐗)\alpha(\mathbf{X}) with different coefficients (r=1r=1). We use the following one-hidden layer neural network to represent the treatment in \mathbb{R}:

β1(𝐀)=u2σ(𝐔1𝐀+b1)+b2,\displaystyle\beta_{1}(\mathbf{A})=u_{2}\sigma(\mathbf{U}_{1}\mathbf{A}+b_{1})+b_{2}, (10)

where u2,b2,b1u_{2},b_{2},b_{1}\in\mathbb{R} are scalars and 𝐔11×3\mathbf{U}_{1}\in\mathbb{R}^{1\times 3}. In other words, the hidden layer only includes one node. In the following, we show that this neural network can only represent restricted interaction effects:

Proposition 1

The one-hidden layer neural network (10) can only represent the following interaction effects: (a) β1(𝐀)0\beta_{1}(\mathbf{A})\geq 0 or β1(𝐀)0\beta_{1}(\mathbf{A})\leq 0 for all 𝐀𝒜\mathbf{A}\in\mathcal{A}; (b) β1(𝐀)\beta_{1}(\mathbf{A}) takes the same values for all combinations of two treatments.

The proof of Proposition 1 is provided in the supplementary materials. Based on the above observation, it is critical to guarantee the representation power of β1(𝐀)\beta_{1}(\mathbf{A}) to incorporate flexible interaction effects. In the following, we establish a theoretical guarantee of the representation power of β1()\beta_{1}(\cdot) under a mild assumption on the widths of neural networks:

Theorem 2

For any treatment 𝐀𝒜{0,1}K\mathbf{A}\in\mathcal{A}\subset\{0,1\}^{K}, if β1()\beta_{1}(\cdot) is a 3-layer fully-connected neural network defined in (9) satisfying 4[r1/4][r2/4r]|𝒜|4[r_{1}/4][r_{2}/4r]\geq|\mathcal{A}|, then there exist parameters {𝐔l,𝐛l,l=1,2,3}\{\mathbf{U}_{l},\mathbf{b}_{l},l=1,2,3\}, such that β(𝐀)\beta(\mathbf{A}) satisfies the identifiability constraints and can take any values in r\mathbb{R}^{r}.

The above result is adapted from the recent work on the memorization capacity of neural networks (Yun et al., 2019; Bubeck et al., 2020). Theorem 2 shows that if there are Ω(2K/2r1/2)\Omega(2^{K/2}r^{1/2}) hidden nodes in neural networks, then it is sufficient to represent all possible interaction effects in r\mathbb{R}^{r}. However, obtaining the parameter set {𝐔l,𝐛l,l=1,2,3}\{\mathbf{U}_{l},\mathbf{b}_{l},l=1,2,3\} in Theorem 2 via the optimization algorithm is not guaranteed due to the non-convex loss surface of the neural networks. In practice, the neural network widths in Theorem 2 can be a guide, and choosing a wider network is recommended to achieve better empirical performance.

In summary, we propose to formulate the treatment encoder as two decoupled parts: the additive treatment encoder and the interactive encoder. We provide two options for the interactive treatment encoder: the treatment dictionary and the neural network, where the neural network can improve the asymptotic convergence rate and empirical performance with guaranteed representation power. In the numerical studies, we use the neural network interactive treatment encoder for our proposed method, and a comprehensive comparison between the treatment dictionary and the neural network is provided in the supplementary materials.

3.1.2 Covariates Encoder

As we introduced in (6), the covariates encoder α():𝒳r\alpha(\cdot):\mathcal{X}\rightarrow\mathbb{R}^{r} constitutes the function bases of the treatment effects for all combination treatments. In other words, the treatment effects represented in (6) lie in the space spanned by α(1)(𝐗),,α(r)(𝐗)\alpha^{(1)}(\mathbf{X}),...,\alpha^{(r)}(\mathbf{X}). Therefore, it is critical to consider a sufficiently large and flexible function space to accommodate the highly complex treatment effects and avoid possible model misspecification. In particular, we adopt three nonlinear or nonparametric models for covariates encoders: polynomial, B-Spline (Hastie et al., 2009), and neural network (Goodfellow et al., 2016).

First of all, we introduce the α(𝐗)\alpha(\mathbf{X}) as a feed-forward neural network defined as follows:

α(𝐗)=𝒯Lσσ𝒯1(𝐗),\displaystyle\alpha(\mathbf{X})=\mathcal{T}_{L}\circ\sigma\circ...\circ\sigma\circ\mathcal{T}_{1}(\mathbf{X}), (11)

𝒯l(𝐱)=𝐓l𝐱+𝐜l\mathcal{T}_{l}(\mathbf{x})=\mathbf{T}_{l}\mathbf{x}+\mathbf{c}_{l} is the linear operator with the weight matrix 𝐓lrl×rl1\mathbf{T}_{l}\in\mathbb{R}^{r_{l}\times r_{l-1}} and the biases 𝐜l\mathbf{c}_{l}. The activation function is chosen as ReLU function σ(𝐱)=max(𝐱,0)\sigma(\mathbf{x})=\max(\mathbf{x},0) in this paper. Note that the depth and the width of the covariates encoder α()\alpha(\cdot) are not necessarily identical to those of the interactive treatment encoder β1()\beta_{1}(\cdot), and these are all tuning parameters to be determined through hyper-parameter tuning.

Refer to caption
Figure 2: Model structure of the polynomial covariate encoder. Dashed lines indicate the fixed polynomial expansion procedures, and solid lines are trainable parameters for the linear combination of polynomials.

Even though neural networks achieve superior performance in many fields, their performance in small sample size problems, such as clinical trials or observational studies in medical research, is still deficient. In addition, neural networks lack interpretability due to the nature of their recursive composition; therefore, the adoption of neural networks in medical research is still under review. Here, we propose the polynomial and B-Spline covariates encoders to incorporate nonlinear treatment effects for better interpretation. For the polynomial covariates encoder, we first expand each covariate xix_{i} into a specific order of polynomials (xi,xi2,,xid)(x_{i},x_{i}^{2},...,x_{i}^{d}) where dd is a tuning parameter. Then we take the linear combinations of all polynomials as the output of the covariate encoders. Figure 2 provides an example of the polynomial covariate encoder with d=3d=3. Similarly, as for the B-spline covariates encoder, we first expand each covariate into B-spline bases, where the number of knots and the spline degree are tuning parameters. Likewise, linear combinations of these B-spline bases are adopted as the output of the encoder. Although both polynomial and B-spline covariate encoders can accommodate interaction terms among the polynomial bases or B-spline bases for a better approximation for multivariate functions, exponentially increasing parameters need to be estimated as the dimension of covariates or the degree of bases increases. In the interest of computation feasibility, we do not consider interaction terms in the following discussion.

3.2 Budget Constrained Individualized Treatment Rule

In this section, we consider to optimize the assignment of combination treatments under the budget constraints, where the total cost constraints are imposed on a population with a sample size nn.

We first introduce budget constrained ITR for binary treatments. Suppose we have a treatment (A=1A=1) and a control (A=1A=-1), and there are only b%b\% subjects which can be treated with the treatment A=1A=1. Luedtke and van der Laan (2016) define a contrast function η(𝐱)=δ(𝐱,1)δ(𝐱,1)\eta(\mathbf{x})=\delta(\mathbf{x},1)-\delta(\mathbf{x},-1), and the corresponding ITR under the budget constraint is d(𝐱,b)=I(η(𝐱)qb)d(\mathbf{x},b)=I(\eta(\mathbf{x})\geq q_{b}), where qbq_{b} is the b%b\% quantile of the distribution of η(𝐱)\eta(\mathbf{x}) for a given population with a finite sample size.

Estimating the optimal individualized treatment rule for combination treatments under budget constraints is challenging. First of all, the contrast function is no longer a valid tool to measure the treatment importance to each subject. Given the exponentially increasing choices of combination treatments, the number of contrast functions increases exponentially and the pairwise comparisons do not suffice to determine the optimal assignment. Second, costs over different channels may differ significantly, which makes the quantile no longer an effective criterion for allocating budgets.

In the following, we consider the constrained ITR problem for a finite population:

maxd𝒱n(d)s.t.𝒞n(d)B,\displaystyle\max_{d}\hskip 5.69054pt\mathcal{V}_{n}(d)\hskip 5.69054pts.t.\hskip 5.69054pt\mathcal{C}_{n}(d)\leq B, (12)

where 𝒱n\mathcal{V}_{n} and 𝒞n\mathcal{C}_{n} are defined on a pre-specified population with a sample size nn. Here, covariates 𝐱i\mathbf{x}_{i} (i=1,2,,ni=1,2,...,n) are treated as fixed covariates. Based on model formulation (6), maximizing the objective function of (12) is equivalent to:

argmaxd𝒱n(d)\displaystyle\operatorname*{argmax}_{d}\mathcal{V}_{n}(d) =argmaxd1ni=1n[m(𝐱i)+α(𝐱i)Tβ(d(𝐱i))]\displaystyle=\operatorname*{argmax}_{d}\frac{1}{n}\sum_{i=1}^{n}[m(\mathbf{x}_{i})+\alpha(\mathbf{x}_{i})^{T}\beta(d(\mathbf{x}_{i}))]
=argmaxd1ni=1nα(𝐱i)Tβ(d(𝐱i))\displaystyle=\operatorname*{argmax}_{d}\frac{1}{n}\sum_{i=1}^{n}\alpha(\mathbf{x}_{i})^{T}\beta(d(\mathbf{x}_{i}))
=argmax{dij}1ni=1nj=1|𝒜|δijdij,\displaystyle=\operatorname*{argmax}_{\{d_{ij}\}}\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{|\mathcal{A}|}\delta_{ij}d_{ij},

where δij=α(𝐱i)Tβ(𝐚j)\delta_{ij}=\alpha(\mathbf{x}_{i})^{T}\beta(\mathbf{a}_{j}) denotes the treatment effects of the jjth combination treatment on the iith subject, and dij=I{d(𝐱i)=𝐚j}{0,1}d_{ij}=I\{d(\mathbf{x}_{i})=\mathbf{a}_{j}\}\in\{0,1\} indicates whether the iith subject receives the jjth combination treatment. Since one subject can only receive one combination treatment, we impose the constraint to be jdij=1\sum_{j}d_{ij}=1. Similarly, budget constraints can be formulated as 1ni=1nj=1|𝒜|ca~jdijB\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{|\mathcal{A}|}c_{\tilde{a}_{j}}d_{ij}\leq B, where ca~jc_{\tilde{a}_{j}} is the cost of treatment a~j\tilde{a}_{j} calculated from the cost vector 𝐜\mathbf{c}. The constrained individualized treatment rule can be solved as follows:

max{dij}\displaystyle\max_{\{d_{ij}\}} 1ni=1nj=1|𝒜|δijdij,\displaystyle\hskip 5.69054pt\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{|\mathcal{A}|}\delta_{ij}d_{ij}, (13)
s.t.1ni=1nj=1|𝒜|ca~jdijB,\displaystyle s.t.\hskip 5.69054pt\frac{1}{n}\sum_{i=1}^{n}\sum_{j=1}^{|\mathcal{A}|}c_{\tilde{a}_{j}}d_{ij}\leq B, jdij=1,dij{0,1}, for any i,j.\displaystyle\hskip 5.69054pt\sum_{j}d_{ij}=1,\hskip 5.69054pt\hskip 5.69054ptd_{ij}\in\{0,1\},\hskip 5.69054pt\text{ for any }i,j.

The above optimization problem is equivalent to a multi-choice knapsack problem (Kellerer et al., 2004). For a binary treatment setting, the solution of (13) is the quantile of η(𝐗)\eta(\mathbf{X}), which is a special case of our formulation.

To understand the connection between constrained individualized treatment rule and the multi-choice knapsack problem, we notice that the priority of the treatment is associated with the definition of dominance in the multi-choice knapsack problem: for any i{1,2,,n}i\in\{1,2,...,n\}, if δik>δil\delta_{ik}>\delta_{il} and c𝐚k<c𝐚lc_{\mathbf{a}_{k}}<c_{\mathbf{a}_{l}}, the treatment ll is dominated by the treatment kk. In other words, the treatment kk achieves a better outcome than the treatment ll with a lower cost. Thus, the dominance property could be an alternative to the contrast functions in combination treatment settings.

Here δij\delta_{ij} indicates the treatment effects, and the parametric assumptions are not required, so this framework is also applicable for other methods providing estimations of treatment effects such as the l1l_{1}-penalized least-square (Qian and Murphy, 2011), and the outcome weighted learning with multinomial deviance (Huang et al., 2019). However, the objective function in (12) depends on the estimation of δij\delta_{ij}, and we show that the value reduction under budget constraints is bounded by the estimation error of δij\delta_{ij}’s in Theorem 4. Consequently, estimation bias in δij\delta_{ij} could lead to biased results in (13). Since the proposed model (6) provides an efficient and accurate estimation of treatment effects, it also results in a more favorable property for solving the budget constrained individualized treatment rule for combination treatments.

3.3 Estimation and Implementation

In this section, we introduce the estimation and hyper-parameter tuning procedures of the proposed method for unconstrained and constrained individualized treatment rule for combination treatment.

3.3.1 Estimation of the Double Encoder Model

First of all, we propose the following doubly robust estimator for the treatment effects:

α^(),β^()=argminα(),β()𝔼{1^(𝐀i|𝐗𝐢)(Yim^(𝐗i)α(𝐗i)Tβ(𝐀i))2},\displaystyle\hat{\alpha}(\cdot),\hat{\beta}(\cdot)=\operatorname*{argmin}_{\alpha(\cdot),\beta(\cdot)}\mathbb{E}\bigg{\{}\frac{1}{\hat{\mathbb{P}}(\mathbf{A}_{i}|\mathbf{X_{i}})}(Y_{i}-\hat{m}(\mathbf{X}_{i})-\alpha(\mathbf{X}_{i})^{T}\beta(\mathbf{A}_{i}))^{2}\bigg{\}}, (14)

where ^(𝐀i|𝐗𝐢)\hat{\mathbb{P}}(\mathbf{A}_{i}|\mathbf{X_{i}}) is a working model of the propensity score specifying the probability of treatment assignment given pre-treatment covariates, and m^(𝐗i)\hat{m}(\mathbf{X}_{i}) is a working model of treatment-free effects. The inverse probability weights given by the propensity scores balance the samples assigned to different combination treatments, assumed to be equal under the randomized clinical trial setting. By removing the treatment-free effects m(𝐱)m(\mathbf{x}) from the responses before we estimate the treatment effects, the numerical stability can be improved and the estimator variance is reduced. This is also observed in (Zhou et al., 2017; Fu et al., 2016). Furthermore, the estimator in (14) is doubly-robust in that if either ^(|)\hat{\mathbb{P}}(\cdot|\cdot) or m^()\hat{m}(\cdot) is correctly-specified, α^()Tβ^()\hat{\alpha}(\cdot)^{T}\hat{\beta}(\cdot) is a consistent estimator of the treatment effects, and a detailed proof is provided in the supplemental material. This result extends the results in (Meng and Qiao, 2020) from binary and multiple treatments to combination treatments. Empirically, we minimize the sample average of the loss function (14) with additional penalties: for the additive treatment encoder, L2L_{2} penalty is imposed to avoid overfitting; for the interactive treatment encoder, L1L_{1} penalty is added since the interaction effects are usually sparse (Wu and Hamada, 2011).

In this work, the working model of the propensity score is obtained via the penalized multinomial logistic regression (Friedman et al., 2010) as a working model. Specifically, the multinomial logistic model is parameterized by γ1,γ2,,γ2Kp\gamma_{1},\gamma_{2},...,\gamma_{2^{K}}\in\mathbb{R}^{p}:

(𝐀~=k|𝐱)=exp(γkT𝐱)k=12Kexp(γkT𝐱).\displaystyle\mathbb{P}(\tilde{\mathbf{A}}=k|\mathbf{x})=\frac{\exp(\gamma_{k}^{T}\mathbf{x})}{\sum_{k^{\prime}=1}^{2^{K}}\exp(\gamma_{k^{\prime}}^{T}\mathbf{x})}.

The parameters γk\gamma_{k}’s can be estimated by maximizing the likelihood:

maxγ1,,γ2K1ni=1n[k=12KγkT𝐱iI(𝐀~=k)log{k=12Kexp(γkT𝐱𝐢)}]λj=1p(k=12Kγkj2)1/2,\displaystyle\max_{\gamma_{1},...,\gamma_{2^{K}}}\frac{1}{n}\sum_{i=1}^{n}\bigg{[}\sum_{k=1}^{2^{K}}\gamma_{k}^{T}\mathbf{x}_{i}I(\tilde{\mathbf{A}}=k)-\log\{\sum_{k^{\prime}=1}^{2^{K}}\exp(\gamma_{k^{\prime}}^{T}\mathbf{x_{i}})\}\bigg{]}-\lambda\sum_{j=1}^{p}(\sum_{k=1}^{2^{K}}\gamma_{kj}^{2})^{1/2},

where the group Lasso (Meier et al., 2008) is used to penalize parameters across all treatment groups. A potential issue of propensity score estimation is that the estimated probability could be negligible when there are many possible treatments, which leads to unstable estimators for treatment effects. To alleviate the limitation on inverse probability weighting, we stabilize the propensity scores (Xu et al., 2010) by multiplying the frequency of the corresponding treatment to the weights.

For the estimation of the treatment-free effects m()m(\cdot), we adopt a two-layer neural network:

m(𝐱)=(𝐰m2)Tσ(𝐖m1𝐱),\displaystyle m(\mathbf{x})=(\mathbf{w}_{m}^{2})^{T}\sigma(\mathbf{W}_{m}^{1}\mathbf{x}), (15)

where 𝐰m2h\mathbf{w}^{2}_{m}\in\mathbb{R}^{h} and 𝐖m1h×p\mathbf{W}_{m}^{1}\in\mathbb{R}^{h\times p} are weight matrices, and σ(x)\sigma(x) is the ReLu function. The width hh controls the complexity of this model. The weight matrices are estimated through minimizing:

min𝐰m2,𝐖m11ni=1n[yi(𝐰m2)Tσ(𝐖m1𝐱)]2.\displaystyle\min_{\mathbf{w}_{m}^{2},\mathbf{W}_{m}^{1}}\frac{1}{n}\sum_{i=1}^{n}[y_{i}-(\mathbf{w}_{m}^{2})^{T}\sigma(\mathbf{W}_{m}^{1}\mathbf{x})]^{2}.

Given working models m^(𝐱)\hat{m}(\mathbf{x}) and ^(𝐚|𝐱)\hat{\mathbb{P}}(\mathbf{a}|\mathbf{x}) for treatment-free effects and propensity scores, we propose to optimize the double encoder alternatively. The detailed algorithm is listed in Algorithm 1. Specifically, we employ the Adam optimizer (Kingma and Ba, 2014) for optimizing each encoder: covariate encoder α(𝐱)\alpha(\mathbf{x}), additive treatment encoder β1(𝐚)\beta_{1}(\mathbf{a}), and non-parametric treatment encoder β2(𝐚)\beta_{2}(\mathbf{a}). To stabilize the optimization during the iterations, we utilize the exponential scheduler (Patterson and Gibson, 2017) which decays the learning rate by a constant per epoch. In all of our numerical studies, we use 0.95 as a decaying constant for the exponential scheduler. To ensure the identifiability of treatment effects, we also require a constraint on the treatment encoder β()\beta(\cdot) such that 𝐚𝒜β(𝐚)=0\sum_{\mathbf{a}\in\mathcal{A}}\beta(\mathbf{a})=0. To satisfy this constraint, we add an additional normalization layer before the output of β()\beta(\cdot). The normalization layer subtracts the weighted mean vector where the weight is given by the reciprocal of the combination treatment occurrence in the batch. Since this operation only centers the outputs, the theoretical guarantee for β()\beta(\cdot) in Section 4 still holds. Once our algorithm converges, we obtain the estimation for α()\alpha(\cdot) and β()\beta(\cdot), and also the estimated individualized treatment rule d^()\hat{d}(\cdot) by (3).

Algorithm 1 Double encoder model training algorithm
Input: Training dataset (𝐱i,𝐚i,𝐲i)i=1n(\mathbf{x}_{i},\mathbf{a}_{i},\mathbf{y}_{i})_{i=1}^{n}, working models m^(𝐱),^(𝐚|𝐱)\hat{m}(\mathbf{x}),\hat{\mathbb{P}}(\mathbf{a}|\mathbf{x}), hyper-parameters including network structure-related hyper-parameters (e.g., network depth Lα,LβL_{\alpha},L_{\beta}, network width rα,rβr_{\alpha},r_{\beta}, encoder output dimension rr), optimization-related hyper-parameters (e.g., additive treatment encoder penalty coefficients λa\lambda_{a}, interactive treatment encoder penalty coefficients λi\lambda_{i}, mini-batch size BB, learning rate η\eta, and training epochs EE).
Initialization: Initialize parameters in α^(0)(𝐱)\hat{\alpha}^{(0)}(\mathbf{x}), β^0(0)(𝐚)\hat{\beta}^{(0)}_{0}(\mathbf{a}), and β^1(0)(𝐚)\hat{\beta}^{(0)}_{1}(\mathbf{a}).
Training:
for e in 1: E do
     for Mini-batch sampled from (𝐱i,𝐚i,𝐲i)i=1n(\mathbf{x}_{i},\mathbf{a}_{i},\mathbf{y}_{i})_{i=1}^{n} do
         α^(e)(𝐱)=argmin1B1^(𝐚i|𝐱𝐢){yim^(𝐱i)α(𝐱i)T(β^0(e1)(𝐚i)+β^1(e1)(𝐚i))}2\hat{\alpha}^{(e)}(\mathbf{x})=\operatorname*{argmin}\frac{1}{B}\sum\frac{1}{\hat{\mathbb{P}}(\mathbf{a}_{i}|\mathbf{x_{i}})}\bigg{\{}y_{i}-\hat{m}(\mathbf{x}_{i})-\alpha(\mathbf{x}_{i})^{T}(\hat{\beta}_{0}^{(e-1)}(\mathbf{a}_{i})+\hat{\beta}_{1}^{(e-1)}(\mathbf{a}_{i}))\bigg{\}}^{2}
         β^1(e)(𝐚)=argmin1B1^(𝐚i|𝐱𝐢){yim^(𝐱i)α^(e)(𝐱i)T(β0(𝐚i)+β^1(e1)(𝐚i))}2+λaβ02\hat{\beta}_{1}^{(e)}(\mathbf{a})=\operatorname*{argmin}\frac{1}{B}\sum\frac{1}{\hat{\mathbb{P}}(\mathbf{a}_{i}|\mathbf{x_{i}})}\bigg{\{}y_{i}-\hat{m}(\mathbf{x}_{i})-\hat{\alpha}^{(e)}(\mathbf{x}_{i})^{T}(\beta_{0}(\mathbf{a}_{i})+\hat{\beta}_{1}^{(e-1)}(\mathbf{a}_{i}))\bigg{\}}^{2}+\lambda_{a}\lVert\beta_{0}\rVert_{2}
         β^2(e)(𝐚)=argmin1B1^(𝐚i|𝐱𝐢){yim^(𝐱i)α^(e)(𝐱i)T(β^0(e)(𝐚i)+β1(𝐚i))}2+λiβ11\hat{\beta}_{2}^{(e)}(\mathbf{a})=\operatorname*{argmin}\frac{1}{B}\sum\frac{1}{\hat{\mathbb{P}}(\mathbf{a}_{i}|\mathbf{x_{i}})}\bigg{\{}y_{i}-\hat{m}(\mathbf{x}_{i})-\hat{\alpha}^{(e)}(\mathbf{x}_{i})^{T}(\hat{\beta}_{0}^{(e)}(\mathbf{a}_{i})+\beta_{1}(\mathbf{a}_{i}))\bigg{\}}^{2}+\lambda_{i}\rVert\beta_{1}\rVert_{1}
     end for
end for

In addition, successful neural network training usually requires careful hyper-parameter tuning. The proposed double encoder model includes multiple hyper-parameters: network structure-related hyper-parameters (e.g., network depth LαL_{\alpha} and LβL_{\beta}, network width rαr_{\alpha} and rβr_{\beta}, encoder output dimension rr), optimization-related hyper-parameters (e.g., additive treatment encoder penalty coefficients λa\lambda_{a}, interactive treatment encoder penalty coefficients λi\lambda_{i}, mini-batch size BB, learning rate η\eta, and training epochs EE). These hyper-parameters induce an extremely large search space, which makes the grid search method (Yu and Zhu, 2020) practically infeasible. Instead, we randomly sample 50 hyper-parameter settings in each experiment over the pre-specified search space (detailed specification of hyper-parameter space is provided in the supplementary materials), and the best hyper-parameter setting is selected if it attains the largest value function on an independent validation set. Furthermore, due to the non-convexity of the loss function, the convergence of the algorithm also relies heavily on the parameter initialization. In the supplementary materials, we provide detailed analyses for numerical results under different parameter initializations.

3.3.2 Budget-constrained ITR estimation

In the following, we introduce our procedure for the budget-constrained individualized treatment rule for combination treatment (12) estimation. We use the plug-in estimates α^(𝐱i)Tβ^(𝐚j)\hat{\alpha}(\mathbf{x}_{i})^{T}\hat{\beta}(\mathbf{a}_{j}) for δij\delta_{ij}, and calculate the cost for each combination treatment from the cost vector 𝐜\mathbf{c} by c𝐚j=𝐚jT𝐜c_{\mathbf{a}_{j}}=\mathbf{a}_{j}^{T}\mathbf{c}. Then we apply the dynamic programming algorithm to solve (12) with plug-in δij\delta_{ij}’s. Although the multi-choice knapsack problem is a NP-hard problem, we can still solve it within pseudo-polynomial time (Kellerer et al., 2004). Specifically, we denote Z^l(b)\hat{Z}_{l}(b) as the optimal value 1ni=1lj=1|𝒜|δ^ijdij\frac{1}{n}\sum_{i=1}^{l}\sum_{j=1}^{|\mathcal{A}|}\hat{\delta}_{ij}d_{ij} for the first ll subjects with budget constraints 1ni=1lj=1|𝒜|ca~jdijb\frac{1}{n}\sum_{i=1}^{l}\sum_{j=1}^{|\mathcal{A}|}c_{\tilde{a}_{j}}d_{ij}\leq b. Let Z^l(b)=\hat{Z}_{l}(b)=-\infty if no solution exists and Z^0(b)=0\hat{Z}_{0}(b)=0. We define the budget space as ={b:0bB}\mathcal{B}=\{b:0\leq b\leq B\} including all possible average costs for nn subjects, where 0 is the minimal cost if no treatment is applied to subjects, and the maximal cost is our specified budget BB. Once the iterative algorithm ends, the optimal objective function is obtained as Z^n(B)\hat{Z}_{n}(B) and the optimal treatment assignment is the output {dij:i=1,,n,j=1,,|𝒜|}\{d_{ij}:i=1,...,n,j=1,...,|\mathcal{A}|\}. The detailed algorithm is illustrated in Algorithm 2.

Algorithm 2 Pseudo code of dynamic programming algorithm
1:Input: Treatment effects {δ^ij:i=1,2,n,j=1,,|𝒜|}\{\hat{\delta}_{ij}:i=1,2,...n,j=1,...,|\mathcal{A}|\}, cost {cA~j:j=1,,|𝒜|}\{c_{\tilde{A}_{j}}:j=1,...,|\mathcal{A}|\}, budget BB.
2:Initialize: Z^0(b)0\hat{Z}_{0}(b)\leftarrow 0, for b={b:0bB}b\in\mathcal{B}=\{b:0\leq b\leq B\}
3:while l<nl<n do
4:     ll+1l\leftarrow l+1
5:     for bb\in\mathcal{B} do
6:         Z^l(b)maxj:b>cA~jZ^l1(bcA~j)+δ^lj/n\hat{Z}_{l}(b)\leftarrow\max_{j:b>c_{\tilde{A}_{j}}}\hat{Z}_{l-1}(b-c_{\tilde{A}_{j}})+\hat{\delta}_{lj}/n
7:         dlj1d_{lj}\leftarrow 1 if j=argmaxj:b>cA~jZ^l1(bcA~j)+δ^lj/nj=\operatorname*{argmax}_{j:b>c_{\tilde{A}_{j}}}\hat{Z}_{l-1}(b-c_{\tilde{A}_{j}})+\hat{\delta}_{lj}/n; Otherwise, dlj0d_{lj}\leftarrow 0
8:     end for
9:end while
10:Output: {dij:i=1,,n,j=1,,|𝒜|}\{d_{ij}:i=1,...,n,j=1,...,|\mathcal{A}|\}

4 Theoretical Guarantees

In this section, we establish the theoretical properties of the ITR estimation for combination treatments and the proposed method. First, we establish the value reduction bound for the combination treatments, either with or without budget constraints. Second, we provide a non-asymptotic excess risk bound for the double encoder model, which achieves a faster convergence rate compared with existing methods for multi-arm treatment problems.

4.1 Value Reduction Bound

The value reduction is the difference between the value functions of the optimal individualized treatment rule and of the estimated individualized treatment rule. The value function under a desirable ITR is expected to converge to the value function under the optimal ITR when the sample size goes to infinity. Prior to presenting the main results, we introduce some necessary notations. The conditional expectation of the outcome YY given the subject variable 𝐗\mathbf{X} and the treatment 𝐀\mathbf{A} is denoted by Q(𝐗,𝐀)=𝔼[Y|𝐗,𝐀]Q(\mathbf{X},\mathbf{A})=\mathbb{E}[Y|\mathbf{X},\mathbf{A}], and the treatment-free effects can be rewritten as m(𝐗)=𝔼[Q(𝐗,𝐀)|𝐗]m(\mathbf{X})=\mathbb{E}[Q(\mathbf{X},\mathbf{A})|\mathbf{X}], and the treatment effects can be denoted as δ(𝐗,𝐀)=Q(𝐗,𝐀)m(𝐗)\delta(\mathbf{X},\mathbf{A})=Q(\mathbf{X},\mathbf{A})-m(\mathbf{X}). In particular, the true and the estimated treatment effects are denoted as δ(,)\delta^{*}(\cdot,\cdot) and δ^(,)\hat{\delta}(\cdot,\cdot), respectively. In addition, we introduce an assumption on the treatment effects:

Assumption 2

For any ϵ>0\epsilon>0, there exist some constant C>0C>0 and γ>0\gamma>0 such that

(max𝐀,𝐀𝒜|δ(𝐗,𝐀)δ(𝐗,𝐀)|ϵ)Cϵγ.\displaystyle\mathbb{P}(\max_{\mathbf{A},\mathbf{A}^{\prime}\in\mathcal{A}}|\delta^{*}(\mathbf{X},\mathbf{A})-\delta^{*}(\mathbf{X},\mathbf{A}^{\prime})|\leq\epsilon)\leq C\epsilon^{\gamma}. (16)

Assumption 2 is a margin condition characterizing the behavior of the boundary between different combination treatments. A larger value of γ\gamma indicates that the treatment effects are differentiable with a higher probability, suggesting it is easier to find the optimal individualized treatment rule. Similar assumptions are also required in the literature (Qian and Murphy, 2011; Zhao et al., 2012; Qi et al., 2020) to achieve a faster convergence rate of the value reduction bound.

The following theorem shows that the value reduction is bounded by the estimation error of the treatment effects, and the convergence rate can be improved if Assumption 2 holds:

Theorem 3

Suppose the treatment effects δ(,)2\delta^{*}(\cdot,\cdot)\in\mathcal{H}^{2}. For any estimator δ^(,)\hat{\delta}(\cdot,\cdot), and the corresponding decision rule d^\hat{d} such that d^(𝐗)argmax𝐀𝒜δ^(𝐗,𝐀)\hat{d}(\mathbf{X})\in\operatorname*{argmax}_{\mathbf{A}\in\mathcal{A}}\hat{\delta}(\mathbf{X},\mathbf{A}), we have

𝒱(d)𝒱(d^)2max𝐀𝒜{𝔼[δ(𝐗,𝐀)δ^(𝐗,𝐀)]2}1/2.\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq 2\max_{\mathbf{A}\in\mathcal{A}}\big{\{}\mathbb{E}[\delta^{*}(\mathbf{X},\mathbf{A})-\hat{\delta}(\mathbf{X},\mathbf{A})]^{2}\big{\}}^{1/2}. (17)

If Assumption 2 holds, the convergence rate is improved by

𝒱(d)𝒱(d^)C(γ)max𝐀𝒜{𝔼[δ(𝐗,𝐀)δ^(𝐗,𝐀)]2}(1+γ)/(2+γ),\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq C(\gamma)\max_{\mathbf{A}\in\mathcal{A}}\big{\{}\mathbb{E}[\delta^{*}(\mathbf{X},\mathbf{A})-\hat{\delta}(\mathbf{X},\mathbf{A})]^{2}\big{\}}^{(1+\gamma)/(2+\gamma)}, (18)

where C(γ)C(\gamma) is a constant that depends on CC and γ\gamma.

Theorem 3 builds a connection between the value reduction and the estimation error of the treatment effects δ^(,)\hat{\delta}(\cdot,\cdot), which shows that an accurate estimation of treatment effects would lead the estimated value function 𝒱(d^)\mathcal{V}(\hat{d}) to approach the optimal value function 𝒱(d)\mathcal{V}(d^{*}). Based on Theorem 3, we can further connect the value reduction bound to the excess risk of the estimator of the proposed model:

Corollary 1

Suppose we define the expected risk of function Q(,)Q(\cdot,\cdot) as L(Q)=𝔼[YQ(𝐗,𝐀)]2L(Q)=\mathbb{E}[Y-Q(\mathbf{X},\mathbf{A})]^{2}. Then for any estimator of the function Q(,)Q(\cdot,\cdot), which is denoted by Q^(,)\hat{Q}(\cdot,\cdot), we have the following value reduction bound:

𝒱(d)𝒱(d^)2{L(Q^)L(Q)}1/2.\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq 2\big{\{}L(\hat{Q})-L(Q^{*})\big{\}}^{1/2}.

Further, if Assumption 2 holds, the above inequality can be tighter with γ>0\gamma>0:

𝒱(d)𝒱(d^)C(γ){L(Q^)L(Q)}(1+γ)/(2+γ).\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq C(\gamma)\big{\{}L(\hat{Q})-L(Q^{*})\big{\}}^{(1+\gamma)/(2+\gamma)}.

Next, we consider the value reduction bound under budget constraints. Since the multi-choice knapsack problem we formulated for budget-constrained ITR is NP-hard (Kellerer et al., 2004), we adopt a pseudo-polynomial dynamic programming algorithm (Dudziński and Walukiewicz, 1987) to obtain an approximated solution. In the following, we analyze the theoretical property of the approximated value function that derived from dynamic programming Algorithm 2. Specifically, we define the approximated value function as the sum of the treatment effects of the first ll subjects divided by the sample size nn, which is Z^l(b)\hat{Z}_{l}(b) in Algorithm 2. In addition, we denote the approximated value function as Zl(b)Z_{l}^{*}(b) if the true treatment effects δij\delta_{ij}^{*}’s are plugged in. Then we have the following result indicating that the approximated value function converges if the estimation error of δ^(,)\hat{\delta}(\cdot,\cdot) converges.

Theorem 4

For the approximated value function obtained from Algorithm 2, for any B>0B>0, we have

|Z(B)Z^(B)|1ni=1n|maxA~j𝒜δ(𝐱i,A~j)δ^(𝐱i,A~j)|.\displaystyle|Z^{*}(B)-\hat{Z}(B)|\leq\frac{1}{n}\sum_{i=1}^{n}|\max_{\tilde{A}_{j}\in\mathcal{A}}\delta^{*}(\mathbf{x}_{i},\tilde{A}_{j})-\hat{\delta}(\mathbf{x}_{i},\tilde{A}_{j})|.

In other words, the approximated value function under budget constraints can converge if δ^(,)\hat{\delta}(\cdot,\cdot) is a consistent estimator of treatment effects. Note that the proposed estimator is a doubly robust estimator in that either propensity score or treatment-free effects is correctly specified, our proposed estimator is a consistent estimator, which consequently leads the value function and approximated value function under budget constraints converge.

4.2 Excess Risk Bound

In this subsection, we provide a non-asymptotic value reduction bound for the proposed DEM and show the improved convergence rate under the DEM. In Corollary 1, we have shown that the value reduction can be bounded by the excess risk between the true and estimated Q-functions. The excess risk serves as an intermediate tool to establish the non-asymptotic property of the proposed estimator which depends on the complexity of the function class. In the proposed method, we focus on the function class 𝒬={Q:𝒳×𝒜|Q(𝐱,𝐚)=m(𝐱)+α(𝐱)Tβ(𝐚)}\mathcal{Q}=\big{\{}Q:\mathcal{X}\times\mathcal{A}\rightarrow\mathbb{R}|Q(\mathbf{x},\mathbf{a})=m(\mathbf{x})+\alpha(\mathbf{x})^{T}\beta(\mathbf{a})\big{\}}, where m(),α()m(\cdot),\alpha(\cdot) and β()\beta(\cdot) are defined in (15), (11) and (7). We establish the following excess risk upper bound for the estimator in 𝒬\mathcal{Q}:

Lemma 1

For any distribution (𝐗,𝐀,Y)(\mathbf{X},\mathbf{A},Y) with 𝔼[Y2]c1\mathbb{E}[Y^{2}]\leq c_{1}, given a function Q^\hat{Q} from 𝒬\mathcal{Q}, then with probability 12ϵ1-2\epsilon,

L(Q^)L(Q)8Cn(𝒬)+2c12log(1/ϵ)n,\displaystyle L(\hat{Q})-L(Q^{*})\leq 8C\mathcal{R}_{n}(\mathcal{Q})+\sqrt{\frac{2c_{1}^{2}\log(1/\epsilon)}{n}}, (19)

where CC is the Lipschitz constant of L(Q)L(Q), and n(𝒬)\mathcal{R}_{n}(\mathcal{Q}) is the Rademacher complexity of 𝒬\mathcal{Q}.

Lemma 1 provides an upper bound of the excess risk in Corollary 1 using the Rademacher complexity of 𝒬\mathcal{Q}. However, the Rademacher complexity of a general neural network is still an open problem in the literature and existing bounds are mainly established based on the different types of norm constraints of weight matrices (Bartlett et al., 2017; Golowich et al., 2018; Neyshabur et al., 2017, 2015). In this work, we focus on the following sub-class of 𝒬\mathcal{Q} with L2L_{2} and spectral norm constraints:

𝒬Bm,Bα,Bβ={Q𝒬:𝐰m22Bm,𝐖m12,Bm,𝐓l2Bα,𝐔l2Bβ},\displaystyle\mathcal{Q}_{B_{m},B_{\alpha},B_{\beta}}=\big{\{}Q\in\mathcal{Q}:\lVert\mathbf{w}_{m}^{2}\rVert_{2}\leq B_{m},\lVert\mathbf{W}_{m}^{1}\rVert_{2,\infty}\leq B_{m},\lVert\mathbf{T}_{l}\rVert_{2}\leq B_{\alpha},\lVert\mathbf{U}_{l}\rVert_{2}\leq B_{\beta}\big{\}},

where 2\lVert\cdot\rVert_{2} denotes the L2L_{2}-norm for vectors and the spectral norm for matrices. For any matrix 𝐗=(𝐗1,,𝐗p)\mathbf{X}=(\mathbf{X}_{1},...,\mathbf{X}_{p}), and 𝐗i\mathbf{X}_{i} is the iith column of matrix 𝐗\mathbf{X}, we use X2,=maxi𝐗i2\lVert X\rVert_{2,\infty}=\max_{i}\lVert\mathbf{X}_{i}\rVert_{2} to denote the L2,L_{2,\infty} norm of 𝐗\mathbf{X}. We then establish the upper bound of the Rademacher complexity of 𝒬Bm,Bα,Bβ\mathcal{Q}_{B_{m},B_{\alpha},B_{\beta}} as follows:

Lemma 2

Suppose 𝔼[𝐗22]c22\mathbb{E}[\lVert\mathbf{X}\rVert_{2}^{2}]\leq c_{2}^{2}. The Rademacher complexity of 𝒬Bm,Bα,Bβ\mathcal{Q}_{B_{m},B_{\alpha},B_{\beta}} is upper bounded by:

n(𝒬Bm,Bα,Bβ)\displaystyle\mathcal{R}_{n}(\mathcal{Q}_{B_{m},B_{\alpha},B_{\beta}}) 2Bm2c2hn+BαLαBβLβc2Kn.\displaystyle\leq 2B_{m}^{2}c_{2}\sqrt{\frac{h}{n}}+B_{\alpha}^{L_{\alpha}}B_{\beta}^{L_{\beta}}c_{2}\sqrt{\frac{K}{n}}. (20)

Lemma 2 provides an upper bound of the Rademacher complexity of Bm,Bα,Bβ\mathcal{F}_{B_{m},B_{\alpha},B_{\beta}} with the rate O(1n)O(\sqrt{\frac{1}{n}}). The first term of (20) is the upper bound for the function class of m(𝐱)m(\mathbf{x}) in (15), which depends on the width of hidden layers hh. If hh is large, the function m(𝐱)m(\mathbf{x}) is able to approximate a larger function space, but with a less tight upper bound on the generalization error. The second term of (20) is associated with the functional class of the inner product of the double encoders with a convergence rate of O(K1/2n1/2)O(K^{1/2}n^{-1/2}). The rate increases with the number of treatments KK rather than |𝒜||\mathcal{A}| due to the parameter-sharing feature of the interactive treatment encoder, and the linearly growing dimension of input of function β()\beta(\cdot) in the proposed method. Specifically, the input of β()\beta(\cdot) is the combination treatment 𝐀\mathbf{A} itself, and parameters in the treatment encoder are shared by all the combination treatments. Thus, the model complexity is proportional to KK and the product of the spectral norm of weight matrices. Based on Lemmas 1 and 2, we derive the value reduction bound for the proposed method as follows:

Theorem 5

For any distribution (𝐗,𝐀,Y)(\mathbf{X},\mathbf{A},Y) with 𝔼[Y2]c1\mathbb{E}[Y^{2}]\leq c_{1}, 𝔼[𝐗22]c2\mathbb{E}[\lVert\mathbf{X}\rVert_{2}^{2}]\leq c_{2}. Consider the neural networks in the subspace 𝒬Bm,Bα,Bβ\mathcal{Q}_{B_{m},B_{\alpha},B_{\beta}}, with probability at least 12ϵ1-2\epsilon, we have the following value reduction bound:

𝒱(d)𝒱(d^)2{16CBm2c2hn+8CBαLαBβLβc2Kn+2c12log(1/ϵ)n}1/2.\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq 2\bigg{\{}16CB_{m}^{2}c_{2}\sqrt{\frac{h}{n}}+8CB_{\alpha}^{L_{\alpha}}B_{\beta}^{L_{\beta}}c_{2}\sqrt{\frac{K}{n}}+\sqrt{\frac{2c_{1}^{2}\log(1/\epsilon)}{n}}\bigg{\}}^{1/2}.

If Assumption 2 holds, we have a tighter bound with a positive γ\gamma:

𝒱(d)𝒱(d^)C(γ){16CBm2c2hn+8CBαLαBβLβc2Kn+2c12log(1/ϵ)n}(1+γ)/(2+γ).\displaystyle\mathcal{V}(d^{*})-\mathcal{V}(\hat{d})\leq C(\gamma)\bigg{\{}16CB_{m}^{2}c_{2}\sqrt{\frac{h}{n}}+8CB_{\alpha}^{L_{\alpha}}B_{\beta}^{L_{\beta}}c_{2}\sqrt{\frac{K}{n}}+\sqrt{\frac{2c_{1}^{2}\log(1/\epsilon)}{n}}\bigg{\}}^{(1+\gamma)/(2+\gamma)}.

Theorem 5 establishes the value reduction bound in that the estimated decision rule can approach the optimal value function as the sample size increases. Compared with the existing value reduction bound for multi-arm treatments, the proposed method improves the convergence rate from O(|𝒜|1/4)O(|\mathcal{A}|^{1/4}) to O((log2|𝒜|)1/4)O((\log_{2}|\mathcal{A}|)^{1/4}). Further, the order of the value reduction bound can approach nearly n1/2n^{-1/2} as γ\gamma goes to infinity, which is consistent with the convergence rates established in (Qian and Murphy, 2011; Qi et al., 2020).

5 Simulation Studies

In this section, we evaluate the performance of the proposed method in estimating the individualized treatment rule for combination treatments. Our numerical studies show that the proposed method achieves superior performance to competing methods in both unconstrained and budget-constrained scenarios.

5.1 Unconstrained ITR simulation

We first investigate the empirical performance of the proposed method without budget constraints. We assume the pre-treatment covariates 𝐗=(X1,,X10)10\mathbf{X}=(X_{1},\ldots,X_{10})\in\mathbb{R}^{10} are independently and uniformly sampled from (1,1)(-1,1). Four simulation settings are designed to evaluate the performance under varying settings. In simulation settings 1 and 2, we consider combinations of 3 treatments, which induces 8 possible combinations, with 6 of them considered as our assigned treatments. Similarly, in simulation settings 3 and 4, we consider combinations of 5 treatments, and we assume that 20 of all combinations are assigned to subjects. The treatments are assigned either uniformly or following the propensity score model:

(A~i|𝐗)=exp{0.2i(𝐗Tβ)}jexp{0.2j(𝐗Tβ)},\displaystyle\mathbb{P}(\tilde{A}_{i}|\mathbf{X})=\frac{\exp\{0.2i*(\mathbf{X}^{T}\beta)\}}{\sum_{j}\exp\{0.2j*(\mathbf{X}^{T}\beta)\}}, (21)

and the marginal treatment assignment distribution is shown in Figure 3.

Refer to caption
Refer to caption
Figure 3: Treatment assignment distribution in simulation settings. The left panel is for simulation settings 1 and 2, and the right panel is for simulation settings 3 and 4.
Table 1: Simulation settings 1 and 2: treatment effect and interaction effect functions specification. Column “Treatment Effects” specifies the treatment effect functions of individual treatments adopted in simulation settings 1 and 2. Column “Interaction Effects” specifies the interaction effects among individual treatments in setting 2.
Treatment 𝐀\mathbf{A} Treatment Effects Interaction Effects
(0,0,0)(0,0,0) 0 -
(0,0,1)(0,0,1) 2X1+exp(X3+X4)2X_{1}+\exp(X_{3}+X_{4}) -
(0,1,0)(0,1,0) 2X2log(X5)+X72X_{2}\log(X_{5})+X_{7} -
(0,1,1)(0,1,1) - sin(5X12)3(X20.5)2\sin(5X_{1}^{2})-3(X_{2}-0.5)^{2}
(1,0,0)(1,0,0) sin(X3)+2log(X4)+2log(X7)\sin(X_{3})+2\log(X_{4})+2\log(X_{7}) -
(1,1,1)(1,1,1) - 2sin((X2X4)2)2\sin((X_{2}-X_{4})^{2})
Table 2: Simulation settings 3 and 4: treatment effect and interaction effect functions specification. Column “Treatment Effects” specifies the treatment effect functions of individual treatments adopted in simulation settings 3 and 4. Column “Interaction Effects” specifies the interaction effects among individual treatments in setting 4.
Treatment 𝐀\mathbf{A} Treatment Effects Interaction Effects
(0,0,0,0,0)(0,0,0,0,0) 0 -
(0,0,0,0,1)(0,0,0,0,1) (X10.25)3(X_{1}-0.25)^{3} -
(0,0,0,1,0)(0,0,0,1,0) 2log(X3)+4log(X8)cos(2πX10)2\log(X_{3})+4\log(X_{8})\cos(2\pi X_{10}) -
(0,0,1,0,0)(0,0,1,0,0) X2sin(X4)1X_{2}\sin(X_{4})-1
(0,0,1,0,1)(0,0,1,0,1) - exp(2X2)\exp(2X_{2})
(0,1,0,0,0)(0,1,0,0,0) (X1+X5X82)3(X_{1}+X_{5}-X_{8}^{2})^{3} -
(0,1,0,0,1)(0,1,0,0,1) - exp(2X4+X9)\exp(2X_{4}+X_{9})
(0,1,0,1,1)(0,1,0,1,1) - 4log(X6)-4\log(X_{6})
(0,1,1,0,0)(0,1,1,0,0) - 0
(0,1,1,1,0)(0,1,1,1,0) - 0
(1,0,0,0,0)(1,0,0,0,0) exp(X2X5)\exp(X_{2}-X_{5}) -
(1,0,0,0,1)(1,0,0,0,1) - 0
(1,0,0,1,0)(1,0,0,1,0) - 0
(1,0,1,0,0)(1,0,1,0,0) - 0
(1,0,1,0,1)(1,0,1,0,1) - 3/2cos(2πX1+X82)-3/2\cos(2\pi X_{1}+X_{8}^{2})
(1,1,0,0,0)(1,1,0,0,0) - 0
(1,1,0,0,1)(1,1,0,0,1) - 4log(X6)-4\log(X_{6})
(1,1,0,1,1)(1,1,0,1,1) - X62+1/2sin(2π/X7)X_{6}^{2}+1/2\sin(2\pi/X_{7})
(1,1,1,0,0)(1,1,1,0,0) - 0
(1,1,1,1,0)(1,1,1,1,0) - 0

In simulation setting 1, we assume that the treatment effects of the combination treatment are additive from individual treatment, and we specify the individual treatment effect functions in the column “Treatment Effects” provided in Table 1. Based on simulation setting 1, we consider some interaction effects among treatments in simulation setting 2, which are specified in the column “Interaction Effects” in Table 1. Therefore, the treatment effects of the combination treatments are the summation of individual treatment effects and interaction effects. Similarly, Table 2 specifies the treatment effects and interaction effects for simulation settings 3 and 4 in the same manner. In particular, the treatment effects of the combination treatments are additive from individual treatment effects in the simulation setting 3, while interaction effects are added in the simulation setting 4. In summary, we evaluate the empirical performance of the proposed method and competing methods under the additive treatment effects scenarios in simulation settings 1 and 3, and under the interactive treatment effects scenarios in simulation settings 2 and 4.

For each simulation setting, the sample sizes for the training data vary from 500, 1000 to 2000, and each setting is repeated 200 times. Then we compare the proposed method with the following methods: the L1L_{1}-penalized least square (L1L_{1}-pls, Qian and Murphy 2011), the outcome weighted learning with multinomial deviance (OWL-MD, Huang et al. 2019), the multicategory outcome weighted learning with linear decisions(MOWL-linear Zhang and Liu 2014), the outcome weighted learning with deep learning (OWL-DL, Liang et al. 2018), the treatment-agnostic representation network (TARNet) (Shalit et al., 2017). The empirical evaluation of the value function and accuracy are reported in Tables 3 and 4, where the empirical value function (Qian and Murphy, 2011) is calculated via

𝒱^(d)=𝔼n[YI{d(𝐗)=𝐀}]𝔼n[I{d(𝐗)=𝐀}],\displaystyle\hat{\mathcal{V}}(d)=\frac{\mathbb{E}_{n}[YI\{d(\mathbf{X})=\mathbf{A}\}]}{\mathbb{E}_{n}[I\{d(\mathbf{X})=\mathbf{A}\}]},

where 𝔼n\mathbb{E}_{n} denotes the empirical average.

Since simulation settings 1 and 3 do not include interaction effects among different treatments, all competing methods except for the OWL-DL (Liang et al., 2018) are over-parameterized, while the proposed method can be adaptive to the additive setting with a large λi\lambda_{i}. Therefore, the proposed method and the OWL-DL outperform other competing methods in both settings. In contrast, complex interaction effects are considered in simulation settings 2 and 4, and the performance of OWL-DL is inferior since a consistent estimation is not guaranteed for OWL-DL if there are interaction effects. Although other competing methods are saturated in incorporating interaction effects, their estimation efficiencies are still undermined since the decision functions in these methods are all treatment-specific, while the proposed method possesses the unique parameter-sharing feature for different treatments. Therefore, the advantage of our method is more significant for small sample sizes or large KK scenarios. Specifically, the proposed method improves the accuracy by 10.9% to 17.8% in simulation setting 4 when the sample size is 500. In addition, we also compare the empirical performance of the double encoder model with different choices of covariates and treatment encoders, and the detailed simulation results are presented in the supplementary materials.

Table 3: Unconstrained simulation study: Comparisons of value functions for the proposed method and existing methods including the L1L_{1}-penalized least square (L1L_{1}-pls, Qian and Murphy 2011), the outcome weighted learning with multinomial deviance (OWL-MD, Huang et al. 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear, Zhang et al. 2020), the outcome weighted learning with deep learning (OWL-DL, Liang et al. 2018) and the treatment-agnostic representation network (TARNet, Shalit et al. 2017). Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (Uniform), and treatments are assigned based on the propensity score model (21, PS-based).
Value
Treatment Assignment Setting Sample Size Proposed L1L_{1}-pls OWL-MD MOWL-linear OWL-DL TARNet
Uniform 1 500 5.477(0.218) 3.643(0.158) 5.338(0.259) 5.206(0.185) 5.408(0.265) 5.320(0.278)
1000 5.622(0.206) 3.800(0.154) 5.510(0.251) 5.292(0.182) 5.512(0.259) 5.398(0.273)
2000 5.658(0.208) 3.989(0.149) 5.577(0.246) 5.384(0.179) 5.595(0.251) 5.411(0.272)
2 500 5.268(0.325) 3.870(0.291) 5.132(0.306) 5.078(0.291) 5.015(0.400) 5.008(0.405)
1000 5.418(0.312) 4.028(0.285) 5.302(0.299) 5.168(0.279) 5.105(0.398) 5.024(0.402)
2000 5.498(0.311) 4.191(0.282) 5.344(0.289) 5.211(0.272) 5.215(0.382) 5.118(0.386)
3 500 5.600(0.288) 3.562(0.235) 4.479(0.285) 5.216(0.240) 5.332(0.312) 5.354(0.305)
1000 5.667(0.268) 3.702(0.232) 4.987(0.279) 5.283(0.241) 5.403(0.310) 5.366(0.300)
2000 5.719(0.262) 3.855(0.230) 5.274(0.265) 5.459(0.232) 5.598(0.299) 5.423(0.289)
4 500 6.117(0.328) 4.490(0.264) 5.900(0.278) 5.948(0.277) 5.995(0.335) 5.895(0.328)
1000 6.374(0.319) 4.850(0.262) 6.200(0.272) 6.120(0.269) 6.012(0.321) 5.998(0.315)
2000 6.732(0.310) 5.252(0.254) 6.506(0.259) 6.494(0.262) 6.254(0.311) 6.057(0.310)
PS-based 1 500 5.415(0.238) 4.048(0.198) 5.061(0.215) 4.897(0.233) 5.218(0.273) 5.013(0.305)
1000 5.589(0.223) 4.087(0.198) 5.099(0.213) 4.959(0.231) 5.223(0.272) 5.018(0.298)
2000 5.662(0.219) 4.224(0.183) 5.178(0.201) 4.898(0.230) 5.238(0.279) 5.017(0.299)
2 500 5.005(0.324) 3.980(0.236) 4.815(0.254) 4.629(0.279) 4.635(0.336) 4.886(0.352)
1000 5.622(0.322) 4.042(0.235) 4.906(0.249) 4.700(0.276) 4.913(0.334) 5.021(0.341)
2000 5.658(0.320) 4.104(0.229) 5.005(0.245) 4.672(0.274) 4.998(0.326) 5.054(0.331)
3 500 5.665(0.330) 3.384(0.258) 3.540(0.269) 5.401(0.302) 5.505(0.352) 5.476(0.338)
1000 5.792(0.321) 4.560(0.248) 5.009(0.268) 5.519(0.303) 5.784(0.348) 5.676(0.308)
2000 5.796(0.318) 5.273(0.246) 5.307(0.259) 5.582(0.300) 5.788(0.338) 5.774(0.299)
4 500 5.630(0.356) 4.462(0.285) 3.816(0.305) 5.090(0.321) 5.028(0.405) 5.108(0.387)
1000 6.001(0.348) 5.432(0.355) 5.822(0.302) 5.134(0.319) 5.384(0.400) 5.338(0.379)
2000 6.289(0.345) 5.808(0.344) 6.141(0.299) 5.294(0.318) 5.684(0.389) 5.589(0.378)
Table 4: Unconstrained simulation study: Comparisons of accuracies for the proposed method and existing methods including the L1L_{1}-penalized least square (L1L_{1}-pls, Qian and Murphy 2011), the outcome weighted learning with multinomial deviance (OWL-MD, Huang et al. 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear, Zhang et al. 2020), the outcome weighted learning with deep learning (OWL-DL, Liang et al. 2018) and the treatment-agnostic representation network (TARNet, Shalit et al. 2017). Two treatment assignment schemes are presented: all treatments are uniformly assigned to subjects (Uniform), and treatments are assigned based on the propensity score model (21, PS-based).
Accuracy
Treatment Assignment Setting Sample Size Proposed L1L_{1}-pls OWL-MD MOWL-linear OWL-DL TARNet
Uniform 1 500 0.622(0.052) 0.338(0.039) 0.572(0.040) 0.491(0.041) 0.598(0.046) 0.553(0.050)
1000 0.707(0.050) 0.358(0.038) 0.642(0.040) 0.507(0.040) 0.682(0.043) 0.640(0.050)
2000 0.738(0.050) 0.382(0.038) 0.694(0.039) 0.552(0.042) 0.710(0.042) 0.686(0.048)
2 500 0.544(0.058) 0.350(0.037) 0.515(0.041) 0.474(0.035) 0.488(0.045) 0.432(0.052)
1000 0.610(0.054) 0.367(0.034) 0.573(0.038) 0.505(0.033) 0.532(0.040) 0.462(0.051)
2000 0.630(0.051) 0.389(0.033) 0.605(0.036) 0.516(0.034) 0.568(0.041) 0.506(0.049)
3 500 0.420(0.041) 0.102(0.025) 0.148(0.031) 0.251(0.034) 0.271(0.047) 0.205(0.057)
1000 0.445(0.039) 0.099(0.024) 0.187(0.027) 0.285(0.033) 0.300(0.046) 0.268(0.053)
2000 0.464(0.039) 0.116(0.021) 0.254(0.027) 0.331(0.031) 0.353(0.048) 0.311(0.045)
4 500 0.324(0.045) 0.146(0.031) 0.205(0.032) 0.215(0.032) 0.154(0.047) 0.162(0.041)
1000 0.335(0.044) 0.191(0.031) 0.279(0.030) 0.229(0.032) 0.189(0.045) 0.193(0.043)
2000 0.372(0.041) 0.222(0.029) 0.323(0.029) 0.241(0.029) 0.228(0.044) 0.225(0.042)
PS-based 1 500 0.571(0.048) 0.317(0.031) 0.477(0.035) 0.430(0.037) 0.498(0.050) 0.432(0.053)
1000 0.648(0.044) 0.327(0.027) 0.502(0.033) 0.451(0.032) 0.525(0.047) 0.462(0.051)
2000 0.699(0.044) 0.332(0.028) 0.522(0.031) 0.448(0.031) 0.552(0.048) 0.481(0.050)
2 500 0.492(0.052) 0.284(0.037) 0.420(0.037) 0.397(0.036) 0.370(0.047) 0.375(0.048)
1000 0.566(0.051) 0.290(0.037) 0.443(0.034) 0.414(0.037) 0.397(0.045) 0.388(0.047)
2000 0.618(0.048) 0.291(0.037) 0.464(0.031) 0.406(0.032) 0.411(0.041) 0.425(0.049)
3 500 0.378(0.041) 0.053(0.030) 0.071(0.033) 0.223(0.036) 0.300(0.051) 0.248(0.050)
1000 0.439(0.041) 0.091(0.031) 0.181(0.032) 0.279(0.034) 0.376(0.041) 0.344(0.048)
2000 0.444(0.040) 0.137(0.029) 0.242(0.030) 0.327(0.031) 0.416(0.038) 0.378(0.047)
4 500 0.223(0.056) 0.101(0.038) 0.083(0.039) 0.090(0.043) 0.102(0.052) 0.084(0.061)
1000 0.267(0.053) 0.136(0.036) 0.195(0.041) 0.089(0.041) 0.121(0.051) 0.098(0.056)
2000 0.279(0.052) 0.205(0.037) 0.245(0.041) 0.101(0.039) 0.168(0.048) 0.127(0.055)

5.2 Budget-constrained ITR simulation

Table 5: Constrained simulation study: Comparisons of value functions for the proposed method and existing methods including the L1L_{1}-penalized least square (L1L_{1}-pls, Qian and Murphy 2011), the outcome weighted learning with multinomial deviance (OWL-MD, Huang et al. 2019), the multicategory outcome weighted learning with linear decisions (MOWL-linear, Zhang et al. 2020), and the treatment-agnostic representation network (TARNet, Shalit et al. 2017). All treatments are uniformly assigned to subjects.
Value
Sample Size Constraints Proposed L1L_{1}-pls OWL-MD MOWL-linear TARNet
500 20% 5.237(0.322) 3.445(0.268) 4.918(0.283) 4.909(0.279) 4.874(0.335)
50% 5.527(0.320) 3.756(0.277) 5.218(0.271) 5.215(0.273) 4.998(0.335)
80% 5.832(0.319) 4.108(0.267) 5.510(0.280) 5.499(0.276) 5.425(0.329)
100% 6.117(0.328) 4.490(0.264) 5.900(0.278) 5.948(0.277) 5.895(0.328)
1000 20% 5.498(0.321) 3.685(0.261) 5.175(0.270) 5.170(0.271) 5.047(0.320)
50% 5.841(0.315) 4.014(0.265) 5.487(0.269) 5.318(0.275) 5.274(0.318)
80% 6.102(0.320) 4.417(0.261) 5.612(0.273) 5.598(0.267) 5.418(0.315)
100% 6.374(0.319) 4.850(0.262) 6.200(0.272) 6.120(0.269) 5.998(0.315)
2000 20% 5.789(0.309) 4.108(0.249) 5.356(0.209) 5.317(0.266) 5.015(0.308)
50% 6.015(0.315) 4.437(0.251) 5.897(0.207) 5.778(0.261) 5.298(0.298)
80% 6.324(0.311) 4.847(0.255) 6.215(0.208) 6.117(0.264) 5.598(0.315)
100% 6.732(0.310) 5.252(0.254) 6.506(0.201) 6.494(0.262) 6.057(0.310)

In this subsection, we investigate the budget-constrained setting with the same data generation mechanism as in simulation setting 4 in Section 5.1. For a fair comparison, we compare the proposed method with competing methods which provide a score to measure the utility or effect of each treatment. We then apply the proposed MCKP method to all of these methods because the proposed framework (13) does not require specification for the treatment effects.

For the budget constraint, we let the second treatment be the most critical, or urgently needed, by the population. Thus, we constrain the amount of the second treatment so that only partial patients can be treated by the second treatment. The quantiles of the constrained populations are 20%,50%,80%20\%,50\%,80\%, and 100%100\%, where the constraints in the last case is trivial constraints as it is equivalent to an unconstrained setting.

The simulation results are provided in Tables 5, which clearly indicate that the proposed method outperforms other methods in the constrained cases. Moreover, the proposed method achieves smaller reductions of value functions when budget constraints are imposed. Specifically, the value functions of the competing methods are reduced by about 0.9 when the budget is decreased from 100%\% to 20%\%. More precisely, when the sample size is 2000, compared with the best performances of competing methods, the proposed method improves the value function by 8.08%8.08\% when the budget is 20%20\%, and it achieves 3.47%3.47\% improvement in value function when the budget is 100%100\%. The significant improvement of the value function under limited budget scenarios shows that the proposed method provides a more accurate estimation of treatment effects, and thus leads to better individualized treatment rule estimation under restrictive constraints.

6 Application to patient-derived xenograft study

In this section, we apply our method to patient-derived xenograft (PDX) data to inform optimal personalized treatment for cancer patients. Due to ethical issues and other limitations of randomized clinical trials in cancer studies, recent works have utilized patient-derived xenografts (PDXs) to perform large-scale screening in mice to evaluate cancer therapies (Gao et al., 2015). Specifically, samples of primary solid tumors are collected from patients through surgery or biopsy (Hidalgo et al., 2014), and each tumor is implanted into multiple mice to create a PDX line, where multiple treatments can be applied simultaneously. Meanwhile, high throughput genomic assays such as RNA and DNA sequencing of solid tumors are collected as the pre-treatment covariates. Therefore, the mice within one PDX line share the same pre-treatment covariates. The primary interest of the outcome in these studies is the tumor size or growth rate. An illustration of the PDX data collection procedure is provided in Figure 4.

Refer to caption
Figure 4: Illustration of the PDX data collection. Tumor samples from a patient are implanted into multiple mice, where a PDX line is formed by these mice. Different treatments can be applied simultaneously. Tumor size, which is the primary interest of the outcome, is measured for each mouse. RNA and DNA sequencing and other features are collected as pre-treatment covariates.

The original data is from the Novartis PDX study (Gao et al., 2015), and we follow the screening and pre-processing steps in (Rashid et al., 2020) to obtain the genomic data and tumor measurements. In particular, our focus is colorectal cancer (CRC), where 37 PDX lines are investigated and the 13 treatments listed in Table 6 are administered simultaneously to all 37 PDX lines. These lead to 481 observations. Compared with other randomized clinical trials or observational studies, the outcomes of each subject (PDX line) in PDX data given all treatments are fully observed. Therefore, the positivity assumption is intrinsically satisfied. In addition, the pre-treatment covariates are 94 genetic features associated with 50 genes selected by unsupervised and supervised screening by (Rashid et al., 2020) and these pre-treatment covariates are inherently balanced since all treatment groups include exactly the same PDX lines. Furthermore, the outcome of our interest is measured by the scaled maximum observed tumor size shrunken from the baseline size, where a larger value is more desirable.

For the budget-constrained setting, we impose the costs of treatments as follows: $79 for BKM120, $100 for LJC049, $66 for BYL719, $240 for cetuximab, $124 for encorafenib, $500 for LJM716 and $79 for binimetinib, where the prices for LJC049 and LJM716 are hypothetical, while other prices are based on https://www.goodrx.com for unit dosage. We consider the hypothetical budgets for these 37 PDX lines as $21,000, $15,000, $10,000, and $5,000, where $21,000 is equivalent to the unconstrained scenario because it is sufficient to cover the most expensive combination treatment for all PDX lines.

To implement our proposed method and other competing methods, we randomly split the dataset into training (25 PDX lines), validation (6 PDX lines), and testing (6 PDX lines) sets. All methods are trained on the training set, while hyper-parameters are tuned based on the validation set. The value function of the unconstrained scenario is calculated on testing set via 𝒱^(d)=𝔼n[YI{d(𝐗)=𝐀}]𝔼n[I{d(𝐗)=𝐀}]\hat{\mathcal{V}}(d)=\frac{\mathbb{E}_{n}[YI\{d(\mathbf{X})=\mathbf{A}\}]}{\mathbb{E}_{n}[I\{d(\mathbf{X})=\mathbf{A}\}]}. For the budget-constrained scenarios, we apply the selected model to all PDX lines to obtain the estimation of treatment effects, and then apply the MCKP algorithm to all PDX lines and calculate the value function based on 37 PDX lines. Finally, we repeat the above random splitting 100 times to validate the results for comparison.

Table 6: Table of single or combination treatments considered to treat colorectal cancer for PDX lines.
Type BKM120 LJC049 BYL719 cetuximab encorafenib LJM716 binimetinib
Single 1 0 0 0 0 0 0
Single 0 1 0 0 0 0 0
Single 0 0 1 0 0 0 0
Single 0 0 0 1 0 0 0
Single 0 0 0 0 1 0 0
Single 0 0 0 0 0 0 1
Comb 1 1 0 0 0 0 0
Comb 0 0 1 1 0 0 0
Comb 0 0 1 1 1 0 0
Comb 0 0 1 0 1 0 0
Comb 0 0 1 0 0 1 0
Comb 0 0 1 0 0 0 1
Comb 0 0 0 1 1 0 0
Table 7: Mean and standard errors of value function under different budget constraints. The budget $21,000 is equivalent to the non-constrained scenario.
Budgets Proposed method L1L_{1}-PLS MOWL-linear MOWL-kernel OWL-DL TARNet
$ 21,000 0.142(0.199) -0.157(0.227) 0.054(0.205) -0.313(0.251) -0.298(0.268) -0.254(0.209)
$ 15,000 0.103(0.196) -0.188(0.231) -0.007(0.207) -0.312(0.255) -0.318(0.264) -0.283(0.210)
$ 10,000 0.067(0.200) -0.209(0.222) -0.077(0.198) -0.330(0.250) -0.320(0.267) -0.299(0.208)
$ 5,000 0.050(0.202) -0.282(0.216) -0.103(0.199) -0.367(0.249) -0.367(0.259) -0.348(0.207)

In the following, we report the means and standard deviations of the value functions under different budget constraints in Table 7. For the unconstrained scenario ($21,000\$21,000 budget), the proposed method achieves great improvement in value maximization. As a reference, the value function under one-size-fits-all rules and the optimal treatment are shown in Table 8, which shows that our proposed method improves the value function gap between optimal treatment assignment and one-size-fits-all rules by 23.0% to 81.1%. In comparison with the competing methods which also estimate the individualized treatment rule, our proposed method improves the value function gap between the optimal treatment assignment and competing ITR methods by 40.4% to 65.8%.

For the budget-constrained scenarios, the proposed method achieves dominant advantages over the competing methods in that the value function under the most restrictive budget constraints still achieves comparable results with other competing methods in the unconstrained scenario. Compared with one-size-fits-all rules, the proposed method achieves the best value function with about $5,000 budget compared to the best one-size-fits-all rule, which is the combination of BYL719 and binimetinib with a $5,365 budget. In summary, the proposed method is more capable of effectively controlling the tumor size than any other competing methods and one-size-fits-all rules. Our approach could have great potential for improving therapy quality for colorectal cancer patients.

Table 8: Value function under one-size-fits-all rules and optimal treatment assignment
Treatment Value Budget Treatment Value Budget
BKM120 + LJC049 -0.684 $6,623 BKM120 -0.350 $2,923
BYL719 + cetuximab -0.232 $11,322 LJC049 -1.168 $3,700
BYL719 + cetuximab + encorafenib -0.388 $15,910 BYL719 -0.371 $2,442
BYL719 + encorafenib -0.562 $7,030 cetuximab -0.633 $8,880
BYL719 + LJM716 -0.103 $20,942 encorafenib -1.083 $4,588
BYL719 + binimetinib 0.047 $5,365 binimetinib -0.425 $2,923
cetuximab + encorafenib -0.749 $13,468 Optimal 0.447 $8,171

7 Discussion

In this paper, we broaden the scope of estimating the individualized treatment rule (ITR) from binary and multi-arm treatments to combination treatments, where treatments within each combination can interact with each other. We propose the Double Encoder Model (DEM) as a nonparametric approach to accommodate intricate treatment effects of combination treatments. Specifically, our method overcomes the curse of dimensionality issue via adopting neural network treatment encoders. The parameter-sharing feature of the neural network treatment encoder enhances the estimation efficiency such that the proposed method is able to outperform other parametric approaches given a small sample size. In addition, we also adapt the estimated ITR to budget-constrained scenarios. This adapation is achieved through the multi-choice knapsack framework, which strengthens our proposed method in situations with limited resources. Theoretically, we offer a value reduction bound with and without budget constraints and an improved convergence rate concerning the number of treatments under the DEM.

Several potential research directions could be worth exploring further. First of all, the proposed method employs the propensity score model to achieve the double robustness property. However, the inverse probability weighting method could be weakened in observational studies considering the combination treatments, due to the potential violation of positivity assumptions. This phenomenon is also observed in the binary treatment scenario with high-dimensional covariates (D’Amour et al., 2021). There are existing works to overcome this limitation in the binary and multi-arm treatment setting, which utilizes overlap weights(Li, 2019; Li et al., 2019) to substitute the propensity score. However, this strategy cannot solve the same issue in combination treatment problems. Therefore, exploring alternative approaches for combination treatment problems will be a worthwhile direction.

Second, compared with the binary treatments, combination treatments enable us to optimize multiple outcomes of interest simultaneously. The major challenge of multiple outcomes is that each combination treatment may only favor a few outcomes, and therefore an optimal ITR is expected to achieve a trade-off among multiple outcomes. Some recent works have studied trade-offs between the outcome of interest and risk factors (Wang et al., 2018; Huang and Xu, 2020). However, trade-offs among multiple outcomes could be more challenging.

Furthermore, interpretability is another desirable property of the individualized treatment rule, especially in medical settings. The proposed method incorporates neural networks to enjoy benefits in estimation and theoretical properties, while interpretation is not obvious. Some existing works (Laber and Zhao, 2015; Zhang et al., 2015) propose tree-type methods for better interpretability under binary or multi-arm settings, but may be not applicable for combination treatments. In the literature of explainable machine learning, there are available post-hoc and model-agnostic approaches (Lundberg and Lee, 2017; Shrikumar et al., 2017) which can be learned from. However, more sophisticated adaptation might be needed for the combination treatment problem.

References

  • Bartlett et al. (2017) Bartlett, P., Foster, D. J. and Telgarsky, M. (2017) Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498.
  • Bhattacharya and Dupas (2012) Bhattacharya, D. and Dupas, P. (2012) Inferring welfare maximizing treatment assignment under budget constraints. Journal of Econometrics, 167, 168–196.
  • Bozic et al. (2013) Bozic, I., Reiter, J. G., Allen, B., Antal, T., Chatterjee, K., Shah, P., Moon, Y. S., Yaqubie, A., Kelly, N., Le, D. T. et al. (2013) Evolutionary dynamics of cancer in response to targeted combination therapy. elife, 2, e00747.
  • Braveman et al. (2011) Braveman, P. A., Egerter, S. A. and Mockenhaupt, R. E. (2011) Broadening the focus: the need to address the social determinants of health. American journal of preventive medicine, 40, S4–S18.
  • Bubeck et al. (2020) Bubeck, S., Eldan, R., Lee, Y. T. and Mikulincer, D. (2020) Network size and weights size for memorization with two-layers neural networks. arXiv preprint arXiv:2006.02855.
  • Caspi et al. (2010) Caspi, A., Hariri, A. R., Holmes, A., Uher, R. and Moffitt, T. E. (2010) Genetic sensitivity to the environment: the case of the serotonin transporter gene and its implications for studying complex diseases and traits. American journal of Psychiatry, 167, 509–527.
  • Clifton and Laber (2020) Clifton, J. and Laber, E. (2020) Q-learning: Theory and applications. Annual Review of Statistics and Its Application, 7, 279–301.
  • Dudziński and Walukiewicz (1987) Dudziński, K. and Walukiewicz, S. (1987) Exact methods for the knapsack problem and its generalizations. European Journal of Operational Research, 28, 3–21.
  • D’Amour et al. (2021) D’Amour, A., Ding, P., Feller, A., Lei, L. and Sekhon, J. (2021) Overlap in observational studies with high-dimensional covariates. Journal of Econometrics, 221, 644–654.
  • Forrest and Tamura (2010) Forrest, G. N. and Tamura, K. (2010) Rifampin combination therapy for nonmycobacterial infections. Clinical Microbiology Reviews, 23, 14–34.
  • Friedman et al. (2010) Friedman, J., Hastie, T. and Tibshirani, R. (2010) Regularization paths for generalized linear models via coordinate descent. Journal of Statistical Software, 33, 1.
  • Fu et al. (2016) Fu, H., Zhou, J. and Faries, D. E. (2016) Estimating optimal treatment regimes via subgroup identification in randomized control trials and observational studies. Statistics in Medicine, 35, 3285–3302.
  • Gao et al. (2015) Gao, H., Korn, J. M., Ferretti, S., Monahan, J. E., Wang, Y., Singh, M., Zhang, C., Schnell, C., Yang, G., Zhang, Y. et al. (2015) High-throughput screening using patient-derived tumor xenografts to predict clinical trial drug response. Nature medicine, 21, 1318–1325.
  • Golowich et al. (2018) Golowich, N., Rakhlin, A. and Shamir, O. (2018) Size-independent sample complexity of neural networks. In Conference On Learning Theory, 297–299. PMLR.
  • Goodfellow et al. (2016) Goodfellow, I., Bengio, Y. and Courville, A. (2016) Deep Learning. MIT press.
  • Hastie et al. (2009) Hastie, T., Tibshirani, R., Friedman, J. H. and Friedman, J. H. (2009) The elements of statistical learning: data mining, inference, and prediction, vol. 2. Springer.
  • Hidalgo et al. (2014) Hidalgo, M., Amant, F., Biankin, A. V., Budinská, E., Byrne, A. T., Caldas, C., Clarke, R. B., de Jong, S., Jonkers, J., Mælandsmo, G. M. et al. (2014) Patient-derived xenograft models: an emerging platform for translational cancer research. Cancer discovery, 4, 998–1013.
  • Holland (1986) Holland, P. W. (1986) Statistics and causal inference. Journal of the American Statistical Association, 81, 945–960.
  • Huang et al. (2019) Huang, X., Goldberg, Y. and Xu, J. (2019) Multicategory individualized treatment regime using outcome weighted learning. Biometrics, 75, 1216–1227.
  • Huang and Xu (2020) Huang, X. and Xu, J. (2020) Estimating individualized treatment rules with risk constraint. Biometrics, 76, 1310–1318.
  • Kaddour et al. (2021) Kaddour, J., Zhu, Y., Liu, Q., Kusner, M. J. and Silva, R. (2021) Causal effect inference for structured treatments. Advances in Neural Information Processing Systems, 34, 24841–24854.
  • Kalra et al. (2010) Kalra, S., Kalra, B. and Agrawal, N. (2010) Combination therapy in hypertension: An update. Diabetology & metabolic syndrome, 2, 1–11.
  • Kellerer et al. (2004) Kellerer, H., Pferschy, U. and Pisinger, D. (2004) Multidimensional knapsack problems. In Knapsack Problems, 235–283. Springer.
  • Kingma and Ba (2014) Kingma, D. P. and Ba, J. (2014) Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
  • Kitagawa and Tetenov (2018) Kitagawa, T. and Tetenov, A. (2018) Who should be treated? empirical welfare maximization methods for treatment choice. Econometrica, 86, 591–616.
  • Korkut et al. (2015) Korkut, A., Wang, W., Demir, E., Aksoy, B. A., Jing, X., Molinelli, E. J., Babur, Ö., Bemis, D. L., Sumer, S. O., Solit, D. B. et al. (2015) Perturbation biology nominates upstream–downstream drug combinations in RAF inhibitor resistant melanoma cells. Elife, 4, e04640.
  • Kosorok and Laber (2019) Kosorok, M. R. and Laber, E. B. (2019) Precision medicine. Annual Review of Statistics and Its Application, 6, 263–286.
  • Laber and Zhao (2015) Laber, E. B. and Zhao, Y.-Q. (2015) Tree-based methods for individualized treatment regimes. Biometrika, 102, 501–514.
  • Li (2019) Li, F. (2019) Propensity score weighting for causal inference with multiple treatments. The Annals of Applied Statistics, 13, 2389–2415.
  • Li et al. (2019) Li, F., Thomas, L. E. and Li, F. (2019) Addressing extreme propensity scores via the overlap weights. American journal of epidemiology, 188, 250–257.
  • Li et al. (2018) Li, M., Zhu, L., Chen, L., Li, N. and Qi, F. (2018) Assessment of drug–drug interactions between voriconazole and glucocorticoids. Journal of Chemotherapy, 30, 296–303.
  • Liang et al. (2018) Liang, M., Ye, T. and Fu, H. (2018) Estimating individualized optimal combination therapies through outcome weighted deep learning algorithms. Statistics in Medicine, 37, 3869–3886.
  • Lu et al. (2013) Lu, W., Zhang, H. H. and Zeng, D. (2013) Variable selection for optimal treatment decision. Statistical Methods in Medical Research, 22, 493–504.
  • Luedtke and van der Laan (2016) Luedtke, A. R. and van der Laan, M. J. (2016) Optimal individualized treatments in resource-limited settings. The International Journal of Biostatistics, 12, 283–303.
  • Lundberg and Lee (2017) Lundberg, S. M. and Lee, S.-I. (2017) A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.
  • Maruthur et al. (2016) Maruthur, N. M., Tseng, E., Hutfless, S., Wilson, L. M., Suarez-Cuervo, C., Berger, Z., Chu, Y., Iyoha, E., Segal, J. B. and Bolen, S. (2016) Diabetes medications as monotherapy or metformin-based combination therapy for type 2 diabetes: A systematic review and meta-analysis. Annals of Internal Medicine, 164, 740–751.
  • Matrajt et al. (2021) Matrajt, L., Eaton, J., Leung, T. and Brown, E. R. (2021) Vaccine optimization for covid-19: Who to vaccinate first? Science Advances, 7.
  • Meier et al. (2008) Meier, L., Van De Geer, S. and Bühlmann, P. (2008) The group lasso for logistic regression. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 70, 53–71.
  • Meng and Qiao (2020) Meng, H. and Qiao, X. (2020) Doubly robust direct learning for estimating conditional average treatment effect. arXiv preprint arXiv:2004.10108.
  • Mokhtari et al. (2017) Mokhtari, R. B., Homayouni, T. S., Baluch, N., Morgatskaya, E., Kumar, S., Das, B. and Yeger, H. (2017) Combination therapy in combating cancer. Oncotarget, 8, 38022.
  • Möttönen et al. (1999) Möttönen, T., Hannonen, P., Leirisalo-Repo, M., Nissilä, M., Kautiainen, H., Korpela, M., Laasonen, L., Julkunen, H., Luukkainen, R., Vuori, K. et al. (1999) Comparison of combination therapy with single-drug therapy in early rheumatoid arthritis: A randomised trial. The Lancet, 353, 1568–1573.
  • Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S. and Srebro, N. (2017) A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.
  • Neyshabur et al. (2015) Neyshabur, B., Tomioka, R. and Srebro, N. (2015) Norm-based capacity control in neural networks. In Conference on Learning Theory, 1376–1401. PMLR.
  • Patterson and Gibson (2017) Patterson, J. and Gibson, A. (2017) Deep learning: A practitioner’s approach. ” O’Reilly Media, Inc.”.
  • Qi et al. (2020) Qi, Z., Liu, D., Fu, H. and Liu, Y. (2020) Multi-armed angle-based direct learning for estimating optimal individualized treatment rules with various outcomes. Journal of the American Statistical Association, 115, 678–691.
  • Qian and Murphy (2011) Qian, M. and Murphy, S. A. (2011) Performance guarantees for individualized treatment rules. Annals of Statistics, 39, 1180.
  • Rashid et al. (2020) Rashid, N. U., Luckett, D. J., Chen, J., Lawson, M. T., Wang, L., Zhang, Y., Laber, E. B., Liu, Y., Yeh, J. J., Zeng, D. et al. (2020) High-dimensional precision medicine from patient-derived xenografts. Journal of the American Statistical Association, 116, 1140–1154.
  • Rodrigues (2019) Rodrigues, A. D. (2019) Drug-drug interactions. CRC Press.
  • Rubin (1974) Rubin, D. B. (1974) Estimating causal effects of treatments in randomized and nonrandomized studies. Journal of Educational Psychology, 66, 688.
  • Schmieder et al. (2015) Schmieder, R. E., Gitt, A. K., Koch, C., Bramlage, P., Ouarrak, T. and Tschöpe, D. (2015) Achievement of individualized treatment targets in patients with comorbid type-2 diabetes and hypertension: 6 months results of the dialogue registry. BMC Endocrine Disorders, 15, 1–12.
  • Shalit et al. (2017) Shalit, U., Johansson, F. D. and Sontag, D. (2017) Estimating individual treatment effect: generalization bounds and algorithms. In International Conference on Machine Learning, 3076–3085. PMLR.
  • Shi et al. (2019) Shi, C., Blei, D. and Veitch, V. (2019) Adapting neural networks for the estimation of treatment effects. Advances in neural information processing systems, 32.
  • Shi et al. (2018) Shi, C., Fan, A., Song, R. and Lu, W. (2018) High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics, 46, 925.
  • Shrikumar et al. (2017) Shrikumar, A., Greenside, P. and Kundaje, A. (2017) Learning important features through propagating activation differences. In International conference on machine learning, 3145–3153. PMLR.
  • Stader et al. (2020) Stader, F., Khoo, S., Stoeckle, M., Back, D., Hirsch, H. H., Battegay, M. and Marzolini, C. (2020) Stopping lopinavir/ritonavir in covid-19 patients: duration of the drug interacting effect. Journal of Antimicrobial Chemotherapy, 75, 3084–3086.
  • Tamma et al. (2012) Tamma, P. D., Cosgrove, S. E. and Maragakis, L. L. (2012) Combination therapy for treatment of infections with gram-negative bacteria. Clinical Microbiology Reviews, 25, 450–470.
  • Wang et al. (2018) Wang, Y., Fu, H. and Zeng, D. (2018) Learning optimal personalized treatment rules in consideration of benefit and risk: with an application to treating type 2 diabetes patients with insulin therapies. Journal of the American Statistical Association, 113, 1–13.
  • Wu and Hamada (2011) Wu, C. J. and Hamada, M. S. (2011) Experiments: planning, analysis, and optimization. John Wiley & Sons.
  • Xu et al. (2010) Xu, S., Ross, C., Raebel, M. A., Shetterly, S., Blanchette, C. and Smith, D. (2010) Use of stabilized inverse propensity scores as weights to directly estimate relative risk and its confidence intervals. Value in Health, 13, 273–277.
  • Xue et al. (2021) Xue, F., Zhang, Y., Zhou, W., Fu, H. and Qu, A. (2021) Multicategory angle-based learning for estimating optimal dynamic treatment regimes with censored data. Journal of the American Statistical Association, 1–14.
  • Yu and Zhu (2020) Yu, T. and Zhu, H. (2020) Hyper-parameter optimization: A review of algorithms and applications. arXiv preprint arXiv:2003.05689.
  • Yun et al. (2019) Yun, C., Sra, S. and Jadbabaie, A. (2019) Small relu networks are powerful memorizers: A tight analysis of memorization capacity. Advances in Neural Information Processing Systems, 32.
  • Zhang et al. (2020) Zhang, C., Chen, J., Fu, H., He, X., Zhao, Y.-Q. and Liu, Y. (2020) Multicategory outcome weighted margin-based learning for estimating individualized treatment rules. Statistica Sinica, 30, 1857.
  • Zhang and Liu (2014) Zhang, C. and Liu, Y. (2014) Multicategory angle-based large-margin classification. Biometrika, 101, 625–640.
  • Zhang et al. (2015) Zhang, Y., Laber, E. B., Tsiatis, A. and Davidian, M. (2015) Using decision lists to construct interpretable and parsimonious treatment regimes. Biometrics, 71, 895–904.
  • Zhao and Ding (2023) Zhao, A. and Ding, P. (2023) Covariate adjustment in multiarmed, possibly factorial experiments. Journal of the Royal Statistical Society Series B: Statistical Methodology, 85, 1–23.
  • Zhao et al. (2012) Zhao, Y., Zeng, D., Rush, A. J. and Kosorok, M. R. (2012) Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107, 1106–1118.
  • Zhou and Kosorok (2017) Zhou, X. and Kosorok, M. R. (2017) Augmented outcome-weighted learning for optimal treatment regimes. arXiv preprint arXiv:1711.10654.
  • Zhou et al. (2017) Zhou, X., Mayer-Hamblett, N., Khan, U. and Kosorok, M. R. (2017) Residual weighted learning for estimating individualized treatment rules. Journal of the American Statistical Association, 112, 169–187.