Smoothed Concordance-Assisted Learning for Optimal Treatment Decision in High Dimensional Data
Abstract
Optimal treatment regime is the individualized treatment decision rule which yields the optimal treatment outcomes in expectation. A simple case of treatment decision rule is the linear decision rule, which is characterized by its coefficients and its threshold. As patients’ heterogeneity data accumulates, it is of interest to estimate the optimal treatment regime with a linear decision rule in high-dimensional settings. Single timepoint optimal treatment regime can be estimated using Concordance-assisted learning (CAL), which is based on pairwise comparison. CAL is flexible and achieves good results in low dimensions. However, with an indicator function inside it, CAL is difficult to optimize in high dimensions. Recently, researchers proposed a smoothing approach using a family of cumulative distribution functions to replace indicator functions. In this paper, we introduce smoothed concordance-assisted learning (SMCAL), which applies the smoothing method to CAL using a family of sigmoid functions. We then prove the convergence rates of the estimated coefficients by analyzing the approximation and stochastic errors for the cases when the covariates are continuous. We also consider discrete covariates cases, and establish similar results. Simulation studies are conducted, demonstrating the advantage of our method.
Keywords: optimal treatment regime, precision medicine, smoothing approximation, monotonic single-index model
1 Introduction
In order to give precised treatment decisions, we have to take patients’ heterogeneity into account. Decision rules based on patients’ own features are called individualized treatment rules. In a binary treatment decisions setting, are the treatment indicators and
are the i.i.d. observations, where and are the outcome and features of subject . Suppose is the expected outcome for subject with features and treatment . If , treatment is more favorable to subject compared to treatment . Assuming the linear decision rule , a treatment regime is determined by the coefficients and the threshold . Mean treatment outcomes can be modeled as
An optimal treatment regime is the treatment regime which yields the most favorable mean treatment outcomes. Optimal treatment decision with multiple decision timepoints is called optimal dynamic treatment regimes. There are currently two leading dynamic treatment learning approaches. The first one is Q-learning by Watkins, (1989). Watkins and Dayan, (1992) proved the convergence property of Q-learning. Song et al., (2011) extended Q-learning into penalized Q-learning (PQ-learning). The second one is Advantage Learning (A-learning) by Murphy, (2003). Blatt et al., (2004) showed that A-learning is likely to have smaller bias than Q-learning.
Due to the rapid accumulation of patients’ heterogeneity data, people started to consider the high dimensional setting. Zhu et al., (2019) extended Q-learning to high dimensional Q-learning, with a focus on the two-stage dynamic treatment regimes. Shi et al., (2018) proposed Penalized A-learning (PAL), which applied the Dantzig selector under the A-learning framework, with a focus on two or three-stage settings.
For a single timepoint treatment decision, Zhang et al., (2012) used a doubly robust augmented inverse probability weighted estimator to handle the possible model misspecification issue. Zhao et al., (2012) adopted the support vector machine framework and proposed outcome weighted learning (OWL). Song et al., (2015) modified OWL to be penalized outcome weighted learning (POWL). Lu et al., (2013) introduced a robust estimator which doesn’t require estimating the baseline function and is easy to perform variable selection in large dimensions. CAL by Fan et al., (2017) used the contrast between two treatments to construct a concordance function, and then used the concordance function to estimate the coefficients . CAL has a fast convergence rate, and it does not assume that the contrast function is a linear combination of the features. They assumed the stable unit treatment value assumption (SUTVA) in Rubin, (1980),
(1.1) |
and the no-unmeasured-confounders condition,
(1.2) |
The proposed concordance function is
(1.3) |
where the contrast function . Within the concordance function, an important assumption is that is concord with , which means
(1.4) |
where is an unknown monotone increasing function. Using the propensity score , their is estimated by its unbiased estimator
where can be any arbitrary function independent of , while it is usually chosen as the mean response of those patients who received treatment . The true and the true threshold in the linear decision rule are estimated by
(1.5) |
and
(1.6) |
In the high dimensional setting, however, due to the indicator function inside the concordance function, the optimization of CAL is somewhat difficult. In order to deal with this optimization issue of CAL, Liang et al., (2018) proposed SCAL, which used hinge loss to replace , and they added the penalty for variable selection purpose. They estimate and by
and
It achieved -rate but induced a relatively large bias due to the difference between the hinge loss and the original 0-1 loss.
Our way to overcome the optimization difficulties in CAL under high dimensional data is based on the smoothing method proposed by Han et al., (2017). A family of sigmoid functions are used to substitute the indicator function in CAL. The penalty is also added. We then employ the coordinate descent algorithm for optimization. Our method is called Smoothed Concordance-assisted Learning (SMCAL).
Compared to SCAL, our SMCAL has a slower convergence rate but achieves a much smaller bias. Numerical comparisons with POWL, SCAL and PAL demonstrated the advantage of our method, especially when the sample size is relatively large. Other than the continuous covariates cases in the SCAL settings, we expand our SMCAL to the discrete covariates cases.
There are three main differences between our work and Han et al., (2017). Firstly, they focused on the Maximum Rank Correlation (MRC) proposed in Han, (1987), but our smoothing is applied on the concordance function. Secondly, their paper studied Monotone Transformation model, but we focus on Monotone Single Index Model, and we have a faster convergence rate. Thirdly, we assume weaker assumptions instead of their normality assumption posted on the covariates, and we expand our method to discrete cases.
The rest of this paper is organized as follows: Section 2 introduces SMCAL and describes the coordinate descent algorithm. Section 3 displays the -error rate in continuous cases and the -error rate in discrete cases, which are the main results of our paper. In Section 4, we conduct numerical comparisons with SCAL, PAL and POWL. A real data application to STAR*D dataset is provided in Section 5. The proofs of lemmas and theorems in Section 3 are left to the supplementary material.
2 Methods
2.1 The Model Set-up
We still assume the Assumptions (1.1), (1.2) and (1.4) posted in Fan et al., (2017). Our smoothing procedure approximates their concordance function by
(2.1) |
where we use a family of sigmoid functions
to replace in Fan et al., (2017). is a positive constant depending on . An unbiased estimator of is
(2.2) |
Without loss of generality, we can set . Define
where is a constant to be chosen. will converge to as goes to infinity.
Define , where is a constant so that . Notice that maximizing is equivalent to maximizing
we can write our loss function as
(2.3) |
Then we use the following two steps to estimate and .
(2.4) |
(2.5) |
Our is different from due to the stochastic factors and the penalty. We will call the approximation error and the stochastic error.
2.2 Coordinate Descent Algorithm
We can iteratively use proximal gradient descent on each coordinate to solve the optimization problem. Step size is a fixed constant.
Consider . When a pair satisfies , likely we have , which means . Because is convex when , we know that our optimization problem is likely to be convex when is close to . and can be chosen by cross-validation.
3 Convergence Rates of the Estimated Coefficients
In this section, we establish the error rates for under continuous and then discrete covariates cases. Let . Assume the sparsity of the real coefficients . Let represent the nonzero indexes in , , and define . Let represent all the entries of whose indexes belong to set . The notation means that if , or equivalently , then there exist and , such that for all , we have . If and , then we write .
3.1 Continuous Cases
3.1.1 Approximation Error
To analyze the approximation error, we assume the following assumptions.
Assumption 1.
(A1) is twice differentiable and .
(A2) For all , and , we have .
(A3) Let , where the derivative is taken with respect to every coordinate except the first one because is fixed. Assume .
Remark 1.
(A2) is true for many distributions of . For example, it’s easy to show that if and , then (A2) is satisfied.
Remark 2.
Since is the maximizer of , is non-negative definite. Assumption similar to (A3) also posed in Section 4.1.1, Han et al., (2017). But in the discrete cases, we will use another way to build the upper error bound and no longer need this assumption.
Under the above assumptions, the following Lemma 1 measures the closeness between the concordance function and its approximation .
Lemma 1.
(3.1) |
The convexity assumption (A3) in Assumption 1 implies that and are of the same rate, and thus the approximation error can be bounded through analyzing , where we can apply Lemma 1 and get Theorem 2.
Theorem 2 (Approximation Error Rate).
(3.2) |
Remark 3.
The approximation error in Han et al., (2017) is
We have a faster convergence rate because we use the concordance function estimator
which is different from the Maximum Rank Correlation (MRC) estimator
in their paper.
3.1.2 Stochastic Error
When investigating the stochastic error , we follow the framework of Section 4.2 in Han et al., (2017). In order to be consistent with their notations, define
and denotes Penalty, is the dual norm of . The following assumptions are needed in this subsection.
Assumption 2.
(A4).
(A5), .
(A6).
Remark 4.
Our assumption (A4) is very similar to (A1) in Assumption 1, but (A1) doesn’t include any stochastic factors.
Remark 5.
Although it’s reasonable for us to post (A5) in our optimal treatment decision problem, (A5) inevitably eliminates the multivariate normal distribution.
Remark 6.
We provide two necessary steps, Lemma 3 and 4, in order to apply Theorem 4.8 in Han et al., (2017). Lemma 3 is used to bound the term .
Lemma 3.
(3.3) |
In Han et al., (2017), they introduced a parameter and stated that when is fixed. However, in our case can be much larger than , so we will use a different approach, which is our Lemma 4, to analyze the rate of . Define
the following Lemma 4 provides a bound for in high probability.
Lemma 4.
(3.4) |
is used to bound the difference between and . By using an argument similar to Theorem 4.8 and Lemma 4.10 in Han et al., (2017), we can prove the stochastic error rate stated in the following theorem.
Theorem 5 (Stochastic Error Rate).
(3.5) |
Theorem 6 (Overall Error Rate).
(3.6) |
Remark 7.
For comparison, in Han et al., (2017), they proved and . Choose , then in their paper .
3.2 Discrete Cases
We will make some different assumptions as we discuss the discrete covariates cases in this section. The main difference between the discrete and continuous cases is that in the discrete cases, might be a constant around .
Assumption 3.
(B1) There exists a matrix , such that the conditional distribution is a symmetric distribution centered at the origin.
(B2) .
(B3) .
Remark 8.
In the continuous covariates cases, (B1) is satisfied under some multivariate normal distributions, where we can think of to be slightly larger than the nonzero indexes set. We can let include the features which are correlated with the first feature.
Remark 9.
In the discrete cases, we bound the approximation error and the stochastic error at the same time. Lemma 3 and 4 also hold true for discrete cases and the proofs are almost the same. Moreover, we have a different main theorem as follows:
Theorem 7 (Error Rate in Discrete Cases).
As long as , for all such that , we have , and .
Remark 10.
In the proof, we choose . Using (B3), the theorem implies that for any , . If , then . Therefore, when , we conclude that once , then . In other words, if the contrast functions , then , which means we will rank the subjects in a right order with high probability.
4 Simulation Studies
Four methods are compared in this section: POWL in Song et al., (2015), PAL in Shi et al., (2018), SCAL in Liang et al., (2018) and SMCAL. POWL method minimizes
(4.1) |
where is the decision function and the tuning parameter is selected by the largest IPW estimator. PAL is a dynamic treatment regime approach but can be used in the single-stage problem because PAL estimates the latest stage at first. PAL perform variable selection using penalized A-learning, then use unpenalized A-learning to solve the coefficients. We used the R package provided by the authors of PAL to get its results.
To evaluate the estimated coefficients, we report the MSE after normalization. To evaluate the variable selection accuracy, we report the Incorrect Zeros (true coefficient is zero but the estimation is nonzero) and Correct Zeros (true coefficient is zero and the estimation is zero). To evaluate the estimated treatment regime, we report the Percentage of Correct Decision (PCD) and the mean response if we follow the estimated treatment regime (Estimated Value). Estimated Value is calculated by drawing 1000 new samples. Standard errors of the PCDs and Estimated Values are in the parenthesis.
4.1 Low Dimension
4.1.1 Linear Case
The first example can be found in Zhao et al., (2012) as well as in Liang et al., (2018). Set , are independent variables, all generated from in the continuous cases. is chosen to be for all . , where . simulations are conducted for respectively.
In the discrete cases, everything remains the same except that we generate from the uniform distribution on .
Numerical comparisons are summarized in Table 1, where we did simulations only for SMCAL and PAL, the results by SCAL and POWL were from Liang et al., (2018).
n | MSE | Incorr0(0) | Corr0(48) | PCD | Estimated Value | |
---|---|---|---|---|---|---|
POWL | 30 | 1.60 | 1.70 | 42.23 | 0.615(0.02) | 1.09(0.02) |
100 | 1.27 | 1.94 | 46.64 | 0.768(0.02) | 1.27(0.02) | |
200 | 1.09 | 1.99 | 47.78 | 0.786(0.02) | 1.30(0.03) | |
PAL | 30 | 1.83 | 1.76 | 46.23 | 0.631(0.012) | 1.17(0.015) |
100 | 1.01 | 0.92 | 46.53 | 0.808(0.009) | 1.37(0.009) | |
200 | 0.32 | 0.17 | 47.00 | 0.903(0.005) | 1.45(0.006) | |
SCAL | 30 | 1.40 | 0.73 | 35.79 | 0.659(0.01) | 1.16(0.01) |
100 | 0.52 | 0.11 | 41.97 | 0.764(0.01) | 1.31(0.01) | |
200 | 0.19 | 0.01 | 46.03 | 0.749(0.01) | 1.32(0.01) | |
SMCAL | 30 | 0.93 | 0.82 | 43.09 | 0.677(0.014) | 1.21(0.017) |
100 | 0.81 | 0.75 | 44.11 | 0.735(0.010) | 1.28(0.013) | |
200 | 0.69 | 0.57 | 45.42 | 0.788(0.007) | 1.34(0.009) | |
SMCAL-Discrete | 30 | 0.95 | 0.89 | 43.12 | 0.653(0.014) | 1.20(0.017) |
100 | 0.79 | 0.74 | 44.15 | 0.723(0.009) | 1.28(0.011) | |
200 | 0.70 | 0.50 | 43.78 | 0.764(0.006) | 1.33(0.008) |
In general, POWL gives many estimated zeros. Our MSEs are larger than SCAL but smaller than POWL. In fact, our theoretical results indicate a slower convergence rate than SCAL. In the case when the sample size is 200, our PCDs and Estimated Values are significantly higher than those in SCAL, which may be explained by SMCAL has a smaller bias compared to SCAL. We also notice that the PCD of 200-sample SCAL is abnormally smaller than the PCD of 100-sample SCAL. This may be because SCAL does not converge to the true coefficients due to the induced bias by the hinge loss. In summary, we think when the sample size is small, SCAL is better than SMCAL. But when the sample size grows larger, SMCAL will be increasingly better than SCAL.
In the discrete cases, the MSE, variable selection, PCDs and Estimated Values have similar patterns.
PAL has the best performance in this example. The 100-sample PAL is already much better than 200-sample SCAL and SMCAL in terms of PCDs and Estimated Values. In this example, the contrast function , which is a linear combination of the features. In such a linear case, pairwise comparison based methods like CAL, SCAL and SMCAL seems to be less efficient than PAL. But CAL, SCAL and SMCAL only assume , which is flexible and can be applied in nonlinear cases.
4.1.2 Nonlinear Case
The second example, see Table 2, compares PAL and SMCAL on a case when the contrast function is nonlinear of the features. We set , , , and , where . Our MSEs are much smaller than PAL, and we can successfully select important features when the sample size is relatively large. Our PCDs and Estimated Values are also much better than PAL. PAL seems to be not suitable for such a nonlinear case, but SMCAL performs quite well.
n | MSE | Incorr0(0) | Corr0(46) | PCD | Estimated Value | |
---|---|---|---|---|---|---|
PAL | 30 | 1.44 | 3.12 | 44.30 | 0.579(0.007) | 10.42(0.431) |
100 | 0.82 | 2.22 | 45.35 | 0.666(0.006) | 14.67(0.486) | |
200 | 0.58 | 1.80 | 45.83 | 0.706(0.005) | 16.39(0.552) | |
SMCAL | 30 | 0.90 | 1.03 | 33.53 | 0.689(0.006) | 17.45(0.464) |
100 | 0.52 | 0.05 | 29.44 | 0.814(0.006) | 18.56(0.319) | |
200 | 0.28 | 0.00 | 30.66 | 0.888(0.004) | 19.92(0.510) |
4.2 High Dimension
We conduct simulations for the following six high dimensional models which can be found in Section 4.2, Liang et al., (2018). We uniformly set sample size and dimension . The interaction terms with treatment type are or , which satisfies our Monotone Single Index Model assumption. However, the baseline function can be very flexible. The following models use linear, polynomial or even trigonometric sines as the baseline function. We set , and .
, where , .
, where , , .
, where , , .
, where , .
, where , , .
, where , , .
Liang et al., (2018) reported the results of SCAL, and we ran simulations for SMCAL, all summarized in Table 3.
Model | MSE | Incorr0(0) | Corr0(497) | PCD | Estimated Value | |||||
---|---|---|---|---|---|---|---|---|---|---|
SCAL | SMCAL | SCAL | SMCAL | SCAL | SMCAL | SCAL | SMCAL | SCAL | SMCAL | |
Model 1 | 0.61 | 0.76 | 0.75 | 0.86 | 482.62 | 490.81 | 0.744(0.01) | 0.732(0.005) | 3.80(0.02) | 3.82(0.016) |
Model 2 | 0.56 | 0.71 | 0.57 | 0.55 | 485.34 | 492.69 | 0.763(0.01) | 0.749(0.006) | 3.79(0.03) | 3.87(0.015) |
Model 3 | 0.44 | 0.69 | 0.49 | 0.47 | 488.12 | 491.59 | 0.786(0.01) | 0.751(0.005) | 1.92(0.02) | 1.88(0.015) |
Model 4 | 0.35 | 0.62 | 0.35 | 0.21 | 486.81 | 481.69 | 0.801(0.01) | 0.752(0.006) | 5.67(0.04) | 5.69(0.043) |
Model 5 | 0.32 | 0.56 | 0.25 | 0.18 | 487.00 | 485.41 | 0.810(0.01) | 0.740(0.008) | 5.63(0.05) | 5.62(0.057) |
Model 6 | 0.29 | 0.55 | 0.20 | 0.05 | 485.33 | 485.09 | 0.820(0.01) | 0.774(0.007) | 3.74(0.04) | 3.78(0.040) |
It shows that our model selections and Estimated Values are close to those in SCAL. The MSEs and PCDs are not as good as those in SCAL, perhaps because SCAL has a faster convergence rate, which makes it perform better than SMCAL in high dimensions.
5 Real Data Analysis
The STAR*D study, which focused on the Major Depression Disorder (MDD), enrolled over 4,000 outpatients from age 18 to age 75. There were altogether four treatment levels: 2, 2A, 3 and 4. At each treatment level, patients were randomly assigned to different treatment groups. Various clinic and socioeconomic factors were recorded, as well as the treatment outcomes. More details can be found in Fava et al., (2003).
We focus on level 2 of STAR*D study. The data consists of 319 samples, with each sample containing the patient ID, treatment type, treatment outcome and 305 other clinical or genetic features. There are two different treatment types: bupropion (BUP) and sertraline (SER). We use to label SER and to label BUP. The 16-item Quick Inventory of Depressive Symptomatology-Clinician-Rated (QIDS-C16) ranges from 0 to 24, with smaller values indicating better outcomes. In our setting, larger value indicates better treatment effect, so we use the negative QIDS-C16 as our treatment outcome.
Among the 319 patients selected, 166 of them have received SER and 153 of them have received BUP. After obtaining our estimated treatment regime, we use the inverse probability weighted estimator (IPW) proposed by Zhang et al., (2012)
to calculate the estimated values. We draw bootstrap samples for 1000 times, each time with 1000 samples to get the Estimated Value and 95% CI.
We ran the simulation for PAL and our SMCAL. Liang et al., (2018) reported the results of SCAL, POWL and non-dynamic treatment regimes. These are all summarized in Table 4.
Treatment Regime | Estimated Value | Diff | 95% CI on Diff |
---|---|---|---|
Optimal Regime (SCAL) | -6.77 | ||
Optimal Regime (SMCAL) | -6.90 | 0.13 | (-0.46,0.69) |
Optimal Regime (PAL) | -8.15 | 1.38 | (0.71,2.02) |
Optimal Regime (POWL) | -9.46 | 2.69 | (1.18,4.24) |
BUP | -10.50 | 3.75 | (2.38,5.19) |
SER | -10.72 | 3.97 | (2.57,5.50) |
Consider SCAL as baseline, Diff in the table means the difference between the estimated value of each method and the estimated value of SCAL. According to Table 4, the optimal treatment regimes are all better than non-dynamic treatment regimes. Optimal treatment regimes by SCAL and SMCAL have the best estimated values. PAL and POWL are not as good as SCAL and SMCAL in this real data example.
Comparison of the received treatments and the estimated optimal treatment regimes are available in Table 5.
Recommended: SER | Recommended: BUP | |
---|---|---|
Randomized Treatment: SER | 75 | 91 |
Randomized Treatment: BUP | 68 | 85 |
Randomized Treatment means the treatment actually performed in the STAR*D study. Estimated Treatment means the treatment suggested by the estimated optimal treatment regime. From Table 5 we can see that among those 153 people who received BUP, the optimal treatment regime by SMCAL suggests that 77 of them receive SER and 76 of them still receive BUP. It is a rather balanced treatment regime.
6 Conclusions
In this paper, we have proposed SMCAL, which is based on the concordance-assisted-learning (CAL) framework and the smoothing procedure by Han et al., (2017), and aims at dealing with the optimization issue of CAL in high dimensional data. We established convergence results in both continuous covariates and discrete covariates cases. The proposed SMCAL can be successfully applied in the case when the contrast function is nonlinear of the features, and it does not induce a relatively large bias compared to SCAL.
SUPPLEMENTARY MATERIAL
7 Proofs
Lemma 8.
There exists , s.t. , where
Proof of Lemma 8.
7.1 Proofs of Section 3.1
Proof of Lemma 1.
Define
and
then
(7.2) |
(A1) and (A2) in Assumptions 1 implies and . So
Notice that
we have
Therefore,
(7.3) | ||||
∎
Proof of Theorem 2.
Proof of Lemma 3.
Let
where . Then since is the locally maximizer of . Because
Therefore, using the Hoeffding bound for U-statistics which can be found in, for example, Pitcan, (2017),
(7.6) |
Choose , the proof is completed. ∎
Proof of Lemma 4.
Define
Under Assumption 2, if ,
According to the Hoeffding bound for U-statistics,
Let and , then
Choose , then
(7.7) |
Let be an -covering of , where we constrain and is a small positive constant to be chosen. Since the covering number of should not exceed the packing number, which is less than
we can find a such that
Consider
(7.8) |
then
(7.9) |
When ,
(7.10) |
where is defined in Lemma 8. Then by Lemma 8, we have
(7.11) |
It bounds the difference between and when . Next, consider and in the same direction but . If and , we have
therefore,
(7.12) |
For any , we can find an integer and s.t. , and , where is defined in (7.8).
The proof for the discrete cases under Assumption 3 is essentially the same. ∎
Proof of Theorem 5.
The proof of Theorem 5 is very similar to the proof of Lemma 4.10 in Han et al., (2017), so in here we only give a sketch of the proof. According to Lemma 1,
(7.14) |
And based on Taylor Expansion of and ,
(7.15) |
in here and do not include the first coordinate. Then use Theorem 2 and Assumption 1, we have
(7.16) |
where are some proper constants.
Similar to the proof of Theorem 4.8 in Han et al., (2017), as long as and is locally convex differentiable, where is the dual norm of , we have
Then by Lemma 3, Lemma 4 and (7.16), define
(7.17) | ||||
Since , we can check that the assumptions of Theorem 4.8 in Han et al., (2017) are all satisfied. Let . Theorem 4.8 in Han et al., (2017) implies that . ∎
7.2 Proofs of Section 3.2
Proof of Theorem 7.
Recall the definition of in Assumption 3. Define
We let , then according to the definition of , so is nonempty. We then prove the theorem by three steps.
First, let’s prove: for large enough , . Clearly
(7.18) | ||||
and there exists a positive constant s.t.
(7.19) | ||||
If , then , but . Define , where is the p.m.f. of , then independent of . Using (B1) in Assumption 3, We have
(7.20) | ||||
Let , then , which implies that is not a maximizer of . So .
Secondly, let’s prove the entries of are all 0. By (B1) in Assumption 3,
(7.21) | ||||
Let in the formula. And notice that . Because , we know attains its maximum at when . Therefore, we have
(7.22) | ||||
which means the such that , and will let no smaller than . So without loss of generality we can set since maximizes .
Finally, define , then
(7.23) | ||||
If , using (7.19), LABEL:equ:diff_c_beta2 and the fact that ,
(7.24) | ||||
then by Lemma 4,
(7.25) | ||||
Choose and , then this satisfies and . So
(7.26) |
If , then , which contradicts with the fact that minimizes in . Hence .
Now let’s consider . Truncate the part of to be 0 and call this new vector as . By definition, , so . Similar to (7.22), we know , so
(7.27) |
and therefore . ∎
References
- Blatt et al., (2004) Blatt, D., Murphy, S., and Zhu, J. (2004). A-learning for approximate planning.
- Fan et al., (2017) Fan, C., Lu, W., Song, R., and Zhou, Y. (2017). Concordance-assisted learning for estimating optimal individualized treatment regimes. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5):1565 – 1582.
- Fava et al., (2003) Fava, M., Rush, J., Kupfer, D. J., Trivedi, M. H., Nierenberg, A. A., Thase, M. E., Sackeim, H. A., Quitkin, F. M., Wisniewski, S., Lavori, P. W., Rosenbaum, J. F., Dunner, D. L., and STAR*D Investigators, G. (2003). Background and rationale for the sequenced treatment alternatives to relieve depression (star*d) study. Drugs therapy: predictors of response, 26(2).
- Han et al., (2017) Han, F., Ji, H., Ji, Z., and Wang, H. (2017). A provable smoothing approach for high dimensional generalized regression with applications in genomics. Electronic Journal of Statistics, 11(2):4347–4403.
- Han, (1987) Han, A., K. (1987). Non-parametric analysis of a generalized regression model: the maximum rank correlation estimator. Journal of econometrics, 35(2-3):303 – 316.
- Liang et al., (2018) Liang, S., Lu, W., Song, R., and Wang, L. (2018). Sparse concordance-assisted learning for optimal treatment decision. Journal of Machine Learning Research, 18(154-234):1 – 26.
- Lu et al., (2013) Lu, W., Zhang, H., and Zeng, D. (2013). Variable selection for optimal treatment decision. STATISTICAL METHODS IN MEDICAL RESEARCH, (5):493.
- Murphy, (2003) Murphy, S. A. (2003). Optimal dynamic treatment regimes. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 65(2):331 – 366.
- Pitcan, (2017) Pitcan, Y. (2017). A note on concentration inequalities for u-statistics.
- Rubin, (1980) Rubin, D. B. (1980). Randomization analysis of experimental data: The fisher randomization test comment. Journal of the American Statistical Association, 75(371):591 – 593.
- Shi et al., (2018) Shi, C., Fan, A., Song, R., and Lu, W. (2018). High-dimensional a-learning for optimal dynamic treatment regimes. Annals of Statistics, 46(3):925–957.
- Song et al., (2015) Song, R., Kosorok, M., Zeng, D., Zhao, Y., Laber, E., and Yuan, M. (2015). On sparse representation for optimal individualized treatment selection with penalized outcome weighted learning.
- Song et al., (2011) Song, R., Wang, W., Zeng, D., and Kosorok, M. R. (2011). Penalized q-learning for dynamic treatment regimes.
- Watkins, (1989) Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, UK.
- Watkins and Dayan, (1992) Watkins, C. J. C. H. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279.
- Zhang et al., (2012) Zhang, B., Tsiatis, A., Laber, E., and Davidian, M. (2012). A robust method for estimating optimal treatment regimes. Biometrics, 68(4):1010 – 1018.
- Zhao et al., (2012) Zhao, Y., Zeng, D., Rush A., J., and Kosorok Michael, R. (2012). Estimating individualized treatment rules using outcome weighted learning. Journal of the American Statistical Association, 107(499):1106 – 1118.
- Zhu et al., (2019) Zhu, W., Zeng, D., and Song, R. (2019). Proper inference for value function in high-dimensional q-learning for dynamic treatment regimes. Journal of the American Statistical Association, 114(527):1404–1417.