This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Causal Inference Based Single-branch Ensemble Trees For Uplift Modeling

Fanglan Zheng, Menghan Wang\equalcontrib, Kun Li\equalcontrib, Jiang Tian, Xiaojia Xiang
Abstract

In this manuscript (ms), we propose causal inference based single-branch ensemble trees for uplift modeling, namely CIET. Different from standard classification methods for predictive probability modeling, CIET aims to achieve the change in the predictive probability of outcome caused by an action or a treatment. According to our CIET, two partition criteria are specifically designed to maximize the difference in outcome distribution between the treatment and control groups. Next, a novel single-branch tree is built by taking a top-down node partition approach, and the remaining samples are censored since they are not covered by the upper node partition logic. Repeating the tree-building process on the censored data, single-branch ensemble trees with a set of inference rules are thus formed. Moreover, CIET is experimentally demonstrated to outperform previous approaches for uplift modeling in terms of both area under uplift curve (AUUC) and Qini coefficient significantly. At present, CIET has already been applied to online personal loans in a national financial holdings group in China. CIET will also be of use to analysts applying machine learning techniques to causal inference in broader business domains such as web advertising, medicine and economics.

Introduction

Uplift modeling involves a set of methods for estimating the expected causal impact of taking an action or a treatment at an individual or subgroup level, which could lead to an increase in their conversion probability (Zhao and Harinen 2019). Typically for financial services and commercial companies looking to provide additional value-added services and products to their customers, marketers may be interested in evaluating the effectiveness of numerous marketing techniques, such as sending promotional coupons. With the change in customer conversion possibilities, marketers are able to efficiently target prospects. More than marketing campaigns, uplift modeling can be applied to a variety of real-world scenarios related to personalization, such as online advertising, insurance, or healthcare, where patients with varying levels of response to a new drug are identified, including the discovery of adverse effects on specific subgroups (Jaskowski and Jaroszewicz 2012).

In essence, uplift modeling is a problem that combines causal inference and machine learning. For the former, it is mutually exclusive to estimate the change between two outcomes for the same individual. To overcome this counterfactual problem, samples are randomly assigned to a treatment group (receiving online advertisement or marketing campaign) and a control group (not receiving online advertisement nor marketing campaign). For the latter, the task is to train a model that predicts the difference in the probability of belonging to a given class on the two groups. At present, two major categories of estimation techniques have been proposed in the literature, namely meta-learners and tailored methods (Zhang, Li, and Liu 2022). The first includes the Two-Model approach (Radcliffe 2007), the X-learner (Künzel et al. 2017) and the transformed outcome methods (Athey and Imbens 2015) which extend classical machine learning techniques. The second refers to direct uplift modeling such as uplift trees (Rzepakowski and Jaroszewicz 2010) and various neural network based methods (Louizos et al. 2017; Yoon, Jordon, and van der Schaar 2018), which modify the existing machine learning algorithms to estimate treatment effects. Also, uplift trees can be extended to more general ensemble tree models, such as causal forests (Wager and Athey 2018; Athey, Tibshirani, and Wager 2019), at the cost of losing true interpretability.

In order to take advantage of decision trees and the ensemble approach, we propose causal inference based single-branch ensemble trees for uplift modeling (CIET) with two completely different partition criteria that directly maximizes the difference between outcome distributions of the treatment and control groups. When building a single-branch tree, we employ lift gain and lift gain ratio as loss functions or partition criteria for node splitting in a recursive manner. Since our proposed splitting criteria are highly related to the incremental impact, the performance of CIET is thus expected to be reflected in the uplift estimation. Meanwhile, the splitting logic of all nodes along the path from root to leaf is combined to form a single rule to ensure the interpretability of CIET. Moreover, the dataset not covered by the rule is then censored and the above tree-building process continue to repeat on the censored data. Due to this divide and conquer learning strategy, the dependencies between the formed rules can be effectively avoided. It leads to the formation of single-branch ensemble trees and a set of decorrelated inference rules.

Note that our CIET is essentially different from decision trees for uplift modeling and causal forests. There are three major differences: (1) single-branch tree vsvs standard binary tree; (2) lift gain and its ratio as loss function or splitting criterion vsvs Kullback-Leibler divergence and squared Euclidean distance; and (3) decorrelated inference rules vsvs correlated inference rules or even no inference rules. It is demonstrated through empirical experiments that CIET can achieve better uplift estimation compared with the existing models. Extensive experimental results on synthetic data and the public credit card data show the success of CIET in uplift modeling. We also train an ensemble model and evaluate its performance on a large real-world online loan application dataset from a national financial holdings group in China. As expected, the corresponding results show a significant improvement in the evaluation metrics in terms of both AUUC and Qini coefficient.

The rest of this ms is organized as follows. First, causal inference based single-branch ensemble trees for uplift modeling is introduced. Next, full details of our experimental results on synthetic data, credit card data and real-world online loan application data are given. It is demonstrated that CIET performs well in estimating causal effects compared to decision trees for uplift modeling. Finally, conclusions are presented.

Causal Inference Based Single-Branch Ensemble Trees (CIET) for Uplift Modeling

This section consists of three parts. We first present two splitting criteria, single-branch ensemble approach and pruning strategy specially designed for the uplift estimation problem. Evaluation metrics for uplift modeling are then discoursed. Three key algorithms of CIET are further described in detail.

Splitting Criteria, Single-Branch Ensemble Method and Pruning Strategy

Two distinguishing characteristics of CIET are splitting criteria for tree generation and the single-branch ensemble method, respectively.

As for splitting criteria in estimating uplift, it is motivated by our expectation to achieve the maximum difference between the distributions of the treatment and control groups. Given a class-labeled dataset with NN samples, NTN^{T} and NCN^{C} are sample size of the treatment and control groups (recall that N=NT+NCN=N^{T}+N^{C}, TT and CC represent the treatment and control groups). Formally, in the case of a balanced randomized experiment, the estimator of the difference in sample average outcomes between the two groups is given by:

τ=(PTPC)(NT+NC)\displaystyle\tau=(P^{T}-P^{C})(N^{T}+N^{C}) (1)

where PTP^{T} and PCP^{C} are the probability distribution of the outcome for the two groups. Motivated by Eq. (1), the divergence measures for uplift modeling we propose are lift gain and its ratio, namely LG and LGR for short. The corresponding mathematical forms of LG and LGR can be thus expressed as

LG\displaystyle LG =(PRTPRC)NRτ0=τRτ0\displaystyle=(P_{R}^{T}-P_{R}^{C})N_{R}-\tau_{0}=\tau_{R}-\tau_{0} (2)
LGR\displaystyle LGR =(PRTPRC)(P0TP0C)(PRTPRC)=τRNR\displaystyle=\frac{(P_{R}^{T}-P_{R}^{C})}{(P_{0}^{T}-P_{0}^{C})}\propto(P_{R}^{T}-P_{R}^{C})=\frac{\tau_{R}}{N_{R}} (3)

where P0TP_{0}^{T} and P0CP_{0}^{C} are the initial probability distribution of the outcome for the two groups, τ0=(P0TP0C)NR\tau_{0}=(P_{0}^{T}-P_{0}^{C})N_{R}. And NRN_{R} and YRY_{R} for a node logic RR represent coverage and correction classification, while PRTP_{R}^{T} and PRCP_{R}^{C} are the corresponding probability distribution of the outcome for both groups, respectively. Evidently, both Eq. (2) and Eq. (3) represent the estimator for uplift, which are proposed as two criteria in our ms. Compared to the standard binary tree with left and right branches, only one branch is created after each node splitting in this ms. It is characterized by the fact that both LG and LGR are calculated using the single-branch observations that present following a node split. Accordingly, subscript kk indicating binary branches doesn’t exist in the above equation. Furthermore, the second term of LG makes every node partition better than randomization, while LGR has the identical advantages to information gain ratio.

The proposed splitting criterion for a test attribute A is then defined for any divergence measure DD as

Δ=D(PT(Y):PC(Y)|A)D(PT(Y):PC(Y))\Delta=D(P^{T}(Y):P^{C}(Y)|A)-D(P^{T}(Y):P^{C}(Y)) (4)

where D(PT(Y):PC(Y)|A)D(P^{T}(Y):P^{C}(Y)|A) is the conditional divergence measure. Apparently, Δ\Delta is the incremental gain in divergence following a node splitting. Substituting for DD the LGLG and LGRLGR, we obtain our proposed splitting criteria ΔLG\Delta_{LG} and ΔLGR\Delta_{LGR}. The intuition behind these splitting criteria is as follows: we want to build a single-branch tree such that the distribution divergence between the treatment and control groups before and after splitting an attribute differ as much as possible. Thus, an attribute with the highest Δ\Delta is chosen as the best splitting one. In order to achieve it, we need to calculate and find the best splitting point for each attribute. In particular, an attribute is sorted in descending order by value when it is numerical. For categorical attributes, some encoding methods are adopted for numerical type conversion. The average of each pair of adjacent values in an attribute with nn value, forms n1n-1 splitting points or values. As for this attribute, the point of the highest Δ\Delta can be seen as the best partition one. Furthermore, the best splitting attribute with the highest Δ\Delta can be achieved by traversing all attributes. As for the best splitting attribute, the instances are thus divided into two subsets at the best splitting point. One feeds into a single-branch node, while the other is censored. Note that the top–down, recursive partition will continue unless there is no attribute that explains the incremental estimation with statistical significance. Also, histogram-based method can be employed to select the best splitting for each feature, which can reduce the time complexity effectively.

Due to noise and outliers in the dataset, a node may merely represent these abnormal points, resulting in model overfitting. Pruning can often effectively deal with this problem. That is, using statistics to cut off unreliable branches. Since none of the pruning methods is essentially better than others, we use a relatively simple pre-pruning strategy. If Δ\Delta gain is less than a certain threshold, node partition would stop. Thus, a smaller and simpler tree is constructed after pruning. Naturally, decision-makers prefer less complex inference rules, since they are considered to be more comprehensible and robust from business perspective.

Evaluation Metrics for Uplift Modeling

As noted above, it is impossible to observe both the control and treatment outcomes for an individual, which makes it difficult to find measure of loss for each observation. It leads that uplift evaluation should differ drastically from the traditional machine learning model evaluation. That is, improving the predictive accuracy of the outcome does not necessarily indicate that the models will have better performance in identifying targets with higher uplift. In practice, most of the uplift literature resort to aggregated measures such as uplift bins or curves. Two key metrics involved are area under uplift curve (AUUC) and Qini coefficient (Gutierrez and Gérardy 2016), respectively. In order to define AUUC, binned uplift predictions are sorted from largest to smallest. For each tt, the cumulative sum of the observations statistic is formulated as below,

f(t)=(YtTNtTYtCNtC)(NtT+NtC)f(t)=(\frac{Y_{t}^{T}}{N_{t}^{T}}-\frac{Y_{t}^{C}}{N_{t}^{C}})(N_{t}^{T}+N_{t}^{C}) (5)

where the tt subscript implies that the quantity is calculated on the first or top tt observations. The higher this value, the better the uplift model. The continuity of the uplift curves makes it possible to calculate AUUC, i.e. area under the real uplift curve, which can be used to evaluate and compare different models. As for Qini coefficient, it represents a natural generalization of Gini coefficient to uplift modeling. Qini curve is introduced with the following equation,

g(t)=YtTYtCNtTNtCg(t)={Y_{t}^{T}}-\frac{Y_{t}^{C}N_{t}^{T}}{N_{t}^{C}} (6)

There is an obvious parallelism with the uplift curve since f(t)=g(t)(NtT+NtC)/NtTf(t)=g(t)(N_{t}^{T}+N_{t}^{C})/N_{t}^{T}. The difference between the area under the actual Qini curve and that under the diagonal corresponding to random targeting can be obtained. It is further normalized by the area between the random and the optimal targeting curves, which is defined as Qini coefficient.

Algorithm Modules

The following representation of three algorithms includes: selecting the best split for each feature using the splitting criteria described above, learning a top-down induced single-branch tree and forming ensemble trees with each resulting tree progressively.

Algorithm 1 depicts how to find the best split of a single feature FF on a given dataset D[group_key,feature,D[group\_key,feature, target]target] using a histogram-based method with the proposed two splitting criteria. Gain_left and Gain_right are the uplift gains for the child nodes after each node partition. If the maximum value of Gain_left is greater than that of Gain_right, the right branch is censored and vice versa. Thus, the best split with its corresponding splitting logic, threshold and uplift gain is found, which is denoted by Best_\_Direction, Best_\_Threshold and Best_Δ\_\Delta. Besides, there are several thresholds to be initialized before training a CIET model, including minimum number of samples at a inner node min_samplesmin\_samples, minimum recall min_recallmin\_recall and minimum uplift gain required for splitting min_Δmin\_\Delta. The top-down process would continue only when the restrictions are satisfied.

Algorithm 1 Selecting the Best Split for One Feature

Input: D, the given class-labeled dataset, including the group key (treatment/control);
Parameter: feature FF, min_\_samples, min_\_recall, min_Δ\_\Delta
Output: the best split that maximizes the lift gain or lift gain ratio on a feature

1:  Set Best_\_Value = 0, Best_\_Direction = ””, Best_Δ\_\Delta = 0, Best_\_Threshold = None
2:  calculate YTY^{T}, YCY^{C}, NTN^{T}, NCN^{C} on D
3:  For each feature value vv , calculate the values of YFvTY_{F\leq v}^{T}, YFvCY_{F\leq v}^{C}, NFvTN_{F\leq v}^{T}, NFvCN_{F\leq v}^{C}, and then the Gain_left(vv) and Gain_right(vv) with LG (2) or LGR (3).
4:  set the Gain_left(vv) and Gain_right(vv) to their minimum value on the vv, whose split does not satisfy the restrictions on number of samples/recall rate/divergence measure gain.
5:  v1v_{1} = argmax(Gain_left(vv)), v2v_{2} = argmax(Gain_right(vv))
6:  if max(Gain_left) \geq max(Gain_right) then
7:     Best_\_Value = max(Gain_left)
8:     Best_\_Direction = ”\leq
9:     Best_\_Threshold = v1v_{1}
10:  else
11:     Best_\_Value = max(Gain_right)
12:     Best_\_Direction = ”>>
13:     Best_\_Threshold = v2v_{2}
14:  end if
15:  return Best_\_Value, Best_\_Direction, Best_\_Threshold

Algorithm 2 presents a typical algorithmic framework for top–down induction of a single-branch uplift tree, which is built in a recursive manner using a greedy depth-first strategy. The parameter max_depthmax\_depth represents the depth of the tree and costcost indicates the threshold of LG or LGR. As the tree grows deep, more instances are censored since they are not covered by the node partition logic of each layer. As a result, each child node subdivides the original dataset hierarchically into a smaller subset until the stopping criterion is satisfied. Tracing the splitting logics on the path from the root to leaf nodes in the tree, an ”IF-THEN” inference rule is thus extracted.

Finally, adopting a divide-and-conquer strategy, the above tree-building process is repeated on the censored samples to form ensemble trees, resulting in the formation of a set of inference rules as shown in Algorithm 3.

Algorithm 2 Learning An ”IF-THEN” Uplift Rule of A Single-branch Tree

Input: D, the given class-labeled dataset, including the group key (treatment/control);
Parameter: max_\_depth, cost, min_\_samples, min_\_recall, min_Δ\_\Delta
Output: an ”IF-THEN” uplift rule

1:  Set Rule_\_Single = [], Max_\_Gain = 0.0
2:  Set Add_Rule = True
3:  while depth \leq max_\_depth and Add_Rule do
4:     if the treatment group or control group in D is empty then
5:        break
6:     end if
7:     Set Keep = { }, Best_\_Split = { }
8:     depth \leftarrow depth + 1
9:     Add_Rule = False
10:     for feature in features do
11:        Keep[feature] = Best_Split_for_One_Feature(D, feature, min_\_samples, min_\_recall, min_Δ\_\Delta) (Algorithm 1)
12:     end for
13:     for feature in Keep do
14:        if feature’s best gain >> Max_Gain + cost then
15:           Max_\_Gain = feature’s best gain
16:           Add Keep[feature] to Best_\_Split
17:           Add_Rule = True
18:        else
19:           continue
20:        end if
21:     end for
22:     Add Best_\_Split to Rule_\_Single
23:     D \leftarrow D \setminus { Samples covered by Rule_\_Single}
24:  end while
25:  return Rule_\_Single
Algorithm 3 Learning A Set of ”IF-THEN” Uplift Rules

Input: D, the given class-labeled dataset, including the group key (treatment/control);
Parameter: max_\_depth, rule_\_count, cost, min_\_samples, min_\_recall, min_Δ\_\Delta
Output: a set of ”IF-THEN” uplift rules

1:  Set Rule_\_Set = {}, number = 0
2:  while number \leq rule_\_count do
3:     rule = Single_\_Uplift_\_Rule(D, cost, max_\_depth, min_\_samples, min_\_recall, min_Δ\_\Delta) (Algorithm 2)
4:     Add rule to Rule_\_Set
5:     D \leftarrow D \setminus dataset covered by rule
6:     number \leftarrow number + 1
7:  end while
8:  return Rule_\_Set
Refer to caption
Figure 1: The uplift curves of four analyzed classifiers with four different colors for the synthetic dataset, while the dashed line corresponding to random targeting.

Experiments

In this section, the effectiveness of our CIET is evaluated on synthetic and real-world business datasets. Since CIET fundamentally stems from tree-based approaches, we implement it and compare it with uplift decision trees based on squared Euclidean distance and Kullback-Leibler divergence (Rzepakowski and Jaroszewicz 2010), which are referred as baselines.

rule number ”1” ”2” ”3”
node logic x9_\_uplift \leq 0.17 x3_\_informative >> -1.04 x6_\_informative >> 0.95
node logic x10_\_uplift \leq 2.59 x1_\_informative \leq 2.71 x1_\_informative \leq 1.58
node logic null x2_\_informative \leq 1.28 x9_\_uplift \leq 2.09
NbeforeN_{before} 3000 2210 675
NbeforeTN_{before}^{T} 1500 1117 348
NbeforeCN_{before}^{C} 1500 1093 327
NruleN_{rule} 790 1535 180
NruleTN_{rule}^{T} 383 769 76
NruleCN_{rule}^{C} 407 766 104
net gain 195.99 87.14 37.66
recalltreatmentrecall_{treatment} 36.62%\% 70.42%\% 42.11%\%
recallcontrolrecall_{control} 28.00%\% 63.70%\% 44.90%\%
Table 1: A set of inference rules found by CIET and their corresponding statistical indicators with criterion_type=criterion\_type=”LG”, rule_count=3rule\_count=3 and max_depth=3max\_depth=3.

Experiments on Synthetic Data

Dataset We can test the methodology with numerical simulations. That is, generating synthetic datasets with known causal and non-causal relationships between the outcome, action (treatment/control) and some confounding variables. More specifically, both the outcome and the action/treatment variables are binary. A synthetic dataset is generated with the make_uplift_classificationmake\_uplift\_classification function in ”Causal ML” package, based on the algorithm in (Guyon 2003). There are 3,000 instances for the treatment and control groups, with response rates of 0.6 and 0.5, respectively. The input consist of 11 features in three categories. 8 of them are used for base classification, which are composed of 6 informative and 2 irrelevant variables. 2 positive uplift variables are created to testify positive treatment effect. The remaining one is a mix variable, which is defined as a linear superposition of a randomly selected informative classification variable and a randomly selected positive uplift variable.

Parameters and Results There are four hyper-parameters in CIET: criterion_typecriterion\_type, max_depthmax\_depth and rule_countrule\_count. criterion_typecriterion\_type includes two options, LG and LGR. More precisely, two main factors of business complexity and difficulty in online deployment, determine parameter assignment. Due to the requirement of model generalization and its interpretability, max_depthmax\_depth is set to 3. That is, the business logics of a single inference rule are always less than or equal to 3. And, rule_countrule\_count is given a value of 3, indicating that a set of no more than three rules is defined to model the causal effect of a treatment on the outcome. Meanwhile, the default values for min_samplesmin\_samples, min_recallmin\_recall, costcost and min_Δmin\_\Delta are 50, 0.1, 0.01 and 0, respectively.

dataset KL Euclid LG LGR
training 0.187 0.189 0.239 0.235
test 0.176 0.178 0.210 0.225
Table 2: Qini coefficients of four analyzed classifiers on the training and test sets of the synthetic data.

The stratified sampling method is used to divide the synthetic dataset into training and test sets in a ratio of fifty percent to fifty percent. Figure 1 shows the uplift curves of the four analyzed classifiers. The AUUC of CIET with LG and LGR are 294 and 292 on the training set, which are significantly greater than 266 and 265 for the decision trees for uplift modeling with KL divergence and squared Euclidean distance. At the 36th percentile of the population, the cumulative profit increase reach 303 and 354 for LG and LGR, resulting in a growth rate of more than 18%\% and 37%\% compared to baselines. Besides, AUUC shows little variation in the training and test datasets, indicating that the stability of CIET is also excellent. According to Table 2, Qini coefficients of CIET are also obviously greater, with an increase of more than 24.5%\% and 17.8%\%. Furthermore, all three rules are determined by uplift and informative variables as expected, which can be seen from Table 1.

Experiments on Credit Card Data

Dataset We use the publicly available dataset CreditCredit ApprovalApproval from the UCI repository as one of the real-world examples, which contains 690 credit card applications. All the 15 attributes and the outcome are encoded as nonsense symbols, where A7vA7\neq v is applied as the condition for dividing the dataset into treatment and control groups. There are 291 and 399 observations in the two groups with response rates of 0.47 and 0.42, respectively. Attributes with more than 25%\% difference in distribution between the two groups should be removed before any experiments are performed. This leads that 12 attributes are left as input variables. For further preprocessing, categorical features are binarized through one-hot encoding.

Refer to caption
Figure 2: The uplift curves of four analyzed classifiers with four different colors for the CreditApprovalCredit\ Approval dataset, while the dashed curve corresponding to the random targeting.
metrics KL Euclid LG LGR
AUUC 37.337 40.887 42.893 48.222
Qini 0.201 0.236 0.257 0.310
Table 3: Model performance of four analyzed classifiers on the CreditApprovalCredit\ Approval dataset.

Parameters and Results Based on the business decision-making perspective, the initial parameters are also the same as above.

Refer to caption
Figure 3: The uplift curves of four analyzed classifiers with four different colors for the real world online loan application dataset, while the dashed line corresponding to random targeting.

In order to avoid the distribution bias caused by the division on such small sample size dataset, there is no need to divide CreditApprovalCredit\ Approval into training and test parts. Figure 2 shows the uplift curves for the four analyzed classifiers, from which we can see that CIET is able to obtain higher AUUC and Qini coefficients. As shown in Table 3, the former increases from 37\sim40 at baselines to 42\sim48 at CIET approximately, while the latter also improves significantly from 0.20\sim0.23 to 0.25\sim0.31. Especially when LGR serves as the splitting criterion, the cumulative profit has a distinguished peak of 74.5, while only 48.4%\% of the samples are covered.

Experiments on Online Loan Application Data

Dataset We further extend our CIET to precision marketing for new customer application. A telephone marketing campaign is designed to promote customers to apply for personal credit loans at a national financial holdings group in China via its official mobile app. The target is 1/0, indicating whether a customer would submit an application or not. The data contains 53,629 individuals, consisting of a treated group of 32,984 (receiving marketing calls) and a control group of 20,645 (not receiving marketing calls). These two groups have 300 and 124 credit loan applications with response rates of 0.9% and 0.6%, which are typical values in real world marketing practice. There are 24 variables in all, which are characterized as credit card-related information, loan history, customer demographics et al.

dataset KL Euclid LG LGR
training 0.173 0.168 0.414 0.302
test 0.108 0.124 0.385 0.319
Table 4: Qini coefficients of four analyzed classifiers on the training and test sets of the real world online loan application data.

Parameters and Results All parameters are the same as in the above experiments. The dataset is first divided into training and test sets in a ratio of sixty percent to forty percent. The response rates are consistent across two sets for two groups. Figure 3 diplays the results graphically. As for the training dataset, CIET based on LG and LGR reach AUUC of about 104 and 89, while the decision trees based on KL divergence and squared Euclidean distance are 73 and 72. It can be seen that CIET achieves a significant improvement compared to baselines on this real-world dataset even with a very low response rate. Moreover, as can be seen in Table 4, Qini coefficient based on our approaches increases to 0.30\sim0.41 from 0.16\sim0.17 on the training dataset. Meanwhile, Qini coefficient changes little when crossing to test dataset, indicating a better stability. Consequently, classifier with our CIET for precision marketing is effectively improved as well as stabilized in terms of AUUC and Qini coefficient. At present, CIET has already been applied to personal credit telemarketing.

Conclusion

In this ms, we propose new methods for constructing causal inference based single-branch ensemble trees for uplift estimation, CIET for short. Our methods provide two partition criteria for node splitting and strategy for generating ensemble trees. The corresponding outputs are uplift gain between the two outcomes and a set of interpretable inference rules, respectively. Compared with the classical decision tree for uplift modeling, CIET can not only be able to avoid dependencies among inference rules, but also improve the model performance in terms of AUUC and Qini coefficient. It would be widely applicable to any randomized controlled trial, such as medical studies and precision marketing.

References

  • Athey and Imbens (2015) Athey, S.; and Imbens, G. 2015. Machine Learning Methods for Estimating Heterogeneous Causal Effects.
  • Athey, Tibshirani, and Wager (2019) Athey, S.; Tibshirani, J.; and Wager, S. 2019. Generalized random forests. The Annals of Statistics, 47(2): 1148 – 1178.
  • Gutierrez and Gérardy (2016) Gutierrez, P.; and Gérardy, J.-Y. 2016. Causal Inference and Uplift Modelling: A Review of the Literature. In International Conference on Predictive Applications and APIs.
  • Guyon (2003) Guyon, I. 2003. Design of experiments for the NIPS 2003 variable selection benchmark.
  • Jaskowski and Jaroszewicz (2012) Jaskowski, M.; and Jaroszewicz, S. 2012. Uplift modeling for clinical trial data.
  • Künzel et al. (2017) Künzel, S. R.; Sekhon, J. S.; Bickel, P. J.; and Yu, B. 2017. Meta-learners for Estimating Heterogeneous Treatment Effects using Machine Learning. arXiv: Statistics Theory.
  • Louizos et al. (2017) Louizos, C.; Shalit, U.; Mooij, J. M.; Sontag, D. A.; Zemel, R. S.; and Welling, M. 2017. Causal Effect Inference with Deep Latent-Variable Models. In NIPS.
  • Radcliffe (2007) Radcliffe, N. 2007. Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, 14–21.
  • Rzepakowski and Jaroszewicz (2010) Rzepakowski, P.; and Jaroszewicz, S. 2010. Decision Trees for Uplift Modeling. In 2010 IEEE International Conference on Data Mining, 441–450.
  • Wager and Athey (2018) Wager, S.; and Athey, S. 2018. Estimation and Inference of Heterogeneous Treatment Effects using Random Forests. Journal of the American Statistical Association, 113(523): 1228–1242.
  • Yoon, Jordon, and van der Schaar (2018) Yoon, J.; Jordon, J.; and van der Schaar, M. 2018. GANITE: Estimation of Individualized Treatment Effects using Generative Adversarial Nets. In ICLR.
  • Zhang, Li, and Liu (2022) Zhang, W.; Li, J.; and Liu, L. 2022. A Unified Survey of Treatment Effect Heterogeneity Modelling and Uplift Modelling. ACM Computing Surveys (CSUR), 54: 1 – 36.
  • Zhao and Harinen (2019) Zhao, Z.; and Harinen, T. 2019. Uplift Modeling for Multiple Treatments with Cost Optimization. 2019 IEEE International Conference on Data Science and Advanced Analytics (DSAA), 422–431.