An End-to-End Framework for Marketing Effectiveness Optimization under Budget Constraint

Ziang Yan, Shusen Wang, Guorui Zhou, Jingjian Lin, Peng Jiang Kuaishou IncBeijingChina yanziang,wangshusen03,zhouguorui,linjingjian,[email protected]

(2023)

Abstract.

Online platforms often incentivize consumers to improve user engagement and platform revenue. Since different consumers might respond differently to incentives, individual-level budget allocation is an essential task in marketing campaigns. Recent advances in this field often address the budget allocation problem using a two-stage paradigm: the first stage estimates the individual-level treatment effects using causal inference algorithms, and the second stage invokes integer programming techniques to find the optimal budget allocation solution. Since the objectives of these two stages might not be perfectly aligned, such a two-stage paradigm could hurt the overall marketing effectiveness.

In this paper, we propose a novel end-to-end framework to directly optimize the business goal under budget constraints. Our core idea is to construct a regularizer to represent the marketing goal and optimize it efficiently using gradient estimation techniques. As such, the obtained models can learn to maximize the marketing goal directly and precisely. We extensively evaluate our proposed method in both offline and online experiments, and experimental results demonstrate that our method outperforms current state-of-the-art methods. Our proposed method is currently deployed to allocate marketing budgets for hundreds of millions of users on a short video platform and achieves significant business goal improvements. Our code will be publicly available.

Marketing, Causal Learning, Integer Programming, Neural Networks, Gradient Estimation

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06^†^†submissionid: 123-A56-BU3^†^†ccs: Information systems Personalization^†^†ccs: Applied computing Marketing^†^†ccs: Mathematics of computing Causal networks^†^†ccs: Mathematics of computing Integer programming

1. Introduction

Offering incentives (e.g., cash bonuses, discounts, coupons) to consumers are effective ways for online platforms to acquire new users, increase user engagement, and boost platform revenue (Lin et al., 2017; Zhao and Harinen, 2019; Zhao et al., 2019; Makhijani et al., 2019; Goldenberg et al., 2020; Zhang et al., 2021; Tu et al., 2021; Ai et al., 2022; Albert and Goldenberg, 2022; Zhou et al., 2023). For example, coupons in Taobao (Zhang et al., 2021) are offered to increase user activity, promotions in Booking (Goldenberg et al., 2020) can improve user satisfaction, cash bonuses in Kuaishou (Ai et al., 2022) are used to stimulate user retention, and promotions in Uber (Zhao and Harinen, 2019) can encourage users to start using a new product. Despite the effectiveness, these marketing campaigns could incur high costs, and thus the total budget is often capped in real-world scenarios. From the perspective of online platforms, the goal of a marketing campaign is often described as maximizing a specific business goal (e.g., user retention) under a specific budget constraint. To this end, allocating an appropriate amount of incentive to different users is essential for optimizing marketing effectiveness since different users could respond differently to incentives. This budget allocation problem has great practical implications and has been investigated for decades.

Many recent advances address the budget allocation problem using a two-stage paradigm (Zhao et al., 2019; Ai et al., 2022; Tu et al., 2021; Albert and Goldenberg, 2022): in the first stage the individual-level heterogeneous treatment effects are estimated using causal inference techniques, and then the estimated effects are fed into an integer programming formulation as coefficients to find the optimal allocation. However, the objectives of these two stages are fundamentally different, and in practice, we have observed these objectives are not perfectly aligned. For example, if the total budget is extremely low, then an effective allocation algorithm should distribute incentives only to a small subset of users who are fairly sensitive to incentives. In this case, an improved treatment effects predictor might have worse marketing effectiveness, if the model has a higher precision on all users on average but performs poorly on this small subset of users.

In this paper, we propose a novel end-to-end approach to optimize marketing effectiveness under budget constraints, directly and precisely. Our core idea is to construct a regularizer to represent the marketing goal, and further optimize it efficiently using gradient estimation techniques. Concretely, we use S-Learner (Künzel et al., 2019) based on Deep Neural Networks (DNNs) to predict the treatment effects of different users. Given the estimated treatment effects, we further formulate the budget allocation problem as a multiple choice knapsack problem (MCKP) (Kellerer et al., 2004), and exploit the Expected Outcome Metric method (EOM) to build an unbiased estimation (denoted as $Q$ ) of the marketing goal (i.e., the expected business goal under a certain budget constraint). The $Q$ function represents the marketing effectiveness and is the exact objective we want to maximize, thus we treat $Q$ as a regularizer and integrate it into the training objective of the S-Learner model. However, there remains one challenge in doing so: since our construction process of $Q$ involves solving a series of MCKPs using Lagrange multipliers and thus many non-differentiable operations are involved in, the automatic differentiation mechanism in modern machine learning libraries (such as TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2019)) would not give correct gradients. Our solution to resolve this issue is to treat $Q$ as a black-box function and exploit gradient estimation techniques such as the finite-difference method or natural evolution strategies (NES) (Wierstra et al., 2014) to obtain the gradients. As such, the regularizer could be effectively jointly optimized with the original learning objective of the S-Leanrer model. This regularizer endows the S-Learner model with an ability to directly learn from the marketing goal and further maximize it, in a principled way. Our proposed method is extensively evaluated on both offline simulations and online experiments. Experimental results demonstrate that our proposed end-to-end training framework can achieve significantly better marketing effectiveness, in comparison with current state-of-the-arts. We further deploy the proposed method in a large-scale short video platform to allocate cash bonuses to users to incentivize them to create more videos, and online A/B tests show our method significantly outperforms the baseline methods by at least 1.24% more video creators under the same budget, which is a tremendous improvement over these months. Currently, our proposed method is serving hundreds of millions of users.

2. Related Works

2.1. Budget Allocation

Starting from a common objective, many budget allocation methods have been proposed. Early methods (Guelman et al., 2015; Zhao et al., 2017; Zhao and Harinen, 2019) often select the amount of incentives for users heuristically after obtaining the heterogeneous treatment effects estimation. However, lacking an explicit optimization problem formulation could make these methods less effective in maximizing the marketing goal.

A popular way of allocating budget is to follow a two-stage paradigm (Zhao et al., 2019; Makhijani et al., 2019; Goldenberg et al., 2020; Tu et al., 2021; Ai et al., 2022; Albert and Goldenberg, 2022): in the first stage uplift models are used to predict the treatment effects, and in the second stage integer programming is invoked to find the optimal allocation. For instance, Zhao et al. (2019) propose to use a logit response model (Phillips, 2021) to predict treatment effects and then obtain the optimal allocation via root-finding for KKT conditions. Later, Tu et al. (2021) introduce various advanced estimators (including Causal Tree (Athey and Imbens, 2016), Causal Forest (Wager and Athey, 2018) and Meta-Learners (Sołtys et al., 2015; Künzel et al., 2019)) to estimate the heterogeneous effects. The second stage has also received much research attention. Makhijani et al. (2019) formulate the marketing goal as a Min-Cost Flow network optimization problem for better expressivity. More recently, Ai et al. (2022); Albert and Goldenberg (2022) use MCKP to represent the budget allocation problem in the discrete case and develop efficient solvers based on Lagrangian duality. Though effective, owing to the misalignment between the objectives of these two stages, solutions from these two-stage methods could be suboptimal.

Several methods have also been proposed to directly learn the optimal budget assignment policy (Xiao et al., 2019; Du et al., 2019; Goldenberg et al., 2020; Zou et al., 2020; Zhang et al., 2021; Zhou et al., 2023). Xiao et al. (2019); Zhang et al. (2021) develop reinforcement learning solutions based on constrained Markov decision process to directly learn an optimal policy. However, learning such complicated policies in a pure model-free black-box approach without exploiting the causal structures could suffer from sample inefficiency. Du et al. (2019); Zou et al. (2020); Goldenberg et al. (2020) propose to directly learn the ratios between values and costs in the binary treatment setting, and treatments could be first applied to users with higher scores. Later, a contemporary work (Zhou et al., 2023) proves the proposed loss in (Du et al., 2019; Zou et al., 2020) cannot achieve the correct rank when the loss converges. Furthermore, the method proposed in (Goldenberg et al., 2020) is applicable only if the budget constraint is that monetary ROI $\geq 0$ , which is not flexible enough for many marketing campaigns in online Internet platforms. Zhou et al. (2023) extend the decision-focused learning idea to the multiple-treatments setting and develop a loss function to learn decision factors for MCKP solutions. However, the proposed method relies on the assumption of diminishing marginal utility, which is often not strictly hold in practice.

2.2. Gradient Estimation

Estimating the gradients of a black-box function has been extensively studied for decades and widely used in the field of reinforcement learning (Sutton et al., 1999) and adversarial attack (Ilyas et al., 2018; Guo et al., 2019). The simplest approach is the finite-difference strategy, i.e., calculating the gradients by definition. The finite-difference strategy could give accurate gradients but might be computationally expensive if the independent variables of the black-box function have high dimensionalities. An alternative way of gradient estimation is natural evolution strategies (NES) (Wierstra et al., 2014), which generate several search directions to probe local geometrics of the black-box function and estimate the gradients based on sampled search directions. NES is generally more computationally efficient than the finite-difference strategy, but the estimated gradients are noisy. Antithetic sampling (Salimans et al., 2017) could be employed to reduce the variance of NES gradient estimation. In this paper, we present both finite-difference- and NES-based algorithms for gradient estimation.

3. Our End-to-End Framework

Many budget allocation methods follow a two-stage paradigm, which may lead to degraded performance due to the misalignment between the objectives of the two stages. We reckon it can be beneficial to properly incorporate the allocation process based on the second stage into the learning objective, such that the model can directly learn to maximize the marketing goal. In this section, we first introduce major notations, then briefly analyze the two-stage paradigm, finally we present our end-to-end training framework.

3.1. Notations

In most online Internet platforms, randomized controlled trials (RCT) could be easily deployed and thus a decent amount of RCT data could usually be collected at an acceptable cost in practice. As such, in this paper, we focus on training models using RCT data for simplicity and leave the exploration on exploiting observational data to future works. We use $u_{i}\in\mathbb{R}^{d}$ to denote the features for $i$ -th user in a certain user group. The term “user group” might refer to a group of RCT training users or a group of non-RCT users whose budgets are allocated by our model. The exact meaning of the term “user group” should be easily inferred from the context in this paper. We set $K$ mutually exclusive treatments, and each treatment represents a certain amount of incentives (e.g., cash bonus, or discount levels) assigned to the user. The treatment assigned to user $u_{i}$ in RCT is denoted by $t_{i}\in\{1,\ldots K\}$ , and the user would have a binary response $y_{i}\in\{0,1\}$ after receiving the treatment $t_{i}$ . The response shall be the business goal we want to maximize. Meanwhile, applying treatment $t_{i}$ to user $u_{i}$ would induce a non-negative cost $z_{i}\geq 0$ . Each sample from an RCT dataset consists of a quadruple $(u_{i},t_{i},y_{i},z_{i})$ .

3.2. The Two-Stage Paradigm

As discussed, a lot of efforts have been devoted to two-stage budget allocation methods (Zhao et al., 2019; Makhijani et al., 2019; Goldenberg et al., 2020; Tu et al., 2021; Ai et al., 2022; Albert and Goldenberg, 2022). In this section, we briefly introduce common components in two-stage budget allocation methods. In the first stage, two-stage methods seek to build an uplift model to predict the uplifts of responses under all possible treatments. In principle, any uplift models could be applied in this stage, such as S-Learners (Künzel et al., 2019), R-Learners (Nie and Wager, 2021), Uplift Trees (Rzepakowski and Jaroszewicz, 2012), Causal Forests (Athey and Imbens, 2016), GRFs (Athey et al., 2019), ORFs (Oprescu et al., 2019) and LBCFs (Ai et al., 2022). Although some tree-based uplift models like GRFs (Athey et al., 2019) and ORFs (Oprescu et al., 2019) are unbiased uplift estimators with theoretical convergence guarantees, the splitting criteria in their tree-growing process are not flexible enough to be easily extended and integrated with our proposed regularizers. Thus in this paper, we restrict our attention to using S-Leaners with DNNs (Krizhevsky et al., 2012) as base models, just like (Zhou et al., 2023). A typical DNN S-Learner model $f$ would take the user feature $u_{i}$ as input and feed $u_{i}$ into a multi-layer perceptron (MLP) with several fully-connected layers to obtain the base hidden representation $h_{i}$ of the user $u_{i}$ , and then the model $f$ would branch $2K$ headers (i.e., shadow MLPs) on the top of the base hidden representation $h_{i}$ to predict responses and costs for every possible treatment, where each treatment corresponds two headers (one header for response and the other header for cost). In general, any DNN architecture could be used in this stage provided the DNN model takes $u_{i}$ as input and outputs two $K$ -dimensional vectors for predicted responses/costs under all possible treatments, and in this paper, we use the simple architecture described above and leave the exploration on DNN architectures to future works. Let $\theta$ exhaustively collects learnable parameters of the above DNN S-Leaner $f$ , and let $f_{v}\left(u_{i};\theta\right)\in\mathbb{R}^{K}$ and $f_{c}\left(u_{i};\theta\right)\in\mathbb{R}^{K}$ denote the predicted responses/costs of $f$ . The predicted response and cost of applying treatment $j$ to user $u_{i}$ are denoted by $f_{v}\left(u_{i};\theta\right)_{j}$ and $f_{c}\left(u_{i};\theta\right)_{j}$ , respectively. Given a training set with $N$ RCT users, the DNN S-Learner is trained in the first stage by minimizing the following objective:

(1)

L_{\textrm{SL}}\left(\theta\right)=-\frac{1}{N}\sum_{i=1}^{N}L_{\textrm{CE}}\left(f_{v}\left(u_{i};\theta\right)_{t_{i}},y_{i}\right)+L_{\textrm{C}}\left(f_{c}\left(u_{i};\theta\right)_{t_{i}},z_{i}\right),

where $L_{\textrm{CE}}(\cdot,\cdot)$ is the cross entropy loss for the binary response, and $L_{\textrm{C}}(\cdot,\cdot)$ is a suitable loss for the cost target. The formulation of $L_{\textrm{C}}(\cdot,\cdot)$ could vary from one dataset to another. For example, on a dataset where the cost label $z_{i}$ is categorical, it could be reasonable to use a cross entropy-loss for $L_{\textrm{C}}(\cdot,\cdot)$ , while on another dataset where the cost label $z_{i}$ is continuous, a regression loss would be more appropriate for $L_{\textrm{C}}(\cdot,\cdot)$ . After minimizing the objective $L_{\textrm{SL}}\left(\theta\right)$ in Equation (1), given a feature vector of a user, the well-trained DNN S-Leaner model should predict unbiased estimations of expected responses/costs of this user under all possible treatments.

In the second stage, two-stage methods usually keep the uplift model $f$ fixed and use a multiple choice knapsack problem (MCKP) (Kellerer et al., 2004) formulation to maximize the marketing goal under the budget constraint. Suppose the total budget for a group of $M$ users is set to $T$ . The expected response of $u_{i}$ under treatment $j$ is denoted by $v_{ij}\in[0,1]$ , and $v_{ij}$ could be predicted by the well-trained DNN S-Learner from the first stage: $v_{ij}=f_{v}\left(u_{i};\theta\right)_{j}$ . Let $c_{ij}\geq 0$ denotes the expected cost of $u_{i}$ under treatment $j$ , and $c_{ij}$ could also be similarily predicted by $c_{ij}=f_{c}\left(u_{i};\theta\right)_{j}$ . Using the above coefficients and constants, two-stage methods typically construct the following MCKP to find the optimal personalized treatment assignment solution $x\in\{0,1\}^{M\times K}$ which could maximize the expected marketing goal under a given budget constraint:

(2)	$\displaystyle\max_{x}$	$\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{K}v_{ij}x_{ij}$
	s.t.	$\displaystyle\sum_{i=1}^{M}\sum_{j=1}^{K}c_{ij}x_{ij}\leq T$
		$\displaystyle\sum_{j=1}^{K}x_{ij}=1,\quad i=1,2,\ldots,M$
		$\displaystyle x_{ij}\in\{0,1\},\quad i=1,2,\ldots,M,\quad j=1,2,\ldots,K,$

where the $M\times K$ matrix $x$ represents the budget allocation solution for $M$ users. All entries in $x$ are either 0 or 1. Each row of $x$ has exactly one entry with value 1, and $x_{ij}=1$ means user $u_{i}$ should be assigned with treatment $j$ . Although MCKP is proven to be NP-Hard (Kellerer et al., 2004), in practice we can approximately solve it by first making linear relaxation to get rid of the integer constraints and then converting it to the dual problem (Ai et al., 2022; Zhou et al., 2023). Specifically, the dual problem of MCKP has only one scalar variable $\alpha$ under a constraint that $\alpha\geq 0$ (Kellerer et al., 2004), and the objective of dual problem is given by

(3)

\displaystyle\min_{\alpha\geq 0}

\displaystyle\alpha T+\sum_{i=1}^{M}\max_{j}\left(v_{ij}-\alpha c_{ij}\right).

The dual problem (3) could be efficiently solved via bisection methods such as Dyer-Zemel algorithm (Zemel, 1984) or dual gradient descent (Ai et al., 2022; Zhou et al., 2023). Once the optimal dual solution $\alpha^{*}$ is obtained, the optimal solution $x^{*}$ of the primal MCKP (2) could be given by KKT conditions:

(4)

x_{ij}^{*}=\begin{cases}1,&\text{if }j=\operatorname*{arg\,max}_{k}v_{ik}-\alpha^{*}\cdot c_{ik}\\ 0,&\text{otherwise.}\end{cases}

Interested readers can read the insightful paper (Kellerer et al., 2004) for more details about solving MCKPs.

3.3. Marketing Goal as a Regularizer

As discussed in Section 1, our core idea for end-to-end training is to construct a regularizer to represent the marketing goal (i.e., the expected response under a certain budget constraint) and optimize it efficiently using gradient estimation techniques. Suppose we have a batch of $B$ RCT users from the training set (we set the batch size to $B=10,000$ in all experiments), the S-Learner model could predict responses for each user and we shall stack their predicted responses into a $B\times K$ matrix $v$ to simplify the notations, and the costs of applying arbitrary treatments to these users could also be stacked into a $B\times K$ matrix $c$ in a similar way. In this section, we present our solution to design a function $Q(v,c)$ to map the predicted response matrix $v$ and cost matrix $c$ to the marketing goal and integrate $Q(v,c)$ into the training objective of the S-Leaner model as a regularizer.

Refer to caption — Figure 1. Example of sensitiveness curve between the empirical business goal (y-axis) and empirical cost (x-axis) in the marketing data.

We use Expected Outcome Metric (EOM) method (Zhao et al., 2017; Ai et al., 2022; Zhou et al., 2023) to construct $Q(v,c)$ . Concretely, starting from any dual feasible solution $\alpha\geq 0$ , we can obtain the corresponding budget allocation solution $x$ via KKT conditions in Equation (4). Since $\alpha$ is the Lagrangian multiplier associated with the budget constraint in the primal MCKP (2), we know from complementary slackness conditions that $x$ must be the primal optimal solution for another MCKP in which the total budget $T$ is set to be $T=\sum_{i=1}^{M}\sum_{j=1}^{K}c_{ij}x_{ij}$ while other coefficients $c$ and $v$ remain unchanged. In other words, given any $\alpha\geq 0$ , the corresponding $x$ is the optimal budget allocation solution under a certain total budget constraint. Using the budget allocation solution $x$ , we can find the set of user indices (denoted by $\mathcal{S}_{\alpha}$ ) in the $B$ RCT users whose treatment on the RCT data is equal to the assigned treatment in $x$ :

(5)

\mathcal{S}_{\alpha}=\left\{i\,\middle|\,t_{i}=\operatorname*{arg\,max}_{j}x_{ij},\,\,i=1,2,\ldots,B\right\},

where $t_{i}$ is the treatment applied to user $u_{i}$ in the RCT data. Based on the set $\mathcal{S}_{\alpha}$ , the per-capita response $V\left(v,c,\alpha\right)\in\mathbb{R}_{+}$ and per-capita cost $C\left(v,c,\alpha\right)\in\mathbb{R}_{+}$ could be empirically estimated by:

(6)		$\displaystyle V\left(v,c,\alpha\right)$	$\displaystyle=\frac{1}{\left\lvert\mathcal{S}_{\alpha}\right\rvert}\sum_{i\in\mathcal{S}_{\alpha}}y_{i}$
(6)		$\displaystyle C\left(v,c,\alpha\right)$	$\displaystyle=\frac{1}{\left\lvert\mathcal{S}_{\alpha}\right\rvert}\sum_{i\in\mathcal{S}_{\alpha}}z_{i},$

where $y_{i}\in\{0,1\}$ and $z_{i}\in\mathbb{R}_{+}$ denote the response and cost of user $u_{i}$ in the RCT data respectively, and $\left\lvert\mathcal{S}_{\alpha}\right\rvert$ represents the cardinality of set $\mathcal{S}_{\alpha}$ .

Algorithm 1 Evaluate

Q(v,c)

given

v

and

c

1:Input: the predicted response/cost matrix

v

c

of a group of

B

users; the total budget

T

; the maximum number of iterations

I

in bisection; the maximum

\alpha_{\textrm{max}}

in bisection; the tolerance

\epsilon

of per-capita cost in bisection.

2:Output: the expected per-capita response

Q(v,c)

under the optimal budget allocation solution.

3:Initialize

t\leftarrow 1

\alpha_{l}\leftarrow 0

\alpha_{h}\leftarrow\alpha_{\textrm{max}}

4:while

t\leq I

5: // Evaluate the middle point in the search interval.

\alpha_{m}\leftarrow\frac{\alpha_{l}+\alpha_{h}}{2}

7: Recover the primal solution

x_{m}

corresponing to

\alpha_{m}

using KKT conditions in Equation (4).

8: Based on

x_{m}

, evaluate empirical per-capita response

V\left(v,c,\alpha_{m}\right)

and empirical per-capita cost

C\left(v,c,\alpha_{m}\right)

using Equation (5) and Equation (6).

10: // Check if we could break early.

11: Calculate the per-capita cost error:

e\leftarrow\left\lvert C\left(v,c,\alpha_{m}\right)-\frac{T}{B}\right\rvert

12: if

e\leq\epsilon

then

13: break

14: end if

15:

16: // Update search interval.

17: if

C\left(v,c,\alpha_{m}\right)>\frac{T}{B}

then

18: // Empirical cost is too high, we should turn down the budget in MCKP by increasing

\alpha

19:

\alpha_{l}\leftarrow\alpha_{m}

20: else

21: // Empirical cost is too low, we should turn up the budget in MCKP by decreasing

\alpha

22:

\alpha_{h}\leftarrow\alpha_{m}

23: end if

24:

25: // Update step counter.

26:

t\leftarrow t+1

27:end while

28:return the final empirical per-capita response

V\left(v,c,\alpha_{m}\right)

As discussed, the obtained $x$ is the optimal budget allocation solution for these $B$ users under a certain budget constraint, and thus we could tune the total budget by adjusting the value of dual feasible solution $\alpha$ (Kellerer et al., 2004). As such, we could start from $\alpha=0$ and gradually increase its value, and the total budget in MCKP would be accordingly decreased. In this process, we could plot empirical per-capita cost $C\left(v,c,\cdot\right)$ and empirical per-capita response $V\left(v,c,\cdot\right)$ in the same figure ¹¹1Note this curve might not be strictly monotonically increasing since the empirical estimation process is noisy. to illustrate the users’ expected responses under different costs, and an example figure is shown in Figure 1. Based on this curve, intuitively we can use bisection search to find a suitable $\alpha^{\prime}$ such that the empirical per-capita cost $C\left(v,c,\alpha^{\prime}\right)\in\mathbb{R}_{+}$ equals the given per-capita budget $\frac{T}{\left\lvert\mathcal{S}_{\alpha}\right\rvert}$ in the original MCKP, and define $Q(v,c)$ as the empirical per-capita response at that $\alpha^{\prime}$ : $Q(v,c)=V\left(v,c,\alpha^{\prime}\right)$ . All details of evaluating $Q(v,c)$ given $v$ and $c$ are summarized in Algorithm 1.

The bisection process in Algorithm 1 could also be extended to $n$ -ary search by adding more middle points in the search interval in each iteration. Although the bisection method and the $n$ -ary search method share the same asymptotic time complexity, in modern GPU devices the $n$ -ary search method could usually run faster if the implementation is properly vectorized.

Given a batch of training data with $B$ RCT users, instead of optimizing the original objective $L_{\textrm{SL}}\left(\theta\right)$ in Equation (1) for S-Learner, we could directly optimize the marketing effectiveness by minimizing the following regularized objective $L_{R}(\theta)$ in this batch:

(7)

L_{R}(\theta)=L_{\textrm{SL}}(\theta)-\lambda\cdot Q(v,c),

where $L_{\textrm{SL}}(\theta)$ is the original loss in Equation (1) on this batch, and $\lambda>0$ is a scaling factor that balances the importance of the original objective $L_{\textrm{SL}}\left(\theta\right)$ and the regularizer $Q(v,c)$ .

If the gradient $\frac{\partial Q(v,c)}{\partial v}$ and $\frac{\partial Q(v,c)}{\partial c}$ could be obtained (see Section 3.4), we can integrate the gradient into the back-propagation process (see Section 3.5). As such, the regularizer $Q(v,c)$ could be jointly optimized with the original learning objective $L_{\textrm{SL}}(\theta)$ and the whole objective $L_{R}(\theta)$ in Equation (7) could be efficiently solved in an end-to-end flavor. This endows the trained model with an ability to directly learn from the marketing goal and further maximize it, in a principled way.

3.4. Gradient Estimation

As shown in Algorithm 1, the construction of regularizer $Q(v,c)$ involves many non-differentiable operations. Thus if we naïvely implement Algorithm 1 as a computational graph in modern machine learning libraries such as TensorFlow (Abadi et al., 2015) and PyTorch (Paszke et al., 2019), the automatic differentiation mechanism would not generate correct gradients. In this section, we present our solution to overcome this challenge and calculate the (estimation of) gradient $\frac{\partial Q(v,c)}{\partial v}$ and $\frac{\partial Q(v,c)}{\partial c}$ . We treat $Q(v,c)$ as a black-box function, and use two strategies to estimate the gradients: 1) finite-difference strategy in Section 3.4.1, and 2) natural evolution strategies (NES) (Wierstra et al., 2014) in Section 3.4.2. For clarity of presentation, in this section, we would present the detailed solution for estimating $\frac{\partial Q(v,c)}{\partial v}$ only, and estimating for $\frac{\partial Q(v,c)}{\partial c}$ could be similarly performed.

3.4.1. Finite-Difference Strategy

Finite-difference strategy is commonly used in the field of black-box adversarial attack (Chen et al., 2017; Tu et al., 2019) to estimate the gradient of a complicated black-box function. Specifically, given predicted responses $v\in[0,1]^{B\times K}$ of a batch of $B$ RCT training users, the $l$ -th element in the gradient matrix $\frac{\partial Q(v,c)}{\partial v}\in\mathbb{R}^{B\times K}$ could be estimated by:

(8)

\left(\frac{\partial Q(v,c)}{\partial v}\right)_{l}\approx\frac{Q\left(v+h\cdot e_{l},c\right)-Q\left(v-h\cdot e_{l},c\right)}{2h},

where $l=1,2,\ldots,B\times K$ , $h$ is a small positive constant to perturb the input of the $Q$ function, and $e_{l}\in\{0,1\}^{B\times K}$ is the standard basis matrix with only the $l$ -th component as 1. We use the symmetric difference quotient (Peter and Maria, 2014) to improve the numerical stability.

Using Equation (8), theoretically, we could estimate each entry of the matrix $\frac{\partial Q(v,c)}{\partial v}$ one by one, and finally stack results of all entries together to obtain the estimation of the entire gradient matrix $\frac{\partial Q(v,c)}{\partial v}$ , at the cost of evaluating the $Q$ function $2\times B\times K$ times. However, the forward process of function $Q$ is sophisticated, and thus evaluating the $Q$ function for $2\times B\times K$ times could be computationally intensive in practice. To accelerate the finite-difference process, we use a strategy somewhat analogous to a coordinate descent: randomly masking some entries. Concretely, in each training iteration, instead of evaluating all $B\times K$ entries of $\frac{\partial Q(v,c)}{\partial v}$ , we randomly select $F^{\prime}\ll B\times K$ entries and evaluate these entries of $\frac{\partial Q(v,c)}{\partial v}$ using Equation (8) while other entries are set to zero. In practice, we could adjust the hyper-parameter $F^{\prime}$ to achieve a sensible trade-off between computational efficiency and the accuracy of the gradient estimation.

3.4.2. Natural Evolution Strategy

NES is a popular derivative-free numerical optimization algorithm and it is commonly used in the field of black-box adversarial attack (Ilyas et al., 2018, 2019; Guo et al., 2019). Suppose a random variable $v^{\prime}\in\mathbb{R}^{B\times K}$ follows an isotropic normal distribution with $v$ as the center and $\sigma^{2}I$ as the covariance matrix: $v^{\prime}\sim\mathcal{N}\left(v,\sigma^{2}I\right)$ , where $\sigma$ is a small positive constant and $I$ is a $\left(B\times K\right)$ -by- $\left(B\times K\right)$ identity matrix. NES seeks to maximize the expected value of function $Q$ under the distribution of $v^{\prime}$ by estimating the gradient using the “log derivative trick”²²2Similar trick is also used in policy gradient algorithms in reinforcement learning.:

(9)	$\displaystyle\frac{\partial}{\partial v}\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\right]$	$\displaystyle=\frac{\partial}{\partial v}\int Q\left(v^{\prime},c\right)\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\frac{\partial}{\partial v}\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\frac{\pi\left(v^{\prime}\middle\|\,v\right)}{\pi\left(v^{\prime}\middle\|\,v\right)}\frac{\partial}{\partial v}\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\pi\left(v^{\prime}\middle\|\,v\right)\frac{\partial}{\partial v}\log\left(\pi\left(v^{\prime}\middle\|\,v\right)\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\frac{\partial}{\partial v}\log\left(\pi\left(v^{\prime}\middle\|\,v\right)\right)\right],$

where $\pi\left(v^{\prime}\middle|\,v\right)$ is the probability density function of the random variable $v^{\prime}$ . If the variance $\sigma^{2}$ is small enough, the expected value of $Q$ in Equation (9) should be close to $Q(v,c)$ , and thus we could use NES to estimate the gradient $\frac{\partial Q(v,c)}{\partial v}$ :

(10)	$\displaystyle\frac{\partial Q(v,c)}{\partial v}$	$\displaystyle\approx\frac{\partial}{\partial v}\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\right]$
		$\displaystyle=\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\frac{\partial}{\partial v}\log\left(\pi\left(v^{\prime}\middle\|\,v\right)\right)\right]$
		$\displaystyle\approx\frac{1}{\sigma N^{\prime}}\sum_{i=1}^{N^{\prime}}\delta_{i}\cdot Q\left(v+\sigma\delta_{i},c\right),$

where in the last equation we use empirical average value to estimate the expectation, $\delta_{i}\in\mathbb{R}^{B\times K}$ is a sample from the standard normal distribution in the $B\times K$ dimensional vector space, and $N^{\prime}$ is the number of sampling directions. Using Equation (10), we could obtain an unbiased estimation of the gradient $\frac{\partial Q(v,c)}{\partial v}$ in a derivative-free fashion. The hyper-parameter $N^{\prime}$ controls the trade-off between the computational efficiency and accuracy of NES.

Algorithm 2 Train DNN model end-to-end for budget allocation

1:Input: a training dataset consists of RCT users; initialized DNN model parameter

\theta

;

2:Output: a well-trained DNN model for budget allocation

3:while not converged do

4: Get one batch of training data from the training set.

5: Run forward-propagation and get predicted results

v

and

c

6: Estimate

\frac{\partial Q(v,c)}{\partial v}

and

\frac{\partial Q(v,c)}{\partial c}

using finite-difference strategy (Equation (8)) or NES (Equation (10)).

7: Calculate the surrogate loss

L_{S}(\theta)

using Equation (11).

8: Run back-propagation to obtain

\frac{\partial L_{S}(\theta)}{\partial\theta}

9: Run one gradient descent step to update

\theta

based on

\frac{\partial L_{S}(\theta)}{\partial\theta}

10:end while

11:return the well-trained DNN model

\theta

Like other Monte Carlo methods, NES also suffers from high variance when estimating a vector in a high dimensional space. To reduce the variance, we employ antithetic sampling (Salimans et al., 2017) to generate $N^{\prime}$ sampling directions: we first sample $\frac{N^{\prime}}{2}$ i.i.d. Gaussian noises as $\delta_{i},\,i=1,2,\ldots,\frac{N^{\prime}}{2}$ , and then set the remaining $\frac{N^{\prime}}{2}$ directions by flipping the signs of previous $\frac{N^{\prime}}{2}$ directions: $\delta_{i}=-\delta_{N^{\prime}-i+1},\,i=\frac{N^{\prime}}{2}+1,\frac{N^{\prime}}{2}+2,\ldots,N^{\prime}$ . This sampling strategy has been empirically verified to be effective in many previous works (Ilyas et al., 2018, 2019; Guo et al., 2019).

3.5. Training

In this section, we present our solution to train the DNN model to maximize the marketing goal, by optimizing the regularized objective $L_{R}(\theta)$ in Equation (7). Common gradient descent based optimizers such as SGD or Adam (Kingma and Ba, 2015) would require the gradients w.r.t. DNN model parameter (i.e., $\frac{\partial Q(v,c)}{\partial\theta}$ ) to update $\theta$ , while in Section 3.4 we only have $\frac{\partial Q(v,c)}{\partial v}$ and $\frac{\partial Q(v,c)}{\partial c}$ . Thus, a back-propagation step which converts $\frac{\partial Q(v,c)}{\partial v}$ and $\frac{\partial Q(v,c)}{\partial c}$ to $\frac{\partial Q(v,c)}{\partial\theta}$ is required. To this end, we design the following surrogate loss $L_{S}(\theta)$ :

(11)

\begin{split}L_{S}(\theta)=L_{\textrm{CE}}(\theta)-\lambda\cdot\operatorname{Tr}\left[v^{\mathrm{T}}\texttt{stop\_gradient}\left(\frac{\partial Q(v,c)}{\partial v}\right)\right]\\ -\lambda\cdot\operatorname{Tr}\left[c^{\mathrm{T}}\texttt{stop\_gradient}\left(\frac{\partial Q(v,c)}{\partial c}\right)\right],\end{split}

where $\lambda>0$ is a scaling factor that balances the importance of the original objective and the regularizer, $\operatorname{Tr}[\cdot]$ is the matrix trace operator, $v^{\mathrm{T}}$ and $c^{\mathrm{T}}$ is the transpose of matrix $v$ and $c$ respectively, and $\texttt{stop\_gradient}(\cdot)$ is the operator to treat a certain variable as a constant and avoid back-propagation through it.³³3In TensorFlow this could be achieved via tf.stop_gradient(), and in PyTorch this could be achieved by tensor.detach(). Unlike the regularizer part of $L_{R}(\theta)$ which is cumbersome to handle, the regularizer part of $L_{S}(\theta)$ is a simple linear combination of elements in matrix $v$ and $c$ , which are output values from the last layer of the DNN model. As such, the gradient of $L_{S}(\theta)$ w.r.t. $\theta$ could be easily obtained by running a standard back-propagation over the DNN model. Furthermore, it could be verified that $L_{S}(\theta)$ and $L_{R}(\theta)$ shares the same first order gradient: $\frac{\partial L_{S}(\theta)}{\partial\theta}=\frac{\partial L_{R}(\theta)}{\partial\theta}$ . Thus, with the help of the surrogate loss $L_{S}(\theta)$ , we could effectively optimize the original regularized objective $L_{R}(\theta)$ , and the well-trained DNN model shall learn to maximize the marketing goal in an end-to-end manner. Algorithm 2 summarizes all details for training.

4. Experimental Results

4.1. Datasets

We evaluate the efficacy of our method in three different datasets:

•

Synthetic dataset. To elaborate on key components of our idea (i.e., EOM and gradient estimation), we generate a synthetic dataset consisting of 10,000 RCT users. We set four different treatments (i.e., $K=4$ ), and the treatment $t_{i}$ applied to the $i$ -th user in RCT is randomly sampled from these four treatments with equal probabilities. We assume the $i$ -th user has a four-dimensional ground truth vector $v^{\textrm{gt}}_{i}\in[0,1]^{4}$ , where the $j$ -th element in $v^{\textrm{gt}}_{i}$ (denoted by $v^{\textrm{gt}}_{ij}$ ) represents the probability of $y_{i}=1$ when treatment $j$ is applied to that user. We use the following three steps to generate $v^{\textrm{gt}}_{i}$ : 1) the first element of $v^{\textrm{gt}}_{i}$ (i.e., $v^{\textrm{gt}}_{i1}$ ) is sampled from the uniform distribution in $[0,0.1]$ , 2) to determine the remaining elements of $v^{\textrm{gt}}_{i}$ , we sample three i.i.d. numbers from the uniform distribution in $[0,0.2]$ and sort these three numbers in descending order (denoted by $0.2\geq k_{1}\geq k_{2}\geq k_{3}\geq 0$ ), and 2) these three numbers are used as incremental values of $v^{\textrm{gt}}$ : let $v^{\textrm{gt}}_{ij}=v^{\textrm{gt}}_{i(j-1)}+k_{j-1}$ , where $j=2,3,4$ . Once the four-dimensional ground truth vector $v^{\textrm{gt}}_{i}$ is generated, we get the $t_{i}$ -th element of the $v^{\textrm{gt}}_{i}$ (denoted by $v^{\textrm{gt}}_{i(t_{i})}$ ) and sample the binary response $y_{i}$ of the $i$ -th user from the Bernoulli distribution with probability $v^{\textrm{gt}}_{i(t_{i})}$ . We assume the cost of applying the $j$ -th treatment to any user is $j$ . In other words, we set the $\left(i,j\right)$ -th element in ground-truth cost matrix $c^{\textrm{gt}}$ to be $c^{\textrm{gt}}_{ij}=j$ , where $j=1,2,3,4$ . The synthetic dataset is designed for illustrating the effectiveness of EOM and gradient estimation, thus the features of users are omitted in this dataset.
•

CRITEO-UPLIFT v2 (Diemert et al., 2018). This publicly available dataset is designed for evaluating models for uplift modeling. This dataset contains 13.9 million samples which are collected from an RCT. Each sample has 12 dense features, one binary treatment indicator, and two binary labels: visit and conversion. In this dataset, treatment is defined as whether the user is targeted by advertising or not, and labels are defined as positive if the user visited/converted on the advertiser’s website during a test period of two weeks. To evaluate different budget allocation methods, we follow (Zhou et al., 2023) and use the visit/conversion label as the cost/value respectively. We compare the performance of our proposed method with state-of-the-arts in CRITEO-UPLIFT v2 to demonstrate the effectiveness of our method.
•

KUAISHOU-PRODUCE-COIN. Produce Coin is a common marketing campaign to incentivize short video creators to upload more short videos in Kuaishou. In this campaign, we provide a prized task to each short video creator on the Kuaishou platform. Concretely, in each task, if a creator uploads a video within 24 hours, he/she will receive a certain amount of coins as a reward. Each creator could see the number of coins he/she could possibly obtain, and if the number seems attractive enough the creator might finish the task to collect these coins. We manually select 30 distinct levels of coins (e.g., 1 coin, 2 coins, $\ldots$ ), and each creator will be assigned one of these 30 levels. The amount of coins could be personalized to maximize the total amount of video uploads under the constraint that the total amount of coins does not exceed a certain limit. In this dataset, we define treatment as the level of coins, thus we have $K=30$ treatments. For the user (i.e., creator) $u_{i}$ , the response $y_{i}$ would be $y_{i}=1$ if the user has successfully completed the task, and otherwise, we set $y_{i}=0$ . The cost of applying a treatment to $u_{i}$ would be defined as the real number of coins sent to $u_{i}$ : if $y_{i}=1$ we set the cost to the number of coins corresponding to the assigned treatment, and if $y_{i}=0$ we set the cost to zero. To collect this dataset we conduct an RCT for one week. In this RCT, we randomly sample one reward level from predefined 30 levels with equal probabilities for each user as the treatment and collect the response/cost label from the logs. This dataset contains 82.4 million samples, and we deploy several models trained on this dataset in the Kuaishou platform to evaluate the online performance of different methods. To protect data privacy, we normalize the per-capita response/cost and report the normalized values on all tables and figures.

4.2. Evaluation Metrics

We use the following evaluation metrics to evaluate and compare the performance of different methods:

•

AUCC (Area Under the Cost Curve) in (Du et al., 2019). AUCC is commonly used in existing literatures (Du et al., 2019; Ai et al., 2022; Zhou et al., 2023) to evaluate the ranking performance of uplift models in the two-treatment setting. Interested readers can check the original paper (Du et al., 2019) for more details about the evaluation process of AUCC. In this paper, we use AUCC to compare the performance of different methods in CRITEO-UPLIFT v2.
•

EOM (Expected Outcome Metric). EOM or similar metric is also commonly used in existing literatures (Ai et al., 2022; Zhou et al., 2023) to empirically estimate the expected outcome (e.g., per-capita response or per-capita cost) for arbitrary budget allocation policy. EOM can make an unbiased estimation of any outcome, provided that outcome is calculable given a budget allocation solution on RCT data. Thus, EOM is more flexible than AUCC in practice. In this paper, we use EOM in the synthetic dataset and the KUAISHOU-PRODUCE-COIN dataset to evaluate different budget allocation methods, and technical details of EOM have been introduced in Equation (5) and Equation (6) in Section 3.3.

4.3. Implementation Details

Implementation details on different datasets are shown as follows:

•

The synthetic dataset. To illustrate the basic idea of EOM and gradient estimation in our proposed method, we first generate a cost matrix $c\in\mathbb{R}_{+}^{10,000\times 5}$ and a value matrix $v\in[0,1]^{10,000\times 5}$ as the starting point, and then use the Adam optimizer with a learning rate 0.005 to optimize $v$ by maximizing $Q(v,c)$ for 100 gradient ascent steps. For the starting point of cost matrix $c$ , we directly use the ground-truth cost matrix $c=c^{\textrm{gt}}$ and keep it fixed during gradient ascent. For the starting point of value matrix $v$ , we use the same procedure as described in Section 4.1 to re-generate a random matrix $v$ which satisfies the following conditions for all $i$ : 1) $0\leq v_{i1}\leq v_{i2}\leq v_{i3}\leq v_{i4}$ , and 2) $v_{i2}-v_{i1}\geq v_{i3}-v_{i2}\geq v_{i4}-v_{i3}\geq 0$ . We set the per-capita budget to 2.0 for evaluating the $Q$ function. Since features of users are omitted in the synthetic dataset, we could not expect any generalization. Thus, we treat all 10,000 samples as training samples and report per-capita response and per-capita cost estimated by EOM on all training samples.
•
CRITEO-UPLIFT v2. We compare our proposed method with the following baseline methods in terms of AUCC:
- –
  
  TSM-SL. The baseline two-stage method in many existing literatures (Ai et al., 2022; Zhao et al., 2019; Zhou et al., 2023). In the first stage, a DNN-based S-Learner model is exploited to predict the conversion uplifts and visit uplifts for individuals. In the second stage, an MCKP formulation could be used to find the optimal budget allocation solution. However for AUCC, the explicit budget allocation solution is not required, and we simply rank different individuals by the ratio between the conversion uplifts and visit uplifts and use the rank to compute AUCC.
- –
  
  Direct Rank (Du et al., 2019). Du et al. (2019) propose a direct rank method to directly learn the ratios between values and costs and use the ratios to rank individuals.
- –
  
  DRP (Zhou et al., 2023). Zhou et al. (2023) design several decision factors for MCKP solutions, and propose a surrogate loss function to directly learn these decision factors.
Following (Zhou et al., 2023), we randomly sample 70% samples from the dataset as the training set, and the remaining samples are used as the test set. We report the performance of different methods in the test set, and for baseline methods, we directly cite results from (Zhou et al., 2023). For our proposed method, we use a DNN architecture described in Section 3.2: the base DNN is a 4-layer MLP (hidden sizes 512-256-128-64), and each header DNN is a 2-layer MLP (hidden sizes 32-1). We insert a ReLU activation layer after each fully connected layer in both base DNN and header DNN, except the final output layer in header DNN. We also insert a Batch Normalization layer (Ioffe and Szegedy, 2015) after the activation layer for faster convergence. The value of $\lambda$ in the surrogate loss $L_{S}(\theta)$ in Equation (11) is set to 200. For the finite-difference strategy, we set $F^{\prime}=4,000$ and $h=0.0003$ . For NES, we set $N^{\prime}=2,000$ and $\sigma=0.001$ . The MSRA initialization (He et al., 2015) is invoked to initialize all weight matrices in our model. The randomly initialized DNN S-Learner model is trained using the Adam optimizer with a learning rate of 0.0001 for 10 epochs. In each training step, the per-capita budget for evaluating the $Q$ function is randomly sampled from a uniform distribution in $[0.039,0.047]$ . Following (Zhou et al., 2023), we run our proposed method 20 times and report the mean and variance of results.
•
KUAISHOU-PRODUCE-COIN. To demonstrate the effectiveness of our proposed training framework in real-world scenarios, we conduct both offline evaluation and online evaluation to compare our proposed method and TSM-SL. The offline experiment and online experiment share the following settings:
- –
  
  The architecture for base DNN header DNN is kept the same as in the CRITEO-UPLIFT v2 dataset.
- –
  
  We train the DNN S-Learner model using the Adam optimizer with a learning rate of 0.0003 for 20 epochs. To make a fair comparison, we use identical feature engineering/DNN architecture/training policies for our proposed method and TSM-SL.
- –
  
  In each training step, the per-capita budget for evaluating the $Q$ function is randomly sampled from a uniform distribution in $[1.0,1.5]$ .
Furthermore, in the offline experiment:
- –
  
  We randomly sample 80% of users from the KUAISHOU-PRODUCE-COIN dataset as the training set and the remaining 20% is used as the test set.
- –
  
  We employ EOM and bisection to find the value of per-capita response when per-capita cost equals 1.2, and this metric is used to measure the performance of different methods.
In the online experiment:
- –
  
  All samples in the KUAISHOU-PRODUCE-COIN dataset are used for training and the obtained well-trained models are deployed in an online A/B test in the Kuaishou platform to compare the efficacy of their budget allocation policies in the real world.
- –
  
  The online A/B test is conducted for one week. There are 14.2 million users in the online A/B test, and these users are disjoint for users in the KUAISHOU-PRODUCE-COIN dataset. We randomly partition 14.2 million users into two equally sized groups, one group for our proposed method and the other group for TSM-SL. For each group, the reported per-capita response/cost is directly computed over the entire 7.1 million users, and EOM is not employed.

4.4. Results on Synthetic Dataset

Results on the synthetic dataset are summarized in Figure 2. The black line in Figure 2(a) shows the statistics about per-capita cost and per-capita value in RCT. As discussed in Section 3.3, starting from any cost and value matrix we could obtain a series of budget allocation solutions corresponding to different budgets, and further invoke EOM to empirically estimate the per-capita value $V\left(v,c,\alpha\right)$ and per-capita cost $C\left(v,c,\alpha\right)$ for each budget allocation solution. We perform the above EOM procedure using ground-truth $v^{\textrm{gt}}$ and $c^{\textrm{gt}}$ , and plot the trajectory of $V\left(v^{\textrm{gt}},c^{\textrm{gt}},\cdot\right)$ and $C\left(v^{\textrm{gt}},c^{\textrm{gt}},\cdot\right)$ as the red dotted line in Figure 2(a). The value of $Q$ function for ground-truth $v^{\textrm{gt}}$ and $c^{\textrm{gt}}$ is $Q\left(v^{\textrm{gt}},c^{\textrm{gt}}\right)=0.2123$ .

To verify the effectiveness of the gradient estimation process, we first generate initial $v$ and $c$ as described in Section 4.3 and study the cosine similarity between our estimated gradient and the precise gradient. To this end, we first set $F^{\prime}=40,000$ and use the finite-difference strategy to calculate the precise gradient $\frac{\partial Q(v,c)}{\partial v}$ . Then, we use NES to estimate $\frac{\partial Q(v,c)}{\partial v}$ (Equation (10)) and gradually increase $N^{\prime}$ . Finally, we check the cosine similarity between the NES estimated gradient and the precise gradient. Figure 2(b) shows the relations between cosine similarity and $N^{\prime}$ , and we can see that as more sampling directions are involved in NES, the estimated gradients are more accurate.

We further plug the estimated gradient by NES into an Adam optimizer to maximize $Q(v,c)$ , which represents the per-capita value when the per-capita cost equals 2.0. Figure 2(c) shows the per-capita value and per-capita cost of initial $v$ and $c$ . The value of $Q$ function for initial $v$ and $c$ is $Q(v,c)=0.1996$ . After 100 gradient ascent steps, we can increase the value of $Q$ function from 0.1996 to 0.2141. Figure 2(d) shows the per-capita value and per-capita cost of optimized $v$ and $c$ . By comparing Figure 2(c) and Figure 2(d), we see that our proposed method could effectively increase the value of $Q$ function. From the properties of EOM, we know that $Q$ would be an unbiased estimator of the marketing goal. As such, the marketing goal could be effectively maximized. Since we only have 10,000 data samples, the Adam optimizer could severely overfit the training data and thus in Figure 2(d) there might be a peak around the region where the per-capita budget equals 2.0.

4.5. Results on CRITEO-UPLIFT v2

Table 1. Results on CRITEO-UPLIFT v2.

Method	AUCC
TSM-SL	0.7561 $\pm$ 0.0113
Direct Rank (Du et al., 2019)	0.7562 $\pm$ 0.0131
DRP (Zhou et al., 2023)	0.7739 $\pm$ 0.0002
Ours-NES	0.7849 $\pm$ 0.0167
Ours-FD	0.7904 $\pm$ 0.0096

Results on CRITEO-UPLIFT v2 are summarized in Table 1. Ours-FD and Ours-NES refer to our proposed method with finite-difference strategy / NES as the gradient estimator, respectively. We see both Ours-FD and Ours-NES significantly outperform all competitive methods in the sense of AUCC. Ours-FD has the highest variance in AUCC for 20 runs, and this phenomenon could be partially explained by the fact that NES gradients are noisy. Although the performance of Ours-FD surpasses that of Ours-NES, Ours-FD costs 4x more running times on an NVIDIA Tesla T4 GPU (Ours-FD 17.6 hours v.s. Ours-NES 4.3 hours) since we need to evaluate the $Q$ function for more times.

4.6. Results on KUAISHOU-PRODUCE-COIN

We adopt Ours-FD for the KUAISHOU-PRODUCE-COIN dataset and compare it with TSM-SL.

4.6.1. Offline Evaluation

We are interested in how the value of $\lambda$ in Equation (11) affects the performance of Ours-FD. We set $\lambda\in\{0,1,2,5,10,20,50,100,200,500,1000,2000,5000\}$ and measure the performance of Ours-FD. For each value of $\lambda$ , we train the model four times and report their average performance. Figure 3 illustrates the impact of $\lambda$ in Equation (11) in Ours-FD. By definition, we know that if we set $\lambda=0$ , Ours-FD would be reduced to TSM-SL. If $\lambda>0$ , the regularizer would take effect and the DNN S-Learner model could learn from the marketing goal directly. However, if $\lambda$ is too large, the regularizer could dominate the training objective and the predicted responses and costs could be far away from their ground-truth values thus the training process could be unstable. In Figure 3, when $\lambda\leq 200$ the per-capita response would increase with $\lambda$ , and it is clear that our regularizer is beneficial for the marketing goal in this region. Furthermore, if $\lambda\geq 500$ , the performance of Ours-FD would be worse than that of TSM-SL since the training process becomes unstable.

4.6.2. Online Evaluation

For online evaluation, we re-train the model using the entire KUAISHOU-PRODUCE-COIN dataset. We deploy Ours-FD with $\lambda=200$ and TSM-SL into the online A/B test. To measure the performance of different methods, we calculate the per-capita response and per-capita cost by averaging the results over seven days. When compared with TSM-SL, Ours-FD achieves a -0.369% lower per-capita cost and a 1.24% higher per-capita response simultaneously, which is considered to be a significant improvement over these months. The Ours-FD model is currently deployed to allocate marketing budgets for hundreds of millions of users.

5. Conclusion

In this paper, we investigate the end-to-end training framework in the budget allocation problem. We formulate the business goal under a certain budget constraint as a black-box regularizer and develop two efficient gradient estimation algorithms to optimize it. For both gradient estimation algorithms, we suggest a hyper-parameter to trade off the computation complexity and the accuracy of estimated gradients. Our proposed method shall endow the well-trained DNN model with the ability to directly learn from the marketing goal and further maximize it. Extensive experiments on three datasets have shown the superiority of our method in both offline simulation and online experiments. Our future work will focus on reducing the variance for gradient estimation and applying this approach in more complicated real-world marketing scenarios.

References

(1)
Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
Ai et al. (2022) Meng Ai, Biao Li, Heyang Gong, Qingwei Yu, Shengjie Xue, Yuan Zhang, Yunzhou Zhang, and Peng Jiang. 2022. LBCF: A Large-Scale Budget-Constrained Causal Forest Algorithm. In WWW 2022. 2310–2319.
Albert and Goldenberg (2022) Javier Albert and Dmitri Goldenberg. 2022. E-Commerce Promotions Personalization via Online Multiple-Choice Knapsack with Uplift Modeling. In CIKM 2022. 2863–2872.
Athey and Imbens (2016) Susan Athey and Guido Imbens. 2016. Recursive partitioning for heterogeneous causal effects. Proceedings of the National Academy of Sciences 113, 27 (2016), 7353–7360.
Athey et al. (2019) Susan Athey, Julie Tibshirani, and Stefan Wager. 2019. Generalized random forests. The Annals of Statistics 47, 2 (2019), 1148–1178.
Chen et al. (2017) Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. 2017. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM workshop on artificial intelligence and security. 15–26.
Diemert et al. (2018) Eustache Diemert, Artem Betlei, Christophe Renaudin, and Massih-Reza Amini. 2018. A large scale benchmark for uplift modeling. In SIGKDD 2018.
Du et al. (2019) Shuyang Du, James Lee, and Farzin Ghaffarizadeh. 2019. Improve User Retention with Causal Learning. In SIGKDD Workshop on Causal Discovery 2019. PMLR, 34–49.
Goldenberg et al. (2020) Dmitri Goldenberg, Javier Albert, Lucas Bernardi, and Pablo Estevez. 2020. Free lunch! retrospective uplift modeling for dynamic promotions recommendation within roi constraints. In ACM Conference on Recommender Systems 2020. 486–491.
Guelman et al. (2015) Leo Guelman, Montserrat Guillén, and Ana M Pérez-Marín. 2015. Uplift random forests. Cybernetics and Systems 46, 3-4 (2015), 230–248.
Guo et al. (2019) Yiwen Guo, Ziang Yan, and Changshui Zhang. 2019. Subspace attack: Exploiting promising subspaces for query-efficient black-box attacks. NeurIPS 2019 32.
He et al. (2015) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV 2015. 1026–1034.
Ilyas et al. (2018) Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. 2018. Black-box adversarial attacks with limited queries and information. In ICML 2018. PMLR, 2137–2146.
Ilyas et al. (2019) Andrew Ilyas, Logan Engstrom, and Aleksander Madry. 2019. Prior Convictions: Black-box Adversarial Attacks with Bandits and Priors. In ICLR 2019.
Ioffe and Szegedy (2015) Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML 2015. pmlr, 448–456.
Kellerer et al. (2004) Hans Kellerer, Ulrich Pferschy, and David Pisinger. 2004. The multiple-choice knapsack problem. In Knapsack Problems. Springer, 317–347.
Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR 2015.
Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classification with deep convolutional networks. In NIPS 2012, Vol. 1097.
Künzel et al. (2019) Sören R Künzel, Jasjeet S Sekhon, Peter J Bickel, and Bin Yu. 2019. Metalearners for estimating heterogeneous treatment effects using machine learning. Proceedings of the national academy of sciences 116, 10 (2019), 4156–4165.
Lin et al. (2017) Ying-Chun Lin, Chi-Hsuan Huang, Chu-Cheng Hsieh, Yu-Chen Shu, and Kun-Ta Chuang. 2017. Monetary discount strategies for real-time promotion campaign. In WWW 2017. 1123–1132.
Makhijani et al. (2019) Rahul Makhijani, Shreya Chakrabarti, Dale Struble, and Yi Liu. 2019. LORE: a large-scale offer recommendation engine with eligibility and capacity constraints. In ACM Conference on Recommender Systems 2019. 160–168.
Nie and Wager (2021) Xinkun Nie and Stefan Wager. 2021. Quasi-oracle estimation of heterogeneous treatment effects. Biometrika 108, 2 (2021), 299–319.
Oprescu et al. (2019) Miruna Oprescu, Vasilis Syrgkanis, and Zhiwei Steven Wu. 2019. Orthogonal random forest for causal inference. In ICML 2019. PMLR, 4932–4941.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In NeurIPS 2019, Vol. 32.
Peter and Maria (2014) D Lax Peter and Shea Terrell Maria. 2014. Calculus With Applications.
Phillips (2021) Robert L Phillips. 2021. Pricing and revenue optimization. In Pricing and Revenue Optimization. Stanford university press.
Rzepakowski and Jaroszewicz (2012) Piotr Rzepakowski and Szymon Jaroszewicz. 2012. Decision trees for uplift modeling with single and multiple treatments. Knowledge and Information Systems 32, 2 (2012), 303–327.
Salimans et al. (2017) Tim Salimans, Jonathan Ho, Xi Chen, Szymon Sidor, and Ilya Sutskever. 2017. Evolution strategies as a scalable alternative to reinforcement learning. arXiv preprint arXiv:1703.03864 (2017).
Sołtys et al. (2015) Michał Sołtys, Szymon Jaroszewicz, and Piotr Rzepakowski. 2015. Ensemble methods for uplift modeling. Data Mining and Knowledge Discovery 6, 29 (2015), 1531–1559.
Sutton et al. (1999) Richard S Sutton, David McAllester, Satinder Singh, and Yishay Mansour. 1999. Policy gradient methods for reinforcement learning with function approximation. NeurIPS 1999 12 (1999).
Tu et al. (2019) Chun-Chen Tu, Paishun Ting, Pin-Yu Chen, Sijia Liu, Huan Zhang, Jinfeng Yi, Cho-Jui Hsieh, and Shin-Ming Cheng. 2019. Autozoom: Autoencoder-based zeroth order optimization method for attacking black-box neural networks. In AAAI 2019, Vol. 33. 742–749.
Tu et al. (2021) Ye Tu, Kinjal Basu, Cyrus DiCiccio, Romil Bansal, Preetam Nandy, Padmini Jaikumar, and Shaunak Chatterjee. 2021. Personalized treatment selection using causal heterogeneity. In WWW 2021. 1574–1585.
Wager and Athey (2018) Stefan Wager and Susan Athey. 2018. Estimation and inference of heterogeneous treatment effects using random forests. J. Amer. Statist. Assoc. 113, 523 (2018), 1228–1242.
Wierstra et al. (2014) Daan Wierstra, Tom Schaul, Tobias Glasmachers, Yi Sun, Jan Peters, and Jürgen Schmidhuber. 2014. Natural evolution strategies. JMLR 2014 15, 1 (2014), 949–980.
Xiao et al. (2019) Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, and Shuang Yang. 2019. Model-based constrained MDP for budget allocation in sequential incentive marketing. In CIKM 2019. 971–980.
Zemel (1984) Eitan Zemel. 1984. An O (n) algorithm for the linear multiple choice knapsack problem and related problems. Information processing letters 18, 3 (1984), 123–128.
Zhang et al. (2021) Yang Zhang, Bo Tang, Qingyu Yang, Dou An, Hongyin Tang, Chenyang Xi, Xueying Li, and Feiyu Xiong. 2021. BCORLE ( $\lambda$ ): An Offline Reinforcement Learning and Evaluation Framework for Coupons Allocation in E-commerce Market. In NeurIPS 2021, Vol. 34. 20410–20422.
Zhao et al. (2019) Kui Zhao, Junhao Hua, Ling Yan, Qi Zhang, Huan Xu, and Cheng Yang. 2019. A Unified Framework for Marketing Budget Allocation. In SIGKDD 2019. 1820–1830.
Zhao et al. (2017) Yan Zhao, Xiao Fang, and David Simchi-Levi. 2017. Uplift modeling with multiple treatments and general response types. In SIAM 2017. SIAM, 588–596.
Zhao and Harinen (2019) Zhenyu Zhao and Totte Harinen. 2019. Uplift modeling for multiple treatments with cost optimization. In IEEE International Conference on Data Science and Advanced Analytics (DSAA) 2019. IEEE, 422–431.
Zhou et al. (2023) Hao Zhou, Shaoming Li, Guibin Jiang, Jiaqi Zheng, and Dong Wang. 2023. Direct Heterogeneous Causal Learning for Resource Allocation Problems in Marketing. In AAAI 2023.
Zou et al. (2020) Will Y Zou, Shuyang Du, James Lee, and Jan Pedersen. 2020. Heterogeneous causal learning for effectiveness optimization in user marketing. arXiv preprint arXiv:2004.09702 (2020).

(9)	$\displaystyle\frac{\partial}{\partial v}\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\right]$	$\displaystyle=\frac{\partial}{\partial v}\int Q\left(v^{\prime},c\right)\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\frac{\partial}{\partial v}\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\frac{\pi\left(v^{\prime}\middle\|\,v\right)}{\pi\left(v^{\prime}\middle\|\,v\right)}\frac{\partial}{\partial v}\pi\left(v^{\prime}\middle\|\,v\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\int Q\left(v^{\prime},c\right)\pi\left(v^{\prime}\middle\|\,v\right)\frac{\partial}{\partial v}\log\left(\pi\left(v^{\prime}\middle\|\,v\right)\right)\,\textrm{d}v^{\prime}$
		$\displaystyle=\mathbb{E}_{v^{\prime}}\left[Q\left(v^{\prime},c\right)\frac{\partial}{\partial v}\log\left(\pi\left(v^{\prime}\middle\|\,v\right)\right)\right],$