Marketing Budget Allocation with Offline Constrained
Deep Reinforcement Learning
Abstract.
We study the budget allocation problem in online marketing campaigns that utilize previously collected offline data. We first discuss the long-term effect of optimizing marketing budget allocation decisions in the offline setting. To overcome the challenge, we propose a novel game-theoretic offline value-based reinforcement learning method using mixed policies. The proposed method reduces the need to store infinitely many policies in previous methods to only constantly many policies, which achieves nearly optimal policy efficiency, making it practical and favorable for industrial usage. We further show that this method is guaranteed to converge to the optimal policy, which cannot be achieved by previous value-based reinforcement learning methods for marketing budget allocation. Our experiments on a large-scale marketing campaign with tens-of-millions users and more than one billion budget verify the theoretical results and show that the proposed method outperforms various baseline methods. The proposed method has been successfully deployed to serve all the traffic of this marketing campaign.
1. Introduction

Online marketing campaigns play a prominent role in user acquisition and retention. For example, as illustrated in Figure 1, E-commerce and digital payment companies often offer cash coupons to encourage purchases or promote new payment tools (Zhao et al., 2019; Yu et al., 2021; Liu et al., 2019). The core problem under these campaigns is appropriately allocating budgets to incentivize desired user behaviors while satisfying certain budget constraints. With the increasing scale of online marketing campaigns, the budget allocation problem has stimulated a great bulk of research attention.
The effectiveness of a marketing campaign is commonly measured by metrics regarding users’ daily activeness, for example, the average Daily Active Users (DAU), which is the average of the DAU in a given time window (e.g. one week). As these metrics are hard to directly optimize, conventional methods use immediate user responses, like the coupon redemption rate (Yu et al., 2021), as surrogates. Typically these methods take a two-step framework (Zhao et al., 2019; Agarwal et al., 2014; Xu et al., 2015; Chen et al., 2021b; Li et al., 2021). They first build a response model, which estimates users’ immediate responses to different incentives (Chen et al., 2021b; Liu et al., 2019), then solve a constrained optimization problem to make the budget allocation (Agarwal et al., 2014; Zhong et al., 2015). Although this framework scales well, it fails to take the long-lasting effects of a marketing campaign into consideration. For example, new users may establish desired payment habits after participating in the marketing campaign several times, which cannot be modeled by response models using immediate signals.
Due to the ability to characterize the long-term effect, Reinforcement Learning (RL) has been introduced to the marketing budget allocation scenarios. For example, airline companies have used small-scale RL to optimize their frequent flyer program for a long time (Labbi and Berrospi, 2007). With the recent development of constrained RL (Liang et al., 2018; Paternain et al., 2019; Stooke et al., 2020; Chen et al., 2021a), various constrained policy optimization methods are proposed to optimize the allocation policy by interacting with user response simulators of the marketing campaign while satisfying required budget constraints (Chow et al., 2017; Wu et al., 2018).
However, accurate simulators are usually very difficult to build in industrial scenarios (Shi et al., 2019; Chen et al., 2021c). To eschew the need for simulators, various offline value-based methods (Q-learning) are proposed, which can learn merely with offline datasets. Xiao et al. (2019) proposes one of the most representative methods in this line of research which penalizes the Q-value function with constraint dissatisfaction. Although practically effective, it lacks a theoretical guarantee and may not converge to the optimal policy. To ensure the convergence of value-based methods, it is tempting to consider the recently proposed game-theoretic approaches that take advantage of mixed policies (Le et al., 2019; Bohez et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021). A mixed policy can be viewed as a distribution over multiple individual policies. At the start of each episode, a single individual policy is randomly picked from these policies to generate the decision for the entire episode. Such methods are shown to converge as mixed policy contains (nearly) infinitely many individual policies as the number of training steps goes to infinity. However, in these mixed policy methods, the number of individual policies needed to be stored grows linearly, which makes them intractable to be implemented in industrial marketing scenarios where each individual policy is represented by a large-scale neural network as the function approximator.
In this paper, we formulate the marketing budget allocation problem as an offline constrained deep RL problem, in light of the failure of short-term decisions in capturing the long-term effects of marketing campaigns. Then we use a game-theoretic value-based method with mixed policies to solve this problem. To resolve the aforementioned issue of prohibitively high memory cost for storing all historical individual policies as in previous mixed policy methods (Le et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021), we propose to express any mixed policy as a convex combination of a finite set of policies, called Affinely Independent Mixed (AIM) policies. These AIM policies form a basis of the affine space spanned by the policies’ measurements (i.e., reward and budget constraints). Hence, this approach effectively reduces the number of stored individual policies to a constant (i.e., at most in the setting with constraints). We further devise two methods to determine the AIM policies and the weights during the learning process, i.e., AIM-mean and AIM-greedy. Specifically, AIM-mean recovers the original mixed policy methods (Le et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021) and converges to the optimal policy with constant memory cost. AIM-greedy is an analogy of the heuristic method with a single policy (Xiao et al., 2019; Le et al., 2019), which outperforms the previous method empirically and is favorable for practical scenarios. Experiments in the real-world online marketing scenario verify our analysis and show superior empirical performance.
The contributions of this paper are summarized as follows: (i) Using data collected from a real-world online marketing campaign, we reveal the long-term effect of the marketing campaign, which cannot be characterized via conventional two-stage methods. To this end, we reformulate the task as an offline constrained deep RL problem. (ii) To reduce the memory cost of mixed policy methods, we propose two model-agnostic methods, i.e., AIM-mean and AIM-greedy, which effectively reduce the number of stored individual policies to a constant. Our methods are general solutions to the critical issue of memory consumption in mixed policy methods. (iii) Empirical results demonstrate the superior performance and policy efficiency of the proposed method, which has been successfully deployed on a large-scale online marketing campaign with tens of millions of users and a more than one billion marketing budget.
2. Related Work
This section reviews prior works on marketing budget allocation.
Classic methods. Many classic methods use a two-step framework (Zhao
et al., 2019; Agarwal
et al., 2014; Xu
et al., 2015; Chen
et al., 2021b; Liu et al., 2019; Li
et al., 2021; Yu
et al., 2021), i.e., a response prediction step and a decision making step. Response prediction models include DNN (Chen
et al., 2021b; Zhao
et al., 2019) or GNN (Liu et al., 2019; Yu
et al., 2021), while the decision making step can utilize dual method (Zhong
et al., 2015), linear programming (Yu
et al., 2021; Chen
et al., 2021b), bisection method (Zhao
et al., 2019) or control based method (Xu
et al., 2015). These methods only optimize the immediate reward and fail to capture the long-term effect.
Constrained RL. Constrained Markov Decision Process (CMDP) (Altman, 1999) is a commonly used framework to characterize the long-term effect. When the problem scale is small, CMDP can be directly solved via dynamic programming (Labbi and Berrospi, 2007) or linear programming (Altman, 1999). In large-scale industrial scenarios where user transitions are very hard to model (Chen et al., 2021c), many methods learn optimal policy by directly interacting with the environment rather than modeling the whole dynamics of the environment. Though policy-based methods such as CPO (Achiam et al., 2017) and RCPO (Tessler et al., 2018) have success in certain tasks, policy-based methods are rarely adopted in industrial-level usages (Afsar et al., 2021). Because the extremely large user and item spaces in industrial applications lead to large variance and low sample efficiency when learning from offline datasets (Zou et al., 2020), value-based or actor-critic methods are more commonly used (Chen et al., 2021c). However, it has been shown that directly applying them may not converge (Altman, 1999; Abernethy et al., 2011). Although mixed policy methods (Miryoosefi et al., 2019; Le et al., 2019; Bohez et al., 2019; Zahavy et al., 2021) are proved to converge to the optimal policy, they conventionally need to store all historical policies. Hence, the memory cost is linear to the number of training steps, which is prohibitively expensive in large-scale scenarios.
3. Long-Term Effect in marketing budget allocation
In online marketing campaigns, the core problem is to appropriately allocate budgets to incentivize desired user behaviors while satisfying certain budget constraints. The effectiveness of a marketing campaign is commonly measured by average DAU over different time windows.
As an example, we here discuss a typical scenario of marketing budget allocation, which was held by Alipay from Q4 2020 to Q2 2021 to improve the market share of its digital payment app. During the campaign, tens of millions of users participated and cash coupons worth more than a billion yuan were distributed and redeemed. Illustrated by Figure 1, they offer cash coupons to their users, who can redeem one cash coupon per day when paying with the app and get another one for the next day. The cash coupon is used to incentivize users to try out the digital payment app, and the users are expected to enjoy the convenience and establish a new paying habit. The performance of the campaign is measured by the weekly average DAU who make digital payments. With a slight abuse of notion, when discussing on the user level, we use average DAU to refer to the number of days the user makes payments, whose sum over all users yields the average DAU of the app.
Conventional approaches simplify the above long-term (i.e., one-week) decision making problem to an optimization task over short-term metrics (Yu et al., 2021; Chen et al., 2021b; Liu et al., 2019). These methods first estimate some short-term metrics, like the coupon redemption rate or the users’ DAU in the coming day, under different coupon values. They then perform an optimization task to make the budget allocation (Agarwal et al., 2014; Zhong et al., 2015). In the following, via careful investigation on the real data, we show that the above methods actually oversimplify the marketing budget allocation task, neglecting the important long-term effect.
We first verify whether the simplification of substituting long-term metrics with short-term metrics is reasonable in marketing budget allocation tasks. We collect a dataset of 4.2 million samples with a random budget allocation strategy (data collection is discussed in Sec.6.1). Near the budget averages, we randomly assign different coupon amounts to the users and collects the users’ DAU metrics over 1/3/7 days’ time frames.
Given a fixed budget, the optimal budget allocation plan depends on sensitivity of the DAU metrics with respect to the changes in coupon values (Yu et al., 2021; Liu et al., 2019; Zhong et al., 2015). To simplify the analysis, we set the coupon value to be 1 or 3 yuan, and measure the difference in the corresponding users’ DAU. As a common practice in marketing budget allocation (Labbi and Berrospi, 2007), we divide all users into an active group and an inactive group according to their historical DAU, and inspect their behaviors respectively. Intuitively, the optimal budget allocated to a group of users should be approximately proportional to their DAU changes (Yu et al., 2021). We then compare the DAU changes after allocating this additional budget to each user. The result, as illustrated in Table 1, shows that the additional 2 yuan has a 32% additional effect for inactive users over a 1-day window. However, the margin enlarges to a more significant 82% after 7 days. This shows that if we only consider the short-term metrics, it is likely that we significantly underestimate the potential gain of allocating more budgets to inactive users, resulting in a non-optimal policy.
DAU change over | Inactive users | Active users |
1 day | +0.15 | +0.11 |
3 days | +0.54 | +0.31 |
7 days | +1.24 | +0.68 |
Diff. (Inactive/Active - 1) |
32% |
72% |
82% |
4. Offline Constrained RL for marketing budget allocation
In this section, we formalize the marketing budget allocation problem as a offline constrained RL task and review the mixed policy approach (Le et al., 2019; Miryoosefi et al., 2019) to solve it.
4.1. Problem Formulation
A well-established framework for studying constrained RL is the Constrained Markov decision process (CMDP) (Altman, 1999), which can be identified by a tuple . Over an infinite time horizon, each time at , the agent observes a state among the state space and chooses an action among the action space. Given the current state and the action chosen, the state at the next observation evolves to state with probability , and the agent receives an immediate reward . In particular, for the marketing budget allocation problem, the corresponding action consumes the limited budget that constrains the usage of resources. Therefore, we assume that action suffers an -dimensional immediate cost , where is the number of constraints. The decisions are made according to a policy . At each state, a policy is a distribution over action space.
In the following, we consider two types of policies. Deterministic policy chooses a single action deterministically for any state, and can be identified by the mapping . The set of all deterministic policies is denoted by . To improve the convergence of value-based methods, it is helpful to consider mixed deterministic policy (Le et al., 2019; Bohez et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021), where is a distribution over . To execute a policy , we first select a policy according to and then execute for the entire episode.
For any discount factor and any policy , the expectations of the discounted reward and the discounted cost are defined as
where the expectation is over the stochastic process. To simplify notions, we define a policy’s measurement vector as the -dimensional vector containing the expectations of the discounted reward and the discounted cost ,
In offline constrained RL, we have a pre-collected dataset containing transitions generated from (possibly multiple) behavior policies, jointly denoted as . The goal in offline constrained RL is to find an optimal policy, denoted as , that maximizes the reward subject to the constraints
(1) |
where is vector of known constants. For an optimal policy , is called the optimal value of the problem. To simplify the discussion, for any not satisfying the constraints, let .
Without the constraints, learning optimal policy from a pre-collected dataset is a well-studied problem. For such problems, many effective methods rely on Q-learning to learn in an off-policy style and find optimal deterministic policies efficiently (Sutton and Barto, 2018). However, different from the unconstrained case, for constrained RL, none of the optimal policies may be deterministic. We illustrate this with the following example.
Example 0.
Consider a CMDP with a single state, and two actions, where the first action has and the second action with . The process terminates whenever an action is taken. For any , to maximize the reward subject to requires randomization among the two actions with the second action being chosen with probability . In this case, the optimal policy exists and is not deterministic.
Therefore to solve offline constrained RL, it is common to consider using mixed deterministic policy.
4.2. Relax to an Unconstrained Problem
We now show that the offline constrained RL problem (1) can be reformulated to an equilibrium problem (2), which can then be solved using online learning techniques.
We first relax the constrained problem (1) into an equivalent unconstrained optimization problem, where the value function is penalized by the constraint violation. Using a Lagrangian multiplier , we construct the Lagrangian
We show that solving the optimal value of the original constrained problem (1) is equivalent to solving the equilibrium or saddle point of this unconstrained problem
(2) |
By a standard Lagrangian argument, that violates certain constraints, by setting the corresponding dimensions of to , and otherwise, then . On the other hand, for any feasible policy , since all elements in are non-positive, the inner minimization is achieved when . The outer maximization is then realized by certain feasible policies that maximize the discounted reward . Hence, we conclude the equivalence between the constrained task (1) and the unconstrained one (2).
Input: no-regret online learning algorithm: ; offline RL algorithm that yields a policy:
Initialization: .
4.3. Online Learning Techniques
We then take a game-theoretic perspective and solve the equilibrium finding problem (2) with online learning. From a game-theoretic perspective, the equilibrium finding problem (2) can be viewed as a two-player zero-sum game. The player plays against player, where is the utility gained by player and the loss suffered by the player. The equilibrium of this game can be solved by the general technique of (Freund and Schapire, 1999) by repetitively playing a no-regret online learning algorithm against a best response learner.
We take a brief review of the online learning framework, where the core idea is the existence of no-regret learners (Shalev-Shwartz et al., 2011). For a learner who makes a sequence of decisions, the no-regret property means the cumulative loss by the current learning algorithm converges to the best-fixed strategy in the hindsight sublinearly. That is, after making decisions,
where is generated by the online learning algorithm given the loss function of all previous plays. We also consider learners that are prescient, i.e. that can choose with the knowledge of the loss function up to and including time .
Perhaps the most trivial strategy in a prescient setting is simply to play the best choice for the current loss function. The best response strategy assumes the knowledge of the other player’s play, and simply maximizes utility for the current loss function. For the player, this can be defined as
4.4. Solve the Unconstrained Problem
The method of Freund and Schapire (1999) to solve the equilibrium finding problem can be summarized as Algorithm 1, where the can be, for example, the OGD defined in (3). As stated in the following proposition, the average of players’ decisions converges to the equilibrium point.
Proposition 4.2.
Proof.
For any given , maximize is equivalent to an ordinal unconstrained RL problem, where at each step a penalized reward is given. Therefore deterministic policies that maximize the , also maximizes and we have
Note that this is critical to ensure the convergence of RL algorithms that yield deterministic policies. Since is linear to , and are policies,
Then the claim follows from the definition of . ∎
When running Algorithm 1 in practice, can be easily maintained in place as a vector, and updated via at each round. However, it is much more difficult to maintain the mixed policy in our scenario where deep neural networks are typically used as function approximators. Since neural networks are not linear, to recover the mixed policy, it requires to store the parameters of all individual policies . The memory cost is prohibitively expensive since the neural networks can be considerably large.
To overcome this problem, in the following section, we introduce a novel notion of Affinely Independent Mixed (AIM) policy, which reformulates any mixed policy based on only constantly many individual policies, which successfully reduces the memory cost of mixed policy methods to a constant.
5. AIM-based methods
In this section, we first propose the notion of AIM policy, which can be recovered via randomly sampling from a set of individual policies that are affinely independent from each other in the measurement space. Then we propose the AIM-mean method to equivalently transform the mixed policy in Algorithm 1 into an AIM policy, which maintains a constant number of individual policies throughout the training process while still ensuring convergence. We finally propose a variant called AIM-greedy by discarding the undesirable individual policies in AIM-mean, which outperforms the heuristic method using any single policy (Le et al., 2019; Xiao et al., 2019) empirically and is favorable for practical usage.
Note that although our AIM method is motivated by the marketing budget allocation problem, it is actually suitable to a large range of scenarios using deep RL with mixed policies (Zahavy et al., 2021), such as apprenticeship learning (Zahavy et al., 2020; Abbeel and Ng, 2004) and pure exploration (Hazan et al., 2019). Our AIM method is a general approach that fundamentally solves the issue of storing all historical policies in mixed policy methods, which can be used in these deep RL tasks as well.
5.1. Affinely Independent Mixed (AIM) Policy
We first introduce some notations and definitions regarding mixed policy. For any mixed policy , we define its active policy set as the set of all deterministic policies with a non-zero probability in . i.e.,111We here slightly abuse to denote the probability of any deterministic policy in the mixed policy .
We also define its active measurement set as the set of the measurement vectors of all these active policies, i.e.,
Recall that the measurement vector of is given as
Suppose the active policy set is finite, i.e., , then we equivalently have
which is a convex combination of its active measurement vectors in .
We call an AIM policy, if its active measurement vectors are affinely independent of each other. Intuitively, the AIM policy characterizes a family of mixed policies with a relatively small active policy set. Specifically, if some mixed policy is not an AIM policy, namely the elements in are affinely dependent, we can pick several affinely independent elements such that lies in the convex hull of these elements. In this case, by adjusting the mixing weights accordingly, the mixed policy can be recovered with a smaller active policy set.
To formalize the above reduction, notice that for an AIM policy , its active measurement vectors are affinely independent and form a simplex in . For any policy whose measurement vector lies in the affine hull , its barycentric coordinate can be solved via
(5) |
If for any , then lies inside the convex hull . In this case, we can achieve the same measurement vector of by randomly choosing the policy with probability . In this way, we only need to store the policies in , which might be smaller than .
In summary, any mixed policy can be equivalently realized by randomly sampling with proper weights from several deterministic policies that are affinely independent in the measurement space . Since is the affine span of any affinely independent vectors, intuitively we can reduce the number of the stored individual policies to at most . In the following, we devise two algorithms that realize the AIM policy.
5.2. AIM-mean Algorithm
We first propose AIM-mean (Algorithm 2), which realizes the original mixed policy method (Algorithm 1) via AIM policies. At each round , AIM-mean outputs an AIM policy whose measurement vector is equal to the average measurement of all historical policies, i.e.,
Therefore, AIM-mean evolves exactly the same as the original method, which ensures convergence to the optimal policy, while requiring to store no more than individual policies.
The core step of AIM-mean is the maintenance of the active policy set of the output mixed policy, which we now explicate in detail. At the beginning of each round , the algorithm receives a new policy generated by the original algorithm, whose measurement vector takes . After receiving the policy, we first calculate the target vector (i.e., the average measurement vector) at the current round, then decide whether we should alter the active set:
Input: AIM-mean policy , any with
Output: AIM-mean policy s.t.
(i) If , this implies that the size of is less than (otherwise we have ). In this case, is affinely independent of any vector in the current active set , and we add to the active set.
(ii) If , then we maintain the active set as . Now we just need to recalculate the mixing weights of the elements in regarding via (5).
(iii) If and , then we need to alter the elements in active set and express as a convex combination. Intuitively, we can replace one element in by so that the simplex formed by the new active set contains . Notice that since and lie inside and outside of the original simplex respectively, the segment joining the two points must cross the boundary of the original simplex. Therefore, we can find the specific facet where the intersection locates at, remove any vertex in that is not on this facet, and add the new policy to the active set.
The above procedure of removing vertex is summarized as the subroutine, which takes three arguments, namely the vertices of the simplex , previous average measurement vector , and the target vector . first finds the intersection of the above segment and the facet of the simplex, then replaces a random vertex that is not on the facet with the new policy at the current round. Specifically, assume and as the barycentric coordinates of and with respect to . Denote . Since , some entries of must be non-positive. Hence, we can set
(6) |
Then is the point on the facet of the simplex. Now we just need to remove any vertex in whose weight corresponding to is non-positive, and update accordingly.
After updating the active set and the mixing weights , we repeatedly call to remove any vertex that does not contribute to . This further reduces the number of the stored individual policies without affecting the output mixed policy .
It can be easily checked that at the end of each round, is an affinely independent set and . This validates the correctness of AIM-mean, as summarized in the following.
Corollary 5.1.
AIM-mean maintains the running average of the measurement vectors of all input policies as an AIM policy.
Method | Simplified Dataset | Full Dataset | ||||||
1d DAU | 3d DAU | 7d DAU | 1d DAU | 3d DAU | 7d DAU | |||
Classic | DeepFM | - | 0.5997 | 2.2032 | 4.7806 | 0.6396 | 2.3232 | 5.2114 |
DIN | - | 0.6010 | 2.2070 | 4.7854 | 0.6468 | 2.3306 | 5.2237 | |
Off-policy RL | DDQN | Single-best | 0.5272 | 2.3597 | 4.9418 | 0.5981 | 2.5014 | 5.6306 |
AIM-mean | 0.5172 | 2.3433 | 4.9360 | 0.5779 | 2.4819 | 5.5759 | ||
AIM-greedy | 0.5339 | 2.3640 | 4.9732 | 0.5998 | 2.5213 | 5.6361 | ||
EDRR | Single-best | 0.5741 | 2.4003 | 4.9620 | 0.6156 | 2.5243 | 5.6390 | |
AIM-mean | 0.5572 | 2.3849 | 4.9360 | 0.6045 | 2.5144 | 5.6255 | ||
AIM-greedy | 0.5893 | 2.4193 | 5.1153 | 0.6163 | 2.5253 | 5.6432 | ||
Offline RL | BCQ | Single-best | 0.5770 | 2.4068 | 5.0608 | 0.6150 | 2.5318 | 5.6901 |
AIM-mean | 0.5613 | 2.4016 | 5.0461 | 0.6159 | 2.5310 | 5.6685 | ||
AIM-greedy | 0.5846 | 2.4266 | 5.1911 | 0.6205 | 2.5454 | 5.7020 |
5.3. AIM-greedy Algorithm
Input: AIM policy , any with
Output: AIM policy
Recall that classic mixed policy methods randomly select an individual policy from all historical policies via uniform sampling. There is another heuristic method (Le et al., 2019) that selects the single-best policy among all individual policies, which improves the empirical performance despite lack of convergence guarantee. To improve the practical performance, we propose a novel variant called AIM-greedy by removing ”bad” active policies in the mixed policy. Although it may not converge, AIM-greedy consistently outperforms the previous heuristic method in experiment, hence it is favorable for practical usage.
As described in Algorithm 3, AIM basically follows the procedure of AIM-mean to ensure the AIM property, except that it uses a more radical target vector than AIM-mean.
To decide the target vector, AIM-greedy performs a greedy line search on the segment between the previous target vector and the new policy , and selects a target vector that maximizes the reward at the premise of least constraint violation. Specifically, (i) if both and satisfy the constraints, we set to the policy with a higher reward; (ii) if neither satisfies the constraints, we choose the policy whose Euclidean distance to the target constraints is smaller; (iii) If one satisfies while the other does not, and the infeasible policy has the higher reward, we move to the boundary of the constraints (i.e., line 6 in Algorithm 5.2).
In fact, AIM-greedy generalizes the idea of selecting the single best policy and is guaranteed to perform no worse than any single best policy.
Theorem 5.2.
The performance of the AIM-greedy algorithm is no worse than using any single best policy.
Proof.
At each round, AIM-greedy conducts a linear search on to maximize , whereas the single best policy selects from . Using induction, since this equation is linear to , by the induction hypothesis, a line search on segment always performs no worse than selecting single best policy. ∎
5.4. Policy efficiency
We then discuss the optimal policy efficiency which shows that our AIM policy achieves optimal policy efficiency.
Theorem 5.3.
For a constrained RL problem with -dimensional constraint vector, a mixed policy needs to randomize among at least individual deterministic policies to ensure convergence.
Proof.
We give constructive proof. Consider a CMDP with a single state and actions. The first action has reward , and cost . The rest actions all have reward , and is the unit vector of the i-th dimension, and the episode terminates after any action is taken. Let , the all one vector times . Then it is clear that maximizing reward subject to these constraints requires randomizing among all actions uniformly. When deterministic policies are used, it requires randomizing among at least policies. ∎
Since the affinely independent properties ensure the AIM-type policies store no more than policies throughout the learning process, and hence the storage complexity is optimal to ensure convergence when learning from off-policy samples.
6. Experiments
6.1. Dataset
During the marketing campaign, Alipay uses an online A/B testing system to perform user-level randomization; for a small percentage of users, a random budget allocation policy is used to assign coupon values. After running this randomized trial for a week in Dec 2020, 4.2 million samples are collected. Since the random strategy is likely to be non-optimal, given the average coupon amount of about 2.5 yuan, the collection of these samples is indeed expensive. Each sample has about one thousand raw features, which include timestamp, age, gender, city, and other behavior features. After one-hot encoding, all samples take around one hundred thousand different values. The label of each sample is the corresponding user’s DAU over the next 1/3/7 days.
The full dataset takes about 56 gigabytes on disk, and our deployed model is implemented with the company’s private distributed model training infrastructure that supports high-dimensional embedding lookup tables. To simplify the tuning process, we also create a simplified data set, which contains only 2 user features, 1) whether the user participated in the previous day, which is a boolean value, and 2) the activeness level of the user. The results on both datasets are qualitatively similar and are both reported.
6.2. Evaluation Setting
The baselines. We first compare our proposed method with classic methods that optimize short-term metrics (i.e. 1d DAU). For classic methods, we adopt the widely used DeepFM (Guo et al., 2017) and DIN (Zhou et al., 2018). We then compare our proposed AIM-mean and AIM-greedy methods with the heuristic single-best method (Le et al., 2019). Since these methods are model-agnostic, we implement them using the following RL methods. (i) DDQN uses double Q techniques (Van Hasselt et al., 2015; Fujimoto et al., 2018) on the vanilla DQN (Mnih et al., 2013; Liu et al., 2018) to avoid overestimation. We set the weight for the smaller Q network to 0.8, and the higher one to 0.2. (ii) EDRR (Liu et al., 2020) is a state-of-the-art method in RL-based recommendations, which augments the aforementioned DDQN methods with a supervised task to stabilize the training of the embedding layer. (iii) BCQ (Fujimoto et al., 2019) adds a behavior cloning task to encourage the learned policy to stay close to the behavior policy that is being used to generate the offline dataset. We add the imitation task to the EDRR model and set the weight of behavior cloning to be .
Training setting. For each dataset, we divide 90% of the data as the training set and leave the rest as the evaluation set. The division is performed by using a hashing function on the user id, to ensure that all interactions of a user either are in the training set or the evaluation set. Each method is trained for 2 epochs. The discount rate is set to be , and the batch size is . We utilize Adam optimizer for all the methods. We invoke the Learner to update the every 10 steps. To speed up the training process, the policy parameters are evaluated for adding to AIM policy and export once per 100 steps. For the online learner, we use the OGD (3) with learning rate as (Zinkevich, 2003).
Budget control. To make a fair comparison between conventional methods and the offline constrained RL methods, we control the immediate budget spent to the same yuan using a linear programming task. We solve the task by taking the Q values of the RL method as a response estimation of the users, then taking the two-step framework, and adding a constraint until the cost of the DNN methods and the RL methods are the same.
Offline evaluation. To evaluate different methods on ground truth, a straightforward way is to evaluate the learned policy through an online A/B test. However, it may hurt user experiences and be prohibitively expensive in an online marketing campaign. As suggested by (Zou et al., 2020), we use a clipped importance sampling method to evaluate the policy. To ensure a small variance and control the bias, the weight is clipped to .
6.3. Offline Experimental Results
Overall performance comparison. Results by an average of 5 repetitive experiment runs are obtained, and we report the three metrics in Table 2. For each method, we report its results in terms of the average coupon redemption rate (1d DAU), the average DAU over 3 days (3d DAU), and over 7 days (7d DAU). The redemption rate measures the immediate effect of the marketing campaign, and the average DAU over 3 days and 7 days measure the long-term effect of running the marketing campaign. From Table 2 we have following observations: (1) Though the conventional method yields a nice result in optimizing short-term metrics (1d DAU), they are outperformed by the offline constrained RL methods when considering the long-term effect (3/7d DAU). (2) Though lacks theoretical guarantee, using a single Q-network outperforms the AIM-mean method yield by the game-theoretic approach. Meanwhile, both are consistently outperformed by the AIM-greedy method, which verifies our theoretical analysis. Note that according to corollary 5.1, the measurement vector of the original mixed policy method is exactly the same as the AIM-mean method, and hence they have identical performance.

The policy efficiency. The memory usage of the three offline constrained RL methods (using BCQ) with the standard deviation over the 5 repetitive runs is reported in Figure 2 in terms of the number of parameters stored during the training process on full dataset. Directly applying the original mixed policy approach (Algorithm 1) suffers from an impractical memory cost. In contrast, our proposed methods significantly reduces memory cost to at most a constant, which are favorable for large-scale industrial scenarios.
6.4. Online Experimental Results
We conduct a series of A/B experiments in a live system serving tens of millions of users and handing out more than one billion yuan worth cash coupons. Figure 3 illustrates the online request serving process and the offline training procedure. We train the model using BCQ, where the is updated by an OGD method. The model is evaluated by an AIM-greedy method, and if the current policy shall be added into the mixed policy, the model parameters will be exported to disk, together with the weight for all exported models. These models are then loaded to our online serving machines, to provide online budget allocation services.

The online experiment is conducted by performing user-level randomization. We start an online experiment with a small percentage of traffic and gradually increase the traffic. The offline constrained deep RL model is compared with the baseline model that uses DIN with a linear programming solver to find the allocation plan. The online experiment run for a week, with 5.5 million cash coupons sent by each of the models. Though the offline constrained RL method has a lower coupon redemption rate (1d DAU) than the previous DIN model, after running the experiment for a week, it improves the weekly average DAU by 0.6% (7d DAU) for the app. Meanwhile, it reduces the coupon cost by 0.8%. Our retrospective analysis shows that the emphasis on long-term value drives this model to allocate more budgets to inactive users, and accelerate the process of user acquisition. The more frequent retention of these inactive users compensates for the short-term metrics deterioration of active users. The model is then successfully deployed to the full traffic during the marketing campaign after a week’s A/B testing.
7. Conclusions
We show the difficulty of marketing budget allocation tasks using offline data from a real-world large-scale campaign, and propose using offline constrained deep RL techniques to deal with the long-term effect. Existing methods require randomizing among infinitely many policies, which suffers from an unacceptably high memory cost for large-scale scenarios. To tackle this issue, we propose AIM-mean and AIM-greedy algorithms, which reduce the memory consumption to a constant. In addition, we show that AIM-mean is guaranteed to convergence to the optimal policy, and AIM-greedy performs no worse than any single-best policy. The theoretical analyses are verified in a real-world marketing campaign, where the AIM-greedy is shown to outperform all other methods and is successfully deployed to all traffic in the campaign.
References
- (1)
- Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. 1.
- Abernethy et al. (2011) Jacob Abernethy, Peter L Bartlett, and Elad Hazan. 2011. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory. 27–46.
- Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 22–31.
- Afsar et al. (2021) M Mehdi Afsar, Trafford Crump, and Behrouz Far. 2021. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys (CSUR) (2021).
- Agarwal et al. (2014) Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget pacing for targeted online advertisements at linkedin. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1613–1619.
- Altman (1999) Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
- Bohez et al. (2019) Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. 2019. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019).
- Chen et al. (2021b) Xuanying Chen, Zhining Liu, Li Yu, Sen Li, Lihong Gu, Xiaodong Zeng, Yize Tan, and Jinjie Gu. 2021b. Adversarial Learning for Incentive Optimization in Mobile Payment Marketing. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2940–2944.
- Chen et al. (2021c) Xiaocong Chen, Lina Yao, Julian McAuley, Guanglin Zhou, and Xianzhi Wang. 2021c. A survey of deep reinforcement learning in recommender systems: A systematic review and future directions. arXiv preprint arXiv:2109.03540 (2021).
- Chen et al. (2021a) Yi Chen, Jing Dong, and Zhaoran Wang. 2021a. A primal-dual approach to constrained Markov decision processes. arXiv preprint arXiv:2101.10895 (2021).
- Chow et al. (2017) Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2017. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18, 1 (2017), 6070–6120.
- Freund and Schapire (1999) Yoav Freund and Robert E Schapire. 1999. Adaptive game playing using multiplicative weights. Games and Economic Behavior 29, 1-2 (1999), 79–103.
- Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR, 2052–2062.
- Fujimoto et al. (2018) Scott Fujimoto, Herke Van Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477 (2018).
- Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
- Hazan et al. (2019) Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. 2019. Provably efficient maximum entropy exploration. In International Conference on Machine Learning. PMLR, 2681–2691.
- Labbi and Berrospi (2007) Abdel Labbi and Cesar Berrospi. 2007. Optimizing marketing planning and budgeting using Markov decision processes: An airline case study. IBM Journal of Research and Development 51, 3.4 (2007), 421–431.
- Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. 2019. Batch Policy Learning under Constraints. In International Conference on Machine Learning. 3703–3712.
- Li et al. (2021) Duanshun Li, Jing Liu, Jinsung Jeon, Seoyoung Hong, Thai Le, Dongwon Lee, and Noseong Park. 2021. Large-Scale Data-Driven Airline Market Influence Maximization. arXiv preprint arXiv:2105.15012 (2021).
- Liang et al. (2018) Qingkai Liang, Fanyu Que, and Eytan Modiano. 2018. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480 (2018).
- Liu et al. (2020) Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, and Xiuqiang He. 2020. End-to-end deep reinforcement learning based recommendation with supervised embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining. 384–392.
- Liu et al. (2018) Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, and Yuzhou Zhang. 2018. Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027 (2018).
- Liu et al. (2019) Ziqi Liu, Dong Wang, Qianyu Yu, Zhiqiang Zhang, Yue Shen, Jian Ma, Wenliang Zhong, Jinjie Gu, Jun Zhou, Shuang Yang, et al. 2019. Graph representation learning for merchant incentive optimization in mobile payment marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2577–2584.
- Miryoosefi et al. (2019) Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. 2019. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems. 14093–14102.
- Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
- Paternain et al. (2019) Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. 2019. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems. 7555–7565.
- Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz et al. 2011. Online learning and online convex optimization. Foundations and trends in Machine Learning 4, 2 (2011), 107–194.
- Shi et al. (2019) Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2019. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4902–4909.
- Stooke et al. (2020) Adam Stooke, Joshua Achiam, and Pieter Abbeel. 2020. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning. PMLR, 9133–9143.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
- Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. 2018. Reward Constrained Policy Optimization. In International Conference on Learning Representations.
- Van Hasselt et al. (2015) Hado Van Hasselt, Arthur Guez, and David Silver. 2015. Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461 (2015).
- Wu et al. (2018) Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. 2018. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1443–1451.
- Xiao et al. (2019) Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, and Shuang Yang. 2019. Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 971–980.
- Xu et al. (2015) Jian Xu, Kuang-chih Lee, Wentong Li, Hang Qi, and Quan Lu. 2015. Smart pacing for effective online ad campaign optimization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2217–2226.
- Yu et al. (2021) Li Yu, Zhengwei Wu, Tianchi Cai, Ziqi Liu, Zhiqiang Zhang, Lihong Gu, Xiaodong Zeng, and Jinjie Gu. 2021. Joint Incentive Optimization of Customer and Merchant in Mobile Payment Marketing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 15000–15007.
- Zahavy et al. (2020) Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour. 2020. Apprenticeship learning via frank-wolfe. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 6720–6728.
- Zahavy et al. (2021) Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, and Satinder Singh. 2021. Reward is enough for convex MDPs. Advances in Neural Information Processing Systems 34 (2021).
- Zhao et al. (2019) Kui Zhao, Junhao Hua, Ling Yan, Qi Zhang, Huan Xu, and Cheng Yang. 2019. A Unified Framework for Marketing Budget Allocation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1820–1830.
- Zhong et al. (2015) Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li. 2015. Stock constrained recommendation in tmall. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296.
- Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
- Zinkevich (2003) Martin Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03). 928–936.
- Zou et al. (2020) Lixin Zou, Long Xia, Pan Du, Zhuo Zhang, Ting Bai, Weidong Liu, Jian-Yun Nie, and Dawei Yin. 2020. Pseudo Dyna-Q: A reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining. 816–824.