This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Marketing Budget Allocation with Offline Constrained
Deep Reinforcement Learning

Tianchi Cai Ant GroupHangzhou, China Jiyan Jiang Tsinghua UniversityBeijing, China Wenpeng Zhang Ant GroupBeijing, China Shiji Zhou Tsinghua UniversityBeijing, China Xierui Song Ant GroupHangzhou, China Li Yu Ant GroupHangzhou, China Lihong Gu Ant GroupHangzhou, China Xiaodong Zeng Ant GroupHangzhou, China Jinjie Gu Ant GroupHangzhou, China  and  Guannan Zhang Ant GroupHangzhou, China
(2023)
Abstract.

We study the budget allocation problem in online marketing campaigns that utilize previously collected offline data. We first discuss the long-term effect of optimizing marketing budget allocation decisions in the offline setting. To overcome the challenge, we propose a novel game-theoretic offline value-based reinforcement learning method using mixed policies. The proposed method reduces the need to store infinitely many policies in previous methods to only constantly many policies, which achieves nearly optimal policy efficiency, making it practical and favorable for industrial usage. We further show that this method is guaranteed to converge to the optimal policy, which cannot be achieved by previous value-based reinforcement learning methods for marketing budget allocation. Our experiments on a large-scale marketing campaign with tens-of-millions users and more than one billion budget verify the theoretical results and show that the proposed method outperforms various baseline methods. The proposed method has been successfully deployed to serve all the traffic of this marketing campaign.

Offline constrained deep RL, Marketing budget allocation.
journalyear: 2023copyright: acmlicensedconference: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining; February 27-March 3, 2023; Singapore, Singaporebooktitle: Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining (WSDM ’23), February 27-March 3, 2023, Singapore, Singaporeprice: 15.00doi: 10.1145/3539597.3570486isbn: 978-1-4503-9407-9/23/02ccs: Computing methodologies Sequential decision making

1. Introduction

Refer to caption
Figure 1. In an online marketing campaign, cash coupons are offered to promote a digital payment app. Users can use one cash coupon every day, and the performance of the marketing campaign is measured by the average DAU over this period.

Online marketing campaigns play a prominent role in user acquisition and retention. For example, as illustrated in Figure 1, E-commerce and digital payment companies often offer cash coupons to encourage purchases or promote new payment tools (Zhao et al., 2019; Yu et al., 2021; Liu et al., 2019). The core problem under these campaigns is appropriately allocating budgets to incentivize desired user behaviors while satisfying certain budget constraints. With the increasing scale of online marketing campaigns, the budget allocation problem has stimulated a great bulk of research attention.

The effectiveness of a marketing campaign is commonly measured by metrics regarding users’ daily activeness, for example, the average Daily Active Users (DAU), which is the average of the DAU in a given time window (e.g. one week). As these metrics are hard to directly optimize, conventional methods use immediate user responses, like the coupon redemption rate (Yu et al., 2021), as surrogates. Typically these methods take a two-step framework (Zhao et al., 2019; Agarwal et al., 2014; Xu et al., 2015; Chen et al., 2021b; Li et al., 2021). They first build a response model, which estimates users’ immediate responses to different incentives (Chen et al., 2021b; Liu et al., 2019), then solve a constrained optimization problem to make the budget allocation (Agarwal et al., 2014; Zhong et al., 2015). Although this framework scales well, it fails to take the long-lasting effects of a marketing campaign into consideration. For example, new users may establish desired payment habits after participating in the marketing campaign several times, which cannot be modeled by response models using immediate signals.

Due to the ability to characterize the long-term effect, Reinforcement Learning (RL) has been introduced to the marketing budget allocation scenarios. For example, airline companies have used small-scale RL to optimize their frequent flyer program for a long time (Labbi and Berrospi, 2007). With the recent development of constrained RL (Liang et al., 2018; Paternain et al., 2019; Stooke et al., 2020; Chen et al., 2021a), various constrained policy optimization methods are proposed to optimize the allocation policy by interacting with user response simulators of the marketing campaign while satisfying required budget constraints (Chow et al., 2017; Wu et al., 2018).

However, accurate simulators are usually very difficult to build in industrial scenarios (Shi et al., 2019; Chen et al., 2021c). To eschew the need for simulators, various offline value-based methods (Q-learning) are proposed, which can learn merely with offline datasets. Xiao et al. (2019) proposes one of the most representative methods in this line of research which penalizes the Q-value function with constraint dissatisfaction. Although practically effective, it lacks a theoretical guarantee and may not converge to the optimal policy. To ensure the convergence of value-based methods, it is tempting to consider the recently proposed game-theoretic approaches that take advantage of mixed policies (Le et al., 2019; Bohez et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021). A mixed policy can be viewed as a distribution over multiple individual policies. At the start of each episode, a single individual policy is randomly picked from these policies to generate the decision for the entire episode. Such methods are shown to converge as mixed policy contains (nearly) infinitely many individual policies as the number of training steps goes to infinity. However, in these mixed policy methods, the number of individual policies needed to be stored grows linearly, which makes them intractable to be implemented in industrial marketing scenarios where each individual policy is represented by a large-scale neural network as the function approximator.

In this paper, we formulate the marketing budget allocation problem as an offline constrained deep RL problem, in light of the failure of short-term decisions in capturing the long-term effects of marketing campaigns. Then we use a game-theoretic value-based method with mixed policies to solve this problem. To resolve the aforementioned issue of prohibitively high memory cost for storing all historical individual policies as in previous mixed policy methods (Le et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021), we propose to express any mixed policy as a convex combination of a finite set of policies, called Affinely Independent Mixed (AIM) policies. These AIM policies form a basis of the affine space spanned by the policies’ measurements (i.e., reward and budget constraints). Hence, this approach effectively reduces the number of stored individual policies to a constant (i.e., at most m+2m+2 in the setting with mm constraints). We further devise two methods to determine the AIM policies and the weights during the learning process, i.e., AIM-mean and AIM-greedy. Specifically, AIM-mean recovers the original mixed policy methods (Le et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021) and converges to the optimal policy with constant memory cost. AIM-greedy is an analogy of the heuristic method with a single policy (Xiao et al., 2019; Le et al., 2019), which outperforms the previous method empirically and is favorable for practical scenarios. Experiments in the real-world online marketing scenario verify our analysis and show superior empirical performance.

The contributions of this paper are summarized as follows: (i) Using data collected from a real-world online marketing campaign, we reveal the long-term effect of the marketing campaign, which cannot be characterized via conventional two-stage methods. To this end, we reformulate the task as an offline constrained deep RL problem. (ii) To reduce the memory cost of mixed policy methods, we propose two model-agnostic methods, i.e., AIM-mean and AIM-greedy, which effectively reduce the number of stored individual policies to a constant. Our methods are general solutions to the critical issue of memory consumption in mixed policy methods. (iii) Empirical results demonstrate the superior performance and policy efficiency of the proposed method, which has been successfully deployed on a large-scale online marketing campaign with tens of millions of users and a more than one billion marketing budget.

2. Related Work

This section reviews prior works on marketing budget allocation.
Classic methods. Many classic methods use a two-step framework (Zhao et al., 2019; Agarwal et al., 2014; Xu et al., 2015; Chen et al., 2021b; Liu et al., 2019; Li et al., 2021; Yu et al., 2021), i.e., a response prediction step and a decision making step. Response prediction models include DNN (Chen et al., 2021b; Zhao et al., 2019) or GNN (Liu et al., 2019; Yu et al., 2021), while the decision making step can utilize dual method (Zhong et al., 2015), linear programming (Yu et al., 2021; Chen et al., 2021b), bisection method (Zhao et al., 2019) or control based method (Xu et al., 2015). These methods only optimize the immediate reward and fail to capture the long-term effect.

Constrained RL. Constrained Markov Decision Process (CMDP) (Altman, 1999) is a commonly used framework to characterize the long-term effect. When the problem scale is small, CMDP can be directly solved via dynamic programming (Labbi and Berrospi, 2007) or linear programming (Altman, 1999). In large-scale industrial scenarios where user transitions are very hard to model (Chen et al., 2021c), many methods learn optimal policy by directly interacting with the environment rather than modeling the whole dynamics of the environment. Though policy-based methods such as CPO (Achiam et al., 2017) and RCPO (Tessler et al., 2018) have success in certain tasks, policy-based methods are rarely adopted in industrial-level usages (Afsar et al., 2021). Because the extremely large user and item spaces in industrial applications lead to large variance and low sample efficiency when learning from offline datasets (Zou et al., 2020), value-based or actor-critic methods are more commonly used (Chen et al., 2021c). However, it has been shown that directly applying them may not converge (Altman, 1999; Abernethy et al., 2011). Although mixed policy methods (Miryoosefi et al., 2019; Le et al., 2019; Bohez et al., 2019; Zahavy et al., 2021) are proved to converge to the optimal policy, they conventionally need to store all historical policies. Hence, the memory cost is linear to the number of training steps, which is prohibitively expensive in large-scale scenarios.

3. Long-Term Effect in marketing budget allocation

In online marketing campaigns, the core problem is to appropriately allocate budgets to incentivize desired user behaviors while satisfying certain budget constraints. The effectiveness of a marketing campaign is commonly measured by average DAU over different time windows.

As an example, we here discuss a typical scenario of marketing budget allocation, which was held by Alipay from Q4 2020 to Q2 2021 to improve the market share of its digital payment app. During the campaign, tens of millions of users participated and cash coupons worth more than a billion yuan were distributed and redeemed. Illustrated by Figure 1, they offer cash coupons to their users, who can redeem one cash coupon per day when paying with the app and get another one for the next day. The cash coupon is used to incentivize users to try out the digital payment app, and the users are expected to enjoy the convenience and establish a new paying habit. The performance of the campaign is measured by the weekly average DAU who make digital payments. With a slight abuse of notion, when discussing on the user level, we use average DAU to refer to the number of days the user makes payments, whose sum over all users yields the average DAU of the app.

Conventional approaches simplify the above long-term (i.e., one-week) decision making problem to an optimization task over short-term metrics (Yu et al., 2021; Chen et al., 2021b; Liu et al., 2019). These methods first estimate some short-term metrics, like the coupon redemption rate or the users’ DAU in the coming day, under different coupon values. They then perform an optimization task to make the budget allocation (Agarwal et al., 2014; Zhong et al., 2015). In the following, via careful investigation on the real data, we show that the above methods actually oversimplify the marketing budget allocation task, neglecting the important long-term effect.

We first verify whether the simplification of substituting long-term metrics with short-term metrics is reasonable in marketing budget allocation tasks. We collect a dataset of 4.2 million samples with a random budget allocation strategy (data collection is discussed in Sec.6.1). Near the budget averages, we randomly assign different coupon amounts to the users and collects the users’ DAU metrics over 1/3/7 days’ time frames.

Given a fixed budget, the optimal budget allocation plan depends on sensitivity of the DAU metrics with respect to the changes in coupon values (Yu et al., 2021; Liu et al., 2019; Zhong et al., 2015). To simplify the analysis, we set the coupon value to be 1 or 3 yuan, and measure the difference in the corresponding users’ DAU. As a common practice in marketing budget allocation (Labbi and Berrospi, 2007), we divide all users into an active group and an inactive group according to their historical DAU, and inspect their behaviors respectively. Intuitively, the optimal budget allocated to a group of users should be approximately proportional to their DAU changes (Yu et al., 2021). We then compare the DAU changes after allocating this additional budget to each user. The result, as illustrated in Table 1, shows that the additional 2 yuan has a 32% additional effect for inactive users over a 1-day window. However, the margin enlarges to a more significant 82% after 7 days. This shows that if we only consider the short-term metrics, it is likely that we significantly underestimate the potential gain of allocating more budgets to inactive users, resulting in a non-optimal policy.

Table 1. The DAU improvement of increasing the coupon from 1 to 3 yuan over different user groups & time frames.
DAU change over Inactive users Active users
1 day +0.15 +0.11
3 days +0.54 +0.31
7 days +1.24 +0.68
Diff. (Inactive/Active - 1)
32%
72%
82%

4. Offline Constrained RL for marketing budget allocation

In this section, we formalize the marketing budget allocation problem as a offline constrained RL task and review the mixed policy approach (Le et al., 2019; Miryoosefi et al., 2019) to solve it.

4.1. Problem Formulation

A well-established framework for studying constrained RL is the Constrained Markov decision process (CMDP) (Altman, 1999), which can be identified by a tuple {𝒮,𝒜,P,r,𝒄}\{{\mathcal{S}},{\mathcal{A}},P,r,{\bm{c}}\}. Over an infinite time horizon, each time at t=1,2,t=1,2,..., the agent observes a state st𝒮s_{t}\in{\mathcal{S}} among the state space and chooses an action at𝒜a_{t}\in{\mathcal{A}} among the action space. Given the current state and the action chosen, the state at the next observation evolves to state st+1𝒮s_{t+1}\in{\mathcal{S}} with probability P(st+1|st,at)P(s_{t+1}|s_{t},a_{t}), and the agent receives an immediate reward rtr_{t}\in\mathbb{R}. In particular, for the marketing budget allocation problem, the corresponding action ata_{t} consumes the limited budget that constrains the usage of resources. Therefore, we assume that action ata_{t} suffers an mm-dimensional immediate cost 𝒄tm{\bm{c}}_{t}\in\mathbb{R}^{m}, where mm is the number of constraints. The decisions are made according to a policy π\pi. At each state, a policy is a distribution over action space.

In the following, we consider two types of policies. Deterministic policy chooses a single action deterministically for any state, and can be identified by the mapping π:𝒮𝒜\pi:\mathcal{S}\mapsto\mathcal{A}. The set of all deterministic policies is denoted by Π\Pi. To improve the convergence of value-based methods, it is helpful to consider mixed deterministic policy Δ(Π)\Delta(\Pi) (Le et al., 2019; Bohez et al., 2019; Miryoosefi et al., 2019; Zahavy et al., 2021), where μΔ(Π)\mu\in\Delta(\Pi) is a distribution over Π\Pi. To execute a policy μΔ(Π)\mu\in\Delta(\Pi), we first select a policy πΠ\pi\in\Pi according to πμ\pi\sim\mu and then execute π\pi for the entire episode.

For any discount factor γ[0,1)\gamma\in[0,1) and any policy πΔ(Π)\pi\in\Delta(\Pi), the expectations of the discounted reward Jr(π)J_{r}(\pi) and the discounted cost J𝒄(π)J_{\bm{c}}(\pi) are defined as

Jr(π):=𝔼[t=1γt1rt],J𝒄(π):=𝔼[t=1γt1𝒄t],\displaystyle J_{r}(\pi):=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t-1}r_{t}\right],\quad J_{\bm{c}}(\pi):=\mathbb{E}\left[\sum_{t=1}^{\infty}\gamma^{t-1}{\bm{c}}_{t}\right],

where the expectation is over the stochastic process. To simplify notions, we define a policy’s measurement vector as the (m+1)(m+1)-dimensional vector containing the expectations of the discounted reward Jr(π)J_{r}(\pi) and the discounted cost J𝒄(π)J_{\bm{c}}(\pi) ,

𝒙(π):=[Jr(π),J𝒄(π)](Measurement vector).\displaystyle{\bm{x}}(\pi):=[J_{r}(\pi),J_{\bm{c}}(\pi)]\quad\text{(Measurement vector)}.

In offline constrained RL, we have a pre-collected dataset D:={(si,ai,si,ri,𝒄i)}i=1nD:=\{(s_{i},a_{i},s_{i}^{\prime},r_{i},{\bm{c}}_{i})\}_{i=1}^{n} containing transitions generated from (possibly multiple) behavior policies, jointly denoted as πB\pi_{B}. The goal in offline constrained RL is to find an optimal policy, denoted as μ\mu^{*}, that maximizes the reward subject to the constraints

(1) maxμΔ(Π)Jr(μ)subject to J𝒄(μ)τ\displaystyle\max_{\mu\in\Delta(\Pi)}J_{r}(\mu)\quad\text{subject to }\quad J_{\bm{c}}(\mu)\preceq\tau

where τm\tau\in\mathbb{R}^{m} is vector of known constants. For an optimal policy μ\mu^{*}, Jr(μ)J_{r}(\mu^{*}) is called the optimal value of the problem. To simplify the discussion, for any μ\mu not satisfying the constraints, let Jr(μ):=J_{r}(\mu):=-\infty.

Without the constraints, learning optimal policy from a pre-collected dataset is a well-studied problem. For such problems, many effective methods rely on Q-learning to learn in an off-policy style and find optimal deterministic policies efficiently (Sutton and Barto, 2018). However, different from the unconstrained case, for constrained RL, none of the optimal policies may be deterministic. We illustrate this with the following example.

Example 0.

Consider a CMDP with a single state, and two actions, where the first action has r=0,c=0r=0,c=0 and the second action with r=1,c=1r=1,c=1. The process terminates whenever an action is taken. For any τ(0,1)\tau\in(0,1), to maximize the reward subject to JcτJ_{c}\leq\tau requires randomization among the two actions with the second action being chosen with probability τ\tau. In this case, the optimal policy exists and is not deterministic.

Therefore to solve offline constrained RL, it is common to consider using mixed deterministic policy.

4.2. Relax to an Unconstrained Problem

We now show that the offline constrained RL problem (1) can be reformulated to an equilibrium problem (2), which can then be solved using online learning techniques.

We first relax the constrained problem (1) into an equivalent unconstrained optimization problem, where the value function is penalized by the constraint violation. Using a Lagrangian multiplier 𝝀+m{\bm{\lambda}}\in\mathbb{R}^{m}_{+}, we construct the Lagrangian

L(μ,𝝀):=Jr(μ)𝝀T(J𝒄(μ)𝝉).\displaystyle L(\mu,{\bm{\lambda}}):=J_{r}(\mu)-{\bm{\lambda}}^{T}(J_{\bm{c}}(\mu)-{\bm{\tau}}).

We show that solving the optimal value of the original constrained problem (1) is equivalent to solving the equilibrium or saddle point of this unconstrained problem

(2) maxμΔ(Π)min𝝀+mL(μ,𝝀).\displaystyle\max_{\mu\in\Delta(\Pi)}\min_{{\bm{\lambda}}\in\mathbb{R}^{m}_{+}}L(\mu,{\bm{\lambda}}).

By a standard Lagrangian argument, μΔ(Π)\forall\mu\in\Delta(\Pi) that violates certain constraints, by setting the corresponding dimensions of 𝝀{\bm{\lambda}} to \infty, and 0 otherwise, then L(μ,𝝀)=L(\mu,{\bm{\lambda}})=-\infty. On the other hand, for any feasible policy μ\mu, since all elements in J𝒄(μ)τJ_{\bm{c}}(\mu)-\tau are non-positive, the inner minimization is achieved when 𝝀=𝟎{\bm{\lambda}}={\bm{0}}. The outer maximization is then realized by certain feasible policies that maximize the discounted reward Jr(μ)J_{r}(\mu). Hence, we conclude the equivalence between the constrained task (1) and the unconstrained one (2).

Algorithm 1 Original mixed policy method

Input: no-regret online learning algorithm: 𝙻𝚎𝚊𝚛𝚗𝚎𝚛()\mathtt{Learner}(\cdot); offline RL algorithm that yields a policy: 𝙱𝚎𝚜𝚝_𝚛𝚎𝚜𝚙𝚘𝚗𝚜𝚎(𝝀t)\mathtt{Best\_response}({\bm{\lambda}}_{t})

Initialization: 𝝀0+m,π0Π{\bm{\lambda}}_{0}\in\mathbb{R}^{m}_{+},\pi_{0}\in\Pi.

1:  for t=1,2,,Tt=1,2,...,T do
2:     𝝀t𝙻𝚎𝚊𝚛𝚗𝚎𝚛(L(π0,),,L(πt1,)){\bm{\lambda}}_{t}\leftarrow\mathtt{Learner}(L(\pi_{0},\cdot),\dots,L(\pi_{t-1},\cdot))
3:     πt𝙱𝚎𝚜𝚝_𝚛𝚎𝚜𝚙𝚘𝚗𝚜𝚎(𝝀t)\pi_{t}\leftarrow\mathtt{Best\_response}({\bm{\lambda}}_{t})
4:     μ{πi|i=1,2,,t,μ(πi)=1/t}\mu\leftarrow\{\pi_{i}|i=1,2,\dots,t,\mu(\pi_{i})=1/t\} // Mixed policy that uniformly randomly plays π1,,πt\pi_{1},\dots,\pi_{t}
5:  end for
6:  return (μ,𝝀^):𝒙(μ)=1Tt=1T𝒙(πt),𝝀^=1Tt=1T𝝀t(\mu,\hat{{\bm{\lambda}}}):{\bm{x}}(\mu)=\frac{1}{T}\sum_{t=1}^{T}{\bm{x}}(\pi_{t}),\hat{{\bm{\lambda}}}=\frac{1}{T}\sum_{t=1}^{T}{\bm{\lambda}}_{t}

4.3. Online Learning Techniques

We then take a game-theoretic perspective and solve the equilibrium finding problem (2) with online learning. From a game-theoretic perspective, the equilibrium finding problem (2) can be viewed as a two-player zero-sum game. The μ\mu player plays against 𝝀{\bm{\lambda}} player, where L(μ,𝝀)L(\mu,{\bm{\lambda}}) is the utility gained by μ\mu player and the loss suffered by the 𝝀{\bm{\lambda}} player. The equilibrium of this game can be solved by the general technique of (Freund and Schapire, 1999) by repetitively playing a no-regret online learning algorithm against a best response learner.

We take a brief review of the online learning framework, where the core idea is the existence of no-regret learners (Shalev-Shwartz et al., 2011). For a learner who makes a sequence of decisions, the no-regret property means the cumulative loss by the current learning algorithm converges to the best-fixed strategy in the hindsight sublinearly. That is, after making TT decisions,

𝚁𝚎𝚐𝚛𝚎𝚝T:=t=1TL(μt,𝝀t)min𝝀+mt=1TL(μt,𝝀)=o(T)\displaystyle\mathtt{Regret}_{T}:=\sum_{t=1}^{T}L(\mu_{t},{\bm{\lambda}}_{t})-\min_{{\bm{\lambda}}\in\mathbb{R}^{m}_{+}}\sum_{t=1}^{T}L(\mu_{t},{\bm{\lambda}})=o(T)

where 𝝀t=𝙻𝚎𝚊𝚛𝚗𝚎𝚛(L(μ1,),,L(μt1,)){\bm{\lambda}}_{t}=\mathtt{Learner}(L(\mu_{1},\cdot),\dots,L(\mu_{t-1},\cdot)) is generated by the online learning algorithm given the loss function of all previous plays. We also consider learners that are prescient, i.e. that can choose 𝝀t{\bm{\lambda}}_{t} with the knowledge of L(μt,)L(\mu_{t},\cdot) the loss function up to and including time tt.

Perhaps the most trivial strategy in a prescient setting is simply to play the best choice for the current loss function. The best response strategy assumes the knowledge of the other player’s play, and simply maximizes utility for the current loss function. For the μ\mu player, this can be defined as

𝙱𝚎𝚜𝚝_𝚛𝚎𝚜𝚙𝚘𝚗𝚜𝚎(𝝀t):=argmaxμΔ(Π)L(,𝝀t)\displaystyle\mathtt{Best\_response}({\bm{\lambda}}_{t}):=\operatorname*{arg\,max}_{\mu\in\Delta(\Pi)}L(\cdot,{\bm{\lambda}}_{t})

Online Gradient Descent (OGD) (Zinkevich, 2003) is a typical online learning algorithm, defined as follows

(3) 𝝀t+1=𝙾𝙶𝙳(𝝀t,L):=max(𝟎,𝝀tηtL).\displaystyle{\bm{\lambda}}_{t+1}=\mathtt{OGD}({\bm{\lambda}}_{t},L):=\max({\bm{0}},{\bm{\lambda}}_{t}-\eta_{t}\nabla L).

By setting appropriate learning rate, it has no-regret. Usually, the learning rate can be set adaptively with ηt:=1/t\eta_{t}:=1/\sqrt{t}, then the regret bound is O(1/T)O(1/\sqrt{T}) (Zinkevich, 2003).

4.4. Solve the Unconstrained Problem

The method of Freund and Schapire (1999) to solve the equilibrium finding problem can be summarized as Algorithm 1, where the 𝙻𝚎𝚊𝚛𝚗𝚎𝚛\mathtt{Learner} can be, for example, the OGD defined in (3). As stated in the following proposition, the average of players’ decisions converges to the equilibrium point.

Proposition 4.2.

For Algorithm 1, let 𝚁𝚎𝚐𝚛𝚎𝚝T\mathtt{Regret}_{T} be the regret of the online learning algorithm used by 𝛌{\bm{\lambda}}-player, then after TT-round of iterations, Algorithm 1 returns (μ,𝛌^)(\mu,\hat{{\bm{\lambda}}}) with duality gap bounded by

(4) maxμΔ(Π)L(μ,𝝀^)min𝝀+mL(μ,𝝀)𝚁𝚎𝚐𝚛𝚎𝚝TT\displaystyle\max_{\mu\in\Delta(\Pi)}L(\mu,\hat{{\bm{\lambda}}})-\min_{{\bm{\lambda}}\in\mathbb{R}^{m}_{+}}L(\mu,{\bm{\lambda}})\leq\frac{\mathtt{Regret}_{T}}{T}
Proof.

For any given 𝝀^\hat{{\bm{\lambda}}}, maximize L(,𝝀^)L(\cdot,\hat{{\bm{\lambda}}}) is equivalent to an ordinal unconstrained RL problem, where at each step a penalized reward r:=r𝝀T(𝒄τ)r^{\prime}:=r-{\bm{\lambda}}^{T}({\bm{c}}-\tau) is given. Therefore deterministic policies that maximize the rr^{\prime}, also maximizes L(,𝝀^)L(\cdot,\hat{{\bm{\lambda}}}) and we have

maxμΔ(Π)L(μ,𝝀^)=maxπΠL(π,𝝀^).\displaystyle\max_{\mu\in\Delta(\Pi)}L(\mu,\hat{{\bm{\lambda}}})=\max_{\pi\in\Pi}L(\pi,\hat{{\bm{\lambda}}}).

Note that this is critical to ensure the convergence of RL algorithms that yield deterministic policies. Since L(,𝝀)L(\cdot,{\bm{\lambda}}) is linear to 𝝀{\bm{\lambda}}, and πt\pi_{t} are 𝙱𝚎𝚜𝚝_𝚛𝚎𝚜𝚙𝚘𝚗𝚜𝚎(𝝀t)\mathtt{Best\_response}({\bm{\lambda}}_{t}) policies,

maxπΠL(π,𝝀^)=maxπΠ1Tt=1TL(π,𝝀t)1Tt=1TL(πt,𝝀t)\displaystyle\max_{\pi\in\Pi}L(\pi,\hat{{\bm{\lambda}}})=\max_{\pi\in\Pi}\frac{1}{T}\sum_{t=1}^{T}L(\pi,{\bm{\lambda}}_{t})\leq\frac{1}{T}\sum_{t=1}^{T}L(\pi_{t},{\bm{\lambda}}_{t})

Then the claim follows from the definition of 𝚁𝚎𝚐𝚛𝚎𝚝T\mathtt{Regret}_{T}. ∎

When running Algorithm 1 in practice, 𝝀^\hat{{\bm{\lambda}}} can be easily maintained in place as a vector, and updated via 𝝀^=𝝀^(t1)/t+𝝀t/t\hat{{\bm{\lambda}}}=\hat{{\bm{\lambda}}}\cdot(t-1)/t+{\bm{\lambda}}_{t}/t at each round. However, it is much more difficult to maintain the mixed policy μ\mu in our scenario where deep neural networks are typically used as function approximators. Since neural networks are not linear, to recover the mixed policy, it requires to store the parameters of all individual policies {π1,,πT}\{\pi_{1},\dots,\pi_{T}\}. The memory cost is prohibitively expensive since the neural networks can be considerably large.

To overcome this problem, in the following section, we introduce a novel notion of Affinely Independent Mixed (AIM) policy, which reformulates any mixed policy based on only constantly many individual policies, which successfully reduces the memory cost of mixed policy methods to a constant.

5. AIM-based methods

In this section, we first propose the notion of AIM policy, which can be recovered via randomly sampling from a set of individual policies that are affinely independent from each other in the measurement space. Then we propose the AIM-mean method to equivalently transform the mixed policy in Algorithm 1 into an AIM policy, which maintains a constant number of individual policies throughout the training process while still ensuring convergence. We finally propose a variant called AIM-greedy by discarding the undesirable individual policies in AIM-mean, which outperforms the heuristic method using any single policy (Le et al., 2019; Xiao et al., 2019) empirically and is favorable for practical usage.

Note that although our AIM method is motivated by the marketing budget allocation problem, it is actually suitable to a large range of scenarios using deep RL with mixed policies (Zahavy et al., 2021), such as apprenticeship learning (Zahavy et al., 2020; Abbeel and Ng, 2004) and pure exploration (Hazan et al., 2019). Our AIM method is a general approach that fundamentally solves the issue of storing all historical policies in mixed policy methods, which can be used in these deep RL tasks as well.

5.1. Affinely Independent Mixed (AIM) Policy

We first introduce some notations and definitions regarding mixed policy. For any mixed policy μΔ(Π)\mu\in\Delta(\Pi), we define its active policy set Sp(μ)S_{p}(\mu) as the set of all deterministic policies with a non-zero probability in μ\mu. i.e.,111We here slightly abuse μ(π)[0,1]\mu(\pi)\in[0,1] to denote the probability of any deterministic policy πΠ\pi\in\Pi in the mixed policy μ\mu.

Sp(μ):={π|πΠ,μ(π)>0}.\displaystyle S_{p}(\mu):=\{\pi|\pi\in\Pi,\mu(\pi)>0\}.

We also define its active measurement set Sx(μ)S_{x}(\mu) as the set of the measurement vectors of all these active policies, i.e.,

S𝒙(μ):={𝒙(π)|πSp(μ)}.\displaystyle S_{\bm{x}}(\mu):=\{{\bm{x}}(\pi)|\pi\in S_{p}(\mu)\}.

Recall that the measurement vector of μ\mu is given as

𝒙(μ)=𝔼πμ[𝒙(π)]=πΠ𝒙(π)μ(π),\displaystyle{\bm{x}}(\mu)=\mathbb{E}_{\pi\sim\mu}\left[{\bm{x}}(\pi)\right]=\sum_{\pi\in\Pi}{\bm{x}}(\pi)\mu(\pi),

Suppose the active policy set is finite, i.e., Sp(μ)={π1,,πk}S_{p}(\mu)=\{\pi_{1},...,\pi_{k}\}, then we equivalently have

𝒙(μ)=[𝒙(π1),,𝒙(πk)][μ(π1)μ(πk)],\displaystyle{\bm{x}}(\mu)=[{\bm{x}}(\pi_{1}),...,{\bm{x}}(\pi_{k})]\begin{bmatrix}\mu(\pi_{1})\\ \vdots\\ \mu(\pi_{k})\end{bmatrix},

which is a convex combination of its active measurement vectors 𝒙(π1),,𝒙(πk){\bm{x}}(\pi_{1}),...,{\bm{x}}(\pi_{k}) in S𝒙(μ)S_{\bm{x}}(\mu).

We call μ\mu an AIM policy, if its active measurement vectors are affinely independent of each other. Intuitively, the AIM policy characterizes a family of mixed policies with a relatively small active policy set. Specifically, if some mixed policy μΔ(Π)\mu^{\prime}\in\Delta(\Pi) is not an AIM policy, namely the elements in S𝒙(μ)S_{\bm{x}}(\mu^{\prime}) are affinely dependent, we can pick several affinely independent elements such that 𝒙(μ){\bm{x}}(\mu^{\prime}) lies in the convex hull of these elements. In this case, by adjusting the mixing weights accordingly, the mixed policy μ\mu^{\prime} can be recovered with a smaller active policy set.

To formalize the above reduction, notice that for an AIM policy μ\mu, its active measurement vectors are affinely independent and form a simplex in m+1\mathbb{R}^{m+1}. For any policy μ\mu^{\prime} whose measurement vector 𝒙(μ){\bm{x}}(\mu^{\prime}) lies in the affine hull 𝚊𝚏𝚏(S𝒙(μ)){\mathtt{aff}}(S_{\bm{x}}(\mu)), its barycentric coordinate 𝜶k{\bm{\alpha}}\in\mathbb{R}^{k} can be solved via

(5) 𝒙(μ)=[𝒙(π1)𝒙(πk)]𝜶,s.t.iαi=1.\displaystyle{\bm{x}}(\mu^{\prime})=[{\bm{x}}(\pi_{1})\,\dots\,{\bm{x}}(\pi_{k})]{\bm{\alpha}},\quad\text{s.t.}\sum_{i}\alpha_{i}=1.

If αi0\alpha_{i}\geq 0 for any ii, then 𝒙(μ){\bm{x}}(\mu^{\prime}) lies inside the convex hull 𝚌𝚘𝚗𝚟(S𝒙(μ)){\mathtt{conv}}(S_{\bm{x}}(\mu)). In this case, we can achieve the same measurement vector 𝒙(μ){\bm{x}}(\mu^{\prime}) of μ\mu^{\prime} by randomly choosing the policy πiSp(μ)\pi_{i}\in S_{p}(\mu) with probability αi\alpha_{i}. In this way, we only need to store the policies in Sp(μ)S_{p}(\mu), which might be smaller than Sp(μ)S_{p}(\mu^{\prime}).

In summary, any mixed policy can be equivalently realized by randomly sampling with proper weights from several deterministic policies that are affinely independent in the measurement space m+1\mathbb{R}^{m+1}. Since m+1\mathbb{R}^{m+1} is the affine span of any m+2m+2 affinely independent vectors, intuitively we can reduce the number of the stored individual policies to at most m+2m+2. In the following, we devise two algorithms that realize the AIM policy.

5.2. AIM-mean Algorithm

We first propose AIM-mean (Algorithm 2), which realizes the original mixed policy method (Algorithm 1) via AIM policies. At each round tt, AIM-mean outputs an AIM policy whose measurement vector is equal to the average measurement of all historical policies, i.e.,

𝒙(μt)=1ti=1t𝒙(πi).\displaystyle{\bm{x}}(\mu_{t})=\frac{1}{t}\sum_{i=1}^{t}{\bm{x}}(\pi_{i}).

Therefore, AIM-mean evolves exactly the same as the original method, which ensures convergence to the optimal policy, while requiring to store no more than m+2m+2 individual policies.

The core step of AIM-mean is the maintenance of the active policy set of the output mixed policy, which we now explicate in detail. At the beginning of each round tt, the algorithm receives a new policy πt\pi_{t} generated by the original algorithm, whose measurement vector takes 𝒙(πt)m+1{\bm{x}}(\pi_{t})\in\mathbb{R}^{m+1}. After receiving the policy, we first calculate the target vector 𝒙t{\bm{x}}_{t} (i.e., the average measurement vector) at the current round, then decide whether we should alter the active set:

Algorithm 2 AIM-mean

Input: AIM-mean policy μt1\mu_{t-1}, any πt\pi_{t} with x(πt)m+1x(\pi_{t})\in\mathbb{R}^{m+1}

Output: AIM-mean policy μt\mu_{t} s.t. 𝒙(μt)=1ti=1t𝒙(πi){\bm{x}}(\mu_{t})=\frac{1}{t}\sum_{i=1}^{t}{\bm{x}}(\pi_{i})

1:   𝒙t𝒙(μt1)(t1)/t+𝒙(πt)/t{\bm{x}}_{t}\leftarrow{\bm{x}}(\mu_{t-1})*(t-1)/t+{\bm{x}}(\pi_{t})/t // Target value
2:  if 𝒙t{\bm{x}}_{t} not in affine hull 𝚊𝚏𝚏(S𝒙(μt1)){\mathtt{aff}}(S_{\bm{x}}(\mu_{t-1})) then
3:     S𝒙S𝒙(μt1){𝒙(πt)}S_{\bm{x}}\leftarrow S_{\bm{x}}(\mu_{t-1})\cup\{{\bm{x}}(\pi_{t})\}
4:  else if 𝒙t{\bm{x}}_{t} in convex hull 𝚌𝚘𝚗𝚟(S𝒙(μt1)){\mathtt{conv}}(S_{\bm{x}}(\mu_{t-1})) then
5:     S𝒙S𝒙(μt1)S_{\bm{x}}\leftarrow S_{\bm{x}}(\mu_{t-1})
6:  else
7:     S𝒙S𝒙(μt1),𝒙t1=𝒙(μt1)S_{\bm{x}}\leftarrow S_{\bm{x}}(\mu_{t-1}),{\bm{x}}_{t-1}={\bm{x}}(\mu_{t-1})
8:     S𝒙,𝒙t1𝚁𝚎𝚖𝚘𝚟𝚎𝙾𝚗𝚎𝚅𝚎𝚛𝚝𝚎𝚡(S𝒙,𝒙t1,𝒙t)S_{\bm{x}},{\bm{x}}_{t-1}\leftarrow\mathtt{RemoveOneVertex}(S_{\bm{x}},{\bm{x}}_{t-1},{\bm{x}}_{t})
9:     S𝒙S{𝒙(πt)}S_{\bm{x}}\leftarrow S\cup\{{\bm{x}}(\pi_{t})\}
10:  end if
11:  repeat
12:     S𝒙,𝒙t1𝚁𝚎𝚖𝚘𝚟𝚎𝙾𝚗𝚎𝚅𝚎𝚛𝚝𝚎𝚡(S𝒙,𝒙t1,𝒙t)S_{\bm{x}},{\bm{x}}_{t-1}\leftarrow\mathtt{RemoveOneVertex}(S_{\bm{x}},{\bm{x}}_{t-1},{\bm{x}}_{t})
13:  until no more vertex can be removed
14:  𝜶{\bm{\alpha}}\leftarrow barycentric coordinate of 𝒙t{\bm{x}}_{t} w.r.t. S𝒙S_{\bm{x}}
15:  return μtS𝒙α\mu_{t}\leftarrow S_{\bm{x}}\alpha

(i) If 𝒙t𝚊𝚏𝚏(S𝒙(μt1)){\bm{x}}_{t}\notin{\mathtt{aff}}(S_{\bm{x}}(\mu_{t-1})), this implies that the size of S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}) is less than m+2m+2 (otherwise we have 𝚊𝚏𝚏(S𝒙(μt1))=m+1{\mathtt{aff}}(S_{\bm{x}}(\mu_{t-1}))=\mathbb{R}^{m+1}). In this case, 𝒙(πt){\bm{x}}(\pi_{t}) is affinely independent of any vector in the current active set S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}), and we add 𝒙(πt){\bm{x}}(\pi_{t}) to the active set.

(ii) If 𝒙t𝚌𝚘𝚗𝚟(S𝒙(μt1)){\bm{x}}_{t}\in{\mathtt{conv}}(S_{\bm{x}}(\mu_{t-1})), then we maintain the active set as S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}). Now we just need to recalculate the mixing weights of the elements in S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}) regarding 𝒙t{\bm{x}}_{t} via (5).

(iii) If 𝒙t𝚊𝚏𝚏(S𝒙(μt1)){\bm{x}}_{t}\in{\mathtt{aff}}(S_{\bm{x}}(\mu_{t-1})) and xt𝚌𝚘𝚗𝚟(S𝒙(μt1))x_{t}\notin{\mathtt{conv}}(S_{\bm{x}}(\mu_{t-1})), then we need to alter the elements in active set and express 𝒙t{\bm{x}}_{t} as a convex combination. Intuitively, we can replace one element in S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}) by 𝒙(πt){\bm{x}}(\pi_{t}) so that the simplex formed by the new active set contains 𝒙t{\bm{x}}_{t}. Notice that since 𝒙(μt1){\bm{x}}(\mu_{t-1}) and 𝒙t{\bm{x}}_{t} lie inside and outside of the original simplex respectively, the segment joining the two points must cross the boundary of the original simplex. Therefore, we can find the specific facet where the intersection locates at, remove any vertex in S𝒙(μt1)S_{\bm{x}}(\mu_{t-1}) that is not on this facet, and add the new policy to the active set.

The above procedure of removing vertex is summarized as the 𝚁𝚎𝚖𝚘𝚟𝚎𝙾𝚗𝚎𝚅𝚎𝚛𝚝𝚎𝚡(S𝒙,𝒙t1,𝒙t)\mathtt{RemoveOneVertex}(S_{\bm{x}},{\bm{x}}_{t-1},{\bm{x}}_{t}) subroutine, which takes three arguments, namely the vertices of the simplex S𝒙S_{\bm{x}}, previous average measurement vector 𝒙t1{\bm{x}}_{t-1}, and the target vector 𝒙t{\bm{x}}_{t}. 𝚁𝚎𝚖𝚘𝚟𝚎𝙾𝚗𝚎𝚅𝚎𝚛𝚝𝚎𝚡\mathtt{RemoveOneVertex} first finds the intersection of the above segment and the facet of the simplex, then replaces a random vertex that is not on the facet with the new policy x(πt)x(\pi_{t}) at the current round. Specifically, assume 𝜶t1{\bm{\alpha}}_{t-1} and 𝜶t{\bm{\alpha}}_{t} as the barycentric coordinates of 𝒙t1{\bm{x}}_{t-1} and 𝒙t{\bm{x}}_{t} with respect to S𝒙S_{\bm{x}}. Denote 𝜶t=(αt,1,,αt,m+1){\bm{\alpha}}_{t}=(\alpha_{t,1},...,\alpha_{t,m+1}). Since 𝒙t𝚌𝚘𝚗𝚟(S𝒙){\bm{x}}_{t}\notin{\mathtt{conv}}(S_{\bm{x}}), some entries of 𝜶t{\bm{\alpha}}_{t} must be non-positive. Hence, we can set

(6) θargmini:αt,i0αt1,iαt1,iαt,i.\displaystyle\theta\leftarrow\arg\min_{i:\alpha_{t,i}\leq 0}\frac{\alpha_{t-1,i}}{\alpha_{t-1,i}-\alpha_{t,i}}.

Then 𝜶tθ𝜶t+(1θ)𝜶t1{\bm{\alpha}}_{t}\leftarrow\theta{\bm{\alpha}}_{t}+(1-\theta){\bm{\alpha}}_{t-1} is the point on the facet of the simplex. Now we just need to remove any vertex in S𝒙S_{\bm{x}} whose weight corresponding to αt\alpha_{t} is non-positive, and update 𝒙t1θ𝒙t+(1θ)𝒙t1{\bm{x}}_{t-1}\leftarrow\theta{\bm{x}}_{t}+(1-\theta){\bm{x}}_{t-1} accordingly.

After updating the active set S𝒙S_{\bm{x}} and the mixing weights 𝜶t{\bm{\alpha}}_{t}, we repeatedly call 𝚁𝚎𝚖𝚘𝚟𝚎𝙾𝚗𝚎𝚅𝚎𝚛𝚝𝚎𝚡\mathtt{RemoveOneVertex} to remove any vertex that does not contribute to 𝒙t{\bm{x}}_{t}. This further reduces the number of the stored individual policies without affecting the output mixed policy μt\mu_{t}.

It can be easily checked that at the end of each round, S𝒙S_{\bm{x}} is an affinely independent set and 𝒙t𝚌𝚘𝚗𝚟(S𝒙){\bm{x}}_{t}\in{\mathtt{conv}}(S_{\bm{x}}). This validates the correctness of AIM-mean, as summarized in the following.

Corollary 5.1.

AIM-mean maintains the running average of the measurement vectors of all input policies as an AIM policy.

Table 2. Performance comparison of different methods on Alipay dataset.
Method Simplified Dataset Full Dataset
1d DAU 3d DAU 7d DAU 1d DAU 3d DAU 7d DAU
Classic DeepFM - 0.5997 2.2032 4.7806 0.6396 2.3232 5.2114
DIN - 0.6010 2.2070 4.7854 0.6468 2.3306 5.2237
Off-policy RL DDQN Single-best 0.5272 2.3597 4.9418 0.5981 2.5014 5.6306
AIM-mean 0.5172 2.3433 4.9360 0.5779 2.4819 5.5759
AIM-greedy 0.5339 2.3640 4.9732 0.5998 2.5213 5.6361
EDRR Single-best 0.5741 2.4003 4.9620 0.6156 2.5243 5.6390
AIM-mean 0.5572 2.3849 4.9360 0.6045 2.5144 5.6255
AIM-greedy 0.5893 2.4193 5.1153 0.6163 2.5253 5.6432
Offline RL BCQ Single-best 0.5770 2.4068 5.0608 0.6150 2.5318 5.6901
AIM-mean 0.5613 2.4016 5.0461 0.6159 2.5310 5.6685
AIM-greedy 0.5846 2.4266 5.1911 0.6205 2.5454 5.7020

5.3. AIM-greedy Algorithm

Algorithm 3 AIM-greedy

Input: AIM policy μt1\mu_{t-1}, any πt\pi_{t} with x(πt)m+1x(\pi_{t})\in\mathbb{R}^{m+1}

Output: AIM policy μt\mu_{t}

1:  Replace line 1 in AIM-mean by choosing 𝒙t[𝒙(μt1),𝒙(πt)]{\bm{x}}_{t}\in[{\bm{x}}(\mu_{t-1}),{\bm{x}}(\pi_{t})] that maximizes the reward at the premise of satisfying constraints:
2:  𝒙targmax𝒙[𝒙(μt1),𝒙(πt)]min𝝀+mL(𝒙,𝝀).{\bm{x}}_{t}\leftarrow\arg\max_{{\bm{x}}\in[{\bm{x}}(\mu_{t-1}),{\bm{x}}(\pi_{t})]}\min_{{\bm{\lambda}}\in\mathbb{R}^{m}_{+}}L({\bm{x}},{\bm{\lambda}}).

Recall that classic mixed policy methods randomly select an individual policy from all historical policies via uniform sampling. There is another heuristic method (Le et al., 2019) that selects the single-best policy among all individual policies, which improves the empirical performance despite lack of convergence guarantee. To improve the practical performance, we propose a novel variant called AIM-greedy by removing ”bad” active policies in the mixed policy. Although it may not converge, AIM-greedy consistently outperforms the previous heuristic method in experiment, hence it is favorable for practical usage.

As described in Algorithm 3, AIM basically follows the procedure of AIM-mean to ensure the AIM property, except that it uses a more radical target vector 𝒙t{\bm{x}}_{t} than AIM-mean.

To decide the target vector, AIM-greedy performs a greedy line search on the segment between the previous target vector 𝒙t1=𝒙(μt1){\bm{x}}_{t-1}={\bm{x}}(\mu_{t-1}) and the new policy 𝒙(πt){\bm{x}}(\pi_{t}), and selects a target vector 𝒙t{\bm{x}}_{t} that maximizes the reward at the premise of least constraint violation. Specifically, (i) if both 𝒙t1{\bm{x}}_{t-1} and 𝒙(πt){\bm{x}}(\pi_{t}) satisfy the constraints, we set 𝒙t{\bm{x}}_{t} to the policy with a higher reward; (ii) if neither satisfies the constraints, we choose the policy whose Euclidean distance to the target constraints is smaller; (iii) If one satisfies while the other does not, and the infeasible policy has the higher reward, we move 𝒙t{\bm{x}}_{t} to the boundary of the constraints (i.e., line 6 in Algorithm 5.2).

In fact, AIM-greedy generalizes the idea of selecting the single best policy and is guaranteed to perform no worse than any single best policy.

Theorem 5.2.

The performance of the AIM-greedy algorithm is no worse than using any single best policy.

Proof.

At each round, AIM-greedy conducts a linear search on 𝒙t[𝒙(πt),𝒙(μt1)]{\bm{x}}_{t}\in[{\bm{x}}(\pi_{t}),{\bm{x}}(\mu_{t-1})] to maximize min𝝀+mL(𝒙t,𝝀)\min_{{\bm{\lambda}}\in\mathbb{R}^{m}_{+}}L({\bm{x}}_{t},{\bm{\lambda}}), whereas the single best policy selects from 𝒙t{𝒙(πt),𝒙(πt1)}{\bm{x}}_{t}\in\{{\bm{x}}(\pi_{t}),{\bm{x}}(\pi_{t-1})\}. Using induction, since this equation is linear to 𝒙(μt){\bm{x}}(\mu_{t}), by the induction hypothesis, a line search on segment (𝒙(πt),𝒙(μt1))({\bm{x}}(\pi_{t}),{\bm{x}}(\mu_{t-1})) always performs no worse than selecting single best policy. ∎

5.4. Policy efficiency

We then discuss the optimal policy efficiency which shows that our AIM policy achieves optimal policy efficiency.

Theorem 5.3.

For a constrained RL problem with mm-dimensional constraint vector, a mixed policy needs to randomize among at least m+1m+1 individual deterministic policies to ensure convergence.

Proof.

We give constructive proof. Consider a CMDP with a single state and m+1m+1 actions. The first action has reward r=0r=0, and cost 𝒄=𝟎m{\bm{c}}={\bm{0}}\in\mathbb{R}^{m}. The rest actions all have reward r=1r=1, and 𝒄=𝒆i{\bm{c}}={\bm{e}}_{i} is the unit vector of the i-th dimension, and the episode terminates after any action is taken. Let τ=1/(1+m)𝟏\tau=1/(1+m){\bm{1}}, the all one vector times 1/m1/m. Then it is clear that maximizing reward subject to these constraints requires randomizing among all m+1m+1 actions uniformly. When deterministic policies are used, it requires randomizing among at least m+1m+1 policies. ∎

Since the affinely independent properties ensure the AIM-type policies store no more than m+2m+2 policies throughout the learning process, and hence the storage complexity is optimal to ensure convergence when learning from off-policy samples.

6. Experiments

6.1. Dataset

During the marketing campaign, Alipay uses an online A/B testing system to perform user-level randomization; for a small percentage of users, a random budget allocation policy is used to assign coupon values. After running this randomized trial for a week in Dec 2020, 4.2 million samples are collected. Since the random strategy is likely to be non-optimal, given the average coupon amount of about 2.5 yuan, the collection of these samples is indeed expensive. Each sample has about one thousand raw features, which include timestamp, age, gender, city, and other behavior features. After one-hot encoding, all samples take around one hundred thousand different values. The label of each sample is the corresponding user’s DAU over the next 1/3/7 days.

The full dataset takes about 56 gigabytes on disk, and our deployed model is implemented with the company’s private distributed model training infrastructure that supports high-dimensional embedding lookup tables. To simplify the tuning process, we also create a simplified data set, which contains only 2 user features, 1) whether the user participated in the previous day, which is a boolean value, and 2) the activeness level of the user. The results on both datasets are qualitatively similar and are both reported.

6.2. Evaluation Setting

The baselines. We first compare our proposed method with classic methods that optimize short-term metrics (i.e. 1d DAU). For classic methods, we adopt the widely used DeepFM (Guo et al., 2017) and DIN (Zhou et al., 2018). We then compare our proposed AIM-mean and AIM-greedy methods with the heuristic single-best method (Le et al., 2019). Since these methods are model-agnostic, we implement them using the following RL methods. (i) DDQN uses double Q techniques (Van Hasselt et al., 2015; Fujimoto et al., 2018) on the vanilla DQN (Mnih et al., 2013; Liu et al., 2018) to avoid overestimation. We set the weight for the smaller Q network to 0.8, and the higher one to 0.2. (ii) EDRR (Liu et al., 2020) is a state-of-the-art method in RL-based recommendations, which augments the aforementioned DDQN methods with a supervised task to stabilize the training of the embedding layer. (iii) BCQ (Fujimoto et al., 2019) adds a behavior cloning task to encourage the learned policy to stay close to the behavior policy that is being used to generate the offline dataset. We add the imitation task to the EDRR model and set the weight of behavior cloning to be 1.01.0.

Training setting. For each dataset, we divide 90% of the data as the training set and leave the rest as the evaluation set. The division is performed by using a hashing function on the user id, to ensure that all interactions of a user either are in the training set or the evaluation set. Each method is trained for 2 epochs. The discount rate is set to be 0.80.8, and the batch size is 512512. We utilize Adam optimizer for all the methods. We invoke the Learner to update the λ\lambda every 10 steps. To speed up the training process, the policy parameters are evaluated for adding to AIM policy and export once per 100 steps. For the online learner, we use the OGD (3) with learning rate as ηt:=1/t\eta_{t}:=1/\sqrt{t} (Zinkevich, 2003).

Budget control. To make a fair comparison between conventional methods and the offline constrained RL methods, we control the immediate budget spent to the same 33 yuan using a linear programming task. We solve the task by taking the Q values of the RL method as a response estimation of the users, then taking the two-step framework, and adding a constraint until the cost of the DNN methods and the RL methods are the same.

Offline evaluation. To evaluate different methods on ground truth, a straightforward way is to evaluate the learned policy through an online A/B test. However, it may hurt user experiences and be prohibitively expensive in an online marketing campaign. As suggested by (Zou et al., 2020), we use a clipped importance sampling method to evaluate the policy. To ensure a small variance and control the bias, the weight is clipped to 2020.

6.3. Offline Experimental Results

Overall performance comparison. Results by an average of 5 repetitive experiment runs are obtained, and we report the three metrics in Table 2. For each method, we report its results in terms of the average coupon redemption rate (1d DAU), the average DAU over 3 days (3d DAU), and over 7 days (7d DAU). The redemption rate measures the immediate effect of the marketing campaign, and the average DAU over 3 days and 7 days measure the long-term effect of running the marketing campaign. From Table 2 we have following observations: (1) Though the conventional method yields a nice result in optimizing short-term metrics (1d DAU), they are outperformed by the offline constrained RL methods when considering the long-term effect (3/7d DAU). (2) Though lacks theoretical guarantee, using a single Q-network outperforms the AIM-mean method yield by the game-theoretic approach. Meanwhile, both are consistently outperformed by the AIM-greedy method, which verifies our theoretical analysis. Note that according to corollary 5.1, the measurement vector of the original mixed policy method is exactly the same as the AIM-mean method, and hence they have identical performance.

Refer to caption
Figure 2. The number of stored parameters in original mixed policy and AIM-based methods during the training process.

The policy efficiency. The memory usage of the three offline constrained RL methods (using BCQ) with the standard deviation over the 5 repetitive runs is reported in Figure 2 in terms of the number of parameters stored during the training process on full dataset. Directly applying the original mixed policy approach (Algorithm 1) suffers from an impractical memory cost. In contrast, our proposed methods significantly reduces memory cost to at most a constant, which are favorable for large-scale industrial scenarios.

6.4. Online Experimental Results

We conduct a series of A/B experiments in a live system serving tens of millions of users and handing out more than one billion yuan worth cash coupons. Figure 3 illustrates the online request serving process and the offline training procedure. We train the model using BCQ, where the λ\lambda is updated by an OGD method. The model is evaluated by an AIM-greedy method, and if the current policy shall be added into the mixed policy, the model parameters will be exported to disk, together with the weight for all exported models. These models are then loaded to our online serving machines, to provide online budget allocation services.

Refer to caption
Figure 3. The online request and offline training procedure.

The online experiment is conducted by performing user-level randomization. We start an online experiment with a small percentage of traffic and gradually increase the traffic. The offline constrained deep RL model is compared with the baseline model that uses DIN with a linear programming solver to find the allocation plan. The online experiment run for a week, with 5.5 million cash coupons sent by each of the models. Though the offline constrained RL method has a lower coupon redemption rate (1d DAU) than the previous DIN model, after running the experiment for a week, it improves the weekly average DAU by 0.6% (7d DAU) for the app. Meanwhile, it reduces the coupon cost by 0.8%. Our retrospective analysis shows that the emphasis on long-term value drives this model to allocate more budgets to inactive users, and accelerate the process of user acquisition. The more frequent retention of these inactive users compensates for the short-term metrics deterioration of active users. The model is then successfully deployed to the full traffic during the marketing campaign after a week’s A/B testing.

7. Conclusions

We show the difficulty of marketing budget allocation tasks using offline data from a real-world large-scale campaign, and propose using offline constrained deep RL techniques to deal with the long-term effect. Existing methods require randomizing among infinitely many policies, which suffers from an unacceptably high memory cost for large-scale scenarios. To tackle this issue, we propose AIM-mean and AIM-greedy algorithms, which reduce the memory consumption to a constant. In addition, we show that AIM-mean is guaranteed to convergence to the optimal policy, and AIM-greedy performs no worse than any single-best policy. The theoretical analyses are verified in a real-world marketing campaign, where the AIM-greedy is shown to outperform all other methods and is successfully deployed to all traffic in the campaign.

References

  • (1)
  • Abbeel and Ng (2004) Pieter Abbeel and Andrew Y Ng. 2004. Apprenticeship learning via inverse reinforcement learning. In Proceedings of the twenty-first international conference on Machine learning. 1.
  • Abernethy et al. (2011) Jacob Abernethy, Peter L Bartlett, and Elad Hazan. 2011. Blackwell approachability and no-regret learning are equivalent. In Proceedings of the 24th Annual Conference on Learning Theory. 27–46.
  • Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. 2017. Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 22–31.
  • Afsar et al. (2021) M Mehdi Afsar, Trafford Crump, and Behrouz Far. 2021. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys (CSUR) (2021).
  • Agarwal et al. (2014) Deepak Agarwal, Souvik Ghosh, Kai Wei, and Siyu You. 2014. Budget pacing for targeted online advertisements at linkedin. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. 1613–1619.
  • Altman (1999) Eitan Altman. 1999. Constrained Markov decision processes. Vol. 7. CRC Press.
  • Bohez et al. (2019) Steven Bohez, Abbas Abdolmaleki, Michael Neunert, Jonas Buchli, Nicolas Heess, and Raia Hadsell. 2019. Value constrained model-free continuous control. arXiv preprint arXiv:1902.04623 (2019).
  • Chen et al. (2021b) Xuanying Chen, Zhining Liu, Li Yu, Sen Li, Lihong Gu, Xiaodong Zeng, Yize Tan, and Jinjie Gu. 2021b. Adversarial Learning for Incentive Optimization in Mobile Payment Marketing. In Proceedings of the 30th ACM International Conference on Information & Knowledge Management. 2940–2944.
  • Chen et al. (2021c) Xiaocong Chen, Lina Yao, Julian McAuley, Guanglin Zhou, and Xianzhi Wang. 2021c. A survey of deep reinforcement learning in recommender systems: A systematic review and future directions. arXiv preprint arXiv:2109.03540 (2021).
  • Chen et al. (2021a) Yi Chen, Jing Dong, and Zhaoran Wang. 2021a. A primal-dual approach to constrained Markov decision processes. arXiv preprint arXiv:2101.10895 (2021).
  • Chow et al. (2017) Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. 2017. Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research 18, 1 (2017), 6070–6120.
  • Freund and Schapire (1999) Yoav Freund and Robert E Schapire. 1999. Adaptive game playing using multiplicative weights. Games and Economic Behavior 29, 1-2 (1999), 79–103.
  • Fujimoto et al. (2019) Scott Fujimoto, David Meger, and Doina Precup. 2019. Off-policy deep reinforcement learning without exploration. In International Conference on Machine Learning. PMLR, 2052–2062.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke Van Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477 (2018).
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • Hazan et al. (2019) Elad Hazan, Sham Kakade, Karan Singh, and Abby Van Soest. 2019. Provably efficient maximum entropy exploration. In International Conference on Machine Learning. PMLR, 2681–2691.
  • Labbi and Berrospi (2007) Abdel Labbi and Cesar Berrospi. 2007. Optimizing marketing planning and budgeting using Markov decision processes: An airline case study. IBM Journal of Research and Development 51, 3.4 (2007), 421–431.
  • Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. 2019. Batch Policy Learning under Constraints. In International Conference on Machine Learning. 3703–3712.
  • Li et al. (2021) Duanshun Li, Jing Liu, Jinsung Jeon, Seoyoung Hong, Thai Le, Dongwon Lee, and Noseong Park. 2021. Large-Scale Data-Driven Airline Market Influence Maximization. arXiv preprint arXiv:2105.15012 (2021).
  • Liang et al. (2018) Qingkai Liang, Fanyu Que, and Eytan Modiano. 2018. Accelerated primal-dual policy optimization for safe reinforcement learning. arXiv preprint arXiv:1802.06480 (2018).
  • Liu et al. (2020) Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, and Xiuqiang He. 2020. End-to-end deep reinforcement learning based recommendation with supervised embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining. 384–392.
  • Liu et al. (2018) Feng Liu, Ruiming Tang, Xutao Li, Weinan Zhang, Yunming Ye, Haokun Chen, Huifeng Guo, and Yuzhou Zhang. 2018. Deep reinforcement learning based recommendation with explicit user-item interactions modeling. arXiv preprint arXiv:1810.12027 (2018).
  • Liu et al. (2019) Ziqi Liu, Dong Wang, Qianyu Yu, Zhiqiang Zhang, Yue Shen, Jian Ma, Wenliang Zhong, Jinjie Gu, Jun Zhou, Shuang Yang, et al. 2019. Graph representation learning for merchant incentive optimization in mobile payment marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 2577–2584.
  • Miryoosefi et al. (2019) Sobhan Miryoosefi, Kianté Brantley, Hal Daume III, Miro Dudik, and Robert E Schapire. 2019. Reinforcement learning with convex constraints. In Advances in Neural Information Processing Systems. 14093–14102.
  • Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
  • Paternain et al. (2019) Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. 2019. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems. 7555–7565.
  • Shalev-Shwartz et al. (2011) Shai Shalev-Shwartz et al. 2011. Online learning and online convex optimization. Foundations and trends in Machine Learning 4, 2 (2011), 107–194.
  • Shi et al. (2019) Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2019. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4902–4909.
  • Stooke et al. (2020) Adam Stooke, Joshua Achiam, and Pieter Abbeel. 2020. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning. PMLR, 9133–9143.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. 2018. Reinforcement learning: An introduction. MIT press.
  • Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. 2018. Reward Constrained Policy Optimization. In International Conference on Learning Representations.
  • Van Hasselt et al. (2015) Hado Van Hasselt, Arthur Guez, and David Silver. 2015. Deep reinforcement learning with double q-learning. arXiv preprint arXiv:1509.06461 (2015).
  • Wu et al. (2018) Di Wu, Xiujun Chen, Xun Yang, Hao Wang, Qing Tan, Xiaoxun Zhang, Jian Xu, and Kun Gai. 2018. Budget constrained bidding by model-free reinforcement learning in display advertising. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management. 1443–1451.
  • Xiao et al. (2019) Shuai Xiao, Le Guo, Zaifan Jiang, Lei Lv, Yuanbo Chen, Jun Zhu, and Shuang Yang. 2019. Model-based Constrained MDP for Budget Allocation in Sequential Incentive Marketing. In Proceedings of the 28th ACM International Conference on Information and Knowledge Management. 971–980.
  • Xu et al. (2015) Jian Xu, Kuang-chih Lee, Wentong Li, Hang Qi, and Quan Lu. 2015. Smart pacing for effective online ad campaign optimization. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 2217–2226.
  • Yu et al. (2021) Li Yu, Zhengwei Wu, Tianchi Cai, Ziqi Liu, Zhiqiang Zhang, Lihong Gu, Xiaodong Zeng, and Jinjie Gu. 2021. Joint Incentive Optimization of Customer and Merchant in Mobile Payment Marketing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 15000–15007.
  • Zahavy et al. (2020) Tom Zahavy, Alon Cohen, Haim Kaplan, and Yishay Mansour. 2020. Apprenticeship learning via frank-wolfe. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 6720–6728.
  • Zahavy et al. (2021) Tom Zahavy, Brendan O’Donoghue, Guillaume Desjardins, and Satinder Singh. 2021. Reward is enough for convex MDPs. Advances in Neural Information Processing Systems 34 (2021).
  • Zhao et al. (2019) Kui Zhao, Junhao Hua, Ling Yan, Qi Zhang, Huan Xu, and Cheng Yang. 2019. A Unified Framework for Marketing Budget Allocation. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1820–1830.
  • Zhong et al. (2015) Wenliang Zhong, Rong Jin, Cheng Yang, Xiaowei Yan, Qi Zhang, and Qiang Li. 2015. Stock constrained recommendation in tmall. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2287–2296.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1059–1068.
  • Zinkevich (2003) Martin Zinkevich. 2003. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th international conference on machine learning (icml-03). 928–936.
  • Zou et al. (2020) Lixin Zou, Long Xia, Pan Du, Zhuo Zhang, Ting Bai, Weidong Liu, Jian-Yun Nie, and Dawei Yin. 2020. Pseudo Dyna-Q: A reinforcement learning framework for interactive recommendation. In Proceedings of the 13th International Conference on Web Search and Data Mining. 816–824.