This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Modeling Long-Term User Engagement from Stochastic Feedback

Guoxi Zhang Graduate School of Informatics,
Kyoto University
KyotoJapan606-8501
[email protected]
Xing Yao China Central Depository & Clearing Co., Ltd.BeijingChina [email protected]  and  Xuanji Xiao Shopee Inc.ShenzhenChina [email protected]
(2023)
Abstract.

An ultimate goal of recommender systems (RS) is to improve user engagement. Reinforcement learning (RL) is a promising paradigm for this goal, as it directly optimizes overall performance of sequential recommendation. However, many existing RL-based approaches induce huge computational overhead, because they require not only the recommended items but also all other candidate items to be stored. This paper proposes an efficient alternative that does not require the candidate items. The idea is to model the correlation between user engagement and items directly from data. Moreover, the proposed approach consider randomness in user feedback and termination behavior, which are ubiquitous for RS but rarely discussed in RL-based prior work. With online A/B experiments on real-world RS, we confirm the efficacy of the proposed approach and the importance of modeling the two types of randomness.

Recommender Systems, Reinforcement Learning
journalyear: 2023copyright: acmlicensedconference: Companion Proceedings of the ACM Web Conference 2023; April 30-May 4, 2023; Austin, TX, USAbooktitle: Companion Proceedings of the ACM Web Conference 2023 (WWW ’23 Companion), April 30-May 4, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543873.3587626isbn: 978-1-4503-9419-2/23/04ccs: Computing methodologies Markov decision processesccs: Information systems Content rankingccs: Information systems Personalization

1. Introduction

Recommender Systems (RS) are software systems that help users discover engaging items. In particular, a sequential RS allows users to consume new items without ceasing, usually by scrolling down the interface to the RS. The main goal of sequential RS is to increase long-term user engagement. In this regard, myopic decision-making, such as ranking items solely by predicted click-through rate (pCTR), can be suboptimal. For instance, news can have higher pCTR than dramma bloopers, yet dramma bloopers prompt more consumption for videos about the dramma and staffs in the dramma.

Reinforcement learning (RL) has emerged as a promising research direction for optimizing long-term user engagement (10.1145/3289600.3290999, 2, 4, 12, 14). Using RL, sequential recommendation is modeled as interaction between an agent and an environment. At each step, the agent receives a state and selects an action. Usually, states are user profiles and their historical activities, and actions are items available for recommendation. After selecting each action the agent receives a scalar reward. The goal of RL is to learn a policy for action selection, with which the agent can maximize the sum of rewards received during interacting with the environment. Thus, RL is an anticing paradigm if one uses utility for items (e.g. clicks) as rewards, as it naturally optimizes long-term utility of sequential RS.

However, training RL models for industrial RS can be challenging. To understand the issue, consider a minimalist example in which an agent selects one item from a set of candidate items for a user. To train a model for pCTR, one needs to record information about the user, the selected item, and the feedbacks for items. However, to train a RL model such as the well-known deep Q-network (DQN) (mnih-dqn-2015, 6), one needs to additionally record the candidate item set. This induces huge overhead for industrial applications as the candidate set usually contains several hundreds of items. While the overhead can be reduced by learning RL models using item embedding vectors, it is reported that the performance is much worse than learning RL models using items directly (arXiv.2110.11073, 8).

This paper proposes an efficient alternative for optimizing long-term user engagement. The key observation behind the proposed approach is that in industrial setting RS are periodically trained on logs generated by some other RS, which is referred to as the behavior policy afterwards. For applications serving millions of users the behavior policy is usually highly optimized, so by mining the relation between user engagement and items selected by this policy we can obtain a model for user engagement. Notably, the proposed approach does not require the candidate item set in learning. So it can be integrated into any existing myopic RS effortlessly without introducing huge overhead.

In addition to the efficiency issue, the proposed approach also considers stochasticity in rewards and termination of interactions. RL algorithms such as DQN usually assume that (a) rewards are deterministic regarding to states and actions and (b) interactions last for infinite steps. Both of them are rarely satisfied in RS. Real-world RS need to serve users with diverse interests, but they only have limited information about users. Consider two different users that have the same feature values. As they have different interests, the feedbacks they provide for the same item will be different. From the perspective of RS, this means that the rewards for the same state and actions tend to be random. For the same reason, it is unlikely that they will interact with RS for the same number of steps. Unfortunately, both types of stochasticity have not been addressed so far. To model the former, this paper employs the recent distributional RL framework (Bellemare2017, 1). For the latter, this paper extends the distributional RL framework to random termination setting and learns a model to predict whether users terminate interaction.

We evaluated the proposed approach on an real-world industrial RS for short videos that serves millions of users. During a week of online A/B test, after deploying the proposed approach we were able to improves the average number of videos viewed by a user by 2.72% and the average duration of video views by 1.59%. Moreover, the experiment also confirmed that modeling stochasticity is essential for effectively modeling long-term user engagement.

The contribution of this paper can be summarized as follows.

  • This paper proposes a simple and efficient approach for modeling long-term user engagement in RS.

  • This paper proposes to consider stochasticity in user feedback and termination when modeling user engagement.

  • This paper evaluated the proposed method on an industrial RS and confirmed its efficacy.

The rest of this paper organizes as follows. Section 2 briefly reviews relevant literature, and section 3 provide technical background. Section 4 describes the proposed approach. Section 5 presents experiments on real-world RS, and section 6 concludes this paper.

2. Related Work

Existing attemts for RL-based recommendation can be classified into policy-gradient based (10.1145/3289600.3290999, 2) and variants of the DQN algorithm (10.1145/3240323.3240374, 13, 14). This paper address the efficiency issue, which can be consider as a complementary approach for resource-constrained scenarios.

In the literature of RS, significant effort has been devoted to neural architectures. The proposed approach leverages the DLRM (DLRM19, 7) architecture, though it can be combined with other architectures as well. Meanwhile, to model the randomness in rewards, the proposed approach leverages the recent distributional RL framework (Bellemare2017, 1, 3). As for random termination, in literature it has been considered in a model-based RL (pmlr-v70-white17a, 9) and modeled with evolutionary algorithm (yoshida2013reinforcement, 11) or meta gradient descend (NEURIPS2018_2715518c, 10). Our proposed method differs from these approaches as it learns the discount factor from leaving behaviors of users.

3. Preliminaries

Modeling Sequential Recommendation with RL

Sequential interaction between a user and a recommender agent is modeled as a discounted infinite-horizon Markov Decision Process (MDP): <𝒮,𝒜,R,P,γ><\mathcal{S},\mathcal{A},R,P,\gamma>. States 𝒮\mathcal{S} are information about users, which include features such as interest tags and lists of items consumed in past few minutes, hours or days. Actions 𝒜\mathcal{A} are candidate items. In distributional RL, rewards RR are considered as random variables. This paper utilizes clicking signal as rewards, so a realization r=1r=1 means the corresponding items is clicked, and r=0r=0 otherwise. The transition dynamics PP governs state transitions, and the discount factor γ\gamma is used for defining value distribution.

Action Value Distribution

We are interested in modeling the the sum of rewards obtained during agent-environment interaction, which characterizes long-term user engagement. As we assume rewards to be random variables, state-action values Zπ(s,a)Z^{\pi}(s,a), the discounted sum of rewards obtained after selecting an action aa at some state ss and following π\pi afterwards, is also a random variable. Bellemare et al. showed that Zπ(s,a)Z^{\pi}(s,a) is the fixed point of the distributional Bellman operator 𝒯π\mathcal{T}^{\pi} that is defined as (Bellemare2017, 1) :

(1) (s,a)𝒮×𝒜,𝒯πZπ(s,a)=𝐷R(s,a)+γZπ(s,π(s)),\forall(s,a)\in\mathcal{S}\times\mathcal{A},\mathcal{T}^{\pi}Z^{\pi}(s,a)\overset{D}{=}R(s,a)+\gamma Z^{\pi}(s^{\prime},\pi(s^{\prime})),

where sP(s|s,a)s^{\prime}\sim P(s|s,a). X=𝐷YX\overset{D}{=}Y means that the random variable XX has the same distribution as random variable YY.

Dabney et al. proposed to parameterize ZπZ^{\pi} with the so-called quantile distributions (Dabney2018, 3). The idea is to learn a parametric function β:𝒮×𝒜M\beta:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{M} to estimate MM evenly spaced quantiles of ZπZ^{\pi}. Given data from π\pi, they proposed to learn ZπZ^{\pi} by minimizing the quantile huber loss:

(2) L(β)=𝔼(s,a,r,s)π1Mi=1Mj=1M[ρτ^iκ(r+γβ(s,π(s))jβ(s,a)i)],L(\beta)=\mathbb{E}_{(s,a,r,s^{\prime})\sim\pi}\frac{1}{M}\sum_{i=1}^{M}\sum_{j=1}^{M}[\rho_{\hat{\tau}_{i}}^{\kappa}(r+\gamma\beta^{\prime}(s^{\prime},\pi(s^{\prime}))_{j}-\beta(s,a)_{i})],

where τi\tau_{i} is the ith quantile of value distribution, and τ^m=12(τm1+τm)\hat{\tau}_{m}=\frac{1}{2}(\tau_{m-1}+\tau_{m}) for m=1,2,,Mm=1,2,\dots,M are quantile mid-points. For simplicity, τ0=0\tau_{0}=0. ρτκ(u)=1κ|τ𝕀(u<0)|ρκ(u)\rho_{\tau}^{\kappa}(u)=\frac{1}{\kappa}|\tau-\mathbb{I}(u<0)|\rho_{\kappa}(u) and 𝕀()\mathbb{I}(\cdot) is the indicator function. κ>0\kappa>0 is a hyper-parameter, and in practice we use κ=1\kappa=1. β\beta^{\prime} has the same neural structures as β\beta, and its parameters are periodically copied from β\beta.

4. Proposed Method

4.1. The Learning Problem

We assume an agent learns from some offline data 𝒟\mathcal{D} that are generated by behavior policy bb. 𝒟\mathcal{D} are logs of sequential interactions between users and the agent. We represent the information about the user and as states and recommended items as actions. The reward for a decision equals to one if the corresponding item is clicked, and it equals to zero otherwise. Finally, we organize 𝒟\mathcal{D} as transitions, which can be written as (s,a,r,s)(s,a,r,s^{\prime}), where ss and ss^{\prime} are two consecutive states, aa is the action chosen at ss and rr is the reward for this decision.

Our goal is to learn a function g(s,a):𝒮×𝒜g(s,a):\mathcal{S}\times\mathcal{A}\to\mathbb{R} that predicts the the sum of rewards obtained after selecting aa at step ss and following bb afterwards, which characterizes the long-term user engagement of selecting aa. With g(s,a)g(s,a), we can rank items according to their long-term user engagement.

4.2. The Proposed Approach

The proposed method is based on the following observation. Real-world RS are often highly optimized using both learning-based method and rule-based methods, and they are regularly updated to adapt to new users and new items. As a result, users tend to interact with them sequentially, so the data generated by them contain information about items’ effect on user engagement. Thus, we propose to mine such information from data.

Meanwhile, as discussed in Section 1, user feedbacks are stochastic by nature. In consequence, when modeling user engagement as cumulative user feedbacks, we have to appropriately address such randomness. The idea of the proposed approach is to estimate Zb(s,a)Z^{b}(s,a), the value distribution of behavior policy bb from data 𝒟\mathcal{D}. This can be achieved with Equation 2. With Zb(s,a)Z^{b}(s,a) we can use its expectation as a scoring function for long-term user engagement, i.e. g(s,a)=𝔼[Zb(s,a)]=1MjMβ(s,a)jg(s,a)=\mathbb{E}[Z^{b}(s,a)]=\frac{1}{M}\sum_{j}^{M}\beta(s,a)_{j}.

We now discuss how to handle random termination. While the discount factor γ\gamma in an MDP is used to defined value functions and value distributions, it can also be interpreted as the probability that interaction continues after a state-action pair (Littman1994, 5). Thus, 1γ1-\gamma is precisely the probability that interaction terminates. Our idea is to estimate such probability using data and learn Zb(s,a)Z^{b}(s,a) based on such estimate. In consequence, Zb(s,a)Z^{b}(s,a) will be aware of the correlation between decisions and termination of interactions. This leads to a new distributional Bellman operator 𝒯π\mathcal{T}^{\pi}_{\ell}:

(3) (s,a)𝒮×𝒜,𝒯πZπ(s,a)=𝐷R(s,a)+γ(s,π(s))Zπ(s,π(s)),\forall(s,a)\in\mathcal{S}\times\mathcal{A},\mathcal{T}^{\pi}_{\ell}Z^{\pi}(s,a)\overset{D}{=}R(s,a)+\gamma^{\prime}(s^{\prime},\pi(s^{\prime}))Z^{\pi}(s^{\prime},\pi(s^{\prime})),

where γ(s,a):=min(1(s,a),η)\gamma^{\prime}(s,a):=\min(1-\ell(s,a),\eta). (s,a)\ell(s,a) is the probability that interaction terminates after (s,a)(s,a), and η(0,1)\eta\in(0,1) is a hyper-parameter. In this operator, the value distribution of next state Zπ(s,π(s))Z^{\pi}(s^{\prime},\pi(s^{\prime})) is weighted by γ(s,a)\gamma^{\prime}(s,a), which characterizes the probability that interaction continues after (s,π(s))(s^{\prime},\pi(s^{\prime})). Thus, with this operator the learned Zb(s,a)Z^{b}(s,a) will be able to model random termination.

One potential issue about 𝒯π\mathcal{T}^{\pi}_{\ell} is its convergence. In what follows we show that 𝒯π\mathcal{T}^{\pi}_{\ell} is a contraction in tabular case, based on the analysis for 𝒯π\mathcal{T}^{\pi} provided by (Bellemare2017, 1). To begin with, let FXF_{X} and FYF_{Y} be cumulative distribution functions (CDF) of random variable XX and YY. The Wasserstein metric dpd_{p} between FXF_{X} and FYF_{Y} can be written as dp(FX,FY)=FX(u)FY(u)pd_{p}(F_{X},F_{Y})=\|F_{X}(u)-F_{Y}(u)\|_{p}111To simplify notation, random variables and their CDF are conflated whenever possible.. For two action value distributions Z1Z_{1} and Z2Z_{2}, defined d¯p(Z1,Z2):=sups,adp(Z1(s,a),Z2(s,a))\bar{d}_{p}(Z_{1},Z_{2}):=\sup_{s,a}d_{p}(Z_{1}(s,a),Z_{2}(s,a)). Our analysis is based on following two properties of dpd_{p} (Bellemare2017, 1). Suppose AA is a random variable that is independent with XX and YY and aa is a constant, then:

(4) dp(A+X,A+Y)dp(X,Y),dp(aX,aY)|a|dp(X,Y).\begin{split}d_{p}(A+X,A+Y)&\leq d_{p}(X,Y),\\ d_{p}(aX,aY)&\leq|a|d_{p}(X,Y).\\ \end{split}

Suppose we have samples from policy π\pi. Then for two distributions over 𝒮×𝒜\mathcal{S}\times\mathcal{A}, Z1Z_{1} and Z2Z_{2}, and for any state s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}, we can characterize the Wasserstein metric between after applying 𝒯π\mathcal{T}_{\ell}^{\pi} as:

(5) dp(𝒯πZ1,𝒯πZ2)=dp(R(s,a)+γZ1(s,π(s),R(s,a)+γZ2(s,π(s))dp(γ(s,π(s))Z1(s,π(s)),γ(s,π(s))Z2(s,π(s)))ηdp(Z1(s,π(s)),Z2(s,π(s)))ηsups,adp(Z1(s,a)),Z2(s,a))=ηd¯p(Z1,Z2)\begin{split}d_{p}(\mathcal{T}_{\ell}^{\pi}Z_{1},&\mathcal{T}_{\ell}^{\pi}Z_{2})\\ &=d_{p}(R(s,a)+\gamma^{\prime}Z_{1}(s^{\prime},\pi(s^{\prime}),R(s,a)+\gamma^{\prime}Z_{2}(s^{\prime},\pi(s^{\prime}))\\ &\leq d_{p}(\gamma^{\prime}(s^{\prime},\pi(s^{\prime}))Z_{1}(s^{\prime},\pi(s^{\prime})),\gamma^{\prime}(s^{\prime},\pi(s^{\prime}))Z_{2}(s^{\prime},\pi(s^{\prime})))\\ &\leq\eta d_{p}(Z_{1}(s^{\prime},\pi(s^{\prime})),Z_{2}(s^{\prime},\pi(s^{\prime})))\\ &\leq\eta\mathrm{sup}_{s^{\prime},a^{\prime}}d_{p}(Z_{1}(s^{\prime},a^{\prime})),Z_{2}(s^{\prime},a^{\prime}))\\ &=\eta\bar{d}_{p}(Z_{1},Z_{2})\\ \end{split}

Hence d¯p(𝒯πZ1,𝒯πZ2)=sups,adp(𝒯πZ1,𝒯πZ2)ηd¯p(Z1,Z2)\bar{d}_{p}(\mathcal{T}_{\ell}^{\pi}Z_{1},\mathcal{T}_{\ell}^{\pi}Z_{2})=\mathrm{sup}_{s,a}d_{p}(\mathcal{T}_{\ell}^{\pi}Z_{1},\mathcal{T}_{\ell}^{\pi}Z_{2})\leq\eta\bar{d}_{p}(Z_{1},Z_{2}), which means 𝒯d,π\mathcal{T}^{\pi}_{d,\ell} is a η\eta-contraction in d¯p\bar{d}_{p}.

4.3. Practical Algorithm

As for a practical algorithm, we estimate Zb(s,a)Z^{b}(s,a) and (s,a)\ell(s,a) simultaneously using data 𝒟\mathcal{D}. The objective function to be minimized is as follows.

(6) L(β,)=1|𝒟|(s,a,r,s)𝒟{1Mi=1,j=1M[ρτ^iκ(r+γβ(s,π(s))jβ(s,a)i)]elog(s,a)(1e)log(1(s,a))}.\begin{split}&L_{\ell}(\beta,\ell)=\\ &\frac{1}{|\mathcal{D}|}\sum_{(s,a,r,s^{\prime})\in\mathcal{D}}\{\frac{1}{M}\sum_{i=1,j=1}^{M}[\rho_{\hat{\tau}_{i}}^{\kappa}(r+\gamma^{\prime}\beta^{\prime}(s^{\prime},\pi(s^{\prime}))_{j}-\beta(s,a)_{i})]\\ &-e\log\ell(s,a)-(1-e)\log(1-\ell(s,a))\}.\end{split}

e=1e=1 if interaction terminates at (s,a)(s,a) and e=0e=0 otherwise. β\beta and \ell can be parameterized as a multi-objective network, in which they share feature embeddings but have different feedforward networks. Algorithm 1 provides pseudo code for this algorithm.

Input: TcT_{c}, frequency for copying β\beta to β\beta^{\prime}
BB, the size of mini-batches
Initialize β\beta randomly and copy its values to β\beta^{\prime}
for mini-batch i=1,2,,nBi=1,2,\dots,\frac{n}{B} do
       Update β\beta and \ell using Equation 6.
       if imodTc=0i\mod T_{c}=0 then
             β=β\beta^{\prime}=\beta
       end if
      
end for
Algorithm 1 A practical algorithm for the proposed framework)

5. Experiments

This section evaluates the proposed method on an industrial RS that generates a feed for ”videos you might also like”. In particular, we investigates the following two questions.

  1. (1)

    Can we improve user engagement of real-world RS via modeling Zb(s,a)Z^{b}(s,a)?

  2. (2)

    Does modeling randomness in rewards and termination beneficial in practice?

5.1. Methodology

We followed standard protocol for A/B testing. Users were randomly hashed into the control group and test groups. Before the evaluation period, we confirmed that there was no significant difference between the control group and test groups using statistical testing. Every group has at least 630,000 users.

As the our goal is to improve RS via modeling long-term user engagement, we evaluated the performance of the RS after deploying algorithms to the system. Let gb(s,a)g^{b}(s,a) be the current ranking function of the RS. Then after deploying estimator for user engagement g(s,a)g(s,a), the new ranking function becomes gnewb(s,a)=gb(s,a)+wg(s,a)g^{b}_{\text{new}}(s,a)=g^{b}(s,a)+w*g(s,a), where ww is a parameter tuned in preliminaries experiments.

We used the metrics for video consumption as evaluation metric for user engagement, which are listed in Table 1. We report the change in performance after deploying algorithms to the RS. A positive change indicates that an algorithm can improve the performance of the current online policy.

Table 1. Evaluation metrics used in online experiments. VV is the number of videos clicked by a user, and IMP is the number of videos shown to a user. DUR is the sum of time a user spent on watching videos. These three metrics measure the overall performance during the A/B test, so we consider them as metrics for user engagement. CTR and CVR are metrics for individual videos. They reflect the extent to which a user likes individual items, so they are metrics for short-term utility.
Metric Meaning
video views (VV) #clicks / #user
#impression (IMP) #impression / #user
duration (DUR) \sum(play duration) / #user
CTR #clicks / #impression
CVR \sum(play duration) / \sum(clicked videos length)

5.2. Algorithms

control group

A highly-optimized RS for short-videos used in some industrial application.

Proposed

The proposed approach, which considers randomness in rewards and termination for modeling user engagement.

Random-Reward

A variant of the proposed approach considers randomness in rewards but not randomness in termination. Refer to this method as RR.

Full-Deterministic

A variant of the proposed approach that directly estimates g(s,a)g(s,a) without considering randomness in rewards and termination. Refer to this method as FD.

5.3. Implementation Details

We used the DLRM structure in experiments. There were 40 features in total, including user ID, item ID, and item keywords. These discrete features were mapped to embedding vectors in 32\mathbb{R}^{32}. Parameters of the target module are copied from the policy module every 100 gradient steps. Algorithms were trained daily using a single P40 GPU. We used Adam as the optimizer for all three algorithms with a learning rate set to 0.000150.00015. Whem modeling randomness in rewards, the number of quantiles was set to 200.

Table 2. Results of our online A/B test. The numbers shown in parentheses are p values. Without considering any randomness, FD failed to improve any metric significantly. After considering randomness in rewards, RR was able to improve VV, CTR and CVR significantly. The proposed apporach improved VV, DUR, CTR and CVR significantly. The results confirm that modeling randomness in rewards and termination is crucial for modeling user engagement.
Metrics FD RR Proposed
VV 1.19% (0.11) 1.58% (0.031) 2.72% (2.1×1042.1\times 10^{-4})
IMP 0.140% (0.86) -0.265% (0.74) 0.502% (0.53)
DUR 0.263% (0.73) 1.24% (0.10) 1.59% (3.5×1023.5\times 10^{-2})
CTR 1.05% (0.93) 1.85% (3.6×1033.6\times 10^{-3}) 2.21% (4.1×1044.1\times 10^{-4})
CVR 0.410% (0.14) 1.22% (3.0×106)3.0\times 10^{-6}) 1.07% (4.8×1054.8\times 10^{-5})

5.4. Results

Table 2 shows results for our online evaluations. FD failed to provide any significant improvement over the control group. RR was able to improve VV, CTR and CVR significantly, though its improvement for DUR was not significant. These results imply that users in this group were likely to watch slightly more videos than the control group. Moreover, they show that modeling randomness in rewards is indeed helpful for characterizing items’ effect on user engagement. Meanwhile, the proposed method was able to significantly improve VV, DUR, CTR and CVR, showing that randomness in termination is yet another key factor for modeling user engagement. Together, Table 2 confirms the efficacy of the proposed method as an efficient alternative for modeling long-term user engagement for RS.

6. Conclusion

This paper investigates how to model long-term user engagement for RS efficiently. The proposed approach relies on an observation that the behavior policy (i.e. the RS in production environment) in industrial applications are often highly optimized. So instead of learning the optimal policy from scratch, the proposed model tries to capture correlation between user engagement and recommended item in offline data. Although not guaranteed to be optimal, it enjoys computational efficiency and is confirmed to be effective for real-world systems. Moreover, this paper proposes to model randomness in rewards and termination and confirms the importance of the two factors for modeling user engagement.

References

  • (1) Marc G. Bellemare, Will Dabney and Rémi Munos “A Distributional Perspective on Reinforcement Learning” In Proceedings of the Thirty-fourth International Conference on Machine Learning Sydney, Australia: PMLR, 2017, pp. 449–458
  • (2) Minmin Chen et al. “Top-K Off-Policy Correction for a REINFORCE Recommender System” In Proceedings of the Twelfth ACM International Conference on Web Search and Data Mining Melbourne, VIC, Australia: Association for Computing Machinery, 2019, pp. 456–464
  • (3) Will Dabney, Mark Rowland, Marc G. Bellemare and Rémi Munos “Distributional Reinforcement Learning With Quantile Regression” In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 2892–2901
  • (4) Eugene Ie et al. “SlateQ: A Tractable Decomposition for Reinforcement Learning with Recommendation Sets” In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence International Joint Conferences on Artificial Intelligence Organization, 2019, pp. 2592–2599
  • (5) Michael L. Littman “Markov Games As a Framework for Multi-agent Reinforcement Learning” In Proceedings of the Eleventh International Conference on International Conference on Machine Learning, 1994, pp. 157–163
  • (6) Volodymyr Mnih et al. “Human-level control through deep reinforcement learning” In Nature 518.7540, 2015, pp. 529–533
  • (7) Maxim Naumov et al. “Deep Learning Recommendation Model for Personalization and Recommendation Systems”, 2019 arXiv:1906.00091 [cs.IR]
  • (8) Kai Wang et al. “RL4RS: A Real-World Benchmark for Reinforcement Learning based Recommender System”, 2021 arXiv:2110.11073 [cs.IR]
  • (9) Martha White “Unifying Task Specification in Reinforcement Learning” In Proceedings of the Thirty-Fourth International Conference on Machine Learning Sydney, Australia: PMLR, 2017, pp. 3742–3750
  • (10) Zhongwen Xu, Hado P Hasselt and David Silver “Meta-Gradient Reinforcement Learning” In Advances in Neural Information Processing Systems Curran Associates, Inc., 2018
  • (11) Naoto Yoshida, Eiji Uchibe and Kenji Doya “Reinforcement learning with state-dependent discount factor” In 2013 IEEE third joint international conference on development and learning and epigenetic robotics, 2013, pp. 1–6 IEEE
  • (12) Xiangyu Zhao et al. “Deep Reinforcement Learning for Online Advertising in Recommender Systems” In arXiv preprint arXiv:1909.03602, 2019
  • (13) Xiangyu Zhao et al. “Deep Reinforcement Learning for Page-Wise Recommendations” In Proceedings of the 12th ACM Conference on Recommender Systems Vancouver, BC, Canada: Association for Computing Machinery, 2018, pp. 95–103
  • (14) Guanjie Zheng et al. “DRN: A Deep Reinforcement Learning Framework for News Recommendation” In Proceedings of the 2018 World Wide Web Conference Lyon, France: International World Wide Web Conferences Steering Committee, 2018, pp. 167–176