Locality-Sensitive Experience Replay for Online Recommendation

Xiaocong Chen [email protected] University of New South WalesSydneyAustralia , Lina Yao [email protected] University of New South WalesSydneyAustralia , Xianzhi Wang [email protected] University of Technology SydneySydneyAustralia and Julian McAuley [email protected] University of California, San DiegoCAUSA

(2018)

Abstract.

Online recommendation requires handling rapidly changing user preferences. Deep reinforcement learning (DRL) is an effective means of capturing users’ dynamic interest during interactions with recommender systems. Generally, it is challenging to train a DRL agent, due to large state space (e.g., user-item rating matrix and user profiles), action space (e.g., candidate items), and sparse rewards. Existing studies leverage experience replay (ER) to let an agent learn from past experience. However, they adapt poorly to the complex environment of online recommender systems and are inefficient in determining an optimal strategy from past experience. To address these issues, we design a novel state-aware experience replay model, which selectively selects the most relevant, salient experiences, and recommends the agent with the optimal policy for online recommendation. In particular, the model uses locality-sensitive hashing to map high dimensional data into low-dimensional representations and a prioritized reward-driven strategy to replay more valuable experience at a higher chance. Experiments on three online simulation platforms demonstrate our model’s feasibility and superiority to several existing experience replay methods.

Recommender System, Deep Reinforcement Learning, Experience Replay

^†^†copyright: none^†^†journalyear: 2018^†^†doi: 10.1145/1122445.1122456^†^†conference: Woodstock ’18: ACM Symposium on Neural Gaze Detection; June 03–05, 2018; Woodstock, NY^†^†booktitle: Woodstock ’18: ACM Symposium on Neural Gaze Detection, June 03–05, 2018, Woodstock, NY^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Online recommendation aims to learn users’ preferences and recommend items dynamically to help users find desired items in highly dynamic environments (Zhang et al., 2019). Deep reinforcement learning (DRL) naturally fits online recommendation as it learns policies through interactions with the environment via maximizing a cumulative reward. Besides, DRL has been widely applied to sequential decision-making (e.g. in Atari (Mnih et al., 2013) and AlphaGo (Silver et al., 2016)) and achieved remarkable progress. Therefore, it is increasing applied for enhancing online recommender systems (Chen et al., 2018; Bai et al., 2019; Zhao et al., 2019).

DRL-based recommender systems cover three categories of methods: deep Q-learning (DQN), policy gradient, and hybrid methods. DQN aims to find the best step via maximizing a Q-value over all possible actions. As the representatives, Zheng et al. (2018) introduced DRL into recommender systems for news recommendation; Chen et al. (2018) introduced a robust reward function to Q-learning, which stabilized the reward in online recommendation. Despite the capability of fast-indexing in selecting a discrete action, Q-learning-based methods conduct the “maximize” operation over the action space (i.e., all available items) and suffer from the stuck agent problem (Dulac-Arnold et al., 2015)—the “maximize” operation becomes unfeasible when the action space has high dimensionality (e.g., 100,000 items form a 10k-dimensional action space) (Chen et al., 2019). Policy-gradient-based methods use the average reward as guideline to mitigate the stuck agent problem (Chen et al., 2019). However, they are prone to converge to sub-optimality (Pan et al., 2019). While both DQN and policy gradient are more suitable for small action and state spaces (Lillicrap et al., 2015; Wang et al., 2018) in a recommendation context, hybrid methods (Chen et al., 2019; Dulac-Arnold et al., 2015; Zhao et al., 2018; Hu et al., 2018) has the capability to map large high-dimensional discrete state spaces into low-dimensional continuous spaces via combines the advantages of Q-learning and policy gradient. A typical hybrid method is the actor-critic network (Konda and Tsitsiklis, 2000), which adopts policy gradient on an actor network and Q-learning on a critic network to achieve Nash equilibrium on both networks. Actor-critic networks have been widely applied to DRL-based recommender systems (Liu et al., 2020; Chen et al., 2020).

Existing DRL-based recommendation methods except policy-gradient-based ones rely heavily on experience replay to learn from previous experience, avoid re-traversal of the state-action space, and stabilize the training on large, sparse state and action spaces (Zha et al., 2019). They generally require long training time, thus suffering from the training inefficiency problem. Further more, in contrast to the larger, diverse pool of continuous actions required in recommendation tasks, existing experience replay methods are mostly designed for games with a small pool of discrete actions. Therefore, a straightforward application of those methods may result in strong biases during the policy learning process (Hou et al., 2017), thus impeding the generalization of optimal recommendation results. For example, Schaul et al. (2016) assume that not every experience is worth replaying and propose a prioritized experience replay (PER) method to replay only the experience with the largest temporal difference error. Sun et al. (2020) propose attentive experience replay (AER), which introduces similarity measurement into PER to boost the efficiency of finding similar states’ experience, but attention mechanisms cause inefficiency on large sized state and action spaces (Kitaev et al., 2020).

We present a novel experience replay structure, Locality-Sensitive Experience Replay (LSER), to address the above challenges. Differing from existing approaches, which apply random or uniform sampling, LSER samples experiences based on expected states. Inspired by collaborative filtering (which measures the similarity between users and items to make recommendations) and AER (Sun et al., 2020), LSER only replays experience from similar states to improve the sampling efficiency. Specifically, we introduce a ranking mechanism to prioritize replays and promote the higher reward experiences. We further use $\epsilon$ -greedy method to avoid replaying high-rears states excessively.

Refer to caption — Figure 1. The proposed LSER with DDPG. The environment provides the current state $s_{t}$ , with the policy $\pi$ learned by the DDPG model; the action $a_{t}$ can be obtained by $a_{t}=\pi(s_{t})$ . $r_{t}$ will be provided by the user (e.g. click or not). LSER takes $s_{t}$ as the input and encodes it on the projective space. Given the encoded states, LSER will return the most similar experience for DDPG to update he parameters. After that, this transition $h(s_{t}):(s_{t},a_{t},s_{t+1},r_{t})$ will be stored.

Considering the high-dimensionality of vectorized representations of states, We convert similarity measurement for high-dimensional data into a hash key matching problem and employ locality-sensitive hashing to transform states into low-dimensional representations. Then, we assign similar vectors the same hash codes (based on the property of locality-sensitive hashing). Such a transformation reduces all the states into low dimension hash keys.

In summary, we make the following contributions in this paper:

•

We propose a novel experience replay method (LSER) for reinforcement-learning-based online recommendation. It employs a similarity measurement to improve training efficiency.
•

LSER replays experience based on the similarity level of the given state and the stored states; the agent thus has a higher chance to learn valuable information than it does with uniform sampling.
•

The experiments on three platforms, VirtualTB, RecSim and RecoGym, demonstrate the efficacy and superiority of LSER to several state-of-the-art experience replay methods.

2. Methodology

In this section, we will briefly introduce the proposed LSER method with theoretical analysis. The overall structure of using LSER in DRL RS can be found in Figure 1.

2.1. Overview

Online Recommendation aims to find a solution that best reflects real-time interactions between users and the recommender system and apply the solution to the recommendation policy. The system needs to analyze users’ behaviors and update the recommend policy dynamically. In particular, reinforcement learning-based recommendation learns from interactions through a Markov Decision Process (MDP).

Given a recommendation problem consisting of a set of users $\mathcal{U}=\{u_{0},u_{1},\cdots u_{n}\}$ , a set of items $\mathcal{I}=\{i_{0},i_{1},\cdots i_{m}\}$ and user’s demographic information $\mathcal{D}=\{d_{0},d_{1},\cdots,d_{n}\}$ , MDP can be represented as a tuple $(\mathcal{S},\mathcal{A},\mathcal{P},\mathcal{R},\gamma)$ , where $\mathcal{S}$ denotes the state space (i.e., the combination of the subsets of $\mathcal{I}$ and its corresponding user information) $\mathcal{A}$ denotes the action space, which represents agent’s selection during recommendation based on the state space $\mathcal{S}$ , $\mathcal{P}$ denotes the set of transition probabilities for state transfer based on the action received, $\mathcal{R}$ is a set of rewards received from users, which are used to evaluate the action taken by the recommender system (each reward is a binary value to indicate whether user has clicked the recommended item or not), and $\gamma$ is a discount factor $\gamma\in[0,1]$ for the trade-off between future and current rewards.

Given a user $u$ and an initial state $s_{0}$ observed by the agent (or the recommender system), which includes a subset of item set $\mathcal{I}$ and user’s profile information $d_{0}$ , a typical recommendation iteration for the user goes as follows: first, the agent takes an action $a_{0}$ based on the recommend policy $\pi_{0}$ under the observed state $s_{0}$ and receives the corresponding reward $r_{0}$ —the reward $r_{0}$ is the numerical representation for user’s behavior such as click through or not; then, the agent generates a new policy $\pi_{1}$ based on the received reward $r_{0}$ and determines the new state $s_{1}$ based on the probability distribution $p(s_{\textit{new}}|s_{0},a_{0})\in\mathcal{P}$ . The cumulative reward (denoted by $r_{c}$ ) after $k$ iterations from the initial state is as follows:

\displaystyle r_{c}=\sum_{k=0}\gamma^{k}r_{k}

DRL-based recommender systems uses a replay buffer to store and replay old experience for training. Given the large state and action space in a recommender system, not every experiences are worth to replay (Chen et al., 2021)—replaying experience that does not contain useful information will increase the training time significantly and introduce extra uncertainty to convergence. Hence, it is reasonable to prioritize replay important experience for DRL recommender systems.

The ideal criterion for measuring the importance of a transition in RL is the amount of knowledge learnable from the transition in its current state (Schaul et al., 2016; Horgan et al., 2018). State-of-the-art methods like AER are unsuitable for recommendation tasks that contain large, higher dimensional state and action spaces as their sampling strategies may not work properly. Thus, we propose a new experience replay method named Locality-sensitive experience replay (LSER) for online recommendation, which uses hashing for dimension reduction when sampling and storing the experiences.

2.2. Locality-sensitive Experience Replay

We formulate the storage and sampling issue in LSER as a similarity measure problem, where LSER stores similar states into the same buckets and samples similar experiences based on state similarities. A popular way of searching similar high-dimensional vectors in Euclidean space is Locality-Sensitive Hashing (LSH), which follows the idea of Approximate Nearest Neighbor (ANN) while allocating similar items into the same buckets to measure the similarity. However, standard LSH conducts bit-sampling on the Hamming space; it requires time-consuming transformation between the Euclidean space to the Hamming space, liable to lose information. Aiming at measuring the similarity between high-dimensional vectors without losing significant information, we propose using $p$ -stable distribution (Nolan, 2003) to conduct dimensionality reduction while preserving the original distance. This converts high-dimensional vectors (states) into low-dimensional representations easier to be handled by the similarity measure.

To address possible hash collision (i.e., dissimilar features may be assigned into the same bucket and recognized as similar), we introduce the formal definition of the collision probability for LSH. Then, we theoretically analyze the collision probability for $p$ -stable distribution to prove that our method has a reasonable boundary for collision probability.

Definition 1 (Collision probability for LSH in $p$ -stable distribution).

Given an LSH function $h_{ab}\in\mathcal{H}$ and the probability density function (PDF) of the absolute value of the $p$ -stable ( $p\in[1,2])$ distribution $f_{p}(t)$ in $L^{p}$ space, the collision probability for vectors u and v is represented by:

(1)

\displaystyle P=Pr[h_{\textbf{ab}}(\textbf{u})=h_{\textbf{ab}}(\textbf{v})]=\int_{0}^{w}\frac{1}{c}f_{p}\bigg{(}\frac{t}{c}\bigg{)}\bigg{(}1-\frac{t}{w}\bigg{)}~{}dt

where $c=\|\textbf{u}-\textbf{v}\|_{p}$ and $w$ is a user-defined fixed distance measure.

Here, we use a 2-state distribution, i.e., normal distribution for dimension reduction. We randomly initialize $n_{h}$ hyperplanes based on normal distribution on the projective space $\mathcal{P}^{n}$ to get the hash representation for a given state $s$ , where $n$ is the dimension of the state. The hashing representation $h(s)$ for the given state $s$ is calculated as follows:

(2)

\displaystyle h_{p\in\mathcal{P}^{n}}(s)=\{0,1\}^{n}\text{with}\begin{cases}1&p_{i}\cdot s_{i}>0\\ 0&p_{i}\cdot s_{i}\leq 0{}\end{cases}

The collision probability of the above method can be represented as:

	$\displaystyle P=\mathit{Pr}[h_{p\in\mathcal{P}^{n}}(\textbf{u})=h_{p\in\mathcal{P}^{n}}(\textbf{v})]=1-\frac{Ang(\textbf{u},\textbf{v})}{\pi}$
(3)		$\displaystyle\text{ where }Ang({\textbf{u},\textbf{v}})=\arccos\frac{\|\textbf{u}\cap\textbf{v}\|}{\sqrt{\|\textbf{u}\|\cdot\|\textbf{v}\|}}~{}$

Eq.(2) formulates the information loss during the projection, where we use term $e$ to represent the quantification between the real value $p\cdot v$ and hashed results induced from $h(\textbf{v})$ . Since the relative positions in original space are preserved during the hash transformation with an extra measurement $e$ , the upper bound and lower bound of collision probability boundary in projective space is guarantee to be intact. That means the more dissimilar states will not receive a higher probability to be allocated into the same hash result.

Lemma 1.

Given an arbitrary hash function $h_{ab}\in\mathcal{H}$ , the collision probability for a given vector u and v is bounded at both ends.

Proof.

Since $\mathit{Pr}[h_{ab}(\textbf{u})=h_{ab}(\textbf{v})]$ monotonically decreases in $c$ for any hash function from the LSH family $\mathcal{H}$ , the collision probability is bounded from above by $Pr[h_{ab}(\textbf{u})=h_{ab}(\textbf{v})]$ for $c-e$ and from below by $Pr[h_{ab}(\textbf{u})=h_{ab}(\textbf{v})]$ for $c+e$ .

\displaystyle P=\int_{0}^{w}\frac{1}{c}f_{p}\bigg{(}\frac{t}{c}\bigg{)}\bigg{(}1-\frac{t}{w}\bigg{)}~{}dt=\int_{0}^{w/c}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}~{}dq\text{ with }q=\frac{t}{c}

Then, we have the upper bound:

		$\displaystyle\int_{0}^{w/(c-e)}f_{p}(q)\bigg{(}1-\frac{(c-e)q}{w}\bigg{)}dq\text{ with }q=\frac{t}{c-e}$
	$\displaystyle=$	$\displaystyle\int_{0}^{w/(c-e)}\bigg{(}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}+\frac{qef_{p}(q)}{w}\bigg{)}~{}dq$
	$\displaystyle\leq$	$\displaystyle P+\frac{e}{w}\int_{0}^{w/(c-e)}qf_{p}(q)~{}dq\leq P+\frac{e}{c-e}$

and the lower bound:

	$\displaystyle\int_{0}^{w/(c+e)}f_{p}(q)\bigg{(}1-\frac{(c+e)q}{w}\bigg{)}~{}dq\text{ with }q=\frac{t}{c+e}$
	$\displaystyle=\int_{0}^{w/(c+e)}\bigg{(}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}-\frac{qef_{p}(q)}{w}\bigg{)}~{}dq$
	$\displaystyle=P-\frac{e}{w}\int_{0}^{w/(c+e)}qf_{p}(q)~{}dq-\int_{w/(c+e)}^{w/c}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}~{}dq$
	$\displaystyle\geq P-\frac{e}{c+e}-\bigg{(}1-\frac{c}{c+e}\bigg{)}=P-\frac{2e}{c+e}$

We compute the upper bound based on Hölder’s inequality in $L^{1}$ space:

\displaystyle\int_{0}^{w/(c-e)}qf_{p}(q)~{}dq\leq\bigg{(}\sup_{q\in[0,w/(c-e)]}q\bigg{)}\|f_{p}\|_{1}\leq\frac{w}{c-e}

Considering the $L^{\infty}$ space, we have:

\displaystyle\int_{0}^{w/(c-e)}qf_{p}(q)~{}dq\leq\|f_{p}\|_{\infty}\int_{0}^{w/(c-e)}q~{}dq=\frac{w^{2}\|f_{p}\|_{{\infty}}}{2(c-e)^{2}}

We use the similar method in $L^{1}$ to compute the lower bound:

	$\displaystyle\int_{w/(c+e)}^{w/c}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}~{}dq$	$\displaystyle\leq\bigg{(}\sup_{q\in[w/(c+e),w/c]}\bigg{(}1-\frac{qc}{w}\bigg{)}\bigg{)}\\|f_{p}\\|_{1}$
		$\displaystyle\leq 1-\frac{cw/(c+e)}{w}=\frac{e}{c+e}$

and in $L^{\infty}$ :

	$\displaystyle\int_{w/(c+e)}^{w/c}f_{p}(q)\bigg{(}1-\frac{qc}{w}\bigg{)}~{}dq$	$\displaystyle\leq\\|f_{p}\\|_{\infty}\int_{w/(c+e)}^{w/c}\bigg{(}1-\frac{c}{w}q\bigg{)}~{}dq$
		$\displaystyle\leq\frac{e^{2}w\\|f_{p}\\|_{\infty}}{2c(c+e)^{2}}$

The collision probability $Pr[h_{ab}(u)=h_{ab}(v)]$ is bounded from both ends as follows:

\displaystyle\bigg{[}P-\min\bigg{(}\frac{2e}{c+e},\frac{e^{2}w\|f_{p}\|_{\infty}}{2(c+e)^{2}}\bigg{)},P+\min\bigg{(}\frac{e}{c-e},\frac{w^{2}\|f_{p}\|_{{\infty}}}{2(c-e)^{2}}\bigg{)}\bigg{]}

∎

Note that, when calculating the lower and upper bounds, $q$ represents $\frac{t}{c+e}$ and $\frac{t}{c-e}$ , respectively. The algorithm of LSER is shown in Algorithm 1.

In the following, we demonstrate from two perspectives that LSER can find the similar states efficiency. First, we show the efficacy of LSER with theoretical guarantee, i.e., similar states can be sampled given the current state. We formulate ‘the sampling of similar states’ as a neighbor-finding problem in the projective space and provide a theoretical proof of the soundness of LSER. Given a set of states $\mathcal{S}$ , and a query $q_{s}$ , LSER can quickly find a state $s\in\mathcal{S}$ within distance $r_{2}$ or determine that $\mathcal{S}$ has no states within distance $r_{1}$ . Based on existing work (Indyk and Motwani, 1998), the LSH family is $(r_{1},r_{2},p_{1},p_{2})$ -sensitive, i.e., we can find a distribution $\mathcal{H}$ such that $p_{1}\geq Pr_{h\sim\mathcal{H}}[h_{ab}(\textbf{u})=h_{ab}(\textbf{v})]$ when u and v are similar and $p_{2}\leq Pr_{h\sim\mathcal{H}}[h_{ab}(\textbf{u})=h_{ab}(\textbf{v})]$ when u and v are dissimilar.

Theorem 2.

Let $\mathcal{H}$ be $(r_{1},r_{2},p_{1},p_{2})$ -sensitive. Suppose $p_{1}>1/n$ and $p_{2}>1/n$ , where $n$ is the size of data points. There exists a solution for the neighbor finding problem in LSER within $O(n^{\rho}p_{1}^{-1}\log n)$ query time, and $O(n^{1+\rho}p_{1}^{-1})$ space.

Proof.

Assume $r_{1},r_{2},p_{1},p_{2}$ are known, $\rho=\frac{\log(1/p_{1})}{\log(1/p_{2})}$ , and $k=\frac{\log(n)}{\log(1/p_{2})}$ where $k$ is the number of hash functions, and LSH initializes $L$ tables. Based on the definition in (Indyk and Motwani, 1998), we have:

(4)

\displaystyle kL=k\left\lceil p_{1}^{-k}\right\rceil\leq k(e^{\log(1/p_{1})\cdot k}+1)\leq k(n^{\rho}/p_{1}+1)=O(n^{\rho}/p_{1}\log n)

The space complexity is calculated as $O(Lnd_{s})$ where $d_{s}$ is the dimension of state $s$ . It can be written as $O(n^{1+\rho}/p_{1}d_{s})$ (by applying $L=n^{\rho}/p_{1}$ ) and further simplified into $O(n^{1+\rho}/p_{1})$ .

Then, we prove LSER can find similar neighbors. The $L$ table can be classified into two categories: similar and dissimilar. Given a state $s$ , the similar category gives similar states while the dissimilar category provides dissimilar states. We split the two categories such that $L=\lfloor n\rfloor+\left\lceil m\right\rceil$ and its corresponding $\lfloor k\rfloor,\left\lceil k\right\rceil$ . Given any state $s\in\mathcal{S}$ in the distance $r_{1}$ , LSER must be able to find the most similar states in a high probability—the query and the data need to share the same hash-bucket in one of the tables. The probability of their not sharing the same hash-bucket is

(5)	$\displaystyle(1-p_{1}^{\lfloor k\rfloor})^{\lfloor n\rfloor}(1-p_{1}^{\left\lceil k\right\rceil})^{\left\lceil m\right\rceil}$	$\displaystyle\leq(1-p_{1}^{\lfloor k\rfloor})^{n-1}(1-p_{1}^{\left\lceil k\right\rceil})^{m}$
(6)		$\displaystyle\leq e^{-np_{1}^{\lfloor k\rfloor}-mp_{1}^{\left\lceil k\right\rceil}}(1-p_{1}^{\lfloor k\rfloor})^{-1}$
(7)		$\displaystyle=e^{-(np_{1}^{-1+\alpha}+mp_{1}^{\alpha})n^{-\rho}}(1-p_{1}^{\lfloor k\rfloor})^{-1}$
(8)		$\displaystyle=e^{-1}(1-p_{1}^{\lfloor k\rfloor})^{-1}$

where $\alpha=\left\lceil k\right\rceil-k$ . We have applied the definitions $p_{1}^{k}=p_{2}^{\rho k}=n^{-\rho}$ for step 6 to step 7 and $np_{1}^{-1+\alpha}+mp_{1}^{\alpha}=n^{\rho}$ for step (7) to (8). Finally, we get the probability of LSER’s getting the similar states as follows:

\displaystyle P\geq 1-e^{-1}(1-p_{1}^{\lfloor k\rfloor})^{-1}>0

Recall that $p_{1}>1/n$ . Therefore, we conclude that LSER can find the most similar states. ∎

2.3. Store and Sampling Strategy

Existing experience replay methods in DRL research assume that the recent experience is more informative than older experience. Therefore, they simply replace the oldest experience with the newest experience to update the experience buffer in DRL-based recomender systems without further optimization. As such, some valuable experience might be discarded, i.e., catastrophic forgetting. In contrast, we design a state-aware reward-driven experience storage strategy, which removes the experience with the lowest reward—instead of following the First-In-First-Out (FIFO) strategy—when the replay buffer is full. Formally speaking, a transition $\tau_{t}:(s_{t},a_{t},s_{t+1},r_{t})$ will be stored in the replay buffer based on the value $h_{p\in\mathcal{P}^{n}}(\tau_{t}.s_{t})$ . If the replay buffer is full, the transition with the same value of $h_{p\in\mathcal{P}^{n}}(\tau_{t}.s_{t})$ but lower reward will be replaced. In practice, an indicator $m_{t}$ is stored in the transition as well to indicate when the recommendation should terminate.

Sampling strategy is another crucial component of LSER, which determines which experience should be selected for the agent to optimize in LSER. We propose a state-aware reward-driven sampling strategy that only replays the experience with the top-N highest rewards in the same hashing area; this way, the agent can quickly find the correct direction for optimization. We call our sampling strategy ‘state-aware’ because we use a hash key to encode the state and replay the experience based on the hash key. Compared with uniform sampling, our strategy has a higher chance to replay the correct experience. Here, we illustrate how to address three related challenges faced by our sampling strategy: exploitation-vs-exploration dilemma, bias annealing and non-existence dilemma.

Exploitation vs. exploration dilemma. Exploitation and exploration dilemma is a well-known dilemma when training an agent for RL, including LSER. TWhile our reward-driven strategy forces the agent to exploit existing high-rewarding experiences, the agent may converge to a sub-optimal policy instead of the globally optimal one. We use a similar method to $\epsilon$ -greedy to achieve a trade-off between exploitation and exploration. LSER first draws a random probability $p\in[0,1]$ then uses reward-driven sampling if the probability less than a threshold, $\epsilon_{\mathit{max}}$ and random sampling otherwise. The threshold allows LSER to replay low priority experience to fulfill the exploration requirement.

Bias annealing. Prioritizing partial experiences among the replay buffer may introduce inductive bias (Schaul et al., 2016; Sun et al., 2020)—the training process is highly non-stationary (due to changing policies); even a small bias introduced by the sampling strategy may change the solution that the policy converges to. A common solution is to let the priority anneal periodically so that the agent can visit those less-replayed experiences. By using the threshold, our $\epsilon$ -greedy method has the similar effect as annealing on allowing low-priority experiences to be replayed.

Non-existence dilemma. When split the projective space into areas to initialize hyperplanes, some areas may not have any data points (esp. when the number of hyperplanes is large), causing the ‘non-existence dilemma’. Consequently, when a new transition comes, the algorithm will stop if no experience can be found on $h_{p\in\mathcal{P}^{n}}$ . We use the similarity measure to overcome this problem. Specifically, we find the two hash areas that are most similar to each other (based on current $h_{p\in\mathcal{P}^{n}}$ ) and conduct sampling on those two states. We use Jaccard similarity to measure the similarity between hash codes $A,B$ . As such, LSER can always replay the relevant experience.

2.4. Training Procedure

We use Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) as the training backbone. We choose an actor-critic network as the agent and train two parts of the actor-critic network simultaneously. The critic network aims to minimize the following loss function:

	$\displaystyle l(\theta_{\psi})=\frac{1}{N}\sum_{j=1}^{N}((r+\gamma\xi)-\psi_{\theta_{\psi}}(s_{t},a_{t}))^{2}$
	$\displaystyle\text{ where }\xi=\psi_{\theta_{\psi}^{\prime}}(s_{t+1},\phi_{\theta_{\phi}^{\prime}}(s_{t+1}))$

where $\theta_{\psi}$ and $\theta_{\phi}$ are the critic and actor parameters, $N$ is the size of the mini-batch from the replay buffer, $\psi_{\theta_{\psi}^{\prime}}$ and $\phi_{\theta_{\phi}^{\prime}}$ are the target critic and target actor network, respectively. We apply the Ornstein-Uhlenbeck process in the action space to introduce perturbation; this encourages the agent to explore. The target network will be updated based on the corresponding hyper-parameter $\tau$ .

3. Experiments

input : Transition for storage

\tau:(s_{t},a_{t},m_{t},s_{t+1},r_{t})

, capacity

c

, state dimension

d_{s}

, batch size

b

, Hash bits

n_{h}

, state

s

for sampling, epsilon threshold

\epsilon_{max}

1 Initialize

n_{h}

hyperplanes on projection space

\mathcal{P}_{d_{s}}

;

2 Initialize empty dictionary

\mathcal{M}

;

3 Function encode(s) is

4 for $p$ in $\mathcal{P}_{d_{s}}$ do

5 Calculate hash bits by using Eq.2;

7 end for

8 return ””.join(hash bits)

9 end

10Function Push( $\tau$ ) is

v_{h}

= encode(

\tau.s_{t}

) // get the hash code

11 if T.size ¡ T.capacity then

12 if $v_{h}$ in $\mathcal{M}$ then

\mathcal{M}[v_{h}]

.append(

\tau

);

15 else

\mathcal{M}[v_{h}]

\tau

;

18 end if

20 else

21 if $v_{h}$ in $\mathcal{M}$ and $\tau.r_{t}>\mathcal{M}[v_{h}][0].r$ then

// replace tuple has the minimal reward

\mathcal{M}[v_{h}][0]

\tau

;

24 end if

26 end if

27 Sort(

\mathcal{M}[v_{h}]

) based on reward in ascending order;

29 end

30Function Sample(s,b) is

v_{h}

= encode(

s

);

p

= random.random();

33 Find two most similar hash values

v_{1},v_{2}

based on

v_{h}

;

34 if $v_{h}$ in $\mathcal{M}$ then

35 if $p<\epsilon_{max}$ then

36 result =

\mathcal{M}[v_{h}][-b:]

;

38 else

39 result = random.sample(

\mathcal{M}[v_{h}],b

);

41 end if

43 else

44 if $p<\epsilon_{max}$ then

45 result =

\mathcal{M}[v_{1}][-b:]+\mathcal{M}[v_{2}][-b:]

;

47 else

48 result = random.sample(

\mathcal{M}[v_{1}],b

) + random.sample(

\mathcal{M}[v_{2}],b

);

50 end if

52 end if

53 return result;

55 end

Algorithm 1 LSH memory by using dictionary

3.1. Online Simulation Platform Evaluation

We conduct experiments on three widely used public simulation platforms: VirtualTB (Shi et al., 2019), RecSim (Ie et al., 2019) and RecoGym (Rohde et al., 2018), which mimic online recommendations in real-world applications.

VirtualTB is a real-time simulation platform for recommendation, where the agent recommend items based on users’ dynamic interests. VirtualTB uses a pre-trained generative adversarial imitation learning (GAIL) to generate different users who have both static interest and dynamic interest. It’s worth to mention that, the GAIL is pre-trained by using the real-world from Taobao, which is one of the largest online retail platforms in China. Moreover, the interactions between users and items are generated by GAIL as well. Benefit from that, VirualTB can provide a large number of users and the corresponding interactions to simulate the real-world scenario.

RecSim is a configurable platform for authoring simulation environments that naturally supports sequential interaction with users in recommender systems. RecSim differs from VirtualTB in containing different, simpler tasks but fewer users and items. There are two different tasks from RecSim, namely interest evolution and long-term satisfaction. The former (interest evolution) encourages the agent to explore and fulfill the user’s interest without further exploitation; the latter (long-term satisfaction) depicts an environment where a user interacts with content characterized by the level of ‘clickbaitiness.’ Generally, clickbaity items lead to more engagement yet lower long-term satisfaction, while non-clickbaity items have the opposite effect. The challenge lies in balancing the two to achieve a long-term optimal trade-off under the partially observable dynamics of the system, where satisfaction is a latent variable that can only be inferred from the increase/decrease in engagement.

RecoGym is a small Open AI gym-based platform, where users have no long-term goals. Different from RecSim and VirtualTB, RecoGym is designed for computational advertising. Similar with RecSim, RecoGym uses the click or not to represent the reward signal. Moreover, similar with RecSim, users in those two environments do not contain any dynamic interests.

Considering RecoGym and RecSim have limited data points and do not consider users’ dynamic interests, we select VirtualTB as the main platform for evaluations. Our model is implemented in Pytorch (Paszke et al., 2019) and all experiments are conducted on a server with two Intel Xeon CPU E5-2697 v2 CPUs with 6 NVIDIA TITAN X Pascal GPUs, 2 NVIDIA TITAN RTX and 768 GB memory. We use two two-hidden-layer neural networks with 128 hidden unit as the actor network and the critic network, respectively. $\tau$ , $\gamma$ , and $c$ are set to $0.001$ , $0.99$ and $1e^{6}$ , respectively, during experiments.

3.2. Evaluation Metrics and Baselines

The evaluation metrics are environment-specific. For VirtualTB and RecoGym, click-through rate is used as the main evaluation metric. For RecSim, we use the built-in metric, which is a quality score, as the main evaluation metric. We compare our method with the following baselines.

•

Prioritized Experience Replay (PER) (Schaul et al., 2016): an experience replay method for discrete control, which uses TD-error to rank experience and a re-weighting method to conduct the bias annealing.
•

Dynamic Experience Replay (DER) (Luo and Li, 2020): an experience replay method designed for imitation learning, where stores both human demonstrations and previous experience. Those experiences are selected randomly without any priority.
•

Attentive Experience Replay (AER) (Sun et al., 2020): an experience replay method that uses attention to calculate the similarly for boosting sample efficiency with PER.
•

Selective Experience Replay (SER) (Isele and Cosgun, 2018): an experience replay method for lifelong machine learning, which employs LSTM as the experience buffer and selectively stores experience.
•

Hindsight Experience Replay (HER) (Andrychowicz et al., 2017): an experience replay method that replays two experience (one successful, one unsuccessful) each time.

For AER, PER, SER and HER, We use the same training strategy as LSER. For DER, we use its original structure to run experiments without human demonstrations. The size of the replay buffer is set to $1,000,000$ for VirtualTB and $10,000$ for RecSim and RecoGym. The number of episodes for our experiments is set to $90,000$ for VirtualTB and $1,000$ for RecSim and RecoGym. Note that only PER, AER and SER contains a prioritize operation to rank or store the experience.

3.3. Results and Evaluation

Results for the three platforms (Fig 3) demonstrate our method (LSER) outperformed the baselines: LSER yields significant improvements on VirtualTB, which is a large and sparse environment; while AER, DER, PER and SER find a correct policy within around $50,000$ episodes, ours takes around $30,000$ episodes; HER does not perform well because it introduces too much failed experience and has a slow learning process; DER introduces the human demonstration into the vanilla ER which is hard to acquire for recommendation task.

Applying PER to DDPG slightly outperforms applying DER to DDPG, which is consistent with observations by previous work (Novati and Koumoutsakos, 2019; Sun et al., 2020). As PER was originally designed for Deep Q-learning, it uses the high TD-error to indicate the high informative experience for the value-network. When applying PER into DDPG, which is an actor-critic based algorithm, the sampled experience is also used to update the policy network. Those experiences with high TD-error normally diverge far away from the current policy and harm the updates of the policy-network. In contrast, LSER selects experience according to the similary with the current state. This preference for on-distribution states tends to discard experiences that contain old states and stabilize the training process of the policy network. AER does not perform as well as PER in VirtualTB because it heavily relies on the attention mechanism to calculate the similarity score between states. LSER’s $\epsilon$ -greedy method can enforce agent to do more exploration when user’s interest shift.

All methods gained similar results on RecSim and RecoGym because all methods can iterate all possible combinations of states and actions. Fig. 3(b), 3(c) and 3(d) show that LSER is slightly better and more stable than the baselines on RecSim and RecoGym. Since the two platforms are quite small¹¹1RecoGym only contains 100 users and 10 products; RecSim contains 100 users and 20 products., similarity matching and $\epsilon$ -greedy do not significantly improve performance.

3.4. Running Time Comparison

We report the running time of the selected experience replay methods in Table 1 to evaluate the efficiency of LSER. LSER outperforms all While performing poorly on RecSim and RecoGym, it is faster than most of the baselines. In comparison, LSER introduces extra running time in small environments (e.g, RecSim and RecoGym) than large environments. For VirutalTB, AER takes much longer time than all other methods, due to attention calculation (Kitaev et al., 2020).

Table 1. Comparison of running time for DER, PER, SER, AER and LSER coupling with DDPG in three different environments when running the experiments in

90,000

episodes

	Running Time ( $10^{3}$ s)
	RecSim(LTS)	RecSim(IE)	RecoGym	VirtualTB
DER	5.63	5.42	4.53	95.22
PER	5.44	5.15	4.18	94.52
SER	5.31	5.05	4.21	90.05
AER	5.18	4.94	4.12	145.33
HER	5.33	5.11	4.20	120.33
LSER	5.23	5.04	4.15	85.12

3.5. Ablation Study

We further investigate the effect of LSER’s store and sampling strategy by replacing our store strategy with the normal strategy and our sampling strategy with random sampling. The results of our ablation study are shown in Fig. 3(f), where LSER-P denotes LSER with the replaced store strategy and LSER-S denotes the LSER with the replaced sampling strategy. We found the sample strategy played the most critical role in achieving good performance, as LSER-S underperformed LSER significantly. The store strategy also contributed to the better performance. LSER-P was less stable (indicated by a wider error bar). but outperformed LSER at $\sim 30,000$ episodes, due to occurrence of sub-optimal policies.

3.6. Impact of Number of Hyperplanes

In our method, the number of hyperplanes is critical to determine the length of the result hash-bits of a given state. Longer hash-bits can provide more accurate similarity measurement result but low efficiency, while shorter hash-bits can increase efficiency but decrease the accuracy. It’s a trade-off which needs a middle-point to balance between efficiency and accuracy. We want to answer the following question:“Does increase the hyperplanes always boost the recommendation performance?” and find out the optimal number.

We report the experimental results in VirtualTB, where we evaluate the effect by varying number hyperplanes in LSER (shown in Fig 4). The results on the other two platforms show the similar pattern.

The performance gradually increases with more hyperplanes, but it levels off or even drops when number of hyperplanes reaches 20.

3.7. Discussion and Future Extensions

Fig 3(a) shows LSER suffers instability after reaching the first peak at episode $\sim 50,000$ . Different from the other methods, LSER can quickly reach the optimal policy but suffers fluctuation. That indicates $\epsilon$ -greedy tends to lead the agent towards learning from low-priority experience after the optimal policy is reached. We alleviate the issue by adjusting the value of $\epsilon$ . Here, we tried $\epsilon=\{0,0.9,0.99,1\}$ to determine the best choice of the $\epsilon$ on VirtualTB. The results are shown in Fig. 3(e), where $\epsilon=1$ corresponds to greedy sampling while $\epsilon=0$ refers to randomly sampling. Besides, we provide an intervention strategy to stabilize the training process—the agent will stop exploration once the test reward is higher than a reward threshold $T_{r}$ . This strategy allows the agent to find an near-optimal policy at an early stage. We examined the performance under $T_{r}$ =0.95, which delivers a better training process.

4. Related Work

Zhao et al. (2018) first introduced DRL to recommender systems Zhao et al. (2018). They use DQN to embed user and item information for news recommendaion, where Vanilla ER is used to help the agent learn from past experience. And until present, most methods use only vanilla ER, which uniformly samples experiences from the replay buffer. Among them, Zhao et al. (2020b) apply DQN to online recommendation and RNN to generate state embeddings; Chen et al. (2018) point out that DQN receives unstable rewards in dynamic environments such as online recommendation and may harm the agent; Chen et al. (2019) found that traditional methods like DQN become intractable when the state becomes higher-dimensional; DPG addresses the intractability by mapping high-dimensional discrete state into low-dimensional continuous state (Chen et al., 2020; Zhao et al., 2020a).

Intuitively, some instances are more important than others; so a better experience replay strategy is to sampling experiences according to how much current agent can learn from each of them. While such a measure is not directly accessible, proxies propose to retain experiences in the replay buffer or to sample experiences from the buffer. Replay strategies reply on optimization objectives. In simple continuous control tasks,experience replay should contain experiences that are not close to the current policy to prevent fitting to local minima, and the best replay distribution is in between an on-policy distribution and uniform distribution (De Bruin et al., 2015). However, they De Bruin et al. (2015) also note that such a heuristic is unsuitable for complex tasks where policies are updated for many iterations. In DRL problems, when the rewards are sparse, the agent can learn from failed experiences by replacing the original goals with states in reproduced artificial successful trajectories (Andrychowicz et al., 2017)

For complex control tasks, PER (Schaul et al., 2016) measures the importance of experiences using the TD-error and designs a customized importance sampling strategy to avoid the effect of bias. Based on that, Ref-ER (Novati and Koumoutsakos, 2019) actively enforces the similarity between policy and the experience in the replay buffer, considering on-policy transitions are more useful for training the current policy. AER (Sun et al., 2020) is an experience replay method that combines the advantages from PER and Ref-ER. It uses attention score to indicate state similarity and replays those experiences awarded high similarity with high priority. All aforementioned work focuses on optimizing the sampling strategy, aiming to select the salient and relevant agent’s experiences in replay buffer effectively. Selective experience replay (SER) (Isele and Cosgun, 2018), in contrast, aims to optimize the storing process to ensure only valuable experience will be stored. The main idea is to use an Long-short term memory (LSTM) network to store only useful experience.

5. Conclusion

In this paper, we propose state-aware reward-driven experience replay (LSER) to address the sub-optimality and training instability issues with reinforcement learning for online recommender systems. Instead of focusing on improving the sample efficiency for discrete tasks, LSER considers online recommendation as a continuous task; it then uses locality-sensitive hashing to determine state similarity and reward for efficient experience replay. Our evaluation of LSER against several state-of-the-art experience-replay methods on three benchmarks (VirtualTB, RecSim and RecoGym) demonstrate LSER’s feasibility and superior performance.

In the future, we will explore new solutions for improving stability, such as better optimizers to help the agent get rid of saddle points, new algorithms to stabilize the training for DDPG, and trust region policy optimization to increase training stability (Schulman et al., 2015). Moreover, more advance reinforcement learning algorithms could be used to replace the DDPG such as soft actor-critic (SAC) (Haarnoja et al., 2018) or Twin Delayed Deep Deterministic (TD3) (Fujimoto et al., 2018).

References

(1)
Andrychowicz et al. (2017) Marcin Andrychowicz, Filip Wolski, Alex Ray, Jonas Schneider, Rachel Fong, Peter Welinder, Bob McGrew, Josh Tobin, OpenAI Pieter Abbeel, and Wojciech Zaremba. 2017. Hindsight experience replay. In Advances in neural information processing systems. 5048–5058.
Bai et al. (2019) Xueying Bai, Jian Guan, and Hongning Wang. 2019. A Model-Based Reinforcement Learning with Adversarial Training for Online Recommendation. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc., 10735–10746. https://proceedings.neurips.cc/paper/2019/file/e49eb6523da9e1c347bc148ea8ac55d3-Paper.pdf
Chen et al. (2019) Haokun Chen, Xinyi Dai, Han Cai, Weinan Zhang, Xuejian Wang, Ruiming Tang, Yuzhou Zhang, and Yong Yu. 2019. Large-scale interactive recommendation with tree-structured policy gradient. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3312–3320.
Chen et al. (2018) Shi-Yong Chen, Yang Yu, Qing Da, Jun Tan, Hai-Kuan Huang, and Hai-Hong Tang. 2018. Stabilizing reinforcement learning in dynamic environment with application to online recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 1187–1196.
Chen et al. (2020) Xiaocong Chen, Chaoran Huang, Lina Yao, Xianzhi Wang, Wenjie Zhang, et al. 2020. Knowledge-guided deep reinforcement learning for interactive recommendation. In 2020 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8.
Chen et al. (2021) Xiaocong Chen, Lina Yao, Julian McAuley, Guangling Zhou, and Xianzhi Wang. 2021. A Survey of Deep Reinforcement Learning in Recommender Systems: A Systematic Review and Future Directions. arXiv preprint arXiv:2109.03540 (2021).
De Bruin et al. (2015) Tim De Bruin, Jens Kober, Karl Tuyls, and Robert Babuška. 2015. The importance of experience replay database composition in deep reinforcement learning. In Deep reinforcement learning workshop, NIPS.
Dulac-Arnold et al. (2015) Gabriel Dulac-Arnold, Richard Evans, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt, Timothy Mann, Theophane Weber, Thomas Degris, and Ben Coppin. 2015. Deep reinforcement learning in large discrete action spaces. arXiv preprint arXiv:1512.07679 (2015).
Fujimoto et al. (2018) Scott Fujimoto, Herke Hoof, and David Meger. 2018. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning. PMLR, 1587–1596.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. 2018. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International conference on machine learning. PMLR, 1861–1870.
Horgan et al. (2018) Dan Horgan, John Quan, David Budden, Gabriel Barth-Maron, Matteo Hessel, Hado van Hasselt, and David Silver. 2018. Distributed Prioritized Experience Replay. In International Conference on Learning Representations.
Hou et al. (2017) Yuenan Hou, Lifeng Liu, Qing Wei, Xudong Xu, and Chunlin Chen. 2017. A novel ddpg method with prioritized experience replay. In 2017 IEEE international conference on systems, man, and cybernetics (SMC). IEEE, 316–321.
Hu et al. (2018) Yujing Hu, Qing Da, Anxiang Zeng, Yang Yu, and Yinghui Xu. 2018. Reinforcement learning to rank in e-commerce search engine: Formalization, analysis, and application. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 368–377.
Ie et al. (2019) Eugene Ie, Chih wei Hsu, Martin Mladenov, Vihan Jain, Sanmit Narvekar, Jing Wang, Rui Wu, and Craig Boutilier. 2019. RecSim: A Configurable Simulation Platform for Recommender Systems. (2019). arXiv:1909.04847 [cs.LG]
Indyk and Motwani (1998) Piotr Indyk and Rajeev Motwani. 1998. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proceedings of the thirtieth annual ACM symposium on Theory of computing. 604–613.
Isele and Cosgun (2018) David Isele and Akansel Cosgun. 2018. Selective experience replay for lifelong learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
Kitaev et al. (2020) Nikita Kitaev, Lukasz Kaiser, and Anselm Levskaya. 2020. Reformer: The Efficient Transformer. In International Conference on Learning Representations. https://openreview.net/forum?id=rkgNKkHtvB
Konda and Tsitsiklis (2000) Vijay R Konda and John N Tsitsiklis. 2000. Actor-critic algorithms. In Advances in neural information processing systems. 1008–1014.
Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. 2015. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015).
Liu et al. (2020) Feng Liu, Huifeng Guo, Xutao Li, Ruiming Tang, Yunming Ye, and Xiuqiang He. 2020. End-to-End Deep Reinforcement Learning based Recommendation with Supervised Embedding. In Proceedings of the 13th International Conference on Web Search and Data Mining. 384–392.
Luo and Li (2020) Jieliang Luo and Hui Li. 2020. Dynamic experience replay. In Conference on Robot Learning. PMLR, 1191–1200.
Mnih et al. (2013) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin Riedmiller. 2013. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602 (2013).
Nolan (2003) John Nolan. 2003. Stable distributions: models for heavy-tailed data. Birkhauser New York.
Novati and Koumoutsakos (2019) Guido Novati and Petros Koumoutsakos. 2019. Remember and forget for experience replay. In International Conference on Machine Learning. PMLR, 4851–4860.
Pan et al. (2019) Feiyang Pan, Qingpeng Cai, Pingzhong Tang, Fuzhen Zhuang, and Qing He. 2019. Policy gradients for contextual recommendations. In The World Wide Web Conference. 1421–1431.
Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Advances in neural information processing systems. 8026–8037.
Rohde et al. (2018) David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. Recogym: A reinforcement learning environment for the problem of product recommendation in online advertising. arXiv preprint arXiv:1808.00720 (2018).
Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. 2016. Prioritized experience replay. In International Conference on Learning Representations.
Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. 2015. Trust region policy optimization. In International conference on machine learning. PMLR, 1889–1897.
Shi et al. (2019) Jing-Cheng Shi, Yang Yu, Qing Da, Shi-Yong Chen, and An-Xiang Zeng. 2019. Virtual-taobao: Virtualizing real-world online retail environment for reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4902–4909.
Silver et al. (2016) David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. 2016. Mastering the game of Go with deep neural networks and tree search. nature 529, 7587 (2016), 484–489.
Sun et al. (2020) Peiquan Sun, Wengang Zhou, and Houqiang Li. 2020. Attentive experience replay. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5900–5907.
Wang et al. (2018) Lu Wang, Wei Zhang, Xiaofeng He, and Hongyuan Zha. 2018. Supervised reinforcement learning with recurrent neural network for dynamic treatment recommendation. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2447–2456.
Zha et al. (2019) Daochen Zha, Kwei-Herng Lai, Kaixiong Zhou, and Xia Hu. 2019. Experience Replay Optimization. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4243–4249. https://doi.org/10.24963/ijcai.2019/589
Zhang et al. (2019) Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep learning based recommender system: A survey and new perspectives. ACM Computing Surveys (CSUR) 52, 1 (2019), 1–38.
Zhao et al. (2020a) Kangzhi Zhao, Xiting Wang, Yuren Zhang, Li Zhao, Zheng Liu, Chunxiao Xing, and Xing Xie. 2020a. Leveraging Demonstrations for Reinforcement Recommendation Reasoning over Knowledge Graphs. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 239–248.
Zhao et al. (2019) Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin. 2019. ” Deep reinforcement learning for search, recommendation, and online advertising: a survey” by Xiangyu Zhao, Long Xia, Jiliang Tang, and Dawei Yin with Martin Vesely as coordinator. ACM SIGWEB Newsletter Spring (2019), 1–15.
Zhao et al. (2018) Xiangyu Zhao, Long Xia, Liang Zhang, Zhuoye Ding, Dawei Yin, and Jiliang Tang. 2018. Deep reinforcement learning for page-wise recommendations. In Proceedings of the 12th ACM Conference on Recommender Systems. 95–103.
Zhao et al. (2020b) Xiangyu Zhao, Xudong Zheng, Xiwang Yang, Xiaobing Liu, and Jiliang Tang. 2020b. Jointly learning to recommend and advertise. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3319–3327.
Zheng et al. (2018) Guanjie Zheng, Fuzheng Zhang, Zihan Zheng, Yang Xiang, Nicholas Jing Yuan, Xing Xie, and Zhenhui Li. 2018. DRN: A deep reinforcement learning framework for news recommendation. In Proceedings of the 2018 World Wide Web Conference. 167–176.

Locality-Sensitive Experience Replay for Online Recommendation

Abstract.

1. Introduction

2. Methodology

2.1. Overview

2.2. Locality-sensitive Experience Replay

Definition 1 (Collision probability for LSH in pp-stable distribution).

Lemma 1.

Proof.

Theorem 2.

Proof.

2.3. Store and Sampling Strategy

2.4. Training Procedure

3. Experiments

3.1. Online Simulation Platform Evaluation

3.2. Evaluation Metrics and Baselines

3.3. Results and Evaluation

3.4. Running Time Comparison

3.5. Ablation Study

3.6. Impact of Number of Hyperplanes

3.7. Discussion and Future Extensions

4. Related Work

5. Conclusion

References

Definition 1 (Collision probability for LSH in $p$ -stable distribution).