This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Clustered Embedding Learning for Recommender Systems

Yizhou Chen Guangda Huzhang yizhou.chen, guangda.huzhang @shopee.com Shopee Pte Ltd., Singapore Anxiang Zeng [email protected] SCSE, Nanyang Technological University, Singapore Qingtao Yu, Hui Sun Heng-Yi Li, Jingyi Li qingtao.yu, halsey.sun, hengyi.li, lijingyi @shopee.com Shopee Pte Ltd., Singapore Yabo Ni [email protected] SCSE, Nanyang Technological University, Singapore Han Yu [email protected] SCSE, Nanyang Technological University, Singapore  and  Zhiming Zhou [email protected] Shanghai University of Finance and Economics, Shanghai, China
(2023)
Abstract.

In recent years, recommender systems have advanced rapidly, where embedding learning for users and items plays a critical role. A standard method learns a unique embedding vector for each user and item. However, such a method has two important limitations in real-world applications: 1) it is hard to learn embeddings that generalize well for users and items with rare interactions; and 2) it may incur unbearably high memory costs when the number of users and items scales up. Existing approaches either can only address one of the limitations or have flawed overall performances. In this paper, we propose Clustered Embedding Learning (CEL) as an integrated solution to these two problems. CEL is a plug-and-play embedding learning framework that can be combined with any differentiable feature interaction model. It is capable of achieving improved performance, especially for cold users and items, with reduced memory cost. CEL enables automatic and dynamic clustering of users and items in a top-down fashion, where clustered entities jointly learn a shared embedding. The accelerated version of CEL has an optimal time complexity, which supports efficient online updates. Theoretically, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Empirically, we validate the effectiveness of CEL on three public datasets and one business dataset, showing its consistently superior performance against current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of +0.6%+0.6\% in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets 26502650 times smaller.111The code is avaliable at: https://doi.org/10.5281/zenodo.7620448.

Embedding learning, clustering, recommender system, large-scale application, sparse data
journalyear: 2023copyright: acmlicensedconference: Proceedings of the ACM Web Conference 2023; May 1–5, 2023; Austin, TX, USAbooktitle: Proceedings of the ACM Web Conference 2023 (WWW ’23), May 1–5, 2023, Austin, TX, USAprice: 15.00doi: 10.1145/3543507.3583362isbn: 978-1-4503-9416-1/23/04ccs: Computing methodologies Learning latent representationsccs: Information systems Recommender systemsccs: Applied computing Electronic commerce

1. Introduction

Recommender systems, which support many real-world applications, have become increasingly accurate in recent years. Important improvements are brought about by the use of deep learning (Cheng et al., 2016; Guo et al., 2017) and the leverage of user historical data (Zhou et al., 2018, 2019). In these systems, embedding learning (i.e., learning vector representations) for users and items plays a critical role.

The standard method, full embedding, learns a unique vector representation for each user and item. However, this approach faces important limitations in practice. Firstly, for cold users and items that have insufficient historical interactions (which commonly exist in practice), it is hard for them to learn an embedding that generalizes well on their own. Secondly, web-scale recommender systems usually involve an enormous number of users and items; the memory cost incurred by full embedding can be prohibitively high: 1 billion users, each assigned a 6464 dimensional embedding in 3232-bit floating point, require 238238 GB of memory.

To mitigate the problem of cold users and items, a common idea is to impose regularization, e.g., forcing all embeddings to follow a given clustering structure (Almutairi et al., 2021; Yang et al., 2016). However, using a fixed predefined clustering structure lacks flexibility; the given clustering structure can hardly be optimal. Besides, existing methods (Almutairi et al., 2021; Yang et al., 2016) require training in full embedding, still suffering from high memory costs.

To mitigate the problem of high memory cost, hashing (Weinberger et al., 2009; Yan et al., 2021a; Tan et al., 2020; Shi et al., 2020) is typically adopted to map entities to a fixed-size embedding table. This can directly tackle the problem of memory cost. However, forcing a hashing that is essentially random (Weinberger et al., 2009; Yan et al., 2021a) leads to unsatisfactory performance; the performances of current learning-to-hash methods (Tan et al., 2020; Shi et al., 2020) also have a large space for improvements. Besides, pretraining a hash function (Tan et al., 2020) or binding each hash bucket with a warm user (Shi et al., 2020) also makes these methods lack flexibility.

In chasing a more flexible and integrated solution to these two problems, while at the same time improving the performance, we propose in this paper Clustered Embedding Learning (CEL). CEL achieves dynamic and automatic clustering of users and items, which gives it the potential to have superior performance. In CEL, entities within the same cluster may use and jointly learn a shared embedding; thus it can significantly reduce the memory cost.

CEL performs clustering in a top-down (i.e., divisive) fashion, starting with a single cluster that includes all users (or items) and recursively splitting a cluster into two. Given that we want to share information among similar entities, such divisive clustering is a natural choice: sharing information of all entities at the beginning, then seeking to refine the clustering when possible.

Though being a natural thought, implementing divisive clustering for embedding learning is challenging. In the circumstance of shared embedding, it appears to have no standard to differentiate entities within the same cluster. To tackle this problem, we propose to refer to their gradients with respect to the shared embedding, since the differences in the gradients indicate their different desired optimization directions of the embedding. Then naturally, we propose to split the cluster along the first principal component of the gradients.

After cluster split and associated embedding optimizations, the current cluster assignment might have room for improvement, e.g., the embedding of some other cluster might better fit the interactions of an entity. Thus, fine-grained automatic cluster optimization also requires enabling cluster reassignment. Given that the training loss appears to be the only referable metric, we propose to conduct cluster reassignment based on direct loss minimization.

With such a cluster split and reassignment paradigm, CEL is ideally a sustainable framework that can not only support starting with zero data, but also can automatically increase the number of clusters, as well as automatically optimize the cluster reassignments, as the interaction data increases and the number of users/items increases.

Refer to caption
Figure 1. The standard behavior prediction paradigm in recommender system.

We validate the proposed CEL both theoretically and empirically. Firstly, we prove the identifiability and the existence of a unique optimal number of clusters for CEL in the context of nonnegative matrix factorization. Then, we show that the accelerated version of CEL (CEL-Lite) has an optimal time complexity, which allows efficient online updates. Finally, we conduct extensive experiments on public real-world datasets as well as a private business dataset, integrated with various feature interaction models (including NMF, DIN, and DeepFM); the results show that CEL consistently outperforms current state-of-the-art methods. In particular, when incorporating CEL into the business model, it brings an improvement of +0.6%+0.6\% in AUC, which translates into a significant revenue gain; meanwhile, the size of the embedding table gets 2650 times smaller.

2. Related Works

One line of research regularizes the embeddings to follow a given clustering structure to improve their qualities. For example, Joint NMF and K-Means (JNKM) (Yang et al., 2016) forces the embeddings to follow a K-clustering structure; eTree (Almutairi et al., 2021) learns the embeddings under an implicit tree (i.e., hierarchical clustering). Prior works in this line also include HSR (Wang et al., 2018), HieVH (Sun et al., 2017), etc. The major distinction of our work is that CEL does not predefine the clustering structure. Instead, it automatically optimizes the number of clusters and the cluster assignment. Besides, these methods require training in full embedding size, while CEL is trained with reduced embedding size.

Another line of work uses the hash to share and jointly learn the embeddings. The plain version applies the modulo operation (Weinberger et al., 2009) on the hash ID of each entity to obtain its hash code (i.e., the embedding index). Sharing embedding within each hash bucket, the embedding size can be reduced by the magnitude of the modulo divisor. Binary Hash (BH) (Yan et al., 2021a; Zhang et al., 2016) adopts a code block strategy on binarized hash IDs to reduce direct collisions. Adaptive Embedding (AE) (DeepRec, 2021) explicitly avoids collisions for warm users who have sufficient historical data. However, the hash assignments in these methods are essentially random and will not adapt/learn during learning. There are also several works that tried to learn a hash function. For example, HashGNN (Tan et al., 2020) learns a binary code hash function with graph neural networks. However, it requires pretraining a hash function. Preference Hash (PreHash) (Shi et al., 2020) learns to hash users with similar preferences to the same bucket. PreHash binds a warm user to each bucket, while the embedding of other users is a weighted sum of the embeddings of their top-KK related buckets. By contrast, CEL does not force bindings between users and clusters.

Some other works seek to reduce memory cost by reducing the embedding dimension or the number of bits for each dimension (i.e., quantization) (Ginart et al., 2021; Liu et al., 2020; Yan et al., 2021b; Zhang et al., 2016; Yang et al., 2019), which are orthogonal to our work in the perspective of reducing the memory cost. The type of clustering algorithms most relevant to CEL is Divisive Hierarchical Clustering (DHC) (Roux, 2015). Particularly, Principal Direction Divisive Partitioning (PDDP) (Boley, 1998) splits clusters via principal coordinate analysis (Gower, 2014), which is remotely related to our proposed GPCA (Section 3.2.2).

Refer to caption

(a)                                                                     (b)

Figure 2. (a) An overview of CEL: it alternatively performs embedding optimization and cluster optimization. (b) The illustration of GPCA: it splits clusters along the first principal component of item gradients.

3. The Proposed Framework

In this section, we present Clustered Embedding Learning (CEL) as a plug-and-play embedding learning framework which can be integrated with any differentiable feature interaction model such as Nonnegative Matrix Factorization (NMF), NeuMF (He et al., 2017), Wide&Deep (Cheng et al., 2016), DeepFM (Guo et al., 2017), MMoE (Ma et al., 2018), DIN (Zhou et al., 2018) and DIEN (Zhou et al., 2019). We illustrate the role of embedding learning in the behavior prediction that is commonly used in recommender systems in Figure 1.

Assume we have a user-item interaction data matrix 𝐗N×M\mathbf{X}\in\mathbb{R}^{N\times M} of NN users and MM items, where 𝐗(i,j)\mathbf{X}(i,j) denotes the interaction (rating, click, etc.) of the ithi^{th} user on the jthj^{th} item. We aim to recover the data matrix with two low-rank matrices, i.e., 𝐗Y(𝐀,𝐁)\mathbf{X}\approx Y(\mathbf{A},\mathbf{B}), where 𝐀N×R\mathbf{A}\in\mathbb{R}^{N\times R} and 𝐁M×R\mathbf{B}\in\mathbb{R}^{M\times R} can be regarded as the user embedding matrix and the item embedding matrix, respectively. Each row of 𝐀\mathbf{A} (or 𝐁\mathbf{B}) corresponds to the embedding of a user (or item). RR is the embedding dimension. The prediction Y(𝐀,𝐁)=[yij]=[y(𝐀i,𝐁j)]N×MY(\mathbf{A},\mathbf{B})=[y_{ij}]=[y(\mathbf{A}_{i},\mathbf{B}_{j})]\in\mathbb{R}^{N\times M} is produced by a differentiable feature interaction model yy that takes the embedding of user ii and the embedding of item jj as inputs. Generally, embedding learning can be formulated as an optimization problem:

(1) argmin𝐀,𝐁,=𝐖(𝐗Y(𝐀,𝐁))F2,\operatorname*{arg\,min}_{\mathbf{A},\mathbf{B}}\;\mathcal{L},\;\mathcal{L}=\big{\|}\mathbf{W}\odot(\mathbf{X}-Y(\mathbf{A},\mathbf{B}))\big{\|}_{F}^{2},

where 𝐖{0,1}N×M\mathbf{W}\in\{0,1\}^{N\times M} contains 1s at the indices of observed entries in 𝐗\mathbf{X}, and 0s otherwise. FF denotes the Frobenius norm and \odot denotes element-wise multiplication.

CEL clustering can be similarly applied to users and items. For clarity, we will only discuss the clustering of items. To initiate the clustering of items, we decompose the item embedding matrix as a product following a clustering structure:

(2) 𝐁=𝐒q𝐁q,q={0,1,2,},\mathbf{B}=\mathbf{S}_{q}\mathbf{B}_{q},\;\;q=\{0,1,2,\ldots\},

where 𝐒q{0,1}M×Mq\mathbf{S}_{q}\in\{0,1\}^{M\times M_{q}} and 𝐁qMq×R\mathbf{B}_{q}\in\mathbb{R}^{M_{q}\times R} denote the cluster assignment matrix and the cluster embedding matrix of items, respectively. MqM_{q} denotes the current number of clusters and qq denotes the current number of splits. We adopt hard cluster assignment: each row of 𝐒q\mathbf{S}_{q} has exactly one 11, while the other entries are zero.

CEL solves Eq. (1) by alternatively performing embedding optimization and cluster optimization (Figure 2).

3.1. Embedding Optimization

In embedding optimization, we optimize 𝐀\mathbf{A} and 𝐁q\mathbf{B}_{q} subject to a fixed 𝐒q\mathbf{S}_{q}. The optimization of 𝐀\mathbf{A} can be directly conducted with Eq. (1) and (2). Since hard cluster assignment is adopted, the embedding of each cluster can be optimized separately.

The loss of the kthk^{th} cluster embedding of 𝐁q\mathbf{B}_{q} can be denoted as:

(3) k=𝐖[𝐗diag(𝐒q(:,k))Y(𝐀,𝐒q(:,k)𝐁q(k,:))]F2,\begin{split}\mathcal{L}_{k}=\big{\|}\mathbf{W}\odot\big{[}&\mathbf{X}\cdot\text{diag}(\mathbf{S}_{q}(:,k))\\ &-Y(\mathbf{A},\mathbf{S}_{q}(:,k)\mathbf{B}_{q}(k,:))\big{]}\big{\|}_{F}^{2},\end{split}

where colon (:)(:) stands for the slicing operator that takes all values of the respective field, and diag()\text{diag}(\cdot) maps a vector into a diagonal matrix. It has =k=1Mqk\mathcal{L}=\sum_{k=1}^{M_{q}}\mathcal{L}_{k}.

3.2. Cluster Optimization

In this section, we introduce the two proposed cluster optimization operations: cluster reassignment and cluster split. We assume that the reassignment occurs every t1t_{1} steps of embedding optimization, while the split occurs every t2t_{2} steps of embedding optimization.

3.2.1. Cluster Reassignment

An item assigned to a sub-optimal cluster can result in performance loss. When the embeddings have been optimized, we may perform cluster reassignment so that an item can be assigned to a better cluster. Reassignment is an important step toward automatically grouping similar items.

One problem CEL faces is how to measure the distance between an item and a cluster, since their representations are in different domains: items have interaction data, while clusters have cluster embeddings. A natural metric seems to be the fitness of the cluster embedding to the item. Thus, we seek to directly minimize the loss of each item w.r.t. their cluster assignments:

(4) 𝐒q=argmin𝐒q𝐖(𝐗Y(𝐀,𝐒q𝐁q))F2.\mathbf{S}_{q}=\operatorname*{arg\,min}_{\mathbf{S}_{q}}\;\big{\|}\mathbf{W}\odot(\mathbf{X}-Y(\mathbf{A},\mathbf{S}_{q}\mathbf{B}_{q}))\big{\|}_{F}^{2}.

With fixed embeddings, the reassignment of each item is independent, therefore Eq. (4) can also be equivalently formulated as a set of sub-problems, one for each item:

𝐒q(j,:)=argmin𝐒q(j,:)𝐖(:,j)(𝐗(:,j)Y(𝐀,𝐒q(j,:)𝐁q))F2.\mathbf{S}_{q}(j,:)=\operatorname*{arg\,min}_{\mathbf{S}_{q}(j,:)}\;\big{\|}\mathbf{W}(:,j)\odot(\mathbf{X}(:,j)-Y(\mathbf{A},\mathbf{S}_{q}(j,:)\mathbf{B}_{q}))\big{\|}_{F}^{2}.

3.2.2. Cluster Split with GPCA

With embeddings optimized, we may consider cluster split to achieve a finer clustering. Each time we select one cluster and split it into two. There are many heuristic criteria for choosing the cluster. For example, we may choose the cluster that has the largest: total loss, mean loss, number of associated items, or number of associated interactions (𝐖𝐒q(:,k)0\|\mathbf{W}\mathbf{S}_{q}(:,k)\|_{0}). According to our experiments, they generally lead to similarly good results. This demonstrates the robustness of CEL upon cluster-choosing criteria. In this paper, we adopt the number of associated interactions as the default criterion.

The key challenge for CEL is how to split a cluster. With hard cluster reassignment and embedding sharing, embeddings within a cluster are the same. Thus we need a representation that can differentiate items in the same cluster. To tackle this problem, we propose Gradient Principal Component Analysis (GPCA) that splits clusters according to items’ gradients w.r.t. their shared embedding, which indicates their different desired optimization directions of the embedding. Such a split strategy can help achieve finer-grained clustering of similar items while further minimizing the loss.

Specifically, we assemble the gradients of items of a given cluster into a matrix 𝐆=[,gj,]\mathbf{G}=[\ldots,g_{j},\ldots]^{\top}, j{j|𝐒q(j,k)=1}\forall j\in\{j|\mathbf{S}_{q}(j,k)=1\}, where gj=/𝐁(j,:)g_{j}=-\partial\mathcal{L}/\partial\mathbf{B}(j,:) and kk denotes the cluster to be split. Then, we normalize 𝐆\mathbf{G} (as a standard procedure before performing PCA, without changing its notation) and compute its first principal component:

(5) 𝐩=argmax𝐩2=1𝐩𝐆𝐆𝐩.\mathbf{p}=\operatorname*{arg\,max}_{\|\mathbf{p}\|_{2}=1}\;\mathbf{p}^{\top}\mathbf{G}^{\top}\mathbf{G}\mathbf{p}.

After that, we split the cluster into two according to their first principal component scores gj𝐩g_{j}^{\top}\mathbf{p}:

(6) {𝐒q+1(j,k)=1, if gj𝐩<δ;𝐒q+1(j,Mq+1)=1, if gj𝐩δ.\left\{\begin{array}[]{ll}\mathbf{S}_{q+1}(j,k)=1,\;\;\;\;\;\;\;\;\;\;\text{ if }g_{j}^{\top}\mathbf{p}<\delta;\\ \mathbf{S}_{q+1}(j,M_{q}+1)=1,\;\;\text{ if }g_{j}^{\top}\mathbf{p}\geq\delta.\end{array}\right.

One set of items is still assigned to cluster kk, while the other set is assigned to a new cluster Mq+1M_{q}+1. We may adjust the threshold δ\delta to control/balance the number of items in each cluster.

We present the pseudocode of CEL in Algorithm 1.

3.3. Theoretical Analysis of CEL under NMF

CEL can be integrated with any differentiable feature interaction model, however, most of them (e.g., when involving other features or integrated with neural networks) are hard to analyze. Even so, we are able to study the theoretical properties of CEL when it is integrated with Nonnegative Matrix Factorization (NMF). These analyses can to some extent demonstrate the validity of CEL.

In NMF, Y(𝐀,𝐁)=𝐀𝐁Y(\mathbf{A},\mathbf{B})=\mathbf{A}\mathbf{B}^{\top} and 𝐀,𝐁0\mathbf{A},\mathbf{B}\geq 0, which factorizes the data matrix 𝐗\mathbf{X} into the product of two low-rank matrices. Despite that low-rank matrix factorization generally has no unique solution, nonnegativity helps establish the identifiability of NMF. NMF can thus produce essentially unique low-dimensional representations (Lee and Seung, 1999). The NMF of 𝐗=𝐀𝐁\mathbf{X}=\mathbf{A}\mathbf{B}^{\top} is regarded as essentially unique if the nonnegative matrix 𝐀\mathbf{A} and 𝐁\mathbf{B} are identifiable up to a permutation and scaling. We show that the solution of CEL is identifiable:

Theorem 1.

(Identifiability of CEL) Assume that the data matrix 𝐗\mathbf{X} follows 𝐗=𝐀𝐁q𝐒q\mathbf{X}=\mathbf{A}\mathbf{B}_{q}^{\top}\mathbf{S}_{q}^{\top}, where 𝐀N×R\mathbf{A}\in\mathbb{R}^{N\times R}, 𝐁qMq×R\mathbf{B}_{q}\in\mathbb{R}^{M_{q}\times R}, 𝐒q{0,1}M×Mq\mathbf{S}_{q}\in\{0,1\}^{M\times M_{q}}, 𝐒q(j,:)0=1,j[1,M]\|\mathbf{S}_{q}(j,:)\|_{0}=1,\forall j\in[1,M], rank(𝐗)=rank(𝐀)=Rrank(\mathbf{X})=rank(\mathbf{A})=R, 𝐁q\mathbf{B}_{q} has no repeated rows and, without loss of generality, MqRM_{q}\geq R. If 𝐒q\mathbf{S}_{q} is of full-column rank and rows of 𝐁q\mathbf{B}_{q} are sufficiently scattered, then 𝐀\mathbf{A}, 𝐁q\mathbf{B}_{q} and 𝐒q\mathbf{S}_{q} are essentially unique.

The formal definitions (e.g., sufficiently scattered) (Almutairi et al., 2021; Yang et al., 2016; Fu et al., 2018) and detailed proofs are provided in Appendix B.

We assume 𝐒q\mathbf{S}_{q} to be of full-column rank, which means there is no empty cluster. Enforcing such a condition in our implementation has improved the model performance. The assumption of 𝐁q\mathbf{B}_{q} being sufficiently scattered is common and is likely to be satisfied as shown by (Almutairi et al., 2021). Based on Theorem 1, we can further deduce the existence of a unique optimal number of clusters:

Proposition 0.

(Optimal Number of Clusters of CEL) Under the assumptions of Theorem 1 and further assume that the data matrix can also be decomposed as 𝐗=𝐀𝐁q+1𝐒q+1\mathbf{X}=\mathbf{A}^{\prime}\mathbf{B}_{q+1}^{\top}\mathbf{S}_{q+1}^{\top}, where 𝐀N×R\mathbf{A}^{\prime}\in\mathbb{R}^{N\times R}, 𝐁q+1Mq+1×R\mathbf{B}_{q+1}\in\mathbb{R}^{M_{q+1}\times R}, 𝐒q+1{0,1}M×Mq+1\mathbf{S}_{q+1}\in\{0,1\}^{M\times M_{q+1}}, 𝐒q+1(j,:)0=1,j[1,M]\|\mathbf{S}_{q+1}(j,:)\|_{0}=1,\forall j\in[1,M], rank(𝐀)=Rrank(\mathbf{A}^{\prime})=R, and Mq+1>MqRM_{q+1}>M_{q}\geq R. If 𝐒q+1\mathbf{S}_{q+1} is of full-column rank and the rows of 𝐁q+1\mathbf{B}_{q+1} are sufficiently scattered, then 𝐁q+1\mathbf{B}_{q+1} has only MqM_{q} distinct rows, which can be obtained via permutation and scaling of the rows in 𝐁q\mathbf{B}_{q}.

The proof is built on the fact that the identifiability of CEL implies 𝐒q+1𝐁q+1\mathbf{S}_{q+1}\mathbf{B}_{q+1} is a permutation and scaling of 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q}. The implication of Proposition 2 is twofold: 1) there exists a unique optimal number of clusters222A formal proof is provided in Appendix B.; 2) once the optimal number of clusters is reached, further split is unnecessary and will only introduce duplicated cluster embeddings.

Based on this observation and given that we do not have the exact formulation of the optimal number, we will keep splitting, while always initializing the embedding of the newly split cluster to be the same as the chosen cluster (this will retain the optimality of the previous solution). Besides, in online learning settings where we continually receive streaming data, the optimal number of clusters may keep increasing. In this sense, cluster split is well-motivated and makes CEL a sustainable learning framework.

3.4. CEL-Lite for Efficient Online Update

3.4.1. Online Learning.

Modern recommender systems generally deal with a massive amount of data. Such systems are thus usually updated in an online fashion: data arrive in sequence over time, and updates take place when the accumulated data reaches a predefined quantity. This framework closely follows the online learning paradigm and allows the model to rapidly adapt to the newest data.

We adapt CEL into CEL-Lite for the online learning setting by optimizing its efficiency in dealing with batch data. CEL-Lite has a theoretically optimal time complexity and inherits most of the advantages of CEL. CEL-Lite is also suitable when we have access to the full data but want to train the model in an online fashion.

3.4.2. Data and Embedding Optimization.

Under the online setting, each new batch 𝐗b\mathbf{X}_{b} consists of a set of user-item interactions {𝐗(i,j)}\{\mathbf{X}(i,j)\}. To enable the usage of interaction history as well as mitigate catastrophic forgetting (Kirkpatrick et al., 2017), we buffer from online batches a small number of the latest interactions (n\leq\!n) from each user and item for experience replay (Rolnick et al., 2019), which is commonly used in online systems. Since each interaction is represented by a triplet of \langleuser Id, item Id, rating\rangle, the storage cost incurred by buffering is acceptable. Moreover, the interactions can be stored in distributed file systems, unlike the embeddings or model parameters which must be stored in memory for efficient updates.

In CEL-Lite, one step of embedding optimization is performed for each batch via stochastic gradient descent with related buffered interactions. The overall time complexity of embedding optimization is then at most 𝒪(nDR)\mathcal{O}(nDR), where DD denotes the total number of interactions 𝐖0\|\mathbf{W}\|_{0}.

3.4.3. Efficient Cluster Reassignment.

To achieve fast cluster reassignment, we incorporate the following accelerations into CEL-Lite: 1) only considers mm additional randomly sampled clusters for each reassignment; 2) only items involved in the current batch are considered for reassignment.

With these two strategies, the cluster assignments are only improved locally. Nevertheless, with sufficient updates, it may be able to converge to the global optimal. Accordingly to our experiments, such accelerations significantly improve efficiency, while only incurring a small drop in performance. The time complexity of each reassignment is now only 𝒪(nmbR)\mathcal{O}(nmbR), where bb is the batch size. Given there are D/bD/b batches, there are at most D/bD/b cluster reassignments. Thus, the overall time complexity is 𝒪(nmDR)\mathcal{O}(nmDR).

3.4.4. Efficient Cluster Split.

To control the overall complexity of cluster split for CEL-Lite, we propose to constrain the number of associated interactions in each cluster and ensure clusters are approximately balanced. These two conditions limit the maximal number of split operations and the time complexity of each split. Concretely, we set up the following split strategies:

Strategy 1.

If an item has more than dd interactions, it will directly split and form a cluster by itself.

Strategy 2.

A cluster will split if and only if it has more than 2d2d associated interactions (𝐖𝐒q(:,k)0\|\mathbf{W}\mathbf{S}_{q}(:,k)\|_{0}). The split should be balanced such that the difference between the number of associated interactions of resulting clusters is within dd.

Strategy 1 handles extremely warm items. After executing strategy 1, each clustered item has no more than dd interactions; strategy 2 then gets to easily balance the splits, which can be done by tuning333The tuning of δ\delta can be done by Quickselect in 𝒪(𝐒q(:,k)0)\mathcal{O}(\|\mathbf{S}_{q}(:,k)\|_{0}). the threshold δ\delta in Eq. (6).

With strategy 1 and 2, all clusters have at least d/2d/2 interactions. Therefore, there are at most 2D/d2D/d clusters and hence at most 2D/d2D/d GPCA splits in total. Calculating the gradient of each item with only their nn buffered interactions, the time complexity of a single GPCA split is approximately 𝒪(n𝐒q(:,k)0R)\mathcal{O}(n\|\mathbf{S}_{q}(:,k)\|_{0}R), which is no more than 𝒪(2ndR)\mathcal{O}(2ndR). The overall time complexity of splits is thus 𝒪(4nDR)\mathcal{O}(4nDR).

Note that we use a constant dd in strategies 1 and 2 for simplicity and ease of use in practice. It can in fact generalize to any monotonically nondecreasing sublinear functions of DD without altering the overall time complexity. These strategies allow the maximum number of clusters to grow according to DD, which practices our theoretical results in Section 3.3.

3.4.5. The Optimal Time Complexity

Regarding n,mn,m as constants, each of the three optimizations of CEL-Lite is of a time complexity of 𝒪(DR)\mathcal{O}(DR). Thus, the overall complexity of CEL-Lite is also 𝒪(DR)\mathcal{O}(DR). Note that once we need to process all the data and each invokes an update of the embedding, the time complexity is already 𝒪(DR)\mathcal{O}(DR). That is, the time complexity of 𝒪(DR)\mathcal{O}(DR) in a sense is the lower bound of embedding learning.

4. Experiments

Table 1. Datasets summary.
Dataset Users Items No. interactions
MovieLens-1m 6,0406,040 3,9523,952 1,000,2091,000,209
Video Games 826,767826,767 50,21050,210 1,324,7531,324,753
Electronics 4,201,6964,201,696 476,002476,002 7,824,4827,824,482
One-day Sales 13\approx 13m 40\approx 40m 700\approx 700m

In this section, we empirically analyze CEL and compare it with related embedding learning baselines, including: Vanilla full embedding; eTree (Almutairi et al., 2021); JNKM (Yang et al., 2016); Modulo (Weinberger et al., 2009) (i.e., with plain hash); BH (Yan et al., 2021a); AE (DeepRec, 2021); HashGNN (Tan et al., 2020); and PreHash (Shi et al., 2020). They are tested with a variety of feature interaction models, including: NMF, DeepFM (Guo et al., 2017), DIN (Zhou et al., 2018), and a business model from an E-commerce platform.

The datasets considered include: MovieLens, a review dataset with user ratings (151\!\sim\!5) on movies; Video Games and Electronics, review datasets with user ratings (151\!\sim\!5) of products from Amazon; and One-day Sales, sub-sampled real service interaction data within a day of the E-commerce platform Shopee. The key statistics of these datasets are given in Table 1.

The original eTree and JNKM do not share embedding within clusters. To make the comparison thorough, we adapt eTree such that in the test phase, leaf nodes (users/items) will share the embedding of their parent nodes. Similarly, we adapt JNKM such that clustered items share the embedding of their cluster in the test phase. PreHash has a hyperparameter KK that selects top-KK related clusters for each user or item. We report its results with different values of KK.

We reuse MqM_{q} to also denote: the number of parent nodes in eTree; the number of clusters in JNKM; and the number of hash buckets in hash-based methods. We refer to Mq/MM_{q}/M as the compression ratio.

Table 2. The test Mean Square Error (MSE) of rating predictions on MovieLens-1m dataset (averaged over 5-fold cross validation). A compression ratio (defined as Mq/MM_{q}/M) of 100%100\% means the result is obtained under full embedding, while other compression ratios indicate that the results are obtained with clustered/hashed/shared embedding. Results marked with (p) indicate the results obtained with personalization (Section 4.2.1).
Feature Interaction Model NMF DIN
Compression Ratio 100%100\% 5%5\% 1%1\% 0.5%0.5\% 100%100\% 5%5\% 1%1\% 0.5%0.5\%
LTC LMC SIT
Vanilla full embedding 0.83570.8357 - - - 0.79030.7903 - - -
eTree 0.78410.7841 0.83990.8399 0.84820.8482 0.86140.8614 0.76400.7640 0.83060.8306 0.83790.8379 0.84700.8470
JNKM 0.76260.7626 0.89030.8903 0.86160.8616 0.84980.8498 0.74860.7486 0.88680.8868 0.85600.8560 0.84660.8466
Modulo - 1.02271.0227 1.06881.0688 1.07321.0732 - 1.02061.0206 1.06471.0647 1.06921.0692
BH - 0.97530.9753 1.01701.0170 1.05261.0526 - 0.99070.9907 1.00401.0040 1.04951.0495
AE - 0.99590.9959 1.03431.0343 1.06301.0630 - 0.98950.9895 1.02801.0280 1.06241.0624
HashGNN - 0.84910.8491 0.88290.8829 0.88820.8882 - 0.84670.8467 0.86630.8663 0.87230.8723
PreHash (K=4K=4) - 0.81690.8169 0.88320.8832 0.89870.8987 - 0.80570.8057 0.86780.8678 0.88970.8897
PreHash (K=MqK=M_{q}) - 0.79580.7958 0.80470.8047 0.81010.8101 - 0.78710.7871 0.78930.7893 0.79420.7942
CEL (Ours) 0.7507p\textbf{\small{{0.7507}}}^{p} 0.7784 0.7858 0.7926 0.7390p\textbf{\small{{0.7390}}}^{p} 0.7718 0.7787 0.7868
CEL-Lite (Ours) 0.7519p\mathbf{0.7519}^{p} 0.7820\mathbf{0.7820} 0.7901\mathbf{0.7901} 0.8014\mathbf{0.8014} 0.7421p\mathbf{0.7421}^{p} 0.7766\mathbf{0.7766} 0.7789\mathbf{0.7789} 0.7906\mathbf{0.7906}

4.1. Performance Versus Baselines

4.1.1. Rating Prediction

In this experiment, we perform rating prediction on the MovieLens datasets. We adopt NMF as well as DIN (both with R=64R=64) as our feature interaction models, respectively. We set t1=40t_{1}=40 and t2=10t_{2}=10 for CEL, while b=2000b=2000, n=20n=20, m=10m=10, t1=1t_{1}=1 for CEL-Lite. Since some baselines apply their methods only on items or users, to ensure a fair comparison, for this experiment we only compress (clustering, hashing, etc.) the item embeddings while keeping full user embeddings for all methods.

The results are shown in Table 2. It can be observed that CEL outperforms existing baselines under all settings. According to our results, training with predefine clustering structure, eTree and JNKM brings clear improvements over vanilla methods in the full embedding setting. However, they consistently underperform CEL, especially when using a compressed embedding table. For hash based methods, the improvements brought by BH and AE (over naive Modulo) are relatively limited, while HashGNN and PreHash achieve more clear performance gains. Among them, PreHash with K=MqK=M_{q} achieves the best performance in the compressed settings. Nevertheless, CEL clearly outperforms PreHash. Meanwhile, the results of this experiment demonstrate that CEL-Lite, being much more computationally efficient than CEL, has a similar performance to CEL and still clearly outperforms other baselines. The degradation in performance might be attributed to its subsampling accelerations.

In addition, we check in Table 2 three important properties for each method: Low Time Consumption (LTC), which indicates whether it achieves a time complexity of 𝒪(DR)\mathcal{O}(DR) for both training and test. In particular, algorithms whose time complexity scales linearly with MqM_{q} is not regarded as being LTC. Thus, eTree, JNKM, PreHash (with K=MqK=M_{q}), and CEL is not regarded as LTC. Neither does HashGNN, as it requires running GNN aggregation of neighborhoods, which has a worst case time complexity exponential in DD; Low Memory Consumption (LMC), which indicates whether the method can significantly reduce the memory cost for both training and testing compared with full embedding. JNKM, eTree, and HashGNN require a full embedding table during training, thus do not satisfy LMC; Support Incremental Training (SIT), which indicates whether the method supports incremental training, is crucial for practical systems. As the data increases, the predefined tree/clustering structure of eTree/JNKM can be sub-optimal, the pretrained hash function of HashGNN may become outdated, and the pre-specified pivot users in PreHash may become less representative. As a result, these methods may require periodical retraining.

Table 3. The test AUC (presented in % for a better interpretation) on Video Games and Electronics datasets.
Dataset Video Games Electronics
Feature Interaction DeepFM DIN DeepFM DIN
Compression Ratio 100%100\% 10%10\% 5%5\% 2%2\% 100%100\% 10%10\% 5%5\% 2%2\% 100%100\% 10%10\% 5%5\% 2%2\% 100%100\% 10%10\% 5%5\% 2%2\%
Vanilla full embedding 68.1968.19 - - - 70.7370.73 - - - 66.9566.95 - - - 67.6267.62 - - -
Modulo - 61.6861.68 58.8158.81 56.3456.34 - 63.3563.35 61.9461.94 60.2560.25 - 63.1363.13 54.5454.54 48.6148.61 - 63.7063.70 61.6961.69 58.9358.93
HashGNN - 68.5168.51 64.4364.43 61.8461.84 - 71.5971.59 71.1171.11 70.8270.82 - 64.4964.49 61.4161.41 59.4559.45 - 65.5065.50 63.4963.49 60.4060.40
PreHash (K=MqK=M_{q}) - 69.1969.19 66.2166.21 63.8963.89 - 72.7272.72 71.6371.63 71.4671.46 - 67.5067.50 67.1867.18 66.9966.99 - 70.3070.30 68.8068.80 68.7768.77
CEL-Lite 70.85p\mathbf{70.85}^{p} 70.33\mathbf{70.33} 70.21\mathbf{70.21} 69.82\mathbf{69.82} 73.85p\mathbf{73.85}^{p} 73.80\mathbf{73.80} 73.50\mathbf{73.50} 73.39\mathbf{73.39} 68.98p\mathbf{68.98}^{p} 68.90\mathbf{68.90} 68.69\mathbf{68.69} 68.03\mathbf{68.03} 71.63p\mathbf{71.63}^{p} 71.56\mathbf{71.56} 70.18\mathbf{70.18} 69.81\mathbf{69.81}

4.1.2. Online Conversion Prediction

To promote desirable behaviors from users, such as clicks and purchases, conversion rate prediction is an important learning task in online E-commerce systems. In this section, we test the models with the conversion prediction task, i.e., predicting whether a user will perform a desirable behavior on an item. We use the Video Games and Electronics datasets for this experiment, treating ratings not less than 44 as positive feedback (i.e., conversions). For performance evaluation, we use the standard Area under the ROC Curve (AUC) on the test sets, which can be interpreted as the probability that a random positive example gets a score higher than a random negative example.

The experiment is conducted in an online fashion, such that the data is received incrementally over time. We adopt HashGNN and PreHash, the top-2 performing hash methods in the rating prediction experiment (as well as the naive Modulo) as our baselines, though they do not support incremental training. The hash function of HashGNN requires pretraining with full data before starting the actual embedding learning, thus we allow it to access the full data before the online training. PreHash requires binding the warmest users to its buckets, we thus allow it to go through all data to find the warmest users before the online training. It is worth noting that, unlike these two methods, CEL(-Lite) can naturally start embedding learning from scratch without any extra information.

In this experiment, we compress both the user embeddings and the item embeddings. To further demonstrate that our method works well with different feature interaction models, we adopt DeepFM in this experiment, while keeping using DIN. We set R=8R=8 and b=256b=256. We set n=20n=20, m=50m=50, and t1=1t_{1}=1 for CEL-Lite.

The results are shown in Table 3. It can be observed that CEL-Lite still clearly outperforms the baselines.

Table 4. The test AUC (%) on One-day Sales dataset.
Dataset One-day Sales
Embedding size
Full 1m 100k 10k
Business Model 87.0987.09 86.8086.80 86.5586.55 86.4686.46
CEL-Lite - - 87.78\mathbf{87.78} 87.70\mathbf{87.70}

4.1.3. Integrated with Business Model

To further validate the performance of CEL-Lite, we conduct experiments on dataset from the E-commerce platform Shopee, that needs to handle billions of requests per day. We use the model deployed in that platform as our baseline, which is a fine-tuned business model that has a sophisticated network structure and heavy feature engineering. It optionally uses full embedding, or uses Modulo to compress the size of embedding table. In this experiments, we integrate CEL-Lite into this business model, replacing its embedding learning module.

As shown in Table 4, CEL-Lite with a small number of clusters (with 10k or 100k embedding entities for users and items, respectively) outperforms the business model with full embedding, and more significantly, the business model with compression. Specifically, CEL-Lite achieves an improvement of +0.61%+0.61\% in AUC with the embedding table size being 2650 times smaller. Note that a +0.1%+0.1\% gain in absolute AUC is regarded as significant for the conversion rate tasks in practice (Cheng et al., 2016; Zhou et al., 2018).

Refer to caption
Figure 3. Averaged test MSE of items with different number of interactions. PreHash (with K=MqK=M_{q}) is the best-performing baseline in Table 2. CEL significantly outperforms PreHash on cold items. Personalization degenerates the performance of cold items, while improves the performance of warm items.
Refer to caption
Refer to caption
Refer to caption

(a)                                              (b)                                               (c)

Figure 4. The test MSE of different (initial) compression ratios on (a) the dense MovieLens-1m, (b) the original MovieLens-1m, and (c) the sparse MovieLens-1m. Standard errors over 55 runs are indicated by the shades. CEL with personalization at a given compression ratio indicates the fully personalized result, initiated with CEL at that compression ratio. 100%100\% means gradually split to full embedding.

4.2. Personalization and Cold-start Problems

4.2.1. Personalization.

When training with shared embeddings, we may choose to untie the cluster assignments at a specific point to learn a personalized embedding (with the shared embedding as initialization) for each user and item. After being untied, embeddings can be further optimized according to their own historical data. This makes sense if we want a full embedding, in case we have enough data and memory. We refer to this procedure as personalization, which is similar to the personalization in federated learning settings (Tan et al., 2022). There are various personalization techniques. We leave them for future investigations. Here, we simply add the following distance regularization to the original objective to guide the personalization:

p=λp2j=1M𝐁(j,:)𝐒q(j,:)𝐁q22.\mathcal{L}_{p}=\frac{\lambda_{p}}{2}\sum_{j=1}^{M}\|\mathbf{B}(j,:)-\mathbf{S}_{q}(j,:)\mathbf{B}_{q}\|_{2}^{2}.

We set λp=50\lambda_{p}=50 in our experiment. Table 2&3 shows CEL with personalization achieves the best results. In particular, it has outperformed all other full embedding methods.

4.2.2. Mitigating Cold-start Problems.

We investigate the test loss of items with respect to the number of associated interactions in the training set (on the MovieLens-1m dataset, with a compression ratio of 1%1\%). As shown in Figure 3, CEL significantly outperforms PreHash on cold items. This could be because PreHash emphasizes the warm items, while the cold items are roughly represented as a weighted sum of the warm ones. This may also be why PreHash performs slightly better in the interval of [800,2000][800,2000]. On the other hand, we can see that if we do personalization for CEL, the performance of cold items degenerates. This strongly evidences those cold items cannot be well optimized with only their own data, and CEL can help in this regard. Personalization brings benefits for the intervals of >20>\!20, hence can have a better overall performance. We further try only personalizing the warm items (>20>\!20), and find a 1.2%1.2\% improvement on test MSE.

4.2.3. Sparsity and Compression Ratio.

To investigate the impact of data sparsity and compression ratio on CEL, we create a dense MovieLens-1m dataset by filtering out the users and items with no more than 100100 interactions, and a sparse MovieLens-1m dataset by randomly dropping 50%50\% interactions (but still ensuring at least 11 sample) for each user.

As shown in Figure 4, on the dense and original datasets, increasing the number of clusters consistently improves the overall performance of CEL; while on the sparse dataset, the optimal compression ratio occurs at 2%2\%: further splits will degenerate its performance.

We can also see that personalization at different compression ratios consistently brings improvement over CEL on the dense and original datasets, which is reasonable as warm items may have dominated the overall performance in these datasets. By contrast, on the sparse dataset, personalization consistently leads to degeneration of performance. More specifically, on the dense dataset, increasing the number of clusters consistently improves the performance of personalization, while on the original and sparse dataset, there is an optimal compression ratio of 1%2%1\%\sim 2\% for personalization.

Overall, the number of clusters is not the more the better. Too many clusters on a sparse dataset or a dataset that has many cold items, may lead to performance degeneration: partially on cold items or even the overall performance.

4.3. Ablation Studies on Clustering

4.3.1. Interpretation of Clustering.

To see how and how well CEL clusters items, we visualize the learned clusters of CEL on MovieLens-1m at Mq=10M_{q}=10 in Figure 5.

We find that at this stage, the clusters are mainly divided according to the ratings of items. The standard deviations of ratings within clusters are generally low. This result is reasonable because the task is to predict the ratings.

We can also see that movies within a cluster appear to have some inner correlation. For example, cluster 66 mainly consists of action or science fiction action movies; cluster 55 mainly consists of music or animation movies that are suitable for children; cluster 99 mainly consists of crime movies that have gangster elements; and cluster 88 contains 70%70\% movies directed by Stanley Kubrick. Note that such information is not fed to the model in any form, but learned by CEL automatically. A calculation of the averaged entropy of the distribution of genres in each cluster further evidences that CEL can better distinguish movies of different genres than others.

Interestingly, we find movies of the same type may form multiple clusters of different average ratings. For example, clusters 44 and 66 are both mostly action, but with a clear difference in their ratings.

Table 5. The test MSE of rating predictions on MovieLens-1m. We compare CEL with rule-based clusterings, and test using rule-based clusterings as initialization for CEL.
Number of clusters (MqM_{q}) 1818 4040 200200
Initialization Continue Running
By Genre - 1.05151.0515 - -
By Avr. Rating - 0.83970.8397 0.83900.8390 0.83800.8380
By Genre CEL 0.80660.8066 0.78590.7859 0.7776\mathbf{0.7776}
By Avr. Rating CEL 0.7959\mathbf{0.7959} 0.78920.7892 0.78330.7833
None CEL 0.7985{0.7985} 0.7858\mathbf{0.7858} 0.77840.7784

4.3.2. Rule-based Clusterings

CEL achieves automatic clustering of entities, but it is still not clear how it works against rule-based clusterings. Given that we find in the initial stage CEL mainly clusters items according to the genres and ratings of movies on the MovieLens dataset. In this section, we compare CEL with rule-based clustering based on genres (partitioning movies according to their genres in the meta data; there are 18 genres in total) and ratings (partitioning movies evenly according to their averaged rating).

As shown in Table 5, neither of these two rule-based clusterings works well. We further try using these two rule-based clusterings as initialization for CEL, to see whether it can help improve the performance. As we can see, such initialization brings either no or insignificant improvements.

Refer to caption
Figure 5. Clusters learned by CEL, ordered by their averaged ratings. We provide a one-line description for each cluster by summarizing the movies contained inside, and list 5 movies with the most number of ratings, with outliers marked by *. The table on the right shows the averaged entropy of the genre distribution in the clusters, calculated with the meta-data available.

4.3.3. Split Methods

Table 6. Test MSE of different split methods.
MovieLens 100k 1m
Random Split 0.93920.9392 0.85160.8516
User-Inspired Bisecting 2-Means 0.92450.9245 0.83550.8355
Gradient Bisecting 2-Means 0.88470.8847 0.81860.8186
Gradient Random Projection 0.92010.9201 0.82690.8269
GPCA 0.8707\mathbf{0.8707} 0.7858\mathbf{0.7858}

We test several different split methods, including: Random Split, which randomly splits items into two sets; Bisecting 2-Means, an advanced split method from DHC (Steinbach et al., 2000), which bisects items into 2 clusters according to the distance between item representations; Random Projection, which is similar to GPCA but projects the gradients in a random direction instead of the first principal component; and GPCA. Bisecting 2-Means is performed either on the gradient matrix or on the user-inspired representations (i.e., representing an item by averaging the embeddings of its associated users). For all methods, we split till a compression ratio of 1%1\%.

The results are shown in Table 6. We can see that random split yields the worst result, which indicates the importance of split methods; Bisecting 2-Means works better on gradients than user-inspired representations, which suggests the gradients are more informative; Bisecting 2-Means and random project of gradient works less than GPCA, which suggests the rationality of split according to the first principal component.

5. Conclusions

In this paper, we proposed Clustered Embedding Learning (CEL), a novel embedding learning framework that can be plugged into any differentiable feature interaction model. It is capable of achieving improved performance with reduced memory costs. We have shown the theoretical identifiability of CEL and the optimal time complexity of CEL-Lite. Our extensive experiments suggested that CEL can be a powerful component for practical recommender systems.

We have mainly considered the embeddings of users and items in this paper. Nevertheless, CEL can be similarly applied to other ID features or categorical features, such as shop ID or seller ID in E-commerce, search words in search engines, object categories, and geographic locations. We leave these for future work.

Acknowledgments

This research is supported in part by the National Natural Science Foundation of China (U22B2020); the National Research Foundation Singapore and DSO National Laboratories under the AI Singapore Programme (AISG2-RP-2020-019); the RIE 2020 Advanced Manufacturing and Engineering Programmatic Fund (A20G8b0102), Singapore; the Nanyang Assistant Professorship; the Shanghai University of Finance and Economics Research Fund (2021110066).

References

  • (1)
  • Almutairi et al. (2021) Faisal M Almutairi, Yunlong Wang, Dong Wang, Emily Zhao, and Nicholas D Sidiropoulos. 2021. eTREE: Learning Tree-structured Embeddings. In Proc. AAAI, Vol. 35. 6609–6617.
  • Boley (1998) Daniel Boley. 1998. Principal direction divisive partitioning. Data mining and knowledge discovery 2, 4 (1998), 325–344.
  • Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. 2016. Wide & deep learning for recommender systems. In Proc. DLRS. 7–10.
  • DeepRec (2021) DeepRec. 2021. Adaptive Embedding. https://github.com/alibaba/DeepRec/blob/main/README.md.
  • Fu et al. (2018) Xiao Fu, Kejun Huang, and Nicholas D Sidiropoulos. 2018. On identifiability of nonnegative matrix factorization. IEEE Signal Processing Letters 25, 3 (2018), 328–332.
  • Ginart et al. (2021) Antonio A Ginart, Maxim Naumov, Dheevatsa Mudigere, Jiyan Yang, and James Zou. 2021. Mixed dimension embeddings with application to memory-efficient recommendation systems. In Proc. IEEE ISIT. IEEE, 2786–2791.
  • Gower (2014) John C Gower. 2014. Principal coordinates analysis. Wiley StatsRef: Statistics Reference Online (2014), 1–7.
  • Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: a factorization-machine based neural network for CTR prediction. arXiv preprint arXiv:1703.04247 (2017).
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In Proc. WWW. 173–182.
  • Kirkpatrick et al. (2017) James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. 2017. Overcoming catastrophic forgetting in neural networks. PNAS 114, 13 (2017), 3521–3526.
  • Lee and Seung (1999) Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects by non-negative matrix factorization. Nature 401, 6755 (1999), 788–791.
  • Liu et al. (2020) Haochen Liu, Xiangyu Zhao, Chong Wang, Xiaobing Liu, and Jiliang Tang. 2020. Automated embedding size search in deep recommender systems. In Proc. SIGIR. 2307–2316.
  • Ma et al. (2018) Jiaqi Ma, Zhe Zhao, Xinyang Yi, Jilin Chen, Lichan Hong, and Ed H Chi. 2018. Modeling task relationships in multi-task learning with multi-gate mixture-of-experts. In Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 1930–1939.
  • Rolnick et al. (2019) David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lillicrap, and Gregory Wayne. 2019. Experience replay for continual learning. Advances in NeurIPS 32 (2019).
  • Roux (2015) Maurice Roux. 2015. A comparative study of divisive hierarchical clustering algorithms. arXiv preprint arXiv:1506.08977 (2015).
  • Shi et al. (2020) Shaoyun Shi, Weizhi Ma, Min Zhang, Yongfeng Zhang, Xinxing Yu, Houzhi Shan, Yiqun Liu, and Shaoping Ma. 2020. Beyond user embedding matrix: Learning to hash for modeling large-scale users in recommendation. In Proc. SIGIR. 319–328.
  • Steinbach et al. (2000) Michael Steinbach, George Karypis, and Vipin Kumar. 2000. A comparison of document clustering techniques. (2000).
  • Sun et al. (2017) Zhu Sun, Jie Yang, Jie Zhang, and Alessandro Bozzon. 2017. Exploiting both vertical and horizontal dimensions of feature hierarchy for effective recommendation. In Proc. AAAI, Vol. 31.
  • Tan et al. (2022) Alysa Ziying Tan, Han Yu, Lizhen Cui, and Qiang Yang. 2022. Towards personalized federated learning. IEEE Transactions on Neural Networks and Learning Systems (2022).
  • Tan et al. (2020) Qiaoyu Tan, Ninghao Liu, Xing Zhao, Hongxia Yang, Jingren Zhou, and Xia Hu. 2020. Learning to hash with graph neural networks for recommender systems. In Proc. TheWebConf 2020. 1988–1998.
  • Wang et al. (2018) Suhang Wang, Jiliang Tang, Yilin Wang, and Huan Liu. 2018. Exploring hierarchical structures for recommender systems. IEEE Transactions on Knowledge and Data Engineering 30, 6 (2018), 1022–1035.
  • Weinberger et al. (2009) Kilian Weinberger, Anirban Dasgupta, John Langford, Alex Smola, and Josh Attenberg. 2009. Feature hashing for large scale multitask learning. In Proc. ICML. 1113–1120.
  • Yan et al. (2021a) Bencheng Yan, Pengjie Wang, Jinquan Liu, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. 2021a. Binary code based hash embedding for web-scale applications. In Proc. CIKM. 3563–3567.
  • Yan et al. (2021b) Bencheng Yan, Pengjie Wang, Kai Zhang, Wei Lin, Kuang-Chih Lee, Jian Xu, and Bo Zheng. 2021b. Learning Effective and Efficient Embedding via an Adaptively-Masked Twins-based Layer. In Proc. CIKM. 3568–3572.
  • Yang et al. (2016) Bo Yang, Xiao Fu, and Nicholas D Sidiropoulos. 2016. Learning from hidden traits: Joint factor analysis and latent clustering. IEEE Transactions on Signal Processing 65, 1 (2016), 256–269.
  • Yang et al. (2019) Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. 2019. Quantization networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7308–7316.
  • Zhang et al. (2016) Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. 2016. Discrete collaborative filtering. In Proc. SIGIR. 325–334.
  • Zhou et al. (2019) Guorui Zhou, Na Mou, Ying Fan, Qi Pi, Weijie Bian, Chang Zhou, Xiaoqiang Zhu, and Kun Gai. 2019. Deep interest evolution network for click-through rate prediction. In Proc. AAAI, Vol. 33. 5941–5948.
  • Zhou et al. (2018) Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, Yanghui Yan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-through rate prediction. In Proc. SIGKDD. 1059–1068.
Table 7. The test MSE of rating predictions on MovieLens datasets with NMF as the feature interaction model.
Datasets MovieLens-100k MovieLens-1m MovieLens-10m
Compression Ratio 100%100\% 5%5\% 1%1\% 0.5%0.5\% 100%100\% 5%5\% 1%1\% 0.5%0.5\% 100%100\% 5%5\% 1%1\% 0.5%0.5\%
Vanilla full embedding 0.92830.9283 - - - 0.83570.8357 - - - 0.72270.7227 - - -
eTree 0.91270.9127 0.98300.9830 0.99730.9973 0.98730.9873 0.78410.7841 0.83990.8399 0.84820.8482 0.86140.8614 0.70000.7000 0.77010.7701 0.78300.7830 0.79020.7902
JNKM 0.88410.8841 1.05011.0501 1.04631.0463 1.10541.1054 0.76260.7626 0.89030.8903 0.86160.8616 0.84980.8498 0.69410.6941 0.77660.7766 0.79230.7923 0.79620.7962
Modulo - 1.09081.0908 1.10451.1045 1.10971.1097 - 1.02271.0227 1.06881.0688 1.07321.0732 - 0.84020.8402 0.85320.8532 0.86580.8658
BH - 1.02431.0243 1.06951.0695 1.07291.0729 - 0.97530.9753 1.01701.0170 1.05261.0526 - 0.83510.8351 0.84410.8441 0.85440.8544
AE - 1.07151.0715 1.08631.0863 1.09691.0969 - 0.99590.9959 1.03431.0343 1.06301.0630 - 0.83770.8377 0.85030.8503 0.86370.8637
PreHash (K=4K=4) - 0.91650.9165 0.96020.9602 0.98330.9833 - 0.81690.8169 0.88320.8832 0.89870.8987 - 0.78160.7816 0.79340.7934 0.79370.7937
PreHash (K=MqK=M_{q}) - 0.91400.9140 0.95330.9533 0.95990.9599 - 0.79580.7958 0.80470.8047 0.81010.8101 - 0.74930.7493 0.75180.7518 0.76920.7692
CEL (Ours) 0.8602p\mathbf{0.8602}^{p} 0.8689\mathbf{0.8689} 0.8707\mathbf{0.8707} 0.8906\mathbf{0.8906} 0.7507p\mathbf{0.7507}^{p} 0.7784\mathbf{0.7784} 0.7858\mathbf{0.7858} 0.7926\mathbf{0.7926} 0.6888p\mathbf{0.6888}^{p} 0.7301\mathbf{0.7301} 0.7374\mathbf{0.7374} 0.7389\mathbf{0.7389}
CEL-Lite (Ours) 0.8611p0.8611^{p} 0.87020.8702 0.88240.8824 0.89280.8928 0.7519p{0.7519}^{p} 0.78200.7820 0.79010.7901 0.80140.8014 0.6904p{0.6904}^{p} 0.73430.7343 0.74070.7407 0.74210.7421

Appendix A Appendix

In this appendix, we provide some important implementation details and some additional ablation studies. Detailed proofs about identifiability (Appendix B), detailed time complexity analysis and proofs (Appendix C), and more implementation details and results (Appendix D), can be found in https://arxiv.org/abs/2302.01478.

Input : User-item interaction data matrix XX
Output : User embedding matrix 𝐀\mathbf{A}, item cluster assignment matrix 𝐒q\mathbf{S}_{q}, item cluster embedding matrix 𝐁q\mathbf{B}_{q}, and trained model yy
1 Initialize Mq1M_{q}\leftarrow 1, 𝐒q[1]M×Mq\mathbf{S}_{q}\leftarrow[1]^{M\times M_{q}}, set hyperparameters {δ,η}\{\delta,\eta\};
2 Randomly initialize 𝐀\mathbf{A}, 𝐁q\mathbf{B}_{q}, and trainable parameters in yy ;
3
4while not converged do
5       𝐖(𝐗Y(𝐀,𝐒q𝐁q))F2\mathcal{L}\leftarrow\big{\|}\mathbf{W}\odot(\mathbf{X}-Y(\mathbf{A},\mathbf{S}_{q}\mathbf{B}_{q}))\big{\|}_{F}^{2};
6       𝐀𝐀η/𝐀\mathbf{A}\leftarrow\mathbf{A}-\eta\cdot\partial\mathcal{L}/\partial\mathbf{A},  𝐁q𝐁qη/𝐁q\mathbf{B}_{q}\leftarrow\mathbf{B}_{q}-\eta\cdot\partial\mathcal{L}/\partial\mathbf{B}_{q};
7      
8      if satisfy reassign condition then
9             𝐒qargmin𝐒q𝐖(𝐗Y(𝐀,𝐒q𝐁q))F2\mathbf{S}_{q}\leftarrow\operatorname*{arg\,min}_{\mathbf{S}_{q}}\;\big{\|}\mathbf{W}\odot(\mathbf{X}-Y(\mathbf{A},\mathbf{S}_{q}\mathbf{B}_{q}))\big{\|}_{F}^{2};
10            
11       end if
12      if satisfy split condition then
13             kargmaxk𝐖𝐒q(:,k)0k\leftarrow\operatorname*{arg\,max}_{k}\|\mathbf{W}\mathbf{S}_{q}(:,k)\|_{0} ;
             𝐁q(Mq+1,:)𝐁q(k,:)\mathbf{B}_{q}(M_{q}+1,:)\leftarrow\mathbf{B}_{q}(k,:) ;
              // New row of 𝐁q\mathbf{B}_{q}
             𝐒q(:,Mq+1)[0]M\mathbf{S}_{q}(:,M_{q}+1)\leftarrow[0]^{M} ;
              // New column of 𝐒q\mathbf{S}_{q}
             Update 𝐒q\mathbf{S}_{q} with GPCA;
              // Eq. (5) and  (6)
14            
15            MqMq+1M_{q}\leftarrow M_{q}+1;
16            
17       end if
18      
19 end while
Algorithm 1 The pseudocode of CEL

A.1. More Details about Experiment Settings

We apply stochastic gradient descent with a learning rate 0.00010.0001 and 0.050.05 for CEL and CEL-Lite, respectively. We also adopt the gradient averaging technique (Appendix D.3). The total number of iterations and other hyperparameters on each dataset is tuned by validation.

By default, we set the initial number of clusters M0M_{0} to be 11 in all of our experiments. We have tuned M0M_{0} (when M0>1M_{0}>1, as initialization, the users/items are mapped randomly into M0M_{0} clusters) and find a generally consistent performance when M0M_{0} is relatively small (10\leq 10).

When splitting with GPCA, we use a priority queue to keep track of the clusters. We set the cluster splitting threshold d=100d=100, and the split is stopped when the total number of clusters MqM_{q} reaches the given compression ratio.

We perform 55-fold cross validation for the rating prediction task, and train with 55 random seeds for other tasks, reporting the average.

For the public datasets used in this paper, we use a split ratio of 80%/20%80\%/20\% for dividing the train set and the test set. For Amazon datasets, we report the test performance of items/users which have interaction(s) on the training set.

In our implementation, DIN and DeepFM utilize the sparse features from the available meta data. Such sparse features allow feature crossing in DeepFM. Following the common heuristic (He et al., 2017), we use 1616 and 88 neurons in the two hidden layers of DIN, respectively. We use the same network architecture for DeepFM. The business model has a sophisticated network structure and heavy feature engineering, which leads to more than 1010m trainable non-embedding parameters.

On MovieLens datasets, the genre of each movie is provided in the metadata. We define the the genre distributions, i.e., a distribution pg(xk),k[1,Mq]p_{g}(x_{k}),k\in[1,M_{q}] for genre gg in MqM_{q} clusters. We thus can calculate the averaged entropy (e.g., in Figure 5) of the genre distribution in the clusters, which can be expressed below

Entropy=1|Genres|gGenresk=1Mqpg(xk)logpg(xk).\textstyle\text{Entropy}=\frac{1}{|\text{Genres}|}\sum_{g\in\text{Genres}}\sum_{k=1}^{M_{q}}p_{g}(x_{k})\log p_{g}(x_{k}).

Lower entropy indicates a better distinction of genres.

Refer to caption
Refer to caption
Refer to caption

Refer to caption
    (a)                                     (b)                                     (c)                                 (d)

Figure 6. The test MSE of CEL under different settings (averaged over 55 runs): (a) MovieLens-100k + NMF; (b) MovieLens-100k + DIN; (c) MovieLens-1m + NMF. There are totally 55 cluster choosing criteria: 1) The total gradient norm of a cluster (computed as trace(𝐆k𝐆k)\text{trace}(\mathbf{G}_{k}^{\top}\mathbf{G}_{k})); 2) The total loss of a cluster k\mathcal{L}_{k}; 3) The averaged loss of a cluster k/𝐒q(:,k)0\mathcal{L}_{k}/\|\mathbf{S}_{q}(:,k)\|_{0}; 4) The number of the associated item 𝐒q(:,k)0\|\mathbf{S}_{q}(:,k)\|_{0}; 5) The number of the associated interactions (data) 𝐖𝐒q(:,k)0\|\mathbf{W}\mathbf{S}_{q}(:,k)\|_{0}. On (d) we also compute the averaged ranking (i.e., from 1st1^{st} the best to 5th5^{th} the worst) of each cluster choosing criterion on all the 1212 settings tested.

A.2. On Datasets of Different Scales

Besides MovieLens-1m, we also performed experiments on the dataset of MovieLens-100k, MovieLens-10m, and MovieLens-25m, which contains 100k, 10m, and 25m interactions respectively. As the results shown in Table 7 and 8, CEL consistently outperforms baselines. Other observations are also in line with our discussion in the main paper.

Table 8. The test MSE of rating predictions on MovieLens-25m datasets with NMF as the feature interaction model.
Datasets MovieLens-25m
Compression Ratio 100%100\% 5%5\% 1%1\% 0.5%0.5\%
Vanilla full embedding 0.83590.8359 - - -
Modulo - 1.07821.0782 1.16801.1680 1.20041.2004
PreHash (

K=MqK=M_{q}

)
- 0.90950.9095 0.95660.9566 1.01041.0104
CEL-Lite (Ours) 0.7508\mathbf{0.7508}

p

0.7809\mathbf{0.7809} 0.8119\mathbf{0.8119} 0.8404\mathbf{0.8404}

A.3. Ablation Studies: Split

In this section, we perform a thorough study of the split in CEL.

A.3.1. Split vs. No Split

We first provide empirical evidence of the benefit of CEL split operation by ablations studies. For no split, we initialize the number of clusters M0M_{0} to be exactly the maximum allowed number. The assignments are randomly initialized. Note that reassignments are still allowed. The results in Table 9 show that the split operation does bring benefits compared with no split. Another strong evidence supporting the superior performance of our GPCA split can be found in Table 6.

Table 9. The test MSE. Comparison of split vs. no split.
MovieLens-1m NMF DIN
Compression Ratio 100%100\% 1%1\% 0.5%0.5\% 100%100\% 1%1\% 0.5%0.5\%
CEL (no split) 0.7582 0.7991 0.8025 0.7428 0.7824 0.8023
CEL 0.7474 0.7846 0.7926 0.7390 0.7787 0.7868

A.3.2. Criterion for Choosing the Split Cluster

We test 55 cluster choosing criteria, results shown in Figure 6. We can see that they generally lead to similarly good performance. It demonstrates the robustness of CEL upon cluster choosing criteria. It is worth noting that the mean loss criterion performs worse than the others, probably because the cold items tend to have larger men loss than the warm items, as revealed by our experiment on Figure 3. Thus focusing on the mean may overly emphasize the cluster with cold items, and thus lead to undesired behavior. Thus in this sense, the total loss (or other criteria) may perform better than the mean.

A.3.3. Threshold for GPCA Split

We test different thresholds δ\delta for split with GPCA. The results are shown in Table 10, where we split till a compression ratio of 2%2\%. As we can see, they generally lead to similarly good performance, indicating the robustness of GPCA split w.r.t the threshold δ\delta. For the other results reported in this paper, we adopt a default value of δ=0\delta=0. Noted that GPCA always finds a significant 1st1^{st} principal component: the eigenvalue decompositions in our experiments show that the largest eigenvalue is usually 100\sim\!100 times larger than the second largest eigenvalue.

Table 10. The test MSE for different split thresholds.
MovieLens 100k 1m
δ=0\delta=0 0.8684 0.7795
δ=\delta= median of the 1st1^{st} principal component scores 0.8691 0.7807
δ\delta chosen to balance the split of data (Section 3.4) 0.8689 0.7808

A.4. Ablation Studies: Reassignment

A.4.1. Reassignment

We conduct an experiment to quantify the effect of reassignment. The results as in Table 11 show that under the two scenarios we tested, performing extensive steps of reassignments will lead to slight overfitting, while performing few steps of reassignment will lead to underfitting. Thus the total number of reassignments (as well as t1t_{1}) is a hyperparameter that can be tuned (e.g., by grid search). This reveals the rationale behind our strategy of a constant number of reassignments.

Table 11. The MSE for different frequencies of reassignment.
MovieLens-1m Compression Ratio 0.5%0.5\% Compression Ratio 1%1\%
t1t_{1} 2020 4040 100100 2020 4040 100100
Total no. of reassignments 100\approx 100 50\approx 50 20\approx 20 100\approx 100 50\approx 50 20\approx 20
CEL (training MSE) 0.7307 0.7318 0.7329 0.7071 0.7179 0.7187
CEL (test MSE) 0.7931 0.7926 0.7939 0.7867 0.7858 0.7860

A.4.2. Retrain with the Learned Clustering

We further conduct experiments to retraining with the learned clustering structure. The settings of the retraining are identical to the training from which we obtained the clustering structure, so that only difference is that now the assignment is fixed (no cluster split nor cluster reassignment). The results are shown in Table 12. As can be seen, the test performance of retraining is nearly identical to that of training, indicating that our CEL algorithm has learned effective clustering structures that can generalize well with different initialization of embeddings. It also implies that CEL does not encounter obvious optimization difficulties during the training to obtain such a clustering structure.

Table 12. The test MSE when performing retrain.
Datasets MovieLens-100k MovieLens-1m
Compression Ratio 100%100\% 1%1\% 0.5%0.5\% 100%100\% 1%1\% 0.5%0.5\%
CEL 0.8671 0.8707 0.8906 0.7507 0.7858 0.7926
CEL (Retrain) - 0.8704 0.8905 - 0.7858 0.7927

Appendix B More about Identifiability of CEL

B.1. Definitions of Related Concepts

B.1.1. Essentially Uniqueness.

We say that the NMF of 𝐗=𝐀𝐁\mathbf{X}=\mathbf{A}\mathbf{B}^{\top} is essentially unique, if the nonnegative matrix 𝐀\mathbf{A} and 𝐁\mathbf{B} are identifiable up to a common permutation and scaling/counter-scaling, that is, 𝐗=𝐀𝐁\mathbf{X}=\mathbf{A}^{\prime}\mathbf{B}^{\prime\top} implies 𝐀=𝐀𝚷𝐃\mathbf{A}^{\prime}=\mathbf{A}\mathbf{\Pi}\mathbf{D} and 𝐁=𝐁𝚷𝐃1\mathbf{B}^{\prime}=\mathbf{B}\mathbf{\Pi}\mathbf{D}^{-1} for a permutation matrix 𝚷\mathbf{\Pi} and a full-rank diagonal scaling matrix 𝐃\mathbf{D}.

B.1.2. Sufficiently Scatteredness.

To investigate the identifiability, it is a common practice to assume the rows of 𝐁\mathbf{B} are sufficiently scattered (Almutairi et al., 2021; Yang et al., 2016; Fu et al., 2018). Formally, it should have the following property.

Definition 0.

(Sufficiently Scattered) (Fu et al., 2018) The rows of a nonnegative matrix 𝐁M×R\mathbf{B}\in\mathbb{R}^{M\times R} are said to be sufficiently scattered if: 1) cone{𝐁}𝒞cone\{\mathbf{B}^{\top}\}\supseteq\mathcal{C}; and 2) cone{𝐁}bd{𝒞}={λ𝐞k|λ0,k=1,,R}cone\{\mathbf{B}^{\top}\}^{\star}\cap bd\{\mathcal{C}^{*}\}=\{\lambda\mathbf{e}_{k}|\lambda\geq 0,k=1,\ldots,R\}. Here cone{𝐁}={𝐱|𝐱=𝐁θ,θ𝟎,𝟏θ=1}cone\{\mathbf{B}^{\top}\}=\{\mathbf{x}|\mathbf{x}=\mathbf{B}^{\top}\mathbf{\theta},\forall\mathbf{\theta}\geq\mathbf{0},\mathbf{1}^{\top}\mathbf{\theta}=1\} and cone{𝐁}={𝐲|𝐱=𝐱𝐲,𝐱cone{𝐁}}cone\{\mathbf{B}^{\top}\}^{\star}=\{\mathbf{y}|\mathbf{x}=\mathbf{x}^{\top}\mathbf{y},\forall\mathbf{x}\in cone\{\mathbf{B}^{\top}\}\} are the conic hull of 𝐁\mathbf{B}^{\top} and its dual cone, respectively; 𝒞={𝐱|𝐱𝟏R1𝐱2}\mathcal{C}=\{\mathbf{x}|\mathbf{x}^{\top}\mathbf{1}\geq\sqrt{R-1}\|\mathbf{x}\|_{2}\}, 𝒞={𝐱|𝐱𝟏𝐱2}\mathcal{C}^{*}=\{\mathbf{x}|\mathbf{x}^{\top}\mathbf{1}\geq\|\mathbf{x}\|_{2}\}; bdbd is the boundary of a set.

The sufficiently scattered condition essentially means that cone{𝐁}cone\{\mathbf{B}^{\top}\} contains 𝒞\mathcal{C}, a central region of the nonnegative orthant, as its subset.

B.2. Proof of Theorem 1

First we have the following lemma, which is taken from the previous work (Fu et al., 2018).

Lemma 0.

(NMF Identifiability) Suppose 𝐗=𝐀𝐁\mathbf{X}=\mathbf{A}\mathbf{B}^{\top} where 𝐀N×R\mathbf{A}\in\mathbb{R}^{N\times R} and 𝐁M×R\mathbf{B}\in\mathbb{R}^{M\times R} are two nonnegative matrices, and rank(𝐗)=rank(𝐀)=Rrank(\mathbf{X})=rank(\mathbf{A})=R. If further assume that the rows of 𝐁\mathbf{B} are sufficiently scattered. Then 𝐀\mathbf{A} and 𝐁\mathbf{B} are essentially unique.

Note that this lemma is a direct implication of Theorem 1 in (Fu et al., 2018).

Proof.

(Proof of Theorem 1) We assumed the rows of 𝐁q\mathbf{B}_{q} are sufficiently scattered. Since 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} is a permutation with repetition of rows in 𝐁q\mathbf{B}_{q}, rows of 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} are sufficiently scattered (because cone{𝐁q𝐒q}=cone{𝐁q}cone\{\mathbf{B}_{q}^{\top}\mathbf{S}_{q}^{\top}\}=cone\{\mathbf{B}_{q}^{\top}\} following from our assumption of 𝐒q\mathbf{S}_{q} being of full-column rank). Now, since AA is of full-column rank and rows of 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} are sufficiently scattered, 𝐀\mathbf{A} and 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} in 𝐗=𝐀𝐁q𝐒q\mathbf{X}=\mathbf{A}\mathbf{B}_{q}^{\top}\mathbf{S}_{q}^{\top} are essentially unique by Lemma 2.

Given that 𝐒q\mathbf{S}_{q} is of full-column rank, 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} is a permutation with repetition of distinct rows in 𝐁q\mathbf{B}_{q} and every row in 𝐁q\mathbf{B}_{q} appears in 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q}. Regarding 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q} as the subject of decomposition, 𝐒q\mathbf{S}_{q} and 𝐁q\mathbf{B}_{q} can be uniquely determined up to a permutation. Note that there is no diagonal scaling ambiguity between 𝐒q\mathbf{S}_{q} and 𝐁q\mathbf{B}_{q}, since we have required 𝐒q\mathbf{S}_{q} to be a matrix of {0,1}\{0,1\}.

Overall, 𝐀\mathbf{A}, 𝐒q\mathbf{S}_{q} and 𝐁q\mathbf{B}_{q} are essentially unique. ∎

B.3. Proof of Proposition 2

Proof.

Following the proof of Theorem 1, we can similarly deduce that 𝐀\mathbf{A}^{\prime} and 𝐒q+1𝐁q+1\mathbf{S}_{q+1}\mathbf{B}_{q+1} are essentially unique. Thus 𝐒q+1𝐁q+1=𝐒q𝐁q𝚷𝐃\mathbf{S}_{q+1}\mathbf{B}_{q+1}=\mathbf{S}_{q}\mathbf{B}_{q}\mathbf{\Pi}\mathbf{D} where 𝐃\mathbf{D} is a full rank diagonal nonnegative matrix and 𝚷\mathbf{\Pi} is a permutation matrix. From the assumption we know that there are only MqM_{q} distinct rows in 𝐒q𝐁q\mathbf{S}_{q}\mathbf{B}_{q}, thus MqM_{q} distinct rows in 𝐒q+1𝐁q+1\mathbf{S}_{q+1}\mathbf{B}_{q+1}. From the assumption that 𝐒q+1\mathbf{S}_{q+1} being of full-column rank, we know that all rows of 𝐁q+1\mathbf{B}_{q+1} appears in 𝐒q+1𝐁q+1\mathbf{S}_{q+1}\mathbf{B}_{q+1}, which implies only MqM_{q} distinct rows in 𝐁q+1\mathbf{B}_{q+1}, and they corresponds to a permutation/scaling of the rows in 𝐁q\mathbf{B}_{q}. ∎

B.4. Uniqueness of the Optimal Cluster Number

Proposition 0.

(Irreducibility of the Optimal Cluster Number) Following from the assumptions of Theorem 1, there does not exist a decomposition of the data matrix such that 𝐗=𝐀𝐁q1𝐒q1\mathbf{X}=\mathbf{A}^{\prime}\mathbf{B}_{q-1}^{\top}\mathbf{S}_{q-1}^{\top}, where 𝐀N×R\mathbf{A}^{\prime}\in\mathbb{R}^{N\times R}, 𝐁q1Mq1×R\mathbf{B}_{q-1}\in\mathbb{R}^{M_{q-1}\times R}, 𝐒q1{0,1}M×Mq1\mathbf{S}_{q-1}\in\{0,1\}^{M\times M_{q-1}}, 𝐒q1(j,:)0=1,j[1,M]\|\mathbf{S}_{q-1}(j,:)\|_{0}=1,\forall j\in[1,M], rank(𝐀)=Rrank(\mathbf{A}^{\prime})=R, Mq>Mq1RM_{q}>M_{q-1}\geq R, and 𝐒q1\mathbf{S}_{q-1} is of full-column rank.

Proof.

Suppose such matrix decomposition exists. Then by Proposition 2 we know that for the decomposition in Theorem 1, 𝐁q\mathbf{B}_{q} will have repeated rows. This will violate our assumption of 𝐁q\mathbf{B}_{q} having no repeated rows, which proves the above proposition by contradiction. The above proposition shows that no Mq1<MqM_{q-1}<M_{q} satisfies Theorem 1. ∎

Together with Proposition 2, we know MqM_{q} is unique, if the assumptions of Theorem 1 holds. Formally:

Corollary 0.

(Uniqueness of the Optimal Cluster Number) Following from the assumptions of Theorem 1, there does not exist a decomposition of the data matrix such that 𝐗=𝐀𝐁p𝐒p\mathbf{X}=\mathbf{A}^{\prime}\mathbf{B}_{p}^{\top}\mathbf{S}_{p}^{\top}, where 𝐀N×R\mathbf{A}^{\prime}\in\mathbb{R}^{N\times R}, 𝐁pMp×R\mathbf{B}_{p}\in\mathbb{R}^{M_{p}\times R} has no repeated rows, 𝐒p{0,1}M×Mp\mathbf{S}_{p}\in\{0,1\}^{M\times M_{p}}, 𝐒p(j,:)0=1,j[1,M]\|\mathbf{S}_{p}(j,:)\|_{0}=1,\forall j\in[1,M], rank(𝐀)=Rrank(\mathbf{A}^{\prime})=R, MpMqM_{p}\neq M_{q}, and 𝐒p\mathbf{S}_{p} is of full-column rank.

In summary, our Theorem 1 has identified a set of solutions of 𝐀\mathbf{A}, 𝐒q\mathbf{S}_{q} and 𝐁q\mathbf{B}_{q}, such that the solutions do not contain repeated rows in 𝐁q\mathbf{B}_{q} (otherwise the solution can be further reduced to a simpler form). Such a set of irreducible solutions naturally leads to an optimal cluster number. Also, the (essentially) uniqueness of the solutions implies the uniqueness of the optimal cluster number.

Appendix C More on Complexity of CEL(-Lite)

C.1. Time and Space Complexities of GPCA

For CEL, time complexity of GPCA is 𝒪(𝐒q(:,k)0NR)\mathcal{O}(\|\mathbf{S}_{q}(:,k)\|_{0}NR), which is linear in 𝐒q(:,k)0\|\mathbf{S}_{q}(:,k)\|_{0} and is the same as gradient calculation. PCA (including eigen decomposition) only involves an R×RR\times R matrix. Its time complexity is 𝒪(𝐒q(:,k)0R2+R3)\mathcal{O}(\|\mathbf{S}_{q}(:,k)\|_{0}R^{2}+R^{3}), where RR is usually small. The space complexity is also low, which can be as low as 𝒪(R2)\mathcal{O}(R^{2}) in theory. Thus, split with GPCA is both time and space efficient. The time complexity can be further reduced via parallel computation.

Note that when we use a constant number of (less or equal to) nn buffer interactions for each item considered in the GPCA, the time complexity will reduce to 𝒪(n𝐒q(:,k)0R)\mathcal{O}(n\|\mathbf{S}_{q}(:,k)\|_{0}R).

C.2. Time Complexity of CEL-Lite Formally

Theorem 1.

The time complexity of CEL-Lite is 𝒪(DR)\mathcal{O}(DR).

Proof.

The total time complexity is the sum of time complexity of embedding optimization and cluster optimization. The cluster optimization consists of split and reassignments.

Firstly, all reassignments can be processed within 𝒪(DR)\mathcal{O}(DR) time since it involves total D/bD/b batch; each batch incurs 𝒪(nmbR)\mathcal{O}(nmbR) time from bb sub-problems with mm candidate clusters and a constant number nn of buffered history considered.

Secondly, all the split operations can be processed within 𝒪(DR)\mathcal{O}(DR) time complexity. Here, we change dd in strategies 1&2 to a monotonically non-decreasing sublinear function d(Dt)d(D_{t}) to make it more general, where DtD_{t} denotes the number of interactions in the current training step tt, which is used to distinguish from DD, the total number of interactions collected during the entire online learning. Then, strategies 1&2 ensure each newly split cluster in the current training step tt will have 12d(Dt)\geq\frac{1}{2}d(D_{t}) interactions. This will remain true through the entire training, since the non-decreasing function dd ensures that such cluster will still have 12d(Dt)12d(Dt)\geq\frac{1}{2}d(D_{t^{\prime}})\geq\frac{1}{2}d(D_{t}) interactions even it splits in any future iteration t>tt^{\prime}>t. We now denote the number of newly split cluster in the current training step tt as Δt\Delta_{t} and suppose the training lasts for a total of TT steps. Thus we can sum over them and obtain t=1TΔt×12d(Dt)D\sum_{t=1}^{T}\Delta_{t}\times\frac{1}{2}d(D_{t})\leq D. Now, recall that the time complexity of each split operation is 𝒪(n𝐒q(:,k)0R)\mathcal{O}(n\|\mathbf{S}_{q}(:,k)\|_{0}R) when only a constant number of history considered. The time complexity for all splits in step tt then requires 𝒪((Δt×2d(Dt)+𝒪(b))nR)\mathcal{O}\big{(}(\Delta_{t}\times 2d(D_{t})+\mathcal{O}(b))nR\big{)} time, where 𝒪(b)\mathcal{O}(b) accounts for the new interactions from the reassignment or new batch data in step tt. As a result, we know that the overall time complexity for all splits are 𝒪(nRt=1T(Δt×2d(Dt)+𝒪(b)))𝒪(nR(4D+𝒪(D)))=𝒪(nDR)=𝒪(DR)\mathcal{O}\big{(}nR\sum_{t=1}^{T}(\Delta_{t}\times 2d(D_{t})+\mathcal{O}(b))\big{)}\leq\mathcal{O}\big{(}nR(4D+\mathcal{O}(D))\big{)}=\mathcal{O}(nDR)=\mathcal{O}(DR).444Note that strategy 1 incurs at most 𝒪(DR)\mathcal{O}(DR) time complexity additionally.

Note that the embedding optimization also takes 𝒪(DR)\mathcal{O}(DR) time. So, the overall time complexity of CEL-Lite is 𝒪(DR)\mathcal{O}(DR). ∎

C.3. Complexity Analysis for CEL

In contrast to CEL-Lite, CEL assumes the data matrix is known a priori to the embedding learning. For CEL, in each step of optimization we will need to process the entire data matrix. With our framework in Section 3, CEL splits once per t2t_{2} steps of embedding optimizations, until we have reached a desired cluster number MM^{*} (corresponding to a desired compression ratio). We perform the reassignment once per t1t_{1} steps of embedding optimizations. Suppose the training lasts for a total of EE steps (of embedding optimization). EE is a constant and maybe not negligible (or even very large) depending on the learning rate and the convergence speed.

Theorem 2.

CEL with the maximum cluster number being fixed to MM^{*} has a time complexity of 𝒪((E+M+Et1M)DR)\mathcal{O}\left(\left(E+M^{*}+\frac{E}{t_{1}}M^{*}\right)DR\right).

Proof.

The embedding optimizations have a time complexity of 𝒪(DR)\mathcal{O}(DR) for each step. So the total time complexity of EE steps of embedding optimization is 𝒪(EDR)\mathcal{O}(EDR).

For the split operations, we can imagine a binary tree starting from a single root node. The children represent two clusters from the split of their parent, and let the value of a node represents the interactions (number of interactions associated with it). The time complexity will sum up all the nodes in the tree (except leaf nodes). We can easily verify that processing all nodes in a particular level (depth) of the tree sums to 𝒪(DR)\mathcal{O}(DR) time complexity if the level is completely filled (will be less if it is not completely filled), since the numbers of interactions associated to the nodes sums to DD. So now we will need to compute the depth of the tree. The tree is strictly a binary tree (but not necessarily full) with a maximum depth MM^{*}. Therefore, the splits have a time complexity of 𝒪(MDR)\mathcal{O}(M^{*}DR).

For the reassignment, the time complexity of each reassignment is 𝒪(MDR)\mathcal{O}(M^{*}DR). We have totally E/t1E/t_{1} reassignments, so we have a total time complexity of 𝒪(E/t1MDR)\mathcal{O}(E/t_{1}M^{*}DR) for reassignment.

Thus, the overall time complexity is 𝒪((E+M+Et1M)DR)\mathcal{O}\left(\left(E+M^{*}+\frac{E}{t_{1}}M^{*}\right)DR\right). ∎

C.4. Practical Time Complexities

Our theoretical analysis (Section 3.4 and above) has shown CEL(-Lite) to be an efficient framework. Empirically, for example, for MovieLens-1M with DIN at the compression ratio of 5%5\% (Table 2), CEL runs for 1592s, while PreHash (k=Mqk=M_{q}) takes 7256s. On large-scale datasets, Electronics with DIN at the compression ratio of 10%10\% (Table 3) as the example, CEL-Lite converges within 9366s. In comparison, PreHash (k=Mqk=M_{q}) takes 31087s and the fastest baseline Modulo takes 6305s.

Appendix D More Details and Results

D.1. Initialization

All the embeddings (vectors) are randomly initialized: We first initialize the full embedding. We sample a vector consisting of random Gaussian entries (𝒩(0,1)\mathcal{N}(0,1)), then divided by the maximum absolute value to map each entry to a range of [1,1][-1,1]. For NMF, we take the absolute value of each entry to ensure the nonnegativity. Note that after each embedding optimization, we also apply the absolute function. Secondly, we initialize the cluster assignment such that users/items are assigned to clusters randomly. Then the cluster embeddings are initialized as the center of its associated users/items.

D.2. Baseline Implementation Details

D.2.1. eTree and JNKM implementation details.

For eTree, we adopt a tree structure of depth 33. The first layer is a single root node, the second layer consists of MqM_{q} nodes, while the leaf nodes on the last layer represent the items. For JNKM, there are a total of MqM_{q} clusters. During test prediction, we only keep the embeddings of second layer nodes of eTree, or the cluster embeddings of JNKM.

D.2.2. BH and AE implementation details.

BH requires setting a hyperparameter BB (integer) that indicates the number of binary code blocks. Note that the total length of the binary coding LL should satisfy 2LM2^{L}\geq M, while the total embedding size is B×2L/BB\times 2^{L/B} should be close to MqM_{q}. We always choose an appropriate BB so that its memory usage B×2L/BMqB\times 2^{L/B}\geq M_{q} is just slightly larger than the embedding size MqM_{q} required by the corresponding compression ratio in each experiment. The rest of the settings follow the original paper (Yan et al., 2021a). For AE, we allow the top Mq/2M_{q}/2 warmest items to have their own embedding, and the rest of the items are assigned to share the other Mq/2M_{q}/2 embeddings through the Modulo hash method.

D.2.3. PreHash implementation details.

PreHash requires setting a hyperparameter KK to select the top-KK bucket for each user/item. In our experiments, we set KK to range from 44 to MqM_{q}. The number of buckets MqM_{q} is determined by the corresponding compression ratio in each experiment. For the rating prediction with NMF (Table 2), we do not use the historical interactions of users while hashing. Instead, we learn the hash assignment (weights associated with each bucket) during training. This is to ensure a fair comparison, since the historical interactions are also not used by the other baselines in the comparison. For the online conversion rate prediction (Table 3), we follow the original paper and use the user history to calculate the hash assignment. We collect and update the user history after receiving the streamed online data. The buckets are initialized randomly and then bind to the anchor users identified throughout online learning. The same operation is applied to items to reduce the item embedding size accordingly. The rest of the settings follow the original paper (Shi et al., 2020).

Refer to caption
Refer to caption

(a)                                                        (b)

Figure 7. (a) Illustration of the sufficiently scattered condition. The dots are rows of 𝐁q\mathbf{B}_{q}; the triangle is the nonnegative orthant; the circle is the central region 𝒞\mathcal{C}; the shaded region is cone{𝐁q}cone\{\mathbf{B}_{q}^{\top}\}. Image is taken from a prior work (Fu et al., 2018). (b) The percentage of cluster embeddings (rows of 𝐁q\mathbf{B}_{q}) that lie outside the central region 𝒞\mathcal{C}. We can see that in all settings, eventually all cluster embeddings will lie outside the central region 𝒞\mathcal{C}.

D.3. Gradient Averaging and Optimizer

We optimize \mathcal{L} with respect to 𝐁q\mathbf{B}_{q} and 𝐀\mathbf{A} via first order optimizers, e.g., (plain) gradient descent (GD) and Adam. Since the number of assigned items can be significantly different among clusters, there may exhibit significant differences in gradient scales. To synchronize gradient descent training among different clusters, we divide the loss of each cluster by the number of associated items to switch from the total loss to the averaged loss:

(7) Δ𝐁q(k,:)=η𝐒q(:,k)0k/𝐁q(k,:)\Delta\mathbf{B}_{q}(k,:)=-\frac{\eta}{\|\mathbf{S}_{q}(:,k)\|_{0}}\;\;\partial\mathcal{L}_{k}/\partial\mathbf{B}_{q}(k,:)

where η\eta is the learning rate, and the divisor is equal to the number of items assigned to the kthk^{th} cluster. We refer to it as the gradient averaging. It ensures that embeddings of all clusters are updated at a similar pace, avoiding larger clusters to be updated much faster.

We performed ablation studies on the optimizer used for gradient descent and the effect of gradient averaging. We found gradient averaging significantly reduces overfitting and keeps training stable.

Table 13 and 14 present the results of CEL with NMF on the MovieLens-100k dataset.555The trend for other feature interaction models and on other datasets are similar. The results show that gradient averaging largely improves the training result of GD, while such improvement is less obvious for Adam. It could be because Adam, to some extent, has invariance towards the gradient scale. Among them, GD + gradient averaging achieves the best result overall.

Interestingly, without gradient averaging, we observe that the test MSE will start to go up after 1000\sim 1000 steps of embedding optimization, while training MSE will keep going down. It is possibly: Without gradient averaging, the embedding of clusters with limited assignments (and associated data) will be updated at the same speed as the big clusters. Such fast update of embeddings for small clusters may overfit it to its limited interactions.

Table 13. The MSE results of different optimization settings for rating prediction tasks on MovieLens-100k. The item embedding table is compressed.
Compression Ratio GD w. grad avr GD wo. grad avr Adam w. grad avr Adam wo. grad avr
1%1\% 0.9022 1.023 0.9139 0.9189
2%2\% 0.9012 1.019 0.9150 0.9280
Table 14. The MSE results of different optimizers for rating prediction tasks on MovieLens-100k. The item embedding table is compressed to 2%2\%.
GD Adam AdaGrad RMSProp
0.8654 0.8720 0.9160 0.8754

D.4. Norm Regularization

To handle the scaling arbitrary between 𝐀\mathbf{A} and 𝐁\mathbf{B} in NMF, we adopt a common solution by adding norm regularizations of 𝐀\mathbf{A} and 𝐁q\mathbf{B}_{q} to the original objective function:

(8) reg=λreg2(𝐀F2+𝐁qF2).\textstyle\mathcal{L}_{reg}=\frac{\lambda_{reg}}{2}\big{(}\|\mathbf{A}\|_{F}^{2}+\|\mathbf{B}_{q}\|_{F}^{2}\big{)}.

We set λreg=1\lambda_{reg}=1 for our experiments.

D.5. The Sufficiently Scattered Condition

As introduced in Appendix B, the sufficiently scattered condition being satisfied essentially means that conic hull of rows of 𝐁q\mathbf{B}_{q} contains a central region 𝒞\mathcal{C} of the nonnegative orthant as its subset, illustrated in Figure 7(a). Although it is hard to visualize or verify whether the sufficiently scattered condition is satisfied, we can easily check whether a cluster embedding (a row of 𝐁q\mathbf{B}_{q}) lies outside the central region 𝒞\mathcal{C}. We thus perform such studies in Figure 7(b). We can see that in all settings, eventually all cluster embeddings will lie outside the central region 𝒞\mathcal{C}, implying that the embeddings (rows of 𝐁q\mathbf{B}_{q}) tend to satisfy the sufficiently scattered condition. Moreover, a larger compression ratio (which leads to a finer clustering structure) is more likely to get the embeddings to lie outside 𝒞\mathcal{C}. Although there is no explicit regularization that encourages the clusters to be scattered, the norm penalty we applied in equation (8) tends to push the embeddings to the boundary of the nonnegative orthant, which is a possible cause that we eventually learn a spread set of clusters.