This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Learning Binarized Graph Representations with Multi-faceted Quantization Reinforcement for Top-K Recommendation

Yankai Chen1, Huifeng Guo2, Yingxue Zhang2, Chen Ma3, Ruiming Tang2, Jingjie Li2, Irwin King1 1The Chinese University of Hong Kong, 2Huawei Noah’s Ark Lab, 3City University of Hong Kong ykchen, [email protected]; [email protected]; huifeng.guo, yingxue.zhang, tangruiming, [email protected]
(2022)
Abstract.

Learning vectorized embeddings is at the core of various recommender systems for user-item matching. To perform efficient online inference, representation quantization, aiming to embed the latent features by a compact sequence of discrete numbers, recently shows the promising potentiality in optimizing both memory and computation overheads. However, existing work merely focuses on numerical quantization whilst ignoring the concomitant information loss issue, which, consequently, leads to conspicuous performance degradation. In this paper, we propose a novel quantization framework to learn Binarized Graph Representations for Top-K Recommendation (BiGeaR). BiGeaR introduces multi-faceted quantization reinforcement at the pre-, mid-, and post-stage of binarized representation learning, which substantially retains the representation informativeness against embedding binarization. In addition to saving the memory footprint, BiGeaR further develops solid online inference acceleration with bitwise operations, providing alternative flexibility for the realistic deployment. The empirical results over five large real-world benchmarks show that BiGeaR achieves about 22%\sim40% performance improvement over the state-of-the-art quantization-based recommender system, and recovers about 95%\sim102% of the performance capability of the best full-precision counterpart with over 8×\times time and space reduction.

Recommender system; Quantization-based; Binarization; Graph Convolutional Network; Graph Representation
journalyear: 2022copyright: acmcopyrightconference: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 14–18, 2022; Washington, DC, USA.booktitle: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’22), August 14–18, 2022, Washington, DC, USAprice: 15.00isbn: 978-1-4503-9385-0/22/08doi: 10.1145/3534678.3539452ccs: Information systems Recommender systems

1. Introduction

Recommender systems, aiming to perform personalized information filtering (Ying et al., 2018), are versatile to many Internet applications. Learning vectorized user-item representations (i.e., embeddings), as the core of various recommender models, is the prerequisite for online inference of user-item interactions (He et al., 2020; Wang et al., 2019). With the explosive data expansion (e.g., Amazon owns over 150M users and 350M products (user statistics, 2022; product statistics, 2022)), one major existing challenge however is to perform inference efficiently. This usually requires large memory and computation consumption (e.g., for Amazon 500M-scaled full-precision111 It could be single-precision or double-precision; we use float32 as the default for explanation and conducting experiments throughout this paper. embedding table) on certain data centers (Tailor et al., 2021), and therefore hinders the deployment to devices with limited resources (Tailor et al., 2021).

Refer to caption
Figure 1. Illustration of BiGeaR.

To tackle this issue, representation quantization recently provides the promising feasibility. Generally, it learns to quantize latent features of users and items via converting the continuous full-precision representations into discrete low-bit ones. The quantized representations thus are conducive to model size reduction and inference speed-up with low-bit arithmetic on devices where CPUs are typically more affordable than expensive GPUs (Banner et al., 2018; Bahri et al., 2021). Technically, quantization can be categorized into multi-bit, 2-bit (i.e., ternarized), and 1-bit (i.e., binarized) quantization. With only one bit-width, representation binarization for recommendation takes the most advantage of representation quantization and therefore draws the growing attention recently (Tan et al., 2020; Kang and McAuley, 2019).

Despite the promising potentiality, it is still challenging to develop realistic deployment mainly because of the large performance degradation in Top-K recommendation (Tan et al., 2020; Kang and McAuley, 2019). The crux of the matter is the threefold information loss:

  • Limited expressivity of latent features. Because of the discrete constraints, mapping full-precision embeddings into compact binary codes with equal expressivity is NP-hard (Håstad, 2001). Thus, instead of proposing complex and deep neural structures for quantization (Erin Liong et al., 2015; Zhu et al., 2016), sign()\operatorname{sign}(\cdot) function is widely adopted to achieve O(1)O(1) embedding binarization (Tan et al., 2020; Wang et al., 2017; Lin et al., 2017). However, this only guarantees the sign (+/-) correlation for each embedding entry. Compared to the original full-precision embeddings, binarized targets produced from sign()\operatorname{sign}(\cdot) are naturally less informative.

  • Degraded ranking capability. Ranking capability, as the essential measurement of Top-K recommendation, is the main objective to work on. Apart from the inevitable feature loss in numerical quantization, previous work further ignores the discrepancy of hidden knowledge that is inferred by full-precision and binarized embeddings (Kang and McAuley, 2019; Tan et al., 2020). However, such hidden knowledge is vital to reveal users’ preference towards different items, losing of which may thus draw degraded ranking capability and sub-optimal model learning accordingly.

  • Inaccurate gradient estimation. Due to the non-differentiability of quantization function sign()\operatorname{sign}(\cdot), Straight-Through Estimator (STE) (Bengio et al., 2013) is widely adopted to assume all propagated gradients as 1 in backpropagation (Tan et al., 2020; Lin et al., 2017; Qin et al., 2020). Intuitively, the integral of 1 is a certain linear function other than sign()\operatorname{sign}(\cdot), whereas this may lead to inaccurate gradient estimation and produce inconsistent optimization directions in the model training.

To address aforementioned problems, we propose a novel quantization framework, namely BiGeaR, to learn the Binarized Graph Representations for Top-K Recommendation. Based on the natural bipartite graph topology of user-item interactions, we implement BiGeaR with the inspiration from graph-based models, i.e., Graph Convolutional Networks (GCNs) (Kipf and Welling, 2017; Hamilton et al., 2017). With the emphasis on deepening the exploration of multi-hop subgraph structures, GCN-based recommender models capture the high-order interactive relations and well simulate Collaborative Filtering for recommendation (He et al., 2020; Wang et al., 2019; Ying et al., 2018). Specifically, BiGeaR sketches graph nodes (i.e., users and items) with binarized representations, which facilitates nearly one bit-width representation compression. Furthermore, our model BiGeaR decomposes the prediction formula (i.e., inner product) into bitwise operations (i.e., Popcount and XNOR). This dramatically reduces the number of floating-point operations (#FLOP) and thus introduces theoretically solid acceleration for online inference. To avoid large information loss, as shown in Figure 1(a), BiGeaR technically consists of multi-faceted quantization reinforcement at the pre-, mid-, and post-stage of binarized representation learning:

  1. (1)

    At the pre-stage of model learning, we propose the graph layer-wise quantization (from low- to high-order interactions) to consecutively binarize the user-item features with different semantics. Our analysis indicates that such layer-wise quantization can actually achieve the magnification effect of feature uniqueness, which significantly compensates for the limited expressivity of embeddings binarization. The empirical study also justifies that, this is more effective to enrich the quantization informativeness, rather than simply increasing the embedding sizes in the conventional manner (Tan et al., 2020; Lin et al., 2017; Qin et al., 2020; Rastegari et al., 2016).

  2. (2)

    During the mid-stage of embedding quantization, BiGeaR introduces the self-supervised inference distillation to develop the low-loss ranking capability inheritance. Technically, it firstly extracts several pseudo-positive training samples that are inferred by full-precision embeddings of BiGeaR. Then these samples serve as the direct regularization target to the quantized embeddings, such that they can distill the ranking knowledge from full-precision ones to have similar inference results. Different from the conventional knowledge distillation, our approach is tailored specifically for Top-K recommendation and therefore boosts the performance with acceptable training costs.

  3. (3)

    As for the post-stage backpropagation of quantization optimization, we propose to utilize the approximation of Dirac delta function (i.e., δ\delta function) (Bracewell and Bracewell, 1986) for more accurate gradient estimation. In contrast to STE, our gradient estimator provides the consistent optimization direction with sign()\operatorname{sign}(\cdot) in both forward and backward propagation. The empirical study demonstrates its superiority over other gradient estimators.

Empirical Results. The extensive experiments over five real-world benchmarks show that, BiGeaR significantly outperforms the state-of-the-art quantization-based recommender model by 25.77%\sim40.12% and 22.08%\sim39.52% w.r.t. Recall and NDCG metrics. Furthermore, it attains 95.29%\sim100.32% and 95.32%\sim101.94% recommendation capability compared to the best-performing full-precision model, with over 8×\times inference acceleration and space compression.

Discussion. It is worthwhile mentioning that BiGeaR is related to hashing-based models (i.e., learning to hash) (Kang et al., 2021; Kang and McAuley, 2019), as, conceptually, binary hashing can be viewed as 1-bit quantization. However, as shown in Figure 1(b), they have different motivations. Hashing-based models are usually designed for fast candidate generation, followed by full-precision re-ranking algorithms for accurate prediction. Meanwhile, BiGeaR is end-to-end that aims to make predictions within the proposed architecture. Hence, we believe BiGeaR is technically related but motivationally orthogonal to them.

Organization. We present BiGeaR methodology and model analysis in § 2 and § 3. Then we report the experiments and review the related work in § 4 and § 5 with the conclusion in  § 6.

2. BiGeaR Methodology

In this section, we formally introduce: (1) graph layer-wise quantization for feature magnification; (2) inference distillation for ranking capability inheritance; (3) gradient estimation for better model optimization. BiGeaR framework is illustrated in Figure 2(a). The notation table and pseudo-codes are attached in Appendix A and B.

Refer to caption
Figure 2. BiGeaR first pre-trains the full-precision embeddings and then triggers the (1) graph layer-wise quantization, (2) inference distillation, and (3) accurate gradient estimation to learn the binarized representations (Best view in color).

Preliminaries: graph convolution. Its general idea is to learn node representations by iteratively propagating and aggregating latent features of neighbors via the graph topology (Wu et al., 2019; He et al., 2020; Kipf and Welling, 2017). We adopt the graph convolution paradigm working on the continuous space from LightGCN (He et al., 2020) that recently shows good recommendation performance. Let 𝒗u(l),𝒗i(l)d\boldsymbol{v}^{(l)}_{u},\boldsymbol{v}^{(l)}_{i}\in\mathbb{R}^{d} denote the continuous feature embeddings of user uu and item ii computed after ll layers of information propagation. 𝒩(x)\mathcal{N}(x) represents xx’s neighbor set. They can be iteratively updated by utilizing information from the (ll-11)-th layer:

(1)

𝒗u(l)=i𝒩(u)1|𝒩(u)||𝒩(i)|𝒗i(l1),𝒗i(l)=u𝒩(i)1|𝒩(i)||𝒩(u)|𝒗u(l1).\displaystyle\boldsymbol{v}^{(l)}_{u}=\sum_{i\in\mathcal{N}(u)}\frac{1}{\sqrt{|\mathcal{N}(u)|\cdot|\mathcal{N}(i)|}}\boldsymbol{v}^{(l-1)}_{i},\ \ \boldsymbol{v}^{(l)}_{i}=\sum_{u\in\mathcal{N}(i)}\frac{1}{\sqrt{|\mathcal{N}(i)|\cdot|\mathcal{N}(u)|}}\boldsymbol{v}^{(l-1)}_{u}.

2.1. Graph Layer-wise Quantization

We propose the graph layer-wise quantization mainly by computing quantized embeddings and embedding scalers: (1) these quantized embeddings sketch the full-precision embeddings with dd-dimensional binarized codes (i.e., {1,1}d\{-1,1\}^{d}); and (2) each embedding scaler reveals the value range of original embedding entries. Specifically, during the graph convolution at each layer, we track the intermediate information (e.g., 𝒗u(l)\boldsymbol{v}^{(l)}_{u}) and perform the layer-wise 1-bit quantization in parallel as:

(2) 𝒒u(l)=sign(𝒗u(l)),𝒒i(l)=sign(𝒗i(l)),\boldsymbol{q}_{u}^{(l)}=\operatorname{sign}\big{(}\boldsymbol{v}^{(l)}_{u}\big{)},\ \ \boldsymbol{q}_{i}^{(l)}=\operatorname{sign}\big{(}\boldsymbol{v}^{(l)}_{i}\big{)},

where embedding segments 𝒒u(l)\boldsymbol{q}_{u}^{(l)}, 𝒒i(l)\boldsymbol{q}_{i}^{(l)} \in {1,1}d\{-1,1\}^{d} retain the node latent features directly from 𝒗u(l)\boldsymbol{v}^{(l)}_{u} and 𝒗i(l)\boldsymbol{v}^{(l)}_{i}. To equip with the layer-wise quantized embeddings, we further include a layer-wise positive embedding scaler for each node (e.g., αu(l)\alpha_{u}^{(l)} \in +\mathbb{R}^{+}), such that 𝒗u(l)\boldsymbol{v}^{(l)}_{u} \doteq αu(l)\alpha_{u}^{(l)}𝒒u(l)\boldsymbol{q}^{(l)}_{u}. Then for each entry in αu(l)\alpha_{u}^{(l)}𝒒u(l)\boldsymbol{q}^{(l)}_{u}, it is still binarized by {αu(l)\{-\alpha_{u}^{(l)}, αu(l)}\alpha_{u}^{(l)}\}. In this work, we compute the mean of L1-norm as:

(3) αu(l)=1d𝒗u(l)1,αi(l)=1d𝒗i(l)1.\alpha_{u}^{(l)}=\frac{1}{d}\cdot||\boldsymbol{v}_{u}^{(l)}||_{1},\ \ \alpha_{i}^{(l)}=\frac{1}{d}\cdot||\boldsymbol{v}_{i}^{(l)}||_{1}.

Instead of setting αu(l)\alpha_{u}^{(l)} and αi(l)\alpha_{i}^{(l)} as learnable, such deterministic computation is simple yet effective to provide the scaling functionality whilst substantially pruning the parameter search space. The experimental demonstration is in Appendix F.2.

After LL layers of quantization and scaling, we have built the following binarized embedding table for each graph node xx as:

(4) 𝒜x={αx(0),αx(1),,αx(L)},𝒬x={𝒒x(0),𝒒x(1),,𝒒x(L)}.\mathcal{A}_{x}=\{\alpha_{x}^{(0)},\alpha_{x}^{(1)},\cdots,\alpha_{x}^{(L)}\},\ \ \mathcal{Q}_{x}=\{\boldsymbol{q}^{(0)}_{x},\boldsymbol{q}^{(1)}_{x},\cdots,\boldsymbol{q}^{(L)}_{x}\}.

From the technical perspective, BiGeaR binarizes the intermediate semantics at different layers of the receptive field (Veličković et al., 2018; Wu et al., 2020) for each node. This, essentially, achieves the magnification effect of feature uniqueness to enrich user-item representations via the interaction graph exploration. We leave the analysis in § 3.1.

2.2. Prediction Acceleration

Model Prediction. Based on the learned embedding table, we predict the matching scores by adopting the inner product:

(5) y^u,i=<f(𝒜u,𝒬u),f(𝒜i,𝒬i)>,\widehat{y}_{u,i}=\big{<}f(\mathcal{A}_{u},\mathcal{Q}_{u}),f(\mathcal{A}_{i},\mathcal{Q}_{i})\big{>},

where function f(,)f(\cdot,\cdot) in this work is implemented as:

(6) f(𝒜u,𝒬u)=||l=0Lwlαu(l)𝒒u(l),f(𝒜i,𝒬i)=||l=0Lwlαi(l)𝒒i(l).f(\mathcal{A}_{u},\mathcal{Q}_{u})=\Big{|}\Big{|}_{l=0}^{L}w_{l}\alpha_{u}^{(l)}\boldsymbol{q}^{(l)}_{u},\ \ f(\mathcal{A}_{i},\mathcal{Q}_{i})=\Big{|}\Big{|}_{l=0}^{L}w_{l}\alpha_{i}^{(l)}\boldsymbol{q}^{(l)}_{i}.

Here ||\big{|}\big{|} represents concatenation of binarized embedding segments, in which weight wlw_{l} measures the contribution of each segment in prediction. wlw_{l} can be defined as a hyper-parameter or a learnable variable (e.g., with attention mechanism (Veličković et al., 2018)). In this work, we set wlw_{l} \propto ll to linearly increase wlw_{l} for segments from lower-layers to higher-layers, mainly for its computational simplicity and stability.

Computation Acceleration. Notice that for each segment of f(𝒜u,𝒬u)f(\mathcal{A}_{u},\mathcal{Q}_{u}), e.g., wlαu(l)𝒒u(l)w_{l}\alpha_{u}^{(l)}\boldsymbol{q}^{(l)}_{u}, entries are binarized by two values (i.e., wlαu(l)-w_{l}\alpha_{u}^{(l)} or wlαu(l)w_{l}\alpha_{u}^{(l)}). Thus, we can achieve the prediction acceleration by decomposing Equation (5) with bitwise operations. Concretely, in practice, 𝒒u(l)\boldsymbol{q}^{(l)}_{u} and 𝒒i(l)\boldsymbol{q}^{(l)}_{i} will be firstly encoded into basic dd-bits binary codes, denoted by 𝒒¨u(l),𝒒¨i(l){0,1}d\boldsymbol{{\ddot{q}}}^{(l)}_{u},\boldsymbol{{\ddot{q}}}^{(l)}_{i}\in\{0,1\}^{d}. Then we replace Equation (5) by introducing the following formula:

(7) y^u,i=l=0Lwl2αu(l)αi(l)(2𝙿𝚘𝚙𝚌𝚘𝚞𝚗𝚝(𝚇𝙽𝙾𝚁(𝒒¨u(l),𝒒¨i(l)))d).\widehat{y}_{u,i}=\sum_{l=0}^{L}w_{l}^{2}\alpha_{u}^{(l)}\alpha_{i}^{(l)}\cdot\big{(}2{\tt Popcount}\big{(}{\tt XNOR}(\boldsymbol{{\ddot{q}}}^{(l)}_{u},\boldsymbol{{\ddot{q}}}^{(l)}_{i})\big{)}-d\big{)}.

Compared to the original computation approach in Equation (5), our bitwise-operation-supported prediction in Equation (7) reduces the number of floating-point operations (#FLOP) with Popcount and XNOR. We illustrate an example in Figure 2(b).

2.3. Self-supervised Inference Distillation

To alleviate the asymmetric inference capability issue between full-precision representations and binarized ones, we introduce the self-supervised inference distillation such that binarized embeddings can well inherit the inference knowledge from full-precision ones. Generally, we treat our full-precision intermediate embeddings (e.g., 𝒗u(l)\boldsymbol{v}_{u}^{(l)}) as the teacher embeddings and the quantized segments as the student embeddings. Given both teacher and student embeddings, we can obtain their prediction scores as y^u,itch\widehat{{y}}_{u,i}^{tch} and y^u,istd\widehat{{y}}_{u,i}^{std}. For Top-K recommendation, then our target is to reduce their discrepancy as:

(8) argmin𝒟(y^u,itch,y^u,istd).\operatorname{argmin}\mathcal{D}(\widehat{{y}}_{u,i}^{tch},\widehat{{y}}_{u,i}^{std}).

A straightforward implementation of function 𝒟\mathcal{D} from the conventional knowledge distillation (Hinton et al., 2015; Anil et al., 2018) is to minimize their Kullback-Leibler divergence (KLD) or mean squared error (MSE). Despite their effectiveness in classification tasks (e.g., visual recognition (Anil et al., 2018; Xie et al., 2020)), they may not be appropriate for Top-K recommendation as:

  • Firstly, both KLD and MSE in 𝒟\mathcal{D} encourage the student logits (e.g., y^u,istd\widehat{{y}}_{u,i}^{std}) to be similarly distributed with the teacher logits in a macro view. But for ranking tasks, they may not well learn the relative order of user preferences towards items, which, however, is the key to improving Top-K recommendation capability.

  • Secondly, they both develop the distillation over the whole item corpus, which may be computational excessive for realistic model training. As the item popularity usually follows the Long-tail distribution (Park and Tuzhilin, 2008; Tang and Wang, 2018), learning the relative order of those frequently interacted items located at the tops of ranking lists actually contributes more to the Top-K recommendation performance.

To develop effective inference distillation, we propose to extract additional pseudo-positive training samples from teacher embeddings to regularize the targeted embeddings on each convolutional layer. Concretely, let σ\sigma represent the activation function (e.g., Sigmoid). We first pre-train the teacher embeddings to minimize the Bayesian Personalized Ranking (BPR) loss (Rendle et al., 2012):

(9) BPRtch=u𝒰i𝒩(u)j𝒩(u)lnσ(y^u,itchy^u,jtch),\mathcal{L}^{tch}_{BPR}=-\sum_{u\in\mathcal{U}}\sum_{i\in\mathcal{N}(u)\atop j\notin\mathcal{N}(u)}\ln\sigma(\widehat{y}^{\,tch}_{u,i}-\widehat{y}^{\,tch}_{u,j}),

where BPRtch\mathcal{L}^{tch}_{BPR} forces the prediction of an observed interaction to be higher than its unobserved counterparts, and the teacher score y^u,itch\widehat{y}^{\,tch}_{u,i} is computed as y^u,itch=<||l=0Lwl𝒗u(l),||l=0Lwl𝒗i(l)>\widehat{y}^{\,tch}_{u,i}=\big{<}\big{|}\big{|}_{l=0}^{L}w_{l}\boldsymbol{v}_{u}^{(l)},\big{|}\big{|}_{l=0}^{L}w_{l}\boldsymbol{v}_{i}^{(l)}\big{>}. Please notice that we only disable binarization and its associated gradient estimation in pre-training. After it is well-trained, for each user uu, we retrieve the layer-wise teacher inference towards all items \mathcal{I}:

(10) 𝒚^utch,(l)=<wl𝒗^u(l),wl𝒗^i(l)>i.\widehat{\boldsymbol{y}}^{\,tch,(l)}_{u}=\big{<}w_{l}\widehat{\boldsymbol{v}}_{u}^{(l)},w_{l}\widehat{\boldsymbol{v}}_{i}^{(l)}\big{>}_{i\in\mathcal{I}}.

Based on the segment scores 𝒚^utch,(l)\boldsymbol{\widehat{y}}_{u}^{\,tch,(l)} at the ll-th layer, we first sort out Top-R items with the highest matching scores, denoted by Stch(l)(u)S^{(l)}_{tch}(u). And hyper-parameter R \ll \mathcal{I}. Inspired by (Tang and Wang, 2018), then we propose our layer-wise inference distillation as follows:

(11)

ID(u)=l=0LID(l)(𝒚^ustd,(l),Stch(l)(u))=1Rl=0Lk=1Rwklnσ(y^u,Stch(l)(u,k)std,(l)),\displaystyle\mathcal{L}_{ID}(u)=\sum_{l=0}^{L}\mathcal{L}_{ID}^{(l)}(\widehat{\boldsymbol{y}}_{u}^{\,std,(l)},S^{(l)}_{tch}(u))=-\frac{1}{R}\sum_{l=0}^{L}\sum^{R}_{k=1}w_{k}\cdot\ln\sigma(\widehat{y}_{u,S^{(l)}_{tch}(u,k)}^{\,std,(l)}),

where student scores 𝒚^ustd,(l)\widehat{\boldsymbol{y}}^{\,std,(l)}_{u} is computed similarly to Equation (10) and Stch(l)(u,k)S^{(l)}_{tch}(u,k) returns the kk-th high-scored item from the pseudo-positive samples. wkw_{k} is the ranking-aware weight presenting two major effects: (1) since samples in Stch(l)(u)S^{(l)}_{tch}(u) are not necessarily all ground-truth positive, wkw_{k} thus balances their contribution to the overall loss; (2) it dynamically adjusts the weight importance for different ranking positions in Stch(l)(u)S^{(l)}_{tch}(u). To achieve these, wkw_{k} can be developed by following the parameterized geometric distribution for approximating the tailed item popularity (Rendle and Freudenthaler, 2014):

(12) wk=λ1exp(λ2k),w_{k}=\lambda_{1}\exp(-\lambda_{2}\cdot k),

where λ1\lambda_{1} and λ2\lambda_{2} control the loss contribution level and sharpness of the distribution. Intuitively, ID\mathcal{L}_{ID} encourages highly-recommended items from full-precision embeddings to more frequently appear in the student’s inference list. Moreover, our distillation approach regularizes the embedding quantization in a layer-wise manner as well; this will effectively narrow their inference discrepancy for a more correlated recommendation capability.

Objective Function. Combining BPRstd\mathcal{L}^{std}_{BPR} that calculates BPR loss (similar to Equation (9)) with the student predictions from Equation (5) and ID\mathcal{L}_{ID} for all training samples, our final objective function for learning embedding binarization is defined as:

(13) =BPRstd+ID+λΘ22,\mathcal{L}=\mathcal{L}^{std}_{BPR}+\mathcal{L}_{ID}+\lambda||\Theta||_{2}^{2},

where Θ22||\Theta||_{2}^{2} is the LL2-regularizer of node embeddings parameterized by hyper-parameter λ\lambda to avoid over-fitting.

2.4. Gradient Estimation

Although Straight-Through Estimator (STE) (Bengio et al., 2013) enables an executable gradient flow for backpropagation, it however may cause the issue of inconsistent optimization direction in forward and backward propagation: as the integral of the constant 1 in STE is a linear function, other than sign()\operatorname{sign}(\cdot) function. To furnish more accurate gradient estimation, in this paper, we utilize the approximation of Dirac delta function (Bracewell and Bracewell, 1986) for gradient estimation.

Refer to caption
Figure 3. Gradient estimation.

Concretely, let u(ϕ)u(\phi) denote the unit-step function, a.k.a., Heaviside step function (function, 2022), where u(ϕ)u(\phi) == 11 for ϕ\phi >> 0 and u(ϕ)u(\phi) == 0 otherwise. Obviously, we can take a translation from step function to sign()\operatorname{sign}(\cdot) by sign(ϕ)\operatorname{sign}(\phi) == 2u(ϕ)u(\phi) - 1, and thus theoretically sign(ϕ)ϕ\frac{\partial\operatorname{sign}(\phi)}{\partial\phi} == 2u(ϕ)ϕ2\frac{\partial u(\phi)}{\partial\phi}. As for u(ϕ)ϕ\frac{\partial u(\phi)}{\partial\phi}, it has been proved (Bracewell and Bracewell, 1986) that, u(ϕ)ϕ\frac{\partial u(\phi)}{\partial\phi} == 0 if ϕ\phi \neq 0, and u(ϕ)ϕ\frac{\partial u(\phi)}{\partial\phi} == \infty otherwise, which exactly is the Dirac delta function, a.k.a., the unit impulse function δ()\delta(\cdot) (Bracewell and Bracewell, 1986) shown in Figure 3(a). However, it is still intractable to directly use δ()\delta(\cdot) for gradient estimation. A feasible solution is to approximate δ()\delta(\cdot) by introducing zero-centered Gaussian probability density as follows:

(14) δ(ϕ)=limβ|β|πexp((βϕ)2),\delta(\phi)=\lim_{\beta\rightarrow\infty}\frac{|\beta|}{\sqrt{\pi}}\exp(-(\beta\phi)^{2}),

which imlies that:

(15) sign(ϕ)ϕ2γπexp((γϕ)2).\frac{\partial\operatorname{sign}(\phi)}{\partial\phi}\doteq\frac{2\gamma}{\sqrt{\pi}}\exp(-(\gamma\phi)^{2}).

As shown in Figure 3(b)-(c), hyper-parameter γ\gamma determines the sharpness of the derivative curve for approximation to sign()\operatorname{sign}(\cdot).

Intuitively, our proposed gradient estimator follows the main direction of factual gradients with sign()\operatorname{sign}(\cdot) in model optimization. This will produce a coordinated value quantization from continuous embeddings to quantized ones, and thus a series of evolving gradients can be estimated for the inputs with diverse value ranges. As we will show in § 4.6 of experiments, our gradient estimator can work better than other recent estimators (Gong et al., 2019; Qin et al., 2020; Darabi et al., 2018; Yang et al., 2019; Liu et al., 2019).

3. Model Analysis

3.1. Magnification of Feature Uniqueness

We take user uu as an example for illustration and the following analysis can be popularized to other nodes without loss of generality. Similar to sensitivity analysis in statistics (Koh and Liang, 2017) and influence diffusion in social networks (Xu et al., 2018), we measure how the latent feature of a distant node xx finally affects uu’s representation segments before binarization (e.g., 𝒗u(l)\boldsymbol{v}^{(l)}_{u}), supposing xx is a multi-hop neighbor of uu. We denote the feature enrichment ratio 𝔼xu(l)\mathbb{E}_{x\rightarrow u}^{(l)} as the L1-norm of Jacobian matrix [𝒗u(l)/𝒗xu(0)]\begin{matrix}\left[{\partial\boldsymbol{v}^{(l)}_{u}}/\ {\partial\boldsymbol{v}^{(0)}_{x_{u}}}\right]\end{matrix}, by detecting the absolute influence of all fluctuation in entries of 𝒗x(0)\boldsymbol{v}^{(0)}_{x} to 𝒗u(l)\boldsymbol{v}^{(l)}_{u}, i.e., 𝔼xu(l)\mathbb{E}_{x\rightarrow u}^{(l)} = [𝒗u(l)/𝒗x(0)]1\begin{matrix}\left|\left|\left[{\partial\boldsymbol{v}^{(l)}_{u}}/\ {\partial\boldsymbol{v}^{(0)}_{x}}\right]\right|\right|_{1}\end{matrix}. Focusing on a ll-length path hh connected by the node sequence: xhlx_{h}^{l}, xhl1x_{h}^{l-1}, \cdots, xh1x_{h}^{1}, xh0x_{h}^{0}, where xhlx_{h}^{l} = uu and xh0x_{h}^{0} = xx, we follow the chain rule to develop the derivative decomposition as:

(16) 𝒗u(l)𝒗x(0)=h=1H[𝒗xhl(l)𝒗xh0(0)]h\displaystyle\frac{\partial\boldsymbol{v}^{(l)}_{u}}{\partial\boldsymbol{v}^{(0)}_{x}}=\sum_{h=1}^{H}\begin{matrix}\left[\frac{\partial\boldsymbol{v}^{(l)}_{x_{h}^{l}}}{\partial\boldsymbol{v}^{(0)}_{x_{h}^{0}}}\right]_{h}\end{matrix} =h=1Hk=l11|𝒩(xhk)|1|𝒩(xhk1)|𝑰\displaystyle=\sum_{h=1}^{H}\prod_{k=l}^{1}\begin{matrix}\frac{1}{\sqrt{|\mathcal{N}(x^{k}_{h})}|}\cdot\frac{1}{\sqrt{|\mathcal{N}(x^{k-1}_{h})}|}\cdot\boldsymbol{I}\end{matrix}
=|𝒩(u)||𝒩(x)|h=1Hk=1l1|𝒩(xhk)|𝑰,\displaystyle=\begin{matrix}\sqrt{\frac{|\mathcal{N}(u)|}{|\mathcal{N}(x)|}}\end{matrix}\sum_{h=1}^{H}\prod_{k=1}^{l}\begin{matrix}\frac{1}{|\mathcal{N}(x^{k}_{h})|}\cdot\boldsymbol{I}\end{matrix},

where HH is the number of paths between uu and xx in total. Since all factors in the computation chain are positive, then:

(17) 𝔼xu(l)=[𝒗u(l)𝒗x(0)]1=d|𝒩(u)||𝒩(x)|h=1Hk=1l1|𝒩(xhk)|.\begin{matrix}\mathbb{E}_{x\rightarrow u}^{(l)}=\left|\left|\left[\frac{\partial\boldsymbol{v}^{(l)}_{u}}{\partial\boldsymbol{v}^{(0)}_{x}}\right]\right|\right|_{1}=d\cdot\sqrt{\frac{|\mathcal{N}(u)|}{|\mathcal{N}(x)|}}\end{matrix}\cdot\sum_{h=1}^{H}\prod_{k=1}^{l}\begin{matrix}\frac{1}{|\mathcal{N}(x^{k}_{h})|}\end{matrix}.

Note that here the term h=1Hk=1l1/|𝒩(xhk)|\sum_{h=1}^{H}\prod_{k=1}^{l}1/|\mathcal{N}(x^{k}_{h})| is exactly the probability of the ll-length random walk starting at uu that finally arrives at xx, which can be interpreted as:

(18) 𝔼xu(l)1|𝒩(x)|Prob(l-step random walk from u arrives at x).\mathbb{E}_{x\rightarrow u}^{(l)}\propto\frac{1}{\sqrt{|\mathcal{N}(x)|}}\cdot Prob(\text{$l$-step random walk from $u$ arrives at $x$}).

Magnification Effect of Feature Uniqueness. Equation (18) implies that, with the equal probability to visit adjacent neighbors, distant nodes with fewer degrees (i.e., |𝒩(x)||\mathcal{N}(x)|) will contribute more feature influence to user uu. But most importantly, in practice, these ll-hop neighbors of user uu usually represent certain esoteric and unique objects with less popularity. By collecting the intermediate information in different depth of the graph convolution, we can achieve the feature magnification effect for all unique nodes that live within LL hops of graph exploration, which finally enrich uu’s semantics in all embedding segments for quantization.

3.2. Complexity Analysis

To discuss the feasibility for realistic deployment, we compare BiGeaR with the best full-precision model LightGCN (He et al., 2020), as they are end-to-end with offline model training and online prediction.

Training Time Complexity. MM, NN, and EE represent the number of users, items, and edges; SS and BB are the epoch number and batch size. We use BiGeaRtch and BiGeaRstd to denote the pre-training version and binarized one, respectively. As we can observe from Table 1, (1) both BiGeaRtch and BiGeaRstd takes asymptotically similar complexity of graph convolution with LightGCN (where BiGeaRstd takes additional O(2Sd(L+1)E)O(2Sd(L+1)E) complexity for binarization). (2) For BPR\mathcal{L}_{BPR} computation, to prevent over-smoothing issue (Li et al., 2019; Li et al., 2018), usually L4L\leq 4; compare to the convolution operation, the complexity of BPR\mathcal{L}_{BPR} is acceptable. (3) For ID\mathcal{L}_{ID} preparation, after the training of BiGeaRtch, we firstly obtain the layer-wise prediction scores with O(MNd(L+1))O(MNd(L+1)) time complexity and then sort out the Top-R pseudo-positive samples for each user with O(N+RlnR)O(N+R\ln R). For BiGeaRstd, it takes a layer-wise inference distillation from BiGeaRtch with O(SRd(L+1)E)O\big{(}SRd(L+1)E\big{)}. (4) To estimate the gradients for BiGeaRstd, it takes O(2Sd(L+1)E)O(2Sd(L+1)E) for all training samples.

Table 1. Traing time complexity.
LightGCN BiGeaRtch BiGeaRstd
Graph Normalization O(2E)O\big{(}2E\big{)} O(2E)O\big{(}2E\big{)} -
Conv. and Quant. O(2SdE2LB)O\big{(}\frac{2SdE^{2}L}{B}\big{)} O(2SdE2LB)O\big{(}\frac{2SdE^{2}L}{B}\big{)} O(2Sd(E2LB+(L+1)E))O\big{(}2Sd(\frac{E^{2}L}{B}+(L+1)E)\big{)}
BPR\mathcal{L}_{BPR} Loss O(2SdE)O\big{(}2SdE\big{)} O(2Sd(L+1)E)O\big{(}2Sd(L+1)E\big{)} O(2Sd(L+1)E)O\big{(}2Sd(L+1)E\big{)}
ID\mathcal{L}_{ID} Loss - O(MNd(L+1)(N+RlnR))O\big{(}MNd(L+1)(N+R\ln R)\big{)} O(SRd(L+1)E)O\big{(}SRd(L+1)E\big{)}
Gradient Estimation - - O(2Sd(L+1)E)O\big{(}2Sd(L+1)E\big{)}

Memory overhead and Prediction Acceleration. We measure the memory footprint of embedding tables for online prediction. As we can observe from the results in Table 2: (1) Theoretically, the embedding size ratio of our model over LightGCN is 32d(L+1)(32+d)\frac{32d}{(L+1)(32+d)}. Normally, L4L\leq 4 and d64d\geq 64, thus our model achieves at least 4×\times space cost compression. (2) As for the prediction time cost, we compare the number of binary operations (#BOP) and floating-point operations (#FLOP) between our model and LightGCN. We find that BiGeaR replaces most of floating-point arithmetics (e.g., multiplication) with bitwise operations.

Table 2. Complexity of space cost and online prediction.
Embedding size #FLOP #BOP
LightGCN O(32(M+N)d)O\big{(}32(M+N)d\big{)} O(2MNd)O\big{(}2MNd\big{)} -
BiGeaR O((M+N)(L+1)(32+d))O\big{(}(M+N)(L+1)(32+d)\big{)} O(4MN(L+1))O\big{(}4MN(L+1)\big{)} O(2MN(L+1)d)O\big{(}2MN(L+1)d\big{)}

4. Experimental Results

We evaluate our model on Top-K recommendation task with the aim of answering the following research questions:

  • RQ1. How does BiGeaR perform compared to state-of-the-art full-precision and quantization-based models?

  • RQ2. How is the practical resource consumption of BiGeaR?

  • RQ3. How do proposed components affect the performance?

4.1. Experiment Setup

Datasets. To guarantee the fair comparison, we directly use five widely evaluated datasets (including the training/test splits) from: MovieLens222https://grouplens.org/datasets/movielens/1m/ (Tan et al., 2020; He et al., 2016; Chen et al., 2022b; Chen et al., 2022a), Gowalla333https://github.com/gusye1234/LightGCN-PyTorch/tree/master/data/gowalla (Wang et al., 2019; Tan et al., 2020; He et al., 2020; Wang et al., 2020), Pinterest444https://sites.google.com/site/xueatalphabeta/dataset-1/pinterest_iccv (Geng et al., 2015; Tan et al., 2020), Yelp2018555https://github.com/gusye1234/LightGCN-PyTorch/tree/master/data/yelp2018 (Wang et al., 2019, 2020; He et al., 2020), Amazon-Book666https://github.com/gusye1234/LightGCN-PyTorch/tree/master/data/amazon-book (Wang et al., 2019, 2020; He et al., 2020). Dataset statistics and descriptions are reported in Table 3 and Appendix C.

Table 3. The statistics of datasets.
MovieLens Gowalla Pinterest Yelp2018 Amazon-Book
#Users 6,040 29,858 55,186 31,668 52,643
#Items 3,952 40,981 9,916 38,048 91,599
#Interactions 1,000,209 1,027,370 1,463,556 1,561,406 2,984,108

Evaluation Metric. In Top-K recommendation evaluation, we select two widely-used evaluation protocols Recall@K and NDCG@K to evaluate Top-K recommendation capability.

Competing Methods. We include the following recommender models: (1) 1-bit quantization-based methods including graph-based (GumbelRec (Jang et al., 2017; Maddison et al., 2017; Zhang and Zhu, 2019), HashGNN (Tan et al., 2020)) and general model-based (LSH (Gionis et al., 1999), HashNet (Cao et al., 2017), CIGAR (Kang and McAuley, 2019)), and (2) full-precision models including neural-network-based (NeurCF (He et al., 2017)) and graph-based (NGCF (Wang et al., 2019), DGCF (Wang et al., 2020), LightGCN (He et al., 2020)). Detailed introduction of these methods are attached in Appendix D.

We exclude early quantization-based recommendation baselines, e.g., CH (Liu et al., 2014), DiscreteCF (Zhang et al., 2016), DPR (Zhang et al., 2017), and full-precision solutions, e.g., GC-MC (Berg et al., 2017), PinSage (Ying et al., 2018), mainly because the above competing models (Kang and McAuley, 2019; He et al., 2020; Wang et al., 2019; He et al., 2017) have validated the superiority.

Experiment Settings. Our model is implemented by Python 3.7 and PyTorch 1.14.0 with non-distributed training. The experiments are run on a Linux machine with 1 NVIDIA V100 GPU, 4 Intel Core i7-8700 CPUs, 32 GB of RAM with 3.20GHz. For all the baselines, we follow the official reported hyper-parameter settings, and for methods lacking recommended settings, we apply a grid search for hyper-parameters. The embedding dimension is searched in {32,64,128,256,512,102432,64,128,256,512,1024}. The learning rate η\eta is tuned within {104,103,10210^{-4},10^{-3},10^{-2}} and the coefficient of L2L2 normalization λ\lambda is tuned among {106,105,104,10310^{-6},10^{-5},10^{-4},10^{-3}}. We initialize and optimize all models with default normal initializer and Adam optimizer (Kingma and Ba, 2015). We report all hyper-parameters in Appendix E for reproducibility.

Table 4. Performance comparison (waveline and underline represent the best performing full-precision and quantization-based models).
Model MovieLens (%) Gowalla (%) Pinterest (%) Yelp2018 (%) Amazon-Book (%)
Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20 Recall@20 NDCG@20
NeurCF 21.40 ±\pm 1.51 37.91 ±\pm 1.14 14.64 ±\pm 1.75 23.17 ±\pm 1.52 12.28 ±\pm 1.88 13.41 ±\pm 1.13 4.28 ±\pm 0.71 7.24 ±\pm 0.53 3.49 ±\pm 0.75 6.71 ±\pm 0.72
NGCF 24.69 ±\pm 1.67 39.56 ±\pm 1.26 16.22 ±\pm 0.90 24.18 ±\pm 1.23 14.67 ±\pm 0.56 13.92 ±\pm 0.44 5.89 ±\pm 0.35 9.38 ±\pm 0.52 3.65 ±\pm 0.73 6.90 ±\pm 0.65
DGCF 25.28 ±\pm 0.39 45.98 ±\pm 0.58 18.64 ±\pm 0.30 25.20 ±\pm 0.41 15.52 ±\pm 0.42 16.51 ±\pm 0.56 6.37 ±\pm 0.55 11.08 ±\pm 0.48 4.32 ±\pm 0.34 7.73 ±\pm 0.27
LightGCN 26.28 ±\pm 0.20 46.04 ±\pm 0.18 19.02 ±\pm 0.19 25.71 ±\pm 0.25 15.33 ±\pm 0.28 16.29 ±\pm 0.24 6.79 ±\pm 0.31 12.17 ±\pm 0.27 4.84 ±\pm 0.09 8.11 ±\pm 0.11
BiGeaR 25.57 ±\pm 0.33 45.56 ±\pm 0.31 18.36 ±\pm 0.14 24.96 ±\pm 0.17 15.57 ±\pm 0.22 16.83 ±\pm 0.46 6.47 ±\pm 0.14 11.60 ±\pm 0.18 4.68 ±\pm 0.11 8.12 ±\pm 0.12
Capability 97.30% 98.96% 96.53% 97.08% 100.32% 101.94% 95.29% 95.32% 96.69% 100.12%
LSH 11.38 ±\pm 1.23 14.87 ±\pm 0.76 8.14 ±\pm 0.98 12.19 ±\pm 0.86 7.88 ±\pm 1.21 9.84 ±\pm 0.90 2.91 ±\pm 0.51 5.06 ±\pm 0.67 2.41 ±\pm 0.95 4.39 ±\pm 1.16
HashNet 15.43 ±\pm 1.73 24.78 ±\pm 1.50 11.38 ±\pm 1.25 16.50 ±\pm 1.42 10.27 ±\pm 1.48 11.64 ±\pm 0.91 3.37 ±\pm 0.78 7.31 ±\pm 1.16 2.86 ±\pm 1.51 4.75 ±\pm 1.33
CIGAR 14.84 ±\pm 1.44 24.63 ±\pm 1.77 11.57 ±\pm 1.01 16.77 ±\pm 1.29 10.34 ±\pm 0.97 11.87 ±\pm 1.20 3.65 ±\pm 0.90 7.87 ±\pm 1.03 3.05 ±\pm 1.32 4.98 ±\pm 1.24
GumbelRec 16.62 ±\pm 2.17 29.36 ±\pm 2.53 12.26 ±\pm 1.58 17.49 ±\pm 1.08 10.53 ±\pm 0.79 11.86 ±\pm 0.86 3.85 ±\pm 1.39 7.97 ±\pm 1.59 2.69 ±\pm 0.55 4.32 ±\pm 0.47
HashGNNh 14.21 ±\pm 1.67 24.39 ±\pm 1.87 11.63 ±\pm 1.47 16.82 ±\pm 1.35 10.15 ±\pm 1.43 11.96 ±\pm 1.58 3.77 ±\pm 1.02 7.75 ±\pm 1.39 3.09 ±\pm 1.29 5.19 ±\pm 1.03
HashGNNs 19.87 ±\pm 0.93 37.32 ±\pm 0.81 13.45 ±\pm 0.65 19.12 ±\pm 0.68 12.38 ±\pm 0.86 13.63 ±\pm 0.75 4.86 ±\pm 0.36 8.83 ±\pm 0.27 3.34 ±\pm 0.25 5.82 ±\pm 0.24
BiGeaR 25.57 ±\pm 0.33 45.56 ±\pm 0.31 18.36 ±\pm 0.14 24.96 ±\pm 0.17 15.57 ±\pm 0.22 16.83 ±\pm 0.46 6.47 ±\pm 0.14 11.60 ±\pm 0.18 4.68 ±\pm 0.11 8.12 ±\pm 0.12
Gain 28.69% 22.08% 36.51% 30.54% 25.77% 23.48% 33.13% 31.37% 40.12% 39.52%
pp-value 5.57e-7 2.64e-8 2.21e-7 7.69e-8 2.5e-5 3.51e-5 3.27e-6 5.30e-8 3.49e-6 7.14e-8

4.2. Performance Analysis (RQ1)

We evaluate Top-K recommendation by varying K in {20, 40, 60, 80, 100}. We summarize the Top@20 results in Table 4 for detailed comparison and plot the Top-K recommendation curves in Appendix F.1. From Table 4, we have the following observations:

  • Our model offers a competitive recommendation capability to state-of-the-art full-precision recommender models. (1) BiGeaR generally outperforms most of full-precision recommender models excluding LightGCN over five benchmarks. The main reason is that our model and LightGCN take similar graph convolution methodology with network simplification (He et al., 2020), e.g., removing self-connection and feature transformation, which is proved to be effective for Top-K ranking and recommendation. Moreover, BiGeaR collects the different levels of interactive information in multi depths of graph exploration, which significantly enriches semantics to user-item representations for binarization. (2) Compared to the state-of-the-art method LightGCN, our model develops about 95%\sim102% of performance capability w.r.t. Recall@20 and NDCG@20 throughout all datasets. This shows that our proposed model designs are effective to narrow the performance gap with full-precision model LightGCN. Although the binarized embeddings learned by BiGeaR may not achieve the exact expressivity parity with the full-precision ones learned by LightGCN, considering the advantages of space compression and inference acceleration that we will present later, we argue that such performance capability is acceptable, especially for those resource-limited deployment scenarios.

  • Compared to all binarization-based recommender models, BiGeaR presents the empirically remarkable and statistically significant performance improvement. (1) Two conventional methods (LSH, HashNet) for general item retrieval tasks underperform CIGAR, HashGNN and BiGeaR, showing that a direct model adaptation may be too trivial for Top-K recommendation. (2) Compared to CIGAR, graph-based models generally work better. This is mainly because, CIGAR combines general neural networks with learning to hash techniques for fast candidate generation; on the contrary, graph-based models are more capable of exploring multi-hop interaction subgraphs to directly simulate the high-order collaborative filtering process for model learning. (3) Our model further outperforms HashGNN by about 26%\sim40% and 22%\sim40% w.r.t. Recall@20 and NDCG@20, proving the effectiveness of our proposed multi-faceted optimization components in embedding binarization. (4) Moreover, the significance test in which pp-value << 0.05 indicates that the improvements over all five benchmarks are statistically significant.

4.3. Resource Consumption Analysis (RQ2)

We analyze the resource consumption in training, online inference, and memory footprint by comparing to the best two competing models, i.e., LightGCN and HashGNN. Due to the page limits, we report the empirical results of MovieLens dataset in Table 5.

  1. (1)

    TtrainT_{train}: we set batch size B=2048B=2048 and dimension size d=256d=256 for all models. We find that HashGNN is fairly time-consuming than LightGCN and BiGeaR. This is because HashGNN adopts the early GCN framework (Hamilton et al., 2017) as the backbone; LightGCN and BiGeaR utilize more simplified graph convolution architecture in which operations such as self-connection, feature transformation, and nonlinear activation are all removed (He et al., 2020). Furthermore, BiGeaR needs 5.1s and 6.2s per epoch for pretraining and quantization, both of which take slightly more yet asymptotically similar time cost with LightGCN, basically following the complexity analysis in § 3.2.

  2. (2)

    TinferT_{infer}: we randomly generate 1,000 queries for online prediction and conduct experiments with the vanilla NumPy777https://www.lfd.uci.edu/~gohlke/pythonlibs/ on CPUs. We observe that HashGNNs takes a similar time cost with LightGCN. This is because, while HashGNNh purely binarizes the continuous embeddings, its relaxed version HashGNNs adopts a Bernoulli random variable to provide the probability of replacing the quantized digits with original real values (Tan et al., 2020). Thus, although HashGNNh can use Hamming distance for prediction acceleration, HashGNNs with the improved recommendation accuracy can only be computed by floating-point arithmetics. As for BiGeaR, thanks to its bitwise-operation-supported capability, it runs about 8×\times faster than LightGCN whilst maintaining the similar performance on MovieLens dataset.

  3. (3)

    SETS_{ET}: we only store the embedding tables that are necessary for online inference. As we just explain, HashGNNs interprets embeddings by randomly selected real values, which, however, leads to the expansion of space consumption. In contrast to HashGNNs, BiGeaR can separately store the binarized embeddings and corresponding scalers, making a balanced trade-off between recommendation accuracy and space overhead.

Table 5. Resource consumption on MovieLens dataset.
LightGCN HashGNNh HashGNNs BiGeaR
Ttrain/T_{train}\mathbin{/}#epcoch 4.91s 186.23s 204.53s (5.16+6.22)s
Tinfer/T_{infer}\mathbin{/}#query 32.54ms 2.45ms 31.76ms 3.94ms
SETS_{ET} 9.79MB 0.34MB 9.78MB 1.08MB
Recall@20 26.28% 14.21% 19.87% 25.57%
Refer to caption
Figure 4. Study of graph layer-wise quantization.

4.4. Study of Layer-wise Quantization (RQ3.A)

To verify the magnification of feature uniqueness in layer-wise quantization, we modify BiGeaR and propose two variants, denoted as BiGeaRw/oLW{}_{w/o\,LW} and BiGeaRw/oFU{}_{w/o\,FU}. We report the results in Figure 4 by denoting Recall@20 and NDCG@20 in red and blue, respectively, and vary color brightness for different variants. From these results, we have the following explanations.

  • Firstly, BiGeaRw/oLW{}_{w/o\,LW} discards the layer-wise quantization and adopts the conventional manner by quantizing the last outputs from LL convolution iterations. We fix dimension d=256d=256 and vary layer number LL for BiGeaR, and only vary dimension dd for variant BiGeaRw/oLW{}_{w/o\,LW} with fixed L=2L=2. (1) Even by continuously increasing the dimension size from 64 to 1024, BiGeaRw/oLW{}_{w/o\,LW} slowly improves both Recall@20 and NDCG@20 performance. (2) By contrast, our layer-wise quantization presents a more efficient capability in improving performance by increasing LL from 0 to 3. When L=4L=4, BiGeaR usually exhibits a conspicuous performance decay, mainly because of the common over-smoothing issue in graph-based models (Li et al., 2019; Li et al., 2018). Thus, with a moderate size d=256d=256 and convolution number L3L\leq 3, BiGeaR can attain better performance with acceptable computational complexity.

  • Secondly, BiGeaRw/oFU{}_{w/o\,FU} omits the feature magnification effect by adopting the way used in HashGNN (Hamilton et al., 2017; Tan et al., 2020) as:

    (19) 𝒗x(l)=z𝒩(x)1|𝒩(z)|𝒗z(l1).\begin{matrix}\boldsymbol{v}^{(l)}_{x}=\sum_{z\in\mathcal{N}(x)}\frac{1}{|\mathcal{N}(z)|}\boldsymbol{v}^{(l-1)}_{z}.\end{matrix}

    Similar to the analysis in § 3.1, such modification will finally disable the “magnification term” in Equation (18) and simplify it to the vanilla random walk for graph exploration. Although BiGeaRw/oFU{}_{w/o\,FU} presents similar curve trends with BiGeaR when LL increases, the general performance throughout all five datasets is unsatisfactory compared to BiGeaR. This validates the effectiveness of BiGeaR’s effort in magnifying unique latent features, which enriches user-item representations and boosts Top-K recommendation performance accordingly.

4.5. Study of Inference Distillation (RQ3.B)

4.5.1. Effect of Layer-wise Distillation.

We study the effectiveness of our inference distillation by proposing two ablation variants, namely noID and endID. While noID totally removes our information distillation in model training, endID downgrades the original layer-wise distillation to only distill information from the last layer of graph convolution. As shown in Table 6, both noID and endID draw notable performance degradation. Furthermore, the performance gap between endID and BiGeaR shows that it is efficacious to conduct our inference distillation in a layer-wise manner for further performance enhancement.

Table 6. Learning inference distillation.
Variant MovieLens Gowalla Pinterest Yelp2018 Amazon-book
R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20
noID 24.40 44.06 17.85 24.28 15.23 16.38 6.18 11.22 4.07 7.31
-4.58% -3.29% -2.78% -2.72% -2.18% -2.85% -4.48% -3.28% -13.03% -9.98%
endID 25.02 44.75 18.05 24.73 15.28 16.58 6.29 11.37 4.44 7.78
-2.15% -1.78% -1.69% -0.92% -1.86% -1.49% -2.78% -1.98% -5.13% -4.19%
KLD 24.32 44.38 17.63 24.07 14.78 15.92 5.83 10.36 4.13 7.21
-4.89% -2.59% -3.98% -3.57% -5.07% -5.41% -9.89% -10.69% -11.75% -11.21%
BiGeaR 25.57 45.56 18.36 24.96 15.57 16.83 6.47 11.60 4.68 8.12

4.5.2. Conventional Knowledge Distillation.

To compare with the conventional approach, we modify BiGeaR by applying KL divergence for layer-wise teacher and student logits, i.e., 𝒚^utch,(l)\boldsymbol{\widehat{y}}_{u}^{\,tch,(l)} v.s. 𝒚^ustd,(l)\boldsymbol{\widehat{y}}_{u}^{\,std,(l)}. We denote this variant as KLD. As we can observe from Table 6, using conventional knowledge distillation with KL divergence leads to suboptimal performance. This is because KL divergence encourages the teacher and student training objects to have a similar logit distribution, but users’ relative order of item preference can not be well learned from this process. Compared to the conventional approach, our proposed layer-wise Inference distillation is thus more effective for ranking information distillation.

4.6. Study of Gradient Estimation (RQ3.C)

We compare our gradient estimation with several recently studied estimators, such as Tanh-like (Qin et al., 2020; Gong et al., 2019), SSwish (Darabi et al., 2018), Sigmoid (Yang et al., 2019), and projected-based estimator (Liu et al., 2019) (denoted as PBE), by implementing them in BiGeaR. We report their Recall@20 in Figure 5 and compute the performance gain of our approach over these estimators accordingly. We have two main observations:

  1. (1)

    Our proposed approach shows the consistent superiority over all other gradient estimators. These estimators usually use visually similar functions, e.g., tanh(\cdot), to approximate sign()\operatorname{sign}(\cdot). However, these functions are not necessarily theoretically relevant to sign()\operatorname{sign}(\cdot). This may lead to inaccurate gradient estimation. On the contrary, as we explain in § 2.4, we transfer the unit-step function u()u(\cdot) to sign()\operatorname{sign}(\cdot) by sign()\operatorname{sign}(\cdot) = 2u()u(\cdot) - 1. Then we can further estimate the gradients of sign()\operatorname{sign}(\cdot) with the approximated derivatives of u()u(\cdot). In other words, our approach follows the main optimization direction of factual gradients with sign()\operatorname{sign}(\cdot); and different from previous solutions, this guarantees the coordination in both forward and backward propagation.

  2. (2)

    Furthermore, compared to the last four datasets, MovieLens dataset confronts a larger performance disparity between our approach and others. This is because MovieLens dataset is much denser than the other datasets, i.e., #Interactions#Users#Items\frac{\#Interactions}{\#Users\cdot\#Items} = 0.0419 \gg {0.00084, 0.00267, 0.0013, 0.00062}, which means that users tend to have more item interactions and complicated preferences towards different items. Consequently, this posts a higher requirement for the gradient estimation capability in learning ranking information. Fortunately, the empirical results in Figure 5 demonstrate that our solution well fulfills these requirements, especially for dense interaction graphs.

Refer to caption
Figure 5. Gradient estimator comparison w.r.t. Recall@20.

5. Related Work

Full-precision recommender models. (1) Collaborative Filtering (CF) is a prevalent methodology in modern recommender systems (Covington et al., 2016; Ying et al., 2018; Yang et al., 2022). Earlier CF methods, e.g., Matrix Factorization (Koren et al., 2009; Rendle et al., 2012), reconstruct historical interactions to learn user-item embeddings. Recent neural-network-based models, e.g., NeurCF (He et al., 2017) and attention-based models (Chen et al., 2017; He et al., 2018), further boost performance with neural networks. (2) Graph-based methods focus on studying the interaction graph topology for recommendation. Graph convolutional networks (GCNs) (Hamilton et al., 2017; Kipf and Welling, 2017) inspire early work, e.g., GC-MC (Berg et al., 2017), PinSage (Ying et al., 2018), and recent models, e.g., NGCF (Wang et al., 2019), DGCF (Wang et al., 2020), and LightGCN (He et al., 2020), mainly because they can well simulate the high-order CF signals among high-hop neighbors for recommendation.

Learning to hash. Hashing-based methods map dense floating-point embeddings into binary spaces for Approximate Nearest Neighbor (ANN) search acceleration. A representative model LSH (Gionis et al., 1999) has inspired a series of work for various tasks, e.g., fast retrieval of images (Cao et al., 2017), documents (Li et al., 2014; Zhang and Zhu, 2020), and categorical information (Kang et al., 2021). For Top-K recommendation, models like DCF (Zhang et al., 2016), DPR (Zhang et al., 2017) include neural network architectures. A recent work CIGAR (Kang and McAuley, 2019) proposes adaptive model designs for fast candidate generation. To investigate the graph structure of user-item interactions, model HashGNN (Tan et al., 2020) applies hashing techniques into graph neural networks for recommendation. However, one major issue is that solely using learned binary codes for prediction usually draws a large performance decay. Thus, to alleviate the issue, CIGAR further equips with additional full-precision recommender models (e.g., BPR-MF (Rendle et al., 2012)) for fine-grained re-ranking; HashGNN proposes relaxed version by mixing full-precision and binary embedding codes.

Quantization-based models. Quantization-based models share similar techniques with hashing-based methods, e.g., sign()\operatorname{sign}(\cdot) is usually adopted mainly because of its simplicity. However, quantization-based models do not pursue extreme encoding compression, and thus they develop multi-bit, 2-bit, and 1-bit quantization for performance adaptation. Recently, there is growing attention to quantize graph-based models, such as Bi-GCN (Wang et al., 2021) and BGCN (Bahri et al., 2021) However, these two models are mainly designed for geometric classification tasks, but their capability in product recommendation is unclear. Thus, in this paper, we propose BiGeaR to learn 1-bit user-item representation quantization for Top-K recommendation. Different from binary hashing-based methods, BiGeaR aims to make predictions within its own framework, making a balanced trade-off between efficiency and performance.

6. Conclusion and Future Work

In this paper, we present BiGeaR to learn binarized graph representations for recommendation with multi-faceted binarization techniques. The extensive experiments not only validate the performance superiority over competing binarization-based recommender systems, but also justify the effectiveness of all proposed model components. In the future, we plan to investigate two major possible problems. (1) It is worth developing binarization techniques for model-agnostic recommender systems with diverse learning settings (Song et al., 2021; Yang et al., 2021; Song et al., 2022). (2) Instead of using sign()\operatorname{sign}(\cdot) for quantization, developing compact multi-bit quantization methods with similarity-preserving is promising to improve ranking accuracy.

Acknowledgements.
The work described in this paper was partially supported by the National Key Research and Development Program of China (No. 2018AAA0100204), the Research Grants Council of the Hong Kong Special Administrative Region, China (CUHK 2410021, Research Impact Fund, No. R5034-18), and the CUHK Direct Grant (4055147).

References

  • (1)
  • Anil et al. (2018) Rohan Anil, Gabriel Pereyra, Alexandre Passos, Robert Ormandi, George E Dahl, and Geoffrey E Hinton. 2018. Large scale distributed neural network training through online distillation. ICLR.
  • Bahri et al. (2021) Mehdi Bahri, Gaétan Bahl, and Stefanos Zafeiriou. 2021. Binary Graph Neural Networks. In CVPR. 9492–9501.
  • Banner et al. (2018) Ron Banner, Itay Hubara, Elad Hoffer, and Daniel Soudry. 2018. Scalable methods for 8-bit training of neural networks. NeurIPS 31.
  • Bengio et al. (2013) Yoshua Bengio, Nicholas Léonard, and Aaron Courville. 2013. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv.
  • Berg et al. (2017) Rianne van den Berg, Thomas N Kipf, and Max Welling. 2017. Graph convolutional matrix completion. arXiv preprint arXiv:1706.02263.
  • Bracewell and Bracewell (1986) Ronald Newbold Bracewell and Ronald N Bracewell. 1986. The Fourier transform and its applications. Vol. 31999. McGraw-Hill New York.
  • Cao et al. (2017) Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Philip S Yu. 2017. Hashnet: Deep learning to hash by continuation. In ICCV. 5608–5617.
  • Chen et al. (2017) Jingyuan Chen, Hanwang Zhang, Xiangnan He, Liqiang Nie, Wei Liu, and Tat-Seng Chua. 2017. Attentive collaborative filtering: Multimedia recommendation with item-and component-level attention. In SIGIR. 335–344.
  • Chen et al. (2022b) Yankai Chen, Menglin Yang, Yingxue Zhang, Mengchen Zhao, Ziqiao Meng, Jianye Hao, and Irwin King. 2022b. Modeling Scale-free Graphs with Hyperbolic Geometry for Knowledge-aware Recommendation. WSDM.
  • Chen et al. (2022a) Yankai Chen, Yaming Yang, Yujing Wang, Jing Bai, Xiangchen Song, and Irwin King. 2022a. Attentive Knowledge-aware Graph Convolutional Networks with Collaborative Guidance for Personalized Recommendation. ICDE.
  • Covington et al. (2016) Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networks for youtube recommendations. In Recsys. 191–198.
  • Darabi et al. (2018) Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and Vahid Partovi Nia. 2018. Bnn+: Improved binary network training. arXiv.
  • Erin Liong et al. (2015) Venice Erin Liong, Jiwen Lu, Gang Wang, Pierre Moulin, and Jie Zhou. 2015. Deep hashing for compact binary codes learning. In CVPR. 2475–2483.
  • function (2022) Step function. 2022. https://en.wikipedia.org/wiki/Heaviside_step_function.
  • Geng et al. (2015) Xue Geng, Hanwang Zhang, Jingwen Bian, and Tat-Seng Chua. 2015. Learning image and user features for recommendation in social networks. In ICCV.
  • Gionis et al. (1999) Aristides Gionis, Piotr Indyk, Rajeev Motwani, et al. 1999. Similarity search in high dimensions via hashing. In VLDB, Vol. 99. 518–529.
  • Gong et al. (2019) Ruihao Gong, Xianglong Liu, Shenghu Jiang, Tianxiang Li, Peng Hu, Jiazhen Lin, Fengwei Yu, and Junjie Yan. 2019. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In ICCV. 4852–4861.
  • Hamilton et al. (2017) William L Hamilton, Rex Ying, and Jure Leskovec. 2017. Inductive representation learning on large graphs. In NeurIPS. 1025–1035.
  • Håstad (2001) Johan Håstad. 2001. Some optimal inapproximability results. Journal of the ACM (JACM) 48, 4, 798–859.
  • He and McAuley (2016) Ruining He and Julian McAuley. 2016. Modeling the visual evolution of fashion trends with one-class collaborative filtering. In WWW. 507–517.
  • He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In SIGIR. 639–648.
  • He et al. (2018) Xiangnan He, Zhankui He, Jingkuan Song, Zhenguang Liu, Yu-Gang Jiang, and Tat-Seng Chua. 2018. Nais: Neural attentive item similarity model for recommendation. TKDE 30, 12, 2354–2366.
  • He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural collaborative filtering. In WWW. 173–182.
  • He et al. (2016) Xiangnan He, Hanwang Zhang, Min-Yen Kan, and Tat-Seng Chua. 2016. Fast matrix factorization for recommendation with implicit feedback. In SIGIR.
  • Hinton et al. (2015) Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  • Jang et al. (2017) Eric Jang, Shixiang Gu, and Ben Poole. 2017. Categorical reparameterization with gumbel-softmax. In 5th ICLR.
  • Kang et al. (2021) Wang-Cheng Kang, Derek Zhiyuan Cheng, Tiansheng Yao, Xinyang Yi, Ting Chen, Lichan Hong, and Ed H Chi. 2021. Learning to embed categorical features without embedding tables for recommendation. SIGKDD.
  • Kang and McAuley (2019) Wang-Cheng Kang and Julian McAuley. 2019. Candidate generation with binary codes for large-scale top-n recommendation. In CIKM. 1523–1532.
  • Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In ICLR.
  • Kipf and Welling (2017) Thomas N Kipf and Max Welling. 2017. Semi-supervised classification with graph convolutional networks. In 5th ICLR.
  • Koh and Liang (2017) Pang Wei Koh and Percy Liang. 2017. Understanding black-box predictions via influence functions. In ICML. PMLR, 1885–1894.
  • Koren et al. (2009) Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization techniques for recommender systems. Computer 42, 8, 30–37.
  • Li et al. (2019) Guohao Li, Matthias Muller, Ali Thabet, and Bernard Ghanem. 2019. Deepgcns: Can gcns go as deep as cnns?. In ICCV. 9267–9276.
  • Li et al. (2014) Hao Li, Wei Liu, and Heng Ji. 2014. Two-Stage Hashing for Fast Document Retrieval.. In ACL (2). 495–500.
  • Li et al. (2018) Qimai Li, Zhichao Han, and Xiao-Ming Wu. 2018. Deeper insights into graph convolutional networks for semi-supervised learning. In AAAI.
  • Liang et al. (2016) Dawen Liang, Laurent Charlin, James McInerney, and David M Blei. 2016. Modeling user exposure in recommendation. In WWW. 951–961.
  • Lin et al. (2017) Xiaofan Lin, Cong Zhao, and Wei Pan. 2017. Towards accurate binary convolutional neural network. In NeurIPS.
  • Liu et al. (2019) Chunlei Liu, Wenrui Ding, Xin Xia, Yuan Hu, Baochang Zhang, Jianzhuang Liu, Bohan Zhuang, and Guodong Guo. 2019. RBCN: Rectified binary convolutional networks for enhancing the performance of 1-bit DCNNs. arXiv.
  • Liu et al. (2014) Xianglong Liu, Junfeng He, Cheng Deng, and Bo Lang. 2014. Collaborative hashing. In CVPR. 2139–2146.
  • Maddison et al. (2017) Chris J Maddison, Andriy Mnih, and Yee Whye Teh. 2017. The concrete distribution: A continuous relaxation of discrete random variables. In 5th ICLR.
  • Park and Tuzhilin (2008) Yoon-Joo Park and Alexander Tuzhilin. 2008. The long tail of recommender systems and how to leverage it. In RecSys. 11–18.
  • product statistics (2022) Amazon product statistics. 2022. https://www.retailtouchpoints.com/resources/how-many-products-does-amazon-carry.
  • Qin et al. (2020) Haotong Qin, Ruihao Gong, Xianglong Liu, Mingzhu Shen, Ziran Wei, Fengwei Yu, and Jingkuan Song. 2020. Forward and backward information retention for accurate binary neural networks. In CVPR. 2250–2259.
  • Rastegari et al. (2016) Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali Farhadi. 2016. Xnor-net: Imagenet classification using binary convolutional neural networks. In ECCV. Springer, 525–542.
  • Rendle and Freudenthaler (2014) Steffen Rendle and Christoph Freudenthaler. 2014. Improving pairwise learning for item recommendation from implicit feedback. In WSDM. 273–282.
  • Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv.
  • Song et al. (2021) Zixing Song, Ziqiao Meng, Yifei Zhang, and Irwin King. 2021. Semi-supervised Multi-label Learning for Graph-structured Data. In CIKM. ACM, 1723–1733.
  • Song et al. (2022) Zixing Song, Xiangli Yang, Zenglin Xu, and Irwin King. 2022. Graph-based semi-supervised learning: A comprehensive review. TNNLS.
  • Tailor et al. (2021) Shyam A Tailor, Javier Fernandez-Marques, and Nicholas D Lane. 2021. Degree-quant: Quantization-aware training for graph neural networks. 9th ICLR.
  • Tan et al. (2020) Qiaoyu Tan, Ninghao Liu, Xing Zhao, Hongxia Yang, Jingren Zhou, and Xia Hu. 2020. Learning to hash with GNNs for recommender systems. In WWW. 1988–1998.
  • Tang and Wang (2018) Jiaxi Tang and Ke Wang. 2018. Learning compact ranking models with high performance for recommender system. In SIGKDD. 2289–2298.
  • user statistics (2022) Amazon user statistics. 2022. https://backlinko.com/amazon-prime-users.
  • Veličković et al. (2018) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. ICLR.
  • Wang et al. (2021) Junfu Wang, Yunhong Wang, Zhen Yang, Liang Yang, and Yuanfang Guo. 2021. Bi-gcn: Binary graph convolutional network. In CVPR. 1561–1570.
  • Wang et al. (2017) Jingdong Wang, Ting Zhang, Nicu Sebe, Heng Tao Shen, et al. 2017. A survey on learning to hash. TPAMI 40, 4, 769–790.
  • Wang et al. (2019) Xiang Wang, Xiangnan He, Meng Wang, Fuli Feng, and Tat-Seng Chua. 2019. Neural graph collaborative filtering. In SIGIR. 165–174.
  • Wang et al. (2020) Xiang Wang, Hongye Jin, An Zhang, Xiangnan He, Tong Xu, and Tat-Seng Chua. 2020. Disentangled graph collaborative filtering. In SIGIR. 1001–1010.
  • Wu et al. (2019) Felix Wu, Amauri Souza, Tianyi Zhang, Christopher Fifty, Tao Yu, and Kilian Weinberger. 2019. Simplifying graph convolutional networks. In ICML. PMLR.
  • Wu et al. (2020) Zonghan Wu, Shirui Pan, Fengwen Chen, Guodong Long, Chengqi Zhang, and S Yu Philip. 2020. A comprehensive survey on graph neural networks. IEEE TNNLS 32, 1, 4–24.
  • Xie et al. (2020) Qizhe Xie, Minh-Thang Luong, Eduard Hovy, and Quoc V Le. 2020. Self-training with noisy student improves imagenet classification. In CVPR. 10687–10698.
  • Xu et al. (2018) Keyulu Xu, Chengtao Li, Yonglong Tian, Tomohiro Sonobe, Ken-ichi Kawarabayashi, and Stefanie Jegelka. 2018. Representation learning on graphs with jumping knowledge networks. In ICML. PMLR, 5453–5462.
  • Yang et al. (2019) Jiwei Yang, Xu Shen, Jun Xing, Xinmei Tian, Houqiang Li, Bing Deng, Jianqiang Huang, and Xian-sheng Hua. 2019. Quantization networks. In CVPR. 7308–7316.
  • Yang et al. (2021) Menglin Yang, Min Zhou, Marcus Kalander, Zengfeng Huang, and Irwin King. 2021. Discrete-time Temporal Network Embedding via Implicit Hierarchical Learning in Hyperbolic Space. In SIGKDD. 1975–1985.
  • Yang et al. (2022) Menglin Yang, Min Zhou, Jiahong Liu, Defu Lian, and Irwin King. 2022. HRCF: Enhancing collaborative filtering via hyperbolic geometric regularization. In WebConf. 2462–2471.
  • Ying et al. (2018) Rex Ying, Ruining He, Kaifeng Chen, Pong Eksombatchai, William L Hamilton, and Jure Leskovec. 2018. Graph convolutional neural networks for web-scale recommender systems. In SIGKDD. 974–983.
  • Zhang et al. (2016) Hanwang Zhang, Fumin Shen, Wei Liu, Xiangnan He, Huanbo Luan, and Tat-Seng Chua. 2016. Discrete collaborative filtering. In SIGIR. 325–334.
  • Zhang et al. (2017) Yan Zhang, Defu Lian, and Guowu Yang. 2017. Discrete personalized ranking for fast collaborative filtering from implicit feedback. In AAAI, Vol. 31.
  • Zhang and Zhu (2019) Yifei Zhang and Hao Zhu. 2019. Doc2hash: Learning discrete latent variables for documents retrieval. In ACL. 2235–2240.
  • Zhang and Zhu (2020) Yifei Zhang and Hao Zhu. 2020. Discrete Wasserstein Autoencoders for Document Retrieval. In ICASSP. IEEE, 8159–8163.
  • Zhu et al. (2016) Han Zhu, Mingsheng Long, Jianmin Wang, and Yue Cao. 2016. Deep hashing network for efficient similarity retrieval. In AAAI, Vol. 30.
Table 1. Notations and meanings.
Notation Explanation
dd, LL Embedding dimensions and graph convolution layers.
𝒰\mathcal{U}, \mathcal{I} Collection of users and items.
𝒩(x)\mathcal{N}(x) Neighbors of node xx.
𝒗x(l)\boldsymbol{v}_{x}^{(l)} Full-precision embedding of node xx at ll-th convolution.
𝒒x(l)\boldsymbol{q}_{x}^{(l)} Binarized embedding of of node xx at ll-th quantization.
αx(l)\alpha_{x}^{(l)} ll-th embedding scaler of node xx.
𝒜x\mathcal{A}_{x} and 𝒬x\mathcal{Q}_{x} Binarized embedding table of xx learned by BiGeaR.
wlw_{l} ll-th weight in predicting matching score.
yu,iy_{u,i} A scalar indicates the existence of user-item interaction.
y^u,itch\widehat{y}^{tch}_{u,i} Predicted score based on full-precision embeddings.
y^u,istd\widehat{y}^{std}_{u,i} Predicted score based on binarized embeddings.
𝒚^utch,(l)\boldsymbol{\widehat{y}}^{tch,\,(l)}_{u} Predicted scores of uu based on ll-th embeddings segments.
𝒚^ustd,(l)\boldsymbol{\widehat{y}}^{std,\,(l)}_{u} Predicted scores of uu based on ll-th quantized segments.
Stch(l)(u)S_{tch}^{(l)}(u) pseudo-positive training samples of uu.
wkw_{k} kk-th weight in inference distillation loss.
BPRtch\mathcal{L}_{BPR}^{tch}, BPRstd\mathcal{L}_{BPR}^{std} BPR loss based on full-precision and binarized scores.
ID\mathcal{L}_{ID} Inference distillation loss.
\mathcal{L} Objective function of BiGeaR.
u()u(\cdot), δ()\delta(\cdot) Unit-step function and Dirac delta function.
λ\lambda, λ1\lambda_{1}, λ2\lambda_{2}, γ\gamma, η\eta Hyper-parameters and the learning rate.

Appendix A Notation Table

We list key notations in Table 1.

Refer to caption
Figure 1. Top-K recommendation curve.

Appendix B Pseudo-codes of BiGeaR

The pseudo-codes of BiGeaR are attached in Algorithm 1.

Input: Interaction graph; trainable embeddings 𝒗{}\boldsymbol{v}_{\{\cdots\}}; hyper-parameters: LL, η\eta, λ\lambda, λ1\lambda_{1}, λ2\lambda_{2}, γ\gamma.
Output: Prediction function (u,i)\mathcal{F}(u,i)
1 𝒜u\mathcal{A}_{u}\leftarrow\emptyset, 𝒜i\mathcal{A}_{i}\leftarrow\emptyset, 𝒬u\mathcal{Q}_{u}\leftarrow\emptyset, 𝒬i\mathcal{Q}_{i}\leftarrow\emptyset;
2 while BiGeaR not converge do
3       for l=1,,Ll=1,\cdots,L do
4             𝒗u(l)i𝒩(u)1|𝒩(u)||𝒩(i)|𝒗i(l1)\boldsymbol{v}^{(l)}_{u}\leftarrow\sum_{i\in\mathcal{N}(u)}\frac{1}{\sqrt{|\mathcal{N}(u)|\cdot|\mathcal{N}(i)|}}\boldsymbol{v}^{(l-1)}_{i},
5             𝒗i(l)u𝒩(i)1|𝒩(i)||𝒩(u)|𝒗u(l1)\boldsymbol{v}^{(l)}_{i}\leftarrow\sum_{u\in\mathcal{N}(i)}\frac{1}{\sqrt{|\mathcal{N}(i)|\cdot|\mathcal{N}(u)|}}\boldsymbol{v}^{(l-1)}_{u}.
6             if  with inference distillation then
7                   𝒒u(l)sign(𝒗u(l)),𝒒i(l)sign(𝒗i(l))\boldsymbol{q}_{u}^{(l)}\leftarrow\operatorname{sign}\big{(}\boldsymbol{v}^{(l)}_{u}\big{)},\ \ \boldsymbol{q}_{i}^{(l)}\leftarrow\operatorname{sign}\big{(}\boldsymbol{v}^{(l)}_{i}\big{)},
8                   αu(l)𝑽u(l)1d\alpha_{u}^{(l)}\leftarrow\frac{||\boldsymbol{V}_{u}^{(l)}||_{1}}{d}, αi(l)𝑽i(l)1d\alpha_{i}^{(l)}\leftarrow\frac{||\boldsymbol{V}_{i}^{(l)}||_{1}}{d};
9                   Update (𝒜u\mathcal{A}_{u}, 𝒬u\mathcal{Q}_{u}), (𝒜i\mathcal{A}_{i}, 𝒬i\mathcal{Q}_{i}) with αu(l)𝒒u(l)\alpha_{u}^{(l)}\boldsymbol{q}_{u}^{(l)}, αi(l)𝒒i(l)\alpha_{i}^{(l)}\boldsymbol{q}_{i}^{(l)};
10                  
11            
12      
13      y^u,itch<||l=0Lwl𝒗u(l),||l=0Lwl𝒗i(l)>\widehat{y}^{\,tch}_{u,i}\leftarrow\Big{<}\big{|}\big{|}_{l=0}^{L}w_{l}\boldsymbol{v}_{u}^{(l)},\big{|}\big{|}_{l=0}^{L}w_{l}\boldsymbol{v}_{i}^{(l)}\Big{>}.
14      
15      if  with inference distillation then
16             𝒒u(0)sign(𝒗u(0)),𝒒i(0)sign(𝒗i(0))\boldsymbol{q}_{u}^{(0)}\leftarrow\operatorname{sign}\big{(}\boldsymbol{v}^{(0)}_{u}\big{)},\ \ \boldsymbol{q}_{i}^{(0)}\leftarrow\operatorname{sign}\big{(}\boldsymbol{v}^{(0)}_{i}\big{)},
17             αu(0)𝑽u(0)1d\alpha_{u}^{(0)}\leftarrow\frac{||\boldsymbol{V}_{u}^{(0)}||_{1}}{d}, αi(0)𝑽i(0)1d\alpha_{i}^{(0)}\leftarrow\frac{||\boldsymbol{V}_{i}^{(0)}||_{1}}{d};
18             Update (𝒜u\mathcal{A}_{u}, 𝒬u\mathcal{Q}_{u}), (𝒜i\mathcal{A}_{i}, 𝒬i\mathcal{Q}_{i}) with αu(0)𝒒u(0)\alpha_{u}^{(0)}\boldsymbol{q}_{u}^{(0)}, αi(0)𝒒i(0)\alpha_{i}^{(0)}\boldsymbol{q}_{i}^{(0)}; y^u,istd=<f(𝒜u,𝒬u),f(𝒜i,𝒬i)>\widehat{y}^{\,std}_{u,i}=\big{<}f(\mathcal{A}_{u},\mathcal{Q}_{u}),f(\mathcal{A}_{i},\mathcal{Q}_{i})\big{>};
19             {y^u,itch,(l)}l=0,1,,Lget score segments fromy^u,itch\{\widehat{y}^{tch,\,(l)}_{u,i}\}_{l=0,1,\cdots,L}\leftarrow\text{get score segments from}\widehat{y}^{\,tch}_{u,i};
20             {y^u,istd,(l)}l=0,1,,Lget score segments fromy^u,istd\{\widehat{y}^{std,\,(l)}_{u,i}\}_{l=0,1,\cdots,L}\leftarrow\text{get score segments from}\widehat{y}^{\,std}_{u,i};
21             ID\mathcal{L}_{ID}\leftarrow compute loss with {y^u,itch,(l)}l=0,1,,L\{\widehat{y}^{tch,\,(l)}_{u,i}\}_{l=0,1,\cdots,L}, {y^u,istd,(l)}l=0,1,,L\{\widehat{y}^{std,\,(l)}_{u,i}\}_{l=0,1,\cdots,L}.
22             \mathcal{L}\leftarrow compute BPRstdandID\mathcal{L}^{std}_{BPR}and\mathcal{L}_{ID}.
23            
24      else
25             \mathcal{L}\leftarrow compute BPRtch\mathcal{L}^{tch}_{BPR}.
26            
27      
28      Optimize BiGeaR with regularization;
29      
30return \mathcal{F}.
Algorithm 1 BiGeaR algorithm.

Appendix C Datasets

  • MovieLens (Tan et al., 2020; He et al., 2016; Chen et al., 2022b; Chen et al., 2022a) is a widely adopted benchmark for movie recommendation. Similar to the setting in (Tan et al., 2020; He et al., 2016; Chen et al., 2022b), yu,i=1y_{u,i}=1 if user uu has an explicit rating score towards item ii, otherwise yu,i=0y_{u,i}=0. In this paper, we use the MovieLens-1M data split.

  • Gowalla (Wang et al., 2019; Tan et al., 2020; He et al., 2020; Wang et al., 2020) is the check-in dataset (Liang et al., 2016) collected from Gowalla, where users share their locations by check-in. To guarantee the quality of the dataset, we extract users and items with no less than 10 interactions similar to (Wang et al., 2019; Tan et al., 2020; He et al., 2020; Wang et al., 2020).

  • Pinterest (Geng et al., 2015; Tan et al., 2020) is an implicit feedback dataset for image recommendation (Geng et al., 2015). Users and images are modeled in a graph. Edges represent the pins over images initiated by users. In this dataset, each user has at least 20 edges.

  • Yelp2018 (Wang et al., 2019, 2020; He et al., 2020) is collected from Yelp Challenge 2018 Edition, where local businesses such as restaurants are treated as items. We retain users and items with over 10 interactions similar to (Wang et al., 2019, 2020; He et al., 2020).

  • Amazon-Book (Wang et al., 2019, 2020; He et al., 2020) is organized from the book collection of Amazon-review for product recommendation (He and McAuley, 2016). Similarly to (Wang et al., 2019; He et al., 2020; Wang et al., 2020), we use the 10-core setting to graph nodes.

Appendix D Competing Methods

  • LSH (Gionis et al., 1999) is a representative hashing method to approximate the similarity search for massive high-dimensional data. We follow the adaptation in (Tan et al., 2020) to it for Top-K recommendation.

  • HashNet (Cao et al., 2017) is a state-of-the-art deep hashing method that is originally proposed for multimedia retrieval tasks. We use the same adaptation strategy in (Tan et al., 2020) to it for recommendation.

  • CIGAR (Kang and McAuley, 2019) is a hashing-based method for fast item candidate generation, followed by complex full-precision re-ranking algorithms. We use its quantization part for fair comparison.

  • GumbelRec is a variant of our model with the implementation of Gumbel-softmax for categorical variable quantization (Jang et al., 2017; Maddison et al., 2017; Zhang and Zhu, 2019). GumbelRec utilizes the Gumbel-softmax trick to replace sign()\operatorname{sign}(\cdot) function for embedding binarization.

  • HashGNN (Tan et al., 2020) is the state-of-the-art end-to-end 1-bit quantization recommender system. HashGNNh denotes its vanilla hard encoding version; and HashGNNs is the relaxed version of replacing several quantized digits with the full-precision ones.

  • NeurCF (He et al., 2017) is a classical neural network model to capture user-item nonlinear feature interactions for collaborative filtering.

  • NGCF (Wang et al., 2019) is a state-of-the-art graph-based collaborative filtering model that largely follows the standard GCN (Kipf and Welling, 2017).

  • DGCF (Wang et al., 2020) is one of the latest graph-based model that learns disentangled user intents for better Top-K recommendation.

  • LightGCN (He et al., 2020) is another latest GCN-based recommender system that presents a more concise and powerful model structure with state-of-the-art performance.

Appendix E Hyper-parameter Settings

We report all hyper-parameter settings in Table 2.

Table 2. Hyper-parameter settings for the five datasets.
MovieLens Gowalla Pinterest Yelp2018 Amazon-Book
BB 2048 2048 2048 2048 2048
dd 256 256 256 256 256
η\eta 1×1031\times 10^{-3} 1×1031\times 10^{-3} 5×1045\times 10^{-4} 5×1045\times 10^{-4} 5×1045\times 10^{-4}
λ\lambda 1×1041\times 10^{-4} 5×1055\times 10^{-5} 1×1041\times 10^{-4} 1×1041\times 10^{-4} 1×1061\times 10^{-6}
λ1\lambda_{1} 1 1 1 1 1
λ2\lambda_{2} 0.1 0.1 0.1 0.1 0.1
γ\gamma 1 1 1 1 1
LL 2 2 2 2 2

Appendix F Additional Experimental Results

F.1. Top-K Recommendation Curve

We curve the Top-K recommendation by varying K from 20 to 100 and compare BiGeaR with several selected models. As shown in Figure 1, BiGeaR consistently presents the performance superiority over HashGNN, and shows the competitive recommendation accuracy with DGCF and LightGCN.

F.2. Implementation of Embedding Scaler α(l){\alpha}^{(l)}

We set the embedding scaler to learnable (denoted by LB) and show the results in Table 3. We observe that, the design of learnable embedding scaler does not achieve the expected performance. This is probably because there is no direct mathematical constraint to it and thus the parameter search space is too large to find the optimum by stochastic optimization.

Table 3. Implementation of Embedding Scaler.
MovieLens Gowalla Pinterest Yelp2018 Amazon-book
R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20
LB 23.07 41.42 17.01 23.11 14.19 15.29 6.05 10.80 4.52 7.85
-9.78% -9.09% -7.35% -7.41% -8.86% -9.15% -6.49% -6.90% -3.42% -3.33%
BiGeaR 25.57 45.56 18.36 24.96 15.57 16.83 6.47 11.60 4.68 8.12

F.3. Implementation of wlw_{l}.

We try the following three additional implementation of wlw_{l} and report the results in Tables 4.

  1. (1)

    wlw_{l} = 1L+1\frac{1}{L+1} equally contributes for all embedding segments.

  2. (2)

    wlw_{l} = 1L+1l\frac{1}{L+1-l} is positively correlated to the ll value, so as to highlight higher-order structures of the interaction graph.

  3. (3)

    wlw_{l} = 2(L+1l)2^{-(L+1-l)} is positively correlated to ll with exponentiation.

The experimental results show that implementation (2) performs fairly well compared to the others, demonstrating the importance of highlighting higher-order graph information. This corroborates the design of our implementation in BiGeaR, i.e., wlw_{l} \propto ll, which however is simpler and effective with better recommendation accuracy.

Table 4. Implementation of wlw_{l}.
MovieLens Gowalla Pinterest Yelp2018 Amazon-Book
R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20
(1) 22.75 41.13 16.15 21.82 14.16 15.48 5.88 10.32 4.46 7.63
(2) 25.07 44.64 17.81 24.46 15.26 16.57 6.40 11.38 4.58 7.96
(3) 21.23 37.81 15.24 20.71 12.93 14.28 5.24 9.51 3.74 64.98
Best 25.57 45.56 18.36 24.96 15.57 16.83 6.47 11.60 4.68 8.12

F.4. Implementation of wkw_{k}.

We further evaluate different wkw_{k}:

  1. (1)

    wkw_{k} = RkR\frac{R-k}{R} is negatively correlated to the ranking position kk.

  2. (2)

    wkw_{k} = 1k\frac{1}{k} is inversely proportional to position kk.

  3. (3)

    wkw_{k} = 2k2^{-k} is exponential to the value of k-k.

We observe from Table 5 that the implementation (3) works slightly worse than Equation (12) but generally better than the other two methods. This show that the exponential modeling is more effective to depict the importance contribution of items for approximating the tailed item popularity (Rendle and Freudenthaler, 2014). Moreover, Equation (12) introduces hyper-parameters to provide the flexibility of adjusting the function properties for different datasets.

Table 5. Implementation of wkw_{k}.
MovieLens Gowalla Pinterest Yelp2018 Amazon-Book
R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20 R@20 N@20
(1) 24.97 44.33 17.96 24.87 15.11 16.20 6.28 11.21 4.43 7.78
(2) 25.08 45.19 17.95 24.95 15.18 16.34 6.27 11.25 4.48 7.92
(3) 25.16 44.92 18.32 24.81 15.26 16.65 6.33 11.36 4.53 8.06
Best 25.57 45.56 18.36 24.96 15.57 16.83 6.47 11.60 4.68 8.12