EDGE-Rec: Efficient and Data-Guided Edge Diffusion For Recommender Systems Graphs

Utkarsh Priyam [email protected] , Hemit Shah [email protected] and Edoardo Botta [email protected] Carnegie Mellon UniversityPittsburghPennsylvaniaUSA

(2024; 29 July 2024)

Abstract.

Most recommender systems research focuses on binary historical user-item interaction encodings to predict future interactions. User features, item features, and interaction strengths remain largely under-utilized in this space or only indirectly utilized, despite proving largely effective in large-scale production recommendation systems. We propose a new attention mechanism, loosely based on the principles of collaborative filtering, called Row-Column Separable Attention (RCSA) to take advantage of real-valued interaction weights as well as user and item features directly. Building on this mechanism, we additionally propose a novel Graph Diffusion Transformer (GDiT) architecture which is trained to iteratively denoise the weighted interaction matrix of the user-item interaction graph directly. The weighted interaction matrix is built from the bipartite structure of the user-item interaction graph and corresponding edge weights derived from user-item rating interactions. Inspired by the recent progress in text-conditioned image generation, our method directly produces user-item rating predictions on the same scale as the original ratings by conditioning the denoising process on user and item features with a principled approach.

recommendation, diffusion, graph, recommender, systems

^†^†copyright: acmlicensed^†^†journalyear: 2024^†^†conference: Knowledge Discovery and Data Mining, Workshop on Generative AI for Recommender Systems and Personalization ’24; August 25–26, 2024; Barcelona, Spain^†^†isbn: ^†^†ccs: Computing methodologies Machine learning algorithms^†^†ccs: Computing methodologies Probabilistic reasoning

1. Introduction

Denoising diffusion models (Ho et al., 2020) form a prominent class of models that have proven remarkably effective across a variety of generation and sampling tasks where the data distribution is high-dimensional, such as video, image synthesis and graph generation (Liu et al., 2023). However, generative modeling of graphs has historically focused on smaller graphs for applications such as molecule generation (Vignac et al., 2023; Liu et al., 2023). Recent work involving graph generation with diffusion models show promising results for reconstructing large undirected graphs with thousands of nodes (Wang et al., 2023; Bergmeister et al., 2024). Prior work also presents promising results with guided denoising steps incorporating score functions to evaluate graphs generated during denoising (Jo et al., 2022).

Recommendation systems focus on using historical user-item interaction data (such as reviews) to predict/recommend new items for users. Previous research in this space applies diffusion models to augment sparse data, model user dynamics/preference distributions, model item latent distributions, or perform sequential recommendation (Wang et al., 2023; Yang et al., 2023; Ma et al., 2024). In contrast to prior research in diffusion-based recommendation systems, we aim to construct a recommender system based on graph diffusion models by formulating the problem as a graph completion task. We focus primarily on the graph structure of the user-item interaction space.

Refer to caption — Figure 1. User-item interaction graph for a movie rating dataset. Each edge weight corresponds to the user-provided rating of the movie. Users and movies are linked via only their rating interactions, hence the graph is bipartite.

Our proposed method involves representing historical user-item interactions as edges between nodes in a bipartite graph with real-valued edge weights. These weights can represent the strength of these interactions (e.g. a rating out of 5). We also utilize latent space representations of users and items to condition the denoising process for the prediction of new interaction links (recommendations). In recommendation systems, the set of users and items (as well as their attributes) is fixed, so we focus on noising edge attributes of the interaction graph such as the adjacency matrix and edge weights in a continuous manner. Additionally, since recommender systems graphs are bipartite, we can represent the edge weights between users and items in a weighted interaction matrix, forming a 2D array similar to an image where each pixel represents the corresponding interaction strength between a user and item.

To the best of our knowledge, our work is the first to consider the interaction graph in a recommender system with both node features (per user and item) and the strength of interactions as edge weights. By conditioning on node features in the denoising model, we are able to guide the diffusion of weights over all possible interaction edges in the graph to approximate the expected interaction strengths.

2. Related Work

2.1. Diffusion on Adjacency Matrices

In the context of graph generation, a natural formulation of a diffusion model defines the noising process in the space of adjacency matrices. In this spirit, EDGE (Chen et al., 2023) defines the transition kernel of the forward process as a Bernoulli distribution that resamples each edge with a time-varying positive probability. It aims to reconstruct the original graph by removing edges in the forward process, and denoising by adding edges in the reverse process. By explicitly modelling the node degrees and appropriately restricting each step of the backward process to a selected subset of “active” nodes, EDGE achieves promising performance and efficiency on larger graphs.

2.2. Diffusion Transformers

Transformer based denoising models (those that predict the noise added in a forward step by $q(x_{t}|x_{t-1})$ or predict $x_{0}$ directly), have become the standard for image diffusion models. Considering our aim to perform continuous diffusion on a 2D weighted interaction matrix with similar characteristics to an image, our denoising model is inspired heavily by the architecture utilized in the recent work on Diffusion Transformers (Peebles and Xie, 2023). Our main goal is to capitalize on the modeling capacity of scaled dot-product attention to capture neighborhood characteristics for each node in the interaction graph. We follow a similar approach to the original Diffusion Transformer architecture with regards to the usage of multi-head attention and time-step conditioning via adaptive layer-norms.

2.3. Diffusion-based Recommender Systems

Due to the structure implied by user-item interactions, recommendation problems can be naturally reformulated as link prediction problems on graphs. Applications of graph convolutional networks (Wang et al., 2021) to this task have been extensively studied, while applications of diffusion models here are relatively understudied. Recently, Wang et al. (2023) proposed DiffRec, a recommendation paradigm based on generative diffusion models. Given a user and a list of items, the diffusion process acts on the space of the interaction vectors. That is, it reconstructs one-hot encoded vectors representing whether the user has interacted with a specific item or not.

2.4. Graph Diffusion Recommender Systems

Subsequently, Jiang et al. (2023); Zhu et al. (2023) have incorporated the graph structure of the interaction history, bridging the gap between recommendation systems and graph diffusion models. However, they still do not fully utilize user or item features, and perform the forward and reverse diffusion processes over the space of one-hot encodings for interactions, regardless of interaction strength.

A concurrent work by Yi et al. (2024) formulates the problem of recommendation similarly, however, they first train a graph neural encoder to produce node (user/item) embeddings. In their experiment with continuous diffusion, they noise these embeddings in the forward process, and train a denoising model to recover the embeddings. Their graph neural encoder relies mainly on the graph defined by historical user-item interactions, and they additionally incorporate user/item features by measuring the similarity between user-user and item-item pairs based on their features.

Our work differs from their approach as we incorporate user and item features directly in our denoising model architecture rather than producing node embeddings using this information. While they perform diffusion over the space of these node embeddings, our diffusion process is applied to the space of weighted interaction matrices. As a result, our denoising model is able to directly predict interactions strengths whereas theirs recover node embeddings.

3. Preliminaries

In the traditional formulation of denoising diffusion models (Ho et al., 2020), the forward diffusion process gradually adds Gaussian noise to the original data, $\mathbf{x}_{0}\sim q(\mathbf{x})$ , over $T$ diffusion steps to reach random Gaussian noise, $\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I})$ . Each diffusion step corrupts the noised representation from the previous step:

q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I})

Where $\{\beta_{t}\}_{t=1}^{T}$ are the noise-schedule hyper-parameters controlling the amount of noise added at each step. Note $\beta_{t}\in(0,1)$ .

Ho et al. (2020) also provide the analytical form for $\mathbf{x}_{t}$ given $\mathbf{x}_{0}$ :

	$\displaystyle q(\mathbf{x}_{t}\|\mathbf{x}_{0})$	$\displaystyle=\mathcal{N}(\mathbf{x}_{t};\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0},(1-\widebar{\alpha}_{t})\mathbf{I}$
		$\displaystyle=\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\widebar{\alpha}_{t}}\epsilon,$	$\displaystyle\epsilon\sim\mathcal{N}(0,\mathbf{I})$

Where $\alpha_{t}=1-\beta_{t}$ and $\widebar{\alpha}_{t}=\Pi_{i=1}^{t}\alpha_{i}$ . This enables any arbitrary step of the diffusion process to be generated extremely quickly for any given $\mathbf{x}_{0}\sim q(\mathbf{x})$ , simply by sampling $\epsilon\sim\mathcal{N}(0,\mathbf{I})$ . The diffusion steps can be tuned by choosing an appropriate set of $\{\beta_{t}\}_{t=1}^{T}$ .

The reverse diffusion process aims to gradually recover the original data $\mathbf{x}_{0}$ given the fully corrupted $\mathbf{x}_{T}$ . Ho et al. (2020) show that the posterior Gaussian, $p(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})$ , is tractable (when conditioned on $\mathbf{x}_{0}$ ) and has the analytical form:

	$\displaystyle p(\mathbf{x}_{t-1}\|\mathbf{x}_{t},\mathbf{x}_{0})=$	$\displaystyle\mathcal{N}(\mathbf{x}_{t-1};\mu_{t}(\mathbf{x}_{t},\mathbf{x}_{0}),\tilde{\beta_{t}}\mathbf{I})$
	$\displaystyle\mu_{t}(\mathbf{x}_{t},\mathbf{x}_{0}):=$	$\displaystyle\frac{\sqrt{\widebar{\alpha}_{t-1}}\beta_{t}}{1-\widebar{\alpha}_{t}}\mathbf{x}_{0}+\frac{\sqrt{\alpha_{n}}(1-\widebar{\alpha}_{t-1})}{1-\widebar{\alpha}_{t}}\mathbf{x}_{t}$
	$\displaystyle\tilde{\beta}_{t}:=$	$\displaystyle\frac{1-\widebar{\alpha}_{t-1}}{1-\widebar{\alpha}_{t}}\beta_{t}$

Above, the conditioning on $\mathbf{x}_{0}$ is necessary for the posterior to be tractable. The primary goal of a diffusion model is to learn the initial distribution such that arbitrary Gaussian noise can be gradually denoised to produce outputs from the original input data distribution $q(\mathbf{x})$ . Without conditioning on $x_{0}$ , we need:

p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}\left(\mathbf{x}_{t-1};\mu_{\theta}(\mathbf{x}_{t},t),\Sigma_{\theta}(\mathbf{x}_{t},t)\right)

The model, parameterized by $\theta$ , is trained to approximate either the mean, $\mu_{\theta}$ , and variance, $\Sigma_{\theta}$ , or to predict the added noise, $\epsilon$ . Generally, the latter approach has lower variance and this parameterization more closely resembles Langevin dynamics with a loss that resembles the denoising score matching objective (Ho et al., 2020):

\mathcal{L}=\left\|\epsilon-\epsilon_{\theta}(\mathbf{x}_{t},t)\right\|^{2}=\left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\widebar{\alpha}_{t}}\epsilon,t\right)\right\|^{2}

where $\epsilon_{\theta}(\cdot)$ is the noise predicted by the model. We use this vanilla diffusion approach over the space of weighted interaction matrices (applying isotropic noise in the forward steps, and training our GDiT architecture to predict the added noise in the reverse process). Inspired by text-conditioned image-generation methods (Rombach et al., 2022), our approach also conditions on user features, $\mathcal{U}$ , as well as item features, $\mathcal{I}$ , such that the model predicts $\epsilon_{\theta}(\mathbf{x}_{t},t,\mathcal{U},\mathcal{I})$ and learns the posterior distribution $p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathcal{U},\mathcal{I})$ .

3.1. Adjacency Matrix vs. Interaction Matrix

With a bipartite undirected graph, there is no need to encode the edges with a full adjacency matrix. We instead define our diffusion process to operate over weighted interaction matrices consisting of users along the rows and items along the columns. For example, for the small interaction graph provided in Figure 1 we show below both the adjacency matrix and the weighted interaction matrix (abbreviations are used in place of the full movie titles).

	342	254	436	974	TGM	TTN	AVG
342					5.0	2.5	3.0
254					---	1.0	4.0
436					---	---	4.5
974					1.5	---	3.5
TGM	5.0	---	---	1.5
TTN	2.5	1.0	---	---
AVG	3.0	4.0	4.5	3.5

Figure 3. Full adjacency matrix for the interaction graph provided in Figure 1. Notice the empty quadrants in the top left and bottom right of the matrix due to the bipartite graph.

\Description

Adjacency matrix filled with edge weights for bipartite graph of users and movies with edges corresponding to ratings between 1 to 5. Cells corresponding to edges between user-user nodes or movie-movie nodes are empty as they do not exist in the bipartite graph, whereas cells where the user has not rated the corresponding move are filled with three dashes.

	TGM	TTN	AVG
342	5.0	2.5	3.0
254	---	1.0	4.0
436	---	---	4.5
974	1.5	---	3.5

Figure 4. Weighted interaction matrix for the same graph (with non-existent interactions as ‘---’). Our proposed diffusion method operates over these weighted interaction matrices.

\Description

Weighted interaction matrix structure which is consistent with the top right corner of the adjacency matrix of the user-movie rating graph. Users are on the vertical axis and movies along the horizontal axis, with ratings in the cells of the array.

3.2. Recommender System Constraints

Formulating the recommendation task as a graph diffusion problem introduces two key constraints. First, the set of all users and items (as well as their metadata/attributes) is fixed (e.g. Netflix’s subscribers and movie library). Second, the scale of the graphs we aim to operate on is on the order of at least $10^{3}$ nodes and $10^{5}$ edges. This excludes the majority of prior work on generating graphs using diffusion models as they operate on nodes, edges, and attributes in the forward and reverse processes, and they cannot scale to graphs with thousands of nodes (often because the methods are geared towards applications like molecule generation, where the graphs are substantially smaller and denser than recommender systems).

4. Proposed Framework

We take inspiration from the notion of “active node” sampling in the interaction graph from EDGE (Wang et al., 2023), thus enforcing the same mechanism of scalable local diffusion. Below, we highlight some key differences between our method and prior works.

(1)

Sets of active nodes are sampled a priori and kept fixed throughout the diffusion. This means that our diffusion process is defined on the space of $n\times m$ sub-matrices (called “patches”), whose values represent all ratings in a sub-sample of $n$ users and $m$ items from the dataset. During training this translates to a batch of $n\times m$ interaction matrix patches which are corrupted in the forward process. The model is trained to predict the added noise within these patches given the corrupted data, time-step, and feature information for the $n$ users and $m$ items within each patch.
(2)

For all user-item pairs within a sampled $n\times m$ patch, we consider real-valued interaction strengths as opposed to binary interactions (e.g. ratings from 1-5 for the MovieLens datasets (Harper and Konstan, 2015)). This makes the task more difficult but at the same time closer to applications of recommendation systems. The interaction strengths are mapped to the continuous interval $[-1,1]$ , with $0$ representing a neutral interaction, and positive and negative values corresponding to favorable and unfavorable interactions, respectively.
(3)

With a large (and dense) enough dataset it is reasonable to assume that the distribution of these interaction strengths will be roughly Gaussian, or can easily be transformed to a standard Gaussian in an invertible manner, such as with a quantile transformation. This motivates the usage of the vanilla diffusion method over the space of (transformed) weighted interaction matrices.
(4)

In contrast to EDGE, we noise all edges within a patch at the same time, without employing additional selection at each step. This is made computationally feasible by the a priori subsampling of the active edges, which makes each subsample of size significantly smaller than the entire graph.
(5)

In contrast to DiffRec, we contract the weights of the interaction graph edges to be samples from a standard Gaussian distribution. While DiffRec stops the forward process at some partially-noised time-step to encode historical interaction information as a latent prior, we hope to remove this requirement by conditioning on user and item features, enabling generation by sampling directly from random noise.

4.1. Row-Column Separable Attention (RCSA)

A key aspect of weighted interaction matrices is that the axes ordering of users and items can be permuted arbitrarily without changing the encoded information. As a result, spatial correlations in an interaction matrix are not analogous to those present in images. Due to this fact, the predictive power present within a square neighborhood of a user-item pair is limited.

We hypothesize that most of the predictive power relevant to the noise prediction at a particular user-item pair is contained in the interaction matrix entries with either the same user or the same item, that is across the same row or the same column. Analogously, we hypothesize the amount of predictive power present across user-item pairs with different items and different users to be low.

The presence of this structure suggests that the commonly used U-Net (Ronneberger et al., 2015) architecture based on square convolutions is not suitable as it would only capture information within an immediate neighborhood. For this reason, we propose a novel attention module for interaction matrices. Within this module, each element of the matrix is separately attending to elements in its row first and elements in its columns subsequently. This is equivalent to a two-headed masked attention module where the attention coefficients for elements that do not belong to the same row or column, respectively, are masked out. In practice, one attention head operates on the sub-sample of the weighted interaction matrix and the other operates on its transpose. Replacing all attention modules with RCSA modules reduces the computation cost for an $N\times M$ patch from $O(N^{2}M^{2})$ for traditional 2D attention mechanisms to $O(N^{2}+M^{2})$ .

4.2. Graph Diffusion Transformer (GDiT)

We propose a new transformer architecture (Figure 6) based on Diffusion Transformers (DiT) (Peebles and Xie, 2023), which we use to attend to user and item features in predicting diffusion noise. Specifically, we merge DiT with scaled layer norms conditioned on the diffusion time step and DiT with cross-attention conditioned on node features. We also apply RCSA (Figure 5) in place of all 2D attention operations. This architectural change better suits attention on bipartite graph structures, as well as providing runtime efficiency by replacing 2D attention with 1D attention. Note that each block can be repeated an arbitrary number of times to enhance the model’s learning capacity for higher dimensional node feature embeddings (or be used to predict multi-dimensional rating/interaction strengths).

5. Evaluation

We train and evaluate our proposed method, EDGE-Rec, on the MovieLens datasets ML-100k and ML-1M (Harper and Konstan, 2015). Evaluation is performed using a set of standard metrics for recommendation systems based on the top- $K$ predictions from the model. We train each model for at least 10000 iterations with a batch size of 16 and patch size at least $50\times 50$ . We construct a custom time-aware 90-10 train-test split of the edges to ensure consequentiality and the absence of overlap in the time ranges covered by training and validation split.

Training consists of sampling random patches of specified size $n\times m$ from the target graph. Uniform random sampling of the full weighted interaction matrix over at least 10000 iterations allows most if not all of the users and items to be included during training. The model is provided patches corrupted to uniformly sampled time-steps, the time-steps each patch is noised to, and user-item features for each patch. Unknown interactions are optionally masked, and, if a user has not interacted with an item, the expected or true interaction strength is set to 0 (neutral).

Finally, we train the GDiT model by maximizing the usual denoising formulation of the ELBO objective, which we regularize at the single step level using Bayesian Personalized Ranking loss (Rendle et al., 2012) to encourage coherent rankings. At sampling time, we reconstruct patches of ratings from random noise. Notably, given a patch we have access to the known training edges, which precede the test edge edges. This allows to frame the sampling problem as a matrix completion problem. Inspired from inpainting techniques for image completion (Lugmayr et al., 2022), we replace the known portion of the patch with the known noised ratings from the training set to further condition the generation on the a priori information and ecourage accurate generation of new ratings that “complete” the already-available ones. These candidates are discarded at test time.

We also propose a new method of denoising larger subgraphs by randomly tiling the encompassing subgraph with smaller patches at each reverse-diffusion step. As a result, the entire interaction matrix is denoised uniformly during the sampling process, instead of independently as when patches are naively denoised sequentially.

Figures 7 and 8 depict plots for the chosen suite of evaluation metrics for various top-K values. In particular, note that EDGE-Rec surpasses DiffRec baselines (included in Appendix A) for most patch sampling tasks on edge densities above around $55\%$ (in terms of known interactions). Additionally, we observe that scaling up denoising sampling using our novel tiling method does not significantly degrade recommendations, with sufficient subgraph density.

5.1. Model and patch size

We upscale the experiments with the proposed architecture by stacking 8 GDiT blocks. Figure 9 shows the results obtained across different choices of density and patch size. Notably, when compared with corresponding smaller baselines composed of a single block (Figure 8), we observe improvements of up to 0.13 in magnitude across all top-1, top-5 and top-10 metrics. This suggest a strong improvement in the quality of the resulting rankings overall. We further observe the role of the patch size in balancing precision and recall. Specifically, from top-20 to top-50, as the patch size increases, meaning that the number of ranked candidates is increased, Figure 9 shows an increase in precision and recall, and a decrease in NDCG.

5.2. Larger Datasets

We train and evaluate a large model with 16 GDiT layers and $\sim 14$ -million parameters on the larger ML-1M dataset (Harper and Konstan, 2015). This dataset contains 1,000,000 ratings from over 6000 users and 4000 movies (10,000 nodes and 1,000,000 edges). A patch size of $64\times 64$ was used. Evaluation is performed by randomly sampling 10 patches of the same size from the full weighted interaction matrix, performing in-painting as previously mentioned, and averaging the metrics over these patches. Note the model predictions are compared against test edges (user ratings the model has not previously seen).

Due to the consistency of the metrics between our evaluations on ML-100K and ML-1M, we conclude that our model is able to scale to large datasets well. Despite an order of magnitude more data, performance is relatively minimally degraded. This is most likely driven by our training strategy involving random patch sampling, while our evaluation strategy adopts this scalable approach as well.

6. Conclusions and Limitations

We propose a graph-diffusion-based recommendation model that predicts continuous interaction strengths. Building on advancements in conditioned diffusion models, we generate recommendations by conditioning on user and item features. By tuning the size of the patch, our model shows promising performance, despite the patch size being two orders of magnitude smaller than the complete graph. Assuming one can identify dense patches of the graph with high accuracy, this makes the training process highly scalable in the number of users and items. However, in highly sparse production-level environments, identifying such patches can be a challenge, as it essentially reduces to the retrieval problem. This makes the assumption of a dense patch size particularly strong and a clear limitation of the method. If available, future work could pair this ranking methodology with a trained retrieval model that produces high-density patches of candidate user-item pairs.

Acknowledgements.

To Michael Mu, for the late-night company in the MSML lounge, and for answering our many questions on recent diffusion model developments.

References

(1)
Bergmeister et al. (2024) Andreas Bergmeister, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer. 2024. Efficient and Scalable Graph Generation through Iterative Local Expansion. arXiv:2312.11529 [cs.SI]
Chen et al. (2023) Xiaohui Chen, Jiaxing He, Xu Han, and Li-Ping Liu. 2023. Efficient and Degree-Guided Graph Generation via Discrete Diffusion Modeling. arXiv:2305.04111 [cs.LG]
Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (dec 2015), 19 pages. https://doi.org/10.1145/2827872
Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
Jiang et al. (2023) Yangqin Jiang, Yuhao Yang, Lianghao Xia, and Chao Huang. 2023. DiffKG: Knowledge Graph Diffusion Model for Recommendation. arXiv:2312.16890 [cs.IR]
Jo et al. (2022) Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. 2022. Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations. arXiv:2202.02514 [cs.LG]
Liu et al. (2023) Chengyi Liu, Wenqi Fan, Yunqing Liu, Jiatong Li, Hang Li, Hui Liu, Jiliang Tang, and Qing Li. 2023. Generative Diffusion Models on Graphs: Methods and Applications. arXiv:2302.02591 [cs.LG]
Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11461–11471.
Ma et al. (2024) Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, and Zhanhui Kang. 2024. Plug-in Diffusion Model for Sequential Recommendation. arXiv:2401.02913 [cs.IR]
Peebles and Xie (2023) William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205.
Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian Personalized Ranking from Implicit Feedback. arXiv:1205.2618 [cs.IR]
Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
Vignac et al. (2023) Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. 2023. DiGress: Discrete Denoising diffusion for graph generation. arXiv:2209.14734 [cs.LG]
Wang et al. (2021) Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z. Sheng, Mehmet A. Orgun, Longbing Cao, Francesco Ricci, and Philip S. Yu. 2021. Graph Learning based Recommender Systems: A Review. arXiv:2105.06339 [cs.IR]
Wang et al. (2023) Wenjie Wang, Yiyan Xu, Fuli Feng, Xinyu Lin, Xiangnan He, and Tat-Seng Chua. 2023. Diffusion Recommender Model. arXiv:2304.04971 [cs.IR]
Yang et al. (2023) Zhengyi Yang, Jiancan Wu, Zhicai Wang, Xiang Wang, Yancheng Yuan, and Xiangnan He. 2023. Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion. arXiv:2310.20453 [cs.IR]
Yi et al. (2024) Zixuan Yi, Xi Wang, and Iadh Ounis. 2024. A Directional Diffusion Graph Transformer for Recommendation. arXiv:2404.03326 [cs.IR]
Zhu et al. (2023) Yunqin Zhu, Chao Wang, and Hui Xiong. 2023. Towards Graph-Aware Diffusion Modeling for Collaborative Filtering. arXiv:2311.08744 [cs.IR]

Appendix A DiffRec Baseline Metrics

Appendix B Visualizing the Sparsity in the MovieLens datasets

Figure 12 visualizes the ratings provided by all users for the movies in the MovieLens-100K dataset. Note the inherent sparsity in the encoded interaction matrix.

In order to sample dense subgraphs of the interaction matrix, we heuristically identify “important” or densely connected users and movies by counting the number of nodes at a distance of 2 edges from each user or movie node in the interaction graph. Figure 13 is a heat-map of the data after sorting users and movies by this heuristic. We sample dense subgraphs of the ML-100k dataset by selecting a suitably-sized sub-matrix from the lower left corner of the visualized sorted complete graph adjacency matrix.

	342	254	436	974	TGM	TTN	AVG
342					5.0	2.5	3.0
254					---	1.0	4.0
436					---	---	4.5
974					1.5	---	3.5
TGM	5.0	---	---	1.5
TTN	2.5	1.0	---	---
AVG	3.0	4.0	4.5	3.5

	342	254	436	974	TGM	TTN	AVG
342					5.0	2.5	3.0
254					---	1.0	4.0
436					---	---	4.5
974					1.5	---	3.5
TGM	5.0	---	---	1.5
TTN	2.5	1.0	---	---
AVG	3.0	4.0	4.5	3.5

	342	254	436	974	TGM	TTN	AVG
342					5.0	2.5	3.0
254					---	1.0	4.0
436					---	---	4.5
974					1.5	---	3.5
TGM	5.0	---	---	1.5
TTN	2.5	1.0	---	---
AVG	3.0	4.0	4.5	3.5