This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

EDGE-Rec: Efficient and Data-Guided Edge Diffusion For Recommender Systems Graphs

Utkarsh Priyam [email protected] Hemit Shah [email protected]  and  Edoardo Botta [email protected] Carnegie Mellon UniversityPittsburghPennsylvaniaUSA
(2024; 29 July 2024)
Abstract.

Most recommender systems research focuses on binary historical user-item interaction encodings to predict future interactions. User features, item features, and interaction strengths remain largely under-utilized in this space or only indirectly utilized, despite proving largely effective in large-scale production recommendation systems. We propose a new attention mechanism, loosely based on the principles of collaborative filtering, called Row-Column Separable Attention (RCSA) to take advantage of real-valued interaction weights as well as user and item features directly. Building on this mechanism, we additionally propose a novel Graph Diffusion Transformer (GDiT) architecture which is trained to iteratively denoise the weighted interaction matrix of the user-item interaction graph directly. The weighted interaction matrix is built from the bipartite structure of the user-item interaction graph and corresponding edge weights derived from user-item rating interactions. Inspired by the recent progress in text-conditioned image generation, our method directly produces user-item rating predictions on the same scale as the original ratings by conditioning the denoising process on user and item features with a principled approach.

recommendation, diffusion, graph, recommender, systems
copyright: acmlicensedjournalyear: 2024conference: Knowledge Discovery and Data Mining, Workshop on Generative AI for Recommender Systems and Personalization ’24; August 25–26, 2024; Barcelona, Spainisbn: ccs: Computing methodologies Machine learning algorithmsccs: Computing methodologies Probabilistic reasoning

1. Introduction

Denoising diffusion models (Ho et al., 2020) form a prominent class of models that have proven remarkably effective across a variety of generation and sampling tasks where the data distribution is high-dimensional, such as video, image synthesis and graph generation (Liu et al., 2023). However, generative modeling of graphs has historically focused on smaller graphs for applications such as molecule generation (Vignac et al., 2023; Liu et al., 2023). Recent work involving graph generation with diffusion models show promising results for reconstructing large undirected graphs with thousands of nodes (Wang et al., 2023; Bergmeister et al., 2024). Prior work also presents promising results with guided denoising steps incorporating score functions to evaluate graphs generated during denoising (Jo et al., 2022).

Recommendation systems focus on using historical user-item interaction data (such as reviews) to predict/recommend new items for users. Previous research in this space applies diffusion models to augment sparse data, model user dynamics/preference distributions, model item latent distributions, or perform sequential recommendation (Wang et al., 2023; Yang et al., 2023; Ma et al., 2024). In contrast to prior research in diffusion-based recommendation systems, we aim to construct a recommender system based on graph diffusion models by formulating the problem as a graph completion task. We focus primarily on the graph structure of the user-item interaction space.

Refer to caption
Figure 1. User-item interaction graph for a movie rating dataset. Each edge weight corresponds to the user-provided rating of the movie. Users and movies are linked via only their rating interactions, hence the graph is bipartite.
\Description

Bipartite graph of orange user nodes with numbered ID’s, and blue movie nodes with names. Lines connect users to movies with one box interrupting each edge with the rating given by the user for the movie.

Our proposed method involves representing historical user-item interactions as edges between nodes in a bipartite graph with real-valued edge weights. These weights can represent the strength of these interactions (e.g. a rating out of 5). We also utilize latent space representations of users and items to condition the denoising process for the prediction of new interaction links (recommendations). In recommendation systems, the set of users and items (as well as their attributes) is fixed, so we focus on noising edge attributes of the interaction graph such as the adjacency matrix and edge weights in a continuous manner. Additionally, since recommender systems graphs are bipartite, we can represent the edge weights between users and items in a weighted interaction matrix, forming a 2D array similar to an image where each pixel represents the corresponding interaction strength between a user and item.

To the best of our knowledge, our work is the first to consider the interaction graph in a recommender system with both node features (per user and item) and the strength of interactions as edge weights. By conditioning on node features in the denoising model, we are able to guide the diffusion of weights over all possible interaction edges in the graph to approximate the expected interaction strengths.

2. Related Work

2.1. Diffusion on Adjacency Matrices

In the context of graph generation, a natural formulation of a diffusion model defines the noising process in the space of adjacency matrices. In this spirit, EDGE (Chen et al., 2023) defines the transition kernel of the forward process as a Bernoulli distribution that resamples each edge with a time-varying positive probability. It aims to reconstruct the original graph by removing edges in the forward process, and denoising by adding edges in the reverse process. By explicitly modelling the node degrees and appropriately restricting each step of the backward process to a selected subset of “active” nodes, EDGE achieves promising performance and efficiency on larger graphs.

2.2. Diffusion Transformers

Transformer based denoising models (those that predict the noise added in a forward step by q(xt|xt1)q(x_{t}|x_{t-1}) or predict x0x_{0} directly), have become the standard for image diffusion models. Considering our aim to perform continuous diffusion on a 2D weighted interaction matrix with similar characteristics to an image, our denoising model is inspired heavily by the architecture utilized in the recent work on Diffusion Transformers (Peebles and Xie, 2023). Our main goal is to capitalize on the modeling capacity of scaled dot-product attention to capture neighborhood characteristics for each node in the interaction graph. We follow a similar approach to the original Diffusion Transformer architecture with regards to the usage of multi-head attention and time-step conditioning via adaptive layer-norms.

2.3. Diffusion-based Recommender Systems

Due to the structure implied by user-item interactions, recommendation problems can be naturally reformulated as link prediction problems on graphs. Applications of graph convolutional networks (Wang et al., 2021) to this task have been extensively studied, while applications of diffusion models here are relatively understudied. Recently, Wang et al. (2023) proposed DiffRec, a recommendation paradigm based on generative diffusion models. Given a user and a list of items, the diffusion process acts on the space of the interaction vectors. That is, it reconstructs one-hot encoded vectors representing whether the user has interacted with a specific item or not.

2.4. Graph Diffusion Recommender Systems

Subsequently, Jiang et al. (2023); Zhu et al. (2023) have incorporated the graph structure of the interaction history, bridging the gap between recommendation systems and graph diffusion models. However, they still do not fully utilize user or item features, and perform the forward and reverse diffusion processes over the space of one-hot encodings for interactions, regardless of interaction strength.

A concurrent work by Yi et al. (2024) formulates the problem of recommendation similarly, however, they first train a graph neural encoder to produce node (user/item) embeddings. In their experiment with continuous diffusion, they noise these embeddings in the forward process, and train a denoising model to recover the embeddings. Their graph neural encoder relies mainly on the graph defined by historical user-item interactions, and they additionally incorporate user/item features by measuring the similarity between user-user and item-item pairs based on their features.

Our work differs from their approach as we incorporate user and item features directly in our denoising model architecture rather than producing node embeddings using this information. While they perform diffusion over the space of these node embeddings, our diffusion process is applied to the space of weighted interaction matrices. As a result, our denoising model is able to directly predict interactions strengths whereas theirs recover node embeddings.

Refer to caption
Figure 2. DiffRec - Starting from a one-hot vector of historical interactions, the forward process noises the user’s interaction history until timestep TT by the transition step q(xt|xt1)q(x_{t}|x_{t-1}). The model is trained to recover x0x_{0} using pθ(xt1|xt)p_{\theta}(x_{t-1}|x_{t}) (Wang et al., 2023).
\Description

Visual overview of diffusion process where a one-hot vector is visualized as a bar graph. Arrows in the forward direction point to new bar graphs representing weights in the one-hot vector being noised. Arrows in the reverse direction point to the previous denoised bar graph representing the one-hot encoded vector.

3. Preliminaries

In the traditional formulation of denoising diffusion models (Ho et al., 2020), the forward diffusion process gradually adds Gaussian noise to the original data, 𝐱0q(𝐱)\mathbf{x}_{0}\sim q(\mathbf{x}), over TT diffusion steps to reach random Gaussian noise, 𝐱T𝒩(0,𝐈)\mathbf{x}_{T}\sim\mathcal{N}(0,\mathbf{I}). Each diffusion step corrupts the noised representation from the previous step:

q(𝐱t|𝐱t1)=𝒩(𝐱t;1βt𝐱t1,βt𝐈)q(\mathbf{x}_{t}|\mathbf{x}_{t-1})=\mathcal{N}(\mathbf{x}_{t};\sqrt{1-\beta_{t}}\mathbf{x}_{t-1},\beta_{t}\mathbf{I})

Where {βt}t=1T\{\beta_{t}\}_{t=1}^{T} are the noise-schedule hyper-parameters controlling the amount of noise added at each step. Note βt(0,1)\beta_{t}\in(0,1).

Ho et al. (2020) also provide the analytical form for 𝐱t\mathbf{x}_{t} given 𝐱0\mathbf{x}_{0}:

q(𝐱t|𝐱0)\displaystyle q(\mathbf{x}_{t}|\mathbf{x}_{0}) =𝒩(𝐱t;\widebarαt𝐱0,(1\widebarαt)𝐈\displaystyle=\mathcal{N}(\mathbf{x}_{t};\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0},(1-\widebar{\alpha}_{t})\mathbf{I}
=\widebarαt𝐱0+1\widebarαtϵ,\displaystyle=\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\widebar{\alpha}_{t}}\epsilon, ϵ𝒩(0,𝐈)\displaystyle\epsilon\sim\mathcal{N}(0,\mathbf{I})

Where αt=1βt\alpha_{t}=1-\beta_{t} and \widebarαt=Πi=1tαi\widebar{\alpha}_{t}=\Pi_{i=1}^{t}\alpha_{i}. This enables any arbitrary step of the diffusion process to be generated extremely quickly for any given 𝐱0q(𝐱)\mathbf{x}_{0}\sim q(\mathbf{x}), simply by sampling ϵ𝒩(0,𝐈)\epsilon\sim\mathcal{N}(0,\mathbf{I}). The diffusion steps can be tuned by choosing an appropriate set of {βt}t=1T\{\beta_{t}\}_{t=1}^{T}.

The reverse diffusion process aims to gradually recover the original data 𝐱0\mathbf{x}_{0} given the fully corrupted 𝐱T\mathbf{x}_{T}. Ho et al. (2020) show that the posterior Gaussian, p(𝐱t1|𝐱t,𝐱0)p(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0}), is tractable (when conditioned on 𝐱0\mathbf{x}_{0}) and has the analytical form:

p(𝐱t1|𝐱t,𝐱0)=\displaystyle p(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathbf{x}_{0})= 𝒩(𝐱t1;μt(𝐱t,𝐱0),βt~𝐈)\displaystyle\mathcal{N}(\mathbf{x}_{t-1};\mu_{t}(\mathbf{x}_{t},\mathbf{x}_{0}),\tilde{\beta_{t}}\mathbf{I})
μt(𝐱t,𝐱0):=\displaystyle\mu_{t}(\mathbf{x}_{t},\mathbf{x}_{0}):= \widebarαt1βt1\widebarαt𝐱0+αn(1\widebarαt1)1\widebarαt𝐱t\displaystyle\frac{\sqrt{\widebar{\alpha}_{t-1}}\beta_{t}}{1-\widebar{\alpha}_{t}}\mathbf{x}_{0}+\frac{\sqrt{\alpha_{n}}(1-\widebar{\alpha}_{t-1})}{1-\widebar{\alpha}_{t}}\mathbf{x}_{t}
β~t:=\displaystyle\tilde{\beta}_{t}:= 1\widebarαt11\widebarαtβt\displaystyle\frac{1-\widebar{\alpha}_{t-1}}{1-\widebar{\alpha}_{t}}\beta_{t}

Above, the conditioning on 𝐱0\mathbf{x}_{0} is necessary for the posterior to be tractable. The primary goal of a diffusion model is to learn the initial distribution such that arbitrary Gaussian noise can be gradually denoised to produce outputs from the original input data distribution q(𝐱)q(\mathbf{x}). Without conditioning on x0x_{0}, we need:

pθ(𝐱t1|𝐱t)=𝒩(𝐱t1;μθ(𝐱t,t),Σθ(𝐱t,t))p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t})=\mathcal{N}\left(\mathbf{x}_{t-1};\mu_{\theta}(\mathbf{x}_{t},t),\Sigma_{\theta}(\mathbf{x}_{t},t)\right)

The model, parameterized by θ\theta, is trained to approximate either the mean, μθ\mu_{\theta}, and variance, Σθ\Sigma_{\theta}, or to predict the added noise, ϵ\epsilon. Generally, the latter approach has lower variance and this parameterization more closely resembles Langevin dynamics with a loss that resembles the denoising score matching objective (Ho et al., 2020):

=ϵϵθ(𝐱t,t)2=ϵϵθ(\widebarαt𝐱0+1\widebarαtϵ,t)2\mathcal{L}=\left\|\epsilon-\epsilon_{\theta}(\mathbf{x}_{t},t)\right\|^{2}=\left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\widebar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\widebar{\alpha}_{t}}\epsilon,t\right)\right\|^{2}

where ϵθ()\epsilon_{\theta}(\cdot) is the noise predicted by the model. We use this vanilla diffusion approach over the space of weighted interaction matrices (applying isotropic noise in the forward steps, and training our GDiT architecture to predict the added noise in the reverse process). Inspired by text-conditioned image-generation methods (Rombach et al., 2022), our approach also conditions on user features, 𝒰\mathcal{U}, as well as item features, \mathcal{I}, such that the model predicts ϵθ(𝐱t,t,𝒰,)\epsilon_{\theta}(\mathbf{x}_{t},t,\mathcal{U},\mathcal{I}) and learns the posterior distribution pθ(𝐱t1|𝐱t,𝒰,)p_{\theta}(\mathbf{x}_{t-1}|\mathbf{x}_{t},\mathcal{U},\mathcal{I}).

3.1. Adjacency Matrix vs. Interaction Matrix

With a bipartite undirected graph, there is no need to encode the edges with a full adjacency matrix. We instead define our diffusion process to operate over weighted interaction matrices consisting of users along the rows and items along the columns. For example, for the small interaction graph provided in Figure 1 we show below both the adjacency matrix and the weighted interaction matrix (abbreviations are used in place of the full movie titles).

342 254 436 974 TGM TTN AVG
342 5.0 2.5 3.0
254 --- 1.0 4.0
436 --- --- 4.5
974 1.5 --- 3.5
TGM 5.0 --- --- 1.5
TTN 2.5 1.0 --- ---
AVG 3.0 4.0 4.5 3.5
Figure 3. Full adjacency matrix for the interaction graph provided in Figure 1. Notice the empty quadrants in the top left and bottom right of the matrix due to the bipartite graph.
\Description

Adjacency matrix filled with edge weights for bipartite graph of users and movies with edges corresponding to ratings between 1 to 5. Cells corresponding to edges between user-user nodes or movie-movie nodes are empty as they do not exist in the bipartite graph, whereas cells where the user has not rated the corresponding move are filled with three dashes.

TGM TTN AVG
342 5.0 2.5 3.0
254 --- 1.0 4.0
436 --- --- 4.5
974 1.5 --- 3.5
Figure 4. Weighted interaction matrix for the same graph (with non-existent interactions as ‘---’). Our proposed diffusion method operates over these weighted interaction matrices.
\Description

Weighted interaction matrix structure which is consistent with the top right corner of the adjacency matrix of the user-movie rating graph. Users are on the vertical axis and movies along the horizontal axis, with ratings in the cells of the array.

3.2. Recommender System Constraints

Formulating the recommendation task as a graph diffusion problem introduces two key constraints. First, the set of all users and items (as well as their metadata/attributes) is fixed (e.g. Netflix’s subscribers and movie library). Second, the scale of the graphs we aim to operate on is on the order of at least 10310^{3} nodes and 10510^{5} edges. This excludes the majority of prior work on generating graphs using diffusion models as they operate on nodes, edges, and attributes in the forward and reverse processes, and they cannot scale to graphs with thousands of nodes (often because the methods are geared towards applications like molecule generation, where the graphs are substantially smaller and denser than recommender systems).

4. Proposed Framework

We take inspiration from the notion of “active node” sampling in the interaction graph from EDGE (Wang et al., 2023), thus enforcing the same mechanism of scalable local diffusion. Below, we highlight some key differences between our method and prior works.

  1. (1)

    Sets of active nodes are sampled a priori and kept fixed throughout the diffusion. This means that our diffusion process is defined on the space of n×mn\times m sub-matrices (called “patches”), whose values represent all ratings in a sub-sample of nn users and mm items from the dataset. During training this translates to a batch of n×mn\times m interaction matrix patches which are corrupted in the forward process. The model is trained to predict the added noise within these patches given the corrupted data, time-step, and feature information for the nn users and mm items within each patch.

  2. (2)

    For all user-item pairs within a sampled n×mn\times m patch, we consider real-valued interaction strengths as opposed to binary interactions (e.g. ratings from 1-5 for the MovieLens datasets (Harper and Konstan, 2015)). This makes the task more difficult but at the same time closer to applications of recommendation systems. The interaction strengths are mapped to the continuous interval [1,1][-1,1], with 0 representing a neutral interaction, and positive and negative values corresponding to favorable and unfavorable interactions, respectively.

  3. (3)

    With a large (and dense) enough dataset it is reasonable to assume that the distribution of these interaction strengths will be roughly Gaussian, or can easily be transformed to a standard Gaussian in an invertible manner, such as with a quantile transformation. This motivates the usage of the vanilla diffusion method over the space of (transformed) weighted interaction matrices.

  4. (4)

    In contrast to EDGE, we noise all edges within a patch at the same time, without employing additional selection at each step. This is made computationally feasible by the a priori subsampling of the active edges, which makes each subsample of size significantly smaller than the entire graph.

  5. (5)

    In contrast to DiffRec, we contract the weights of the interaction graph edges to be samples from a standard Gaussian distribution. While DiffRec stops the forward process at some partially-noised time-step to encode historical interaction information as a latent prior, we hope to remove this requirement by conditioning on user and item features, enabling generation by sampling directly from random noise.

4.1. Row-Column Separable Attention (RCSA)

A key aspect of weighted interaction matrices is that the axes ordering of users and items can be permuted arbitrarily without changing the encoded information. As a result, spatial correlations in an interaction matrix are not analogous to those present in images. Due to this fact, the predictive power present within a square neighborhood of a user-item pair is limited.

We hypothesize that most of the predictive power relevant to the noise prediction at a particular user-item pair is contained in the interaction matrix entries with either the same user or the same item, that is across the same row or the same column. Analogously, we hypothesize the amount of predictive power present across user-item pairs with different items and different users to be low.

Refer to caption
Figure 5. A submatrix/subsample from the full weighted interaction matrix including additional features provided to the denoising model (latent representations for user features, uiu_{i}, and item features, iji_{j}). We also show our custom attention mechanism’s operation for the user-item pair (u3,i2)(u_{3},i_{2}), attending over other users’ ratings for the same item (column), as well as the same user’s ratings of other items (row).
\Description

A visualization of the submatrix from a weighted interaction matrix as a grid of red three dimensional cubes. Users are again along the vertical axis and items along the horizontal axis. Additionally, each user and item is depicted along their respective axes as a vector in three dimensions in the depth dimension to represent the features of each user and item. For row three and column two, these rows are highlighted in darker red with green arrows stretching across the row and column to represent the attention mechanism applied for the user three, item two pair.

The presence of this structure suggests that the commonly used U-Net (Ronneberger et al., 2015) architecture based on square convolutions is not suitable as it would only capture information within an immediate neighborhood. For this reason, we propose a novel attention module for interaction matrices. Within this module, each element of the matrix is separately attending to elements in its row first and elements in its columns subsequently. This is equivalent to a two-headed masked attention module where the attention coefficients for elements that do not belong to the same row or column, respectively, are masked out. In practice, one attention head operates on the sub-sample of the weighted interaction matrix and the other operates on its transpose. Replacing all attention modules with RCSA modules reduces the computation cost for an N×MN\times M patch from O(N2M2)O(N^{2}M^{2}) for traditional 2D attention mechanisms to O(N2+M2)O(N^{2}+M^{2}).

4.2. Graph Diffusion Transformer (GDiT)

Refer to caption
Figure 6. Novel GDiT Architecture
\Description

Architecture of the novel graph diffusion transformer. From the bottom left a block of operations is defined which accepts node feature embeddings as inputs. These inputs pass through a row column self attention block, residual connection from the input, layer norm, multi-layer perceptron, residual connection from the output of the layer norm, and a final layer norm. The output of these blocks is connected to a row column cross-attention block. On the right a block of operations is defined to accept noised interaction strengths, and diffusion time step embeddings. The time embeddings are passed through a multi-layer perceptron to generate scale and shift parameters. The noised interaction strengths pass through a layer-norm, scale and shift, row column self attention block, scale, residual connection from the input, layer norm, scale and shift, row column feature strength cross attention, scale, residual connection from prior to the last layer norm, layer norm, scale and shift, multi-layer perceptron, scale, residual connection from prior to the last layer norm, 1x1 convolution, to produce the noise prediction outputs.

We propose a new transformer architecture (Figure 6) based on Diffusion Transformers (DiT) (Peebles and Xie, 2023), which we use to attend to user and item features in predicting diffusion noise. Specifically, we merge DiT with scaled layer norms conditioned on the diffusion time step and DiT with cross-attention conditioned on node features. We also apply RCSA (Figure 5) in place of all 2D attention operations. This architectural change better suits attention on bipartite graph structures, as well as providing runtime efficiency by replacing 2D attention with 1D attention. Note that each block can be repeated an arbitrary number of times to enhance the model’s learning capacity for higher dimensional node feature embeddings (or be used to predict multi-dimensional rating/interaction strengths).

5. Evaluation

We train and evaluate our proposed method, EDGE-Rec, on the MovieLens datasets ML-100k and ML-1M (Harper and Konstan, 2015). Evaluation is performed using a set of standard metrics for recommendation systems based on the top-KK predictions from the model. We train each model for at least 10000 iterations with a batch size of 16 and patch size at least 50×5050\times 50. We construct a custom time-aware 90-10 train-test split of the edges to ensure consequentiality and the absence of overlap in the time ranges covered by training and validation split.

Training consists of sampling random patches of specified size n×mn\times m from the target graph. Uniform random sampling of the full weighted interaction matrix over at least 10000 iterations allows most if not all of the users and items to be included during training. The model is provided patches corrupted to uniformly sampled time-steps, the time-steps each patch is noised to, and user-item features for each patch. Unknown interactions are optionally masked, and, if a user has not interacted with an item, the expected or true interaction strength is set to 0 (neutral).

Finally, we train the GDiT model by maximizing the usual denoising formulation of the ELBO objective, which we regularize at the single step level using Bayesian Personalized Ranking loss (Rendle et al., 2012) to encourage coherent rankings. At sampling time, we reconstruct patches of ratings from random noise. Notably, given a patch we have access to the known training edges, which precede the test edge edges. This allows to frame the sampling problem as a matrix completion problem. Inspired from inpainting techniques for image completion (Lugmayr et al., 2022), we replace the known portion of the patch with the known noised ratings from the training set to further condition the generation on the a priori information and ecourage accurate generation of new ratings that “complete” the already-available ones. These candidates are discarded at test time.

We also propose a new method of denoising larger subgraphs by randomly tiling the encompassing subgraph with smaller patches at each reverse-diffusion step. As a result, the entire interaction matrix is denoised uniformly during the sampling process, instead of independently as when patches are naively denoised sequentially.

Figures 7 and 8 depict plots for the chosen suite of evaluation metrics for various top-K values. In particular, note that EDGE-Rec surpasses DiffRec baselines (included in Appendix A) for most patch sampling tasks on edge densities above around 55%55\% (in terms of known interactions). Additionally, we observe that scaling up denoising sampling using our novel tiling method does not significantly degrade recommendations, with sufficient subgraph density.

Refer to caption
Figure 7. EDGE-Rec 8-block model evaluation metrics on ML-100k dataset. The figures show average results on arbitrary randomly sampled patches of varying sizes, as well as tiled sampling at 64×6464\times 64 over the entire user-item rating graph.
\Description

Comparison of the evaluation metrics for an 8-block model following the previously described architecture plotted as a set of line graphs. Along the horizontal axis, multiple plots depict the trend in precision, recall, normalized discounted cumulative gain, mean reciprocal rank, and hit rate as K is increased to take the model’s top K predictions for the ML-1M dataset. There are four lines per plot, showing the trends in results for each metric as K is increased, for 3 sampling patch sizes 50×5050\times 50, 64×6464\times 64, and 80×8080\times 80, as well as tiled sampling over the full graph with tile size 64×6464\times 64.

Refer to caption
Figure 8. EDGE-Rec 1-block model evaluation metrics on ML-100k dataset. The top row depicts results on arbitrary 50×5050\times 50 patches from subgraphs of specified density. The bottom row compares evaluations for sampling a fixed 50×5050\times 50 patch to sampling the entire subgraph for a fixed label density of 70%70\%, which corresponds to the top 100 users and 100 movies, by rating density (i.e. a 100×100100\times 100 patch).
\Description

Evaluation metrics for the previously described architecture plotted as a set of line graphs. Along the horizontal axis, multiple plots depict the trend in precision, recall, normalized discounted cumulative gain, mean reciprocal rank, and hit rate as K is increased to take the model’s top K predictions for the ML-100K dataset. On the top row are the metrics when performing 50x50 patch sampling, with multiple lines in each plot depicting the metrics when specifying a varying minimum density of 0.5, 0.6, 0.7, or 0.8 for the sampled patches. The bottom row of plots for each metric shows the trend in performance as K is increased to take the model’s top K predictions, and compares 50x50 patch sampling with a density of 0.7 for the edges, against the performance for full graph subsampling via the novel tiling method.

5.1. Model and patch size

We upscale the experiments with the proposed architecture by stacking 8 GDiT blocks. Figure 9 shows the results obtained across different choices of density and patch size. Notably, when compared with corresponding smaller baselines composed of a single block (Figure 8), we observe improvements of up to 0.13 in magnitude across all top-1, top-5 and top-10 metrics. This suggest a strong improvement in the quality of the resulting rankings overall. We further observe the role of the patch size in balancing precision and recall. Specifically, from top-20 to top-50, as the patch size increases, meaning that the number of ranked candidates is increased, Figure 9 shows an increase in precision and recall, and a decrease in NDCG.

Refer to caption
Figure 9. EDGE-Rec 8-block model evaluation metrics on ML-100k dataset. The figure depicts results on arbitrary patches of 50×5050\times 50 and 80×8080\times 80 sizes and specified densities.
\Description

Comparison of the evaluation metrics for a 8-blocks model following the previously described architecture across different choices of density and patch size plotted as a set of line graphs. Along the horizontal axis, multiple plots depict the trend in precision, recall, normalized discounted cumulative gain, mean reciprocal rank, and hit rate as K is increased to take the model’s top K predictions for the ML-100K dataset. Multiple lines in each plot depict the metrics when specifying a varying minimum density of 0.5 or 0.7 and a patch size of 50x50 or 80x80.

5.2. Larger Datasets

We train and evaluate a large model with 16 GDiT layers and 14\sim 14-million parameters on the larger ML-1M dataset (Harper and Konstan, 2015). This dataset contains 1,000,000 ratings from over 6000 users and 4000 movies (10,000 nodes and 1,000,000 edges). A patch size of 64×6464\times 64 was used. Evaluation is performed by randomly sampling 10 patches of the same size from the full weighted interaction matrix, performing in-painting as previously mentioned, and averaging the metrics over these patches. Note the model predictions are compared against test edges (user ratings the model has not previously seen).

Refer to caption
Figure 10. EDGE-Rec 16-block model evaluation metrics on ML-1M dataset. The figures show average results on arbitrary randomly sampled patches of size 64×6464\times 64.
\Description

Comparison of the evaluation metrics for a 16-block model following the previously described architecture plotted as a set of line graphs. Along the horizontal axis, multiple plots depict the trend in precision, recall, normalized discounted cumulative gain, mean reciprocal rank, and hit rate as K is increased to take the model’s top K predictions for the ML-1M dataset. There is one line per plot showing the trend in results for each metric as K is increased.

Due to the consistency of the metrics between our evaluations on ML-100K and ML-1M, we conclude that our model is able to scale to large datasets well. Despite an order of magnitude more data, performance is relatively minimally degraded. This is most likely driven by our training strategy involving random patch sampling, while our evaluation strategy adopts this scalable approach as well.

6. Conclusions and Limitations

We propose a graph-diffusion-based recommendation model that predicts continuous interaction strengths. Building on advancements in conditioned diffusion models, we generate recommendations by conditioning on user and item features. By tuning the size of the patch, our model shows promising performance, despite the patch size being two orders of magnitude smaller than the complete graph. Assuming one can identify dense patches of the graph with high accuracy, this makes the training process highly scalable in the number of users and items. However, in highly sparse production-level environments, identifying such patches can be a challenge, as it essentially reduces to the retrieval problem. This makes the assumption of a dense patch size particularly strong and a clear limitation of the method. If available, future work could pair this ranking methodology with a trained retrieval model that produces high-density patches of candidate user-item pairs.

Acknowledgements.
To Michael Mu, for the late-night company in the MSML lounge, and for answering our many questions on recent diffusion model developments.

References

  • (1)
  • Bergmeister et al. (2024) Andreas Bergmeister, Karolis Martinkus, Nathanaël Perraudin, and Roger Wattenhofer. 2024. Efficient and Scalable Graph Generation through Iterative Local Expansion. arXiv:2312.11529 [cs.SI]
  • Chen et al. (2023) Xiaohui Chen, Jiaxing He, Xu Han, and Li-Ping Liu. 2023. Efficient and Degree-Guided Graph Generation via Discrete Diffusion Modeling. arXiv:2305.04111 [cs.LG]
  • Harper and Konstan (2015) F. Maxwell Harper and Joseph A. Konstan. 2015. The MovieLens Datasets: History and Context. ACM Trans. Interact. Intell. Syst. 5, 4, Article 19 (dec 2015), 19 pages. https://doi.org/10.1145/2827872
  • Ho et al. (2020) Jonathan Ho, Ajay Jain, and Pieter Abbeel. 2020. Denoising Diffusion Probabilistic Models. arXiv:2006.11239 [cs.LG]
  • Jiang et al. (2023) Yangqin Jiang, Yuhao Yang, Lianghao Xia, and Chao Huang. 2023. DiffKG: Knowledge Graph Diffusion Model for Recommendation. arXiv:2312.16890 [cs.IR]
  • Jo et al. (2022) Jaehyeong Jo, Seul Lee, and Sung Ju Hwang. 2022. Score-based Generative Modeling of Graphs via the System of Stochastic Differential Equations. arXiv:2202.02514 [cs.LG]
  • Liu et al. (2023) Chengyi Liu, Wenqi Fan, Yunqing Liu, Jiatong Li, Hang Li, Hui Liu, Jiliang Tang, and Qing Li. 2023. Generative Diffusion Models on Graphs: Methods and Applications. arXiv:2302.02591 [cs.LG]
  • Lugmayr et al. (2022) Andreas Lugmayr, Martin Danelljan, Andres Romero, Fisher Yu, Radu Timofte, and Luc Van Gool. 2022. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 11461–11471.
  • Ma et al. (2024) Haokai Ma, Ruobing Xie, Lei Meng, Xin Chen, Xu Zhang, Leyu Lin, and Zhanhui Kang. 2024. Plug-in Diffusion Model for Sequential Recommendation. arXiv:2401.02913 [cs.IR]
  • Peebles and Xie (2023) William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4195–4205.
  • Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian Personalized Ranking from Implicit Feedback. arXiv:1205.2618 [cs.IR]
  • Rombach et al. (2022) Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
  • Ronneberger et al. (2015) Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention–MICCAI 2015: 18th international conference, Munich, Germany, October 5-9, 2015, proceedings, part III 18. Springer, 234–241.
  • Vignac et al. (2023) Clement Vignac, Igor Krawczuk, Antoine Siraudin, Bohan Wang, Volkan Cevher, and Pascal Frossard. 2023. DiGress: Discrete Denoising diffusion for graph generation. arXiv:2209.14734 [cs.LG]
  • Wang et al. (2021) Shoujin Wang, Liang Hu, Yan Wang, Xiangnan He, Quan Z. Sheng, Mehmet A. Orgun, Longbing Cao, Francesco Ricci, and Philip S. Yu. 2021. Graph Learning based Recommender Systems: A Review. arXiv:2105.06339 [cs.IR]
  • Wang et al. (2023) Wenjie Wang, Yiyan Xu, Fuli Feng, Xinyu Lin, Xiangnan He, and Tat-Seng Chua. 2023. Diffusion Recommender Model. arXiv:2304.04971 [cs.IR]
  • Yang et al. (2023) Zhengyi Yang, Jiancan Wu, Zhicai Wang, Xiang Wang, Yancheng Yuan, and Xiangnan He. 2023. Generate What You Prefer: Reshaping Sequential Recommendation via Guided Diffusion. arXiv:2310.20453 [cs.IR]
  • Yi et al. (2024) Zixuan Yi, Xi Wang, and Iadh Ounis. 2024. A Directional Diffusion Graph Transformer for Recommendation. arXiv:2404.03326 [cs.IR]
  • Zhu et al. (2023) Yunqin Zhu, Chao Wang, and Hui Xiong. 2023. Towards Graph-Aware Diffusion Modeling for Collaborative Filtering. arXiv:2311.08744 [cs.IR]

Appendix A DiffRec Baseline Metrics

Refer to caption
Figure 11. DiffRec (Wang et al., 2023) evaluation metrics on ML-100k and ML-1M.
\Description

Baseline metrics for the DiffRec (Wang et al., 2023) algorithm.

Appendix B Visualizing the Sparsity in the MovieLens datasets

Figure 12 visualizes the ratings provided by all users for the movies in the MovieLens-100K dataset. Note the inherent sparsity in the encoded interaction matrix.

Refer to caption
Figure 12. Heat-map of ratings provided by users for movies ranging from 1 to 5 in the MovieLens-100K dataset. Values are 0 (black) where the user has not rated the movie.
\Description

A plot of a heat-map for the ratings provided by users for movies in the Movie Lens 100K dataset, with an orange color intensity scheme. User IDs are on the y-axis and Movie IDs are on the x-axis. Where there are unrated movies from users, the pixels are black, and pixels are colored from red, to orange, to white corresponding to ratings of 1, 3, and 5 respectively. There is a grid like structure to the pixel colors, emphasizing popular movies that have been rated by many users, and users who have rated many movies.

In order to sample dense subgraphs of the interaction matrix, we heuristically identify “important” or densely connected users and movies by counting the number of nodes at a distance of 2 edges from each user or movie node in the interaction graph. Figure 13 is a heat-map of the data after sorting users and movies by this heuristic. We sample dense subgraphs of the ML-100k dataset by selecting a suitably-sized sub-matrix from the lower left corner of the visualized sorted complete graph adjacency matrix.

Refer to caption
Figure 13. Heat-map of user-provided ratings for movies in the MovieLens-100K dataset, sorted heuristically by density of graph.
\Description

A plot of a heat-map for the ratings provided by users for movies in the Movie Lens 100K dataset, with an orange color intensity scheme. User IDs are on the y-axis and Movie IDs are on the x-axis. Where there are unrated movies from users, the pixels are black, and pixels are colored from red, to orange, to white corresponding to ratings of 1, 3, and 5 respectively. Due to the heuristic sorting based on the density of the graph, most colored pixels are concentrated near the bottom left of the plot, and the gradient going towards the bottom left changes the pixel colors from black, to red, to orange, to white.