Learning ID-free Item Representation with Token Crossing for Multimodal Recommendation

Kangning Zhang [email protected] 0009-0009-9080-7484 Shanghai Jiao Tong UniversityShanghaiChina , Jiarui Jin [email protected] 0000-0001-6458-1586 Xiaohongshu Inc.ShanghaiChina , Yingjie Qin [email protected] 0009-0005-5078-2201 Xiaohongshu Inc.ShanghaiChina , Ruilong Su [email protected] 0009-0003-6695-4536 Xiaohongshu Inc.ShanghaiChina , Jianghao Lin [email protected] 0000-0002-8953-3203 Shanghai Jiao Tong UniversityShanghaiChina , Yong Yu [email protected] 0000-0003-0281-8271 Shanghai Jiao Tong UniversityShanghaiChina and Weinan Zhang [email protected] 0000-0002-0127-2425 Shanghai Jiao Tong UniversityShanghaiChina

Abstract.

Current multimodal recommendation models have extensively explored the effective utilization of multimodal information; however, their reliance on ID embeddings remains a performance bottleneck. Even with the assistance of multimodal information, optimizing ID embeddings remains challenging for ID-based Multimodal Recommender when interaction data is sparse. Furthermore, the unique nature of item-specific ID embeddings hinders the information exchange among related items and the spatial requirement of ID embeddings increases with the scale of item. Based on these limitations, we propose an ID-free MultimOdal TOken Representation scheme named MOTOR that represents each item using learnable multimodal tokens and connects them through shared tokens. Specifically, we first employ product quantization to discretize each item’s multimodal features (e.g., images, text) into discrete token IDs. We then interpret the token embeddings corresponding to these token IDs as implicit item features, introducing a new Token Cross Network to capture the implicit interaction patterns among these tokens. The resulting representations can replace the original ID embeddings and transform the original ID-based multimodal recommender into ID-free system, without introducing any additional loss design. MOTOR reduces the overall space requirements of these models, facilitating information interaction among related items, while also significantly enhancing the model’s recommendation capability. Extensive experiments¹¹1The source code will be available upon acceptance. on nine mainstream models demonstrate the significant performance improvement achieved by MOTOR, highlighting its effectiveness in enhancing multimodal recommendation systems.

^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/18/06

1. Introduction

Recommendation systems have become an essential component in enabling users to discover and explore content across various online services, from e-commerce platforms to social networks. Sparse interactions pose a significant challenge for traditional collaborative filtering techniques. Multimodal Recommender Systems address this challenge by incorporating item-related multimedia features, such as images, text descriptions, and other modalities, to enrich the representation of items (He and McAuley, 2016; Wei et al., 2019; Wang et al., 2021; Zhou et al., 2023a; Zhou and Shen, 2023; Yu et al., 2023; Liu et al., 2024; Zhang et al., 2024).

Traditional approaches have involved fusing multimodal features with the item ID embeddings through concatenation or summation (He and McAuley, 2016; Liu et al., 2017). The success of graph neural networks has also inspired a series of works that leverage GNNs to capture the high-order semantics between modal features and ID embeddings (Wei et al., 2019; Wang et al., 2021; Wei et al., 2020; Yu et al., 2023). Some works have also considered the semantic differences between modal domains (Zhou et al., 2023a; Tao et al., 2022; Liu et al., 2024), as well as between behavior and modality (Zhang et al., 2024). However, the current multimodal recommendation models still heavily rely on the learning of ID Embeddings, with multimodal features serving as auxiliary roles, which would result in the following issues: (i) Information Isolation: The independent ID embedding for each item hinder information exchange among related items; (ii) Cold-Start: For items with very few interaction data, their ID embeddings are difficult to optimize adequately; (iii) Storage Burden: The storage requirements for ID embeddings expand as the number of items grows;

Refer to caption — Figure 1. MOTOR transforms the original ID-based recommendation model into an ID-free framework by removing the ID Embedding Table and maintaining modal-specific Token Embedding Tables with fewer parameters. Consider a scenario where items 0 and 1 are popular items, while item 2 is a cold-start item. MOTOR establishes connections among these items through shared tokens in either the image or text modality. Specifically, items 0, 1, and 2 are linked by a common image token 0, and items 1 and 2 are connected through a shared text token 3. Consequently, the cold-start item 2 can enhance its representation through associations with other related items.

Considering these constraints, we propose an ID-free multimodal token representation scheme (MOTOR). This approach replaces the original ID embeddings of items with Token Representations, thereby seamlessly transforming the ID-based recommender into an ID-free system, as illustrated in Figure 1. Specifically, given the multimodal features of all items, we first employ optimized product quantization (OPQ) techniques to discretize these features into $D$ -dimensional codes. For example, if each item contains $M$ modalities, we discretize the features of each modality into $D$ -dimensional token IDs. Thus, each item is ultimately assigned $M\times D$ token IDs. In MOTOR, semantically similar related items (e.g., those with similar raw multimodal features) are effectively connected through shared tokens, addressing the problem of Information Isolation. For the Cold-Start problem, MOTOR leverages the token-sharing mechanism to enhance the representation of cold-start and long-tail items by utilizing information from other popular items. Furthermore, we do not need to maintain an ID Embedding Table for the items; instead, we maintain modal-specific Token Embedding Tables with fewer trainable parameters, alleviating the issue of Storage Burden. Experimentally, on the item side, we only need to maintain approximately 20% of the original embedding parameters to learn item representations on the largest dataset. We obtain the token embeddings for each item by looking up the Token Embedding Table with token IDs.

Note that the tokens in MOTOR differ from the semantically explicit word tokens in language models. Large language models often use attention layers or transformer module (Vaswani et al., 2017a) to model the associations of word tokens, leading to considerable computational and storage requirements. In MOTOR, we aim to explore the interaction patterns between multimodal tokens without incurring significant overhead. Innovatively, we treat these token embeddings as implicit feature vectors. Through a simple yet effective Token Cross Network (TCN), we perform one-order, second-order, and high-order interaction of these feature vectors to compute the corresponding Token Representations of the items. While practical, we consider two types of Token Cross Networks: Modal-specific and Modal-agnostic. Modal-specific means designing TCNs tailored to each modality, wherein token embeddings interact exclusively within the same modality to derive the modal-specific Token Representation. These representations are subsequently fused by aggregating token representations across various modalities. Modal-agnostic utilizes a larger TCN network to perform unified interactions among all token embeddings from all modalities, directly generating the final Token Representation. This innovative Token Representation replaces the original ID embedding, enabling the seamless transformation of existing ID-based multimodal recommendation models into ID-free frameworks. In general, MOTOR is compatible with various multimodal recommendation models without introducing additional loss designs.

Our main contributions are summarized as follows:

•

We explore the issues in current ID-based multimodal recommenders and, based on this, propose an ID-free multimodal token representation scheme–MOTOR. To the best of our knowledge, MOTOR is the first work that introduces tokenization technology for ID-free multimodal recommendation, which learns item representation with learnable multimodal token crossing.
•

To avoid introducing excessive overhead, we consider token embeddings as implicit item features and design a lightweight Token Cross Network to explore the token interactions. Experimentally, we evaluated both our Modal-specific and Modal-agnostic Token Cross Networks, systematically comparing their performance.
•

We conduct extensive experiments on nine backbone models, encompassing both general and multimodal recommendation models. MOTOR consistently achieve the SOTA performance, with notable improvements for both long-tail and popular items. We will provide the open-source framework compatible with multiple downstream multimodal recommendation models.

2. Related Work

2.1. ID-based Multimodal Recommendation

Due to the achievement in collaborative filtering (He et al., 2017), numerous early multimodal recommendation models aim to seamlessly blend visual or textual auxiliary data with entity ID embeddings, in order to effectively capture modality-augmented user-item interactions. For instance, VBPR (He and McAuley, 2016) enhances BPR-based recommendations by incorporating visual features. Meanwhile, Deepstyle (Liu et al., 2017) augments the representations of items with both visual and style features within the BPR framework. Recently, a new research direction introduces GNNs into multimodal recommendation systems. MMGCN (Wei et al., 2019) leverages modality-specific relational graphs, while DualGNN (Wang et al., 2021) incorporates a User-User graph. Certain studies have also addressed the semantic distinctions across modal domains. BM3 (Zhou et al., 2023a) and SLMRec (Tao et al., 2022) propose innovative self-supervised learning strategies for aligning modalities. AlignRec (Liu et al., 2024) breaks down the recommendation task into three alignment objective, with each governed by unique objective functions. Furthermore, DREAM (Zhang et al., 2024) addresses the issue of Modal Information Forgetting by introducing the similarity-supervised signal.

However, these ID-based models still heavily rely on the learning of item ID embeddings, which indicates that even with the integration of multimodal information, if ID embeddings are not adequately optimized, the model’s recommendation performance can decrease significantly. MOTOR does not focus on downstream multimodal recommendation models; instead, it employs a streamlined framework, using learnable Token Representations to replace the original ID embeddings, thereby transforming these ID-based recommenders into ID-free systems.

2.2. Tokenization in Recommendation

The current main methods of tokenization in recommendation focus mainly on the two dimensions: LLM-based Recommendation System and Generative Recommendation.

Tokens in LLM-based Recommendation. Due to the tremendous success of Large Language Models (LLMs) in the field of NLP, researchers are endeavoring to leverage the rich semantic knowledge and powerful reasoning capabilities of LLMs to improve recommendation systems (Lin et al., 2024b; Hou et al., 2023; Geng et al., 2023; Lin et al., 2024a). Early efforts such as VQRec (Hou et al., 2023) convert the text features encoded by BERT (Vaswani et al., 2017b) into discretized codes for transferable sequential recommenders. Recent studies explore using in-vocabulary tokens to characterize items in LLMs, where users and items are represented by token combinations. P5 (Geng et al., 2023) convert user-item interactions into natural language formats using numeric IDs constructed of in-vocabulary tokens of the T5 model (Raffel et al., 2023). Further, sequential ID and collaborative ID are developed to improve item information exchange (Hua et al., 2023).

Generative Recommendation. Within this paradigm, item sequences are transformed into token sequences, which are subsequently input into generative models to autoregressively predict the target item’s tokens. TIGER (Rajput et al., 2023) constructs IDs using hierarchical tokens through learnable RQ-VAE. LETTER (Wang et al., 2024) proposes aligning quantized embeddings in RQ-VAE with collaborative embeddings to leverage both collaborative and semantic information.

To our best knowledge, MOTOR is the first to consider leveraging the product quantization of multimodal features into ID-based multimodal recommenders, introducing an innovative Token Representation computed by Token Cross Network to replace original ID embeddings. Thus, our work remains orthogonal to the aforementioned LLM-based and Generative recommendation.

3. Methodology

3.1. Preliminary

Consider a set of $M$ users denoted by $\mathcal{U}=\{u_{i}\}_{i=1}^{M}$ , and a set of $N$ items denoted by $\mathcal{I}=\{i_{t}\}_{t=1}^{N}$ . Each item is associated with multimodal features $f_{i}^{m}\in\mathbb{R}^{d_{m}}$ , where $m\in\{V,T\}$ . In this context, $V$ and $T$ correspond to the visual and textual modalities, respectively.

The user behavior history is represented as $R\in\mathbb{R}^{M\times N}$ , where $R_{ui}=1$ signifies that user $u$ has interacted with item $i$ , and $R_{ui}=0$ otherwise. Essentially, $R$ can be viewed as a sparse behavior graph $\mathcal{G}=(\mathcal{V},\mathcal{E})$ , with $\mathcal{V}=\mathcal{U}\cup\mathcal{I}$ representing the vertices, and $\mathcal{E}=\{(u,i)\mid u\in\mathcal{U},i\in\mathcal{I},R_{ui}=1\}$ representing the edges. The set of items that a user $u$ has interacted with is denoted by $\mathcal{I}_{u}=\{i\mid(u,i)\in\mathcal{E}\}$ .

The aim of ID-based multimodal recommendation is to precisely estimate user preferences by ranking items for each user, utilizing the predicted preference scores $y_{ui}=f_{\theta}(e_{u},e_{i},f_{i}^{m})$ . In this function, $\theta$ represents the model’s parameters, $e_{u}$ and $e_{i}$ stand for the ID embeddings, and $f_{i}^{m}$ denotes the modal features for the item $i$ . In MOTOR, we attempt to learn a novel Item Token Representation $r_{i}$ , to replace the item’s ID embedding, $e_{i}$ . The preference function of ID-free recommendation can be denoted as $y_{ui}=f_{\theta}(e_{u},r_{i},f_{i}^{m})$ ²²2In this work, we use lowercase letters with subscripts to denote the vector corresponding to a specific item, and uppercase letters to represent the feature matrix of all items. For example, $r_{i}\in\mathbb{R}^{d}$ denotes the final Token Representation of item $i$ , while $R\in\mathbb{R}^{N\times d}$ represents the final Token Representations of all items..

3.2. The Framework of MOTOR

The structure of MOTOR is illustrated in Figure 2. The core objective of MOTOR is to effectively learn $r_{i}$ . This learning process can be divided into two steps. Firstly, by applying Product Quantization to the raw multimodal features of items, we obtain items’ unique multimodal Token IDs $T^{m}$ in modality $m$ . Based on the learned Token IDs, we can retrieve the token embeddings for specific items by looking them up in the Token Embedding Table. Here, each item corresponds to multiple token embeddings. Secondly, we treat the token embeddings obtained for each item as its implicit feature vectors. Through a lightweight Token Cross Network, we perform one-order, second-order, and high-order feature interactions on these implicit feature vectors, ultimately generating the Token Representation $r_{i}$ . Please note that the parameters in MOTOR (e.g. Token Embedding Tables, Token Cross Network) is trained end-to-end with downstream multimodal models, rather than through separate learning. Next, we present the MOTOR in detail.

3.3. Discretized Token Learning

The initial textual and visual information of items can be compressed into specific feature vectors $F^{t}$ and $F^{v}$ using various text encoders (Vaswani et al., 2017b; Radford et al., 2019; Reimers and Gurevych, 2019; Raffel et al., 2023) and visual encoders (Simonyan and Zisserman, 2014; He et al., 2016; Dosovitskiy et al., 2020), respectively. Within the MOTOR framework, we do not rely on specific encoders, nor do we require semantic alignment between image features and text features such as CLIP (Radford et al., 2021).

Next, based on Product Quantization (Jegou et al., 2010), we consider mapping the modal features into discretized tokens. Taking the visual feature $F^{v}$ as an example, a similar approach is applied to the text feature $F^{t}$ . We begin by partitioning each visual feature vector $F^{v}\in\mathbb{R}^{N\times d_{v}}$ into $D$ subvectors, each of dimensionality $N\times\frac{d_{v}}{D}$ . This can be expressed as:

(1)

F^{v}=[F^{v}_{1},F^{v}_{2},\ldots,F^{v}_{D}]

For each subvector $F_{x}^{v}$ , we apply the K-means clustering algorithm to generate $K$ cluster centroids, constructing a codebook $C_{x}^{v}\in\mathbb{R}^{K\times\frac{d_{v}}{D}}$ . This step can be mathematically formulated as:

(2)

C_{x}^{v}=\{c_{x,1}^{v},c_{x,2}^{v},\ldots,c_{x,K}^{v}\}

where $c_{x,j}^{v}\in\mathbb{R}^{\frac{d_{v}}{D}}$ represents the $j$ -th centroid in the $x$ -th subvector’s codebook. To acquire these PQ centroid embeddings, we employ a well-known technique called Optimized Product Quantization (OPQ) (Ge et al., 2013). We then optimize these centroid embeddings by utilizing the visual features of the items from the entire training dataset. Each subvector matrix $F^{v}_{x}$ is then quantized by finding the nearest centroid in the corresponding codebook $C_{x}^{v}$ . The index of this centroid acts as the Token IDs. This assignment is expressed as:

(3)		$\displaystyle t_{x,i}^{v}$	$\displaystyle=\text{argmin}_{j}\\|f^{v}_{x,i}-c_{x,j}^{v}\\|$
(3)		$\displaystyle t_{i}^{v}$	$\displaystyle=[t_{1,i}^{v},\ldots,t_{D,i}^{v}]$

where $t_{x,i}^{v}$ is the Token ID of item $i$ for the $x$ -th subvector in vision modality. $t_{i}^{v}$ is the visual token ids for item $i$ . Finally, the discrete Token IDs $T^{v}\in\mathbb{R}^{N\times D}$ of all items are represented as a concatenation of $t_{i}^{v}$ :

(4)

T^{v}=[t_{1}^{v},t_{2}^{v},\ldots,t_{N}^{v}]

In summary, through the process of vector quantization and clustering, visual features $F^{v}$ is transformed into a set of discrete Token IDs $T^{v}$ . A similar procedure can be applied to convert text features $F^{t}$ into their Token IDs $T^{t}$ . Although the original multimodal features are transformed into discretized token IDs, they still retain significant semantic information (e.g., semantically similar items share the certain token IDs). We further demonstrate the result about semantic analysis of token ids in Appendix 6.3.

3.4. Token Embeddings as Implicit Features

Given the learned discrete token IDs in multimodals, we can directly perform the embedding lookup to get the token embeddings.

Considering the visual tokens $T^{v}\in\mathbb{R}^{N\times D}$ , which comprise $D$ specific indexed token IDs, we construct $D$ Token Embedding Tables for lookup, each corresponding to a particular dimension of the Token IDs. For a given dimension $x$ in the visual modality, all items share the same Token Embedding Table $E^{v}_{x}\in\mathbb{R}^{K\times d}$ , where $K$ corresponds to the number of cluster centers as specified in Section 3.3 and also represents the range of index values, and $d$ denotes the dimension of token embeddings and Token Representations. In MOTOR, the Token Representation replaces the original ID Embedding, thereby ensuring that $d$ is consistent with the dimension of the original model’s ID Embedding.

Considering the visual tokens of item $i$ , $t_{i}^{v}=[t_{1,i}^{v},\ldots,t_{D,i}^{v}]$ , we can obtain the token embeddings by embedding lookup as $[e_{t_{1},i}^{v},\ldots,e_{t_{D},i}^{v}]$ , where $e_{t_{x},i}^{v}$ is the $t_{x}$ -th row of embedding tables $E^{v}_{x}$ . Note that the token embedding $e_{t_{x},i}^{v}$ is different from the corresponding PQ centroid center $c_{x,t_{x}}^{v}$ . Similar to the ID Embeddings, we initialize the token embedding tables randomly and update them throughout the learning process alongside downstream multimodal recommendation models in an end-to-end fashion. Analogous to the visual modality, we obtain the token embeddings for text modality in the same manner. Therefore, the overall token embeddings of item $i$ correspond to $\{e_{1,i}^{v},\ldots,e_{D,i}^{v},e_{1,i}^{t},\ldots,e_{D,i}^{t}\}$ .

In multimodal recommendation, we assume that items do not have content-related features such as titles or categories, but only multiple modal-related features. Within the MOTOR framework, we regard all obtained token embeddings as implicit features of these item and apply them to the Token Cross Network. An intuitive explanation is that the similarity between item’s multimodal features implies content and semantic similarity. When both items share the same token embeddings, it is analogous to sharing the same category features (e.g., the same category: technology).

3.5. Token Cross Network

In traditional CTR tasks, a core issue is the effective utilization of feature interactions (Cheng et al., 2016; Guo et al., 2017; Wang et al., 2017). In MOTOR, we regard the Token Embeddings of items as the corresponding implicit features and use a lightweight Token Cross Network to compute the feature interactions. Inspired by DeepFM (Guo et al., 2017), in the Token Cross Network, we focus on modeling the one-, second-, and high-order components between token embeddings.

Considering that an item comprises token embeddings from multiple modalities, in MOTOR, we design two mechanisms for Token interactions, namely, the Modal-specific Token Cross Network and the Modal-agnostic Token Cross Network. The Modal-specific Token Cross Network implies that we separately interact with the token embeddings under different modalities and ultimately combine the representations from multiple modalities. Conversely, the Modal-agnostic Token Cross Network performs holistic interactions with all tokens of an item, directly generating the final Token Representation.

3.5.1. Modal-specific Token Cross Network

Taking the feature interactions of the visual modality as an example, we use the visual-specific Token Cross Network to process the visual-related Token Embeddings. The one-order interaction result of item $i$ $r^{v}_{\text{one-order},i}\in\mathbb{R}^{d}$ is the weighted average of the individual token embeddings:

(5)

r^{v}_{\text{one-order},i}=\sum_{x=1}^{D}w_{x}^{v}e_{x,i}^{v}

where $w_{x}^{v}$ is the learnable weight of according token embeddings. In the second-order interaction, we reuse the corresponding weights $w_{x}^{v}$ of the token embeddings:

(6)

r^{v}_{\text{second-order},i}=\sum_{x=1}^{D}\sum_{y={x+1}}^{D}w_{x}^{v}w_{y}^{v}e_{x,i}^{v}\odot e_{y,i}^{v}

where $\odot$ means the element-wise multiplication of two vectors. We utilize a simple MLP network to model the high-order interactions of the token embeddings:

(7)			$\displaystyle h_{0,i}^{v}=\text{concat}(e_{1,i}^{v},\ldots,e_{D,i}^{v})$
			$\displaystyle h_{l+1,i}^{v}=\sigma(W_{l}^{v}h_{l,i}^{v}+b_{l}^{v})$
			$\displaystyle r^{v}_{\text{high-order},i}=h_{L,i}^{v}$

where $l$ is the layer depth and $\sigma$ is the activation function, $L$ is the all layers. The Token Representation $r^{v}_{i}\in\mathbb{R}^{d}$ in text modality is the summation of one-, second- and high-order:

(8)

r^{v}_{i}=r^{v}_{\text{one-order},i}+r^{v}_{\text{second-order},i}+r^{v}_{\text{high-order},i}

Similarly, we utilize Text-specific Token Cross Network to generate text-related Token Representation $r^{t}_{i}$ , the final Token Rrepresentation $r_{i}\in\mathbb{R}^{d}$ is the summation of $r^{v}_{i}$ and $r^{t}_{i}$ . The final Token Representation Matrix $R\in\mathbb{R}^{N\times d}$ replace the original Item ID embeddings:

(9)		$\displaystyle r_{i}$	$\displaystyle=r^{t}_{i}+r^{v}_{i}$
(9)		$\displaystyle R$	$\displaystyle=[r_{1},\ldots,r_{N}]$

3.5.2. Modal-agnostic Token Cross Network

The Modal-agnostic Token Cross Network treats the token embeddings from all modalities as a unified whole for feature interactions. Considering the vision and text modalities, there are a total of $2D$ token embeddings. The overall computation process is as follows:

(10)			$\displaystyle r_{\text{one-order},i}=\sum_{x=1}^{2D}w_{x}e_{x,i}$
			$\displaystyle r_{\text{second-order},i}=\sum_{x=1}^{2D}\sum_{y=x+1}^{2D}w_{x}w_{y}e_{x,i}\odot e_{y,i}$
			$\displaystyle h_{0,i}=\text{concat}(e_{1,i},\ldots,e_{2D,i})$
			$\displaystyle h_{l+1,i}=\sigma(W_{l}h_{l,i}+b_{l})$
			$\displaystyle r_{\text{high-order},i}=h_{L,i}$
			$\displaystyle r_{i}=r_{\text{one-order},i}+r_{\text{second-order},i}+r_{\text{high-order},i}$

3.6. Token Representation for Multimodal Recommendation

MOTOR replaces the original Item ID Embeddings with Token Representations and transforms the original ID-based models into ID-free recommerder. During model training, the primary parameters in the MOTOR module (Token Embedding Tables, Token Cross Network) undergo end-to-end training and optimization, analogous to ID Embeddings. Specifically, for a downstream multimodal recommendation model, to generate item recommendations for a specific user $u$ , we first predict the interaction scores between the user and the candidate items. Then we rank the candidate items based on the predicted interaction scores in descending order and select the top $K$ items as recommendations for $u$ . The interaction score is calculated as:

(11)

y_{ui}=h_{u}^{T}h_{i}

where $h_{u}$ and $h_{i}$ are the final user and item representations, derived from the output of the multimodal recommendation models as shown in Figure 2.

3.7. Discussion

3.7.1. The Distinguishability of Final Item Representation

In recommendation systems, a fundamental requirement is the distinctiveness of the final representations (users & items). Within the MOTOR framework, due to the ID-free characteristics, the uniqueness of the final item representation $h_{i}$ can be achieved through the following aspects: (i) the minimum collision probability of Token Ids. Previous researchs (Zhan et al., 2021; Hou et al., 2023; Xiong et al., 2009) demonstrated that the minimum collision probability is attained when modal features are equally quantized to all possible discrete IDs. The technique used for training OPQ tends to create clusters with a relatively uniform distribution in cluster sizes, indicating that OPQ can generate tokens with strong distinguishability for items. Specially, we illustrate the distributions of token ids across different datasets in the Appendix 6.2. (ii) The distinctiveness of the forward process; during the model’s forward process, each item further integrates its own multimodal information. Meanwhile, the use of graph neural networks allows different items to assimilate information from their respective connected nodes.

3.7.2. Complexity Analysis

Space Complexity. MOTOR employs unique Token Representations to transform the ID-based recommenders into ID-free models, thereby reducing the spatial complexity on the item side from $\mathcal{O}(Nd)$ to $\mathcal{O}(DKd)$ , where $D$ is the number of Token IDs per item, $K$ is the number of cluster centers per segment, and $DK\ll N$ . Experimentally, we require only the original 10% to 20% of the embedding parameters to learn item representations across three datasets. Furthermore, in Section 4.8, we demonstrate how variations in the number of item Token IDs across different backbone models impact the total trainable parameters and model performance.

Time Complexity. Compared to the original model, the MOTOR-enhanced recommendation model introduces a Token Cross Network, introducing some additional computational complexity. The complexity for one-order, second-order, and high-order is $O(NDd)$ , $O(ND^{2}d)$ , and $O(NDd^{2})$ , respectively. Therefore, the total additional time complexity is $O(NDd^{2})$ . Considering that the dimension of the embedding $d$ and the number of token IDs $D$ are relatively small, the additional computational complexity is negligible.

4. Experiments

Table 1. Statistics of the experimental datasets.

Datasets	# Users	# Items	# Interactions	Sparsity
Music	57202	25717	184556	99.9875%
Baby	139722	28795	501717	99.9874%
Office	146048	38140	435675	99.9922%

Table 2. The overall performance of MOTOR on Nine Recommendation Models. ”+MS” means utilizing the Modal-specific Token Cross Network to generate Token Representation. ”+MA” means utilizing the Modal-agnostic Token Cross Netwok. ”Improv.” denotes the improvement rate (%) in Recall@20 compared to the baseline vanilla model.

	Music					Baby					Office
Models	R@10	R@20	N@10	N@20	Improv.	R@10	R@20	N@10	N@20	Improv.	R@10	R@20	N@10	N@20	Improv.
BPR	0.0203	0.0310	0.0106	0.0132	—	0.0254	0.0395	0.0133	0.0169	—	0.0178	0.0250	0.0108	0.0126	—
+MS	0.0408	0.0602	0.0226	0.0274	94.19	0.0313	0.0461	0.0172	0.0210	16.71	0.0406	0.0561	0.0238	0.0277	124.40
+MA	0.0430	0.0644	0.0241	0.0295	107.74	0.0319	0.0462	0.0178	0.0214	16.96	0.0424	0.0578	0.0247	0.0286	131.20
LightGCN	0.0414	0.0588	0.0235	0.0279	—	0.0286	0.0438	0.0155	0.0193	—	0.0333	0.0434	0.0213	0.0239	—
+MS	0.0620	0.0897	0.0338	0.0408	52.55	0.0423	0.0618	0.0235	0.0284	41.10	0.0561	0.0774	0.0331	0.0385	78.34
+MA	0.0627	0.0905	0.0345	0.0416	53.91	0.0423	0.0614	0.0235	0.0283	40.18	0.0567	0.0774	0.0331	0.0383	78.34
LayerGCN	0.0485	0.0690	0.0278	0.0329	—	0.0350	0.0521	0.0195	0.0238	—	0.0380	0.0508	0.0236	0.0269	—
+MS	0.0801	0.1153	0.0435	0.0524	67.10	0.0463	0.0699	0.0252	0.0312	34.17	0.0581	0.0830	0.0318	0.0382	63.39
+MA	0.0807	0.1156	0.0437	0.0525	67.54	0.0497	0.0743	0.0275	0.0337	42.61	0.0527	0.0769	0.0294	0.0355	51.38
VBPR	0.0367	0.0514	0.0206	0.0243	—	0.0323	0.0489	0.0180	0.0222	—	0.0388	0.0494	0.0254	0.0281	—
+MS	0.0595	0.0856	0.0323	0.0389	66.54	0.0386	0.0567	0.0212	0.0257	15.95	0.0578	0.0755	0.0349	0.0394	52.83
+MA	0.0589	0.0854	0.0321	0.0388	66.15	0.0402	0.0578	0.0228	0.0272	18.20	0.0554	0.0731	0.0335	0.0380	47.98
GRCN	0.0525	0.0746	0.0291	0.0350	—	0.0460	0.0656	0.0266	0.0316	—	0.0575	0.0760	0.0355	0.0402	—
+MS	0.0561	0.0809	0.0313	0.0375	8.45	0.0474	0.0675	0.0270	0.0321	2.90	0.0625	0.0842	0.0374	0.0429	10.79
+MA	0.0550	0.0802	0.0307	0.0370	7.51	0.0474	0.0674	0.0273	0.0324	2.74	0.0615	0.0830	0.0372	0.0427	9.21
SLMRec	0.0603	0.0875	0.0319	0.0387	—	0.0490	0.0718	0.0276	0.0333	—	0.0557	0.0721	0.0346	0.0387	—
+MS	0.0691	0.0988	0.0373	0.0449	12.91	0.0547	0.0787	0.0311	0.0372	9.61	0.0664	0.0899	0.0386	0.0446	24.69
+MA	0.0688	0.0979	0.0375	0.0449	11.89	0.0543	0.0782	0.0306	0.0367	8.91	0.0650	0.0886	0.0378	0.0438	22.88
BM3	0.0564	0.0783	0.0317	0.0372	—	0.0453	0.0655	0.0256	0.0308	—	0.0501	0.0658	0.0313	0.0353	—
+MS	0.0649	0.0988	0.0403	0.0488	26.18	0.0564	0.0807	0.0313	0.0375	23.21	0.0574	0.0837	0.0315	0.0382	27.20
+MA	0.0620	0.0849	0.0383	0.0441	8.43	0.0542	0.0787	0.0306	0.0368	20.15	0.0625	0.0882	0.0361	0.0426	34.04
FREEDOM	0.0715	0.1062	0.0389	0.0477	—	0.0571	0.0813	0.0322	0.0383	—	0.0721	0.0973	0.0421	0.0485	—
+MS	0.0818	0.1164	0.0443	0.0531	9.60	0.0596	0.0879	0.0347	0.0397	8.12	0.0765	0.1063	0.0427	0.0502	9.25
+MA	0.0795	0.1157	0.0434	0.0525	8.95	0.0592	0.0863	0.0341	0.0395	6.15	0.0743	0.1036	0.0425	0.0491	6.47
MGCN	0.0667	0.0928	0.0365	0.0431	—	0.0502	0.0703	0.0291	0.0342	—	0.0714	0.0928	0.0427	0.0481	—
+MS	0.0673	0.0939	0.0371	0.0438	1.19	0.0562	0.0804	0.0318	0.0379	14.37	0.0775	0.1057	0.0443	0.0514	13.90
+MA	0.0676	0.0943	0.0376	0.0441	1.62	0.0567	0.0812	0.0322	0.0380	15.50	0.0789	0.1074	0.0458	0.0530	15.73

We conduct comprehensive experiments to evaluate the performance of MOTOR and answer the following questions:

•

RQ1: What performance improvements can be achieved by integrating MOTOR into ID-based multimodal recommendation models?
•

RQ2: How do the ID-free recommender and original ID-based models perform differently for items with varying numbers of interactions (e.g., long-tail items, popular items)?
•

RQ3: How do the different modules (e.g., the Multimodal Tokens, the Token Cross Network) influence the performance of MOTOR?
•

RQ4: How variations in the number of item Token IDs across different backbone models impact the total trainable parameters and model performance?

4.1. Experimental Datasets

We conduct experiments on three public Amazon review datasets (Ni et al., 2019): (a) Musical Instruments, (b) Baby and (c) Office Products. For simplicity, we denote them as Music, Baby, and Office, respectively. To ensure a fair comparison with previous work (Zhou et al., 2023b, a; Zhou and Shen, 2023; Yu et al., 2023), we have adopted their feature processing methodologies: For image features, we directly use the 4,096-dimensional visual features extracted by pre-trained CNN (Ni et al., 2019). For textual features, we concatenate textual information such as the title, descriptions, categories, and brands of each item, and extract 384-dimensional textual features using the Sentence transformer (Reimers and Gurevych, 2019). To further illustrate the impact of MOTOR on the recommendation capability for items with various interaction records (e.g., long-tail items, popular items), we apply a 1-core filtering on the original dataset, ensuring that each user and item has at least one interaction record in the training datatset. The filtering results are presented in Table 1.

4.2. BackBone Models

Given that MOTOR is compatible with various ID-based multimodal recommendation models, we integrate MOTOR into nine representative recommendation models. These include three General Recommendation models (e.g., BPR (Rendle et al., 2012), LightGCN (He et al., 2020), LayerGCN (Zhou et al., 2022)), which do not utilize multimodal features as additional supervison signal, and six Multimodal Recommendation models (e.g., VBPR (He and McAuley, 2016), GRCN (Wei et al., 2020), SLMRec (Tao et al., 2022), BM3 (Zhou et al., 2023a), FREEDOM (Zhou and Shen, 2023), MGCN (Yu et al., 2023)), which incorporate multimodal features for learning user and item representations. The detailed description about backbone models is in Appendix 6.1.

4.3. Setup and Evaluation Metrics

For a fair comparison, we adhere to the evaluation settings used in [31, 39], employing a random data split of 8: 1: 1 for training, validation and testing of each user’s interaction history. Furthermore, we assess the top- $K$ recommendation performance of various methods using Recall@ $K$ and NDCG@ $K$ . During the recommendation phase, all items that the user did not previously interact with are considered candidate items. In our experiments, the results are empirically reported for $K$ values of 10 and 20, and we abbreviate Recall@ $K$ and NDCG@ $K$ as R@ $K$ and N@ $K$ , respectively.

4.4. Implementation Details

First, we implement Optimized Product Quantization based on the Faiss ANNS (Johnson et al., 2017) and integrate MOTOR into ID-based recommendation models via the MMRec (Zhou et al., 2023b) framework. The vector quantization process needs to be performed only once and can then be utilized by multiple downstream models. Similar to other existing work (Zhou et al., 2023a; Zhou and Shen, 2023; Yu et al., 2023), we fix the embedding sizes of users, items, and token to 64 for all models, initialize embedding parameters using the Xavier method (Glorot and Bengio, 2010), and employ Adam (Kingma and Ba, 2014) as the optimizer with learning rate of 0.001. For a fair comparison, we meticulously tune the parameters of each model according to their respective published papers. The parameter tuning for MOTOR is exceptionally straightforward, only involving a search for the number of tokens among $\{2,4,8,16\}$ . We fix the number of cluster centers $K$ as 256. We implement the Token Cross Network in both modal-specific and modal-agnostic manners.

4.5. Overall Performance Comparison (RQ1)

We perform comprehensive experiments on the Music, Baby, and Office datasets, showcasing performance comparisons with various backbone models in Table 2. From the table, we can have the following three observations:

•

MOTOR consistently boosts the performance of recommendation models by a considerable margin, independent of the backbone model. The modal-related item-item graph in FREEDOM and MGCN exhibits exceptional capability in leveraging multimodal information. MOTOR-enhanced FREEDOM achieves optimal performance on the Music and Baby datasets, while MOTOR-enhanced MGCN performs best on the Office dataset.
•

MOTOR’s performance improvement is particularly significant for General Recommendation Models (e.g., MOTOR-enhanced BPR exceeds the original BPR model by more than 100% in recall@20 metrics on the Music and Office datasets. Similarly, the recommendation performance of LightGCN and LayerGCN also improves by over 50%). General Recommendation Models do not incorporate multimodal information as additional supervision signals during training. It is important to note that MOTOR does not explicitly introduce multi-modal features into these models. Instead, by tokenizing multimodal features and transform the ID-based model into ID-free, it mitigates multiple issues (e.g., Information Isolation, Cold-Start) within the ID-based paradigm and implicitly incorporates the semantic information contained in the token IDs (e.g., Items with similar multimodal features also share similar token IDs). Although Original Multimodal Models generally outperform General Models, after enhancement with MOTOR, specific General Models can achieve outstanding performance on certain datasets (e.g., on the Music dataset, MOTOR-enhanced LayerGCN achieves results that are surpassed only by MOTOR-enhanced FREEDOM).
•

We systematically compared the performance of two types of Token Cross Networks (TCN): Modal-Agnostic (MA) and Modal-Specific (MS). In general, both TCN networks effectively uncover the interaction pattern between multimodal tokens. Their relative performance depends on the specific backbone model and the particular data distribution. (e.g., for BPR, MGCN, Modal-agnostic consistently outperforms Modal-specific; similarly, for GRCN, SLMRec, and FREEDOM, Modal-agnostic consistently surpasses Modal-specific. For other backbone models, the performance of Modal-specific and Modal-agnostic TCN is comparable.

4.6. Analysis on Items with Diverse Interaction Counts (RQ2)

In this section, we specifically examine the recommendation performance of MOTOR-enhanced recommenders (e.g., LightGCN, VBPR, SLMRec, FREEDOM) for items with varying interaction counts. Specifically, we choose the Baby dataset and divide the test items into five buckets based on their interaction counts in the training set: $[0-5]$ , $[6-10]$ , $[11-20]$ , $[20-50]$ , $[50,\infty)$ . We then analyze the performance of the trained recommenders for each of these buckets. The distribution of item division by interaction records is shown in Table 3. The performance of four models with/without MOTOR is demonstrate in Fig 3. From the figure, we can derive the following observations:

•

Both MOTOR-enhanced and Original Models exhibit significantly higher recommendation capabilities for popular items (¿50) compared to long-tail items ([0-5]); The recommendation performance of these models improves as the number of item interaction counts increases.
•

MOTOR consistently enhances model recommendation performance across various interaction ranges. Notably, due to the heavy reliance on ID embeddings, the Recall@20 metric for original models, such as LightGCN and SLMRec, on long-tail items ([0-5]) drops to nearly zero (0.0007 for LightGCN and 0.0010 for SLMRec), leading to a significant loss in recommendation capability. MOTOR mitigates this dependency through ID-free learning scheme, thereby achieving superior recommendation performance even for long-tail items (0.0171 for MOTOR-enhanced LightGCN, 0.0025 for MOTOR-enhanced SLMRec).
•

For items with high popularity (¿ 50), the original ID-based recommendation models demonstrated impressive results (e.g., Recall@20 for LightGCN, VBPR, SLMRec, and FREEDOM are 0.1023, 0.1086, 0.1409, and 0.1333, respectively). Under the MOTOR-enhanced paradigm, despite eliminating the models’ reliance on ID embeddings, ID-free recommenders still outperform the original models (e.g., Recall@20 for MOTOR-enhanced LightGCN, VBPR, SLMRec, and FREEDOM are 0.1146, 0.1156, 0.1536, and 0.1516, respectively).

Table 3. The distribution of items according to interaction counts in Baby dataset.

	0-5	6-10	10-20	20-50	>50	All
# Items	19905	3538	2389	1881	1082	28795
Percentage (%)	69.13	12.29	8.30	6.53	2.29	100

4.7. Ablation Study (RQ3)

We analyze how each of the proposed components affects the performance of MOTOR. We select three backbone models (e.g., LightGCN (He et al., 2020), VBPR (He and McAuley, 2016) and SLMRec (Tao et al., 2022)) and the experimental results are shown in Table 4. To investigate the impact of multimodal tokens, we attempt to mask the tokens corresponding to a specific modality and explore the effects of another modality tokens. (e.g., “Image Tokens” means removing the text tokens and just utilizing the image-domain tokens to generate the Token Representations.) To examine the effect of the Token Cross Network (TCN) in MOTOR, we replace the TCN with mean or a simple linear layer (“Mean” denotes a straightforward average aggregation of all token embeddings for an item, while Linear represents a linear mapping of the concatenated token embeddings). From Table 4, we make the following observations: (i) The performance of Token Representation consistently exceeds original models, even when Tokens are removed from a specific modality or substituted for TCN with Mean or Linear; (ii) Removing Tokens from either the Image or Text domain leads to a decrease in model performance. Comparatively, “Text Tokens” generally achieves better results, which suggests that the semantic information in text features is more conducive to model learning; This phenomenon is consistent with previous works (Zhou et al., 2023a; Yu et al., 2023; Zhang et al., 2024). Furthermore, we examine the distribution of tokens in the text and vision modalities in Appendix 6.2. Compared to image tokens, text tokens produce a more uniform distribution; (iii) Replacing the Token Cross Network with Mean or Linear also results in a notable performance decrease, with Linear slightly outperforming Mean in VBPR and SLMRec.

Table 4. The Ablation Study about multimodal tokens and Token Cross Network. The result of ”MOTOR” is the same with ”+MS” in Table 2.

		Music		Office
Models		Recall@20	NDCG@20	Recall@20	NDCG@20
LightGCN	Original	0.0588	0.0279	0.0508	0.0269
	Image Tokens	0.0636	0.0296	0.0495	0.0258
	Text Tokens	0.0883	0.0401	0.0734	0.0368
	Mean	0.0818	0.0360	0.0712	0.0360
	Linear	0.0797	0.0360	0.0708	0.0357
	MOTOR	0.0897	0.0408	0.0774	0.0385
VBPR	Original	0.0514	0.0243	0.0494	0.0281
	Image Tokens	0.0655	0.0302	0.0560	0.0289
	Text Tokens	0.0817	0.0370	0.0729	0.0344
	Mean	0.0781	0.0352	0.0683	0.0332
	Linear	0.0825	0.0372	0.0746	0.0358
	MOTOR	0.0856	0.0389	0.0755	0.0394
SLMRec	Original	0.0875	0.0387	0.0721	0.0387
	Image Tokens	0.0898	0.0396	0.0783	0.0399
	Text Tokens	0.0963	0.0438	0.0848	0.0405
	Mean	0.0890	0.0408	0.0759	0.0407
	Linear	0.0925	0.0425	0.0796	0.0413
	MOTOR	0.0988	0.0449	0.0899	0.0446

4.8. Trainable Paremeters and Performance Under Different Token Numbers (RQ4)

In the ID-free framework, we replace the ID embedding table of any backbone model with learnable Token Representation. In Figure 4, we illustrate the overall trainable parameters of the model under different token number settings and the resulting performance improvements. In general, within a reasonable range of token numbers ( $\{2,4,8,16\}$ ), ID-free models generally outperform the original ID-based models, and there is a significant reduction in trainable parameters (e.g., When the uni-modal token numbers are 4, the ID-free LightGCN, VBPR, and SLMRec reduce the overall trainable parameters by 27.9%, 16.0%, and 26.3% compared to their original models, with corresponding performance improvements of 52.6%, 49.2%, and 12.6%, respectively.)

The influence of Token Numbers. In MOTOR, we set the number of tokens per modality to $\{2,4,8,16\}$ , resulting in total number of tokens for items of $\{4,8,16,32\}$ . The optimal performance of these backbone models (e.g. LightGCN, VBPR, SLMRec) is achieved when the number of unimodal tokens is 4 or 8. Reducing the number of tokens limits the embedding solution space for specific items, increasing the likelihood of token collisions with other items. While expanding the number of tokens enhances MOTOR’s expressive capacity, it simultaneously introduces additional modality-related noise. As a result, items that were initially unrelated in interaction records can negatively impact each other because they share token IDs, arising from partially similar original multimodal features. This phenomenon underscores the delicate balance between improving model expressiveness and managing noise and unintended associations in token assignments.

5. Conclusion

We identify several issues in current ID-based multimodal recommenders, namely Information Isolation, Cold-Start, and Storage Burden. Based on this, we propose MOTOR, which represents each item using learnable multimodal tokens and transforms ID-based models into ID-free systems. Through a token-sharing mechanism, MOTOR enhances communication between related items, improves the representation of cold-start items, and eliminates the cumbersome Item ID Embedding Table. Extensive experiments on nine representative models demonstrate the performance gains brought by MOTOR. Future exploration directions include more effective discretization techniques and alternative methods for token interaction and integration.

References

(1)
Cheng et al. (2016) Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, Rohan Anil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah. 2016. Wide & Deep Learning for Recommender Systems. arXiv:1606.07792 [cs.LG] https://arxiv.org/abs/1606.07792
Dosovitskiy et al. (2020) Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
Ge et al. (2013) Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. 2013. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence 36, 4 (2013), 744–755.
Geng et al. (2023) Shijie Geng, Shuchang Liu, Zuohui Fu, Yingqiang Ge, and Yongfeng Zhang. 2023. Recommendation as Language Processing (RLP): A Unified Pretrain, Personalized Prompt Predict Paradigm (P5). arXiv:2203.13366 [cs.IR] https://arxiv.org/abs/2203.13366
Glorot and Bengio (2010) Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics. JMLR Workshop and Conference Proceedings, 249–256.
Guo et al. (2017) Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. 2017. DeepFM: A Factorization-Machine based Neural Network for CTR Prediction. arXiv:1703.04247 [cs.IR] https://arxiv.org/abs/1703.04247
He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
He and McAuley (2016) Ruining He and Julian McAuley. 2016. VBPR: visual bayesian personalized ranking from implicit feedback. In Proceedings of the AAAI conference on artificial intelligence, Vol. 30.
He et al. (2020) Xiangnan He, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. 2020. Lightgcn: Simplifying and powering graph convolution network for recommendation. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval. 639–648.
He et al. (2017) Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. 2017. Neural Collaborative Filtering. arXiv:1708.05031 [cs.IR] https://arxiv.org/abs/1708.05031
Hou et al. (2023) Yupeng Hou, Zhankui He, Julian McAuley, and Wayne Xin Zhao. 2023. Learning Vector-Quantized Item Representation for Transferable Sequential Recommenders. arXiv:2210.12316 [cs.IR] https://arxiv.org/abs/2210.12316
Hua et al. (2023) Wenyue Hua, Shuyuan Xu, Yingqiang Ge, and Yongfeng Zhang. 2023. How to Index Item IDs for Recommendation Foundation Models. In Proceedings of the Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region (SIGIR-AP ’23, Vol. 17). ACM, 195–204. https://doi.org/10.1145/3624918.3625339
Jegou et al. (2010) Herve Jegou, Matthijs Douze, and Cordelia Schmid. 2010. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence 33, 1 (2010), 117–128.
Johnson et al. (2017) Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similarity search with GPUs. arXiv:1702.08734 [cs.CV] https://arxiv.org/abs/1702.08734
Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
Lin et al. (2024a) Jianghao Lin, Bo Chen, Hangyu Wang, Yunjia Xi, Yanru Qu, Xinyi Dai, Kangning Zhang, Ruiming Tang, Yong Yu, and Weinan Zhang. 2024a. ClickPrompt: CTR Models are Strong Prompt Generators for Adapting Language Models to CTR Prediction. In Proceedings of the ACM on Web Conference 2024. 3319–3330.
Lin et al. (2024b) Jianghao Lin, Xinyi Dai, Yunjia Xi, Weiwen Liu, Bo Chen, Hao Zhang, Yong Liu, Chuhan Wu, Xiangyang Li, Chenxu Zhu, Huifeng Guo, Yong Yu, Ruiming Tang, and Weinan Zhang. 2024b. How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv:2306.05817 [cs.IR] https://arxiv.org/abs/2306.05817
Liu et al. (2017) Qiang Liu, Shu Wu, and Liang Wang. 2017. Deepstyle: Learning user preferences for visual recommendation. In Proceedings of the 40th international acm sigir conference on research and development in information retrieval. 841–844.
Liu et al. (2024) Yifan Liu, Kangning Zhang, Xiangyuan Ren, Yanhua Huang, Jiarui Jin, Yingjie Qin, Ruilong Su, Ruiwen Xu, Yong Yu, and Weinan Zhang. 2024. AlignRec: Aligning and Training in Multimodal Recommendations. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (Boise, ID, USA) (CIKM ’24). Association for Computing Machinery, New York, NY, USA, 1503–1512. https://doi.org/10.1145/3627673.3679626
Ni et al. (2019) Jianmo Ni, Jiacheng Li, and Julian McAuley. 2019. Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP). 188–197.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
Raffel et al. (2023) Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2023. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. arXiv:1910.10683 [cs.LG] https://arxiv.org/abs/1910.10683
Rajput et al. (2023) Shashank Rajput, Nikhil Mehta, Anima Singh, Raghunandan H. Keshavan, Trung Vu, Lukasz Heldt, Lichan Hong, Yi Tay, Vinh Q. Tran, Jonah Samost, Maciej Kula, Ed H. Chi, and Maheswaran Sathiamoorthy. 2023. Recommender Systems with Generative Retrieval. arXiv:2305.05065 [cs.IR] https://arxiv.org/abs/2305.05065
Reimers and Gurevych (2019) Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084 (2019).
Rendle et al. (2012) Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme. 2012. BPR: Bayesian personalized ranking from implicit feedback. arXiv preprint arXiv:1205.2618 (2012).
Simonyan and Zisserman (2014) Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
Tao et al. (2022) Zhulin Tao, Xiaohao Liu, Yewei Xia, Xiang Wang, Lifang Yang, Xianglin Huang, and Tat-Seng Chua. 2022. Self-supervised learning for multimedia recommendation. IEEE Transactions on Multimedia (2022).
Vaswani et al. (2017a) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017a. Attention is all you need. Advances in neural information processing systems 30 (2017).
Vaswani et al. (2017b) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017b. Attention is all you need. Advances in neural information processing systems 30 (2017).
Wang et al. (2021) Qifan Wang, Yinwei Wei, Jianhua Yin, Jianlong Wu, Xuemeng Song, and Liqiang Nie. 2021. Dualgnn: Dual graph neural network for multimedia recommendation. IEEE Transactions on Multimedia (2021).
Wang et al. (2017) Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. 2017. Deep & Cross Network for Ad Click Predictions. arXiv:1708.05123 [cs.LG] https://arxiv.org/abs/1708.05123
Wang et al. (2024) Wenjie Wang, Honghui Bao, Xinyu Lin, Jizhi Zhang, Yongqi Li, Fuli Feng, See-Kiong Ng, and Tat-Seng Chua. 2024. Learnable Tokenizer for LLM-based Generative Recommendation. arXiv preprint arXiv:2405.07314 (2024).
Wei et al. (2020) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, and Tat-Seng Chua. 2020. Graph-refined convolutional network for multimedia recommendation with implicit feedback. In Proceedings of the 28th ACM international conference on multimedia. 3541–3549.
Wei et al. (2019) Yinwei Wei, Xiang Wang, Liqiang Nie, Xiangnan He, Richang Hong, and Tat-Seng Chua. 2019. MMGCN: Multi-modal graph convolution network for personalized recommendation of micro-video. In Proceedings of the 27th ACM international conference on multimedia. 1437–1445.
Xiong et al. (2009) Hui Xiong, Junjie Wu, and Jian Chen. 2009. K-Means Clustering Versus Validation Measures: A Data-Distribution Perspective. IEEE Transactions on Systems, Man, and Cybernetics, Part B (Cybernetics) 39, 2 (2009), 318–331. https://doi.org/10.1109/TSMCB.2008.2004559
Yu et al. (2023) Penghang Yu, Zhiyi Tan, Guanming Lu, and Bing-Kun Bao. 2023. Multi-View Graph Convolutional Network for Multimedia Recommendation. arXiv preprint arXiv:2308.03588 (2023).
Zhan et al. (2021) Jingtao Zhan, Jiaxin Mao, Yiqun Liu, Jiafeng Guo, Min Zhang, and Shaoping Ma. 2021. Learning Discrete Representations via Constrained Clustering for Effective and Efficient Dense Retrieval. arXiv:2110.05789 [cs.IR] https://arxiv.org/abs/2110.05789
Zhang et al. (2021) Jinghao Zhang, Yanqiao Zhu, Qiang Liu, Shu Wu, Shuhui Wang, and Liang Wang. 2021. Mining latent structures for multimedia recommendation. In Proceedings of the 29th ACM International Conference on Multimedia. 3872–3880.
Zhang et al. (2024) Kangning Zhang, Yingjie Qin, Jiarui Jin, Yifan Liu, Ruilong Su, Weinan Zhang, and Yong Yu. 2024. DREAM: A Dual Representation Learning Model for Multimodal Recommendation. arXiv:2404.11119 [cs.IR] https://arxiv.org/abs/2404.11119v2
Zhou et al. (2023b) Hongyu Zhou, Xin Zhou, Zhiwei Zeng, Lingzi Zhang, and Zhiqi Shen. 2023b. A Comprehensive Survey on Multimodal Recommender Systems: Taxonomy, Evaluation, and Future Directions. arXiv preprint arXiv:2302.04473 (2023).
Zhou et al. (2022) Xin Zhou, Donghui Lin, Yong Liu, and Chunyan Miao. 2022. Layer-refined Graph Convolutional Networks for Recommendation. arXiv:2207.11088 [cs.IR] https://arxiv.org/abs/2207.11088
Zhou and Shen (2023) Xin Zhou and Zhiqi Shen. 2023. A Tale of Two Graphs: Freezing and Denoising Graph Structures for Multimodal Recommendation. In Proceedings of the 31st ACM International Conference on Multimedia (MM ’23). ACM. https://doi.org/10.1145/3581783.3611943
Zhou et al. (2023a) Xin Zhou, Hongyu Zhou, Yong Liu, Zhiwei Zeng, Chunyan Miao, Pengwei Wang, Yuan You, and Feijun Jiang. 2023a. Bootstrap latent representations for multi-modal recommendation. In Proceedings of the ACM Web Conference 2023. 845–854.

6. Appendix

6.1. Backbone Models

General Recommendation Models:

•

BPR (Rendle et al., 2012) optimizes for pairwise rankings between observed (interacted) and unobserved (noninteracted) items. The core idea of BPR is to maximize the posterior probability that a user prefers an interacted item over a noninteracted one by using a stochastic gradient descent approach.
•

LightGCN (He et al., 2020) removes feature transformation and non-linear activation, focusing solely on neighborhood aggregation. It iteratively propagates embeddings of users and items through the graph to capture higher-order connectivity
•

LayerGCN (Zhou et al., 2022) is a layer-refined GCN model, which refines layer representations during information propagation and node updating of GCN. Further, LayerGCN prunes the edges of the user-item interaction graph following a degree-sensitive probability instead of the uniform distribution.

Multimodal Recommendation Models

•

VBPR (He and McAuley, 2016) integrates visual features into the matrix factorization (MF) framework, optimizing user preferences using the BPR loss. In line with (Zhou et al., 2023b, a), we concatenate an item’s image feature and text feature for learning user preferences.
•

GRCN (Wei et al., 2020): enhances the user-item bipartite graph by eliminating false-positive edges to facilitate multimodal recommendations. Utilizing the refined graph, it subsequently learns the representations of users and items through information propagation and aggregation using GCNs.
•

SLMRec (Tao et al., 2022): employs self-supervised learning methods within a Graph Neural Network framework to capture underlying relationships. The self-supervised learning (SSL) tasks enhance the supervised learning tasks by extracting latent signals directly from the data. This approach involves performing self-supervised data augmentation operations, namely FD, FM, and FAC, followed by the application of contrastive loss for optimization.
•

BM3 (Zhou et al., 2023a): presents a novel self-supervised learning approach aimed at mitigating computational resource expenses and errors caused by incorrect supervision signals. It leverages a straightforward dropout technique to produce contrastive views and employs three distinct contrastive loss functions to refine the resultant representations.
•

FREEDOM (Zhou and Shen, 2023): employs an item-item graph for each modality, structured similarly to LATTICE (Zhang et al., 2021), but these graphs are frozen prior to training. It also introduces degree-sensitive edge pruning techniques to eliminate noise from the user-item interaction graph.
•

MGCN (Yu et al., 2023): purifies modality features using item behavior information to avoid noise contamination and enriches these features in separate views, enhancing feature distinguishability. Additionally, a behavior-aware fuser adaptively learns the relative importance of different modality features.

6.2. The Distribution of Token IDs

We illustrate the distribution of token ids obtained from tokenization for the original modality features across three datasets (e.g., Music, Baby, Office) in Figure 6. From Figure 6, we observe the following: (i) Through tokenization of the raw modality features in the textual and visual domains, we obtain Token IDs that approach a approximately uniform distribution. In other words, for a given Token ID, there are approximately $\frac{N}{DK}$ items sharing the same token, where $N$ represents the total number of items, and $DK$ denotes the total number of tokens; (ii) Compared to the visual modality, the distribution of tokens in the textual modality is more uniform. This partly explains why tokens in the text modality outperform those in the vision modality for recommendation in Section 4.7.

6.3. Case Study

The original multimodal features of items preserve effective semantic information, such as similarity characteristics between items. In this section, we investigate whether Token IDs retain such semantic information, specifically whether items with similar Token IDs also share similar content. We randomly select items as queries and retrieve the first two items with the most similar Token IDs; the results are presented in Figure 5. By utilizing appropriate queries, semantic similarity between items can be discovered solely through token similarity. This demonstrates that the discretized tokens still retain the similarity information from the original multimodal features.