A Survey of Multimodal Recommendation

(2022)

Abstract.

How humans perceive the world is by processing signals from multi-modalities, like the sound we hear, the texture we feel or the images we see. Researchers were inspired to build models that can relate and interpret data from different modalities. In this case, the model is able to represent and discover hidden links among different modalities and possibly recover the information which might be missed out by a uni-modal approach. Natural language processing requires such abilities too, in order to approach human-level understanding in various AI tasks. (sulubacak2019multimodal)

^†^†copyright: acmcopyright^†^†journalyear: 2022^†^†doi: 10.1145

1. Introduction

Explain what multimodality is and what a recommender system is. Explain what a multimodal based recommender system is.

Summary of existing MM tasks/models in CV/NLP. They do not consider the personalisation which is important for a recommender system.

Why incorporating multimodal information into the recommender system. (can potentially alleviate the sparsity problem and improve the quality recommendations)

The scope of this survey. How this survey is organized.

2. Taxonomy based on modalities

2.1. Visual or Textual or Audio

How the features are extracted. How to use features for the recommendation task.

2.1.1. Visual enriched Models.

Data sources include images and videos.

2.1.2. Textual enriched Models.

Data sources include user reviews, item descriptions, item titles. (How the features are extracted)
1. Topic Based. Topic based methods normally apply Latent Dirichlet Allocation(LDA)(blei2003latent) to uncover hidden dimensions in text. It models each document as a distribution of an underlying set of topics.

Data Sources	Notes	Model	Publications
Reviews	Items only	LDA	CTR(wang2011collaborative), HFT(mcauley2013hidden)

2. Neural Networks.

Data Sources	Notes	Model	Publications
Item contents	Items only	SDAE	CDL(wang2015collaborative), CKE(zhang2016collaborative)
Item contents	Items only	CNN	ConvMF(kim2016convolutional)
Reviews	Items & Users	CNN	DeepCoNN(zheng2017joint),TransNets(catherine2017transnets),D-Attn(seo2017interpretable),CARL(wu2019context),NARRE(chen2018neural),DAML(liu2019daml)
Reviews	Items	GRU	TARMF(lu2018coevolutionary)
Reviews	Items	DNN	GATE(ma2019gated)

(How to use features for the recommendation task)
1. The text acts like the regularization term for rating prediction. Examples are HFT(mcauley2013hidden), ConvMF(kim2016convolutional), TARMF(lu2018coevolutionary).

2. Review-based embeddings are combined with ratings-based embeddings through summation NARRE, DAML(chen2018neural; liu2019daml), concatenation TransNets(catherine2017transnets) or neural network layer GATE(ma2019gated). And then use dot product, FM, NeuFM or AE for recommendation task.

3. Each of the review-based and ratings-based model generates a prediction score and will be combined linearly for final prediction. CARL(wu2019context)

2.1.3. Hybrid Models.

Data sources include visual, textual, or audio features. How those modalities are fused.

2.2. Knowledge Graphs.

2.3. Social Graphs.

3. Taxonomy Based on ML tasks

3.1. Recommendation

(1)

deep learning
(2)

graphs

3.2. Pre-training

(1)

pretrain user embeddings: what prefix tasks
(2)

pretrain item embeddings: what prefix tasks

3.3. Cross modal retrieval

4. Taxonomy Based on specific domains

4.1. Micro-video Recommendation

4.2. Fashion Recommendation

4.3. News Recommendation

5. Datasets and Experiments

5.1. Datasets

5.1.1. Visual and textual

5.1.2. Knowledge graphs

5.1.3. Social graphs

5.2. Experiments

5.2.1. Recommendation performance comparison using visual and textual.

5.2.2. Downstream task performance comparison using pretrained user or item embeddings.

5.2.3. Performances on different user or item popularity groups.

5.2.4. Will embeddings extracted from different pretrained models affect the performance?

6. Challenges and Future Research Directions

How to effectively fuse multimodal information?
Design of prefix tasks for pretraining

7. Conclusion

Acknowledgements.

To Robert, for the bagels and explaining CMYK and color spaces.