\AtBeginEnvironment

pmatrix

Personalized Multimodal Large Language Models: A Survey

Junda Wu¹ Hanjia Lyu² Yu Xia¹ Zhehao Zhang³ Joe Barrow⁴
Ishita Kumar⁵ Mehrnoosh Mirtaheri⁶ Hongjie Chen⁷ Ryan A. Rossi⁴ Franck Dernoncourt⁴
Tong Yu⁴ Ruiyi Zhang⁴ Jiuxiang Gu⁴ Nesreen K. Ahmed⁸ Yu Wang⁹ Xiang Chen⁴
Hanieh Deilamsalehy⁴ Namyong Park¹⁰ Sungchul Kim⁴ Huanrui Yang¹¹ Subrata Mitra⁴
Zhengmian Hu⁴ Nedim Lipka⁴ Dang Nguyen¹² Yue Zhao⁶ Jiebo Luo² Julian McAuley¹
¹University of California, San Diego ²University of Rochester ³Dartmouth College ⁴Adobe Research
⁵University of Massachusetts at Amherst ⁶University of Southern California ⁷Virginia Tech
⁸Cisco Research ⁹University of Oregon ¹⁰Meta AI ¹¹University of Arizona ¹²University of Maryland

Abstract

Multimodal Large Language Models (MLLMs) have become increasingly important due to their state-of-the-art performance and ability to integrate multiple data modalities, such as text, images, and audio, to perform complex tasks with high accuracy. This paper presents a comprehensive survey on personalized multimodal large language models, focusing on their architecture, training methods, and applications. We propose an intuitive taxonomy for categorizing the techniques used to personalize MLLMs to individual users, and discuss the techniques accordingly. Furthermore, we discuss how such techniques can be combined or adapted when appropriate, highlighting their advantages and underlying rationale. We also provide a succinct summary of personalization tasks investigated in existing research, along with the evaluation metrics commonly used. Additionally, we summarize the datasets that are useful for benchmarking personalized MLLMs. Finally, we outline critical open challenges. This survey aims to serve as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.

1 Introduction

Multimodal Large Language Models (MLLMs)¹¹1Note MLLM is used synonymously with Large Multimodal Model (LMM), and Large Vision Language Model (LVLM). have recently become important for generating and reasoning with diverse types of complex data such as text, images, and audio Yang et al. (2023). These models that process, generate, and combine information across modalities have found many applications such as healthcare Lu et al. (2024); AlSaad et al. (2024), recommendation Lyu et al. (2024b); Tian et al. (2024), autonomous vehicles Cui et al. (2024); Chen et al. (2024b). However, to further enhance the performance and utility of these models, personalization plays a crucial role, enabling them to adapt more effectively to a user’s specific preferences, context, and needs Chen et al. (2024a). Personalization offers an improved user experience by saving time and increasing accuracy, for instance, by generating content that is more closely aligned with the user’s interests.

Personalization in multimodal large language models comes with its own set of unique challenges. In particular, in order to create personalized experiences, it is essential to have individual-level data from users. In multimodal scenarios, this requires data that spans multiple modalities. For instance a user might generate an image from a text prompt and then provide feedback, such as a thumbs-up or thumbs-down. In this case, we have two modalities—text and image—along with implicit feedback on the text prompt and explicit feedback on the generated image in the form of a like or dislike.

Category	General Mechanism	Example Models and Methods
Personalized MLLM Text Generation	Instruction (Sec. 3.1)	CGSMP Yong et al. (2023), ModICT Li et al. (2024c)
(Section 3)	Alignment (Sec. 3.2)	MPDialog Agrawal et al. (2023), Athena 3.0 Fan et al. (2023)
	Generation (Sec. 3.3)	Wu et al. (2024b), PTSCG Wang et al. (2024a)
	Fine-tuning (Sec. 3.4)	Wang et al. (2023), PVIT Pi et al. (2024)
Personalized MLLM Image Generation	Instruction (Sec. 4.1)	MuDI Jang et al. (2024), Zhong et al. (2024)
(Section 4)	Alignment (Sec. 4.2)	$\lambda$ -ECLIPSE Patel et al. (2024), MoMA Song et al. (2024)
	Generation (Sec. 4.3)	Layout-and-Retouch Kim et al. (2024), Instantbooth Shi et al. (2024a)
	Fine-tuning (Sec. 4.4)	MS-Diffusion Wang et al. (2024d), UNIMO-G Li et al. (2024a)
Personalized MLLM Recommendation	Instruction (Sec. 5.1)	InteraRec Karra and Tulabandhula (2024), X-Reflect Lyu et al. (2024b)
(Section 5)	Alignment (Sec. 5.2)	PMG Shen et al. (2024), MMREC Tian et al. (2024)
	Generation (Sec. 5.3)	RA-Rec Yu et al. (2024),Wei et al. (2024a)
	Fine-tuning (Sec. 5.4)	GPT4RecZhang et al. (2024),MMSSL Wei et al. (2023)
Personalized MLLM Retrieval	Instruction (Sec. 6.1)	ConCon-Chi Rosasco et al. (2024), Med-PMC Liu et al. (2024a)
(Section 6)	Alignment (Sec. 6.2)	AlignBot Chen et al. (2024c), Xu et al. (2024)
	Generation (Sec. 6.3)	Ye et al. (2024a),Yo’LLaVA Nguyen et al. (2024)
	Fine-tuning (Sec. 6.4)	FedPAM Feng et al. (2024), VITR Gong et al. (2023)

Table 1: Overview of Techniques for Personalized Multimodal Large Language Models (Sections 3-6).

We propose an intuitive and symmetric taxonomy for the techniques used for personalized multimodal large language models, including text generation (Section 3), image generation (Section 4), recommendation (Section 5), and retrieval (Section 6). Each category classifies techniques based on how they follow multimodal instructions, enable multimodal alignment, generate personalized responses, or incorporate personalization through model fine-tuning. We illustrate an overview of techniques for personalized multimodal models in Table 1. In parallel to the discussion of techniques, we also summarize various applications in personalized MLLMs (Appendix B).

Summary of Main Contributions. The key contributions of this work are as follows:

•

A comprehensive survey of existing work for personalized MLLMs. We also survey the problem settings, evaluation metrics, and datasets used in the literature.
•

We introduce a few intuitive taxonomies for personalization in MLLMs and survey existing work using these taxonomies.
•

Key open problems and challenges are identified and discussed. These problems are important for future work to address in this rapidly growing but vitaly important field.

Scope of the survey. In this survey, we focus entirely on recent work that leverages multimodal large language models (MLLMs) to generate personalized text, images, audio, or other modalities. We consider techniques for personalization that elicit and incorporate user preferences when generating multimodal outputs. To study these techniques, we decompose works across the following three dimensions:

•

the modality of the content being generated (e.g., text, images, audio, or other);
•

the personalization technique being employed (e.g., prompt-based, prefix tuning, finetuning/adapters);
•

the application of the personalized MLLM (e.g., chat/assistant, recommendation systems, retrieval, classification, image generation, text generation).

2 Overview of Challenges and Techniques

Personalization in multimodal large language models presents several significant challenges, due to the complexity of combining diverse types of data, extracting relevant information, and delivering user-specific insights. To tackle these challenges, researchers have introduced techniques such as multimodal instruction, alignment, and generation.

2.1 Integration of Heterogeneous Data

Multimodal large language models need to combine information from various modalities, such as text, images, audio, video, and user engagement Wei et al. (2024a); Xu et al. (2024). Each modality has distinct characteristics and may convey different types of information. For example, text might describe a product, while an image conveys its visual appearance. Integrating these heterogeneous data types is challenging because they require different encoding methods, processing pipelines, and alignment strategies. Misalignment or incomplete fusion Lyu et al. (2024b); Zhou et al. (2023) can lead to inconsistent or inaccurate user preferences being captured, thus reducing the effectiveness of personalized recommendations. Multimodal alignment can help resolve inconsistencies that may arise from mismatched modalities.

2.2 Data Noise and Redundancy

Different modalities often include noisy, redundant, or irrelevant information Liu et al. (2024c); Lyu and Luo (2022). For example, images of the same product in e-commerce platforms may have varying quality or redundant features, while textual descriptions might include unnecessary details. Extracting meaningful insights from such noisy data is challenging because the model needs to filter out irrelevant content without losing important context. This process becomes even more difficult when handling large amounts of data, as the noise accumulates and complicates the extraction of relevant user preferences. Multimodal instruction can help filter out noisy or redundant data by guiding the model to focus on the most relevant modalities and inputs for each user. By directing the model’s attention to key features in user interactions, this method reduces the impact of irrelevant or repetitive information, ensuring that the generated outputs are more meaningful and concise.

2.3 Granular Understanding of Multimodal Data

Text-based LLMs are adept at processing linguistic information such as item descriptions Lyu et al. (2024a), and some approaches seek to transform non-textual data into the text space Ye et al. (2024b). However, visual inputs often contain nuances—such as color, texture, and context—that are difficult to capture with language alone Shen et al. (2024). For instance, subtle preferences in fashion, home decor, or art may be driven by visual factors that are abstract or subjective. Multimodal LLMs may struggle to extract these fine-grained visual details and relate them meaningfully to textual descriptions, leading to a loss of personalization depth. Multimodal alignment facilitates a mre granular understanding by ensuring that the relationships between different modalities are preserved.

2.4 Scalability and Efficiency

As the volume of multimodal engagement grows, so do the computational demands for processing and personalizing recommendations. Models need to handle a large number of user interactions across various modalities in real-time environments Ye et al. (2024b); Shen et al. (2023), such as social media platforms or e-commerce sites. This necessitates advanced resource allocation strategies, as multimodal large language models often require significant GPU or TPU resources to process images, videos, or audio in parallel with text.

2.5 Capturing Diverse and Dynamic User Preferences

Users interact with multimodal content in diverse ways, and their preferences can evolve over time Rafailidis et al. (2017). Accurately capturing these preferences across modalities is challenging because different data types might signal conflicting or evolving interests. For instance, a user’s engagement with both product reviews and product images may shift over time, requiring the model to adapt its understanding of their preferences in real-time. Additionally, the model needs to continuously update its understanding to reflect new patterns of user behavior.

3 Personalized MLLM Text Generation

3.1 Personalized Multimodal Instruction

Personalized multimodal instruction focuses on guiding MLLMs to generate more tailored content through structured prompts and contextual signals. For example, CGSMP Yong et al. (2023) demonstrates controllable text summarization using multimodal prompts based on image entities, reducing hallucinations and improving summarization quality. Li et al. (2024c) further propose multimodal in-context tuning leveraging in-context learning abilities of MLLMs to dynamically generate product descriptions based on visual and textual cues.

3.2 Personalized Multimodal Alignment

To better reflect user intents in generated texts, a few works explore aligning multimodal inputs to personalized user preferences. For instance, MPDialog Agrawal et al. (2023) aligns character personas and visual scenes to generate context-consistent dialogues. Athena 3.0 Fan et al. (2023) apply this concept to conversational agents, fusing neuro-symbolic strategies with multimodal dialogue generations, aligning responses with user preferences in dynamic contexts. In addition, Sugihara et al. (2024) aligns video summarization with user-defined semantics by matching textual and visual content, ensuring personalized summaries.

3.3 Personalized Multimodal Generation

To generate text that aligns more closely with user-specific preferences, Wu et al. (2024b) introduce a framework for personalized video commenting, where clip selection and text generation processes are tailored to user preferences. Wang et al. (2024a) generate personalized time-synchronized comments on videos by leveraging a multimodal transformer to integrate visual elements with user-specific commentary.

3.4 Personalized Multimodal Fine-tuning

While prompting and instructions might not always achieve satisfying performance, several fine-tuning methods have been developed to help better adapt pre-trained MLLMs to specific user contexts and tasks. Wang et al. (2023) propose prefix-tuning for personalized image captioning, reducing computation costs while retaining high-quality, user-specific outputs. Pi et al. (2024) propose visual instruction tuning to address the limitations of generic MLLMs by enabling them to recognize individuals in images through visual instructions and generate personalized dialogues.

4 Personalized MLLM Image Generation

4.1 Personalized Multimodal Instruction

Zhong et al. (2024) propose a novel multimodal prompt to include complex user queries for customized instructions. Gal et al. (2022) enables multimodal input to be tokenized into a lookup table, whose indexes are further used for generation based on a text transformer. MuDI Jang et al. (2024) addresses identity mixing in multi-subject text-to-image personalization by leveraging segmented subjects using the Segment Anything Model Kirillov et al. (2023). MuDI employs data augmentation (Seg-Mix) during training and an innovative inference initialization technique to generate distinct multi-subject images without mixing identities. Subject-Diffusion Ma et al. (2024) introduces an open-domain personalized text-to-image generation framework that does not require test-time fine-tuning and relies on a single reference image for generating personalized images. The method combines text and image features using a custom prompt format, integrates fine-grained object features and location control for enhanced fidelity, and employs cross-attention map control to handle multiple subjects simultaneously.

4.2 Personalized Multimodal Alignment

MoMA Song et al. (2024) is a tuning-free, open-vocabulary model for personalized image generation which combines reference image features with text prompts, enabling flexible re-contextualization and texture editing while preserving high detail fidelity and identity. $\lambda$ -ECLIPSE Patel et al. (2024) leverages CLIP latent space to accelerate and facilitate personalized generation.

4.3 Personalized Multimodal Generation

Kim et al. (2024) propose Layout-and-Retouch, an approach to achieve better diversity in the personalization of image generation. Shi et al. (2024a) propose Instantbooth for personalized generation without test-time fine-tuning.

4.4 Personalized Multimodal Fine-tuning

MS-Diffusion Wang et al. (2024d) introduces a zero-shot, layout-guided method for multi-subject image personalization in diffusion models. It integrates a Grounding Resampler to enhance subject detail extraction and a Multi-Subject Cross-Attention mechanism to manage conflicts in multi-subject scenarios, ensuring accurate representation of subjects in specific regions while preserving overall image fidelity. UNIMO-G Li et al. (2024a) unifies the multimodal end-to-end fine-tuning with visual and language transformer models.

5 Personalized MLLM Recommendation

5.1 Personalized Multimodal Instruction

Multimodal instructions in recommendations allow for the personalization of user intentions and preferences Ye et al. (2024b); Zhou et al. (2023), rich contexts of item information Zhou et al. (2023); Lyu et al. (2024b), and more diverse user-system interactions Karra and Tulabandhula (2024). To better express preferences for novel items, the user can provide the reference image as part of the recommendation instruction Zhou et al. (2023). Based on the user’s interaction with visual items, the multimodal recommender system could analyze user preferences and intentions Ye et al. (2024b); Yu et al. (2024); Liu et al. (2024c). PMG Shen et al. (2024) further transform multimodal user behaviors into language to model user preferences for a recommendation task. For items in recommendation tasks, when meta data of a item is lacking, MLLMs are beneficial to extract rich descriptive information of items for better recommendation Zhou et al. (2023); Lyu et al. (2024b). On the other hand, many works also encode visual information with textual information into latent representations for better item modeling Tian et al. (2024); Wei et al. (2024a). In addition, multimodal instructions enable interactions through screenshots Karra and Tulabandhula (2024), personalized item design Wei et al. (2024a), and conversational recommendations based on reference images Zhou et al. (2023).

5.2 Personalized Multimodal Alignment

To unify the understanding and reasoning of multimodal information, MLLMs can serve to transcript visual information into texts Shen et al. (2024); Ye et al. (2024b), enable latent representation fusion and alignment Tian et al. (2024); Xu et al. (2024); Hu et al. (2024), and unify multimodal modeling Wei et al. (2024a); Liu et al. (2024c); Yu et al. (2024). Based on the user’s previous interaction with visual items, the MLLMs can provide explanations on the interacted items’ descriptions Ye et al. (2024b), and also the user’s interaction behaviors Shen et al. (2024), which can be further used as part of textual prompts for further textual-only LLM-based reasoning. On the other hand, other works consider using MLLMs’ abilities in encoding multimodal representations, to enable item-level augmentation Tian et al. (2024), user-item fusion Xu et al. (2024), and across-task multimodal knowledge transferring Hu et al. (2024). Recent developments in end-to-end multimodal learning, where multimodal instructions input as a sequence, enables a novel paradigm of generative recommendation, which regards recommends as a next-token prediction task. Some preliminary works directly prompt advanced MLLMs (e.g., GPT-4V) to understand the multimodal instruction and recommendations based on reasoning. Liu et al. (2024c) designs such prompting methods in sequential recommendation tasks, whose recommendation results are re-ranked after generation. Some works also enable tokenized items and users Yu et al. (2024) with multimodal information, which can directly generate items from the model’s embedding space.

5.3 Personalized Multimodal Generation

Generative recommender systems leverage next-token generation as a unified recommendation policy. The LLM can directly generate items as language tokens by further encoding items into the LLM’s embedding space. Recent ID-based representation learning methods encode item IDs into language embeddings, learned from multimodal and collaborative knowledge Yu et al. (2024). In addition, some unified framework Wei et al. (2024a) enables encoding of multi-channel information, and recommendation generation as well as modified images of the user’s potentially interested items. Such multimodal generation provides better explanability of the recommended items and better convinces users to accept the items. However, since item re-ranking is one of the essential steps for post-processing, how to leverage multimodal output for item re-ranking can be still under-explored.

5.4 Personalized Multimodal Fine-tuning

To achieve more efficient alignment of personalized MLLM recommendations, several works also propose fine-tuning methods on MLLMs. GPT4Rec Zhang et al. (2024) incorporates graph modality information which enables structure-level prompting. Based on the novel prompt design, GPT4Rec performs prompt tuning to benefit streaming recommendations on both the node level and the view level. InstructGraph Wang et al. (2024b) also leverages the graph structure to unify NLP, information retrieval, and recommendation tasks, and thus further enables fine-tuning and RLHF for alignment. MMSSL Wei et al. (2023) is a unified learning framework to first decompose users’ modality-aware preferences, and then collaboratively learn the inter and inter-dependency and inter-modality preference signals through self-augmentation. Deng et al. (2024) further propose a unified transformer model that enables inputs of multimodal information, and outputs of content features which can be used to pair with item representations for direct recommendation tasks.

6 Personalized MLLM Retrieval

6.1 Personalized Multimodal Instruction

Personalized multimodal instruction focuses on improving the ability of MLLMs to tailor their outputs based on user-specific needs and preferences. Existing benchmarks, such as ConCon-Chi Rosasco et al. (2024), raise challenges in personalized text-to-image retrieval by introducing complex and varied contexts and instructions for personalized concept learning and compositionality assessment. In a different direction, the Learnable Agent Collaboration Network Shi et al. (2024b) proposes a framework where multiple agents with distinct instructions and roles collaborate to deliver user-specific responses in multimodal search and retrieval engines. Med-PMC Liu et al. (2024a) creates a simulated clinical environment where MLLMs are instructed to interact with a patient simulator decorated with personalized actors for multimodal information retrieval and decision making. These works highlight the need for MLLMs to effectively integrate multimodal information and personalize their responses across diverse user interactions.

6.2 Personalized Multimodal Alignment

To enhance the interaction between MLLMs and user-specific inputs, personalized multimodal alignment ensures that models can adapt to unique preferences and contexts. AlignBot Chen et al. (2024c) aligns robot task planning with user reminders by using a tailored LLaVA-7B model as an adapter for GPT-4o. The alignment translates user instructions into structured prompts enabling a dynamic retrieval mechanism that recalls relevant past experiences and improves task execution. In contrast, the Align and Retrieve framework Xu et al. (2024) focuses on image retrieval with text feedback, using a composition-and-decomposition learning strategy to unify visual and textual inputs. This approach creates a robust multimodal representation for precise alignment between composed queries and target images. Both methods underscore the importance of aligning multimodal inputs for complex retrieval tasks with personalized user needs.

6.3 Personalized Multimodal Generation

Capturing personalized user intents for more accurate retrieval results is another challenge for MLLMs. Ye et al. (2024a) propose an iterative user intent expansion framework, demonstrating how MLLMs can parse and compose personalized multimodal user inputs. It refines the image search process through stages of parsing and logic generation, which also allows user to iteratively refine their search queries. Similarly, Wang et al. (2024e) develop a multimodal query suggestion method leveraging multi-agent reinforcement learning to generate more personalized and diverse query suggestions based on user images, thereby improving the relevance of retrieval results. Additionally, Nguyen et al. (2024) present Yo’LLaVA, a personalized assistant that embeds user-specific visual concepts into latent tokens, enabling model tailored interactions and retrievals. These methods collectively emphasize the integration of generation techniques into retrieval systems for more precise and personalized retrieval outcomes.

6.4 Personalized Multimodal Fine-tuning

To further improve the retrieval capabilities of MLLMs in personalized contexts, various fine-tuning techniques are developed. FedPAM Feng et al. (2024) introduces a federated learning approach for fine-tuning text-to-image retrieval models, allowing them to adapt to user-specific data without sharing confidential information, thereby addressing the data heterogeneity challenge. VITR Gong et al. (2023) enhances vision transformers for cross-modal information retrieval by refining their ability to understand relationships between image regions and textual descriptions. Yeh et al. (2023a) further demonstrate how models can be adapted to identify specific user-defined instances, such as objects or individuals in videos, by extending the model’s vocabulary with learned instance-specific features. Additionally, Chen et al. (2023) explore task-personalized fine-tuning for visually-rich document entity retrieval, utilizing meta-learning to extract unique entity types with few examples. Furthermore, Li et al. (2024b) propose a generative cross-modal retrieval framework that fine-tunes MLLMs to memorize and retrieval visual information directly from model parameters, offering a novel approach to image retrieval. These works show great potential of fine-tuning MLLMs to enhance their retrieval performance in personalized and diverse multimodal tasks.

7 Evaluation

The evaluation of personalized MLLMs is typically categorized based on the target task. UniMP Wei et al. (2024b) explores various personalized tasks, such as personalized preference prediction, personalized explanation generation, and user-guided image generation, among others. Several models focus on personalized recommendation tasks, as detailed in Section 5 Karra and Tulabandhula (2024); Wei et al. (2023); Zhang et al. (2024); Ye et al. (2024b). In the recommendation setting, the goal is to rank the true target (e.g., item or movie) highest on the list relative to other items. Commonly used metrics for this task include MRR, Recall@k, Hit@k, AUC, HR@k, and NDCG@k, which evaluate how well the model ranks the true target item in comparison to other options.

Personalized multimodal generation focuses on creating customized content, such as images or text, by considering user-specific behavior. This includes generating personalized images, posters for movies, or emojis, as demonstrated by Shen et al. (2024). Shen et al. (2024) utilize various image similarity techniques to evaluate the similarity between the generated content and historical or target items, employing metrics like LPIPS (Learned Perceptual Image Patch Similarity) Zhang et al. (2018) and SSIM (Structural Similarity Index Measure) Wang et al. (2004). Additionally, this area encompasses personalized chatbots Nguyen et al. (2024); Fei et al. (2024); Abuzuraiq and Pasquier (2024), which assess models’ abilities to recognize personalized subjects in images, handle visual and text-based question answering Alaluf et al. (2024); Nguyen et al. (2024), and evaluate emotional intelligence by measuring emotion detection accuracy and response diversity Fei et al. (2024).

Gal et al. (2022) introduce personalized text-to-image generation, synthesizing novel scenes based on user-provided concepts and natural language instructions. They evaluate the model by calculating the average pair-wise CLIP-space cosine similarity between generated images and the concept-specific training set, as well as the editability of prompts by measuring the similarity between the generated images and their textual descriptions using CLIP embeddings. Other methods in this domain Kim et al. (2024); Song et al. (2024) focus on two main aspects: identity preservation, which assesses the model’s ability to maintain the subject’s identity, and prompt fidelity, which ensures alignment between the generated images and the textual prompts. Identity preservation is typically measured by I-CLIP Radford et al. (2021) and I-DINO Caron et al. (2021), which compute subject similarity using CLIP and DINO as backbones. Prompt fidelity is evaluated through the CLIP-based text-image similarity score (T-CLIP). Image diversity is assessed using the Inception Score (IS) to capture the variation within generated sets. Jang et al. (2024) introduce a new metric, Detect-and-Compare (D&C), to evaluate multi-subject fidelity, addressing the limitations of existing metrics (like I-CLIP or DINOv2) that do not to prevent identity mixing in multi-subject scenarios. Wang et al. (2024d) and others Ma et al. (2024); Li et al. (2024a) focus on multi-subject personalized text-to-image generation, using M-DINO to capture subject fidelity by avoiding subject neglect, which average fidelity metrics may overlook.

Other tasks include personalized image retrieval, where the Vision-Language model is expected to retrieve a collection of relevant images based on a textual query, using personalized context previously provided by the user (either in the form of images or text). Cohen et al. (2022) first introduce the concept of Personalized Vision and Language, along with a benchmark to evaluate models on tasks like personalized image retrieval and personalized image segmentation. ConCon-Chi Rosasco et al. (2024) further extend this by proposing a new benchmark that evaluates models’ ability to learn new meanings and their compositionality with known concepts. The setting of personalized retrieval has also been expanded to videos in the works of kor (2022); Yeh et al. (2023b). Zero-shot Composed Image Retrieval (ZS-CIR) evaluates the model’s capability to retrieve images based on compositional queries, without requiring prior examples for new combinations of known concepts. The metrics typically used for these tasks include measuring the rank of the first ground truth (GT) image using Mean Reciprocal Rank (MRR), Recall@k to determine if any GT image appears in the top-k results, and Mean Average Precision (MAP) to assess the ranking of all GT images. Additionally, MAP@k evaluates the precision of GT images up to the top-k retrieved results. Lastly, personalized semantic segmentation focuses on segmenting an instance of a personalized concept in an image, based on a textual query that refers to that concept. Cohen et al. (2022) use the intersection-over-union (IoU) metric to evaluate this, reporting the rate of predictions with IoU above a specified threshold.

8 Datasets

In recent years, the field of multimodal and personalized learning has seen an increase in datasets, each designed to address specific research challenges. These datasets span various domains, including vision-language models, agent collaboration networks, fashion retrieval, and cross-modal retrieval tasks. For instance, ConCon-Chi Rosasco et al. (2024) and MSMTPInfo Shi et al. (2024b) provide benchmarks for evaluating complex, dynamic user interactions in multimodal contexts, while fashion-focused datasets such as FashionIQ Wu et al. (2021), and Fashion200k Han et al. (2017) offer rich collections of images and triplets for advancing research in fashion retrieval and recommendation. UniMP Wei et al. (2024b) uses Amazon review data including the user-item interactions and the items’ images. Other datasets like RefCOCOg Mao et al. (2016) and CLEVR Johnson et al. (2017) focus on relationships between objects and regions in images, contributing to cross-modal retrieval research. In addition, a wide range of datasets in multimodal recommendation and information retrieval are used in developing and evaluating personalized MLLMs. We summarize a series of comprehensive datasets with detailed descriptions in Table 2 (Appendix A).

9 Open Problems & Challenges

In this section, we discuss open problems and highlight important challenges for future work.

9.1 Benchmark Datasets

For developing better personalized MLLMs, there is a need for more robust and comprehensive benchmark datasets to improve both training and evaluation. Currently, there are limited multimodal benchmark datasets with user-specific information.

9.2 Evaluation Metrics

Many works have focused mainly on evaluating downstream tasks such as recommendation, rather than directly assessing the quality of generated outputs. However, direct evaluation of generation quality is crucial for improving these models.

9.3 Multimodality Diversity and Complexity

Most existing work leverages only standard modalities such as text and images. Future work should explore other more diverse types of modalities such as audio, video, graphs, among others. Furthermore, there is a need to develop techniques for supporting many more modalities all at once, as most work has focused only on two such modalities.

9.4 Modality Fusion

In MLLMs, a common challenge is the dominance of text during modality fusion. Since these models are typically pre-trained on vast amounts of text data, they become highly proficient at processing and interpreting textual information. Consequently, when integrating multiple modalities, there is a tendency for the model to over-rely on text, which can overshadow other crucial data sources like images or audio. This text bias often results in suboptimal performance in tasks that require a deeper understanding of non-textual information, where visual or audio cues are key to provide the full context.

9.5 Theoretical Foundations

Developing theoretical foundations for the techniques behind personalized MLLMs remains an open problem Wu et al. (2024a). Understanding their theoretical limits and trade-offs is also of fundamental importance. In spite of this, understanding the theoretical limits of these techniques remains an open problem for future work.

10 Conclusion

In this work, we present a comprehensive survey on personalized multimodal large language models, focusing on their architectures, training methods, and applications. We introduce an intuitive taxonomy for categorizing the techniques used to personalize MLLMs for individual users and provide a detailed discussion of these approaches. Additionally, we explore how these techniques can be combined or adapted when appropriate, highlighting both their advantages and underlying principles. We offer a concise summary of the personalization tasks addressed in existing research and review the evaluation metrics used for each task. We also summarize key datasets that are valuable for benchmarking personalized MLLMs. Finally, we identify important open challenges that remain to be addressed. This survey serves as a valuable resource for researchers and practitioners seeking to understand and advance the development of personalized multimodal large language models.

Limitations

In this paper, the extent of personalization in MLLMs is inherently limited by the available datasets and applications. Moreover, our focus is on MLLMs that incorporate specific personalized multimodal instructions, but we do not address inherent model biases that may affect personalization. Addressing such biases could be a valuable direction for future research in MLLM personalization.

References

kor (2022) 2022. Personalised clip or: how to find your vacation videos.
Abuzuraiq and Pasquier (2024) Ahmed M Abuzuraiq and Philippe Pasquier. 2024. Towards personalizing generative ai with small data for co-creation in the visual arts. In IUI Workshops.
Agrawal et al. (2023) Harsh Agrawal, Aditya Mishra, Manish Gupta, et al. 2023. Multimodal persona based generation of comic dialogs. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14150–14164.
Alaluf et al. (2024) Yuval Alaluf, Elad Richardson, Sergey Tulyakov, Kfir Aberman, and Daniel Cohen-Or. 2024. Myvlm: Personalizing vlms for user-specific queries. arXiv preprint arXiv:2403.14599.
AlSaad et al. (2024) Rawan AlSaad, Alaa Abd-Alrazaq, Sabri Boughorbel, Arfan Ahmed, Max-Antoine Renault, Rafat Damseh, and Javaid Sheikh. 2024. Multimodal large language models in health care: Applications, challenges, and future outlook. Journal of Medical Internet Research, 26:e59505.
Berg et al. (2010) Tamara L. Berg, Alexander C. Berg, and Jonathan Shih. 2010. Automatic attribute discovery and characterization from noisy web data. In Proceedings of the 11th European Conference on Computer Vision: Part I, ECCV’10, page 663–676, Berlin, Heidelberg. Springer-Verlag.
Caron et al. (2021) Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660.
Chen et al. (2023) Jiayi Chen, Hanjun Dai, Bo Dai, Aidong Zhang, and Wei Wei. 2023. On task-personalized multimodal few-shot learning for visually-rich document entity retrieval. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 9006–9025, Singapore. Association for Computational Linguistics.
Chen et al. (2024a) Jin Chen, Zheng Liu, Xu Huang, Chenwang Wu, Qi Liu, Gangwei Jiang, Yuanhao Pu, Yuxuan Lei, Xiaolong Chen, Xingmei Wang, et al. 2024a. When large language models meet personalization: Perspectives of challenges and opportunities. World Wide Web, 27(4):42.
Chen et al. (2024b) Long Chen, Oleg Sinavski, Jan Hünermann, Alice Karnsund, Andrew James Willmott, Danny Birch, Daniel Maund, and Jamie Shotton. 2024b. Driving with llms: Fusing object-level vector modality for explainable autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 14093–14100. IEEE.
Chen et al. (2024c) Pengan Chen, Ziniu Wu, Jiawei Sun, Dong Wang, Peng Zhou, Nieqing Cao, Yan Ding, Bin Zhao, Xuelong Li, et al. 2024c. Alignbot: Aligning vlm-powered customized task planning with user reminders through fine-tuning for household robots. arXiv preprint arXiv:2409.11905.
Chen et al. (2019) Wen Chen, Pipei Huang, Jiaming Xu, Xin Guo, Cheng Guo, Fei Sun, Chao Li, Andreas Pfadler, Huan Zhao, and Binqiang Zhao. 2019. Pog: personalized outfit generation for fashion recommendation at alibaba ifashion. In Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pages 2662–2670.
Choudhury et al. (2024) Muntabir Hasan Choudhury, Lamia Salsabil, William A Ingram, Edward A Fox, and Jian Wu. 2024. Etdpc: A multimodality framework for classifying pages in electronic theses and dissertations. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22878–22884.
Cohen et al. (2022) Niv Cohen, Rinon Gal, Eli A Meirom, Gal Chechik, and Yuval Atzmon. 2022. “this is my unicorn, fluffy”: Personalizing frozen vision-language representations. In European conference on computer vision, pages 558–577. Springer.
Cui et al. (2024) Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zichong Yang, Kuei-Da Liao, et al. 2024. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979.
Dai et al. (2024) Qian Dai, Dong Wei, Hong Liu, Jinghan Sun, Liansheng Wang, and Yefeng Zheng. 2024. Federated modality-specific encoders and multimodal anchors for personalized brain tumor segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 1445–1453.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255.
Deng et al. (2024) Xiuqi Deng, Lu Xu, Xiyao Li, Jinkai Yu, Erpeng Xue, Zhongyuan Wang, Di Zhang, Zhaojie Liu, Guorui Zhou, Yang Song, et al. 2024. End-to-end training of multimodal model and ranking model. arXiv preprint arXiv:2404.06078.
Fan et al. (2023) Yue Fan, Kevin K Bowden, Wen Cui, Winson Chen, Vrindavan Harrison, Angela Ramirez, Saaket Agashe, Xinyue Gabby Liu, Neha Pullabhotla, NQJ Bheemanpally, et al. 2023. Athena 3.0: personalized multimodal chatbot with neuro-symbolic dialogue generators. Alexa Prize Soc Bot Grand Challenge, 5.
Fei et al. (2024) Hao Fei, Han Zhang, Bin Wang, Lizi Liao, Qian Liu, and Erik Cambria. 2024. Empathyear: An open-source avatar multimodal empathetic chatbot. arXiv preprint arXiv:2406.15177.
Feng et al. (2024) Yueying Feng, Fan Ma, Wang Lin, Chang Yao, Jingyuan Chen, and Yi Yang. 2024. Fedpam: Federated personalized augmentation model for text-to-image retrieval. In Proceedings of the 2024 International Conference on Multimedia Retrieval, pages 1185–1189.
Gal et al. (2022) Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H Bermano, Gal Chechik, and Daniel Cohen-Or. 2022. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618.
Gong et al. (2023) Yan Gong, Georgina Cosma, and Axel Finke. 2023. Vitr: augmenting vision transformers with relation-focused learning for cross-modal information retrieval. ACM Transactions on Knowledge Discovery from Data.
Han et al. (2017) Xintong Han, Zuxuan Wu, Phoenix X Huang, Xiao Zhang, Menglong Zhu, Yuan Li, Yang Zhao, and Larry S Davis. 2017. Automatic spatially-aware fashion concept discovery. In Proceedings of the IEEE international conference on computer vision, pages 1463–1471.
He and McAuley (2016) Ruining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion trends with one-class collaborative filtering. In proceedings of the 25th international conference on world wide web, pages 507–517.
Hu et al. (2024) Hengchang Hu, Qijiong Liu, Chuang Li, and Min-Yen Kan. 2024. Lightweight modality adaptation to sequential recommendation via correlation supervision. In European Conference on Information Retrieval, pages 123–139. Springer.
Jang et al. (2024) Sangwon Jang, Jaehyeong Jo, Kimin Lee, and Sung Ju Hwang. 2024. Identity decoupling for multi-subject personalization of text-to-image models. arXiv preprint arXiv:2404.04243.
Johnson et al. (2017) Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. 2017. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910.
Karra and Tulabandhula (2024) Saketh Reddy Karra and Theja Tulabandhula. 2024. Interarec: Interactive recommendations using multimodal large language models. In Pacific-Asia Conference on Knowledge Discovery and Data Mining, pages 32–43. Springer.
Kim et al. (2024) Kangyeol Kim, Wooseok Seo, Sehyun Nam, Bodam Kim, Suhyeon Jeong, Wonwoo Cho, Jaegul Choo, and Youngjae Yu. 2024. Layout-and-retouch: A dual-stage framework for improving diversity in personalized image generation. arXiv preprint arXiv:2407.09779.
Kirillov et al. (2023) Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4015–4026.
Kumar et al. (2024) Ishita Kumar, Snigdha Viswanathan, Sushrita Yerra, Alireza Salemi, Ryan A Rossi, Franck Dernoncourt, Hanieh Deilamsalehy, Xiang Chen, Ruiyi Zhang, Shubham Agarwal, et al. 2024. Longlamp: A benchmark for personalized long-form text generation. arXiv preprint arXiv:2407.11016.
Li et al. (2024a) Wei Li, Xue Xu, Jiachen Liu, and Xinyan Xiao. 2024a. Unimo-g: Unified image generation through multimodal conditional diffusion. arXiv preprint arXiv:2401.13388.
Li et al. (2024b) Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, and Tat-Seng Chua. 2024b. Generative cross-modal retrieval: Memorizing images in multimodal language models for retrieval and beyond. arXiv preprint arXiv:2402.10805.
Li et al. (2024c) Yunxin Li, Baotian Hu, Wenhan Luo, Lin Ma, Yuxin Ding, and Min Zhang. 2024c. A multimodal in-context tuning approach for e-commerce product description generation. arXiv preprint arXiv:2402.13587.
Liu et al. (2024a) Hongcheng Liu, Yusheng Liao, Siqv Ou, Yuhao Wang, Heyang Liu, Yanfeng Wang, and Yu Wang. 2024a. Med-pmc: Medical personalized multi-modal consultation with a proactive ask-first-observe-next paradigm. arXiv preprint arXiv:2408.08693.
Liu et al. (2024b) Jingping Liu, Mingchuan Zhang, Weichen Li, Chao Wang, Shuang Li, Haiyun Jiang, Sihang Jiang, Yanghua Xiao, and Yunwen Chen. 2024b. Beyond entities: A large-scale multi-modal knowledge graph with triplet fact grounding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 18653–18661.
Liu et al. (2024c) Yuqing Liu, Yu Wang, Lichao Sun, and Philip S Yu. 2024c. Rec-gpt4v: Multimodal recommendation with large vision-language models. arXiv preprint arXiv:2402.08670.
Lu et al. (2024) Ming Y Lu, Bowen Chen, Drew FK Williamson, Richard J Chen, Melissa Zhao, Aaron K Chow, Kenji Ikemura, Ahrong Kim, Dimitra Pouli, Ankush Patel, et al. 2024. A multimodal generative ai copilot for human pathology. Nature, pages 1–3.
Lyu et al. (2024a) Hanjia Lyu, Song Jiang, Hanqing Zeng, Yinglong Xia, Qifan Wang, Si Zhang, Ren Chen, Chris Leung, Jiajie Tang, and Jiebo Luo. 2024a. Llm-rec: Personalized recommendation via prompting large language models. In Findings of the Association for Computational Linguistics: NAACL 2024, pages 583–612.
Lyu and Luo (2022) Hanjia Lyu and Jiebo Luo. 2022. Understanding political polarization via jointly modeling users, connections and multimodal contents on heterogeneous graphs. In Proceedings of the 30th ACM International Conference on Multimedia, pages 4072–4082.
Lyu et al. (2024b) Hanjia Lyu, Ryan Rossi, Xiang Chen, Md Mehrab Tanjim, Stefano Petrangeli, Somdeb Sarkhel, and Jiebo Luo. 2024b. X-reflect: Cross-reflection prompting for multimodal recommendation. arXiv preprint arXiv:2408.15172.
Ma et al. (2024) Jian Ma, Junhao Liang, Chen Chen, and Haonan Lu. 2024. Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning. In ACM SIGGRAPH 2024 Conference Papers, pages 1–12.
Mao et al. (2016) Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. 2016. Generation and comprehension of unambiguous object descriptions. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20.
McAuley et al. (2015) Julian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based recommendations on styles and substitutes. In Proceedings of the 38th international ACM SIGIR conference on research and development in information retrieval, pages 43–52.
Nguyen et al. (2024) Thao Nguyen, Haotian Liu, Yuheng Li, Mu Cai, Utkarsh Ojha, and Yong Jae Lee. 2024. Yo’llava: Your personalized language and vision assistant. arXiv preprint arXiv:2406.09400.
Ni et al. (2023) Yongxin Ni, Yu Cheng, Xiangyan Liu, Junchen Fu, Youhua Li, Xiangnan He, Yongfeng Zhang, and Fajie Yuan. 2023. A content-driven micro-video recommendation dataset at scale. arXiv preprint arXiv:2309.15379.
Patel et al. (2024) Maitreya Patel, Sangmin Jung, Chitta Baral, and Yezhou Yang. 2024. lambda-eclipse: Multi-concept personalized text-to-image diffusion models by leveraging clip latent space. arXiv preprint arXiv:2402.05195.
Pi et al. (2024) Renjie Pi, Jianshu Zhang, Tianyang Han, Jipeng Zhang, Rui Pan, and Tong Zhang. 2024. Personalized visual instruction tuning. arXiv preprint arXiv:2410.07113.
Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
Rafailidis et al. (2017) Dimitrios Rafailidis, Pavlos Kefalas, and Yannis Manolopoulos. 2017. Preference dynamics with multimodal user-item interactions in social media recommendation. Expert Systems with Applications, 74:11–18.
Rosasco et al. (2024) Andrea Rosasco, Stefano Berti, Giulia Pasquale, Damiano Malafronte, Shogo Sato, Hiroyuki Segawa, Tetsugo Inada, and Lorenzo Natale. 2024. Concon-chi: Concept-context chimera benchmark for personalized vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22239–22248.
Shen et al. (2023) Jialie Shen, Marie Morrison, and Zhu Li. 2023. Scalable multimodal learning and multimedia recommendation. In 2023 IEEE 9th International Conference on Collaboration and Internet Computing (CIC), pages 121–124. IEEE.
Shen et al. (2024) Xiaoteng Shen, Rui Zhang, Xiaoyan Zhao, Jieming Zhu, and Xi Xiao. 2024. Pmg: Personalized multimodal generation with large language models. In Proceedings of the ACM on Web Conference 2024, pages 3833–3843.
Shi et al. (2024a) Jing Shi, Wei Xiong, Zhe Lin, and Hyun Joon Jung. 2024a. Instantbooth: Personalized text-to-image generation without test-time finetuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8543–8552.
Shi et al. (2024b) Yunxiao Shi, Min Xu, Haimin Zhang, Xing Zi, and Qiang Wu. 2024b. A learnable agent collaboration network framework for personalized multimodal ai search engine. arXiv preprint arXiv:2409.00636.
Song et al. (2024) Kunpeng Song, Yizhe Zhu, Bingchen Liu, Qing Yan, Ahmed Elgammal, and Xiao Yang. 2024. Moma: Multimodal llm adapter for fast personalized image generation. arXiv preprint arXiv:2404.05674.
Sugihara et al. (2024) Tomoya Sugihara, Shuntaro Masuda, Ling Xiao, and Toshihiko Yamasaki. 2024. Language-guided self-supervised video summarization using text semantic matching considering the diversity of the video. arXiv preprint arXiv:2405.08890.
Tian et al. (2024) Jiahao Tian, Jinman Zhao, Zhenkai Wang, and Zhicheng Ding. 2024. Mmrec: Llm based multi-modal recommender system. arXiv preprint arXiv:2408.04211.
Wang et al. (2024a) Hei-Chia Wang, Martinus Maslim, and Wei-Ting Hong. 2024a. Personalized time-sync comment generation based on a multimodal transformer. Multimedia Systems, 30(2):105.
Wang et al. (2024b) Jianing Wang, Junda Wu, Yupeng Hou, Yao Liu, Ming Gao, and Julian McAuley. 2024b. Instructgraph: Boosting large language models via graph-centric instruction tuning and preference alignment. arXiv preprint arXiv:2402.08785.
Wang et al. (2024c) Lei Wang, Yi Hu, Jiabang He, Xing Xu, Ning Liu, Hui Liu, and Heng Tao Shen. 2024c. T-sciq: Teaching multimodal chain-of-thought reasoning via large language model signals for science question answering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19162–19170.
Wang et al. (2024d) X Wang, Siming Fu, Qihan Huang, Wanggui He, and Hao Jiang. 2024d. Ms-diffusion: Multi-subject zero-shot image personalization with layout guidance. arXiv preprint arXiv:2406.07209.
Wang et al. (2023) Xuan Wang, Guanhong Wang, Wenhao Chai, Jiayu Zhou, and Gaoang Wang. 2023. User-aware prefix-tuning is a good learner for personalized image captioning. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 384–395. Springer.
Wang et al. (2024e) Zheng Wang, Bingzheng Gan, and Wei Shi. 2024e. Multimodal query suggestion with multi-agent reinforcement learning from human feedback. In Proceedings of the ACM on Web Conference 2024, pages 1374–1385.
Wang et al. (2004) Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Simoncelli. 2004. Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing, 13(4):600–612.
Wei et al. (2024a) Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, and Xianfeng Tang. 2024a. Towards unified multi-modal personalization: Large vision-language models for generative recommendation and beyond. In The Twelfth International Conference on Learning Representations.
Wei et al. (2024b) Tianxin Wei, Bowen Jin, Ruirui Li, Hansi Zeng, Zhengyang Wang, Jianhui Sun, Qingyu Yin, Hanqing Lu, Suhang Wang, Jingrui He, and Xianfeng Tang. 2024b. Towards unified multi-modal personalization: Large vision-language models for generative recommendation and beyond. In ICLR.
Wei et al. (2023) Wei Wei, Chao Huang, Lianghao Xia, and Chuxu Zhang. 2023. Multi-modal self-supervised learning for recommendation. In Proceedings of the ACM Web Conference 2023, pages 790–800.
Wei et al. (2024c) Zhichao Wei, Qingkun Su, Long Qin, and Weizhi Wang. 2024c. Mm-diff: High-fidelity image personalization via multi-modal condition integration. arXiv preprint arXiv:2403.15059.
Wu et al. (2021) Hui Wu, Yupeng Gao, Xiaoxiao Guo, Ziad Al-Halah, Steven Rennie, Kristen Grauman, and Rogerio Feris. 2021. Fashion iq: A new dataset towards retrieving images by natural language feedback. In Proceedings of the IEEE/CVF Conference on computer vision and pattern recognition, pages 11307–11317.
Wu et al. (2024a) Junda Wu, Xintong Li, Tong Yu, Yu Wang, Xiang Chen, Jiuxiang Gu, Lina Yao, Jingbo Shang, and Julian McAuley. 2024a. Commit: Coordinated instruction tuning for multimodal large language models. arXiv preprint arXiv:2407.20454.
Wu et al. (2024b) Yihan Wu, Ruihua Song, Xu Chen, Hao Jiang, Zhao Cao, and Jin Yu. 2024b. Understanding human preferences: Towards more personalized video to text generation. In Proceedings of the ACM on Web Conference 2024, pages 3952–3963.
Xie et al. (2024) Jingyi Xie, Rui Yu, He Zhang, Sooyeon Lee, Syed Masum Billah, and John M Carroll. 2024. Emerging practices for large multimodal model (lmm) assistance for people with visual impairments: Implications for design. arXiv preprint arXiv:2407.08882.
Xu et al. (2024) Yahui Xu, Yi Bin, Jiwei Wei, Yang Yang, Guoqing Wang, and Heng Tao Shen. 2024. Align and retrieve: Composition and decomposition learning in image retrieval with text feedback. IEEE Transactions on Multimedia.
Yang et al. (2023) Zhengyuan Yang, Linjie Li, Kevin Lin, Jianfeng Wang, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2023. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1.
Ye et al. (2024a) Yilin Ye, Qian Zhu, Shishi Xiao, Kang Zhang, and Wei Zeng. 2024a. The contemporary art of image search: Iterative user intent expansion via vision-language model. Proceedings of the ACM on Human-Computer Interaction, 8(CSCW1):1–31.
Ye et al. (2024b) Yuyang Ye, Zhi Zheng, Yishan Shen, Tianshu Wang, Hengruo Zhang, Peijun Zhu, Runlong Yu, Kai Zhang, and Hui Xiong. 2024b. Harnessing multimodal large language models for multimodal sequential recommendation. arXiv preprint arXiv:2408.09698.
Yeh et al. (2023a) Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. 2023a. Meta-personalizing vision-language models to find named instances in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19123–19132.
Yeh et al. (2023b) Chun-Hsiao Yeh, Bryan Russell, Josef Sivic, Fabian Caba Heilbron, and Simon Jenni. 2023b. Meta-personalizing vision-language models to find named instances in video. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19123–19132.
Yong et al. (2023) Qian Yong, Jueqi Wei, YiRen Zhang, XiLun Zhang, Chao Wei, Simiao Chen, Yunhe Li, Cheng Ye, Bing Huang, and Hao Wang. 2023. Cgsmp: Controllable generative summarization via multimodal prompt. In Proceedings of the 1st Workshop on Large Generative Models Meet Multimodal Applications, pages 45–50.
Young et al. (2014) Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. 2014. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
Yu et al. (2024) Xiaohan Yu, Li Zhang, Xin Zhao, Yue Wang, and Zhongrui Ma. 2024. Ra-rec: An efficient id representation alignment framework for llm-based recommendation. arXiv preprint arXiv:2402.04527.
Zhang et al. (2024) Peiyan Zhang, Yuchen Yan, Xi Zhang, Liying Kang, Chaozhuo Li, Feiran Huang, Senzhang Wang, and Sunghun Kim. 2024. Gpt4rec: Graph prompt tuning for streaming recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1774–1784.
Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595.
Zhong et al. (2024) Linhao Zhong, Yan Hong, Wentao Chen, Binglin Zhou, Yiyi Zhang, Jianfu Zhang, and Liqing Zhang. 2024. User-friendly customized generation with multi-modal prompts. arXiv preprint arXiv:2405.16501.
Zhou et al. (2023) Peilin Zhou, Meng Cao, You-Liang Huang, Qichen Ye, Peiyan Zhang, Junling Liu, Yueqi Xie, Yining Hua, and Jaeboum Kim. 2023. Exploring recommendation capabilities of gpt-4v (ision): A preliminary case study. arXiv preprint arXiv:2311.04199.

Appendix A Summary of Datasets in Personalized MLLM Recommendation and Retrieval

Dataset Name	Description	Stats
ConCon-Chi Rosasco et al. (2024)	Benchmark for personalized vision-language tasks	4008 images; 20 concepts; 101 contexts
MSMTPInfo Shi et al. (2024b)	Evaluation for Agent Collaboration Network	13 themes; multiple sessions; dynamic user topics
Shoes Berg et al. (2010)	Dataset for interactive fashion image retrieval	8900 training triplets; 1700 test triplets
Fashion200k Han et al. (2017)	Large-scale fashion dataset with over 200K images	172K training images; 33K test queries
Business Dataset Deng et al. (2009)	User query images from a real image search engine	50K images; 5 suggestions per image
Yo’LLaVA Nguyen et al. (2024)	Personalized language and vision assistant dataset	40 subjects; 10-20 images per subject
RefCOCOg Mao et al. (2016)	Images from MS-COCO with relational annotations	21899 train, 1300 val and 2600 test images
Flickr30K Young et al. (2014)	Benchmark for visual-semantic embedding networks	29000 train, 1000 validation and 1000 test images
MicroLens Ni et al. (2023)	Video introductions and cover images	1 billion user-item interactions; 34 million users
Amazon-Baby McAuley et al. (2015)	Images and product descriptions	180 million relationships; 6 million objects;
interaRec Karra and Tulabandhula (2024)	Screenshot based recommendations using multimodal large language models	1,500 sessions; item IDs; screenshots of webpages
POG Chen et al. (2019)	Fashion rec by personalized outfit generation	1.43 million outfits; 80 most frequent categories;
MovieLens Ni et al. (2023)	Describes 5-star rating and free-text tagging activity	100K ratings; 3,683 tag applications; 9,742 movies

Table 2: Summary of Datasets.

To tackle these diverse challenges, researchers employ a wide array of approaches, ranging from the use of established, large-scale public datasets to the creation of tailored datasets for specific tasks. For example, studies on multimodal sequential recommendation often leverage datasets like Amazon-Baby McAuley et al. (2015), Amazon-Game He and McAuley (2016), and MicroLens Ni et al. (2023) for evaluation. In contrast, some researchers harness the power of Large Language Models (LLMs) to generate synthetic data, as exemplified by the Business Dataset Deng et al. (2009), which utilizes GPT-generated suggestions for labeling. The development of custom datasets also plays a crucial role, as seen in POG’s Chen et al. (2019) adaptation of the iFashion dataset for personalized fashion recommendations, InteraRec’s Karra and Tulabandhula (2024) collection of screenshots from Amazon websites to create a new resource for multimodal recommendation research, and LongLaMP Kumar et al. (2024) benchmark to evaluate long-form personalized text generation. A summary of these datasets can be found in Table 2.

Appendix B Applications

Personalized MLLMs have an extensive range of applications, targeting various tasks in the textual, visual, audio, and other domains.

B.1 Personalized MLLM Recommendation

Liu et al. (2024b) develop a multimodal knowledge graph that recommends missing entities in triplet structures. Their approach predicts relationships between entities (e.g., people) within images.

B.2 Personalized MLLM Retrieval

Choudhury et al. (2024) classify Electronic Theses and Dissertations (ETD) through a combination of visual and textual learning.

B.3 Personalized MLLM Text Generation

Wei et al. (2024a) propose a multimodal learning framework where a vision model extracts features from images and a language model learns from texts. The extracted features are jointly modeled to yield personalized product recommendation, preference prediction, explanation generation. Additionally, Wang et al. (2024c) leverage multimodal learning to answer science-related questions using chain-of-thought reasoning.

B.4 Personalized MLLM Image Generation

Shen et al. (2024) utilize MLLMs to generate movie posters tailored to users’ preferences. Similarly, Song et al. (2024) and Wei et al. (2024c) leverage LLMs to generate images based on visual and textual prompts.

B.5 Miscellaneous Applications

MLLMs have been applied in various other fields, such as helping visually impaired individuals by verifying images and offering outfit suggestions Xie et al. (2024). Moreover, MLLMs are used on brain tumor segmentation and tumor identification Dai et al. (2024).