A Survey on Personalized Content Synthesis with Diffusion Models

Xulu Zhang^1,2, Xiao-Yong Wei^3,1, Wengyu Zhang¹, Jinlin Wu^2,5, Zhaoxiang Zhang^2,4,5, Zhen Lei^2,4,5, Qing Li¹

¹Department of Computing, the Hong Kong Polytechnic University, Hong Kong ²Center for Artificial Intelligence and Robotics, HKISI, CAS, Hong Kong ³College of Computer Science, Sichuan University, Chengdu, China ⁴School of Artificial Intelligence, UCAS, Beijing, China ⁵State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA, Beijing, China

Abstract

Recent advancements in generative models have significantly impacted content creation, leading to the emergence of Personalized Content Synthesis (PCS). With a small set of user-provided examples, PCS aims to customize the subject of interest to specific user-defined prompts. Over the past two years, more than 150 methods have been proposed. However, existing surveys mainly focus on text-to-image generation, with few providing up-to-date summaries on PCS. This paper offers a comprehensive survey of PCS, with a particular focus on the diffusion models. Specifically, we introduce the generic frameworks of PCS research, which can be broadly classified into optimization-based and learning-based approaches. We further categorize and analyze these methodologies, discussing their strengths, limitations, and key techniques. Additionally, we delve into specialized tasks within the field, such as personalized object generation, face synthesis, and style personalization, highlighting their unique challenges and innovations. Despite encouraging progress, we also present an analysis of the challenges such as overfitting and the trade-off between subject fidelity and text alignment. Through this detailed overview and analysis, we propose future directions to advance the development of PCS.

Index Terms:

Generative Models, Diffusion Models, Personalized Content Synthesis.

I Introduction

Recently, generative models have shown remarkable progress in the field of natural language processing and computer vision. Impressive works, such as ChatGPT [1] and diffusion models [2] have demonstrated comparable capability with human experts in many scenarios, especially in content creation. Among these advanced applications, Personalized Content Synthesis (PCS) is one of the most important branches of content generation. The target is to learn the subject of interest (SoI) referred by a few user-uploaded samples and generate images aligned to user-specified context. This brings dramatic facilities for content recreation, such as putting user’s pet in a new background. Benefiting from the customized nature, this application has gained increasing public demand and many researchers and companies have started working on this specific generation task. To achieve this goal, GANs are firstly applied to generate the combination of two concepts in different reference images, like putting sunglasses in target’s face. However, the approach demonstrates unnatural copy-paste results and the process does not support condition guidance, like text prompt, significantly restricting the practical usage. Recently, the diffusion models have enabled much easier and more feasible text-guided content generation, promoting the bloom in text-guided content personalization. Fig. 1 illustrates the total number of research papers on PCS as time progresses. Starting from DreamBooth and Textual Inversion, which are both released in August 2022, over 100 methods have been proposed in such a short time. However, a summary of this research direction has not been conducted and the potential applications have not been investigated clearly. Therefore, this paper aims to provide a comprehensive survey in this area to promote further improvement.

Refer to caption — Figure 1: The rough number of papers on personalized content synthesis.

Different from conventional text-to-image synthesis that is built on large-scale pre-training, PCS requires capturing the key visual features of the SoI using a limited number of references, occasionally as few as one. The primary objective is to solve the learning process in an effective way. According to the training strategy, we can roughly divide the approaches into two categories: optimization-based and learning-based. Specifically, the optimization-based method fine-tunes a distinct generative model for each personalization request, while the learning-based method aims to train a unified model that has the capability of handling any SoI generation. In this paper, we adhere to these fundamental frameworks to give a comprehensive overview of the research efforts made in this field. To provide a clear development timeline, significant works are highlighted in Fig. 2. Various methods have been proposed to address specific challenges within this research domain. This signifies that the field is both highly valued and rapidly evolving.

As the field of image personalization matures, studies have begun to delve into more specialized areas. The subject of interest (SoI) now encompasses not only well-defined objects but also extends to human faces, painting styles, actions, and other complex semantic elements. Additionally, there is an increasing demand for generating compositions that integrate multiple SoIs within a single image. Moreover, the scope of research has expanded beyond static images to other modalities, such as video, 3D representations, and speech. These studies are crucial for applications in real-world scenarios, such as digital marketing, virtual reality, and personalized content creation, where a harmonious blend of multiple elements is often required. We illustrate the proportion of different specialized tasks in Fig. 3, providing a visual representation of the current research landscape and the relative emphasis on various domains within the field. In this paper, we mainly focus on image personalization and systematically review the progress in all tasks, highlighting key methodologies, applications, and drawbacks. Through detailed analysis and discussion, we hope to inspire further innovations and collaborations in this field.

While current methods have demonstrated impressive performance, several challenges remain unsolved. A primary concern is the overfitting caused by the limited number of reference images available. This limitation often results in the incorporation of irrelevant elements and neglecting the textual context within the outputs. For instance, overfitted models tend to generate undesired images with the background of the reference image regardless of the user-input prompt. Another notable challenge is the trade-off between image alignment and text fidelity. Specifically, when a model successfully reconstructs the fine-grained details of the SoI, it often sacrifices controllability. Conversely, enhancing editability often leads to a compromise in the preservation of the SoI. Furthermore, the other challenges include the absence of robust evaluation metrics, the lack of standardized test datasets, and the need for faster processing times. This paper explores these issues and proposes potential avenues for future research. By addressing these challenges, we aim to propel advancements in personalized content synthesis and improve its practical applications

Our contributions different from the other image synthesis survey [3, 4, 5, 6, 7] lie in the following key points:

•

This paper pays special attention to personalized content synthesis compared to the general introduction of image synthesis.
•

We categorize the content personalization into several sub-fields and provide a comprehensive summary of these specialized tasks.
•

We point out the current challenges and suggest potential innovations for future research.

II Fundamentals

In this paper, we mainly focus on the diffusion models, because the recent state-of-the-art methods are mostly built on it. We present the basic formulation of the text-conditioned diffusion process based on Denoising Diffusion Probabilistic Models (DDPMs).

Typically, diffusion models contain two base processes, forward process and reverse process.

The forward process iteratively add random Gaussian noises $(\mathbf{\epsilon}_{t})_{t=1}^{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I})$ into training sample $\mathbf{x}_{0}$ based on a Markov chain of $T$ steps, producing a noisy sample $(\mathbf{x}_{t})_{t=1}^{T}$ with

\mathbf{x}_{t}=\sqrt{\alpha_{t}}\mathbf{x}_{t-1}+\sqrt{1-\alpha_{t}}\mathbf{\epsilon}_{t},\hskip 3.61371pt1\leq t\leq T,

(1)

where ${\alpha_{t}}$ controls the variance of the Gaussian noises $\mathbf{\epsilon}_{t}$ .

With the reparameterization trick [8], Eq. (1) can be written as a close form

\mathbf{x}_{t}=\sqrt{\bar{\alpha}_{t}}\mathbf{x}_{0}+\sqrt{1-\bar{\alpha}_{t}}\mathbf{\epsilon}_{0},

(2)

where $\bar{\alpha}_{t}=\Pi_{s=1}^{t}\alpha_{t}$ , which makes the forward process deterministic on the initial inputs $\mathbf{x}_{0}$ and a fixed noise $\mathbf{\epsilon}_{0}$ .

The reverse process is to recover the original sample from the noisy data. Since the Markov process is non-reversible, we can train a neural network $f$ parameterized on $\mathbf{\theta}$ to learn an estimated distribution $p_{\mathbf{\theta}}(\mathbf{x}_{0}|\mathbf{x}_{t})$ . This can be viewed as a $T$ -step denoising process. For a random time step $t$ , the objective is to minimize the grounding truth $\mathbf{x}_{0}$ and the estimation $\hat{\mathbf{x}_{0}}$ by removing noise from $\mathbf{x}_{t}$ , as follows:

	$\displaystyle L_{rec}$	$\displaystyle=\mathbb{E}\left[w_{t}\\|\hat{\mathbf{x}_{0}}-\mathbf{x}_{0}\\|_{2}^{2}\right],$
		$\displaystyle=\mathbb{E}\left[w_{t}\\|f_{\theta}(\mathbf{x}_{t},t)-\mathbf{x}_{0}\\|_{2}^{2}\right],$		(3)

where $w_{t}$ represents a time-step dependent weight. Some different methods adopt other formulations of the reconstruction process, like Muse utilizing image token modeling loss to achieve patch prediction. In the inference phase, we are able to generate a new sample with an arbitrary Gaussian distribution as input. However, it is impossible to control the output with only a random Gaussian noise. The text prompt is an effective and easy controlling way and can be viewed as an additional condition to guide the reverse process. The reconstruction loss can be written as

\displaystyle L_{rec}=\mathbb{E}\left[w_{t}\|f_{\theta}(\mathbf{x}_{t},t,c)-\mathbf{x}_{0}\|_{2}^{2}\right],

(4)

where $c$ is the text condition. Compared with Eq. 5, this conditioned training strategy significantly improves the control ability of diffusion models. Along with the training stability, a large amount of work based on the diffusion framework has been proposed to improve the performance from various aspects.

III Generic Framework

In this section, we introduce two generic frameworks of PCS, optimization-based methods and learning-based methods, in corresponding subsections.

III-A Optimization-based Framework

Unique Modifer. A crucial aspect of optimization-based personalization is how the SoI is represented in text descriptions so that the users can flexibly generate new prompts. To this end, a unique modifier is designed to symbolize the SoI [9, 10], as shown in Fig. 4. More specifically, this modifier allows for the reuse and combination with other descriptions (e.g., “V* on beach”) during the inference phase. The construction of the unique modifier can be divided into three categories:

•

Learnable embedding. This method adds a new token and its corresponding embedding vector to the word dictionary. We call it pseudo token because it does not exist in the original dictionary. The pseudo token acts as the modifier, with adjustable weights during fine-tuning, while the embeddings of other tokens in the pre-defined dictionary will not be influenced.
•

Plain text. This approach utilizes an explicit text description of the SoI. For example, words such as cat or yellow cat could directly represent the user’s cat in the references. This provides detailed semantic information to improve the subject fidelity. However, it also alters the original meaning of the text, limiting the general applicability of these words, like generating other kinds of cats.
•

Rare token. Employing infrequently used tokens minimizes their impact on commonly used vocabularies and the generalization capabilities of the pre-trained model. However, these tokens cannot provide useful information and still exhibit weak representation in the text encoder, potentially causing ambiguity between the original text and the SoI.

Training Prompt Construction. The construction of training prompts for samples typically starts with adding prefix words, such as “Photo of V*”. However, DreamBooth [10] noted that such a simple description causes a long training time and unsatisfied performance. To address this, they incorporate the unique modifier with a class noun to describe the SoI in the references (e.g., “Photo of V* cat”). Also, the text prompt for each training reference can be more precious for better disentanglement of SoI and irrelevant concepts, such as “Photo of V* cat on the chair” [11]. This follows the tending that high-quality captions in the training set could assist in further improvement of accurate text control [12].

Training Objective. As depicted in Fig. 4, the primary goal of optimization-based methods is to refine a specific cluster of parameters, denoted as $\mathbf{\theta^{\prime}}$ , for each personalization request. This process, often called test-time fine-tuning, involves adjusting $\mathbf{\theta^{\prime}}$ to reconstruct the SoI conditioned on the reference prompt. The fine-tuning is quantified by a reconstruction loss defined as

\displaystyle L_{rec}=\mathbb{E}\left[w_{t}\|f_{\theta^{\prime}}(\mathbf{x}_{t},t,c)-\mathbf{x}_{0}\|_{2}^{2}\right],

(5)

Compared to the large-scale pre-training described in Eq. 5, modifications can be seen in the learnable parameters. The commonly adopted options include optimizing token embeddings [9, 13], the entire diffusion model [10], specific subsets of parameters [14, 15, 16], or introducing new parameters such as adapters [17, 18], and LoRA [19, 20, 21]. The choice of learnable parameters impacts several factors including subject fidelity, tuning speed, and storage requirements. A fundamental observation is that an increase in the number of parameters correlates with enhanced visual fidelity.

Inference. Once the model has been fine-tuned with the optimized parameters $\mathbf{\theta^{\prime}}$ , it is ready for the inference stage, where the personalized image generation takes place. By constructing new input descriptions that include the unique modifier associated with the SoI, it is easy to generate any desired images.

III-B Learning-based Framework

Overview. Recently, the learning-based framework for PCS has gained significant attention due to the rapid inference generation capability without the need for test-time fine-tuning. The basic idea is to leverage large-scale datasets to train robust models capable of personalizing diverse subject inputs. The training process involves minimizing a reconstruction loss between the generated images and the ground truth, similar to Eq. 5, to optimize the learnable parameters. However, it is not easy to train such a powerful model. The success of current methods hinges on three critical factors: 1) How to design an effective architecture to facilitate test-time personalization, 2) How to preserve the utmost information of SoI to ensure visual fidelity, 3) What is the appropriate size of the training dataset to use, In the following sub-section, we present a comprehensive analysis of these factors.

Architecture. In personalization tasks, users typically provide two types of information: one or more reference images and a textual description for content synthesis. These inputs are indispensable for structuring the architecture of the learning-based framework. According to the methods used to fuse features from these two modalities, we can categorize the learning-based approaches into two main groups: placeholder-based and reference-conditioned architectures.

Inspired by the unique modifier used in optimization-based methods, the placeholder-based methods introduce a placeholder that precedes the class noun to represent the visual characteristics of the SoI, as shown at the top of Fig. 5. The placeholder, which stores the extracted image features, is concatenated with the text embeddings processed by the text encoder. These combined features are then fused within subsequent learnable modules, such as adapters or cross-attention layers, to enhance contextual relevance.

Alternatively, the reference-conditioned architecture modifies the U-Net backbone to be conditioned on image reference. This method employs additional layers like cross-attention or adapters specifically designed to handle the integration of extra visual input. For instance, IP-Adapter [22] trains a lightweight decoupled cross-attention module, in which the image features and text features are separately processed with query features. The final output is defined as the addition result after softmax operation. In this case, no placeholder is required.

Moreover, some systems, like Subject-Diffusion [23], integrate both placeholder-based and reference-conditioned modules, taking advantage of the strengths of each approach to enhance the overall personalization capability.

SoI Feature Representation. Extracting representative features of the SoI is crucial in the creation of personalized content. A common approach is to employ an encoder, leveraging pre-trained models such as CLIP [24] and BLIP [25]. While these models excel at capturing global features, they often include irrelevant information that can detract from the fidelity, potentially compromising the quality of the personalized output, such as including the same background in the generation. To mitigate this issue, some studies incorporate additional prior knowledge to guide the learning process so as to focus on the targeted SoI. For instance, the SoI-specific mask [26, 27, 28, 29, 30, 31] contributes to the effective exclusion of the influence of the background. Moreover, using facial landmarks [32] in the context of human face customization helps improve identity preservation.

Handling multiple input references presents another challenge but is essential for real-world deployment. This necessitates an ensemble of features from the multiple references to augment the framework’s adaptability. Yet, the majority of current learning-based systems are limited to handling one reference input. Some research works [32, 33] propose to average or stack features extracted from multiple references to form a composite SoI representation.

Training Data. Training a learning-based model for PCS necessitates a large-scale dataset. There are primarily two types of training samples utilized:

•

Triplet Data (Reference Image, Target Image, Target Caption). This dataset format is directly aligned with the PCS objectives, establishing a clear relation between the reference and the personalized content. However, collecting large-scale triplet samples poses challenges. Several strategies have been proposed to mitigate this issue: 1) Data Augmentation. Techniques such as foreground segmentation followed by placement in a different background are used to construct triplet data [34]. 2) Synthetic Sample Generation. Methods like SuTI [35] utilize multiple optimization-based models to generate synthetic samples, which are then paired with original references. 3) Utilizing Recognizable SoIs. Collecting images of easily recognizable subjects, such as celebrities, significantly facilitates face personalization [36].
•

Dual Data (Reference Image, Reference Caption). This dataset is essentially a simplified version of the triplet format, where the personalized content is the original image itself. Such datasets are more accessible, including collections like LAION [37] and LAION-FACE [38]. However, a notable drawback is that training tends to focus more on reconstructing the reference image rather than incorporating the text prompts. Consequently, models trained on this type of data might struggle with complex prompts that require substantial modifications or interactions with objects.

IV Categorization of Personalization Tasks

As shown in Fig. 6, personalization covers a range of areas, including objects, styles, faces, etc. In the following subsections, we provide an in-depth summary of these tasks.

TABLE I: Comparison between three classical methods on personalized image generation.

Model	Architecture	Pre-training	Generation Time	Storage	Performance
Textual Inversion	Token embedding fine-tuning	No	$\sim 20$ min	Low	Moderate
DreamBooth	Diffusion weight fine-tuning	No	$\sim 6$ min	High	Good
ELITE	Large scale pre-training	Yes	$\sim 4$ sec	Low	Moderate

IV-A Personalized Object Generation

Personalized object generation is a fundamental task, which refers to the process of creating a customized visual representation of specific object or entity. In Tab. I, we present a comparative analysis of three classical methods used in PCS—Textual Inversion [9], DreamBooth [10], and ELITE [39]. Each method employs a different approach. We will introduce the specifics of each method and explore their subsequent developments

Textual Inversion [9] applies a simple yet effective method that inserts a new token in the tokenizer to represent the subject of interest. By reconstructing the SoI references with a noisy input, the learnable pseudo token is optimized. One of the significant benefits of this method is its minimal storage requirement, with the new tokens consuming just a few kilobytes. However, the method has some drawbacks. It compresses complex visual features into a small set of parameters, which can lead to long convergence times and a potential loss in visual fidelity. To address the issue of prolonged training times, the study by [40] identifies the injected noise causes the failure of traditional convergence metrics in determining the precise end of training. After eliminating all randomness, the reconstruction loss becomes significantly more informative, and a stopping criterion that evaluates the loss variance is designed. Recent efforts in enhancing the capabilities of pseudo token embeddings are evident in several innovative approaches. P+ [13] introduces distinct textual conditions across different layers of the U-Net architecture, thereby offering better attribute control through additional learnable parameters. NeTI [41] advances this concept by proposing a neural mapper that adaptively outputs token embeddings based on the denoising timestep and specific U-Net layers. Further, ProSpect [42] demonstrates that different types of prompts—layout, color, structure, and texture—are activated at different stages of the denoising process. Inspired by this, they also recommend optimizing multiple token embeddings tailored to different denoising timesteps. Similarly, a study by [43] this layered activation insight to learn distinct attributes by selectively activating the tokens within their respective scopes. Later, HiFiTuner [27] integrates multiple techniques to achieve higher text alignment and subject fidelity. These include a mask-guided loss function, parameter regularization, time-dependent embedding, and generation refinement assisted by the nearest reference. In addition to time-dependent embedding, optimizing both negative and positive prompt embeddings as suggested by DreamArtist [44] represents another method to refine the training process. Although these advanced techniques mainly aim to enhance the accuracy of subject representation, they also contribute to expediting the training phase by introducing more parameters. In addition to these approaches, further refinements and innovations are continually being explored by injecting other knowledge prior or targeting specific requirements. For example, InstructBooth [45] introduces a reinforcement learning framework that utilizes a human preference scorer [46] as a reward model to provide feedback on text fidelity. [47] introduces a gradient-free evolutionary algorithm to iteratively update the learnable token. In summary, recent developments following the foundational work of Textual Inversion focus on reducing training times and enhancing the visual quality of generated images.

In the realm of optimization-based methods for PCS, there is a clear shift towards fine-tuning model weights rather than just the token embeddings. This approach often addresses the limitations where token embeddings alone struggle to capture complex semantics uncovered in the pre-training data [19, 48]. DreamBooth [10] proposes to use a unique modifier by a rare token to represent the SoI and fine-tune the whole parameters of the diffusion model. Besides, a regularization dataset containing 20-30 images with the same category as SoI is adopted to overcome the overfitting problem. These two combined approaches achieve impressive performance that largely promotes the progress of the research on image personalization. However, fine-tuning the entire model or significant portions for each new object causes considerable storage costs, potentially rendering widespread application. To address this, Custom Diffusion [14] focuses on identifying and fine-tuning critical parameters, particularly the key-value projections in cross-attention layers, to achieve a balance of visual fidelity and storage efficiency. Further approach, Perfusion [15], also adopts the cross-attention fine-tuning and proposes to regularize the update direction of the K (key) projection towards the super-category token embedding and the V (value) projection towards the learnable token embedding. COMCAT [49] introduces a low-rank approximation of attention matrices, which drastically reduces storage requirements to 6 MB while maintaining high fidelity in the outputs. Additionally, methods like adapters [17] and LoRA [19, 20, 21, 30, 31] and their variants [50] are increasingly utilized in personalized generation for parameter-efficient fine-tuning. It is worth noting that the pseudo token embedding fine-tuning is compatible with diffusion weight fine-tuning. For instance, the fine-tuned prompt embedding can be regarded as an effective initialization for the subsequent diffusion weight fine-tuning [51]. Also, these two parts can be simultaneously optimized with different learning rates [28, 52].

Fast response time is a crucial factor for a user-oriented application. Training a powerful model that is able to process any personalization purpose is the ultimate goal. Re-Imagen [53] introduces a retrieval-augmented generative approach, which leverages features from text-image pairs retrieved via a specific prompt. While it is not specifically tailored for object personalization, it demonstrates the feasibility of training such frameworks. Later, ELITE [39] specifically targets image personalization by combining the global reference features with text embedding while incorporating local features that exclude irrelevant backgrounds. Both fused features and local features serve as conditions for the denoising process. Similarly, InstantBooth [54] retrains CLIP models to extract image features and patch features, which are injected into the diffusion model via the attention mechanism and learnable adapter, respectively. Additionally, UMM-Diffusion [55] designs a multi-modal encoder that produces fused features based on the reference image and text prompt. The text features and multi-modal hidden state are seen as guidance signals to predict a mixed noise. Another work, SuTI [35], adopts the same architecture as Re-Imagen. The difference lies in the training samples which are produced by a massive number of optimization-based models, each tuned on a particular subject set. This strategy promotes a more precise alignment with personalization at an instance level rather than the class level of Re-Imagen. Moreover, [56] combines a contrastive-based regularization technique to push the pseudo embedding produced by the image encoder towards the existing nearest pre-trained token. Besides, They introduce a dual-path attention module separately conditioned on the nearest token and pseudo embedding. Compared to the methods that use separate encoders to process a single modality, some works have explored the usage of pre-trained multi-modal large language models (MLLM) that can process text and image modality within a unified framework. For example, BLIP-Diffusion [34] utilizes the pre-trained BLIP2 [57] that encodes multimodal inputs including the SoI reference and a class noun. The output embedding is then concatenated with context description and serves as a condition to generate images. Further, Customization Assistant [58] and KOSMOS-G [59] replace the text encoder of Stable Diffusion with a pre-trained MLLM to output a fused feature based on the reference and context description. Meanwhile, to meet the standard format of Stable Diffusion, a network is trained to align the dimension of the output embedding.

Recently, some works have started to explore the combination of optimization-based and learning-based methods. Learning-based methods provide a general framework capable of handling a wide range of common objects, while optimization-based techniques enable fine-tuning to specific instances, improving the preservation of fine-grained details. [20]. DreamTuner [52] pre-trains a subject encoder that outputs diffusion conditions for accurate reconstruction. In the second stage, they adopt regularization images that are similar to the reference so as to preserve fine-grained details.

IV-B Personalized Style Generation

Personalized style generation seeks to tailor the aesthetic elements of reference images. The notion of “style” now encompasses a broad range of artistic elements, including brush strokes, material textures, color schemes, structural forms, lighting techniques, and cultural influences.

In this field, StyleDrop [18] leverages adapter tuning to efficiently capture the style from a single reference image. This method demonstrates the effectiveness through iterative training, utilizing synthesized images refined by feedback mechanisms, like human evaluations and CLIP scores. This approach not only enhances style learning but also ensures that the generated styles align closely with human aesthetic judgments. Later, GAL [60] explores the active learning in generative models. They propose an uncertainty-based evaluation strategy for synthetic data sampling and a weighted schema to balance the contribution of the additional samples and the original reference. Furthermore, StyleAligned [61] focuses on maintaining stylistic consistency across a batch of images. This is achieved by using the first image as a reference, which acts as an additional key and value in the self-attention layers, ensuring that all subsequent images in the batch adhere to the same stylistic guidelines.

On another front, StyleAdapter [62] employs a dual-path cross-attention mechanism within the learning-based framework. This model introduces a specialized embedding module designed to extract and integrate global features from multiple style references.

IV-C Personalized Face Generation

Personalized face generation aims to generate diverse ID images that adhere to text prompt specifications, utilizing only a few initial face images. Compared to general object personalization, the scope is narrowed to a specific class, humans. It is easy to acquire large-scale human-centric datasets [38, 63, 64] and utilize pre-trained models in well-developed areas, like face landmark detection [65, 66] and face recognition [67].

As for the optimization-based methods, [68] trains a diffusion-based PromptNet that encodes input image and noisy latent to a pseudo word embedding. To alleviate the overfitting problem, the noises predicted by the pseudo embedding and context description are balanced through fusion sampling during classifier-free guidance. HyperDreamBooth [20] proposes a second-time fine-tuning strategy to further enhance the fidelity after training a learning-based model on large-scale dataset. Additionally, [36] provides a novel idea that the personalized ID can be viewed as the composition of celebrity’s face which has been learned by the pre-trained diffusion model. Based on this hypothesis, a simple MLP is optimized to transform face features into the celeb embedding space.

In addition to these optimization-based methods, the number of works in learning-based frameworks is rapidly increasing. Face0 [69] detects and crops the face region to extract refined embedding. During the sampling phase, the output of classifier-free guidance is replaced by a weighted combination of the noise patterns predicted by face-only embedding, text-only embedding, and concatenated face-text embedding. The $\mathcal{W}+$ Adapter [70] constructs a mapping network and residual cross-attention modules to transform the facial features from the StyleGAN [71] $\mathcal{W}+$ space into the text embedding space of Stable Diffusion. FaceStudio [72] adapts the cross-attention layer to support hybrid guidance including stylized images, facial images, and textual prompts. Moreover, PhotoMaker [33] constructs a high-quality dataset through a meticulous data collection and filtering pipeline. They use a two-layer MLP to fuse ID features and class embeddings for an overall representation of human portrait. PortraitBooth [73] also employs a simple MLP, which fuses the text condition and shallow features of a pre-trained face recognition model. To ensure expression manipulation and facial fidelity, they add another expression token and incorporate the identity preservation loss and mask-based cross-attention loss. InstantID [32] additionally introduces a variant of ControlNet that takes facial landmarks as input, providing stronger guiding signals compared to the methods that solely rely on attention fusion.

IV-D Multiple Subject Composition

Sometimes, users have the intention to compose multiple SoI together, which results in the new task, multiple subject composition. However, this task presents a challenge for the optimization-based methods, particularly in how to integrate the parameters within the same module which are separately fine-tuned for individual SoI.

Some works try to integrate multiple parameters into a single unified parameter. For instance, Custom Diffusion [14] proposes a constrained optimization method to merge the cross-attention key-value projection weights with the goal of maximizing reconstruction performance for each subject. Similarly, Mix-of-Show [19] update the LoRA [74] weights using the same objective.

Additionally, some works opt for the one-for-one generation following a fusion mechanism. StyleDrop [18] dynamically summarizes noise predictions from each personalized diffusion model. In OMG [21], the latent predicted by each LoRA-tuned model is spatially composited using the subject mask.

Another straightforward solution is to train a union model on a dataset containing all expected subjects. SVDiff [16] employs a data augmentation method called Cut-Mix to compose several subjects together and applies a location loss to regularize attention maps, ensuring alignment between each subject and its corresponding token. Similar joint training strategies are found in other works [14, 36] which train a single model by reconstructing the appearance of every SoI.

Except for these categories, Cones [75] aims to find a small cluster of neurons that preserve the most information about SoI. For multi-concept generation, the neurons belonging to different SoI will be simultaneously activated to generate the combination.

Towards using learning-based methods for multi-subject generation, as they naturally accommodate the integration of multiple subject features without conflict. These methods can place each feature in its corresponding placeholder, ensuring a seamless and efficient combination [76].

IV-E High-level Semantic Personalization

The field of image personalization is expanding to include not just direct visual attributes but also complex semantic relationships and high-level concepts. Different approaches have been developed to enhance the capability of models to understand and manipulate these abstract elements.

ReVersion [77] intends to invert object relations from references. Specifically, they use a contrastive loss to guide the optimization of the token embedding towards specific clusters of Part-of-Speech tags, such as prepositions, nouns, and verbs. Meanwhile, they also increase the likelihood of adding noise at larger timesteps during the training process to emphasize the extraction of high-level semantic features.

On the other hand, Lego [78] focuses on more general concepts, such as adjectives, which are frequently intertwined with the subject appearance. The concept can be learned from a contrastive loss applied to the dataset comprising clean subject images and images that embody the desired adjectives.

Moreover, ADI [79] aims to learn the action-specific identifier from the references. To ensure the inversion only focuses on the desired action, ADI extracts gradient invariance from a constructed triplet sample and applies a threshold to mask out the irrelevant feature channels.

IV-F Attack and Defense

The advancing technologies also present a challenge in terms of potential risky usage. To mitigate this, Anti-DreamBooth [80] aims to add a subtle noise perturbation to the references so that any personalized model trained on these samples only produces terrible results. The basic idea is to maximize the reconstruction loss of the surrogate model. Additionally, [81] suggests to predefine a collection of trigger words and meaningless images. These data are paired and incorporated during the training phase. Once the trigger words are encountered, the synthesized image will be intentionally altered for safeguarding.

IV-G Personalization on Extra Conditions

Some personalization tasks include additional conditions for content customization. One popular task uses an additional source image as a base for personalization. The target is to replace a subject in the source image with the SoI. It can be seen as a cross-domain task of personalization and image editing. To address this requirement, PhotoSwap [82] first fine-tunes a diffusion model on the references to obtain a personalized model. To better preserve the background of the source image, they initialize the noise with DDIM inversion [83] and replace the intermediate feature maps with those derived from source image generation. Later, MagiCapture [70] broadens the scope to face customization. Another similar application is found in Virtual Try-on technologies, which involve fitting selected clothing onto a target person. The complexities of this task have been thoroughly analyzed in a survey by [84]. Additional conditions in personalization tasks may include adjusting the layout [85], transforming sketches [86], controlling viewpoint [87, 88], or modifying poses [32]. Each of these conditions presents unique challenges and requires specialized approaches to integrate these elements seamlessly into the personalized content

IV-H Personalized Video Generation

In video personalization, the primary inversion objectives can be categorized into three distinct types: appearance, motion, and the combination of both subject and motion.

In appearance-based personalization, authors typically use an image as the reference point and employ video diffusion models as the foundational technology. The process involves leveraging sophisticated methods from 2D personalization, such as parameter-efficient fine-tuning [89], data augmentation [90, 89], and attention manipulation [91, 92, 89]. Additionally, several studies [93, 91, 92, 94] have explored the learning-based framework. These diffusion models are specifically tailored to synthesize videos based on the image references.

For personalization centered around motion, the reference switches to a video clip containing a consistent action. A common approach is to fine-tune the video diffusion model by reconstructing the action clip [95, 96, 97, 98, 99, 100, 101]. However, distinguishing between appearance and motion within the reference video can be challenging. First, SAVE [96] applies appearance learning to ensure that appearance is excluded from the motion learning phase. Additionally, VMC [97] removes the background information during training prompt construction.

When integrating both subject appearance and motion, innovative methods are employed to address the complexities of learning both aspects simultaneously. MotionDirector [102] utilizes spatial and temporal losses to facilitate learning across these dimensions. Another approach, DreamVideo [103], incorporates residual features from a randomly selected frame to emphasize subject information. This technique enables the fine-tuned module to primarily focus on learning motion dynamics.

In summary, video personalization strategies vary significantly based on the specific aspects, appearance, motion, or both. Moreover, due to the current limitations in robust video feature representation, the process of video synthesis that is directly conditioned on another video remains an area under exploration.

IV-I Personalized 3D generation

Basically, the pipeline of 3D personalization begins by fine-tuning a 2D diffusion model using optimization-based methods. This tuned model then guides the optimization of a 3D Neural Radiance Field (NeRF) model [104] for each specific prompt [105, 106]. In the second phase, DreamFusion [107] is the key technique, which introduces Score Distillation Sampling (SDS) to train a 3D model capable of rendering images aligned with 2D diffusion model. Building on this foundation, several methods have been developed to improve the workflow. DreamBooth3D [108] structures the process into three phases: initializing and optimizing a NeRF from a DreamBooth model, rendering multi-view images, and fine-tuning a secondary DreamBooth for the final 3D NeRF refinement. Consist3D [109] enhances text embeddings by training two distinct tokens, a semantic and a geometric token, during 3D model optimization. TextureDreamer [110] focuses on extracting texture maps from optimized spatially-varying bidirectional reflectance distribution (BRDF) fields for rendering texture on wide-range 3D subjects.

Additionally, advancements extend to 3D avatar rendering and dynamic scenes. Animate124 [111] and Dream-in-4D [112] integrate video diffusion for 4D dynamic scene support within the 3D optimization process. In avatar rendering, PAS [113] generates 3D body poses configurable by avatar settings, StyleAvatar3D [114] facilitates 3D avatar generation based on images, and AvatarBooth [115] employs dual fine-tuned diffusion models for separate face and body generation.

IV-J Others

Some works have introduced different personalization tasks. For example, SVG personalization is introduced by [116], in which a parameter-efficient fine-tuning method is applied to create SVGs. After first-step generation, the SVGs are refined through a process that includes semantic alignment and a dual optimization approach, which utilizes both image-level and vector-level losses to enhance the final output. [117] introduces continual learning into personalization tasks, which requires fine-tuning a model on sequential SoI inputs while preventing catastrophic forgetting. To address this issue, a loss is used to minimize the weight changes on each training step and cross-attention parameters trained on different SoI references are regularized into an union set. Another application, 360-degree panorama customization [118], is also emerging as a potential tool for personalization in the digital imaging realm.

V Techniques in Personalized Image Synthesis

V-A Attention-based Operation

Attention-based operations are a crucial technique in model learning, particularly for processing features effectively. These operations generally involve manipulating the way a model focuses on different parts of data, often through a method known as the Query-Key-Value (QKV) scheme. However, problems also arise in the delicate module, where the unique modifier, may dominate the attention map, leading to a focus solely on the SoI and neglecting other details [119].

To counteract this, a cluster of studies like [56, 62, 19, 52] aim to refine this mechanism to enhance feature processing. For example, Mix-of-Show [19] enhances contextual relevance through region-aware cross-attention by substituting the feature map, which is initially generated by the global prompt and replaced with distinct regional features corresponding to each entity. DreamTuner [52] designs an attention layer that takes the features of the generated image as query, the concatenation of generated features as key, and reference features as value. Additionally, the background of the reference is ignored in the attention modules.

Another research branch focuses on restricting the influence of the SoI token within the attention layers. For instance, Layout-Control [85] introduces a method to adjust attention weights specifically around the layout without additional training. Cones 2 [120] also defines some negative attention areas to penalize the illegal occupation to allow multiple object generation. VICO [121] inserts a new attention layer where a binary mask is deployed to selectively obscure the attention map between the noisy latent and the reference image features. In addition to these explicit attention weights modification methods, many researchers [76, 16, 28, 23, 29, 122, 51] employ localization supervision in the cross-attention module. DreamTuner [52] further refines this approach by designing an attention layer that more effectively integrates features from different parts of the image.

V-B Mask-guided Generation

Masks serve as a strong prior that indicates the position and contour of the specified object, which is pivotal for guiding the focus of generative models. Benefitting from advanced segmentation methods, the SoI can be precisely isolated from the background. Based on this strategy, plenty of studies [26, 27, 28, 29, 30, 31, 51, 123] choose to discard the pixels of the background area so that the reconstruction loss can focus on the targeted object and exclude irrelevant disturbances. Also, another technique [124] extends to the background reconstruction for better disentanglement. In addition, as discussed in Section V-A, the layout indicated by the mask can be incorporated into the attention modules as a supervision signal. Moreover, the mask can stitch specific feature maps to construct more informative semantic patterns [39, 125]. Then Subject-Diffusion [23] takes this approach a step further by applying masking to the latent features throughout the diffusion stages. Additionally, there are some other mask integration approaches. AnyDoor [126] employs an extra high-frequency filter to extract detailed features alongside the segmented subject as a condition for the image generation process. DisenBooth [127] defines an identity-irrelevant embedding with a learnable mask. By maximizing the cosine similarity between the identity-preservation embedding and identity-irrelevant embedding, the mask will adaptively exclude the redundant information, and thus the subject appearance can be better preserved. PACGen [128] adds two more additional prompts, an SoI suppression prompt (e.g. sks person) and a diverse prompt (e.g. high quality, color image), which participates in the classifier-free guidance assisted with a binary mask that indicates subject area. Face-Diffuser [129] determines the mask through augmentation from the noise predicted by both a pre-trained text-to-image diffusion model and a learning-based personalized model. Each model makes its own noise prediction, and the final noise output is a composite created through mask-guided concatenation.

V-C Data Augmentation

Due to limited references, existing methods often struggle to capture complete semantic information of the SoI, resulting in challenges in producing realistic and diverse images. To address this, various techniques employ data augmentation strategies to enrich the diversity of SoI. COTI [130] adopts a scorer network to progressively expand the training set by selecting semantic-relevant samples with high aesthetic quality from a large web-crawled data pool. SVDiff [16] manually constructs mixed images of multiple SoI as new training data, thereby enhancing the model’s exposure to complex scenarios. Such concept composition is also used in [23, 90, 89]. BLIP-Diffusion [34] segments the foreground subject and composes it in a random background so that the original text-image pairs are expanded to an instruction-followed dataset. To create such an instruction-followed dataset for face personalization, DreamIdentity [131] leverages the existing knowledge of celebrities embedded in large-scale pre-trained diffusion model to generate both the source image and the edited face image. PACGen [128] shows that the spatial position also entangles with the identity information. Rescale, center crop, and relocation are effective solutions for this issue. Besides, StyleAdapter [62] chooses to shuffle the patch to break the irrelevant subject and preserve the desired style. Break-A-Scene [28] aims to invert multiple subjects from a single reference image. To achieve this goal, the method samples a random subset of the target subjects and then employs a masking strategy, ensuring that the learning process is specifically focused on the sampled subjects.

V-D Regularization

Regularization is a kind of method that is used to regularize the weight update to avoid overfitting or preserve a better appearance.

A commonly adopted method is to use an additional dataset composed of images with the same category as SoI [10]. By reconstructing these images, the personalized model is required to preserve the pre-trained knowledge, which is an effective way to alleviate the overfitting issue. Building on this strategy, StyleBoost [132] introduces an auxiliary dataset for the purpose of style personalization. Later, [11] introduces a more delicate construction pipeline for the regularization dataset. This includes detailed prompts that specify the shape, background, color, and texture. To disentangle the subject from the background, the pipeline includes sampling subjects that share the same class noun both within identical and varied background contexts.

Another regularization approach is to utilize the pre-traiend text prior that is trained on large-scale datasets. In an ideal case, the SoI token is able to fluently compose with other text descriptions to generate well-aligned images like a pre-trained word. Therefore, the pre-trained words can be seen as a supervised signal to guide the optimization. For example, Perfusion [15] constrains the key projection towards the embedding of the class noun to inject the text-level knowledge and the value projection towards SoI images to achieve visual fidelity. Moreover, inspired by coached active learning [133, 134], which uses anchor concepts for optimization guidance, Compositional Inversion [119] employs a set of semantically related tokens as anchors to constrain the token embedding search towards areas of high alignment with the SoI. This kind of constraint is applied to the input of the text encoder but also applicable to the output. [135] aims to learn an encoder that produces the offset on the class token embedding to represent the key visual features. By minimizing the offset, the final word embedding denoted by the class token plus the offset is able to achieve better text alignment. Similarly, Cones 2 [120] minimizes the offset by reconstructing the features of 1,000 sentences containing the class noun. And [136] optimizes the learnable token towards the mean textual embedding of 691 well-known names. [56] proposes to use a contrastive loss to guide the SoI text embedding close to its nearest CLIP tokens pre-trained on large-scale samples. [30] aims to minimize the embedding similarity between SoI text and its class noun to improve generalizability. On the other hand, VICO [121] empirically finds the end-of-text token <|EOT|> keeps the semantic consistency of SoI. To leverage this discovery, an L2 loss is leveraged to reduce the difference of attention similarity logits between the SoI token and <|EOT|>.

In addition to these commonly used regularization terms, several studies have introduced novel methods from different aspects. Because the projection kurtosis of a natural image tends to remain constant across various projection directions [137], [138] introduces a loss function that minimizes the difference between the maximum and minimum kurtosis values extracted via Discrete Wavelet Transform [139].

VI Evaluation

VI-A Evaluation Dataset

To assess the performance of personalized models, various datasets have been developed:

The primary dataset used in DreamBooth [10] includes 30 subjects such as backpacks, animals, cars, toys, etc. This is expanded to DreamBench-v2 [35], which adds 220 test prompts to the subjects. Custom Diffusion [14] focuses on evaluating 10 subjects, each with 20 specific test prompts, and includes tests for multi-subject composition with 5 pairs of subjects and 8 prompts for each pair. Later, the authors released Custom-101 [14], which comprises 101 subjects to provide a broader scope of evaluation. Additionally, a recent dataset, Stellar [31], specifically targets human-centric evaluation, featuring 20,000 prompts on 400 human identities.

VI-B Evaluation Metrics

As personalized content synthesis aims to maintain fidelity to the SoI while ensuring alignment with textual conditions, the metrics are designed for text alignment and visual similarity. To measure how well the semantics of the text prompt are represented in the generated image, CLIP similarity score of text features and image features is widely adopted. To determine how closely the generated subject resembles the SoI, visual similarity is assessed based on large-scale pre-trained models like CLIP and DINO. Conventional metrics such as the Fréchet Inception Distance (FID) [140] and Inception Score (IS) [141] are also applicable for evaluating PCS models. These metrics provides insights into the general image quality and coherence. In addition to these commonly adopted metrics, there are some discussions of specialized metrics for PCS system evaluation. [50] suggests evaluating personalized models based on fidelity, controllability, diversity, base model preservation, and image quality. [31] develops specific metrics for human personalization, including soft-penalized CLIP text score, Identity Preservation Score, Attribute Preservation Score, Stability of Identity Score, Grounding Objects Accuracy, and Relation Fidelity Score. These metrics ensure a structured and detailed evaluation of personalized models.

VII Challenge and Outlook

VII-A Overfitting Problem.

Current personalized content synthesis (PCS) systems face a critical challenge of overfitting, particularly when trained with a limited set of reference images. This overfitting problem manifests in two ways: 1) Loss of SoI editability. The personalized model tends to produce images that rigidly mirror the SoI in the reference, such as consistently depicting a cat in an identical pose. 2) Irrelevant semantic inclusion. the irrelevant elements in the references are generated in the output, such as backgrounds or objects not pertinent to the current context.

To investigate the rationale behind, Compositional Inversion [119] observes that the learned token embedding is located in an out-of-distribution area compared to the center distribution formed by pre-trained words. This is also found in [136] that the pseudo token embeddings deviate significantly from the distribution of the initial embedding. Meanwhile, [119, 125, 121] found that the unique modifier dominates in the cross-attention layers compared to the other context tokens, leading to the absence of other semantic appearances.

To address this issue, many solutions have been proposed. Most methods discussed in Sec. V contribute to the alleviation of the overfitting problem, such as the exclusion of redundant background, attention manipulation, regularization of the learnable parameters, and data augmentation. However, it has not been solved yet, especially in the cases where the SoI has a non-rigid appearance [119] or the context prompt has a similar semantic correlation with the irrelevant elements in the reference [60]. It is clear that addressing overfitting in PCS is not merely a technical challenge but a necessity for ensuring the practical deployment and scalability of these systems in varied and dynamic real-world environments. Therefore, there is an urgent need for an effective strategy and robust evaluation metrics to achieve broader adoption and greater satisfaction in practical uses.

VII-B Trade-off on Subject Fidelity and Text Alignment.

The ultimate goal of personalized content synthesis (PCS) is to create systems that not only render the SoI with high fidelity but also effectively respond to textual prompts. However, achieving excellence in both areas simultaneously presents a notable conflict. In particular, achieving high subject fidelity typically involves capturing and reproducing detailed, specific features of the SoI, which often requires the model to learn and replicate very delicate characteristics. On the other hand, text alignment demands that the system flexibly adapts the SoI according to varying textual descriptions, which might suggest changes in pose, expression, environment, or stylistic alterations that could contradict the reconstruction process during training. Therefore, it is hard to gain flexible adaption in different contexts while pushing the model to capture fine-grained details. To address this inherent conflict, Perfusion [15] proposes to regularize the attention projections by these two items. [142] decouples the conditional guidance into two separate processes, which allows for the distinct handling of subject fidelity and textual alignment. Despite these efforts, there still remains room for further exploration and refinement of this issue. Enhanced model architectures, innovative training methodologies, and more dynamic data handling strategies could potentially provide new pathways to better balance the demands of subject and text fidelity in PCS systems.

VII-C Standardization and Evaluation.

Despite the popularity of personalization, there is a noticeable lack of standardized test datasets and robust evaluation metrics for measuring progress and comparing different approaches effectively. Therefore, future efforts could focus on creating comprehensive and widely accepted benchmarks that can test various aspects of PCS models. Additionally, there is a need to develop metrics that can more accurately reflect both the qualitative and quantitative performance of PCS systems.

VIII Conclusion

This survey has provided a thorough review of personalized content synthesis, particularly focusing on diffusion models. We explore two main frameworks, optimization-based and learning-based methods, and delve into their mechanics. We also cover the recent progress in specific customization areas, including object, face, style, video, 3D synthesis. All covered personalization papers are summarized in Tab. II, Tab. III, Tab. IV, and Tab. V. In addition to the impressive techniques, we propose several challenges that still need to be addressed. These include preventing overfitting, finding the right balance between reconstruction quality and editability, and standardizing evaluation methods. Through this detailed analysis and targeted recommendations, our survey aims to promote further innovation and collaboration within the PCS community.

TABLE II: Paper summary on personalization.

Paper	Goal	Architecture	Method	Backbone
Textual Inversion [9]	Personalized Object Generation	Optimization-Based	Token Embedding Fine-tuning	LDM
DreamBooth [10]	Personalized Object Generation	Optimization-Based	Diffusion Model Fine-tuning; Regularization Dataset	Imagen
DreamArtist [44]	Personalized Object Generation	Optimization-Based	Negative Prompt Fine-tuning	LDM
DVAR [40]	Personalized Object Generation	Optimization-Based	Randomness Erasing	Stable Diffusion 1.5
HiPer [143]	Personalized Object Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion
P+ [13]	Personalized Object Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion 1.4
Unet-finetune [17]	Personalized Object Generation	Optimization-Based	Parameter-efficient Fine-tuning	Stable Diffusion
Jia et al. [125]	Personalized Object Generation	Optimization-Based	Regularization; Mask-assisted Generation	Imagen
COTI [130]	Personalized Object Generation	Optimization-Based	Data Augmentation	Stable Diffusion 2.0
Gradient-Free TI [47]	Personalized Object Generation	Optimization-Based	Evolutionary Strategy	Stable Diffusion
PerSAM [26]	Personalized Object Generation	Optimization-Based	Mask-assisted Generation	Stable Diffusion
DisenBooth [127]	Personalized Object Generation	Optimization-Based	Regularization	Stable Diffusion 2.1
NeTI [41]	Personalized Object Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion 1.4
Prospect [42]	Personalized Object Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion 1.4
Break-a-Scene [28]	Personalized Object Generation	Optimization-Based	Attention-based Operation; Attention-based Operation; Mask-assisted Generation	Stable Diffusion 2.1
COMCAT [49]	Personalized Object Generation	Optimization-Based	Attention-based Operation	Stable Diffusion 1.4
ViCo [121]	Personalized Object Generation	Optimization-Based	Attention-based Operation; Regularization	Stable Diffusion
OFT [144]	Personalized Object Generation	Optimization-Based	Finetuning Strategy	Stable Diffusion 1.5
LyCORIS [50]	Personalized Object Generation	Optimization-Based	Attention-based Operation	Stable Diffusion 1.5
He et al. [11]	Personalized Object Generation	Optimization-Based	Data Augmentation	SD, SDXL
DIFFNAT [138]	Personalized Object Generation	Optimization-Based	Regularization	Stable Diffusion
MATTE [43]	Personalized Object Generation; Personalized Style Generation; High-level Semantic Personalization	Optimization-Based	Regularization	Stable Diffusion
LEGO [78]	Personalized Object Generation	Optimization-Based	Regularization	Stable Diffusion
Catversion [48]	Personalized Object Generation	Optimization-Based	Embedding Concatenation	Stable Diffusion 1.5
CLiC [29]	Personalized Object Generation	Optimization-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.4
HiFi Tuner [27]	Personalized Object Generation	Optimization-Based	Attention-based Operation; Regularization; Mask-assisted Generation	Stable Diffusion 1.4
InstructBooth [45]	Personalized Object Generation	Optimization-Based	Fine-tuning based on Reinforcement Learning	Stable Diffusion 1.5
DETEX [124]	Personalized Object Generation	Optimization-Based	Mask-assisted Generation	Stable Diffusion 1.4
DreamDistribution [145]	Personalized Object Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion 2.1
DreamTuner [52]	Personalized Object Generation	Optimization-Based	Attention-based Operation	Stable Diffusion
Xie et al. [30]	Personalized Object Generation	Optimization-Based	Regularization; Mask-assisted Generation	Stable Diffusion
Infusion [146]	Personalized Object Generation	Optimization-Based	Attention-based Operation	Stable Diffusion 1.5
Pair Customization [147]	Personalized Object Generation	Optimization-Based	Disentanglement Approach	SDXL
DisenDreamer [148]	Personalized Object Generation	Optimization-Based	Disentanglement Approach	Stable Diffusion 2.1
Re-Imagen [53]	Personalized Object Generation	Learning-Based	Retrieval-augmented Paradigm	Imagen
Versatile Diffusion [149]	Personalized Object Generation	Learning-Based	Architecture Design	Stable Diffusion 1.4
Tuning-Encoder [135]	Personalized Object Generation	Learning-Based	Regularization	Stable Diffusion
ELITE [39]	Personalized Object Generation	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.4
UMM-Diffusion [55]	Personalized Object Generation	Learning-Based	Reference Feature Injection	Stable Diffusion 1.5
SuTI [35]	Personalized Object Generation	Learning-Based	Data Augmentation	Imagen
InstantBooth [54]	Personalized Object Generation	Learning-Based	Patch Feature Extraction	Stable Diffusion 1.4
BLIP-Diffusion [34]	Personalized Object Generation	Learning-Based	Data Augmentation; Mask-assisted Generation	Stable Diffusion 1.5
Domain-Agnostic [56]	Personalized Object Generation	Learning-Based	Regularization; Attention-based Operation	Stable Diffusion
IP-Adapter [22]	Personalized Object Generation	Learning-Based	Reference Feature Injection	Stable Diffusion 1.5
Kosmos-G [59]	Personalized Object Generation; Personalized Face Generation	Learning-Based	Multimodal Language Modeling	Stable Diffusion 1.5
Kim et al. [150]	Personalized Object Generation	Learning-Based	Reference Feature Injection	Stable Diffusion 1.4

TABLE III: Paper summary on personalization.

Paper	Goal	Architecture	Method	Backbone
CAFÉ [58]	Personalized Object Generation; Personalized Face Generation	Learning-Based	Multimodal Language Modeling	Stable Diffusion 2.0
SAG [151]	Personalized Object Generation; Personalized Face Generation; Personalized Style Generation	Learning-Based	Gradient-based Guidance	Stable Diffusion 1.5
BootPIG [152]	Personalized Object Generation	Learning-Based	Data Augmentation	Stable Diffusion
StyleDrop [18]	Personalized Style Generation; Multiple Subject Composition	Optimization-Based	Data Augmentation	Muse
StyleBoost [132]	Personalized Style Generation	Optimization-Based	Regularization Dataset	Stable Diffusion 1.5
StyleAligned [61]	Personalized Style Generation	Optimization-Based	Regularization	SDXL
GAL [153]	Personalized Style Generation; Personalized Object Generation	Optimization-Based	Data Augmentation	Stable Diffusion 1.5
StyleForge [154]	Personalized Style Generation	Optimization-Based	Parameter-efficient Fine-tuning	Stable Diffusion 1.5
StyleAdapter [62]	Personalized Style Generation	Learning-Based	Data Augmentation; Reference Feature Fxtraction	Stable Diffusion
ArtAdapter [155]	Personalized Style Generation	Learning-Based	Data Augmentation	Stable Diffusion 1.5
ProFusion [68]	Personalized Face Generation	Optimization-Based	Fusion Sampling	Stable Diffusion 2.0
Celeb Basis [36]	Personalized Face Generation	Optimization-Based	Face Feature Representation	Stable Diffusion
HyperDreamBooth [20]	Personalized Face Generation	Optimization-Based; Learning-Based	Pretraining and Fast Fine-tuning	Stable Diffusion 1.5
Banerjee et al. [156]	Personalized Face Generation	Optimization-Based	Regularization	Stable Diffusion 1.4
Magicapture [51]	Personalized Face Generation; Multiple Subject Composition	Optimization-Based	Attention-based Operation; Mask-assisted Generation; Identity Loss	Stable Diffusion 1.5
Concept-centric [142]	Personalized Face Generation	Optimization-Based	Modified Classifier-free Guidance	Stable Diffusion 1.5
Cross Initialization [136]	Personalized Face Generation	Optimization-Based	Regularization	Stable Diffusion 2.1
OMG [21]	Personalized Face Generation; Multiple Subject Composition	Optimization-Based	Attention-based Operation; Mask-assisted Generation; Identity Loss	SDXL
InstantFamily [157]	Personalized Face Generation; Multiple Subject Composition; Personalization on Extra Conditions	Optimization-Based	Mask-assisted Generation	Stable Diffusion 1.5
Su et al. [158]	Personalized Face Generation	Learning-Based	Identity Loss, Multi-task Learning	Diffusion Autoencoder, StyleGAN
FastComposer [76]	Personalized Face Generation	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.5
Face0 [69]	Personalized Face Generation	Learning-Based	Fusion Sampling	Stable Diffusion 1.4
DreamIdentity [131]	Personalized Face Generation	Learning-Based	Attention-based Operation	Stable Diffusion v2-1-base
Face-Diffuser [129]	Personalized Face Generation	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.5
W-Plus-Adapter [70]	Personalized Face Generation	Learning-Based	Face Feature Representation	Stable Diffusion 1.5, StyleGAN
Portrait Diffusion [159]	Personalized Face Generation	Learning-Based	Mask-assisted Generation	Stable Diffusion 1.5
RetriNet [123]	Personalized Face Generation	Learning-Based	Mask-assisted Generation	Stable Diffusion
FaceStudio [72]	Personalized Face Generation	Learning-Based	Attention-based Operation	Stable Diffusion
PVA [160]	Personalized Face Generation	Learning-Based	Mask-assisted Generation	LDM
DemoCaricature [86]	Personalized Face Generation; Personalization on Extra Conditions	Learning-Based	Regularization	Stable Diffusion 1.5
PhotoMaker [33]	Personalized Face Generation	Learning-Based	Data Collection; Reference Feature Fxtraction	SDXL
Stellar [31]	Personalized Face Generation	Learning-Based	Evaluation Prompts and Metrics	SDXL
PortraitBooth [73]	Personalized Face Generation	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.5
InstantID [32]	Personalized Face Generation; Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	Stable Diffusion
ID-Aligner [161]	Personalized Face Generation	Learning-Based	Feedback Learning	Stable Diffusion 1.5, SDXL
MoA [162]	Personalized Face Generation; Multiple Subject Composition	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.5
IDAdapter [163]	Personalized Face Generation	Learning-Based	Mixed Facial Features	Stable Diffusion 2.1
Infinite-ID [164]	Personalized Face Generation	Learning-Based	Reference Feature Injection	SDXL
Face2Diffusion [165]	Personalized Face Generation	Learning-Based	Multi-feature Injection	Stable Diffusion

TABLE IV: Paper summary on personalization.

Paper	Goal	Architecture	Method	Backbone
Custom Diffusion [14]	Multiple Subject Composition	Optimization-Based	Parameter-efficient Fine-tuning; Constrained Optimization for Multi-concept Compostion	Stable Diffusion 1.4
Cones [75]	Multiple Subject Composition	Optimization-Based	Concept Neurons Activation	Stable Diffusion 1.4
SVDiff [16]	Multiple Subject Composition	Optimization-Based	Data Augmentation; Attention-based Operation	Stable Diffusion
Perfusion [15]	Multiple Subject Composition	Optimization-Based	Regularization	Stable Diffusion 1.5
Mix-of-Show [19]	Multiple Subject Composition	Optimization-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion 1.5
Cones-2 [120]	Multiple Subject Composition	Optimization-Based	Attention-based Operation; Regularization; Mask-assisted Generation	Stable Diffusion v2-1-base
PACGen [128]	Multiple Subject Composition	Optimization-Based	Mask-assisted Generation	Stable Diffusion 1.4
Compositional Inversion [119]	Multiple Subject Composition	Optimization-Based	Attention-based Operation; Regularization	Stable Diffusion
EM-style [122]	Multiple Subject Composition	Optimization-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion
MC² [166]	Multiple Subject Composition	Optimization-Based	Attention-based Operation	Stable Diffusion 1.5
MultiBooth [167]	Multiple Subject Composition	Optimization-Based	Mask-assisted Generation	Stable Diffusion 1.5
Matsuda et al. [168]	Multiple Subject Composition	Optimization-Based	Mask-assisted Generation	LDM
AnyDoor [126]	Multiple Subject Composition	Learning-Based	Mask-assisted Generation	Stable Diffusion 2.1
Subject-Diffusion [23]	Multiple Subject Composition	Learning-Based	Attention-based Operation; Data Augmentation; Mask-assisted Generation	Stable Diffusion v2-base
CustomNet [87]	Multiple Subject Composition; Personalization on Extra Conditions	Learning-Based	Multi-condition Integration	Stable Diffusion
MIGC [169]	Multiple Subject Composition	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion
Anti-DreamBooth [80]	Attack and Defense	Optimization-Based	Alternating Surrogate and Perturbation Learning	Stable Diffusion 2.1
Concept Censorship [81]	Attack and Defense	Optimization-Based	Trigger Injection	Stable Diffusion 1.4
Huang et al. [170]	Attack and Defense	Optimization-Based	backdoor attack	Stable Diffusion
ADI [79]	High-level Semantic Personalization	Optimization-Based	Mask-assisted Generation	Stable Diffusion v2-1-base
ReVersion [77]	High-level Semantic Personalization	Optimization-Based	Regularization	Stable Diffusion
Inv-ReVersion [171]	High-level Semantic Personalization	Optimization-Based	Regularization	Stable Diffusion 1.5
PhotoSwap [82]	Personalization on Extra Conditions	Optimization-Based	Attention-based Operation	Stable Diffusion 2.1
PE-VITON [172]	Personalization on Extra Conditions	Optimization-Based	Shape and Texture Control	Stable Diffusion
Layout-Control [85]	Personalization on Extra Conditions	Optimization-Based	Attention-based Operation	Stable Diffusion
SwapAnything [173]	Personalization on Extra Conditions	Optimization-Based	Mask-assisted Generation	Stable Diffusion 2.1
PE-VITON [172]	Personalization on Extra Conditions	Optimization-Based	Shape and Texture Control	Stable Diffusion
Viewpoint Control [88]	Personalization on Extra Conditions	Optimization-Based	3D Feature Incorporation	SDXL
Prompt-Free Diffusion [174]	Personalization on Extra Conditions	Learning-Based	Reference Feature Injection	Stable Diffusion 2.0
Uni-ControlNet [175]	Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	Stable Diffusion
TryonDiffusion [176]	Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	Diffusion Model
ViscoNet [177]	Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	Stable Diffusion
Context Diffusion [178]	Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	LDM
FreeControl [179]	Personalization on Extra Conditions	Learning-Based	Fusion Guidance	Stable Diffusion 1.5, 2.1, SDXL
Li et al. [180]	Personalization on Extra Conditions	Learning-Based	Attention-based Operation; Mask-assisted Generation	Stable Diffusion

TABLE V: Paper summary on personalization.

Paper	Goal	Architecture	Method	Backbone
Tune-A-Video [95]	Personalized Video Generation	Optimization-Based	Parameter-efficient Fine-tuning	Stable Diffusion 1.4
Gen-1 [98]	Personalized Video Generation	Optimization-Based	Multi-feature Injection	LDM
Make-A-Protagonist [181]	Personalized Video Generation	Optimization-Based	Attention-based Operation	LDM
Animate-A-Story [99]	Personalized Video Generation	Optimization-Based	Retrieval-augmented Paradigm	Stable Diffusion
MotionDirector [102]	Personalized Video Generation	Optimization-Based	Disentanglement Approach	ZeroScope, ModelScope
LAMP [100]	Personalized Video Generation	Optimization-Based	First-frame Conditioned Pipeline	SDXL, Stable Diffusion 1.4
VideoDreamer [90]	Personalized Video Generation; Multiple Subject Composition	Optimization-Based	Data Augmentation	Stable Diffusion 2.1
VMC [97]	Personalized Video Generation	Optimization-Based	Parameter-efficient Fine-tuning	Show-1
SAVE [96]	Personalized Video Generation	Optimization-Based	Attention-based Operation	Video Diffusion Model
Customizing Motion [101]	Personalized Video Generation	Optimization-Based	Regularization	ZeroScope
DreamVideo [103]	Personalized Video Generation	Optimization-Based	Disentanglement Approach	ModelScope
MotionCrafter [182]	Personalized Video Generation	Optimization-Based	Disentanglement Approach	Video Diffusion Model
CustomVideo [89]	Personalized Video Generation; Multiple Subject Composition	Optimization-Based	Data Augmentation; Attention-based Operation	ZeroScope
Customize-A-Video [183]	Personalized Video Generation	Optimization-Based	Multi-stage Refinement	ModelScope
Magic-Me [184]	Personalized Video Generation	Optimization-Based	Multi-stage Refinement	Stable Diffusion 1.5, AnimateDiff
VideoAssembler [93]	Personalized Video Generation	Learning-Based	Reference Feature Injection	VidRD
VideoBooth [91]	Personalized Video Generation	Learning-Based	Reference Feature Injection	Video Diffusion Model
StyleCrafter [92]	Personalized Video Generation; Personalized Style Generation	Learning-Based	Reference Feature Injection	Stable Diffusion 1.5
DreaMoving [94]	Personalized Video Generation; Personalized Face Generation; Personalization on Extra Conditions	Learning-Based	Multi-feature Injection	Stable Diffusion
ID-Animator [185]	Personalized Video Generation; Personalized Face Generation	Learning-Based	Reference Feature Injection	AnimateDiff
Magic3D [105]	Personalized 3D Generation	Optimization-Based	Score Distillation Sampling	eDiff-I (Imagen), LDM
DreamBooth3D [108]	Personalized 3D Generation	Optimization-Based	Multi-stage Refinement	Imagen
PAS [113]	Personalized 3D Generation	Learning-Based	Text-to-3D-Pose	diffusion model
StyleAvatar3D [114]	Personalized 3D Generation	Learning-Based	Multi-view Alignment	LDM
AvatarBooth [115]	Personalized 3D Generation	Optimization-Based	Dual Model Fine-tuning	Stable Diffusion
MVDream [106]	Personalized 3D Generation	Optimization-Based	Score Distillation Sampling	Stable Diffusion v2-1-base
Consist3D [109]	Personalized 3D Generation	Optimization-Based	Token Embedding Enhancement	Stable Diffusion
Animate124 [111]	Personalized 3D Generation	Optimization-Based	Multi-stage Refinement	Stable Diffusion 1.5, Zero1-to-3-XL
Dream-in-4D [112]	Personalized 3D Generation	Optimization-Based	Multi-stage Refinement	Stable Diffusion 2.1, NeRF
TextureDreamer [110]	Personalized 3D Generation	Optimization-Based	Texture Extraction	LDM, ControlNet v1.1
TIP-Editor [186]	Personalized 3D Generation	Optimization-Based	Multi-stage Refinement; Mask-assisted Generation	Stable Diffusion
Continual Diffusion [117]	Others	Optimization-Based	Regularization	Stable Diffusion
SVGCustomization [116]	Others	Optimization-Based	Fine-tuning and Alignment	Stable Diffusion 1.5
StitchDiffusion [118]	Others	Optimization-Based	Parameter-efficient Fine-tuning	LDM

References

[1] T. Wu, S. He, J. Liu, S. Sun, K. Liu, Q.-L. Han, and Y. Tang, “A brief overview of chatgpt: The history, status quo and potential future development,” IEEE/CAA Journal of Automatica Sinica, vol. 10, no. 5, pp. 1122–1136, 2023.
[2] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[3] L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
[4] R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa et al., “State of the art on diffusion models for visual computing,” arXiv preprint arXiv:2310.07204, 2023.
[5] F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
[6] C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion model in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.
[7] P. Cao, F. Zhou, Q. Song, and L. Yang, “Controllable generation with text-to-image diffusion models: A survey,” arXiv preprint arXiv:2403.04279, 2024.
[8] J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Information Processing Systems, vol. 33, pp. 6840–6851, 2020.
[9] R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
[10] N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510.
[11] X. He, Z. Cao, N. Kolkin, L. Yu, H. Rhodin, and R. Kalarot, “A data perspective on enhanced identity preservation for diffusion personalization,” arXiv preprint arXiv:2311.04315, 2023.
[12] J. Betker, G. Goh, L. Jing, TimBrooks, J. Wang, L. Li, LongOuyang, JuntangZhuang, JoyceLee, YufeiGuo, WesamManassra, PrafullaDhariwal, CaseyChu, YunxinJiao, and A. Ramesh, “Improving image generation with better captions.” [Online]. Available: https://api.semanticscholar.org/CorpusID:264403242
[13] A. Voynov, Q. Chu, D. Cohen-Or, and K. Aberman, “ $p+$ : Extended textual conditioning in text-to-image generation,” arXiv preprint arXiv:2303.09522, 2023.
[14] N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1931–1941.
[15] Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
[16] L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, “Svdiff: Compact parameter space for diffusion fine-tuning,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7323–7334.
[17] C. Xiang, F. Bao, C. Li, H. Su, and J. Zhu, “A closer look at parameter-efficient tuning in diffusion models,” arXiv preprint arXiv:2303.18181, 2023.
[18] K. Sohn, L. Jiang, J. Barber, K. Lee, N. Ruiz, D. Krishnan, H. Chang, Y. Li, I. Essa, M. Rubinstein et al., “Styledrop: Text-to-image synthesis of any style,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[19] Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu et al., “Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[20] N. Ruiz, Y. Li, V. Jampani, W. Wei, T. Hou, Y. Pritch, N. Wadhwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023.
[21] Z. Kong, Y. Zhang, T. Yang, T. Wang, K. Zhang, B. Wu, G. Chen, W. Liu, and W. Luo, “Omg: Occlusion-friendly personalized multi-concept generation in diffusion models,” arXiv preprint arXiv:2403.10983, 2024.
[22] H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
[23] J. Ma, J. Liang, C. Chen, and H. Lu, “Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning,” arXiv preprint arXiv:2307.11410, 2023.
[24] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
[25] J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International conference on machine learning. PMLR, 2022, pp. 12 888–12 900.
[26] R. Zhang, Z. Jiang, Z. Guo, S. Yan, J. Pan, H. Dong, P. Gao, and H. Li, “Personalize segment anything model with one shot,” arXiv preprint arXiv:2305.03048, 2023.
[27] Z. Wang, W. Wei, Y. Zhao, Z. Xiao, M. Hasegawa-Johnson, H. Shi, and T. Hou, “Hifi tuner: High-fidelity subject-driven fine-tuning for diffusion models,” arXiv preprint arXiv:2312.00079, 2023.
[28] O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski, “Break-a-scene: Extracting multiple concepts from a single image,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–12.
[29] M. Safaee, A. Mikaeili, O. Patashnik, D. Cohen-Or, and A. Mahdavi-Amiri, “Clic: Concept learning in context,” arXiv preprint arXiv:2311.17083, 2023.
[30] J. Lu, C. Xie, and H. Guo, “Object-driven one-shot fine-tuning of text-to-image diffusion with prototypical embedding,” arXiv preprint arXiv:2401.15708, 2024.
[31] P. Achlioptas, A. Benetatos, I. Fostiropoulos, and D. Skourtis, “Stellar: Systematic evaluation of human-centric personalized text-to-image methods,” arXiv preprint arXiv:2312.06116, 2023.
[32] Q. Wang, X. Bai, H. Wang, Z. Qin, and A. Chen, “Instantid: Zero-shot identity-preserving generation in seconds,” arXiv preprint arXiv:2401.07519, 2024.
[33] Z. Li, M. Cao, X. Wang, Z. Qi, M.-M. Cheng, and Y. Shan, “Photomaker: Customizing realistic human photos via stacked id embedding,” arXiv preprint arXiv:2312.04461, 2023.
[34] D. Li, J. Li, and S. Hoi, “Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[35] W. Chen, H. Hu, Y. Li, N. Ruiz, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[36] G. Yuan, X. Cun, Y. Zhang, M. Li, C. Qi, X. Wang, Y. Shan, and H. Zheng, “Inserting anybody in diffusion models via celeb basis,” arXiv preprint arXiv:2306.00926, 2023.
[37] C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
[38] Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, and F. Wen, “General facial representation learning in a visual-linguistic manner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 697–18 709.
[39] Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 943–15 953.
[40] A. Voronov, M. Khoroshikh, A. Babenko, and M. Ryabinin, “Is this loss informative? faster text-to-image customization by tracking objective dynamics,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[41] Y. Alaluf, E. Richardson, G. Metzer, and D. Cohen-Or, “A neural space-time representation for text-to-image personalization,” ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–10, 2023.
[42] Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu, “Prospect: Prompt spectrum for attribute-aware personalization of diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–14, 2023.
[43] A. Agarwal, S. Karanam, T. Shukla, and B. V. Srinivasan, “An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis,” arXiv preprint arXiv:2311.11919, 2023.
[44] Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022.
[45] D. Chae, N. Park, J. Kim, and K. Lee, “Instructbooth: Instruction-following personalized text-to-image generation,” arXiv preprint arXiv:2312.03011, 2023.
[46] X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Human preference score: Better aligning text-to-image models with human preference,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2096–2105.
[47] Z. Fei, M. Fan, and J. Huang, “Gradient-free textual inversion,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1364–1373.
[48] R. Zhao, M. Zhu, S. Dong, N. Wang, and X. Gao, “Catversion: Concatenating embeddings for diffusion-based text-to-image personalization,” arXiv preprint arXiv:2311.14631, 2023.
[49] J. Xiao, M. Yin, Y. Gong, X. Zang, J. Ren, and B. Yuan, “Comcat: towards efficient compression and customization of attention-based vision models,” arXiv preprint arXiv:2305.17235, 2023.
[50] S.-Y. Yeh, Y.-G. Hsieh, Z. Gao, B. B. Yang, G. Oh, and Y. Gong, “Navigating text-to-image customization: From lycoris fine-tuning to model evaluation,” arXiv preprint arXiv:2309.14859, 2023.
[51] J. Hyung, J. Shin, and J. Choo, “Magicapture: High-resolution multi-concept portrait customization,” arXiv preprint arXiv:2309.06895, 2023.
[52] M. Hua, J. Liu, F. Ding, W. Liu, J. Wu, and Q. He, “Dreamtuner: Single image is enough for subject-driven generation,” arXiv preprint arXiv:2312.13691, 2023.
[53] W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” arXiv preprint arXiv:2209.14491, 2022.
[54] J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
[55] Y. Ma, H. Yang, W. Wang, J. Fu, and J. Liu, “Unified multi-modal latent diffusion for joint subject and text conditional image generation,” arXiv preprint arXiv:2303.09319, 2023.
[56] M. Arar, R. Gal, Y. Atzmon, G. Chechik, D. Cohen-Or, A. Shamir, and A. H. Bermano, “Domain-agnostic tuning-encoder for fast personalization of text-to-image models,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10.
[57] J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International conference on machine learning. PMLR, 2023, pp. 19 730–19 742.
[58] Y. Zhou, R. Zhang, J. Gu, and T. Sun, “Customization assistant for text-to-image generation,” arXiv preprint arXiv:2312.03045, 2023.
[59] X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, and F. Wei, “Kosmos-g: Generating images in context with multimodal large language models,” arXiv preprint arXiv:2310.02992, 2023.
[60] X. Zhang, W. Zhang, X.-Y. Wei, J. Wu, Z. Zhang, Z. Lei, and Q. Li, “Generative active learning for image synthesis personalization,” arXiv preprint arXiv:2403.14987, 2024.
[61] A. Hertz, A. Voynov, S. Fruchter, and D. Cohen-Or, “Style aligned image generation via shared attention,” arXiv preprint arXiv:2312.02133, 2023.
[62] Z. Wang, X. Wang, L. Xie, Z. Qi, Y. Shan, W. Wang, and P. Luo, “Styleadapter: A single-pass lora-free model for stylized image generation,” arXiv preprint arXiv:2309.01770, 2023.
[63] H. Yang, H. Zhu, Y. Wang, M. Huang, Q. Shen, R. Yang, and X. Cao, “Facescape: a large-scale high quality 3d face dataset and detailed riggable 3d face prediction,” in Proceedings of the ieee/cvf conference on computer vision and pattern recognition, 2020, pp. 601–610.
[64] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 2018, pp. 67–74.
[65] Y. Wu and Q. Ji, “Facial landmark detection: A literature survey,” International Journal of Computer Vision, vol. 127, no. 2, pp. 115–142, 2019.
[66] K. Khabarlak and L. Koriashkina, “Fast facial landmark detection and applications: A survey,” arXiv preprint arXiv:2101.10808, 2021.
[67] I. Adjabi, A. Ouahabi, A. Benzaoui, and A. Taleb-Ahmed, “Past, present, and future of face recognition: A review,” Electronics, vol. 9, no. 8, p. 1188, 2020.
[68] Y. Zhou, R. Zhang, T. Sun, and J. Xu, “Enhancing detail preservation for customized text-to-image generation: A regularization-free approach,” arXiv preprint arXiv:2305.13579, 2023.
[69] D. Valevski, D. Lumen, Y. Matias, and Y. Leviathan, “Face0: Instantaneously conditioning a text-to-image model on a face,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10.
[70] X. Li, X. Hou, and C. C. Loy, “When stylegan meets stable diffusion: a w+ adapter for personalized image generation,” arXiv preprint arXiv:2311.17461, 2023.
[71] T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
[72] Y. Yan, C. Zhang, R. Wang, Y. Zhou, G. Zhang, P. Cheng, G. Yu, and B. Fu, “Facestudio: Put your face everywhere in seconds,” arXiv preprint arXiv:2312.02663, 2023.
[73] X. Peng, J. Zhu, B. Jiang, Y. Tai, D. Luo, J. Zhang, W. Lin, T. Jin, C. Wang, and R. Ji, “Portraitbooth: A versatile portrait model for fast identity-preserved personalization,” arXiv preprint arXiv:2312.06354, 2023.
[74] E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
[75] Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion models for customized generation,” arXiv preprint arXiv:2303.05125, 2023.
[76] G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Tuning-free multi-subject image generation with localized attention,” arXiv preprint arXiv:2305.10431, 2023.
[77] Z. Huang, T. Wu, Y. Jiang, K. C. Chan, and Z. Liu, “Reversion: Diffusion-based relation inversion from images,” arXiv preprint arXiv:2303.13495, 2023.
[78] S. Motamed, D. P. Paudel, and L. Van Gool, “Lego: Learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models,” arXiv preprint arXiv:2311.13833, 2023.
[79] S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, and D. Wang, “Learning disentangled identifiers for action-customized text-to-image generation,” arXiv preprint arXiv:2311.15841, 2023.
[80] T. Van Le, H. Phung, T. H. Nguyen, Q. Dao, N. N. Tran, and A. Tran, “Anti-dreambooth: Protecting users from personalized text-to-image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2116–2127.
[81] J. Zhang, F. Kerschbaum, T. Zhang et al., “Backdooring textual inversion for concept censorship,” arXiv preprint arXiv:2308.10718, 2023.
[82] J. Gu, Y. Wang, N. Zhao, T.-J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung et al., “Photoswap: Personalized subject swapping in images,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[83] J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020.
[84] T. Islam, A. Miron, X. Liu, and Y. Li, “Deep learning in virtual try-on: A comprehensive survey,” IEEE Access, 2024.
[85] M. Chen, I. Laina, and A. Vedaldi, “Training-free layout control with cross-attention guidance,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5343–5353.
[86] D.-Y. Chen, S. Koley, A. Sain, P. N. Chowdhury, T. Xiang, A. K. Bhunia, and Y.-Z. Song, “Democaricature: Democratising caricature generation with a rough sketch,” arXiv preprint arXiv:2312.04364, 2023.
[87] Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, and Y. Shan, “Customnet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models,” arXiv preprint arXiv:2310.19784, 2023.
[88] N. Kumari, G. Su, R. Zhang, T. Park, E. Shechtman, and J.-Y. Zhu, “Customizing text-to-image diffusion with camera viewpoint control,” arXiv preprint arXiv:2404.12333, 2024.
[89] Z. Wang, A. Li, E. Xie, L. Zhu, Y. Guo, Q. Dou, and Z. Li, “Customvideo: Customizing text-to-video generation with multiple subjects,” arXiv preprint arXiv:2401.09962, 2024.
[90] H. Chen, X. Wang, G. Zeng, Y. Zhang, Y. Zhou, F. Han, and W. Zhu, “Videodreamer: Customized multi-subject text-to-video generation with disen-mix finetuning,” arXiv preprint arXiv:2311.00990, 2023.
[91] Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, and Z. Liu, “Videobooth: Diffusion-based video generation with image prompts,” arXiv preprint arXiv:2312.00777, 2023.
[92] G. Liu, M. Xia, Y. Zhang, H. Chen, J. Xing, X. Wang, Y. Yang, and Y. Shan, “Stylecrafter: Enhancing stylized text-to-video generation with style adapter,” arXiv preprint arXiv:2312.00330, 2023.
[93] H. Zhao, T. Lu, J. Gu, X. Zhang, Z. Wu, H. Xu, and Y.-G. Jiang, “Videoassembler: Identity-consistent video generation with reference entities using diffusion model,” arXiv preprint arXiv:2311.17338, 2023.
[94] M. Feng, J. Liu, K. Yu, Y. Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Li et al., “Dreamoving: A human video generation framework based on diffusion models,” arXiv e-prints, pp. arXiv–2312, 2023.
[95] J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633.
[96] Y. Song, W. Shin, J. Lee, J. Kim, and N. Kwak, “Save: Protagonist diversification with structure agnostic video editing,” arXiv preprint arXiv:2312.02503, 2023.
[97] H. Jeong, G. Y. Park, and J. C. Ye, “Vmc: Video motion customization using temporal attention adaption for text-to-video diffusion models,” arXiv preprint arXiv:2312.00845, 2023.
[98] P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7346–7356.
[99] Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Storytelling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023.
[100] R. Wu, L. Chen, T. Yang, C. Guo, C. Li, and X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.
[101] J. Materzynska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, and B. Russell, “Customizing motion in text-to-video diffusion models,” arXiv preprint arXiv:2312.04966, 2023.
[102] R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou, “Motiondirector: Motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2310.08465, 2023.
[103] Y. Wei, S. Zhang, Z. Qing, H. Yuan, Z. Liu, Y. Liu, Y. Zhang, J. Zhou, and H. Shan, “Dreamvideo: Composing your dream videos with customized subject and motion,” arXiv preprint arXiv:2312.04433, 2023.
[104] B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
[105] C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309.
[106] Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi-view diffusion for 3d generation,” arXiv preprint arXiv:2308.16512, 2023.
[107] B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022.
[108] A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron et al., “Dreambooth3d: Subject-driven text-to-3d generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2349–2359.
[109] Y. Ouyang, W. Chai, J. Ye, D. Tao, Y. Zhan, and G. Wang, “Chasing consistency in text-to-3d generation from a single image,” arXiv preprint arXiv:2309.03599, 2023.
[110] Y.-Y. Yeh, J.-B. Huang, C. Kim, L. Xiao, T. Nguyen-Phuoc, N. Khan, C. Zhang, M. Chandraker, C. S. Marshall, Z. Dong et al., “Texturedreamer: Image-guided texture synthesis through geometry-aware diffusion,” arXiv preprint arXiv:2401.09416, 2024.
[111] Y. Zhao, Z. Yan, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Animate124: Animating one image to 4d dynamic scene,” arXiv preprint arXiv:2311.14603, 2023.
[112] Y. Zheng, X. Li, K. Nagano, S. Liu, O. Hilliges, and S. De Mello, “A unified approach for text-and image-guided 4d scene generation,” arXiv preprint arXiv:2311.16854, 2023.
[113] S. Azadi, T. Hayes, A. Shah, G. Pang, D. Parikh, and S. Gupta, “Text-conditional contextualized avatars for zero-shot personalization,” arXiv preprint arXiv:2304.07410, 2023.
[114] C. Zhang, Y. Chen, Y. Fu, Z. Zhou, G. Yu, B. Wang, B. Fu, T. Chen, G. Lin, and C. Shen, “Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation,” arXiv preprint arXiv:2305.19012, 2023.
[115] Y. Zeng, Y. Lu, X. Ji, Y. Yao, H. Zhu, and X. Cao, “Avatarbooth: High-quality and customizable 3d human avatar generation,” arXiv preprint arXiv:2306.09864, 2023.
[116] P. Zhang, N. Zhao, and J. Liao, “Text-guided vector graphics customization,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11.
[117] J. S. Smith, Y.-C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, and H. Jin, “Continual diffusion: Continual customization of text-to-image diffusion with c-lora,” arXiv preprint arXiv:2304.06027, 2023.
[118] H. Wang, X. Xiang, Y. Fan, and J.-H. Xue, “Customizing 360-degree panoramas through text-to-image diffusion models,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 4933–4943.
[119] X. Zhang, X.-Y. Wei, J. Wu, T. Zhang, Z. Zhang, Z. Lei, and Q. Li, “Compositional inversion for stable diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 7, 2024, pp. 7350–7358.
[120] Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones 2: Customizable image synthesis with multiple subjects,” arXiv preprint arXiv:2305.19327, 2023.
[121] S. Hao, K. Han, S. Zhao, and K.-Y. K. Wong, “Vico: Plug-and-play visual condition for personalized text-to-image generation,” 2023.
[122] T. Rahman, S. Mahajan, H.-Y. Lee, J. Ren, S. Tulyakov, and L. Sigal, “Visual concept-driven image generation with text-to-image diffusion model,” arXiv preprint arXiv:2402.11487, 2024.
[123] H. Tang, X. Zhou, J. Deng, Z. Pan, H. Tian, and P. Chaudhari, “Retrieving conditions from reference images for diffusion models,” arXiv preprint arXiv:2312.02521, 2023.
[124] Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, and W. Zuo, “Decoupled textual embeddings for customized image generation,” arXiv preprint arXiv:2312.11826, 2023.
[125] X. Jia, Y. Zhao, K. C. Chan, Y. Li, H. Zhang, B. Gong, T. Hou, H. Wang, and Y.-C. Su, “Taming encoder for zero fine-tuning image customization with text-to-image diffusion models,” arXiv preprint arXiv:2304.02642, 2023.
[126] X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” arXiv preprint arXiv:2307.09481, 2023.
[127] H. Chen, Y. Zhang, S. Wu, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation,” in The Twelfth International Conference on Learning Representations, 2023.
[128] Y. Li, H. Liu, Y. Wen, and Y. J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154, 2023.
[129] Y. Wang, W. Zhang, J. Zheng, and C. Jin, “High-fidelity person-centric subject-to-image synthesis,” arXiv preprint arXiv:2311.10329, 2023.
[130] J. Yang, H. Wang, R. Xiao, S. Wu, G. Chen, and J. Zhao, “Controllable textual inversion for personalized text-to-image generation,” arXiv preprint arXiv:2304.05265, 2023.
[131] Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, and Z. Mao, “Dreamidentity: Improved editability for efficient face-identity preserved image generation,” arXiv preprint arXiv:2307.00300, 2023.
[132] J. Park, B. Ko, and H. Jang, “Styleboost: A study of personalizing text-to-image generation in any style using dreambooth,” in 2023 14th International Conference on Information and Communication Technology Convergence (ICTC). IEEE, 2023, pp. 93–98.
[133] X.-Y. Wei and Z.-Q. Yang, “Coaching the exploration and exploitation in active learning for interactive video retrieval,” IEEE Transactions on Image Processing, vol. 22, no. 3, pp. 955–968, 2012.
[134] ——, “Coached active learning for interactive video search,” in Proceedings of the 19th ACM international conference on Multimedia, 2011, pp. 443–452.
[135] R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Encoder-based domain tuning for fast personalization of text-to-image models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–13, 2023.
[136] L. Pang, J. Yin, H. Xie, Q. Wang, Q. Li, and X. Mao, “Cross initialization for personalized text-to-image generation,” arXiv preprint arXiv:2312.15905, 2023.
[137] X. Zhang and S. Lyu, “Using projection kurtosis concentration of natural images for blind noise covariance matrix estimation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 2870–2876.
[138] A. Roy, M. Suin, A. Shah, K. Shah, J. Liu, and R. Chellappa, “Diffnat: Improving diffusion image quality using natural image statistics,” arXiv preprint arXiv:2311.09753, 2023.
[139] M. J. Shensa et al., “The discrete wavelet transform: wedding the a trous and mallat algorithms,” IEEE Transactions on signal processing, vol. 40, no. 10, pp. 2464–2482, 1992.
[140] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” Advances in neural information processing systems, vol. 30, 2017.
[141] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen, “Improved techniques for training gans,” Advances in neural information processing systems, vol. 29, 2016.
[142] P. Cao, L. Yang, F. Zhou, T. Huang, and Q. Song, “Concept-centric personalization with large-scale diffusion priors,” arXiv preprint arXiv:2312.08195, 2023.
[143] I. Han, S. Yang, T. Kwon, and J. C. Ye, “Highly personalized text embedding for image manipulation by stable diffusion,” arXiv preprint arXiv:2303.08767, 2023.
[144] Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, and B. Schölkopf, “Controlling text-to-image diffusion by orthogonal finetuning,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[145] B. N. Zhao, Y. Xiao, J. Xu, X. Jiang, Y. Yang, D. Li, L. Itti, V. Vineet, and Y. Ge, “Dreamdistribution: Prompt distribution learning for text-to-image diffusion models,” arXiv preprint arXiv:2312.14216, 2023.
[146] W. Zeng, Y. Yan, Q. Zhu, Z. Chen, P. Chu, W. Zhao, and X. Yang, “Infusion: Preventing customized text-to-image diffusion from overfitting,” arXiv preprint arXiv:2404.14007, 2024.
[147] M. Jones, S.-Y. Wang, N. Kumari, D. Bau, and J.-Y. Zhu, “Customizing text-to-image models with a single image pair,” arXiv preprint arXiv:2405.01536, 2024.
[148] H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disendreamer: Subject-driven text-to-image generation with sample-aware disentangled tuning,” IEEE Transactions on Circuits and Systems for Video Technology, 2024.
[149] X. Xu, Z. Wang, G. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7754–7765.
[150] M. Kim, J. Yoo, and S. Kwon, “Personalized text-to-image model enhancement strategies: Sod preprocessing and cnn local feature integration,” Electronics, vol. 12, no. 22, p. 4707, 2023.
[151] J. Pan, H. Yan, J. H. Liew, J. Feng, and V. Y. Tan, “Towards accurate guided diffusion sampling through symplectic adjoint method,” arXiv preprint arXiv:2312.12030, 2023.
[152] S. Purushwalkam, A. Gokul, S. Joty, and N. Naik, “Bootpig: Bootstrapping zero-shot personalized image generation capabilities in pretrained diffusion models,” arXiv preprint arXiv:2401.13974, 2024.
[153] X. Zhang, W. Zhang, X.-Y. Wei, J. Wu, Z. Zhang, Z. Lei, and Q. Li, “Generative active learning for image synthesis personalization,” arXiv preprint arXiv:2403.14987, 2024.
[154] J. Park, B. Ko, and H. Jang, “Text-to-image synthesis for any artistic styles: Advancements in personalized artistic image generation via subdivision and dual binding,” arXiv preprint arXiv:2404.05256, 2024.
[155] D.-Y. Chen, H. Tennent, and C.-W. Hsu, “Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation,” arXiv preprint arXiv:2312.02109, 2023.
[156] S. Banerjee, G. Mittal, A. Joshi, C. Hegde, and N. Memon, “Identity-preserving aging of face images via latent diffusion models,” in 2023 IEEE International Joint Conference on Biometrics (IJCB). IEEE, 2023, pp. 1–10.
[157] C. Kim, J. Lee, S. Joung, B. Kim, and Y.-M. Baek, “Instantfamily: Masked attention for zero-shot multi-id image generation,” arXiv preprint arXiv:2404.19427, 2024.
[158] Y.-C. Su, K. C. Chan, Y. Li, Y. Zhao, H. Zhang, B. Gong, H. Wang, and X. Jia, “Identity encoder for personalized diffusion,” arXiv preprint arXiv:2304.07429, 2023.
[159] J. Liu, H. Huang, C. Jin, and R. He, “Portrait diffusion: Training-free face stylization with chain-of-painting,” arXiv preprint arXiv:2312.02212, 2023.
[160] J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J.-C. Bazin, and F. De la Torre, “Personalized face inpainting with diffusion models by parallel visual attention,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 5432–5442.
[161] W. Chen, J. Zhang, J. Wu, H. Wu, X. Xiao, and L. Lin, “Id-aligner: Enhancing identity-preserving text-to-image generation with reward feedback learning,” arXiv preprint arXiv:2404.15449, 2024.
[162] D. Ostashev, Y. Fang, S. Tulyakov, K. Aberman et al., “Moa: Mixture-of-attention for subject-context disentanglement in personalized image generation,” arXiv preprint arXiv:2404.11565, 2024.
[163] S. Cui, J. Deng, J. Guo, X. An, Y. Zhao, X. Wei, and Z. Feng, “Idadapter: Learning mixed features for tuning-free personalization of text-to-image models,” arXiv preprint arXiv:2403.13535, 2024.
[164] Y. Wu, Z. Li, H. Zheng, C. Wang, and B. Li, “Infinite-id: Identity-preserved personalization via id-semantics decoupling paradigm,” arXiv preprint arXiv:2403.11781, 2024.
[165] K. Shiohara and T. Yamasaki, “Face2diffusion for fast and editable face personalization,” arXiv preprint arXiv:2403.05094, 2024.
[166] J. Jiang, Y. Zhang, K. Feng, X. Wu, and W. Zuo, “Mc²: Multi-concept guidance for customized multi-concept generation,” arXiv preprint arXiv:2404.05268, 2024.
[167] C. Zhu, K. Li, Y. Ma, C. He, and L. Xiu, “Multibooth: Towards generating all your concepts in an image from text,” arXiv preprint arXiv:2404.14239, 2024.
[168] H. Matsuda, R. Togo, K. Maeda, T. Ogawa, and M. Haseyama, “Multi-object editing in personalized text-to-image diffusion model via segmentation guidance,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 8140–8144.
[169] D. Zhou, Y. Li, F. Ma, Z. Yang, and Y. Yang, “Migc: Multi-instance generation controller for text-to-image synthesis,” arXiv preprint arXiv:2402.05408, 2024.
[170] Y. Huang, F. Juefei-Xu, Q. Guo, J. Zhang, Y. Wu, M. Hu, T. Li, G. Pu, and Y. Liu, “Personalization as a shortcut for few-shot backdoor attack against text-to-image diffusion models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 19, 2024, pp. 21 169–21 178.
[171] G. Zhang, Y. Qian, J. Deng, and X. Cai, “Inv-reversion: Enhanced relation inversion based on text-to-image diffusion models,” Applied Sciences, vol. 14, no. 8, p. 3338, 2024.
[172] S. Zhang, M. Ni, L. Wang, W. Ding, S. Chen, and Y. Liu, “A two-stage personalized virtual try-on framework with shape control and texture guidance,” arXiv preprint arXiv:2312.15480, 2023.
[173] J. Gu, Y. Wang, N. Zhao, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, and X. E. Wang, “Swapanything: Enabling arbitrary object swapping in personalized visual editing,” arXiv preprint arXiv:2404.05717, 2024.
[174] X. Xu, J. Guo, Z. Wang, G. Huang, I. Essa, and H. Shi, “Prompt-free diffusion: Taking” text” out of text-to-image diffusion models,” arXiv preprint arXiv:2305.16223, 2023.
[175] S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
[176] L. Zhu, D. Yang, T. Zhu, F. Reda, W. Chan, C. Saharia, M. Norouzi, and I. Kemelmacher-Shlizerman, “Tryondiffusion: A tale of two unets,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4606–4615.
[177] S. Y. Cheong, A. Mustafa, and A. Gilbert, “Visconet: Bridging and harmonizing visual and textual conditioning for controlnet,” arXiv preprint arXiv:2312.03154, 2023.
[178] I. Najdenkoska, A. Sinha, A. Dubey, D. Mahajan, V. Ramanathan, and F. Radenovic, “Context diffusion: In-context aware image generation,” arXiv preprint arXiv:2312.03584, 2023.
[179] S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, and B. Zhou, “Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition,” arXiv preprint arXiv:2312.07536, 2023.
[180] P. Li, Q. Nie, Y. Chen, X. Jiang, K. Wu, Y. Lin, Y. Liu, J. Peng, C. Wang, and F. Zheng, “Tuning-free image customization with image and text guidance,” arXiv preprint arXiv:2403.12658, 2024.
[181] Y. Zhao, E. Xie, L. Hong, Z. Li, and G. H. Lee, “Make-a-protagonist: Generic video editing with an ensemble of experts,” arXiv preprint arXiv:2305.08850, 2023.
[182] Y. Zhang, F. Tang, N. Huang, H. Huang, C. Ma, W. Dong, and C. Xu, “Motioncrafter: One-shot motion customization of diffusion models,” arXiv preprint arXiv:2312.05288, 2023.
[183] Y. Ren, Y. Zhou, J. Yang, J. Shi, D. Liu, F. Liu, M. Kwon, and A. Shrivastava, “Customize-a-video: One-shot motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2402.14780, 2024.
[184] Z. Ma, D. Zhou, C.-H. Yeh, X.-S. Wang, X. Li, H. Yang, Z. Dong, K. Keutzer, and J. Feng, “Magic-me: Identity-specific video customized diffusion,” arXiv preprint arXiv:2402.09368, 2024.
[185] X. He, Q. Liu, S. Qian, X. Wang, T. Hu, K. Cao, K. Yan, M. Zhou, and J. Zhang, “Id-animator: Zero-shot identity-preserving human video generation,” arXiv preprint arXiv:2404.15275, 2024.
[186] J. Zhuang, D. Kang, Y.-P. Cao, G. Li, L. Lin, and Y. Shan, “Tip-editor: An accurate 3d editor following both text-prompts and image-prompts,” arXiv preprint arXiv:2401.14828, 2024.