This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

MRStyle: A Unified Framework for Color Style Transfer with
Multi-Modality Reference

Jiancheng Huang1\equalcontrib, Yu Gao1\equalcontrib, Zequn Jie1, Yujie Zhong1, Xintong Han2, Lin Ma1 Croresponding author.
Abstract

In this paper, we introduce MRStyle, a comprehensive framework that enables color style transfer using multi-modality reference, including image and text. To achieve a unified style feature space for both modalities, we first develop a neural network called IRStyle, which generates stylized 3D lookup tables for image reference. This is accomplished by integrating an interaction dual-mapping network with a combined supervised learning pipeline, resulting in three key benefits: elimination of visual artifacts, efficient handling of high-resolution images with low memory usage, and maintenance of style consistency even in situations with significant color style variations. For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle). Our TRStyle method is highly efficient in both training and inference, producing notable open-set text-guided transfer results. Extensive experiments in both image and text settings demonstrate that our proposed method outperforms the state-of-the-art in both qualitative and quantitative evaluations.

[Uncaptioned image]
Figure 1: Our multi-modality reference color style transfer results. (a) State-of-the-art methods DNCM (Ke et al. 2023) and CAP-VST (Wen, Gao, and Zou 2023) often produce artifacts (e.g., texture in red box) or unsatisfactory colors (e.g., content in blue box). In contrast, our method avoids artifacts and achieves better color transfer effects. (b) Our method can produce amazing stylized results on 8K images given text reference. Zoom in for better visualization.
footnotetext: Equally contribution   Corresponding author.

1 Introduction

With the surge in popularity of short video and photo-sharing platforms, many users aspire to customize their photo/video’s color style, including brightness, hue, and saturation, before sharing. Existing photo editing software provides expert-defined image filters or lookup tables (LUTs) for color adjustments. However, these pre-set filters/LUTs cannot meet all users’ aesthetic needs and limit user flexibility. To mitigate such limitations, researchers have introduced a color-style transfer technology.

Image-guided color style transfer is the most common color style transfer task. It requires the stylized image to align with the reference style image in overall color style while maintaining the content and texture of the original content image. Current image-guided color transfer methods often fall short of fulfilling satisfactory transfer results (Fig. 1 (a)). Many recent deep learning-based approaches are built on encoder-decoder structure with feature whitening and coloring (WCT) operations (Luan et al. 2017; Li et al. 2018; Yoo et al. 2019; An et al. 2020; Ho and Zhou 2021; Chiu and Gurari 2022b). These methods often produce unrealistic artifacts in the output image due to their reliance on an encoder-decoder structure for stylized image generation. Additionally, they have difficulties handling high-resolution images due to the huge network memory usage. An alternative approach, mapping-based methods (Lin et al. 2023; Ke et al. 2023), addresses these issues by applying a predicted color mapping matrix to the original image for color style transfer, instead of using the encoder-decoder pattern. Nevertheless, these methods may fail to deliver satisfactory color transfer effects between images with very different inherent colors, due to the model’s structure and training paradigm.

The task of text-guided color style transfer has been recently proposed, as finding a reference style image that meets personal requirements can be challenging and impractical. Existing text-guided methods (Bau et al. 2021; Patashnik et al. 2021; Kwon and Ye 2022; Shi et al. 2022) require either expensive paired data gathering or time-consuming online optimization for each content and style. With the development of diffusion models, some image editing methods based on diffusion models can also perform text-guided style transfer (Huang et al. 2024; Meng et al. 2021; Brooks, Holynski, and Efros 2023; Huberman-Spiegelglas, Kulikov, and Michaeli 2023). However, their stylization quality is not guaranteed as they are not designed for color style transfer.

All these methods mentioned above consider only one modality. Compared to image prompts, text prompts are more user-friendly and flexible but provide less intuitive style information. Consequently, a natural idea springs up, i.e., put forward a unified framework for color style transfer, which can accept either the text or image prompts. The key to this idea is how to unify the style information in the text and image into a common space. To achieve this, we first train an image-guided color transfer model, and then align the text features from the stable diffusion priors (Rombach et al. 2022) with the color style features of the pre-trained image-guided model. To the best of our knowledge, our method is the first work that can utilize either image or text prompt as references for color style transfer.

For image reference, we propose a novel image reference method for color style transfer named IRStyle. Firstly, to avoid artifacts and ensure low memory usage for high-resolution inputs, we follow the mapping-based methods, adopting the simple 3DLUT (Zeng et al. 2020; Cong et al. 2022) to perform the color transfer. Secondly, to keep style similarity, we introduce an interaction module dual-mapping network. Additionally, a hybrid training pipeline combining paired supervision and unpaired supervision is designed, which enhances the style similarity metrics.

For text reference, we introduce a lightweight network to align the text features from the pre-trained stable diffusion with the style feature of our IRStyle. Since there are no public datasets available for training, we further design a cost-efficient method for data collection with the help of ChatGPT (Ouyang et al. 2022) and the stable diffusion model (Rombach et al. 2022). Leveraging the prior of the pre-trained stable diffusion model, our method can conduct open-set text-guided style transfer. In addition, due to the excellent performance of our IRStyle, our method achieves impressive style transfer results (Fig. 1 (b)). Furthermore, our model structure and data collection strategy ensure high efficiency in both training and testing. The main contributions are summarized as follows:

  • We propose a generic multi-modality reference color style transfer architecture named MRStyle, which accepts prompts from either images or text as references.

  • For image reference, we propose a stylized LUTs generation method (IRStyle). Our method can eliminate artifacts, handle high-resolution images with low memory usage, and preserve style uniformity for images with significantly different inherent colors.

  • For text reference, we fully exploit priors from the pre-trained stable diffusion model and our IRStyle to design the text-guided color style transfer model (TRStyle). Our model operates efficiently in both training and inference, as well as generates significant open-set transfer results.

  • Comprehensive evaluations demonstrate that MRStyle outperforms state-of-the-art methods significantly.

2 Related Works

Image Reference Transfer. Image reference color style transfer is the process of color style transition between images. This task can be divided into WCT-based methods (Li et al. 2017; Yoo et al. 2019; Qiao et al. 2021; Ho and Zhou 2021; Li et al. 2018; Chiu and Gurari 2022b; Wen, Gao, and Zou 2023; Xiaopeng Sun and Fan 2021) and mapping-based methods (Lin et al. 2023; Ke et al. 2023; Chen et al. 2023). WCT (Li et al. 2017) first uses the feature whitening and coloring operation with an encoder-decoder structure to complete the color transfer. CAP-VST (Wen, Gao, and Zou 2023) utilizes a similar pipeline with a new effective reversible residual network and an unbiased WCT. Due to the decoding process, they inevitably fall into the problem of visual defects and vast memory requirements. In contrast, mapping-based methods (Lin et al. 2023; Ke et al. 2023; Chen et al. 2023) can well solve these problems, which use low-resolution input to predict the color mapping matrix applied to the original image. AdaCM (Lin et al. 2023) proposes a network directly predicting the matrix. DNCM (Ke et al. 2023) further decouples the process into color normalization and stylization. Nonetheless, the color similarity performance is unsatisfactory. In this work, we propose IRStyle to improve the mapping-based methods, by introducing an interaction dual-mapping network and a combined supervised learning pipeline. In addition to image reference, our method can also utilize text as reference.

Refer to caption

Figure 2: The overview of MRStyle. (a) The inference pipeline of MRStyle. (b) The combined supervised training pipeline of IRStyle. Paired supervised losses are self\mathcal{L}_{self} and cm\mathcal{L}_{cm}. Unpaired supervised losses are style\mathcal{L}_{style}, hist\mathcal{L}_{hist} and content\mathcal{L}_{content}. The Interaction Dual-Mapping Network is detailed in Fig. 3 (c). After the training of IRStyle, we integrate the pre-trained IRStyle with stable diffusion priors to finalize our TRStyle. (c) The training pipeline of TRStyle.

Text Reference Transfer. Text reference color style transfer aims at adapting the content image to the color style described by the provided text. With the development of the vision-language pre-training models (Radford et al. 2021; Li et al. 2022; Alayrac et al. 2022; Singh et al. 2022), the information of text and images can be well aligned into a unified space. SpaceEdit (Shi et al. 2022) performs supervised language-guided image editing, which requires costly paired training data and cannot be applied to open-set scenes.  (Bau et al. 2021; Patashnik et al. 2021; Kwon and Ye 2022) have demonstrated the capability of performing open-set text-guided style transfer. They achieve this by leveraging CLIP (Radford et al. 2021) to explore the desired style space during each inference. Nevertheless, this online optimization process incurs significant time consumption during inference, rendering it impractical for real-world applications. As generative models develop rapidly, image editing, including stylization, has seen considerable improvement. Many image editing methods based on diffusion models are proposed, such as SDEdit (Meng et al. 2021), EditAnything (Gao et al. 2023), InstructPix2Pix (Brooks, Holynski, and Efros 2023), and MGIE (Fu et al. 2023) etc.. However, they primarily concentrate on image editing rather than color style transfer, which can result in content distortion or subpar color outcomes. Text-guided colorization(Weng et al. 2024; Huang, Zhao, and Liao 2022; Zabari et al. 2023), another color-related topic, primarily aims to convert grayscale images into visually pleasing colorful ones, with its text often focusing on object-level color descriptions. Conversely, text-guided color style transfer is primarily concerned with the color style transfer of photorealistic color images, with its text more concentrated on global color style description. In this study, we exploit how to inject the pre-trained diffusion priors into open-set text-guided color style transfer.

3 Method

The overview of MRStyle is shown in Fig. 2. It can conduct color style transfer using either image or text reference in a unified framework. To achieve this, we align the features of reference images and texts into a unified style feature space. Specifically, we first train IRStyle via a neural LUT network (Sec.3.1). An interaction dual-mapping network and a combined supervised learning pipeline are introduced. Then, we employ the synthetic text-image pairs to train a feature mapper, which projects text features from the pre-trained Stable Diffusion priors into the style space of IRStyle (Sec. 3.2). During inference, when accepting image or text as style input, MRStyle will extract the style features of the corresponding modality, and then interact with the features of the content image for color transfer (Fig. 2 (a)).

3.1 Image Reference Color Style Transfer

Interaction Dual-Mapping Network

Refer to caption

Figure 3: Architectural designs for image reference color style transfer. The feature extraction of input images has been omitted for simplicity. (a) Interaction direct-mapping network, e.g., AdaCM (Lin et al. 2023). (b) Non-interaction dual-mapping network, e.g., DNCM (Ke et al. 2023). (c) Interaction dual-mapping network (ours).

Following mapping-based methods, we design a neural network to generate color mapping matrices and apply them to the original image completing the style color mapping. For simplicity, we consider the combined 3D-LUTs as the mapping matrices. The computational complexity of the 3D-LUT is O(1) for each input pixel, with only 0 floating-point operations, making it extremely fast even at high resolutions. Moreover, when the video scene shows minimal changes, the 3D-LUTs predicted for the initial frame can be applied to subsequent frames, significantly reducing the computational complexity. We posit that our method is not limited to 3D-LUT, other color mapping matrices (e.g. JBL (Xia et al. 2020), DNCM (Ke et al. 2023)) are feasible as well.

Here, we describe how to design our network in detail. Given an original content image 𝐈𝐜\mathbf{I_{c}} and an reference style image 𝐈𝐬\mathbf{I_{s}} both with shape (3,h,w)(3,h,w), we downsample them to obtain two thumbnail 𝐈~𝐜\mathbf{\tilde{I}_{c}} and 𝐈~𝐬\mathbf{\tilde{I}_{s}}. Then, we feed 𝐈~𝐜\mathbf{\tilde{I}_{c}} and 𝐈~𝐬\mathbf{\tilde{I}_{s}} into a shared encoder EE to extract the content features 𝐅𝐜\mathbf{F_{c}} and style features 𝐅𝐬\mathbf{F_{s}}. To get the stylized result 𝐘\mathbf{Y}, we compare three possible architectural designs as shown in Fig. 3.

a) Interaction direct-mapping network. As depicted in Fig. 3 (a), we simply use an interaction module to merge 𝐅𝐜\mathbf{F_{c}} and 𝐅𝐬\mathbf{F_{s}}. Then, a predictor 𝐏𝐫𝐞𝐝\mathbf{Pred} is employed to get a direct transfer LUT 𝐋\mathbf{L}. Finally, we directly map the original image 𝐈𝐜\mathbf{I_{c}} to the final result 𝐘\mathbf{Y} through 𝐋\mathbf{L}. This network is similar to AdaCM (Lin et al. 2023). However, directly using a single LUT to do color transfer may be difficult, when the color style between 𝐈𝐜\mathbf{I_{c}} and 𝐈𝐬\mathbf{I_{s}} vary largely, as shown in Fig. 6 (e).

b) Non-interaction dual-mapping network. As shown in Fig. 3 (b), we employ the content predictor 𝐏𝐫𝐞𝐝𝐜\mathbf{Pred_{c}} for 𝐅𝐜\mathbf{F_{c}} to obtain the content LUT 𝐋𝐜\mathbf{L_{c}}, and the style predictor 𝐏𝐫𝐞𝐝𝐬\mathbf{Pred_{s}} for 𝐅𝐬\mathbf{F_{s}} to acquire the style LUT 𝐋𝐬\mathbf{L_{s}}. Then we execute a dual mapping (i.e., content extraction and then stylization) to achieve the final outcome. First, 𝐋𝐜\mathbf{L_{c}} is applied on 𝐈𝐜\mathbf{I_{c}} to get the content map 𝐌𝐜\mathbf{M_{c}}, representing for the style-free content of 𝐈𝐜\mathbf{I_{c}}. Second, 𝐋𝐬\mathbf{L_{s}} is used on 𝐌𝐜\mathbf{M_{c}} to get the result 𝐘\mathbf{Y}. This network is similar to DNCM (Ke et al. 2023). The prediction of 𝐋𝐜\mathbf{L_{c}} and 𝐋𝐬\mathbf{L_{s}} is independent without interaction. Thus, this requires either the content LUT to normalize all images to a common content space, or the style LUT to transform all contents, which are challenging as shown in Fig. 6 (a).

c) Interaction dual-mapping network. As previously discussed, we believe that the interaction between content and style features is crucial. Furthermore, explicitly decomposing the transfer process into content extraction and stylization can enhance final results. Thus, we incorporate these two benefits into our final network design as depicted in Fig. 3 (c). Specifically, we use the VGG encoder to extract four scales content features 𝐅𝐜\mathbf{F_{c}} and style features 𝐅𝐬\mathbf{F_{s}}, where 𝐅𝐬={𝐅𝐬,𝐢}1i4\mathbf{F_{s}}={\{\mathbf{F_{s,i}}\}_{1\leq i\leq 4}}, 𝐅𝐜={𝐅𝐜,𝐢}1i4\mathbf{F_{c}}={\{\mathbf{F_{c,i}}\}_{1\leq i\leq 4}}. These features at multiple scales interact with each other through AdaInt (Huang and Belongie 2017). After the interaction, they are downsampled to a uniform scale and concatenated together. Subsequently, these features are inputted into the content predictor 𝐏𝐫𝐞𝐝𝐜\mathbf{Pred_{c}} generating the content LUT 𝐋𝐜\mathbf{L_{c}} of 𝐈𝐜\mathbf{I_{c}}, and also into the style predictor 𝐏𝐫𝐞𝐝𝐬\mathbf{Pred_{s}} getting the style LUT 𝐋𝐬\mathbf{L_{s}} of 𝐈𝐬\mathbf{I_{s}}. Each predictor is composed of four convolution blocks. Finally, we apply 𝐋𝐜\mathbf{L_{c}} to 𝐈𝐜\mathbf{I_{c}} get the content map 𝐌𝐜\mathbf{M_{c}}, and then utilize 𝐋𝐬\mathbf{L_{s}} to 𝐌𝐜\mathbf{M_{c}} get the final result 𝐘\mathbf{Y}.

The Combined Supervised Learning Pipeline

A combined learning pipeline is designed to train with paired and unpaired supervision as shown in Fig. 2 (b).

Using the paired supervision technique, we randomly apply two filters to a source image, resulting in two images with identical content but different color styles, denoted as 𝐈𝟏\mathbf{I_{1}} and 𝐈𝟐\mathbf{I_{2}}. Firstly, We take 𝐈𝟏\mathbf{I_{1}} as the content image and 𝐈𝟐\mathbf{I_{2}} as the style image. The desired stylized result 𝐘𝟏\mathbf{Y_{1}} of 𝐈𝟏\mathbf{I_{1}} should be 𝐈𝟐\mathbf{I_{2}} itself. Hence MSE loss between 𝐘𝟏\mathbf{Y_{1}} and 𝐈𝟐\mathbf{I_{2}} is adopted, referred to as self\mathcal{L}_{self}. Secondly, we use 𝐈𝟏\mathbf{I_{1}} and 𝐈𝟐\mathbf{I_{2}} as content images for model training respectively. Images with the same content should extract identical content maps, i.e., the content map 𝐌𝐜,𝟏\mathbf{M_{c,1}} of 𝐈𝟏\mathbf{I_{1}} and the content map 𝐌𝐜,𝟐\mathbf{M_{c,2}} of 𝐈𝟐\mathbf{I_{2}} should be identical. MSE loss between 𝐌𝐜,𝟏\mathbf{M_{c,1}} and 𝐌𝐜,𝟐\mathbf{M_{c,2}} is used, denoted as cm\mathcal{L}_{cm}. The paired supervised losses are:

pair=self(𝐘𝟏,𝐈𝟐)+cm(𝐌𝐜,𝟏,𝐌𝐜,𝟐)\mathcal{L}_{pair}=\mathcal{L}_{self}(\mathbf{Y_{1}},\mathbf{I_{2}})+\mathcal{L}_{cm}(\mathbf{M_{c,1}},\mathbf{M_{c,2}}) (1)

The benefit of paired supervision technology lies in its capacity to obtain ground truth in a self-supervised manner. However, it is inconsistent with the real inference situation, where there is typically a difference in content between the style and content image. To address this, we introduce unpaired supervision, where the content image 𝐈𝟐\mathbf{I_{2}} and style image 𝐈𝟑\mathbf{I_{3}} are derived from different images. Since there is no ground truth for unpaired supervision, we utilize loss functions commonly employed in other color transfer methods (Li et al. 2017; Yoo et al. 2019; Li et al. 2018; Chiu and Gurari 2022b), including style similarity loss style\mathcal{L}_{style} and structural content loss content\mathcal{L}_{content}. style\mathcal{L}_{style} is implemented through MSE between the mean and standard deviation of shallow feature maps extracted from pre-trained VGG-Net, while content\mathcal{L}_{content} is implemented by MSE between the deep feature maps. However, style\mathcal{L}_{style} generally reflects the similarity of the high-level semantic feature space rather than color information (Ke et al. 2023). Therefore, we additionally employ a more interpretable loss function hist\mathcal{L}_{hist} following (Huang et al. 2023), which measures the distance between the soft color histograms. We denote the resulting image as 𝐘𝟐\mathbf{Y_{2}}. The unpaired supervised losses are as follows:

unpair=content(𝐘𝟐,𝐈2)+style(𝐘𝟐,𝐈3)+hist(𝐘𝟐,𝐈3)\mathcal{L}_{unpair}=\mathcal{L}_{content}(\mathbf{Y_{2}},\mathbf{I}_{2})+\mathcal{L}_{style}(\mathbf{Y_{2}},\mathbf{I}_{3})+\mathcal{L}_{hist}(\mathbf{Y_{2}},\mathbf{I}_{3}) (2)

In training, both paired supervised losses and unpaired supervised losses are utilized for each sample.

3.2 Text Reference Color Style Transfer

A Vanilla Text-Guided Way

With the fast development of text-to-image diffusion models, the stable diffusion model can produce high-quality images according to the text prompt. Therefore, a vanilla way to achieve text reference color style transfer would be directly using the pre-trained stable diffusion model to generate reference style images by the provided style text. Then, given the generated style image, we can use our pre-trained IRStyle to get the result.

Although this solution can accomplish the task, it has two main drawbacks. Firstly, it is time-consuming as the generation of a style image requires multiple steps within the diffusion process. Secondly, the entire process of text-guided color transfer is not end-to-end. Thus, a faster and more elegant alternative needs to be presented.

Refer to caption

Figure 4: The color and illumination of the generated image in the diffusion model. (a) The denoising process during image generation. (b) The color similarity analysis under different time steps of diffusion model sampling.

Efficient Priors Feature Mapper

The Stable Diffusion model (Rombach et al. 2022) accomplishes text-to-image generation via a U-Net structure, characterized by a step-by-step denoising process. To ensure that the final generated image is semantically consistent with the given text, the model computes the cross-attention between the text embeddings and U-Net features at every step. This suggests that the internal representations of the U-Net features could be well-associated with language-describable semantic concepts, and thus can be exploited to guide the style color transfer. An interesting observation, as shown in Fig. 4 (a), is that the color and illumination of the generated results are decided during the early stage of denoising of the diffusion model. To support this finding, we calculate the VGG style loss between the predicted result and the final image for different time steps under 100 examples during different time steps. As shown in Fig. 4 (b), only a small difference between the initial and final stages of the denoising process (as stated in  (Ke et al. 2023), loss below 5 usually indicates high similarity). Therefore, we might be able to use the early, or even the first-step features to guide the color transfer.

Our method requires only one forward pass of the diffusion model in the whole process. We design an efficient priors feature mapper, mapping the stable diffusion model priors to the reference style features of our IRStyle. The mapper consists of four different convolution blocks (Goodfellow et al. 2014). Given a random noise latent 𝐳~T𝒩(𝟎,𝐈)\tilde{\mathbf{z}}_{T}\sim\mathcal{N}(\mathbf{0},\mathbf{I}) at timestep TT and a style text prompt 𝐓𝐬\mathbf{T_{s}}, we extract features from four different layers of the U-Net decoder, denoted as 𝐅𝐭={𝐅𝐭,𝐢}1i4\mathbf{F_{t}}={\{\mathbf{F_{t,i}}\}_{1\leq i\leq 4}}. Then, we use the mapper to transfer 𝐅𝐭\mathbf{F_{t}} to the corresponding style reference space in each scale, denoted as 𝐅𝐦={𝐅𝐦,𝐢}1i4\mathbf{F_{m}}={\{\mathbf{F_{m,i}}\}_{1\leq i\leq 4}}. We subsequently replace 𝐅𝐬\mathbf{F_{s}} in IRStyle with 𝐅𝐦\mathbf{F_{m}}, for the text-guided color transfer.

Training with Synthetic Data. The whole training process is shown in Fig. 2 (c). Since there are no public datasets that can be used for training, we design a synthetic data generation way, which is highly cost-efficient. Firstly, we use ChatGPT to generate a style text prompt 𝐓𝐬\mathbf{T_{s}}, and then feed 𝐓𝐬\mathbf{T_{s}} into the stable diffusion model to generate the corresponding style image 𝐈𝐬\mathbf{I_{s}}. Secondly, given the content image 𝐈𝐜\mathbf{I_{c}} and 𝐈𝐬\mathbf{I_{s}}, we feed them into our trained IRStyle to generate the result 𝐈𝐠\mathbf{I_{g}}. Finally, we use the (𝐈𝐜,𝐓𝐬,𝐈𝐠)(\mathbf{I_{c}},\mathbf{T_{s}},\mathbf{I_{g}}) as one training sample, where 𝐈𝐜\mathbf{I_{c}} and 𝐓𝐬\mathbf{T_{s}} are the inputs, and 𝐈𝐠\mathbf{I_{g}} is the ground truth. We denote the result of TRStyle as 𝐘\mathbf{Y}. MSE loss between 𝐘\mathbf{Y} and 𝐈𝐠\mathbf{I_{g}} is adopted, denoted as teach\mathcal{L}_{teach} . During training, only the mapper is trainable and others are frozen.

Discussion. Directly using the CLIP features to guide the color style transfer as  (Radford et al. 2021) is another choice. The reasons behind our choice of stable diffusion features over CLIP features are discussed in the appendix.

4 Experiments

We evaluate our image style transfer in image reference setting (Sec 4.1) and text reference setting (Sec 4.2). Evaluations on video style transfer are conducted in the appendix.

Refer to caption
Figure 5: Qualitative comparison of the image reference setting. Our method shows the superiority in both photorealism and stylization over the other methods.
Table 1: Quantitative comparison of the image reference setting. The best and second best are in bold and underlined.
Method Photo-NAS PhotoWCT2 PCA-KD DNCM CAP-VST Ours
Style Gram loss \downarrow 2.9132 1.8485 2.2991 3.4972 0.9310 1.2707
Content SSIM \uparrow 0.7132 0.7136 0.7138 0.7777 0.7618 0.7913
Style score \uparrow 0.8167 0.7873 0.8134 0.7325 0.8647 0.8262
User score \uparrow 2.43 2.59 2.70 2.97 3.35 3.61

4.1 Image Reference Experiments

Comparisons with Other Methods

We compare IRStyle on the reference image color transfer task with recent methods: PhotoWCT2 (Chiu and Gurari 2022b), PhotoNAS (An et al. 2020), PCA-KD (Chiu and Gurari 2022a), CAP-VST (Wen, Gao, and Zou 2023) DNCM (Ke et al. 2023). We use their publicly available pre-trained models and code for evaluation. We do not compare the time cost with DNCM since it only provides online demos.

Qualitative Results. Fig. 5 shows the visual comparison with other methods. We can see, in most cases, PhotoWCT2, PCA-KD, and PhotoNAS exhibit significant visual noise and loss of detail. While CAP-VST mitigates this problem, it still presents visual noise (e.g., the sky in Fig. 5 (d)). DNCM eliminates visual artifacts but struggles to preserve color similarity (e.g., the tree in Fig. 5 (b)). Compared with the existing methods, our method faithfully maintains image details and delivers superior stylization results. Besides, our method ensures consistent image stylization without artifacts, even with significant variations in color style within the input (e.g., Fig. 5 (a)).

Table 2: Comparison on GPU Inference Time / Memory, and Model Size. All evaluations are conducted with float32 model precision on a Tesla V100 GPU (32GB memory). The units “s”, “GB”, and “M” refer to seconds, gigabytes, and millions, respectively. “OOM” indicates out of memory.
Resolution PhotoNAS PhotoWCT2 PCA-KD
FHD (1920×\times1080) 0.59s / 15.6GB 0.3s / 14GB 0.05s / 7GB
2K (2560×\times1440) 0.99s / 23.9GB 0.45s / 20GB 0.06s / 11GB
4K (3840×\times2160) OOM 1s / 23.8GB 0.1s / 16GB
8K (7680×\times4320) OOM OOM OOM
Model Size (M) 40.2M 7M 73K
Resolution CAP-VST DNCM Ours
FHD (1920×\times1080) 1.09s / 24GB - / 1.96GB 0.019s / 3GB
2K (2560×\times1440) 1.1s / 24GB - / 1.96GB 0.019s / 3GB
4K (3840×\times2160) 1.1s / 24GB - / 1.96GB 0.021s / 3GB
8K (7680×\times4320) 1.1s / 24GB - / 1.96GB 0.022s / 3GB
Model Size (M) 4M 5.15M 24M

Quantitative Results. Following previous work (Chiu and Gurari 2022a; Wen, Gao, and Zou 2023; Ke et al. 2023), we employ three metrics for evaluation, i.e., Style Gram loss (Gatys, Ecker, and Bethge 2016) and Style score (Ke et al. 2023) to measure style similarity, and Content SSIM (Ke et al. 2023) to measure content similarity. Table 1 shows that our IRStyle provides the best trade-off between content and style similarity. Although CAP-VST achieves higher scores in style similarity, the visual artifacts however lead to worse visual effects (e.g., Fig. 5 (d)). More visualizations are shown in the appendix.

User Study. We further conduct a user study to evaluate the subjective quality of different methods. We invite 40 users and show them 20 randomly selected images from the test set, each consisting of an input image, a reference style image, and 6 randomly shuffled transfer results. Participants are requested to rate the overall stylization quality of the transfer results on a scale of 1 to 5, mainly focusing on aspects such as style and content similarity, photorealism, and the visual appeal of the color style. After collecting these results, we calculate the average score for each method. Table 1 suggests that our methods are predominantly favored by users. Detailed analyses are provided in the appendix.

Table 3: Ablation study of IRStyle.
Type Style Gram loss \downarrow Style score \uparrow Content SSIM \uparrow
w/o interaction 3.598 0.7745 0.7547
w/o pair\mathcal{L}_{pair} 2.096 0.8158 0.6678
w/o unpair\mathcal{L}_{unpair} 1.853 0.8193 0.770
w/o hist\mathcal{L}_{hist} 1.385 0.8196 0.7720
w/o dual-mapping 1.452 0.8294 0.7890
Full version 1.270 0.8262 0.7913
Refer to caption
Figure 6: Visualization results of IRStyle’s ablation study. These visual results are consistent with those in Table 3.
Refer to caption
Figure 7: Qualitative comparison results of text reference setting. Our method can generate natural stylized images by open-set text reference prompts and preserve the original image texture without any finetuning.

Inference Time and Memory. As shown in Table 2, on FHD, 2K, 4K, and 8K images, our method has the fastest inference speed in all settings, even nearly 3× speedup compared to the fastest state-of-the-art method, i.e., PCA-KD. Moreover, the time cost of our IRStyle is essentially insensitive to practical resolutions. Most WCT-based methods (PhotoWCT2, PhotoNAS, and PCA-KD) demand a significant amount of memory, leading to out-of-memory issues in high-resolution images, even when GPUs with 32GB of RAM are employed. In contrast, CAP-VST utilizes a uniform resize to a resolution of 1280×9601280\times 960 to reduce the memory footprint, but at the cost of sharpness, which contradicts the purpose of using 2K and 4K images. DNCM performs better in terms of parameters and memory. We attribute this to the color mapping matrix, which isn’t our research focus. We also implement the color transfer matrix to achieve a similar model size and competitive results.

Ablation Studies

In this part, we conduct a systematic empirical study on our IRStyle.

Interaction. We construct the proposed IRStyle without interaction module (Fig. 3 (b)). Table 3 w/o interaction shows that removing the interaction module significantly decreases style similarity and slightly affects content similarity. Fig. 6 also highlights the importance of feature interaction between the content and style images.

Dual-mapping. We construct the proposed IRStyle using the architecture of direct-mapping (Fig. 3 (a)). Table 3 w/o dual-mapping indicates that the dual-mapping design can enhance both style and content similarity. The visualization in Fig. 6 further validates this conclusion.

Supervision Functions. We validate the effectiveness of each of our proposed supervision, including pair\mathcal{L}_{pair}, unpair\mathcal{L}_{unpair} and hist\mathcal{L}_{hist}. Both Table 3 and Fig. 6 confirm the significance of each supervision, i.e., employing all supervisions can yield the optimal style color transfer results.

Refer to caption
Figure 8: Ablation study of TRStyle.

4.2 Text Reference Results

Comparison with Other Methods

We compare our TRStyle with four state-of-the-art methods, including SDEdit (Meng et al. 2021), InstructPix2Pix (Brooks, Holynski, and Efros 2023), MGIE (Fu et al. 2023) and L-CAD (Weng et al. 2024). Since L-CAD falls under text-guided colorization and can only accept grayscale input, we convert the original image to grayscale for inference with L-CAD. As shown in Fig. 7, our method outperforms other techniques in terms of photorealism, stylization, and visual expressiveness. SDEdit exhibits content distortion and inconsistency in color style and text description. This is primarily because these methods focus on content editing and lack sufficient training for color style transfer. While InstructPix2Pix and MGIE achieve better style consistency, it also produces images with content distortion, particularly in portrait scenes (e.g., the eyes in Fig. 7 (c)). Moreover, the color style is less visually attractive compared to ours. L-CAD exhibits no shortcomings in terms of content, however, it is relatively weaker in stylization (e.g., Fig. 7 (b)). This is because such methods typically concentrate on the specific colors of objects, rather than the overall color style. More visualizations are shown in the appendix.

Inference Time and Memory. As shown in Table 4, our method surpasses others in terms of both speed and memory efficiency. We attribute this to the utilization of features derived from one single forward of the Stable Diffusion and the design of our priors feature mapper.

Table 4: Comparison on Inference Time and Memory.
Method MGIE InstructPix2Pix SDEdit L-CAD Ours
Time/Memory\downarrow 13.8s/40GB 9.67/15.5GB 8.97s/15.5GB 17.6s/14GB 0.316s/12.5GB

Ablation Studies

We provide a detailed ablation analysis of different configurations of our method.

Compaired with the Vanilla Way. We compare the results of the vanilla way (Sec. 3.2) with our TRStyle. As shown in Fig. 8, our method achieves similar results to the vanilla way, but with a shorter running time (one-pass vs. TT-pass). This demonstrates the effectiveness of our feature mapper and synthetic data collection method.

Compaired with Feature from Encoder. We construct the proposed TRStyle with the features extracted from the encoder of the U-Net. As shown in Fig. 8, it is clear that ours (features extracted from the decoder) achieves greater consistency between the final result and the text prompt. The reason is that the decoder contains more information about the text prompt, as confirmed in (Cao et al. 2023).

5 Conclusion

In this paper, we propose a universal multi-modality reference color style transfer architecture named MRStyle, which accepts prompts from either images or text as references. This is the first time that unification has been achieved in the text modality and image modality for color style transfer. Benefiting from the proposed interaction dual-mapping network and the combined supervised learning pipeline, our method shows significant improvements over existing methods in various aspects when using image reference. Additionally, owing to our proposed efficient priors feature mapper and data construction methods, our method shows superiority and effectiveness when accepting text references.

References

  • Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. NeurIPS.
  • An et al. (2020) An, J.; Xiong, H.; Huan, J.; and Luo, J. 2020. Ultrafast Photorealistic Style Transfer via Neural Architecture Search. In AAAI.
  • Bau et al. (2021) Bau, D.; Andonian, A.; Cui, A.; Park, Y.; Jahanian, A.; Oliva, A.; and Torralba, A. 2021. Paint by word. arXiv preprint arXiv:2103.10951.
  • Brooks, Holynski, and Efros (2023) Brooks, T.; Holynski, A.; and Efros, A. A. 2023. Instructpix2pix: Learning to follow image editing instructions. CVPR.
  • Cao et al. (2023) Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; and Zheng, Y. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv preprint arXiv:2304.08465.
  • Chen et al. (2023) Chen, Y.; Yang, H.; Yang, Y.; Liu, Y.; Wang, W.; Wen, X.; and Xie, C. 2023. NLUT: Neuarl-based 3D Lookup Tables for Video Photorealistic Style Transfer. arXiv preprint arXiv:2303.09170.
  • Chiu and Gurari (2022a) Chiu, T.-Y.; and Gurari, D. 2022a. Pca-based knowledge distillation towards lightweight and content-style balanced photorealistic style transfer models. In CVPR.
  • Chiu and Gurari (2022b) Chiu, T.-Y.; and Gurari, D. 2022b. PhotoWCT2: Compact Autoencoder for Photorealistic Style Transfer Resulting from Blockwise Training and Skip Connections of High-Frequency Residuals. In WACV.
  • Cong et al. (2022) Cong, W.; Tao, X.; Niu, L.; Liang, J.; Gao, X.; Sun, Q.; and Zhang, L. 2022. High-Resolution Image Harmonization via Collaborative Dual Transformations. In CVPR.
  • Diederik and Jimmy (2015) Diederik, K.; and Jimmy, B. 2015. Adam: A Method for Stochastic Optimization. CoRR.
  • Fu et al. (2023) Fu, T.-J.; Hu, W.; Du, X.; Wang, W. Y.; Yang, Y.; and Gan, Z. 2023. Guiding instruction-based image editing via multimodal large language models. ICLR.
  • Gao et al. (2023) Gao, S.; Lin, Z.; Xie, X.; Zhou, P.; Cheng, M.-M.; and Yan, S. 2023. EditAnything: Empowering Unparalleled Flexibility in Image Editing and Generation. In ACMMM, Demo track.
  • Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image Style Transfer Using Convolutional Neural Networks. In CVPR.
  • Goodfellow et al. (2014) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A. C.; and Bengio, Y. 2014. Generative Adversarial Nets. In NeurIPS.
  • Hessel et al. (2021) Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; and Choi, Y. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In EMNLP, 7514–7528.
  • Ho and Zhou (2021) Ho, M. M.; and Zhou, J. 2021. Deep Preset: Blending and Retouching Photos with Color Style Transfer. In WACV.
  • Huang et al. (2023) Huang, L.; Chen, D.; Liu, Y.; Shen, Y.; Zhao, D.; and Zhou, J. 2023. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778.
  • Huang and Belongie (2017) Huang, X.; and Belongie, S. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In ICCV.
  • Huang et al. (2024) Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Chen, S.; and Cao, L. 2024. Diffusion model-based image editing: A survey. arXiv preprint arXiv:2402.17525.
  • Huang, Zhao, and Liao (2022) Huang, Z.; Zhao, N.; and Liao, J. 2022. Unicolor: A unified framework for multi-modal colorization with transformer. TOG.
  • Huberman-Spiegelglas, Kulikov, and Michaeli (2023) Huberman-Spiegelglas, I.; Kulikov, V.; and Michaeli, T. 2023. An Edit Friendly DDPM Noise Space: Inversion and Manipulations. arXiv preprint arXiv:2304.06140.
  • Ke et al. (2023) Ke, Z.; Liu, Y.; Zhu, L.; Zhao, N.; and Lau, R. W. 2023. Neural Preset for Color Style Transfer. In CVPR.
  • Kwon and Ye (2022) Kwon, G.; and Ye, J. C. 2022. Clipstyler: Image style transfer with a single text condition. In CVPR.
  • Lee et al. (2024) Lee, H.; Kang, K.; Ok, J.; and Cho, S. 2024. CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment. In CVPR.
  • Li et al. (2022) Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
  • Li et al. (2017) Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; and Yang, M.-H. 2017. Universal Style Transfer via Feature Transforms. In NeurIPS.
  • Li et al. (2018) Li, Y.; Liu, M.-Y.; Li, X.; Yang, M.-H.; and Kautz, J. 2018. A Closed-form Solution to Photorealistic Image Stylization. In ECCV.
  • Lin et al. (2023) Lin, T.; Lin, H.; Li, F.; He, D.; Wu, W.; Wang, M.; Li, X.; and Liu, Y. 2023. AdaCM: adaptive ColorMLP for real-time universal photo-realistic style transfer. In AAAI.
  • Luan et al. (2017) Luan, F.; Paris, S.; Shechtman, E.; and Bala, K. 2017. Deep Photo Style Transfer. In CVPR.
  • Meng et al. (2021) Meng, C.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; and Ermon, S. 2021. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
  • Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. NeurIPS.
  • Patashnik et al. (2021) Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; and Lischinski, D. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV.
  • Qiao et al. (2021) Qiao, Y.; Cui, J.; Huang, F.; Liu, H.; Bao, C.; and Li, X. 2021. Efficient Style-Corpus Constrained Learning for Photorealistic Style Transfer. TIP.
  • Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
  • Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.
  • Sergyan (2008) Sergyan, S. 2008. Color histogram features based image classification in content-based image retrieval systems. In 2008 6th international symposium on applied machine intelligence and informatics, 221–224. IEEE.
  • Shi et al. (2022) Shi, J.; Xu, N.; Zheng, H.; Smith, A.; Luo, J.; and Xu, C. 2022. SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing. In CVPR.
  • Singh et al. (2022) Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; and Kiela, D. 2022. Flava: A foundational language and vision alignment model. In CVPR.
  • Tsung-Yi et al. (2014) Tsung-Yi, L.; Michael, M.; Serge, B.; and James, H. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. In ECCV.
  • Wen, Gao, and Zou (2023) Wen, L.; Gao, C.; and Zou, C. 2023. CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer. In CVPR.
  • Weng et al. (2024) Weng, S.; Zhang, P.; Li, Y.; Li, S.; Shi, B.; et al. 2024. L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors. NeurIPS.
  • Wu et al. (2022) Wu, Z.; Zhu, Z.; Du, J.; and Bai, X. 2022. CCPL: contrastive coherence preserving loss for versatile style transfer. In ECCV.
  • Xia et al. (2020) Xia, X.; Zhang, M.; Xue, T.; Sun, Z.; Fang, H.; Kulis, B.; and Chen, J. 2020. Joint Bilateral Learning for Real-time Universal Photorealistic Style Transfer. In ECCV.
  • Xiaopeng Sun and Fan (2021) Xiaopeng Sun, T. H., Muxingzi Li; and Fan, L. 2021. Enhance Image as You Like with Unpaired Learning. In IJCAI.
  • Yoo et al. (2019) Yoo, J.; Uh, Y.; Chun, S.; Kang, B.; and Ha, J.-W. 2019. Photorealistic Style Transfer via Wavelet Transforms. In ICCV.
  • Zabari et al. (2023) Zabari, N.; Azulay, A.; Gorkor, A.; Halperin, T.; and Fried, O. 2023. Diffusing Colors: Image Colorization with Text Guided Diffusion. In SIGGRAPH Asia 2023 Conference Papers.
  • Zeng et al. (2020) Zeng, H.; Cai, J.; Li, L.; Cao, Z.; and Zhang, L. 2020. Learning Image-adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-time. IEEE TPAMI.
  • Zhang et al. (2022) Zhang, F.; Zeng, H.; Zhang, T.; and Zhang, L. 2022. CLUT-Net: Learning Adaptively Compressed Representations of 3DLUTs for Lightweight Image Enhancement. In ACMMM.
  • Zhang et al. (2018) Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
[Uncaptioned image]
Figure 9: Samples of synthetic data for TRStyle training.

Appendix A Synthetic Data for TRStyle Training

We use ChatGPT (Ouyang et al. 2022) and Stable Diffusion Model (Rombach et al. 2022) (v1-5) to make 100000 text-image pairs. Each pair consists of a sentence within 70 words as the style text reference 𝐓𝐬\mathbf{T_{s}} and its corresponding generated style image 𝐈𝐬\mathbf{I_{s}}. Then we randomly select content images 𝐈𝐜\mathbf{I_{c}} from the COCO (Tsung-Yi et al. 2014) dataset and construct training triplets (𝐈𝐜,𝐓𝐬,𝐈𝐠)(\mathbf{I_{c}},\mathbf{T_{s}},\mathbf{I_{g}}) following Sec. 3.2.2. In Fig. 9, we show the samples of our synthetic data.

Appendix B Implementation Details

B.1 Implementation of IRStyle

Following recent color style transfer methods (Yoo et al. 2019; An et al. 2020; Li et al. 2018), we train our model on the images from the MS COCO (Tsung-Yi et al. 2014) dataset. We collect about 6,000 3D-LUT files as filters used in the paired supervised learning pipeline. During the evaluation, we use the test images collected by Photo-NAS (An et al. 2020). We take the VGG as the encoder and utilize CLUT (Zhang et al. 2022) as our color mapping LUT. The inputs are randomly cropped to 256×256256\times 256. We train the network by the Adam (Diederik and Jimmy 2015) optimizer for 300300 epochs. With a batch size of 2424, the initial learning rate is 5e45e^{-4}.

B.2 Implementation of TRStyle

We use ChatGPT (Ouyang et al. 2022) and Stable Diffusion Model (Rombach et al. 2022) (v1-5) to make a large dataset including 100,000 text-image pairs. Details about the dataset are illustrated in the appendix. We train the mapper network in TRStyle by Adam (Diederik and Jimmy 2015) optimizer for 200200 epochs. With a batch size of 88, the initial learning rate is 5e45e^{-4} and is multiplied by 0.50.5 after 3030 epochs. The experiments are conducted on a single Tesla V100 32G GPU.

Appendix C Performance on Video Color Style Transfer

For image reference, we compare our method with state-of-the-art methods (Wu et al. 2022; Wen, Gao, and Zou 2023). For quantitative evaluation, we gathered 25 pairs of video clips and corresponding style images from various scenes on the Internet. In line with (Wu et al. 2022; Wen, Gao, and Zou 2023), we adopt the temporal loss to measure temporal consistency.

Table 5: Quantitative comparison of image reference video color transfer. ’i’ denotes frame interval. The execution time is the total processing time for 150 frames.
Method Content SSIM\uparrow Style Gram loss\downarrow Temporal loss\downarrow Time cost\downarrow
i=1i=1 i=10i=10
CCPL 0.7007 1.79 0.092 0.173 19.27s
CAP-VST 0.7446 0.98 0.071 0.132 160s
Ours 0.7766 1.48 0.054 0.108 1.39s
Refer to caption
Figure 10: Video color style transfer results of MRStyle

For each video, we generate the LUT using IRStyle on the initial frame and apply it to all frames, ensuring rapid execution speed and consistent results across frames. The results in Table 5 show that our framework yields comparable results against the other methods, including temporal consistency, content similarity, style effect, and time cost. In practical scenarios, we can employ a simple scene judgment method, such as the lab histogram (Sergyan 2008), to segment video scenes. Subsequently, using the same LUT within each scene enables us to achieve the best trade-off between temporal consistency and style effect.

In Fig. 10, we present a series of stylized frames from our MRStyle, including image and text reference. The style remains consistent across frames and the resulting video is notably stable.

Appendix D User Study of IRStyle

Refer to caption
Figure 11: Comparison on user study results.
Table 6: Quantitative comparison of the text reference setting. The best and second best are in bold and underlined, respectively.
Method MGIE InstructPix2Pix SDEdit L-CAD Ours
CLIP Score \uparrow 0.2448 0.2731 0.2366 0.2264 0.2466
Content SSIM \uparrow 0.6796 0.6698 0.4111 0.7508 0.7527
LPIPS \downarrow 0.6796 0.3212 0.3612 0.2531 0.2138
User score \uparrow 2.89 3.46 2.57 3.05 3.72
Refer to caption
Figure 12: Extending MRStyle to support simultaneous input of text and images. 𝐰\mathbf{w} is the weight of image reference features. When 𝐰\mathbf{w} equals 1, only the image reference is considered, while a value of 0 represents exclusive reliance on the text reference. (a) The row emphasizes the gradual change effect between various color series (e.g., pink and orange). (b) The row depicts variations in brightness levels.
Refer to caption
Figure 13: Visual results of IRStyle. Our method is robust when generalizing to different input colors.

The overall stylization quality score is rated from 1 (least satisfactory) to 5 (highly satisfactory), with 3 indicating acceptable results. Ratings of 1 and 2 denote negative outcomes, 3 signifies neutral outcomes, while 4 and 5 represent positive outcomes. Fig. 11 displays the evaluation results’ distribution for each method. The transfer results of our method exhibit the highest proportion in the range from neutral to positive, thus suggesting its effectiveness across a diverse range of cases.

Appendix E Quantitative Experiments of TRStyle

Since there is presently no standardized benchmark for evaluating the color style transfer of text references, we have created a test benchmark consisting of 40 samples. Each sample within this benchmark includes an input image and a style text prompt. The images for the benchmark have been sourced from Unsplashhttps://unsplash.com/ and civitaihttps://civitai.com/, covering four distinct scenes: buildings, sky, portraits, and mountains. The style text prompts are generated using ChatGPT (Ouyang et al. 2022), e.g., ”A gray tone, elegant and vintage style”. Following the experiments setting of IRStyle, we employ 4 metrics for evaluation of IRStyle, i.e., CLIP Score loss (Hessel et al. 2021) to measure style similarity, Content SSIM (Ke et al. 2023) and LPIPS (Zhang et al. 2018) to measure content similarity, and user score for human sensory evaluation. Table 6 shows that our TRStyle achieves the best trade-off between the content and style similarity, aligning most closely with popular preferences. Our approach did not yield the highest results in terms of the clip score. This can be attributed to the fact that image editing methods often modify the content of the image, leading to a higher clip score.

Refer to caption
Figure 14: Visual results of TRStyle. Our method is robust when generalizing to different scenes, e.g., portraits, landscapes, and food.
Refer to caption
Figure 15: Dissuasion with CLIP Prior in TRStyle.

Appendix F Dissuasion with CLIP Prior

We chose stable diffusion features over CLIP features for three primary reasons. First, our IRStyle operates in image space, which aligns more naturally with the SD’s features due to their lower-level, pixel-wise representation. This contrasts with CLIP’s inclination towards capturing abstract, high-level information, which lacks the granularity of pixel-wise guidance. Second, the powerful image generation capability of SD can provide aesthetic priors for the final result, leading to better visual effects. Third, the same style description often has different tones, which is consistent with the randomness of the diffusion model. In contrast, CLIP tends to establish a fixed-style correspondence per prompt, which restricts variability.

We further replace the SD prior in our proposed TRStyle with the CLIP prior, which involves directly extracting text features from CLIP. As depicted in Fig. 15, the CLIP Prior setting produces bad transfer results. This further confirms the first reason that, compared to CLIP, the features from the SD align better with the style features of IRStyle, as both operate at the image level. We posit that a carefully developed mapper like CLIPTone (Lee et al. 2024), integrating the directional color vector, could potentially alleviate this issue. As shown in Fig. 15, the CLIPTone shows better text consistency compared to our CLIP Prior method. However, it seems that CLIPTone mainly concentrates on the color terms (e.g., ’blue’ and ’green’), but overlooks the context related to these colors (e.g., ’modern’ and ’aurora’). Moreover, when compared to our SD prior method (TRStyle), its color tones appear overly saturated. This aligns with our earlier analysis that the SD-based approach can leverage the aesthetic priors of SD, thus yielding better visual results.

Appendix G Simultaneous Text and Image References

We extend our MRStyle to support simultaneous text and image references. The style feature of text and image references are denoted as 𝐅𝐭\mathbf{F_{t}} and 𝐅𝐢\mathbf{F_{i}} respectively. Since we align the style information in the text and image into a common space, we can mix these styles by 𝐅𝐦=𝐰𝐅𝐢+(1𝐰)𝐅𝐭\mathbf{F_{m}}=\mathbf{w}*\mathbf{F_{i}}+(1-\mathbf{w})*\mathbf{F_{t}}, where 𝐰[0,1]\mathbf{w}\in[0,1]. Subsequently, we utilize the 𝐅𝐦\mathbf{F_{m}} as the style feature in our interaction dual-mapping network to complete the style transfer. Fig. 12 presents the depiction of our method’s capability to simultaneously support reference inputs of text and images. By adjusting the weight 𝐰\mathbf{w}, we can effectively balance the contribution of both modes. As 𝐰\mathbf{w} increases, the transfer becomes more biased towards the style indicated by the image references. In the event of substantial disparities between textual and pictorial references, our methodology may result in unconventional or unexpected outcomes. However, due to the scope and limitations of this paper, a comprehensive analysis of this phenomenon will be excluded from the detailed discussion provided.

Refer to caption
Figure 16: Visual results of failure cases.

Appendix H Limitation

For IRStyle, blurriness present in the input could potentially be amplified in the output, as shown in Fig. 16 (a), where the JPEG artifacts are amplified, especially in the background. Moreover, it cannot perform local color adjustments, with Fig. 16 (b) illustrating its inability to distinguish elements with similar colors but different semantics, such as the sky and the river. For text reference, since our TRStyle utilizes the priors of IRStyle and the stable diffusion, the limitation mentioned above of IRStyle exists in the results of TRStyle as well. Furthermore, the style similarity will be influenced by the generation ability of the Stable Diffusion. As shown in Fig. 16 (c), its transfer result is inconsistent with the text description, this problem is caused by the stable diffusion, which produces a brown-style reference image.

Appendix I More Results

I.1 Visual Results of IRStyle

We provide more visual results of IRStyle in Fig. 13.

I.2 Visual Results of TRStyle

We provide more visual results of TRStyle in Fig. 14. In reality, a single scene can correspond to various color styles. For instance, a winter scene can be associated with a snowy white color style or vintage tones. Existing image editing software usually provides various styles of filters for the same scene (e.g., Winter 1, Winter 2). Leveraging the stable diffusion priors, our method can generate multiple styles for the same scene using different noise seeds with simple descriptions, as depicted in Fig. 17 (a). Our approach also enables detailed descriptions of color styles, which can reduce generation ambiguity as shown in Fig. 17 (b).

Refer to caption
Figure 17: Visual results of TRStyle with different noise seeds. (a) Due to the stochasticity of the stable diffusion model, our TRStyle can produce varying color style transformation outputs for a single simple text prompt, using different noise seeds. (b) Providing detailed text prompts, our TRStyle can generate similar color outputs for different noise seeds.