MRStyle: A Unified Framework for Color Style Transfer with
Multi-Modality Reference
Abstract
In this paper, we introduce MRStyle, a comprehensive framework that enables color style transfer using multi-modality reference, including image and text. To achieve a unified style feature space for both modalities, we first develop a neural network called IRStyle, which generates stylized 3D lookup tables for image reference. This is accomplished by integrating an interaction dual-mapping network with a combined supervised learning pipeline, resulting in three key benefits: elimination of visual artifacts, efficient handling of high-resolution images with low memory usage, and maintenance of style consistency even in situations with significant color style variations. For text reference, we align the text feature of stable diffusion priors with the style feature of our IRStyle to perform text-guided color style transfer (TRStyle). Our TRStyle method is highly efficient in both training and inference, producing notable open-set text-guided transfer results. Extensive experiments in both image and text settings demonstrate that our proposed method outperforms the state-of-the-art in both qualitative and quantitative evaluations.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/edd13335-9326-47be-a8d1-59120be6a651/x1.png)
1 Introduction
With the surge in popularity of short video and photo-sharing platforms, many users aspire to customize their photo/video’s color style, including brightness, hue, and saturation, before sharing. Existing photo editing software provides expert-defined image filters or lookup tables (LUTs) for color adjustments. However, these pre-set filters/LUTs cannot meet all users’ aesthetic needs and limit user flexibility. To mitigate such limitations, researchers have introduced a color-style transfer technology.
Image-guided color style transfer is the most common color style transfer task. It requires the stylized image to align with the reference style image in overall color style while maintaining the content and texture of the original content image. Current image-guided color transfer methods often fall short of fulfilling satisfactory transfer results (Fig. 1 (a)). Many recent deep learning-based approaches are built on encoder-decoder structure with feature whitening and coloring (WCT) operations (Luan et al. 2017; Li et al. 2018; Yoo et al. 2019; An et al. 2020; Ho and Zhou 2021; Chiu and Gurari 2022b). These methods often produce unrealistic artifacts in the output image due to their reliance on an encoder-decoder structure for stylized image generation. Additionally, they have difficulties handling high-resolution images due to the huge network memory usage. An alternative approach, mapping-based methods (Lin et al. 2023; Ke et al. 2023), addresses these issues by applying a predicted color mapping matrix to the original image for color style transfer, instead of using the encoder-decoder pattern. Nevertheless, these methods may fail to deliver satisfactory color transfer effects between images with very different inherent colors, due to the model’s structure and training paradigm.
The task of text-guided color style transfer has been recently proposed, as finding a reference style image that meets personal requirements can be challenging and impractical. Existing text-guided methods (Bau et al. 2021; Patashnik et al. 2021; Kwon and Ye 2022; Shi et al. 2022) require either expensive paired data gathering or time-consuming online optimization for each content and style. With the development of diffusion models, some image editing methods based on diffusion models can also perform text-guided style transfer (Huang et al. 2024; Meng et al. 2021; Brooks, Holynski, and Efros 2023; Huberman-Spiegelglas, Kulikov, and Michaeli 2023). However, their stylization quality is not guaranteed as they are not designed for color style transfer.
All these methods mentioned above consider only one modality. Compared to image prompts, text prompts are more user-friendly and flexible but provide less intuitive style information. Consequently, a natural idea springs up, i.e., put forward a unified framework for color style transfer, which can accept either the text or image prompts. The key to this idea is how to unify the style information in the text and image into a common space. To achieve this, we first train an image-guided color transfer model, and then align the text features from the stable diffusion priors (Rombach et al. 2022) with the color style features of the pre-trained image-guided model. To the best of our knowledge, our method is the first work that can utilize either image or text prompt as references for color style transfer.
For image reference, we propose a novel image reference method for color style transfer named IRStyle. Firstly, to avoid artifacts and ensure low memory usage for high-resolution inputs, we follow the mapping-based methods, adopting the simple 3DLUT (Zeng et al. 2020; Cong et al. 2022) to perform the color transfer. Secondly, to keep style similarity, we introduce an interaction module dual-mapping network. Additionally, a hybrid training pipeline combining paired supervision and unpaired supervision is designed, which enhances the style similarity metrics.
For text reference, we introduce a lightweight network to align the text features from the pre-trained stable diffusion with the style feature of our IRStyle. Since there are no public datasets available for training, we further design a cost-efficient method for data collection with the help of ChatGPT (Ouyang et al. 2022) and the stable diffusion model (Rombach et al. 2022). Leveraging the prior of the pre-trained stable diffusion model, our method can conduct open-set text-guided style transfer. In addition, due to the excellent performance of our IRStyle, our method achieves impressive style transfer results (Fig. 1 (b)). Furthermore, our model structure and data collection strategy ensure high efficiency in both training and testing. The main contributions are summarized as follows:
-
•
We propose a generic multi-modality reference color style transfer architecture named MRStyle, which accepts prompts from either images or text as references.
-
•
For image reference, we propose a stylized LUTs generation method (IRStyle). Our method can eliminate artifacts, handle high-resolution images with low memory usage, and preserve style uniformity for images with significantly different inherent colors.
-
•
For text reference, we fully exploit priors from the pre-trained stable diffusion model and our IRStyle to design the text-guided color style transfer model (TRStyle). Our model operates efficiently in both training and inference, as well as generates significant open-set transfer results.
-
•
Comprehensive evaluations demonstrate that MRStyle outperforms state-of-the-art methods significantly.
2 Related Works
Image Reference Transfer. Image reference color style transfer is the process of color style transition between images. This task can be divided into WCT-based methods (Li et al. 2017; Yoo et al. 2019; Qiao et al. 2021; Ho and Zhou 2021; Li et al. 2018; Chiu and Gurari 2022b; Wen, Gao, and Zou 2023; Xiaopeng Sun and Fan 2021) and mapping-based methods (Lin et al. 2023; Ke et al. 2023; Chen et al. 2023). WCT (Li et al. 2017) first uses the feature whitening and coloring operation with an encoder-decoder structure to complete the color transfer. CAP-VST (Wen, Gao, and Zou 2023) utilizes a similar pipeline with a new effective reversible residual network and an unbiased WCT. Due to the decoding process, they inevitably fall into the problem of visual defects and vast memory requirements. In contrast, mapping-based methods (Lin et al. 2023; Ke et al. 2023; Chen et al. 2023) can well solve these problems, which use low-resolution input to predict the color mapping matrix applied to the original image. AdaCM (Lin et al. 2023) proposes a network directly predicting the matrix. DNCM (Ke et al. 2023) further decouples the process into color normalization and stylization. Nonetheless, the color similarity performance is unsatisfactory. In this work, we propose IRStyle to improve the mapping-based methods, by introducing an interaction dual-mapping network and a combined supervised learning pipeline. In addition to image reference, our method can also utilize text as reference.
Text Reference Transfer. Text reference color style transfer aims at adapting the content image to the color style described by the provided text. With the development of the vision-language pre-training models (Radford et al. 2021; Li et al. 2022; Alayrac et al. 2022; Singh et al. 2022), the information of text and images can be well aligned into a unified space. SpaceEdit (Shi et al. 2022) performs supervised language-guided image editing, which requires costly paired training data and cannot be applied to open-set scenes. (Bau et al. 2021; Patashnik et al. 2021; Kwon and Ye 2022) have demonstrated the capability of performing open-set text-guided style transfer. They achieve this by leveraging CLIP (Radford et al. 2021) to explore the desired style space during each inference. Nevertheless, this online optimization process incurs significant time consumption during inference, rendering it impractical for real-world applications. As generative models develop rapidly, image editing, including stylization, has seen considerable improvement. Many image editing methods based on diffusion models are proposed, such as SDEdit (Meng et al. 2021), EditAnything (Gao et al. 2023), InstructPix2Pix (Brooks, Holynski, and Efros 2023), and MGIE (Fu et al. 2023) etc.. However, they primarily concentrate on image editing rather than color style transfer, which can result in content distortion or subpar color outcomes. Text-guided colorization(Weng et al. 2024; Huang, Zhao, and Liao 2022; Zabari et al. 2023), another color-related topic, primarily aims to convert grayscale images into visually pleasing colorful ones, with its text often focusing on object-level color descriptions. Conversely, text-guided color style transfer is primarily concerned with the color style transfer of photorealistic color images, with its text more concentrated on global color style description. In this study, we exploit how to inject the pre-trained diffusion priors into open-set text-guided color style transfer.
3 Method
The overview of MRStyle is shown in Fig. 2. It can conduct color style transfer using either image or text reference in a unified framework. To achieve this, we align the features of reference images and texts into a unified style feature space. Specifically, we first train IRStyle via a neural LUT network (Sec.3.1). An interaction dual-mapping network and a combined supervised learning pipeline are introduced. Then, we employ the synthetic text-image pairs to train a feature mapper, which projects text features from the pre-trained Stable Diffusion priors into the style space of IRStyle (Sec. 3.2). During inference, when accepting image or text as style input, MRStyle will extract the style features of the corresponding modality, and then interact with the features of the content image for color transfer (Fig. 2 (a)).
3.1 Image Reference Color Style Transfer
Interaction Dual-Mapping Network
Following mapping-based methods, we design a neural network to generate color mapping matrices and apply them to the original image completing the style color mapping. For simplicity, we consider the combined 3D-LUTs as the mapping matrices. The computational complexity of the 3D-LUT is O(1) for each input pixel, with only 0 floating-point operations, making it extremely fast even at high resolutions. Moreover, when the video scene shows minimal changes, the 3D-LUTs predicted for the initial frame can be applied to subsequent frames, significantly reducing the computational complexity. We posit that our method is not limited to 3D-LUT, other color mapping matrices (e.g. JBL (Xia et al. 2020), DNCM (Ke et al. 2023)) are feasible as well.
Here, we describe how to design our network in detail. Given an original content image and an reference style image both with shape , we downsample them to obtain two thumbnail and . Then, we feed and into a shared encoder to extract the content features and style features . To get the stylized result , we compare three possible architectural designs as shown in Fig. 3.
a) Interaction direct-mapping network. As depicted in Fig. 3 (a), we simply use an interaction module to merge and . Then, a predictor is employed to get a direct transfer LUT . Finally, we directly map the original image to the final result through . This network is similar to AdaCM (Lin et al. 2023). However, directly using a single LUT to do color transfer may be difficult, when the color style between and vary largely, as shown in Fig. 6 (e).
b) Non-interaction dual-mapping network. As shown in Fig. 3 (b), we employ the content predictor for to obtain the content LUT , and the style predictor for to acquire the style LUT . Then we execute a dual mapping (i.e., content extraction and then stylization) to achieve the final outcome. First, is applied on to get the content map , representing for the style-free content of . Second, is used on to get the result . This network is similar to DNCM (Ke et al. 2023). The prediction of and is independent without interaction. Thus, this requires either the content LUT to normalize all images to a common content space, or the style LUT to transform all contents, which are challenging as shown in Fig. 6 (a).
c) Interaction dual-mapping network. As previously discussed, we believe that the interaction between content and style features is crucial. Furthermore, explicitly decomposing the transfer process into content extraction and stylization can enhance final results. Thus, we incorporate these two benefits into our final network design as depicted in Fig. 3 (c). Specifically, we use the VGG encoder to extract four scales content features and style features , where , . These features at multiple scales interact with each other through AdaInt (Huang and Belongie 2017). After the interaction, they are downsampled to a uniform scale and concatenated together. Subsequently, these features are inputted into the content predictor generating the content LUT of , and also into the style predictor getting the style LUT of . Each predictor is composed of four convolution blocks. Finally, we apply to get the content map , and then utilize to get the final result .
The Combined Supervised Learning Pipeline
A combined learning pipeline is designed to train with paired and unpaired supervision as shown in Fig. 2 (b).
Using the paired supervision technique, we randomly apply two filters to a source image, resulting in two images with identical content but different color styles, denoted as and . Firstly, We take as the content image and as the style image. The desired stylized result of should be itself. Hence MSE loss between and is adopted, referred to as . Secondly, we use and as content images for model training respectively. Images with the same content should extract identical content maps, i.e., the content map of and the content map of should be identical. MSE loss between and is used, denoted as . The paired supervised losses are:
(1) |
The benefit of paired supervision technology lies in its capacity to obtain ground truth in a self-supervised manner. However, it is inconsistent with the real inference situation, where there is typically a difference in content between the style and content image. To address this, we introduce unpaired supervision, where the content image and style image are derived from different images. Since there is no ground truth for unpaired supervision, we utilize loss functions commonly employed in other color transfer methods (Li et al. 2017; Yoo et al. 2019; Li et al. 2018; Chiu and Gurari 2022b), including style similarity loss and structural content loss . is implemented through MSE between the mean and standard deviation of shallow feature maps extracted from pre-trained VGG-Net, while is implemented by MSE between the deep feature maps. However, generally reflects the similarity of the high-level semantic feature space rather than color information (Ke et al. 2023). Therefore, we additionally employ a more interpretable loss function following (Huang et al. 2023), which measures the distance between the soft color histograms. We denote the resulting image as . The unpaired supervised losses are as follows:
(2) |
In training, both paired supervised losses and unpaired supervised losses are utilized for each sample.
3.2 Text Reference Color Style Transfer
A Vanilla Text-Guided Way
With the fast development of text-to-image diffusion models, the stable diffusion model can produce high-quality images according to the text prompt. Therefore, a vanilla way to achieve text reference color style transfer would be directly using the pre-trained stable diffusion model to generate reference style images by the provided style text. Then, given the generated style image, we can use our pre-trained IRStyle to get the result.
Although this solution can accomplish the task, it has two main drawbacks. Firstly, it is time-consuming as the generation of a style image requires multiple steps within the diffusion process. Secondly, the entire process of text-guided color transfer is not end-to-end. Thus, a faster and more elegant alternative needs to be presented.
Efficient Priors Feature Mapper
The Stable Diffusion model (Rombach et al. 2022) accomplishes text-to-image generation via a U-Net structure, characterized by a step-by-step denoising process. To ensure that the final generated image is semantically consistent with the given text, the model computes the cross-attention between the text embeddings and U-Net features at every step. This suggests that the internal representations of the U-Net features could be well-associated with language-describable semantic concepts, and thus can be exploited to guide the style color transfer. An interesting observation, as shown in Fig. 4 (a), is that the color and illumination of the generated results are decided during the early stage of denoising of the diffusion model. To support this finding, we calculate the VGG style loss between the predicted result and the final image for different time steps under 100 examples during different time steps. As shown in Fig. 4 (b), only a small difference between the initial and final stages of the denoising process (as stated in (Ke et al. 2023), loss below 5 usually indicates high similarity). Therefore, we might be able to use the early, or even the first-step features to guide the color transfer.
Our method requires only one forward pass of the diffusion model in the whole process. We design an efficient priors feature mapper, mapping the stable diffusion model priors to the reference style features of our IRStyle. The mapper consists of four different convolution blocks (Goodfellow et al. 2014). Given a random noise latent at timestep and a style text prompt , we extract features from four different layers of the U-Net decoder, denoted as . Then, we use the mapper to transfer to the corresponding style reference space in each scale, denoted as . We subsequently replace in IRStyle with , for the text-guided color transfer.
Training with Synthetic Data. The whole training process is shown in Fig. 2 (c). Since there are no public datasets that can be used for training, we design a synthetic data generation way, which is highly cost-efficient. Firstly, we use ChatGPT to generate a style text prompt , and then feed into the stable diffusion model to generate the corresponding style image . Secondly, given the content image and , we feed them into our trained IRStyle to generate the result . Finally, we use the as one training sample, where and are the inputs, and is the ground truth. We denote the result of TRStyle as . MSE loss between and is adopted, denoted as . During training, only the mapper is trainable and others are frozen.
Discussion. Directly using the CLIP features to guide the color style transfer as (Radford et al. 2021) is another choice. The reasons behind our choice of stable diffusion features over CLIP features are discussed in the appendix.
4 Experiments
We evaluate our image style transfer in image reference setting (Sec 4.1) and text reference setting (Sec 4.2). Evaluations on video style transfer are conducted in the appendix.

Method | Photo-NAS | PhotoWCT2 | PCA-KD | DNCM | CAP-VST | Ours |
Style Gram loss | 2.9132 | 1.8485 | 2.2991 | 3.4972 | 0.9310 | 1.2707 |
Content SSIM | 0.7132 | 0.7136 | 0.7138 | 0.7777 | 0.7618 | 0.7913 |
Style score | 0.8167 | 0.7873 | 0.8134 | 0.7325 | 0.8647 | 0.8262 |
User score | 2.43 | 2.59 | 2.70 | 2.97 | 3.35 | 3.61 |
4.1 Image Reference Experiments
Comparisons with Other Methods
We compare IRStyle on the reference image color transfer task with recent methods: PhotoWCT2 (Chiu and Gurari 2022b), PhotoNAS (An et al. 2020), PCA-KD (Chiu and Gurari 2022a), CAP-VST (Wen, Gao, and Zou 2023) DNCM (Ke et al. 2023). We use their publicly available pre-trained models and code for evaluation. We do not compare the time cost with DNCM since it only provides online demos.
Qualitative Results. Fig. 5 shows the visual comparison with other methods. We can see, in most cases, PhotoWCT2, PCA-KD, and PhotoNAS exhibit significant visual noise and loss of detail. While CAP-VST mitigates this problem, it still presents visual noise (e.g., the sky in Fig. 5 (d)). DNCM eliminates visual artifacts but struggles to preserve color similarity (e.g., the tree in Fig. 5 (b)). Compared with the existing methods, our method faithfully maintains image details and delivers superior stylization results. Besides, our method ensures consistent image stylization without artifacts, even with significant variations in color style within the input (e.g., Fig. 5 (a)).
Resolution | PhotoNAS | PhotoWCT2 | PCA-KD |
FHD (19201080) | 0.59s / 15.6GB | 0.3s / 14GB | 0.05s / 7GB |
2K (25601440) | 0.99s / 23.9GB | 0.45s / 20GB | 0.06s / 11GB |
4K (38402160) | OOM | 1s / 23.8GB | 0.1s / 16GB |
8K (76804320) | OOM | OOM | OOM |
Model Size (M) | 40.2M | 7M | 73K |
Resolution | CAP-VST | DNCM | Ours |
FHD (19201080) | 1.09s / 24GB | - / 1.96GB | 0.019s / 3GB |
2K (25601440) | 1.1s / 24GB | - / 1.96GB | 0.019s / 3GB |
4K (38402160) | 1.1s / 24GB | - / 1.96GB | 0.021s / 3GB |
8K (76804320) | 1.1s / 24GB | - / 1.96GB | 0.022s / 3GB |
Model Size (M) | 4M | 5.15M | 24M |
Quantitative Results. Following previous work (Chiu and Gurari 2022a; Wen, Gao, and Zou 2023; Ke et al. 2023), we employ three metrics for evaluation, i.e., Style Gram loss (Gatys, Ecker, and Bethge 2016) and Style score (Ke et al. 2023) to measure style similarity, and Content SSIM (Ke et al. 2023) to measure content similarity. Table 1 shows that our IRStyle provides the best trade-off between content and style similarity. Although CAP-VST achieves higher scores in style similarity, the visual artifacts however lead to worse visual effects (e.g., Fig. 5 (d)). More visualizations are shown in the appendix.
User Study. We further conduct a user study to evaluate the subjective quality of different methods. We invite 40 users and show them 20 randomly selected images from the test set, each consisting of an input image, a reference style image, and 6 randomly shuffled transfer results. Participants are requested to rate the overall stylization quality of the transfer results on a scale of 1 to 5, mainly focusing on aspects such as style and content similarity, photorealism, and the visual appeal of the color style. After collecting these results, we calculate the average score for each method. Table 1 suggests that our methods are predominantly favored by users. Detailed analyses are provided in the appendix.
Type | Style Gram loss | Style score | Content SSIM |
w/o interaction | 3.598 | 0.7745 | 0.7547 |
w/o | 2.096 | 0.8158 | 0.6678 |
w/o | 1.853 | 0.8193 | 0.770 |
w/o | 1.385 | 0.8196 | 0.7720 |
w/o dual-mapping | 1.452 | 0.8294 | 0.7890 |
Full version | 1.270 | 0.8262 | 0.7913 |


Inference Time and Memory. As shown in Table 2, on FHD, 2K, 4K, and 8K images, our method has the fastest inference speed in all settings, even nearly 3× speedup compared to the fastest state-of-the-art method, i.e., PCA-KD. Moreover, the time cost of our IRStyle is essentially insensitive to practical resolutions. Most WCT-based methods (PhotoWCT2, PhotoNAS, and PCA-KD) demand a significant amount of memory, leading to out-of-memory issues in high-resolution images, even when GPUs with 32GB of RAM are employed. In contrast, CAP-VST utilizes a uniform resize to a resolution of to reduce the memory footprint, but at the cost of sharpness, which contradicts the purpose of using 2K and 4K images. DNCM performs better in terms of parameters and memory. We attribute this to the color mapping matrix, which isn’t our research focus. We also implement the color transfer matrix to achieve a similar model size and competitive results.
Ablation Studies
In this part, we conduct a systematic empirical study on our IRStyle.
Interaction. We construct the proposed IRStyle without interaction module (Fig. 3 (b)). Table 3 w/o interaction shows that removing the interaction module significantly decreases style similarity and slightly affects content similarity. Fig. 6 also highlights the importance of feature interaction between the content and style images.
Dual-mapping. We construct the proposed IRStyle using the architecture of direct-mapping (Fig. 3 (a)). Table 3 w/o dual-mapping indicates that the dual-mapping design can enhance both style and content similarity. The visualization in Fig. 6 further validates this conclusion.
Supervision Functions. We validate the effectiveness of each of our proposed supervision, including , and . Both Table 3 and Fig. 6 confirm the significance of each supervision, i.e., employing all supervisions can yield the optimal style color transfer results.

4.2 Text Reference Results
Comparison with Other Methods
We compare our TRStyle with four state-of-the-art methods, including SDEdit (Meng et al. 2021), InstructPix2Pix (Brooks, Holynski, and Efros 2023), MGIE (Fu et al. 2023) and L-CAD (Weng et al. 2024). Since L-CAD falls under text-guided colorization and can only accept grayscale input, we convert the original image to grayscale for inference with L-CAD. As shown in Fig. 7, our method outperforms other techniques in terms of photorealism, stylization, and visual expressiveness. SDEdit exhibits content distortion and inconsistency in color style and text description. This is primarily because these methods focus on content editing and lack sufficient training for color style transfer. While InstructPix2Pix and MGIE achieve better style consistency, it also produces images with content distortion, particularly in portrait scenes (e.g., the eyes in Fig. 7 (c)). Moreover, the color style is less visually attractive compared to ours. L-CAD exhibits no shortcomings in terms of content, however, it is relatively weaker in stylization (e.g., Fig. 7 (b)). This is because such methods typically concentrate on the specific colors of objects, rather than the overall color style. More visualizations are shown in the appendix.
Inference Time and Memory. As shown in Table 4, our method surpasses others in terms of both speed and memory efficiency. We attribute this to the utilization of features derived from one single forward of the Stable Diffusion and the design of our priors feature mapper.
Method | MGIE | InstructPix2Pix | SDEdit | L-CAD | Ours |
Time/Memory | 13.8s/40GB | 9.67/15.5GB | 8.97s/15.5GB | 17.6s/14GB | 0.316s/12.5GB |
Ablation Studies
We provide a detailed ablation analysis of different configurations of our method.
Compaired with the Vanilla Way. We compare the results of the vanilla way (Sec. 3.2) with our TRStyle. As shown in Fig. 8, our method achieves similar results to the vanilla way, but with a shorter running time (one-pass vs. -pass). This demonstrates the effectiveness of our feature mapper and synthetic data collection method.
Compaired with Feature from Encoder. We construct the proposed TRStyle with the features extracted from the encoder of the U-Net. As shown in Fig. 8, it is clear that ours (features extracted from the decoder) achieves greater consistency between the final result and the text prompt. The reason is that the decoder contains more information about the text prompt, as confirmed in (Cao et al. 2023).
5 Conclusion
In this paper, we propose a universal multi-modality reference color style transfer architecture named MRStyle, which accepts prompts from either images or text as references. This is the first time that unification has been achieved in the text modality and image modality for color style transfer. Benefiting from the proposed interaction dual-mapping network and the combined supervised learning pipeline, our method shows significant improvements over existing methods in various aspects when using image reference. Additionally, owing to our proposed efficient priors feature mapper and data construction methods, our method shows superiority and effectiveness when accepting text references.
References
- Alayrac et al. (2022) Alayrac, J.-B.; Donahue, J.; Luc, P.; Miech, A.; Barr, I.; Hasson, Y.; Lenc, K.; Mensch, A.; Millican, K.; Reynolds, M.; et al. 2022. Flamingo: a visual language model for few-shot learning. NeurIPS.
- An et al. (2020) An, J.; Xiong, H.; Huan, J.; and Luo, J. 2020. Ultrafast Photorealistic Style Transfer via Neural Architecture Search. In AAAI.
- Bau et al. (2021) Bau, D.; Andonian, A.; Cui, A.; Park, Y.; Jahanian, A.; Oliva, A.; and Torralba, A. 2021. Paint by word. arXiv preprint arXiv:2103.10951.
- Brooks, Holynski, and Efros (2023) Brooks, T.; Holynski, A.; and Efros, A. A. 2023. Instructpix2pix: Learning to follow image editing instructions. CVPR.
- Cao et al. (2023) Cao, M.; Wang, X.; Qi, Z.; Shan, Y.; Qie, X.; and Zheng, Y. 2023. MasaCtrl: Tuning-Free Mutual Self-Attention Control for Consistent Image Synthesis and Editing. arXiv preprint arXiv:2304.08465.
- Chen et al. (2023) Chen, Y.; Yang, H.; Yang, Y.; Liu, Y.; Wang, W.; Wen, X.; and Xie, C. 2023. NLUT: Neuarl-based 3D Lookup Tables for Video Photorealistic Style Transfer. arXiv preprint arXiv:2303.09170.
- Chiu and Gurari (2022a) Chiu, T.-Y.; and Gurari, D. 2022a. Pca-based knowledge distillation towards lightweight and content-style balanced photorealistic style transfer models. In CVPR.
- Chiu and Gurari (2022b) Chiu, T.-Y.; and Gurari, D. 2022b. PhotoWCT2: Compact Autoencoder for Photorealistic Style Transfer Resulting from Blockwise Training and Skip Connections of High-Frequency Residuals. In WACV.
- Cong et al. (2022) Cong, W.; Tao, X.; Niu, L.; Liang, J.; Gao, X.; Sun, Q.; and Zhang, L. 2022. High-Resolution Image Harmonization via Collaborative Dual Transformations. In CVPR.
- Diederik and Jimmy (2015) Diederik, K.; and Jimmy, B. 2015. Adam: A Method for Stochastic Optimization. CoRR.
- Fu et al. (2023) Fu, T.-J.; Hu, W.; Du, X.; Wang, W. Y.; Yang, Y.; and Gan, Z. 2023. Guiding instruction-based image editing via multimodal large language models. ICLR.
- Gao et al. (2023) Gao, S.; Lin, Z.; Xie, X.; Zhou, P.; Cheng, M.-M.; and Yan, S. 2023. EditAnything: Empowering Unparalleled Flexibility in Image Editing and Generation. In ACMMM, Demo track.
- Gatys, Ecker, and Bethge (2016) Gatys, L. A.; Ecker, A. S.; and Bethge, M. 2016. Image Style Transfer Using Convolutional Neural Networks. In CVPR.
- Goodfellow et al. (2014) Goodfellow, I. J.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A. C.; and Bengio, Y. 2014. Generative Adversarial Nets. In NeurIPS.
- Hessel et al. (2021) Hessel, J.; Holtzman, A.; Forbes, M.; Le Bras, R.; and Choi, Y. 2021. CLIPScore: A Reference-free Evaluation Metric for Image Captioning. In EMNLP, 7514–7528.
- Ho and Zhou (2021) Ho, M. M.; and Zhou, J. 2021. Deep Preset: Blending and Retouching Photos with Color Style Transfer. In WACV.
- Huang et al. (2023) Huang, L.; Chen, D.; Liu, Y.; Shen, Y.; Zhao, D.; and Zhou, J. 2023. Composer: Creative and controllable image synthesis with composable conditions. arXiv preprint arXiv:2302.09778.
- Huang and Belongie (2017) Huang, X.; and Belongie, S. 2017. Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization. In ICCV.
- Huang et al. (2024) Huang, Y.; Huang, J.; Liu, Y.; Yan, M.; Lv, J.; Liu, J.; Xiong, W.; Zhang, H.; Chen, S.; and Cao, L. 2024. Diffusion model-based image editing: A survey. arXiv preprint arXiv:2402.17525.
- Huang, Zhao, and Liao (2022) Huang, Z.; Zhao, N.; and Liao, J. 2022. Unicolor: A unified framework for multi-modal colorization with transformer. TOG.
- Huberman-Spiegelglas, Kulikov, and Michaeli (2023) Huberman-Spiegelglas, I.; Kulikov, V.; and Michaeli, T. 2023. An Edit Friendly DDPM Noise Space: Inversion and Manipulations. arXiv preprint arXiv:2304.06140.
- Ke et al. (2023) Ke, Z.; Liu, Y.; Zhu, L.; Zhao, N.; and Lau, R. W. 2023. Neural Preset for Color Style Transfer. In CVPR.
- Kwon and Ye (2022) Kwon, G.; and Ye, J. C. 2022. Clipstyler: Image style transfer with a single text condition. In CVPR.
- Lee et al. (2024) Lee, H.; Kang, K.; Ok, J.; and Cho, S. 2024. CLIPtone: Unsupervised Learning for Text-based Image Tone Adjustment. In CVPR.
- Li et al. (2022) Li, J.; Li, D.; Xiong, C.; and Hoi, S. 2022. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In ICML.
- Li et al. (2017) Li, Y.; Fang, C.; Yang, J.; Wang, Z.; Lu, X.; and Yang, M.-H. 2017. Universal Style Transfer via Feature Transforms. In NeurIPS.
- Li et al. (2018) Li, Y.; Liu, M.-Y.; Li, X.; Yang, M.-H.; and Kautz, J. 2018. A Closed-form Solution to Photorealistic Image Stylization. In ECCV.
- Lin et al. (2023) Lin, T.; Lin, H.; Li, F.; He, D.; Wu, W.; Wang, M.; Li, X.; and Liu, Y. 2023. AdaCM: adaptive ColorMLP for real-time universal photo-realistic style transfer. In AAAI.
- Luan et al. (2017) Luan, F.; Paris, S.; Shechtman, E.; and Bala, K. 2017. Deep Photo Style Transfer. In CVPR.
- Meng et al. (2021) Meng, C.; Song, Y.; Song, J.; Wu, J.; Zhu, J.-Y.; and Ermon, S. 2021. Sdedit: Image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
- Ouyang et al. (2022) Ouyang, L.; Wu, J.; Jiang, X.; Almeida, D.; Wainwright, C.; Mishkin, P.; Zhang, C.; Agarwal, S.; Slama, K.; Ray, A.; et al. 2022. Training language models to follow instructions with human feedback. NeurIPS.
- Patashnik et al. (2021) Patashnik, O.; Wu, Z.; Shechtman, E.; Cohen-Or, D.; and Lischinski, D. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In ICCV.
- Qiao et al. (2021) Qiao, Y.; Cui, J.; Huang, F.; Liu, H.; Bao, C.; and Li, X. 2021. Efficient Style-Corpus Constrained Learning for Photorealistic Style Transfer. TIP.
- Radford et al. (2021) Radford, A.; Kim, J. W.; Hallacy, C.; Ramesh, A.; Goh, G.; Agarwal, S.; Sastry, G.; Askell, A.; Mishkin, P.; Clark, J.; et al. 2021. Learning transferable visual models from natural language supervision. In ICML.
- Rombach et al. (2022) Rombach, R.; Blattmann, A.; Lorenz, D.; Esser, P.; and Ommer, B. 2022. High-resolution image synthesis with latent diffusion models. In CVPR.
- Sergyan (2008) Sergyan, S. 2008. Color histogram features based image classification in content-based image retrieval systems. In 2008 6th international symposium on applied machine intelligence and informatics, 221–224. IEEE.
- Shi et al. (2022) Shi, J.; Xu, N.; Zheng, H.; Smith, A.; Luo, J.; and Xu, C. 2022. SpaceEdit: Learning a Unified Editing Space for Open-Domain Image Color Editing. In CVPR.
- Singh et al. (2022) Singh, A.; Hu, R.; Goswami, V.; Couairon, G.; Galuba, W.; Rohrbach, M.; and Kiela, D. 2022. Flava: A foundational language and vision alignment model. In CVPR.
- Tsung-Yi et al. (2014) Tsung-Yi, L.; Michael, M.; Serge, B.; and James, H. 2014. Microsoft coco: Common objects in context. In European conference on computer vision. In ECCV.
- Wen, Gao, and Zou (2023) Wen, L.; Gao, C.; and Zou, C. 2023. CAP-VSTNet: Content Affinity Preserved Versatile Style Transfer. In CVPR.
- Weng et al. (2024) Weng, S.; Zhang, P.; Li, Y.; Li, S.; Shi, B.; et al. 2024. L-CAD: Language-based Colorization with Any-level Descriptions using Diffusion Priors. NeurIPS.
- Wu et al. (2022) Wu, Z.; Zhu, Z.; Du, J.; and Bai, X. 2022. CCPL: contrastive coherence preserving loss for versatile style transfer. In ECCV.
- Xia et al. (2020) Xia, X.; Zhang, M.; Xue, T.; Sun, Z.; Fang, H.; Kulis, B.; and Chen, J. 2020. Joint Bilateral Learning for Real-time Universal Photorealistic Style Transfer. In ECCV.
- Xiaopeng Sun and Fan (2021) Xiaopeng Sun, T. H., Muxingzi Li; and Fan, L. 2021. Enhance Image as You Like with Unpaired Learning. In IJCAI.
- Yoo et al. (2019) Yoo, J.; Uh, Y.; Chun, S.; Kang, B.; and Ha, J.-W. 2019. Photorealistic Style Transfer via Wavelet Transforms. In ICCV.
- Zabari et al. (2023) Zabari, N.; Azulay, A.; Gorkor, A.; Halperin, T.; and Fried, O. 2023. Diffusing Colors: Image Colorization with Text Guided Diffusion. In SIGGRAPH Asia 2023 Conference Papers.
- Zeng et al. (2020) Zeng, H.; Cai, J.; Li, L.; Cao, Z.; and Zhang, L. 2020. Learning Image-adaptive 3D Lookup Tables for High Performance Photo Enhancement in Real-time. IEEE TPAMI.
- Zhang et al. (2022) Zhang, F.; Zeng, H.; Zhang, T.; and Zhang, L. 2022. CLUT-Net: Learning Adaptively Compressed Representations of 3DLUTs for Lightweight Image Enhancement. In ACMMM.
- Zhang et al. (2018) Zhang, R.; Isola, P.; Efros, A. A.; Shechtman, E.; and Wang, O. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, 586–595.
![[Uncaptioned image]](https://cdn.awesomepapers.org/papers/edd13335-9326-47be-a8d1-59120be6a651/x9.png)
Appendix A Synthetic Data for TRStyle Training
We use ChatGPT (Ouyang et al. 2022) and Stable Diffusion Model (Rombach et al. 2022) (v1-5) to make 100000 text-image pairs. Each pair consists of a sentence within 70 words as the style text reference and its corresponding generated style image . Then we randomly select content images from the COCO (Tsung-Yi et al. 2014) dataset and construct training triplets following Sec. 3.2.2. In Fig. 9, we show the samples of our synthetic data.
Appendix B Implementation Details
B.1 Implementation of IRStyle
Following recent color style transfer methods (Yoo et al. 2019; An et al. 2020; Li et al. 2018), we train our model on the images from the MS COCO (Tsung-Yi et al. 2014) dataset. We collect about 6,000 3D-LUT files as filters used in the paired supervised learning pipeline. During the evaluation, we use the test images collected by Photo-NAS (An et al. 2020). We take the VGG as the encoder and utilize CLUT (Zhang et al. 2022) as our color mapping LUT. The inputs are randomly cropped to . We train the network by the Adam (Diederik and Jimmy 2015) optimizer for epochs. With a batch size of , the initial learning rate is .
B.2 Implementation of TRStyle
We use ChatGPT (Ouyang et al. 2022) and Stable Diffusion Model (Rombach et al. 2022) (v1-5) to make a large dataset including 100,000 text-image pairs. Details about the dataset are illustrated in the appendix. We train the mapper network in TRStyle by Adam (Diederik and Jimmy 2015) optimizer for epochs. With a batch size of , the initial learning rate is and is multiplied by after epochs. The experiments are conducted on a single Tesla V100 32G GPU.
Appendix C Performance on Video Color Style Transfer
For image reference, we compare our method with state-of-the-art methods (Wu et al. 2022; Wen, Gao, and Zou 2023). For quantitative evaluation, we gathered 25 pairs of video clips and corresponding style images from various scenes on the Internet. In line with (Wu et al. 2022; Wen, Gao, and Zou 2023), we adopt the temporal loss to measure temporal consistency.
Method | Content SSIM | Style Gram loss | Temporal loss | Time cost | |
CCPL | 0.7007 | 1.79 | 0.092 | 0.173 | 19.27s |
CAP-VST | 0.7446 | 0.98 | 0.071 | 0.132 | 160s |
Ours | 0.7766 | 1.48 | 0.054 | 0.108 | 1.39s |

For each video, we generate the LUT using IRStyle on the initial frame and apply it to all frames, ensuring rapid execution speed and consistent results across frames. The results in Table 5 show that our framework yields comparable results against the other methods, including temporal consistency, content similarity, style effect, and time cost. In practical scenarios, we can employ a simple scene judgment method, such as the lab histogram (Sergyan 2008), to segment video scenes. Subsequently, using the same LUT within each scene enables us to achieve the best trade-off between temporal consistency and style effect.
In Fig. 10, we present a series of stylized frames from our MRStyle, including image and text reference. The style remains consistent across frames and the resulting video is notably stable.
Appendix D User Study of IRStyle

Method | MGIE | InstructPix2Pix | SDEdit | L-CAD | Ours |
CLIP Score | 0.2448 | 0.2731 | 0.2366 | 0.2264 | 0.2466 |
Content SSIM | 0.6796 | 0.6698 | 0.4111 | 0.7508 | 0.7527 |
LPIPS | 0.6796 | 0.3212 | 0.3612 | 0.2531 | 0.2138 |
User score | 2.89 | 3.46 | 2.57 | 3.05 | 3.72 |


The overall stylization quality score is rated from 1 (least satisfactory) to 5 (highly satisfactory), with 3 indicating acceptable results. Ratings of 1 and 2 denote negative outcomes, 3 signifies neutral outcomes, while 4 and 5 represent positive outcomes. Fig. 11 displays the evaluation results’ distribution for each method. The transfer results of our method exhibit the highest proportion in the range from neutral to positive, thus suggesting its effectiveness across a diverse range of cases.
Appendix E Quantitative Experiments of TRStyle
Since there is presently no standardized benchmark for evaluating the color style transfer of text references, we have created a test benchmark consisting of 40 samples. Each sample within this benchmark includes an input image and a style text prompt. The images for the benchmark have been sourced from Unsplash††https://unsplash.com/ and civitai††https://civitai.com/, covering four distinct scenes: buildings, sky, portraits, and mountains. The style text prompts are generated using ChatGPT (Ouyang et al. 2022), e.g., ”A gray tone, elegant and vintage style”. Following the experiments setting of IRStyle, we employ 4 metrics for evaluation of IRStyle, i.e., CLIP Score loss (Hessel et al. 2021) to measure style similarity, Content SSIM (Ke et al. 2023) and LPIPS (Zhang et al. 2018) to measure content similarity, and user score for human sensory evaluation. Table 6 shows that our TRStyle achieves the best trade-off between the content and style similarity, aligning most closely with popular preferences. Our approach did not yield the highest results in terms of the clip score. This can be attributed to the fact that image editing methods often modify the content of the image, leading to a higher clip score.


Appendix F Dissuasion with CLIP Prior
We chose stable diffusion features over CLIP features for three primary reasons. First, our IRStyle operates in image space, which aligns more naturally with the SD’s features due to their lower-level, pixel-wise representation. This contrasts with CLIP’s inclination towards capturing abstract, high-level information, which lacks the granularity of pixel-wise guidance. Second, the powerful image generation capability of SD can provide aesthetic priors for the final result, leading to better visual effects. Third, the same style description often has different tones, which is consistent with the randomness of the diffusion model. In contrast, CLIP tends to establish a fixed-style correspondence per prompt, which restricts variability.
We further replace the SD prior in our proposed TRStyle with the CLIP prior, which involves directly extracting text features from CLIP. As depicted in Fig. 15, the CLIP Prior setting produces bad transfer results. This further confirms the first reason that, compared to CLIP, the features from the SD align better with the style features of IRStyle, as both operate at the image level. We posit that a carefully developed mapper like CLIPTone (Lee et al. 2024), integrating the directional color vector, could potentially alleviate this issue. As shown in Fig. 15, the CLIPTone shows better text consistency compared to our CLIP Prior method. However, it seems that CLIPTone mainly concentrates on the color terms (e.g., ’blue’ and ’green’), but overlooks the context related to these colors (e.g., ’modern’ and ’aurora’). Moreover, when compared to our SD prior method (TRStyle), its color tones appear overly saturated. This aligns with our earlier analysis that the SD-based approach can leverage the aesthetic priors of SD, thus yielding better visual results.
Appendix G Simultaneous Text and Image References
We extend our MRStyle to support simultaneous text and image references. The style feature of text and image references are denoted as and respectively. Since we align the style information in the text and image into a common space, we can mix these styles by , where . Subsequently, we utilize the as the style feature in our interaction dual-mapping network to complete the style transfer. Fig. 12 presents the depiction of our method’s capability to simultaneously support reference inputs of text and images. By adjusting the weight , we can effectively balance the contribution of both modes. As increases, the transfer becomes more biased towards the style indicated by the image references. In the event of substantial disparities between textual and pictorial references, our methodology may result in unconventional or unexpected outcomes. However, due to the scope and limitations of this paper, a comprehensive analysis of this phenomenon will be excluded from the detailed discussion provided.

Appendix H Limitation
For IRStyle, blurriness present in the input could potentially be amplified in the output, as shown in Fig. 16 (a), where the JPEG artifacts are amplified, especially in the background. Moreover, it cannot perform local color adjustments, with Fig. 16 (b) illustrating its inability to distinguish elements with similar colors but different semantics, such as the sky and the river. For text reference, since our TRStyle utilizes the priors of IRStyle and the stable diffusion, the limitation mentioned above of IRStyle exists in the results of TRStyle as well. Furthermore, the style similarity will be influenced by the generation ability of the Stable Diffusion. As shown in Fig. 16 (c), its transfer result is inconsistent with the text description, this problem is caused by the stable diffusion, which produces a brown-style reference image.
Appendix I More Results
I.1 Visual Results of IRStyle
We provide more visual results of IRStyle in Fig. 13.
I.2 Visual Results of TRStyle
We provide more visual results of TRStyle in Fig. 14. In reality, a single scene can correspond to various color styles. For instance, a winter scene can be associated with a snowy white color style or vintage tones. Existing image editing software usually provides various styles of filters for the same scene (e.g., Winter 1, Winter 2). Leveraging the stable diffusion priors, our method can generate multiple styles for the same scene using different noise seeds with simple descriptions, as depicted in Fig. 17 (a). Our approach also enables detailed descriptions of color styles, which can reduce generation ambiguity as shown in Fig. 17 (b).
