ITportrait: Image-Text Coupled 3D Portrait Domain Adaptation
Abstract.
Domain adaptation of 3D portraits has gained more and more attention. However, the transfer mechanism of existing methods is mainly based on vision or language, which ignores the potential of vision-language combined guidance. In this paper, we propose an Image-Text multi-modal framework, namely Image and Text portrait (ITportrait), for 3D portrait domain adaptation. ITportrait relies on a two-stage alternating training strategy. In the first stage, we employ a 3D Artistic Paired Transfer (APT) method for image-guided style transfer. APT constructs paired photo-realistic portraits to obtain accurate artistic poses, which helps ITportrait to achieve high-quality 3D style transfer. In the second stage, we propose a 3D Image-Text Embedding (ITE) approach in the CLIP space. ITE uses a threshold function to self-adaptively control the optimization direction of images or texts in the CLIP space. Comprehensive experiments prove that our ITportrait achieves state-of-the-art (SOTA) results and benefits downstream tasks. All source codes and pre-trained models will be released to the public.

1. Introduction
Artistic portraits have many applications (Pinkney and Adler, 2020; Yang et al., 2022; Men et al., 2022; Huang and Belongie, 2017; Gatys et al., 2016) in our daily lives, especially in industries related to animation, art, and the metaverse. As shown in Fig. 1, artistic portraits can be regarded as a portrait domain adaptation task, which refers to transforming the artistic style, cross-species identity, and expression shape change. The current domain adaption methods are mainly divided into two categories: guided by the vision-based method (artistic-image (Yang et al., 2022; Liu et al., 2021; Chong and Forsyth, 2022)), or guided by the language-based method (text-description (Patashnik et al., 2021; Yu et al., 2022)). Combining image-guided and text-driven guidance can not only transfer the precise and detailed style of the reference image but also have the text-driven flexible editing ability. Therefore, Image-Text coupled guidance has better style control-ability and artistic merit (Schaldenbrand et al., 2022; Frans et al., 2021). However, the potential of Vision-Language (Image-Text) multi-modal guidance is under-explored.
Mixing styles from the artistic images and the text description is challenging. To achieve image-text coupled domain adaptation, we need to first implement the transfer of artistic image style and text description, respectively. Nevertheless, as shown in Fig. 1, when directly using the method of style transfer and text editing successively, the style of the previous stage will be lost. In addition, compared with 2D GAN-based methods, 3D GANs (Abdal et al., 2023; Gu et al., 2021; Sun et al., 2022a; Zhou et al., 2021; Chan et al., 2021) have more advantages in domain adaptation tasks due to the ability of multi-view consistency synthesis. However, achieving portrait domain adaptation on 3D GAN also exacerbates the difficulty. The reasons are twofold: 1) From the image-guided perspective, it is challenging for the existing methods (Abdal et al., 2023; Jin et al., 2022; Xu et al., 2022) to achieve 3D high-quality style transfer. Because the pose of the artistic style reference image is tough to estimate. 2) From the text-driven aspect, previous 3D portrait text-driven methods (Sun et al., 2022b, a) may cause geometric collapse (Fig 10) when supervised in a single view (Gal et al., 2021; Kwon and Ye, 2022).
To cope with the above problems, we propose a multi-modal framework, ITportrait (short for ”Image and Text”), which supports domain adaptation for 3D portraits jointly guided by images and texts. Inspired by CLIP’s ability to encode images and texts, we consider embedding the image and text feature in the CLIP space to mix image and text styles simultaneously. However, using a single image embedding in the CLIP space as guidance will lead to overfitting (Gal et al., 2021). Hence, we additionally train a to generate sufficient stylized samples, which will be embedded in the CLIP space as image guidance to prevent overfitting. Subsequently, we propose a two-stage alternating training strategy. In the first stage, we design a 3d portrait style transfer method (APT). We construct paired photo-realistic portrait images, which can help to obtain accurate pose estimation of art-reference images. By this means, a high-quality one-shot style transfer can be realized. In the second stage, we propose an Image-Text Embedding (ITE) strategy that includes a text-guided direction and an image-guided direction in the CLIP space. More specifically, the image guidance transfers the global style while the text guidance edits the local portraits. We employ a threshold function to control the direction of domain adaption in the CLIP space. Furthermore, a 3D multi-view augment strategy is proposed to prevent geometry collapse and improve the rendering quality.
To sum up, our contributions are listed as follows:
-
•
We propose a multi-modal domain adaption framework, ITportrait. To the best of our knowledge, this work is the first attempt to explore the potential of the Image-Text coupled domain adaption for 3D portraits.
-
•
We customize an Image-Text embedding approach, namely ITE, to self-adaptively control the fusion of image and text guidance in the CLIP space. ITE not only employs an artistic paired transfer method APT to avoid one-shot stylization overfitting but also constructs 3D multi-view supervision to improve the geometric quality of the text-driven editing.
-
•
Our ITportrait boosts the downstream application of 3D-aware one-shot portrait tasks including image-guided stylization, text-driven manipulation, and Image-Text coupled domain adaption. ITportrait also pushes the frontier of view-consist editing for photo-realistic and art-drawing portraits.
2. Related Work
2.1. Artistic Portrait Generation
The artistic portrait is generally generated by GANs. Many works in the 2D vision field are based on StyleGAN (Karras et al., 2019). For instance, Toonify (Pinkney and Adler, 2020) achieves style transfer by exchanging the layers of StyleGAN. BlendGAN (Liu et al., 2021) performs style transfer by training an MLP and injecting style into StyleGAN. JOJOGAN (Chong and Forsyth, 2022) constructs a paired dataset to avoid overfitting. Recently, DualstyleGAN (Yang et al., 2022) yields high-quality style transfer by training a dual-path StyleGAN. With the development of 3DGAN (Chan et al., 2022; Sun et al., 2022a; Gu et al., 2021), more and more works are studying 3D artistic portrait generation. For example, Dr3d (Jin et al., 2022) and 3DAvatarGAN (Abdal et al., 2023) are trained from drawing datasets (Pinkney and Adler, 2020; Yang et al., 2022) to achieve domain adaption. However, these large-scale corresponding domain datasets are tedious and labor-intensive to obtain. The closest work to us is Your3dEmoji (Xu et al., 2022), which can complete a one-shot 3D style transfer. Nevertheless, Your3dEmoji cannot obtain the pose of artistic reference images, which leads to time-consuming transfer learning and causes a poor effect. Besides, IDE-3D (Sun et al., 2022a) and Next3d (Sun et al., 2022b) use the CLIP-based method (StyleGAN-NADA (Gal et al., 2021)) to generate artistic portraits. However, these CLIP-based augment methods (i.e., perspective augments (Frans et al., 2021; Kwon and Ye, 2022)) are all designed for 2D vision, which only considers the front-view supervision for 3D portraits. It may lead to poor generation performance for the side of the portrait or geometry collapse, as shown in Fig. 10. In contrast, our APT can predict an accurate artistic pose to improve the quality of image-based one-shot style transfer. And we construct 3D multi-view samples in CLIP space to enhance the portrait render quality.

2.2. Image-Text Based Domain Adaption
Portrait domain adaptation can be divided into twofold: image-guided methods and text-driven methods. The earliest image-guided method can be regarded as image-to-image translation task (Alaluf et al., 2021; Patashnik et al., 2021; Shao and Zhang, 2021; Xie et al., 2021; Zhao et al., 2020). The current image-guided methods like DynaGAN (Kim et al., 2022), and StyleDomain (Alanov et al., 2022) transfer styles by modifying the structure of StyleGAN (Karras et al., 2020, 2021, 2019). The main disadvantage of imaged-guided transfer is that it only supports to transfer of the global style, but the local editing of portraits is less flexible than the text-based method. For text-guided transfer, the CLIP model (Radford et al., 2021) has become the mainstream method of text-driven synthesis since it can learn a joint embedding space of images and text. For example, StyleCLIP (Patashnik et al., 2021) achieves high-fidelity manipulation by exploring the latent space of StyleGAN. CF-CLIP (Yu et al., 2022) can support better text-driven face controllable editing. The main disadvantage of text-driven methods is that although the local editing of portraits is convenient, the global cross-domain transfer of portraits is less controllable than image-based methods. Recently, researchers proposed some Image-Text couple methods for domain adaption. StyleCLIPDraw (Schaldenbrand et al., 2022) and CLIPDraw (Frans et al., 2021) propose an Image-Text coupling generator method, proposed an Image-Text coupled generator approach. However, their approach is for generation from scratch rather than further manipulation of specified images. StyleGAN-NADA (Gal et al., 2021) proposes a direction-loss of image and text in the CLIP space. Although StyleGAN-NADA can integrate images and text style, it requires many art reference images as guidance. Otherwise, single image guidance in the CLIP space will cause the gender and expression of the content image to change, leading to overfitting and poor style transfer results. In contrast, We constructed an alternating training method. Additionally, training a to generate many style reference samples solved the overfitting problem of the one-shot image embedding with CLIP space.
3. Preliminaries
EG3D. We first briefly review the network architecture of a SOTA 3D network, EG3D (Chan et al., 2022). Our generator , and are fine-tuned on EG3D. Specifically, EG3D starts with randomly sampled GAN latent codes, and then a feature generator based on StyleGAN (Karras et al., 2019) converts latent codes into 2D features and maps them into 3D tri-planes. An MLP decoder predicts features for 3D point projections on tri-planes to generate color and density. Finally, volume rendering is used to generate an image on the orientation of the camera pose. This process can be formulated as follows:
(1) |
where is the lantent code. is the camera pose. is the weight parameters of the EG3D generator . are the generated images.
CLIP-guided Loss. OpenAI proposes CLIP (Radford et al., 2021), a high-performance text-image embedding model trained on 400 million text-image pairs. It consists of two encoders that encode images and text into 512-dimensional embeddings, respectively. A later work, StyleGAN-NADA (Gal et al., 2021), designs a direction CLIP loss to align the CLIP space directions between the source and target text-image pairs as
(2) | |||
where, and are the image and text encoder of CLIP, and denote the text and input content of the style, respectively. When we use natural images as content, the is set to ”photo”. refers to the source image. represents the manipulated images. indicates the direction CLIP loss.
4. Method
In this section, we describe the proposed ITportrait framework. Our goal is to achieve Image-Text coupled domain adaption of 3D portraits. We consider embedding the image and text features in the CLIP space. Yet, using a single image as guidance will lead to overfitting (Gal et al., 2021). Hence, we design an alternating training approach as shown in Fig. 2. Specifically, the alternating training approach includes two stages: the one-shot style transfer stage in Sec. 4.1 and the Image-Text fusion stage in Sec. 4.2. The one-shot style transfer (APT) stage generates sufficient stylization images to prevent portraits from overfitting. The Image-Text fusion stage (ITE) embeds stylization images and text descriptions in the CLIP space for Image-Text fusion adaptation. Eventually, ITportrait can gradually achieve Image-Text coupled domain adaption by alternately training the two stages.
4.1. Image-guided One-shot Stylization
In this section, we propose an artistic paired transfer method (APT) to achieve image-guided 3D style transfer, as shown in Fig. 2 with blue color. Specifically, we first propose a GAN inversion method to get the and pose of artistic images. Then we use perturbations to construct paired datasets for stylization.
To begin with, we propose an artistic GAN inversion approach in Eq. (3) to predict the and pose . In practice, randomly initialized pose (Ko et al., 2023; Yin et al., 2022) can easily lead to optimization failures. Therefore, we need to obtain the artistic image’s accurate initial pose . As shown in Fig. 3, we observe the existing 2D GAN inversion method (Tov et al., 2021) can construct a well-aligned real photograph from the artistic reference image. Although estimating the pose of artistic images is difficult, the portrait in the photo-realistic domain is easy to estimate. Hence, our optimization-based GAN inversion approach to obtain the and can be formulated as follows:
(3) | |||
(4) |
where is obtained from style image by the 2D encoder-based GAN inversion method Restyle (Tov et al., 2021). is the randomly sampled noise. denotes the style mapping layers of StyleGAN (Karras et al., 2019). is a hyperparameter and set to . is obtained from perturbing 13-18 mapping layers of . The pre-trained parameter of is fixed. refers to the aligned photo-realistic images get from StyleGAN. Thus, we can employ an off-the-shelf estimator (Deng et al., 2019) to predict the pose from the real photograph as an alternative to the artistic reference image. Then, the pose is used as the initialization in Eq. (4). is the style reference image. refers to the align-paired images we get from EG3D. Loss refers to , (Zhang et al., 2018), , and . This way, our optimize-base artistic GAN inversion approach in Eq. (4) can obtain the and of the artistic image. Please refer to our supplementary materials for more details about artistic GAN inversion (Eq. (3) and Eq. (4)).

After artistic GAN inversion, we start 3DGAN style transfer. Following the same way of 2D style transfer method JOJOGAN (Chong and Forsyth, 2022), our 3D style transfer process can be formulated as follows:
(5) | |||
(6) |
where is given from Eq. (4). denotes the style mapping layers of EG3D. is the randomly sampled noise. controls the degree of stylization. The interpolated is obtained from perturbing 9-13 mapping layers of . The loss refers to . In each training epoch, we perturb 9-13 mapping layers of to get the interpolated . Subsequently, we use to utilize these to generate aligned photo-realistic paired datasets . As shown in Fig. 2 blue part, these paired photo-realistic datasets will be supervised by style reference image to make the 3DGAN style transfer. We only fine-tune the parameters of . With these techniques, we enable the EG3D generator to achieve high-quality one-shot stylization.
4.2. Image-Text Fusion in the CLIP Space
In this section, our goal is to apply Image-Text embedding (ITE) manipulation to 3D portraits through the pre-trained text-image embedding model CLIP (Radford et al., 2021), as depicted in Fig. 2 with yellow color. First, 3D multi-view images , and are generated by , and . Then, the 3D multi-view images are fed into the CLIP space for mixing with a text description. Finally, we propose a threshold function and CLIP-direction loss to control the fusion direction self-adaptively. Specifically, ITE constructs , , and as
(7) | |||
where refers to 3D multi-view samples. refers to 2D canonical view samples. The parameter of and in Eq. (7) is from the original EG3D. The parameter of is training in Sec. 4.1. is an stylization image generated from . Then, as emphasized in Eq. (2) in the preliminary, ITE implements the 3D Image-Text embedding strategy by using CLIP-space direction as follows:
(8) | |||

where and denote the semantic text and input content of the style object, respectively. , , and are defined in Eq. (7). and are the image and text encoders of CLIP. is used to maintain the image-guided style of the previous stage. is used to obtain the text-driven style. is the final CLIP direction loss. As exhibited in Fig. 4, by this means, ITE can achieve Image-Text coupled manipulation to 3D portraits. We use as a parameter to control the weights guided by text direction and image direction . In the experiments, we found that optimizing the direction of the image and text simultaneously may easily cause an insignificant fusion effect (Fig. 8). Therefore, we design a threshold function to control so that we only transfer to a certain direction of image or text in each epoch. The expression of is defined as follows:
(9) |
where refers to Eq. (8). we use the to measure the degree of image-guided stylization, which means the distance of a stylization image from the real photographs in the CLIP space. Because at the beginning of the alternating training approach, in Sec. 4.1 is less stylized, resulting unstable direction of and large direction deviation. Therefore, only when distance exceeds the threshold , which means the image is stylized enough, will we transfer to the direction of . And if the distance is less than the threshold , which means the degree of stylization is insufficient, we only perform domain adaptation in the text direction . In addition, the experiments demonstrate that when training more epochs, tends to exceed the threshold often. This can lead to over-optimizing the style image-guided direction regardless of the text-driven direction. To handle this issue, we propose another threshold-based text-compensation strategy. We use a random function to generate an integer from 1 to 100. When the integer value is less than , we compensate for transferring in the direction of . With the threshold techniques and , we can automatically perform self-adaptive Image-Text fusion according to the degree of threshold in the CLIP space, which helps to maintain the domain style of image and text.

4.3. Alternating Training
In this section, we introduce the alternating training mechanism. First, we use the artistic GAN inversion in Eq. (4) to get and . Then start our alternating training mechanism, as shown in Fig. 2. In the first stage, our APT uses Eq. (6) to train . In the second stage, our ITE exploits the Eq. (8) to train . is always keep frozen. The above two stages are trained alternately.
This alternating training strategy has two advantages: (i) It can avoid the insignificant fusion effect of Image-Text joint training (Schaldenbrand et al., 2022; Frans et al., 2021). Because the threshold function in Eq. (9) can be used to control the to only transfer in one direction (image or text) at each epoch. Therefore, a better Image-Text fusion effect can be achieved (Fig. 9). (ii) It effectively prevents the one-shot image-guided overfitting (Gal et al., 2021) in the CLIP space because and can generate many different stylized images at each epoch, which help to prevent the from overfitting.
5. Experiments
5.1. Implement Details.
We use the Adam (Kingma and Ba, 2015) optimizer. The learning is set to . To speed up the training time, we retain the layer-optimize technique from StyleGAN-NADA (Gal et al., 2021). Considering the network structure of EG3D, we only optimize the first, third, and fourth modules (SynthesisNetwork, Superresolution, Decoder) of EG3D. Thus, finetuning ITportait on an RTX 3090 for 400 epochs takes about 8 minutes with a batch size of 2 for every case. Furthermore, in Eq. (9), the approximate value of the threshold is 0.7, and the approximate value of the threshold is 50. in Eq. (7) is three view angles randomly selected from yaw and pitch. The yaw angles are from to and the pitch angles are from to .
5.2. Comparsion with Baselines
Image-Text Coupled Domain adaption. We conducted comparative experiments to evaluate the effectiveness of our Image-Text coupled (ITE) results. The goal of style fusion is to maintain the reference image’s style and the text’s editing effect. For a fair comparison, we consider two Image-Text fusion strategies: (i) use the image-guided style transfer method first, and then perform text Editings, such as JOJOGAN (Chong and Forsyth, 2022) + StyleCLIP (Patashnik et al., 2021) and JOJOGAN (Chong and Forsyth, 2022) + CLIPstyler (Kwon and Ye, 2022). (ii) use text editing first, and then perform image-guided style transfer, such as StyleGAN-NADA (Gal et al., 2021) +JOJOGAN (Chong and Forsyth, 2022). As shown in Fig. 5, the result shows that our ITportrait maintains the reference image’s style and the text’s editing effect best. Specifically, the effect of JOJOGAN + StyleCLIP is not obvious because the style of JOJOGAN cannot be preserved in the text editing stage of StyleCLIP. The effect of JOJOGAN + CLIPstyler has artifacts because CLIPstyler incorrectly optimizes the style of JOJOGAN. The effect of StyleGAN-NADA + JOJOGAN is also not obvious because the stylization of JOJOGAN completely conceals the text editing of StyleGAN-NADA, distorting the style of image and text. In contrast, our method effectively preserves the style and CLIP editing effects thanks to the fusion mechanism ITE in the CLIP space.
Image-guided Stylization. We evaluate our proposed image-guided one-shot stylization method APT. For image-guided style transfer tasks, a high-quality transfer effect refers to the style that can be transferred while maintaining the Identity of the original portrait. We selected these SOTA portrait style transfer methods for comparison, including Your3dEmoji (Xu et al., 2022), BlendGAN (Liu et al., 2021), MindtheGAP (Zhu et al., 2021). The results are illustrated in Fig. 6, and our effect is the best. Specifically, BlendGAN can only capture the general color of the style, and the specific portrait details are insufficient. Your3dEmoji can achieve higher-quality stylization details. But their wrinkle geometry deformation is inferior, such as for the caricature example (Line 2). The effect of MindtheGAP is the closest to ours. But MindtheGAP needs about 15 minutes to train, while our ART only needs 4 minutes of finetuning.

Text-driven Manipulation. ITportrait supports the manipulation of portraits with only text guidance. We compare our method with the SOTA text-driven manipulation methods including, CLIPstyler (Kwon and Ye, 2022), StyleCLIP (Patashnik et al., 2021), and IDE-NADA (IDE-3D (Sun et al., 2022a) + StyleGAN-NADA (Gal et al., 2021)). Because StyleCLIP is good at face editing, while IDE-NADA and CLIPStyler are good at domain transfer. Therefore, for a fair comparison, we use long text descriptions containing face edits and cross-domain information. As shown in Fig. 7, our method achieves more visually pleasing domain adaptation results. This stems from the fact that we construct 3D multi-view supervision. We effectively supervise the portrait’s side and achieve the CLIP enhancement effect. Although IDE-NADA is aimed at 3D portraits, it only employs 2D CLIP enhancement methods (Frans et al., 2021; Kwon and Ye, 2022). This leads to poor domain adaptation results for some text descriptions. For the example of ’wearing glasses’, the effect of IDE-NADA is not obvious (Lines 2 and 3).

Quantitative Analysis. Quantitative analysis of style transfer and cross-domain editing is challenging due to the absence of GroundTruth. Conducting a user study is the most common way to evaluate different style transfer methods. We use a similar approach to (Gal et al., 2021; Kwon and Ye, 2022). Specifically, we establish four sets of images. Users have unlimited time to rate their preferences from 1 to 5 (five being the best and one being the worst). All images from each group are presented side-by-side in random order. Each group contains comparisons between our method and other methods, including JOJOGAN (Chong and Forsyth, 2022) + CLIPstyler (Kwon and Ye, 2022), StyleGAN-NADA (Gal et al., 2021) + JOJOGAN (Chong and Forsyth, 2022), and JOJOGAN (Chong and Forsyth, 2022) + StyleCLIP (Patashnik et al., 2021). As shown in Tab. 1, our method is the most popular. Because our ITportrait most effectively retains the image-guided and text-driven styles, thus achieving the best image-text fusion adaptation effect. Please refers to the supplement for more details about user studies.
Methods | Image-Style | Text-Style | Content |
JOJOGAN + StyleCLIP | 1.119 | 1.343 | 3.914 |
StyleGAN-NADA + JOJOGAN | 3.716 | 1.740 | 3.072 |
JOJOGAN + CLIPstyler | 3.466 | 3.986 | 2.867 |
Ours | 4.775 | 4.842 | 3.742 |


5.3. Ablation Studies
Pose estimation. We conduct ablation experiments to evaluate the effectiveness of our APT that uses photo-realistic portraits for pose estimation (Eq. (3)). Specifically, we compare: (i) not use APT initialization, directly optimizing with Eq. (4) and (ii) use APT initialization and then optimize with Eq. (4). We select 30 art images to compare the estimation of GAN inversion accuracy. The results are listed shown in Tab. 2. It can be observed that using APT can effectively improve our quality. Because the 3D GAN inversion (Ko et al., 2023; Yin et al., 2022) of artistic portraits, pose and are randomly initialized and simultaneously optimized may lead to artifact results. We construct real photographs as the initialization of the pose, which can ease the difficulty of pose optimization, thereby improving the quality of the one-shot stylization. More details are shown in the supplement.
Image-Text Fusion Strategy. We conduct ablation experiments to prove the effectiveness of our self-adaptive threshold function (Eq. (9)) for the ITE in CLIP space. The threshold is used to transfer to the image direction when the is stylized enough. The threshold can be used to maintain the Text style and prevent the excessive transfer to the image direction. Specifically, we compare our ITE with (i) do not use the threshold function Eq. (9) at all, (ii) only use the threshold , and (iii) only use the threshold . We keep other parameters constant and only change the different thresholds mentioned above. In addition, we also explored the impact of different thresholds and on the fusion results. As shown in Fig. 9, our threshold method achieves the best fusion effects. It can be seen that if the threshold function is not used, simultaneously optimizing CLIP and image loss has no obvious effect because there is a contradiction between image guidance and text guidance. Threshold and can be effectively used to preserve the orientation of the image and the text, respectively.
Method | MSE | SSIM | PSNR | LPIPS | ID |
w/o pose init | |||||
w/ pose init |

3D View Augment. We conduct ablation experiments to demonstrate the effectiveness of our 3D view augments. Specifically, we compare with the original single-view (Sun et al., 2022b, a) and perspective augment (Schaldenbrand et al., 2022; Kwon and Ye, 2022). For fairness, we keep other parameters the same and only change the number of supervised views. Specific Results As shown in Fig. 10, our method works best. There is a portrait geometric collapse in single-view supervision (Lines 1,2). The CLIP perspective augments method can misinterpret the wrinkles on the eyes as the effect of wearing glasses (Lines 3) or cause blurry glasses on the side (Line 4). In summary, although these augment methods positively affect the front portrait, the side of the portrait is blurred. In contrast, our method is free from distortions in all views. Because we can effectively supervise the side faces.


5.4. Application
Novel View Synthesis. As depicted in Fig. 11, our ITportrait can support novel view synthesis of images from cross-domain drawings. The main reason is that ITportrait aggregates image-guided style and text-driven manipulation in the CLIP space while adding 3D multi-view supervision. As shown in the ablation study in Fig. 11, the proposed ITportrait effectively prevents the appearance of portrait side artifacts and geometric collapse. Thus, we can achieve better high-quality perspective synthesis than the previous methods (Sun et al., 2022b, a; Schaldenbrand et al., 2022; Kwon and Ye, 2022).
3D Portrait Domain Adaption. Our method supports 3D portrait domain adaptation with limited data (i.e., a text or a single image). Fig. 12 shows the 3D portrait domain adaptation effect of Dr.3d (Jin et al., 2022) and 3DAvatarGAN (Abdal et al., 2023) in the caricature domain. However, they need a dataset of the corresponding art domain (Pinkney and Adler, 2020; Yang et al., 2022) and train from scratch, which is very time-consuming (training for about 4 hours). Furthermore, these large-scale corresponding domain datasets are tedious and labor-intensive to collect. In contrast, With 8 minutes of fine-tuning and a single image or text as guidance, ITportrait supports 3D portrait domain adaptation.


View-Consistent Portrait Editing. Our ITportrait can achieve view-consistent portrait editing. As illustrated in Fig. 13, our ITportrait can implement image-guided, text-driven, and Image-Text coupled domain adaptation for photo-realistic portraits. As shown in Fig. 14, we also support art-drawing portrait editing. Specifically, we can use APT for two-stage 3D GAN inversion, similar to PTI (Roich et al., 2022). We first get the pivot latent code and (in Eq. (4)). Then we fine-tune using Eq. (6). This way, APT can realize high-quality 3D GAN inversion for both photo-realistic and art-drawing portraits. Then we implement editing using our ITE approach. Please refers to our supplement for more details.
6. Conclusions
In this paper, We propose ITportrait, a novel framework for 3D portrait Image-Text coupled domain adaptation. To the best of our knowledge, this is the first work to study Image-Text coupled manipulation for 3D portrait domain adaptation, which can offer better operability of artistic portraits. We design a two-stage alternating training strategy for image and text fusion in the CLIP space. Specifically, In the first stage, we design a one-shot 3D portrait style transfer method (APT). Our APT constructs a paired photo-realistic portrait to accurately estimate the pose of artistic images for high-quality one-shot stylization. In the second stage, we present an Image-Text embedding (ITE) strategy. ITE includes 3D supervision and a threshold function to self-adaptively achieve controllable fusion direction control. Comprehensive experiments demonstrate that ITportrait achieves state-of-the-art (SOTA) results and supports wide downstream applications.
References
- (1)
- Abdal et al. (2023) Rameen Abdal, Hsin-Ying Lee, Peihao Zhu, Menglei Chai, Aliaksandr Siarohin, Peter Wonka, and Sergey Tulyakov. 2023. 3DAvatarGAN: Bridging Domains for Personalized Editable Avatars. arXiv preprint arXiv:2301.02700 (2023).
- Alaluf et al. (2021) Yuval Alaluf, Or Patashnik, and Daniel Cohen-Or. 2021. Only a matter of style: Age transformation using a style-based regression model. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–12.
- Alanov et al. (2022) Aibek Alanov, Vadim Titov, Maksim Nakhodnov, and Dmitry Vetrov. 2022. StyleDomain: Analysis of StyleSpace for Domain Adaptation of StyleGAN. arXiv preprint arXiv:2212.10229 (2022).
- Chan et al. (2022) Eric R Chan, Connor Z Lin, Matthew A Chan, Koki Nagano, Boxiao Pan, Shalini De Mello, Orazio Gallo, Leonidas J Guibas, Jonathan Tremblay, Sameh Khamis, et al. 2022. Efficient geometry-aware 3D generative adversarial networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16123–16133.
- Chan et al. (2021) Eric R Chan, Marco Monteiro, Petr Kellnhofer, Jiajun Wu, and Gordon Wetzstein. 2021. pi-gan: Periodic implicit generative adversarial networks for 3d-aware image synthesis. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5799–5809.
- Chong and Forsyth (2022) Min Jin Chong and David Forsyth. 2022. Jojogan: One shot face stylization. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVI. Springer, 128–152.
- Deng et al. (2019) Yu Deng, Jiaolong Yang, Sicheng Xu, Dong Chen, Yunde Jia, and Xin Tong. 2019. Accurate 3d face reconstruction with weakly-supervised learning: From single image to image set. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops. 0–0.
- Frans et al. (2021) Kevin Frans, Lisa B Soros, and Olaf Witkowski. 2021. Clipdraw: Exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843 (2021).
- Gal et al. (2021) Rinon Gal, Or Patashnik, Haggai Maron, Gal Chechik, and Daniel Cohen-Or. 2021. Stylegan-nada: Clip-guided domain adaptation of image generators. arXiv preprint arXiv:2108.00946 (2021).
- Gatys et al. (2016) Leon A Gatys, Alexander S Ecker, and Matthias Bethge. 2016. Image style transfer using convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2414–2423.
- Gu et al. (2021) Jiatao Gu, Lingjie Liu, Peng Wang, and Christian Theobalt. 2021. Stylenerf: A style-based 3d-aware generator for high-resolution image synthesis. arXiv preprint arXiv:2110.08985 (2021).
- Huang and Belongie (2017) Xun Huang and Serge Belongie. 2017. Arbitrary style transfer in real-time with adaptive instance normalization. In Proceedings of the IEEE international conference on computer vision. 1501–1510.
- Jin et al. (2022) Wonjoon Jin, Nuri Ryu, Geonung Kim, Seung-Hwan Baek, and Sunghyun Cho. 2022. Dr. 3D: Adapting 3D GANs to Artistic Drawings. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
- Karras et al. (2020) Tero Karras, Miika Aittala, Janne Hellsten, Samuli Laine, Jaakko Lehtinen, and Timo Aila. 2020. Training generative adversarial networks with limited data. Advances in neural information processing systems 33 (2020), 12104–12114.
- Karras et al. (2021) Tero Karras, Miika Aittala, Samuli Laine, Erik Härkönen, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. 2021. Alias-free generative adversarial networks. Advances in Neural Information Processing Systems 34 (2021), 852–863.
- Karras et al. (2019) Tero Karras, Samuli Laine, and Timo Aila. 2019. A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 4401–4410.
- Kim et al. (2022) Seongtae Kim, Kyoungkook Kang, Geonung Kim, Seung-Hwan Baek, and Sunghyun Cho. 2022. DynaGAN: Dynamic Few-shot Adaptation of GANs to Multiple Domains. In SIGGRAPH Asia 2022 Conference Papers. 1–8.
- Kingma and Ba (2015) Diederik P. Kingma and Jimmy Lei Ba. 2015. Adam: A Method for Stochastic Optimization. In ICLR.
- Ko et al. (2023) Jaehoon Ko, Kyusun Cho, Daewon Choi, Kwangrok Ryoo, and Seungryong Kim. 2023. 3d gan inversion with pose optimization. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2967–2976.
- Kwon and Ye (2022) Gihyun Kwon and Jong Chul Ye. 2022. Clipstyler: Image style transfer with a single text condition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18062–18071.
- Liu et al. (2021) Mingcong Liu, Qiang Li, Zekui Qin, Guoxin Zhang, Pengfei Wan, and Wen Zheng. 2021. Blendgan: Implicitly gan blending for arbitrary stylized face generation. Advances in Neural Information Processing Systems 34 (2021), 29710–29722.
- Men et al. (2022) Yifang Men, Yuan Yao, Miaomiao Cui, Zhouhui Lian, and Xuansong Xie. 2022. DCT-net: domain-calibrated translation for portrait stylization. ACM Transactions on Graphics (TOG) 41, 4 (2022), 1–9.
- Patashnik et al. (2021) Or Patashnik, Zongze Wu, Eli Shechtman, Daniel Cohen-Or, and Dani Lischinski. 2021. Styleclip: Text-driven manipulation of stylegan imagery. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 2085–2094.
- Pinkney and Adler (2020) Justin NM Pinkney and Doron Adler. 2020. Resolution dependent gan interpolation for controllable image synthesis between domains. arXiv preprint arXiv:2010.05334 (2020).
- Radford et al. (2021) Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Roich et al. (2022) Daniel Roich, Ron Mokady, Amit H Bermano, and Daniel Cohen-Or. 2022. Pivotal tuning for latent-based editing of real images. ACM Transactions on Graphics (TOG) 42, 1 (2022), 1–13.
- Schaldenbrand et al. (2022) Peter Schaldenbrand, Zhixuan Liu, and Jean Oh. 2022. Styleclipdraw: Coupling content and style in text-to-drawing translation. arXiv preprint arXiv:2202.12362 (2022).
- Shao and Zhang (2021) Xuning Shao and Weidong Zhang. 2021. SPatchGAN: A statistical feature based discriminator for unsupervised image-to-image translation. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6546–6555.
- Sun et al. (2022a) Jingxiang Sun, Xuan Wang, Yichun Shi, Lizhen Wang, Jue Wang, and Yebin Liu. 2022a. Ide-3d: Interactive disentangled editing for high-resolution 3d-aware portrait synthesis. ACM Transactions on Graphics (TOG) 41, 6 (2022), 1–10.
- Sun et al. (2022b) Jingxiang Sun, Xuan Wang, Lizhen Wang, Xiaoyu Li, Yong Zhang, Hongwen Zhang, and Yebin Liu. 2022b. Next3D: Generative Neural Texture Rasterization for 3D-Aware Head Avatars. arXiv preprint arXiv:2211.11208 (2022).
- Tov et al. (2021) Omer Tov, Yuval Alaluf, Yotam Nitzan, Or Patashnik, and Daniel Cohen-Or. 2021. Designing an encoder for stylegan image manipulation. ACM Transactions on Graphics (TOG) 40, 4 (2021), 1–14.
- Xie et al. (2021) Shaoan Xie, Mingming Gong, Yanwu Xu, and Kun Zhang. 2021. Unaligned image-to-image translation by learning to reweight. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 14174–14184.
- Xu et al. (2022) Shiyao Xu, Lingzhi Li, Li Shen, Yifang Men, and Zhouhui Lian. 2022. Your3dEmoji: Creating Personalized Emojis via One-shot 3D-aware Cartoon Avatar Synthesis. In SIGGRAPH Asia 2022 Technical Communications. 1–4.
- Yang et al. (2022) Shuai Yang, Liming Jiang, Ziwei Liu, and Chen Change Loy. 2022. Pastiche master: exemplar-based high-resolution portrait style transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 7693–7702.
- Yin et al. (2022) Fei Yin, Yong Zhang, Xuan Wang, Tengfei Wang, Xiaoyu Li, Yuan Gong, Yanbo Fan, Xiaodong Cun, Ying Shan, Cengiz Oztireli, et al. 2022. 3D GAN Inversion with Facial Symmetry Prior. arXiv preprint arXiv:2211.16927 (2022).
- Yu et al. (2022) Yingchen Yu, Fangneng Zhan, Rongliang Wu, Jiahui Zhang, Shijian Lu, Miaomiao Cui, Xuansong Xie, Xian-Sheng Hua, and Chunyan Miao. 2022. Towards counterfactual image manipulation via clip. In Proceedings of the 30th ACM International Conference on Multimedia. 3637–3645.
- Zhang et al. (2018) Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang. 2018. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition. 586–595.
- Zhao et al. (2020) Yihao Zhao, Ruihai Wu, and Hao Dong. 2020. Unpaired image-to-image translation using adversarial consistency loss. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part IX 16. Springer, 800–815.
- Zhou et al. (2021) Peng Zhou, Lingxi Xie, Bingbing Ni, and Qi Tian. 2021. Cips-3d: A 3d-aware generator of gans based on conditionally-independent pixel synthesis. arXiv preprint arXiv:2110.09788 (2021).
- Zhu et al. (2021) Peihao Zhu, Rameen Abdal, John Femiani, and Peter Wonka. 2021. Mind the gap: Domain gap control for single shot domain adaptation for generative adversarial networks. arXiv preprint arXiv:2110.08398 (2021).