ToonAging: Face Re-Aging upon Artistic Portrait Style Transfer

Bumsoo Kim

{}^{1,2}

, Abdul Muqeet

{}^{2}

, Kyuchul Lee

{}^{3}

, Sanghyun Seo

{}^{1}

{}^{1}

Chung-Ang University,

{}^{2}

VIVE STUDIOS,

{}^{3}

Coupang
{bumsookim,

{}^{*}

sanghyun}@cau.ac.kr, [email protected], [email protected] Corresponding author.

Abstract

Face re-aging is a prominent field in computer vision and graphics, with significant applications in photorealistic domains such as movies, advertising, and live streaming. Recently, the need to apply face re-aging to non-photorealistic images, like comics, illustrations, and animations, has emerged as an extension in various entertainment sectors. However, the lack of a network that can seamlessly edit the apparent age in NPR images has limited these tasks to a naive, sequential approach. This often results in unpleasant artifacts and a loss of facial attributes due to domain discrepancies. In this paper, we introduce a novel one-stage method for face re-aging combined with portrait style transfer, executed in a single generative step. We leverage existing face re-aging and style transfer networks, both trained within the same PR domain. Our method uniquely fuses distinct latent vectors, each responsible for managing aging-related attributes and NPR appearance. By adopting an exemplar-based approach, our method offers greater flexibility compared to domain-level fine-tuning approaches, which typically require separate training or fine-tuning for each domain. This effectively addresses the limitation of requiring paired datasets for re-aging and domain-level, data-driven approaches for stylization. Our experiments show that our model can effortlessly generate re-aged images while simultaneously transferring the style of examples, maintaining both natural appearance and controllability.

1 Introduction

Face re-aging is a problem in computer graphics that semantically changes the apparent age according to the target age. Originally, this task was conducted at high costs by expert artists in media production fields, such as film, advertisement, and TV series. After generative models showed results that were indistinguishable from real images, there were some trials to perform re-aging using generative models. Several existing studies [alaluf2021only, hsu2022agetransgan, makhmudkhujaev2021ragan, makhmudkhujaev2023raganpp, zoss2022production, muqeet2023video, li2023pluralistic, yoon2023manipulation] predominantly employ StyleGAN approaches. With the given input face image, input age, and target age, re-aging methods alter the generated images to resemble the target age. However, all these methods mainly focus on the single goal of changing a person’s age in photos. However, all these methods are optimized to alter the age of real individuals in photographs. As a result, such methods often struggle when applied to NPR images

Meanwhile, artistic portraits hold a significant place in our everyday life, particularly within industries like comics, animation, posters, and advertising. These portraits, which include styles ranging from caricatures to anime, constitute essentially stylized images. Similar to re-aging processes, StyleGAN-based architectures [pinkney2020resolution, huang2021unsupervised, song2021agilegan, kim2022cross, li2023parsing, jang2021stylecarigan, yang2022pastiche, wang20223d, kim2023context, zhu2023few, men2022dct, lee2023fix, khowaja2023face, yang2022vtoonify, zhang2022generalizedOD, chefer2022image, back2022webtoonme] have achieved substantial success in generating high-resolution artistic portraits. This success is largely attributable to their hierarchical style control, which also facilitates transfer learning applications. Image-to-Image translation methods [kim2019u, choi2020starganv2] involve developing distinct models for each domain, which requires extensive datasets and a longer preparation process compared to methods using pre-trained models, as each domain starts from scratch. These methods are primarily oriented towards style modification with limited capability for additional attribute transformation simultaneously.

Refer to caption — Figure 1: We propose ToonAging, which can perform face re-aging and portrait style transfer in a single generation step. Since we adopted an exemplar-based approach, portraits can be transferred to various domains, enabling plausible re-aging progression simultaneously.

As previously mentioned, despite various research efforts in both face re-aging and artistic portrait generation, to the best of our knowledge, the combination of face re-aging and artistic portraits has not been explored. The naive approach to tackle this problem is to apply each method sequentially. We investigated this approach by applying SAM [alaluf2021only] to images and then applying these re-aged images with the state-of-the-art style generation technique DualStyleGAN [yang2022pastiche]. The results are demonstrated in Fig. 2 (b). We encountered various issues with this approach: 1) Aging details such as wrinkles in the input image are lost. 2) DualStyleGAN does not successfully preserve facial attributes, including facial expressions like the mouth in the input image. The reason for these issues lies in the inversion processes, which do not effectively preserve the input details. We also conducted the experiment in reverse order, first applying DualStyleGAN to images and then SAM, as shown in Fig. 2 (c). We encountered the same problem of inversion with this approach as well. In this case, SAM failed to preserve the artistic details of the input images. The underlying issue is attributed to the training procedure of re-aging methods, as they are trained on [or2020lifespan], which is based on real images generated using StyleGAN [karras2019stylegan].

To address the challenges in combining face re-aging and artistic portrait generation, we have developed a method that effectively merges latent information for each goal. Our approach combines SAM [alaluf2021only] for re-aging and DualStyleGAN [yang2022pastiche] for style transfer. Specifically, we extract inversions in the ${W+}$ space of a pre-trained StyleGAN network using SAM and apply these, along with a residual ${w}$ , to the intrinsic path of DualStyleGAN. This combination respects the principle of superposition, allowing the effects of each network to merge seamlessly and without mutual interference. Our analysis suggests that using the ${W+}$ space, rather than the ${Z+}$ space, for the intrinsic path better preserves the subject’s attributes. Additionally, by maintaining the original extrinsic path in DualStyleGAN, we naturally blend age transformation and style transfer. This unified approach offers precise control over the re-aging and style transfer processes, overcoming the limitations of previous methods and enabling the generation of more natural and realistic images. In summary, we present three primary contributions:

•

ToonAging innovatively combines face re-aging and portrait style transfer in a one-stage method, surpassing sequential methods in performance without needing extra datasets or training.
•

We utilize a latent fusion approach across different networks, enabling precise control over diverse aspects from coarse to fine details, including the shape, rendering style, color, and aging effects with ease.
•

In contrast to domain-level style transfer approaches, our method inherits an exemplar-based technique, allowing for style transfer across any domain without the need for additional fine-tuning or training.

2 Related Work

In this section, we provide a literature review covering methods for face re-aging and the generation of artistic styles in facial images.

2.1 Face Re-Aging

Face re-aging [alaluf2021only, hsu2022agetransgan, makhmudkhujaev2021ragan, makhmudkhujaev2023raganpp, zoss2022production, muqeet2023video] including aging (i.e., Old direction) and de-aging (i.e., Young direction), aims to change the apparent age of a facial image to specified target age. It faces an issue where the re-aging network must perceptually preserve the original facial identity in the generated output [zhai2018identity], while simultaneously generating a significant visual effect of re-aging as well [zoss2022production, muqeet2023video]. However, it is known that there is a trade-off between aging performance and identity preservation [alaluf2021only, zoss2022production, muqeet2023video]. Moreover, the limited paired dataset for age labeling with corresponding faces makes the face re-aging task more difficult.

To mitigate this, [zoss2022production] leverages an image-level re-aging network [alaluf2021only] in a supervision manner. [muqeet2023video] extends this approach to achieve superior video-level consistent re-aging output, demonstrating highly consistent re-aging effects with stable aging performance and identity preservation. Based on the StyleGAN architecture [karras2019stylegan, karras2020stylegan2], some works have explored latent vectors within learned distributions [alaluf2021only, makhmudkhujaev2021ragan, makhmudkhujaev2023raganpp, gomez2022cusp, maeng2023gmba]. There have also been efforts to perform re-aging via face attribute editing [tzaban2022stitch, kim2023diffusionvae, yang2023styleganex]. Recently, some researchers have attempted to address this complex training issue by employing the diffusion approach [li2023pluralistic, chen2023fading]. Additionally, one study [duong2019automatic] utilizes a reinforcement learning approach to identify suitable aging visuals.

2.2 Artistic Style Face Generation

In the field of generative models, Non-Photorealistic Rendering (NPR) style image generation [kim2024minecraft, yang2022pastiche, wang2022ctlgan] and style transfer [pinkney2020resolution, kim2022cross, song2023agilegan3d, shah2022multistylegan, zhou2022hrinversion, li2023multimodal, kim2023context] have emerged as appealing problems in computer graphics due to their attractive and aesthetic appearance. The first approach using generative models for face generation involves a domain-level technique called layer swapping [pinkney2020resolution, kwak2022generate] which replaces source domain weights (i.e., realistic rendering) with target domain weights (i.e., NPR). Similarly, Cross-Domain Style Mixing (CDSM) [kim2022cross] fuses two different domain weights at the latent level in $\mathcal{S}$ space [wu2021stylespace] by fine-tuning [karras2020ada] the pre-trained weights with a cartoon dataset. On the other hand, unsupervised approaches have also been employed for cartoon-style image generation [huang2021unsupervised] through image-to-image translation schemes [liu2017unit, zhu2017cyclegan, huang2018munit, lee2018drit, lee2020dritpp, kim2019u, xiao2022appearance, men2022unpaired, wang2022realtime]. Recently, a study leveraged a diffusion model for face stylization [liu2023portrait]. StyleGAN-Fusion [song2024styleganfusion] applies domain adaptation to distill the knowledge of pre-trained large-scale diffusion models, enabling the transition from the original domain to new target domains without requiring any training images in that domain. However, these approaches often require extensive dataset acquisition processes for cartoon images, leading to the problem of insufficient datasets in specific domains. To address this issue, recent work adopts an exemplar-based method [yang2022pastiche] that doesn’t require large datasets. Here, the key aspect of cartoon style generation is addressing the issue of large dataset requirements [zoss2022production, muqeet2023video] using an exemplar-based method [yang2022pastiche], rather than a domain-level method [pinkney2020resolution, kim2022cross], which is more suitable for our task.

3 Methodology

In the context of real-world image datasets, unsupervised learning methods for re-aging, such as those applied to datasets like FFHQ-Aging [or2020lifespan], are feasible. Additionally, data-centric approaches, as proposed in methods like [zoss2022production, muqeet2023video], can be used to create supervised learning datasets for re-aging. However, in the domain of artistic portrait images, there is a significant lack of datasets for re-aging, and the use of data-centric approaches is complex and time-consuming.

Unlike traditional methods, our approach does not rely on existing datasets or the generation of new data. Instead, we utilize existing fine-tuned networks through an exemplar-based scheme. This strategy eliminates the need for additional training by fusing individual latent vectors. For our method to be effective, it is crucial that the learned distribution and alignment within the networks are congruent. Initially, we naively applied face re-aging and portrait style transfer as a two-stage process within the StyleGAN2 latent space [karras2020stylegan2] and the FFHQ distribution [karras2019stylegan]. However, this approach presents its own limitations.

To perform face re-aging with portrait style transfer, we first provide a concise introduction to the existing network as preliminary knowledge in Sec. 3.1. Then, we elaborate on the details of our approach in Sec. LABEL:method:02.

3.1 Preliminaries

3.1.1 Face Re-Aging Network

Face re-aging semantically generates a re-aging effect based on the original appearance, following the target age. Many works [hsu2022agetransgan, gomez2022cusp, makhmudkhujaev2021ragan, yao2021hrfae, or2020lifespan, makhmudkhujaev2023raganpp] have proposed re-aging networks for high-quality and dramatic age progression. SAM [alaluf2021only], a method based on StyleGAN, demonstrates notable effectiveness in encoding age-related attributes into the latent vector, utilizing residuals in the $\mathcal{W+}$ space in conjunction with the StyleGAN and inversion scheme. Its integration capabilities and performance make it a prominent choice among various face re-aging networks. Technically, given an input image $x$ , an inversion encoder $E_{\text{inv}}$ , and a target age $a_{\text{target}}$ , the face re-aging network SAM is formulated as:

\textit{SAM}(x,a_{\text{target}}):=G_{\text{org}}(w^{+}_{\text{age}}),

(1)

where $G_{\text{org}}$ denotes the StyleGAN generator, and $w^{+}_{\text{age}}\in\mathbb{R}^{18\times 512}$ represents the age-included latent vector encoded as $w^{+}_{\text{age}}=E_{\text{age}}(x,a_{\text{target}})+E_{\text{inv}}(x)$ , with $a_{\text{target}}$ being a constant ranging from 0 to 100. It is noteworthy that SAM only requires an additional encoder, the age encoder $E_{\text{age}}$ . Thus, in the absence of the age encoder ( $E_{\text{age}}(x,a_{\text{target}}):=\textbf{0}$ ), Eq. 1 effectively reverts to the original StyleGAN formulation $G_{\text{org}}(E_{\text{inv}}(x))$ , demonstrating the foundational role of the age encoder in extending the model’s capabilities for age-specific transformations. This integration of a simple yet impactful age encoder into the latent vector-based re-aging process underscores our rationale for selecting this model as the basis for our ToonAging method.

3.1.2 Artistic Portrait Style Transfer Network

DualStyleGAN stands out among recent advancements [pinkney2020resolution, kim2022cross, chong2022jojogan, back2021fine] in face style transfer networks due to its innovative structure incorporating both intrinsic and extrinsic paths. This design simplifies the style transfer process, effectively integrates different latent vectors, and excels in applying diverse external styles while maintaining core features. Its streamlined yet versatile nature makes it an ideal choice for our methodology, particularly in scenarios requiring the application of diverse external styles. Given PR input $x$ and NPR style $S$ , DualStyleGAN is formulated as:

\textit{DualStyleGAN}(x,S):=G_{\text{modified}}(f_{\text{in}}(z^{+}_{\text{in}}),f_{\text{ex}}(z^{+}_{\text{ex}}),\textbf{w}),

(2)

where $z^{+}_{\text{in}}$ is the latent vector encoded from PR image via intrinsic path as $z^{+}_{in} = E$