High Resolution Face Age Editing
Abstract
Face age editing has become a crucial task in film post-production, and is also becoming popular for general purpose photography. Recently, adversarial training has produced some of the most visually impressive results for image manipulation, including the face aging/de-aging task. In spite of considerable progress, current methods often present visual artifacts and can only deal with low-resolution images. In order to achieve aging/de-aging with the high quality and robustness necessary for wider use, these problems need to be addressed. This is the goal of the present work. We present an encoder-decoder architecture for face age editing. The core idea of our network is to create both a latent space containing the face identity, and a feature modulation layer corresponding to the age of the individual. We then combine these two elements to produce an output image of the person with a desired target age. Our architecture is greatly simplified with respect to other approaches, and allows for continuous age editing on high resolution images in a single unified model. Source codes are available at https://github.com/InterDigitalInc/HRFAE.
Keywords:
High resolution, face aging, image-to-image translation.1 Introduction
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Learning to manipulate face age is an important topic both in industry and academia. In the movie post-production industry, many actors are retouched in some way, either for beautification or texture editing. More specifically, synthetic aging or de-aging effects are usually generated by makeup or special visual effects. Although impressive results can be obtained digitally, as in the recent Martin Scorcese’s movie The Irishman, the underlying processes are extremely time consuming. Thus, robust, high-quality algorithms for performing automatic age modification are highly desirable. Nevertheless, editing faces is an intrinsically difficult task. Indeed, the human brain is particularly good at perceiving faces’ attributes in order to detect, recognize or analyze them, for instance to infer identity or emotions. Consequently, even small artifacts are immediately perceived and ruin the perception of results. For this reason, our goal is to produce artifact-free, sharp and photorealistic results on high-resolution face images.
With the success of Generative Adversarial Networks (GANs) [7] in high quality image generation, GAN-based models have been widely used for image-to-image translation [35, 40]. Despite having set new standards for natural image synthesis, GANs are known to suffer from two major flaws : an abundance of small artifacts and strong instability of the training process. The latest face aging studies [9, 20, 33, 36, 39] also adopt GAN-based models. Specifically, they divide face datasets into different age groups, feed young images into the generator, and rely on the discriminator to map output images to older age distributions. There are multiple limitations to this approach. Firstly, as can be expected, these approaches inherit the drawbacks of GAN-based methods - blurry background, small parasite structures, instability of training. Secondly, as the aging effect is generated by matching the output image distribution to the target group, these methods are limited to coarse aging/de-aging. To achieve fine-grained transformation, a separate model needs to be trained between each pair of ages.
In this work, we propose an encoder-decoder architecture for the problem of face age editing with high visual quality on high resolution images. In order to address the aforementioned limitations, namely the tendency to produce visual artifacts and training instability, we endeavour to keep the architecture as simple as possible. Firstly, we use a single network for both aging and de-aging. This is reasonable since the encoder part of our model is assumed to encode identity, emotion or details in the input image that are not related to age, so that the same latent space can be used for both tasks of aging and de-aging. Secondly, we rely on a feature modulation layer, that is compact, acts directly on the latent space and allows for continuous age transitions. Thirdly, unlike in competing methods where the discriminator used during adversarial training is conditioned on the target age, we use a discriminator which is not conditioned and concentrates solely on the photorealism of the output images to reduce editing artifacts. The discriminator can be considered as a regularizer which imposes photorealism other than a traditional discriminator trying to match two distributions. Thanks to this design, our model achieves efficient disentanglement of age attributes and face identity. We present experimental results on high resolution images with qualitative and quantitative evaluations. In particular, these experiments provide clear evidence that the visual quality achieved by our results outperforms state of the art methods. Experiments on alternative datasets further illustrate the generalization capacity of the method.
2 Related Works
Face aging The survey work [6] gives an exhaustive overview of the traditional age synthesis algorithms. In this work, we are more interested in deep learning based methods, which have made impressive progress on face aging tasks during the last few years. A conditional GAN [24] model is first introduced for face aging task by [1, 39]. They encode the face image to the latent space, manipulate the latent code, and decode it to an aged face with the generator. However, the identity information is damaged during this process. This is further improved by [36, 38], by adding an identity preserving term to the objective. Despite the improvement, their results are over-smoothed compared with the input images. To capture texture details, wavelet-based generative models are introduced by [19, 20]. Their complex models increase the training difficulty and still yield strong artifacts. All the aforementioned models only enable face aging from one age group to another, e.g., from 20s to 40s, lacking flexibility. Recently, [9] proposed an encoder-decoder network, in which a personalized aging basis is synthesized and an age-specific transform is applied. Their model also relies on a conditional discriminator to distinguish aging patterns between age groups. Different from other methods, our model is designed for age editing with a random target age. Moreover, our approach produces much less artifacts, making age editing on images of high resolution () possible.
Image-to-image translation Face aging can be considered as an image-to-image translation problem, ie translating images between young age and old age domains. An optimization based method is proposed by [34], showing the possibility to use linear interpolation of deep features from pretrained convnets to transform images. GAN based methods [13, 40, 11] further enable real-time translation, by training a feed forward generator. Existing image-to-image translation studies [3, 4, 18, 29, 30, 37] on face images also yield impressive results in manipulating facial attributes. Lample et al. [18] design an autoencoder architecture to reconstruct images, and isolate single image characteristics in a latent component via a discriminator. These characteristics can then be modified directly in the latent space. Choi et al. [4] propose a method to perform image-to-image translation for multiple domains using only a single model. Pumarola et al. [29] introduce an attention based model, which enables face animation by simple interpolation.
High-resolution image synthesis In spite of the considerable progress of recent methods, manipulating/editing natural images of high resolution has not yet been achieved. Nevertheless, in another task - image generation, high quality results at high resolution are now available. Image generation at resolution is first achieved by [15], with a progressive growing of GAN architectures. The quality of their results is further improved by StyleGAN [16, 17], which learns a separation of high-level attributes automatically during the training. Based on this work, Shen et al. [32] propose an effective way to interpret the latent space learned by the generator and achieve high visual fidelity face manipulation on synthesized images. However, according to our experiments, only a fraction of natural images can be accurately reconstructed with a latent code, which makes this type of method impractical. In contrast, our proposed method achieves face age editing on images, with great simplicity of architecture and loss design. The age editing is achieved only by an auxiliary modulating network, which could be potentially generalized to other face manipulation tasks.
3 Method

In this section, we present the face age editing problem and present our proposed model in detail. Figure 2 illustrates our proposed age transformer and training procedure.
3.1 Overview
Let be an image drawn randomly from a face dataset. We denote by the age of the person in . Our goal is to transform so that the person in this image looks like someone at years old. We want the aged version of to share many age-unrelated characteristics with : identity, emotion, haircut, background, etc. That is to say: the facial attributes not relevant to age, as well as the background, need to be preserved during age transformation. Therefore, we assume that a face aging model and a face de-aging model can share most of their parameters. In this setting, we consider a single age transformer and assume that can transform any face image to any target age. The inputs of our model are the face image and the target age . The output is denoted by , which depicts at the target age .
3.2 Age transformer
The proposed age transformer shown in Figure 2 employs an auto-encoder architecture and is made of an encoder, a feature modulation block and a decoder. The encoder consists of three strided convolutional layers (the first one of stride 1, the other two of stride 2) and four residual blocks [8], while the decoder contains two nearest-neighbour upsampling layers and three convolutional layers, similar to the architecture used in [14, 40]. The main difference compared to these works is our feature modulation block, in which the output features of the encoder are modulated by an age-specific vector (see details below). This idea is inspired by recent works on style transfer [5, 10] which show the possibility to represent different styles using the parameters of normalization layers.
-
•
Encoder The face image is the input of the encoder. The output features are denoted by , where is the number of channels and is the product of the two spatial dimensions.
-
•
Feature modulation for age selection The target age is encoded as an one-hot vector, denoted by , and passed to the modulating network. This network consists of a single fully connected layer whith a sigmoid activation. It outputs a modulation vector , which is used to re-weight the features before passing them into the decoder and obtaining the face image at the desired age. The modulated features are , where is the diagonal matrix with diagonal .
-
•
Decoder The decoder takes the modulated features as input and two skip connections, used to preserve the finer details of the input image. The final output is denoted by .
3.3 Training
As illustrated in Figure 2, we train our age transformer with an age classifier that ensures age-accurate transformation and a discriminator that preserves photorealism.
The initial age of is easy to estimate using a pretrained age classifier, e.g., [31]. We thus do not use an age-annotated dataset for training. The original age range of the training dataset is denoted by . At test time, the target age can be chosen as any age in . At training time, it would seem reasonable to chose any value in uniformly at random. However, we noticed that the artifacts appearing during large age transformations were better corrected when selecting a target age far enough from during training. We propose to sample from the set at training time, where is a predefined constant representing the minimum age transformation interval. We denote by the uniform distribution over .
Classification loss To measure the age of , we use the same age classifier as the one used to estimate . During training, we freeze the weights of this classifier. The classifier, denoted by , takes as input and generates a discrete probability distribution over the set of ages . The classification loss satisfies
(1) |
where denotes the training image distribution over , denotes the categorical cross-entropy loss, and is the one-hot vector encoding .
Adversarial loss To enforce better photorealism of the modified images , we adopt an adversarial loss built using PatchGAN [13] with the LSGAN objective [22]. Unlike the latest works on face aging [9, 20, 33, 36, 39], our discriminator is used to distinguish between real and manipulated images without taking the age information into account. In our work, the aging and de-aging effects is obtained solely with the age classification loss.
The discriminator is denoted by . The architecture of is the same as proposed in [13]. We use a patch size for images. The modified image should be indistinguishable from real samples. Therefore, the losses we use are
(2) |
when training , and
(3) |
when training . We apply regularization [23] with on the discriminator.
Reconstruction loss When the age transformer receives and as inputs, the generated output image should be identical to the input image. Hence, we minimize the following reconstruction loss:
(4) |
Full loss We train the age transformer and the discriminator by minimizing the full objective
(5) |
where and are weights balancing the influence of each loss.
4 Experiments
In this section, we introduce our training setup and present the experimental results. We further evaluate the quality of our results using quantitative metrics.
4.1 Data augmentation with synthetic images
Our training dataset is built upon FFHQ [16], a high resolution dataset which contains face images at resolution. The dataset includes large variations in age, ethnicity, pose, lighting, and image background. However, the dataset contains only unlabeled raw images collected from Flickr.
To obtain the age information, we use an age classifier pretrained on IMDB-WIKI [31]. We observe that FFHQ contains much more samples of young faces than of old ones. This data imbalance is challenging since the aging and de-aging tasks would not be treated equally during training: most of faces being young, the age transformer would be trained to perform aging much more often than de-aging, failing to yield satisfying de-aging results. To compensate this imbalance in the age distribution, we propose to perform data augmentation using StyleGAN - a state-of-the-art high resolution image generation model [16]. We use the StyleGAN model pretrained on FFHQ to generate synthetic images. A quick visual inspection shows that most of the generated images have no significant artifacts and are nearly indistinguishable from real images by a human. Therefore, we use them for data augmentation to obtain a quasi-uniform age distribution over : for any age bin with less than samples in the original FFHQ dataset, we complete this bin with some of the generated synthetic face images; for any age bin with more than samples, we select randomly face images from the original FFHQ dataset. The age-equalized dataset contains images over the range .
4.2 Implementation details
Our model is implemented in PyTorch [28]. We take of the equalized dataset as our training set and the rest as test set. For the age transformer and the discriminator, spectral normalisation [25] is applied on all the convolution layers except the last one of the age transformer. All the activation layers use Leaky ReLU [21] with a negative slope of .
We consider age transformation only in the age range . The constant is set to . We have observed that the most significant artifacts appear when the gap between the source and target age is large. By choosing large enough, we force the discriminator to suppress these artifacts during adversarial training. The weights and are set to and , respectively. We use Adam optimizer with a learning rate of . The age transformer is updated once after each discriminator update. Our model is trained for epochs to achieve face age editing on high resolution images. The first epochs are trained on images with a batch size of . The next epochs are trained on images, for which we reduce the batch size to , learning rate to and to .
4.3 Qualitative evaluation
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Figure 9 presents age editing results on input images in different age groups. Our approach yields visually satisfying results with sharp details (best viewed when zooming on the results) and without introducing significant artifacts. Only the age relevant facial features are modified, while the identity, haircut, emotion and background are well preserved. This is all the more satisfying that no mask has been used to isolate the face from the rest of the image. Figure 4 presents age editing results with a smooth evolution of the target age. The difference between two adjacent results is nearly invisible, which illustrates the smoothness of the aging process.
|
|
|||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
(a) Comparison with IPCGAN. |
Input | 51+ | Input | 51+ | Input | 51+ | Input | 51+ | |
PAGGAN |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
---|---|---|---|---|---|---|---|---|
Ours |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(b) Comparison with PAGGAN. |
We compare our method to the two most recent state-of-the-art methods on face aging for which the official codes are released - IPCGAN [36] and PAGGAN [38]. We also compare our results to those obtained with FaderNet [18], which allows one to manipulate several facial attributes including the age.
Figure 5 present the face aging results of IPCGAN, PAGGAN and our method on CACD [2]. The output size of each method is: for IPCGAN, for PAGGAN, for our method. IPCGAN generates satisfying aging results and preserves well the identity of input images. However, as can be seen e.g. in Figure 5(a) row 1 column 4, the generated image presents noticeable artifacts. PAGGAN generates impressive aging effects but also introduce colored artifacts as shown in Figure 5(b) row 1 column 2. IPCGAN and PAGGAN both degrade the quality of input images. Our method is able to generate consistent aging effects, and preserve well the fine details of the input images.
Input () | Fader () | PAGGAN () | IPCGAN () | Ours () |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Generalisation capacity for images in unseen dataset For fair comparison and also to reduce the possible effect of overfitting on the training data, we evaluate all methods on a dataset not viewed at training time by any of the methods. We chose CelebA-HQ [15], a high resolution version of the CelebA dataset. The input images are at resolution, and are further downsampled at the resolution at which each method was trained using their official codes. The output size of each method is: for PAGGAN, for IPCGAN, for FaderNet, and for our method. We compare only the face aging results from young age group to old age group, since PAGGAN and IPCGAN are trained only for aging. Figure 6 shows the results obtained with the different methods. FaderNet [18] introduces little modifications. PAGGAN [38] generates satisfying age progression effects. However, noticeable artifacts are present on the face edges and hairs. IPCGAN [36] is limited to low resolution and thus introduces a strong degradation on the quality of the image. In comparison to these results, our approach introduces much less artifacts and preserves the fine details of the face and the background better.
4.4 Quantitative evaluation
Gender | Smiling | Emotion Preservation(%) | |||||
---|---|---|---|---|---|---|---|
Method | Predicted Age | Blur | Preservation(%) | Preservation(%) | Neutral | Happiness | |
FaderNet [18] | 44.34 11.40 | 9.15 | 97.60 | 95.20 | 90.60 | 92.40 | |
PAGGAN [38] | 49.07 11.22 | 3.68 | 95.10 | 93.10 | 90.20 | 91.70 | |
IPCGAN [36] | 49.72 10.95 | 9.73 | 96.70 | 93.60 | 89.50 | 91.10 | |
Ours | 54.77 8.40 | 2.15 | 97.10 | 96.30 | 91.30 | 92.70 |
Quantitative evaluation of image-to-image translation tasks is still an open question and there is no universal metric to measure photorealism or quantify artifacts in an image. The recent works [9, 20, 38] on face aging use an online face recognition API to estimate the age and the identity preservation accuracy of the modified images. We thus employ a similar evaluation process.
In our evaluation, the first images with true “Young” label of the CelebA-HQ dataset are extracted as test images. Using this test set, we make a quantitative comparison with FaderNet [18], IPCGAN [36] and PAGGAN [38]. Each image is transferred to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to and respectively. For FaderNet, the old attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since is considered as the oldest age group, we choose a target age of (the mean of the age range ) for our age transformer.
Thus we get 1000 modified images for each method. We further evaluate these output images using the online face recognition API of Face++ [12]. From the detect API, we obtain the following interesting metrics: age, gender, blurriness (whether the face is blurry or not, larger values means blurrier), smiling and emotion estimation. The emotion estimation contains a series of emotions: sadness, neutral, disgust, anger, surprise, fear and happiness. With a preliminary analysis on the results, of the input images are classified as neutral or happiness. Thus we just keep these two terms for emotion preservation comparison. We have also compared the identity preservation rate using the API to compare the modified images with the original inputs. However, since all methods achieve a nearly accuracy, this metric is not reported here.
Table 1 shows the quantitative evaluation results. All the methods are given the oldest age group as aging target, and we notice that our method has the highest average predicted age. The gender preservation rate is calculated by comparing the estimated gender with the original CelebA annotations. Using this metric, FaderNet achieves the best performance, followed by our method. For expression preservation (smiling) and emotion preservation (neutral, happiness), our approach yields the best results. It is to be noted however that all methods have similar results. For the blur evaluation, results are much more contrasted. Our method performs much better in generating sharper images, which is in agreement with the visual comparisons.
4.5 Discussion
Input | 25 | 65 | Input | 25 | 65 | |
---|---|---|---|---|---|---|
(a) | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(b) | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
(c) | ![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Ablation study on discriminator We have explored three different types of discriminators to train the age transformer. Figure 7 presents the face age editing results corresponding to the different settings.
-
•
Conditional discriminator. We adopt a patch discriminator [13] with a label projection applied on the features before the last convolutional layer, similar to the settings in [26]. The discriminator is conditioned on four age groups: -, -, -, -. At the training stage we find it essential to give the same number of real and fake images from each class to the discriminator to make the training successful. If we sample a target age from the set at training time, the discriminator will receive more manipulated images in the youngest and oldest group. Thus it tends to classify all the images in these two groups as fake. The conditional discriminator is very sensitive to the original data distribution and needs much more hyper-parameter fine-tuning to converge. Figure 7(a) presents the age editing results with conditional discriminator. Strong artifacts can be observed in the aging results.
-
•
Two separate discriminators. One discriminator receives manipulated and real images with a desired age lies in the old age group (-), while the other one takes manipulated and real images in the young age group (-). With this setting, the task of generating aging/de-aging effects is shared among the classifier and the discriminators. Although the results in 7(b) are better than those in 7(a), over-smoothing artifacts are perceived in the de-aging results and colored artifacts appear in the aging results.
-
•
One single discriminator. This is our proposed method. The discriminator can be considered as a regularizer which imposes photorealism, as it takes all the manipulated and real images as input. The generation of aging/de-aging effects is solely dictated by the age classifier. We are able to achieve high resolution results only with this last setting.
Input | Reconstructed | Input | Reconstructed | Input | Reconstructed |
---|---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Image reconstructed from a latent code optimization As mentioned in Section 2, the recent work of Shen et al. [32] proposes an effective way to manipulate the latent code of an image generator to achieve high visual quality manipulation of synthesized images. It is therefore tempting to manipulate the latent code directly to produce face manipulation (and thus age editing) on natural images with this approach. However, finding such a latent code for an arbitrary face image is still a challenging problem. According to our experiments using StyleGAN [16], only a fraction of natural face images can be accurately reconstructed from the latent code 111The latent code is obtained through optimization in the latent space by finding a latent code that minimizes the distance between the generated image and the input image. by [27]. Consequently, this type of method is impractical until a better StyleGAN encoder is made available. Figure 8 is meant to support this claim, where reconstruction results of natural face images can be assessed. We notice that the reconstructed images have painting-like artifacts, blurry backgrounds, and sometimes fail to preserve the identity of the person in the input image. Indeed, StyleGAN is much more efficient at sampling random faces from the latent space than at approximating a given face image. This is due to the fact that a GAN is not necessarily invertible. Hence, an editing method based on this latent code reconstruction will struggle to handle correctly natural images and to achieve the high visual quality of our method.
Weakly supervised training To the best of our knowledge, our work is the first to use unlabeled data for training among recent face aging studies [9, 20, 33, 36, 39]. A classifier pretrained on IMDB-WIKI [31], a low resolution face dataset, is used to provide age information. Moreover, the discriminator in our method is used only to distinguish real and manipulated images. Relying solely on the classifier, we successfully extract the age specific features and further realize age transform on high resolution images. This reveals the capacity of the classifier, even trained on low quality images. Our method could be potentially generalized to other face attributes manipulation tasks, by using a separate pair of modulating network and classifier for each attribute.
5 Conclusion
In this paper, we have proposed an age transformer architecture, enabling continuous face age editing with a single network, which we have endeavoured to keep as simple as possible. We believe that this approach, combined with an encoder-decoder architecture, rather than relying on a complex GAN, is the best path towards high quality, high resolution face editing results. We have demonstrated the capacity of our model to produce photorealistic and sharp results, without introducing significant artifacts, on images of resolution . The proposed feature modulation block appears to achieve efficient separation of age and identity information. Given the performance achieved, this design can be potentially useful for other face attribute manipulation tasks.
References
- [1] Antipov, G., Baccouche, M., Dugelay, J.L.: Face aging with conditional generative adversarial networks. In: 2017 IEEE International Conference on Image Processing (ICIP). pp. 2089–2093. IEEE (2017)
- [2] Chen, B.C., Chen, C.S., Hsu, W.H.: Cross-age reference coding for age-invariant face recognition and retrieval. In: European conference on computer vision. pp. 768–783. Springer (2014)
- [3] Chen, Y.C., Shen, X., Lin, Z., Lu, X., Pao, I., Jia, J., et al.: Semantic component decomposition for face attribute manipulation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9859–9867 (2019)
- [4] Choi, Y., Choi, M., Kim, M., Ha, J.W., Kim, S., Choo, J.: Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 8789–8797 (2018)
- [5] Dumoulin, V., Shlens, J., Kudlur, M.: A learned representation for artistic style. Proc. of ICLR (2017)
- [6] Fu, Y., Guo, G., Huang, T.S.: Age synthesis and estimation via faces: A survey. IEEE transactions on pattern analysis and machine intelligence 32(11), 1955–1976 (2010)
- [7] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial nets. In: Advances in neural information processing systems. pp. 2672–2680 (2014)
- [8] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
- [9] He, Z., Kan, M., Shan, S., Chen, X.: S2gan: Share aging factors across ages and share aging trends among individuals. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 9440–9449 (2019)
- [10] Huang, X., Belongie, S.J.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV (2017)
- [11] Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 172–189 (2018)
- [12] Inc, M.: Face++ research toolkit. http://www.faceplusplus.com. (2013)
- [13] Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR (2017)
- [14] Johnson, J., Alahi, A., Li, F.F.: Perceptual losses for real-time style transfer and super-resolution. In: ECCV. Springer (2016)
- [15] Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=Hk99zCeAb
- [16] Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4401–4410 (2019)
- [17] Karras, T., Laine, S., Aittala, M., Hellsten, J., Lehtinen, J., Aila, T.: Analyzing and improving the image quality of stylegan. arXiv preprint arXiv:1912.04958 (2019)
- [18] Lample, G., Zeghidour, N., Usunier, N., Bordes, A., DENOYER, L., et al.: Fader networks: Manipulating images by sliding attributes. In: Advances in Neural Information Processing Systems (2017)
- [19] Li, P., Hu, Y., He, R., Sun, Z.: Global and local consistent wavelet-domain age synthesis. IEEE Transactions on Information Forensics and Security (2019)
- [20] Liu, Y., Li, Q., Sun, Z.: Attribute-aware face aging with wavelet-based generative adversarial networks. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (June 2019)
- [21] Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Citeseer (2013)
- [22] Mao, X., Li, Q., Xie, H., Lau, R.Y., Wang, Z., Paul Smolley, S.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2794–2802 (2017)
- [23] Mescheder, L., Nowozin, S., Geiger, A.: Which training methods for gans do actually converge? In: International Conference on Machine Learning (ICML) (2018)
- [24] Mirza, M., Osindero, S.: Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784 (2014)
- [25] Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization for generative adversarial networks. In: International Conference on Learning Representations (2018), https://openreview.net/forum?id=B1QRgziT-
- [26] Miyato, T., Koyama, M.: cgans with projection discriminator. arXiv preprint arXiv:1802.05637 (2018)
- [27] Nikitko, D.: Stylegan encoder for official tensorflow implementation. https://github.com/Puzer/stylegan-encoder (2019)
- [28] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., Lerer, A.: Automatic differentiation in PyTorch. In: NIPS Autodiff Workshop (2017)
- [29] Pumarola, A., Agudo, A., Martinez, A.M., Sanfeliu, A., Moreno-Noguer, F.: Ganimation: Anatomically-aware facial animation from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 818–833 (2018)
- [30] Qian, S., Lin, K.Y., Wu, W., Liu, Y., Wang, Q., Shen, F., Qian, C., He, R.: Make a face: Towards arbitrary high fidelity face manipulation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 10033–10042 (2019)
- [31] Rothe, R., Timofte, R., Van Gool, L.: Dex: Deep expectation of apparent age from a single image. In: Proceedings of the IEEE International Conference on Computer Vision Workshops. pp. 10–15 (2015)
- [32] Shen, Y., Gu, J., Tang, X., Zhou, B.: Interpreting the latent space of gans for semantic face editing. arXiv preprint arXiv:1907.10786 (2019)
- [33] Song, J., Zhang, J., Gao, L., Liu, X., Shen, H.T.: Dual conditional gans for face aging and rejuvenation. In: IJCAI. pp. 899–905 (2018)
- [34] Upchurch, P., Gardner, J., Pleiss, G., Pless, R., Snavely, N., Bala, K., Weinberger, K.: Deep feature interpolation for image content changes. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7064–7073 (2017)
- [35] Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR (2018)
- [36] Wang, Z., Tang, X., Luo, W., Gao, S.: Face aging with identity-preserved conditional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 7939–7947 (2018)
- [37] Xiao, T., Hong, J., Ma, J.: Elegant: Exchanging latent encodings with gan for transferring multiple face attributes. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 168–184 (2018)
- [38] Yang, H., Huang, D., Wang, Y., Jain, A.K.: Learning face age progression: A pyramid architecture of gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 31–39 (2018)
- [39] Zhang, Z., Song, Y., Qi, H.: Age progression/regression by conditional adversarial autoencoder. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
- [40] Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE international conference on computer vision. pp. 2223–2232 (2017)
Appendix 0.A Network architecture
Table 2 presents the hyperparameters of the proposed network architecture. The discriminator is a patch discriminator. Each element of the output feature map corresponds to a receptive field of on the original input image.
Appendix 0.B Age classifier
To obtain the age information of FFHQ dataset [16], we use the age classifier [31], which has been pretrained on IMDB-WIKI. This dataset contains face images of celebrities collected from the IMDB and Wikipedia websites. The dataset mostly covers the age interval, and has only very few samples for the younger and older age intervals. Consequently, the age classifier might yield less accurate age estimation for faces of people younger than years old or much older than years old. We therefore choose to use images in the age range for training. We pass the images of FFHQ dataset into the age classifier and observe that FFHQ contains much more samples of young faces than of old ones. We then augment the dataset with synthetic images generated by StyleGAN [16] to achieve a quasi-uniform age distribution over the age range , as described in section of the paper.
Appendix 0.C Additional results
In this section, we present supplementary results on images.
0.C.1 Results on FFHQ dataset
0.C.2 Comparison with other methods
In Figure 13, we show additional comparison of face aging results on Celeba-HQ [15]. As mentioned in the paper, we compare our method against the two most recent state-of-the-art methods on face aging for which the official codes are released - PAGGAN [38] and IPCGAN [36]. We also compare our results to those obtained with Fader Network [18], which allows one to manipulate several facial attributes including the age. Each input image is transformed to the oldest age group using their official released models. For IPCGAN and PAGGAN, the oldest age group refer to and respectively. For Fader Network, the age attribute is set to be the default largest value for aging in their official code. To have a fair comparison with groupwise methods, and since is considered as the oldest age group, we choose a target age of (the mean of the age range ) for our age transformer.
Operation | Kernel size | Stride | Channel |
---|---|---|---|
Age transformer | |||
Encoder | |||
Convolution | |||
Convolution | |||
Skip connection 1 | |||
Convolution | |||
Skip connection 2 | |||
Residual block | |||
Residual block | |||
Residual block | |||
Residual block | |||
Modulation layer | |||
Decoder | |||
Concatenation with skip connection 2 | |||
Upsampling | |||
Convolution | |||
Concatenation with skip connection 1 | |||
Upsampling | |||
Convolution | |||
Convolution | |||
Discriminator | |||
Convolution | |||
Convolution | |||
Convolution | |||
Convolution | |||
Convolution | |||
Convolution | |||
Upsampling mode | Nearest (scale factor = ) | ||
Padding mode | Reflection | ||
Normalization | InstanceNorm for age transformer | ||
BatchNorm for discriminator | |||
Activation | LeakyReLU (negative slope = ) |
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
25 | 35 | 45 | 55 | 65 |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input () | Fader () | PAGGAN () | IPCGAN () | Ours () |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
Input () | Fader () | PAGGAN () | IPCGAN () | Ours () |
---|---|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |