This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

LEGAN: Disentangled Manipulation of Directional Lighting and Facial Expressions whilst Leveraging Human Perceptual Judgements thanks: *Work done while at Affectiva

Sandipan Banerjee1, Ajjen Joshi1, Prashant Mahajan*2, Sneha Bhattacharya*3, Survi Kyal1 and Taniya Mishra*4
1
Affectiva, USA, 2Amazon, USA, 3Silver Spoon Animation, USA, 4SureStart, USA
{firstname.lastname}@affectiva.com, [email protected], [email protected], [email protected]
Abstract

Building facial analysis systems that generalize to extreme variations in lighting and facial expressions is a challenging problem that can potentially be alleviated using natural-looking synthetic data. Towards that, we propose LEGAN, a novel synthesis framework that leverages perceptual quality judgments for jointly manipulating lighting and expressions in face images, without requiring paired training data. LEGAN disentangles the lighting and expression subspaces and performs transformations in the feature space before upscaling to the desired output image. The fidelity of the synthetic image is further refined by integrating a perceptual quality estimation model, trained with face images rendered using multiple synthesis methods and their crowd-sourced naturalness ratings, into the LEGAN framework as an auxiliary discriminator. Using objective metrics like FID and LPIPS, LEGAN is shown to generate higher quality face images when compared with popular GAN models like StarGAN and StarGAN-v2 for lighting and expression synthesis. We also conduct a perceptual study using images synthesized by LEGAN and other GAN models and show the correlation between our quality estimation and visual fidelity. Finally, we demonstrate the effectiveness of LEGAN as training data augmenter for expression recognition and face verification tasks.

1 Introduction

Deep learning [59] has engendered tremendous progress in automated facial analysis, with applications ranging from face verification [72, 22, 11] to expression classification [92]. However, building robust and accurate models that generalize effectively in-the-wild is still an open problem. A major part of it stems from training datasets failing to represent the “true” distribution of real world data [54, 68, 93, 6] (\egextreme lighting conditions [17, 73, 38]); or the training set may be non-uniformly distributed across classes, leading to the long-tail problem [63].

One way to mitigate the imbalance problem, shown to work in multiple domains [62, 9, 15, 100], is to introduce synthetic samples into the training set. Many approaches for generating synthetic data exist [18, 8, 42], none as successful as GANs [37] in generating realistic face images [16, 51, 52, 25]. Thus, in this work, we design a GAN model for synthesizing variations of an existing face image with the desired illumination and facial expression, while keeping the subject identity and other attributes constant. These synthetic images, when used as supplemental training data, can help build facial analysis systems that better generalize across variations in illumination and expressions.

Refer to caption
Figure 1: LEGAN jointly manipulates lighting and expression in face images while preserving the subject identity of the input.
Refer to caption
Figure 2: Our quality estimation model is trained on synthetic face images, varying in gender, ethnicity, lighting and facial attributes, generated with five methods [1, 51, 52, 9, 78] and their crowd-sourced perceptual ratings. To account for the subjective nature of human perception, our model utilizes a margin based regression loss to learn representations. This trained model can then generate naturalness predictions (Q(I)) of unseen face images.

One drawback of GAN-based face generation is the absence of an accurate and automated metric to judge the perceptual quality of synthesized images [57, 20]. In order to solve this problem, we also introduce a quality estimation model that can serve as a cheap but efficient proxy for human judgment while evaluating naturalness of synthetic face images. Instead of generating a single score for a distribution of synthetic images [107, 45, 79] or for image pairs [105, 75], our goal is to infer an image-quality score on a continuum for a single synthetic face image. With this in mind, we run an Amazon Mechanical Turk (AMT) experiment where turkers are instructed to score the naturalness of synthetic face images, generated using different 3D-model [9] and GAN-based [51, 52, 78, 2] synthesis approaches. We then build a feed forward CNN to learn representations from these images that map to their corresponding perceptual rating, using a margin based regression loss.

In addition to a traditional discriminator [37], this trained quality model is then used as an auxiliary discriminator in the synthesis framework, named LEGAN (Lighting-Expression GAN), that we propose in this paper. Instead of intertwining the two tasks [25], LEGAN decomposes the lighting and expression sub-spaces using a pair of hourglass networks (encoder-decoder) that generate transformation masks capturing the intensity changes required for target generation. The desired output image is then synthesized by a third hourglass network from these two masks.

To demonstrate the effectiveness of LEGAN, we qualitatively and quantitatively compare its synthesized images with those produced by two popular GAN based models [25, 26] using objective metrics like FID [45], LPIPS [105], SSIM [95] and face match score. We also conduct a human rater study to evaluate the perceptual quality of LEGAN’s images and the contribution of the quality based auxiliary discriminator towards hallucinating perceptually superior images. Finally, we show the efficacy of LEGAN as training data augmenter by improving the generalizability of face verification and expression recognition models.

2 Related Work

Face Synthesis: Early approaches [18, 67] focused on stitching together similar looking facial patches from a gallery to synthesize a new face. Manipulating the facial shape using 3D models [49, 89, 63, 9] or deep features [27, 91] is another popular approach to generate new views. In recent times however researchers have pre-dominantly focused on using GANs [37] for synthesis, where an upsampling generator hallucinates faces from a noise vector, either randomly sampled from a distribution [37, 77, 51, 52] or interpreted from a different domain [71]. An existing face can also be encoded and then upsampled to obtain the desired attributes [48, 47, 96, 10, 70, 31, 35, 108, 23, 25].

Refer to caption
Figure 3: During training, GG takes an image IaI_{a} and target expression and lighting tensor fbf_{b} as input and disentangles the feature sub-spaces using a pair of hourglass networks generating transformation masks MeM_{e} and MlM_{l} that are concatenated and passed through a third hourglass to hallucinate the target output G(Ia,fb)G(I_{a},f_{b}). To eliminate the need of paired training data, we augment G(Ia,fb)G(I_{a},f_{b}) with the source expression and lighting tensor faf_{a} and pass it through the same generator to reconstruct the input G(G(Ia,fb),fa)G(G(I_{a},f_{b}),f_{a}), to compute the reconstruction error LrecL_{rec}. Moreover, with the help of DD and the quality based auxiliary discriminator QQ, we calculate the adversarial (LadvL_{adv}), feature classification (LclsL_{cls}) and quality (LqualL_{qual}) losses respectively. Utilizing the identity network TT, we compute the identity loss LidL_{id}. During testing, only GG is required to generate synthetic images.

Editing Expressions: Research in this domain started with modeling skin-muscle movements [33] for different facial expressions or swapping facial patches based on visual proximity [18]. With the advent of 3D face models, researchers used static [63] or morphable models [19, 102] to manipulate facial expressions with a higher degree of realism. Recently, the use of VAEs [13] and adversarial image-to-image translation networks have become extremely popular for editing facial expressions [48, 108, 25], with or without paired data for training. Some of these models use attention masks [76], facial shape information [36, 34] or exemplar videos [85] to guide the model in this task.

Editing Lighting: While methods like histogram equalization [74, 109] and gamma correction can shift the global luminance distribution and color encoding of an image, they cannot manipulate the direction of the light source itself. An early method [94] utilizes spherical harmonics to manipulate the directional lighting in 3D. In [24], local linear adjustments are performed on overlapping windows to change the lighting profile of an image. Deep learning based approaches have also been proposed where the reflectance, normal and lighting channels are disentangled and edited to relight images [81, 84, 83, 41]. Alternatively, the desired lighting can be passed to an encoder-decoder pair as the target for lighting manipulation in the input image [106, 86, 65, 69]. Recently, joint facial pose, lighting and expression manipulation has been proposed in [30, 58] where an input image can be manipulated by changing its attributes in feature space leveraging 3D parameters or latent information from synthetic images during training.

Quality Estimation of Synthetic Face Images: Synthetic image quality is commonly evaluated using metrics like the Inception Score [79] or FID [45], which compare statistics of real and synthetic feature distributions, and output a single score for the whole distribution rather than the individual image. The features themselves are extracted from the Inception-v3 model [87], usually pre-trained on objects from [29], and not specifically faces. As these metrics do not take into account human judgements, they do not correlate well with perceptual realism [14, 20]. Consequently, researchers run perceptual studies to score the naturalness of synthetic images [104, 8]. These ratings are also used to design models that measure distortion between real and synthetic pairs [105, 75] or the coarse realism (‘real’ vs ‘fake’) of a synthetic image [107]. None of these evaluation models however are designed specifically for face images. Recently, [57] proposed a metric to rate the perceptual quality of a single image by using binary ratings from [107] as ground truth for synthetic face images generated by [51, 52]. However, their regression based model is trained on only 4,270 images and thus insufficient to reliably model the subjective nature of human judgements.

Unlike these methods, we build a synthetic face quality estimation model by leveraging perceptual ratings of over 37,000 images generated using five different synthesis techniques [51, 52, 78, 1, 9]. Our quality model takes into account the variability in human judgements and generate a realism score for individual images rather than the whole set. We leverage this model as an auxiliary discriminator in the LEGAN framework for simultaneous lighting and facial expression manipulation. This novelty together with LEGAN’s feature disentanglement improves the naturalness of the hallucinated images. Additionally, we do not require external 3DMM information or latent vectors during training nor do we need to fine-tune our model during testing on input images [30, 58].

3 Quality Estimation Model

Our quality estimator model is trained with synthetic face images assembled and annotated in two sequential stages, as described below.

Stage I: We first generate 16,507 synthetic face images using the StyleGAN [52] generator. These images are then annotated by labelers using Amazon Mechanical Turk (AMT) on a scale of 0 - 10 for naturalness, where a 0 rating represents an unnatural image and 10 a hyper-realistic one. The images are then binned into two broad groups - ‘unnatural’ for AMT ratings between 0 - 5 and ‘natural’ for 5 - 10. We extract descriptors for each image from the ‘avg_pool’ layer of the ResNet50 [44] model, pre-trained on VGGFace2 [22] and train a linear SVM [28] with the extracted features of around 12,000 images from this dataset and use the rest for parameter tuning. Post training, we use this SVM as a rough estimator of naturalness.

Stage II: In this stage, we perform the same AMT experiment again with a larger set of synthetic face images, collected from the following datasets:
1. FaceForensics++[78] - we randomly sample 1000 frames from this dataset consisting of 1000 video sequences that have been manipulated with four automated face manipulation methods.
2. DeepFake[2, 3] - we use sampled frames from 620 manipulated videos of 43 actors from [1].
3. ProGAN [51] - we generate 10,000 synthetic face images of non-existent subjects by training NVIDIA’s progressively growing GAN model on the CelebA-HQ dataset [51].
4. StyleGAN [52] - we extract 100,000 hyper-realistic face images of non-existent subjects generated using the StyleGAN model that were pre-filtered for quality [4].
5. Notre Dame Synthetic Face Dataset [9] - we randomly sample 163,000 face images, from the available 2M, of synthetic subjects generated using ‘best-fitting’ 3D models.

To focus on near-frontal faces, we remove images with yaw over 15°\degree in either direction, estimated using [43]. Since gender information is absent in most of the above datasets, we group the synthetic images using gender predictions from a pre-trained model [60]. Our trained SVM (from Stage I) is also used to rate the coarse naturalness of the collected images, using their ResNet50 features. We ensure balance in our synthetic dataset by sampling evenly from the natural and unnatural sets, as estimated by the SVM, and the perceived gender classes. To focus solely on the facial region, the pixels outside the convex hull formed by the facial landmarks, estimated using [21], of an image are masked. After the gender, facial yaw and naturalness based filtering, and the pre-processing step, we end up with 37,267 synthetic face images111Available here: https://github.com/Affectiva/LEGAN_Perceptual_Dataset for our second AMT experiment.

Again, we ask Turkers to rate each image for naturalness on a scale of 0 - 10. Each image is shown to a Turker for 60 seconds to allow them time to make proper judgement even with slow network connection. We divide the full set of images into 72 batches such that each batch gets separately rated by 3 different Turkers. Post crowd-sourcing, we compute the mean (μ\mu) and standard deviation (σ\sigma) from the 3 scores and assign them as naturalness label for an image.

To train the quality estimation model, we use 80% of this annotated data and the rest for validation and testing. For augmentation, we only mirror the images as other techniques like translation, rotation and scaling drastically change their appearance compared to what the Turkers examined. Our model downsamples an input image using a set of strided convolution layers with Leaky ReLU [99] activation followed by two fully connected layers with linear activation and outputs a single realness scoring. Since both μ\mu and σ\sigma are passed as image labels, we try to capture the inconsistency in the AMT ratings (i.e. the subjective nature of human perception) by formulating a margin based loss for training. The model weights are tuned such that its prediction is within an acceptable margin, set to σ\sigma, from the mean rating μ\mu of the image. The loss LNL_{N} can be represented as:

LN=1ni=1nσμQ(Ii)2222L_{N}=\frac{1}{n}\sum_{i=1}^{n}\left\|\sigma-\left\|\mu-Q(I_{i})\right\|_{2}^{2}\right\|_{2}^{2} (1)

where nn is the batch size, Q(Ii)Q(I_{i}) is the model prediction for the ii-th image in the batch. Since the model is trained on the mean rating μ\mu (regression) as the target rather than fixed classes (classification), LNL_{N} pushes the model predictions towards the confidence margin σ\sigma from μ\mu.

Refer to caption
Figure 4: Sample LEGAN result G(Ia,fb)G(I_{a},f_{b}), with generated expression MeM_{e} and lighting MlM_{l} masks for input IaI_{a}. The salient pixels for the translation task automatically ‘heat up’ in MeM_{e} and MlM_{l}, similar to flow maps.
Refer to caption
Figure 5: Sample results on MultiPIE [39] test images comparing LEGAN with popular GAN models. The target conditions are (a) (Disgust, Slight Left Shadow), (b) (Surprise, Slight Right Shadow), (c) (Squint, Ambient lighting), and (d) (Smile, Bright lighting). LEGAN hallucinates subtle muscle movements like nose wrinkles more prominently (top row) while preserving the subject identity. Additionally, our quality loss eliminates blob artifacts [53] from synthesized images (bottom row, StarGAN vs. StarGAN w/ LqualL_{qual}).

4 LEGAN

We build LEGAN as a lightweight network that works with unpaired data, similar to the StarGAN family, to focus more on assessing the effect of our quality estimation model (QQ) as a perceptual loss. Unlike [30, 58], LEGAN does not require additional networks to regress 3DMM parameters or fine-tuning during inference. We describe the architecture and objective functions of LEGAN in this section, an overview of which can be seen in Figure 3.

4.1 Architecture

Generator: Our generator GG, composed of three hourglass networks (encoder-decoder), starts with an input RGB face image IaI_{a} and a target attributes vector fbf_{b} that corresponds to expression and lighting conditions cec_{e} and clc_{l} respectively. The first hourglass receives IaI_{a} concatenated with cec_{e} while the second one receives IaI_{a} concatenated with clc_{l}, thus disentangling the transformation task. Inside each hourglass, the concatenated tensor is downsampled using strided convolutions and then passed through a set of residual blocks [44] before being upsampled using pixel shuffling layers [82]. Each convolution layer is followed by instance normalization [90] and ReLU activation [64] for non-linearity. These upsampled images are the transformation masks MeM_{e} and MlM_{l} that map the changes in pixel intensity required to translate IaI_{a} to conditions specified in fbf_{b}. MeM_{e} and MlM_{l} are concatenated together and fed to the third hourglass to generate the output image G(Ia,fb)G(I_{a},f_{b}). The objective of dividing the generation process into two stages and hallucinating the transformation masks is two fold - (a) easing the task of each hourglass by simply making it focus on registering the required expression or lighting changes instead of both registration and hallucination, and (b) making the transformation process more explainable, with salient pixels prominent in MeM_{e} and MlM_{l}, as can be seen in Figure 4.

Discriminator: The discriminator DD takes the output image G(Ia,fb)G(I_{a},f_{b}) and predicts not only its realness score but also classifies its attributes fbf_{b}. DD is composed of strided convolution layers with Leaky ReLU [99] activation that downsample the image to extract its encoded feature map. We use a patch discriminator [48] that takes this encoded feature map and passes it through a single channel convolution to get the realness map DsrcD_{src}. This feature map is also operated by a conv layer with kk filters to get the attributes prediction map DclsD_{cls}, where kk = no. of channels in fbf_{b}.

Auxiliary Discriminator: We integrate the perceptual quality model QQ, described in Section 3, into the LEGAN model graph to further refine the naturalness of the images synthesized by GG. Unlike DD, we do not train QQ jointly with GG but use the weights of a pre-trained snapshot.

Identity Network: We also add a pre-trained identity preserving network T to estimate the deviation of the output identity from that of the input. T is trained offline on face images with different pose, expression and lighting and its weights are kept frozen throughout LEGAN’s training.

4.2 Loss Function

1. Adversarial Loss: DD is trained to distinguish a real face image IaI_{a} from its synthetic counterpart and judge the realness of the hallucinated image G(Ia,fb)G(I_{a},f_{b}). To stabilize the gradients and improve quality, we use the WGAN [7] based objective LadvL_{adv} for this task with a gradient penalty [40], set as:

Ladv=𝔼Ia[Dsrc(Ia)]𝔼Ia,fb[Dsrc(G(Ia,fb))]λgp𝔼I^[(I^Dsrc(I^)21)2]L_{adv}=\mathbb{E}_{I_{a}}[D_{src}(I_{a})]-\mathbb{E}_{I_{a},f_{b}}[D_{src}(G(I_{a},f_{b}))]-\\ \lambda_{gp}\mathbb{E}_{\hat{I}}[(\left\|\nabla_{\hat{I}}D_{src}(\hat{I})\right\|_{2}-1)^{2}] (2)

where I^\hat{I} is sampled uniformly from real and synthetic images and λgp\lambda_{gp} is an tunable parameter. While DD tries to minimize this to separate the synthetic from the real, GG tries to maximize it by fooling DD.

2. Classification Loss: To ensure the target lighting and expression are correctly rendered by GG and enable LEGAN to do many-to-many translations, we formulate a classification loss using DD’s predictions, in the form of DclsD_{cls}. The loss LclsL_{cls} is computed as:

Lcls=𝔼Ia,fa[logDcls(faIa)]+𝔼Ia,fb[logDcls(fbG(Ia,fb))]L_{cls}=\mathbb{E}_{I_{a},f_{a}}[-\log D_{cls}(f_{a}\mid I_{a})]+\\ \mathbb{E}_{I_{a},f_{b}}[-\log D_{cls}(f_{b}\mid G(I_{a},f_{b}))] (3)

where faf_{a} and fbf_{b} are the original and target attributes of an input image IaI_{a}.

3. Identity Loss: To preserve the subject identity without using paired data, we add an identity loss between the input and the translated output G(Ia,fb)G(I_{a},f_{b}) by utilizing representations from T. Both IaI_{a} and G(Ia,fb)G(I_{a},f_{b}) are passed through T for feature extraction and we set the objective LidL_{id} to minimize the cosine distance between these two features as:

Lid=𝔼Ia,fb[1T(Ia)T(G(Ia,fb))T(Ia)2T(G(Ia,fb))2]L_{id}=\mathbb{E}_{I_{a},f_{b}}[1-\frac{T(I_{a})\cdot T(G(I_{a},f_{b}))}{\left\|T(I_{a})\right\|_{2}\left\|T(G(I_{a},f_{b}))\right\|_{2}}] (4)

Ideally, the cosine distance between these two feature vectors should be 0, as they belong to the same identity.

4. Reconstruction Loss: To keep non-translating features from the input intact in the output image, we use a cyclic reconstruction loss [108] LrecL_{rec} between IaI_{a} and its reconstruction G(G(Ia,fb),fa)G(G(I_{a},f_{b}),f_{a}), computed as:

Lrec=𝔼Ia,fb,fa[IaG(G(Ia,fb),fa)1]L_{rec}=\mathbb{E}_{I_{a},f_{b},f_{a}}[\left\|I_{a}-G(G(I_{a},f_{b}),f_{a})\right\|_{1}] (5)
Table 1: Quantitative comparison with popular GAN models on held out CMU-MultiPIE [39] test set.
Models
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
Human Preference \uparrow
StarGAN [25]
38.745 0.126 0.559 0.635 5.200 22.3%
StarGAN w/ LqualL_{qual}
34.045 0.123 0.567 0.647 5.391 34.7%
StarGAN-v2 [26]
54.842 0.212 0.415 0.202 5.172 3.75%
LEGAN
29.964 0.120 0.649 0.649 5.853 39.3%
Real Images
12.931 - - 0.739 5.921 -
Table 2: Quantitative performance of popular GAN models, trained only on CMU-MultiPIE [39], when tested on AFLW [56] images.
Metrics
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
StarGAN [25]
43.987 0.302 0.639 0.622 6.41
StarGAN w/ LqualL_{qual}
41.641 0.268 0.624 0.614 6.73
StarGAN-v2 [26]
59.328 0.445 0.279 0.213 6.60
LEGAN
38.794 0.271 0.622 0.628 6.68
Real Images
- - - - 6.83
Table 3: Quantitative performance of popular GAN models, trained only on CMU-MultiPIE [39], when tested on CelebA [61] images.
Metrics
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
StarGAN [25]
42.089 0.173 0.623 0.620 6.98
StarGAN w/ LqualL_{qual}
36.189 0.145 0.614 0.621 7.15
StarGAN-v2 [26]
50.360 0.311 0.312 0.295 6.89
LEGAN
29.059 0.184 0.662 0.635 7.07
Real Images
- - - - 7.31

5. Quality Loss: We use QQ’s predictions for further improving with perceptual realism of the synthetic images. Masked versions of the input image Ia{I_{a}}^{\prime}, the synthesized output G(Ia,fb){G(I_{a},f_{b})}^{\prime} and the reconstructed input G(G(Ia,fb),fa){G(G(I_{a},f_{b}),f_{a})}^{\prime}, produced using facial landmarks extracted by [21], are used for loss computation as follows:

Lqual=𝔼Ia,fb[qQ(G(I,fb))1]+𝔼Ia,fb,fa[Q(Ia)Q(G(G(I,fb),fa))1]L_{qual}=\mathbb{E}_{I_{a},f_{b}}[\left\|q-Q({G(I,f_{b})}^{\prime})\right\|_{1}]+\\ \mathbb{E}_{I_{a},f_{b},f_{a}}[\left\|Q({I_{a}}^{\prime})-Q({G(G(I,f_{b}),f_{a})}^{\prime})\right\|_{1}] (6)

where qq is a hyper-parameter that can be tuned to lie between 5 (realistic) and 10 (hyper-realistic). We find synthesis results to be optimal when qq = 8.

Full Loss: We also apply total variation loss [50] LtvL_{tv} on G(Ia,fb)G(I_{a},f_{b}) and G(G(Ia,fb),fa)G(G(I_{a},f_{b}),f_{a}) to smooth boundary pixels and set the final training objective LL as a weighted sum of the six losses as:

L=Ladv+λ1Lcls+λ2Lrec+λ3Lid+λ4Lqual+λ5LtvL=L_{adv}+\lambda_{1}L_{cls}+\lambda_{2}L_{rec}+\lambda_{3}L_{id}+\lambda_{4}L_{qual}+\lambda_{5}L_{tv}\vspace{-0.2cm} (7)
Refer to caption
Figure 6: LEGAN synthesized output images from randomly sampled images from the AFLW [56] and CelebA [61] datasets. Although, trained only on frontal MultiPIE images, LEGAN can realistically manipulate lighting and expression in images with variance in pose, resolution and domain (sketch).
Table 4: Verification performance (TPR@FPR = 1%) of LightCNN-29 [98] on IJB-B [97] and LFW [46] with and without LEGAN based augmentation.
Training Data
Real Images [101] (# Identities)
Synthetic Images (# Identities)
IJB-B [97] Performance
LFW [46] Performance
Original
439,999 (10,575)
0 0.954 ±\pm 0.002 0.966 ±\pm 0.002
Augmented
439,999 (10,575)
439,999 (10,575)
0.967 ±\pm 0.001 0.972 ±\pm 0.001
Table 5: Model [32] performance (ROC AUC) on AffectNet [66] with and without LEGAN based augmentation
Training Data
Real Images [66]
Synthetic Images
‘Neutral’
‘Happy’
‘Surprise’
‘Disgust’
Original
204,325
0 0.851 ±\pm 0.005 0.955 ±\pm 0.001 0.873 ±\pm 0.004 0.887 ±\pm 0.005
Augmented
204,325
279,324
0.868 ±\pm 0.005 0.956 ±\pm 0.001 0.890 ±\pm 0.003 0.897 ±\pm 0.001

5 Experiments and Results

Training Data. We utilize 36,657 frontal RGB images from the CMU-MultiPIE dataset [39], with 20 different lighting conditions and 6 acted facial expressions, to build our model. For training we use 33,305 images of 303 subjects and the remaining 3,352 images of 34 subjects for testing. The training data is highly skewed towards ‘Neutral’ and ‘Smile’ compared to the other 4 expressions but the distribution is almost uniform for the lighting classes. We align each image using their eye landmarks extracted with [21] and resize to 128×\times128×\times3. We do not fine-tune LEGAN on any other data and solely rely on its generalizability for the different experimental tasks.

Testing Data. Along with MultiPIE’s held out test set, we also utilize the AFLW [56] and CelebA [61] datasets to test the robustness of LEGAN towards in-the-wild conditions. We do not train or fine-tune LEGAN on these datasets and use the model trained on MultiPIE for translation tasks.

Augmentation: Recoloring. Since MultiPIE [39] was acquired in a controlled setting, it has more or less uniform hue and saturation across all images. To artificially inject some diversity in the overall image color, and prevent visual overfitting, we build an image colorization model. Inspired by [104], we train two separate pix2pix [48] style GAN models where the input is a grayscale image and the target is set as its colored counterpart (\ieoriginal image). To manipulate the color style of the same grayscale image differently, we train these two models with randomly sampled 10K images from the UMDFaces [12] and FFHQ [51] datasets respectively. The trained generators are used to augment the color style of the MultiPIE training set.

Augmentation: White Balancing. In addition to changing the color style, we also artificially edit the white balance of the training images by utilizing the pre-trained model from [5]. This model can automatically correct the white balance of an image or render it with different camera presets. For each training input, we randomly use it as is or select its recolored or color corrected version, and pass it to LEGAN.

Implementation Details. To learn the model, we use the Adam optimizer [55] with a learning rate of 0.0001 and parameters β1\beta_{1} and β2\beta_{2} set to 0.5 and 0.999 respectively. The different loss weights λgp\lambda_{gp}, λ1\lambda_{1}, λ2\lambda_{2}, λ3\lambda_{3} λ4\lambda_{4} and λ5\lambda_{5} are set empirically to 10, 20, 10, 10, 0.5 and 0.0000001 respectively. As done in [76], we train DD 5 times for each training iteration of the GG. We train the LightCNN-29 model [98] on the CASIA-WebFace dataset [101] and utilize it as our identity network T. Features from its penultimate layer are used to compute LidL_{id}. LEGAN is trained with a batch size of 10 on a single Tesla V100 GPU for 100 epochs.

Comparison with Other GAN Models: Our proposed model is simple and unique as it does not require 3DMM information or external synthetic images during training [30, 58] nor do we need to fine-tune our model during testing on input images. Due to this simplicity, we choose 2 popular publicly available unpaired domain translation models for comparison - StarGAN [25] and the more recent StarGAN-v2 [26]. We train these models for lighting and expression manipulation with the same MultiPIE [39] training split for 100 epochs. Additionally, to gauge the effect of LqualL_{qual} on off-the-shelf models we train StarGAN separately with the auxiliary discriminator added (StarGAN w/ LqualL_{qual}).

Metrics for Quality Estimation. To evaluate the quality of synthetic face images generated by LEGAN and other GAN models, we compare the synthesized output with the corresponding target image222Since target images are not available for the AFLW [56] and CelebA [61] datasets, we evaluate metrics between the source and output images. using these metrics - (1) FID [45] and (2) LPIPS [105] to gauge the realism, (3) SSIM [95] to measure noise, and (4) face match score using pre-trained ResNet50 [44, 22] features and Pearson correlation coefficient. We also use our trained quality estimator to directly extract the (5) quality score of real and synthetic images.

Human Evaluation. We also run a perceptual study using face images generated by these models where we ask 17 non-expert human raters to pick an image from a lineup that best matches - (1) a target facial expression and (2) a target lighting condition, while (3) preserving the identity for 30 different MultiPIE [39] subjects. The raters are first shown real examples of the target expressions and lighting conditions. Each lineup consists of an actual image of the subject with bright lighting and neutral expression and the same subject synthesized for the target expression and lighting by the StarGAN [25], StarGAN w/ LqualL_{qual}, StarGAN-v2 [26] and LEGAN, presented in a randomized order. We aggregate the rater votes across all rows and normalize them for each model (rightmost column of Table 1).

Quantitative Results. As can be seen from Table 1, LEGAN synthesizes perceptually superior face images (FID, LPIPS, Quality Score) while retaining subject identity (Match Score) better than the other GAN models. As validated by the human evaluation, LEGAN also effectively translates the input image to the target lighting and expression conditions. Surprisingly, the StarGAN-v2 fails to generate realistic images in these experiments. This can be attributed to the fact that our task of joint lighting and expression manipulation presents the model with a much higher number of possible transformation domains (101 to be exact). The StarGAN-v2 model does learn to separate and transform lighting and expression to a certain degree but fails to decouple the other image attributes like identity, gender and race333We are not the first to encounter this issue, as shared here: https://github.com/clovaai/stargan-v2/issues/21. Therefore, it synthesizes images that can be easily picked out by human perusal or identity matching. We also find adding LqualL_{qual} to StarGAN improves almost all its metric scores underpinning the value our quality estimator QQ even when coupled with off-the-shelf models. On top of enhancing the overall sharpness, LqualL_{qual} removes bullet-hole artifacts, similar to [53], from the peripheral regions of the output face. Such an artifact can be seen in (row 4, col 2, forehead-hair boundary) of Figure 5, which is eliminated by adding LqualL_{qual} (row 4, col 3).

For the unseen, in-the-wild datasets, the performance of LEGAN is mostly superior to the other models, as shown in Tables 2 and 3 respectively. Due to the non-uniform nature of the data, especially the facial pose, most of the metrics deteriorate from Table 1. However, the boost in quality score overall suggests the high quality images from AFLW [56] and especially CelebA [61] to be visually more appealing than images from MultiPIE [39]. Some sample results have been shared in Figure 6.

Effectiveness as Training Data Augmenter: We examine the use of LEGAN as training data augmenter for face verification and expression recognition tasks using the IJB-B [97], LFW [46] and AffectNet [66] datasets. For face verification, we use the CASIA-WebFace dataset [101] and the LightCNN-29 [98] architecture due to their popularity in this domain. We randomly sample 439,999 images of 10,575 subjects from [101] for training and 54,415 images for validation. We augment the training set by randomly manipulating the lighting and expression of each image (Table 4, row 2). The LightCNN-29 model is trained from scratch separately with the original and augmented sets and its weights saved when validation loss plateaus across epochs. These saved snapshots are then used to extract features from a still image or video frame in the IJB-B [97] dataset. For each IJB-B template, a mean feature is computed using video and media pooling operations [62] and match score between such features is calculated with Pearson correlation. For the LFW [46] images, we simply compare features between similar and dissimilar identities. To measure statistical significance of any performance benefit, we run each training 3 separate times. We find the model trained with the augmented data to improve upon the verification performance of the baseline (Table 4). This suggests that the LEGAN generated images retain their original identity and can boost the robustness of classification models towards intra-class variance in expressions and lighting.

For expression classification, we use a modified version of the AU-classification model from [32] (Leaky ReLU [99] and Dropout added) and manually annotated AffectNet [66] images for the (‘Neutral’, ‘Happy’, ‘Surprise’, ‘Disgust’) classes, as these 4 expressions overlap with MultiPIE [39]. The classification model is trained with 204,325 face images from AffectNet’s training split, which is highly skewed towards the ‘Happy’ class (59%) and has very few images for the ‘Surprise’ (6.2%) and ‘Disgust’ (1.6%) expressions. To balance the training distribution, we populate each sparse class with synthetic images generated by LEGAN from real images belonging to any of the other 3 classes. We use the original and augmented (balanced) data separately to train two versions of the model for expression classification. As there is no test split, we use the 2,000 validation images for testing, as done in other works [92]. We find the synthetic images, when used in training, to substantially improve test performance especially for the previously under-represented ‘Surprise’ and ‘Disgust’ classes (Table 5). This further validates the realism of the expressions generated by LEGAN.

6 Conclusion

We propose LEGAN, a GAN framework for performing many-to-many joint manipulation of lighting and expressions of an existing face image without requiring paired training data. Instead of translating the image representations in an entangled feature space like [25], LEGAN estimates transformation maps in the decomposed lighting and expression sub-spaces before combining them to get the desired output image. To enhance the perceptual quality of the synthetic images, we directly integrate a quality estimation model into LEGAN’s pipeline as an auxiliary discriminator. This quality estimation model, built with synthetic face images from different methods [51, 52, 78, 2, 9] and their crowd-sourced naturalness ratings, is trained using a margin based regression loss to capture the subjective nature of human judgement. The usefulness of LEGAN’s feature disentangling towards synthesis quality is shown by objective comparison [45, 105, 95] of its synthesized images to that of the other popular GAN models like StarGAN [25] and StarGAN-v2 [26]. These experiments also highlight the usefulness of the proposed quality estimator in LEGAN and StarGAN w/ LqualL_{qual}, specifically comparing the latter to vanilla StarGAN.

As a potential application, we use LEGAN as training data augmenter for face verification on the IJB-B [97] and LFW [46] datasets and for facial expression classification on the AffectNet [66] dataset (Tables 4 and 5). An improvement in the verification scores in both datasets suggests LEGAN can enhance the intra-class variance while preserving subject identity. The boost in expression recognition performance validates the realism of the LEGAN generated facial expressions. The output quality, however, could be further improved when translating from an intense expression to another. We plan to address this by - (1) using attention masks in our encoder modules, and (2) building translation pathways of facial action units [88] while going from one expression to another. Another future goal is to incorporate a temporal component in LEGAN for synthesizing a sequence of coherent frames.

References

  • [1] Deepfake Detection Challenge:. https://www.kaggle.com/c/deepfake-detection-challenge.
  • [2] DeepFake FaceSwap:. https://faceswap.dev/.
  • [3] FaceSwap Repository:. https://github.com/deepfakes/faceswap.
  • [4] Pre-filtered StyleGAN Images:. https://generated.photos/?ref=producthunt.
  • [5] M. Afifi and M.S. Brown. Deep white-balance editing. In CVPR, 2020.
  • [6] M. Alcorn, Q. Li, Z. Gong, C. Wang, L. Mai, W-S. Ku, and A. Nguyen. Strike (with) a pose: Neural networks are easily fooled by strange poses of familiar objects. In CVPR, 2019.
  • [7] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [8] S. Banerjee, J. Bernhard, W. Scheirer, K. Bowyer, and P. Flynn. Srefi: Synthesis of realistic example face images. In IJCB, 2017.
  • [9] S. Banerjee, W. Scheirer, K. Bowyer, and P. Flynn. Fast face image synthesis with minimal training. In WACV, 2019. Dataset available here: https://cvrl.nd.edu/projects/data/.
  • [10] S. Banerjee, W. Scheirer, K. Bowyer, and P. Flynn. On hallucinating context and background pixels from a face mask using multi-scale gans. In WACV, 2020.
  • [11] A. Bansal, C. Castillo, R. Ranjan, and R. Chellappa. The do’s and don’ts for cnn-based face verification. ICCV Workshops, 2017.
  • [12] A. Bansal, A. Nanduri, C. D. Castillo, R. Ranjan, and R. Chellappa. Umdfaces: An annotated face dataset for training deep networks. IJCB, 2017.
  • [13] J. Bao, D. Chen, F. Wen, H. Li, and G. Hua. Towards open-set identity preserving face synthesis. In CVPR, 2018.
  • [14] S. Barratt and R. Sharma. A note on the inception score. arXiv:1801.01973.
  • [15] S. Beery, Y. Liu, D. Morris, J. Piavis, A. Kapoor, M. Meister, N. Joshi, and Pietro Perona. Synthetic examples improve generalization for rare classes. In WACV, 2020.
  • [16] D. Berthelot, T. Schumm, and L. Metz. Began: Boundary equilibrium generative adversarial networks. arXiv:1703.10717.
  • [17] J.R. Beveridge, D.S. Bolme, B.A. Draper, G.H. Givens, Y.M. Lui, and P.J. Phillips. Quantifying how lighting and focus affect face recognition performance. In CVPR Workshops, 2010.
  • [18] D. Bitouk, N. Kumar, S. Dhillon, S. Belhumeur, and S. K. Nayar. Face swapping: Automatically replacing faces in photographs. SIGGRAPH, 2005.
  • [19] V. Blanz and T. Vetter. Face recognition based on fitting a 3d morphable model. IEEE Trans. on Pattern Analysis and Machine Intelligence, 25(9):1063–1074, 2003.
  • [20] A. Borji. Pros and cons of gan evaluation measures. CVIU, 179(3):41–65, 2019.
  • [21] A. Bulat and G. Tzimiropoulos. How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks). In ICCV, 2017.
  • [22] Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman. Vggface2: A dataset for recognizing faces across pose and age. In arXiv:1710.08092.
  • [23] H. Chang, J. Lu, F. Yu, and A. Finkelstein. Pairedcyclegan: Asymmetric style transfer for applying and removing makeup. In CVPR, 2018.
  • [24] J. Chen, G. Su, J. He, and S. Ben. Relighting using locally constrained global optimization. In ECCV, 2010.
  • [25] Y. Choi, M. Choi, M. Kim, J-W. Ha, S. Kim, and J. Chool. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In CVPR, 2018.
  • [26] Y. Choi, Y. Uh, J. Yoo, and J-W. Ha. Stargan v2: Diverse image synthesis for multiple domains. In CVPR, 2020.
  • [27] F. Cole, D. Belanger, D. Krishnan, A. Sarna, I. Mosseri, and W. T. Freeman. Face synthesis from facial identity features. In CVPR, 2017.
  • [28] C. Cortes and V. Vapnik. Support-vector networks. Machine learning, 20(3):273–297, 1995.
  • [29] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and Fei-Fei. Li. Imagenet: A large-scale hierarchical image database. In CVPR, 2009.
  • [30] Y. Deng, J. Yang, D. Chen, F. Wen, and X. Tong. Disentangled and controllable face image generation via 3d imitative-contrastive learning. In CVPR, 2020.
  • [31] X. Di and V.M. Patel. Facial synthesis from visual attributes via sketch using multi-scale generators. TBIOM, 2019.
  • [32] I.O. Ertugrul, J.F. Cohn, L.A. Jeni, Z. Zhang, L. Yin, and Q. Ji. Cross-domain au detection: Domains, learning approaches, and measures. In FG, 2019.
  • [33] M.A. Fischler and R.A. Elschlager. The representation and matching of pictorial structures. IEEE Transactions on Computers, 22(1):67–92, 1973.
  • [34] C. Fu, Y. Hu, X. Wu, G. Wang, Q. Zhang, and R. He. High fidelity face manipulation with extreme pose and expression. arXiv:1903.12003.
  • [35] B. Gecer, S. Ploumpis, I. Kotsia, and S. Zafeiriou. Ganfit: Generative adversarial network fitting for high fidelity 3d face reconstruction. In CVPR, 2019.
  • [36] Z. Geng, C. Cao, and S. Tulyakov. 3d guided fine-grained face manipulation. In CVPR, 2019.
  • [37] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. C. Courville, and Y. Bengio. Generative adversatial nets. In NeurIPS, 2014.
  • [38] K. Grm, V. Štruc, A. Artiges, M. Caron, and H.K. Ekenel. Strengths and weaknesses of deep learning models for face recognition against image degradations. IET Biometrics, 7(1):81–89, 2018.
  • [39] R. Gross, I. Matthews, J. Cohn, T. Kanade, and S. Baker. Multi-pie. Image and Vision Computing., 28(5):807–813, 2010.
  • [40] I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. Courville. Improved training of wasserstein gans. In NeurIPS, 2017.
  • [41] X. Han, Y. Liu, H. Yang, G. Xing, and Y. Zhang. Normalization of face illumination with photorealistic texture via deep image prior synthesis. Neurocomputing, 2020.
  • [42] T. Hassner, S. Harel, E. Paz, and R. Enbar. Effective face frontalization in unconstrained images. In CVPR, 2015.
  • [43] K. He and X. Xue. Facial landmark localization by part-aware deep convolutional network. In PCM, 2016.
  • [44] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016.
  • [45] M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NeurIPS, 2017.
  • [46] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller. Labeled faces in the wild: A database for studying face recognition in unconstrained environments. In Tech Report 07–49, 2007.
  • [47] R. Huang, S. Zhang, T. Li, and R. He. Beyond face rotation: Global and local perception gan for photorealistic and identity preserving frontal view synthesis. ICCV, 2017.
  • [48] P. Isola, J-Y. Zhu, T. Zhou, and A.A. Efros. Image-to-image translation with conditional adversarial nets. In CVPR, 2017.
  • [49] A. S. Jackson, A. Bulat, V. Argyriou, and G. Tzimiropoulos. Large pose 3d face reconstruction from a single image via direct volumetric cnn regression. ICCV, 2017.
  • [50] J. Johnson, A. Alahi, and F-F. Li. Perceptual losses for real-time style transfer and super-resolution. In ECCV, 2016.
  • [51] T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. ICLR, 2018.
  • [52] T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. In arXiv:1812.04948, 2018.
  • [53] T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila. Analyzing and improving the image quality of StyleGAN. CoRR, arXiv:1912.04958, 2019.
  • [54] I. Kemelmacher-Shlizerman, S. Seitz, D. Miller, and E. Brossard. The megaface benchmark: 1 million faces for recognition at scale. In CVPR, 2016.
  • [55] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
  • [56] M. Koestinger, P. Wohlhart, P.M. Roth, and H. Bischof. Annotated Facial Landmarks in the Wild: A Large-scale, Real-world Database for Facial Landmark Localization. In First IEEE International Workshop on Benchmarking Facial Image Analysis Technologies, 2011.
  • [57] Y.A. Kolchinski, S. Zhou, S. Zhao, G. Mitchell, and S. Ermon. Approximating human judgment of generated image quality. arXiv:1912.12121.
  • [58] M. Kowalski, S.J. Garbin, V. Estellers, T. Baltrušaitis, M. Johnson, and J. Shotton. Config: Controllable neural face image generation. In ECCV, 2020.
  • [59] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.
  • [60] G. Levi and T. Hassner. Age and gender classification using convolutional neural networks. In CVPR Workshops, 2015.
  • [61] Z. Liu, P. Luo, X. Wang, and X. Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [62] I. Masi, T. Hassner, A. T. Tran, and G. Medioni. Rapid synthesis of massive face sets for improved face recognition. FG, 2017.
  • [63] I. Masi, A. T. Tran, J. T. Leksut, T. Hassner, and G. Medioni. Do we really need to collect millions of faces for effective face recognition? In ECCV, 2016.
  • [64] A.L. Mass, A.Y. Hannun, and A.Y. Ng. Rectifier nonlinearities improve neural network acoustic models. In ICML, 2013.
  • [65] A. Meka, C. Haene, R. Pandey, M. Zollhoefer, S. Fanello, G. Fyffe, A. Kowdle, X. Yu, J. Busch, J. Dourgarian, P. Denny, S. Bouaziz, P. Lincoln, M. Whalen, G. Harvey, J. Taylor, S. Izadi, A. Tagliasacchi, P. Debevec, C. Theobalt, J. Valentin, and C. Rhemann. Deep reflectance fields - high-quality facial reflectance field inference from color gradient illumination. In SIGGRAPH, 2019.
  • [66] A. Mollahosseini, B. Hasani, and M.H. Mahoor. Affectnet: A new database for facial expression, valence, and arousal computation in the wild. IEEE Trans. on Affective Computing, 2017.
  • [67] S. Mosaddegh, L. Simon, and F. Jurie. Photorealistic face de-identification by aggregating donors’ face components. In ACCV, 2014.
  • [68] A. Nech and I. Kemelmacher-Shlizerman. Level playing field for million scale face recognition. CVPR, 2017.
  • [69] T. Nestmeyer, J-F. Lalonde, I. Matthews, and A. Lehrmann. Learning physics-guided face relighting under directional light. In CVPR, 2020.
  • [70] Y. Nirkin, Y. Keller, and T. Hassner. Fsgan: Subject agnostic face swapping and reenactment. In ICCV, 2019.
  • [71] T-H. Oh, T. Dekel, C. Kim, I. Mosseri, W.T. Freeman, M. Rubinstein, and W. Matusik. Speech2face: Learning the face behind a voice. In CVPR, 2019.
  • [72] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In BMVC, 2015.
  • [73] P. J. Phillips, J. R. Beveridge, B. A. Draper, G. Givens, A. J. O’Toole, D. S. Bolme, J. Dunlop, Y. M. Lui, H. Sahibzada, and S. Weiment. An introduction to the good, the bad, and the ugly face recognition challenge problem. In FG, 2011.
  • [74] S.M. Pizer and et al. Adaptive histogram equalization and its variations. Computer Vision, Graphics, and Image Processing, 39:355–368, 1987.
  • [75] E. Prashnani, H. Cai, Y. Mostofi, and P. Sen. Pieapp: Perceptual image-error assessment through pairwise preference. In CVPR, 2018.
  • [76] A. Pumarola, A. Agudo, A.M. Martinez, A. Sanfeliu, and F. Moreno-Noguer. Ganimation: Anatomically-aware facial animation from a single image. In ECCV, 2018.
  • [77] A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.
  • [78] A. Rössler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. FaceForensics++: Learning to detect manipulated facial images. In ICCV, 2019. Available here: https://github.com/ondyari/FaceForensics.
  • [79] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NeurIPS, 2016.
  • [80] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. Improved techniques for training gans. In NeurIPS, 2016.
  • [81] S. Sengupta, A. Kanazawa, C. Castillo, and D. Jacobs. Sfsnet: Learning shape, reflectance and illuminance of faces in the wild. In CVPR, 2018.
  • [82] W. Shi, J. Caballero, F. Huszar, J. Totz, A.P. Aitken, R. Bishop, D. Rueckert, and Z. Wang. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In CVPR, 2016.
  • [83] Z. Shu, S. Hadap, E. Shechtman, K. Sunkavalli, S. Paris, and D. Samaras. Portrait lighting transfer using a mass transport approach. In SIGGRAPH, 2017.
  • [84] Z. Shu, E. Yumer, S. Hadap, K. Sunkavalli, E. Shechtman, and D. Samaras. Neural face editing with intrinsic image disentangling. In CVPR, 2017.
  • [85] A. Siarohin, S. Lathuilière, S. Tulyakov, E. Ricci, and N. Sebe. First order motion model for image animation. In NeurIPS, 2019.
  • [86] T. Sun, J.T. Barron, Y-T. Tsai, Z. Xu, X. Yu, G. Fyffe, C. Rhemann, J. Busch, P. Debevac, and R. Ramamoorthi. Single image portrait relighting. In SIGGRAPH, 2019.
  • [87] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
  • [88] Y. Tian, T. Kanade, and J.F. Cohn. Recognizing action units for facial expression analysis. TPAMI, 23(2):97–115, 2001.
  • [89] A. T. Tran, T. Hassner, I. Masi, and G. Medioni. Regressing robust and discriminative 3D morphable models with a very deep neural network. CVPR, 2017.
  • [90] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
  • [91] P. Upchurch, J. Gardner, G. Pleiss, R. Pless, N. Snavely, K. Bala, and K. Weinberger. Deep feature interpolation for image content changes. CVPR, 2017.
  • [92] R. Vemulapalli and A. Agarwala. A compact embedding for facial expression similarity. In CVPR, 2019.
  • [93] R.G. VidalMata, S. Banerjee, and et al. Bridging the gap between computational photography and visual recognition. arXiv:1901.09482, 2019.
  • [94] Y. Wang, L. Zhang, Z. Liu, G. Hua, Z. Wen, Z. Zhang, and D. Samaras. Face relighting from a single image under arbitrary unknown lighting conditions. In TPAMI, 2009.
  • [95] Z. Wang, A. Bovik, H. Sheikh, and E. Simoncelli. Image quality assessment: From error visibility to structural similarity. IEEE Trans. on Image Processing, 13(4):600–612, 2004.
  • [96] Y. Wen, B. Raj, and R. Singh. Face reconstruction from voice using generative adversarial networks. In NeurIPS, 2019.
  • [97] C. Whitelam, E. Taborsky, A. Blanton, B. Maze, J. Adams, T. Miller, N. Kalka, A. K. Jain, J. A. Duncan, K. Allen, J. Cheney, and P. Grother. Iarpa janus benchmark-b face dataset. In CVPR Workshops, 2017.
  • [98] X. Wu, R. He, Z. Sun, and T. Tan. A light cnn for deep face representation with noisy labels. TIFS, 2018.
  • [99] B. Xu, N. Wang, T. Chen, and M. Li. Empirical evaluation of rectified activations in convolutional network. arXiv:1505.00853, 2015.
  • [100] J. Yang, Z. Zhao, H. Zhang, and Y. Shi. Data augmentation for x-ray prohibited item images using generative adversarial networks. IEEE Access, 7:28894–28902, 2019.
  • [101] D. Yi, Z. Lei, S. Liao, and S. Z. Li. Learning face representation from scratch. In arXiv:1411.7923.
  • [102] H. Yu, O.G. Garrod, and P.G. Schyns. Perception-driven facial expression synthesis. Computers & Graphics, 36(3), 2012.
  • [103] M.D. Zeiler, D. Krishnan, G.W. Taylor, and R. Fergus. Deconvolutional networks. In CVPR, 2010.
  • [104] R. Zhang, P. Isola, and A.A. Efros. Colorful image colorization. In ECCV, 2016.
  • [105] R. Zhang, P. Isola, A.A. Efros, E. Shechtman, and O. Wang. The unreasonable effectiveness of deep features as a perceptual metric. In CVPR, 2018.
  • [106] H. Zhou, S. Hadap, K. Sunkavalli, and D. Jacobs. Deep single-image portrait relighting. In ICCV, 2019.
  • [107] S. Zhou, M.L. Gordon, R. Krishna, A. Narcomey, Fei-Fei. Li, and M.S. Bernstein. Hype: A benchmark for human eye perceptual evaluation of generative models. In NeurIPS, 2019.
  • [108] J.Y. Zhu, T. Park, P. Isola, and A.A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.
  • [109] K. Zuiderveld. Contrast limited adaptive histogram equalization. Graphics Gems IV, 1994.

7 Quality Estimation Model: Architecture Details

We share details of the architecture of our quality estimator QQ in Table 6. The fully connected layers in QQ are denoted as ‘fc’ while each convolution layer, represented as ‘conv’, is followed by Leaky ReLU [99] activation with a slope of 0.01.

Table 6: Detailed architecture of our quality estimation model QQ (input size is 128×\times128×\times3).
Layer
Filter/Stride/Dilation
# of filters
input 128×\times128 3
conv0 4×\times4/2/1 64
conv1 4×\times4/2/1 128
conv2 4×\times4/2/1 256
conv3 4×\times4/2/1 512
conv4 4×\times4/2/1 1024
conv5 4×\times4/2/1 2048
fc0 - 256
fc1 - 1
Refer to caption
Figure 7: Histogram depicting the number of images in each naturalness bin, as rated by the Amazon Mechanical Turkers. Much more images fell on the ‘natural’ half (5 - 10) rather than the ‘unnatural’ one (0 - 5), suggesting the synthetic face images used in our study to be more or less realistic.
Table 7: Hourglass architecture for expression mask (MeM_{e}) synthesis in the generator GG. The input size is 128×\times128×\times9, three RGB channels (IaI_{a}) and six expression channels (cec_{e}).
Layer
Filter/Stride/Dilation
# of filters
input 128×\times128/-/- 9
conv0 7×\times7/1/1 64
conv1 4×\times4/2/1 128
conv2 4×\times4/2/1 256
RB0 3×\times3/1/1 256
RB1 3×\times3/1/1 256
RB2 3×\times3/1/1 256
RB3 3×\times3/1/1 256
RB4 3×\times3/1/1 256
RB5 3×\times3/1/1 256
PS0 - 256
conv3 4×\times4/1/1 128
PS1 - 128
conv4 4×\times4/1/1 64
conv5 (MeM_{e}) 7×\times7/1/1 3
Table 8: Hourglass architecture for lighting mask (MlM_{l}) synthesis in the generator GG. The input size is 128×\times128×\times23, three RGB channels (IaI_{a}) and twenty expression channels (clc_{l}).
Layer
Filter/Stride/Dilation
# of filters
input 128×\times128/-/- 23
conv0 7×\times7/1/1 64
conv1 4×\times4/2/1 128
conv2 4×\times4/2/1 256
RB0 3×\times3/1/1 256
RB1 3×\times3/1/1 256
RB2 3×\times3/1/1 256
RB3 3×\times3/1/1 256
RB4 3×\times3/1/1 256
RB5 3×\times3/1/1 256
PS0 - 256
conv3 4×\times4/1/1 128
PS1 - 128
conv4 4×\times4/1/1 64
conv5 (MlM_{l}) 7×\times7/1/1 3
Table 9: Hourglass architecture for target image (G(Ia,fb)G(I_{a},f_{b})) synthesis in the generator GG. The input size is 128×\times128×\times6, three expression mask channels (MeM_{e}) and three lighting mask channels (MlM_{l}).
Layer
Filter/Stride/Dilation
# of filters
input 128×\times128/-/- 6
conv0 7×\times7/1/1 64
conv1 4×\times4/2/1 128
conv2 4×\times4/2/1 256
RB0 3×\times3/1/1 256
RB1 3×\times3/1/1 256
RB2 3×\times3/1/1 256
RB3 3×\times3/1/1 256
RB4 3×\times3/1/1 256
RB5 3×\times3/1/1 256
PS0 - 256
conv3 4×\times4/1/1 128
PS1 - 128
conv4 4×\times4/1/1 64
conv5 (G(Ia,fb)G(I_{a},f_{b})) 7×\times7/1/1 3
Table 10: Detailed architecture of LEGAN’s discriminator DD (input size is 128×\times128×\times3).
Layer
Filter/Stride/Dilation
# of filters
input 128×\times128 3
conv0 4×\times4/2/1 64
conv1 4×\times4/2/1 128
conv2 4×\times4/2/1 256
conv3 4×\times4/2/1 512
conv4 4×\times4/2/1 1024
conv5 4×\times4/2/1 2048
conv6 (DsrcD_{src}) 3×\times3/1/1 1
conv7 (DclsD_{cls}) 1×\times1/1/1 26

8 Quality Estimation Model: Naturalness Rating Distribution in Training

In this section, we share the distribution of the naturalness ratings that we collected from the Amazon Mechanical Turk (AMT) experiment (Stage II). To do this, we average the perceptual rating for each synthetic face image from its three scores and increment the count of a particular bin in [(0 - 1), (1 - 2), … , (8 - 9), (9 - 10)] based on the mean score. As described in Section 3 of the main paper, we design the AMT task such that a mean rating between 0 and 5 suggests the synthetic image to look ‘unnatural’ while a score between 5 and 10 advocates for its naturalness. As can be seen in Figure 7, majority of the synthetic images used in our study generates a mean score that falls on the ‘natural’ side, validating their realism. When used to train our quality estimation model QQ, these images tune its weights to look for the same perceptual features in other images while rating their naturalness.

Refer to caption
Figure 8: Mean naturalness rating of the different synthesis approaches used in our study [51, 52, 2, 78, 9]. As expected, the StyleGAN [52] images are rated higher than others as they were pre-filtered for quality[4].
Refer to caption
Figure 9: Mean naturalness rating, as estimated by Turkers (blue) and predicted by our trained quality estimation model QQ (red), for the different synthesis approaches used in our study [51, 52, 2, 78, 9]. These ratings are specifically for images from the test split in our experiments, so QQ never encountered them during training. Yet, QQ is able to predict the naturalness of these images with a high degree of certainty.
Refer to caption
Figure 10: Perceptual quality predictions by our trained quality estimation model (QQ) on sample test images generated using [52, 51, 2, 78, 9]. For each image, the (mean ±\pm standard deviation) of the three naturalness scores, collected from AMT, is shown below while QQ’s prediction is shown above in red.

To further check the overall perceptual quality of each of the different synthesis approaches used in our study [51, 52, 2, 78, 9], we separately find the mean rating for each synthetic face image generated by that method, depicted in Figure 8. It comes as no surprise for the StyleGAN [52] images to rank the highest, with a mean score over 7, as its face images were pre-filtered for quality [4]. The other four approaches perform roughly the same, generating a mean score that falls between 6 and 7.

9 Quality Estimation Model: Loss Function

Our loss uses the L2 norm between the predicted quality (pp) and mean label (μ\mu) and then computes a second L2 norm between this distance and the standard deviation (σ\sigma). σ\sigma acts as a margin in this case. If we consider μ\mu as the center of a circle with radius of σ\sigma, then our loss tries to push pp towards the boundary to fully capture the subjectiveness of human perception. We also tried a hinge version of this loss: max(0,(μp22σ))\max\left(0,\left(\left\|\mu-p\right\|_{2}^{2}-\sigma\right)\right). This function penalizes pp falling outside the permissible circle while allowing it to lie anywhere within it. When σ\sigma is low, both functions act similarly. We found the quality estimation model’s (QQ) predictions to be less stochastic when trained with the margin loss than the hinge. On a held-out test set, both losses performed similarly with only 0.2% difference in regression accuracy. Experimental results with LEGAN, and especially StarGAN trained using QQ (Tables 1, 2, 3 in the main text), underpin the efficiency of the margin loss in comprehending naturalness. The improvements in perceptual quality, as demonstrated by LPIPS and FID, further justify its validity as a good objective for training QQ.

10 Quality Estimation Model: Prediction Accuracy During Testing

As discussed in Section 3 of the main text, we hold out 10% of the crowd-sourced data (3,727 face images) for testing our quality estimation model QQ post training. Since QQ never encountered these images during training, we use them to evaluate the effectiveness of our model. We separately compute the mean naturalness score for each synthesis approach used in our study and compare this value with the average quality score as predicted by QQ. The results can be seen in Figure 9. Overall, our model predicts the naturalness score for each synthesis method with a high degree of certainty. Some qualitative results can also be seen in Figure 10.

11 LEGAN: Detailed Architecture

In this section, we list the different layers in the generator GG and discriminator DD of LEGAN. Since GG is composed of three hourglass networks, we separately describe their architecture in Tables 7, 8 and 9 respectively. The convolution layers, residual blocks and pixel shuffling layers are indicated as ‘conv’, ‘RB’, and ‘PS’ respectively in the tables. After each of ‘conv’ and ‘PS’ layer in an hourglass, we use ReLU activation and instance normalization [90], except for the last ‘conv’ layer where a tanh activation is used [77, 80]. The description of DD can be found in Table 10. Similar to QQ, each convolution layer is followed by Leaky ReLU [99] activation with a slope of 0.01 in DD, except for the final two convolution layers that output the realness matrix DsrcD_{src} and the classification map DclsD_{cls}.

Refer to caption
Figure 11: Sample qualitative results from LEGAN and its ablated variants on randomly sampled input images from MultiPIE [39] test set. The target expression and lighting conditions for each row are - (a) (Smile, Left Shadow), (b) (Squint, Ambient), (c) (Disgust, Left Shadow), and (d) (Surprise, Ambient).
Table 11: Ablation studies - quantitative results on held out CMU-MultiPIE [39] test set.
Models
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
wo/ disentangling
40.244 0.148 0.557 0.601 5.348
wo/ LadvL_{adv}
351.511 0.460 0.352 0.476 1.74
wo/ LclsL_{cls}
30.236 0.139 0.425 0.717 5.873
wo/ LrecL_{rec}
40.479 0.135 0.550 0.676 5.475
wo/ LqualL_{qual}
46.420 0.168 0.544 0.621 5.190
wo/ LidL_{id}
35.429 0.140 0.566 0.587 5.861
LEGAN
29.964 0.120 0.649 0.649 5.853

12 LEGAN: Ablation Study

To analyze the contribution of each loss component on synthesis quality, we prepare 5 different versions of LEGAN by removing (feature disentanglement, LadvL_{adv}, LclsL_{cls}, LrecL_{rec}, LqualL_{qual}, and LidL_{id}) from GG while keeping everything else the same. The qualitative and quantitative results, produced using MultiPIE [39] test data, are shown in Figure 11 and Table 11 respectively. For the quantitative results, the output image is compared with the corresponding target image in MultiPIE, and not the source image (\ieinput).

As expected, we find LadvL_{adv} to be crucial for realistic hallucinations, in absence of which the model generates non-translated images totally outside the manifold of real images. The disentanglement of the lighting and expression via LEGAN’s hourglass pair allows the model to independently generate transformation masks which in turn synthesize more realistic hallucinations. Without the disentanglement, the model synthesizes face images with pale-ish skin color and suppressed expressions. When LclsL_{cls} is removed, LEGAN outputs the input image back as the target attributes are not checked by DD anymore. Since the input image is returned back by the model, it generates a high face matching and mean quality score (Table 11, third row). When the reconstruction error LrecL_{rec} is plugged off the output images lie somewhere in the middle, between the input and target expressions, suggesting the contribution of the loss in smooth translation of the pixels. Removing LqualL_{qual} and LidL_{id} deteriorates the overall naturalness, with artifacts manifesting in the eye and mouth regions. As expected, the overall best metrics are obtained when the full LEGAN model with all the loss components is utilized.

13 LEGAN: Optimal Upsampling

To check the effect of the different upsampling approaches on hallucination quality, we separately apply bilinear interpolation, transposed convolution [103] and pixel shuffling [82] on the decoder module of the three hourglass networks in LEGAN’s generator GG. While the upsampled pixels are interpolated based on the original pixel in the first approach, the other two approaches explicitly learn the possible intensity during upsampling. More specifically, pixel shuffling blocks learn the intensity for the pixels in the fractional indices of the original image (i.e. the upsampled indices) by using a set convolution channels and have been shown to generate sharper results than transposed convolutions. Unsurprisingly, it generates the best quantitative results by outperforming the other two upsampling approaches on 3 out of the 5 objective metrics, as shown in Table 12. Hence we use pixel shuffling blocks in our final implementation of LEGAN.

However, as can be seen in Figure 12, the expression and lighting transformation masks MeM_{e} and MlM_{l} are more meaningful when interpolated rather than explicitly learned. This interpolation leads to a smoother flow of upsampled pixels with facial features and their transformations visibly more noticeable compared to transposed convolutions and pixel shuffling.

Refer to caption
Figure 12: Adding different upsampling techniques in our decoder modules generates hallucinations with slightly different perceptual scores for the same input. Here the target expression and lighting conditions are set as - (a) (Smile, Bright), and (b) (Surprise, Left Shadow). However, the transformation masks MeM_{e} and MlM_{l} are smoother and more meaningful when bilinear interpolation is used for upsampling. Since both transposed convolution [103] and pixel shuffling [82] learn the intensity of the upsampled pixels instead of simple interpolation, the masks they generate are more fragmented and discrete. We use pixel shuffling in our final LEGAN model.
Table 12: Effects of different upsampling - quantitative results on held out CMU-MultiPIE [39] test set.
Models
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
Bilinear Interpolation
29.933 0.128 0.630 0.653 5.823
Transposed Convolution [103]
28.585 0.125 0.635 0.644 5.835
Pixel Shuffling [82]
29.964 0.120 0.649 0.649 5.853

14 LEGAN: Optimal Value of qq

As discussed in the main text, we set the value of the hyper-parameter qq = 8 for computing the quality loss LqualL_{qual}. We arrive at this specific value after experimenting with different possible values. Since qq acts as a target for perceptual quality while estimating LqualL_{qual} during the forward pass, it can typically range from 5 (realistic) to 10 (hyper-realistic). We set qq to all possible integral values between 5 and 10 for evaluating the synthesis results both qualitatively (Figure 13) and quantitatively (Table 13).

Refer to caption
Figure 13: Sample results illustrating the effect of the hyper-parameter qq on synthesis quality. The input images are randomly sampled from the MultiPIE [39] test set with target expression and lighting conditions set as - (a) (Disgust, Left Shadow), and (b) (Neutral, Right Shadow). Since it generates more stable and noticeable expressions with fewer artifacts, we set qq = 8 for the final LEGAN model.

As can be seen, when qq is set to 8, LEGAN generates more stable images with much less artifacts compared to other values of qq. Also, the synthesized expressions are visibly more noticeable for this value of qq (Figure 13, top row). When evaluated quantitatively, images generated by LEGAN with qq = 8 garner the best score for 4 out of 5 objective metrics. This is interesting as setting qq = 10 (and not 8) should ideally generate hyper-realistic images and consequently produce the best quantitative scores. We attribute this behavior of LEGAN to the naturalness distribution of the images used to train our quality model QQ. Since majority of these images fell in the (7-8) and (8-9) bins, and very few in (9-10) (as shown in Figure 7), QQ’s representations are aligned to this target. As a result, QQ tends to rate hyper-realistic face images (i.e. images with mean naturalness rating between 8 - 10) with a score around 8. Such an example can be seen in the rightmost column of the first row in Figure 10, where QQ rates a hyper-realistic StyleGAN generated image [52] as 8.3. Thus, setting qq = 8 for LqualL_{qual} computation (using trained QQ’s weights) during LEGAN training produces the optimal results.

Table 13: Quantitative results on the held out CMU-MultiPIE [39] test set by varying the value of the hyper-parameter qq.
Models
FID [45] \downarrow
LPIPS [105] \downarrow
SSIM [95] \uparrow
Match Score [44, 22] \uparrow
Quality Score \uparrow
qq = 5
41.275 0.143 0.550 0.642 5.337
qq = 6
44.566 0.139 0.542 0.651 5.338
qq = 7
38.684 0.137 0.631 0.663 5.585
qq = 8 (LEGAN)
29.964 0.120 0.649 0.649 5.853
qq = 9
42.772 0.137 0.637 0.659 5.686
qq = 10
46.467 0.132 0.586 0.583 5.711
Refer to caption
Figure 14: Our perceptual study interface: given a base face image with neutral expression and bright lighting (leftmost image), a rater is asked to select the image that best matches the target expression (‘Squint’) and lighting (‘Right Shadow’) for the same subject.

15 LEGAN: Perceptual Study Details

In this section, we share more details about the interface used for our perceptual study. As shown in Figure 14, we ask the raters to pick the image that best matches a target expression and lighting condition. To provide a basis for making judgement, we also share a real image of the same subject with neutral expression and bright lighting condition. However, this is not necessarily the input to the synthesis models for the target expression and lighting generation, as we want to estimate how these models do when the input image has more extreme expressions and lighting conditions. The image order is also randomized to eliminate any bias.

16 LEGAN: Model Limitations

Although LEGAN is trained on just frontal face images acquired in a controlled setting, it can still generate realistic new views even for non-frontal images with a variety of expressions, as shown in Figures 16 and 17. However, as with any synthesis model, LEGAN also has its limitations. In majority of the cases where LEGAN fails to synthesize a realistic image, the input expression is irregular with non-frontal head pose or occlusion, as can be seen in Figure 15. As a result, LEGAN fails to generalize and synthesizes images with incomplete translations or very little pixel manipulations. One way to mitigate this problem is to extend both our quality model and LEGAN to non-frontal facial poses and occlusions by introducing randomly posed face images during training.

Refer to caption
Figure 15: Failure cases: for each input image LEGAN fails to correctly generate the target facial expression. In (a) LEGAN manages to generate a surprised mouth but fails to open the subject’s eyes, (b) the smile is half generated due to occlusion by the subject’s fingers, (c) the target disgusted expression is missing, and (d) the subject’s eyes are not squinted. Most of these failure cases are either due to non-frontal facial pose or occlusion.

17 LEGAN: More Qualitative Results

In this section, we share more qualitative results generated by LEGAN on unconstrained data from the AFLW [56] and CelebA [61] datasets in Figures 16 and 17 respectively. The randomly selected input images vary in ethnicity, gender, color composition, resolution, lighting, expression and facial pose. In order to judge LEGAN’s generalizability, we only train the model on 33k frontal face images from MultiPIE [39] and do not fine tune it on any other dataset.

Refer to caption
Figure 16: Synthesized expressions and lighting conditions for the same input image, as generated by LEGAN. These input images are randomly selected from the AFLW [56] dataset and the results are generated by randomly setting different expression and lighting targets. LEGAN is trained on 33k frontal face images from MultiPIE [39] and we do not fine-tune the model on any other dataset. All images are 128×\times128×\times3.
Refer to caption
Figure 17: Synthesized expressions and lighting conditions for the same input image, as generated by LEGAN. These input images are randomly selected from the CelebA [61] dataset and the results are generated by randomly setting different expression and lighting targets. LEGAN is trained on 33k frontal face images from MultiPIE [39] and we do not fine-tune the model on any other dataset. All images are 128×\times128×\times3.

18 Recolorization Network: Architecture Details

For the colorization augmentation network, we use a generator architecture similar to the one used in [10] for the 128×\times128×\times3 resolution. The generator is an encoder-decoder with skip connections connecting the encoder and decoder layers, and the discriminator is the popular CASIANet [101] architecture. Details about the generator layers can be found in Table 14.

Table 14: Colorization Generator architecture (input size is 128×\times128×\times3)
Layer
Filter/Stride/Dilation
# of filters
conv0 3×\times3/1/2 128
conv1 3×\times3/2/1 64
RB1 3×\times3/1/1 64
conv2 3×\times3/2/1 128
RB2 3×\times3/1/1 128
conv3 3×\times3/2/1 256
RB3 3×\times3/1/1 256
conv4 3×\times3/2/1 512
RB4 3×\times3/1/1 512
conv5 3×\times3/2/1 1,024
RB5 3×\times3/1/1 1,024
fc1 512 -
fc2 16,384 -
conv3 3×\times3/1/1 4*512
PS1 - -
conv4 3×\times3/1/1 4*256
PS2 - -
conv5 3×\times3/1/1 4*128
PS3 - -
conv6 3×\times3/1/1 4*64
PS4 - -
conv7 3×\times3/1/1 4*64
PS5 - -
conv8 5×\times5/1/1 3
Refer to caption
Figure 18: Recolorization Example: We randomly select a test image from the MultiPIE [39] dataset and recolor it using the colorization generator snapshots, trained using UMDFaces [12] and FFHQ [51] datasets respectively. Although the image is recolored, its lighting is preserved by the colorization generator.

We train two separate versions of the colorization network with randomly selected 10,000 face images from the UMDFaces [12] and FFHQ [51] datasets. These two trained generators can then be used to augment LEGAN’s training set by randomly recoloring the MultiPIE [39] images from the training split. Such an example has been shared in Figure 18.