Reducing Racial Bias in Facial Age Prediction using Unsupervised Domain Adaptation in Regression

Apoorva Gokhale Astuti Sharma¹¹footnotemark: 1 Kaustav Datta¹¹footnotemark: 1 Savyasachi¹¹footnotemark: 1
University of California, San Diego
{agokhale, asharma, kdatta, ssavyasa}@eng.ucsd.edu Equal contribution. Names are listed alphabetically.

Abstract

We propose an approach for unsupervised domain adaptation for the task of estimating someone’s age from a given face image. In order to avoid the propagation of racial bias in most publicly available face images datasets into inefficacy of models trained on them, we perform domain adaptation to motivate the predictor to learn features that are invariant to ethnicity, enhancing the generalization performance across faces of people from different ethnic backgrounds. Exploiting the ordinality of age, we also impose ranking constraints on the prediction of the model and design our model such that it takes as input a pair of images, and outputs both the relative age difference and the rank of the first identity with respect to the other in terms of their ages. Furthermore, we implement Multi-Dimensional Scaling to retrieve absolute ages from the predicted age differences from as few as two labelled images from the domain to be adapted to. We experiment with a publicly available dataset with age labels, dividing it into subsets based on the ethnicity labels, and evaluating the performance of our approach on the data from an ethnicity different from the one that the model is trained on. Additionally we impose a constraint to preserve the sanity of the predictions with respect to relative and absolute ages, and another to ensure the smoothness of the predictions with respect to the input. We experiment extensively and compare various domain adaptation approaches for the task of regression.

1 Introduction

Human beings are a wonderful species comprising of highly diverse races, each beautiful in their own way. Everyone has distinct facial features and it is very difficult, even for humans, to just judge someone’s age by looking at their face. Imagine how difficult it would be for machines. There are numerous tasks where age prediction models could be useful. There are certain scenarios where we have age restrictions due to the services being offered not being suitable to certain age groups. For instance, certain content - both online and in the physical world is restricted for children and it is useful to have some automated tools which could could facilitate such access restrictions. We choose this problem due to its challenging nature and moral & ethical applications.

There are prior works[13] which deal with this problem. However, just like any other machine learning task, the accuracy of the predictions of these models is limited by the datasets they are trained on. It is a well known fact that most face images datasets are biased towards Caucasian ethnicity[7] due to the ease of availability of the data and other races like Asians, American-African, South-East Asians are under represented in these datasets. As a result, the models trained on these datasets show poorer performance on the under-represented ethnicities. Furthermore, labeling an age dataset is cumbersome and expensive as compared to something like labeling genders. Thus, it is difficult to create sufficiently large labeled datasets for different. In our work, we try to tackle these problems and reduce the racial bias in age prediction models using the rapidly evolving techniques of domain adaptation. To the best of our knowledge, no one has tried to apply domain adaptation for the task of age prediction and this further strengthens the relevance of our work.

Domain adaptation techniques help in reducing the domain bias introduced in the model trained on biased datasets. In a standard domain adaptation task, there is a single source domain for which ample labeled data is available and one or multiple target domains where the labeled data is very limited(Semi-Supervised Domain Adaptation) or not available at all (Unsupervised Domain Adaptation) and the objective is to leverage the available data to build a domain invariant predictor. We discuss this in detail in the next section.

We model this problem as an unsupervised domain adaptation task where we try to adapt a deep convolutional age regressor trained on labeled data where identities are of Caucasian ethnicity to data from other ethnicities. We employ both Adversarial and Maximum Mean Discrepancy-based methods for domain adaptation.

Since different races age differently, we think that the domain shift in an age prediction model might be too large to adapt. In order to tackle this, we try a different approach where we train a model to estimate the difference in the age, and also the ordering, on two given face images and then adapt it across ethnicities. We use Multi- Dimensional Scaling to convert the age differences predicted by the model to actual age values. This is driven by the notion that the features learned by the model which capture the difference in the age and the ordering should be more adaptable across different races as compared to the features which capture the absolute age.

In this paper, first we discuss the Related Work in Section 2. In Section 3, We discuss various methods that we employ in detail. Section 4 comprises of the implementation details and different experiments we run in terms of hyper-parameters and variations in the proposed methods. We conclude and discuss future works and improvements in Section 5.

2 Related Work

2.1 Age prediction

A number of existing works tackle the problem of age estimation from images. A classical machine learning approach, [13] perform coarse to fine prediction of age from images by extracting features such as Bio-Inspired Features (BIF), Kernel-based Local Binary Patterns (KLBP) and Multi-scale Wrinkle Patterns (MWP) that are used for Support Vector Regression.

In [5] the authors use a VGG-16 [14] backbone along with an expectation computed over the softmax output over all possible ages.

Inspired by [5], the authors of [15] propose a novel network structure, and a compact stage-wise model for age estimation in which a dynamic range to each age group is introduced. The age interval of each group can be shifted and scaled depending on the input face image in this method.

The authors of [6] propose an approach for domain adaptation for age estimation in which they introduce a Gaussian kernel MMD-based and Graph Laplacian-based term in the objective function, to ensure that the model learns features that are invariant and are such that the predictions of the model preserve the smoothness in the inputs, across domains.

2.2 Combined Regression and Ranking

The author of [2] proposes the joint optimization of both ranking and regression objective functions for enhancing the accuracy of the prediction ensuring good performance in both the tasks. This is implemented using a combined loss function that weighs squared error between the predicted and ground truth values and a logistic loss for the predicted rank and the actual ordering of a pair of values sampled from the same dataset. It is observed in the experiments conducted by him that optimizing upon the ranking objective along with the regression objective helps improve the model performance on predicting target values on rarely-occurring examples. In order to exploit this in our problem scenario we design our model to output both the difference between the ages of the identities in the two input images, and the rank which is a value from 0,1 depending on which identity is older. We optimize on a combined objective function taking both of these into account.

2.3 Learning Distance Metrics between Pairs of Images

In [1], the authors tackle the problem of unsupervised domain adaptation when the source and the target domains have disjoint label spaces by formulating the classification problem into a verification task. They propose a Feature Transfer Network, allowing simultaneous optimization of domain adversarial loss and domain separation loss, as well as a variant of N -pair metric loss for entropy minimization on the target domain where the ground-truth label structure is unknown, to further improve the adaptation quality. They demonstrate this for cross-ethnicity face verification that overcomes label biases in training data. Our approach is loosely related to this as we too try to estimate the difference between an ordinal attribute of two images input to the model simultaneously.

The conditional adversarial autoencoder (CAAE) proposed in [4] achieves face age progression and regression in a holistic framework. Starting from an arbitrary query face without knowing its true age, they are able to freely produce faces at different ages, while at the same time preserving the personality.

DEX, in [5], tackled the estimation of apparent age in still face images.It posed the age regression problem as a deep classification problem followed by a softmax expected value refinement and show improvements over direct regression training of CNNs. DEX ensembles the prediction of 20 networks on the cropped face image. DEX does not explicitly employ facial landmarks. The paper also crawled internet face images with available age to create a large public dataset, IMDB-WIKI. However the dataset doesn’t have ethnicity labels making the dataset not directly usable for our objective.

3 Methods

We focus on unsupervised domain adaptation. We have a source dataset $X_{s}=\{x_{i},y_{i}\}_{i=0}^{N_{s}}$ drawn from a labeled source domain S, and a dataset $X_{t}=\{x_{j}\}^{Nt}$ from a different unlabeled target domain. For our case, source and target domains are images from different ethnicities and labels are age values. As we treat age as continuous values, we model this as a regression task. All of our models are deep neural networks with a regression layer plugged in the end.
We experiment with some baseline models, and further built our approach on them. In next few sections we talk about our models and motivation of using them along with details of their architecture.

3.1 Baseline: Source Only

For a baseline of our work, we work with only single domain age predictor models. We take a pre-trained model, and remove the classifier to plug in our regression module and then fine-tune it to predict age values. The regression module consists of multiple linear layers with a single output in the end for giving out age values. We experiment with MSE(L2) and MAE(L1) loss and find better performance with MAE loss and therefore, use MAE loss for all our regression models. We try different pre-trained models such as different variants of ResNet[9] and Inception[10] and in our case, ResNet-50 gives the best performance and we use that as our base model for all of our experiments.

3.2 Pairwise image ranking

We adopt an alternative approach, where our model learns to predict the relative difference in ages between a pair of images, and their ranking. The pair of images are concatenated and fed into the network as a 6 channel input. The base architecture remains the same as our baseline model above, except now instead of the single output at the end of our regression module, we have two outputs representing the distance between the images, and our ranking value that comes from a sigmoid activation. The loss function consists of two terms, namely the L1/L2 regression loss, and a binary cross entropy loss for ranking. As mentioned previously, this ranking information aids our model to adapt better because it is reasonable to assume that age differences remain consistent across ethnicties.
Now that we can compute the pairwise distance between any two images, to obtain our final predictions of the absolute ages, we employ Multi Dimensional Scaling, which is explained in the next section.

3.3 Multi Dimensional Scaling

Multi Dimensional Scaling (MDS) is an algorithm that, given pairwise dissimilarities between each pair of objects in a set and a specific number of dimensions N, maps the original objects into this N dimensional space such that the pairwise distances are preserved as best as possible. We use Metric MDS, which works by minimizing a cost function called Stress, defined as below -

Stress_{D}(x_{1},x_{2},...x_{n})=\sqrt{\sum_{i\neq j=1,..N}(d_{ij}-\|x_{i}-x_{j}\|)^{2}}

(1)

where $x_{1},...x_{n}$ are the mapped points in N dimensional space, and $d_{ij}$ is an element of the distance matrix.

3.4 Label Normalization

In order to make the gradients smoother and more stable while training the model, we experimented with label normalization to limit the output labels, absolute ages or age differences, to the range [0,1]. Empirically, we found that this does not have a positive influence on the training process.

3.5 Identity Constraints

It is desirable that the predictor $f$ (or $f_{1}$ or $f_{2}$ ) is a reasonable predictor, we choose to exploit the regularity the age difference between images that are copies of each other should be 0 and the ranking function should have maximum uncertainty, and the age difference output by the model $f_{1}(A,B)=-f_{1}(B,A)$ , $f_{1}$ being the predictive function for age difference. Also, $f_{2}(A,B)=f_{2}(B,A)^{\prime}$ , $f_{2}$ being the ranking function. We include this in our complete objective function.

3.6 Adaptive Measures

3.6.1 Adversarial

One approach to domain adaptation is to learn the domain invariant representation of the data. That is, we learn a model that can generalize well from one domain to another, and thus finding a internal representation which contains no discriminative information about the origin of the input (source or target), while also preserving a low risk on the source (labeled) examples. Based on this idea we use DANN as one of our domain adaptation methods.
DANN(Domain Adversarial training of Neural Networks) implements the idea discussed above in the context of neural network architectures that are trained on labeled data from the source domain and unlabeled data from the target domain. As the training progresses, the approach promotes the emergence of features that are (i) discriminative for the main learning task on the source domain and (ii) indiscriminate with respect to the shift between the domains. This adaptation behaviour can be achieved in almost any feed-forward model by augmenting it with few standard layers and a new gradient reversal layer. The resulting augmented architecture can be trained using standard backpropagation and stochastic gradient descent. The complete DANN architecture is shown in Figure1.
We use ResNet as the feature extactor and our regression module as label predictor which together form a standard feed-forward architecture. Unsupervised domain adaptation is achieved by adding an adverserial network which is composed of three linear layers, connected to the feature extractor via a gradient reversal layer that multiplies the gradient by a certain negative constant during the backpropagation-based training.

Refer to caption — Figure 1: Domain Adverserial training of Neural Networks (DANN)

3.6.2 MMD

Due to the unavailability of labels in the target domain, one commonly used strategy of UDA is to learn domain invariant representation via minimizing the domain distribution discrepancy. On this idea, many existing methods aim to bound the target error by the source error plus a discrepancy metric between the source and the target. One such discrepancy metric MMD(Maximum Mean Discrepancy) is based on the notion of embedding probabilities in a reproducing kernel Hilbert space. It measures distance between kernel mean embeddings of two distributions. For mmd, we use gausian kernel. We do single layer adaptation on fc1.

3.6.3 Smoothing

In unsupervised domain adaptation for classification tasks, the unlabeled target data is often utilized to find the parameter configuration that leads to a classifier in the target domain that passes through low density regions in the input feature space. This is achieved by implementing entropy minimization, maximizing the classifier’s confidence on the target domain inputs. Although we do not have a corresponding approach minimizing the uncertainty of regression outputs, we leverage Graph Laplacian-based smoothing, which tries to enforce that if inputs $x_{i}$ & $x_{j}$ are close to each other in the input space, then the corresponding outputs $f(x_{i})$ & $f(x_{j})$ should also be similar to each other. This smoothness is computed as:

f^{T}Lf=\frac{1}{2}\sum_{i,j}w_{ij}(f(x_{i})-f(x_{j}))^{2}

(2)

where,

w_{ij}=exp(\frac{-||x_{i}-x_{j}||^{2}}{2\sigma^{2}})

(3)

3.6.4 Final Objective Function

The final objective function that we use to optimize our model is a combination of the L1 loss on the predictions on the labelled source domain data, the objective term associated with the domain adaptation measure (MMD or Adversarial). In case of the input being pairwise the L1 Loss is on the relative age difference predicted on the labelled data. We also include the Ranking Loss on the ranks output by the model for the source domain examples, the identity and inversion constraint loss and the smoothing loss. It can be represented as follows:

L=L_{reg}(f_{1}(g(y_{s})),y_{s})+L_{adv}(g(x_{s}),g(x_{t}))

(4)

or in case of pairwise inputs :

L=L_{reg}(f_{1}(g(x_{s,1},x_{s,2})),y_{s,1}-y_{s,2})\vspace{-6pt}\\ +\alpha L_{rank}(f_{2}(g(x_{s,1},x_{s,2})),r(y_{s,1}-y_{s,2}))\vspace{5pt}\\ +\gamma L_{adv}(g(x_{s,1},x_{s,2}),g(x_{t,1},x_{t,2}))\vspace{5pt}\\ +\beta L_{id}(f_{1}(g(x_{s,1},x_{s,2})),f_{1}(g(x_{t,1},x_{t,2})))\vspace{5pt}\\ +\sigma L_{smooth}(g(x_{s}),g(x_{t}),f_{1}(g(x_{s})),f_{1}(g(x_{t})))

(5)

where $r$ is the ranking function, $g$ is the feature extractor and $f_{1}$ is the regression output function and $f_{2}$ is the ranking output function.

Table 1: Loss(MAE) for various domain combination using baseline model

Source	Target	Source Val Loss	Target Loss
Caucasian	African	7.232	7.946
Caucasian	Asian	8.122	7.677

4 Experiments

4.1 Dataset

In this work, we run experiments on the UTK[4] dataset. UTKFace dataset is a large-scale face dataset with long age span (range from 0 to 116 years old). The dataset consists of over 20,000 face images with annotations of age, gender, and ethnicity. The images cover large variation in pose, facial expression, illumination, occlusion, resolution, ethnicity etc. We use the cropped and aligned face image variant of the dataset. The Figure [] shows some of the examples of processed face images from the dataset of different ethnicities. The images are resized to size 200x200 and we also employ random horizontal flips for data augmentation.However, as seen in Table 7, there is a large bias even in this dataset in terms of the distribution of the images among the 5 ethnicities.

Table 2: Distribution of Images among Ethnicities

Ethnicity	Number of Images
Caucasian	10,078
African	4,526
Asian	3,434
Indian	3,975
Others	1,692

4.2 Implementation Details

All of the code is written using PyTorch. We find that learning rate 1e-3 gives the best convergence and we use that for all our models. All of our models are trained using stochastic gradient descent on a single GPU with a batch size of 16. We use Adam [11] for our regression network and SGD for the adversarial network.

4.3 Pairwise Models

This section describes our experiments where inputs to the model are pairs of images of faces of the same ethnicity.

4.3.1 Rank and Without Rank

We experiment to find the usefulness of rank in driving the regression outputs of the model to better values. We find that augmenting the original L1 Loss on the age difference and its prediction with a weighed binary cross entropy loss on the predicted rank indeed improves the performance of the model on the regression task. While the original model trained only on the L1 Loss achieves minimum L1 Loss of 10.15 on the training set and 12.67 on the validation set, the respective L1 Loss values for the ranking-augmented regression approach are 6.23 and 7.9. This shows that adding the ranking objective enhances the accuracy of the regression outputs from the classifier sharing weights up until the bottleneck layer.

4.4 Adaptive Models

Table 3: Loss(MAE) with different layers adaptation for baseline model for Caucasian to African

Layers	S Val Loss	Target Loss
No Adaptation	6.326	6.986
conv+fc123	6.384	6.876
conv+fc1	7.056	8.150
fc1	7.144	7.728
fc123	7.356	7.876

4.4.1 Layers

We experimented with single and multiple layers adaptation for both of our adaptative measures, DANN and MMD. We take into account both last convolutional layer of our feature extractor and fully connected layers from our regression module. With experiments we find that adapting convolutional layer and first fully connected layer gave better results as compared to single or different combinations of multiple layers. This experiment is performed on the standard age prediction model(not pairwise). Table 3 summarizes the results.

Table 4: Loss(MAE) with different DANN hyperparameter

\gamma

on pre-trained model for Caucasian to African

$\gamma$ value	S Val Loss	Target Loss
0.1	6.582	7.787
1	7.281	8.256

4.4.2 Pre-Trained Regression model

Since, The adaptive models were giving poor prediction results, In this experiment we first train a baseline (source only) model till it starts showing good performance on the source data. We then take this pre-trained model and plug in adaptation using an adversarial network. We try $0.1$ and $1$ values for $\gamma$ - the hyperparameter for adversarial loss and the table and figure summarizes the results for the same. For $\gamma=1$ , we find that as the model tries to make features domain invariant, it loses its regressive power and thus performs poorly on both source and target domain as compared to even the baseline model. Fir $\gamma=0.1$ , we conclude that signal is not strong enough to give any significant boost in the overall performance. Table 5 summarizes the results.

4.4.3 DANN vs MMD

We experiment with two domain adaptation measures as discussed in Section 3.4 in detail. According to our obtained results, using adversarial loss performs better than MMD. We observe that the target loss for DANN is lower than that of MMD after training the network for adaptation.

Table 5: Loss(MAE) for DANN and MMD(fc1) on pre-trained model for Caucasian to African

Method	S Val Loss	Target Loss
DANN	7.144	7.728
MMD	8.136	8.393

Table 6: Loss(MAE) with different DANN hyperparameter

\gamma

on baseline model for Caucasian to African

$\gamma$ value	S Val Loss	Target Loss
0.1	6.812	7.755
0.3	7.086	7.591
0.6	6.896	7.749
1	7.232	7.946

4.4.4 Adversarial Hyper-Parameter

In order to control the effect of adversarial domain adaptation loss, we try different values of the hyperparameter $\gamma$ and the Table 6 summarizes the results. We adapt all the layers in this scenario and use the standard age prediction model for adaptation. Similar to the results for the pre-trained model, we find that for larger values of $\gamma$ , the gradient to make features domain invariant, the prediction capabilities of the model get compromised. However, for smaller values, It is too weak to show any improvement on the target set.

4.4.5 MDS

We use two labelled images from our target domain as references, to find out the absolute values of the ages of our original images, after we map our pairwise distances to 1 component using Metric MDS. The two reference images are used for inferring relative difference of ages of two images after MDS, and using the information of their absolute values, we infer the absolute values of the other images. We report the loss values for MDS for two settings - with and without the ranking loss.

Table 7: Loss for MDS

Rank/Regression	Target Loss
Rank + Regression	17.87
Regression	18.67

5 Conclusion and Future Work

We tackle the problem of age estimation based on face images and then try to reduce the racial bias in the predictor using domain adaptation techniques. We train a deep convolutional regression model for age estimation and then use domain adaptation techniques(adversarial and MMD) to improve the generalization performance. We model the age estimation in two ways - as a standard regression problem and as a pairwise approach where the model is trained to predict the age difference between a pair of face images and then we use Multi-Dimensional Scaling to find the absolute age values. We find that even though the features may be more adaptable in a model which captures the difference in ages among two face images, the prediction capability of the model is lesser in comparison to the standard regression model. We think that this may be due to the overall difficulty of the predicting the difference in age as compared to predicting the absolute age since now there could be more variation in the pair of faces and also an additional step of MDS. In terms of adaptation, Adversarial techniques tend to give better performance than MMD based ones. However the difference is not much. We also find that, as we try to adapt features from multiple layers, the features become more domain invariant but the regression capability gets compromised. Overall, we find that domain adaptation across races for age estimation is a challenging task. After adaptation, we see some improvements in the perforce, e.g. our baseline with conv+fc1 target MAE loss reduces by value of 0.11(as compared to source only performance). However in totality, by observing performance of all our experiments, we don’t see any major improvement in the performance on the target domain. We observe that when the adaptation feedback is strong, the model loses its general prediction capabilities and we see a drop in performance in both source and target domains. Also, if the signal is too weak, there’s not much improvement in terms of adaptation. Further hyper-tuning of the model may result in a boost in performance. The performance is also limited due to the dataset we’re using currently which may not be large enough for the problem at hand. There are not many datasets available with ethnicity labels. In future works, we plan on creating an ethnicity labeled dataset from existing face images dataset but first training an ethnicity classifier on the given dataset and then using that to infer the ethnicity labels. Domain mapping can also be implemented here, using generative modelling to synthesize images of one ethnicity from another such that the attributes that are telling for age are preserved. Other directions for future improvement can be making use of multi-modal inputs such as sound along with images, so that the hoarseness and timbre of the voice can be used as a feature to estimate the age of the person.

References

[1] Kihyuk Sohn, Wenling Shang, Xiang Yu and Manmohan Chandraker, Unsupervised Domain Adaptation for Distance Metric Learning, International Conference on Learning Representations, 2019
[2] D. Sculley, Combined Regression and Ranking
[3] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I. Jordan, Conditional Adversarial Domain Adaptation
[4] Zhang, Zhifei, Song, Yang, and Qi, Hairong, Age Progression/Regression by Conditional Adversarial Autoencoder, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
[5] R. Rothe, R. Timofte and L. V. Gool, DEX: Deep EXpectation of Apparent Age from a Single Image, 2015 IEEE International Conference on Computer Vision Workshop (ICCVW), Santiago, 2015
[6] Ankita Singh and Shayok Chakraborty, Deep Domain Adaptation for Regression, Development and Analysis of Deep Learning Architectures
[7] M. Merler, N. Ratha, R. S. Feris, and J. R. Smith, Diversity in faces. arXiv preprint arXiv:1901.10436, 2019.
[8] C. Ng, Y. Cheng, G. Hsu and M. H. Yap, ”Multi-layer age regression for face age estimation,” 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, 2017, pp. 294-297.
[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun, Deep Residual Learning for Image Recognition, 2015
[10] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott E. Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, Andrew Rabinovich, Going Deeper with Convolutions, 2014
[11] Diederik P. Kingma, Jimmy Ba, Adam: A Method for Stochastic Optimization
[12] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, Victor Lempitsky, Domain-Adversarial Training of Neural Networks
[13] C. Ng, Y. Cheng, G. Hsu and M. H. Yap, ”Multi-layer age regression for face age estimation,” 2017 Fifteenth IAPR International Conference on Machine Vision Applications (MVA), Nagoya, 2017, pp. 294-297.
[14] Karen Simonyan, Andrew Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition
[15] SSR-Net: A Compact Soft Stagewise Regression Network for Age Estimation Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu1, Yung-Yu Chuang