COMBOLOSS FOR FACIAL ATTRACTIVENESS ANALYSIS WITH SQUEEZE-AND-EXCITATION NETWORKS

Abstract

Loss function is crucial for model training and feature representation learning, conventional models usually regard facial attractiveness recognition task as a regression problem, and adopt MSE loss or Huber variant loss as supervision to train a deep convolutional neural network (CNN) to predict facial attractiveness score. Little work has been done to systematically compare the performance of diverse loss functions. In this paper, we firstly systematically analyze model performance under diverse loss functions. Then a novel loss function named ComboLoss is proposed to guide the SEResNeXt50 network. The proposed method achieves state-of-the-art performance on SCUT-FBP, HotOrNot and SCUT-FBP5500 datasets with an improvement of 1.13%, 2.1% and 0.57% compared with prior arts, respectively. Code and models are available at https://github.com/lucasxlu/ComboLoss.git.

Index Terms— Deep learning, facial beauty prediction (FBP), face analysis

1 Introduction and Related Work

With the popularity of short video platforms (such as TikTok ¹¹1https://www.tiktok.com) and social network APPs (like Facebook, Instagram and WeChat) among the public, facial attractiveness analysis gains increased attention. Previous works [1, 2, 3] demonstrate that computational models can be adopted to automatically learn facial attractiveness. Recent years have witnessed many achievements in related areas [4, 5, 6, 7, 8]. However, due to the diverse head poses, expression, age, low resolution and illumination problems, it is still quite challenging to develop an accurate model to predict facial attractiveness levels.

Facial attractiveness analysis has been researched for decades with many achievements [1, 2, 9, 10, 3, 6, 11, 7]. Traditional methods [1, 4] usually extract hand-crafted features to form facial representation, and train a regressor or a classifier. The output of the model is regarded as the facial attractiveness level. Gray et al. [3] introduce a shallow neural network to learn facial beauty on their proposed HotOrNot dataset [3] without utilizing facial landmarks. Since the breakthrough of AlexNet [12], researchers are paying more attention to developing more advanced CNN architecture to improve facial beauty prediction (FBP) accuracy. Liang et al. [10] introduce SCUT-FBP5500 dataset [10] with 5500 portrait images, which facilitate further research [7, 8, 13] in related fields. Xu et al. [6] optimize a classification branch and a regression branch in parallel with their proposed CRNet [6], and report very promising results on SCUT-FBP [4] and HotOrNot [3] dataset, but the exploration of the classification branch is limited. Lin et al. [8] propose AaNet which takes beauty-related facial attributes as additional inputs to enhance model performance. Despite the promising performance, the subnetworks of auxiliary tasks induce additional network parameters, which makes the facial attractiveness analysis model quite heavy. In contrast to existing works [7, 6, 8], we do not add any heavy extra subnetworks into the backbone model, result in less probability of overfitting. By simply better utilizing the network’s output, we append a classification loss (aka weighted cross entropy loss), an expectation loss and a regression loss to better supervise the model training. We term our approach as ComboLoss, experimental results indicate that ComboLoss achieves state-of-the-art performance on 3 datasets without bells and whistles.

The main contributions of this paper are as follows: (1) We systematically compare different loss functions in guiding a deep CNN to learn facial attractiveness. (2) We present a simple yet effective approach named ComboLoss, to better leverage the output of CNN and enhance the performance as well, while result in neglectable (less than 0.037%) parameters increasement. (3) The proposed methods achieve state-of-the-art performance on SCUT-FBP [4], HotOrNot [3] and SCUT-FBP5500 [10] datasets, surpassing prior arts by 1.13%, 2.1% and 0.57%, respectively.

2 Proposed Methods

2.1 ComboLoss

In this section, we give a detailed discussion of our proposed ComboLoss. The target of deep CNN-based facial attractiveness analysis is to find a nonlinear mapping $y=\Phi(\mathcal{I})$ , which means a deep CNN $\Phi$ maps an input facial image $\mathcal{I}\in\mathcal{R}^{W\times H\times C}$ to an output beauty score $y$ . Given a set of labelled groundtruth training samples $\Omega=\{\mathcal{I}_{i},\hat{y}_{i}\}_{i=1}^{N}$ , the target of the model training is to find an optimal $\Phi$ which minimizes:

\sum_{i=1}^{N}\mathcal{L}(\Phi(\mathcal{I}_{i}),\hat{y}_{i})

(1)

where $\mathcal{L}$ is a pre-defined loss function which measures the difference between a predicted beauty score and its groundtruth beauty score. In this phenomenon, the deep CNN is regarded as a regression model.

The design of a proper loss function is crucial for deep CNN-based facial beauty prediction. However, the majority deep CNN-based FBP systems [4, 10, 11] usually utilize MSE as supervision. The performance of solely leveraging beauty score as supervision to train a regression network is limited [6, 7, 13], researchers are paying attention to adopt multi-task learning [7], ranking [13], multi-stream network [6], label distribution learning [14], and utilize auxiliary network [8] to enhance performance. However, these methods bring too many parameters, result in quite heavy models. In this paper, we propose a new loss function, termed ComboLoss, which further improves the performance of deep CNN-based FBP systems. The components of ComboLoss are regression loss, expectation loss, and a classification loss.

•

Regression Loss. In contrast to mainstream FBP models with MSE loss, we adopt $L_{1}$ loss as regression loss instead. The regression loss is defined as:

$L_{reg}=\frac{1}{N}\sum_{i=1}^{N}|\hat{s}_{i}-s_{i}|$ (2)

where $\hat{s}_{i},s_{i}$ and $N$ represent the output score of regression module, groundtruth score and sample capacity, respectively.
•

Classification Loss. In addition to regression module, we also insert a classification module before the last layer of SEResNeXt50 network. To enhance loss and gradient flowing, we adopt weighted cross entropy as classification loss, which is defined as Equation 3. The chosen of output neuron $\mathcal{C}$ and $w_{c}$ are discussed in Section 2.2. We introduce classification loss into a regression network to learn additional information, to better separates attractive faces and unattractive samples.

$L_{cls}=-\frac{1}{N}\sum_{i=1}^{N}w_{c}c_{i}log\hat{c}_{i}$ (3)

where $\hat{c}_{i}$ , $c_{i}$ and $w_{c}$ denote predicted probability output by softmax layer, correct classification indicator and weight of each category, respectively.
•

Expectation Loss. Despite the classification loss mentioned above can better separates different facial attractive levels, but it holds the assumption that categories are independent of each other, which is not suitable in this phenomenon (e.g. a face with a beauty level of 4 is undoubtedly more attractive than a face with beauty level of 1). In order to enhance model training, we present a new loss function by leveraging softmax probability. Then a regression manner is applied to minimize the gap between groundtruth beauty score and the expectation score. We define expectation as:

$\mathcal{E}_{i}=\sum_{i=1}^{\mathcal{C}}\hat{c}_{i}\cdot i$ (4)

The expectation loss measures the difference between groundtruth score and expectation score with $L_{1}$ loss:

$L_{exp}=\frac{1}{N}\sum_{i=1}^{N}|\hat{s}_{i}-\mathcal{E}_{i}|$ (5)

Having defined the individual item, the ComboLoss is denoted as $\mathcal{L}_{combo}=\alpha\cdot L_{reg}+\beta\cdot L_{exp}+\gamma\cdot L_{cls}$ , we set $\alpha=2,\beta=1,\gamma=1$ in our experiments.

2.2 Attractiveness Score Discretization

Since we insert classification module into a regression network, the groundtruth for classification should be defined for model training. The groundtruth beauty scores in relevant benchmark dataset [4, 10, 3] are continuous values, which cannot be directly utilized to train a classification task. Therefore, attractiveness score discretization should be conducted firstly. How to discrete continuous scores and how many score ranges should be discreted remain a open problem. In this paper, we take the simplest fashion for easy implementation. Formally, in SCUT-FBP [4] and SCUT-FBP5500 [10] datasets, the rounded integer of $\lceil s-\frac{1}{2}\rceil$ is regarded as classification label, in HotOrNot [3], the intervals are set to 3 as in [6]. More advanced discretization methods may also be adopted, which is left to our future work.

It is worth noting that despite we take the similar discretization method like CRNet [6], but the imbalance problem after applying discretization is neglected in [6]. The capacity of category 0 is 3 times bigger than category 1, which is harmful for model training. In this paper, we solve the imbalanced learning problem in loss level. Namely, the hyper-parameter in Equation 3 is introduced to balance the loss and gradient of each category. The hyper-parameter $w_{c}$ of category $c\in\mathcal{C}$ is determined as:

w_{c}=\frac{max(|m|)_{m\in\mathcal{C}}}{|c|}

(6)

2.3 Network Architecture

Inspired by [15], we incorporate squeeze-and-excitation module [15] into the ResNeXt50 [16] to form SEResNeXt50 as the backbone network for feature representation learning. Other architecture (such as ResNet [17], EfficientNet [18]) may also be utilized, but it’s beyond the scope of this paper.

Squeeze operation can embed global spatial information with a channel descriptor through utilizing global average pooling to generate channel-wise statistics. Namely, by global average pooling the feature map, we get:

z_{c}=\frac{1}{H\times W}\sum_{i=1}^{H}\sum_{j=1}^{W}u_{c}(i,j)

(7)

where $H$ and $W$ represent the height and width of the feature map, $u_{c}(i,j)$ denotes the pixel value at $(i,j)$ in channel $c$ . After obtaining the statistics $z_{c}\in\mathbb{R}^{C}$ , an excitation operation is applied. This is achieved by the following equation:

s=\sigma(W_{2}\cdot ReLU(W_{1}\cdot z))

(8)

where $\sigma$ denotes sigmoid activation, $W_{1}\in\mathbb{R}^{\frac{C}{r}\times C}$ is a dimensionality-reduction fully connected (FC) layer with reduction ratio $r$ , and $W_{2}\in\mathbb{R}^{C\times\frac{C}{r}}$ is a dimensionality-increasing FC layer with increasing ratio $r$ . The output of the squeeze-and-excitation block is calculated by rescaling the transformation output $U$ with activations:

\tilde{x}_{c}=s_{c}\cdot u_{c}

(9)

where $\tilde{X}=[\tilde{x}_{1},\cdots,\tilde{x}_{c}]$ , $u_{c}\in\mathbb{R}^{H\times W}$ represents feature map, $s_{c}$ represents scalar of sigmoid activation output (please refer to [15] for more details). In this paper, we incorporate squeeze-and-excitation [15] module into ResNeXt50 [16], result in SEResNeXt50 architecture. The network architecture is shown in Fig 1.

Refer to caption — Fig. 1: Network architecture supervised by ComboLoss.

3 Experiments

3.1 Datasets and Evaluation Settings

We use 3 popular datasets (SCUT-FBP [4], HotOrNot [3] and SCUT-FBP5500 [10]) to evaluate the effectiveness of our proposed methods. SCUT-FBP [4] contains 500 female images with beauty scores ranged from 1 to 5. HotOrNot dataset [3] contains 2056 facial images collected from Internet with variant postures, cluttered background, overexposure and low resolution problems. SCUT-FBP5500 [10] contains 5500 facial images with beauty scores in $[1,5]$ . Each image is annotated by 60 volunteers, and the average is used as the groundtruth to remove personal preference bias. All experiments are conducted by 5-fold cross validation, and the average is reported for comparison with other models.

As noted in related works [4, 3, 10], we adopt mean absolute error (MAE), root mean squared error (RMSE) and pearson correlation (PC) to evaluate the performance of different models on SCUT-FBP5500 [10], and PC is used as metric on SCUT-FBP [4] and HotOrNot [3] datasets. A computational model with lower MAE, lower RMSE and higher PC denotes better performance.

3.2 Implementation Details

We conduct experiments with PyTorch [19] on two NVIDIA K80 GPUs with cuDNN acceleration. The learning rate starts from 0.01 and is divided by 10 per 50 epochs. Weight decay and batch size are set as 0.001 and 64, respectively. The model is trained via SGD with 0.9 momentum for 200 epochs. The images are resized and randomly cropped to $224\times 224$ patches, color jittering and random rotation [19] are applied for data augmentation. The network is initialized with ImageNet pretrained weights to avoid overfitting due to the limited capacity of current public research dataset. We do not adopt additional facial dataset for pretraining.

3.3 Performance Evaluation

The performance comparison with other methods on SCUT-FBP [4], HotOrNot [3] and SCUT-FBP5500 [10] datasets are shown in Table 1, Tabel 2 and Table 3, respectively. The proposed ComboLoss achieves state-of-the-art performance and outperforms prior arts on SCUT-FBP [4], HotOrNot [3] and SCUT-FBP5500 [10] datasets by 1.13%, 2.1% and 0.57% on PC, respectively. We list 6 precisely predicted samples and 6 inaccurately output instances on SCUT-FBP5500 [10] in Fig 2.

Table 1: Performance comparison on SCUT-FBP.

Model	PC
Combined Features+Gaussian Regression [4]	0.6482
CNN-based [4]	0.8187
Liu et al. [20]	0.6938
Xu et al. [5]	0.8570
KFME [21]	0.7988
RegionScatNet [14]	0.83
PI-CNN [11]	0.87
CRNet [6]	0.8723
HMTNet + Ridge Regression [7]	0.8977
ComboLoss (Ours)	0.9090

Table 2: Performance comparison on HotOrNot Dataset.

Model	PC
Eigenface [3]	0.180
Two Layer Model [3]	0.438
Multiscale Model [3]	0.458
S. Wang et al. [22]	0.437
Xu et al. [5]	0.468
CRNet [6]	0.482
ComboLoss (Ours)	0.503

Table 3: Performance comparison on SCUT-FBP5500.

Model	MAE	RMSE	PC
ResNeXt-50 [16]	0.2291	0.3017	0.8997
ResNet-18 [17]	0.2419	0.3166	0.8900
AlexNet [12]	0.2651	0.3481	0.8634
HMTNet [7]	0.2380	0.3141	0.8912
AaNet [8]	0.2236	0.2954	0.9055
$R^{2}$ ResNeXt [23]	0.2416	0.3046	0.8957
$R^{3}$ CNN [13]	0.2120	0.2800	0.9142
ComboLoss (Ours)	0.2050	0.2704	0.9199

3.4 Ablation Study

To validate the effectiveness of each component of our proposed methods, we perform extra ablation experiments on SCUT-FBP5500 dataset [10]. Unlike the experimental settings used in the previous performance comparison section, we take another data splitting strategy as defined in SCUT-FBP5500 provision [10]. Namely, the 60% of the images are used as training set and the remaining images are test set. We adopt 60%/40% splitting setting [10] in ablation experiments to reduce huge computation of 5-fold cross validation.

3.4.1 Effects on ComboLoss

We conduct experiments under the supervision of different loss functions, namely, MSE loss [4, 10], $L_{1}$ loss, smooth $L_{1}$ loss and Smooth huber loss [7] (see Table 4). ComboLoss achieves best performance, and surpasses MSE loss on MAE, RMSE and PC by 0.69%, 1.34% and 1.09%, respectively.

Table 4: Evaluation on different loss functions.

Loss Function	MAE	RMSE	PC
$L_{1}$ Loss	0.2191	0.2918	0.9030
MSE Loss	0.2195	0.2947	0.9008
Smooth $L_{1}$ Loss	0.2194	0.2869	0.9064
Smooth Huber Loss [7]	0.2196	0.2903	0.9052
ComboLoss (Ours)	0.2126	0.2813	0.9117

3.4.2 Effects of Different Network Backbone

Backbone network plays a vital part in feature representation learning and performance [12, 17, 16, 15, 18]. We replace the SEResNeXt50 [15] with a simple ResNet18 [17] to validate the effectiveness of backbone architecture and supervision of ComboLoss. We can see clearly from Table 5 that stronger backbone architecture leads to better performance (0.9041 VS 0.9117 and 0.8946 VS 0.9008). However, the superior performance our proposed methods achieved does not solely come from stronger backbone. Supervised by the vanilla MSE loss, the stronger SEResNeXt50 [15] only achieves a PC of 0.9008, while a simple ResNet18 [17] supervised by ComboLoss achieves a PC of 0.9041. The SEResNeXt50 trained by ComboLoss result in 1.09% performance gain than its counterpart SEResNeXt50 trained by MSE Loss, which shows the effectiveness of the proposed methods as well.

Table 5: Evaluation on different backbone architecture.

Model	MAE	RMSE	PC
ResNet18	0.2313	0.3054	0.8946
ResNet18 + ComboLoss	0.2202	0.2907	0.9041
SEResNeXt50	0.2195	0.2947	0.9008
SEResNeXt50 + ComboLoss	0.2126	0.2813	0.9117

4 Conclusion and Future Work

In this paper. we first perform systematic analysis on diverse loss functions for facial attractiveness score regression. Then a simple yet effective approach named ComboLoss, is presented to enhance model training. The proposed method achieves state-of-the-art performance on SCUT-FBP [4], HotOrNot [3] and SCUT-FBP5500 [10] datasets. We will apply ComboLoss to other regression problems (such as age estimation), and explore more useful discretization approaches in our future work.

References

[1] David I Perrett, Karen A May, and Sin Yoshikawa, “Facial shape and judgements of female attractiveness,” Nature, vol. 368, no. 6468, pp. 239, 1994.
[2] Yael Eisenthal, Gideon Dror, and Eytan Ruppin, “Facial attractiveness: Beauty and the machine,” Neural Computation, vol. 18, no. 1, pp. 119–142, 2006.
[3] Douglas Gray, Kai Yu, Wei Xu, and Yihong Gong, “Predicting facial beauty without landmarks,” in European Conference on Computer Vision. Springer, 2010, pp. 434–447.
[4] Duorui Xie, Lingyu Liang, Lianwen Jin, Jie Xu, and Mengru Li, “Scut-fbp: A benchmark dataset for facial beauty perception,” in 2015 IEEE International Conference on Systems, Man, and Cybernetics. IEEE, 2015, pp. 1821–1826.
[5] Lu Xu, Jinhai Xiang, and Xiaohui Yuan, “Transferring rich deep features for facial beauty prediction,” arXiv preprint arXiv:1803.07253, 2018.
[6] Lu Xu, Jinhai Xiang, and Xiaohui Yuan, “Crnet: Classification and regression neural network for facial beauty prediction,” in Pacific Rim Conference on Multimedia. Springer, 2018, pp. 661–671.
[7] Lu Xu, Heng Fan, and Jinhai Xiang, “Hierarchical multi-task network for race, gender and facial attractiveness recognition,” in 2019 IEEE International Conference on Image Processing (ICIP). IEEE, 2019, pp. 3861–3865.
[8] Luojun Lin, Lingyu Liang, Lianwen Jin, and Weijie Chen, “Attribute-aware convolutional neural networks for facial beauty prediction,” in Proceedings of the 28th International Joint Conference on Artificial Intelligence. AAAI Press, 2019, pp. 847–853.
[9] Amit Kagian, Gideon Dror, Tommer Leyvand, Daniel Cohen-Or, and Eytan Ruppin, “A humanlike predictor of facial attractiveness,” in NIPS, 2007.
[10] Lingyu Liang, Luojun Lin, Lianwen Jin, Duorui Xie, and Mengru Li, “Scut-fbp5500: A diverse benchmark dataset for multi-paradigm facial beauty prediction,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 1598–1603.
[11] Jie Xu, Lianwen Jin, Lingyu Liang, Ziyong Feng, Duorui Xie, and Huiyun Mao, “Facial attractiveness prediction using psychologically inspired convolutional neural network (pi-cnn),” in ICASSP, 2017.
[12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton, “Imagenet classification with deep convolutional neural networks,” in NIPS, 2012.
[13] Luojun Lin, Lingyu Liang, and Lianwen Jin, “Regression guided by relative ranking using convolutional neural network (r3cnn) for facial beauty prediction,” IEEE Transactions on Affective Computing, 2019.
[14] Shu Liu, Bo Li, Yang-Yu Fan, Zhe Quo, and Ashok Samal, “Facial attractiveness computation by label distribution learning with deep cnn and geometric features,” in ICME, 2017.
[15] Jie Hu, Li Shen, and Gang Sun, “Squeeze-and-excitation networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 7132–7141.
[16] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He, “Aggregated residual transformations for deep neural networks,” in CVPR, 2017.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, “Deep residual learning for image recognition,” in CVPR, 2016.
[18] Mingxing Tan and Quoc Le, “Efficientnet: Rethinking model scaling for convolutional neural networks,” in International Conference on Machine Learning, 2019, pp. 6105–6114.
[19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al., “Pytorch: An imperative style, high-performance deep learning library,” in Advances in Neural Information Processing Systems, 2019, pp. 8024–8035.
[20] Shu Liu, Yang-Yu Fan, Zhe Guo, Ashok Samal, and Afan Ali, “A landmark-based data-driven approach on 2.5 d facial attractiveness computation,” Neurocomputing, vol. 238, pp. 168–178, 2017.
[21] Anne Elorza Deias, “Face beauty analysis via manifold based semi-supervised learning,” 2017.
[22] Shuyang Wang, Ming Shao, and Yun Fu, “Attractive or not? beauty prediction with attractiveness-aware encoders and robust late fusion,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 805–808.
[23] Luojun Lin, Lingyu Liang, and Lianwen Jin, “R 2-resnext: A resnext-based regression model with relative ranking for facial beauty prediction,” in 2018 24th International Conference on Pattern Recognition (ICPR). IEEE, 2018, pp. 85–90.