\UseRawInputEncoding

Robust and Precise Facial Landmark Detection by Self-Calibrated Pose Attention Network

Jun Wan, Hui Xi, Jie Zhou, Zhihui lai, Witold Pedrycz, Xu Wang, Hang Sun This work is supported by the National Natural Science Foundation of China (Grant No. 62002233, 62076164, 61802267, 61976145 and 61806127), the Shenzhen Science and Technology Program (Grant No. JCYJ20210324094601005, JCYJ20210324094413037 and JCYJ20190813100801664), the Natural Science Foundation of Guangdong Province (Grant No. 2019A1515111121, 2021A1515011861) and the Natural Science Foundation of HuBei Province (Grant No. 2021CFB004). Corresponding author: Jie Zhou.J. Wan and Hui Xi are with the School of Information and Safety Engineering, Zhongnan University of Economics and Law, Wuhan, 430073, China and the College of Computer Science and Software Engineering, Shen zhen University, Shenzhen, 518060, China. (e-mail:[email protected], [email protected]).J. Zhou, Z. Lai and X. Wang are with the Computer Vision Institute, College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China, and also with the Shenzhen Institute of Artificial Intelligence and Robotics for Society, Shenzhen, China. (e-mail: [email protected], [email protected], [email protected]).W. Pedrycz is with the Department of Electrical

\&

Computer Engineering, University of Alberta, Edmonton, Canada, and the Systems Research Institute, Polish Academy of Sciences, Warsaw, Poland. (e-mail: [email protected].)H. Sun is with College of Computer and Information Technology, China Three Gorges University, Yichang, HuBei, China. (e-mail: [email protected].)

Abstract

Current fully-supervised facial landmark detection methods have progressed rapidly and achieved remarkable performance. However, they still suffer when coping with faces under large poses and heavy occlusions for inaccurate facial shape constraints and insufficient labeled training samples. In this paper, we propose a semi-supervised framework, i.e., a Self-Calibrated Pose Attention Network (SCPAN) to achieve more robust and precise facial landmark detection in challenging scenarios. To be specific, a Boundary-Aware Landmark Intensity (BALI) field is proposed to model more effective facial shape constraints by fusing boundary and landmark intensity field information. Moreover, a Self-Calibrated Pose Attention (SCPA) model is designed to provide a self-learned objective function that enforces intermediate supervision without label information by introducing a self-calibrated mechanism and a pose attention mask. We show that by integrating the BALI fields and SCPA model into a novel self-calibrated pose attention network, more facial prior knowledge can be learned and the detection accuracy and robustness of our method for faces with large poses and heavy occlusions have been improved. The experimental results obtained for challenging benchmark datasets demonstrate that our approach outperforms state-of-the-art methods in the literature.

Index Terms:

facial landmark detection, self-calibrated mechanism, shape constraints, heavy occlusions, heatmap regression.

I Introduction

Facial landmark detection, also known as face alignment, aims to locate the predefined landmarks (e.g., eye corners, nose tip, mouth corners) of a face, which has attracted much attention from the computer vision community. Precise and robust facial landmark detection lays the foundation for high-quality performance of many computer vision and computer graphics tasks, such as face recognition [1], face animation [2, 3], and face reenactment [4, 5]. Many face analysis tasks [6, 7, 8] rely on locations of detected facial landmarks, so imprecise landmarks will be propagated to the subsequent tasks and could lead to unsatisfactory face analysis results.

Refer to caption — Figure 1: The proposed Boundary-Aware Landmark Intensity (BALI) fields. By integrating the facial boundary and landmark intensity field information, both boundary and field constraints can be introduced to achieve more robust and precise facial landmark detection.

Most works [9, 10, 11, 12, 13] of facial landmark detection adopt the supervised learning approach and they usually map facial appearances to landmark heatmaps or coordinates, which have achieved great success. However, on the one hand, their performance depends on a large number of training samples with full landmark annotations that are usually tedious and time-consuming to annotate. For example, for 3000 images and 68 landmarks a face, 204000 landmarks need to be annotated. Moreover, the limitations of the human visual system also reduce the precision and consistency of landmarks. On the other hand, most facial landmark detection methods suffer from performance degradation when facing large poses and partial occlusions, as in this case, convolutional neural networks (CNNs) may be misled to learn inaccurate feature representation and shape constraints. Facial boundary heatmaps [14, 15] and part affinity fields [16, 17] are proposed to address these problems. However, the constraints of both boundary heatmaps and part affinity fields are very rough, and it is still hard to achieve high-precision landmark detection. Therefore, how to model more effective facial shape constraints for precise and robust landmark detection with unlabeled face images remains a challenging problem.

To address the above problems, in this paper, we propose a novel semi-supervised approach, i.e., Self-Calibrated Pose Attention Network (SCPAN) for achieving robust and precise facial landmark detection. The overall architecture of the proposed SCPAN is shown in Fig. 2. SCPAN contains two parts: Boundary-Aware Landmark Intensity (BALI) fields and Self-Calibrated Pose Attention (SCPA) model. The proposed BALI fields have a composite structure, which is composed of a scalar component for the confidence of a particular boundary and a vector component that points to the closest landmark in the particular boundary. As shown in Fig. 1, the proposed BALI fields can simultaneously add both boundary and field constraints to the predicted landmarks, thus helping improve the detection accuracy. Moreover, an SCPA model is designed to learn more representative and discriminative features by introducing the self-calibrated mechanism and pose attention mask. The self-calibrated mechanism can provide a natural learning objective function that enforces intermediate supervision without label information, which effectively reduces the dependence of the detection accuracy on labeled facial images. The pose attention mask can selectively emphasize important features and suppress less useful ones for producing more effective landmark heatmaps and BALI fields. Finally, by integrating the proposed BALI fields and SCPA model into a novel SCPAN framework with seamless formulations, more facial prior knowledge can be learned for achieving robust and precise facial landmark detection. The main contributions of this work are summarized as follows:

1) By incorporating boundary heatmaps and landmark intensity fields, we propose a Boundary-Aware Landmark Intensity (BALI) field, in which both boundary and field information can be better used to model the facial shape constraints for detecting more accurate landmarks.

2) A Self-Calibrated Pose Attention (SCPA) model is proposed to learn more representative and discriminative features by introducing the self-calibrated mechanism and pose attention mask. It can help generate more effective landmark heatmaps and BALI fields while dealing with complicated cases, especially for faces with large poses and heavy occlusions.

3) To the best of our knowledge, this is the first study to explore how to incorporate landmark heatmaps, boundary heatmaps and landmark intensity fields for handling facial landmark detection under challenging scenarios in a semi-supervised way. By seamlessly integrating BALI fields and SCPA model, the proposed SCPAN outperforms state-of-the-art methods on challenging benchmark datasets such as 300W [18], Menpo 2D [19], COFW [20], AFLW [21], WFLW [14] and 300VW [22].

The rest of the paper is organized as follows. Section II gives an overview of the related work. Section III shows the proposed method, including the BALI fields and the SCPA model. A series of experiments are conducted to evaluate the performance of the proposed method in Section IV. Finally, Section V concludes the paper.

II Related Work

This section reviews related work completed for fully-supervised facial landmark detection and semi-supervised facial landmark detection methods.

Fully-Supervised Facial Landmark Detection. The mainstream of fully-supervised facial landmark detection usually maps facial appearance features to landmark coordinates or heatmaps and has achieved great success. Early methods such as Active Shape Model (ASM) [23], Active Appearance Model(AAM) [24] and Constrained Local Model (CLM) [25], use the parametric models to enhance the shape variation. They are sensitive to variations on facial poses and occlusions. Recent approaches can be divided into two groups: coordinate regression-based and heatmap regression-based methods. The coordinate regression-based methods[26, 27, 28, 14] directly learn the mapping from facial appearance feature to the landmark coordinate vectors by using different models. In Mnemonic Descent Method (MDM) [28], a recurrent neural network is used to extract the task-based features and model dependencies between cascade iterations for detecting more accurate landmarks. In look-at-boundary (LAB) [14], the stacked hourglass network is used to generate more effective facial boundary heatmaps by introducing the adversarial concept and message passing layers, which helps enhance the shape constraints and improve alignment accuracy. In the occlusion-adaptive deep network (ODN) [29], the Resnet is used to address the occlusion problem for facial landmark detection by recovering more discriminative representations with the learned geometric information. With their favorable regression abilities, these algorithms all achieve good results in restricted conditions. However, they usually regress landmark coordinates by using full connection operations, which can not fully utilize the spatial relationships between pixels and limits their accuracy and robustness against faces in the wild. Since the heatmap regression-based face alignment methods [30, 31, 13, 15, 12] predict landmarks by regressing landmark heatmaps, which makes them better encode the part constraints and context information, thus achieving state-of-the-art performance. Dong et al. propose a style aggregated network (SAN) [30] to address face alignment problems under image styles variations. Liu et al. [31] propose a novel latent variable optimization strategy to find the semantically consistent annotations and alleviate the limitations of human annotations. In Multi-Order High-Precision Hourglass Network (MMHN) [13] and Multi-order Multi-constraint Deep Networks (MMDN) [15], high-order information are utilized to explore more discriminative representations for robust face alignment. In LUVLi [12], a novel end-to-end framework is proposed to achieve state-of-the-art alignment accuracy by jointly estimating facial landmark locations, uncertainty, and visibility. However, their performance still depends on a large scale of annotated training samples and also suffers from faces with large poses and heavy occlusions. While our proposed SCPAN can model more effective facial shape constraints with unlabeled face images, thus reducing the dependency on landmark annotations.

Semi-Supervised Facial Landmark Detection. As the fully-supervised facial landmark detection methods [31, 13, 15, 12] depend highly on the scale of annotated face images, several semi-supervised methods [32, 33, 34, 7] are proposed to improve face alignment by using unlabeled face images. Tang et al. [32] use an iterative coarse-to-fine patch-based scheme and a greedy patch selection strategy to address face alignment by optimizing an objective function defined on both annotated and unannotated images. Honari et al. [33] aim to improve face alignment by proposing an unsupervised technique that leverages equivariant landmark transformation and auxiliary attributes, thereby reducing the need of labeled face images. Dong et al. [34] use the coherency of optical flow as the source of supervision, which can help achieve more precise facial landmark detection. Dong et al. [35] propose an interaction mechanism between a teacher and two students to generate more reliable pseudo labels for addressing partially labeled facial landmark detection problems. Zhu et al. [7] and Yin et al. [36] both use the consistency constraints on the facial sequence to address semi-supervised facial landmark tracking tasks. However, utilizing optical flow or the temporal relations can only address face alignment problems under small facial poses and expressions variations, as an “in-the-wild” facial video usually deforms or zooms gradually and smoothly without sharp changes, thus resulting in a decrease in detection accuracy when facing large variations on facial poses and expressions. By contrast, our SCPAN is able to use unlabeled face images with large poses and heavy occlusions as supervision signals, therefore its detection robustness and accuracy have been enhanced.

III Robust and Precise Facial Landmark Detection by Self-Calibrated Pose Attention Network

In this section, we first elaborate on the Boundary-Aware Landmark Intensity (BALI) fields, and then describe the Self-Calibrated Pose Attention (SCPA) model. Finally, we show the proposed Self-Calibrated Pose Attention Network (SCPAN) and its objective function.

III-A Boundary-Aware Landmark Intensity (BALI) fields

The proposed BALI fields have a composite structure. They consist of a scalar component for confidence (boundary heatmaps as shown in Fig. 3 (b)) and a vector component that points to the closest landmark in this boundary (as shown in Fig. 3 (c) and (d)). The proposed BALI fields need to estimate the confidence $B^{ij}_{c}$ and a vector $(B_{u}^{ij},B_{v}^{ij})$ at every output location $(i,j)$ , which can be expressed as follows:

B^{ij}=\left\{{B^{ij}_{c},B^{ij}_{u},B^{ij}_{v}}\right\}

(1)

(B_{u}^{ij},B_{v}^{ij})=(\hat{u},\hat{v})-(i,j)

(2)

where $(B_{u}^{ij},B_{v}^{ij})$ is calculated within a square region with edge length $2R+1$ and centered at the ground-truth landmark location $(\hat{u},\hat{v})$ . $B_{c}^{ij}$ denotes the boundary heatmap which is constructed according to MMDN [15]. To be specific, for each boundary, landmarks on this boundary are firstly interpolated to get a dense boundary line. Then, a Gaussian distribution is used to construct a ground-truth boundary heatmap by transforming distance map $Dist$ that is obtained with a binary map and distance transform function. The boundary heatmap is constructed as follows:

\hat{B}_{c}^{ij}=\left\{{\begin{array}[]{*{20}{c}}{\exp\left({-\frac{{Dist\left({i,j}\right)}}{{2\sigma^{2}}}}\right){\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt},if{\kern 1.0pt}Dist\left({i,j}\right)<2{\sigma}{\kern 1.0pt}}\\ {\xi,{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}otherwise}\end{array}}\right.

(3)

where $\xi$ is a small constant, $\sigma$ denotes the standard deviation of the corresponding Gaussian distribution and $\hat{B}$ denotes the ground-truth boundary heatmap. So far, the BALI fields have been constructed, however, how to utilize the BALI fields to detect more accurate landmarks is still an unsolved problem. In this paper, we address the above problem with the help of landmark heatmaps, i.e., we generate the landmark heatmap and the BALI fields at the same time. Therefore, by optimizing landmark heatmaps and boundary heatmaps (contained in the proposed BALI fields) in a multi-task way, the facial boundary constraints can be introduced to generate more effective landmark heatmaps. Then, we further use the field constraints to obtain more precise landmark coordinates. To be specific, we firstly compute the coarse landmark coordinates by using $\arg\mathop{\max}\limits_{i,j}\left(H\right)=\left({i^{\prime},j^{\prime}}\right)$ , where $H$ denotes the predicted landmark heatmap. Then, we crop a small square region from the landmark heatmap with edge length $2r+1$ center at $(i^{\prime},j^{\prime})$ , which is denoted by $H^{\prime}$ . Finally, the soft-argmax operation is utilized on $H^{\prime}$ to calculate the final landmark coordinates. The whole process can be formulated as follows:

H^{\prime}=\left\{{\left({i,j}\right)\left|{i\in\left[{i^{\prime}-r,i^{\prime}+r}\right],}\right.j\in\left[{j^{\prime}-r,j^{\prime}+r}\right]}\right\}

(4)

\left({\bar{u},\bar{v}}\right)=\frac{{\sum\nolimits_{i,j\in H^{\prime}}{{H^{ij}}\times\left({\left({B_{u}^{ij},B_{v}^{ij}}\right)+\left({i,j}\right)}\right)}}}{{\sum\nolimits_{i,j\in H^{\prime}}{{H^{ij}}}}}

(5)

where $(\bar{u},\bar{v})$ denotes the final predicted landmark coordinates. Therefore, by designing Eq. (5), both boundary and field constraints can be introduced to detect more accurate landmarks.

III-B Self-Calibrated Pose Attention model

The proposed Boundary-Aware Landmark Intensity (BALI) fields can introduce both boundary and field constraints that would be helpful for detecting more precise landmarks, however, how to generate more accurate and effective BALI fields and landmark heatmaps are still open questions. The heatmap regression-based facial landmark detection methods [30, 31, 12, 13, 15] can generate effective landmark heatmaps and have achieved state-of-the-art performance as they can effectively encode the part constraints and context information. However, these methods suffer from performance degradation when facing large poses and heavy occlusions, because 1) the large pose or occlusions will mislead models’ learning of robust features and accurate facial shape constraints. 2) backpropagated gradients diminish in strength as they are propagated through a very deep network. Therefore, in this paper, a Self-Calibrated Pose Attention (SCPA) model is proposed to address the above problem by introducing a self-calibrated mechanism and a pose attention mask. The self-calibrated mechanism is able to provide intermediate supervision and address the gradient vanishing problem by optimizing the $L_{2}$ -loss of the generated heatmaps between paired images, and the pose attention mask can drive the network to focus on part of interest by using the newly generated heatmaps as attention for learning more representative and discriminative features. Therefore, the SCPA model is able to produce more accurate BALI fields and landmark heatmaps and help achieve more robust and precise facial landmark detection. The network structure of the proposed SCPA model is shown in Fig. 2.

III-B1 Self-Calibrated Mechanism

In CPM [37] and Openpose [16], the sequential architecture is utilized to learn more effective human pose estimation models by communicating increasingly refined heatmaps between stages and providing a natural learning objective function for enforcing intermediate supervision. Therefore, the problem of vanishing gradients during training can be well addressed. Inspired by this observation, we generate and optimize landmark and boundary heatmaps in each SCPA model, and the loss of the generated landmark and boundary heatmaps is used to train the whole network.

As shown in Fig. 2, the SCPA model is actually a modified Hourglass Network unit [38]. The input of the Hourglass Network Unit is denoted as $P$ , and the output is denoted as $Q$ . To produce landmark and boundary heatmaps, $Q$ will go through a residual-block, and the loss between the generated landmark and boundary heatmaps and the ground-truths is usually used to optimize the SCPA model. However, that loss is calculated based on the training samples’ labels, which are usually tedious and time-consuming to obtain. Hence, we further propose a new self-calibrated mechanism, in which the loss of the generated landmark and boundary heatmaps between paired images (called self-calibrated loss) is introduced as part of loss for supervision. On the one hand, the proposed self-calibrated mechanism can boost supervision by introducing the self-calibrated loss. On the other hand, the calculation of self-calibrated loss may not need the label information that can effectively reduce the dependence of labeled facial images. Paired images mean the original image and its disturbed version. As shown in Fig. 4, to build a disturbed image, a combination of the texture disturbance operations and spatial transformation operations are applied to the original image. The texture disturbance can be achieved by the operations such as occlusion, blurring and noises, while the spatial transformation operations can be implemented by the translation, rotation and scaling operation. To be specific, by inputting a pair of face images $(I^{\alpha},I^{\beta})$ (the original face $I^{\alpha}$ and its disturbance $I^{\beta}$ ), the SCPA model produces the corresponding landmark heatmaps $(H^{\alpha},H^{\beta})$ and boundary heatmaps $(B_{c}^{\alpha},B_{c}^{\beta})$ . Hence, the original loss function can be formulated as follows:

{\mathbb{L}_{org}}=\sum\limits_{n=1}^{N}{\sum\limits_{k\in\{\alpha,\beta\}}{\left\|{\left({{H^{n,k}},B_{c}^{n,k}}\right)-\left({{{\hat{H}}^{n,k}},\hat{B}_{c}^{n,k}}\right)}\right\|_{F}^{2}}}

(6)

where $n$ denotes the image index and ${\left\|{{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}\cdot{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}}\right\|_{\rm{F}}}$ is the Frobenius norm. $\alpha$ and $\beta$ corresponds to the original face and its disturbance, respectively. $(\hat{H},\hat{B}_{c})$ represents the ground-truth landmark and boundary heatmaps. Then, the self-calibrated loss can be formulated as follows:

{\mathbb{L}_{scl}}=\sum\limits_{n=1}^{N}{\left\|{D\left({{H^{n,\alpha}},B_{c}^{n,\alpha}}\right)-\left({{H^{n,\beta}},B_{c}^{n,\beta}}\right)}\right\|_{F}^{2}}

(7)

where $D$ corresponds to the disturbance operations. For the texture disturbance, the heatmaps of paired images are the same, $D$ is equal to 1. For the spatial transformation operations, $D$ is the corresponding transformation parameter. $L_{scl}$ represents the self-calibrated loss. The final loss function corresponding to the self-calibrated mechanism can be expressed as follows:

{\mathbb{L}_{scm}}=\eta{\mathbb{L}_{scl}}+\lambda{\mathbb{L}_{org}}

(8)

where $\lambda$ and $\eta$ correspond to the weights of $\mathbb{L}_{org}$ and ${\mathbb{L}_{scl}}$ , respectively. More importantly, when $\lambda$ is set to 0, the proposed self-calibrated mechanism is able to use paired unlabeled images as supervision signals and reduce the dependence of label information. This means the detection accuracy of the proposed method is able to be further boosted by using unlabeled face data.

III-B2 Pose Attention Mask

Since the proposed self-calibrated mechanism can help produce more accurate and effective landmark and boundary heatmaps, which contain rich facial pose information that would be helpful for learning more discriminative representations. Therefore, the learned facial pose information can be leveraged to guide the model by hinting where needs to be noticed and where can be ignored. In the SCPA model, such hints are realized by the pose attention mask denoted as $M_{t}$ , which are computed from the pose information that incorporating both landmark and boundary heatmaps. To be specific, the pose information goes through a residual block and an element-wise sigmoid function, and its values are between 0 and 1 indicating the importance of each element in the pose attention mask. The whole process can be formulated as follows:

{M_{t}}=sig\left({R{B_{2}}\left({R{B_{1}}\left({{Q_{t}}}\right)}\right)}\right)

(9)

where $M_{t}$ denotes the pose attention mask in stage $t$ and $t=1\cdots 3$ . $RB$ represents a residual block and $sig$ denotes the sigmoid function. Having computed $M_{t}$ , the intput of next SPCA model is updated by:

{P_{t+1}}=sig\left({R{B_{2}}\left({R{B_{1}}\left({{Q_{t}}}\right)}\right)}\right)\odot{Q_{t}}+{Q_{t}}

(10)

where $\odot$ denotes element-wise product. By multiplying $Q_{t}$ with the attention masks $M_{t}$ , the input of the next SPCA model (i.e., $P_{t+1}$ ) at certain locations are either preserved or suppressed. Since the self-calibrated mechanism can boost supervision of the network and the pose attention mask can selectively emphasize important features and suppress less useful ones, the proposed SPCA model is able to learn more representative and discriminative features for detecting more accurate landmarks.

III-C Self-Calibrated Pose Attention Network

The proposed BALI field is able to achieve highly precise landmark detection by modeling both boundary and field constraints. Then, the SCPA model can learn more representative and discriminative features by introducing the self-calibrated mechanism and the pose attention mask. Finally, by integrating the SCPA model and BALI fields into a Self-Calibrated Pose Attention Network (SCPAN), we can generate more accurate and effective landmark heatmaps and boundary-aware landmark intensity fields for achieving more robust and precise facial landmark detection. The overall network structure of the proposed SCPAN is shown in Fig. 2.

III-D Objective Function

The proposed SCPAN outputs landmark heatmaps and BALI fields, therefore, the loss between the generated landmark heatmaps and BALI fields and the ground-truths should be used as part of the objective function. Moreover, the loss between the paired images and the loss between the predicted landmark coordinates and the ground-truth coordinates should be a part of the final objective function. Therefore, the objective function can be formulated as follows:

{\mathbb{L}_{scpan}}={\mathbb{L}_{scl}}+{\mathbb{L}_{org}}+{\mathbb{L}_{coor}}

(11)

{\mathbb{L}_{scl}}=\sum\limits_{n=1}^{N}{{loss\left({D\left({{H^{n,\alpha}},B_{c}^{n,\alpha}}\right),\left({{H^{n,\beta}},B_{c}^{n,\beta}}\right)}\right)}}

(12)

{\mathbb{L}_{org}}=\sum\limits_{n=1}^{N}{\sum\limits_{k\in\{\alpha,\beta\}}{loss\left({\left({{H^{n,k}},B^{n,k}}\right),\left({{{\hat{H}}^{n,k}},\hat{B}^{n,k}}\right)}\right)}}

(13)

{\mathbb{L}_{coor}}=\left\|{\left({\bar{u},\bar{v}}\right)-\left({{\hat{u}},{\hat{v}}}\right)}\right\|_{2}^{2}

(14)

Original loss. Generally, the Mean Square Error (MSE) loss is selected for optimizing models [38, 30, 39, 40, 41] to generate landmark heatmaps. However, the MSE loss treats each pixel in the heatmap equally, which easily leads to blurred heatmaps and reduces the detection accuracy. Fortunately, MMDN [15] and MMHN [13] have shown that the Jensen-Shannon divergence loss can pay more attention to the foreground area of heatmaps instead of treating the whole heatmap equally, thus accurately measuring the difference between two distributions. Hence, the Jensen-Shannon Divergence loss is also selected as the objective function to calculate the distribution differences between the generated heatmaps and the ground-truths. The Jensen-Shannon Divergence loss is expressed as follows:

\begin{array}[]{l}JS\left({{p_{G}}||{p_{\hat{G}}}}\right)=\frac{1}{2}KL\left({{p_{G}}\left({i,j}\right)||\frac{{{p_{G}}\left({i,j}\right)+{p_{\hat{G}}}\left({i,j}\right)}}{2}}\right)\\ {\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}{\kern 1.0pt}+\frac{1}{2}KL\left({{p_{\hat{G}}}\left({i,j}\right)||\frac{{{p_{G}}\left({i,j}\right)+{p_{\hat{G}}}\left({i,j}\right)}}{2}}\right)\end{array}

(15)

where $i$ and $j$ denote the indexes of a pixel in the heatmap and $KL$ means the Kullback-Leibler divergence. $p_{G}$ and $p_{\hat{G}}$ denote the probability distributions of the generated and ground-truth heatmaps. Base on the above Jensen-Shannon Divergence loss, the original loss can be reformulated as follows:

\begin{array}[]{l}{\mathbb{L}_{org}}=\sum\limits_{n=1}^{N}\biggl{\{}\sum\limits_{k\in\left\{{\alpha,\beta}\right\}}\Bigl{\{}JS\left({{p_{SE\left({B_{u,v}^{n,k}}\right)}}||{p_{SE\left({\hat{B}_{u,v}^{n,k}}\right)}}}\right)\\ +\{JS\left({p_{{H^{n,k}}}^{T}||p_{{{\hat{H}}^{n,k}}}^{T}}\right)+JS\left({p_{B_{c}^{n,k}}^{T}||p_{\hat{B}_{c}^{n,k}}^{T}}\right)\}\\ +\sum\limits_{t=1}^{T-1}\bigl{\{}JS\left({p_{{H^{n,k}}}^{t}||p_{{{\hat{H}}^{n,k}}}^{t}}\right)+JS\left({p_{B_{c}^{n,k}}^{t}||p_{\hat{B}_{c}^{n,k}}^{t}}\right)\bigr{\}}\Bigr{\}}\biggr{\}}\end{array}

(16)

where $t$ denotes the stage, and $T=4$ . $SE(B_{u,v})$ means cropping the corresponding area (i.e., a square region with edge length $2r+1$ center at the ground-truth landmark location) from $B_{u}$ and $B_{v}$ .

Self-calibrated loss. The proposed self-calibrated mechanism could provide intermediate supervision by optimizing the loss of paired images. We also use the Jensen-Shannon Divergence loss as the objective function, and the self-calibrated loss can be reformulated as follows:

\begin{array}[]{l}{\mathbb{L}_{scl}}=\sum\limits_{n=1}^{N}\Biggl{\{}\sum\limits_{t=1}^{T}\biggl{[}JS\left({p_{D\left({{H^{n,\alpha}}}\right)}^{t}||p_{{H^{n,\beta}}}^{t}}\right)\\ +JS\left({p_{D\left({B_{c}^{n,\alpha}}\right)}^{t}||p_{D\left({B_{c}^{n,\beta}}\right)}^{t}}\right)\biggr{]}\Biggr{\}}\end{array}

(17)

That means in each SCPA model, the self-calibrated loss is introduced to provide intermediate supervision and produce more effective pose attention masks for learning more discriminative representations.

Coordinate loss. The loss between the predicted landmarks and the ground-truths can also be used to optimize the proposed SCPAN, which further help obtain more accurate landmarks. The coordinate loss can be formulated as follows:

{\mathbb{L}_{coor}}=\sum\limits_{n=1}^{N}{\sum\limits_{k\in\{\alpha,\beta\}}{\sum\limits_{\phi=1}^{\Phi}{\left\|{\left({\bar{u}_{\phi}^{n,k},\bar{v}_{\phi}^{n,k}}\right)-\left({\hat{u}_{\phi}^{n,k},\hat{v}_{\phi}^{n,k}}\right)}\right\|_{2}^{2}}}}

(18)

where $\phi$ denote the landmark index. The coordinate loss is able to help better integrate the boundary and field constraints and produce more effective landmark heatmaps and BALI fields for detecting more accurate landmarks.

Overall loss. By combining $\mathbb{L}_{org}$ , $\mathbb{L}_{scl}$ and $\mathbb{L}_{coor}$ , we can obtain the overall loss, which can be formulated as follows:

\begin{array}[]{l}{\mathbb{L}_{scpan}}={\mathbb{L}_{scl}}+{\mathbb{L}_{org}}+{\mathbb{L}_{coor}}\\ =\sum\limits_{n=1}^{N}\Biggl{\{}\sum\limits_{k\in\left\{{\alpha,\beta}\right\}}\biggl{\{}{\lambda_{1}}\bigl{\{}JS\left({p_{{H^{n,k}}}^{T}||p_{{{\hat{H}}^{n,k}}}^{T}}\right)+JS\left({p_{B_{c}^{n,k}}^{T}||p_{\hat{B}_{c}^{n,k}}^{T}}\right)\bigr{\}}\\ +\eta\sum\limits_{t=1}^{T-1}\bigl{\{}JS\left({p_{{H^{n,k}}}^{t}||p_{{{\hat{H}}^{n,k}}}^{t}}\right)+JS\left({p_{B_{c}^{n,k}}^{t}||p_{\hat{B}_{c}^{n,k}}^{t}}\right)\bigl{\}}\\ +{\lambda_{2}}JS\left({{p_{SE\left({B_{u,v}^{n,k}}\right)}}||{p_{SE\left({\hat{B}_{u,v}^{n,k}}\right)}}}\right)\\ +\gamma\sum\limits_{\phi=1}^{\Phi}{\left\|{\left({\bar{u}_{\phi}^{n,k},\bar{v}_{\phi}^{n,k}}\right)-\left({\hat{u}_{\phi}^{n,k},\hat{v}_{\phi}^{n,k}}\right)}\right\|_{2}^{2}}\biggr{\}}\\ +\eta\sum\limits_{t=1}^{T-1}\biggl{\{}JS\left({p_{D\left({{H^{n,\alpha}}}\right)}^{t}\left\|{p_{{H^{n,\beta}}}^{t}}\right.}\right)+JS\left({p_{D\left({B_{c}^{n,\alpha}}\right)}^{t}||p_{D\left({B_{c}^{n,\beta}}\right)}^{t}}\right)\biggr{\}}\Biggr{\}}\end{array}

(19)

With the above overall loss function, the original label information can be better used to improve the detection accuray.

III-E Improving SCPAN with Semi-supervised Learning

SCPAN is able to detect more accurate landmarks for faces with large poses and heavy occlusions by incorporating the proposed BALI fields and SCPA model. However, its performance still depends on large-scale training samples, i.e., high-resolution face images and their landmark annotations. Although it is easy to collect high-resolution face images, annotating them is more expensive and tedious. The proposed self-calibrated mechanism can provide a self-learned objective function that enforces intermediate supervision by utilizing unlabeled face data. In this way, more facial prior knowledge can be learned to enhance the detection accuracy of the proposed SCPAN (i.e., semi-SCPAN). Suppose there are $N$ labeled face images and $M$ unlabeled ones, the objective function of semi-SCPAN can be formulated as follows:

\begin{array}[]{l}{{\mathbb{L}^{\prime}}_{scpan}}={\mathbb{L}_{scpan}}+{{\mathbb{L}^{\prime}}_{scl}}={\mathbb{L}_{scpan}}\\ +\sum\limits_{m=1}^{M}\sum\limits_{t=1}^{T-1}\left(JS\left({{p^{t}_{D\left({{H^{m,\alpha}}}\right)}}\left\|{{p^{t}_{{H^{m,\beta}}}}}\right.}\right){\kern 1.0pt}+JS\left({{p^{t}_{D\left({B_{c}^{m,\alpha}}\right)}}\left\|{{p^{t}_{B_{c}^{m,\beta}}}}\right.}\right)\right)\end{array}

(20)

where ${{\mathbb{L}^{\prime}}_{scl}}$ denotes the self-calibrated loss corresponding to $M$ unlabeled images and ${{\mathbb{L}^{\prime}}_{scpan}}$ represents the final objective function of semi-SCPAN that is designed by utilizing both labeled and unlabeled data. With the above new objective function, the SCPAN is able to use unlabeled face images to further boost the performance of facial landmark detection. We also present the main steps of the proposed Semi-Supervised SCPAN in Alogrithm I.

IV Experiments

In this section, we firstly introduce the evaluation settings including the datasets and methods for comparison. Then, we compare our algorithm with the state-of-the-art facial landmark detection methods on challenging benchmark datasets such as 300W [18], Menpo 2D [19], COFW [20], AFLW [21], WFLW [14] and 300VW [22].

IV-A Datasets and Implementation details

300W (68 landmarks): The training set of 300W is composed of the training set of AFW [42], LFPW [43] and Helen [44], which contains 3148 face pictures. The testing set of 300W includes IBUG and the testing set of LFPW and Helen, which can be further divided as follows: 1) Challenging subset (i.e., IBUG dataset [18]). It contains 135 more general “in the wild” images, and experiments for this dataset are more challenging. 2) Common subset (554 images, of which 224 images are from LFPW test set and 330 images from Helen test set). 3) Fullset (689 images, composed of the challenging subset and common subset). Moreover, following LUVLI [12] and KDN [45], we perform cross dataset evaluation on 300W dataset, i.e., We first train the SCPAN on 300W-LP dataset [46], and then fine-tune on the trainset (3837 samples). We evaluate the SCPAN on 300W testing set, which contains 600 images.

Menpo 2D (68 landmarks): It consists of images from AFLW and FDDB [47], which are re-annotated following 68 landmark annotation scheme. It has two subsets, frontal faces (6679 samples which have 68 annotations) and profile faces (300 samples which have 39 landmark annotations). We use the frontal set for cross dataset evaluation.

COFW (68 landmarks): It is another very challenging dataset on occlusion issues which is published by Burgos-Artizzu et al. [20]. It contains 1345 training images of which 845 images are from LFPW and the others are with heavy occlusions. The testing set includes 507 face images with large variations on the head pose, facial expression and occlusion. We use the testing set for cross dataset evaluation.

AFLW (19 landmarks): It contains 25993 face images which has extremely large variations of jaw angles ranging from ${\rm{-}}{120^{\circ}}$ to ${\rm{+}}{120^{\circ}}$ and pitch angles ranging from ${\rm{-}}{90^{\circ}}$ to ${\rm{+}}{90^{\circ}}$ . Moreover, it also contains very complicated occlusions. AFLW-full selects 24386 images from the whole AFLW dataset and further divides them into two parts: 20000 for training and 4386 for testing. Moreover, 1165 images (i.e., AFLW-frontal) are selected from AFLW-full testing set to evaluate the alignment algorithm on frontal faces.

Algorithm I:Semi-Supervised SCPAN Input: Training set

\{{I^{n}}\}_{n=1}^{N}

. Output: Model

SCPAN

, BALI field

H

and

B

, landmarks

\left({\bar{u},\bar{v}}\right)

. 1: Uses the original face image

{I^{n,\alpha}}

to generate its disturbance

{I^{n,\beta}}

. 2: IF

I^{n}

has landmark annotations THEN 2.1 Generates the corresponding BALI field:

\left({{{\hat{H}}^{n,\alpha}},{{\hat{B}}^{n,\alpha}}}\right)

and

\left({{{\hat{H}}^{n,\beta}},{{\hat{B}}^{n,\beta}}}\right)

. END IF 3: FOR epoch=1 to end_epoch: 3.1 Forward propagation of SCPAN model:

\left({{H^{n,k}},{B^{n,k}}}\right)=SCPAN({I^{n,k}}),{\rm{}}k\in\left({\alpha,\beta}\right)

. 3.2 Computes the loss:

{\mathbb{L}^{\prime}}_{scpan}={\mathbb{L}_{scpan}}+{{\mathbb{L}^{\prime}}_{scl}}

. 3.3 Calculates coarse landmarks:

\left({i^{\prime},j^{\prime}}\right)=\arg\mathop{\max}\limits_{i,j}\left(H\right)

. 3.4 Calculates final landmarks:

\left({\bar{u},\bar{v}}\right)=\frac{{\sum\nolimits_{i,j\in H^{\prime}}{{H^{ij}}\times\left({\left({B_{u}^{ij},B_{v}^{ij}}\right)+\left({i,j}\right)}\right)}}}{{\sum\nolimits_{i,j\in H^{\prime}}{{H^{ij}}}}}

H^{\prime}=\left\{{\left({i,j}\right)|i\in\left[{i^{\prime}-r,i^{\prime}+r}\right],j\in\left[{j^{\prime}-r,j^{\prime}+r}\right]}\right\}

. 3.5 Updates the

SCPAN

model. END FOR

WFLW (98 landmarks): It has 98 landmark annotations and images in WFLW are collected from more complicated scenarios. The training set contains 7500 images and its testing set includes 2500 images. WFLW also has other attribute annotations, including occlusion, pose, makeup, lighting, blur, and expression, which can help more comprehensively evaluate existing alignment algorithms.

300VW (68 landmarks): Following Shen et al. [22], we use 50 videos to train our SCPAN and test it on the rest 64 videos. The testing set is divided into three parts: well-lit (Scenario1, various head poses, occlusions such as glasses and beards), mild unconstrained (Scenario2, different illuminations, dark rooms, overexposed shots and arbitrary expressions) and challenging (Scenario3, illumination conditions, occlusions, make-ups, expressions and head poses) according to the difficulties.

Evaluation Metrics. Facial landmark detection results are envaluated with Normalized Mean Error ( $\rm NME_{box}$ [48, 49], $\rm NME_{diag}$ [50, 14] and $\rm NME_{io}$ [18, 51]), Area Under the Curve (AUC) [12, 13] and Failure Rate (FR) [52, 51]. The $\rm{NME}$ is defined as follows:

{\rm{NME}}=\frac{1}{\Phi}\sum\limits_{\phi=1}^{\Phi}{\frac{{{{\left\|{\left({{{\bar{u}}_{\phi}},{{\bar{v}}_{\phi}}}\right)-\left({{{\hat{u}}_{\phi}},{{\hat{v}}_{\phi}}}\right)}\right\|}_{2}}}}{d}}

(21)

where $\phi$ denotes the landmark index, $(\bar{u},\bar{v})$ and $(\hat{u},\hat{v})$ represent the predicted and ground-truth landmark coordinates, respectively. $d$ denotes the normalization term, which can be set to the interpupil distances, the interocular distance, the distance between the outer corners of the two eyes, the geometric mean of the width and height of the ground-truth bounding box and the diagonal of the tight bounding box for $\rm NME_{ip}$ , $\rm NME_{io}$ , $\rm NME_{box}$ and $\rm NME_{diag}$ , respectively. To compute AUC, we firstly plot the cumulative distribution of the fraction of test images whose NME (%) is less than or equal to the value on the horizontal axis, then the AUC is computed as the area under that curve. FR means the percentage of images in the test set whose NME is larger than a certain threshold.

Implementation Details. In our experiments, all the training and testing images are cropped and resized to 256x256 according to the provided bounding boxes. To generate the disturbance, we use the spatial transformation and texture disturbance operations. Specifically, the spatial transformation contains rotation ( $-60^{\circ},+60^{\circ}$ ), scaling ( $0.5,1$ ), mirror flip and their combinations. The texture disturbance includes occlusion, blurring and noises. The occlusion disturbance is achieved by using two types of occlusions. The first one is occlusion by black, i.e., the occluded area is covered by black, and the second one is occlusion by part of the original face image. The blurring disturbance means downsampling high-resolution images ( $256\times 256$ ) into low-resolution ( $128\times 128$ , $64\times 64$ , $32\times 32$ , $16\times 16$ ) ones with bicubic degradation. We use the Stacked Hourglass Network [38] as our backbone to construct the proposed Self-Calibrated Pose Attention Network, and the spatial size of the output heatmap and field is $128\times 128$ . $\lambda_{1}$ , $\lambda_{2}$ , $\gamma$ and $\eta$ are set to 1, 16, 40 and 4, respectively. $11\times 11$ and $7\times 7$ field regions are used in the training and testing phases, respectively. The training of SCPAN takes 200000 iterations and the staircase function is used to set the learning rate. The initial learning rate is $2.5\times{10^{{\rm{-4}}}}$ and then it is divided by 5, 2 and 2 at iteration 10000, 40000 and 100000, respectively. The SCPAN is trained with Pytorch on 8 Nvidia Tesla V100 GPUs.

Experiment Settings. To evaluate the effectiveness of each module proposed in this paper, we firstly use the Stacked Hourglass Network (SHN) [38] as the baseline, and then we separately construct SHN+BALI and SHN+SCPA by combining SHN with the proposed Boundary-Aware Landmark Intensity (BALI) fields and Self-Calibrated Pose Attention (SCPA) model, respectively. Moreover, we combine SHN with both BALI fields and SCPA model to construct SCPAN (i.e., SHN+BALI+SCPA). Then, as SCPAN can be further boosted by utilizing unlabeled face images (in this paper, we use the CelebA [53] dataset) and the boosted SCPAN is denoted by semi-SCPAN. For CelebA, we use 169854 images for training. To be specific, we firstly use OpenFace [54] to detect 68 landmarks and then obtain the corresponding input images with cutting and scaling operations according to the detected landmarks. For other state-of-the-art methods [55, 30, 56, 35, 57, 12, 58], we either use the original codes released by the authors or restore the experiment, and the results both have achieved the expected effects in the corresponding papers. The detailed comparison is shown below.

TABLE I:

\rm NME_{io}

and

\rm NME_{ip}

comparisions on 300W dataset. (% omitted)

Method	Common Subset	Challenging Subset	Fullset
$\rm NME_{io}$ comparisions
PCD-CNN_CVPR18[55]	3.67	7.62	4.44
SAN_CVPR18[30]	3.34	6.60	3.98
AVS_ICCV19[56]	3.21	6.49	3.86
LAB_CVPR18[14]	2.98	5.19	3.49
Techer_ICCV19[35]	2.91	5.91	3.49
DU-Net_ECCV18[59]	2.90	5.15	3.35
DeCaFa_ICCV19[57]	2.93	5.26	3.39
HR-Net $\rm{}_{19^{\prime}}$ [50]	2.87	5.15	3.32
HG-HSLE_ICCV19[60]	2.85	5.03	3.28
AWing_ICCV19[61]	2.72	4.52	3.07
LUVLi_CVPR20[12]	2.76	5.16	3.23
CCDN_NN21[58]	2.75	4.43	3.08
SHN	3.11	6.23	3.72
SHN+BALI	2.78	5.01	3.21
SHN+SCPA	2.59	4.68	3.00
SCPAN	2.46	4.43	2.85
semi-SCPAN	2.38	4.31	2.76
$\rm NME_{ip}$ comparisions
Honari et al._CVPR18[33]	4.20	7.78	4.90
SBR_CVPR18[34]	3.28	7.58	4.10
TS_CVPR18[35]	3.17	6.41	3.78
Liu et al._CVPR19[31]	3.45	6.38	4.02
ODN_CVPR19[29]	3.56	6.67	4.17
STKI_ACMMM20[7]	3.36	7.39	4.16
MMDN_TNNLS21[15]	3.17	6.08	3.74
semi-SCPAN	2.92	5.96	3.52

IV-B Evaluation under Normal Circumstances

For benchmark datasets such as 300W, Menpo 2D, COFW, AFLW and WFLW, faces in 300W common subset, 300W full set, Menpo 2D and AFLW-frontal dataset are closer to neutral faces and have smaller variations on the head pose, facial expression and occlusion. Hence, we evaluate the effectiveness of the proposed SCPAN method under normal circumstances with these four subsets. Table I includes the $\rm NME_{io}$ and $\rm NME_{ip}$ comparisons of state-of-the-art face alignment methods on 300W common subset and 300W full set. Moreover, Table II shows the $\rm NME_{box}$ and $\rm AUC^{7}_{box}$ comparisons on 300W testing set and Menpo 2D (for cross dataset evaluation). As shown in Tables I and II, the proposed SCPAN outperforms state-of-the-art methods [30, 35, 57, 50, 59, 12] on 300W and Menpo 2D datasets. At the same time, the SCPAN also achieves the best score on AFLW-frontal dataset (as shown in Table III). These results indicate that the SCPAN can improve the detection accuracy under normal circumstances, mainly because 1) the proposed BALI fields can introduce both boundary and field constraints to enhance facial shape constraints and achieve high-precision landmark detection. 2) the SCPA model can boost supervision of the network and selectively emphasize important features and suppress less useful ones to learn more representative and discriminative features by introducing the self-calibrated mechanism and the pose attention mask. 3) by integrating the SCPA model and BALI fields into a Self-Calibrated Pose Attention Network (SCPAN), more accurate and effective landmark heatmaps and BALI fields can be generated for achieving robust and precise facial landmark detection.

TABLE II:

\rm NME_{box}

and

\rm AUC^{7}_{box}

comparisions on 300W Common subset, Menpo 2D and COFW-68 datasets. (- not counted, % omitted) [* = pretrained on 300W-LP-2D [46]]

Method	$\rm NME_{box}$			$\rm AUC^{7}_{box}$
Method	300W	Menpo	COFW	300W	Menpo	COFW
SAN*_CVPR18[30]	2.86	2.95	3.50	59.7	61.9	51.9
2D-FAN*_ICCV17[48]	2.32	2.16	2.95	66.5	69.0	57.5
KDN[45]	2.49	2.26	-	67.3	68.2	-
Softlabel*_ICCV19[52]	2.32	2.27	2.92	66.6	67.4	57.9
KDN*_ICCV19[52]	2.21	2.01	2.73	68.3	71.1	60.1
LUVLi_CVPR20[12]	2.24	2.18	2.75	68.3	70.1	60.8
LUVLi*_CVPR20[12]	2.10	2.04	2.57	70.2	71.9	63.4
SCPAN	2.01	1.93	2.47	71.8	72.8	65.1
SCPAN*	1.95	1.88	2.38	72.5	73.1	65.7

TABLE III:

\rm NME

and

\rm AUC

comparisions on AFLW dataset. (- not counted, % omitted)

Method	$\rm NME_{diag}$		$\rm NME_{box}$	$\rm AUC^{7}_{box}$
Method	Full	Frontal	Full	Full
CCL_CVPR16[21]	2.72	2.17	-	-
LLL_ICCV19[62]	1.97	-	-	-
SAN_CVPR18[30]	1.91	1.85	4.04	54.0
DSRN_CVPR18[63]	1.86	-	-	-
LAB_CVPR18[14]	1.85	1.62	-	-
HR-Net $\rm{}_{19^{\prime}}$ [50]	1.57	1.46	-	-
Wing_CVPR18[64]	-	-	3.56	53.5
KDN[45]	-	-	2.80	60.3
LUVLi_CVPR20[12]	1.39	1.19	2.28	68.0
MHHN_TIP21[13]	1.38	1.19	-	-
SHN	2.46	1.92	3.67	56.4
SHN+BALI	1.84	1.46	2.32	66.1
SHN+SCPA	1.62	1.37	2.21	67.2
SCPAN	1.31	1.10	2.05	69.8
semi-SCPAN	1.23	1.05	2.01	70.7

TABLE IV:

\rm NME_{ip}

and

\rm FR_{ip}^{10}

comparisons on COFW dataset. (- not counted, % omitted)

Method	$\rm NME_{ip}$	$\rm FR_{ip}^{10}$
DRDA_CVPR16[65]	6.46	6.00
RAR_ECCV16[66]	6.03	4.14
DAC-CSR_CVPR17[67]	6.03	4.73
CAM $\rm{}_{19^{\prime}}$ [68]	5.95	3.94
PCD-CNN_CVPR18[55]	5.77	3.73
Wing_CVPR18[64]	5.44	3.75
LAB_CVPR18[14]	5.58	2.76
AWing_ICCV19[6]	4.94	0.99
ODN_CVPR19[29]	5.30	-
MHHN_TIP21[13]	4.95	1.78
SHN	6.21	5.52
SHN+BALI	5.52	3.16
SHN+SCPA	5.21	2.17
SCPAN	4.93	1.78
semi-SCPAN	4.83	1.58

TABLE V:

\rm NME_{io}

comparisons on WFLW dataset. (% omitted).

Method	Testset	Pose Subset	Expression Subset	Illumination Subset	Make-Up Subset	Occlusion Subset	Blur Subset
CCFS_CVPR15[69]	9.07	21.36	10.09	8.30	8.74	11.76	9.96
DVLN_CVPR17[70]	6.08	11.54	6.78	5.73	5.98	7.33	6.88
LAB_CVPR18[14]	5.27	10.24	5.51	5.23	5.15	6.79	6.32
Wing_CVPR18[64]	5.11	8.75	5.36	4.93	5.41	6.37	5.81
MHHN_TIP21[13]	4.77	9.31	4.79	4.72	4.59	6.17	5.82
MMDN_TNNLS21[15]	4.87	8.15	4.99	4.61	4.72	6.17	5.72
HRNet $\rm{}_{19^{\prime}}$ [50]	4.60	7.86	4.78	4.57	4.26	5.42	5.36
AWing_ICCV19[61]	4.36	7.38	4.58	4.32	4.27	5.19	4.96
LUVLi_CVPR20[12]	4.37	7.56	4.77	4.30	4.33	5.29	4.94
SHN	5.78	9.47	6.39	5.83	5.91	7.07	7.21
SHN+BALI	4.73	7.67	4.96	4.73	4.56	5.67	5.47
SHN+SCPA	4.57	7.43	4.77	4.52	4.38	5.44	5.12
SCPAN	4.29	7.22	4.68	4.34	4.21	5.25	4.88
semi-SCPAN	4.21	7.01	4.57	4.15	4.16	5.17	4.81

IV-C Evaluation of Robustness against Occlusion

Variations in occlusion and illumination are classic problems in face alignment task. The state-of-the-art face alignment methods still suffer from heavy occlusions and complicated illuminations. In this paper, we use COFW dataset, 300W challenging subset and WFLW dataset to evaluate the robustness of the proposed SCPAN against occlusion.

For 300W challenging subset, SCPAN can achieve 4.43% $\rm NME_{io}$ as shown in Table I, which outperforms state-of-the-art face alignment methods [30, 35, 57, 50, 59, 12]. This indicates the method can effectively enhance the alignment robustness for faces with heavy occlusions.

For COFW dataset (cross dataset evaluation), the failure rate ( $\rm FR_{ip}^{10}$ ) is defined by the percentage of test images with more than 10% detection error which is normalized by interpupil distance. As illustrated in Table IV, the SCPAN can boost the $\rm NME_{ip}$ to 4.93% and the failure rate to 1.78%, which outperforms the state-of-the-art methods [60, 55, 61, 14, 59, 29, 13]. Moreover, we also use the model trained on 300W training set to further test COFW testing set and the corresponding experimental results (i.e., $\rm NME_{box}$ and $\rm AUC^{7}_{box}$ ) are shown in Table II. These all suggest that the proposed BALI fields and SCPA model play an important role in boosting the ability to address the occlusion problems.

Since WFLW dataset is composed of the Illumination subset, the Make-Up Subset and the Occlusion Subset, which contain complicated occlusions that can be used to evaluate the robustness of the SCPAN against occlusion. From the experimental results shown in Table V, we conclude that SCPAN is more robust to faces with complicated occlusions.

Hence, from the experimental results illustrated in Tables I, II, IV and V, we can conclude that 1) by fusing boundary heatmaps and landmark intensity fields, the proposed BALI fields is able to better model the facial geometric information and context information for enhancing facial shape constraints. 2) the SPCA model can learn more representative and discriminative features by introducing the self-calibrated mechanism and pose attention mask when faces are corrupted by complicated occlusions. 3) the texture disturbance operation can effectively expand the original dataset and make full use of the original label information, which helps achieve more robust landmark detection for faces with heavy occlusions.

IV-D Evaluation of Robustness against Large Poses

Face with large poses is another great challenge for facial landmark detection. We conduct experiments on AFLW-full, 300W challenging subset and WFLW dataset to further evaluate the performance of the SCPAN for faces with large poses. Fig. 5 shows the landmark detection result comparison with the state-of-the-art methods and the ground-truths (i.e., GT) on 300W challenging subset. Tables I, III and V show the corresponding experimental results. To be specific, for AFLW-full dataset, the proposed SCPAN can achieve 1.31% $\rm NME_{diag}$ , 2.05% $\rm NME_{box}$ and 69.8% $\rm AUC_{box}^{7}$ , which exceeds state-of-the-art methods [62, 30, 14, 50, 12, 13]. For WFLW dataset, the $\rm NME_{io}$ on Testset, Pose Subset and Expression Subset all beat the other state-of-the-art methods [63, 14, 13, 50, 59, 12]. From the above experimental results, we can conclude that our method is more robust to faces with large poses, mainly because 1) the spatial transformation disturbance operations can effectively enrich the original dataset, i.e., produce many face images with large poses, thus the robustness of our method against large poses has been improved. 2) the SCPAN can better model the facial shape constraints by integrating the BALI fields and SCPA model, which improves the accuracy of landmark detection for faces with large poses.

TABLE VI:

\rm NME_{ip}

comparisons on 300VW dataset. (% omitted)

Method	Scenario1	Scenario2	Scenario3
TSCN_NIPS14[71]	12.54	7.25	13.13
CCFS_CVPR15[69]	7.68	6.42	13.67
TCDCN_TPAMI16[72]	7.66	6.77	14.98
CCR_ECCV16[73]	7.26	5.89	15.74
iCCR_ECCV16[73]	6.71	4.00	12.75
MDM_CVPR16[28]	5.46	4.59	7.42
TSTN_TPAMI18[74]	5.36	4.51	12.84
FHR_AAAI19[75]	5.07	4.34	7.36
FHR+STA_AAAI19[75]	4.42	4.18	5.98
STKI_ACMMM20[7]	5.04	4.57	6.11
SCPAN	4.49	4.23	5.87

IV-E Evaluation on Face Videos

We evaluate our proposed SCPAN on 300VW dataset with state-of-the-art facial landmark detection methods [64, 65, 66, 67, 7]. From the experimental results as shown in Table VI, we can find that (1) on both Scenario1 and Scenario2, our SCPAN can achieve state-of-the-art accuracy. (2) on Scenario3, our SCPAN beats the best score. These indicate that our proposed SCPAN is more robust to faces under variations on illumination conditions, occlusions, make-ups, expressions and head poses. Moreover, we believe that our SCPAN structure can be further boosted by using a temporal modeling technique [67, 7].

IV-F Evaluation of Semi-SCPAN

Current facial landmark detection methods still suffer insufficient labeled training samples. With a well-designed SCPA model, the proposed SCPAN is able to use unlabeled face images to boost the detection accuracy. To evaluate this, we use both labeled and unlabeled datasets to train SCPAN, i.e., 300W, AFLW, COFW and WFLW are separately mixed with CelebA dataset to train SCPAN (denoted as semi-SCPAN), the corresponding experimental results are shown in Table I, III, IV and V, respectively. From this we can see that semi-SCPAN outperforms the state-of-the-art methods [63, 30, 14, 13, 50, 59, 12] and other semi-supervised methods [33, 34, 35, 7]. This indicates that by introducing unlabeled face datasets, more prior knowledge can be learned to model facial shape constraints by semi-SCPAN for boosting the detection accuracy.

IV-G Self Evaluations

Heatmap generation. Almost heatmap regression-based facial landmark detection methods use heatmaps generated by a Gaussian distribution to regress and predict landmark coordinates. The closer to the landmark, the greater the belief value is. Moreover, the gradient of belief value is also changed. Therefore, by using the Gaussian distribution, neural networks can quickly and directionally reach the landmark. We also explore other two non-Gaussian distributions, i.e., Generalized Error Distribution (GED) [68, 69] and Student-t Distribution (StD) [70] to generate landmark heatmaps and boundary heatmaps. Note that, when we set $df=+\infty$ or $d=0$ , the curves of GED and StD become the Standard Normal Distribution. $df$ means the degree of freedom of StD, $d$ denotes the shape parameter of GED. As shown in Table VII, we can find that using non-Gaussian distributions to generate heatmaps can achieve comparable or even better results than using a Gaussian distribution.

TABLE VII: The effect (

\rm NME_{io}

(%)) of different heatmaps on 300W challenging subset.

Distribution	Gaussian	GED			StD
Distribution	-	d=0.2	d=0.1	d=-0.1	df=1	df=3
NME	4.43	4.67	4.39	5.01	4.42	4.40

Sensitivity analysis of parameters. By combing $\mathbb{L}_{org}$ , $\mathbb{L}_{scl}$ and $\mathbb{L}_{coor}$ , we can obtain the overall loss. The losses corresponding to landmark heatmap and boundary heatmap should be given the same weight. However, the sizes of heatmaps in $t=T$ and $t=1\cdots T-1$ are $128\times 128$ and $64\times 64$ , respectively. Therefore, $\lambda_{1}$ and $\eta$ are set to 1 and 4, respectively. Besides, the loss of field is calculated with a smaller area, $\lambda_{2}$ should be given a larger weight (i.e., 16). When calculating $\mathbb{L}_{coor}$ , we first normalize the landmark coordinates to $[0,1]$ , and then $\gamma$ is set to 40. We also conduct the corresponding experiments by using different $\lambda_{1}$ , $\lambda_{2}$ , $\gamma$ and $\eta$ values on 300W challenging subsets. From the experimental results in Tabel VIII, we can find that $\lambda_{1}=1$ , $\lambda_{2}=16$ , $\gamma=40$ and $\eta=4$ are good choices to balance these four parts.

TABLE VIII: The effect (

\rm{NME_{io}}

(%)) of different

\lambda_{1}

\lambda_{2}

\gamma

and

\eta

values on 300W challenging subset.

$\lambda_{1}$	$\eta$	$\lambda_{2}$	$\gamma$	NME
1	0	0	0	5.96
1	4	0	0	5.69
1	4	8	0	4.96
1	4	16	0	4.78
1	4	32	0	4.84
1	4	16	20	4.54
1	4	16	40	4.43
1	4	16	60	5.01

Time and memory analysis. The proposed SCPAN contains three SCPA models and an Hourglass Network Unit (HNU). Compared to the HNU, our SCPA model increase both the parameter and computational costs. For the baseline i.e., four stacked SHN, its model size and parameters are 16.05MB and 184MB, respectively. While Our SCPAN takes 16.99MB parameters and its model size is 195MB. Besides, the reference speed of our SCPAN can achieve 40FPS in a single Telsa V100 GPU while the SHN achieves 100FPS. If we use multiple GPUs, the reference will speed up. Among all, compared to SHN, our SCPAN introduces some additional parameters which are insignificant compared to 16GB memory on a tesla v100, and the impact of the consumed computation costs will be decreased with the rapid development of the hardware. To reduce the computational costs or employ lightweight backbones, we also conduct the experiment by reducing the number of SCPA models, i.e., one SCPA model and one HNU are used to construct SCPAN. Its parameters and model size will be reduced to 8.73MB and 100MB, respectively. But the $\rm{NME_{io}}$ on 300W challenging dataset will only rise from 4.43% to 4.61%, which still achieves state-of-the-art accuracy.

Visualization of the proposed BALI fields. Since the proposed SCPAN can produce more effective landmark heatmaps and BALI fields and help achieve more accurate facial landmark detection, we visualize the produced landmark heatmaps and BALI fields in Fig. 6. As shown there, the first three rows visualize the landmark heatmaps, boundary heatmaps, BALI fields, the x-offset of the BALI fields and the y-offset of the BALI fields for faces under large poses, and the last three rows show those for faces with complicated occlusions. From Fig. 6, we can see that the SCPAN is able to generate effective and accurate landmark heatmaps and BALI fields for faces in challenging scenarios. Therefore, SCPAN achieves state-of-the-art performance on challenging benchmark datasets.

IV-H Ablation Studies

The proposed Self-Calibrated Pose Attention Network (SCPAN) contains two pivotal components, namely, the BALI fields and the SCPA model. Moreover, SCPAN (denoted as semi-SCPAN) can be further boosted by using unlabeled images. Therefore, the ablation studies are conducted as follows:

(1) The proposed BALI fields can introduce both boundary and field constraints to the predicted landmarks. To evaluate this, we conduct the experiment by combining SHN [38] with the proposed BALI fields (denoted as SHN+BALI) on challenging benchmark datasets including 300W, AFLW, COFW and WFLW. From the experimental results as shown in Tables I, III, IV and V, we can see that SHN+BALI outperforms SHN, which indicates that the proposed BALI fields can better model the facial shape constraints and help detect more accurate landmarks.

(2) The proposed SCPA model is able to learn more representative and discriminative features for producing more effective landmark heatmaps, hence, we conduct the experiment by combining SHN and SCPA model (denoted as SHN+SCPA). The experimental results in Tables I, III, IV and V shows that SHN+SCPA outperforms SHN, which demonstrates that the proposed SCPA model is able to generate more effective landmark heatmaps and help detect more accurate landmarks by introducing a self-calibrated mechanism and a pose attention mask.

(3) The proposed SCPAN is constructed by combining the SHN [38] with the proposed BALI fields and SCPA model. From the experimental results in Tables I, III, IV and V, we can see that SCPAN (SHN+BALI+SCPA) surpasses the SHN+BALI and the SHN+SCPA, respectively. We can conclude that by integrating the BALI fields and SCPA model into the proposed SCPAN, more effective landmark heatmaps and BALI fields can be produced, which can futher enhance the detection robustness against faces with large poses and heavy occlusions.

(4) Finally, SCPAN can be boosted with unlabeled face images, therefore, we conduct the following experiments (denoted as semi-SCPAN) by mixing CelebA dataset with 300W, AFLW, COFW and WFLW datasets, respectively. As shown in Tables I, III, IV and V, semi-SCPAN beats SCPAN on 300W, AFLW, COFW and WFLW datasets, which demonstrates that SCPAN can use unlabeled face images to improve its detection accuracy.

IV-I Experimental Results and Discussions

From the experimental results listed in Tables I-VI and the figures presented in previous subsections, we have the following observations and corresponding analyses.

(1) SCPAN, MHHN [13], AWing [59] and LUVLi [12] are all heatmap regression-based facial landmark detection methods, while the SCPAN outperforms MHHN, Awing and LUVLi as shown in Tables I-VI, which indicates that 1) the proposed BALI fields can effectively enhance the facial shape constraints, 2) the proposed SCPA model is able to learn more representative and discriminative features for producing more effective landmark heatmaps, 3) by integrating the proposed BALI fields and SCPA model into a novel SCPAN framework, the detection accuracy can be further improved.

(2) Facial shape constraints are very important for landmark detection task for faces with large poses and heavy occlusions. SCPAN, MHHN [13], ODN [29] and OpenPose [50] all aim to construct accurate facial shape constraints to improve their detection accuracy. However, from the experimental results in Tables I-VI, we can see that SCPAN outperforms the other methods for faces with large poses and complicated occlusions, which verifies that boundary and field constraints introduced by SCPAN can effectively and precisely model the spatial relationships among landmarks.

(3) Compared to other semi-supervised facial landmark detection methods including Honari et al. [33], SBR [34], TS [35] and STKI [7], semi-SCPAN outperforms them, which indicates that more effective facial prior knowledge can be learned by semi-SCPAN for achieving more robust and precise landmark detection.

IV-J Weakness of the SCPAN

Occlusion problems. Our proposed SCPAN can outperform the state-of-the-art methods for faces in normal circumstances and large poses. However, for heavily occluded faces (e.g., hair, cup and microphone as shown in Fig. 6), the field information becomes less accurate as the area around the landmarks may contain a lot of noise. But this situation can be alleviated by using the boundary heatmaps or larger field region.

The inference speed. The baseline SHN can achieve 100FPS in a single Telsa V100 GPU, while our proposed method needs to use both heatmap and field information to predict landmarks. Moreover, the size of the field region also affects the reference speed. When we use the information of the 7x7 field region, our reference speed will drop to 40 FPS. If we use multiple GPUs, the reference will speed up.

V Conclusion

Robust and precise facial landmark detection is still a very challenging topic due to inaccurate facial shape constraints modeling and insufficient labeled training samples. In this work, we present a SCPAN method to address these problems by seamlessly integrating the BALI fields and SCPA model in a semi-supervised framework. It is shown that the proposed BALI fields can effectively model the spatial relationships among landmarks and the SCPA model can learn more representative and discriminative features for producing more accurate landmark heatmaps and BALI fields, which help achieve more robust and precise facial landmark detection. Moreover, SCPAN can use unlabeled face datasets to further boost its detection accuracy, which effectively reduces the dependence on labeled datasets. Experimental results on challenging benchmark datasets demonstrate that the proposed SCPAN outperforms state-of-the-art methods. It can also be found from the experiment that landmark heatmaps, boundary heatmaps and landmark intensity fields can complement and enhance each other, which further improves the detection accuracy. In the future, we plan to further reduce the dependence of detection accuracy on label information and extend our model to other related topics, such as human pose estimation and hand pose estimation.

References

[1] Y. Wang, Y. Tang, L. Li, and H. Chen, “Modal regression-based atomic representation for robust face recognition and reconstruction,” IEEE Transactions on Cybernetics, vol. 50, pp. 4393–4405, 2020.
[2] S. Wang, H. Ding, and G. Peng, “Dual learning for facial action unit detection under nonfull annotation.” IEEE transactions on cybernetics, vol. PP, 2020.
[3] A. E. Ichim, P. Kadlecek, and L. Kavan, “Phace: physics-based face modeling and animation,” ACM Trans. Graph., vol. 36, pp. 153:1–153:14, 2017.
[4] S. Ha, M. Kersner, B. Kim, S. Seo, and D.-Y. Kim, “Marionette: Few-shot face reenactment preserving identity of unseen targets,” ArXiv, vol. abs/1911.08139, 2020.
[5] G. Yao, Y. Yuan, T. Shao, and K. Zhou, “Mesh guided one-shot face reenactment using graph convolutional networks,” Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[6] W. Xie, L. Shen, and J. Duan, “Adaptive weighting of handcrafted feature losses for facial expression recognition.” IEEE transactions on cybernetics, 2019.
[7] C. Zhu, X. Li, J. Li, G. Ding, and W. Tong, “Spatial-temporal knowledge integration: Robust self-supervised facial landmark tracking,” Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[8] F.-R. Xiong, Y. Xiao, Z. Cao, Y. Wang, J. T. Zhou, and J. Wu, “Ecml: An ensemble cascade metric learning mechanism towards face verification,” IEEE transactions on cybernetics, vol. PP, 2020.
[9] F. Sukno, J. Waddington, and P. Whelan, “3-d facial landmark localization with asymmetry patterns and shape regression from incomplete local features,” IEEE Transactions on Cybernetics, vol. 45, pp. 1717–1730, 2015.
[10] X. Zhao, J. Zou, H. Li, E. Dellandrea, I. A. Kakadiaris, and L. Chen, “Automatic 2.5-d facial landmarking and emotion annotation for social interaction assistance,” IEEE Trans Cybern, vol. 46, no. 9, pp. 2042–2055, 2016.
[11] X. Lin, J. Wan, Y. Xie, S. Zhang, C. Lin, Y. Liang, G. Guo, and S. Z. Li, “Task-oriented feature-fused network with multivariate dataset for joint face analysis,” IEEE transactions on cybernetics, vol. 50, no. 3, pp. 1292–1305, 2019.
[12] A. Kumar and T. K. Marks, “Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood,” in CVPR, pp. 8236–8246, 2020.
[13] J. Wan, Z. Lai, J. Liu, J. Zhou, and C. Gao, “Robust face alignment by multi-order high-precision hourglass network,” IEEE Transactions on Image Processing, vol. 30, pp. 121–133, 2021.
[14] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look at boundary: A boundary-aware face alignment algorithm,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2129–2138.
[15] J. Wan, Z. Lai, J. Li, J. Zhou, and C. Gao, “Robust facial landmark detection by multi-order multi-constraint deep networks,” IEEE transactions on neural networks and learning systems, vol. PP, 2021.
[16] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1302–1310, 2017.
[17] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh, “Openpose: Realtime multi-person 2d pose estimation using part affinity fields,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 43, pp. 172–186, 2021.
[18] C. Sagonas, E. Antonakos, G. Tzimiropoulos, S. P. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: database and results,” Image Vision Comput., vol. 47, pp. 3–18, 2016.
[19] J. Deng, A. Roussos, G. G. Chrysos, E. Ververas, I. Kotsia, J. Shen, and S. Zafeiriou, “The menpo benchmark for multi-pose 2d and 3d facial landmark localisation and tracking,” International Journal of Computer Vision, vol. 127, pp. 599–624, 2018.
[20] X. P. Burgosartizzu, P. Perona, and P. Dollar, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision, 2013, pp. 1513–1520.
[21] S. Zhu, C. Li, C. C. Loy, and X. Tang, “Unconstrained face alignment via cascaded compositional learning,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3409–3417.
[22] J. Shen, S. Zafeiriou, and G. G. Chrysos, “The first facial landmark tracking in-the-wild challenge: Benchmark and results,” in ICCVW, pp. 1003–1011, 2015.
[23] T. F. Cootes, C. J. Taylor, D. H. Cooper, and J. Graham, “Active shape models-their training and application,” Computer vision and image understanding, vol. 61, no. 1, pp. 38–59, 1995.
[24] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” IEEE Transactions on pattern analysis and machine intelligence, vol. 23, no. 6, pp. 681–685, 2001.
[25] D. Cristinacce and T. F. Cootes, “Feature detection and tracking with constrained local models,” in British Machine Vision Conference, 2006.
[26] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” International Journal of Computer Vision, vol. 107, no. 2, pp. 177–190, 2014.
[27] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 1685–1692.
[28] G. Trigeorgis, P. Snape, M. A. Nicolaou, E. Antonakos, and S. Zafeiriou, “Mnemonic descent method: A recurrent process applied for end-to-end face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4177–4187.
[29] M. Zhu, D. Shi, M. Zheng, and M. Sadiq, “Robust facial landmark detection via occlusion-adaptive deep networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3486–3496.
[30] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated network for facial landmark detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 379–388.
[31] Z. Liu, X. Zhu, G. Hu, H. Guo, M. Tang, Z. Lei, N. M. Robertson, and J. Wang, “Semantic alignment: Finding semantically consistent ground-truth for facial landmark detection,” in IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3467–3476.
[32] X. Tang, F. Guo, J. Shen, and T. Du, “Facial landmark detection by semi-supervised deep learning,” Neurocomputing, vol. 297, pp. 22–32, 2018.
[33] S. Honari, P. Molchanov, S. Tyree, P. Vincent, C. Pal, and J. Kautz, “Improving landmark localization with semi-supervised learning,” 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1546–1555, 2018.
[34] X. Dong, S.-I. Yu, X. Weng, S.-E. Wei, Y. Yang, and Y. Sheikh, “Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors,” in IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 360–368.
[35] X. Dong and Y. Yang, “Teacher supervises students how to learn from partially labeled images for facial landmark detection,” in ICCV, pp. 783–792, 2019.
[36] S. Yin, S. Wang, X. Chen, and E. Chen, “Exploiting self-supervised and semi-supervised learning for facial landmark tracking with unlabeled data,” Proceedings of the 28th ACM International Conference on Multimedia, 2020.
[37] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4724–4732, 2016.
[38] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for robust facial landmark localisation,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2017, pp. 2025–2033.
[39] A. Bulat and G. Tzimiropoulos, “Super-fan: Integrated facial landmark localization and super-resolution of real-world low resolution faces in arbitrary poses with gans,” in CVPR, pp. 109–117, 2018.
[40] Y. Chen, Y. Tai, and X. Liu, “Fsrnet: End-to-end learning face super-resolution with facial priors,” in CVPR, pp. 2492–2501, 2018.
[41] C. Ma, Z. Jiang, and Y. Rao, “Deep face super-resolution with iterative collaboration between attentive recovery and landmark estimation,” in CVPR, pp. 5568–5577, 2020.
[42] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” in IEEE Conference on Computer Vision and Pattern Recognition, 2011, pp. 545–552.
[43] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 2879–2886.
[44] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. Huang, “Interactive facial feature localization,” in European Conference on Computer Vision. Springer, 2012, pp. 679–692.
[45] L. Chen, “Kernel density network for quantifying regression uncertainty in face alignment,” 2018.
[46] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li, “Face alignment across large poses: A 3d solution,” in IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 146–155.
[47] V. Jain and E. G. Learned-Miller, “Fddb: A benchmark for face detection in unconstrained settings,” 2010.
[48] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem? (and a dataset of 230,000 3d facial landmarks),” in ICCV, pp. 1021–1030, 2017.
[49] S. Zafeiriou, G. Trigeorgis, and G. Chrysos, “The menpo facial landmark localisation challenge: A step towards the solution,” in CVPRW, pp. 2116–2125, 2017.
[50] K. Sun, Y. Zhao, B. Jiang, and T. Cheng, “High-resolution representations for labeling pixels and regions,” ArXiv, vol. abs/1904.04514, 2019.
[51] Z. Tang, X. Peng, K. Li, and D. Metaxas, “Towards efficient u-nets: A coupled and quantized approach,” TPAMI, vol. 42, pp. 2038–2050, 2020.
[52] L. Chen, H. Su, and Q. Ji, “Face alignment with kernel density deep neural network,” in ICCV, pp. 6991–7001, 2019.
[53] Z. Liu, P. Luo, X. Wang, and X. Tang, “Deep learning face attributes in the wild,” in ICCV, pp. 3730–3738, 2015.
[54] T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “Openface 2.0: Facial behavior analysis toolkit,” in FG, pp. 59–66, 2018.
[55] A. Kumar and R. Chellappa, “Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment,” in CVPR, pp. 430–439, 2018.
[56] S. Qian, K. Sun, and W. Wu, “Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation,” in ICCV, pp. 10 152–10 162, 2019.
[57] A. Dapogny, K. Bailly, and M. Cord, “Decafa: Deep convolutional cascade for face alignment in the wild,” in ICCV, pp. 6892–6900, 2019.
[58] J. Wan, Z. Lai, L. Shen, J. Zhou, C. Gao, G. Xiao, and X. Hou, “Robust facial landmark detection by cross-order cross-semantic deep network,” Neural networks : the official journal of the International Neural Network Society, 2020.
[59] X. Wang, L. Bo, and F. Li, “Adaptive wing loss for robust face alignment via heatmap regression,” in ArXiv, vol. abs/1904.07399, 2019.
[60] J. Wan, J. Li, J. Chang, Y. Wu, Y. Xiao, X. Li, and H. Zheng, “Face alignment by component adaptive mechanism,” Neurocomputing, vol. 329, pp. 227–236, 2019.
[61] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2235–2245.
[62] J. P. Robinson and Y. Li, “Laplace landmark localization,” in ICCV, pp. 10 102–10 111, 2019.
[63] W. Wu and S. Yang, “Leveraging intra and inter-dataset variations for robust face alignment,” in Proceedings of the IEEE conference on computer vision and pattern recognition workshops, 2017, pp. 150–159.
[64] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 4998–5006.
[65] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Learning deep representation for face alignment with auxiliary attributes,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, pp. 918–930, 2016.
[66] H. Liu, J. Lu, J. Feng, and J. Zhou, “Two-stream transformer networks for video-based face alignment,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 40, pp. 2546–2554, 2018.
[67] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and Y. Chen, “Towards highly accurate and stable face alignment for high-resolution videos,” ArXiv, vol. abs/1811.00342, 2019.
[68] H. Mohseni, M. Kringelbach, M. Woolrich, A. Baker, T. Aziz, and P. P. Smith, “Non-gaussian probabilistic meg source localisation based on kernel density estimation ,” Neuroimage, vol. 87, pp. 444 – 464, 2014.
[69] V. Havyarimana, Z. Xiao, P. C. Bizimana, D. Hanyurwimfura, and H. Jiang, “Toward accurate intervehicle positioning based on gnss pseudorange measurements under non-gaussian generalized errors,” IEEE Transactions on Instrumentation and Measurement, vol. 70, pp. 1–12, 2021.
[70] Y. Liang, G. Chen, S. M. Naqvi, and J. Chambers, “Independent vector analysis with multivariate student’s t-distribution source prior for speech separation,” Electronics Letters, vol. 49, pp. 1035–1036, 2013.