¹¹institutetext: SenseTime Research
¹¹email: {zhubeier, linchunze, wangquan, qianchen}@sensetime.com ²²institutetext: University of Toronto
²²email: [email protected]

Fast and Accurate: Structure Coherence Component for Face Alignment

Beier Zhu equal contribution11 Chunze Lin^* 11 Quan Wang 11 Renjie Liao 22 Chen Qian 11

Abstract

In this paper, we propose a fast and accurate coordinate regression method for face alignment. Unlike most existing facial landmark regression methods which usually employ fully connected layers to convert feature maps into landmark coordinate, we present a structure coherence component to explicitly take the relation among facial landmarks into account. Due to the geometric structure of human face, structure coherence between different facial parts provides important cues for effectively localizing facial landmarks. However, the dense connection in the fully connected layers overuses such coherence, making the important cues unable to be distinguished from all connections. Instead, our structure coherence component leverages a dynamic sparse graph structure to passing features among the most related landmarks. Furthermore, we propose a novel objective function, named Soft Wing loss, to improve the accuracy. Extensive experiments on three popular benchmarks, including WFLW, COFW and 300W, demonstrate the effectiveness of the proposed method, achieving state-of-the-art performance with fast speed. Our approach is especially robust to challenging cases resulting in impressively low failure rate ( $0\%$ and $2.88\%$ ) in COFW and WFLW datasets.

1 Introduction

Face alignment, also known as facial landmark detection is an important topic in computer vision and has attracted much attention over past few years [43, 14, 8, 44]. As a fundamental step for face image analysis, face alignment plays a key role in many face applications such as face recognition [58], expression analysis [52] and face editing [38]. Although significant progress has been made, face alignment is still a challenging problem due to issues like occlusion, large pose and complicated expression.

Refer to caption — Figure 1: Comparison between the fully connected layer and our graph convolutional layer. (a) Dense connection in the fully connected layer. (b, d) The performance of fully connected based and our Structure Coherence Component based method under different levels of occlusions. Green and red points correspond to ground-truth and prediction, respectively. (c) Sparse and relation-aware graph convolutional layer.

With the success of deep learning in several computer vision tasks such as image classification and object detection, many convolutional neural networks (CNN) based face alignment methods have been proposed. Existing CNN-based face alignment methods can mainly be divided into two categories: coordinate regression based [39, 14, 43] and heatmap regression based ones [48, 8, 37]. Heatmap regression based methods commonly produce higher precise localization for its translation equivariant property [5]. Keeping the sizes of feature maps and heatmap is essential for high accuracy. However, it will also lead to computationally heavy models which are impractical for deployment in real-world applications. Coordinate regression based methods are relatively simpler and can be built on lighter convolutional networks. The fully connected layers (FC) are commonly used in such methods to convert feature maps to facial landmark coordinates [39, 14, 43]. However, the dense connections of fully connected layers make every landmark correlate to each other. As shown in Fig. 1(a), in the FC layer, every landmark coordinate is connected to the same hidden features. The error of one landmark leads to error of all other landmarks, especially in hard cases such as occlusion. As shown in Fig. 1(b), when we progressively occlude human face, the error of face contour leads to the error of other parts of human face.

Structure coherence between different facial parts provides important cues for effectively localizing facial landmarks, which helps keep the structure of face and predict occluded landmarks. In this paper, we propose Structure Coherence Component (SCC) to convert feature maps to facial landmark coordinates by explicitly exploring the relation among facial landmarks. With the help of deep geometric learning, we treat the intermediate features of each landmark as a node, and leverage a sparse graph structure to propagate features among the neighboring nodes, see Fig. 1(c). The sparse graph structure endows the model with the the capability of using the facial structure coherence appropriately. The sparse graph structure is learnt by data-driven based neighborhood construction and dynamic weight adjustment. Fig. 1(d) shows that reasoning with structure coherence cues allows our model to correctly localize the key points in challenging real-world situations such as occlusion and large pose. As shown in Fig. 2, Structure Coherence Component consists of four parts: attention guided multi-scale feature fusion, mapping to node, dynamic adjacency matrix weighting module and graph relation network. The attention guided multi-scale feature fusion provides rich spatial details and semantic information features. The mapping to node module converts these convolutional features into graph node representations and the relation is learnt via dynamic adjacency matrix weighting module, based on which, the graph relation network effectively regresses the coordinate of facial landmarks. The proposed SCC, simple yet effective, permits more precise localization without burdening the model.

Furthermore, we propose Soft Wing Loss to handle the side-effect of Wing loss [14] on small range errors. Since the facial landmarks are not strictly defined, the annotations vary among annotators, introducing some shifts [29]. In such a case, forcing the model to fit the ground-truth with a large gradient would cause unstable training. Therefore, we make the model more focus on the errors of medium ranges.

We evaluate the proposed method on three widely-used face alignment benchmarks including WFLW [43], COFW [3] and 300W [33]. Experimental results demonstrate the effectiveness of our approach, which outperforms existing state-of-the-art regression based methods by a large margin. In addition to the great performance, our model is much faster and lighter than the closest competitors. We conduct extensive ablation studies to show the effectiveness of each proposed modules.

2 Related Work

Traditional models: Traditional facial landmark detection models mainly fall into two categories, i.e, fitting models and constrained local models. Taylor et al. introduce the Active Appearance Model (AAM) [6][11] to fits the facial images with a small number of coefficients, controlling both the facial appearance and the facial shape. Constrained local models[7][34] are introduced to predict the landmarks based on the global facial shape constraints as well as the independent local appearance information around each landmark. Locating facial landmarks with graph structure is related to some previous works[17][57][40], which apply deformable part models (DPM)[12] to face analysis. These methods belong to probabilistic graphical models, which require hand-crafted potential functions and iterative optimization for inference. However, our method is deep learning based graph network, which generates richer and more expressive feature embeddings and enjoys the faster inference.

CNN based coordinate regression models: Coordinate regression models directly map the face image to the landmark coordinates. Zhang et al. [53] improve the robustness of detection through multi-task learning, i.e., learning landmark coordinates and predicting facial attributes at the same time. Feng et al. [14] introduce a modified log loss, named Wing loss, to increase the contribution of small and medium errors to the training process. LAB [43] regresses facial landmark coordinates with the help of boundary information to reduce the annotation ambiguities. In spite of the advantage of explicit inference of landmark coordinates without any post-processing, the coordinate regression models generally underperform heatmap regression models.

CNN based heatmap regression models: Heatmap regression models leverage fully convolutional networks (FCNs) to maintain structure information throughout the whole network, and therefore outperform coordinate regression models. In recent work, stacked hourglass (HG) [30] is widely used to achieve the state-of-the-art performance. Yang et al. [48] first normalize faces with a supervised transform and then prediction heatmap using a HG. Liu et al. [29] develop a latent variable optimization strategy to reduce the impact of ambiguous annotations when training a 4-stacked HG. In addition to HG, architecture like HR-Net [37] is also able to yield excellent performance. Despite their higher accuracy, heatmap regression models are much more costly from a computational point of view compared to coordinate regression models.

Graph Neural Networks (GNNs): GNNs are a class of models which try to generalize deep learning to handle graph-structured data. They are first introduced in [35] and become more and more popular recently [1]. There are mainly two types of GNNs: Message Passing based Neural Networks [35, 24, 19] and Graph Convolution based Neural Networks [2, 21, 25]. Many recent works have shown that GNNs are very effective in many computer vision tasks, e.g., RGBD semantic segmentation [31], visual situation recognition [23], scene graph generation and reasoning [49, 36], image annotation [51], object detection [47] and 3D shape analysis[41]. Specifically, in this work, we closely follow the so-called graph convolutional network (GCN) [21] which greatly simplifies the graph convolution operator by exploiting approximation to the Chebyshev polynomial based graph spectral filters. It provides a simple yet effective way to integrate local neighboring node feature following the graph topology.

3 Approach

In this section, we present the proposed method in detail. As illustrated in Fig. 2, our Structure Coherence Component is mainly composed of four key parts: an attention guided multi-scale features fusion, a mapping to node module, dynamic adjacency matrix weighting module and a graph relation network. Given an input face image, the convolutional backbone computes feature maps of different resolutions which are carefully fused via attention guidance. A sparse graph structure is learnt by dynamic adjacency matrix weighting module. The features extracted from attention module are then mapped into graph node representation and fed into the graph relation network which outputs the coordinates of facial landmarks.

3.1 Attention Guided Multi-scale Features

Since facial landmarks detection requires extreme precise localization, preserving the spatial information are crucial for an accurate model. Heatmap based methods usually uses several hourglass structures [30] to preserve the spatial information. However, such encoder-decoder architecture is extremely heavy and slows down the inference speed. We propose an efficient attention guided multi-scale features module to improve the localization capability. Fig. 3 illustrates the architecture of this module.

Multi-scale Features: The feature maps from shallower layers encode low-level information and spatial details, while deep layers encode high semantic information [4, 27, 26]. We introduce two bottom-up branches to propagate the spatial details from shallow layers to the deepest layer. Specifically, consider a convolutional backbone composed of $L$ convolutional blocks. We denote $\bm{F}_{l}$ as the last feature maps of the $l$ -th block. We exploit the spatial information from the feature maps $\bm{F}_{L-1}$ and $\bm{F}_{L-2}$ to augment the localization precision of the features $\bm{F}_{L}$ . Each branch is composed of a $3\times 3$ Conv-BN-ReLU, an attention mechanism to filter out noisy information and a down-sampling operation. These feature maps with spatial details are then concatenated with $\bm{F}_{L}$ to form more expressive feature maps.

Semantic-guided Attention: Although the feature maps from shallow layers have rich spatial information, they also contain noisy information which are not informative from the perspective of semantic meaning. We propose semantic-guided attention module to filter out such information. Unlike existing self-attention which uses self-features to compute an attention map, we exploit the high-semantics features maps $\bm{F}_{L}$ to guide the feature maps $\bm{F}_{L-1}$ to suppress noisy information while keeping spatial details. We first upsample the feature maps $\bm{F}_{L}$ , concatenate it with $\bm{F}_{L-1}$ and reduce the channel dimension into the channel of $\bm{F}_{L-1}$ via an $1\times 1$ convolution, obtaining $\tilde{\bm{F}}_{L-1}$ . As $\tilde{\bm{F}}_{L-1}$ merges the information from $\bm{F}_{L}$ and $\bm{F}_{L-1}$ , it contains both semantic information and spatial details. We then use the attention module described in [42] to generate spatial attention $\bm{A}^{s}$ and channel-wise attention $\bm{A}^{c}$ from $\tilde{\bm{F}}_{L}$ as:

	$\displaystyle\bm{A}^{c}$	$\displaystyle=\sigma(\mathbf{W}_{1}\rho(\mathbf{W}_{0}\tilde{\bm{F}}^{c}_{\text{avg}})+\mathbf{W}_{1}\rho(\mathbf{W}_{0}\tilde{\bm{F}}^{c}_{\text{max}}))$
	$\displaystyle\bm{A}^{s}$	$\displaystyle=\sigma(\text{conv}_{7\times 7}([\tilde{\bm{F}}^{s}_{\text{avg}};\tilde{\bm{F}}^{s}_{\text{max}}])))\text{,}$		(1)

where $\mathbf{W}_{0}\in\mathbb{R}^{C/2\times C}$ and $\mathbf{W}_{1}\in\mathbb{R}^{C\times C/2}$ , $C$ is the channel number, $\text{conv}_{7\times 7}$ denotes an $7\times 7$ convolution operation, $\sigma$ , $\rho$ and $[\cdot;\cdot]$ denote sigmoid function, ReLU activation and concatenation operation respectively, $\tilde{\bm{F}}^{c}_{\text{avg}}$ / $\tilde{\bm{F}}^{s}_{\text{avg}}$ and $\tilde{\bm{F}}^{c}_{\text{max}}$ / $\tilde{\bm{F}}^{s}_{\text{max}}$ denote spatially/channel-wise average-pooled features and max-pooled features, respectively. The notation $L$ is omitted for more clarity. The attentive features are then obtained via element-wise multiplication and residual addition. Similarly, we compute the attention for feature maps $\tilde{\bm{F}}_{L-2}$ with messages from features $\tilde{\bm{F}}_{L-1}$ and $\tilde{\bm{F}}_{L}$ , and obtain the attentive features. Finally, we concatenate these features to form the attention guided multi-scale features $\bm{F}_{A}$ . Note that designing the attention module is not our main focus, we adopt the commonly used attention module [42] in our semantic-guided process.

3.2 Graph Relation Network

As the relative spatial relationship of facial landmarks is stable, it is desirable to capture and exploit such important cues. We statically calculate the correlation between face landmarks from data analysis and leverage graph relation networks to effectively explore these relation information.

Map-to-Node Module: In order to make our network end-to-end trainable, we design the map-to-node module to seamlessly map convolutional feature maps to graph node representations. The input convolutional feature maps $\bm{F}_{A}\in\mathbb{R}^{C\times H\times W}$ (where $C$ , $H$ and $W$ represents number of channels, height and width) are first transformed to the hidden feature maps by non-linear function $\bm{Z}=\phi(\bm{F}_{A})\in\mathbb{R}^{Nn\times H\times W}$ , where $n\in\mathbb{Z}^{+}$ is the expansion coefficient and $N$ is the number of landmarks. In this paper, we consider two convolution-BN-ReLU blocks with $n=4$ as the non-linear function $\phi(\cdot)$ . $\bm{Z}$ is then reshaped to $\bm{Z}_{\mathrm{node}}\in\mathbb{R}^{N\times nHW}$ to represent the node feature.

Graph Convolution: Unlike standard convolutions that operate on local Euclidean structures, e.g., a image grid, the goal of GCN is to learn a function $f(\cdot,\cdot)$ on a graph $\mathcal{G}$ , which takes node feature $\bm{H}^{l}\in\mathbb{R}^{N\times d_{l}}$ and the corresponding adjacency matrix $\bm{A}\in\mathbb{R}^{N\times N}$ as input, and outputs the node features as $\bm{H}^{l+1}\in\mathbb{R}^{N\times d_{l+1}}$ . Here $N$ , $l$ , $d_{l}$ and $d_{l+1}$ denote the number of nodes, layer index, the dimension of input node features and the dimension of output node feature respectively. Every GCN layer can be written as a non-linear function by,

\bm{H}^{l+1}=f(\bm{H}^{l},\bm{A})

(2)

With the specific graph convolutional operators employed by [21], the layer can be represented as,

\bm{H}^{l+1}=\psi(\widetilde{\bm{D}}^{-\frac{1}{2}}\widetilde{\bm{A}}\widetilde{\bm{D}}^{-\frac{1}{2}}\bm{H}^{l}\bm{W}^{l})

(3)

where $\bm{W}^{l}\in\mathbb{R}^{d_{l}\times d_{l+1}}$ is a transformation matrix to be learned, $\widetilde{\bm{A}}=\bm{A}+\bm{I}$ , $\widetilde{\bm{D}}$ is the degree matrix of $\widetilde{\bm{A}}$ , $\widetilde{\bm{D}}^{-\frac{1}{2}}\widetilde{\bm{A}}\widetilde{\bm{D}}^{-\frac{1}{2}}$ is the symmetric normalized version of $\widetilde{\bm{A}}$ and $\psi(\cdot)$ denotes BN-ReLU operation.

Neighborhoods Construction: Graph relation network propagates information between nodes based on the adjacency matrix which is crucial to be correctly constructed. In our problem, due to the lack of pre-defined adjacency matrix for facial landmarks, we build it through a data-driven way, i.e., treating each landmark as a node and mining the correlation between landmarks within the dataset. Specifically, we assemble the landmark coordinates of the dataset into a rank-three data tensor $\bm{T}\in\mathbb{R}^{M\times N\times 2}$ where $M$ is the number of images, and the last dimension represents the $(x,y)$ coordinates. We then slice the tensor $\bm{T}$ along the last dimension to generate $\bm{T}_{x}$ and $\bm{T}_{y}$ . Based on $\bm{T}_{x}\in\mathbb{R}^{M\times N}$ and $\bm{T}_{y}\in\mathbb{R}^{M\times N}$ , we calculate Pearson’s correlation coefficient in $x$ and $y$ direction respectively to form correlation matrices $\bm{C}_{x}\in\mathbb{R}^{N\times N}$ and $\bm{C}_{y}\in\mathbb{R}^{N\times N}$ . Then, the correlation between nodes is defined as:

\bm{C}=\frac{1}{2}(\text{abs}(\bm{C}_{x})+\text{abs}(\bm{C}_{y}))

(4)

where $\text{abs}(\cdot)$ returns element-wise absolute value of matrix. Considering the computation cost and noisy edges, we only retain the top $k+1$ largest value of each row of $\bm{C}$ to form a sparse adjacency matrix $\bm{M}$ . In other words, most $k$ relevant landmarks are picked as the neighborhood of each landmark. The binary adjacency matrix with self-loops can be written as:

\bm{M}_{ij}=\left\{\begin{array}[]{ll}1,&\text{if}\ \bm{C}_{ij}\in\text{Top}^{k+1}_{t=1,...,N}(\bm{C}_{it})\\ 0,&\text{otherwise}\end{array}\right.

(5)

Dynamic Adjacency Matrix Weighting: The static adjacency matrix $\bm{M}$ is constructed based on the geometric structure of facial landmarks, while learning relationship among landmarks for each face aims to take the facial appearance factors like occlusion and head pose into consideration. Given the binary matrix $\bm{M}$ which determines the node neighborhoods, we seek to adaptively adjust its weights.

Formally, given the features $\bm{Z}$ extracted from the map-to-node modules, we use the global average pooling layer followed by two fully connected layers to map $\bm{Z}$ to vector $\bm{a}$ whose size is equal to the non zeros in $\bm{M}$ . Finally, we replace the non zeros value in $\bm{M}$ with $\bm{a}$ to form the dynamic adjacency matrix $\bm{A}$ . Following the strategy in [54], we adopt a row-wise softmax operation $\sigma_{\text{row}}$ to replace the symmetric normalization in Eq. 3. Softmax operation makes the weights of each node like probabilities over its neighboring nodes, which stabilizes the training process:

\bm{H}^{l+1}=\psi(\sigma_{\text{row}}(\bm{A})\bm{H}^{l}\bm{W}^{l})

(6)

We use the binary matrix $\bm{M}$ to hold the neighborhoods and only learn their weights because the facial shape pattern are stable, fix the sparse connection will greatly reduce the training parameters which makes the learning process easier.

Graph Relation Network: Inspired by the success of ResNet[20], we adopt the graph residual block architecture. Each block consists of two graph convolutional layers and can be formulated based on Eq. (2) as

	$\displaystyle\bm{H}^{l+1}$	$\displaystyle=f(\bm{H}^{l},\bm{A})$
	$\displaystyle\bm{H}^{l+2}$	$\displaystyle=f(\bm{H}^{l+1},\bm{A})+\bm{H}^{l}$		(7)

The overall graph relation network architecture is shown in Fig. 2. The input feature $\bm{Z}_{node}$ is first fed to graph convolution, followed by several graph residual blocks. The last graph convolution (without batch normalization and ReLU) block maps the hidden node features to landmark coordinates $\bm{O}\in\mathbb{R}^{N\times 2}$ .

Comparison with FC-based regression methods. The fully connected layer and our graph convolutional layers embed the feature of landmarks in two different ways. As shown in Fig.1(a) The CNN backbone and the hidden fully connected layer map the input facial image to the hidden vector, which embed the feature of landmarks globally. Thus, the errors of some parts of the prediction effects the other parts, as they share the same hidden feature. As shown in Fig. 1(b), for the FC-based method, the errors of occluded part interfere the prediction of other visible parts. Meanwhile, our SCC embeds the node feature for each landmark, and propagates node feature according to their relationship. If some parts of predictions fail because of the occlusion, large pose or other hard condition, the node feature of other parts degrade gracefully because of the sparse connection among the node features and the dynamic adjustment of the relationship. As shown in Fig. 1(d), the SCC-based method are more robust to hard cases. Besides, fully connected layers are prone to overfit because of the large number of trainable parameters, while the graph convolution layer requires fewer trainable parameters.

3.3 Soft Wing Loss

Wing loss[14] has constant gradient when error is large, and large gradient for small or medium range errors, which is defined as:

\mathrm{Wing}(x)=\left\{\begin{array}[]{ll}\omega\ln({1+\frac{|x|}{\epsilon}})&\mathrm{if}\ |x|<\omega\\ |x|-C&\mathrm{otherwise}\end{array}\right.

(8)

where $x$ is error and $C$ is $\omega-\omega\ln(1+\omega/\epsilon)$ to smoothly link two piece-wise functions. According to our experiment, the performance of Wing loss is not consistently better than L1 loss, especially when we train the neural networks on difficult dataset with heavy occlusion and blur, such as WFLW. As mentioned in [29], this may be caused by inconsistent annotations due to various reasons, e.g., unclear or inaccurate definition of some landmarks, poor quality of some facial images. Imposing a large gradient magnitude around very small error to force the model exactly fit the ground truth landmarks makes the training process unstable. To alleviate this problem, we present Soft Wing loss to more focus on the errors of medium range:

\text{SoftWing}(x)=\left\{\begin{array}[]{ll}|x|&\text{if}\ |x|<\omega_{1}\\ \omega_{2}\ln({1+\frac{|x|}{\epsilon}})+B&\text{otherwise}\end{array}\right.

(9)

which is linear for small values, and take the curve of $\ln(\cdot)$ for medium and large values. Similar to Wing loss, we use the non-negative $\omega_{1}$ to switch between linear and non-linear part, and $\epsilon$ to limit the curvature of the non-linear part. $B$ is set to $\omega_{1}-\omega_{2}\ln(1+\omega_{1}/\epsilon)$ to make function continuous at $\omega_{1}$ . The visualization of L1, Wing and our Soft Wing loss is shown in Fig.4. Note that we discard the linear part of Wing loss, since our proposed loss can adaptively adjust the magnitude of gradient between medium ( $\omega_{1}<|x|<\omega_{2}$ ) and large errors ( $|x|>\omega_{2}$ ). The magnitude of gradient of the non-linear part is $\frac{\omega_{2}}{|x|+\epsilon}\approx\frac{\omega_{2}}{|x|}$ ( $\epsilon$ is commonly set to small value). Our proposed loss is insensitive to outliers where the gradient varies between $[\frac{\omega_{2}}{C},1]$ ( $C$ is the image size). Note that $\omega_{2}$ should not set to small value because it will cause gradient vanishing problem.

Metric	Method	Fullset	Pose	Expression	Illumination	Make-up	Occlusion	Blur
NME	DVLN17[44]	6.08	11.54	6.78	5.73	5.98	7.33	6.88
	LAB18 [43]	5.27	10.24	5.51	5.23	5.15	6.79	6.32
	Wing18 [14]	5.11	8.75	5.36	4.93	5.41	6.37	5.81
	AGCFN19 [28]	4.90	8.78	5.00	4.93	4.85	6.26	5.73
	LAB18 [43] + AVS19 [32]	4.76	8.21	5.14	4.51	5.00	5.76	5.43
	DeCaFA19 [8]	4.62	8.11	4.65	4.41	4.63	5.74	5.38
	HRNet19 [37]	4.60	7.94	4.85	4.55	4.29	5.44	5.42
	Ours	4.40	7.52	4.65	4.31	4.36	5.23	5.04
FR	DVLN17[44]	10.84	46.93	11.15	7.31	11.65	16.30	13.71
	LAB18 [43]	7.56	28.83	6.37	6.73	7.77	13.72	10.74
	Wing18 [14]	6.00	22.72	4.78	4.30	7.77	12.50	7.76
	AGCFN19 [28]	5.92	24.23	5.41	4.72	5.82	11.00	8.79
	LAB18 [43] + AVS19 [32]	5.24	20.86	4.78	3.72	6.31	9.51	7.24
	DeCaFA19 [8]	4.84	21.4	3.73	3.22	6.15	9.26	6.61
	Ours	2.88	13.80	2.55	2.29	2.43	5.98	4.14
AUC	DVLN17[44]	0.4551	0.1474	0.3889	0.4743	0.4494	0.3794	0.3973
	LAB18 [43]	0.5323	0.2345	0.4951	0.5433	0.5394	0.4490	0.4630
	Wing18 [14]	0.5504	0.3100	0.4959	0.5408	0.5582	0.4885	0.4918
	AGCFN19 [28]	0.5452	0.2826	0.5267	0.5511	0.5547	0.4621	0.4823
	LAB18 [43] + AVS19 [32]	0.5460	0.2764	0.5098	0.5660	0.5349	0.4700	0.4923
	DeCaFA19 [8]	0.563	0.292	0.546	0.579	0.575	0.485	0.494
	Ours	0.5666	0.2981	0.5430	0.5761	0.5710	0.4936	0.5095

Table 1: Evaluation of our method and state-of-the-art approaches on Fullset and six typical subsets of WFLW. The results in terms of normalized mean error, NME (

\%

), failure rate at

10\%

, FR (

\%

) and AUC are reported.

4 Experiments

In this section, we evaluate our method on three popular face alignment benchmarks, compare with state-of-the-art approaches and conduct the ablation study.

4.1 Experimental Setup

Datasets: We conduct evaluation on three widely-adopted challenging datasets: WFLW [43], COFW [3] and 300W [33]. WFLW is among the most challenging face alignment benchmark which includes various hard cases such as heavy occlusion, blur and large pose. COFW is collected to present faces with large variations in shape and occlusions in real-world conditions. Various types of occlusions are introduced and result in a $23\%$ occlusion on facial parts on average. We also use the re-annotated test set [18] with 68 landmarks annotation for cross-dataset validation. 300W contains face images with moderate variations in pose, expression and illumination.

Evaluation Metric: We evaluate the proposed method with normalized mean error and failure rate. we use the inter-ocular distance as the normalization factor. Following the protocol in [43], the failure rate for a maximum error of 0.1 is reported. Area under curve (AUC) is also calculated based on the cumulative error distribution for WFLW dataset.

4.2 Implementation details

All training images are center-cropped and resized to $256\times 256$ . Data augmentation is performed with random rotation ( $\pm 40^{\circ}$ ), translation ( $\pm 30\ px$ ), flipping ( $50\ \%$ ), rescaling ( $\pm 10\ \%$ ) and occlusion ( $20\ \%$ of image size). To mitigate the issue of pose variations, we adopt the Pose-based Data Balancing (PDB)[14] strategy with 9 bins. We use ResNet34[20] as our backbone. During the training, we employ vanilla SGD for optimization with a batch size of $64$ for $500$ epochs. We set the weight decay and the momentum to $0.0005$ and $0.9$ respectively. The initial learning rate is $0.01$ which is dropped by 5 every $100$ epochs. The parameters of the Soft Wing loss are set to $\omega_{1}=2$ , $\omega_{2}=20$ and $\epsilon=0.5$ . The $k$ is set to 3 for adjacency matrix. We use 4 graph residual blocks with hidden feature dimension 128. Our models are trained from scratch using Pytorch.

4.3 Comparison with the State of the Art

WFLW: We evaluate our approach on the WFLW dataset and compare with state-of-the-art methods in terms of mean error, failure rate and AUC. To better understand the effectiveness of the proposed method, we analyse the performance on six subsets with specific issue, e.g., large pose, occlusion and exaggerated expression [43]. The overall results are tabulated in Table 1. The proposed method achieves $4.40\%$ NME, $2.88\%$ failure rate and $0.5666$ AUC, which outperforms most state-of-the-art approaches. Our method fails on only $2.88\%$ of all images, which demonstrates the robustness of our model. Qualitative results are depicted in Fig. 5, where our model successively localizes landmarks in hard cases.

Method	Trained on COFW		Trained on 300W
Method	NME	FR	NME	FR
TCDCN14[53]	-	-	7.66	16.17
SAPM15[16]	-	-	6.64	5.72
CFSS15[56]	-	-	6.28	9.07
HPM14[18]	7.50	13.00	6.72	6.71
CCR15[13]	7.03	10.9	-	-
DRDA16[50]	6.46	6.00	-	-
RAR16[46]	6.03	4.14	-	-
SFPD17[45]	6.40	-	-	-
DAC-CSR17[15]	6.03	4.73	-	-
Wing18[14]	5.44	3.37	-	-
ODN19[55]	5.30	-	-	-
LAB18[43]	3.92	0.39	4.62	2.17
SAN18[9] + AVS19[32]	-	-	4.43	2.82
Ours	3.63	0	4.18	0

Table 2: Evaluation on the COFW dataset in terms of NME (

\%

) and Failure Rate (

\%

) at

10\%

COFW: As shown in in Table 2, our method achieves state-of-the-art performance with $3.63\%$ mean error and $0\%$ failure rate. To further verify the generalization capability of our method, we conduct a cross-dataset evaluation using COFW-68 dataset annotated with 68 landmarks [18]. Our method outperforms the existing best approaches by a large margin, with $4.18\%$ mean error and $0\%$ failure rate. Since the COFW dataset is mainly composed of occluded faces, this impressive performance indicates the robustness of our graph relation framework to handle heavy occlusions.

300W: We compare our approach against the existing best performing methods on the 300W dataset. The results are reported in Table 4.3. Our method outperforms most existing approaches. Note that our method achieves the best results on the challenging subset, which highlights the robustness of the proposed approach in hard cases.

Method	Common	Challenging	Full
PCD-CNN18 [22]	3.67	7.62	4.44
CPM+SBR18 [10]	3.28	7.58	4.10
SAN18 [9]	3.34	6.60	3.98
LAB18 [43]	2.98	5.19	3.49
DeCaFA19 [8]	2.93	5.26	3.39
HRNet19 [37]	2.87	5.15	3.32
Ours	2.88	4.93	3.28

Table 3: Evaluation on the 300W Common subset, Challenging subset and Fullset in terms of mean error(

\%

Model	# params (M)	FLOPS (G)	RT (ms)
SAN [9]	199.63	-	343
LAB [43]	32.05	28.583	60
Wing [14]	24.75	5.396	30
Ours	24.68	5.165	23

Table 4: Efficiency comparison in terms of number of parameters, FLOPS and runtime.

Efficiency Comparison: Since facial landmark detection is widely deployed for many real-time applications, the model size, FLOPS and processing speed are key criteria. We evaluate the runtime of our model on a 1080Ti GPU and compare with existing methods in Table 4.3. Our model only takes $23$ ms and $5.165$ FLOPS to process a $256\times 256$ input image and consists of $\sim 24$ M parameters. Overall, our model is faster and smaller than most competitors.

4.4 Ablation Study

Our framework is composed of several pivotal modules such as graph relation network, attention guided multi-scale features, and soft-wing loss. Based on the baseline Resnet34 with $L=5$ layer stages, we examine the contributions of each proposed module on the WFLW dataset and report the overall results in Table 4.4.

Component	Choice
Fully connected	✓
GN		✓	✓	✓		✓
Attention m-s F.			✓		✓	✓
Soft-Wing loss				✓	✓	✓
GN w/o DW					✓
NME ( $\%$ )	5.95	4.64	4.53	4.52	4.47	4.40

Table 5: Ablation study on components on the WFLW dataset. GN: Graph Network. DW: Dynamic adjacency matrix Weighting

Design Choice	NME ( $\%$ )
Self-attention	4.61
Semantic-guided attention	4.53
Feature maps $\{\bm{F}_{5}\}$	4.64
Feature maps $\{\bm{F}_{4},\bm{F}_{5}\}$	4.57
Feature maps $\{\bm{F}_{3},\bm{F}_{4},\bm{F}_{5}\}$	4.53
Feature maps $\{\bm{F}_{2},\bm{F}_{3},\bm{F}_{4},\bm{F}_{5}\}$	4.56

Table 6: Ablation study on attention generation methods and feature maps for spatial message propagation.

Baseline Model: We first utilize the FC layers to directly regress the facial landmarks. This model is our baseline which achieves a NME of $5.95\%$ .

Graph Relation Network: The graph relation network is a key part of our Structure Coherency Component. We obtain a $1.31\%$ improvement by replacing the FC layers with our graph relation network, resulting in a NME of $4.64\%$ .

Top-k value for adjacency matrix: We report the result with different values of $k$ from $k=1$ to $k=97$ in Fig. 7. When $k=3$ , our model achieves the best performance on WFLW dataset. Note that, the performance degrades if the adjacency matrix is too sparse or too dense. When $k$ is too small, each graph node can not get sufficient information from its correlated neighborhoods, meanwhile, when $k$ is too large, the adjacency matrix becomes dense which leads to oversmoothing of the node features.

Dynamic adjacency matrix weighting: We replace the dynamic adjacency matrix with the binary adjacency matrix $\bm{M}$ and we observe a $0.07\%$ degradation.

Top k	1	2	3	4	5	10	20	40	97
NME	4.44	4.43	4.40	4.46	4.50	4.59	4.69	4.71	4.75

Table 7: NME(%) comparison with different values of

k

k

is for building adjacency matrix.

Attention guided multi-scale feature: The attention guided multi-scale features fusion plays a key role in improving the representation capability of features. By endowing the spatial details to high-semantics features, the model performs a NME of $4.53\%$ , which corresponds to a $0.11\%$ improvement.

Semantic-guided attention: We examine the importance of incorporating additional semantic information from deeper layers to guide generating attention maps. To this end, we degenerate our semantic-guided attention structure into a general self-attention mechanism. As shown in Table 4.4, we observe a drop of performance, resulting in a NME of $4.61\%$ . This experiment proves that the semantic information from high-level features is crucial to guide generating high quality attention. The quantitative performances are supported by the qualitative results illustrated in Fig. 6. The semantic guidance permits to make the feature maps focus on all visible key facial parts. Since self-attention only explore self-information, it only highlights the high-activated part in feature maps.

Features combination: To improve the localization capability, we propagate the spatial information from the shallower layers. We study which combination of feature layers is the optimal one. As tabulated in Table 4.4, the performance increases with the the additional spatial information propagation and the combination of features $\{\bm{F}_{3},\bm{F}_{4},\bm{F}_{5}\}$ yields best results. Since layer 2 is quite shallow, $\bm{F}_{2}$ consists few useful information and limits the performance due to the noise.

Soft Wing Loss: The Soft Wing loss improves the results of the graph relation network by $0.12\%$ . We compare the performance of L1, Wing and our Soft Wing loss based on our baseline model, the results are shown in Table 8. Our Soft Wing loss consistently outperforms Wing loss and L1 loss. As discussed in Section 3.3, the performance of Wing loss degrades when $\epsilon$ decreases, while our loss benefits from imposing larger gradients on medium range errors. The performance of Wing loss is even worse than L1 loss when $\epsilon$ is very small.

epsilon	0.1	0.2	0.5	1	1.5	2
L1	5.95
Wing	6.52	5.98	5.81	5.78	5.75	5.72
SoftWing	5.70	5.66	5.67	5.71	5.70	5.71

Table 8: Comparison of different loss functions. Analysis shows the effectiveness of Soft Wing loss in terms of the NME (%).

5 Conclusion

In this paper, we propose a fast and accurate face alignment method. We present a structure coherence component which consists of attention guided multi-scale feature fusion, mapping to node, dynamic adjacency matrix weighting module and graph relation network. We utilize the relation among facial parts appropriately, which permits precise localization of facial landmarks under hard cases. Experimental results in three challenging face alignment benchmarks demonstrate the effectiveness of the proposed method.

References

[1] Bronstein, M.M., Bruna, J., LeCun, Y., Szlam, A., Vandergheynst, P.: Geometric deep learning: Going beyond euclidean data. IEEE SPM 34(4), 18–42 (July 2017)
[2] Bruna, J., Zaremba, W., Szlam, A., LeCun, Y.: Spectral networks and locally connected networks on graphs. arXiv preprint arXiv:1312.6203 (2013)
[3] Burgos-Artizzu, X.P., Perona, P., Dollar, P.: Robust face landmark estimation under occlusion. In: ICCV (December 2013)
[4] Cai, Z., Fan, Q., Feris, R.S., Vasconcelos, N.: A unified multi-scale deep convolutional neural network for fast object detection. In: ECCV. pp. 354–370 (2016)
[5] Cohen, T., Welling, M.: Group equivariant convolutional networks. In: ICML. pp. 2990–2999 (2016)
[6] Cootes, T.F., Edwards, G.J., Taylor, C.J.: Active appearance models. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(6), 681–685 (June 2001)
[7] Cristinacce, D., Cootes, T.: Feature detection and tracking with constrained local models. Pattern Recognit. 41, 929–938 (01 2006)
[8] Dapogny, A., Bailly, K., Cord, M.: Decafa: Deep convolutional cascade for face alignment in the wild. In: ICCV (2019)
[9] Dong, X., Yan, Y., Ouyang, W., Yang, Y.: Style aggregated network for facial landmark detection. In: CVPR. pp. 379–388 (2018)
[10] Dong, X., Yu, S.I., Weng, X., Wei, S.E., Yang, Y., Sheikh, Y.: Supervision-by-registration: An unsupervised approach to improve the precision of facial landmark detectors. In: CVPR. pp. 360–368 (2018)
[11] Edwards, G.J., Taylor, C.J., Cootes, T.F.: Interpreting face images using active appearance models. In: Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition. pp. 300–305 (1998)
[12] Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32(9), 1627–1645 (2009)
[13] Feng, Z., Hu, G., Kittler, J., Christmas, W., Wu, X.: Cascaded collaborative regression for robust facial landmark detection trained using a mixture of synthetic and real images with dynamic weighting. IEEE TIP 24(11), 3425–3440 (Nov 2015)
[14] Feng, Z.H., Kittler, J., Awais, M., Huber, P., Wu, X.: Wing loss for robust facial landmark localisation with convolutional neural networks. In: CVPR. pp. 2235–2245 (2017)
[15] Feng, Z.H., Kittler, J., Christmas, W., Huber, P., Wu, X.J.: Dynamic attention-controlled cascaded shape regression exploiting training data augmentation and fuzzy-set sample weighting. In: CVPR (July 2017)
[16] Ghiasi, G., Fowlkes, C.: Using segmentation to predict the absence of occluded parts. In: BMVC. pp. 22.1–22.12 (September 2015)
[17] Ghiasi, G., Fowlkes, C.C.: Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2385–2392 (2014)
[18] Ghiasi, G., Fowlkes, C.C.: Occlusion coherence: Localizing occluded faces with a hierarchical deformable part model. In: CVPR (June 2014)
[19] Gilmer, J., Schoenholz, S.S., Riley, P.F., Vinyals, O., Dahl, G.E.: Neural message passing for quantum chemistry. In: ICML. pp. 1263–1272 (2017)
[20] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 770–778 (June 2016)
[21] Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutional networks. In: ICLR. pp. 1–10 (2017)
[22] Kumar, A., Chellappa, R.: Disentangling 3d pose in a dendritic cnn for unconstrained 2d face alignment. In: CVPR. pp. 430–439 (2018)
[23] Li, R., Tapaswi, M., Liao, R., Jia, J., Urtasun, R., Fidler, S.: Situation recognition with graph neural networks. In: ICCV. pp. 4173–4182 (2017)
[24] Li, Y., Tarlow, D., Brockschmidt, M., Zemel, R.: Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493 (2015)
[25] Liao, R., Zhao, Z., Urtasun, R., Zemel, R.S.: Lanczosnet: Multi-scale deep graph convolutional networks. arXiv preprint arXiv:1901.01484 (2019)
[26] Lin, C., Lu, J., Wang, G., Zhou, J.: Graininess-aware deep feature learning for pedestrian detection. In: The European Conference on Computer Vision (ECCV) (September 2018)
[27] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: CVPR. pp. 2117–2125 (2017)
[28] Liu, X., Wang, H., Zhou, J., Tao, L.: Attention-guided coarse-to-fine network for 2d face alignment in the wild. IEEE Access 7, 97196–97207 (2019)
[29] Liu, Z., Zhu, X., Hu, G., Guo, H., Tang, M., Lei, Z., Robertson, N.M., Wang, J.: Semantic alignment: Finding semantically consistent ground-truth for facial landmark detection. In: CVPR (June 2019)
[30] Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: ECCV. pp. 483–499 (2016)
[31] Qi, X., Liao, R., Jia, J., Fidler, S., Urtasun, R.: 3d graph neural networks for rgbd semantic segmentation. In: ICCV. pp. 5199–5208 (2017)
[32] Qian, S., Sun, K., Wu, W., Qian, C., Jia, J.: Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation. In: ICCV (2019)
[33] Sagonas, C., Antonakos, E., Tzimiropoulos, G., Zafeiriou, S., Pantic, M.: 300 faces in-the-wild challenge: database and results. Image and Vision Computing 47, 3 – 18 (2016), 300-W, the First Automatic Facial Landmark Detection in-the-Wild Challenge
[34] Saragih, J., Lucey, S., Cohn, J.: Deformable model fitting by regularized landmark mean-shift. International Journal of Computer Vision 91, 200–215 (01 2011). https://doi.org/10.1007/s11263-010-0380-4
[35] Scarselli, F., Gori, M., Tsoi, A.C., Hagenbuchner, M., Monfardini, G.: The graph neural network model. IEEE NN 20(1), 61–80 (Jan 2009)
[36] Shi, J., Zhang, H., Li, J.: Explainable and explicit visual reasoning over scene graphs. In: CVPR (June 2019)
[37] Sun, K., Zhao, Y., Jiang, B., Cheng, T., Xiao, B., Liu, D., Mu, Y., Wang, X., Liu, W., Wang, J.: High-resolution representations for labeling pixels and regions. CoRR abs/1904.04514 (2019)
[38] Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: Real-time face capture and reenactment of rgb videos. In: CVPR. pp. 2387–2395 (2016)
[39] Trigeorgis, G., Snape, P., Nicolaou, M.A., Antonakos, E., Zafeiriou, S.: Mnemonic descent method: A recurrent process applied for end-to-end face alignment. In: CVPR (June 2016)
[40] Valstar, M., Martinez, B., Binefa, X., Pantic, M.: Facial point detection using boosted regression and graph models. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. pp. 2729–2736. IEEE (2010)
[41] Verma, N., Boyer, E., Verbeek, J.: FeaStNet: Feature-Steered Graph Convolutions for 3D Shape Analysis. In: CVPR - IEEE Conference on Computer Vision & Pattern Recognition. pp. 2598–2606. IEEE, Salt Lake City, United States (2018)
[42] Woo, S., Park, J., Lee, J.Y., So Kweon, I.: Cbam: Convolutional block attention module. In: ECCV. pp. 3–19 (2018)
[43] Wu, W., Qian, C., Yang, S., Wang, Q., Cai, Y., Zhou, Q.: Look at boundary: A boundary-aware face alignment algorithm. In: CVPR (2018)
[44] Wu, W., Yang, S.: Leveraging intra and inter-dataset variations for robust face alignment. In: CVPRW (July 2017)
[45] Wu, Y., Gou, C., Ji, Q.: Simultaneous facial landmark detection, pose and deformation estimation under facial occlusion. In: CVPR (July 2017)
[46] Xiao, S., Feng, J., Xing, J., Lai, H., Yan, S., Kassim, A.A.: Robust facial landmark detection via recurrent attentive-refinement networks. In: ECCV. pp. 57–72 (2016)
[47] Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: CVPR (June 2019)
[48] Yang, J., Liu, Q., Zhang, K.: Stacked hourglass network for robust facial landmark localisation. In: CVPRW. pp. 2025–2033 (July 2017)
[49] Yang, J., Lu, J., Lee, S., Batra, D., Parikh, D.: Graph r-cnn for scene graph generation. In: ECCV. pp. 670–685 (2018)
[50] Zhang, J., Kan, M., Shan, S., Chen, X.: Occlusion-free face alignment: Deep regression networks coupled with de-corrupt autoencoders. In: CVPR (June 2016)
[51] Zhang, J., Wu, Q., Zhang, J., Shen, C., Lu, J.: Mind your neighbours: Image annotation with metadata neighbourhood graph co-attention networks. In: CVPR (June 2019)
[52] Zhang, Y., Zhao, R., Dong, W., Hu, B.G., Ji, Q.: Bilateral ordinal relevance multi-instance regression for facial action unit intensity estimation. In: CVPR. pp. 7034–7043 (2018)
[53] Zhang, Z., Luo, P., Loy, C.C., Tang, X.: Facial landmark detection by deep multi-task learning. In: ECCV. pp. 94–108. Cham (2014)
[54] Zhao, L., Peng, X., Tian, Y., Kapadia, M., Metaxas, D.N.: Semantic graph convolutional networks for 3d human pose regression. In: CVPR. pp. 3425–3435 (2019)
[55] Zhu, M., Shi, D., Zheng, M., Sadiq, M.: Robust facial landmark detection via occlusion-adaptive deep networks. In: CVPR (June 2019)
[56] Zhu, S., Li, C., Loy, C.C., Tang, X.: Face alignment by coarse-to-fine shape searching. In: CVPR. pp. 4998–5006 (2015)
[57] Zhu, X., Ramanan, D.: Face detection, pose estimation, and landmark localization in the wild. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 2879–2886. IEEE (2012)
[58] Zhu, X., Lei, Z., Yan, J., Yi, D., Li, S.Z.: High-fidelity pose and expression normalization for face recognition in the wild. In: CVPR. pp. 787–796 (2015)