This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\justify

Heatmap Regression via Randomized Rounding

Baosheng Yu and Dacheng Tao Baosheng Yu is with The University of Sydney, Australia. E-mail: [email protected]. Dacheng Tao is with JD Explore Academy, China and The University of Sydney, Australia. Email: [email protected]. Corresponding author: Dacheng Tao.1057-7149 © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Abstract

Heatmap regression has become the mainstream methodology for deep learning-based semantic landmark localization, including in facial landmark localization and human pose estimation. Though heatmap regression is robust to large variations in pose, illumination, and occlusion in unconstrained settings, it usually suffers from a sub-pixel localization problem. Specifically, considering that the activation point indices in heatmaps are always integers, quantization error thus appears when using heatmaps as the representation of numerical coordinates. Previous methods to overcome the sub-pixel localization problem usually rely on high-resolution heatmaps. As a result, there is always a trade-off between achieving localization accuracy and computational cost, where the computational complexity of heatmap regression depends on the heatmap resolution in a quadratic manner. In this paper, we formally analyze the quantization error of vanilla heatmap regression and propose a simple yet effective quantization system to address the sub-pixel localization problem. The proposed quantization system induced by the randomized rounding operation 1) encodes the fractional part of numerical coordinates into the ground truth heatmap using a probabilistic approach during training; and 2) decodes the predicted numerical coordinates from a set of activation points during testing. We prove that the proposed quantization system for heatmap regression is unbiased and lossless. Experimental results on popular facial landmark localization datasets (WFLW, 300W, COFW, and AFLW) and human pose estimation datasets (MPII and COCO) demonstrate the effectiveness of the proposed method for efficient and accurate semantic landmark localization. Code is available at http://github.com/baoshengyu/H3R.

Index Terms:
Semantic landmark localization, heatmap regression, quantization error, randomized rounding.

1 Introduction

Semantic landmarks are sets of points or pixels in images containing rich semantic information. They reflect the intrinsic structure or shape of objects such as human faces [1, 2], hands [3, 4], bodies [5, 6], and household objects [7]. Semantic landmark localization is fundamental in computer and robot vision [8, 9, 7, 10]. For example, semantic landmark localization can be used to register correspondences between spatial positions and semantics (semantic alignment), which is extremely useful in visual recognition tasks such as face recognition [8, 11] and person re-identification [12, 13]. Therefore, robust and efficient semantic landmark localization is extremely important in applications requiring accurate semantic landmarks including robotic grasping [7, 10] and facial analysis applications such as face makeup [14, 15], animation [16, 17], and reenactment [18, 9].

Coordinate regression and heatmap regression are two widely-used methods for deep learning-based semantic landmark localization [1, 19]. Rather than directly regressing the numerical coordinate with a fully-connected layer, heatmap-based methods aim to predict the heatmap where the maximum activation point corresponds to the semantic landmark in the input image. An intuitive example of heatmap representation is shown in Fig. 1. Due to the effective spatial generalization of heatmap representation, heatmap regression method is robust to large variations in pose, illumination, and occlusion in unconstrained settings [19, 20]. Heatmap regression has performed particularly well in semantic landmark localization tasks including facial landmark detection [2, 21] and human pose estimation [6, 22]. Despite this promise, heatmap regression method suffers from an inherent drawback, namely that the indices of the activation points in heatmaps are always integers. Vanilla heatmap-based methods therefore fail to predict the numerical coordinates in sub-pixel precision. Sub-pixel localization is nevertheless important in real-world scenarios with the fractional part of numerical coordinates originating from: 1) the input image being captured either by a low-resolution camera and/or at a relatively large distance; and 2) the heatmap is usually at a much lower resolution than the input image due to the downsampling stride of convolutional neural networks. As a result, low-resolution heatmaps significantly degrade heatmap regression performance. Considering that the computational cost of convolutional neural networks usually depends quadratically on the resolution of the input image or the feature map, there is a trade-off between the localization accuracy and the computational cost for heatmap regression [6, 23, 24, 25, 26]. Furthermore, the downsampling stride of heatmap is not always equal to the downsampling stride of feature map: given an original image of 512×512512\times 512 pixels, a heatmap regression model with the input size 128×128128\times 128 pixels, and the feature map with a downsampling stride 44 pixels, we then have the size of heatmap 32×3232\times 32 pixels, i.e., the downsampling stride of heatmap s=16s=16 pixels. For simplicity, we do not distinguish between the above two settings to address the quantization error in a unified manner. Unless otherwise mentioned, we refer to s>1s>1 as the downsampling stride of the heatmap.

Refer to caption
Figure 1: An intuitive example of the heatmap representation for the numerical coordinate. Given the numerical coordinate 𝒙i=(xi,yi)\boldsymbol{x}_{i}=(x_{i},y_{i}), we then have the corresponding heatmap 𝒉i\boldsymbol{h}_{i} with the maximum activation point located at the position (xi,yi)(x_{i},y_{i}).

In vanilla heatmap regression, 1) during training, the ground truth numerical coordinates are first quantized to generate the ground truth heatmap; and 2) during testing, the predicted numerical coordinates can be decoded from the maximum activation point in the predicted heatmap. However, typical quantization operations such as floor, round, and ceil discard the fractional part of the ground truth numerical coordinates, making it difficult to reconstruct the fractional part even from the optimal predicted heatmap. This error induced by the transformation between numerical coordinates and heatmap is known as the quantization error. To address the problem of quantization error, here we introduce a new quantization system to form a lossless transformation between the numerical coordinates and the heatmap. In our approach, during training, the proposed quantization system uses a set of activation points, and the fractional part of the numerical coordinate is encoded as the activation probabilities of different activation points. During testing, the fractional part can then be reconstructed according to the activation probabilities of the top kk maximum activation points in the predicted heatmap. To achieve this, we introduce a new quantization operation called randomized rounding, or random-round, which is widely used in combinatorial optimization to convert fractional solutions into integer solutions with provable approximation guarantees [27, 28]. Furthermore, the proposed method can easily be implemented using a few lines of source code, making it a plug-and-play replacement for the quantization system of existing heatmap regression methods.

In this paper, we address the problem of quantization error in heatmap regression. The remainder of the paper is structured as follows. In the preliminaries, we briefly review two typical semantic landmark localization methods, coordinate regression and heatmap regression. In the methods, we first formally introduce the problem of quantization error by decomposing the prediction error into the heatmap error and the quantization error. We then discuss quantization bias in vanilla heatmap regression and prove a tight upper bound on the quantization error in vanilla heatmap regression. To address quantization error, we devise a new quantization system and theoretically prove that the proposed quantization system is unbiased and lossless. We also discuss uncertainty in heatmap prediction as well as the unbiased annotation when forming a robust semantic landmark localization system. In the experimental section, we demonstrate the effectiveness of our proposed method on popular facial landmark detection datasets (WFLW, 300W, COFW, and AFLW) and human pose estimation datasets (MPII and COCO).

2 Related Work

Semantic landmark localization, which aims to predict the numerical coordinates for a set of pre-defined semantic landmarks in a given image or video, has a variety of applications in computer and robot vision including facial landmark detection [1, 29, 2], hand landmark detection [3, 4], human pose estimation [5, 6, 22], and household object pose estimation [7, 10]. In this section, we briefly review recent works on coordinate regression and heatmap regression for semantic landmark localization, especially in facial landmark localization applications.

2.1 Coordinate Regression

Coordinate regression has been widely and successfully used in semantic landmark localization under constrained settings, where it usually relies on simple yet effective features [30, 31, 32, 33]. To improve the performance of coordinate regression for semantic landmark localization in the wild, several methods have been proposed by using cascade refinement [1, 34, 35, 36, 37, 38], parametric/non-parametric shape models [39, 29, 36, 40], multi-task learning [41, 42, 43], and novel loss functions [44, 45].

2.2 Heatmap Regression

The success of deep learning has prompted the use of heatmap regression for semantic landmark localization, especially for robust and accurate facial landmark localization [2, 46, 47, 23] and human pose estimation [19, 48, 6, 49, 50, 51, 52]. Existing heatmap regression methods either rely on large input images or empirical compensations during inference to mitigate the problem of quantization error [6, 53, 54, 22]. For example, a simple yet effective compensation method known as “shift a quarter to the second maximum activation point” has been widely used in many state-of-the-art heatmap regression methods [6, 55, 22].

Several methods have been developed to address the problem of quantization error in three aspects: 1) jointly predicting the heatmap and the offset in a multi-task manner [56]; 2) encoding and decoding the fractional part of numerical coordinates via a modulated 2D Gaussian distribution [24, 26]; and 3) exploring differentiable transformations between the heatmap and the numerical coordinates [57, 20, 58, 59]. Specifically, [24] generates the fractional part sensitive ground truth heatmap for video-based face alignment, which is known as fractional heatmap regression. Under the assumption that the predicted heatmap follows a 2D Gaussian distribution, [26] decodes the fractional part of numerical coordinates from the modulated predicted heatmap. The soft-argmax operation is differentiable [60, 61, 62, 59], and has been intensively explored in human pose estimation [57, 20].

3 Preliminaries

In this section, we introduce two widely-used semantic landmark localization methods, coordinate regression and heatmap regression. For simplicity, we use facial landmark detection as an intuitive example.

Coordinate Regression. Given a face image, semantic landmark detection aims to find the numerical coordinates of a set of pre-defined facial landmarks 𝒙i=(xi,yi)\boldsymbol{x}_{i}=\left(x_{i},y_{i}\right), where i=1,2,,Ki=1,2,\dots,K, indicate the indices of different facial landmarks (e.g., a set of five pre-defined facial landmarks can be left eye, right eye, nose, left mouth corner, and right mouth corner). It is natural to train a model (e.g., deep neural networks) to directly regress the numerical coordinates of all facial landmarks. The coordinate regression model then can be optimized via a typical regression criterion such as mean squared error (MSE) and mean absolute error (MAE). For the MSE criterion (also the L2 loss), we have

(𝒙ip,𝒙ig)=𝒙ip𝒙ig22,\mathcal{L}(\boldsymbol{x}^{p}_{i},\boldsymbol{x}^{g}_{i})=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}^{2}, (1)

where 𝒙ip\boldsymbol{x}^{p}_{i} and 𝒙ig\boldsymbol{x}^{g}_{i} indicate the predicted and the ground truth numerical coordinates, respectively. When using the MAE criterion (also the L1 loss), the loss function \mathcal{L} can be defined in a similar way to (1).

Heatmap Regression. Heatmaps (also known as confidence maps) are simple yet effective representations of semantic landmark locations. Given the numerical coordinate 𝒙i\boldsymbol{x}_{i} for the ii-th semantic landmark, it then corresponds to a specific heatmap 𝒉i\boldsymbol{h}_{i} as shown in Fig. 1. For simplicity, we assume 𝒉i\boldsymbol{h}_{i} is the same size as the input image in this section and leave the problem of quantization error to the next section. With the heatmap representation, the problem of semantic landmark localization can be translated into heatmap regression via two heatmap subroutines: 1) encode (from the ground truth numerical coordinate 𝒙ig\boldsymbol{x}_{i}^{g} to the ground truth heatmap 𝒉ig\boldsymbol{h}_{i}^{g}); and 2) decode (from the predicted heatmap 𝒉ip\boldsymbol{h}_{i}^{p} to the predicted numerical coordinate 𝒙ip\boldsymbol{x}_{i}^{p}). The main deep learning-based heatmap regression for semantic landmark localization framework is shown in Fig. 2.

Refer to caption
Figure 2: The main deep learning-based heatmap regression for semantic landmark localization framework. Specifically, during the inference stage, we decode the predicted heatmaps 𝒉p=(𝒉1p,,𝒉Kp)\boldsymbol{h}^{p}=\left(\boldsymbol{h}^{p}_{1},\dots,\boldsymbol{h}^{p}_{K}\right) to obtain the predicted numerical coordinates 𝒙p=(𝒙1p,,𝒙Kp)\boldsymbol{x}^{p}=\left(\boldsymbol{x}^{p}_{1},\dots,\boldsymbol{x}^{p}_{K}\right); During the training stage, we encode the ground truth numerical coordinates 𝒙g=(𝒙1g,,𝒙Kg)\boldsymbol{x}^{g}=\left(\boldsymbol{x}^{g}_{1},\dots,\boldsymbol{x}^{g}_{K}\right) to generate the ground truth heatmaps 𝒉g=(𝒉1g,,𝒉Kg)\boldsymbol{h}^{g}=\left(\boldsymbol{h}^{g}_{1},\dots,\boldsymbol{h}^{g}_{K}\right).

Specifically, during the inference stage, given a predicted heatmap 𝒉ip\boldsymbol{h}_{i}^{p}, the value 𝒉ip(𝒙)[0,1]\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\in[0,1] indicates the confidence score that the ii-th landmark is located at coordinate 𝒙2\boldsymbol{x}\in\mathbb{N}^{2}. Then, we can decode the predicted numerical coordinate 𝒙ip\boldsymbol{x}_{i}^{p} from the predicted heatmap 𝒉ip\boldsymbol{h}_{i}^{p} using the argmax operation, i.e.,

𝒙ip=(xip,yip)argmax𝒙{𝒉ip(𝒙)}.\boldsymbol{x}_{i}^{p}=\left(x_{i}^{p},y_{i}^{p}\right)\in\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}. (2)

Therefore, with the decode operation in (2), the problem of semantic landmark localization can be solved by training a deep model to predict heatmap 𝒉ip\boldsymbol{h}_{i}^{p}.

To train a heatmap regression model, the ground truth heatmap 𝒉ig\boldsymbol{h}_{i}^{g} is indispensable, i.e., we need to encode the ground truth coordinate 𝒙ig\boldsymbol{x}_{i}^{g} into the ground truth heatmap 𝒉ig\boldsymbol{h}_{i}^{g}. We introduce two widely-used methods to generate the ground truth heatmap, Gaussian heatmap and binary heatmap, as follows. Given the ground truth coordinate 𝒙ig\boldsymbol{x}_{i}^{g}, the ground truth Gaussian heatmap can be generated by sampling and normalizing from a bivariate normal distribution 𝒩(𝒙ig,Σ)\mathcal{N}(\boldsymbol{x}_{i}^{g},\Sigma), i.e., the ground truth heatmap 𝒉ig\boldsymbol{h}_{i}^{g} at location 𝒙2\boldsymbol{x}\in\mathbb{N}^{2} can be evaluated as

𝒉ig(𝒙)=exp(12(𝒙𝒙ig)Σ1(𝒙𝒙ig)),\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{x}_{i}^{g})^{\top}\Sigma^{-1}(\boldsymbol{x}-\boldsymbol{x}_{i}^{g})\right), (3)

where Σ\Sigma is the covariance matrix (a positive semi-definite matrix) and σ>0\sigma>0 is the standard deviation in both directions, i.e.,

Σ=[σ200σ2].\Sigma=\begin{bmatrix}\sigma^{2}&0\\ 0&\sigma^{2}\end{bmatrix}. (4)

When σ0\sigma~{}\to~{}0, the ground truth heatmap can be generated by assigning a positive value at the ground truth numerical coordinate 𝒙ig\boldsymbol{x}_{i}^{g}, i.e.,

𝒉ig(𝒙)={1if 𝒙=𝒙ig,0otherwise.\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=\begin{cases}1&\quad\text{if }\boldsymbol{x}=\boldsymbol{x}_{i}^{g},\\ 0&\quad\text{otherwise}.\end{cases} (5)

Specifically, when σ0\sigma\to 0, the ground truth heatmap defined in (5) is also known as the binary heatmap.

Given the ground truth heatmap, the heatmap regression model then can be optimized using typical pixel-wise regression criteria such as MSE, MAE, or Smooth-L1 [44]. Specifically, for Gaussian heatmaps, the heatmap regression model is usually optimized with the pixel-wise MSE criterion, i.e.,

(𝒉ip,𝒉ig)=𝔼𝒉ip(𝒙)𝒉ig(𝒙)22.\mathcal{L}(\boldsymbol{h}_{i}^{p},\boldsymbol{h}_{i}^{g})=\mathbb{E}\|\boldsymbol{h}_{i}^{p}(\boldsymbol{x})-\boldsymbol{h}_{i}^{g}(\boldsymbol{x})\|_{2}^{2}. (6)

When using the MAE/Smooth-L1 criteria, the loss function can be defined in a similar way to (6). For binary heatmap, the heatmap regression model can also be optimized with the pixel-wise cross-entropy criterion, i.e.,

(𝒉ip,𝒉ig)=𝔼(CE(𝒉ip(𝒙),𝒉ig(𝒙))),\mathcal{L}(\boldsymbol{h}^{p}_{i},\boldsymbol{h}^{g}_{i})=\mathbb{E}\left(\mathcal{L}_{\text{CE}}\left(\boldsymbol{h}^{p}_{i}(\boldsymbol{x}),\boldsymbol{h}^{g}_{i}(\boldsymbol{x})\right)\right), (7)

where CE\mathcal{L}_{\text{CE}} indicates the cross-entropy criterion with a softmax function as the activation/normalization function. A comprehensive review of different loss functions for semantic landmark localization is beyond the scope of this paper, but we refer interested readers to [45] for descriptions of coordinate regression and [21] for heatmap regression. Unless otherwise mentioned, we use the MSE criterion for Gaussian heatmap and the cross-entropy criterion for binary heatmap in this paper.

4 Method

In this section, we first introduce the quantization system in heatmap regression and then formulate the quantization error in a unified way by correcting the quantization bias in a vanilla quantization system. Lastly, we devise a new quantization system via randomized rounding to address the problem of quantization error.

4.1 Quantization System

Heatmap regression for semantic landmark localization usually contains two key components: 1) heatmap prediction; and 2) transformation between the heatmap and the numerical coordinates. The quantization system in heatmap regression is a combination of the encode and decode operations. During training, when the ground truth numerical coordinates 𝒙ig\boldsymbol{x}_{i}^{g} are floating-point numbers, we then need to calculate a specific Gaussian kernel matrix using (3) for each landmark, since different numerical coordinates usually have different fractional parts. As a result, it will significantly increase the training loads of the heatmap regression model. For example, given 9898 landmarks per face image, the kernel size 11×1111\times 11, and a mini-batch of 1616 training samples, we then have to run (3) for 98×16×11×11=189,72898\times 16\times 11\times 11=189,728 times in each training iteration. To address this issue, existing heatmap regression methods usually first quantize numerical coordinates into integers, where a standard kernel matrix can then be shared for efficient ground truth heatmap generation [6, 2, 55, 23]. However, the above-mentioned existing heatmap regression methods usually suffer from the inherent drawback of failing to encode the fractional part of numerical coordinates. Therefore, how to efficiently encode the fractional information in numerical coordinates still remains challenging. Furthermore, during the inference stage, the predicted numerical coordinates 𝒙ip\boldsymbol{x}_{i}^{p} obtained by a decode operation in (2) are also integers. As a result, typical heatmap regression methods usually fail to efficiently address the fractional part of the numerical coordinate during both training and inference, resulting in localization error.

To analyze the localization error caused by the quantization system in heatmap regression, we formulate the localization error as the sum of heatmap error and quantization error as follows:

loc=𝒙ip𝒙ig2=𝒙ip𝒙iopt+𝒙iopt𝒙ig2𝒙ip𝒙iopt2heatmap error+𝒙iopt𝒙ig2quantization error,\begin{split}\mathcal{E}_{loc}=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}&=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{opt}+\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}\\ &\leq\underbrace{\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{opt}\|_{2}}_{\textbf{heatmap error}}+\underbrace{\|\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}}_{\textbf{quantization error}},\end{split} (8)

where 𝒙iopt\boldsymbol{x}_{i}^{opt} indicates the numerical coordinate decoded from the optimal predicted heatmap. Generally, the heatmap error corresponds to the error in heatmap prediction, i.e., 𝒉ip𝒉ig2\|\boldsymbol{h}_{i}^{p}-\boldsymbol{h}_{i}^{g}\|_{2}, and the quantization error indicates the error caused by both the encode and decode operations. If there is no heatmap error, the localization error then all originates from the error of the quantization system, i.e.,

loc=𝒙ip𝒙ig2=𝒙iopt𝒙ig2.\begin{split}\mathcal{E}_{loc}=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}=\|\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}.\end{split} (9)

The generalizability of deep neural networks for heatmap prediction, i.e., the heatmap error, is beyond the scope of this paper. We do not consider the heatmap error during the analysis of quantization error in this paper.

To obtain integer coordinates for the generation of the ground truth heatmap, typical integer quantization operations such as floorround, and ceil have been widely used in previous heatmap regression methods. To unify the quantization error induced by different integer operations, we first introduce a unified integer quantization operation as follows. Given a downsampling stride s>1s>1 and a threshold t[0,1]t\in[0,1], the coordinate xx\in\mathbb{N} then can be quantized according to its fractional part ϵ=x/sx/s\epsilon=x/s-\lfloor x/s\rfloor, i.e.,

𝒒(x,s,t)={x/sif ϵ<t,x/s+1otherwise.\boldsymbol{q}(x,s,t)=\begin{cases}\lfloor x/s\rfloor&\quad\text{if }~{}\epsilon<t,\\ \lfloor x/s\rfloor+1&\quad\text{otherwise}.\end{cases} (10)

That is, for integer quantization operations floorround, and ceil, we have t=1.0t=1.0, t=0.5t=0.5, and t=0t=0, respectively. Furthermore, when the downsampling stride s>1s>1, the decode operation in (2) then becomes

𝒙ips(argmax𝒙{𝒉ip(𝒙)}).\boldsymbol{x}_{i}^{p}\in s*\left(\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}\right). (11)

A vanilla quantization system for heatmap regression can then be formed by the encode operation in (10) and the decode operation in (11). When applied to a vector or a matrix, the integer quantization operation defined in (10) is an element-wise operation.

Refer to caption

Figure 3: An intuitive example of the encode operation via randomized rounding. When the downsampling stride s>1s>1, the ground truth activation point (xig/s,yig/s)(x_{i}^{g}/s,y_{i}^{g}/s) usually does not correspond to a single pixel in the heatmap. Therefore, we introduce the randomized rounding operation to assign the ground truth activation point to a set of alternative activation points and the activation probability depends on the fractional part of the ground truth numerical coordinates.

4.2 Quantization Error

In this subsection, we first correct the bias in a vanilla quantization system to form an unbiased vanilla quantization system. With the unbiased quantization system, we then provide a tight upper bound on the quantization error for vanilla heatmap regression.

Let ϵx\epsilon_{x} denote the fractional part of xig/sx_{i}^{g}/s, and ϵy\epsilon_{y} denote the fractional part of yig/sy_{i}^{g}/s. Given the downsampling stride of the heatmap s>1s>1, we then have

ϵx=xig/sxig/s,ϵy=yig/syig/s.\begin{split}\epsilon_{x}&=x_{i}^{g}/s-\lfloor x_{i}^{g}/s\rfloor,\\ \epsilon_{y}&=y_{i}^{g}/s-\lfloor y_{i}^{g}/s\rfloor.\end{split} (12)

Given the assumption of a “perfect” heatmap prediction model or no heatmap error, i.e., 𝒉ip(𝒙)=𝒉ig(𝒙)\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=\boldsymbol{h}_{i}^{g}(\boldsymbol{x}), we then have the predicted numerical coordinates

xip/s={xig/sif ϵx<t,xig/s+1otherwise,x_{i}^{p}/s=\begin{cases}\lfloor x_{i}^{g}/s\rfloor&\quad\text{if }\epsilon_{x}<t,\\ \lfloor x_{i}^{g}/s\rfloor+1&\quad\text{otherwise},\end{cases}
yip/s={yig/sif ϵy<t,yig/s+1otherwise.y_{i}^{p}/s=\begin{cases}\lfloor y_{i}^{g}/s\rfloor&\quad\text{if }\epsilon_{y}<t,\\ \lfloor y_{i}^{g}/s\rfloor+1&\quad\text{otherwise}.\end{cases}

If data samples satisfy the i.i.d. assumption and the fractional parts ϵx,ϵy𝕌(0,1)\epsilon_{x},\epsilon_{y}\in\mathbb{U}(0,1), the bias of 𝒙ip\boldsymbol{x}_{i}^{p} as an estimator of 𝒙ig\boldsymbol{x}_{i}^{g} can then be evaluated as

𝔼(xip/sxig/s)=𝔼(𝟏{ϵx<t}(ϵx)+𝟏{ϵxt}(1ϵx))=0.5t.\begin{split}\mathbb{E}\left(x_{i}^{p}/s-x_{i}^{g}/s\right)&=\mathbb{E}\left(\mathbf{1}{\{\epsilon_{x}<t\}}(-\epsilon_{x})+\mathbf{1}{\{\epsilon_{x}\geq t\}}(1-\epsilon_{x})\right)\\ &=0.5-t.\end{split}

Considering that xig,yigx_{i}^{g},y_{i}^{g} are independent variables, we thus have the quantization bias in the vanilla quantization system as follows:

𝔼(𝒙ip/s𝒙ig/s)=(𝔼(xip/sxig/s),𝔼(yip/syig/s))=(0.5t,0.5t).\begin{split}\mathbb{E}\left(\boldsymbol{x}_{i}^{p}/s-\boldsymbol{x}_{i}^{g}/s\right)&=\left(\mathbb{E}\left(x_{i}^{p}/s-x_{i}^{g}/s\right),~{}\mathbb{E}\left(y_{i}^{p}/s-y_{i}^{g}/s\right)\right)\\ &=\left(0.5-t,~{}0.5-t\right).\end{split}

Therefore, only the encode operation in (10), i.e., the round operation, is unbiased. Furthermore, given t[0,1]\forall t~{}\in~{}[0,1] for the encode operation in (10), we can correct the bias of the encode operation with a shift on the decode operation, i.e.,

𝒙ips(argmax𝒙{𝒉ip(𝒙)}+t0.5).\boldsymbol{x}_{i}^{p}\in s*\left(\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}+t-0.5\right). (13)

For simplicity, we use the round operation, i.e., t=0.5t=0.5, to form an unbiased quantization system as our baseline. Though the vanilla quantization system defined by (10) and (13) is unbiased, it causes non-invertible localization error. An intuitive explanation for this is that the encode operation in  (10) directly discards the fractional part of the ground truth numerical coordinates, thus making it impossible for the decode operation to accurately reconstruct the numerical coordinates.

Theorem 1.

Given an unbiased quantization system defined by the encode operation in (10) and the decode operation in (13), we then have that the quantization error tightly upper bounded, i.e.,

𝒙ip𝒙ig22s/2,\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}\leq\sqrt{2}s/2,

where s>1s>1 indicates the downsampling stride of the heatmap.

Proof.

In Appendix. ∎

From Theorem 1, we know that the vanilla quantization system defined by (10) and (13) will cause non-invertible quantization error and that the upper bound of the quantization error linearly depends on the downsampling stride of the heatmap. As a result, given the heatmap regression model, it will cause extremely large localization error for large faces in the original input image, making it a significant problem in many important face-related applications such as face makeup, face swapping, and face reenactment.

Refer to caption

Figure 4: An example of different possible sets of alternative activation points.

4.3 Randomized Rounding

In vanilla heatmap regression, each numerical coordinate corresponds to a single activation point in the heatmap, while the indices of the activation point are all integers. As a result, the fractional part of the numerical coordinate is usually ignored during the encode process, making it an inherent drawback of heatmap regression for sub-pixel localization. To retain the fractional information when using heatmap representations, we utilize multiple activation points around the ground truth activation point. Inspired by the randomized rounding method [27], we address the quantization error in vanilla heatmap regression by using a probabilistic approach. Specifically, we encode the fractional part of the numerical coordinate to different activation points with different activation probabilities. An intuitive example is shown in Fig. 3.

We describe the proposed quantization system as follows. Given the ground truth numerical coordinate 𝒙ig=(xig,yig)\boldsymbol{x}_{i}^{g}=(x_{i}^{g},y_{i}^{g}) and a downsampling stride of the heatmap s>1s>1, the ground truth activation point in the heatmap is (xig/s,yig/s)(x_{i}^{g}/s,y_{i}^{g}/s), which are usually floating-point numbers, and we are unable to find the corresponding pixel in the heatmap. If we ignore the fractional part (ϵx,ϵy)(\epsilon_{x},\epsilon_{y}) using a typical integer quantization operation, e.g., round, the ground truth activation point will be approximated by one of the activation points around the ground truth activation point, i.e., (xig/s,yig/s)\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right), (xig/s+1,yig/s)\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right), (xig/s,yig/s+1)\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right), and (xig/s+1,yig/s+1)\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right). However, the above process is not invertible. To address this, we randomly assign the ground truth activation point to one of the alternative activation points around the ground truth activation point, and the activation probability is determined by the fractional part of the ground truth activation point as follows:

P{𝒉ig(xig/s,yig/s)=1}=(1ϵx)(1ϵy),P{𝒉ig(xig/s+1,yig/s)=1}=ϵx(1ϵy),P{𝒉ig(xig/s,yig/s+1)=1}=(1ϵx)ϵy,P{𝒉ig(xig/s+1,yig/s+1)=1}=ϵxϵy.\begin{split}P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right)=1\right\}&=(1-\epsilon_{x})(1-\epsilon_{y}),\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right)=1\right\}&=\epsilon_{x}(1-\epsilon_{y}),\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right)=1\right\}&=(1-\epsilon_{x})\epsilon_{y},\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right)=1\right\}&=\epsilon_{x}\epsilon_{y}.\end{split} (14)

To achieve the encode scheme in (14) in conjunction with current minibatch stochastic gradient descent training algorithms for deep learning models, we introduce a new integer quantization operation via randomized rounding, i.e., random-round:

𝒒(x,s)={x/sif ϵ<t,t𝕌(0,1),x/s+1otherwise.\boldsymbol{q}(x,s)=\begin{cases}\lfloor x/s\rfloor&\quad\text{if }\epsilon<t,~{}t\sim\mathbb{U}(0,1),\\ \lfloor x/s\rfloor+1&\quad\text{otherwise}.\end{cases} (15)

Given the encode operation in (15), if we do not consider the heatmap error, we then have the activation probability at 𝒙\boldsymbol{x}:

𝒉ip(𝒙)=P{𝒉ig(𝒙)=1}.\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=P\left\{\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=1\right\}. (16)

As a result, the fractional part of the ground truth numerical coordinate (ϵx,ϵy)(\epsilon_{x},\epsilon_{y}) can be reconstructed from the predicted heatmap via the activation probabilities of all activation points, i.e.,

𝒙ip=s(𝒙𝒳ig𝒉ip(𝒙)𝒙),\boldsymbol{x}_{i}^{p}=s*\left(\sum_{\boldsymbol{x}\in\mathcal{X}_{i}^{g}}~{}\boldsymbol{h}_{i}^{p}(\boldsymbol{x})*\boldsymbol{x}\right), (17)

where 𝒳ig\mathcal{X}_{i}^{g} indicates the set of activation points around the ground truth activation point, i.e.,

𝒳ig={(xig/s,yig/s),(xig/s+1,yig/s),(xig/s,yig/s+1),(xig/s+1,yig/s+1)}.\begin{split}\mathcal{X}_{i}^{g}=\{&\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right),\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right),\\ &\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right),\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right)\}.\end{split} (18)
Theorem 2.

Given the encode operation in (15) and the decode operation in (17), we then have that the 1) encode operation is unbiased; and 2) quantization system is lossless, i.e., there is no quantization error.

Proof.

In Appendix. ∎

From Theorem 2, we know that the quantization system defined by the encode operation in (15) and the decode operation in (17) is unbiased and lossless.

4.4 Activation Points Selection

The fractional information of the numerical coordinate (ϵx,ϵy)(\epsilon_{x},\epsilon_{y}) is well-captured by the randomized rounding operation, allowing us to reconstruct the ground truth numerical coordinate 𝒙ig\boldsymbol{x}_{i}^{g} without the quantization error. However, during the inference phase, the ground truth numerical coordinate 𝒙ig\boldsymbol{x}_{i}^{g} is unavailable and heatmap error always exists in practice, making it difficult to identify the proper set of ground truth activation points 𝒳ig\mathcal{X}_{i}^{g}. In this section, we describe a way to form a set of alternative activation points in practice.

We introduce two activation point selection methods as follows. The first solution is to estimate all activation points via the points around the maximum activation point. As shown in Fig. 4, given the maximum activation point, we then have four different sets of alternative activation points, 𝒳ig1,𝒳ig2,𝒳ig3,and𝒳ig4\mathcal{X}_{i}^{g_{1}},\mathcal{X}_{i}^{g_{2}},\mathcal{X}_{i}^{g_{3}},\text{and}~{}\mathcal{X}_{i}^{g_{4}}. Therefore, given the predicted heatmap in practice, we then take a risk of choosing an incorrect set of alternative activation points. To find a robust set of alternative activation points, we may use all nine activation points around the maximum activation point, i.e.,

𝒳ig=𝒳ig1𝒳ig2𝒳ig3𝒳ig4.\mathcal{X}_{i}^{g}=\mathcal{X}_{i}^{g_{1}}\cup\mathcal{X}_{i}^{g_{2}}\cup\mathcal{X}_{i}^{g_{3}}\cup\mathcal{X}_{i}^{g_{4}}. (19)

Another solution of alternative activation points 𝒳ig\mathcal{X}_{i}^{g} is to generalize the argmax operation with the argtopk operation, i.e., we decode the predicted heatmap 𝒉ip\boldsymbol{h}_{i}^{p} to obtain the numerical coordinate 𝒙ip\boldsymbol{x}_{i}^{p} according to the top kk largest activation points,

𝒳ig=argtopk𝒙(𝒉𝒊𝒑(𝒙)).\mathcal{X}_{i}^{g}=\underset{\boldsymbol{x}}{\arg\text{topk}}(\boldsymbol{h_{i}^{p}(\boldsymbol{x})}). (20)

If there is no heatmap error, the two alternative activation points solutions presented above, i.e., the alternative activation points in (18) and (19), are equal to each other when using the decode operation in (17). Specifically, we find that the activation points in (19) achieve comparable performance to the activation points in (20) when k=9k=9. For simplicity, unless otherwise mentioned, we use the set of alternative activation points defined by (20) in this paper. Furthermore, when we take the heatmap error into consideration, the values of different kk then forms a trade-off on the selection of activation points, i.e., a larger kk will be robust to activation point selection whilst also increasing the risk of noise from the heatmap error. See more discussion in Section 4.5 and the experimental results in Section 5.5.

4.5 Discussion

In this subsection, we provide some insights into the proposed quantization system with respect to: 1) the influence of human annotators on the proposed quantization system in practice; and 2) the underlying explanation behind the widely used empirical compensation method “shift a quarter to the second maximum activation point”.

Unbiased Annotation. We assume the “ground truth numerical coordinates” are always accurate, while the ground truth numerical coordinates are usually labelled by human annotators at the risk of the annotation bias. Given an input image, the ground truth numerical coordinates 𝒙ig\boldsymbol{x}_{i}^{g} can be obtained by clicking a specific pixel in the image, which is a simple but effective annotation pipeline provided by most image annotation tools. For sub-pixel numerical coordinates, especially in low-resolution input images, the annotators may click either one of all possible pixels around the ground truth numerical coordinates due to human visual uncertainty. As shown in Fig. 5, clicking any one of the four possible pixels causes annotation error, which corresponds to the fractional part of the ground truth numerical coordinate (ϵx,ϵy)=(xigxig,yigyig)(\epsilon_{x}^{\prime},\epsilon_{y}^{\prime})=\left(x_{i}^{g}-\lfloor x_{i}^{g}\rfloor,~{}y_{i}^{g}-\lfloor y_{i}^{g}\rfloor\right). Given enough data samples, if the annotators click the pixel according to the following distribution, i.e.,

P{(xig,yig)}=(1ϵx)(1ϵy),P{(xig+1,yig)}=ϵx(1ϵy),P{(xig,yig+1)}=(1ϵx)ϵy,P{(xig+1,yig+1)}=ϵxϵy,\begin{split}P\left\{\left(\lfloor x_{i}^{g}\rfloor,\lfloor y_{i}^{g}\rfloor\right)\right\}&=(1-\epsilon_{x}^{\prime})(1-\epsilon_{y}^{\prime}),\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor+1,\lfloor y_{i}^{g}\rfloor\right)\right\}&=\epsilon_{x}^{\prime}(1-\epsilon_{y}^{\prime}),\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor,\lfloor y_{i}^{g}\rfloor+1\right)\right\}&=(1-\epsilon_{x}^{\prime})\epsilon_{y}^{\prime},\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor+1,\lfloor y_{i}^{g}\rfloor+1\right)\right\}&=\epsilon_{x}^{\prime}\epsilon_{y}^{\prime},\end{split} (21)

the fractional part then can be well captured by the heatmap regression model and we refer to it as an unbiased annotation.

Refer to caption


Figure 5: An example of unbiased human annotation for semantic landmark localization.

If we take the downsampling stride into consideration, (ϵx,ϵy)(\epsilon_{x},\epsilon_{y}) is then a joint result of both the downsampling of the heatmap and the annotation process, i.e.,

(ϵx,ϵy)(ϵx/s,ϵy/s)+(s1).(\epsilon_{x},~{}\epsilon_{y})~{}\propto~{}\left(\epsilon_{x}^{\prime}/s,~{}\epsilon_{y}^{\prime}/s\right)+(s-1). (22)

On the one hand, if the heatmap regression model uses a low input resolution (or a large downsampling stride s1s\gg 1), the fractional part (ϵx,ϵy)(\epsilon_{x},\epsilon_{y}) then mainly comes from the downsampling of the heatmap; on the other hand, if the heatmap regression model uses a high input resolution, the annotation process will also have a significant influence on the heatmap regression. Therefore, when using a high input resolution model in practice, a diverse set of human annotators help reduce the bias in annotation process.

Empirical Compensation. “Shift a quarter to the second maximum activation point” has become an effective and widely used empirical compensation method for heatmap regression [6, 55, 22], but it still lacks a proper explanation. We thus provide an intuitive explanation according to the proposed quantization system. The proposed quantization system encodes the ground truth numerical coordinates into multiple activation points, and the activation probability of each activation point is decided by the fractional part, i.e., the activation probability indicates the distance between the activation point and the ground truth activation point. Therefore, the ground truth activation point is closer to the ii-th maximum activation point than the (i+1)(i+1)-th maximum activation point. We demonstrate the averaged activation probabilities for the top kk activation points on the WFLW dataset in Table I.

TABLE I: The activation probabilities of the top kk activation points.
k=1 k=2 k=3 k=4
𝒉ip(𝒙)\boldsymbol{h}_{i}^{p}(\boldsymbol{x}) 0.44 0.26 0.17 0.13
NME(%) 6.45 5.07 4.71 4.68

We find that the marginal improvement decreases as the number of activation points increases, i.e., the second maximum activation points provides the maximum improvement to the reconstruction of the fractional part. This observation partially explains the effectiveness of the compensation method “shift a quarter to the second maximum activation point”, which can be seen as a special case of the proposed method (20) with k=2k=2.

Furthermore, the proposed quantization system shares the same motivation with the bilinear interpolation. Specifically, the bilinear interpolation usually aims to find the value of the unknown function f(x,y)f(x,y) given its neighbors f(x1,y1)f(x_{1},y_{1}), f(x1,y2)f(x_{1},y_{2}), f(x2,y1)f(x_{2},y_{1}), and f(x2,y2)f(x_{2},y_{2}). For the proposed quantization system, we have f(x,y)=(x,y)f(x,y)=(x,y), which indicates the location of landmarks. Specifically, if there is no heatmap error, we then have x1=xig/sx_{1}=\lfloor x_{i}^{g}/s\rfloor, x2=xig/s+1x_{2}=\lfloor x_{i}^{g}/s\rfloor+1, y1=yig/sy_{1}=\lfloor y_{i}^{g}/s\rfloor, and y2=yig/s+1y_{2}=\lfloor y_{i}^{g}/s\rfloor+1. If we take heatmap error into consideration, the ground truth activation points are usually unknown. Therefore, the number of alternative activation points also controls the trade-off between the robustness of the quantization system and the risk of noise from the heatmap error (Also see details in Section 4.4).

TABLE II: Comparison with State-of-the-Arts on WFLW dataset.
Method NME (%), Inter-ocular
test pose expression illumination make-up occlusion blur
ESR [29] 11.13 25.88 11.47 10.49 11.05 13.75 12.20
SDM [31] 10.29 24.10 11.45 9.32 9.38 13.03 11.28
CFSS [36] 9.07 21.36 10.09 8.30 8.74 11.76 9.96
DVLN [63] 6.08 11.54 6.78 5.73 5.98 7.33 6.88
LAB [64] 5.27 10.24 5.51 5.23 5.15 6.79 6.32
Wing [45] 5.11 8.75 5.36 4.93 5.41 6.37 5.81
3DDE [65] 4.68 8.62 5.21 4.65 4.60 5.77 5.41
DeCaFA [66] 4.62 8.11 4.65 4.41 4.63 5.74 5.38
HRNet [23] 4.60 7.86 4.78 4.57 4.26 5.42 5.36
AVS [67] 4.39 8.42 4.68 4.24 4.37 5.60 4.86
LUVLi [68] 4.37 - - - - - -
AWing [21] 4.21 7.21 4.46 4.23 4.02 4.99 4.82
H3R (ours) 3.81 6.45 4.07 3.70 3.66 4.48 4.30
TABLE III: Comparison with State-of-the-Arts on 300W dataset.
Method NME (%), Inter-ocular NME (%), Inter-pupil
private full common challenge private full common challenge
SAN [69] - 3.98 3.34 6.60 - - - -
DAN [70] 4.30 3.59 3.19 5.24 - 5.03 4.42 7.57
SHN [71] 4.05 - - - - 4.68 4.12 7.00
LAB [64] - 3.49 2.98 5.19 - 4.12 3.42 6.98
Wing [45] - - - - - 4.04 3.27 7.18
DeCaFA [66] - 3.39 2.93 5.26 - - - -
DFCE [72] 3.88 3.24 2.76 5.22 - 4.55 3.83 7.54
AVS [67] - 3.86 3.21 6.49 - 4.54 3.98 7.21
HRNet [23] 3.85 3.32 2.87 5.15 - - - -
HG-HSLE [73] - 3.28 2.85 5.03 - 4.59 3.94 7.24
LUVLi [68] - 3.23 2.76 5.16 - - - -
3DDE [65] 3.73 3.13 2.69 4.92 - 4.39 3.73 7.10
AWing [21] 3.56 3.07 2.72 4.52 - 4.31 3.77 6.52
H3R (ours) 3.48 3.02 2.65 4.58 5.07 4.24 3.67 6.60

5 Facial Landmark Detection

In this section, we perform facial landmark detection experiments. We first introduce widely used facial landmark detection datasets. We then describe the implementation details of our proposed method. Finally, we present our experimental results on different datasets and perform comprehensive ablation studies on the most challenging dataset.

5.1 Datasets

We use four widely used facial landmark detection datasets:

  • WFLW [64]. WFLW contains 10,00010,000 face images, including 7,5007,500 training images and 2,5002,500 testing images, with 9898 manually annotated facial landmarks. All face images are selected from the WIDER Face dataset [74], which contains face images with large variations in scale, expression, pose, and occlusion.

  • 300W [75]. 300W contains 3,1483,148 training images, including 337337 images from AFW [30], 2,0002,000 images from the training set of HELEN [76], and 811811 images from the training set of LFPW [39]. For testing, there are four different settings: 1) common: 554554 images, including 330330 and 224224 images from the testsets of HELEN and LFPW, respectively; 2) challenge: 135135 images from IBUG; 3) full: 689689 images as a combination of common and challenge; and 4) private: 600600 indoor/outdoor images. All images are manually annotated with 6868 facial landmarks.

  • COFW [34]. COFW contains 1,8521,852 images, including 1,3451,345 training and 507507 testing images. All images are manually annotated with 2929 facial landmarks.

  • AFLW [77]. AFLW contains 24,38624,386 face images, including 20,00020,000 images for training and 4,8364,836 images for testing. For testing, there are two settings: 1) full: all 4,8364,836 images for testing; and 2) front: 1,3141,314 frontal images selected from the full set. All images are manually annotated with 2121 facial landmarks. For fair comparison, we use 1919 facial landmarks, i.e., the landmarks on two ears are ignored.

5.2 Evaluation Metrics

We use the normalized mean error (NME) as the evaluation metric in this paper, i.e.,

NME=𝔼(𝒙ip𝒙ig2d),\text{NME}=\mathbb{E}\left(\frac{\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}}{d}\right), (23)

where dd indicates the normalization distance. For fair comparison, we report the performances on WFLW, 300W, and COFW using two the normalization methods, inter-pupil distance (the distance between the eye centers) and inter-ocular distance (the distance between the outer eye corners). We report the performance on AFLW using the size of the face bounding box as the normalization distance, i.e., the normalization distance can be evaluated by d=whd=\sqrt{w*h}, where ww and hh indicate the width and height of the face bounding box, respectively.

5.3 Implementation Details

We implement the proposed heatmap regression method for facial landmark detection using PyTorch [78]. Following the practice in [23], we use HRNet [22] as our backbone network, which is an efficient counterpart of ResNet [79], U-Net [80], and Hourglass [6] for semantic landmark localization. Unless otherwise mentioned, we use HRNet-W18 as the backbone network in our experiments. All face images are cropped and resized to 256×256256\times 256 pixels and the downsampling stride of the feature map is 44 pixels. For training, we perform widely-used data augmentation for facial landmark detection as follows We horizontally flip all training images with probability 0.50.5 and randomly change the brightness (±0.125\pm 0.125), contrast (±0.5\pm 0.5), and saturation (±0.5\pm 0.5) of each image. We then randomly rotate the image (±30\pm 30^{\circ}), rescale the image (±0.25\pm 0.25), and translate the image (±16\pm 16 pixels). We also randomly erase a rectangular region in the training image [81]. All our models are initialized from the weights pretrained on ImageNet [82]. We use the Adam optimizer [83] with batch size 1616. The learning rate starts from 0.0010.001 and is divided by 1010 for every 6060 epochs, with 150150 training epochs in total. During the testing phase, we horizontally flip testing images as the data augmentation and average the predictions.

TABLE IV: Comparison with State-of-the-Arts on COFW dataset.
Method NME (%)
Inter-ocular Inter-pupil
SHN [71] - 5.60
LAB [64] - 5.58
DFCE [72] - 5.27
3DDE [65] - 5.11
Wing [45] - 5.07
HRNet [23] 3.45 -
AWing [21] - 4.94
H3R (ours) 3.15 4.55
TABLE V: Comparison with State-of-the-Arts on AFLW dataset.
Method NME (%)
full front
DFCE [72] 2.12 -
3DDE [65] 2.01 -
SAN [69] 1.91 1.85
LAB [64] 1.85 1.62
Wing [45] 1.65 -
HRNet [23] 1.57 1.46
LUVLi [68] 1.39 1.19
H3R (ours) 1.27 1.11

5.4 Comparison with Current State-of-the-Art

To demonstrate the effectiveness of the proposed method, we compare it with recent state-of-the-art facial landmark detection methods as follows. As shown in Table II, the proposed method outperforms recent state-of-the-art methods on the most challenging dataset, WFLW, with a clear margin for all different settings. For the 300W dataset, we try to report the performances under different settings for fair comparison. As shown in Table III, the proposed method achieves comparable performances for all different settings. Specifically, LAB [64] uses the boundary information as the auxiliary supervision; compared to Wing [45], which uses the coordinate regression for semantic landmark localization, the heatmap-based methods usually achieve better performance on the challenge subset. In Table IV, we see that the proposed method outperforms recent state-of-the-art methods with a clear margin on COFW dataset. AFLW captures a wide range of different face poses, including both frontal faces and non-frontal faces. As shown in Table V, the proposed method achieves consistent improvements for both frontal faces and non-frontal faces, suggesting robustness across different face poses.

5.5 Ablation Studies

To better understand the proposed quantization system in different settings, we perform ablation studies the most challenging dataset, WFLW [64].

Refer to caption

Figure 6: The influence of different input resolutions.

The influence of different input resolutions. Heatmap regression models use a fixed input resolution, e.g., 256×256256\times 256 pixels, but training and testing images usually represent a wide range of resolutions, e.g., most faces in the WFLW dataset have an inter-ocular distance of between 3030 and 120120 pixels. Therefore, we compare the proposed method with the baseline using an input resolution from 64×6464\times 64 pixels to 512×512512\times 512 pixels, i.e., a heatmap resolution from 16×1616\times 16 pixels to 128×128128\times 128 pixels. In Fig. 6, the proposed method significantly improves heatmap regression performance when using a low input resolution. The increasing number of high-resolution images/videos in real-world applications is a challenge with respect to the computational cost and device memory needed to overcome the problem of sub-pixel localization by increasing the input resolution of deep learning-based heatmap regression models. For example, in the film industry, it has sometimes become necessary to swap the appearance of a target actor and a source actor to generate higher fidelity video frames in visual effects, especially when an actor is unavailable for some scenes [84]. The manipulation of actor faces in video frames relies on accurate localization of different facial landmarks and is performed at megapixel resolution, inducing a huge computational cost for extensive frame-by-frame animation. Therefore, instead of using high-resolution input images, our proposed method delivers another efficient solution for dealing with accurate semantic landmark localization.

TABLE VI: The influence of different numbers of alternative activation points when using binary heatmap.
Resolution NME (%)
k=1k=1 k=2k=2 k=3k=3 k=4k=4 k=5k=5 k=6k=6 k=7k=7 k=8k=8 k=9k=9 k=10k=10 k=11k=11 k=12k=12 best
512×512512\times 512 3.980 3.946 3.930 3.923 3.916 3.912 3.909 3.906 3.903 3.901 3.898 3.897 3.890/k=25k=25
384×384384\times 384 3.932 3.875 3.855 3.850 3.842 3.837 3.833 3.830 3.828 3.827 3.826 3.826 3.825/k=14k=14
256×256256\times 256 4.005 3.881 3.836 3.832 3.819 3.815 3.810 3.808 3.807 3.807 3.807 3.808 3.807/k=9k=9
128×128128\times 128 4.637 4.164 4.029 4.023 3.997 3.991 3.988 3.989 3.991 3.994 3.998 4.002 3.988/k=7k=7
TABLE VII: The influence of different numbers of alternative activation points when using Gaussian heatmap.
NME (%)
k=1k=1 k=2k=2 k=3k=3 k=4k=4 k=5k=5 k=6k=6 k=7k=7 k=8k=8 k=9k=9 k=10k=10 k=11k=11 k=12k=12 best
σ=0.0\sigma=0.0 4.235 4.102 4.069 4.065 4.090 4.202 4.427 4.775 5.219 5.719 6.256 6.802 4.065/k=4k=4
σ=0.5\sigma=0.5 4.162 4.045 4.010 4.008 3.991 3.990 3.988 3.992 3.998 4.014 4.047 4.108 3.988/k=7k=7
σ=1.0\sigma=1.0 4.037 3.928 3.898 3.908 3.871 3.877 3.876 3.876 3.871 3.861 3.855 3.857 3.855/k=11k=11
σ=1.5\sigma=1.5 4.032 3.927 3.901 3.918 3.873 3.879 3.885 3.882 3.878 3.864 3.857 3.860 3.855/k=18k=18
σ=2.0\sigma=2.0 4.086 3.983 3.953 3.969 3.923 3.931 3.937 3.931 3.925 3.909 3.902 3.907 3.894/k=25k=25
TABLE VIII: The influence of different face ”bounding box” annotation policies.
Policy NME (%), Inter-ocular
training testing test pose expression illumination make-up occlusion blur
P1 P1 3.81 6.45 4.07 3.70 3.66 4.48 4.30
P2 P2 3.95 6.74 4.17 3.89 3.81 4.73 4.51
P1 P2 5.34 9.68 5.24 5.25 5.57 6.55 6.14
P2 P1 4.04 6.90 4.24 3.99 3.90 4.87 4.65

The influence of different numbers of alternative activation points. In the proposed quantization system, the activation probability indicates the distance between activation point to the ground truth activation point (xig/s,yig/s)(x_{i}^{g}/s,y_{i}^{g}/s). If there is no heatmap error, the alternative activation points in (20) then give the same result as in (18). If the heatmap error cannot be ignored, there will be a trade-off on the number of alternative activation points: 1) a small kk increases the risk of missing the ground truth alternative activation points; 2) a large kk introduces the noise from irrelevant activation points, especially for large heatmap error. We demonstrate the performance of the proposed method by using different numbers of alternative activation points in Table VI. Specifically, we see that 1) when using a high input resolution, the best performance is achieved with a relatively large kk; and 2) the performance is smooth near the optimal number of alternative activation points, making it easy to find a proper kk for validation data. As introduced in Section 3, binary heatmaps can be seen as a special case of Gaussian heatmaps with standard deviation σ=0\sigma=0. Considering that Gaussian heatmaps have been widely used in semantic landmark localization applications, we generalize the proposed quantization system to Gaussian heatmaps and demonstrate the influence of different numbers of alternative activation points in Table VII. Specifically, we see that 1) when applying the proposed quantization system to the model using Gaussian heatmap, it achieves comparable performance to the model using binary heatmap; and 2) the optimal number of alternative activation points increases with the standard deviation σ\sigma.

The influence of different “bounding box” annotation policies. For facial landmark detection, a reference bounding box is required to indicate the position of the facial area. However, there is a performance gap when using different reference bounding boxes [21]. A comparison between two widely used “bounding box” annotation policies is shown in Fig. 7, and we introduce two different bounding box annotation policies as follows:

  • P1: This annotation policy is usually used in semantic landmark localization tasks, especially in facial landmark localization. Specifically, the rectangular area of the bounding box tightly encloses a set of pre-defined facial landmarks.

  • P2: This annotation policy has been widely used in face detection datasets [74]. The bounding box contains the areas of the forehead, chin, and cheek. For the occluded face, the bounding box is estimated by the human annotator based on the scale of occlusion.

Refer to caption

Figure 7: A comparison between different ”bounding box” annotation policies. P1: the yellow bounding boxes. P2: the blue bounding boxes.

Refer to caption

Figure 8: Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted facial landmarks. The first row shows some “good” cases and the second row shows some “bad” cases.

We demonstrate the experimental results using different annotation policies in Table VIII. Specifically, we find that the policy P1 usually achieves better results, possibly because the occluded forehead (e.g., hair) introduces additional variations to the face bounding boxes when using the policy P2. Furthermore, the model trained using the policy P2 is more robust to different bounding box policy during testing, suggesting its robustness to inaccurate bounding boxes from face detection algorithms.

Qualitative Analysis. As shown in Fig. 8, we present some “good” and “bad” facial landmark detection examples according to NME. Specifically, for the good cases presented in the first row, most images are of high quality; For the bad cases in the second row, most images contain heavy blurring and/or occlusion, making it difficult to accurately identify the contours of different facial parts.

TABLE IX: Results on MPII Human Pose dataset. In each block, the first row with “-” indicates the baseline method, i.e., k=1k=1; The second row with “*” indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”.
Backbone Input H3R Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
ResNet-50 256×256256\times 256 - 96.3 95.2 89.0 82.9 88.2 83.7 79.4 88.4 29.8
ResNet-50 256×256256\times 256 * 96.4 95.3 89.0 83.2 88.4 84.0 79.6 88.5 34.0
ResNet-50 256×256256\times 256 96.3 95.2 88.8 83.4 88.5 84.3 79.8 88.6 34.9
ResNet-101 256×256256\times 256 - 96.7 95.8 89.3 84.2 87.9 84.2 80.7 88.9 30.0
ResNet-101 256×256256\times 256 * 96.9 95.9 89.5 84.4 88.4 84.5 80.7 89.1 34.0
ResNet-101 256×256256\times 256 96.7 96.0 89.3 84.4 88.5 84.3 80.6 89.1 35.0
ResNet-152 256×256256\times 256 - 97.0 95.8 89.9 84.7 89.1 85.4 81.3 89.5 31.0
ResNet-152 256×256256\times 256 * 97.0 95.9 90.0 85.0 89.2 85.3 81.3 89.6 35.0
ResNet-152 256×256256\times 256 96.8 95.9 90.1 84.9 89.3 85.3 81.3 89.6 36.2
HRNet-W32 256×256256\times 256 - 97.0 95.7 90.1 86.3 88.4 86.8 82.9 90.1 32.8
HRNet-W32 256×256256\times 256 * 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 256×256256\times 256 97.1 96.1 90.8 86.1 89.2 86.4 82.6 90.3 39.3
TABLE X: Results on COCO validation set. In each block, the first row with “-” indicates the baseline, i.e., k=1k=1; The second row with “*” indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”, which can be seen as a special case of H3R with k=2k=2.
Backbone Input H3R AP Ap .5 AP .75 AP (M) AP (L) AR AR .5 AR .75 AR (M) AR (L)
HRNet-W32 192×128192\times 128 - 0.674 0.890 0.771 0.648 0.732 0.739 0.932 0.828 0.700 0.795
HRNet-W32 192×128192\times 128 * 0.710 0.892 0.792 0.682 0.771 0.770 0.933 0.844 0.732 0.827
HRNet-W32 192×128192\times 128 0.720 0.892 0.797 0.691 0.784 0.777 0.933 0.846 0.739 0.834
HRNet-W32 256×192256\times 192 - 0.723 0.904 0.811 0.690 0.788 0.782 0.941 0.859 0.741 0.841
HRNet-W32 256×192256\times 192 * 0.744 0.905 0.819 0.708 0.810 0.798 0.942 0.865 0.757 0.858
HRNet-W32 256×192256\times 192 0.750 0.906 0.820 0.715 0.817 0.802 0.942 0.865 0.761 0.861
HRNet-W48 256×192256\times 192 - 0.730 0.904 0.817 0.693 0.798 0.788 0.943 0.864 0.745 0.852
HRNet-W48 256×192256\times 192 * 0.751 0.906 0.822 0.715 0.818 0.804 0.943 0.867 0.762 0.864
HRNet-W48 256×192256\times 192 0.756 0.906 0.825 0.718 0.825 0.806 0.941 0.868 0.763 0.869
HRNet-W32 384×288384\times 288 - 0.748 0.904 0.826 0.712 0.816 0.802 0.941 0.871 0.759 0.864
HRNet-W32 384×288384\times 288 * 0.758 0.906 0.825 0.720 0.827 0.809 0.943 0.869 0.767 0.871
HRNet-W32 384×288384\times 288 0.762 0.905 0.830 0.725 0.833 0.812 0.942 0.873 0.769 0.874
HRNet-W48 384×288384\times 288 - 0.753 0.907 0.823 0.712 0.823 0.804 0.941 0.867 0.759 0.869
HRNet-W48 384×288384\times 288 * 0.763 0.908 0.829 0.723 0.834 0.812 0.942 0.871 0.767 0.876
HRNet-W48 384×288384\times 288 0.765 0.907 0.829 0.724 0.838 0.814 0.941 0.871 0.769 0.878

6 Human Pose Estimation

In this section, we perform human pose estimation experiments to further demonstrate the effectiveness of the proposed quantization system for accurate semantic landmark localization.

6.1 Datasets

We perform experiments on two popular human pose estimation datasets,

  • MPII [85]: The MPII Human Pose dataset contains around 28,821 images with 40,522 person instances, in which 11,701 images for testing and the remaining 17,120 images for training. Following the experimental setup in  [22], we use 22,246 person instances for training and evaluate the performance on the MPII validation set with 2,958 person instances, which is a heldout subset of MPII training set.

  • COCO [86]: The COCO dataset contains over 200,000 images and 250,000 person instances, in which each person instance is labeled with 17 keypoints. Following the experimental setup in [22], we evaluate the proposed method on the validation set with 5,000 images.

6.2 Implementation Details

We utilize recent state-of-the-art heatmap regression method for human pose estimation, HRNet [22], as our baseline. Specifically, the proposed quantization system can be easily integrated into most heatmap regression models and we have made the source code of human pose estimation based on the HRNet baseline publicly available. For the MPII Human Pose dataset, we use the standard evaluation metric, head-normalized probability of correct keypoint or PCKh [85]. Specifically, a correct keypoint should fall within αl\alpha*l pixels of the ground truth position, where ll indicates the normalization distance and α[0,1]\alpha\in[0,1] is the matching threshold. For fair comparison, we apply two different matching thresholds, [email protected] and [email protected], where a smaller matching threshold, α=0.1\alpha=0.1, indicates a more strict evaluation metric for accurate semantic landmark localization [22]. For the COCO dataset, we use the standard evaluation metric, averaged precision (AP) and averaged recall (AR), where the object keypoint similarity or OKS is used as the similarity measure between the ground truth objects and the predicted objects [86].

6.3 Results

The experimental results on the MPII dataset are shown in Table IX. Specifically, when using a coarse evaluation metric, [email protected], both the proposed method and the compensation method achieve comparable performance to the baseline method, suggesting that the quantization error is trivial in coarse semantic landmark localization; When using a more strict evaluation metric, [email protected], the compensation method, which can be seen as a special case of H3R with k=2k=2, significantly improves the baseline, e.g., from 32.832.8 to 37.737.7, while the proposed method H3R further improves the performance from 37.737.7 to 39.339.3. The experimental results on the COCO dataset are shown in Table X. Specifically, the proposed method clearly improves the averaged precision (AP) in different settings and the major improvements on AP come from: 1) a strict evaluation metric, e.g., AP0.75; and 2) large/medium person instances, i.e., AP(M) and AP(L). Furthermore, we also find that the improvement decreases when increasing the input resolution, e.g., from 0.6740.674 to 0.7200.720 for 192×128(0.046)192\times 128~{}(0.046{\uparrow}), from 0.7230.723 to 0.7500.750 for 256×192(0.027)256\times 192~{}(0.027{\uparrow}), and from 0.7480.748 to 0.7620.762 for 384×288(0.014)384\times 288~{}(0.014{\uparrow}).

7 Conclusion

In this paper, we address the problem of sub-pixel localization for heatmap-based semantic landmark localization. We formally analyze quantization error in vanilla heatmap regression and propose a new quantization system via randomized rounding operation, which we prove is unbiased and lossless. Experiments on facial landmark localization and human pose estimation datasets demonstrate the effectiveness of the proposed quantization system for efficient and accurate sub-pixel localization.

Acknowledgement

Dr. Baosheng Yu is supported by ARC project FL-170100117.

References

  • [1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3476–3483.
  • [2] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks),” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1021–1030.
  • [3] A. Sinha, C. Choi, and K. Ramani, “Deephand: Robust hand pose estimation by completing a matrix imputed with deep features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
  • [4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in European Conference on Computer Vision (ECCV), 2018, pp. 118–134.
  • [5] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1653–1660.
  • [6] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision (ECCV), 2016, pp. 483–499.
  • [7] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel objects using vision,” International Journal of Robotics Research (IJRR), vol. 27, no. 2, pp. 157–173, 2008. [Online]. Available: https://doi.org/10.1177/0278364907087172
  • [8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Neural Information Processing Systems (NIPS), 2014, pp. 1988–1996.
  • [9] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2387–2395.
  • [10] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [11] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  • [12] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1077–1085.
  • [13] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embedding for deep person re-identification,” IEEE Transactions on Image Processing (TIP), vol. 28, no. 9, pp. 4500–4509, 2019.
  • [14] Y.-C. Chen, X. Shen, and J. Jia, “Makeup-go: Blind reversion of portrait edit,” in IEEE International Conference on Computer Vision (ICCV), Oct 2017.
  • [15] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asymmetric style transfer for applying and removing makeup,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 40–48.
  • [16] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial animation,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1–10, 2013.
  • [17] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014.
  • [18] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt, “Real-time expression transfer for facial reenactment.” ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 183–1, 2015.
  • [19] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Neural Information Processing Systems (NIPS), 2014, pp. 1799–1807.
  • [20] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical coordinate regression with convolutional neural networks,” arXiv preprint arXiv:1801.07372, 2018.
  • [21] X. Wang, L. Bo, and L. Fuxin, “Adaptive wing loss for robust face alignment via heatmap regression,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6971–6981.
  • [22] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703.
  • [23] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
  • [24] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and Y. Chen, “Towards highly accurate and stable face alignment for high-resolution videos,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 8893–8900.
  • [25] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei, and J. Sun, “Rethinking on multi-stage networks for human pose estimation,” arXiv preprint arXiv:1901.00148, 2019.
  • [26] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7093–7102.
  • [27] P. Raghavan and C. D. Tompson, “Randomized rounding: a technique for provably good algorithms and algorithmic proofs,” Combinatorica, vol. 7, no. 4, pp. 365–374, 1987.
  • [28] B. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial optimization.   Springer, 2012, vol. 2.
  • [29] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” International Journal of Computer Vision (IJCV), vol. 107, no. 2, pp. 177–190, 2014.
  • [30] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2879–2886.
  • [31] X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 532–539.
  • [32] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1685–1692.
  • [33] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1867–1874.
  • [34] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1513–1520.
  • [35] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment,” in European Conference on Computer Vision (ECCV), 2014, pp. 1–16.
  • [36] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4998–5006.
  • [37] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4733–4742.
  • [38] V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” in IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2017, pp. 468–475.
  • [39] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 35, no. 12, pp. 2930–2940, 2013.
  • [40] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang, “Direct shape regression networks for end-to-end face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5040–5049.
  • [41] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in European Conference on Computer Vision (ECCV), 2014, pp. 94–108.
  • [42] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499–1503, 2016.
  • [43] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41, no. 1, pp. 121–135, 2017.
  • [44] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
  • [45] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2235–2245.
  • [46] A. Bulat and G. Tzimiropoulos, “Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3706–3714.
  • [47] D. Merget, M. Rock, and G. Rigoll, “Robust facial landmark detection via a fully-convolutional local-global context network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 781–790.
  • [48] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1913–1921.
  • [49] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4929–4937.
  • [50] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-end learning for joint detection and grouping,” in Neural Information Processing Systems (NIPS), 2017, pp. 2277–2287.
  • [51] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature pyramids for human pose estimation,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1281–1290.
  • [52] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy, “Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model,” in European Conference on Computer Vision (ECCV), 2018, pp. 269–286.
  • [53] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724–4732.
  • [54] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299.
  • [55] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7103–7112.
  • [56] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4903–4911.
  • [57] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in European Conference on Computer Vision (ECCV), 2018, pp. 529–545.
  • [58] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation and action recognition using multitask deep learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5137–5146.
  • [59] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression by combining indirect part detection and contextual information,” Computers & Graphics, vol. 85, pp. 15–22, 2019.
  • [60] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European Conference on Computer Vision (ECCV), 2016, pp. 467–483.
  • [61] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research (JMLR), vol. 17, no. 1, pp. 1334–1373, 2016.
  • [62] J. Thewlis, H. Bilen, and A. Vedaldi, “Unsupervised learning of object landmarks by factorized spatial embeddings,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5916–5925.
  • [63] W. Wu and S. Yang, “Leveraging intra and inter-dataset variations for robust face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 150–159.
  • [64] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look at boundary: A boundary-aware face alignment algorithm,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2129–2138.
  • [65] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “Face alignment using a 3d deeply-initialized ensemble of regression trees,” Computer Vision and Image Understanding (CVIU), vol. 189, p. 102846, 2019.
  • [66] A. Dapogny, K. Bailly, and M. Cord, “Decafa: Deep convolutional cascade for face alignment in the wild,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [67] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia, “Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 10 153–10 163.
  • [68] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike-Akino, X. Liu, and C. Feng, “Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8236–8246.
  • [69] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated network for facial landmark detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 379–388.
  • [70] M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment network: A convolutional neural network for robust face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 88–97.
  • [71] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for robust facial landmark localisation,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 79–87.
  • [72] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment,” in European Conference on Computer Vision (ECCV), 2018, pp. 585–601.
  • [73] X. Zou, S. Zhong, L. Yan, X. Zhao, J. Zhou, and Y. Wu, “Learning robust facial landmark detection via hierarchical structured ensemble,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
  • [74] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5525–5533.
  • [75] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: The first facial landmark localization challenge,” in IEEE International Conference on Computer Vision Workshops (ICCVW), 2013, pp. 397–403.
  • [76] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in European Conference on Computer Vision (ECCV), 2012, pp. 679–692.
  • [77] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization,” in IEEE International Conference on Computer Vision Workshops (ICCVW), 2011, pp. 2144–2151.
  • [78] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Neural Information Processing Systems (NeurIPS), H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, Eds.   Curran Associates, Inc., 2019, pp. 8024–8035.
  • [79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
  • [80] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI).   Springer, 2015, pp. 234–241.
  • [81] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” AAAI Conference on Artificial Intelligence (AAAI), 2020.
  • [82] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
  • [83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
  • [84] J. Naruniec, L. Helminger, C. Schroers, and R. Weber, “High-resolution neural face swapping for visual effects,” Eurographics Symposium on Rendering, vol. 39, no. 4, 2020.
  • [85] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3686–3693.
  • [86] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.
[Uncaptioned image] Baosheng Yu received a B.E. from the University of Science and Technology of China in 2014, and a Ph.D. from The University of Sydney in 2019. He is currently a Research Fellow in the School of Computer Science and the Faculty of Engineering at The University of Sydney, NSW, Australia. His research interests include machine learning, computer vision, and deep learning.
[Uncaptioned image] Dacheng Tao (F’15) is the President of the JD Explore Academy and a Senior Vice President of JD.com. He is also an advisor and chief scientist of the digital science institute in the University of Sydney. He mainly applies statistics and mathematics to artificial intelligence and data science, and his research is detailed in one monograph and over 200 publications in prestigious journals and proceedings at leading conferences. He received the 2015 Australian Scopus-Eureka Prize, the 2018 IEEE ICDM Research Contributions Award, and the 2021 IEEE Computer Society McCluskey Technical Achievement Award. He is a fellow of the Australian Academy of Science, AAAS, and ACM.

Appendix A Proofs of Theorem 1 and Theorem 2

Theorem 1.

Given an unbiased quantization system defined by the encode operation in (10) and the decode operation in (13), we then have that the quantization error tightly upper bounded, i.e.,

𝒙ip𝒙ig22s/2,\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}\leq\sqrt{2}s/2,

where s>1s>1 indicates the downsampling stride of the heatmap.

Proof.

Given the ground truth numerical coordinate 𝒙ig=(xig,yig)\boldsymbol{x}_{i}^{g}=\left(x_{i}^{g},y_{i}^{g}\right), the predicted numerical coordinate 𝒙ip=(xip,yip)\boldsymbol{x}_{i}^{p}=\left(x_{i}^{p},y_{i}^{p}\right), and the downsampling stride of the heatmap s>1s>1, if there is no heatmap error, we then have

𝒉ip(𝒙)=𝒉ig(𝒙),\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=\boldsymbol{h}_{i}^{g}(\boldsymbol{x}),

where 𝒉ip(𝒙)\boldsymbol{h}_{i}^{p}(\boldsymbol{x}) and 𝒉ig(𝒙)\boldsymbol{h}_{i}^{g}(\boldsymbol{x}) indicate the ground truth heatmap and the predicted heatmap, respectively. Therefore, according to the decode operation in (13), we have the predicted numerical coordinate as

xip/s={xig/s+t0.5if ϵx<t,xig/s+t+0.5otherwise.x_{i}^{p}/s=\begin{cases}\lfloor x_{i}^{g}/s\rfloor+t-0.5&\quad\text{if }~{}\epsilon_{x}<t,\\ \lfloor x_{i}^{g}/s\rfloor+t+0.5&\quad\text{otherwise}.\end{cases}
yip/s={yig/s+t0.5if ϵy<t,yig/s+t+0.5otherwise.y_{i}^{p}/s=\begin{cases}\lfloor y_{i}^{g}/s\rfloor+t-0.5&\quad\text{if }~{}\epsilon_{y}<t,\\ \lfloor y_{i}^{g}/s\rfloor+t+0.5&\quad\text{otherwise}.\end{cases}

where ϵx=xig/sxig/s\epsilon_{x}=x_{i}^{g}/s-\lfloor x_{i}^{g}/s\rfloor and ϵy=yig/syig/s\epsilon_{y}=y_{i}^{g}/s-\lfloor y_{i}^{g}/s\rfloor. The quantization error of vanilla quantization system then can be evaluated as follows:

|xip/sxig/s|={|tϵx0.5|if ϵx<t,|tϵx+0.5|otherwise.|x_{i}^{p}/s-x_{i}^{g}/s|=\begin{cases}|t-\epsilon_{x}-0.5|&\quad\text{if }~{}\epsilon_{x}<t,\\ |t-\epsilon_{x}+0.5|&\quad\text{otherwise}.\end{cases}
|yip/syig/s|={|tϵy0.5|if ϵy<t,|tϵy+0.5|otherwise.|y_{i}^{p}/s-y_{i}^{g}/s|=\begin{cases}|t-\epsilon_{y}-0.5|&\quad\text{if }~{}\epsilon_{y}<t,\\ |t-\epsilon_{y}+0.5|&\quad\text{otherwise}.\end{cases}

The maximum quantization error |xipxig|=s/2|x_{i}^{p}-x_{i}^{g}|=s/2 is achieved when ϵx=t\epsilon_{x}=t. Similarly, we have the maximum quantization error |yipyig|=s/2|y_{i}^{p}-y_{i}^{g}|=s/2 is achieved with ϵy=t\epsilon_{y}=t. Considering that xipx_{i}^{p} and yipy_{i}^{p} are linearly independent variables, we thus have

𝒙ip𝒙ig2=(xipxig)2+(yipyig)22s/2.\begin{split}\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}&=\sqrt{\left(x_{i}^{p}-x_{i}^{g}\right)^{2}+\left(y_{i}^{p}-y_{i}^{g}\right)^{2}}\leq\sqrt{2}s/2.\end{split}

The maximum quantization error is achieved with ϵx=ϵy=t\epsilon_{x}=\epsilon_{y}=t. That is, the quantization error in vanilla quantization system is tightly upper bounded by 2s/2\sqrt{2}s/2.

Theorem 2.

Given the encode operation in (15) and the decode operation in (17), we then have that the 1) encode operation is unbiased; and 2) quantization system is lossless, i.e., there is no quantization error.

Proof.

Given the ground truth numerical coordinate 𝒙ig=(xig,yig)\boldsymbol{x}_{i}^{g}=\left(x_{i}^{g},y_{i}^{g}\right), the predicted numerical coordinate 𝒙ip=(xip,yip)\boldsymbol{x}_{i}^{p}=\left(x_{i}^{p},y_{i}^{p}\right), and the downsampling stride of the heatmap s>1s>1, we then have

𝔼(𝒒(xig,s))=𝔼(P{ϵx<t}xig/s+P{ϵxt}(xig/s+1))=xig/s(1ϵx)+(xig/s+1)ϵx=xig/s\begin{split}\mathbb{E}\left(\boldsymbol{q}(x_{i}^{g},s)\right)&=\mathbb{E}\left(P\left\{\epsilon_{x}<t\right\}\lfloor x_{i}^{g}/s\rfloor+P\left\{\epsilon_{x}\geq t\right\}(\lfloor x_{i}^{g}/s\rfloor+1)\right)\\ &=\lfloor x_{i}^{g}/s\rfloor(1-\epsilon_{x})+(\lfloor x_{i}^{g}/s\rfloor+1)\epsilon_{x}\\ &=x_{i}^{g}/s\end{split}

Similarly, we have 𝔼(𝒒(yig,s))=yig/s\mathbb{E}\left(\boldsymbol{q}(y_{i}^{g},s)\right)=y_{i}^{g}/s. Considering that xipx_{i}^{p} and yipy_{i}^{p} are linearly independent variables, we thus have

𝔼(𝒒(𝒙ig,s))=(𝔼(𝒒(yig,s)),𝔼(𝒒(xig,s)))=(xig/s,yig/s).\begin{split}\mathbb{E}\left(\boldsymbol{q}(\boldsymbol{x}_{i}^{g},s)\right)&=\left(\mathbb{E}\left(\boldsymbol{q}(y_{i}^{g},s)\right),\mathbb{E}\left(\boldsymbol{q}(x_{i}^{g},s)\right)\right)\\ &=(x_{i}^{g}/s,y_{i}^{g}/s).\end{split}

Therefore, the encode operation in (15), i.e., random-round, is an unbiased encode operation for heatmap regression.

We then prove that the quantization system is losses as follows. For the decode operation in (17), if there is no heatmap error, we then have

P{𝒙ip/s=(xig/s,yig/s)}=(1ϵx)(1ϵy),P{𝒙ip/s=(xig/s+1,yig/s)}=ϵx(1ϵy),P{𝒙ip/s=(xig/s,yig/s+1)}=(1ϵx)ϵy,P{𝒙ip/s=(xig/s+1,yig/s+1)}=ϵxϵy.\begin{split}P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor)\}&=(1-\epsilon_{x})(1-\epsilon_{y}),\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor)\}&=\epsilon_{x}(1-\epsilon_{y}),\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1)\}&=(1-\epsilon_{x})\epsilon_{y},\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1)\}&=\epsilon_{x}\epsilon_{y}.\\ \end{split}

We can reconstruct the fractional part of 𝒙ig\boldsymbol{x}_{i}^{g}, i.e.,

(xip/s,yip/s)=𝒙ipP{𝒙ip}𝒙ip/s=(xig/s,yig/s)(1ϵx)(1ϵy)+(xig/s+1,yig/s)ϵx(1ϵy)+(xig/s,yig/s+1)(1ϵx)ϵy+(xig/s+1,yig/s+1)ϵxϵy=(xig/s,yig/s).\begin{split}\left(x_{i}^{p}/s,y_{i}^{p}/s\right)=&\sum\limits_{\boldsymbol{x}_{i}^{p}}P\{\boldsymbol{x}_{i}^{p}\}\boldsymbol{x}_{i}^{p}/s\\ =~{}&(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor)*(1-\epsilon_{x})(1-\epsilon_{y})\\ &+(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor)*\epsilon_{x}(1-\epsilon_{y})\\ &+(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1)*(1-\epsilon_{x})\epsilon_{y}\\ &+(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1)*\epsilon_{x}\epsilon_{y}\\ =&\left(x_{i}^{g}/s,y_{i}^{g}/s\right).\end{split}

That is, (xip,yip)=(xig,yig)\left(x_{i}^{p},y_{i}^{p}\right)=\left(x_{i}^{g},y_{i}^{g}\right), i.e., there is no quantization error.

Appendix B Experiments

In this section, we provide additional experimental results on facial landmark detection and human pose estimation.

B.1 Facial Landmark Detection

TABLE XI: The influence of different numbers of training samples when using different input resolutions. In each cell, the first number indicates the performance of baseline method and the second number indicates the performance of the proposed method.
#Samples NME (%)
256×256256\times 256 128×128128\times 128 64×6464\times 64
256 5.67/5.46 6.02/5.47 7.74/6.20
1024 4.79/4.63 5.27/4.77 7.13/5.43
4096 4.14/3.97 4.81/4.17 6.57/4.76
7500 4.00/3.81 4.62/3.99 6.50/4.62
TABLE XII: Comparison of different backbone networks and feature maps on WFLW dataset.
Backbone Input Heatmap FLOPs #Params NME (%), Inter-ocular
test pose expression illumination make-up occlusion blur
HRNet-W18 256×256256\times 256 64×6464\times 64 4.84G 9.69M 3.81 6.45 4.07 3.70 3.66 4.48 4.30
HRNet-W18 256×256256\times 256 256×256256\times 256 4.98G 9.69M 3.91 6.83 4.09 3.82 3.70 4.59 4.41
U-Net 256×256256\times 256 256×256256\times 256 60.55G 31.46M 4.93 9.30 5.08 4.74 4.82 6.43 5.72
HRNet-W18 128×128128\times 128 32×3232\times 32 1.21G 9.69M 3.99 6.78 4.26 3.89 3.84 4.65 4.46
HRNet-W18 128×128128\times 128 128×128128\times 128 1.24G 9.69M 4.06 6.89 4.41 3.97 3.95 4.78 4.52
U-Net 128×128128\times 128 128×128128\times 128 15.14G 31.46M 4.17 7.10 4.45 4.06 4.03 5.00 4.73
HRNet-W18 64×6464\times 64 16×1616\times 16 0.30G 9.69M 4.64 7.77 5.05 4.52 4.59 5.31 4.96
HRNet-W18 64×6464\times 64 64×6464\times 64 0.31G 9.69M 4.61 7.70 5.00 4.44 4.58 5.28 4.94
U-Net 64×6464\times 64 64×6464\times 64 3.79G 31.46M 4.37 7.18 4.69 4.26 4.30 5.12 4.80

The influence of different numbers of training samples. The proposed quantization system does not rely on any assumption about the number of training samples, and is lossless for heatmap regression if there is no heatmap error. However, heatmap prediction performance will be influenced by the number of training samples: increasing the number of training samples improves the model generalizability from the learning theory perspective. Therefore, we perform experiments to evaluate the influence of the proposed method when using different numbers of training samples in practice. As shown in Table XI, we find that 1) the proposed method delivers consistent improvements when using different numbers of training samples; and 2) increasing the number of training samples significantly improves the performance of heatmap regression models with low-resolution input images.

The influence of different backbone networks. If we do not take the heatmap prediction error into consideration, the quantization error in heatmap regression is then caused by the downsampling of heamaps: 1) the downsampling of input images and 2) the downsampling of CNN feature maps. Though the analysis of heamap prediction error is out the scope of this paper, we perform some experiments to demonstrate the influence of different feature maps from the backbone networks in practice. Specifically, we perform experiments using the following two settings: 1) upsampling the feature maps from HRNet [22]; or 2) using the feature maps from U-shape backbone networks, i.e., U-Net [80]. As shown in Table XII, we see that 1) directly upsampling the feature maps achieves comparable performance with the baseline method; 2) U-Net performs better than HRNet-W18 when using a small input resolution (e.g., 64×6464\times 64 pixels), while is significantly worse than HRNet when using a large input resolution (e.g., 256×256256\times 256 pixels); and 3) U-Net contains more parameters and requires much more computations than HRNet when using the same input resolution. It would be interesting to further explore more efficient U-shape networks for low-resolution heatmap-based semantic landmark localization.

TABLE XIII: The influence of different types of heatmap when using the heatmap regression model with different input resolutions.
Heatmap NME (%)
256×256256\times 256 128×128128\times 128 64×6464\times 64
Gaussian 3.86 4.21 4.99
Binary 3.81 3.99 4.62

The influence of different types of heatmap. We perform some experiments to demonstrate the influence when using different types of heatmap, Gaussian heatmap and binary heatmap. As shown in Table XIII, 1) when using a large input resolution, the heatmap regression model using either Gaussian heatmap or binary heatmap achieves comparable performance; and 2) when using a low input resolution, the heatmap regression model achieves better performance with the binary heatmap. We demonstrate the differences between binary heatmap and Gaussian heatmap in Figure 9. Specifically, the Gaussian heatmap improves the robustness of heatmap prediction, while at the risk of increasing the uncertainty on the maximum activation point in the predicted heatmap. Therefore, when training very efficient heatmap regression models using a low input resolution, we recommend the binary heatmap.

Refer to caption

Figure 9: An intuitive example of the ground truth heatmap on WFLW dataset using different σ\sigma. We plot all heatmaps 𝒉1g,,𝒉98g\boldsymbol{h}^{g}_{1},\dots,\boldsymbol{h}^{g}_{98} into a single figure for better visualization.

The qualitative comparison between the vanilla quantization system and the proposed quantization system. We provide some demo images for facial landmark detection using both the baseline method (i.e., k=1k=1) and the proposed quantization method (i.e., k=9k=9) to demonstrate the effectiveness of the proposed method for accurate semantic landmark localization. As shown in Fig. 10, we see that the black landmarks are closer to the blue landmarks than the yellow landmarks, especially when using low resolution models (e.g., 64×6464\times 64 pixels).

Refer to caption

Figure 10: Qualitative results from the test split of the WFLW dataset (best view in color). Blue: the ground truth facial landmarks. Yellow: the predicted facial landmarks by the vanilla quantization system (i.e., k=1k=1). Black: the predicted facial landmarks by the proposed quantization system (i.e., k=9k=9). For above three rows, we use the heatmap regression models with the input resolutions, 256×256256\times 256, 128×128128\times 128, and 64×6464\times 64 pixels, respectively.

B.2 Human Pose Estimation

To utilize the proposed method for accurate semantic landmark localization, it contains only one hyper-parameter kk, i.e., the number of activation points. To better understand the effectiveness of the proposed method for human pose estimation, we perform ablation studies using different numbers of alternative activation points on both MPII and COCO datasets. As shown in Table XIV and Table XV, we find that the proposed method achieves comparable performance when kk is between 1010 and 2525, making it easy to choose a proper kk on the validation data for human pose estimation applications.

TABLE XIV: Results on COCO validation set. The input resolution is 256×192256\times 192 pixels. The first row indicates the performance when using the compensation method “shift a quarter to the second maximum activation point”.
Backbone kk AP Ap .5 AP .75 AP (M) AP (L) AR AR .5 AR .75 AR (M) AR (L)
HRNet-W32 - 0.744 0.905 0.819 0.708 0.810 0.798 0.942 0.865 0.757 0.858
HRNet-W32 1 0.723 0.904 0.811 0.690 0.788 0.782 0.941 0.859 0.741 0.841
HRNet-W32 2 0.738 0.905 0.817 0.702 0.805 0.793 0.942 0.864 0.752 0.854
HRNet-W32 3 0.743 0.905 0.819 0.708 0.810 0.797 0.942 0.865 0.756 0.857
HRNet-W32 4 0.743 0.904 0.819 0.709 0.809 0.797 0.941 0.866 0.756 0.857
HRNet-W32 5 0.747 0.904 0.819 0.712 0.814 0.800 0.940 0.866 0.759 0.859
HRNet-W32 6 0.747 0.905 0.820 0.711 0.814 0.799 0.941 0.865 0.759 0.859
HRNet-W32 7 0.745 0.905 0.820 0.709 0.812 0.798 0.941 0.864 0.757 0.857
HRNet-W32 8 0.746 0.905 0.819 0.711 0.813 0.799 0.942 0.863 0.759 0.858
HRNet-W32 9 0.746 0.905 0.819 0.710 0.812 0.798 0.942 0.863 0.758 0.857
HRNet-W32 10 0.749 0.905 0.820 0.713 0.815 0.801 0.943 0.864 0.760 0.860
HRNet-W32 11 0.750 0.906 0.820 0.715 0.817 0.802 0.942 0.865 0.761 0.861
HRNet-W32 12 0.750 0.906 0.821 0.714 0.817 0.801 0.942 0.865 0.760 0.861
HRNet-W32 13 0.749 0.906 0.821 0.713 0.816 0.800 0.942 0.865 0.760 0.860
HRNet-W32 14 0.749 0.905 0.820 0.713 0.816 0.800 0.941 0.864 0.760 0.860
HRNet-W32 15 0.749 0.906 0.820 0.713 0.817 0.800 0.942 0.863 0.760 0.860
HRNet-W32 16 0.749 0.905 0.820 0.713 0.816 0.800 0.941 0.863 0.760 0.860
HRNet-W32 17 0.750 0.905 0.821 0.714 0.817 0.801 0.942 0.863 0.761 0.860
HRNet-W32 18 0.750 0.905 0.820 0.713 0.817 0.801 0.942 0.864 0.760 0.860
HRNet-W32 19 0.750 0.904 0.820 0.714 0.816 0.801 0.941 0.864 0.760 0.860
HRNet-W32 20 0.749 0.904 0.820 0.713 0.816 0.800 0.940 0.864 0.760 0.859
HRNet-W32 21 0.747 0.904 0.819 0.712 0.813 0.799 0.940 0.862 0.758 0.858
HRNet-W32 22 0.749 0.904 0.819 0.713 0.815 0.799 0.941 0.863 0.759 0.859
HRNet-W32 23 0.749 0.904 0.819 0.713 0.816 0.800 0.941 0.863 0.760 0.860
HRNet-W32 24 0.749 0.904 0.820 0.713 0.817 0.800 0.941 0.863 0.759 0.860
HRNet-W32 25 0.749 0.905 0.820 0.713 0.817 0.800 0.941 0.864 0.760 0.860
HRNet-W32 30 0.749 0.905 0.819 0.714 0.818 0.800 0.941 0.862 0.759 0.859
HRNet-W32 35 0.748 0.902 0.819 0.712 0.816 0.799 0.940 0.862 0.758 0.859
HRNet-W32 40 0.746 0.901 0.818 0.710 0.814 0.797 0.939 0.861 0.756 0.857
HRNet-W32 45 0.746 0.901 0.815 0.710 0.814 0.796 0.939 0.860 0.755 0.857
HRNet-W32 50 0.746 0.901 0.814 0.710 0.816 0.797 0.938 0.859 0.755 0.857
HRNet-W32 75 0.741 0.899 0.811 0.705 0.810 0.792 0.936 0.856 0.751 0.853
HRNet-W32 100 0.734 0.897 0.805 0.697 0.803 0.785 0.934 0.850 0.743 0.846
HRNet-W32 125 0.717 0.893 0.795 0.682 0.785 0.770 0.930 0.841 0.727 0.831
HRNet-W32 150 0.651 0.889 0.755 0.629 0.702 0.713 0.925 0.809 0.680 0.762
TABLE XV: Results on MPII validation set. The input resolution is 256×256256\times 256 pixels. The first row indicates the performance when using the compensation method “shift a quarter to the second maximum activation point”.
Backbone kk Head Shoulder Elbow Wrist Hip Knee Ankle Mean [email protected]
HRNet-W32 - 97.1 95.9 90.3 86.4 89.1 87.1 83.3 90.3 37.7
HRNet-W32 1 97.033 95.703 90.131 86.312 88.402 86.802 82.924 90.055 32.808
HRNet-W32 2 97.033 95.856 90.302 86.416 88.818 87.004 83.325 90.247 35.912
HRNet-W32 3 97.135 95.975 90.302 86.347 89.250 86.741 83.018 90.271 37.668
HRNet-W32 4 97.237 95.907 90.643 86.141 89.164 86.862 83.089 90.294 36.997
HRNet-W32 5 97.203 95.839 90.677 86.141 89.147 87.124 82.994 90.315 38.774
HRNet-W32 6 97.101 95.839 90.677 86.262 89.181 86.923 83.089 90.310 38.548
HRNet-W32 7 97.135 95.822 90.609 86.313 89.199 86.943 83.113 90.312 37.913
HRNet-W32 8 97.033 95.822 90.557 86.175 89.147 86.922 82.876 90.245 38.129
HRNet-W32 9 97.067 95.822 90.694 86.329 89.112 86.942 82.876 90.284 38.457
HRNet-W32 10 97.033 95.873 90.626 86.141 89.372 86.862 82.995 90.297 39.024
HRNet-W32 11 97.101 95.907 90.592 86.244 89.406 86.842 82.853 90.302 39.139
HRNet-W32 12 97.101 95.822 90.660 86.175 89.475 86.761 82.900 90.297 38.996
HRNet-W32 13 97.101 95.822 90.694 86.141 89.475 86.741 82.829 90.289 38.855
HRNet-W32 14 97.033 95.822 90.728 86.141 89.337 86.640 82.806 90.250 38.616
HRNet-W32 15 97.169 95.754 90.609 86.124 89.337 86.801 82.475 90.211 38.798
HRNet-W32 16 97.101 95.771 90.694 86.141 89.199 86.882 82.569 90.226 38.691
HRNet-W32 17 97.067 95.822 90.609 86.142 89.285 86.821 82.617 90.224 39.053
HRNet-W32 18 97.033 95.822 90.626 86.004 89.250 86.801 82.522 90.193 39.048
HRNet-W32 19 97.033 95.839 90.626 85.953 89.250 86.822 82.593 90.198 39.058
HRNet-W32 20 96.999 95.788 90.626 86.039 89.389 86.801 82.664 90.226 39.014
HRNet-W32 21 96.999 95.771 90.677 85.935 89.337 86.862 82.451 90.187 38.855
HRNet-W32 22 96.999 95.856 90.626 86.090 89.268 86.741 82.475 90.198 38.842
HRNet-W32 23 96.999 95.890 90.609 86.038 89.164 86.801 82.499 90.187 39.037
HRNet-W32 24 96.862 95.754 90.592 86.021 89.233 86.842 82.333 90.146 39.089
HRNet-W32 25 96.930 95.754 90.506 86.055 89.302 86.721 82.309 90.135 39.126
HRNet-W32 30 96.862 95.788 90.438 86.004 89.372 86.660 82.144 90.101 38.759
HRNet-W32 35 96.623 95.873 90.523 85.936 89.354 86.701 81.884 90.075 38.964
HRNet-W32 40 96.385 95.856 90.523 85.816 89.406 86.721 81.908 90.049 38.618
HRNet-W32 45 96.317 95.669 90.506 85.746 89.285 86.660 82.026 89.990 38.579
HRNet-W32 50 96.214 95.686 90.506 85.678 89.268 86.439 81.648 89.899 38.584
HRNet-W32 55 96.044 95.669 90.404 85.335 89.337 86.439 81.506 89.815 38.496
HRNet-W32 60 95.805 95.533 90.302 85.147 89.147 86.419 81.270 89.672 38.332
HRNet-W32 65 95.703 95.618 90.251 85.079 89.164 86.419 81.081 89.646 38.218
HRNet-W32 70 95.498 95.584 90.165 85.215 89.302 86.378 80.963 89.633 38.197
HRNet-W32 75 95.259 95.533 90.029 85.147 89.147 86.318 80.467 89.487 38.054
HRNet-W32 80 95.020 95.516 90.046 84.907 89.216 86.197 80.255 89.404 37.684
HRNet-W32 85 94.679 95.465 89.927 84.925 89.337 86.258 80.042 89.357 37.463
HRNet-W32 90 94.407 95.431 89.790 84.907 89.233 86.197 79.735 89.251 37.208
HRNet-W32 95 94.065 95.414 89.773 84.890 89.302 86.097 79.263 89.165 36.719
HRNet-W32 100 93.827 95.482 89.603 84.737 89.320 86.076 78.578 89.032 36.180
HRNet-W32 105 93.520 95.448 89.603 84.599 89.199 85.996 77.964 88.884 35.532
HRNet-W32 110 93.008 95.448 89.518 84.273 89.147 85.835 77.114 88.657 34.770
HRNet-W32 115 92.565 95.296 89.501 84.085 89.112 85.794 76.405 88.483 33.352
HRNet-W32 120 92.121 95.262 89.398 83.930 89.060 85.533 75.272 88.238 31.814
HRNet-W32 125 91.473 95.160 89.160 83.554 88.991 85.472 74.162 87.939 30.065
HRNet-W32 130 90.825 94.939 89.211 83.023 89.095 85.089 72.910 87.609 27.528
HRNet-W32 135 90.246 94.667 89.006 82.577 88.852 84.807 71.209 87.164 24.757
HRNet-W32 140 89.529 94.463 88.853 82.046 88.679 83.800 69.792 86.659 21.788
HRNet-W32 145 88.404 94.124 88.717 81.310 88.645 82.773 67.997 86.053 19.136
HRNet-W32 150 87.756 93.886 88.614 80.146 88.575 81.383 66.084 85.373 16.586
HRNet-W32 175 82.401 90.829 86.927 71.515 86.014 70.504 56.826 80.109 9.204
HRNet-W32 200 68.008 87.075 82.325 62.146 79.176 58.737 41.120 72.009 5.988