\justify

Heatmap Regression via Randomized Rounding

Baosheng Yu and Dacheng Tao Baosheng Yu is with The University of Sydney, Australia. E-mail: [email protected]. Dacheng Tao is with JD Explore Academy, China and The University of Sydney, Australia. Email: [email protected]. Corresponding author: Dacheng Tao.1057-7149 © 2018 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.

Abstract

Heatmap regression has become the mainstream methodology for deep learning-based semantic landmark localization, including in facial landmark localization and human pose estimation. Though heatmap regression is robust to large variations in pose, illumination, and occlusion in unconstrained settings, it usually suffers from a sub-pixel localization problem. Specifically, considering that the activation point indices in heatmaps are always integers, quantization error thus appears when using heatmaps as the representation of numerical coordinates. Previous methods to overcome the sub-pixel localization problem usually rely on high-resolution heatmaps. As a result, there is always a trade-off between achieving localization accuracy and computational cost, where the computational complexity of heatmap regression depends on the heatmap resolution in a quadratic manner. In this paper, we formally analyze the quantization error of vanilla heatmap regression and propose a simple yet effective quantization system to address the sub-pixel localization problem. The proposed quantization system induced by the randomized rounding operation 1) encodes the fractional part of numerical coordinates into the ground truth heatmap using a probabilistic approach during training; and 2) decodes the predicted numerical coordinates from a set of activation points during testing. We prove that the proposed quantization system for heatmap regression is unbiased and lossless. Experimental results on popular facial landmark localization datasets (WFLW, 300W, COFW, and AFLW) and human pose estimation datasets (MPII and COCO) demonstrate the effectiveness of the proposed method for efficient and accurate semantic landmark localization. Code is available at http://github.com/baoshengyu/H3R.

Index Terms:

Semantic landmark localization, heatmap regression, quantization error, randomized rounding.

1 Introduction

Semantic landmarks are sets of points or pixels in images containing rich semantic information. They reflect the intrinsic structure or shape of objects such as human faces [1, 2], hands [3, 4], bodies [5, 6], and household objects [7]. Semantic landmark localization is fundamental in computer and robot vision [8, 9, 7, 10]. For example, semantic landmark localization can be used to register correspondences between spatial positions and semantics (semantic alignment), which is extremely useful in visual recognition tasks such as face recognition [8, 11] and person re-identification [12, 13]. Therefore, robust and efficient semantic landmark localization is extremely important in applications requiring accurate semantic landmarks including robotic grasping [7, 10] and facial analysis applications such as face makeup [14, 15], animation [16, 17], and reenactment [18, 9].

Coordinate regression and heatmap regression are two widely-used methods for deep learning-based semantic landmark localization [1, 19]. Rather than directly regressing the numerical coordinate with a fully-connected layer, heatmap-based methods aim to predict the heatmap where the maximum activation point corresponds to the semantic landmark in the input image. An intuitive example of heatmap representation is shown in Fig. 1. Due to the effective spatial generalization of heatmap representation, heatmap regression method is robust to large variations in pose, illumination, and occlusion in unconstrained settings [19, 20]. Heatmap regression has performed particularly well in semantic landmark localization tasks including facial landmark detection [2, 21] and human pose estimation [6, 22]. Despite this promise, heatmap regression method suffers from an inherent drawback, namely that the indices of the activation points in heatmaps are always integers. Vanilla heatmap-based methods therefore fail to predict the numerical coordinates in sub-pixel precision. Sub-pixel localization is nevertheless important in real-world scenarios with the fractional part of numerical coordinates originating from: 1) the input image being captured either by a low-resolution camera and/or at a relatively large distance; and 2) the heatmap is usually at a much lower resolution than the input image due to the downsampling stride of convolutional neural networks. As a result, low-resolution heatmaps significantly degrade heatmap regression performance. Considering that the computational cost of convolutional neural networks usually depends quadratically on the resolution of the input image or the feature map, there is a trade-off between the localization accuracy and the computational cost for heatmap regression [6, 23, 24, 25, 26]. Furthermore, the downsampling stride of heatmap is not always equal to the downsampling stride of feature map: given an original image of $512\times 512$ pixels, a heatmap regression model with the input size $128\times 128$ pixels, and the feature map with a downsampling stride $4$ pixels, we then have the size of heatmap $32\times 32$ pixels, i.e., the downsampling stride of heatmap $s=16$ pixels. For simplicity, we do not distinguish between the above two settings to address the quantization error in a unified manner. Unless otherwise mentioned, we refer to $s>1$ as the downsampling stride of the heatmap.

Refer to caption — Figure 1: An intuitive example of the heatmap representation for the numerical coordinate. Given the numerical coordinate $\boldsymbol{x}_{i}=(x_{i},y_{i})$ , we then have the corresponding heatmap $\boldsymbol{h}_{i}$ with the maximum activation point located at the position $(x_{i},y_{i})$ .

In vanilla heatmap regression, 1) during training, the ground truth numerical coordinates are first quantized to generate the ground truth heatmap; and 2) during testing, the predicted numerical coordinates can be decoded from the maximum activation point in the predicted heatmap. However, typical quantization operations such as floor, round, and ceil discard the fractional part of the ground truth numerical coordinates, making it difficult to reconstruct the fractional part even from the optimal predicted heatmap. This error induced by the transformation between numerical coordinates and heatmap is known as the quantization error. To address the problem of quantization error, here we introduce a new quantization system to form a lossless transformation between the numerical coordinates and the heatmap. In our approach, during training, the proposed quantization system uses a set of activation points, and the fractional part of the numerical coordinate is encoded as the activation probabilities of different activation points. During testing, the fractional part can then be reconstructed according to the activation probabilities of the top $k$ maximum activation points in the predicted heatmap. To achieve this, we introduce a new quantization operation called randomized rounding, or random-round, which is widely used in combinatorial optimization to convert fractional solutions into integer solutions with provable approximation guarantees [27, 28]. Furthermore, the proposed method can easily be implemented using a few lines of source code, making it a plug-and-play replacement for the quantization system of existing heatmap regression methods.

In this paper, we address the problem of quantization error in heatmap regression. The remainder of the paper is structured as follows. In the preliminaries, we briefly review two typical semantic landmark localization methods, coordinate regression and heatmap regression. In the methods, we first formally introduce the problem of quantization error by decomposing the prediction error into the heatmap error and the quantization error. We then discuss quantization bias in vanilla heatmap regression and prove a tight upper bound on the quantization error in vanilla heatmap regression. To address quantization error, we devise a new quantization system and theoretically prove that the proposed quantization system is unbiased and lossless. We also discuss uncertainty in heatmap prediction as well as the unbiased annotation when forming a robust semantic landmark localization system. In the experimental section, we demonstrate the effectiveness of our proposed method on popular facial landmark detection datasets (WFLW, 300W, COFW, and AFLW) and human pose estimation datasets (MPII and COCO).

2 Related Work

Semantic landmark localization, which aims to predict the numerical coordinates for a set of pre-defined semantic landmarks in a given image or video, has a variety of applications in computer and robot vision including facial landmark detection [1, 29, 2], hand landmark detection [3, 4], human pose estimation [5, 6, 22], and household object pose estimation [7, 10]. In this section, we briefly review recent works on coordinate regression and heatmap regression for semantic landmark localization, especially in facial landmark localization applications.

2.1 Coordinate Regression

Coordinate regression has been widely and successfully used in semantic landmark localization under constrained settings, where it usually relies on simple yet effective features [30, 31, 32, 33]. To improve the performance of coordinate regression for semantic landmark localization in the wild, several methods have been proposed by using cascade refinement [1, 34, 35, 36, 37, 38], parametric/non-parametric shape models [39, 29, 36, 40], multi-task learning [41, 42, 43], and novel loss functions [44, 45].

2.2 Heatmap Regression

The success of deep learning has prompted the use of heatmap regression for semantic landmark localization, especially for robust and accurate facial landmark localization [2, 46, 47, 23] and human pose estimation [19, 48, 6, 49, 50, 51, 52]. Existing heatmap regression methods either rely on large input images or empirical compensations during inference to mitigate the problem of quantization error [6, 53, 54, 22]. For example, a simple yet effective compensation method known as “shift a quarter to the second maximum activation point” has been widely used in many state-of-the-art heatmap regression methods [6, 55, 22].

Several methods have been developed to address the problem of quantization error in three aspects: 1) jointly predicting the heatmap and the offset in a multi-task manner [56]; 2) encoding and decoding the fractional part of numerical coordinates via a modulated 2D Gaussian distribution [24, 26]; and 3) exploring differentiable transformations between the heatmap and the numerical coordinates [57, 20, 58, 59]. Specifically, [24] generates the fractional part sensitive ground truth heatmap for video-based face alignment, which is known as fractional heatmap regression. Under the assumption that the predicted heatmap follows a 2D Gaussian distribution, [26] decodes the fractional part of numerical coordinates from the modulated predicted heatmap. The soft-argmax operation is differentiable [60, 61, 62, 59], and has been intensively explored in human pose estimation [57, 20].

3 Preliminaries

In this section, we introduce two widely-used semantic landmark localization methods, coordinate regression and heatmap regression. For simplicity, we use facial landmark detection as an intuitive example.

Coordinate Regression. Given a face image, semantic landmark detection aims to find the numerical coordinates of a set of pre-defined facial landmarks $\boldsymbol{x}_{i}=\left(x_{i},y_{i}\right)$ , where $i=1,2,\dots,K$ , indicate the indices of different facial landmarks (e.g., a set of five pre-defined facial landmarks can be left eye, right eye, nose, left mouth corner, and right mouth corner). It is natural to train a model (e.g., deep neural networks) to directly regress the numerical coordinates of all facial landmarks. The coordinate regression model then can be optimized via a typical regression criterion such as mean squared error (MSE) and mean absolute error (MAE). For the MSE criterion (also the L2 loss), we have

\mathcal{L}(\boldsymbol{x}^{p}_{i},\boldsymbol{x}^{g}_{i})=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}^{2},

(1)

where $\boldsymbol{x}^{p}_{i}$ and $\boldsymbol{x}^{g}_{i}$ indicate the predicted and the ground truth numerical coordinates, respectively. When using the MAE criterion (also the L1 loss), the loss function $\mathcal{L}$ can be defined in a similar way to (1).

Heatmap Regression. Heatmaps (also known as confidence maps) are simple yet effective representations of semantic landmark locations. Given the numerical coordinate $\boldsymbol{x}_{i}$ for the $i$ -th semantic landmark, it then corresponds to a specific heatmap $\boldsymbol{h}_{i}$ as shown in Fig. 1. For simplicity, we assume $\boldsymbol{h}_{i}$ is the same size as the input image in this section and leave the problem of quantization error to the next section. With the heatmap representation, the problem of semantic landmark localization can be translated into heatmap regression via two heatmap subroutines: 1) encode (from the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}$ to the ground truth heatmap $\boldsymbol{h}_{i}^{g}$ ); and 2) decode (from the predicted heatmap $\boldsymbol{h}_{i}^{p}$ to the predicted numerical coordinate $\boldsymbol{x}_{i}^{p}$ ). The main deep learning-based heatmap regression for semantic landmark localization framework is shown in Fig. 2.

Specifically, during the inference stage, given a predicted heatmap $\boldsymbol{h}_{i}^{p}$ , the value $\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\in[0,1]$ indicates the confidence score that the $i$ -th landmark is located at coordinate $\boldsymbol{x}\in\mathbb{N}^{2}$ . Then, we can decode the predicted numerical coordinate $\boldsymbol{x}_{i}^{p}$ from the predicted heatmap $\boldsymbol{h}_{i}^{p}$ using the argmax operation, i.e.,

\boldsymbol{x}_{i}^{p}=\left(x_{i}^{p},y_{i}^{p}\right)\in\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}.

(2)

Therefore, with the decode operation in (2), the problem of semantic landmark localization can be solved by training a deep model to predict heatmap $\boldsymbol{h}_{i}^{p}$ .

To train a heatmap regression model, the ground truth heatmap $\boldsymbol{h}_{i}^{g}$ is indispensable, i.e., we need to encode the ground truth coordinate $\boldsymbol{x}_{i}^{g}$ into the ground truth heatmap $\boldsymbol{h}_{i}^{g}$ . We introduce two widely-used methods to generate the ground truth heatmap, Gaussian heatmap and binary heatmap, as follows. Given the ground truth coordinate $\boldsymbol{x}_{i}^{g}$ , the ground truth Gaussian heatmap can be generated by sampling and normalizing from a bivariate normal distribution $\mathcal{N}(\boldsymbol{x}_{i}^{g},\Sigma)$ , i.e., the ground truth heatmap $\boldsymbol{h}_{i}^{g}$ at location $\boldsymbol{x}\in\mathbb{N}^{2}$ can be evaluated as

\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=\exp\left(-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{x}_{i}^{g})^{\top}\Sigma^{-1}(\boldsymbol{x}-\boldsymbol{x}_{i}^{g})\right),

(3)

where $\Sigma$ is the covariance matrix (a positive semi-definite matrix) and $\sigma>0$ is the standard deviation in both directions, i.e.,

\Sigma=\begin{bmatrix}\sigma^{2}&0\\ 0&\sigma^{2}\end{bmatrix}.

(4)

When $\sigma~{}\to~{}0$ , the ground truth heatmap can be generated by assigning a positive value at the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}$ , i.e.,

\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=\begin{cases}1&\quad\text{if }\boldsymbol{x}=\boldsymbol{x}_{i}^{g},\\ 0&\quad\text{otherwise}.\end{cases}

(5)

Specifically, when $\sigma\to 0$ , the ground truth heatmap defined in (5) is also known as the binary heatmap.

Given the ground truth heatmap, the heatmap regression model then can be optimized using typical pixel-wise regression criteria such as MSE, MAE, or Smooth-L1 [44]. Specifically, for Gaussian heatmaps, the heatmap regression model is usually optimized with the pixel-wise MSE criterion, i.e.,

\mathcal{L}(\boldsymbol{h}_{i}^{p},\boldsymbol{h}_{i}^{g})=\mathbb{E}\|\boldsymbol{h}_{i}^{p}(\boldsymbol{x})-\boldsymbol{h}_{i}^{g}(\boldsymbol{x})\|_{2}^{2}.

(6)

When using the MAE/Smooth-L1 criteria, the loss function can be defined in a similar way to (6). For binary heatmap, the heatmap regression model can also be optimized with the pixel-wise cross-entropy criterion, i.e.,

\mathcal{L}(\boldsymbol{h}^{p}_{i},\boldsymbol{h}^{g}_{i})=\mathbb{E}\left(\mathcal{L}_{\text{CE}}\left(\boldsymbol{h}^{p}_{i}(\boldsymbol{x}),\boldsymbol{h}^{g}_{i}(\boldsymbol{x})\right)\right),

(7)

where $\mathcal{L}_{\text{CE}}$ indicates the cross-entropy criterion with a softmax function as the activation/normalization function. A comprehensive review of different loss functions for semantic landmark localization is beyond the scope of this paper, but we refer interested readers to [45] for descriptions of coordinate regression and [21] for heatmap regression. Unless otherwise mentioned, we use the MSE criterion for Gaussian heatmap and the cross-entropy criterion for binary heatmap in this paper.

4 Method

In this section, we first introduce the quantization system in heatmap regression and then formulate the quantization error in a unified way by correcting the quantization bias in a vanilla quantization system. Lastly, we devise a new quantization system via randomized rounding to address the problem of quantization error.

4.1 Quantization System

Heatmap regression for semantic landmark localization usually contains two key components: 1) heatmap prediction; and 2) transformation between the heatmap and the numerical coordinates. The quantization system in heatmap regression is a combination of the encode and decode operations. During training, when the ground truth numerical coordinates $\boldsymbol{x}_{i}^{g}$ are floating-point numbers, we then need to calculate a specific Gaussian kernel matrix using (3) for each landmark, since different numerical coordinates usually have different fractional parts. As a result, it will significantly increase the training loads of the heatmap regression model. For example, given $98$ landmarks per face image, the kernel size $11\times 11$ , and a mini-batch of $16$ training samples, we then have to run (3) for $98\times 16\times 11\times 11=189,728$ times in each training iteration. To address this issue, existing heatmap regression methods usually first quantize numerical coordinates into integers, where a standard kernel matrix can then be shared for efficient ground truth heatmap generation [6, 2, 55, 23]. However, the above-mentioned existing heatmap regression methods usually suffer from the inherent drawback of failing to encode the fractional part of numerical coordinates. Therefore, how to efficiently encode the fractional information in numerical coordinates still remains challenging. Furthermore, during the inference stage, the predicted numerical coordinates $\boldsymbol{x}_{i}^{p}$ obtained by a decode operation in (2) are also integers. As a result, typical heatmap regression methods usually fail to efficiently address the fractional part of the numerical coordinate during both training and inference, resulting in localization error.

To analyze the localization error caused by the quantization system in heatmap regression, we formulate the localization error as the sum of heatmap error and quantization error as follows:

\begin{split}\mathcal{E}_{loc}=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}&=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{opt}+\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}\\ &\leq\underbrace{\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{opt}\|_{2}}_{\textbf{heatmap error}}+\underbrace{\|\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}}_{\textbf{quantization error}},\end{split}

(8)

where $\boldsymbol{x}_{i}^{opt}$ indicates the numerical coordinate decoded from the optimal predicted heatmap. Generally, the heatmap error corresponds to the error in heatmap prediction, i.e., $\|\boldsymbol{h}_{i}^{p}-\boldsymbol{h}_{i}^{g}\|_{2}$ , and the quantization error indicates the error caused by both the encode and decode operations. If there is no heatmap error, the localization error then all originates from the error of the quantization system, i.e.,

\begin{split}\mathcal{E}_{loc}=\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}=\|\boldsymbol{x}_{i}^{opt}-\boldsymbol{x}_{i}^{g}\|_{2}.\end{split}

(9)

The generalizability of deep neural networks for heatmap prediction, i.e., the heatmap error, is beyond the scope of this paper. We do not consider the heatmap error during the analysis of quantization error in this paper.

To obtain integer coordinates for the generation of the ground truth heatmap, typical integer quantization operations such as floor, round, and ceil have been widely used in previous heatmap regression methods. To unify the quantization error induced by different integer operations, we first introduce a unified integer quantization operation as follows. Given a downsampling stride $s>1$ and a threshold $t\in[0,1]$ , the coordinate $x\in\mathbb{N}$ then can be quantized according to its fractional part $\epsilon=x/s-\lfloor x/s\rfloor$ , i.e.,

\boldsymbol{q}(x,s,t)=\begin{cases}\lfloor x/s\rfloor&\quad\text{if }~{}\epsilon<t,\\ \lfloor x/s\rfloor+1&\quad\text{otherwise}.\end{cases}

(10)

That is, for integer quantization operations floor, round, and ceil, we have $t=1.0$ , $t=0.5$ , and $t=0$ , respectively. Furthermore, when the downsampling stride $s>1$ , the decode operation in (2) then becomes

\boldsymbol{x}_{i}^{p}\in s*\left(\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}\right).

(11)

A vanilla quantization system for heatmap regression can then be formed by the encode operation in (10) and the decode operation in (11). When applied to a vector or a matrix, the integer quantization operation defined in (10) is an element-wise operation.

4.2 Quantization Error

In this subsection, we first correct the bias in a vanilla quantization system to form an unbiased vanilla quantization system. With the unbiased quantization system, we then provide a tight upper bound on the quantization error for vanilla heatmap regression.

Let $\epsilon_{x}$ denote the fractional part of $x_{i}^{g}/s$ , and $\epsilon_{y}$ denote the fractional part of $y_{i}^{g}/s$ . Given the downsampling stride of the heatmap $s>1$ , we then have

\begin{split}\epsilon_{x}&=x_{i}^{g}/s-\lfloor x_{i}^{g}/s\rfloor,\\ \epsilon_{y}&=y_{i}^{g}/s-\lfloor y_{i}^{g}/s\rfloor.\end{split}

(12)

Given the assumption of a “perfect” heatmap prediction model or no heatmap error, i.e., $\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=\boldsymbol{h}_{i}^{g}(\boldsymbol{x})$ , we then have the predicted numerical coordinates

x_{i}^{p}/s=\begin{cases}\lfloor x_{i}^{g}/s\rfloor&\quad\text{if }\epsilon_{x}<t,\\ \lfloor x_{i}^{g}/s\rfloor+1&\quad\text{otherwise},\end{cases}

y_{i}^{p}/s=\begin{cases}\lfloor y_{i}^{g}/s\rfloor&\quad\text{if }\epsilon_{y}<t,\\ \lfloor y_{i}^{g}/s\rfloor+1&\quad\text{otherwise}.\end{cases}

If data samples satisfy the i.i.d. assumption and the fractional parts $\epsilon_{x},\epsilon_{y}\in\mathbb{U}(0,1)$ , the bias of $\boldsymbol{x}_{i}^{p}$ as an estimator of $\boldsymbol{x}_{i}^{g}$ can then be evaluated as

\begin{split}\mathbb{E}\left(x_{i}^{p}/s-x_{i}^{g}/s\right)&=\mathbb{E}\left(\mathbf{1}{\{\epsilon_{x}<t\}}(-\epsilon_{x})+\mathbf{1}{\{\epsilon_{x}\geq t\}}(1-\epsilon_{x})\right)\\ &=0.5-t.\end{split}

Considering that $x_{i}^{g},y_{i}^{g}$ are independent variables, we thus have the quantization bias in the vanilla quantization system as follows:

\begin{split}\mathbb{E}\left(\boldsymbol{x}_{i}^{p}/s-\boldsymbol{x}_{i}^{g}/s\right)&=\left(\mathbb{E}\left(x_{i}^{p}/s-x_{i}^{g}/s\right),~{}\mathbb{E}\left(y_{i}^{p}/s-y_{i}^{g}/s\right)\right)\\ &=\left(0.5-t,~{}0.5-t\right).\end{split}

Therefore, only the encode operation in (10), i.e., the round operation, is unbiased. Furthermore, given $\forall t~{}\in~{}[0,1]$ for the encode operation in (10), we can correct the bias of the encode operation with a shift on the decode operation, i.e.,

\boldsymbol{x}_{i}^{p}\in s*\left(\underset{\boldsymbol{x}}{\arg\max}~{}\left\{\boldsymbol{h}_{i}^{p}(\boldsymbol{x})\right\}+t-0.5\right).

(13)

For simplicity, we use the round operation, i.e., $t=0.5$ , to form an unbiased quantization system as our baseline. Though the vanilla quantization system defined by (10) and (13) is unbiased, it causes non-invertible localization error. An intuitive explanation for this is that the encode operation in (10) directly discards the fractional part of the ground truth numerical coordinates, thus making it impossible for the decode operation to accurately reconstruct the numerical coordinates.

Theorem 1.

Given an unbiased quantization system defined by the encode operation in (10) and the decode operation in (13), we then have that the quantization error tightly upper bounded, i.e.,

\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}\leq\sqrt{2}s/2,

where $s>1$ indicates the downsampling stride of the heatmap.

Proof.

In Appendix. ∎

From Theorem 1, we know that the vanilla quantization system defined by (10) and (13) will cause non-invertible quantization error and that the upper bound of the quantization error linearly depends on the downsampling stride of the heatmap. As a result, given the heatmap regression model, it will cause extremely large localization error for large faces in the original input image, making it a significant problem in many important face-related applications such as face makeup, face swapping, and face reenactment.

4.3 Randomized Rounding

In vanilla heatmap regression, each numerical coordinate corresponds to a single activation point in the heatmap, while the indices of the activation point are all integers. As a result, the fractional part of the numerical coordinate is usually ignored during the encode process, making it an inherent drawback of heatmap regression for sub-pixel localization. To retain the fractional information when using heatmap representations, we utilize multiple activation points around the ground truth activation point. Inspired by the randomized rounding method [27], we address the quantization error in vanilla heatmap regression by using a probabilistic approach. Specifically, we encode the fractional part of the numerical coordinate to different activation points with different activation probabilities. An intuitive example is shown in Fig. 3.

We describe the proposed quantization system as follows. Given the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}=(x_{i}^{g},y_{i}^{g})$ and a downsampling stride of the heatmap $s>1$ , the ground truth activation point in the heatmap is $(x_{i}^{g}/s,y_{i}^{g}/s)$ , which are usually floating-point numbers, and we are unable to find the corresponding pixel in the heatmap. If we ignore the fractional part $(\epsilon_{x},\epsilon_{y})$ using a typical integer quantization operation, e.g., round, the ground truth activation point will be approximated by one of the activation points around the ground truth activation point, i.e., $\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right)$ , $\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right)$ , $\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right)$ , and $\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right)$ . However, the above process is not invertible. To address this, we randomly assign the ground truth activation point to one of the alternative activation points around the ground truth activation point, and the activation probability is determined by the fractional part of the ground truth activation point as follows:

\begin{split}P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right)=1\right\}&=(1-\epsilon_{x})(1-\epsilon_{y}),\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right)=1\right\}&=\epsilon_{x}(1-\epsilon_{y}),\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right)=1\right\}&=(1-\epsilon_{x})\epsilon_{y},\\ P\left\{\boldsymbol{h}_{i}^{g}\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right)=1\right\}&=\epsilon_{x}\epsilon_{y}.\end{split}

(14)

To achieve the encode scheme in (14) in conjunction with current minibatch stochastic gradient descent training algorithms for deep learning models, we introduce a new integer quantization operation via randomized rounding, i.e., random-round:

\boldsymbol{q}(x,s)=\begin{cases}\lfloor x/s\rfloor&\quad\text{if }\epsilon<t,~{}t\sim\mathbb{U}(0,1),\\ \lfloor x/s\rfloor+1&\quad\text{otherwise}.\end{cases}

(15)

Given the encode operation in (15), if we do not consider the heatmap error, we then have the activation probability at $\boldsymbol{x}$ :

\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=P\left\{\boldsymbol{h}_{i}^{g}(\boldsymbol{x})=1\right\}.

(16)

As a result, the fractional part of the ground truth numerical coordinate $(\epsilon_{x},\epsilon_{y})$ can be reconstructed from the predicted heatmap via the activation probabilities of all activation points, i.e.,

\boldsymbol{x}_{i}^{p}=s*\left(\sum_{\boldsymbol{x}\in\mathcal{X}_{i}^{g}}~{}\boldsymbol{h}_{i}^{p}(\boldsymbol{x})*\boldsymbol{x}\right),

(17)

where $\mathcal{X}_{i}^{g}$ indicates the set of activation points around the ground truth activation point, i.e.,

\begin{split}\mathcal{X}_{i}^{g}=\{&\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor\right),\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor\right),\\ &\left(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1\right),\left(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1\right)\}.\end{split}

(18)

Theorem 2.

Given the encode operation in (15) and the decode operation in (17), we then have that the 1) encode operation is unbiased; and 2) quantization system is lossless, i.e., there is no quantization error.

Proof.

In Appendix. ∎

From Theorem 2, we know that the quantization system defined by the encode operation in (15) and the decode operation in (17) is unbiased and lossless.

4.4 Activation Points Selection

The fractional information of the numerical coordinate $(\epsilon_{x},\epsilon_{y})$ is well-captured by the randomized rounding operation, allowing us to reconstruct the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}$ without the quantization error. However, during the inference phase, the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}$ is unavailable and heatmap error always exists in practice, making it difficult to identify the proper set of ground truth activation points $\mathcal{X}_{i}^{g}$ . In this section, we describe a way to form a set of alternative activation points in practice.

We introduce two activation point selection methods as follows. The first solution is to estimate all activation points via the points around the maximum activation point. As shown in Fig. 4, given the maximum activation point, we then have four different sets of alternative activation points, $\mathcal{X}_{i}^{g_{1}},\mathcal{X}_{i}^{g_{2}},\mathcal{X}_{i}^{g_{3}},\text{and}~{}\mathcal{X}_{i}^{g_{4}}$ . Therefore, given the predicted heatmap in practice, we then take a risk of choosing an incorrect set of alternative activation points. To find a robust set of alternative activation points, we may use all nine activation points around the maximum activation point, i.e.,

\mathcal{X}_{i}^{g}=\mathcal{X}_{i}^{g_{1}}\cup\mathcal{X}_{i}^{g_{2}}\cup\mathcal{X}_{i}^{g_{3}}\cup\mathcal{X}_{i}^{g_{4}}.

(19)

Another solution of alternative activation points $\mathcal{X}_{i}^{g}$ is to generalize the argmax operation with the argtopk operation, i.e., we decode the predicted heatmap $\boldsymbol{h}_{i}^{p}$ to obtain the numerical coordinate $\boldsymbol{x}_{i}^{p}$ according to the top $k$ largest activation points,

\mathcal{X}_{i}^{g}=\underset{\boldsymbol{x}}{\arg\text{topk}}(\boldsymbol{h_{i}^{p}(\boldsymbol{x})}).

(20)

If there is no heatmap error, the two alternative activation points solutions presented above, i.e., the alternative activation points in (18) and (19), are equal to each other when using the decode operation in (17). Specifically, we find that the activation points in (19) achieve comparable performance to the activation points in (20) when $k=9$ . For simplicity, unless otherwise mentioned, we use the set of alternative activation points defined by (20) in this paper. Furthermore, when we take the heatmap error into consideration, the values of different $k$ then forms a trade-off on the selection of activation points, i.e., a larger $k$ will be robust to activation point selection whilst also increasing the risk of noise from the heatmap error. See more discussion in Section 4.5 and the experimental results in Section 5.5.

4.5 Discussion

In this subsection, we provide some insights into the proposed quantization system with respect to: 1) the influence of human annotators on the proposed quantization system in practice; and 2) the underlying explanation behind the widely used empirical compensation method “shift a quarter to the second maximum activation point”.

Unbiased Annotation. We assume the “ground truth numerical coordinates” are always accurate, while the ground truth numerical coordinates are usually labelled by human annotators at the risk of the annotation bias. Given an input image, the ground truth numerical coordinates $\boldsymbol{x}_{i}^{g}$ can be obtained by clicking a specific pixel in the image, which is a simple but effective annotation pipeline provided by most image annotation tools. For sub-pixel numerical coordinates, especially in low-resolution input images, the annotators may click either one of all possible pixels around the ground truth numerical coordinates due to human visual uncertainty. As shown in Fig. 5, clicking any one of the four possible pixels causes annotation error, which corresponds to the fractional part of the ground truth numerical coordinate $(\epsilon_{x}^{\prime},\epsilon_{y}^{\prime})=\left(x_{i}^{g}-\lfloor x_{i}^{g}\rfloor,~{}y_{i}^{g}-\lfloor y_{i}^{g}\rfloor\right)$ . Given enough data samples, if the annotators click the pixel according to the following distribution, i.e.,

\begin{split}P\left\{\left(\lfloor x_{i}^{g}\rfloor,\lfloor y_{i}^{g}\rfloor\right)\right\}&=(1-\epsilon_{x}^{\prime})(1-\epsilon_{y}^{\prime}),\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor+1,\lfloor y_{i}^{g}\rfloor\right)\right\}&=\epsilon_{x}^{\prime}(1-\epsilon_{y}^{\prime}),\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor,\lfloor y_{i}^{g}\rfloor+1\right)\right\}&=(1-\epsilon_{x}^{\prime})\epsilon_{y}^{\prime},\\ P\left\{\left(\lfloor x_{i}^{g}\rfloor+1,\lfloor y_{i}^{g}\rfloor+1\right)\right\}&=\epsilon_{x}^{\prime}\epsilon_{y}^{\prime},\end{split}

(21)

the fractional part then can be well captured by the heatmap regression model and we refer to it as an unbiased annotation.

If we take the downsampling stride into consideration, $(\epsilon_{x},\epsilon_{y})$ is then a joint result of both the downsampling of the heatmap and the annotation process, i.e.,

(\epsilon_{x},~{}\epsilon_{y})~{}\propto~{}\left(\epsilon_{x}^{\prime}/s,~{}\epsilon_{y}^{\prime}/s\right)+(s-1).

(22)

On the one hand, if the heatmap regression model uses a low input resolution (or a large downsampling stride $s\gg 1$ ), the fractional part $(\epsilon_{x},\epsilon_{y})$ then mainly comes from the downsampling of the heatmap; on the other hand, if the heatmap regression model uses a high input resolution, the annotation process will also have a significant influence on the heatmap regression. Therefore, when using a high input resolution model in practice, a diverse set of human annotators help reduce the bias in annotation process.

Empirical Compensation. “Shift a quarter to the second maximum activation point” has become an effective and widely used empirical compensation method for heatmap regression [6, 55, 22], but it still lacks a proper explanation. We thus provide an intuitive explanation according to the proposed quantization system. The proposed quantization system encodes the ground truth numerical coordinates into multiple activation points, and the activation probability of each activation point is decided by the fractional part, i.e., the activation probability indicates the distance between the activation point and the ground truth activation point. Therefore, the ground truth activation point is closer to the $i$ -th maximum activation point than the $(i+1)$ -th maximum activation point. We demonstrate the averaged activation probabilities for the top $k$ activation points on the WFLW dataset in Table I.

TABLE I: The activation probabilities of the top

k

activation points.

	k=1	k=2	k=3	k=4
$\boldsymbol{h}_{i}^{p}(\boldsymbol{x})$	0.44	0.26	0.17	0.13
NME(%)	6.45	5.07	4.71	4.68

We find that the marginal improvement decreases as the number of activation points increases, i.e., the second maximum activation points provides the maximum improvement to the reconstruction of the fractional part. This observation partially explains the effectiveness of the compensation method “shift a quarter to the second maximum activation point”, which can be seen as a special case of the proposed method (20) with $k=2$ .

Furthermore, the proposed quantization system shares the same motivation with the bilinear interpolation. Specifically, the bilinear interpolation usually aims to find the value of the unknown function $f(x,y)$ given its neighbors $f(x_{1},y_{1})$ , $f(x_{1},y_{2})$ , $f(x_{2},y_{1})$ , and $f(x_{2},y_{2})$ . For the proposed quantization system, we have $f(x,y)=(x,y)$ , which indicates the location of landmarks. Specifically, if there is no heatmap error, we then have $x_{1}=\lfloor x_{i}^{g}/s\rfloor$ , $x_{2}=\lfloor x_{i}^{g}/s\rfloor+1$ , $y_{1}=\lfloor y_{i}^{g}/s\rfloor$ , and $y_{2}=\lfloor y_{i}^{g}/s\rfloor+1$ . If we take heatmap error into consideration, the ground truth activation points are usually unknown. Therefore, the number of alternative activation points also controls the trade-off between the robustness of the quantization system and the risk of noise from the heatmap error (Also see details in Section 4.4).

TABLE II: Comparison with State-of-the-Arts on WFLW dataset.

Method	NME (%), Inter-ocular
Method	test	pose	expression	illumination	make-up	occlusion	blur
ESR [29]	11.13	25.88	11.47	10.49	11.05	13.75	12.20
SDM [31]	10.29	24.10	11.45	9.32	9.38	13.03	11.28
CFSS [36]	9.07	21.36	10.09	8.30	8.74	11.76	9.96
DVLN [63]	6.08	11.54	6.78	5.73	5.98	7.33	6.88
LAB [64]	5.27	10.24	5.51	5.23	5.15	6.79	6.32
Wing [45]	5.11	8.75	5.36	4.93	5.41	6.37	5.81
3DDE [65]	4.68	8.62	5.21	4.65	4.60	5.77	5.41
DeCaFA [66]	4.62	8.11	4.65	4.41	4.63	5.74	5.38
HRNet [23]	4.60	7.86	4.78	4.57	4.26	5.42	5.36
AVS [67]	4.39	8.42	4.68	4.24	4.37	5.60	4.86
LUVLi [68]	4.37	-	-	-	-	-	-
AWing [21]	4.21	7.21	4.46	4.23	4.02	4.99	4.82
H3R (ours)	3.81	6.45	4.07	3.70	3.66	4.48	4.30

TABLE III: Comparison with State-of-the-Arts on 300W dataset.

Method	NME (%), Inter-ocular				NME (%), Inter-pupil
Method	private	full	common	challenge	private	full	common	challenge
SAN [69]	-	3.98	3.34	6.60	-	-	-	-
DAN [70]	4.30	3.59	3.19	5.24	-	5.03	4.42	7.57
SHN [71]	4.05	-	-	-	-	4.68	4.12	7.00
LAB [64]	-	3.49	2.98	5.19	-	4.12	3.42	6.98
Wing [45]	-	-	-	-	-	4.04	3.27	7.18
DeCaFA [66]	-	3.39	2.93	5.26	-	-	-	-
DFCE [72]	3.88	3.24	2.76	5.22	-	4.55	3.83	7.54
AVS [67]	-	3.86	3.21	6.49	-	4.54	3.98	7.21
HRNet [23]	3.85	3.32	2.87	5.15	-	-	-	-
HG-HSLE [73]	-	3.28	2.85	5.03	-	4.59	3.94	7.24
LUVLi [68]	-	3.23	2.76	5.16	-	-	-	-
3DDE [65]	3.73	3.13	2.69	4.92	-	4.39	3.73	7.10
AWing [21]	3.56	3.07	2.72	4.52	-	4.31	3.77	6.52
H3R (ours)	3.48	3.02	2.65	4.58	5.07	4.24	3.67	6.60

5 Facial Landmark Detection

In this section, we perform facial landmark detection experiments. We first introduce widely used facial landmark detection datasets. We then describe the implementation details of our proposed method. Finally, we present our experimental results on different datasets and perform comprehensive ablation studies on the most challenging dataset.

5.1 Datasets

We use four widely used facial landmark detection datasets:

•

WFLW [64]. WFLW contains $10,000$ face images, including $7,500$ training images and $2,500$ testing images, with $98$ manually annotated facial landmarks. All face images are selected from the WIDER Face dataset [74], which contains face images with large variations in scale, expression, pose, and occlusion.
•

300W [75]. 300W contains $3,148$ training images, including $337$ images from AFW [30], $2,000$ images from the training set of HELEN [76], and $811$ images from the training set of LFPW [39]. For testing, there are four different settings: 1) common: $554$ images, including $330$ and $224$ images from the testsets of HELEN and LFPW, respectively; 2) challenge: $135$ images from IBUG; 3) full: $689$ images as a combination of common and challenge; and 4) private: $600$ indoor/outdoor images. All images are manually annotated with $68$ facial landmarks.
•

COFW [34]. COFW contains $1,852$ images, including $1,345$ training and $507$ testing images. All images are manually annotated with $29$ facial landmarks.
•

AFLW [77]. AFLW contains $24,386$ face images, including $20,000$ images for training and $4,836$ images for testing. For testing, there are two settings: 1) full: all $4,836$ images for testing; and 2) front: $1,314$ frontal images selected from the full set. All images are manually annotated with $21$ facial landmarks. For fair comparison, we use $19$ facial landmarks, i.e., the landmarks on two ears are ignored.

5.2 Evaluation Metrics

We use the normalized mean error (NME) as the evaluation metric in this paper, i.e.,

\text{NME}=\mathbb{E}\left(\frac{\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}}{d}\right),

(23)

where $d$ indicates the normalization distance. For fair comparison, we report the performances on WFLW, 300W, and COFW using two the normalization methods, inter-pupil distance (the distance between the eye centers) and inter-ocular distance (the distance between the outer eye corners). We report the performance on AFLW using the size of the face bounding box as the normalization distance, i.e., the normalization distance can be evaluated by $d=\sqrt{w*h}$ , where $w$ and $h$ indicate the width and height of the face bounding box, respectively.

5.3 Implementation Details

We implement the proposed heatmap regression method for facial landmark detection using PyTorch [78]. Following the practice in [23], we use HRNet [22] as our backbone network, which is an efficient counterpart of ResNet [79], U-Net [80], and Hourglass [6] for semantic landmark localization. Unless otherwise mentioned, we use HRNet-W18 as the backbone network in our experiments. All face images are cropped and resized to $256\times 256$ pixels and the downsampling stride of the feature map is $4$ pixels. For training, we perform widely-used data augmentation for facial landmark detection as follows We horizontally flip all training images with probability $0.5$ and randomly change the brightness ( $\pm 0.125$ ), contrast ( $\pm 0.5$ ), and saturation ( $\pm 0.5$ ) of each image. We then randomly rotate the image ( $\pm 30^{\circ}$ ), rescale the image ( $\pm 0.25$ ), and translate the image ( $\pm 16$ pixels). We also randomly erase a rectangular region in the training image [81]. All our models are initialized from the weights pretrained on ImageNet [82]. We use the Adam optimizer [83] with batch size $16$ . The learning rate starts from $0.001$ and is divided by $10$ for every $60$ epochs, with $150$ training epochs in total. During the testing phase, we horizontally flip testing images as the data augmentation and average the predictions.

TABLE IV: Comparison with State-of-the-Arts on COFW dataset.

Method	NME (%)
	Inter-ocular	Inter-pupil
SHN [71]	-	5.60
LAB [64]	-	5.58
DFCE [72]	-	5.27
3DDE [65]	-	5.11
Wing [45]	-	5.07
HRNet [23]	3.45	-
AWing [21]	-	4.94
H3R (ours)	3.15	4.55

TABLE V: Comparison with State-of-the-Arts on AFLW dataset.

Method	NME (%)
	full	front
DFCE [72]	2.12	-
3DDE [65]	2.01	-
SAN [69]	1.91	1.85
LAB [64]	1.85	1.62
Wing [45]	1.65	-
HRNet [23]	1.57	1.46
LUVLi [68]	1.39	1.19
H3R (ours)	1.27	1.11

5.4 Comparison with Current State-of-the-Art

To demonstrate the effectiveness of the proposed method, we compare it with recent state-of-the-art facial landmark detection methods as follows. As shown in Table II, the proposed method outperforms recent state-of-the-art methods on the most challenging dataset, WFLW, with a clear margin for all different settings. For the 300W dataset, we try to report the performances under different settings for fair comparison. As shown in Table III, the proposed method achieves comparable performances for all different settings. Specifically, LAB [64] uses the boundary information as the auxiliary supervision; compared to Wing [45], which uses the coordinate regression for semantic landmark localization, the heatmap-based methods usually achieve better performance on the challenge subset. In Table IV, we see that the proposed method outperforms recent state-of-the-art methods with a clear margin on COFW dataset. AFLW captures a wide range of different face poses, including both frontal faces and non-frontal faces. As shown in Table V, the proposed method achieves consistent improvements for both frontal faces and non-frontal faces, suggesting robustness across different face poses.

5.5 Ablation Studies

To better understand the proposed quantization system in different settings, we perform ablation studies the most challenging dataset, WFLW [64].

The influence of different input resolutions. Heatmap regression models use a fixed input resolution, e.g., $256\times 256$ pixels, but training and testing images usually represent a wide range of resolutions, e.g., most faces in the WFLW dataset have an inter-ocular distance of between $30$ and $120$ pixels. Therefore, we compare the proposed method with the baseline using an input resolution from $64\times 64$ pixels to $512\times 512$ pixels, i.e., a heatmap resolution from $16\times 16$ pixels to $128\times 128$ pixels. In Fig. 6, the proposed method significantly improves heatmap regression performance when using a low input resolution. The increasing number of high-resolution images/videos in real-world applications is a challenge with respect to the computational cost and device memory needed to overcome the problem of sub-pixel localization by increasing the input resolution of deep learning-based heatmap regression models. For example, in the film industry, it has sometimes become necessary to swap the appearance of a target actor and a source actor to generate higher fidelity video frames in visual effects, especially when an actor is unavailable for some scenes [84]. The manipulation of actor faces in video frames relies on accurate localization of different facial landmarks and is performed at megapixel resolution, inducing a huge computational cost for extensive frame-by-frame animation. Therefore, instead of using high-resolution input images, our proposed method delivers another efficient solution for dealing with accurate semantic landmark localization.

TABLE VI: The influence of different numbers of alternative activation points when using binary heatmap.

Resolution	NME (%)
Resolution	$k=1$	$k=2$	$k=3$	$k=4$	$k=5$	$k=6$	$k=7$	$k=8$	$k=9$	$k=10$	$k=11$	$k=12$	best
$512\times 512$	3.980	3.946	3.930	3.923	3.916	3.912	3.909	3.906	3.903	3.901	3.898	3.897	3.890/ $k=25$
$384\times 384$	3.932	3.875	3.855	3.850	3.842	3.837	3.833	3.830	3.828	3.827	3.826	3.826	3.825/ $k=14$
$256\times 256$	4.005	3.881	3.836	3.832	3.819	3.815	3.810	3.808	3.807	3.807	3.807	3.808	3.807/ $k=9$
$128\times 128$	4.637	4.164	4.029	4.023	3.997	3.991	3.988	3.989	3.991	3.994	3.998	4.002	3.988/ $k=7$

TABLE VII: The influence of different numbers of alternative activation points when using Gaussian heatmap.

	NME (%)
	$k=1$	$k=2$	$k=3$	$k=4$	$k=5$	$k=6$	$k=7$	$k=8$	$k=9$	$k=10$	$k=11$	$k=12$	best
$\sigma=0.0$	4.235	4.102	4.069	4.065	4.090	4.202	4.427	4.775	5.219	5.719	6.256	6.802	4.065/ $k=4$
$\sigma=0.5$	4.162	4.045	4.010	4.008	3.991	3.990	3.988	3.992	3.998	4.014	4.047	4.108	3.988/ $k=7$
$\sigma=1.0$	4.037	3.928	3.898	3.908	3.871	3.877	3.876	3.876	3.871	3.861	3.855	3.857	3.855/ $k=11$
$\sigma=1.5$	4.032	3.927	3.901	3.918	3.873	3.879	3.885	3.882	3.878	3.864	3.857	3.860	3.855/ $k=18$
$\sigma=2.0$	4.086	3.983	3.953	3.969	3.923	3.931	3.937	3.931	3.925	3.909	3.902	3.907	3.894/ $k=25$

TABLE VIII: The influence of different face ”bounding box” annotation policies.

Policy		NME (%), Inter-ocular
training	testing	test	pose	expression	illumination	make-up	occlusion	blur
P1	P1	3.81	6.45	4.07	3.70	3.66	4.48	4.30
P2	P2	3.95	6.74	4.17	3.89	3.81	4.73	4.51
P1	P2	5.34	9.68	5.24	5.25	5.57	6.55	6.14
P2	P1	4.04	6.90	4.24	3.99	3.90	4.87	4.65

The influence of different numbers of alternative activation points. In the proposed quantization system, the activation probability indicates the distance between activation point to the ground truth activation point $(x_{i}^{g}/s,y_{i}^{g}/s)$ . If there is no heatmap error, the alternative activation points in (20) then give the same result as in (18). If the heatmap error cannot be ignored, there will be a trade-off on the number of alternative activation points: 1) a small $k$ increases the risk of missing the ground truth alternative activation points; 2) a large $k$ introduces the noise from irrelevant activation points, especially for large heatmap error. We demonstrate the performance of the proposed method by using different numbers of alternative activation points in Table VI. Specifically, we see that 1) when using a high input resolution, the best performance is achieved with a relatively large $k$ ; and 2) the performance is smooth near the optimal number of alternative activation points, making it easy to find a proper $k$ for validation data. As introduced in Section 3, binary heatmaps can be seen as a special case of Gaussian heatmaps with standard deviation $\sigma=0$ . Considering that Gaussian heatmaps have been widely used in semantic landmark localization applications, we generalize the proposed quantization system to Gaussian heatmaps and demonstrate the influence of different numbers of alternative activation points in Table VII. Specifically, we see that 1) when applying the proposed quantization system to the model using Gaussian heatmap, it achieves comparable performance to the model using binary heatmap; and 2) the optimal number of alternative activation points increases with the standard deviation $\sigma$ .

The influence of different “bounding box” annotation policies. For facial landmark detection, a reference bounding box is required to indicate the position of the facial area. However, there is a performance gap when using different reference bounding boxes [21]. A comparison between two widely used “bounding box” annotation policies is shown in Fig. 7, and we introduce two different bounding box annotation policies as follows:

•

P1: This annotation policy is usually used in semantic landmark localization tasks, especially in facial landmark localization. Specifically, the rectangular area of the bounding box tightly encloses a set of pre-defined facial landmarks.
•

P2: This annotation policy has been widely used in face detection datasets [74]. The bounding box contains the areas of the forehead, chin, and cheek. For the occluded face, the bounding box is estimated by the human annotator based on the scale of occlusion.

We demonstrate the experimental results using different annotation policies in Table VIII. Specifically, we find that the policy P1 usually achieves better results, possibly because the occluded forehead (e.g., hair) introduces additional variations to the face bounding boxes when using the policy P2. Furthermore, the model trained using the policy P2 is more robust to different bounding box policy during testing, suggesting its robustness to inaccurate bounding boxes from face detection algorithms.

Qualitative Analysis. As shown in Fig. 8, we present some “good” and “bad” facial landmark detection examples according to NME. Specifically, for the good cases presented in the first row, most images are of high quality; For the bad cases in the second row, most images contain heavy blurring and/or occlusion, making it difficult to accurately identify the contours of different facial parts.

TABLE IX: Results on MPII Human Pose dataset. In each block, the first row with “

-

” indicates the baseline method, i.e.,

k=1

; The second row with “

*

” indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”.

Backbone	Input	H3R	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Mean	[email protected]
ResNet-50	$256\times 256$	-	96.3	95.2	89.0	82.9	88.2	83.7	79.4	88.4	29.8
ResNet-50	$256\times 256$	*	96.4	95.3	89.0	83.2	88.4	84.0	79.6	88.5	34.0
ResNet-50	$256\times 256$	✓	96.3	95.2	88.8	83.4	88.5	84.3	79.8	88.6	34.9
ResNet-101	$256\times 256$	-	96.7	95.8	89.3	84.2	87.9	84.2	80.7	88.9	30.0
ResNet-101	$256\times 256$	*	96.9	95.9	89.5	84.4	88.4	84.5	80.7	89.1	34.0
ResNet-101	$256\times 256$	✓	96.7	96.0	89.3	84.4	88.5	84.3	80.6	89.1	35.0
ResNet-152	$256\times 256$	-	97.0	95.8	89.9	84.7	89.1	85.4	81.3	89.5	31.0
ResNet-152	$256\times 256$	*	97.0	95.9	90.0	85.0	89.2	85.3	81.3	89.6	35.0
ResNet-152	$256\times 256$	✓	96.8	95.9	90.1	84.9	89.3	85.3	81.3	89.6	36.2
HRNet-W32	$256\times 256$	-	97.0	95.7	90.1	86.3	88.4	86.8	82.9	90.1	32.8
HRNet-W32	$256\times 256$	*	97.1	95.9	90.3	86.4	89.1	87.1	83.3	90.3	37.7
HRNet-W32	$256\times 256$	✓	97.1	96.1	90.8	86.1	89.2	86.4	82.6	90.3	39.3

TABLE X: Results on COCO validation set. In each block, the first row with “

-

” indicates the baseline, i.e.,

k=1

; The second row with “

*

” indicates the compensation method, i.e., “shift a quarter to the second maximum activation point”, which can be seen as a special case of H3R with

k=2

Backbone	Input	H3R	AP	Ap .5	AP .75	AP (M)	AP (L)	AR	AR .5	AR .75	AR (M)	AR (L)
HRNet-W32	$192\times 128$	-	0.674	0.890	0.771	0.648	0.732	0.739	0.932	0.828	0.700	0.795
HRNet-W32	$192\times 128$	*	0.710	0.892	0.792	0.682	0.771	0.770	0.933	0.844	0.732	0.827
HRNet-W32	$192\times 128$	✓	0.720	0.892	0.797	0.691	0.784	0.777	0.933	0.846	0.739	0.834
HRNet-W32	$256\times 192$	-	0.723	0.904	0.811	0.690	0.788	0.782	0.941	0.859	0.741	0.841
HRNet-W32	$256\times 192$	*	0.744	0.905	0.819	0.708	0.810	0.798	0.942	0.865	0.757	0.858
HRNet-W32	$256\times 192$	✓	0.750	0.906	0.820	0.715	0.817	0.802	0.942	0.865	0.761	0.861
HRNet-W48	$256\times 192$	-	0.730	0.904	0.817	0.693	0.798	0.788	0.943	0.864	0.745	0.852
HRNet-W48	$256\times 192$	*	0.751	0.906	0.822	0.715	0.818	0.804	0.943	0.867	0.762	0.864
HRNet-W48	$256\times 192$	✓	0.756	0.906	0.825	0.718	0.825	0.806	0.941	0.868	0.763	0.869
HRNet-W32	$384\times 288$	-	0.748	0.904	0.826	0.712	0.816	0.802	0.941	0.871	0.759	0.864
HRNet-W32	$384\times 288$	*	0.758	0.906	0.825	0.720	0.827	0.809	0.943	0.869	0.767	0.871
HRNet-W32	$384\times 288$	✓	0.762	0.905	0.830	0.725	0.833	0.812	0.942	0.873	0.769	0.874
HRNet-W48	$384\times 288$	-	0.753	0.907	0.823	0.712	0.823	0.804	0.941	0.867	0.759	0.869
HRNet-W48	$384\times 288$	*	0.763	0.908	0.829	0.723	0.834	0.812	0.942	0.871	0.767	0.876
HRNet-W48	$384\times 288$	✓	0.765	0.907	0.829	0.724	0.838	0.814	0.941	0.871	0.769	0.878

6 Human Pose Estimation

In this section, we perform human pose estimation experiments to further demonstrate the effectiveness of the proposed quantization system for accurate semantic landmark localization.

6.1 Datasets

We perform experiments on two popular human pose estimation datasets,

•

MPII [85]: The MPII Human Pose dataset contains around 28,821 images with 40,522 person instances, in which 11,701 images for testing and the remaining 17,120 images for training. Following the experimental setup in [22], we use 22,246 person instances for training and evaluate the performance on the MPII validation set with 2,958 person instances, which is a heldout subset of MPII training set.
•

COCO [86]: The COCO dataset contains over 200,000 images and 250,000 person instances, in which each person instance is labeled with 17 keypoints. Following the experimental setup in [22], we evaluate the proposed method on the validation set with 5,000 images.

6.2 Implementation Details

We utilize recent state-of-the-art heatmap regression method for human pose estimation, HRNet [22], as our baseline. Specifically, the proposed quantization system can be easily integrated into most heatmap regression models and we have made the source code of human pose estimation based on the HRNet baseline publicly available. For the MPII Human Pose dataset, we use the standard evaluation metric, head-normalized probability of correct keypoint or PCKh [85]. Specifically, a correct keypoint should fall within $\alpha*l$ pixels of the ground truth position, where $l$ indicates the normalization distance and $\alpha\in[0,1]$ is the matching threshold. For fair comparison, we apply two different matching thresholds, [email protected] and [email protected], where a smaller matching threshold, $\alpha=0.1$ , indicates a more strict evaluation metric for accurate semantic landmark localization [22]. For the COCO dataset, we use the standard evaluation metric, averaged precision (AP) and averaged recall (AR), where the object keypoint similarity or OKS is used as the similarity measure between the ground truth objects and the predicted objects [86].

6.3 Results

The experimental results on the MPII dataset are shown in Table IX. Specifically, when using a coarse evaluation metric, [email protected], both the proposed method and the compensation method achieve comparable performance to the baseline method, suggesting that the quantization error is trivial in coarse semantic landmark localization; When using a more strict evaluation metric, [email protected], the compensation method, which can be seen as a special case of H3R with $k=2$ , significantly improves the baseline, e.g., from $32.8$ to $37.7$ , while the proposed method H3R further improves the performance from $37.7$ to $39.3$ . The experimental results on the COCO dataset are shown in Table X. Specifically, the proposed method clearly improves the averaged precision (AP) in different settings and the major improvements on AP come from: 1) a strict evaluation metric, e.g., AP^0.75; and 2) large/medium person instances, i.e., AP(M) and AP(L). Furthermore, we also find that the improvement decreases when increasing the input resolution, e.g., from $0.674$ to $0.720$ for $192\times 128~{}(0.046{\uparrow})$ , from $0.723$ to $0.750$ for $256\times 192~{}(0.027{\uparrow})$ , and from $0.748$ to $0.762$ for $384\times 288~{}(0.014{\uparrow})$ .

7 Conclusion

In this paper, we address the problem of sub-pixel localization for heatmap-based semantic landmark localization. We formally analyze quantization error in vanilla heatmap regression and propose a new quantization system via randomized rounding operation, which we prove is unbiased and lossless. Experiments on facial landmark localization and human pose estimation datasets demonstrate the effectiveness of the proposed quantization system for efficient and accurate sub-pixel localization.

Acknowledgement

Dr. Baosheng Yu is supported by ARC project FL-170100117.

References

[1] Y. Sun, X. Wang, and X. Tang, “Deep convolutional network cascade for facial point detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 3476–3483.
[2] A. Bulat and G. Tzimiropoulos, “How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks),” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1021–1030.
[3] A. Sinha, C. Choi, and K. Ramani, “Deephand: Robust hand pose estimation by completing a matrix imputed with deep features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016.
[4] U. Iqbal, P. Molchanov, T. Breuel Juergen Gall, and J. Kautz, “Hand pose estimation via latent 2.5 d heatmap regression,” in European Conference on Computer Vision (ECCV), 2018, pp. 118–134.
[5] A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1653–1660.
[6] A. Newell, K. Yang, and J. Deng, “Stacked hourglass networks for human pose estimation,” in European Conference on Computer Vision (ECCV), 2016, pp. 483–499.
[7] A. Saxena, J. Driemeyer, and A. Y. Ng, “Robotic grasping of novel objects using vision,” International Journal of Robotics Research (IJRR), vol. 27, no. 2, pp. 157–173, 2008. [Online]. Available: https://doi.org/10.1177/0278364907087172
[8] Y. Sun, Y. Chen, X. Wang, and X. Tang, “Deep learning face representation by joint identification-verification,” in Neural Information Processing Systems (NIPS), 2014, pp. 1988–1996.
[9] J. Thies, M. Zollhofer, M. Stamminger, C. Theobalt, and M. Nießner, “Face2face: Real-time face capture and reenactment of rgb videos,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 2387–2395.
[10] C. Wang, D. Xu, Y. Zhu, R. Martín-Martín, C. Lu, L. Fei-Fei, and S. Savarese, “Densefusion: 6d object pose estimation by iterative dense fusion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[11] J. Deng, J. Guo, X. Niannan, and S. Zafeiriou, “Arcface: Additive angular margin loss for deep face recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
[12] H. Zhao, M. Tian, S. Sun, J. Shao, J. Yan, S. Yi, X. Wang, and X. Tang, “Spindle net: Person re-identification with human body region guided feature decomposition and fusion,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 1077–1085.
[13] L. Zheng, Y. Huang, H. Lu, and Y. Yang, “Pose-invariant embedding for deep person re-identification,” IEEE Transactions on Image Processing (TIP), vol. 28, no. 9, pp. 4500–4509, 2019.
[14] Y.-C. Chen, X. Shen, and J. Jia, “Makeup-go: Blind reversion of portrait edit,” in IEEE International Conference on Computer Vision (ICCV), Oct 2017.
[15] H. Chang, J. Lu, F. Yu, and A. Finkelstein, “Pairedcyclegan: Asymmetric style transfer for applying and removing makeup,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 40–48.
[16] C. Cao, Y. Weng, S. Lin, and K. Zhou, “3d shape regression for real-time facial animation,” ACM Transactions on Graphics (TOG), vol. 32, no. 4, pp. 1–10, 2013.
[17] C. Cao, Q. Hou, and K. Zhou, “Displaced dynamic expression regression for real-time facial tracking and animation,” ACM Transactions on Graphics (TOG), vol. 33, no. 4, pp. 1–10, 2014.
[18] J. Thies, M. Zollhöfer, M. Nießner, L. Valgaerts, M. Stamminger, and C. Theobalt, “Real-time expression transfer for facial reenactment.” ACM Transactions on Graphics (TOG), vol. 34, no. 6, pp. 183–1, 2015.
[19] J. J. Tompson, A. Jain, Y. LeCun, and C. Bregler, “Joint training of a convolutional network and a graphical model for human pose estimation,” in Neural Information Processing Systems (NIPS), 2014, pp. 1799–1807.
[20] A. Nibali, Z. He, S. Morgan, and L. Prendergast, “Numerical coordinate regression with convolutional neural networks,” arXiv preprint arXiv:1801.07372, 2018.
[21] X. Wang, L. Bo, and L. Fuxin, “Adaptive wing loss for robust face alignment via heatmap regression,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 6971–6981.
[22] K. Sun, B. Xiao, D. Liu, and J. Wang, “Deep high-resolution representation learning for human pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019, pp. 5693–5703.
[23] J. Wang, K. Sun, T. Cheng, B. Jiang, C. Deng, Y. Zhao, D. Liu, Y. Mu, M. Tan, X. Wang, W. Liu, and B. Xiao, “Deep high-resolution representation learning for visual recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 2020.
[24] Y. Tai, Y. Liang, X. Liu, L. Duan, J. Li, C. Wang, F. Huang, and Y. Chen, “Towards highly accurate and stable face alignment for high-resolution videos,” in AAAI Conference on Artificial Intelligence (AAAI), vol. 33, 2019, pp. 8893–8900.
[25] W. Li, Z. Wang, B. Yin, Q. Peng, Y. Du, T. Xiao, G. Yu, H. Lu, Y. Wei, and J. Sun, “Rethinking on multi-stage networks for human pose estimation,” arXiv preprint arXiv:1901.00148, 2019.
[26] F. Zhang, X. Zhu, H. Dai, M. Ye, and C. Zhu, “Distribution-aware coordinate representation for human pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 7093–7102.
[27] P. Raghavan and C. D. Tompson, “Randomized rounding: a technique for provably good algorithms and algorithmic proofs,” Combinatorica, vol. 7, no. 4, pp. 365–374, 1987.
[28] B. Korte, J. Vygen, B. Korte, and J. Vygen, Combinatorial optimization. Springer, 2012, vol. 2.
[29] X. Cao, Y. Wei, F. Wen, and J. Sun, “Face alignment by explicit shape regression,” International Journal of Computer Vision (IJCV), vol. 107, no. 2, pp. 177–190, 2014.
[30] X. Zhu and D. Ramanan, “Face detection, pose estimation, and landmark localization in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2012, pp. 2879–2886.
[31] X. Xiong and F. De la Torre, “Supervised descent method and its applications to face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013, pp. 532–539.
[32] S. Ren, X. Cao, Y. Wei, and J. Sun, “Face alignment at 3000 fps via regressing local binary features,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1685–1692.
[33] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 1867–1874.
[34] X. P. Burgos-Artizzu, P. Perona, and P. Dollár, “Robust face landmark estimation under occlusion,” in IEEE International Conference on Computer Vision (ICCV), 2013, pp. 1513–1520.
[35] J. Zhang, S. Shan, M. Kan, and X. Chen, “Coarse-to-fine auto-encoder networks (cfan) for real-time face alignment,” in European Conference on Computer Vision (ECCV), 2014, pp. 1–16.
[36] S. Zhu, C. Li, C. Change Loy, and X. Tang, “Face alignment by coarse-to-fine shape searching,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 4998–5006.
[37] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik, “Human pose estimation with iterative error feedback,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4733–4742.
[38] V. Belagiannis and A. Zisserman, “Recurrent human pose estimation,” in IEEE International Conference on Automatic Face & Gesture Recognition (FG), 2017, pp. 468–475.
[39] P. N. Belhumeur, D. W. Jacobs, D. J. Kriegman, and N. Kumar, “Localizing parts of faces using a consensus of exemplars,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 35, no. 12, pp. 2930–2940, 2013.
[40] X. Miao, X. Zhen, X. Liu, C. Deng, V. Athitsos, and H. Huang, “Direct shape regression networks for end-to-end face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5040–5049.
[41] Z. Zhang, P. Luo, C. C. Loy, and X. Tang, “Facial landmark detection by deep multi-task learning,” in European Conference on Computer Vision (ECCV), 2014, pp. 94–108.
[42] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE Signal Processing Letters (SPL), vol. 23, no. 10, pp. 1499–1503, 2016.
[43] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: A deep multi-task learning framework for face detection, landmark localization, pose estimation, and gender recognition,” IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), vol. 41, no. 1, pp. 121–135, 2017.
[44] R. Girshick, “Fast r-cnn,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1440–1448.
[45] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu, “Wing loss for robust facial landmark localisation with convolutional neural networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2235–2245.
[46] A. Bulat and G. Tzimiropoulos, “Binarized convolutional landmark localizers for human pose estimation and face alignment with limited resources,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 3706–3714.
[47] D. Merget, M. Rock, and G. Rigoll, “Robust facial landmark detection via a fully-convolutional local-global context network,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 781–790.
[48] T. Pfister, J. Charles, and A. Zisserman, “Flowing convnets for human pose estimation in videos,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1913–1921.
[49] L. Pishchulin, E. Insafutdinov, S. Tang, B. Andres, M. Andriluka, P. V. Gehler, and B. Schiele, “Deepcut: Joint subset partition and labeling for multi person pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4929–4937.
[50] A. Newell, Z. Huang, and J. Deng, “Associative embedding: End-to-end learning for joint detection and grouping,” in Neural Information Processing Systems (NIPS), 2017, pp. 2277–2287.
[51] W. Yang, S. Li, W. Ouyang, H. Li, and X. Wang, “Learning feature pyramids for human pose estimation,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 1281–1290.
[52] G. Papandreou, T. Zhu, L.-C. Chen, S. Gidaris, J. Tompson, and K. Murphy, “Personlab: Person pose estimation and instance segmentation with a bottom-up, part-based, geometric embedding model,” in European Conference on Computer Vision (ECCV), 2018, pp. 269–286.
[53] S.-E. Wei, V. Ramakrishna, T. Kanade, and Y. Sheikh, “Convolutional pose machines,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 4724–4732.
[54] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 7291–7299.
[55] Y. Chen, Z. Wang, Y. Peng, Z. Zhang, G. Yu, and J. Sun, “Cascaded pyramid network for multi-person pose estimation,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 7103–7112.
[56] G. Papandreou, T. Zhu, N. Kanazawa, A. Toshev, J. Tompson, C. Bregler, and K. Murphy, “Towards accurate multi-person pose estimation in the wild,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 4903–4911.
[57] X. Sun, B. Xiao, F. Wei, S. Liang, and Y. Wei, “Integral human pose regression,” in European Conference on Computer Vision (ECCV), 2018, pp. 529–545.
[58] D. C. Luvizon, D. Picard, and H. Tabia, “2d/3d pose estimation and action recognition using multitask deep learning,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 5137–5146.
[59] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression by combining indirect part detection and contextual information,” Computers & Graphics, vol. 85, pp. 15–22, 2019.
[60] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in European Conference on Computer Vision (ECCV), 2016, pp. 467–483.
[61] S. Levine, C. Finn, T. Darrell, and P. Abbeel, “End-to-end training of deep visuomotor policies,” Journal of Machine Learning Research (JMLR), vol. 17, no. 1, pp. 1334–1373, 2016.
[62] J. Thewlis, H. Bilen, and A. Vedaldi, “Unsupervised learning of object landmarks by factorized spatial embeddings,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 5916–5925.
[63] W. Wu and S. Yang, “Leveraging intra and inter-dataset variations for robust face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 150–159.
[64] W. Wu, C. Qian, S. Yang, Q. Wang, Y. Cai, and Q. Zhou, “Look at boundary: A boundary-aware face alignment algorithm,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 2129–2138.
[65] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “Face alignment using a 3d deeply-initialized ensemble of regression trees,” Computer Vision and Image Understanding (CVIU), vol. 189, p. 102846, 2019.
[66] A. Dapogny, K. Bailly, and M. Cord, “Decafa: Deep convolutional cascade for face alignment in the wild,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
[67] S. Qian, K. Sun, W. Wu, C. Qian, and J. Jia, “Aggregation via separation: Boosting facial landmark detector with semi-supervised style translation,” in IEEE International Conference on Computer Vision (ICCV), 2019, pp. 10 153–10 163.
[68] A. Kumar, T. K. Marks, W. Mou, Y. Wang, M. Jones, A. Cherian, T. Koike-Akino, X. Liu, and C. Feng, “Luvli face alignment: Estimating landmarks’ location, uncertainty, and visibility likelihood,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020, pp. 8236–8246.
[69] X. Dong, Y. Yan, W. Ouyang, and Y. Yang, “Style aggregated network for facial landmark detection,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018, pp. 379–388.
[70] M. Kowalski, J. Naruniec, and T. Trzcinski, “Deep alignment network: A convolutional neural network for robust face alignment,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 88–97.
[71] J. Yang, Q. Liu, and K. Zhang, “Stacked hourglass network for robust facial landmark localisation,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2017, pp. 79–87.
[72] R. Valle, J. M. Buenaposada, A. Valdés, and L. Baumela, “A deeply-initialized coarse-to-fine ensemble of regression trees for face alignment,” in European Conference on Computer Vision (ECCV), 2018, pp. 585–601.
[73] X. Zou, S. Zhong, L. Yan, X. Zhao, J. Zhou, and Y. Wu, “Learning robust facial landmark detection via hierarchical structured ensemble,” in IEEE International Conference on Computer Vision (ICCV), October 2019.
[74] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: A face detection benchmark,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 5525–5533.
[75] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: The first facial landmark localization challenge,” in IEEE International Conference on Computer Vision Workshops (ICCVW), 2013, pp. 397–403.
[76] V. Le, J. Brandt, Z. Lin, L. Bourdev, and T. S. Huang, “Interactive facial feature localization,” in European Conference on Computer Vision (ECCV), 2012, pp. 679–692.
[77] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, “Annotated facial landmarks in the wild: A large-scale, real-world database for facial landmark localization,” in IEEE International Conference on Computer Vision Workshops (ICCVW), 2011, pp. 2144–2151.
[78] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala, “Pytorch: An imperative style, high-performance deep learning library,” in Neural Information Processing Systems (NeurIPS), H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, Eds. Curran Associates, Inc., 2019, pp. 8024–8035.
[79] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770–778.
[80] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer, 2015, pp. 234–241.
[81] Z. Zhong, L. Zheng, G. Kang, S. Li, and Y. Yang, “Random erasing data augmentation,” AAAI Conference on Artificial Intelligence (AAAI), 2020.
[82] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2009, pp. 248–255.
[83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in International Conference on Learning Representations (ICLR), 2015.
[84] J. Naruniec, L. Helminger, C. Schroers, and R. Weber, “High-resolution neural face swapping for visual effects,” Eurographics Symposium on Rendering, vol. 39, no. 4, 2020.
[85] M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014, pp. 3686–3693.
[86] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European Conference on Computer Vision (ECCV), 2014, pp. 740–755.

Appendix A Proofs of Theorem 1 and Theorem 2

Theorem 1.

Given an unbiased quantization system defined by the encode operation in (10) and the decode operation in (13), we then have that the quantization error tightly upper bounded, i.e.,

\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}\leq\sqrt{2}s/2,

where $s>1$ indicates the downsampling stride of the heatmap.

Proof.

Given the ground truth numerical coordinate $\boldsymbol{x}_{i}^{g}=\left(x_{i}^{g},y_{i}^{g}\right)$ , the predicted numerical coordinate $\boldsymbol{x}_{i}^{p}=\left(x_{i}^{p},y_{i}^{p}\right)$ , and the downsampling stride of the heatmap $s>1$ , if there is no heatmap error, we then have

\boldsymbol{h}_{i}^{p}(\boldsymbol{x})=\boldsymbol{h}_{i}^{g}(\boldsymbol{x}),

where $\boldsymbol{h}_{i}^{p}(\boldsymbol{x})$ and $\boldsymbol{h}_{i}^{g}(\boldsymbol{x})$ indicate the ground truth heatmap and the predicted heatmap, respectively. Therefore, according to the decode operation in (13), we have the predicted numerical coordinate as

x_{i}^{p}/s=\begin{cases}\lfloor x_{i}^{g}/s\rfloor+t-0.5&\quad\text{if }~{}\epsilon_{x}<t,\\ \lfloor x_{i}^{g}/s\rfloor+t+0.5&\quad\text{otherwise}.\end{cases}

y_{i}^{p}/s=\begin{cases}\lfloor y_{i}^{g}/s\rfloor+t-0.5&\quad\text{if }~{}\epsilon_{y}<t,\\ \lfloor y_{i}^{g}/s\rfloor+t+0.5&\quad\text{otherwise}.\end{cases}

where $\epsilon_{x}=x_{i}^{g}/s-\lfloor x_{i}^{g}/s\rfloor$ and $\epsilon_{y}=y_{i}^{g}/s-\lfloor y_{i}^{g}/s\rfloor$ . The quantization error of vanilla quantization system then can be evaluated as follows:

|x_{i}^{p}/s-x_{i}^{g}/s|=\begin{cases}|t-\epsilon_{x}-0.5|&\quad\text{if }~{}\epsilon_{x}<t,\\ |t-\epsilon_{x}+0.5|&\quad\text{otherwise}.\end{cases}

|y_{i}^{p}/s-y_{i}^{g}/s|=\begin{cases}|t-\epsilon_{y}-0.5|&\quad\text{if }~{}\epsilon_{y}<t,\\ |t-\epsilon_{y}+0.5|&\quad\text{otherwise}.\end{cases}

The maximum quantization error $|x_{i}^{p}-x_{i}^{g}|=s/2$ is achieved when $\epsilon_{x}=t$ . Similarly, we have the maximum quantization error $|y_{i}^{p}-y_{i}^{g}|=s/2$ is achieved with $\epsilon_{y}=t$ . Considering that $x_{i}^{p}$ and $y_{i}^{p}$ are linearly independent variables, we thus have

\begin{split}\|\boldsymbol{x}_{i}^{p}-\boldsymbol{x}_{i}^{g}\|_{2}&=\sqrt{\left(x_{i}^{p}-x_{i}^{g}\right)^{2}+\left(y_{i}^{p}-y_{i}^{g}\right)^{2}}\leq\sqrt{2}s/2.\end{split}

The maximum quantization error is achieved with $\epsilon_{x}=\epsilon_{y}=t$ . That is, the quantization error in vanilla quantization system is tightly upper bounded by $\sqrt{2}s/2$ .
∎

Theorem 2.

Proof.

\begin{split}\mathbb{E}\left(\boldsymbol{q}(x_{i}^{g},s)\right)&=\mathbb{E}\left(P\left\{\epsilon_{x}<t\right\}\lfloor x_{i}^{g}/s\rfloor+P\left\{\epsilon_{x}\geq t\right\}(\lfloor x_{i}^{g}/s\rfloor+1)\right)\\ &=\lfloor x_{i}^{g}/s\rfloor(1-\epsilon_{x})+(\lfloor x_{i}^{g}/s\rfloor+1)\epsilon_{x}\\ &=x_{i}^{g}/s\end{split}

Similarly, we have $\mathbb{E}\left(\boldsymbol{q}(y_{i}^{g},s)\right)=y_{i}^{g}/s$ . Considering that $x_{i}^{p}$ and $y_{i}^{p}$ are linearly independent variables, we thus have

\begin{split}\mathbb{E}\left(\boldsymbol{q}(\boldsymbol{x}_{i}^{g},s)\right)&=\left(\mathbb{E}\left(\boldsymbol{q}(y_{i}^{g},s)\right),\mathbb{E}\left(\boldsymbol{q}(x_{i}^{g},s)\right)\right)\\ &=(x_{i}^{g}/s,y_{i}^{g}/s).\end{split}

Therefore, the encode operation in (15), i.e., random-round, is an unbiased encode operation for heatmap regression.

We then prove that the quantization system is losses as follows. For the decode operation in (17), if there is no heatmap error, we then have

\begin{split}P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor)\}&=(1-\epsilon_{x})(1-\epsilon_{y}),\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor)\}&=\epsilon_{x}(1-\epsilon_{y}),\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1)\}&=(1-\epsilon_{x})\epsilon_{y},\\ P\{\boldsymbol{x}_{i}^{p}/s=(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1)\}&=\epsilon_{x}\epsilon_{y}.\\ \end{split}

We can reconstruct the fractional part of $\boldsymbol{x}_{i}^{g}$ , i.e.,

\begin{split}\left(x_{i}^{p}/s,y_{i}^{p}/s\right)=&\sum\limits_{\boldsymbol{x}_{i}^{p}}P\{\boldsymbol{x}_{i}^{p}\}\boldsymbol{x}_{i}^{p}/s\\ =~{}&(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor)*(1-\epsilon_{x})(1-\epsilon_{y})\\ &+(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor)*\epsilon_{x}(1-\epsilon_{y})\\ &+(\lfloor x_{i}^{g}/s\rfloor,\lfloor y_{i}^{g}/s\rfloor+1)*(1-\epsilon_{x})\epsilon_{y}\\ &+(\lfloor x_{i}^{g}/s\rfloor+1,\lfloor y_{i}^{g}/s\rfloor+1)*\epsilon_{x}\epsilon_{y}\\ =&\left(x_{i}^{g}/s,y_{i}^{g}/s\right).\end{split}

That is, $\left(x_{i}^{p},y_{i}^{p}\right)=\left(x_{i}^{g},y_{i}^{g}\right)$ , i.e., there is no quantization error.
∎

Appendix B Experiments

In this section, we provide additional experimental results on facial landmark detection and human pose estimation.

B.1 Facial Landmark Detection

TABLE XI: The influence of different numbers of training samples when using different input resolutions. In each cell, the first number indicates the performance of baseline method and the second number indicates the performance of the proposed method.

#Samples	NME (%)
#Samples	$256\times 256$	$128\times 128$	$64\times 64$
256	5.67/5.46	6.02/5.47	7.74/6.20
1024	4.79/4.63	5.27/4.77	7.13/5.43
4096	4.14/3.97	4.81/4.17	6.57/4.76
7500	4.00/3.81	4.62/3.99	6.50/4.62

TABLE XII: Comparison of different backbone networks and feature maps on WFLW dataset.

Backbone	Input	Heatmap	FLOPs	#Params	NME (%), Inter-ocular
Backbone	Input	Heatmap	FLOPs	#Params	test	pose	expression	illumination	make-up	occlusion	blur
HRNet-W18	$256\times 256$	$64\times 64$	4.84G	9.69M	3.81	6.45	4.07	3.70	3.66	4.48	4.30
HRNet-W18	$256\times 256$	$256\times 256$	4.98G	9.69M	3.91	6.83	4.09	3.82	3.70	4.59	4.41
U-Net	$256\times 256$	$256\times 256$	60.55G	31.46M	4.93	9.30	5.08	4.74	4.82	6.43	5.72
HRNet-W18	$128\times 128$	$32\times 32$	1.21G	9.69M	3.99	6.78	4.26	3.89	3.84	4.65	4.46
HRNet-W18	$128\times 128$	$128\times 128$	1.24G	9.69M	4.06	6.89	4.41	3.97	3.95	4.78	4.52
U-Net	$128\times 128$	$128\times 128$	15.14G	31.46M	4.17	7.10	4.45	4.06	4.03	5.00	4.73
HRNet-W18	$64\times 64$	$16\times 16$	0.30G	9.69M	4.64	7.77	5.05	4.52	4.59	5.31	4.96
HRNet-W18	$64\times 64$	$64\times 64$	0.31G	9.69M	4.61	7.70	5.00	4.44	4.58	5.28	4.94
U-Net	$64\times 64$	$64\times 64$	3.79G	31.46M	4.37	7.18	4.69	4.26	4.30	5.12	4.80

The influence of different numbers of training samples. The proposed quantization system does not rely on any assumption about the number of training samples, and is lossless for heatmap regression if there is no heatmap error. However, heatmap prediction performance will be influenced by the number of training samples: increasing the number of training samples improves the model generalizability from the learning theory perspective. Therefore, we perform experiments to evaluate the influence of the proposed method when using different numbers of training samples in practice. As shown in Table XI, we find that 1) the proposed method delivers consistent improvements when using different numbers of training samples; and 2) increasing the number of training samples significantly improves the performance of heatmap regression models with low-resolution input images.

The influence of different backbone networks. If we do not take the heatmap prediction error into consideration, the quantization error in heatmap regression is then caused by the downsampling of heamaps: 1) the downsampling of input images and 2) the downsampling of CNN feature maps. Though the analysis of heamap prediction error is out the scope of this paper, we perform some experiments to demonstrate the influence of different feature maps from the backbone networks in practice. Specifically, we perform experiments using the following two settings: 1) upsampling the feature maps from HRNet [22]; or 2) using the feature maps from U-shape backbone networks, i.e., U-Net [80]. As shown in Table XII, we see that 1) directly upsampling the feature maps achieves comparable performance with the baseline method; 2) U-Net performs better than HRNet-W18 when using a small input resolution (e.g., $64\times 64$ pixels), while is significantly worse than HRNet when using a large input resolution (e.g., $256\times 256$ pixels); and 3) U-Net contains more parameters and requires much more computations than HRNet when using the same input resolution. It would be interesting to further explore more efficient U-shape networks for low-resolution heatmap-based semantic landmark localization.

TABLE XIII: The influence of different types of heatmap when using the heatmap regression model with different input resolutions.

Heatmap	NME (%)
Heatmap	$256\times 256$	$128\times 128$	$64\times 64$
Gaussian	3.86	4.21	4.99
Binary	3.81	3.99	4.62

The influence of different types of heatmap. We perform some experiments to demonstrate the influence when using different types of heatmap, Gaussian heatmap and binary heatmap. As shown in Table XIII, 1) when using a large input resolution, the heatmap regression model using either Gaussian heatmap or binary heatmap achieves comparable performance; and 2) when using a low input resolution, the heatmap regression model achieves better performance with the binary heatmap. We demonstrate the differences between binary heatmap and Gaussian heatmap in Figure 9. Specifically, the Gaussian heatmap improves the robustness of heatmap prediction, while at the risk of increasing the uncertainty on the maximum activation point in the predicted heatmap. Therefore, when training very efficient heatmap regression models using a low input resolution, we recommend the binary heatmap.

The qualitative comparison between the vanilla quantization system and the proposed quantization system. We provide some demo images for facial landmark detection using both the baseline method (i.e., $k=1$ ) and the proposed quantization method (i.e., $k=9$ ) to demonstrate the effectiveness of the proposed method for accurate semantic landmark localization. As shown in Fig. 10, we see that the black landmarks are closer to the blue landmarks than the yellow landmarks, especially when using low resolution models (e.g., $64\times 64$ pixels).

B.2 Human Pose Estimation

To utilize the proposed method for accurate semantic landmark localization, it contains only one hyper-parameter $k$ , i.e., the number of activation points. To better understand the effectiveness of the proposed method for human pose estimation, we perform ablation studies using different numbers of alternative activation points on both MPII and COCO datasets. As shown in Table XIV and Table XV, we find that the proposed method achieves comparable performance when $k$ is between $10$ and $25$ , making it easy to choose a proper $k$ on the validation data for human pose estimation applications.

TABLE XIV: Results on COCO validation set. The input resolution is

256\times 192

pixels. The first row indicates the performance when using the compensation method “shift a quarter to the second maximum activation point”.

Backbone	$k$	AP	Ap .5	AP .75	AP (M)	AP (L)	AR	AR .5	AR .75	AR (M)	AR (L)
HRNet-W32	-	0.744	0.905	0.819	0.708	0.810	0.798	0.942	0.865	0.757	0.858
HRNet-W32	1	0.723	0.904	0.811	0.690	0.788	0.782	0.941	0.859	0.741	0.841
HRNet-W32	2	0.738	0.905	0.817	0.702	0.805	0.793	0.942	0.864	0.752	0.854
HRNet-W32	3	0.743	0.905	0.819	0.708	0.810	0.797	0.942	0.865	0.756	0.857
HRNet-W32	4	0.743	0.904	0.819	0.709	0.809	0.797	0.941	0.866	0.756	0.857
HRNet-W32	5	0.747	0.904	0.819	0.712	0.814	0.800	0.940	0.866	0.759	0.859
HRNet-W32	6	0.747	0.905	0.820	0.711	0.814	0.799	0.941	0.865	0.759	0.859
HRNet-W32	7	0.745	0.905	0.820	0.709	0.812	0.798	0.941	0.864	0.757	0.857
HRNet-W32	8	0.746	0.905	0.819	0.711	0.813	0.799	0.942	0.863	0.759	0.858
HRNet-W32	9	0.746	0.905	0.819	0.710	0.812	0.798	0.942	0.863	0.758	0.857
HRNet-W32	10	0.749	0.905	0.820	0.713	0.815	0.801	0.943	0.864	0.760	0.860
HRNet-W32	11	0.750	0.906	0.820	0.715	0.817	0.802	0.942	0.865	0.761	0.861
HRNet-W32	12	0.750	0.906	0.821	0.714	0.817	0.801	0.942	0.865	0.760	0.861
HRNet-W32	13	0.749	0.906	0.821	0.713	0.816	0.800	0.942	0.865	0.760	0.860
HRNet-W32	14	0.749	0.905	0.820	0.713	0.816	0.800	0.941	0.864	0.760	0.860
HRNet-W32	15	0.749	0.906	0.820	0.713	0.817	0.800	0.942	0.863	0.760	0.860
HRNet-W32	16	0.749	0.905	0.820	0.713	0.816	0.800	0.941	0.863	0.760	0.860
HRNet-W32	17	0.750	0.905	0.821	0.714	0.817	0.801	0.942	0.863	0.761	0.860
HRNet-W32	18	0.750	0.905	0.820	0.713	0.817	0.801	0.942	0.864	0.760	0.860
HRNet-W32	19	0.750	0.904	0.820	0.714	0.816	0.801	0.941	0.864	0.760	0.860
HRNet-W32	20	0.749	0.904	0.820	0.713	0.816	0.800	0.940	0.864	0.760	0.859
HRNet-W32	21	0.747	0.904	0.819	0.712	0.813	0.799	0.940	0.862	0.758	0.858
HRNet-W32	22	0.749	0.904	0.819	0.713	0.815	0.799	0.941	0.863	0.759	0.859
HRNet-W32	23	0.749	0.904	0.819	0.713	0.816	0.800	0.941	0.863	0.760	0.860
HRNet-W32	24	0.749	0.904	0.820	0.713	0.817	0.800	0.941	0.863	0.759	0.860
HRNet-W32	25	0.749	0.905	0.820	0.713	0.817	0.800	0.941	0.864	0.760	0.860
HRNet-W32	30	0.749	0.905	0.819	0.714	0.818	0.800	0.941	0.862	0.759	0.859
HRNet-W32	35	0.748	0.902	0.819	0.712	0.816	0.799	0.940	0.862	0.758	0.859
HRNet-W32	40	0.746	0.901	0.818	0.710	0.814	0.797	0.939	0.861	0.756	0.857
HRNet-W32	45	0.746	0.901	0.815	0.710	0.814	0.796	0.939	0.860	0.755	0.857
HRNet-W32	50	0.746	0.901	0.814	0.710	0.816	0.797	0.938	0.859	0.755	0.857
HRNet-W32	75	0.741	0.899	0.811	0.705	0.810	0.792	0.936	0.856	0.751	0.853
HRNet-W32	100	0.734	0.897	0.805	0.697	0.803	0.785	0.934	0.850	0.743	0.846
HRNet-W32	125	0.717	0.893	0.795	0.682	0.785	0.770	0.930	0.841	0.727	0.831
HRNet-W32	150	0.651	0.889	0.755	0.629	0.702	0.713	0.925	0.809	0.680	0.762

TABLE XV: Results on MPII validation set. The input resolution is

256\times 256

pixels. The first row indicates the performance when using the compensation method “shift a quarter to the second maximum activation point”.

Backbone	$k$	Head	Shoulder	Elbow	Wrist	Hip	Knee	Ankle	Mean	[email protected]
HRNet-W32	-	97.1	95.9	90.3	86.4	89.1	87.1	83.3	90.3	37.7
HRNet-W32	1	97.033	95.703	90.131	86.312	88.402	86.802	82.924	90.055	32.808
HRNet-W32	2	97.033	95.856	90.302	86.416	88.818	87.004	83.325	90.247	35.912
HRNet-W32	3	97.135	95.975	90.302	86.347	89.250	86.741	83.018	90.271	37.668
HRNet-W32	4	97.237	95.907	90.643	86.141	89.164	86.862	83.089	90.294	36.997
HRNet-W32	5	97.203	95.839	90.677	86.141	89.147	87.124	82.994	90.315	38.774
HRNet-W32	6	97.101	95.839	90.677	86.262	89.181	86.923	83.089	90.310	38.548
HRNet-W32	7	97.135	95.822	90.609	86.313	89.199	86.943	83.113	90.312	37.913
HRNet-W32	8	97.033	95.822	90.557	86.175	89.147	86.922	82.876	90.245	38.129
HRNet-W32	9	97.067	95.822	90.694	86.329	89.112	86.942	82.876	90.284	38.457
HRNet-W32	10	97.033	95.873	90.626	86.141	89.372	86.862	82.995	90.297	39.024
HRNet-W32	11	97.101	95.907	90.592	86.244	89.406	86.842	82.853	90.302	39.139
HRNet-W32	12	97.101	95.822	90.660	86.175	89.475	86.761	82.900	90.297	38.996
HRNet-W32	13	97.101	95.822	90.694	86.141	89.475	86.741	82.829	90.289	38.855
HRNet-W32	14	97.033	95.822	90.728	86.141	89.337	86.640	82.806	90.250	38.616
HRNet-W32	15	97.169	95.754	90.609	86.124	89.337	86.801	82.475	90.211	38.798
HRNet-W32	16	97.101	95.771	90.694	86.141	89.199	86.882	82.569	90.226	38.691
HRNet-W32	17	97.067	95.822	90.609	86.142	89.285	86.821	82.617	90.224	39.053
HRNet-W32	18	97.033	95.822	90.626	86.004	89.250	86.801	82.522	90.193	39.048
HRNet-W32	19	97.033	95.839	90.626	85.953	89.250	86.822	82.593	90.198	39.058
HRNet-W32	20	96.999	95.788	90.626	86.039	89.389	86.801	82.664	90.226	39.014
HRNet-W32	21	96.999	95.771	90.677	85.935	89.337	86.862	82.451	90.187	38.855
HRNet-W32	22	96.999	95.856	90.626	86.090	89.268	86.741	82.475	90.198	38.842
HRNet-W32	23	96.999	95.890	90.609	86.038	89.164	86.801	82.499	90.187	39.037
HRNet-W32	24	96.862	95.754	90.592	86.021	89.233	86.842	82.333	90.146	39.089
HRNet-W32	25	96.930	95.754	90.506	86.055	89.302	86.721	82.309	90.135	39.126
HRNet-W32	30	96.862	95.788	90.438	86.004	89.372	86.660	82.144	90.101	38.759
HRNet-W32	35	96.623	95.873	90.523	85.936	89.354	86.701	81.884	90.075	38.964
HRNet-W32	40	96.385	95.856	90.523	85.816	89.406	86.721	81.908	90.049	38.618
HRNet-W32	45	96.317	95.669	90.506	85.746	89.285	86.660	82.026	89.990	38.579
HRNet-W32	50	96.214	95.686	90.506	85.678	89.268	86.439	81.648	89.899	38.584
HRNet-W32	55	96.044	95.669	90.404	85.335	89.337	86.439	81.506	89.815	38.496
HRNet-W32	60	95.805	95.533	90.302	85.147	89.147	86.419	81.270	89.672	38.332
HRNet-W32	65	95.703	95.618	90.251	85.079	89.164	86.419	81.081	89.646	38.218
HRNet-W32	70	95.498	95.584	90.165	85.215	89.302	86.378	80.963	89.633	38.197
HRNet-W32	75	95.259	95.533	90.029	85.147	89.147	86.318	80.467	89.487	38.054
HRNet-W32	80	95.020	95.516	90.046	84.907	89.216	86.197	80.255	89.404	37.684
HRNet-W32	85	94.679	95.465	89.927	84.925	89.337	86.258	80.042	89.357	37.463
HRNet-W32	90	94.407	95.431	89.790	84.907	89.233	86.197	79.735	89.251	37.208
HRNet-W32	95	94.065	95.414	89.773	84.890	89.302	86.097	79.263	89.165	36.719
HRNet-W32	100	93.827	95.482	89.603	84.737	89.320	86.076	78.578	89.032	36.180
HRNet-W32	105	93.520	95.448	89.603	84.599	89.199	85.996	77.964	88.884	35.532
HRNet-W32	110	93.008	95.448	89.518	84.273	89.147	85.835	77.114	88.657	34.770
HRNet-W32	115	92.565	95.296	89.501	84.085	89.112	85.794	76.405	88.483	33.352
HRNet-W32	120	92.121	95.262	89.398	83.930	89.060	85.533	75.272	88.238	31.814
HRNet-W32	125	91.473	95.160	89.160	83.554	88.991	85.472	74.162	87.939	30.065
HRNet-W32	130	90.825	94.939	89.211	83.023	89.095	85.089	72.910	87.609	27.528
HRNet-W32	135	90.246	94.667	89.006	82.577	88.852	84.807	71.209	87.164	24.757
HRNet-W32	140	89.529	94.463	88.853	82.046	88.679	83.800	69.792	86.659	21.788
HRNet-W32	145	88.404	94.124	88.717	81.310	88.645	82.773	67.997	86.053	19.136
HRNet-W32	150	87.756	93.886	88.614	80.146	88.575	81.383	66.084	85.373	16.586
HRNet-W32	175	82.401	90.829	86.927	71.515	86.014	70.504	56.826	80.109	9.204
HRNet-W32	200	68.008	87.075	82.325	62.146	79.176	58.737	41.120	72.009	5.988