LEARNING HIGH-QUALITY PROPOSALS FOR ACNE DETECTION

JIANWEI ZHANG LEI ZHANG¹¹1Corresponding author JUNYOU WANG XIN WEI College of Computer Science, Sichuan University, Section 4, Southern 1st Ring Rd,
Wuhou District, Chengdu, Sichuan 610065, China
[email protected]
[email protected]
[email protected]
[email protected]
JIAQI LI XIAN JIANG and DANDU Department of Dermatology, West China Hospital, Sichuan University, No.37 Guoxue Alley,
Wuhou District, Chengdu, Sichuan 610041, China
[email protected]
[email protected]
[email protected]

Abstract

Acne detection is crucial for interpretative diagnosis and precise treatment of skin disease. The arbitrary boundary and small size of acne lesions lead to a significant number of poor-quality proposals in two-stage detection. In this paper, we propose a novel head structure for Region Proposal Network to improve the proposals’ quality in two ways. At first, a Spatial Aware Double Head(SADH) structure is proposed to disentangle the representation learning for classification and localization from two different spatial perspectives. The proposed SADH ensures a steeper classification confidence gradient and suppresses the proposals having low intersection-over-union(IoU) with the matched ground truth. Then, we propose a Normalized Wasserstein Distance prediction branch to improve the correlation between the proposals’ classification scores and IoUs. In addition, to facilitate further research on acne detection, we construct a new dataset named AcneSCU, with high-resolution imageries, precise annotations, and fine-grained lesion categories. Extensive experiments are conducted on both AcneSCU and the public dataset ACNE04, and the results demonstrate the proposed method could improve the proposals’ quality, consistently outperforming state-of-the-art approaches. Code and the collected dataset are available in https://github.com/pingguokiller/acnedetection.

keywords:

Acne detection, Object detection, Region proposal network, Localization confidence prediction.

1 Introduction

Acne vulgaris, well-known as acne, is one of the most popular skin disorders in dermatology ^{1, 2}. Delay of acne treatment would lead to not only physical disfigurements such as scars and pigmentation, but also psychosocial impacts such as social isolation, depression, and suicidality ³. Counting skin lesions of different categories is the first and most crucial step of routine clinical acne treatment. Nevertheless, this step is usually manually performed by dermatologists ^{4, 5}, which is time-consuming and highly dependent on medical expertise.

With the rapid development of deep learning, automated acne detection recently drew a growing interest ^{6, 7}. Rashataprucksa et al. ⁶ proposed to handle acne detection with deep learning, and achieved faster and better performance than traditional methods. Min et al. ⁷ proposed a deep learning based method named ACNet to address several problems on acne detection, such as inconsistent illumination, various lesion scales, and high-density distribution. In summary, previous approaches mostly focus on two-stage detection, which treats the detection as a coarse-to-fine process.

{figurehere}

[Uncaptioned image]

Left: an example from AcneSCU, where the image has a size of $3456\times 5184$ . Top right: a partial region with several lesions from the left example, where the lesion with red boundary is an open comedone with an area about $28\times 28$ pixels; Bottom right: the lesion size distribution of AcneSCU. Please note that this paper defines the lesion size as the square root of the bounding box area.

In the first stage of two-stage detection methods, a Region Proposal Network(RPN) is usually utilized to generate a number of coarse object proposals ⁸. RPN consists of two tasks: the classification task to predict a score to indicate how likely a proposal contains an object and the localization task to predict the expected object’s bounding box. Therefore, every proposal is essentially a bounding box with a classification score. Especially, when training the classification task, label assignment selects positive and negative samples according to the anchors’ IoUs with their corresponding ground truth bounding boxes through an IoU threshold, which is called label threshold for simplicity in this paper. In the second stage, the proposals with top classification scores would be inputted to RCNN ⁹ and refined for more precise classification and localization, while the others would be filtered through a classification score threshold, which is called the proposal threshold for simplicity in this paper. For two-stage detection methods, their performance highly depends on a high Average Recall(AR) of proposals generated by RPN ^{10, 11}.

{figurehere}

[Uncaptioned image]

Left: an illustrating example about the RPN proposals, where the blue and red bounding boxes are the proposals of easy sample and hard sample, respectively. Right: the corresponding classification scores of the proposals in the left sub-figure. It could be easily found the classification score gradient curve is very flat. The dotted line indicates the proposal threshold that filters the proposal of the hard sample.

However, acne detection suffers the arbitrary boundary and small average size of lesions, shown in Fig. 1, aggravating the low quality of RPN proposals. Specifically, a low AR of proposals is always observed in RPN, suppressing the performance of two-stage detection methods. In this paper, we argue the pool quality of proposals is caused by two factors. On the one hand, as shown in Fig. 1, the classification scores of the proposals for easy samples are usually very high and pretty close, while that for hard samples are usually very low and filtered. As a result, RPN tends to output redundant proposals for easy samples and ignore the hard samples, leading to a low AR. In this paper, we refer to this phenomenon as flat classification confidence gradient problem. On the other hand, label assignment in RPN utilize a label threshold to split the anchors into the positive and negative samples. Two samples around the two sides of the label threshold may have very close IoUs, but their classification scores are very different. For example, if the label threshold is 0.5, the IoUs of two anchors are 0.49 and 0.51, and their classification labels in RPN would be 0 and 1, respectively. As a result, the unsmooth label assignment leads to a low correlation between the classification scores and the proposals’ IoUs.

In this paper, we propose a novel detection head structure named Spatial Aware Double Head(SADH) to address the flat classification confidence gradient problem. Specifically, the representation learning for classification and localization are disentangled by two different convolution layers. For the classification task, as shown in Fig. 1(b), a $1\times 1$ convolution is utilized to constrain the classification scores of the proposals at adjacent positions to be more independent. For the localization task, a $3\times 3$ convolution is utilized to constrain the classification scores at adjacent positions to be related. To address the low correlation between the classification scores and the proposals’ IoUs, we propose to predict Normalized Wasserstein Distance(NWD), which is a more appropriate localization confidence metric for small objects ¹² and learned by a novel Soft-style Binary Cross Entropy(SBCE) loss. Then, the NWDs are utilized as localization confidence to rectify the original classification scores. In addition, to comprehensively evaluate the proposed method and facilitate further research on acne detection, a new dataset named AcneSCU, containing 276 images and 31777 annotations with 10 lesions categories, is collected. Experimental results on both AcneSCU and public dataset ACNE04 show that the proposed method outperforms state-of-the-art methods. Our contributions can be summarized as follows:

1.

In order to support and facilitate sophisticated research on acne detection, a new dataset called AcneSCU is collected. Compared to previous datasets for acne detection, AcneSCU has higher-resolution and normalized imageries, fine-grained acne categories, and more precise annotations.
2.

The high-resolution images of AcneSCU calls for cropping the images at first, but the unavoidable partial acne lesions on edge would introduce inaccurate training information. To bypass this issue, we proposed a simple but efficient data preprocessing method named masked crop by masking all the partial lesions.
3.

A new head structure named SADH is proposed to address the flat classification confidence gradient problem. The representation learning for classification and localization is disentangled by two different convolution layers, generating a steeper classification confidence gradient and suppressing the easy samples’ proposals which have low IoUs with the matched ground truths.
4.

A NWD prediction branch along with an SBCE loss is proposed for localization confidence prediction. It has been demonstrated the classification scores rectified by NWDs could improve not only the correlation between classification scores and the proposals’ IoUs but also the overall detection performance.

{figurehere}

(a) The RPN head. The red element on the final feature map is calculated by red and purple elements on FPN feature map, while the blue element on the final feature map is calculated by blue and purple elements. (b) The proposed SADH and NWD prediction branch. The inputs of classification scores at adjacent positions on the final feature map are independent, unlike that in RPN have same inputted purple elements, leading to a steeper classification confidence gradient.

2 Related Work

2.1 Detection Heads

Detection heads have been widely used in object detection and finely designed for different tasks. RPN ⁸, widely adopted in two-stage detection methods, is essentially a kind of fully-convolutional network(FCN) ¹³ to generate proposals at each feature map position. As shown in Fig. 1(a), RPN consists of two branches for classification and localization tasks, respectively. Law et al. ¹⁴ proposed a Cornernet to predict an object’s top-left corner and bottom-right corner points. Duan et al. ¹⁵ further extended Cornernet by predicting the center point of an object. Ma et al. ¹⁶ proposed to predict a matching degree score between the predicted bounding box and the ground truth as well as the IoU. Sun et al. ¹⁷ proposed to a universal relation exploring scheme to explore the relations of different entity level, such as objects, superpixels and pixels. Mask R-CNN ¹⁸ extended Faster R-CNN by adding an instance segmentation head. Wu et al. ¹⁹ conducted a thorough comparison between the fully connected head(fc-head) and the convolution head(conv-head) in R-CNN, and found an interesting phenomenon: fc-head is more suitable for the classification task, while conv-head is more suitable for the localization task.

The proposed SADH structure is related to the thought-provoking work ¹⁹. We both introduce a linear-style layer for the classification task. The difference is that ¹⁹ performs a fc-head on the features from RoIAlign, while the proposed method introduces a computationally efficient 1 $\times$ 1 convolution in RPN, see Fig. 1(b) for example. However, our motivations are thoroughly different. Wu et al. are motivated by the fact fc-head has more spatial sensitivity than conv-head, while the proposed SADH is designed to improve the steepness of the classification confidence gradient.

2.2 Localization Confidence Prediction

Recently, localization confidence prediction has also drawn a growing attention. FCOS ²⁰ introduced a center-ness branch to predict the distance between a pixel and the center of its corresponding ground truth. Jiang et al. ²¹ proposed to predict the intersection-over-union(IoU) between the bounding box and the matched ground truth through an IoU-Net. Ma et al. ¹⁶ proposed to utilize the matching degree score of the corner points between the predicted bounding box and the ground truth to indicate the localization confidence. Mask Scoring R-CNN ²² proposed to predict mask IoU for better segmentation performance based on IoU-Net and Mask R-CNN.

FCOS, IoU-Net, and Mask Scoring R-CNN are all related to the proposed method. FCOS predicts a center-ness at each feature map position to down-weight the low-quality detections based on the center prior rule ²³. IoU-Net predicts the IoU and utilizes it in the non-maximum suppression(NMS) by preserving accurately localized bounding boxes. Mask Scoring R-CNN predicts a mask IoU for the instance segmentation task. In this paper, we also add a new branch to predict localization confidence. The difference is that we adopt the NWD as the metric to evaluate localization confidence, which is more suitable for detecting small acne lesions ¹². In addition, the NWD is learned by a novel SBCE loss and directly utilized to rectify the original classification scores.

3 Method

In this section, we first describe the data acquisition and the proposed data preprocessing method in Subsection 3.1. Then, the details of the proposed Spatial Aware Double Head structure and the NWD prediction branch are introduced in Subsection 3.2 and Subsection 3.3, respectively. Lastly, the training and inferring of the proposed method are described in Subsection 3.4.

3.1 Data acquisition and preprocessing

3.1.1 AcneSCU Dataset

To comprehensively evaluate the proposed method and facilitate further research on acne detection, we construct a new dataset called AcneSCU, which has higher-resolution and more normalized imageries, more fine-grained acne categories, and more precise annotations than previous acne dataset ². AcneSCU consists of 276 facial images shot by VISIA complexion analysis system with 31777 instance segmentation annotations of 10 lesion categories, namely open comedone, closed comedone, papule, pustule, nodule, atrophic scar, hypertrophic scar, melasma, nevus and other. Specifically, the lesions, which are difficult to identify, are labeled as the other class. The resolution of images varies from $3128\times 4171$ to $3456\times 5184$ . In order to ensure the label accuracy, the annotations are labeled by 6 dermatologists and verified by 6 times. An example of AcneSCU and the distribution of lesion size are shown in Fig. 1.

3.1.2 Masked Crop

As most images of AcneSCU are of high resolution around $3500\times 5000$ , directly loading the images into memory is computationally costly. It’s natural to crop each image into several sub-images. However, a large number of lesions per image makes it difficult to ensure each lesion to be cropped as an entirety. Whether keeping or ignoring the annotations of the cropped partial lesions would lead to a large number of inaccurate annotations and inferior performance. In this paper, we propose a simple but efficient way named masked crop by masking all the partial lesions. Each sub-image is cropped with a size of $1024\times 1024$ . To ensure the whole area of the image to participate in training, a total number of ${\lceil\frac{W}{1024}\rceil}\times{\lceil\frac{H}{1024}\rceil}$ sub-images are cropped out, where $W$ and $H$ denotes the width and height of the original image, respectively. Each whole image is first cropped along the horizontal direction with equal overlaps, and then cropped along the vertical direction in the same way. As there exists an overlap between the adjacent sub-images, few lesion annotations are omitted. An example of the masked crop is shown in Fig. 3.1.2.

{figurehere} [Uncaptioned image]

Left: the vanilla crop with keeping the annotations of the partial lesions; Right: the masked crop.

3.2 Spatial Aware Double Head

As shown in Fig. 1(a), vanilla RPN performs a $3\times 3$ convolution on Feature Pyramid Networks(FPN) feature map, where the same six elements would be inputted to the calculation of adjacent elements on the following intermediate and final feature maps. As a result, the classification scores at adjacent positions on the final feature map are pretty close, leading to a flat classification confidence gradient. We argue that adjacent elements on the final feature maps for the classification task should be spatially independent for a steeper classification confidence gradient, while that for the localization task should be spatially correlated for similar regression results. From the perspective of spatial perception, we propose a Spatial Aware Double Head to improve the drawback of flat classification confidence gradient in the original RPN. As shown at the top of Fig. 1(b), the intermediate feature maps for the two tasks of RPN are disentangled into two parts, which are computed with a $1\times 1$ and $3\times 3$ convolution on the FPN feature map respectively.

For the classification task, the $1\times 1$ convolution is crucial to constrain the adjacent elements of final feature maps spatially independent. The red and blue elements on the final feature map are only related to the red and blue on the FPN feature map, respectively. As a result, the classification scores of proposals at adjacent positions are not necessarily too close, thus making the classification scores more discriminative. For the localization task, the intermediate feature map is computed by $3\times 3$ convolution, which is the same as RPN. As shown in Fig. 1(a), the six purple elements in the FPN feature map all participate in the computation of the red and blue elementsfine-grained. The spatial correlation benefits the regression results of proposals at adjacent positions. Moreover, following ²⁰, a group normalization(GN) is added before the ReLU layer to make the training more stable.

3.3 NWD Prediction Branch

Label assignment in RPN selects positive and negative samples according to a label threshold and the anchors’ IoUs with the matched ground truths. The anchors with IoUs under the label threshold are regarded as negative samples, and the corresponding proposals are expected to yield extremely low classification scores, leading to a low correlation between the classification scores and the proposals’ IoUs. Jiang et al. ²¹ proposed to predict IoU to alleviate the issue caused by the low correlation. However, Wang et al. ¹² has demonstrated that NWD was a more suitable metric than IoU to evaluate the similarity of two bounding boxes in small object detection. In this paper, a new NWD prediction branch is proposed by performing a $1\times 1$ convolution on the intermediate feature maps for the localization task, as shown in the bottom part of Fig. 1(b). Besides, a sigmoid layer is performed on the final feature map to constrain the predicted NWDs within (0, 1).

3.3.1 Soft-style Binary Cross Entropy

For the learning of NWD, we propose a novel loss named SBCE for the localization regression in object detection, instead of utilizing the widely used $L_{1}$ loss. The SBCE could be formulated as:

\mathcal{L}_{nwd}=-\sum_{i}{log(1-\mid y_{i}-p_{i}\mid)},

(1)

where $p_{i}$ is the predicted NWD of the $i$ -th proposal, and $y_{i}$ is the label, i.e., the NWD between the $i$ -th proposal and its corresponding ground truth. Referring to ¹², $y_{i}$ is defined as:

y_{i}=NWD(B_{i},G_{i})=exp(-\frac{{W_{2}}(B_{i},G_{i})}{C}),

(2)

where $C$ is a constant which is empirically set as the average size of acne lesions, $B_{i}$ and $G_{i}$ are the $i$ -th proposal bounding box and its corresponding ground truth, respectively, and ${W_{2}}(B_{i},G_{i})$ is the Wasserstein coupling distance between them and could be calculated by:

$\displaystyle{W_{2}}(B,G)={\left\|{\left[{{cx}_{b},{cy}_{b},\frac{w_{b}}{2},\frac{h_{b}}{2}}\right]}^{T}-{\left[{{cx}_{g},{cy}_{g},\frac{w_{g}}{2},\frac{h_{g}}{2}}\right]}^{T}\right\|}_{2},$

(3)

where the $({cx}_{b},{cy}_{b})$ , ${w_{b}}$ , and ${h_{b}}$ denote the center coordinates, width and height of the proposal bounding box $B$ respectively, and the same is true for the ground truth $G$ .

Referring to Eq. 1, it’s interesting to find that Binary Cross Entropy is a special case of SBCE when given a hard label $y_{i}\in\{0,1\}$ , while SBCE enables the NWD prediction to be trained as regression task with soft label. Moreover, the logarithm function has a steeper gradient than $L_{1}$ loss, leading to a faster and better convergence.

\tbl

Comparison results with different generic object detection methods on AcneSCU \topruleMethod AP $AP_{s}$ $AP_{m}$ $AP_{l}$ $AR$ $AR_{s}$ $AR_{m}$ $AR_{l}$ \botruleFCOS ²⁰ 0.247 0.197 0.298 0.320 0.649 0.411 0.714 0.791 FSAF ²⁴ 0.402 0.332 0.445 0.680 0.805 0.806 0.813 0.750 AutoAssign ²⁰ 0.471 0.451 0.509 0.409 0.873 0.862 0.876 0.850 \botruleFaster R-CNN ⁸ 0.464 0.390 0.496 0.434 0.750 0.691 0.725 0.747 Cascade R-CNN ²⁵ 0.451 0.416 0.497 0.463 0.766 0.788 0.748 0.688 PANet ²⁶ 0.490 0.495 0.517 0.482 0.756 0.743 0.719 0.792 Cascade Mask R-CNN ²⁷ 0.455 0.480 0.477 0.437 0.708 0.745 0.661 0.696 Mask Scoring R-CNN ²² 0.464 0.467 0.521 0.429 0.755 0.738 0.749 0.734 HTC ²⁸ 0.455 0.452 0.503 0.466 0.823 0.811 0.847 0.813 GHM ²⁹ 0.482 0.458 0.520 0.411 0.774 0.770 0.758 0.756 Libra R-CNN ³⁰ 0.477 0.475 0.523 0.446 0.784 0.768 0.759 0.786 OHEM ³¹ 0.495 0.509 0.526 0.461 0.736 0.732 0.713 0.706 PISA ³² 0.470 0.466 0.493 0.422 0.750 0.740 0.718 0.753 \botruleMask R-CNN ¹⁸ w/o masked crop 0.457 0.354 0.496 0.425 0.729 0.625 0.709 0.727 Mask R-CNN ¹⁸ 0.481 0.488 0.515 0.469 0.750 0.742 0.714 0.759 Mask R-CNN + SADH 0.497 0.464 0.519 0.502 0.775 0.758 0.742 0.805 Mask R-CNN + SADH + NWD 0.507 0.494 0.555 0.472 0.775 0.761 0.754 0.786 \botrule

•

Please note all the methods are evaluated with masked crop unless specially marked with ”w/o masked crop”.

3.4 Training and Inferring

3.4.1 Training

In the training process, the overall loss of the proposed head structure is modified by adding the loss for NWD prediction, which could be formulated as:

\mathcal{L}_{rpn}=\mathcal{L}_{cls}+\mathcal{L}_{loc}+\lambda_{nwd}\times{\mathcal{L}_{nwd}},

(4)

where $\mathcal{L}_{cls}$ and $\mathcal{L}_{loc}$ are the loss for classification and localization task in original RPN, respectively, and $\lambda_{nwd}$ is the hyper-parameter to adjust the weight of $\mathcal{L}_{nwd}$ and set as 1 by default. Same as the training of $\mathcal{L}_{loc}$ , only the positive samples participate in training $\mathcal{L}_{nwd}$ .

3.4.2 Inferring

In the inferring process, the predicted NWD score is utilized to rectify the original classification score in RPN. To constrain the final classification confidence score $s_{final}$ within (0, 1), we formulated $s_{final}$ as:

s_{final}=\sqrt{{s^{(2-\omega_{nwd})}_{cls}}\times{p^{\omega_{nwd}}}},

(5)

where $s_{cls}$ and $p$ denote the original classification score in RPN and the predicted NWD score, respectively, and $\omega_{nwd}$ is the hyper-parameter to adjust the weight of NWD score and set as 1 by default.

4 Experiments

4.1 Datasets.

We evaluate the proposed method on both the AcneSCU and ACNE04 datasets. For the AcneSCU dataset, we randomly split it into two parts, about $90\%$ (248 images) are held out as the train set while the others are used as the test set. As some persons may exist in more than one image, all the persons in the test set are constrained not in the train set to avoid overfitting. For the ACNE04 dataset, it contains 1457 images with 18983 bounding boxes annotations of only one lesion category. Following ², the whole dataset is randomly split into $80\%$ train set and $20\%$ test set, consisting of 1165 and 292 images, respectively.

4.2 Implementation

Our framework is built based on the Mask R-CNN implemented by MMDetection ³³ with one NVIDIA GeForce 3090. The reasons for choosing Mask R-CNN as the baseline are: 1) FPN ³⁴ is utilized to enhance the detection of various-sized objects; 2) RPN ⁸ is utilized for generating candidate regions efficiently, enabling the designing of more flexible and precise classifiers ¹⁰; 3) it could take advantage of instance annotations to improve the detection performance ¹⁸; 4) it’s robust and widely utilized for comparison in generic object detection.

We initialize the Mask R-CNN with parameters for COCO dataset in ¹⁸ and use a ResNet50 pretrained on ImageNet ³⁵ as the backbone. Deformable convolutional networks(DCN) ³⁶ is added to enhance the transformation modeling capability. To output feature maps with suitable scales for acne lesions, only P2, P3, P4, and P5 feature maps of FPN ³⁴ are preserved. 2000 proposals per feature map from FPN are performed with an NMS with IoU threshold of 0.7, and 1000 proposals per sub-image are outputted at last to R-CNN. The anchors and proposals are assigned as positive samples if their IoUs with the matched ground truth are greater than 0.5 when training both RPN and RCNN. In the inferring process, an NMS with IoU threshold of 0.5 is utilized for all the detection bounding boxes, and only 200 per sub-image at most are outputted as detection results at last. In this paper, we find the overall performance works well with the hyper-parameter $\lambda_{nwd}$ and $\omega_{nwd}$ range within (0.8, 1.2), and they are both set as 1 by default. The network is trained by an SGD optimizer with 15 epochs, where the learning rate, momentum, and weight decay are 0.002, 0.9, and 0.0001, respectively. There are two ways to merge detection results of sub-images: 1) follow ³⁷ to perform an NMS on the detection results of sub-images; 2) directly input the whole image into the model and set a larger number of proposals in RPN and RCNN. As the post-processing techniques are not the focus of this paper, without loss of generality, we only investigate the detection performance on sub-images.

4.3 Metrics

To efficiently evaluate the proposed methods, following ⁷, we evaluate Average Precision(AP) and AR at IoU=0.5 with 100 detection boxes at most. Please note that IoU=0.5 is already a harsh condition in consideration of the arbitrary boundary and pixel-level human label error of small acne lesions; see Fig. 1 for example. In addition, metrics with regard to different object sizes are also reported. Specifically, $AP$ , $AP_{s}$ , $AP_{m}$ , and $AP_{l}$ denote the AP of all, small, medium, and large objects, respectively, and the same is true for AR. Please note AP is much more important than AR in object detection, because it evaluates the average precision under different recalls. However, AR is easy to improve by increasing the detection boxes. All this doesn’t conflict with our efforts to improve the quality and AR of the RPN proposals.

{tablehere}\tbl

Comparison results with previous state-of-the-art methods on ACNE04 \topruleMethod AP \botruleFaster R-CNN ⁸ 0.103 R-FCN ³⁸ 0.140 Rashataprucksa et al. ⁶ 0.147 ACNet ⁷ 0.205 \botruleours 0.22 \botrule

\tbl

Results of our method with different plain detection frameworks on AcneSCU \topruleMethod AP $AP_{s}$ $AP_{m}$ $AP_{l}$ $AR$ $AR_{s}$ $AR_{m}$ $AR_{l}$ \botruleHTC ²⁸ 0.455 0.452 0.503 0.466 0.823 0.811 0.847 0.813 HTC + SADH + NWD 0.462 0.449 0.510 0.411 0.819 0.808 0.838 0.688 \botruleFaster R-CNN ⁸ 0.464 0.390 0.496 0.434 0.750 0.691 0.725 0.747 Faster R-CNN + SADH + NWD 0.480 0.474 0.539 0.447 0.760 0.748 0.746 0.766 \botruleMask R-CNN ¹⁸ 0.481 0.488 0.515 0.469 0.750 0.742 0.714 0.759 Mask R-CNN + SADH + NWD 0.507 0.494 0.555 0.472 0.775 0.761 0.754 0.786 \botrule

4.4 Results

To demonstrate the efficiency of the proposed method, state-of-the-art methods of both one-stage detection methods, e.g. FCOS ²⁰, FSAF ²⁴, and AutoAssign ²⁰, and the two-stage detection, e.g. Faster R-CNN ⁸, Cascade R-CNN ²⁵, PANet ²⁶, Cascade Mask R-CNN ²⁷, Mask Scoring R-CNN ²², HTC ²⁸, GHM ²⁹, Libra R-CNN ³⁰, OHEM ³¹, and PISA ³², are chosen as peer competitors on AcneSCU dataset. To make a fair comparison, we try our best to implement the compared methods with the same settings and hyper-parameters. For ACNE04, we compare the methods which have published results on it.

Table 3.3.1 shows the comparative results on AcneSCU. On the overall metric AP, our method outperforms all the compared state-of-the-art one-stage and two-stage generic object detection methods. Specifically, Mask R-CNN with masked crop outperforms the baseline with $2.4\%$ AP, which demonstrates the efficiency of the proposed masked crop. We argue that this is because masked crop could alleviate the influence caused by unavoidable partial lesions. Mask R-CNN with SADH achieves $1.6\%$ higher AP than that without SADH. This may benefit from a steeper classification confidence gradient, which could suppress the proposals of low IoUs. Compared to ”Mask R-CNN + SADH”, the NWD prediction branch could further improve the AP with $1.0\%$ , which demonstrates the efficiency of NWD rectification, especially on medium and small lesions. In addition, we find AutoAssign achieves the greatest AR. That is because AutoAssign’s detection number is much greater than that of our method, about 5:1.

Table 4.3 shows the comparative results on ACNE04. For a fair comparison, we adopt the same data augmentation with previous method ⁷, which means our result doesn’t benefit from the proposed masked crop data preprocessing method. In addition, we calculate the overall performance on the whole images to ensure comparison under the same evaluation condition. Despite without masked crop, the proposed outperforms all the previous methods on the public ACNE04 dataset.

Moreover, we also evaluate the generalization performance of our method with different detection frameworks on AcneSCU. As shown in Table 4.3, it could be found our method achieves consistent improvement with different detection frameworks, demonstrating the generalization performance of our method. Specifically, our method could improve HTC, Faster R-CNN, and Mask R-CNN with $0.7\%$ , $1.6\%$ , and $2.6\%$ , respectively. The best improvement of Mask R-CNN may be due to its best detection performance of the baseline.

{tablehere}\tbl

Ablation study on ACNESCU \topruleGroup Convolution Loss Metric AP \botrule1 $3\times 3$ - - 0.488 $1\times 1$ - - 0.493 \botrule2 - $L_{1}$ loss - 0.487 - SBCE - 0.498 \botrule3 - - IoU 0.490 - - GIoU 0.480 - - DIoU 0.483 - - NWD 0.498 \botrule

4.5 Ablation Study

To validate different components of the proposed method, three groups of ablation experiments are performed, and the results are shown in Table 4.4. Within each group, all the settings are same, including but not limited to DCNs in the backbone, GN in RPN head, and IoU threshold of NMS, except for the investigated component. Therefore, the influence of GN or DCNs on overall comparison is excluded. In addition, we try to make the smallest modification on the baseline. For example, we use the origin RPN for group 2 and 3.

{figurehere} [Uncaptioned image]

The curve of the classification confidence gradient. The x-axis denotes the Manhattan distance between the feature map positions of the proposal and ground truth, while the y-axis denotes the classification confidence gradient, defined by $\Delta{s_{cls}}=s_{cls}(x)-s_{cls}(0)$ , where $s_{cls}(x)$ denotes the classification score at the distance of x.

Specifically, the first group is conducted to analyze the kernel size of the convolution layer performed on the FPN feature map, emphasized with a bold font in Fig. 1(b), in the classification head of SADH. We find that $1\times 1convolution$ plays an important role in the success of SADH, which demonstrates that the kernel size indeed influences the independence of classification scores at adjacent positions on the final feature map. To valid the efficiency of SBCE loss, the second group compares the $L_{1}$ loss ³⁹ and SBCE loss used in the NWD prediction branch. The results show that the proposed SBCE achieves $1.1\%$ higher AP than $L_{1}$ loss. We argue SBCE loss had a steeper curve than $L_{1}$ loss, leading to a faster and better convergence. To analysis the effectiveness of the localization confidence metric, the third group is performed by learning 4 different metrics, i.e., IoU, GIoU ⁴⁰, DIoU ⁴¹, and NWD. It could be found that classification rectification with NWD is better than that with IoU, GIoU, and DIoU. This may be interpreted that NWD is a more suitable localization confidence metric for small acne lesions.

5 Discussion

In this section, we firstly give some visual views to show the detection results of the proposed method in Subsection 5.1. Then, we intuitively to show the efficiency of the proposed method through in-depth analysis in Subsection 5.2.

5.1 Visual View and Analysis of the Detection Results

The visual view of some detection results is shown in Fig. 1. It could be found most skin lesions, especially those which have obvious visual features, could be well detected. The reasons for false detection results could be concluded as: 1) some lesions located on cheeks do not have apparent texture features on the front-view facial images; 2) several close tiny are detected as one lesion; 3) some tiny lesions are too tiny(with sizes less than 20) to be detected for the harsh IoU threshold; 4) some acne lesions with arbitrary boundaries are hard to detect; 5) noisy labels. These problems would be the subjects of our future works.

5.2 In-depth Analysis

To intuitively demonstrate the efficiency of the proposed SADH, we plot the steepness of the proposals’ classification scores, which is referred to as classification confidence gradient in this paper, moving from the center of the ground truth to its boundary. As shown in Fig. 4.5, it could be found that SADH had a steeper classification confidence gradient than the vanilla RPN. We argue a steeper classification confidence gradient could generate more independent classification scores at adjacent positions on the final feature map, suppress the proposals with low IoUs but high classification scores, and then improve the detection of the hard samples.

To demonstrate the effectiveness of the NWD prediction branch, we visualize the relations between the proposals’ IoUs and classification scores $s_{cls}$ , the predicted NWDs $p$ , and the rectified final scores $s_{final}$ , respectively. As shown in Fig. 2, there exists a large number of proposals with IoUs ranging from 0 to 0.4 and classification scores below 0.1. We argue that this is caused by the label assignment in RPN. The anchors whose IoUs below the label threshold, always set as 0.5, are regarded as negative samples. After the rectification of NWDs, these samples could be assigned with more appropriate classification confidences, i.e., the final scores $S_{final}$ . As a result, the detection performance benefits from the higher relation between the classification confidences and IoUs.

6 Conclusion

In this paper, we propose a new RPN head structure to improve the proposals’ quality in acne detection. At first, a Spatial Aware Double Head structure is proposed to learn a steeper classification confidence gradient and suppress the redundant inferior proposals of easy samples. Then, an NWD prediction branch along with a Soft-style Binary Cross Entropy loss is proposed to rectify the proposals’ classification scores. In addition, we construct a new dataset named AcneSCU, with high-resolution imageries, more precise annotations, and fine-grained lesion categories, to facilitate further research on acne detection. Extensive experiments on both AcneSCU and the public dataset ACNE04 demonstrate the efficiency of the proposed method.

\nonumsection

Acknowledgments This work was supported in part by Natural Science Foundation of China under Grant No.62025601.

References

1 E. Bernardis, H. Shou, J. S. Barbieri, P. J. McMahon, M. J. Perman, L. A. Rola, J. L. Streicher, J. R. Treat, L. Castelo-Soccio and A. C. Yan, Development and initial validation of a multidimensional acne global grading system integrating primary lesions and secondary changes, JAMA dermatology 156(3) (2020) 296–302.
2 X. Wu, N. Wen, J. Liang, Y.-K. Lai, D. She, M.-M. Cheng and J. Yang, Joint acne image grading and counting via label distribution learning, Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 10642–10651.
3 J. S. Barbieri, R. Fulton, R. Neergaard, M. N. Nelson, F. K. Barg and D. J. Margolis, Patient perspectives on the lived experience of acne and its treatment among adult women with acne: A qualitative study, JAMA dermatology 157(9) (2021) 1040–1046.
4 N. Kittigul and B. Uyyanonvara, Automatic acne detection system for medical treatment progress report, 2016 7th International Conference of Information and Communication Technology for Embedded Systems (IC-ICTES), IEEE2016, pp. 41–44.
5 G. Maroni, M. Ermidoro, F. Previdi and G. Bigini, Automated detection, extraction and counting of acne lesions for automatic evaluation and tracking of acne severity, 2017 IEEE symposium series on computational intelligence (SSCI), IEEE2017, pp. 1–6.
6 K. Rashataprucksa, C. Chuangchaichatchavarn, S. Triukose, S. Nitinawarat, M. Pongprutthipan and K. Piromsopa, Acne detection with deep neural networks, 2020 2nd International Conference on Image Processing and Machine Vision, 2020, pp. 53–56.
7 K. Min, G.-H. Lee and S.-W. Lee, Acnet: Mask-aware attention with dynamic context enhancement for robust acne detection, arXiv preprint arXiv:2105.14891 (2021).
8 S. Ren, K. He, R. Girshick and J. Sun, Faster r-cnn: Towards real-time object detection with region proposal networks, Advances in neural information processing systems 28 (2015) 91–99.
9 R. Girshick, J. Donahue, T. Darrell and J. Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 580–587.
10 J. Hosang, R. Benenson, P. Dollár and B. Schiele, What makes for effective detection proposals?, IEEE transactions on pattern analysis and machine intelligence 38(4) (2015) 814–830.
11 Z. Zou, Z. Shi, Y. Guo and J. Ye, Object detection in 20 years: A survey, arXiv preprint arXiv:1905.05055 (2019).
12 J. Wang, C. Xu, W. Yang and L. Yu, A normalized gaussian wasserstein distance for tiny object detection, arXiv preprint arXiv:2110.13389 (2021).
13 J. Long, E. Shelhamer and T. Darrell, Fully convolutional networks for semantic segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 3431–3440.
14 H. Law and J. Deng, Cornernet: Detecting objects as paired keypoints, Proceedings of the European conference on computer vision (ECCV), 2018, pp. 734–750.
15 K. Duan, S. Bai, L. Xie, H. Qi, Q. Huang and Q. Tian, Centernet: Keypoint triplets for object detection, Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 6569–6578.
16 T. Ma, W. Tian, P. Kuang and Y. Xie, An anchor-free object detector with novel corner matching method, Knowledge-Based Systems 224 (2021) p. 107083.
17 X. Sun, C. Chen, J. Dong, D. Liu and G. Hu, Exploring ubiquitous relations for boosting classification and localization, Knowledge-Based Systems 196 (2020) p. 105824.
18 K. He, G. Gkioxari, P. Dollár and R. Girshick, Mask r-cnn, Proceedings of the IEEE international conference on computer vision, 2017, pp. 2961–2969.
19 Y. Wu, Y. Chen, L. Yuan, Z. Liu, L. Wang, H. Li and Y. Fu, Rethinking classification and localization for object detection, Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 10186–10195.
20 Z. Tian, C. Shen, H. Chen and T. He, Fcos: Fully convolutional one-stage object detection, Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
21 B. Jiang, R. Luo, J. Mao, T. Xiao and Y. Jiang, Acquisition of localization confidence for accurate object detection, Proceedings of the European conference on computer vision (ECCV), 2018, pp. 784–799.
22 Z. Huang, L. Huang, Y. Gong, C. Huang and X. Wang, Mask scoring r-cnn, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 6409–6418.
23 B. Zhu, J. Wang, Z. Jiang, F. Zong, S. Liu, Z. Li and J. Sun, Autoassign: Differentiable label assignment for dense object detection, arXiv preprint arXiv:2007.03496 (2020).
24 C. Zhu, Y. He and M. Savvides, Feature selective anchor-free module for single-shot object detection, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 840–849.
25 Z. Cai and N. Vasconcelos, Cascade r-cnn: Delving into high quality object detection, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 6154–6162.
26 S. Liu, L. Qi, H. Qin, J. Shi and J. Jia, Path aggregation network for instance segmentation, Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 8759–8768.
27 Z. Cai and N. Vasconcelos, Cascade r-cnn: High quality object detection and instance segmentation, IEEE Transactions on Pattern Analysis and Machine Intelligence (2019) p. 1–1.
28 K. Chen, J. Pang, J. Wang, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Shi, W. Ouyang, C. C. Loy and D. Lin, Hybrid task cascade for instance segmentation, IEEE Conference on Computer Vision and Pattern Recognition, 2019.
29 B. Li, Y. Liu and X. Wang, Gradient harmonized single-stage detector, AAAI Conference on Artificial Intelligence, 2019.
30 J. Pang, K. Chen, Q. Li, Z. Xu, H. Feng, J. Shi, W. Ouyang and D. Lin, Towards balanced learning for instance recognition, International Journal of Computer Vision 129(5) (2021) 1376–1393.
31 A. Shrivastava, A. Gupta and R. Girshick, Training region-based object detectors with online hard example mining, Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 761–769.
32 Y. Cao, K. Chen, C. C. Loy and D. Lin, Prime sample attention in object detection, IEEE Conference on Computer Vision and Pattern Recognition, 2020.
33 K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, X. Lu, R. Zhu, Y. Wu, J. Dai, J. Wang, J. Shi, W. Ouyang, C. C. Loy and D. Lin, MMDetection: Open mmlab detection toolbox and benchmark, arXiv preprint arXiv:1906.07155 (2019).
34 T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan and S. Belongie, Feature pyramid networks for object detection, Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
35 J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li and L. Fei-Fei, Imagenet: A large-scale hierarchical image database, 2009 IEEE conference on computer vision and pattern recognition, Ieee2009, pp. 248–255.
36 J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei, Deformable convolutional networks, Proceedings of the IEEE international conference on computer vision, 2017, pp. 764–773.
37 A. Van Etten, You only look twice: Rapid multi-scale object detection in satellite imagery, arXiv preprint arXiv:1805.09512 (2018).
38 J. Dai, Y. Li, K. He and J. Sun, R-fcn: Object detection via region-based fully convolutional networks, Advances in neural information processing systems, 2016, pp. 379–387.
39 R. Girshick, Fast r-cnn, Proceedings of the IEEE international conference on computer vision, 2015, pp. 1440–1448.
40 H. Rezatofighi, N. Tsoi, J. Gwak, A. Sadeghian, I. Reid and S. Savarese, Generalized intersection over union: A metric and a loss for bounding box regression, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 658–666.
41 Z. Zheng, P. Wang, W. Liu, J. Li, R. Ye and D. Ren, Distance-iou loss: Faster and better learning for bounding box regression, Proceedings of the AAAI Conference on Artificial Intelligence, 34(07)2020, pp. 12993–13000.