This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\floatsetup

[table]capposition=top

11institutetext: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China 22institutetext: CRISE, Institute of Automation, Chinese Academy of Sciences, Beijing, China 33institutetext: CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing, China [email protected], {houjing.huang,wenjie.yang,xtchen,kaiqi.huang} @nlpr.ia.ac.cn

Rethinking of Pedestrian Attribute Recognition: Realistic Datasets and A Strong Baseline

Jian Jia 1122    Houjing Huang 1122    Wenjie Yang 1122    Xiaotang Chen 1122    Kaiqi Huang 112233
Abstract

Despite various methods are proposed to make progress in pedestrian attribute recognition, a crucial problem on existing datasets is often neglected, namely, a large number of identical pedestrian identities in train and test set, which is not consistent with practical application. Thus, images of the same pedestrian identity in train set and test set are extremely similar, leading to overestimated performance of state-of-the-art methods on existing datasets. To address this problem, we propose two realistic datasets PETAzszs and RAPv2zszs following zero-shot setting of pedestrian identities based on PETA and RAPv2 datasets. Furthermore, compared to our strong baseline method, we have observed that recent state-of-the-art methods can not make performance improvement on PETA, RAPv2, PETAzszs and RAPv2zszs. Experiments on existing and proposed datasets verify the superiority of our method by achieving state-of-the-art performance. Code is available at https://github.com/valencebond/Strong_Baseline_of_Pedestrian_Attribute_Recognition

Keywords:
Pedestrian Attribute Recognition, Realistic Datasets, Multi-label Classification

1 Introduction

Pedestrian attribute recognition [34] is to predict multiple attributes of pedestrian images as semantic descriptions in video surveillance, such as age, gender and clothing.

Recently, pedestrian attribute recognition has drawn increasing attention due to its great potential in real world application such as person retrieval [18], person search [9] and person re-identification [31, 22]. Similar as many vision tasks, progress on pedestrian attribute recognition is significantly advanced by deep learning. From the pioneer work DeepMAR [16] based on CaffeNet [8] to most recent work VAC [10] based on ResNet50 [11], mA performance is increased from 73.79 to 78.47 in RAPv1 [19] dataset 111We use RAPv1, RAPv2 to represent dataset published by Li et al. [19] and Li et al. [18] respectively..

Refer to caption
(a) PETA dataset
Refer to caption
(b) RAPv2 dataset
Refer to caption
(c) Overlap Proportion
Figure 1: Extremely similar images of the same pedestrian identity in train set and test set. (a) Images in PETA dataset. (b) Images in RAPv2 dataset. (c) The proportions of common-identities images in the test sets of PETA and RAPv2. The proportion of common-identity images in RAPv2 test set is at least 31.5%, due to some images are not labeled with pedestrian identity.

However, when various networks are proposed to improve the performance by extracting more discriminative features, a crucial problem in existing popular datasets is often neglected, i.e. there are a large number of identical pedestrian identities in train and test set. This results in plenty of extremely similar images of the same pedestrian identity in the train and test set. For example, in PETA [7] and RAPv2 [18], the first large-scale pedestrian attribute dataset and the updated version of the most popular dataset RAPv1 [19], the images of the same pedestrian identity in train set and test set are almost the same except for negligible background and pose variation, as shown in Fig. 12(a) and Fig. 12(b).

The proportion of common-identity 222Common-identity indicates pedestrian identity exists both in train set and test set. Unique-identity indicates the identity only exists in train set or test set. images in test set are calculated as shown in Fig. 11(c). It is worth noting that 57.5% and 31.5% of test set images have similar counterparts of the same pedestrian in the train set of PETA and RAPv2 respectively. Although there are also plenty of similar images in train set and test set of RAPv1, we can not get the accurate proportion of common-identity images in test set, due to pedestrian identity label is not provided. Thus, existing datasets are unreasonable in practical application in which test set identities barely have overlap with identities of train set.

Refer to caption
(a) MsVAA
Refer to caption
(b) VAC
Figure 2: Performance of MsVAA [26] and VAC [10] methods on common-identity images, unique-identity images and all images of PETA test set. There is a significant performance gap between common-identity images and unique-identity images as well as unique-identity images and all images of test set. The remarkable performance gap shows the irrationality of existing datasets. Similar phenomenon can be observed on ALM [28] method and RAPv2 dataset as well.

More importantly, the performance of state-of-the-art(SOTA) methods is overestimated on existing datasets, which misleads the progress of pedestrian attribute recognition. To verify our hypothesis, we reimplement the recent SOTA methods MsVAA [26], VAC [10], ALM [28] and experiments are conducted to evaluate the performance of common-identity images and unique-identity images separately on PETA. As illustrated in Fig. 2(a), the performance gaps of MsVAA between common-identity images and unique-identity images are 20.51%, 24.70%, 15.12%, 16.87%, 16.32% in mA, Acc, Precision, Recall and F1 respectively. Significant performance gaps validates the existing datasets are impractical under real scenario. And compared to the performance of all the images in the test set, the performance of unique-identity images is reduced by 12.7%, 12.14%, 6.95%, 8.4%, 7.86% respectively in mA, Acc, Precision, Recall and F1, which proves that the performance of existing methods is overestimated. Similar performance degradation is also observed for VAC and ALM methods on RAPv2 as well.

To address the problem in existing datasets, we propose a realistic setting for pedestrian attribute datasets: pedestrian identities of test set have no overlap with identities of train set, i.e. pedestrian identities follow the zero-shot setting. Based on PETA [7] and RAPv2 [18] datasets provided with pedestrian identity label, we construct two realistic datasets PETAzszs and RAPv2zszs to make them in line with zero-shot setting of pedestrian identities illustrated in Fig. 3. Consistent performance drop of different SOTA methods in proposed datasets highlights the rationality of our proposed datasets as detailed in Table 2 .

Furthermore, we find experimentally that SOTA methods can not make performance improvement based on our strong baseline method. There are two reasons. First, performance improvement gained by recent SOTA methods is based on underutilized baseline. Second, SOTA methods resort to localize the attribute-specific region to achieve better performance by attention module [26, 10] or Spatial Transformer Network(STN) [28]. However, the strong baseline itself can achieve a good localization of the attribute area, so further strengthening the localization capability cannot bring more performance improvements.

The contributions of this paper are as follows:

  • We observe the crucial problem in existing pedestrian attribute datasets, i.e. a large number of identical pedestrian identities in train and test set, which is impractical and misleads the model evaluation.

  • To solve the dataset problem, we propose two datasets PETAzszs and RAPv2zszs following zero-shot setting of pedestrian identities.

  • Based on our strong baseline, we experimentally find that enhancing localization of attribute-specific area adopted by SOTA methods is not beneficial for performance improvement.

The rest of this paper is organized as follows. Section 2 reviews the related work of pedestrian attribute recognition. Section 3 rethinks existing pedestrian attribute setting and proposes two realistic datasets. Section 4 proposes our method and experiments are conducts on Section 5. Finally, section 6 concludes this work and discusses future directions.

2 Related Work

2.1 Pedestrian Attribute Recognition

Most recent efforts resort to learning discriminate attribute-specific feature representation by enhancing attribute localization. Li et al. [16] firstly considered pedestrian attribute recognition as a multi-label classification task and proposed the weighted sigmoid cross entropy loss. Considering better attribute localization can reduce the impact of irrelevant areas, Liu et al. [25] proposed HydraPlus-Net with multi-directional attention modules to locate fine-grained attributes. Liu et al. [23] proposed a Localization Guided Network based on Class Activation Map(CAM) [33] and EdgeBox [35] to extract attribute-specific local features. Considering corresponding area scales of attributes are various, multi-scale feature maps were reused from different backbone layers [32, 26], i.e. pyramidal feature hierarchy, instead of single feature map of last layer. Guo et al. [10] utilized the assumption that visual attention regions are consistent between different spatial transforms of same image and proposed an attention consistency loss to get a robust attribute localization. Inspired by Feature Pyramid Network(FPN) [20], Tang et al. [28] constructed Attribute Localization Module with Squeeze-and-Excitation(SE) block [13] and Spatial Transformer Network (STN) [15] to enhance attribute localization. From the viewpoint of capturing attribute semantic dependencies, some methods focused on modeling attribute relationship. Wang et al. [30] transformed pedestrian attribute recognition to sequence prediction by Long-Shot-Term-Memory (LSTM) [12] to explore attribute context and correlation.

Compared to previous work, our work rethinks recent progress made on pedestrian attribute recognition from the perspective of datasets and methods. First, the dataset problem we first noticed leads to overestimated performance and misleads the evaluation of recent methods. Thus, we propose two reasonable and realistic datasets. Second, based on a fully utilized baseline network, we find localizing specific-attribute area adopted by recent SOTA methods [26, 10, 28] is not beneficial for performance improvement.

3 Proposed Realistic Datasets

In this section, we first answer two questions: what’s wrong with the existing pedestrian attribute datasets and Why is it important for academic research and industry applications. Then we introduce a new realistic setting and propose two reasonable and practical datasets PETAzszs, RAPv2zszs.

3.1 Problems of existing Datasets

As far as we know, APiS [34] is the first pedestrian attribute recognition dataset followed by PETA [7], RAPv1 [19], PA100k [25], RAPv2 [18] which promote the development of pedestrian attribute recognition. However, no clear, concrete and unified setting is proposed for pedestrian attribute dataset construction.

For PETA, RAPv1, RAPv2 datasets, randomly splitting is adopted as default setting to construct train and test set and the protocol are used by almost all methods [19, 25, 30, 18, 26, 10, 28]. This results in a large number of identical pedestrian identities in train and test set. Considering images of identical pedestrian identities are often collected by the same surveillance camera in very short frame intervals, the appearance of these images is extremely similar with negligible background and pose variation as detailed in Fig. 1(a) and Fig. 1(b). As a result, there are plenty of extremely similar images in train set and test set as shown in Fig. 1(c).

Refer to caption
Figure 3: Pedestrian identity distribution on PETA and PETAzszs. X-axis indicates the pedestrian identity number and Y-axis indicates the proportion of corresponding images of the pedestrian identity. There are 1106 common identities in train and test set, accounting for 19.42% identities (55.27% images) of train set and 26.91% identities (57.70% images) of test set. Our proposed dataset PETAzszs solves the problem by completely separating the identities of test set from the identities of train set.

To further illuminate the difference between existing datasets and our proposed datasets, the pedestrian identity distribution of PETA is given in Fig. 3. In existing dataset PETA, there are 1106 identical identities in train and test set, accounting for 26.91% identities and 57.70% images of the test set. It means that more than half of test set images have their similar counterparts in train set, which is similar to ’data leakage’. In our proposed PETAzszs, there is no overlap between pedestrian identities of train and test set. As for RAPv1 (or RAPv2), all images (or part of images) have no pedestrian identity annotation, so we can not get the accurate identity distribution.

Given the problem of existing datasets, the reasons why it is crucial for industry application and academic research are given as follows. Whether used as primary task in video surveillance or auxiliary task in person retrieval, pedestrian identities of test set barely overlap with identities of train set. So, existing datasets setting is inconsistent with real world application. More importantly, the performance of SOTA methods is overestimated on existing datasets and the model evaluation is misled. We reimplement recent methods MsVAA [26], VAC [10], ALM [28] and report their performance in Fig. 2. Compared to performance on the whole test set, consistent performance degradation of three methods on unique-identity images of test set confirms the overestimated model performance.

3.2 Construction of Proposed Datasets

To solve the problem of existing pedestrian attribute datasets, we propose a realistic setting: there is no identical pedestrian identities in train and test set, i.e. pedestrian identities follow the zero-shot setting. Concretely, we propose the following criteria for dataset construction, to serve as reference for future dataset.

Criteria of proposed datasets:
  1. 1.

    𝕀all=𝕀train𝕀valid𝕀test\mathbb{I}_{all}=\mathbb{I}_{train}\cup\mathbb{I}_{valid}\cup\mathbb{I}_{test},
    |𝕀train|:|𝕀valid|:|𝕀test|\absolutevalue{\mathbb{I}_{train}}:\absolutevalue{\mathbb{I}_{valid}}:\absolutevalue{\mathbb{I}_{test}} \approx 3:1:13:1:1 .

  2. 2.

    𝕀train𝕀valid=\mathbb{I}_{train}\cap\mathbb{I}_{valid}=\varnothing,
    𝕀train𝕀test=\mathbb{I}_{train}\cap\mathbb{I}_{test}=\varnothing, 𝕀valid𝕀test=\mathbb{I}_{valid}\cap\mathbb{I}_{test}=\varnothing .

  3. 3.

    ||𝕀valid||𝕀test||<|𝕀all|×Tid\absolutevalue{\absolutevalue{\mathbb{I}_{valid}}-\absolutevalue{\mathbb{I}_{test}}}<\absolutevalue{\mathbb{I}_{all}}\times T_{id} .

  4. 4.

    |NvalidNtest|<Timg\absolutevalue{N_{valid}-N_{test}}<T_{img} .

  5. 5.

    |RvalidRtest|<Tattr\absolutevalue{R_{valid}-R_{test}}<T_{attr} .

where 𝕀\mathbb{I} denote the pedestrian identity set, NN, RR, TT denote the number of images, the ratio of positive samples of attributes and pre-defined threshold separately. Tid=0.01T_{id}=0.01, Timg=300T_{img}=300, Tattr=0.03T_{attr}=0.03 are used in our experiments by default. Subscript train{train}, valid{valid}, test{test} denotes train set, validation set and test set separately. ||\absolutevalue{\ \cdot\ } is the set cardinality.

To solve the problem of identical identities between train set and test set, Criterion 1,2,3 are proposed to repartition train set and test set of PETA and RAPv2 datasets based on pedestrian identities to make sure that pedestrian identities follow zero-shot setting. To better evaluate the model performance and control the attribute distribution difference between train set and test set, Criterion 4, 5 are proposed.

Based on criteria mentioned above, considering pedestrian identity label is only provided in PETA and RAPv2 datasets, two realistic datasets PETAzszs and RAPv2zszs are proposed, where subscript zszs denotes pedestrian identities of dataset following zero-shot setting. Details of the two proposed dataset are given in Table 1 and supplementary material.

4 Methods

4.1 The Strong Baseline

Given a dataset 𝔻={(𝑿i,𝒚i)}\mathbb{D}=\{(\bm{X}_{i},\bm{y}_{i})\}, 𝒚i{0,1}M\bm{y}_{i}\in\{0,1\}^{M}, i=1,2,,Ni=1,2,...,N, where 𝒚i\bm{y}_{i} indicates ground truth vector and NN, MM denotes the number of train images and attributes respectively. 𝑿i\bm{X}_{i} denotes ii-thth pedestrian image. Pedestrian attribute recognition is a multi-label task to learn to predict attributes 𝒚^i{0,1}M\bm{\hat{y}}_{i}\in\{0,1\}^{M}, given the pedestrian image XiX_{i}. The element values of zeros and ones in the label vector 𝒚\bm{y} and 𝒚^\bm{\hat{y}} denote the absence and presence of the corresponding attributes in the pedestrian image.

Pedestrian attribute model generally adopts multiple binary classifiers with sigmoid function [10, 28] instead of a multi-class classifier with softmax function [29, 6]. So the loss can be computed as

Loss\displaystyle Loss =1Ni=1Nj=1Mωj(yijlog(σ(logitsij))+(1yij)log(1σ(logitsij)))\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}\omega_{j}(y_{ij}\log(\sigma(logits_{ij}))+(1-y_{ij})\log(1-\sigma(logits_{ij}))) (1)

where σ(z)=1/(1+ez)\sigma(z)=1/(1+e^{-z}) and logitsijlogits_{ij} is the output of classifier layer and ωj\omega_{j} adopted here is proposed in [17] to alleviate the distribution imbalance between attributes.

ωj={e1rj,yij=1erj,yij=0\displaystyle\omega_{j}=\begin{cases}e^{1-r_{j}},&y_{ij}=1\\ e^{r_{j}},&y_{ij}=0\end{cases} (2)

where rjr_{j} is the positive sample ratio of jj-thth attribute in train set.

Denote the feature representation of the ii-thth example as 𝐱iRd\mathbf{x}_{i}\in R^{d} , then the conditional probability output by a deep neural network can be predicted by a linear classifier followed by a sigmoid function in Eq. (3)

pij=Pr(Y=yij|𝐱i)=11+e𝐰jT𝐱𝐢\displaystyle p_{ij}=Pr(Y=y_{ij}|\mathbf{x}_{i})=\frac{1}{1+e^{{-\mathbf{w}^{T}_{j}\mathbf{x_{i}}}}} (3)

where [𝐰1,𝐰2,,𝐰M]Rd×M[\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{M}]\in R^{d\times M} is weight of linear classifier and pijp_{ij} is the jj-thth attribute probability of the ii-thth image. We denote this method as baseline in the following sections.

5 Experiments

5.1 Datasets and Evaluation Metrics

We conduct experiments in four existing pedestrian attribute datasets, PETA [7], RAPv1 [19], RAPv2 [18], PA100k [25] and two proposed realistic datasets PETAzszs, RAPv2zszs. Details of each dataset are given in Table 1.

Table 1: Details of existing datasets and proposed datasets. Zero-shot setting of pedestrian identities is considered in proposed PETAzszs and RAPv2zszs datasets. ItrainI_{train}, IvalidI_{valid}, ItestI_{test} indicates the number of identities in train set, validation set and test set respectively. Pedestrian identities are not provided in RAPv1, PA100k and are partly provided in RAPv2, so the exact quantity cannot be counted, which is denoted by – . Due to the overlapped identities between ItrainI_{train}, IvalidI_{valid} and ItestI_{test} of PETA, the sum of ItrainI_{train}, IvalidI_{valid} and ItestI_{test} is not equal to that in PETAzszs. Attribute here denotes the number of attributes used for evaluation.
Dataset Setting ItrainI_{train} IvalidI_{valid} ItestI_{test} Attribute Images NtrainN_{train} NvalN_{val} NtestN_{test}
PETA existing 4,886 1,264 4,110 35 19,000 9,500 1,900 7,600
PETAzszs zero-shot 5,211 1,703 1,785 35 19,000 11,051 3,980 3,969
RAPv2 existing 54 84,928 50,957 16,986 16,985
RAPv2zszs zero-shot 1,508 546 535 54 26,632 14,729 5,961 5,948
RAPv1 existing 51 41,585 33,268 8,317
PA100k existing 26 100,000 80,000 10,000 10,000

Two types of metrics, i.e. a label based metric and four instance based metrics, are adopted to evaluate attribute recognition performance [18]. For label based metric, we compute the mean value of classification accuracy of positive samples and negative samples as the metric of specific attribute, then we take an average over all attributes as mean Accuracy (mA). For instance based metrics, Accuracy, Precision, Recall and F1-score are used.

5.2 Implementation Details

The proposed method is implemented with PyTorch and trained end-to-end. ResNet50 [11] is adopted as backbone to extract pedestrian image feature. Pedestrian images are resized to 256×192256\times 192 with random horizontal mirroring as inputs. SGD is employed for training, with momentum of 0.9, and weight decay of 0.0005. The initial learning rate equals to 0.01 and batch size is set to 64. Plateau learning rate scheduler is used with reduction factor 0.1. Total epoch number of training is 30

5.3 Comparison with State-of-the-art Methods

Results on proposed datasets To fully validate the rationality of proposed datasets, we reimplement the recent methods MsVAA [26], VAC [10] and ALM [28] as MsVAA11footnotemark: 1, VAC11footnotemark: 1 and ALM11footnotemark: 1 respectively based on backbone ResNet50 [11] . Quantitative experiments are conducted in PETAzszs, RAPv2zszs datasets and results are reported in Table 2. It is worth noting that, compared to existing datasets, remarkable performance drop of all methods exists on proposed datasets even if PETAzszs have more train images compared to PETA as shown in Table 1. The result further validates our insight that performance on existing datasets is overestimated and existing datasets mislead the model evaluation. We find experimentally that there is a trade-off between Precision and Recall, so mA and F1 score are more reliable and convinced.The proposed method achieves state-of-the-art performance, with significantly less parameters and computation.

Table 2: Performance comparison of four methods on the PETA, RAPv2 datasets. We use zero-shot setting to denote our proposed PETAzszs or RAPv2zszs dataset. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated. Parameters(Params) and multiply-accumulate operations(MACs) of various methods are also reported
Method Setting PETA RAPv2
mA Accu Prec Recall F1 mA Accu Prec Recall F1 Params(M) MACs(G)
MsVAA[26]11footnotemark: 1 existing 84.35 78.69 87.27 85.51 86.09 77.87 67.19 79.03 79.79 79.04 141.27 6.28
zero-shot 71.03 59.38 74.75 70.10 72.37 71.32 63.59 77.22 76.62 76.44
VAC[10]11footnotemark: 1 existing 83.63 78.94 87.63 85.45 86.23 76.74 67.52 80.42 78.78 79.24 23.61 14.335
zero-shot 71.05 58.90 74.98 70.48 72.13 70.20 65.45 79.87 76.65 77.07
ALM[28]11footnotemark: 1 existing 84.24 77.84 85.79 85.60 85.41 78.21 66.98 78.25 80.43 78.93 30.86 4.32
zero-shot 70.67 58.56 72.97 71.31 71.65 71.97 64.52 77.28 77.74 77.06
Baseline existing 85.11 79.14 86.99 86.33 86.39 77.34 66.12 81.99 75.62 78.21 23.61 4.05
zero-shot 71.84 58.77 77.06 68.24 71.72 70.83 63.63 82.28 72.22 76.34
  • Results are reimplemented with the same setting of our baseline for a fair comparison.

Table 3: Performance comparison with state-of-the-art methods on the PETA, RAPv1, PA100k datasets. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated.
Method Backbone PETA PA100k RAPv1
mA Accu Prec Recall F1 mA Accu Prec Recall F1 mA Accu Prec Recall F1
DeepMAR [16] (ACPR15) CaffeNet 82.89 75.07 83.68 83.14 83.41 72.70 70.39 82.24 80.42 81.32 73.79 62.02 74.92 76.21 75.56
HPNet[25] (ICCV17) InceptionNet 81.77 76.13 84.92 83.24 84.07 74.21 72.19 82.97 82.09 82.53 76.12 65.39 77.33 78.79 78.05
JRL [30] (ICCV17) AlexNet 82.13 82.55 82.12 82.02 74.74 75.08 74.96 74.62
LGNet [23] (BMVC18) Inception-V2 76.96 75.55 86.99 83.17 85.04 78.68 68.00 80.36 79.82 80.09
PGDM [17] (ICME18) CaffeNet 82.97 78.08 86.86 84.68 85.76 74.95 73.08 84.36 82.24 83.29 74.31 64.57 78.86 75.90 77.35
MsVAA[26](ECCV18) ResNet101 84.59 78.56 86.79 86.12 86.46
VAC [10] (CVPR19) ResNet50 79.16 79.44 88.97 86.26 87.59
ALM[28] (ICCV19) BN-Inception 86.30 79.52 85.65 88.09 86.85 80.68 77.08 84.21 88.84 86.46 81.87 68.17 74.71 86.48 80.16
FocalLoss ResNet50 83.00 76.18 84.85 84.44 84.31 78.49 77.87 86.96 85.00 85.58 77.32 65.91 80.74 75.13 76.47
Baseline ResNet50 85.11 79.14 86.99 86.33 86.39 79.38, 78.56 89.41 84.78 86.55 78.48 67.17 82.84 76.25 78.94

Results on existing datasets Experiments are also conducted on existing PETA, RAPv1, PA100k datasets to make a comparison with recent methods and results are reported in Table 3. We compare with state-of-the-art methods, including MsVAA[26], VAC [10] and ALM[28]. According to experiments, we have following observations: 1) Considering the overlapped identities in train and test set on existing PETA and RAPv1 datasets, performance in PA100k are more convinced. 3) Our proposed method with backbone ResNet50 achieves a better performance with only 16.71% parameters and 64.49% computation than MsVAA method which adopts ResNet101 as backbone. 4) Compared to VAC [10] using extra training augmentation and two-branch network, our baseline method achieves a comparable performance with only 28.25% computation in PA100k. Replacing linear classifier with cosine-distance based classifier, our method obtains 86.02%, 80.71  80.07% mA in PETA, RAPv1, PA100K, which outperforms baseline by 0.91%, 2.23%, 0.69% respectively. Consistent improvements achieved by proposed method compared to baseline demonstrate our strategy effectiveness, i.e. normalizing the classifier weight of attributes to make it independent on positive samples of attributes.

Table 4: Performance comparison with reimplement methods and their baseline on PETA, RAPv1, PA100k. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated. Parameters(Params) and multiply-accumulate operations(MACs) of various methods are also reported.
Method Backbone PETA PA100k RAPv1
mA Accu Prec Recall F1 mA Accu Prec Recall F1 mA Accu Prec Recall F1 Params(M) MACs(G)
Baseline(MsVAA[26]) ResNet101 82.67 76.63 85.13 84.46 84.79
Baseline(VAC[10]) ResNet50 78.12 75.23 88.47 83.41 85.86
Baseline(ALM[28]) BN-Inception 82.66 77.73 86.68 84.20 85.57 77.47 75.05 86.61 85.34 85.97 75.76 65.57 78.92 77.49 78.20
Baseline(ours) ResNet50 85.11 79.14 86.99 86.33 86.09 79.38, 78.56 89.41 84.78 86.25 78.48 67.17 82.84 76.25 78.94
MsVAA[26](ECCV18) ResNet50 84.35 78.69 87.27 85.51 86.09 80.10 76.98 86.26 85.62 85.50 79.75 65.74 77.69 78.99 77.93 141.27 4.93
VAC [10] (CVPR19) ResNet50 83.63 78.94 87.63 85.45 86.23 79.04 78.25 88.01 86.07 86.83 78.47 68.55 81.05 79.79 80.02 23.61 14.335
ALM[28] (ICCV19) ResNet50 84.24 77.84 85.79 85.60 85.41 77.47 75.05 86.61 85.34 85.97 75.76 65.57 78.92 77.49 78.20 30.86 4.32
  • Results are reimplemented with the same setting of our baseline for a fair comparison..

5.4 Ablation Study of Baseline

To make a fair comparison with previous SOTA methods, we reimplement MsVAA, VAC, ALM methods and report their performance on PETA, PA100k, RAPv1 as well as their corresponding baseline performance, as shown in Table. 4. It is worth noting that our baseline achieve a much better performance than baseline of previous methods, even if they adopt a powerful backbone ResNet101 [11]. And compared to previous SOTA methods reimplemented with same backbone, our baseline achieve a comparable even better performance. We argue that the effectiveness of a method can not be fully verified when compared with a worse baseline.

The reason why our baseline can achieve a comparable even better performance than previous methods is that a strong baseline itself can implicitly learn the location of attribute-specific area. We utilize GradCAM [27] to locate discriminative visual cues of our baseline model. As show in Fig. 4, even if without explicitly modeling the localization of attribute-specific area, our baseline can localize attribute-specific area to learn discriminative representation. The important thing is not to locate the area of specific attribute, but to distinguish the fine-grained attributes in the same are, such as distinguish sandals from sneakers. Compared to original performance of SOTA methods, there is little difference in performance of reimplemented methods except for ALM. The reason is that attention area of ALM is hard bounding-box, which is coarse-grained and introduces environmental noise.

Refer to caption
Figure 4: Specific-attribute attention area of our baseline method on PETA, RAP, PA100k datasets from top to down.

6 Conclusion

In this paper, we propose two realistic datasets PETAzszs and RAPv2zszs to solve the unreasonable and impractical setting of existing datasets, which misleads model evaluation. Meanwhile, we find SOTA methods can not make any performance improvement on our strong baseline.

References

  • [1] et al., C.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS (2019)
  • [2] et al., Z.: BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. arXiv:1912.02413 (2019)
  • [3] Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, 249–259 (2018)
  • [4] Byrd, J., Lipton, Z.C.: What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372 (2018)
  • [5] Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9268–9277 (2019)
  • [6] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2019)
  • [7] Deng, Y., Luo, P., Loy, C.C., Tang, X.: Pedestrian attribute recognition at far distance. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 789–792. ACM (2014)
  • [8] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. pp. 647–655 (2014)
  • [9] Feris, R., Bobbitt, R., Brown, L., Pankanti, S.: Attribute-based people search: Lessons learnt from a practical surveillance system. In: Proceedings of International Conference on Multimedia Retrieval. p. 153. ACM (2014)
  • [10] Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 729–739 (2019)
  • [11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
  • [12] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
  • [13] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
  • [14] Huang, C., Li, Y., Change Loy, C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5375–5384 (2016)
  • [15] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in neural information processing systems. pp. 2017–2025 (2015)
  • [16] Li, D., Chen, X., Huang, K.: Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In: ACPR. pp. 111–115 (2015)
  • [17] Li, D., Chen, X., Zhang, Z., Huang, K.: Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. In: 2018 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2018)
  • [18] Li, D., Zhang, Z., Chen, X., Huang, K.: A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE transactions on image processing 28(4), 1575–1590 (2018)
  • [19] Li, D., Zhang, Z., Chen, X., Ling, H., Huang, K.: A richly annotated dataset for pedestrian attribute recognition. arXiv preprint arXiv:1603.07054 (2016)
  • [20] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
  • [21] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
  • [22] Lin, Y., Zheng, L., Zheng, Z., Wu, Y., Hu, Z., Yan, C., Yang, Y.: Improving person re-identification by attribute and identity learning. Pattern Recognition (2019)
  • [23] Liu, P., Liu, X., Yan, J., Shao, J.: Localization guided learning for pedestrian attribute recognition. In: BMVC (2018)
  • [24] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 212–220 (2017)
  • [25] Liu, X., Zhao, H., Tian, M., Sheng, L., Shao, J., Yi, S., Yan, J., Wang, X.: Hydraplus-net: Attentive deep features for pedestrian analysis. In: Proceedings of the IEEE international conference on computer vision. pp. 350–359 (2017)
  • [26] Sarafianos, N., Xu, X., Kakadiaris, I.A.: Deep imbalanced attribute classification using visual attention aggregation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 680–697 (2018)
  • [27] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
  • [28] Tang, C., Sheng, L., Zhang, Z., Hu, X.: Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-specific localization. In: Proceedings of the IEEE international conference on computer vision (2019)
  • [29] Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: l 2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 1041–1049. ACM (2017)
  • [30] Wang, J., Zhu, X., Gong, S., Li, W.: Attribute recognition by joint recurrent learning of context and correlation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 531–540 (2017)
  • [31] Yang, W., Huang, H., Zhang, Z., Chen, X., Huang, K., Zhang, S.: Towards rich feature discovery with class activation maps augmentation for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1389–1398 (2019)
  • [32] Yu, K., Leng, B., Zhang, Z., Li, D., Huang, K.: Weakly-supervised learning of mid-level features for pedestrian attribute recognition and localization. In: BMVC (2017)
  • [33] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)
  • [34] Zhu, J., Liao, S., Lei, Z., Yi, D., Li, S.: Pedestrian attribute classification in surveillance: Database and evaluation. In: Proceedings of the IEEE international conference on computer vision workshops. pp. 331–338 (2013)
  • [35] Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: European conference on computer vision. pp. 391–405. Springer (2014)
  • [36] Zou, Y., Yu, Z., Vijaya Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 289–305 (2018)