\floatsetup

[table]capposition=top

¹¹institutetext: School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China ²²institutetext: CRISE, Institute of Automation, Chinese Academy of Sciences, Beijing, China ³³institutetext: CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing, China [email protected], {houjing.huang,wenjie.yang,xtchen,kaiqi.huang} @nlpr.ia.ac.cn

Rethinking of Pedestrian Attribute Recognition: Realistic Datasets and A Strong Baseline

Jian Jia 1122 Houjing Huang 1122 Wenjie Yang 1122 Xiaotang Chen 1122 Kaiqi Huang 112233

Abstract

Despite various methods are proposed to make progress in pedestrian attribute recognition, a crucial problem on existing datasets is often neglected, namely, a large number of identical pedestrian identities in train and test set, which is not consistent with practical application. Thus, images of the same pedestrian identity in train set and test set are extremely similar, leading to overestimated performance of state-of-the-art methods on existing datasets. To address this problem, we propose two realistic datasets PETA_$zs$ and RAPv2_$zs$ following zero-shot setting of pedestrian identities based on PETA and RAPv2 datasets. Furthermore, compared to our strong baseline method, we have observed that recent state-of-the-art methods can not make performance improvement on PETA, RAPv2, PETA_$zs$ and RAPv2_$zs$. Experiments on existing and proposed datasets verify the superiority of our method by achieving state-of-the-art performance. Code is available at https://github.com/valencebond/Strong_Baseline_of_Pedestrian_Attribute_Recognition

Keywords:

Pedestrian Attribute Recognition, Realistic Datasets, Multi-label Classification

1 Introduction

Pedestrian attribute recognition [34] is to predict multiple attributes of pedestrian images as semantic descriptions in video surveillance, such as age, gender and clothing.

Recently, pedestrian attribute recognition has drawn increasing attention due to its great potential in real world application such as person retrieval [18], person search [9] and person re-identification [31, 22]. Similar as many vision tasks, progress on pedestrian attribute recognition is significantly advanced by deep learning. From the pioneer work DeepMAR [16] based on CaffeNet [8] to most recent work VAC [10] based on ResNet50 [11], mA performance is increased from 73.79 to 78.47 in RAPv1 [19] dataset ¹¹1We use RAPv1, RAPv2 to represent dataset published by Li et al. [19] and Li et al. [18] respectively..

However, when various networks are proposed to improve the performance by extracting more discriminative features, a crucial problem in existing popular datasets is often neglected, i.e. there are a large number of identical pedestrian identities in train and test set. This results in plenty of extremely similar images of the same pedestrian identity in the train and test set. For example, in PETA [7] and RAPv2 [18], the first large-scale pedestrian attribute dataset and the updated version of the most popular dataset RAPv1 [19], the images of the same pedestrian identity in train set and test set are almost the same except for negligible background and pose variation, as shown in Fig. 1 2(a) and Fig. 1 2(b).

The proportion of common-identity ²²2Common-identity indicates pedestrian identity exists both in train set and test set. Unique-identity indicates the identity only exists in train set or test set. images in test set are calculated as shown in Fig. 1 1(c). It is worth noting that 57.5% and 31.5% of test set images have similar counterparts of the same pedestrian in the train set of PETA and RAPv2 respectively. Although there are also plenty of similar images in train set and test set of RAPv1, we can not get the accurate proportion of common-identity images in test set, due to pedestrian identity label is not provided. Thus, existing datasets are unreasonable in practical application in which test set identities barely have overlap with identities of train set.

More importantly, the performance of state-of-the-art(SOTA) methods is overestimated on existing datasets, which misleads the progress of pedestrian attribute recognition. To verify our hypothesis, we reimplement the recent SOTA methods MsVAA [26], VAC [10], ALM [28] and experiments are conducted to evaluate the performance of common-identity images and unique-identity images separately on PETA. As illustrated in Fig. 2(a), the performance gaps of MsVAA between common-identity images and unique-identity images are 20.51%, 24.70%, 15.12%, 16.87%, 16.32% in mA, Acc, Precision, Recall and F1 respectively. Significant performance gaps validates the existing datasets are impractical under real scenario. And compared to the performance of all the images in the test set, the performance of unique-identity images is reduced by 12.7%, 12.14%, 6.95%, 8.4%, 7.86% respectively in mA, Acc, Precision, Recall and F1, which proves that the performance of existing methods is overestimated. Similar performance degradation is also observed for VAC and ALM methods on RAPv2 as well.

To address the problem in existing datasets, we propose a realistic setting for pedestrian attribute datasets: pedestrian identities of test set have no overlap with identities of train set, i.e. pedestrian identities follow the zero-shot setting. Based on PETA [7] and RAPv2 [18] datasets provided with pedestrian identity label, we construct two realistic datasets PETA_$zs$ and RAPv2_$zs$ to make them in line with zero-shot setting of pedestrian identities illustrated in Fig. 3. Consistent performance drop of different SOTA methods in proposed datasets highlights the rationality of our proposed datasets as detailed in Table 2 .

Furthermore, we find experimentally that SOTA methods can not make performance improvement based on our strong baseline method. There are two reasons. First, performance improvement gained by recent SOTA methods is based on underutilized baseline. Second, SOTA methods resort to localize the attribute-specific region to achieve better performance by attention module [26, 10] or Spatial Transformer Network(STN) [28]. However, the strong baseline itself can achieve a good localization of the attribute area, so further strengthening the localization capability cannot bring more performance improvements.

The contributions of this paper are as follows:

•

We observe the crucial problem in existing pedestrian attribute datasets, i.e. a large number of identical pedestrian identities in train and test set, which is impractical and misleads the model evaluation.
•

To solve the dataset problem, we propose two datasets PETA_$zs$ and RAPv2_$zs$ following zero-shot setting of pedestrian identities.
•

Based on our strong baseline, we experimentally find that enhancing localization of attribute-specific area adopted by SOTA methods is not beneficial for performance improvement.

The rest of this paper is organized as follows. Section 2 reviews the related work of pedestrian attribute recognition. Section 3 rethinks existing pedestrian attribute setting and proposes two realistic datasets. Section 4 proposes our method and experiments are conducts on Section 5. Finally, section 6 concludes this work and discusses future directions.

2 Related Work

2.1 Pedestrian Attribute Recognition

Most recent efforts resort to learning discriminate attribute-specific feature representation by enhancing attribute localization. Li et al. [16] firstly considered pedestrian attribute recognition as a multi-label classification task and proposed the weighted sigmoid cross entropy loss. Considering better attribute localization can reduce the impact of irrelevant areas, Liu et al. [25] proposed HydraPlus-Net with multi-directional attention modules to locate fine-grained attributes. Liu et al. [23] proposed a Localization Guided Network based on Class Activation Map(CAM) [33] and EdgeBox [35] to extract attribute-specific local features. Considering corresponding area scales of attributes are various, multi-scale feature maps were reused from different backbone layers [32, 26], i.e. pyramidal feature hierarchy, instead of single feature map of last layer. Guo et al. [10] utilized the assumption that visual attention regions are consistent between different spatial transforms of same image and proposed an attention consistency loss to get a robust attribute localization. Inspired by Feature Pyramid Network(FPN) [20], Tang et al. [28] constructed Attribute Localization Module with Squeeze-and-Excitation(SE) block [13] and Spatial Transformer Network (STN) [15] to enhance attribute localization. From the viewpoint of capturing attribute semantic dependencies, some methods focused on modeling attribute relationship. Wang et al. [30] transformed pedestrian attribute recognition to sequence prediction by Long-Shot-Term-Memory (LSTM) [12] to explore attribute context and correlation.

Compared to previous work, our work rethinks recent progress made on pedestrian attribute recognition from the perspective of datasets and methods. First, the dataset problem we first noticed leads to overestimated performance and misleads the evaluation of recent methods. Thus, we propose two reasonable and realistic datasets. Second, based on a fully utilized baseline network, we find localizing specific-attribute area adopted by recent SOTA methods [26, 10, 28] is not beneficial for performance improvement.

3 Proposed Realistic Datasets

In this section, we first answer two questions: what’s wrong with the existing pedestrian attribute datasets and Why is it important for academic research and industry applications. Then we introduce a new realistic setting and propose two reasonable and practical datasets PETA_$zs$, RAPv2_$zs$.

3.1 Problems of existing Datasets

As far as we know, APiS [34] is the first pedestrian attribute recognition dataset followed by PETA [7], RAPv1 [19], PA100k [25], RAPv2 [18] which promote the development of pedestrian attribute recognition. However, no clear, concrete and unified setting is proposed for pedestrian attribute dataset construction.

For PETA, RAPv1, RAPv2 datasets, randomly splitting is adopted as default setting to construct train and test set and the protocol are used by almost all methods [19, 25, 30, 18, 26, 10, 28]. This results in a large number of identical pedestrian identities in train and test set. Considering images of identical pedestrian identities are often collected by the same surveillance camera in very short frame intervals, the appearance of these images is extremely similar with negligible background and pose variation as detailed in Fig. 1(a) and Fig. 1(b). As a result, there are plenty of extremely similar images in train set and test set as shown in Fig. 1(c).

To further illuminate the difference between existing datasets and our proposed datasets, the pedestrian identity distribution of PETA is given in Fig. 3. In existing dataset PETA, there are 1106 identical identities in train and test set, accounting for 26.91% identities and 57.70% images of the test set. It means that more than half of test set images have their similar counterparts in train set, which is similar to ’data leakage’. In our proposed PETA_$zs$, there is no overlap between pedestrian identities of train and test set. As for RAPv1 (or RAPv2), all images (or part of images) have no pedestrian identity annotation, so we can not get the accurate identity distribution.

Given the problem of existing datasets, the reasons why it is crucial for industry application and academic research are given as follows. Whether used as primary task in video surveillance or auxiliary task in person retrieval, pedestrian identities of test set barely overlap with identities of train set. So, existing datasets setting is inconsistent with real world application. More importantly, the performance of SOTA methods is overestimated on existing datasets and the model evaluation is misled. We reimplement recent methods MsVAA [26], VAC [10], ALM [28] and report their performance in Fig. 2. Compared to performance on the whole test set, consistent performance degradation of three methods on unique-identity images of test set confirms the overestimated model performance.

3.2 Construction of Proposed Datasets

To solve the problem of existing pedestrian attribute datasets, we propose a realistic setting: there is no identical pedestrian identities in train and test set, i.e. pedestrian identities follow the zero-shot setting. Concretely, we propose the following criteria for dataset construction, to serve as reference for future dataset.

Criteria of proposed datasets:

1.

$\mathbb{I}_{all}=\mathbb{I}_{train}\cup\mathbb{I}_{valid}\cup\mathbb{I}_{test}$ ,
$\absolutevalue{\mathbb{I}_{train}}:\absolutevalue{\mathbb{I}_{valid}}:\absolutevalue{\mathbb{I}_{test}}$ $\approx$ $3:1:1$ .
2.

$\mathbb{I}_{train}\cap\mathbb{I}_{valid}=\varnothing$ ,
$\mathbb{I}_{train}\cap\mathbb{I}_{test}=\varnothing$ , $\mathbb{I}_{valid}\cap\mathbb{I}_{test}=\varnothing$ .
3.

$\absolutevalue{\absolutevalue{\mathbb{I}_{valid}}-\absolutevalue{\mathbb{I}_{test}}}<\absolutevalue{\mathbb{I}_{all}}\times T_{id}$ .
4.

$\absolutevalue{N_{valid}-N_{test}}<T_{img}$ .
5.

$\absolutevalue{R_{valid}-R_{test}}<T_{attr}$ .

where $\mathbb{I}$ denote the pedestrian identity set, $N$ , $R$ , $T$ denote the number of images, the ratio of positive samples of attributes and pre-defined threshold separately. $T_{id}=0.01$ , $T_{img}=300$ , $T_{attr}=0.03$ are used in our experiments by default. Subscript ${train}$ , ${valid}$ , ${test}$ denotes train set, validation set and test set separately. $\absolutevalue{\ \cdot\ }$ is the set cardinality.

To solve the problem of identical identities between train set and test set, Criterion 1,2,3 are proposed to repartition train set and test set of PETA and RAPv2 datasets based on pedestrian identities to make sure that pedestrian identities follow zero-shot setting. To better evaluate the model performance and control the attribute distribution difference between train set and test set, Criterion 4, 5 are proposed.

Based on criteria mentioned above, considering pedestrian identity label is only provided in PETA and RAPv2 datasets, two realistic datasets PETA_$zs$ and RAPv2_$zs$ are proposed, where subscript $zs$ denotes pedestrian identities of dataset following zero-shot setting. Details of the two proposed dataset are given in Table 1 and supplementary material.

4 Methods

4.1 The Strong Baseline

Given a dataset $\mathbb{D}=\{(\bm{X}_{i},\bm{y}_{i})\}$ , $\bm{y}_{i}\in\{0,1\}^{M}$ , $i=1,2,...,N$ , where $\bm{y}_{i}$ indicates ground truth vector and $N$ , $M$ denotes the number of train images and attributes respectively. $\bm{X}_{i}$ denotes $i$ - $th$ pedestrian image. Pedestrian attribute recognition is a multi-label task to learn to predict attributes $\bm{\hat{y}}_{i}\in\{0,1\}^{M}$ , given the pedestrian image $X_{i}$ . The element values of zeros and ones in the label vector $\bm{y}$ and $\bm{\hat{y}}$ denote the absence and presence of the corresponding attributes in the pedestrian image.

Pedestrian attribute model generally adopts multiple binary classifiers with sigmoid function [10, 28] instead of a multi-class classifier with softmax function [29, 6]. So the loss can be computed as

\displaystyle Loss

\displaystyle=\frac{1}{N}\sum_{i=1}^{N}\sum_{j=1}^{M}\omega_{j}(y_{ij}\log(\sigma(logits_{ij}))+(1-y_{ij})\log(1-\sigma(logits_{ij})))

(1)

where $\sigma(z)=1/(1+e^{-z})$ and $logits_{ij}$ is the output of classifier layer and $\omega_{j}$ adopted here is proposed in [17] to alleviate the distribution imbalance between attributes.

\displaystyle\omega_{j}=\begin{cases}e^{1-r_{j}},&y_{ij}=1\\ e^{r_{j}},&y_{ij}=0\end{cases}

(2)

where $r_{j}$ is the positive sample ratio of $j$ - $th$ attribute in train set.

Denote the feature representation of the $i$ - $th$ example as $\mathbf{x}_{i}\in R^{d}$ , then the conditional probability output by a deep neural network can be predicted by a linear classifier followed by a sigmoid function in Eq. (3)

\displaystyle p_{ij}=Pr(Y=y_{ij}|\mathbf{x}_{i})=\frac{1}{1+e^{{-\mathbf{w}^{T}_{j}\mathbf{x_{i}}}}}

(3)

where $[\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{M}]\in R^{d\times M}$ is weight of linear classifier and $p_{ij}$ is the $j$ - $th$ attribute probability of the $i$ - $th$ image. We denote this method as baseline in the following sections.

5 Experiments

5.1 Datasets and Evaluation Metrics

We conduct experiments in four existing pedestrian attribute datasets, PETA [7], RAPv1 [19], RAPv2 [18], PA100k [25] and two proposed realistic datasets PETA_$zs$, RAPv2_$zs$. Details of each dataset are given in Table 1.

Table 1: Details of existing datasets and proposed datasets. Zero-shot setting of pedestrian identities is considered in proposed PETA_$zs$ and RAPv2_$zs$ datasets.

I_{train}

I_{valid}

I_{test}

indicates the number of identities in train set, validation set and test set respectively. Pedestrian identities are not provided in RAPv1, PA100k and are partly provided in RAPv2, so the exact quantity cannot be counted, which is denoted by – . Due to the overlapped identities between

I_{train}

I_{valid}

and

I_{test}

of PETA, the sum of

I_{train}

I_{valid}

and

I_{test}

is not equal to that in PETA_$zs$. Attribute here denotes the number of attributes used for evaluation.

Dataset	Setting	$I_{train}$	$I_{valid}$	$I_{test}$	Attribute	Images	$N_{train}$	$N_{val}$	$N_{test}$
PETA	existing	4,886	1,264	4,110	35	19,000	9,500	1,900	7,600
PETA_$zs$	zero-shot	5,211	1,703	1,785	35	19,000	11,051	3,980	3,969
RAPv2	existing	–	–	–	54	84,928	50,957	16,986	16,985
RAPv2_$zs$	zero-shot	1,508	546	535	54	26,632	14,729	5,961	5,948
RAPv1	existing	–	–	–	51	41,585	33,268	–	8,317
PA100k	existing	–	–	–	26	100,000	80,000	10,000	10,000

Two types of metrics, i.e. a label based metric and four instance based metrics, are adopted to evaluate attribute recognition performance [18]. For label based metric, we compute the mean value of classification accuracy of positive samples and negative samples as the metric of specific attribute, then we take an average over all attributes as mean Accuracy (mA). For instance based metrics, Accuracy, Precision, Recall and F1-score are used.

5.2 Implementation Details

The proposed method is implemented with PyTorch and trained end-to-end. ResNet50 [11] is adopted as backbone to extract pedestrian image feature. Pedestrian images are resized to $256\times 192$ with random horizontal mirroring as inputs. SGD is employed for training, with momentum of 0.9, and weight decay of 0.0005. The initial learning rate equals to 0.01 and batch size is set to 64. Plateau learning rate scheduler is used with reduction factor 0.1. Total epoch number of training is 30

5.3 Comparison with State-of-the-art Methods

Results on proposed datasets To fully validate the rationality of proposed datasets, we reimplement the recent methods MsVAA [26], VAC [10] and ALM [28] as MsVAA¹¹footnotemark: 1, VAC¹¹footnotemark: 1 and ALM¹¹footnotemark: 1 respectively based on backbone ResNet50 [11] . Quantitative experiments are conducted in PETA_$zs$, RAPv2_$zs$ datasets and results are reported in Table 2. It is worth noting that, compared to existing datasets, remarkable performance drop of all methods exists on proposed datasets even if PETA_$zs$ have more train images compared to PETA as shown in Table 1. The result further validates our insight that performance on existing datasets is overestimated and existing datasets mislead the model evaluation. We find experimentally that there is a trade-off between Precision and Recall, so mA and F1 score are more reliable and convinced.The proposed method achieves state-of-the-art performance, with significantly less parameters and computation.

Table 2: Performance comparison of four methods on the PETA, RAPv2 datasets. We use zero-shot setting to denote our proposed PETA_$zs$ or RAPv2_$zs$ dataset. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated. Parameters(Params) and multiply-accumulate operations(MACs) of various methods are also reported

Method	Setting	PETA					RAPv2
Method	Setting	mA	Accu	Prec	Recall	F1	mA	Accu	Prec	Recall	F1	Params(M)	MACs(G)
MsVAA[26]¹¹footnotemark: 1	existing	84.35	78.69	87.27	85.51	86.09	77.87	67.19	79.03	79.79	79.04	141.27	6.28
MsVAA[26]¹¹footnotemark: 1	zero-shot	71.03	59.38	74.75	70.10	72.37	71.32	63.59	77.22	76.62	76.44	141.27	6.28
VAC[10]¹¹footnotemark: 1	existing	83.63	78.94	87.63	85.45	86.23	76.74	67.52	80.42	78.78	79.24	23.61	14.335
VAC[10]¹¹footnotemark: 1	zero-shot	71.05	58.90	74.98	70.48	72.13	70.20	65.45	79.87	76.65	77.07	23.61	14.335
ALM[28]¹¹footnotemark: 1	existing	84.24	77.84	85.79	85.60	85.41	78.21	66.98	78.25	80.43	78.93	30.86	4.32
ALM[28]¹¹footnotemark: 1	zero-shot	70.67	58.56	72.97	71.31	71.65	71.97	64.52	77.28	77.74	77.06	30.86	4.32
Baseline	existing	85.11	79.14	86.99	86.33	86.39	77.34	66.12	81.99	75.62	78.21	23.61	4.05
Baseline	zero-shot	71.84	58.77	77.06	68.24	71.72	70.83	63.63	82.28	72.22	76.34	23.61	4.05

•

^⋆ Results are reimplemented with the same setting of our baseline for a fair comparison.

Table 3: Performance comparison with state-of-the-art methods on the PETA, RAPv1, PA100k datasets. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated.

Method	Backbone	PETA					PA100k					RAPv1
Method	Backbone	mA	Accu	Prec	Recall	F1	mA	Accu	Prec	Recall	F1	mA	Accu	Prec	Recall	F1
DeepMAR [16] (ACPR15)	CaffeNet	82.89	75.07	83.68	83.14	83.41	72.70	70.39	82.24	80.42	81.32	73.79	62.02	74.92	76.21	75.56
HPNet[25] (ICCV17)	InceptionNet	81.77	76.13	84.92	83.24	84.07	74.21	72.19	82.97	82.09	82.53	76.12	65.39	77.33	78.79	78.05
JRL [30] (ICCV17)	AlexNet	82.13	–	82.55	82.12	82.02	–	–	–	–	–	74.74	–	75.08	74.96	74.62
LGNet [23] (BMVC18)	Inception-V2	–	–	–	–	–	76.96	75.55	86.99	83.17	85.04	78.68	68.00	80.36	79.82	80.09
PGDM [17] (ICME18)	CaffeNet	82.97	78.08	86.86	84.68	85.76	74.95	73.08	84.36	82.24	83.29	74.31	64.57	78.86	75.90	77.35
MsVAA[26](ECCV18)	ResNet101	84.59	78.56	86.79	86.12	86.46	–	–	–	–	–	–	–	–	–	–
VAC [10] (CVPR19)	ResNet50	–	–	–	–	–	79.16	79.44	88.97	86.26	87.59	–	–	–	–	–
ALM[28] (ICCV19)	BN-Inception	86.30	79.52	85.65	88.09	86.85	80.68	77.08	84.21	88.84	86.46	81.87	68.17	74.71	86.48	80.16
FocalLoss	ResNet50	83.00	76.18	84.85	84.44	84.31	78.49	77.87	86.96	85.00	85.58	77.32	65.91	80.74	75.13	76.47
Baseline	ResNet50	85.11	79.14	86.99	86.33	86.39	79.38,	78.56	89.41	84.78	86.55	78.48	67.17	82.84	76.25	78.94

Results on existing datasets Experiments are also conducted on existing PETA, RAPv1, PA100k datasets to make a comparison with recent methods and results are reported in Table 3. We compare with state-of-the-art methods, including MsVAA[26], VAC [10] and ALM[28]. According to experiments, we have following observations: 1) Considering the overlapped identities in train and test set on existing PETA and RAPv1 datasets, performance in PA100k are more convinced. 3) Our proposed method with backbone ResNet50 achieves a better performance with only 16.71% parameters and 64.49% computation than MsVAA method which adopts ResNet101 as backbone. 4) Compared to VAC [10] using extra training augmentation and two-branch network, our baseline method achieves a comparable performance with only 28.25% computation in PA100k. Replacing linear classifier with cosine-distance based classifier, our method obtains 86.02%, 80.71 80.07% mA in PETA, RAPv1, PA100K, which outperforms baseline by 0.91%, 2.23%, 0.69% respectively. Consistent improvements achieved by proposed method compared to baseline demonstrate our strategy effectiveness, i.e. normalizing the classifier weight of attributes to make it independent on positive samples of attributes.

Table 4: Performance comparison with reimplement methods and their baseline on PETA, RAPv1, PA100k. Five metrics, mA, Accuracy, Precision, Recall, F1 are evaluated. Parameters(Params) and multiply-accumulate operations(MACs) of various methods are also reported.

Method	Backbone	PETA					PA100k					RAPv1
Method	Backbone	mA	Accu	Prec	Recall	F1	mA	Accu	Prec	Recall	F1	mA	Accu	Prec	Recall	F1	Params(M)	MACs(G)
Baseline(MsVAA[26])	ResNet101	82.67	76.63	85.13	84.46	84.79	–	–	–	–	–	–	–	–	–	–	–	–
Baseline(VAC[10])	ResNet50	–	–	–	–	–	78.12	75.23	88.47	83.41	85.86	–	–	–	–	–	–	–
Baseline(ALM[28])	BN-Inception	82.66	77.73	86.68	84.20	85.57	77.47	75.05	86.61	85.34	85.97	75.76	65.57	78.92	77.49	78.20	–	–
Baseline(ours)	ResNet50	85.11	79.14	86.99	86.33	86.09	79.38,	78.56	89.41	84.78	86.25	78.48	67.17	82.84	76.25	78.94	–	–
MsVAA[26](ECCV18)^⋆	ResNet50	84.35	78.69	87.27	85.51	86.09	80.10	76.98	86.26	85.62	85.50	79.75	65.74	77.69	78.99	77.93	141.27	4.93
VAC [10] (CVPR19)^⋆	ResNet50	83.63	78.94	87.63	85.45	86.23	79.04	78.25	88.01	86.07	86.83	78.47	68.55	81.05	79.79	80.02	23.61	14.335
ALM[28] (ICCV19)^⋆	ResNet50	84.24	77.84	85.79	85.60	85.41	77.47	75.05	86.61	85.34	85.97	75.76	65.57	78.92	77.49	78.20	30.86	4.32

•

^⋆ Results are reimplemented with the same setting of our baseline for a fair comparison..

5.4 Ablation Study of Baseline

To make a fair comparison with previous SOTA methods, we reimplement MsVAA, VAC, ALM methods and report their performance on PETA, PA100k, RAPv1 as well as their corresponding baseline performance, as shown in Table. 4. It is worth noting that our baseline achieve a much better performance than baseline of previous methods, even if they adopt a powerful backbone ResNet101 [11]. And compared to previous SOTA methods reimplemented with same backbone, our baseline achieve a comparable even better performance. We argue that the effectiveness of a method can not be fully verified when compared with a worse baseline.

The reason why our baseline can achieve a comparable even better performance than previous methods is that a strong baseline itself can implicitly learn the location of attribute-specific area. We utilize GradCAM [27] to locate discriminative visual cues of our baseline model. As show in Fig. 4, even if without explicitly modeling the localization of attribute-specific area, our baseline can localize attribute-specific area to learn discriminative representation. The important thing is not to locate the area of specific attribute, but to distinguish the fine-grained attributes in the same are, such as distinguish sandals from sneakers. Compared to original performance of SOTA methods, there is little difference in performance of reimplemented methods except for ALM. The reason is that attention area of ALM is hard bounding-box, which is coarse-grained and introduces environmental noise.

6 Conclusion

In this paper, we propose two realistic datasets PETA_$zs$ and RAPv2_$zs$ to solve the unreasonable and impractical setting of existing datasets, which misleads model evaluation. Meanwhile, we find SOTA methods can not make any performance improvement on our strong baseline.

References

[1] et al., C.: Learning imbalanced datasets with label-distribution-aware margin loss. In: NIPS (2019)
[2] et al., Z.: BBN: Bilateral-branch network with cumulative learning for long-tailed visual recognition. arXiv:1912.02413 (2019)
[3] Buda, M., Maki, A., Mazurowski, M.A.: A systematic study of the class imbalance problem in convolutional neural networks. Neural Networks 106, 249–259 (2018)
[4] Byrd, J., Lipton, Z.C.: What is the effect of importance weighting in deep learning? arXiv preprint arXiv:1812.03372 (2018)
[5] Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 9268–9277 (2019)
[6] Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4690–4699 (2019)
[7] Deng, Y., Luo, P., Loy, C.C., Tang, X.: Pedestrian attribute recognition at far distance. In: Proceedings of the 22nd ACM international conference on Multimedia. pp. 789–792. ACM (2014)
[8] Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. pp. 647–655 (2014)
[9] Feris, R., Bobbitt, R., Brown, L., Pankanti, S.: Attribute-based people search: Lessons learnt from a practical surveillance system. In: Proceedings of International Conference on Multimedia Retrieval. p. 153. ACM (2014)
[10] Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 729–739 (2019)
[11] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 770–778 (2016)
[12] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
[13] Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 7132–7141 (2018)
[14] Huang, C., Li, Y., Change Loy, C., Tang, X.: Learning deep representation for imbalanced classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 5375–5384 (2016)
[15] Jaderberg, M., Simonyan, K., Zisserman, A., et al.: Spatial transformer networks. In: Advances in neural information processing systems. pp. 2017–2025 (2015)
[16] Li, D., Chen, X., Huang, K.: Multi-attribute learning for pedestrian attribute recognition in surveillance scenarios. In: ACPR. pp. 111–115 (2015)
[17] Li, D., Chen, X., Zhang, Z., Huang, K.: Pose guided deep model for pedestrian attribute recognition in surveillance scenarios. In: 2018 IEEE International Conference on Multimedia and Expo (ICME). pp. 1–6. IEEE (2018)
[18] Li, D., Zhang, Z., Chen, X., Huang, K.: A richly annotated pedestrian dataset for person retrieval in real surveillance scenarios. IEEE transactions on image processing 28(4), 1575–1590 (2018)
[19] Li, D., Zhang, Z., Chen, X., Ling, H., Huang, K.: A richly annotated dataset for pedestrian attribute recognition. arXiv preprint arXiv:1603.07054 (2016)
[20] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2117–2125 (2017)
[21] Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE international conference on computer vision. pp. 2980–2988 (2017)
[22] Lin, Y., Zheng, L., Zheng, Z., Wu, Y., Hu, Z., Yan, C., Yang, Y.: Improving person re-identification by attribute and identity learning. Pattern Recognition (2019)
[23] Liu, P., Liu, X., Yan, J., Shao, J.: Localization guided learning for pedestrian attribute recognition. In: BMVC (2018)
[24] Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: Deep hypersphere embedding for face recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 212–220 (2017)
[25] Liu, X., Zhao, H., Tian, M., Sheng, L., Shao, J., Yi, S., Yan, J., Wang, X.: Hydraplus-net: Attentive deep features for pedestrian analysis. In: Proceedings of the IEEE international conference on computer vision. pp. 350–359 (2017)
[26] Sarafianos, N., Xu, X., Kakadiaris, I.A.: Deep imbalanced attribute classification using visual attention aggregation. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 680–697 (2018)
[27] Selvaraju, R.R., Cogswell, M., Das, A., Vedantam, R., Parikh, D., Batra, D.: Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proceedings of the IEEE international conference on computer vision. pp. 618–626 (2017)
[28] Tang, C., Sheng, L., Zhang, Z., Hu, X.: Improving pedestrian attribute recognition with weakly-supervised multi-scale attribute-speciﬁc localization. In: Proceedings of the IEEE international conference on computer vision (2019)
[29] Wang, F., Xiang, X., Cheng, J., Yuille, A.L.: Normface: l 2 hypersphere embedding for face verification. In: Proceedings of the 25th ACM international conference on Multimedia. pp. 1041–1049. ACM (2017)
[30] Wang, J., Zhu, X., Gong, S., Li, W.: Attribute recognition by joint recurrent learning of context and correlation. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 531–540 (2017)
[31] Yang, W., Huang, H., Zhang, Z., Chen, X., Huang, K., Zhang, S.: Towards rich feature discovery with class activation maps augmentation for person re-identification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 1389–1398 (2019)
[32] Yu, K., Leng, B., Zhang, Z., Li, D., Huang, K.: Weakly-supervised learning of mid-level features for pedestrian attribute recognition and localization. In: BMVC (2017)
[33] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2921–2929 (2016)
[34] Zhu, J., Liao, S., Lei, Z., Yi, D., Li, S.: Pedestrian attribute classification in surveillance: Database and evaluation. In: Proceedings of the IEEE international conference on computer vision workshops. pp. 331–338 (2013)
[35] Zitnick, C.L., Dollár, P.: Edge boxes: Locating object proposals from edges. In: European conference on computer vision. pp. 391–405. Springer (2014)
[36] Zou, Y., Yu, Z., Vijaya Kumar, B., Wang, J.: Unsupervised domain adaptation for semantic segmentation via class-balanced self-training. In: Proceedings of the European Conference on Computer Vision (ECCV). pp. 289–305 (2018)