This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

PRSNet: A Masked Self-Supervised Learning Pedestrian Re-Identification Method

1st Zhijie Xiao School of Information Science and
Technology
Tibet University
Lhasa, China
[email protected]
   2rd Zhicheng Dong School of Information Science and
Technology
Tibet University
Lhasa, China
[email protected]
   3nd Hao Xiang Artificial Intelligence Academy
Nanjing University of Aeronautics
and Astronauticss

Nanjing, China
[email protected]
Abstract

In recent years, self-supervised learning has attracted widespread academic debate and addressed many of the key issues of computer vision. The present research focus is on how to construct a good agent task that allows for improved network learning of advanced semantic information on images so that model reasoning is accelerated during pre-training of the current task. In order to solve the problem that existing feature extraction networks are pre-trained on the ImageNet dataset and cannot extract the fine-grained information in pedestrian images well, and the existing pre-task of contrast self-supervised learning may destroy the original properties of pedestrian images, this paper designs a pre-task of mask reconstruction to obtain a pre-training model with strong robustness and uses it for the pedestrian re-identification task. The training optimization of the network is performed by improving the triplet loss based on the centroid, and the mask image is added as an additional sample to the loss calculation, so that the network can better cope with the pedestrian matching in practical applications after the training is completed. This method achieves about 5% higher mAP on Marker1501 and CUHK03 data than existing self-supervised learning pedestrian re-identification methods, and about 1% higher for Rank1, and ablation experiments are conducted to demonstrate the feasibility of this method. Our model code is located at https://github.com/ZJieX/prsnet.

Index Terms:
self-supervised, pedestrian re-identification, mask image, centroid

I Introduction

The term ”pedestrian re-identification” was first introduced in 2005 by Wojciech Zajdel et al. from the University of Amsterdam [1] in a study of multinodular visual tracking, and before the rise of convolutional feature extraction networks, researchers often used features such as pedestrian image color, texture, and shape [2] for their studies. With the continuous development of deep learning, more and more excellent pedestrian re-identification networks have appeared in public, from metric learning and representation learning in the early stages to unsupervised pedestrian re-identification algorithms nowadays.

Deep metric learning is the study of various loss functions to be used to enhance the robustness of the network. L. Zheng et al. [3] used pedestrian reclassification as an image multiclassification task based on ID loss, and improvements based on ID loss are often used on specific pedestrian reclassification tasks. Among them, many researchers have also proposed validation loss [4], [5], [6], [7] to optimize the relationship between

Refer to caption

Figure 1: The structure of our masked self-supervised learning network framework, and we use a convolutional-deconvolutional network structure to form a autoencoder in self-supervised learning. The dashed box part of the figure is used to perform the pedestrian re-identification task after the self-supervised pre-training is completed.

positive and negative sample pairings, while validation loss is usually combined with ID loss [8], [4] to improve the performance of the network; in 2015, Florian Schroff et al [9] proposed ternary group loss on a face recognition task, which can be trained to make the network since the original ternary loss is limited in discriminability. Hermans et al. [10] proposed hard-sample ternary loss in 2017, i.e., in each training batch, the least similar positive samples and the most smaller negative samples are used for ternary loss training, which makes the network model very discriminative. Chen et al. [11] proposed quadratic loss based on ternary loss, which makes the difference between positive and negative samples smaller and the difference between positive and negative samples larger, making the network more robust after training. These are the most popular deep metric learning approaches for pedestrian reidentification tasks. In representation learning, to capture fine-grained cues in global feature learning, a joint learning framework consisting of single image representation (SIR) and cross-image representation (CIR) [12] was studied early on, using specific sub-networks for triplet loss training. The widely used identity recognition embedding (IDE) model constructs the training process as a multiclassification problem by treating each identity as a distinct class. It is now widely used in pedestrian re-identification algorithms [13], [14], [15], [16], [17], and in 2016, Yutian Lin et al. [18] from the University of Technology Sydney proposed to use pedestrian attribute information to improve the network to recognize fine-grained feature information of pedestrians, which are classical global representation learning representation methods. For local representation learning, there are pedestrian re-identification methods that slice the image according to pedestrian body parts, such as PCB [15], AlignedReID [19], and SCP [20], and pedestrian re-identification methods based on human pose estimation, such as PIE [21], PDC [22], and GLAD [23], as well as dividing the pedestrian image into many equidistant grids for local feature extraction, such as IDLA [24], PersonNet [25], and classical networks such as DSR [26], where local features make it possible to resolve the spatial information that is not well taken into account in global features and which fuse all local features to obtain more detailed pedestrian features.

Although the above methods get good experimental results in the research, the results are not very good once they are applied, and the feature extraction network models of the above methods are pre-trained based on ImageNet datasets, which makes them have certain limitations in various practical applications, coupled with the single scene and low number of pedestrian re-identification datasets, so pedestrian re-identification in the last one or two years in the direction of unsupervised development. In 2020, Yixiao Ge et al. [27] proposed SPCL as a cluster-based pseudo-label unsupervised learning algorithm, which provided a strong benchmark for unsupervised pedestrian reidentification. Based on this, Zuozhou Dai et al. [28] proposed Cluster Contrast ReID, which improves the pseudo-label and training loss function based on this, which once surpassed the then supervised pedestrian reID algorithm by improving the pseudo-label and training loss function, and subsequently a large number of new pedestrian reID algorithms emerged according to the contract supervised learning in unsupervised learning, making the pedestrian reID method widely used in practical applications as well as achieving milestone results.

Based on the above-mentioned existing studies and problems, the main contribution points of the text are as follows:

  • \bullet

    Contrast learning applied to pedestrian recognition has a certain problem of confusing the discriminative attributes of person images. In the field of pedestrian re-recognition, whether it is a dataset or a pedestrian image keyed from other images or video images, there is a problem of small resolution, and the attribute information itself is fuzzy. Therefore, some prior tasks in contrast learning, such as image splitting and stitching, color transformation and image broadening recovery, may lead to the destruction of the target attributes in the pedestrian images, making the subsequent use of the pre-trained model for pedestrian re-identification tasks will produce certain interference, and not be able to perform well for pedestrian re-identification tasks. Therefore, this paper designs an image restoration-based generative learning to enable the network to dynamically extract features from important information areas of pedestrian images.

  • \bullet

    Traditional feature extraction networks pre-trained by ImageNet are not able to fully exploit the fine-grained features of pedestrian images. According to self-supervised learning, we design a mask reconstruction proxy task instead of pre-training on ImageNet classification, which can make the network more suitable for the corresponding downstream tasks and also flexible to deal with the indirection problems caused by data and hardware.

  • \bullet

    Also, in order to enable the network to focus precisely on the key regions in the pedestrian images, this paper puts the masked images in self-supervised learning to use, and we improve the centroid based triple loss to optimize the network training.

II Related Work

In recent years, the self-supervised pedestrian re-identification algorithms have been subdivided into comparative learning and generative learning. Generative learning is represented by methods such as autoencoderss (e.g., GAN [29], VAE [30], etc.), which generate data from data so that it is similar to the training data in terms of overall or high-level semantics. In 2022, Kaiming He proposed MAE [31], which enables the network to easily cope with various computer vision downstream tasks by reconstructing important regions in images for pre-training. Zhongdao Wang et al. [32] proposed that discriminative pedestrian re-identification features can be learned through the cyclic consistency of data association, which is an early attempt at self-supervised learning in the field of pedestrian re-identification. Hao Chen et al. [33] proposed to incorporate contrast learning and generative adversarial networks into a joint training framework to facilitate contrast learning between the original and generated pedestrian images. Zizheng Yao et al. [34] designed an unsupervised pre-training framework based on contrast learning to fully exploit the fine-grained local features in pedestrian images and enhance the global consistency between pedestrian images. Ke Han et al. [35] investigated the generalization problem of pedestrian re-recognition and proposed the BNTA framework to address the fact that BN has a severe bias on the training domain, using a self-supervised learning strategy to adaptively update the BN parameters. Similarly, a good pedestrian re-identification algorithm is trained to require a reasonable loss function. Mikolaj Wieczorek et al. [36] propose to use the center of mass of classes in both the training and inference phases to alleviate the problems of computational speed and hard sample mining, where triple loss is optimized to represent each class by an embedding feature vector to accelerate the retrieval speed.

In this paper, we propose a self-supervised generative learning to complete the pre-training for the pedestrian re-identification task, and add the masked pedestrian image features to the ternary loss based on the center of mass so that the network in this paper can be more fully optimized for training.

III Proposed Method

Generative learning can effectively reconstruct the features and information of the data itself, so it will not destroy the feature attributes in the pedestrian image, and there are certain problems of failure in discriminating the attributes of the task image, and the traditional feature extraction network pre-trained by the ImageNet dataset cannot fully explore the fine-grained problem of the pedestrian image. Based on the above, this paper proposes the self-supervised learning based on the generative mask and triple loss based on the mask center of mass to solve the above two problems, so that pedestrian re-recognition can be better applied in real life. The structure diagram of the proposed method is shown in Fig. 1. Only the encoder is kept for the pedestrian re-identification task, and the decoder will be discarded. The self-supervised pre-training method in this paper takes reference from MAE, but differs from it in that MAE uses two identical structures of ViT as encoder and decoder, while our method uses a symmetric convolution-deconvolution structure for the autoencoder. The method in this paper is described in detail below.

III-A Data Mask Preprocessing

For this autoencoder, this section faces the pedestrian re-identification task, which requires an input image size of 128 × 256 × 3 (W × H × 3) for pre-training, and since pedestrian images are always irregular, the region blocks for masking are also irregular, and the pixel points of the region blocks after masking are replaced by 0 values. We define the width of each masked region divided as pwp_{w} and the height as php_{h}, and 0<pw<W0<p_{w}<W, 0<ph<H0<p_{h}<H, and satisfy pw/ph2{p_{w}}/{p_{h}}\geq 2. In this section, we choose to perform random 75% region masking within the image region, see the reconstruction part in Figure 1, and the black part in the image is the masked part. Based on the above information, we perform a random mask on the pedestrian image, and then the masked part is recovered as much as possible after the autoencoder is trained.

III-B Autoencoder——Encoder

When the data goes through the random mask, it first enters the encoder part, and the encoder structure is referred to as ConvNeXt [37], as shown in Fig. 2.

Refer to caption

Figure 2: Encoder structure.

N1N_{1}, N2N_{2}, N3N_{3} and N4N_{4} represent the number of the network in different blocks, i.e., the network has a total of N1+N2+N3+N4N_{1}+N_{2}+N_{3}+N_{4} layers, and the larger these numbers are, the larger the number of parameters of the network is. In order to make the number of parameters of the network as small as possible, we use N1=3N_{1}=3, N2=3N_{2}=3, N3=9N_{3}=9 and N4=3N_{4}=3, so as to ensure the performance of the network while the number of parameter, the number of parameters is also not too large. We use the green structure in Fig. 2 to perform three downsampling operations, where the blue part only performs the convolution operation and does not change the size of the input feature layer each time. The size of the input mask data is required to be B × 3 × 256 × 128, and the size of the fearture obtained after the encoder is B × 2048, where the structure of each blue convolutional block is shown in (2) in Fig 3. Among them, in order to make the network learn the deep features of the image better, the multi-branched residual structure is carried out after each Conv Block, which is detailed in (1) in Fig. 3.

Refer to caption


Figure 3: Training and verification strategy of Conv Block in network structure.

We refer to the structure of RepVGG [38] and only perform the residual operation during the training, and we discard the residual block in the test validation phase, because using the multi-branched residual block during the training can make the network learn the features well. In the validation phase, we discard the residual blocks because the use of multi-branched residual blocks during training enables the network to learn the image features well, but in fact, these parameters already exist in the network, and using the structural reparameterization, removing the residual structure in the validation phase will not degrade the network’s performance, but will make the network occupy less video memory, and inference speed will be faster.

III-C Autoencoder——Decoder

The decoder is a symmetric structure with the encoder, and its main purpose is to upsample the fearture to the same size as the original data. The structure of the decoder is shown in Fig 4, where N1N_{1}, N2N_{2}, N3N_{3} and N4N_{4} are the same as the values in the encoder, and the structure of the yellow part of the figure is also in shown Fig. 3 (2). The only difference between the decoder and the encoder is that we use deconvolution to upsample the features, and we do not use the RepVGG structure. When the feature is deconvolved with the first convolution kernel of size 8×4, a feature map of size B×C×8×4 will be obtained, and after that, it will be upsampled after each deconvolution, and the reconstructed data will be the same size as the original data.

Refer to caption

Figure 4: Decoder Structure.

III-D Loss Calculation

The existing original data XX, with BB original data in one training cycle, is obtained after a random mask operation as well as a mask sequence region m, with the value in m being 75% of the image mask region location. When XmaskX_{m}ask is self-encoded to get the reconstructed data XrebuildX_{r}ebuild, the MSE loss function is used to optimize the training so that the difference between the reconstructed data and the original data in the random mask region is getting smaller and smaller, see (1).

L=i=0len(m)(XXrebulid)2miBi=0len(m)mi.L=\frac{\sum_{i=0}^{len\left(m\right)}{\left(X-X_{rebulid}\right)^{2}\cdot m_{i}}}{B\cdot\sum_{i=0}^{len\left(m\right)}{m_{i}}}. (1)

III-E Triplet Loss Based on Masked Centroid

After the above self-supervised pre-training, we remove the decoder and keep only the encoder structure for the pedestrian re-identification task, see Fig. 1. For the pedestrian re-identification task, we improve the triadic loss based on the center of mass, as detailed below.

The loss function proposed in this paper is an improvement on centroid-loss for the task of this paper. The triplet loss is given an anchor image AA, and a positive sample image PP and a negative sample image NN, and the objective is to minimize the distance between AA and PP while pushing the distance between AA and NN. The detailed expressions are given in (2). Here [x]+=max(x,0)[x]_{+}=max(x,0) and ff represent the feature extraction network in the training phase, which is the encoder part of this paper.

Ltl=[f(A)f(P)|22f(N)22+α]+.L_{tl}=\left[\left\|f\left(A\right)-f\left(P\right)\right|_{2}^{2}-\left\|f\left(N\right)\right\|_{2}^{2}+\alpha\right]_{+}. (2)

The triplet loss function based on the center of mass is to calculate the distance between the anchor image AA and its center of mass cPc_{P} of all positive samples and the center of mass cNc_{N} of all negative samples, where the center of mass is a simple average calculation of all the data, using sample center of mass aggregation This method produces a robust representation that is less susceptible to single-image mismatching and also reduces retrieval time because all positive and negative samples The same representation is already used for all positive and negative samples, so there is no need to compare each image. The expression of the centroid based triplet loss function (3).

Lctl=[f(A)cP22f(A)cN22+αc]+.L_{ctl}=\left[\left\|f\left(A\right)-c_{P}\right\|_{2}^{2}-\left\|f\left(A\right)-c_{N}\right\|_{2}^{2}+\alpha_{c}\right]_{+}. (3)

In this paper, the pedestrian image data will be randomly masked before entering the autoencoder, so we reconstruct the masked pedestrian image and input it into the ternary loss based on the center of mass as part of the data; see Fig. 1. Based on this, we propose a ternary loss function based on the masked center of mass, as shown in eq.(4), where the masked image and the original image labels are all the same, the difference being that the original image is The difference is that the original image is intact, while the image after random masking is missing many regions, which is neither a complete image, where mPm_{P} and mNm_{N} are the centers of mass of all positive and negative samples of the masked image, and λ1+λ2=1\lambda_{1}+\lambda_{2}=1, which is based on the ratio of the number of positive and negative samples in the masked image.

Lmctl=[f(A)λ1mPcP22\displaystyle L_{mctl}=\left[\left\|f\left(A\right)-\lambda_{1}m_{P}-c_{P}\right\|_{2}^{2}\right.- (4)
f(A)λ2mNcN22+α]+.\displaystyle\left.\left\|f\left(A\right)-\lambda_{2}m_{N}-c_{N}\right\|_{2}^{2}+\alpha\right]_{+}.

With this loss function, the network can actively focus on the important regions of the pedestrian images and learn the fine-grained information of the images while having the advantages of the centroid based triple loss function.

IV Experiment

In this paper, we first perform generative self-supervised learning for pre-training, i.e., the original image is passed through a random mask, and then the masked part is reconstructed after the autoencoder training, and then the decoder is removed from the autoencoder, in which the pre-trained model weights are used to load into the encoder, which performs the pedestrian re-recognition task.

The data set used for pre-training is the COCO2017 data set and the Pascal VOC data set with the category of human images, and the human is cropped-out using its annotation information, and the cropped out pedestrians are keypoint detected using the human pose estimation network HRNet [39], and the pedestrian images with incomplete keypoints are removed. At this time, the data set does not have any labels, and the training set contains a total of 80689 pedestrian images, and these image data are mainly used for self-supervised learning. For the pedestrian re-identification task, we use four classical datasets: Market151 [40], CUHK03 [6], MSMT17 [41], and Dukemtmc-Reid [13] for training and evaluation respectively.

TABLE I: Experimental environment
1 cpu Intel(R) Xeon(R) Platinum 8160
2 memory 15G
3 operating system CentOS7
4 video card NVIDIA TESLA T4 × 8

The methods in this paper are used to compare with the SOAT network on the pedestrian re-identification task by self-supervised learning in recent years. Also, since the methods in this section refer to MAE, a comparison test is done between the methods in this section and MAE using the same training method. We also compare the ternary group loss of mask prime proposed in this chapter with centroid-loss on four pedestrian reidentification datasets for experimental comparison. The experimental environment of this paper is shown in Table I, and the detailed experimental comparison is as follows.

IV-A Pre-training Visualization Results

We use the model with the lowest loss value in pre-training for the visualization of the pre-training part, and we will use this model to initialize the weight parameters of the encoder for the pedestrian re-identification task. The test dataset we use is the pedestrian images taken by ourselves without any correlation with the training data, and the detailed results are shown in Fig. 5.

Refer to caption


Figure 5: Masked self-supervised learning for visualization of pedestrian image reconstruction results.

Nine randomly selected images are shown among all the test results, where the leftmost image on the way shows the original image, the middle image shows the image obtained after the 75% random mask operation, and the right image is the image reconstructed by the autoencoder. The figure shows that some details in the pedestrian image are still not well restored, probably because the convolutional structure in the decoder cannot learn the original content of the image well due to the low resolution of the pedestrian image, and we will introduce the content of GAN to improve this part of the pedestrian image reconstruction to make the reconstruction more detailed in the future.

At the same time, in order to visualize the image feature extraction by the network more intuitively, we also performed a heat map operation on the masked image dropped into the network, as shown in Fig. 6.

Refer to caption

Figure 6: Results of network visualization of heat map of masked images.

It can be seen that the network can still focus well on the general area of the pedestrian in the image despite the image being masked, which also proves that the proposed method in this paper can better focus on the fine-grained information of the pedestrian image after the masked self-supervised learning training.

IV-B Experimental Results of Pedestrian Re-identification

As shown in Fig. 1, after the masked self-supervised learning training, we just used the encoder for the pedestrian re-identification task, and performed the training as well as the evaluation on top of four classical datasets. We then compared the most advanced self-supervised learning methods in recent years on the Market1501 and CUHK03 datasets, and the detailed evaluation results are shown in Table II. mAP obtained in this section is more than 5 points higher than the existing SOAT self-supervised learning methods on these two datasets. The mAP obtained in this section is more than 5 points higher than the existing SOAT self-supervised learning methods on these two datasets, but the only shortcoming is that the Rank1 index does not get much improvement and is basically the same as it is. In pedestrian re-recognition, mAP is a response to the degree that all correct images in the retrieved human hits are ranked at the top of the sorted list, which is used to evaluate the overall effectiveness of the pedestrian re-recognition method and can measure the performance of a pedestrian re-recognition method more comprehensively.

TABLE II: The results of comparing the method in this paper with the advanced self-supervised learning methods in recent years
Methods Market1501 CUHK03
mAP(%) Rank1(%) mAP(%) Rank1(%)
PCB 81.6 93.8 57.5 63.7
OSNet 84.9 94.8 67.8 72.3
P2Net 85.6 95.2 73.6 78.3
DSA 87.6 95.7 75.2 78.9
GCP 88.9 95.2 75.6 77.9
SAN 88.0 96.1 76.4 80.1
ISP 88.6 95.3 74.1 76.5
GASM 84.7 95.3 - -
RGA-SC 88.4 96.1 77.4 81.1
HOReID 84.9 94.2 - -
AMD 87.1 94.8 - -
TransReID 89.5 95.2 - -
PAT 88.0 95.4 - -
MGN+UP-ReID 91.1 97.1 85.3 87.6
ours 97.6 97.1 90.8 88.1

Meanwhile, to better prove the reliability of the method proposed in this chapter, we conducted peer-to-peer experiments using MAE, i.e., we first performed masked self-supervised learning pre-training, and then selected the one with the lowest training loss model for the pedestrian re-identification task. The resulting data are shown in Table III, from which it can be seen that in each of the four datasets, our method is higher than MAE by more than 3 points in each metric. This also proves that the method in this section is indeed effective.

TABLE III: Comparison of the evaluation of the method in this paper with MAE after equivalent training conditions
Methods Market1501 CUHK03
mAP(%) Rank1(%) mAP(%) Rank1(%)
MAE 92.4 90.8 87.0 82.6
ours 97.6 97.1 90.8 88.1
DukeMTMC MSMT17
mAP(%) Rank1(%) mAP(%) Rank1(%)
89.5 90.2 71.9 65.9
95.5 94.7 86.4 83.4

We also conducted an experimental comparison using centroid-loss (ctl) and the mask-centroid-loss (mctl) proposed in this chapter on four datasets; see Table IV, which shows that, according to the data, the metrics obtained after training using the loss function proposed in this chapter are both higher than those obtained by centroid-loss by at least 0.2 points. This also reflects that our proposed mctl enables the network to learn more potential features in pedestrian images.

TABLE IV: Comparison of post-training evaluation using ctl and mctl, respectively
Methods Market1501 CUHK03
mAP(%) Rank1(%) mAP(%) Rank1(%)
ours(ctl) 97.4 97.0 90.3 87.2
ours(mctl) 97.6 97.1 90.8 88.1
DukeMTMC MSMT17
mAP(%) Rank1(%) mAP(%) Rank1(%)
95.3 94.3 86.2 83.1
95.5 94.7 86.4 83.4

Finally, we partially visualize the evaluation results on the Market1501 dataset; see Fig. 7, where three pedestrians with different ids are selected as visualization results, where column 1 is the query image of different perspectives of the same pedestrian, and the next 10 columns are matched as the same target, sorted by similarity from largest to smallest, and the network is basically the same target for all matching results within Rank10.

V Conclusions

In this paper, we propose a generative self-supervised learning method to address the problem of confusing discriminative attributes of human images and the problem that traditional feature extraction networks pre-trained by ImageNet cannot fully exploit the fine-grained nature of human images when applied to the pedestrian re-recognition task, and propose a masked After training and evaluation on four classical pedestrian re-recognition datasets, the proposed method achieves good results in comparison with existing SOAT models and loss function experiments, which also proves the robustness of the method in this chapter.

References

  • [1] Wojciech Zajdel, Zoran Zivkovic, and Ben JA Krose. Keeping track of humans: Have i seen this person before? In Proceedings of the 2005 IEEE international conference on robotics and automation, pages 2081–2086. IEEE, 2005.
  • [2] Tetsu Matsukawa and Einoshin Suzuki. Person re-identification using cnn features learned from combination of attributes. In 2016 23rd international conference on pattern recognition (ICPR), pages 2428–2433. IEEE, 2016.
  • [3] Liang Zheng, Hengheng Zhang, Shaoyan Sun, Manmohan Chandraker, Yi Yang, and Qi Tian. Person re-identification in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1367–1376, 2017.
  • [4] Rahul Rama Varior, Bing Shuai, Jiwen Lu, Dong Xu, and Gang Wang. A siamese long short-term memory architecture for human re-identification. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part VII 14, pages 135–153. Springer, 2016.
  • [5] Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 994–1003, 2018.
  • [6] Wei Li, Rui Zhao, Tong Xiao, and Xiaogang Wang. Deepreid: Deep filter pairing neural network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 152–159, 2014.
  • [7] Zhedong Zheng, Liang Zheng, and Yi Yang. A discriminatively learned cnn embedding for person reidentification. ACM transactions on multimedia computing, communications, and applications (TOMM), 14(1):1–20, 2017.
  • [8] Dapeng Chen, Dan Xu, Hongsheng Li, Nicu Sebe, and Xiaogang Wang. Group consistent similarity learning via deep crf for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8649–8658, 2018.
  • [9] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision, pages 1116–1124, 2015.
  • [10] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [11] Weihua Chen, Xiaotang Chen, Jianguo Zhang, and Kaiqi Huang. Beyond triplet loss: a deep quadruplet network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 403–412, 2017.
  • [12] Faqiang Wang, Wangmeng Zuo, Liang Lin, David Zhang, and Lei Zhang. Joint learning of single-image and cross-image representations for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1288–1296, 2016.
  • [13] Zhedong Zheng, Liang Zheng, and Yi Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE international conference on computer vision, pages 3754–3762, 2017.
  • [14] Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1318–1327, 2017.
  • [15] Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European conference on computer vision (ECCV), pages 480–496, 2018.
  • [16] Yifan Sun, Liang Zheng, Weijian Deng, and Shengjin Wang. Svdnet for pedestrian retrieval. In Proceedings of the IEEE international conference on computer vision, pages 3800–3808, 2017.
  • [17] Mang Ye, Xiangyuan Lan, and Pong C Yuen. Robust anchor embedding for unsupervised video person re-identification in the wild. In Proceedings of the European Conference on Computer Vision (ECCV), pages 170–186, 2018.
  • [18] Yutian Lin, Liang Zheng, Zhedong Zheng, Yu Wu, Zhilan Hu, Chenggang Yan, and Yi Yang. Improving person re-identification by attribute and identity learning. Pattern recognition, 95:151–161, 2019.
  • [19] Xuan Zhang, Hao Luo, Xing Fan, Weilai Xiang, Yixiao Sun, Qiqi Xiao, Wei Jiang, Chi Zhang, and Jian Sun. Alignedreid: Surpassing human-level performance in person re-identification. arXiv preprint arXiv:1711.08184, 2017.
  • [20] Xing Fan, Hao Luo, Xuan Zhang, Lingxiao He, Chi Zhang, and Wei Jiang. Scpnet: Spatial-channel parallelism network for joint holistic and partial person re-identification. In Computer Vision–ACCV 2018: 14th Asian Conference on Computer Vision, Perth, Australia, December 2–6, 2018, Revised Selected Papers, Part II 14, pages 19–34. Springer, 2019.
  • [21] Liang Zheng, Yujia Huang, Huchuan Lu, and Yi Yang. Pose-invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28(9):4500–4509, 2019.
  • [22] Chi Su, Jianing Li, Shiliang Zhang, Junliang Xing, Wen Gao, and Qi Tian. Pose-driven deep convolutional model for person re-identification. In Proceedings of the IEEE international conference on computer vision, pages 3960–3969, 2017.
  • [23] Longhui Wei, Shiliang Zhang, Hantao Yao, Wen Gao, and Qi Tian. Glad: Global-local-alignment descriptor for pedestrian retrieval. In Proceedings of the 25th ACM international conference on Multimedia, pages 420–428, 2017.
  • [24] Ejaz Ahmed, Michael Jones, and Tim K Marks. An improved deep learning architecture for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3908–3916, 2015.
  • [25] Xiang Li, Ancong Wu, and Wei-Shi Zheng. Adversarial open-world person re-identification. In Proceedings of the European Conference on Computer Vision (ECCV), pages 280–296, 2018.
  • [26] Lingxiao He, Jian Liang, Haiqing Li, and Zhenan Sun. Deep spatial feature reconstruction for partial person re-identification: Alignment-free approach. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7073–7082, 2018.
  • [27] Yixiao Ge, Feng Zhu, Dapeng Chen, Rui Zhao, et al. Self-paced contrastive learning with hybrid memory for domain adaptive object re-id. Advances in Neural Information Processing Systems, 33:11309–11321, 2020.
  • [28] Zuozhuo Dai, Guangyuan Wang, Weihao Yuan, Siyu Zhu, and Ping Tan. Cluster contrast for unsupervised person re-identification. In Proceedings of the Asian Conference on Computer Vision, pages 1142–1160, 2022.
  • [29] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  • [30] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  • [31] Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16000–16009, 2022.
  • [32] Zhongdao Wang, Jingwei Zhang, Liang Zheng, Yixuan Liu, Yifan Sun, Yali Li, and Shengjin Wang. Cycas: Self-supervised cycle association for learning re-identifiable descriptions. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 72–88. Springer, 2020.
  • [33] Hao Chen, Yaohui Wang, Benoit Lagadec, Antitza Dantcheva, and Francois Bremond. Joint generative and contrastive learning for unsupervised person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2004–2013, 2021.
  • [34] Zizheng Yang, Xin Jin, Kecheng Zheng, and Feng Zhao. Unleashing potential of unsupervised pre-training with intra-identity regularization for person re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14298–14307, 2022.
  • [35] Ke Han, Chenyang Si, Yan Huang, Liang Wang, and Tieniu Tan. Generalizable person re-identification via self-supervised batch norm test-time adaption. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 817–825, 2022.
  • [36] Mikoaj Wieczorek, Barbara Rychalska, and Jacek Dabrowski. On the unreasonable effectiveness of centroids in image retrieval. In Neural Information Processing: 28th International Conference, ICONIP 2021, Sanur, Bali, Indonesia, December 8–12, 2021, Proceedings, Part IV 28, pages 212–223. Springer, 2021.
  • [37] Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11976–11986, 2022.
  • [38] Xiaohan Ding, Xiangyu Zhang, Ningning Ma, Jungong Han, Guiguang Ding, and Jian Sun. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13733–13742, 2021.
  • [39] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep high-resolution representation learning for human pose estimation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5693–5703, 2019.
  • [40] Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jiahao Bu, and Qi Tian. Person re-identification meets image search. arXiv preprint arXiv:1502.02171, 2015.
  • [41] Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 79–88, 2018.