Domain Camera Adaptation and Collaborative Multiple Feature Clustering for Unsupervised Person Re-ID
Abstract.
Recently unsupervised person re-identification (re-ID) has drawn much attention due to its open-world scenario settings where limited annotated data is available. Existing supervised methods often fail to generalize well on unseen domains, while the unsupervised methods, mostly lack multi-granularity information and are prone to suffer from confirmation bias. In this paper, we aim at finding better feature representations on the unseen target domain from two aspects, 1) performing unsupervised domain adaptation on the labeled source domain and 2) mining potential similarities on the unlabeled target domain. Besides, a collaborative pseudo re-labeling strategy is proposed to alleviate the influence of confirmation bias. Firstly, a generative adversarial network is utilized to transfer images from the source domain to the target domain. Moreover, person identity preserving and identity mapping losses are introduced to improve the quality of generated images. Secondly, we propose a novel collaborative multiple feature clustering framework (CMFC) to learn the internal data structure of target domain, including global feature and partial feature branches. The global feature branch (GB) employs unsupervised clustering on the global feature of person images while the Partial feature branch (PB) mines similarities within different body regions. Finally, extensive experiments on two benchmark datasets show the competitive performance of our method under unsupervised person re-ID settings.
1. Introduction
Person Re-Identification (re-ID) targets to retrieve images from a database collected by non-overlapping cameras. Existing re-ID methods (Zhong et al., 2017; Li et al., 2018b) have achieved dramatic performances when training and testing on the same domain. Nevertheless, researchers have consistently shown that models trained on the source domain may have a significant performance drop when directly applied to the target domain since there exists a large domain gap between the source and target domain. As shown in Fig. 1, since different datasets are often collected from different environments, images from different datasets often have large appearance variations in illumination and background clutter. Besides, in real-world applications, the training data on the target domain is usually unlabeled or partially labeled.

To tackle these issues, recently, considerable literature has grown up around the theme of unsupervised domain adaptation (UDA). Compared to generic unsupervised domain adaptation (UDA), the source and target domains in person re-ID have entirely different classes (i.e. person identities), which is studied as an open set domain adaptation problem. A few works (Deng et al., 2018; Zhu et al., 2017; Wei et al., 2018; Zhong et al., 2018) have been proposed to deal with it, which mainly attempt to solve this problem by transferring the global style of the source domain to the target domain. However, intra-domain image variations also exist because of the disparities in cameras. Note that examples in each row of Fig. 1 are sampled from different cameras in the same dataset and are distinct in the background. Thus, translating the global style of images may confuse the generative networks and impair the quality of generated images.
Apart from these methods, some techniques (Ge et al., 2020; Yu et al., 2019) have been proposed to address the unsupervised person re-ID problem by pseudo label estimation. Features of unlabeled target training data are extracted from the model pre-trained on the labeled source dataset. Unsupervised clustering is applied to these features to generate pseudo labels, which are used for subsequent supervised training. The model training and unsupervised clustering are executed alternately. Clustering quality is critical to their performance. However, performing unsupervised clustering on the whole image may suffer from severe background clutter. Besides, the label generation lacks multi-granularity information and is prone to suffer from confirmation bias.
In this paper, we aim at finding better feature representations on the unseen target domain/dataset from two aspects, using labeled source training images and unlabeled target training images. First, considering the disparities of cameras within the source and target datasets, images from source camera sub-domains are transferred to target camera sub-domains while keeping the person identity. The generated images with specific person identity are thus utilized for supervised learning. Secondly, to mine the potential identity similarities on the target training set, a two-branch framework is proposed for similarity learning using multiple features to alleviate confirmation bias. Specifically, we propose a novel collaborative multiple feature clustering framework (CMFC) for learning representations on the target dataset: including a global feature guided training branch and a partial feature guided training branch. The global feature branch (GB) performs unsupervised clustering on global features and finetunes the network based on the cluster groups of images. The partial feature branch (PB) divides the person image into upper body region and lower body region and performs unsupervised clustering on different parts to obtain pseudo labels of person images. Pseudo labels are utilized to optimize the person re-ID models in a supervised manner. Both global features and partial features are combined in CMFC to learn the internal data distribution on the target dataset. Finally, a collaborative pseudo re-labeling strategy is proposed to alleviate the influence of confirmation bias. Finally, we summarize our main contributions as follows:
-
(1)
We propose a cross-domain camera style adaptation module to transfer images from the source dataset to the target dataset on the camera level, preserving the person identity. The transferred images are then fed to the models to obtain more discriminative features in a supervised manner.
-
(2)
We propose a collaborative multiple feature clustering framework (CMFC) to learn identity similarities on the target domain using multiple features and meanwhile alleviate the influence of confirmation bias. A global feature branch is employed to extract the global feature of pedestrian images and unsupervised clustering is performed on the global features to learn identity similarities. The partial feature branch divides the person images into different parts and the re-ID model is optimized by using the potential similarities in different parts.
-
(3)
Through experimental results on two large-scale datasets, we demonstrate the effectiveness of our method and different components on unsupervised person re-ID.
2. Related work
In this section, since the proposed methods mainly focus on the unsupervised domain adaptation and unsupervised person re-ID methods, we briefly review several related works.
2.1. Supervised Person Re-identification
Extensive supervised methods have been developed on the widely used benchmarks (Zheng et al., 2015, 2017; Ristani et al., 2016), concentrating on discriminative feature representation learning (Li et al., 2018b; Bai et al., 2020), deep metric learning (Yu et al., 2018; Hu et al., 2015), postprocessing procedures (Bai et al., 2017; Zhong et al., 2017; Cao et al., 2020), and other problems such as occlusion (He et al., 2019), various image resolutions (Wang et al., 2018). Although great progress has been observed, these supervised approaches may have a significant performance drop when applied to another unseen domain due to the existence of domain shift.
2.2. Unsupervised Domain Adaptation
Unsupervised domain adaptation (UDA) aims to transfer the knowledge on a labeled source domain to the unlabeled target domain. Several unsupervised domain adaptation methods try to reduce the distribution discrepancy between source and target domain on either feature-level (Peng et al., 2016; Hu et al., 2015) or image-level (Deng et al., 2018; Wei et al., 2018). The former focuses on learning domain-invariant representations by aligning the feature statistics, such as the mean and covariance of source domain and target domain feature distributions (Sun et al., 2016), or the adversarial training approach. The latter learns to transform samples in the pixel space from the source domain to the target domain using generative adversarial networks (Bousmalis et al., 2017; Hoffman et al., 2018). Nevertheless, most of the existing unsupervised domain adaptation methods assume that source and target domains share the same set of classes, while in person re-ID task the source and target domains have entirely different classes. Thus, these methods cannot be directly utilized for unsupervised domain adaptation in person re-ID.
2.3. Unsupervised Person Re-identification
Unsupervised person re-ID methods are proposed to utilize the unlabeled target data and large-scale labeled samples. Among them, some works try to address this problem with domain adaptation on feature or image levels. In (Yang et al., 2020a), PPAN is proposed to enforce feature alignment across domains. Peng et al. (Peng et al., 2016) propose to learn based on asymmetric multi-task dictionary learning. Other works (Deng et al., 2018; Wei et al., 2018) attempt to transfer the source domain images to target domain styles using generative adversarial networks (GAN). Deng et al. (Deng et al., 2018) introduce a similarity preserving cycle consistent generative adversarial network (SPGAN) to translate images. However, intra-domain image variations still exist because of the distribution discrepancy at the camera level. HHL (Zhong et al., 2018) considers the intra-domain image style variations caused by different camera configuration. M2MGAN (Liang et al., 2018) takes multiple source and target sub-domains into consideration. Different from these methods, our cross-domain camera style adaptation module explicitly considers the camera-level disparities and transforms images from the source domain to different cameras in the target domain. Moreover, we introduce person identity preserve loss constraint and identity mapping loss constraint to improve the quality of the generated images. Finally, considering the confirmation bias of label generation, a collaborative pseudo re-labeling strategy is proposed.

Beyond the above methods, some approaches focus on label estimation in the target domain. Fan et al. (Fan et al., 2018) propose an unsupervised re-ID approach for iteratively applying k-means clustering. Yang et al. (Yang et al., 2020b) generate labeled virtual data from the target dataset and propose collaborative filtering on unlabeled data. A Self-similarity Grouping (SSG) approach (Fu et al., 2019) iteratively conducts grouping and re-ID model training in a self-paced manner. A self-training method with progressive augmentation (Zhang et al., 2019) jointly captures the local structure and global data distribution. Soft multi-label learning (Yu et al., 2019) mines the soft label information from a reference set for unsupervised learning. Inspired by these methods, we propose a collaborative multiple feature clustering framework (CMFC) to learn identity similarities using multiple features on the target domain. CMFC is a two-branch framework, including global and partial feature branches, which improves the accuracy of internal data similarity learning using multiple features.
3. Proposed method
Problem definition. For unsupervised person re-ID, we have a labeled dataset from source domain, which contains NS person images. Each image x corresponds to a label y, where . MS is the number of identities in the source dataset. We also have an unlabeled dataset from target domain, containing NT unlabeled images. Note that the identity of the target image is unknown while the camera label is available, which conforms to real-world settings. CS and CT denote the number of cameras in the source dataset and target dataset, respectively.
Motivation. The goal of this paper is to learn the discriminative embeddings of the target dataset by both leveraging the knowledge of the source dataset and mining internal similarities on the target dataset. Thus, in Section 3.2, we perform cross-domain camera style adaptation to transfer images from the source domain to the target domain. In this way, the generated images with target domain styles can be trained in a supervised manner to improve the discriminative ability of re-ID model. Besides, in Section 3.3, we further explore the similarities on the target dataset and finetune the re-ID model with positive pairs and negative pairs to obtain discriminative feature representation on the target dataset.
3.1. Supervised pre-training
To learn feature embeddings of the source domain, we train on the source dataset and denote the obtained model as baseline. We use ResNet50 (He et al., 2016) as backbone network. Given the labeled images in a training batch, we train the baseline model with cross-entropy loss and batch-hard triplet loss (Hermans et al., 2017) simultaneously. The cross-entropy loss is employed with the output of the FC layer by treating the training process as a classification task. Besides, label smoothing (Szegedy et al., 2016) is used to avoid overfitting. Specifically, the cross-entropy loss with label smoothing can be formulated as:
(1) |
(2) |
where ns is the number of images in a batch, is the predicted probability belonging to class k. qi(k) is the label distribution and y is the ground truth class label. is a small perturbation term. The triplet loss is employed to enhance intra-class compactness and inter-class separability, which can be written as,
(3) |
where xa denotes the anchor. xp and xn represent the hardest positive sample and the hardest negative sample in the same batch respectively. m is a margin hyperparameter and D is the Euclidean distance between two features. Thus, the overall loss function is written as follows
(4) |
where can be set to 1 in the experiments for simplicity.
3.2. Cross Domain Camera Style Adaptation
To leverage the knowledge of the source dataset, we try to reduce the distribution discrepancy between the source and target dataset at the image level. Instead of transferring the global style of images, we perform image-image translation by viewing each camera as an individual domain.
Given a dataset XS from the source domain and a dataset XT from the target domain where the camera labels are available, our goal is to train a single generator G that learns the mappings among multiple domains. In this way, a given labeled image with camera label c is transferred to another camera style ct of target domain, while preserving the identity information during the translation. The generated images can then be used to train the person re-ID model. To learn the style mapping between source and target datasets, we employ StarGAN (Choi et al., 2018) and further introduce two loss items during the image translation training procedure to improve the quality of the transferred images.
StarGAN trains a single generator G to translate an input image x into an output image y conditioned on the target domain label c, . In addition, to distinguish real training examples and generated samples from G, an auxiliary classifier is introduced to the discriminator D to control multiple domains. Thus, the discriminator produces probability distributions over both sources and domain labels, . The network architecture of StarGAN is described in Section 4.2.
Adversarial Loss. Adversarial loss is adopted to make the generated images indistinguishable from real images. The generator G tries to minimize the loss while the discriminator tries to maximize it.
(5) |
Domain Classification Loss. To distinguish the domain labels of a real/fake image, an auxiliary classifier is added on top of D and cross-entropy loss is utilized to optimize both G and D. Specifically, the domain classification loss of real images is used to optimizer D and the domain classification loss of fake images is used to optimize G, that is,
(6) |
(7) |
where c′ is the original domain label of real image x.
Reconstruction Loss. To preserve the content of input images while changing only the domain-related style of images, reconstruction loss is used to formulate forward cycle consistency.
(8) |
Identity mapping Loss. Apart from adversarial loss, domain classification loss, and reconstruction loss, we introduce identity mapping loss to regularize the generator to be the identity matrix on samples from the target domain. SPGAN (Deng et al., 2018) uses this loss to preserve the color composition between the input and output during translating images from the source dataset to the target dataset. The identity mapping loss is written as
(9) |
Person Identity Preserve Loss. To utilize the transferred images to supervised person re-ID model training, it is important to preserve the identity of person images while changing the style of images. We introduce the person identity preserve loss by evaluating the variations in the person foreground before and after person transfer. However, the common identity preserve loss form is prone to suffer from perturbation in the foreground masks. To enhance the robustness of , a consistency regularization item is introduced by constraining the output of the original image and the enhanced image to be consistent, which is shown as follows,
(10) |
where M(x) and represents the foreground mask and augmented images of x. In this paper, we use SOLO (Wang et al., 2019) instance segmentation algorithm to extract the foreground mask.
Overall objective function. Stage II in Fig. 2 shows the overview of the training process of the proposed cross-domain camera style adaptation module. Specifically, the objective functions to optimize G and D are written, respectively, as
(11) |
where , , , are hyper-parameters that control the relative importance of domain classification loss, reconstruction loss, identity mapping loss, and person identity preserve loss, respectively. Empirically, we use =1, =1, =10 and =10 in our experiments.
3.3. Collaborative Multiple feature clustering framework (CMFC)
In addition to performing unsupervised domain adaptation, mining potential label information (identity similarity) on the target domain is essential to unsupervised person re-ID. In this section, we introduce a collaborative multiple feature clustering framework (a two-branch network) based on the global features and partial features to mine the similarities of target images to train re-ID model.
Two-branch re-ID model. As illustrated in Stage III of Fig. 2, the two-branch network shares the same backbone with the baseline model. Before training on the target domain, the model is trained on the source dataset and transferred images described in Section 3.2, learning useful representations of person images. However, the model is still not discriminative in the target domain. Afterward, two two-branch networks are trained parallelly with the pseudo labels from each other to alleviate confirmation bias. Our two-branch network includes two branches: the global feature branch and the partial feature branch.
Global feature branch. Given an unlabeled image x, we first feed it into the pre-trained model for feature extraction. The feature map of image x denotes as (the output of Resnet50 layer5). Next, we employ the global max pooling (GMP) on the feature map to obtain feature vector f. For every image in target dataset, we extract the feature vector to form a feature vector set . Based on the vector set, an unsupervised clustering algorithm is utilized to divide the target dataset into different groups. In this paper, we use DBSCAN (Ester et al., 1996) algorithm to perform unsupervised clustering. According to the cluster results, each image x is assigned a pseudo label . In this way, with the pseudo label of each image in target dataset, a new training dataset is organized.
We then finetune the re-ID model with the new dataset in a supervised manner. Specifically, batch hard triplet loss and cross-entropy loss are used to train the model. Triplet loss is employed with feature vector f. Cross entropy loss is employed with the person identity classifier, where the global average pooling (GAP) layer and batch normalization (BN) layer are added before it.
Due to the variation of cluster numbers in the training iterations, the newly added person identity classifier layer should be initialized every time executing DBSCAN (Ester et al., 1996). Following the initialization strategy in (Zhang et al., 2019), we exploit the mean features of each cluster as the initial parameters. Specifically, for each cluster c, we calculate the mean feature by averaging all the embedding features of its elements. The parameters Wc of the classifier layer are initialized as , where C is cluster number in each iteration. Finally, a novel adversarial erasing based reconstruction branch is proposed. Specifically, the feature maps of images are used to generate activation maps. Afterward, the coordinates with the largest value are recorded. Then rectangular areas of random size centered on these coordinates are generated as the erase areas. Finally, the erased images are fed into another decoder network for the reconstruction of the original images. A reconstruction loss as is used. With this strategy, the network can mine the key information from other regions of the erased image, so that the network will not over-trust noisy labels.
Partial feature branch. In person re-ID task, part-level feature learning usually achieves better performance on discriminative re-ID model learning by mining discriminative body regions. In this paper, we divide the feature maps of person images into two parts horizontally and each part, containing the information about the upper body or lower body, is used to generate a pseudo label of the image separately. Given an unlabeled image x, the feature map is divided into two parts, and . Next, the global max pooling (GMP) operation is employed in the sliced feature maps to obtain the feature vectors of different body regions. For every image in target dataset, we extract the feature vectors of upper body and lower body to form two feature vector sets, denoted as and . The unsupervised clustering algorithm is employed on these two vector sets. Therefore, each image x is assigned two pseudo labels .
Based on the pseudo labels, we cross-finetune the re-ID model in a supervised way with triplet loss, similar to the training strategy of the global branch. Specifically, triplet loss is employed with feature vector f and f, respectively.
Global max pooling operation. The procedure of global max pooling and global average pooling is nonparametric. Average pooling calculates the mean of all the pixels within the pooling region while max pooling only considers the maximum response values (Zhao et al., 2020). For unsupervised clustering, max pooling can produce better performance since the regions with higher response values contains more discriminative information about person images. In the ablation experiments, we will evaluate the effect of global max pooling and global average pooling, and demonstrate the superiority of the former (GMP).
Loss function. To fine-tune the re-ID model with the pseudo labels generated from global features and part-level features, the full objective function of the two-branch network is formulated as follows,
(12) |
where pt denotes the output sets of the classifier layer in the global branch. , , are hyper-parameters that control the relative importance of global feature triplet loss, upper body feature triplet loss, and lower body feature triplet loss, respectively. In this paper, we use =1, =1, and =0.5 in our experiments. Since the upper body of person images usually contains more discriminative regions than the lower body.
3.4. Overall algorithm
In this section, we describe the overall framework of the proposed method. As shown in Fig. 2, our method can be developed in three stages. The first stage learns a pre-trained re-ID baseline model on a labeled source training dataset (Section 3.1). Based on the baseline, the second stage trains the model with images transferred from the source dataset to the target dataset, performing multi-domain image translation (Section 3.2). In the third stage, we conduct unsupervised clustering and cross-finetune the model with the target training set alternately. Algorithm 1 presents the optimization procedure of our method.
4. Experiments
In this section, we evaluate the proposed method on two benchmark datasets: Market-1501 and DukeMTMC-reID, and compare with state-of-the-art unsupervised re-ID methods.
4.1. Datasets
We conduct experiments on two large-scale benchmark datasets, i.e. Market-1501 (Zheng et al., 2015) and DukeMTMC-ReID (Zheng et al., 2017; Ristani et al., 2016).
Market-1501 (Zheng et al., 2015) contains 32,668 images of 1,501 labeled persons from six camera views. Specifically, 12,936 images of 751 identities detected by DPM (Felzenszwalb et al., 2009) are used for training. For testing, in total 19,732 images of 750 identities plus some distractors form the gallery set, and 3,368 manually cropped person regions from 750 identities form the query set.
DukeMTMC-ReID (Zheng et al., 2017; Ristani et al., 2016) contains 1,812 identities captured by 8 cameras. There are 16,522 training images, 2,228 query images, and 17,661 gallery images, with 1,404 identities appearing in more than two cameras. Also, similar to the Market-1501, the rest 408 identities are considered as distractors.
4.2. Implementation Details
Baseline Model Training. As described in Section 3.1, we first train a baseline model on the source dataset. All training images are resized to before being fed into the network. For data augmentation, we employ random cropping and flipping. To meet the requirement of hard batch triplet loss, each mini-batch is sampled with randomly selected P = 16 identities and randomly sampled K = 4 images for each identity from the training set, so that the mini-batch size is 64. In the baseline training stage, the margin hyperparameter of triplet loss is set to 0.5 and is set to 0.1 in label smoothing. We use the Adam (Kingma and Ba, 2015) algorithm with a weight decay 0.0005 to optimize the parameters for 80 epochs. The learning rate is set to 0.00035 and decayed to its 1/10 at the 40th and 70th epochs. During testing, we extract the output of the BN layer as the image descriptor and use the Euclidean distance to compute the similarity between the query and gallery images.
Cross Domain Camera Style Adaptation Model. We follow the same architecture as StarGAN (Choi et al., 2018). Specifically, the generator network composes of two convolutional layers with the stride size of two for downsampling, six residual blocks, and two transposed convolutional layers with the stride size of two for upsampling. PatchGANs (Isola et al., 2017) are leveraged for the discriminator network. The input images are resized to . We use Adam optimizer with =0.5 and =0.999 for training. The batch size is set to 16. We perform one generator update after five discriminator updates as in (Gulrajani et al., 2017). In the training stage, we train the model with Market-1501 training dataset and DukeMTMC-reID training dataset simultaneously. In the adaptation stage, for each image in the source set, we generate C style-transferred images (C is the number of cameras in the target set). These C fake images are regarded as containing the same person as the original real images. Then we fine-tune the baseline model with the images transferred from the source dataset to the target dataset.
Unsupervised re-ID model training. For unsupervised training with the target dataset, we train the model in a total of 40 epochs. In each epoch, we perform the unsupervised clustering algorithm and the deep re-ID model training alternately. DBSCAN (Ester et al., 1996) clustering method is used to obtain the pseudo labels of each image. Then the new organized dataset is utilized to finetune the re-ID model. For data augmentation, random cropping, random flipping, and random erasing (Zhong et al., 2020) are applied. The margin hyperparameter of triplet loss is set to 0.3. We use the Adam with a weight decay of 0.0005 and a learning rate of 0.00035 to optimize the parameters.
Market-1501 | DukeMTMC-reID | |||||||
---|---|---|---|---|---|---|---|---|
mAP | R1 | R5 | R10 | mAP | R1 | R5 | R10 | |
Market-1501 | 79.6 | 92.6 | 97.2 | 98.3 | 22.3 | 37.6 | 54.2 | 59.9 |
DukeMTMC-reID | 24.6 | 53.7 | 69.3 | 74.9 | 69.4 | 83.3 | 92.1 | 94.7 |
4.3. Experimental Results
We conduct cross-domain person re-ID evaluation and compare the results with the state-of-the-art methods, including two hand-crafted features, i.e. Bag-of-Words (BoW) (Zheng et al., 2015) and local maximal occurrence (LOMO) (Liao et al., 2015), four unsupervised domain adaptation methods, including PTGAN (Wei et al., 2018), SPGAN (Deng et al., 2018), ARN (Li et al., 2018a) and UDAP (Song et al., 2020), and six unsupervised methods, including PUL (Fan et al., 2018), HHL (Zhong et al., 2018), MAR (Yu et al., 2019), ECN (Zhong et al., 2019), SSG (Fu et al., 2019), MMT (Ge et al., 2020), DCF (Li et al., 2021), GLT (Zheng et al., 2021a) and UNRN (Zheng et al., 2021b). Specifically, we use Market-1501 as the source dataset and DukeMTMC-ReID as the target dataset and report the results on DukeMTMC-ReID test set, and vice versa.
Baseline performance. Here we report the baseline model performance in Table 1. When trained with a labeled dataset and tested on the same dataset, the baseline model achieves a rank-1 accuracy of 92.6% and mAP of 79.6% on Market-1501. However, due to the existence of a domain gap, the performance drops significantly when directly used for another dataset. The rank-1 accuracy declines to 37.6% when tested on DukeMTMC-reID dataset.
Comparison with the state-of-the-art methods on Market-1501 Dataset. Table 2 presents the comparisons when tested on Market-1501 dataset. Compared with four unsupervised domain adaptation approaches, our method outperforms them. The mAP achieves 81.0%, surpassing the UDA method (Song et al., 2020) by 27.3%. Compared with other unsupervised methods, which benefit from initializing the model from the labeled source data and learning with unlabeled target data, our method is superior. Comparing with MMT (Ge et al., 2020), our results are higher by +7.2% in rank-1 and +12% in mAP.
Method | Duke Market | Market Duke | ||||||
---|---|---|---|---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | rank-1 | rank-5 | rank-10 | mAP | |
LOMO (Liao et al., 2015) | 27.2 | 41.6 | 49.1 | 8.0 | 12.3 | 21.3 | 26.6 | 4.8 |
Bow (Zheng et al., 2015) | 35.8 | 52.4 | 60.3 | 14.8 | 17.1 | 28.8 | 34.9 | 8.3 |
PTGAN (Wei et al., 2018) | 38.6 | - | 66.1 | - | 27.4 | - | 50.7 | - |
SPGAN (Isola et al., 2017) | 51.5 | 70.1 | 76.8 | 22.8 | 41.1 | 56.6 | 63.0 | 22.3 |
SPGAN+LMP (Isola et al., 2017) | 57.7 | 75.8 | 82.4 | 26.7 | 46.4 | 62.3 | 68.0 | 26.2 |
ARN (Li et al., 2018a) | 70.3 | 80.4 | 86.3 | 39.4 | 60.2 | 73.9 | 79.5 | 33.4 |
UDAP (Song et al., 2020) | 75.8 | 89.5 | 93.2 | 53.7 | 68.4 | 80.1 | 83.5 | 49.0 |
PUL (Fan et al., 2018) | 45.5 | 60.7 | 66.7 | 20.5 | 30.0 | 43.4 | 48.5 | 16.4 |
HHL (Zhong et al., 2018) | 62.2 | 78.8 | 84.0 | 31.4 | 46.9 | 61.0 | 66.7 | 27.2 |
MAR (Yu et al., 2019) | 67.7 | 81.9 | - | 40.0 | 67.1 | 79.8 | - | 48.0 |
ECN (Zhong et al., 2019) | 75.1 | 87.6 | 91.6 | 43.0 | 63.3 | 75.8 | 80.4 | 40.4 |
SSG (Fu et al., 2019) | 80.0 | 90.0 | 92.4 | 58.3 | 73.0 | 80.6 | 83.2 | 53.4 |
MMT (Ge et al., 2020) | 86.8 | 94.6 | 96.9 | 69.0 | 78.0 | 88.8 | 92.5 | 65.1 |
DCF (Li et al., 2021) | 86.1 | 94.2 | 96.0 | 67.6 | 75.8 | 86.5 | 89.4 | 58.3 |
GLT (Zheng et al., 2021a) | 92.2 | 96.5 | 97.8 | 79.5 | 82.0 | 90.2 | 92.8 | 69.2 |
UNRN (Zheng et al., 2021b) | 91.9 | 96.1 | 97.8 | 78.1 | 82.0 | 90.7 | 93.5 | 69.1 |
CMFC(Ours) | 94.0 | 97.1 | 98.3 | 81.0 | 83.2 | 91.6 | 94.0 | 71.2 |
Comparison with the state-of-the-art methods on DukeMTMC-ReID Dataset. A similar improvement can be observed in DukeMTMC-reID dataset. As shown in Table 2, our method achieves rank-1 accuracy = 83.2% and mAP = 71.2%, outperforming all the competing UDA methods. For example, comparing with UDAP (Song et al., 2020), our results are higher by 14.8% in rank-1 accuracy and 22.2% in mAP. When compared with unsupervised methods, our method is still superior to most existing methods. Compared to MMT, our method also shows a great advantage in accuracy.
4.4. Ablation study
We also conduct extensive ablation studies to analyze the effectiveness of each component of the proposed method.
Effectiveness of Cross Domain Camera Style Adaptation Module. In the paper, we first train a baseline model on the source dataset. Then we try to reduce the distribution discrepancy at the image level by training a multi-domain image-image generator and transferring images from the source dataset to the target dataset. Specifically, for each image in Market-1501 dataset, we generate eight fake images and assign the same label as the original image. For each image in DukeMTMC-reID dataset, we generate six fake images. Fig. 3 shows generated samples. In this way, we can train the model in a supervised way with the generated images. For example, given Market-1501 as source dataset and DukeMTMC-reID as target dataset, each image in Market-1501 is transferred to eight camera style in DukeMTMC-reID. The generated images with assigned labels are utilized for supervised re-ID training.
Method | Market Duke | Duke Market | ||||||
---|---|---|---|---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | rank-1 | rank-5 | rank-10 | mAP | |
Basel. | 37.6 | 54.2 | 59.9 | 22.3 | 53.7 | 69.3 | 74.9 | 24.6 |
CDCSA | 53.8 | 67.8 | 72.2 | 31.4 | 66.8 | 82.1 | 87.9 | 34.0 |

We report the results of re-ID model with the transferred images in Table 3. Specifically, when training with the images transferred from Market-1501 to DukeMTMC-ReID, the model achieves rank-1 = 53.8% and mAP = 31.4% on DukeMTMC-ReID test set. Compared to PTGAN (Wei et al., 2018), SPGAN (Deng et al., 2018) (see Table 2), our generation models have better results, which also validates that viewing each camera as an individual domain results in better generation quality. Similarly, when training with the images generated from DukeMTMC-ReID, the model achieves rank-1 accuracy = 66.8% and mAP = 34.0%, surpassing PTGAN and SPGAN. We further conduct experiments to validate the effectiveness of different loss functions. Specifically, when training with the images generated from original StarGAN with adversarial loss, domain classification loss, and reconstruction loss, the model achieves rank-1 accuracy = 53.4% and mAP = 30.8% on DukeMTMC-ReID dataset(see Table 4). Finally, adding identity mapping loss and person identity preserve loss to restrain the training process, our generation models have better results.
Method | Market-1501 DukeMTMC-reID | |||
---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | |
StarGAN | 53.4 | 67.3 | 71.8 | 30.8 |
StarGAN+ | 53.6 | 67.4 | 72.8 | 31.2 |
StarGAN+ | 53.0 | 67.9 | 73.0 | 31.0 |
StarGAN+ | 53.8 | 67.8 | 72.2 | 31.4 |
Effectiveness of CMFC framework. To verify the effectiveness of our two-branch network and the collaborative pseudo re-labeling strategy for the target domain, we further conduct experiments over baseline. The results are shown in Table 5. Specifically, the model achieves rank-1 accuracy = 83.2% and mAP = 71.2% when tested on DukeMTMC-reID dataset. The model achieves 81.0% and 94.0% on mAP and rank-1 accuracy when tested on Market-1501. Both the proposed two-branch framework and the collaborative re-labeling strategy boost the overall performance.
Specifically, to focus on the contribution of different branches, the ablation studies are evaluated over baseline. Firstly, compared with the baseline model, global and partial feature branches consistently improve the performance, indicating that mining the potential similarity on the target domain is beneficial to discriminative feature learning. The global feature branch obtains higher results than the partial feature branch. We believe that it is because the global feature leads to better clustering quality. Moreover, the partial feature branch can strengthen the representation by mining more information on different body regions. The final two-branch results demonstrate the effectiveness of the combination of global and partial branches. A comparison of visual retrieval results on the Market-1501 dataset between the overall two-branch framework and different branches is shown in Fig. 4. It can be seen that the two-branch framework achieves better results than both global and partial branches. Besides, we can find that although the cross-domain camera style adaptation module provides a higher baseline, the results of the two-branch network yield a marginal performance increase. From this result, we can find that mining identity similarities on the target domain exhibits more potential.
Method | Duke Market | Market Duke | ||||||
---|---|---|---|---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | rank-1 | rank-5 | rank-10 | mAP | |
Basel. | 53.7 | 69.3 | 74.9 | 24.6 | 37.6 | 54.2 | 59.9 | 22.3 |
Re-labeling(G) | 86.9 | 93.1 | 95.2 | 74.3 | 78.2 | 87.6 | 90.0 | 66.2 |
Re-labeling(P) | 78.2 | 85.3 | 88.2 | 61.4 | 69.2 | 81.6 | 84.0 | 58.2 |
Two-branch | 88.0 | 94.5 | 96.3 | 76.0 | 75.1 | 86.0 | 89.5 | 64.9 |
CMFC | 94.0 | 97.1 | 98.3 | 81.0 | 83.2 | 91.6 | 94.0 | 71.2 |

Effect of different pooling operation. We also evaluate the effect of different pooing operations on the two-branch framework. We employ global max pooling (GMP) and global average pooling (GAP) operation on the feature maps to obtain feature vectors, respectively. As shown in Table 6, the model achieves rank-1 accuracy = 91.3% and mAP = 78.0% on DukeMTMC-reID dataset with GMP, surpassing the results with GAP. Similar improvement can be observed when tested on Market-1501. The superiority of GMP probably lies in that max pooling filters out some detrimental signals and focuses on the high response values of feature maps, which benefits the discriminative feature extraction of pedestrian images.
Method | Duke Market | Market Duke | ||||||
---|---|---|---|---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | rank-1 | rank-5 | rank-10 | mAP | |
GMP | 91.3 | 93.3 | 95.3 | 78.0 | 80.7 | 89.1 | 92.0 | 68.4 |
GAP | 89.6 | 90.5 | 93.8 | 73.2 | 75.2 | 86.0 | 90.1 | 64.7 |
Influences of hyper-parameters. The weight of different loss items in Eq.13 is a key hyperparameter affecting the performance of feature representation learning. , , and control the relative importance of the whole body, upper body, and lower body similarity constraint. In the experiments, we set =1, =1 and change the value of . is set to 0.5 and 1 respectively, and the evaluation results are shown in 7. Specifically, =0.5 yields better accuracy. A possible explanation is that the upper body contains more discriminative information about person images than the lower body. However, it still contributes to the learning of features on the target domain.
Method | Duke Market | Market Duke | |||||||
---|---|---|---|---|---|---|---|---|---|
rank-1 | rank-5 | rank-10 | mAP | rank-1 | rank-5 | rank-10 | mAP | ||
Two-branch | 1 | 92.7 | 95.9 | 97.2 | 79.3 | 81.5 | 90.1 | 92.2 | 69.3 |
Two-branch | 0.5 | 94.0 | 97.1 | 98.3 | 81.0 | 83.2 | 91.6 | 94.0 | 71.2 |
partial-branch | 1 | 75.4 | 81.2 | 87.5 | 63.4 | 69.6 | 71.2 | 84.1 | 58.5 |
partial-branch | 0.5 | 81.3 | 86.9 | 91.3 | 70.1 | 74.2 | 78.0 | 86.0 | 62.3 |
5. Conclusion
This paper focuses on unsupervised person re-ID. Specifically, we perform unsupervised domain adaptation on labeled source training images and unsupervised person re-ID on unlabeled target training images. Besides, person identity preserve loss constraint and identity mapping loss constraint are utilized to change the style of images and preserve the identity simultaneously. Moreover, we propose a novel collaborative multiple feature clustering framework (CMFC) for learning representations on the target dataset: global feature guided training branch and partial feature guided training branch. Extensive quantitative experiments validate that learning the potential data similarities on the target domain indeed improves the discriminative representation ability of the person re-ID model. Our method significantly achieves state-of-the-art performance under unsupervised re-ID settings on both two datasets.
References
- (1)
- Bai et al. (2017) Song Bai, Xiang Bai, and Qi Tian. 2017. Scalable person re-identification on supervised smoothed manifold. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2530–2539.
- Bai et al. (2020) Xiang Bai, Mingkun Yang, Tengteng Huang, Zhiyong Dou, Rui Yu, and Yongchao Xu. 2020. Deep-person: Learning discriminative deep features for person re-identification. Pattern Recognition 98 (2020), 107036.
- Bousmalis et al. (2017) Konstantinos Bousmalis, Nathan Silberman, David Dohan, Dumitru Erhan, and Dilip Krishnan. 2017. Unsupervised pixel-level domain adaptation with generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3722–3731.
- Cao et al. (2020) Min Cao, Chen Chen, Hao Dou, Xiyuan Hu, Silong Peng, and Arjan Kuijper. 2020. Progressive Bilateral-Context Driven Model for Post-Processing Person Re-Identification. IEEE Transactions on Multimedia (2020). https://doi.org/10.1109/TMM.2020.2994524
- Choi et al. (2018) Yunjey Choi, Minje Choi, Munyoung Kim, Jung-Woo Ha, Sunghun Kim, and Jaegul Choo. 2018. Stargan: Unified generative adversarial networks for multi-domain image-to-image translation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 8789–8797.
- Deng et al. (2018) Weijian Deng, Liang Zheng, Qixiang Ye, Guoliang Kang, Yi Yang, and Jianbin Jiao. 2018. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 994–1003.
- Ester et al. (1996) Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu, et al. 1996. A density-based algorithm for discovering clusters in large spatial databases with noise.. In Kdd, Vol. 96. 226–231.
- Fan et al. (2018) Hehe Fan, Liang Zheng, Chenggang Yan, and Yi Yang. 2018. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 14, 4 (2018), 1–18.
- Felzenszwalb et al. (2009) Pedro F Felzenszwalb, Ross B Girshick, David McAllester, and Deva Ramanan. 2009. Object detection with discriminatively trained part-based models. IEEE transactions on pattern analysis and machine intelligence 32, 9 (2009), 1627–1645.
- Fu et al. (2019) Yang Fu, Yunchao Wei, Guanshuo Wang, Yuqian Zhou, Honghui Shi, and Thomas S Huang. 2019. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 6112–6121.
- Ge et al. (2020) Yixiao Ge, Dapeng Chen, and Hongsheng Li. 2020. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. In International Conference on Learning Representations.
- Gulrajani et al. (2017) Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. 2017. Improved training of wasserstein gans. Advances in neural information processing systems 30 (2017), 5767–5777.
- He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770–778.
- He et al. (2019) Lingxiao He, Yinggang Wang, Wu Liu, He Zhao, Zhenan Sun, and Jiashi Feng. 2019. Foreground-aware pyramid reconstruction for alignment-free occluded person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 8450–8459.
- Hermans et al. (2017) Alexander Hermans, Lucas Beyer, and Bastian Leibe. 2017. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 (2017).
- Hoffman et al. (2018) Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu, Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell. 2018. Cycada: Cycle-consistent adversarial domain adaptation. In International conference on machine learning. PMLR, 1989–1998.
- Hu et al. (2015) Junlin Hu, Jiwen Lu, and Yap-Peng Tan. 2015. Deep transfer metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 325–333.
- Isola et al. (2017) Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. 2017. Image-to-image translation with conditional adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1125–1134.
- Kingma and Ba (2015) Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization, In 3rd International Conference on Learning Representations. arXiv preprint arXiv:1412.6980.
- Li et al. (2018b) Wei Li, Xiatian Zhu, and Shaogang Gong. 2018b. Harmonious attention network for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2285–2294.
- Li et al. (2018a) Yu-Jhe Li, Fu-En Yang, Yen-Cheng Liu, Yu-Ying Yeh, Xiaofei Du, and Yu-Chiang Frank Wang. 2018a. Adaptation and re-identification network: An unsupervised deep transfer learning approach to person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 172–178.
- Li et al. (2021) Zhihao Li, Bing Han, Xinbo Gao, Biao Hou, and Zongyuan Liu. 2021. Distance constraint between features for unsupervised domain adaptive person re-identification. Neurocomputing 462 (2021), 113–122.
- Liang et al. (2018) Wenqi Liang, Guangcong Wang, Jianhuang Lai, and Junyong Zhu. 2018. M2m-gan: Many-to-many generative adversarial transfer learning for person re-identification. arXiv preprint arXiv:1811.03768 (2018).
- Liao et al. (2015) Shengcai Liao, Yang Hu, Xiangyu Zhu, and Stan Z Li. 2015. Person re-identification by local maximal occurrence representation and metric learning. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2197–2206.
- Peng et al. (2016) Peixi Peng, Tao Xiang, Yaowei Wang, Massimiliano Pontil, Shaogang Gong, Tiejun Huang, and Yonghong Tian. 2016. Unsupervised cross-dataset transfer learning for person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1306–1315.
- Ristani et al. (2016) Ergys Ristani, Francesco Solera, Roger Zou, Rita Cucchiara, and Carlo Tomasi. 2016. Performance measures and a data set for multi-target, multi-camera tracking. In Computer Vision – ECCV 2016 Workshops. Springer, 17–35.
- Song et al. (2020) Liangchen Song, Cheng Wang, Lefei Zhang, Bo Du, Qian Zhang, Chang Huang, and Xinggang Wang. 2020. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognition 102 (2020), 107173.
- Sun et al. (2016) Baochen Sun, Jiashi Feng, and Kate Saenko. 2016. Return of frustratingly easy domain adaptation. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence. 2058––2065.
- Szegedy et al. (2016) Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. 2016. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2818–2826.
- Wang et al. (2019) Xinlong Wang, Tao Kong, Chunhua Shen, Yuning Jiang, and Lei Li. 2019. Solo: Segmenting objects by locations. arXiv preprint arXiv:1912.04488 (2019).
- Wang et al. (2018) Zheng Wang, Mang Ye, Fan Yang, Xiang Bai, and Shin’ichi Satoh. 2018. Cascaded SR-GAN for Scale-Adaptive Low Resolution Person Re-identification. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence. 3891–3897.
- Wei et al. (2018) Longhui Wei, Shiliang Zhang, Wen Gao, and Qi Tian. 2018. Person transfer gan to bridge domain gap for person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 79–88.
- Yang et al. (2020a) Fan Yang, Ke Yan, Shijian Lu, Huizhu Jia, Don Xie, Zongqiao Yu, Xiaowei Guo, Feiyue Huang, and Wen Gao. 2020a. Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia (2020). https://doi.org/10.1109/TMM.2020.3001522
- Yang et al. (2020b) Fengxiang Yang, Zhun Zhong, Zhiming Luo, Sheng Lian, and Shaozi Li. 2020b. Leveraging virtual and real person for unsupervised person re-identification. IEEE Transactions on Multimedia 22, 9 (Sep. 2020), 2444–2453.
- Yu et al. (2019) Hong-Xing Yu, Wei-Shi Zheng, Ancong Wu, Xiaowei Guo, Shaogang Gong, and Jian-Huang Lai. 2019. Unsupervised person re-identification by soft multilabel learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2148–2157.
- Yu et al. (2018) Rui Yu, Zhiyong Dou, Song Bai, Zhaoxiang Zhang, Yongchao Xu, and Xiang Bai. 2018. Hard-aware point-to-set deep metric for person re-identification. In Proceedings of the European Conference on Computer Vision. 188–204.
- Zhang et al. (2019) Xinyu Zhang, Jiewei Cao, Chunhua Shen, and Mingyu You. 2019. Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In Proceedings of the IEEE International Conference on Computer Vision. 8222–8231.
- Zhao et al. (2020) Cairong Zhao, Xinbi Lv, Zhang Zhang, Wangmeng Zuo, Jun Wu, and Duoqian Miao. 2020. Deep fusion feature representation learning with hard mining center-triplet loss for person re-identification. IEEE Transactions on Multimedia 22, 12 (Dec. 2020), 3180–3195.
- Zheng et al. (2021a) Kecheng Zheng, Cuiling Lan, Wenjun Zeng, Zhizheng Zhang, and Zheng-Jun Zha. 2021a. Exploiting sample uncertainty for domain adaptive person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 3538–3546.
- Zheng et al. (2021b) Kecheng Zheng, Wu Liu, Lingxiao He, Tao Mei, Jiebo Luo, and Zheng-Jun Zha. 2021b. Group-aware label transfer for domain adaptive person re-identification. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 5310–5319.
- Zheng et al. (2015) Liang Zheng, Liyue Shen, Lu Tian, Shengjin Wang, Jingdong Wang, and Qi Tian. 2015. Scalable person re-identification: A benchmark. In Proceedings of the IEEE international conference on computer vision. 1116–1124.
- Zheng et al. (2017) Zhedong Zheng, Liang Zheng, and Yi Yang. 2017. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In Proceedings of the IEEE International Conference on Computer Vision. 3754–3762.
- Zhong et al. (2017) Zhun Zhong, Liang Zheng, Donglin Cao, and Shaozi Li. 2017. Re-ranking person re-identification with k-reciprocal encoding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1318–1327.
- Zhong et al. (2020) Zhun Zhong, Liang Zheng, Guoliang Kang, Shaozi Li, and Yi Yang. 2020. Random Erasing Data Augmentation.. In AAAI. 13001–13008.
- Zhong et al. (2018) Zhun Zhong, Liang Zheng, Shaozi Li, and Yi Yang. 2018. Generalizing a person retrieval model hetero-and homogeneously. In Proceedings of the European Conference on Computer Vision. 172–188.
- Zhong et al. (2019) Zhun Zhong, Liang Zheng, Zhiming Luo, Shaozi Li, and Yi Yang. 2019. Invariance matters: Exemplar memory for domain adaptive person re-identification. In Proceedings of the IEEE conference on computer vision and pattern recognition. 598–607.
- Zhu et al. (2017) Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer vision. 2223–2232.