This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Unsupervised Person Re-Identification with Multi-Label Learning Guided Self-Paced Clustering

Qing Li1,3, Xiaojiang Peng2, Yu Qiao3, Qi Hao1
1School of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China
2College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China

3Department of Multimedia Laboratory, Shenzhen Institutes of Advanced Technology,
Chinese Academy of Sciences, Shenzhen, China
[email protected], [email protected], [email protected], [email protected]
  
Abstract

Although unsupervised person re-identification (Re-ID) has drawn increasing research attention recently, it remains challenging to learn discriminative features without annotations across disjoint camera views. In this paper, we address the unsupervised person Re-ID with a conceptually novel yet simple framework, termed as Multi-label Learning guided self-paced Clustering (MLC). MLC mainly learns discriminative features with three crucial modules, namely a multi-scale network, a multi-label learning module, and a self-paced clustering module. Specifically, the multi-scale network generates multi-granularity person features in both global and local views. The multi-label learning module leverages a memory feature bank and assigns each image with a multi-label vector based on the similarities between the image and feature bank. After multi-label training for several epochs, the self-paced clustering joins in training and assigns a pseudo label for each image. The benefits of our MLC come from three aspects: i) the multi-scale person features for better similarity measurement, ii) the multi-label assignment based on the whole dataset ensures that every image can be trained, and iii) the self-paced clustering removes some noisy samples for better feature learning. Extensive experiments on three popular large-scale Re-ID benchmarks demonstrate that our MLC outperforms previous state-of-the-art methods and significantly improves the performance of unsupervised person Re-ID.

1 Introduction

Person re-identification (Re-ID) aims at searching people across non-overlapping surveillance camera views deployed at different locations by matching person images [22, 59, 35]. Due to its importance in smart cities and large-scale surveillance systems, person Re-ID is already a well-established research problem in computer vision [13, 17, 62, 20]. Though great progress has been made in both benchmarks and approaches in recent years, person Re-ID remains an open challenging problem due to the difficulty of learning robust and discriminative representation with large variant intra-person appearance and high inter-person similarity.

Over the past decades, most of the existing person Re-ID works focus on feature designing and metric learning [25, 4, 46]. Recently, modern deep learning has been applied to the Re-ID community and achieved significant progress [22, 5, 57]. Most of these works tackle Re-ID in a supervised learning manner, which are limited by the small scale of Re-ID datasets. Nevertheless, collecting unlabeled pedestrian images are much easier and cheaper, thus training deep networks on large scale unlabeled data becomes increasingly necessary and practical.

In fact, unsupervised learning for person Re-ID has become a hot topic in more recent years [6, 52, 33, 44]. There mainly exist two types of unsupervised person Re-ID methods. The first one is based on unsupervised domain adaption (UDA) where the source domain is usually a labeled dataset and the target domain is an unlabeled dataset. Most of these UDA based methods use transfer learning to learn the knowledge in the labeled source Re-ID dataset and transfer them to target datasets [39, 32, 48, 42]. Specifically, some works use generative adversarial networks (GAN) for transferring sample images from the source domain to the target domain while preserving the person identity as much as possible [41, 63, 65]. Some others first train models on the source domain, then leverage self-supervised learning and clustering to estimate pseudo-labels on the target domain iteratively to fine-tune the pre-trained model [33, 44, 9, 19]. The main disadvantages of unsupervised domain adaption Re-ID are two-fold. On the one hand, it still needs expensive labeled data and the performance is usually limited by the scale of the labeled source dataset. On the other hand, these methods ignore the sample relations between the source and the target datasets.

The second type of unsupervised Re-ID methods are based on fully unsupervised learning. Their goal is to learn discriminative representations in large scale unlabeled data. Most of these methods use clustering to generate pseudo labels. For example, Lin et al. [27] propose a bottom-up clustering (BUC) framework that trains a network with pseudo labels iteratively. The inaccurate nature of clustering algorithm on large intra-class variations makes the pseudo labels noisy, which in consequence leads to poor performance. In order to avoid wrong merging and make full use of all the images, Ding et al[7] propose an elegant and practical density-based clustering approach by incorporating the cluster validity criterion. Wang et al[34] consider the unsupervised person Re-ID as a multi-label classification task to progressively seek true labels. They introduce a Memory-based Multi-label Classification Loss (MMCL) method, which iteratively predicts multi-labels and updates the network with multi-label classification loss. As illustrated in Figure 1, the density-based clustering strategy tries to keep high-purity samples for model training but it may ignore useful hard samples. The multi-label based strategy keeps all the samples in the memory which may introduce noisy samples in training phase.

Refer to caption
Figure 1: Comparison of recent unsupervised person Re-ID methods and our multi-label learning guided self-paced clustering (MLC) method.

In this paper, to address the above issues of clustering and multi-label strategies, we propose a conceptually novel yet simple framework for unsupervised person Re-ID, termed as multi-label learning guided self-paced clustering (MLC). Specifically, MLC learns discriminative information with three crucial modules, namely a multi-scale network (MN), a multi-label learning (ML) module, and a self-paced clustering (SC) module. The MN module is used to mine the multi-scale person features for better similarity measurement. Comparing to the previous methods that only extracting the global features, the MN module captures more non-salient or infrequent local information. Local feature learning is demonstrated as an effective strategy to enhance the feature representation [10] which is complementary to the global feature [36, 12]. The ML module generates a multi-label vector for each image based on a memory bank. Specifically, each sample in the memory bank is viewed as a single class, and a sample is assigned with a multi-hot vector where the corresponding items are activated if the sample is similar with those indexed samples in memory. To this end, these images with the same identity could get similar multi-label vectors. To avoid training noisy samples which may hurt the final model, the SC module is added after several training epochs of ML. The SC module mainly removes noisy samples by density-based clustering algorithm and assigns pseudo labels for multi-class training. We jointly train the whole network in an end-to-end manner.

We evaluate the proposed MLC framework on three large-scale datasets including Market-1501, DukeMTMC-reID, and MSMT17 without leveraging their annotations. Experimental results show that our MLC significantly improves the performance of unsupervised person Re-ID without any annotations and achieves performance superior or comparable to the state-of-the-art methods. h

2 Related work

In this section, we review the Person Re-Identification (Re-ID) technology in the view of supervised learning, unsupervised domain adaption (UDA), and unsupervised learning.

2.1 Supervised Learning for Person Re-ID

Most existing person Re-ID methods employ supervised model learning on per-camera-pair manually labeled pairwise training data. The main techniques of these methods are focused on distance metric or subspace learning [4, 46], view-invariant discriminative feature learning [25, 37, 56]. Based on the surge of deep learning techniques, the field of supervised person Re-ID has witnessed rapid progress in recent years [62, 23, 1]. Generally, most of these methods assume that person images are well-aligned. In fact, it is difficult and impractical to given perfect annotation information while person poses are changing. To overcome this limitation, lots of works have adopted attentional deep learning approaches to tackle the misalignment problem. However, supervised learning methods rely on substantial costly and time-consuming labeled training data, which limits those approaches’ scalability and practicability. Although these methods show certain generalization ability on labeled data, supervised Re-ID methods still lacks of effective and practical applications on unlabeled data.

2.2 UDA for Person Re-ID

With the progress of deep learning on unsupervised feature learning, researchers begin to apply deep learning to address unsupervised person Re-ID tasks [27, 43]. The open set domain adaptation has been extensively applied to solve the problem of image classification tasks [31, 11], where several classes are unknown in the two domains (or in the target domain). Recently, the domain adaptation strategy has been also widely used for unsupervised person Re-ID tasks [6, 32, 39]. Most of these UDA methods refer to transfer learning to learn the knowledge from the source dataset and transfer them to target dataset. It is clear that the former is an auxiliary and necessarily labelled dataset but the latter is a unlabelled dataset [52, 42, 21, 51]. However, classes of the two domains are entirely different for UDA in person Re-ID, which presents a greater challenge.

To address this domain adaptation problem, there are three typical methods as follows. The first category of methods explore image-style transformation from labeled source domain to unlabeled target domain [6, 39, 41]. In [41], PTGAN enforces the self-similarity of an image before and after translation and the domain dissimilarity of a translated source image and a target image. In paper [62], Zhong et al. first propose a Hetero-Homogeneous Learning (HHL) method to learn camera-invariant network for the target domain. However, HHL overlooks the latent positive pairs in the target domain, which might lead the Re-ID model to be sensitive to the background or pose variations in the target domain. To overcome these drawbacks, Zhang et al. [55] propose a self-training method with a progressive augmentation framework to promote the model performance progressively on the target dataset.

The second category of methods utilize a model distillation by using the teacher model to guide learning of the student model. A vast majority of knowledge distillation methods are adopted for Re-ID problems to alleviate cross-camera scene variation explicitly or implicitly [13, 42]. In [42], a multi-teacher adaptive similarity distillation framework is proposed to learn a user-specified lightweight student model from multiple teacher models, without access to source domain data. Wu et al. [43] propose to learn consistent pairwise similarity distributions for intra-camera and cross-camera matching with the guidance of prior common knowledge of intra-camera matching.

The third category of methods attempts on optimizing the Re-ID model with soft labels for target-domain samples by measuring the similarities with reference images or features. Zhong et al. [63] investigate the impact of intra-domain variations and impose three types of invariance constraints on target samples. They assign soft labels and minimize the target invariance by proposing an exemplar memory module [16, 40], which caches feature vectors of every instance. And then, Yu et al. [51] propose a method called deep soft multilabel reference learning (MAR) method, which conducts multiple soft-label learning by comparing with a set of reference persons. Ge et al. [14] designed an asymmetrical framework to generate more robust soft labels via the mutual mean-teaching. However, these methods need to learn one or several effective teacher model(s), which only depend(s) on the diversity and quality of source domain data. It can’t be ignored that they still need a labeled source data, and they don’t explore the sample similarity between the source and the target domain. However, our method not only discard the requirment of labeled source domain data, but also could mine sample similarities on target domain data directly. Specially, we leverage MN to mine multi-granularity person features and ML module to store these up-to-date multi-scale features for the whole target data.

Refer to caption
Figure 2: The pipeline of our MLC framework for unsupervised person Re-ID. It mainly contains three sub-components ( indicated by three colored dot rectangles): 1) multi-scale network (MN), 2) multi-label learning (ML) module, and 3) self-paced clustering (SC) module. Given an image, the multi-scale network extracts both global feature in the whole image and local features in sub regions by global max pooling. The ML module leverages an updatable memory bank to assign multi-label vectors for all unlabeled images and uses multi-label classification loss for training. The SC module filters noisy samples and assign pseudo labels by the step-wise clustering and sampling.

2.3 Unsupervised Learning for Person Re-ID.

Unsupervised learning methods have attracted much attention because of their capability of saving the cost of manual annotations. The main task of unsupervised learning has shifted to how to fully mine the useful information in the increasingly growing unlabeled dataset. Unsupervised learning methods for person Re-ID generally involve two aspects: traditional unsupervised methods and clustering-guided deep learning methods.

Traditional unsupervised person Re-ID mainly focus on feature learning, which created hand-craft features [2] that can be utilized directly to the unlabeled dataset. In the earlier works, Liao et al. [25] propose the local maximal occurrence (LOMO) descriptor, which includes the color and SILTP histograms. Zheng et al. [58] propose to extract the 11-dim color names descriptor for each local patch, and aggregate them into a global vector through a Bag-of-Words model. It is worth pointing out that these methods’ performance are not effective and satisfactory on unlabelled dataset. The latest studies suggest that it is difficult to learn robust and discriminative features by traditional unsupervised methods.

With the surge of deep learning techniques, recent studies have focused on the clustering-guided deep learning methods for unsupervised Person Re-ID [52, 50, 38]. Because of the most important is that clustering method is intuitive and efficient for unsupervised machine learning. These methods generally focused on generates pseudo-labels on the target domain and then use these pseudo-labels to learn deep models in a supervised manner. Liu et al. [29] propose a stepwise metric promotion approach to refine the pseudo labels by iteratively estimate the annotations of training tracklets. Wu et al. [45] propose a progressive sampling method to gradually predict reliable pseudo labels and uncover the unlabeled data for one-shot video-based Re-ID. Yang et al. [47] introduced the asymmetric co-teaching strategy to refine the pseudo-labels in the clustering-based method. Zhai et al. [54] present a novel augmented discriminative clustering technique that incorporates style-translated images to improve the discriminativeness of instance features. However, such methods rely on a good deep Re-ID model as an initialized feature extractor for unsupervised learning. Aside from these methods that require an auxiliary Re-ID model, they also face the hard or difficult samples to effect the quality of the label predict. Therefore, some researchers first consider focusing on the fully unsupervised Re-ID task without rely on any initialized model.

Different from previous works, Lin et al. [27] proposed a bottom-up clustering framework that iteratively trains a network based on the pseudo labels generated by unsupervised clustering. It not only considers the diversity over each sample but also exploits the similarity within each class. Ding et al. [7] designs a novel dispersion-based clustering approach which can discover the underlying feature space for unlabeled pedestrian image data. Zeng et al. [53] propose a hierarchical clustering with hard-batch triplet loss. These approaches have explored cluster distributions in the target domain. They still face the challenge on how to precisely predict the label of hard samples. On Different from these methods which classify each image into a single class, the multi-label classification has the potential to exhibit better efficiency and accuracy. For example, Lin et al. [28] proposed an unsupervised Re-ID network that softened labels to reflect the image similarity and eliminated the hard quantization error. Wang et al. [34] consider the unsupervised person ReID as a multi-label classification task to iteratively predict multi-class labels and update the network with multi-label classification loss. However, this strategy of multi-label keeps all the samples in the memory which may introduce noisy samples in training phase. Different from all the above methods, our method addresses the fully unsupervised person Re-ID problem with joint multi-label and self-paced clustering.

3 Methodology

In this section, we first provided an overview for our method, and then present the preliminary of unsupervised person Re-ID. Finally, we present the individual modules of our framework and the training strategy.

3.1 Overview

To tackle the unsupervised person Re-ID, we propose a multi-label learning guided self-paced clustering (MLC) framework as shown in Figure 2. Our MLC framework includes three crucial modules, namely a multi-scale network (MN), a multi-label learning (ML) module, and a self-paced clustering (SC) module. Given an image, the multi-scale network extracts both global features in the whole image and local features in sub regions by global max pooling. The ML module leverages an updatable memory bank to assign multi-hot labels for all unlabeled images and uses multi-label classification loss for training. Commonly, the memory bank is composed of the multi-scale features of all samples in a dataset. The multi-hot label for an image is determined by the similarities between the image feature and memory bank. The SC module filters noisy samples and assign pseudo labels by the step-wise clustering and sampling. In practice, we first perform several epochs of multi-label training to ensure every image is used for training and then we jointly apply SC and ML which may trade-off the noisy samples and hard samples.

3.2 Preliminary and Initialization

Preliminary. In fully unsupervised person Re-ID tasks, we only have an unlabeled training dataset X={x1,x2,,xN}X=\left\{x_{1},x_{2},\cdots,x_{N}\right\} containing NN person images. Our purpose is to learn a discriminative feature extractor ϕ(θ;xi)\phi(\theta;x_{i}) from XX without any available annotations. The parameters of ϕ\phi are optimized iteratively using an objective function. This feature extractor can be applied to the gallery set, G={g1,g2,,gNt}G=\left\{g_{1},g_{2},\cdots,g_{N_{t}}\right\} of NtN_{t} images, and the query set Q={q1,q2,,qNq}Q=\left\{q_{1},q_{2},\cdots,q_{N_{q}}\right\} of NqN_{q} images. During the evaluation, for any query person image qq, the feature extractor is expected to produce a feature vector to retrieve image gg containing the same person from a gallery set GG. In other words, we need to use the features of a query image ϕ(θ;qi)\phi(\theta;q_{i}) to search more similar feature with gg from the gallery set GG. Hence, it is critical to learn a disciminative feature extractor ϕ(θ;)\phi(\theta;\cdot) for person Re-ID model. The conceptual optimization goal of distance between each pair of images is defined as,

g^=argmingGdist(ϕ(θ;qi),ϕ(θ;gi))\hat{g}=arg\min_{g\in G}dist(\phi(\theta;q_{i}),\phi(\theta;g_{i})) (1)

where dist()dist(\cdot) is the distance metric, e.g., the L2 distance. Generally, person Re-ID mainly adopts Euclidean or cosine distances as the re-ranking method for the retrieval stage. In this case, we use the Euclidean as the distance metric function to re-rank the ReID results. More important, we will bring in a k-reciprocal encoding method with the Jaccard distance of probe and gallery images to computer the distance metric as the re-ranking method [61] for our self-paced clustering. The detailed application will be introduced in Section 3.5.

Initialization with hard labels. In order to learn a discriminative feature extractor, the transitional supervised learning method needs person identity labels for each image. However, there are no manually annotated labels in fully unsupervised Re-ID tasks, so we need to generate the pseudo labels instead of that. Thus, we start by treating each training image xix_{i} as an individual class, and initially assign xix_{i} with a label yiy_{i} by its index, i.e, Y={y1,y2,,yN}Y=\left\{y_{1},y_{2},\cdots,y_{N}\right\}. The feature extractor ϕ(θ;)\phi(\theta;\cdot) is appended by a classifier f(w;ϕ)Nf(w;\phi)\in\mathbb{R}^{N} parameterized by ww. The optimization is defined by the following objective function:

minθ,wi=1Nł(f(w;ϕ(θ;xi)),yi),\min_{\theta,w}\sum_{i=1}^{N}\l(f(w;\phi(\theta;x_{i})),y_{i}), (2)

where ll is the cross-entropy (CE) loss for classification. The hard labels Y are suitable for initialization but not reliable for unsupervised feature learning. We warm up the neural network with the initialized hard label, allowing the model to reach a certain local optimal field where subsequential approaches need to be explored.

3.3 Multi-scale Network

As the feature extractor, the multi-scale network aims to capture multi-granularity person features for similarity computing. Specifically, we use the base of ResNet50 as our backbone since it is widely adopted in person Re-ID tasks and obtains better performance. Inspired by the recent domain adapted Re-ID work [12], we compute the similarity between two persons not only by global information from the whole body but also by local information from different parts of a person. The detailed architecture of our MN is illustrated in Figure 3.

Refer to caption
Figure 3: The architecture of our multi-scale network.

We remove the down-sampling operations before res_conv5_1res\_conv5\_1, and uniformly split the feature maps into an upper and an bottom part at the last convconv layer. We utilize global average pooling (GAP) operation on the whole final feature map and the partial feature maps to obtain three feature vectors for each image, i.e., fg,fup,andflowf_{g},f_{up},\textrm{and}~{}f_{low}. To obtain better multi-granularity features, we concatenate them as a final person representation fallf_{all}.

3.4 Multi-label learning module

The multi-label learning (ML) module aims to learn discriminative features by assigning similar multi-hot labels for similar images based on a feature memory bank.

Memory Bank. A memory bank consists of the representations of all samples in the dataset. Following [34, 16, 40], we maintain the memory bank to serve as a feature storage that saves up-to-date features of the training dataset. Memory bank allows the network discovering more negative samples from the memory buffer to pair with positive sample without recomputing their features Compared with previous methods, the benefit of a memory bank is to collect more feature informative pairs with the cost of memory space for features stored in the memory bank.

Initialization. We initialize the memory module 𝑴\boldsymbol{M} by computing the features of a set of randomly sampled training images based on the warm-up model. Formally, 𝑴={(fall1),(fall2),,(fallN)}\boldsymbol{M}=\left\{(f^{1}_{all}),(f^{2}_{all}),\cdots,(f^{N}_{all})\right\}, where the 𝑴N×d\boldsymbol{M}\in\mathbb{R}^{N\times d} and dd is the dimension of these features. The fallif^{i}_{all} is initialized as the feature of the ii-thth sample xix_{i}. In the memory bank, each slot 𝑴[𝒊]\boldsymbol{M\left[i\right]} stores the L2-normalized feature fallif^{i}_{all} in the key part, while storing the hard label in the value part.

Updating. During each training iteration, the feature vectors on each mini-batch would be involved in memory bank updating. The whole unlabeled data can be cached in the memory bank, which is dynamically updated with features computed in the training. We update the memory bank in a running-average manner as follows,

𝑴[𝒊]𝒕α𝑴[𝒊]𝒕+(1α)falli,𝑴[𝒊]𝒕=𝑴[𝒊]𝒕/𝑴[𝒊]𝒕2,\begin{array}[]{l}\boldsymbol{M\left[i\right]^{t}}\leftarrow\alpha\boldsymbol{M\left[i\right]^{t}}+(1-\alpha)f^{i}_{all},\\ \boldsymbol{M\left[i\right]^{t}}=\boldsymbol{M\left[i\right]^{t}}/||\boldsymbol{M\left[i\right]^{t}}||_{2},\end{array} (3)

where the superscript tt is denoted as the current iteration epoch, α\alpha is the momentum of updating rate. Then, we utilize the Memory-based Positive Label Prediction (MPLP) method [34] with the hard label to predict the multi-hot label y¯\bar{y} based on the memory bank 𝑴\boldsymbol{M}. Followed the MPLP, we first computes a rank list RiR_{i} to store the similarities between a sample and the memory bank as follows,

Ri=argsortj(si,j),j[1,N],R_{i}=\mathop{\textup{arg}\,\textup{sort}}_{j}(s_{i,j}),j\in[1,N], (4)
si,j=𝑴[𝒊]𝑻×𝑴[𝒋]s_{i,j}=\boldsymbol{M\left[i\right]^{T}}\times\boldsymbol{M\left[j\right]} (5)

where si,js_{i,j} is the similarity score of xix_{i} and xjx_{j}, and the RiR_{i} aims to find the candidates for reliable labels about xix_{i}. We use a similarity threshold and a cycle consistency scheme to select relevant label candidates and filter hard negative labels. The positive label is set as,

Pi=R^i[1:l]P_{i}=\hat{R}_{i}[1:l] (6)

where R^i\hat{R}_{i} is the toptop-kik_{i} nearest labels and R^i=Ri[1:ki]\hat{R}_{i}={R}_{i}[1:k_{i}], Ri[ki]{R}_{i[k_{i}]} is the last label with similarity score higher than a threshold which decide the quantity of label candidates. The ll satisfies iRR^l&iRR^l+1i\in R_{\hat{R}_{l}}\&i\notin R_{\hat{R}_{l+1}}. As PiP_{i} contains ll labels, xix_{i} would be assigned with a multi-label yi¯\bar{y_{i}} in which value 1 indicate positive classes,

yi¯={1jPi1jPi\bar{y_{i}}=\left\{\begin{array}[]{l}\ \ 1\quad j\in P_{i}\\ -1\quad j\notin P_{i}\end{array}\right. (7)

The multi-label classification loss, which is computed on positive classes and sampled hard negative classes, is shown below.

mmcl=i=1Nδ|Pi|pPil(p|xi))+1|Si|sSil(s|xi))\mathcal{L}_{mmcl}=\sum_{i=1}^{N}\frac{\delta}{\left|P_{i}\right|}\sum_{p\in P_{i}}\mathit{l}(p|x_{i}))+\frac{1}{\left|S_{i}\right|}\sum_{s\in S_{i}}\mathit{l}(s|x_{i})) (8)

where SiS_{i} is the collection of hard negative classes for xix_{i}, we also select the toptop-r%r\% classes as the hard negative classes, and |Si|=(N|Pi|)r%\left|S_{i}\right|=(N-\left|P_{i}\right|)\cdot r\%. The δ\delta is a a coefficient measuring the importance of multi-label classification loss.

3.5 Self-paced clustering module

As mentioned above, the multi-label learning strategy keeps all samples for training which may hurt feature learning due to noisy samples. The self-paced clustering (SC) module is proposed to update the training dataset with more clean samples and assign pseudo labels for training.

As shown in the bottom-right of Figure 2, we firstly employ the SC module to cluster the final similarity features extracted by our MN. The clustering method aims to group similar entities together after computing the distance metric with k-reciprocal encoding [61] for all training samples. The sampling is used to filter noisy samples and assigns pseudo-labels y~i\tilde{y}_{i} according to the group entities. Finally, we form a new training dataset with pseudo-labels to leverage SC and ML module for jointly learning.

For clustering methods, the selection of nearest neighbors is crucial. It is used for merging all the instances into right cluster and finally affects the clustering results and the quality of pseudo labels. The conventional k-means clustering algorithm comes as a natural choice which selects nearest neighbors according to the distance of a sample to the centroids of clusters. As an alternative choice, the DBSCAN [8] algorithm can be used which selects nearest neighbors by Jaccard distance [61] matrix and k-reciprocal nearest neighbor method. We use DBSCAN as our default SC module, and apply step-wise clustering and sampling to remove the noisy samples of the whole dataset. During the progressive clustering stage, we try to keep the most reliable clusters. In order to avoid accumulating training error caused by noisy clusters, we remove these noisy samples and constitute a new training set with pseudo labels for joint training.

Since labels are available after self-paced clustering, we treat the training process as a classification problem. Specifically, we adopt Cross-Entropy (CE) loss for classification and triplet loss for metric learning. We apply CE loss with label smoothing as follows,

CEp=1Ni=1Nł(f(w;ϕ(θ;xi)),y~is)\mathcal{L}_{CE_{p}}=\frac{1}{N}\sum_{i=1}^{N}\l(f(w;\phi(\theta;x_{i})),\tilde{y}_{i}^{s}) (9)

where the smooth j-th item y~j,is=1ε+εC\tilde{y}_{j,i}^{s}=1-\varepsilon+\frac{\varepsilon}{C} if j=y~ij=\tilde{y}_{i}, otherwise y~j,is=εC\tilde{y}_{j,i}^{s}=\frac{\varepsilon}{C}. CC is the numbers of predict the identities by SC module. ε\varepsilon is a small constant for the label smoothing.

Formally, the triplet loss function is defined as follows,

tri=1Ni=1Nmax(0,f(w;ϕ(θ;xi))f(w;ϕ(θ;xi,p))+mf(w;ϕ(θ;xi))f(w;ϕ(θ;xi,n)))\begin{split}\mathcal{L}_{tri}=\frac{1}{N}\sum_{i=1}^{N}\max(0,\left\|f(w;\phi(\theta;x_{i}))-f(w;\phi(\theta;x_{i,p}))\right\|\\ +m-\left\|f(w;\phi(\theta;x_{i}))-f(w;\phi(\theta;x_{i,n}))\right\|)\end{split} (10)

where \left\|\cdot\right\| denotes the L2L_{2}-norm distance, the xi,px_{i,p} and xi,nx_{i,n} indicate the hardest positive and hardest negative feature sample in each mini-batch, and mm denotes the triplet distance margin, and sets the default value to 0. The overall loss function for optimization is the combination of multi-label classification loss and the pseudo label classification loss (CE loss and triplet loss) as follows,

o=λ1mmcl+λ2(CEp+tri).\mathcal{L}_{o}=\lambda_{1}\mathcal{L}_{mmcl}+\lambda_{2}(\mathcal{L}_{CE_{p}}+\mathcal{L}_{tri}). (11)

4 Experiments

4.1 Datasets and settings

We evaluate our approach on three large-scale bench-mark datasets: Market1501 [58], DukeMTMC-reID [60] and MSMT17[41].

Market1501 Datasets Market1501 contains 1,501 person identities with 32,668 images which are captured by six cameras. It contains 12,936 images of 751 identities for training and 19,732 images of 750 identities for testing.

DukeMTMC-reID Datasets DukeMTMC-reID is a subset of the DukeMTMC dataset. It has 1,404 person identities from eight cameras, with 36411 labeled images. It contains 16,522 images of 702 identities for training, and the remaining images out of another 702 identities for testing, including 2,228 images for query, and 17,661 images for gallery.

Table 1: Performance (%\%) comparison of our framework with the baseline methods on Market-1501 dataset and DukeMTMC-reID.
Method Market-1501 DukeMTMC-reID
source Rank-1 Rank-5 Rank-10 mAP source Rank-1 Rank-5 Rank-10 mAP
Baseline1: fully-supervised [25] Supervised 88.5 96.5 97.9 70.7 Supervised 74.8 87.1 91.5 58.0
Baseline2: ImageNet model None 8.1 17.5 23.6 2.2 None 5.6 11.5 14.9 1.6
Baseline3: MMLC [34] None 79.8 88.4 91.6 44.7 None 65.6 75.9 80.1 39.6
MLC w/o MN None 85.2 92.2 94.6 62.6 None 71.9 81.2 84.4 50.0
MLC None 86.7 93.5 95.6 66.2 None 73.6 82.3 85.5 52.3

MSMT17 Datasets MSMT17 is the most challenging and the currently largest-scale dataset which contains 126,441 images of 4,101 person identities captured from 15 camera views. It is spitted to 32,621 images of 1,041 identities for training, and 93,820 images of 3,060 identities for testing.

Evaluation Metrics We followed the standard training/test split and evaluated the single-query test evaluation settings. For the evaluation metrics, we used the Rank-1/Rank-5 matching accuracy, which means the query picture has the match in the top-k list. And we use the mean Average Precision (mAP), which is computed from the Cumulated Matching Characteristics (CMC) [15].

Implementation Details We implement our method with Pytorch. For data pre-processing, input images are resized into 256×128256\times 128. We apply some commonly-used data augmentation methods including random horizontal flipping, random cropping, color jitter, random erasing, and CamStyle [64]. For a fair comparison, we adopt ResNet-50 pre-trained on ImageNet as our backbone network. The batch size is set to 128 for training. We use the standard SGD as an optimizer, the momentum is 0.9, and weight decay is 5×1045\times 10^{-4}. The initial learning rate is 0.1. We train the model for 6060 epochs, and the learning rate divided by 10 after every 3030 epochs. Following [34], we initialize the memory bank as all zeros, and use the hard label to warm-up model and the memory is fully updated for 55 epochs. The memory updating rate α\alpha starts from 0 and grows linearly to 0.5. The similarity threshold in ML module is 0.6. We jointly train the ML and SC modules after 15th15th epoch.

4.2 Comparison with the baseline

To investigate the effect of the joint learning based on our proposed MLC, we compare MLC with three baselines: 1) the first is basic supervised learning method [25] with triplet loss on the labeled data; 2) the second is direct feature evaluation with the pre-trained ResNet-50 model on ImageNet; 3) the third is unsupervised learning with multi-label classification loss [34]. The experimental results are shown in Table 1. The first two baselines represent the upper and lower bound performance of the backbone model. The third baseline utilizes similar multi-label prediction with ours for discriminative feature learning. In theory, when our generated pseudo labels are closer to the truth labels, the performance of our MLC will approach to the fully-supervised baseline1. For fair comparisons, we also present the results obtained without MN module, i.e., the same feature extractor with baseline methods.

As can be seen, the baseline1 achieves the highest performance with supervised learning, e.g.e.g., 70.7% mAP on Market-1501 and 58.0% mAP on DukeMTMC-reID. On the contrary, the performance of baseline2 is very poor which shows that there is a large gap between person Re-ID and the object classification of ImageNet. Based on the predicted multi-label, the baseline3 boosts the baseline2 significantly, and achieves 42.5% and 38% improvements in mAP on Market-1501 and DukeMTMC-reID, respectively. Without the multi-scale network, our MLC already outperforms all baseline methods largely, with 62.6% mAP on Market-1501 and 50.0% mAP on DukeMTMC-reID. By considering the global and local information (i.e., with MN), the MN module further boosts performance by 3.6% mAP on Market-1501 and 2.3% mAP on DukeMTMC-reID.

Table 2: Performance (%\%) comparison of our framework with state-of-the-art methods on Market-1501 dataset and DukeMTMC-reID.
Method Market-1501 DukeMTMC-reID
source Rank-1 Rank-5 Rank-10 mAP source Rank-1 Rank-5 Rank-10 mAP
LOMO [25] None 27.2 41.6 49.1 8.0 None 12.3 21.3 26.6 4.8
BoW [58] None 35.8 52.4 60.3 14.8 None 17.1 28.8 34.9 8.3
BUC [27] None 66.2 79.6 84.5 38.3 None 47.4 62.6 68.4 27.5
SSL [28] None 71.7 83.8 87.4 37.8 None 52.5 63.5 68.9 28.6
DBC [7] None 69.2 83.0 87.8 41.3 None 51.5 64.6 70.1 30.0
HCT [53] None 80.0 91.6 95.2 56.4 None 69.6 83.4 87.4 50.7
MMCL [34] None 80.3 89.4 92.3 45.5 None 65.2 75.9 80.0 40.2
MLC w/o MN None 85.2 92.2 94.6 62.6 None 71.9 81.2 84.4 50.0
MLC None 86.7 93.5 95.6 66.2 None 73.6 82.3 85.5 52.3
UMDL [32] Duke 34.5 52.6 59.6 12.4 Market 18.5 31.4 37.4 7.3
CAMEL [50] Duke 54.5 73.1 - 26.3 Market 40.3 57.6 - 19.8
PUL [9] Duke 45.5 60.7 66.7 20.5 Market 30.0 43.4 48.5 16.4
PTGAN [41] Duke 38.6 57.3 66.1 15.7 Market 27.4 43.6 50.7 13.5
SPGAN+LMP [6] Duke 57.7 75.8 82.4 26.7 Market 46.4 62.3 68.0 26.2
MMFA [26] Duke 56.7 75.0 81.8 27.4 Market 45.3 59.8 66.3 24.7
TJ-AIDL [39] Duke 58.2 74.8 81.1 26.5 Market 44.3 59.6 65.0 23.0
HHL [62] Duke 62.2 78.8 84.0 31.4 Market 46.9 61.0 66.7 27.2
ECN [63] Duke 75.1 87.6 91.6 43.0 Market 63.3 75.8 80.4 40.4
MAR [51] MSMT 67.7 81.9 - 40.0 MSMT 67.1 79.8 - 48.0
PAUL [49] MSMT 68.5 82.4 87.4 40.1 MSMT 72.0 82.7 86.0 53.2
SSG [12] Duke 80.0 90.0 92.4 58.3 Market 73.0 80.6 83.2 53.4
CR-GAN [3] Duke 77.7 89.7 92.7 54.0 Market 68.9 80.2 84.7 48.6
CASCL [43] MSMT 65.4 80.6 86.2 35.5 MSMT 59.3 73.2 77.5 37.8
PDA-Net [24] Duke 75.2 86.3 90.2 47.6 Market 63.2 77.0 82.5 45.1
MMCL [34] Duke 84.4 92.8 95.0 60.4 Market 72.4 82.9 85.0 51.4
ADTC [18] Duke 79.3 90.8 94.1 59.7 Market 71.9 84.1 87.5 52.5
D-MMD [30] Duke 70.6 87.0 91.5 48.8 Market 63.5 78.8 83.9 46.0
MLC w/o MN Duke 88.4 94.6 96.2 64.7 Market 73.0 82.9 85.9 53.8
MLC Duke 85.6 93.9 96.0 65.9 Market 74.1 83.8 86.3 55.0

4.3 Comparison with the state-of-the-art methods

We compare our approach with the state-of-the-art unsupervised learning methods for person Re-ID including: 1) the hand-crafted features (including LOMO [25] and BoW [58]; 2) the pseudo label learning methods without any other labeled dataset (e.g. BUC [27], SSL [28], DBC [7], HCT [53], MMCL [34]); 3) the unsupervised transfer learning and domain adaptation approaches (e.g. UMDL [32], CAMEL [50], PUL [9], PTGAN [41], SPGAN+LMP [6], MMFA [26], TJ-AIDL [39], HHL [62], ECN [63], MAR [51], PAUL [49], SSG [12], CR-GAN [3], CASCL [43], PDA-Net [24], MMCL [34], ADTC [18], and D-MMD [30]). The comparison results on Market-1501 and DukeMTMC-reID are presented in Table 2, and the comparison on MSMT17 is listed in Table 3.

From Table 2, we observe that our MLC method consistently outperforms recent state-of-the-art methods on both Market-1501 and DukeMTMC-reID with or without source dataset. Without source dataset, the hand-crafted feature based methods show the worst performance since the representation ability of designed features is limited. These deep learning methods (from BUC to MMCL) with pseudo labels significantly outperform the hand-crafted feature based methods which indicates that pseudo labels and deep networks are effective. Our MLC achieves the best results with mAP 66.2% and 52.3% on Market-1501 and DukeMTMC-reID. Compared with pure clustering based methods like SSL, our MLC leverages the ML module to learn better initial representations to guide subsequent clustering. Compared with the MMCL method, our MLC uses another SC module to generate new dataset and refine pseudo labels which largely enhances the Re-ID models through joint training.

We further compare our method with unsupervised transfer learning and domain adaptation methods. Our MLC not only surpasses those fully unsupervised methods but is also better than these unsupervised transfer learning and domain adaptation methods. Specifically, under the transfer learning setting, our method achieves the best performance on both Market1501 and DukeMTMC-reID. For example, our MLC obtains mAP 65.9% and 55.0% on Market1501 and DukeMTMC-reID, respectively. An interesting observation is that the results of our MLC are similar whether there exists annotated source dataset or not. This indicates that our method is suitable for unsupervised person Re-ID learning.

We also conduct experiments on MSMT17, and the results are shown in Table 3. Compared with the other two datasets, MSMT17 is a larger and more challenging dataset because of more complex lighting and scene variations. A limited number of researches have reported performance on MSMT17, including unsupervised learning and transfer learning with domain adaptation methods, such as MMCL [34], PTGAN [41], ECN [63], and SSG[12]. As shown in Table 3, our approach outperforms existing methods by large margins under both unsupervised and transfer learning methods. We can see that our MLC without MN module obtain the best performance on unsupervised learning method. It achieves mAP 12.0% and outperforms our MLC and MMCL by 0.7% and 1.5% improvements, respectively. Under transfer learning setting, our MLC obtains mAP 15.0% and 16.7% with the source datasets of Market1501 and DukeMTMC-reID. Interestingly, the MLC without MN module achieves better results with mAP 16.2% and 18.0%, respectively. The slightly improve of MN module on MSMT17 may be explained by that the person images of MSMT17 are more variant than the other two datasets in posture and scene. Overall, our MLC improves the state-of-the-arts (i.e. MMCL) by 1.8% and 2.8% in mAP and Rank-1 when using Duke as source dataset.

Table 3: Performance (%\%) comparison of our framework with state-of-the-art methods on MSMT17.
Method source MSMT17
Rank-1 Rank-5 Rank-10 mAP
MMCL [34] None 35.4 44.8 49.8 11.2
MLC w/o MN None 39.2 49.4 53.9 12.7
MLC None 37.1 47.5 52.4 12.0
PTGAN [41] Market 10.2 - 24.4 2.9
ECN [63] Market 25.3 36.3 42.1 8.5
SSG[12] Market 31.6 - 49.6 13.2
MMCL [34] Market 40.8 51.8 56.7 15.1
MLC w/o MN Market 42.4 53.0 57.9 16.2
MLC Market 43.9 55.3 60.4 16.5
PTGAN [41] Duke 11.8 - 27.4 3.3
ECN [63] Duke 30.2 41.5 46.8 10.2
SSG[12] Duke 32.2 - 51.2 13.3
MMCL [34] Duke 43.6 54.3 58.9 16.2
MLC w/o MN Duke 45.0 55.9 60.8 16.7
MLC Duke 46.4 57.9 62.7 18.0

4.4 Ablation study on Market1501 and DukeMTMC-reID

Table 4: The performance (%\%) comparison of proposed individual components of MLC on Market-1501 dataset and DukeMTMC-reID.
Module Market-1501 DukeMTMC-reID
MN ML SC Rank-1 Rank-5 Rank-10 mAP Rank-1 Rank-5 Rank-10 mAP
\surd \surd k-means(k=500) 80.5 89.9 92.4 49.1 65.0 76.4 80.9 41.6
\surd \surd k-means(k=750) 81.8 90.3 92.8 49.3 64.5 76.4 80.5 41.3
\surd \surd k-means(k=1000) 81.5 89.8 92.4 48.5 65.2 76.0 80.8 42.1
×\times \surd ×\times 79.8 88.4 91.6 44.7 65.6 75.9 80.1 39.6
\surd \surd ×\times 79.7 89.3 92.3 44.1 66.1 76.8 80.1 41.5
×\times \surd DBSCAN 85.2 92.2 94.6 62.6 71.9 81.2 84.4 50.0
\surd \surd DBSCAN 86.7 93.5 95.6 66.2 73.6 82.3 85.5 52.3

Evaluation of individual modules in MLC. In order to verify the effectiveness of each module in MLC, we conduct the experiments by evaluating the performance contribution of different modules on the Market-1501 and DukeMTMC-reID dataset. It is summarized in Table 4. We utilize the k-means instead of the DBSCAN as the clustering method in our SC module for comparison. And the number of pseudo identities of k-means clustering are set as 500, 750, 1000. Especially, the cross (or tick) in MN, ML, SC column means the result of MLC without (or with) the corresponding modules, respectively.

Seen in Table 4, when using the k-means instead of the DBSCAN method, the best mAP is 49.3% on Market-1501 and 42.1% on DukeMTMC-reID by k=750k=750 and k=1000k=1000, respectively. Seen from the results, the SC module could improve MLC’s performance obviously on Market-1501 while slightly on DukeMTMC-reID. This shows that even an elementary clustering algorithm could push the SC module, which improves the network to learn more discriminative image features. It can be evidenced that using the SC module could be beneficial to all modules in our MLC. Compared with MLC without the SC module, our MLC is found to significantly boost 22.1% and 10.8% mAP on Market-1501 and DukeMTMC-reID. Especially, it is shown that MLC can achieve the best performance with all modules on both datasets. This result shows, on the one hand, the SC module can remove the noisy samples when joint learning with the ML module. On the other hand, it also can help our MLC to generate more accurate pseudo labels for unsupervised learning.

Refer to caption
Figure 4: Evaluation of the different numbers of K1K_{1}, K2K_{2} and KsampleK_{sample} on Market-1501 and DukeMTMC-reID dataset. (a), (c) and (e): rank-1 accuracy. (b), (d) and (f): mAP.
Refer to caption
Figure 5: Evaluation of the different numbers of λ\lambda on Market-1501 and DukeMTMC-reID dataset. (a) : rank-1 accuracy. (b): mAP.
Refer to caption
Figure 6: Visualization of the clustering performances by our MLC with and without MN module are both on the Market-1501 and DukeMTMC-reID dataset. The clustering samples with the top 20 nearest neighbors are obtained by the MLC w/o MN and MLC method, which are shown in the left column and the right column, respectively. The image with a red bounding box means a noisy example in that clustering group, and the rest are correct examples. More specifically, the image with a green bounding box means a positive sample only found by MLC but not the other.

Evaluation of the parameter K1K_{1}, K2K_{2} and KsampleK_{sample}. In the clustering phase, the K1K_{1} and K2K_{2} is denoted the parameter in evaluating of the Jaccard distance [61], the former is considered as the contextual knowledge to re-calculate the distance between the probe and gallery, the latter is the number of candidates in the top-k samples of the ranking list about the probe. And the parameter KsampleK_{sample} is the minimum sample number of groups in the DBSCAN method. To investigate the properties of our MLC with these parameters, we fix the hyperparameter λ\lambda as 0.3. And the default setting of K1K_{1}, K2K_{2} and KsampleK_{sample} are 20, 6 and 4, respectively. We evaluate it by increasing K1K_{1} from 15 to 35, and present the results in Figure 4 (a) and (d). It can be seen that the performance first increases with the growth of K1K_{1}, and then begins a slow decline after K1K_{1} set to 20. Larger K1 means it is more likely to include false matches in the k-reciprocal set, resulting in a decline in performance. The impact of the size of K2K_{2} is shown in Figure 4 (b) and (e), it is varied from 2 to 10. We can see that when we set K2=6K_{2}=6, we get the best performance. Notice that, assigning a too large value to K2K_{2} also reduces the performance. At last, we evaluate KsampleK_{sample} from 2 to 10. Our results are reported in Figure 4 (c) and (f), the performance grows as KsampleK_{sample} increases in a reasonable range. Value of KsampleK_{sample} greater than 4 will reduce the performance. If the parameter of KsampleK_{sample} is set too large, it will affect the performance of DBSCAN clustering.

Evaluation of the hyperparameter λ\lambda. We evaluate how λ\lambda (the relative importance of multi-label classification loss) affects our model learning. To investigate the properties of our MLC framework, we evaluate different λ\lambda from 0.1 to 10. The results of rank-1 accuracy and mAP on Market-1501 and DukeMTMC-reID dataset are shown in Figure 5 (a) and (b). We observe that increasing λ\lambda boosts performance in the beginning, after λ=0.4\lambda=0.4 on Market-1501 and λ=0.3\lambda=0.3 on DukeMTMC-reID gradually degrades. Too larger λ\lambda leads to fast degradation, which indicates the pseudo label classification loss is important for joint training our model. As the mater of fact, the result of this experiment is evidenced that the SC joints in training with the ML is a simple and effective strategy for unsupervised person Re-ID tasks.

Qualitative analysis of clustering visualization To better investigate the effectiveness of our MLC, we visualize the clustering results in Figure 6. We illustrate the clustering performance of our MLC with and without the MN module in the right and left clomun, respectively. We also shows the results of MLC (w or w/o the MN module) on Market-1501 and DukeMTMC-reID dataset. The cluster image shows that the MN module could help not only purifying negative samples (images with red bounding box), but also supplementing hard positive samples (images with green bounding box) in each group.This indicates that our MN module can improve the clustering quality by mining the multi-scale features for better similarity measurement.

5 Conclusion

In this paper, we proposed a novel multi-label learning guided self-paced clustering (MLC) framework for unsupervised Person Re-identification (Re-ID). It mainly contains three modules: the multi-scale network which obtains global and local person representations, the multi-label learning module which trains the network with memory bank and multi-label classification loss, and the self-paced clustering module which removes noisy samples and assigns pseudo labels for training. Extensive experiments on three challenging large-scale datasets demonstrated the effectiveness of all the modules. Our MLC framework finally achieves state-of-the-art performance on these datasets.

References

  • [1] E. Ahmed, M. Jones, and T. K. Marks. An improved deep learning architecture for person re-identification. In CVPR, pages 3908–3916, 2015.
  • [2] L. Bazzani, M. Cristani, and V. Murino. Symmetry-driven accumulation of local features for human characterization and re-identification. Computer Vision and Image Understanding, 117(2):130–144, 2013.
  • [3] Y. Chen, X. Zhu, and S. Gong. Instance-guided context rendering for cross-domain person re-identification. In ICCV, pages 232–242, 2019.
  • [4] Y.-C. Chen, X. Zhu, W.-S. Zheng, and J.-H. Lai. Person re-identification by camera correlation aware feature augmentation. IEEE transactions on pattern analysis and machine intelligence, 40(2):392–408, 2017.
  • [5] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng. Person re-identification by multi-channel parts-based cnn with improved triplet loss function. In CVPR, pages 1335–1344, 2016.
  • [6] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao. Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification. In CVPR, pages 994–1003, 2018.
  • [7] G. Ding, S. Khan, Z. Tang, J. Zhang, and F. Porikli. Towards better validity: Dispersion based clustering for unsupervised person re-identification. arXiv preprint arXiv:1906.01308, 2019.
  • [8] M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, volume 96, pages 226–231, 1996.
  • [9] H. Fan, L. Zheng, C. Yan, and Y. Yang. Unsupervised person re-identification: Clustering and fine-tuning. ACM Transactions on Multimedia Computing, Communications, and Applications, 14(4):83, 2018.
  • [10] M. Farenzena, L. Bazzani, A. Perina, V. Murino, and M. Cristani. Person re-identification by symmetry-driven accumulation of local features. In CVPR, pages 2360–2367. IEEE, 2010.
  • [11] Q. Feng, G. Kang, H. Fan, and Y. Yang. Attract or distract: Exploit the margin of open set. In CVPR, pages 7990–7999, 2019.
  • [12] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang. Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification. In CVPR, pages 6112–6121, 2019.
  • [13] T. Fukuda, M. Suzuki, G. Kurata, S. Thomas, J. Cui, and B. Ramabhadran. Efficient knowledge distillation from an ensemble of teachers. In Interspeech, pages 3697–3701, 2017.
  • [14] Y. Ge, D. Chen, and H. Li. Mutual mean-teaching: Pseudo label refinery for unsupervised domain adaptation on person re-identification. In International Conference on Learning Representations, 2019.
  • [15] D. Gray, S. Brennan, and H. Tao. Evaluating appearance models for recognition, reacquisition, and tracking. In Proc. IEEE international workshop on performance evaluation for tracking and surveillance (PETS), volume 3, pages 1–7. Citeseer, 2007.
  • [16] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9729–9738, 2020.
  • [17] A. Hermans, L. Beyer, and B. Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737, 2017.
  • [18] Z. Ji, X. Zou, X. Lin, X. Liu, T. Huang, and S. Wu. An attention-driven two-stage clustering method for unsupervised person re-identification. In ECCV, pages 20–36. Springer, 2020.
  • [19] X. Jin, C. Lan, W. Zeng, and Z. Chen. Global distance-distributions separation for unsupervised person re-identification. arXiv preprint arXiv:2006.00752, 2020.
  • [20] E. Kodirov, T. Xiang, Z. Fu, and S. Gong. Person re-identification by unsupervised 1\ell_{1} graph learning. In ECCV, pages 178–195. Springer, 2016.
  • [21] E. Kodirov, T. Xiang, and S. Gong. Dictionary learning with iterative laplacian regularisation for unsupervised person re-identification. In BMVC, volume 3, page 8, 2015.
  • [22] W. Li, R. Zhao, T. Xiao, and X. Wang. Deepreid: Deep filter pairing neural network for person re-identification. In CVPR, pages 152–159, 2014.
  • [23] W. Li, X. Zhu, and S. Gong. Harmonious attention network for person re-identification. In CVPR, pages 2285–2294, 2018.
  • [24] Y.-J. Li, C.-S. Lin, Y.-B. Lin, and Y.-C. F. Wang. Cross-dataset person re-identification via unsupervised pose disentanglement and adaptation. In ICCV, pages 7919–7929, 2019.
  • [25] S. Liao, Y. Hu, X. Zhu, and S. Z. Li. Person re-identification by local maximal occurrence representation and metric learning. In CVPR, pages 2197–2206, 2015.
  • [26] S. Lin, H. Li, C.-T. Li, and A. C. Kot. Multi-task mid-level feature alignment network for unsupervised cross-dataset person re-identification. arXiv preprint arXiv:1807.01440, 2018.
  • [27] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang. A bottom-up clustering approach to unsupervised person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 8738–8745, 2019.
  • [28] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian. Unsupervised person re-identification via softened similarity learning. In CVPR, pages 3390–3399, 2020.
  • [29] Z. Liu, D. Wang, and H. Lu. Stepwise metric promotion for unsupervised video person re-identification. In CVPR, pages 2429–2438, 2017.
  • [30] D. Mekhazni, A. Bhuiyan, G. Ekladious, and E. Granger. Unsupervised domain adaptation in the dissimilarity space for person re-identification. In ECCV, pages 159–174. Springer, 2020.
  • [31] P. Panareda Busto and J. Gall. Open set domain adaptation. In CVPR, pages 754–763, 2017.
  • [32] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian. Unsupervised cross-dataset transfer learning for person re-identification. In CVPR, pages 1306–1315, 2016.
  • [33] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang. Unsupervised domain adaptive re-identification: Theory and practice. Pattern Recognition, 102:107173, 2020.
  • [34] D. Wang and S. Zhang. Unsupervised person re-identification via multi-label classification. In CVPR, pages 10981–10990, 2020.
  • [35] F. Wang, W. Zuo, L. Lin, D. Zhang, and L. Zhang. Joint learning of single-image and cross-image representations for person re-identification. In CVPR, pages 1288–1296, 2016.
  • [36] G. Wang, Y. Yuan, X. Chen, J. Li, and X. Zhou. Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the ACM international conference on Multimedia, pages 274–282, 2018.
  • [37] H. Wang, S. Gong, and T. Xiang. Unsupervised learning of generative topic saliency for person re-identification. In Proceedings of the British Machine Vision Conference. BMVA Press, 2014.
  • [38] H. Wang, X. Zhu, T. Xiang, and S. Gong. Towards unsupervised open-set person re-identification. In IEEE International Conference on Image Processing, pages 769–773. IEEE, 2016.
  • [39] J. Wang, X. Zhu, S. Gong, and W. Li. Transferable joint attribute-identity deep learning for unsupervised person re-identification. In CVPR, pages 2275–2284, 2018.
  • [40] X. Wang, H. Zhang, W. Huang, and M. R. Scott. Cross-batch memory for embedding learning. In CVPR, pages 6388–6397, 2020.
  • [41] L. Wei, S. Zhang, W. Gao, and Q. Tian. Person transfer gan to bridge domain gap for person re-identification. In CVPR, pages 79–88, 2018.
  • [42] A. Wu, W.-S. Zheng, X. Guo, and J.-H. Lai. Distilled person re-identification: Towards a more scalable system. In CVPR, pages 1187–1196, 2019.
  • [43] A. Wu, W.-S. Zheng, and J.-H. Lai. Unsupervised person re-identification by camera-aware similarity consistency learning. In CVPR, pages 6922–6931, 2019.
  • [44] J. Wu, S. Liao, X. Wang, Y. Yang, S. Z. Li, et al. Clustering and dynamic sampling based unsupervised domain adaptation for person re-identification. In 2019 IEEE International Conference on Multimedia and Expo (ICME), pages 886–891. IEEE, 2019.
  • [45] Y. Wu, Y. Lin, X. Dong, Y. Yan, W. Ouyang, and Y. Yang. Exploit the unknown gradually: One-shot video-based person re-identification by stepwise learning. In CVPR, pages 5177–5186, 2018.
  • [46] F. Xiong, M. Gou, O. Camps, and M. Sznaier. Person re-identification using kernel-based metric learning methods. In ECCV, pages 1–16. Springer, 2014.
  • [47] F. Yang, K. Li, Z. Zhong, Z. Luo, X. Sun, H. Cheng, X. Guo, F. Huang, R. Ji, and S. Li. Asymmetric co-teaching for unsupervised cross-domain person re-identification. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 12597–12604, 2020.
  • [48] F. Yang, K. Yan, S. Lu, H. Jia, D. Xie, Z. Yu, X. Guo, F. Huang, and W. Gao. Part-aware progressive unsupervised domain adaptation for person re-identification. IEEE Transactions on Multimedia, 2020.
  • [49] Q. Yang, H.-X. Yu, A. Wu, and W.-S. Zheng. Patch-based discriminative feature learning for unsupervised person re-identification. In CVPR, pages 3633–3642, 2019.
  • [50] H. Yu, A. Wu, and W. Zheng. Cross-view asymmetric metric learning for unsupervised person re-identification. In CVPR, pages 994–1002, 2017.
  • [51] H. Yu, A. Zheng, Wei-Shi andWu, X. Guo, S. Gong, and J. Lai. Unsupervised person re-identification by soft multilabel learning. In CVPR, 2019.
  • [52] H.-X. Yu, A. Wu, and W.-S. Zheng. Unsupervised person re-identification by deep asymmetric metric embedding. IEEE transactions on pattern analysis and machine intelligence, 2019.
  • [53] K. Zeng, M. Ning, Y. Wang, and Y. Guo. Hierarchical clustering with hard-batch triplet loss for person re-identification. In CVPR, pages 13657–13665, 2020.
  • [54] Y. Zhai, S. Lu, Q. Ye, X. Shan, J. Chen, R. Ji, and Y. Tian. Ad-cluster: Augmented discriminative clustering for domain adaptive person re-identification. In CVPR, pages 9021–9030, 2020.
  • [55] X. Zhang, J. Cao, C. Shen, and M. You. Self-training with progressive augmentation for unsupervised cross-domain person re-identification. In CVPR, pages 8222–8231, 2019.
  • [56] R. Zhao, W. Ouyang, and X. Wang. Unsupervised salience learning for person re-identification. In CVPR, pages 3586–3593, 2013.
  • [57] L. Zheng, Y. Huang, H. Lu, and Y. Yang. Pose-invariant embedding for deep person re-identification. IEEE Transactions on Image Processing, 28(9):4500–4509, 2019.
  • [58] L. Zheng, L. Shen, L. Tian, S. Wang, J. Wang, and Q. Tian. Scalable person re-identification: A benchmark. In CVPR, pages 1116–1124, 2015.
  • [59] L. Zheng, Y. Yang, and A. G. Hauptmann. Person re-identification: Past, present and future. arXiv preprint arXiv:1610.02984, 2016.
  • [60] Z. Zheng, L. Zheng, and Y. Yang. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In CVPR, pages 3754–3762, 2017.
  • [61] Z. Zhong, L. Zheng, D. Cao, and S. Li. Re-ranking person re-identification with k-reciprocal encoding. In CVPR, pages 1318–1327, 2017.
  • [62] Z. Zhong, L. Zheng, S. Li, and Y. Yang. Generalizing a person retrieval model hetero-and homogeneously. In ECCV, pages 172–188, 2018.
  • [63] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang. Invariance matters: Exemplar memory for domain adaptive person re-identification. In CVPR, pages 598–607, 2019.
  • [64] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camera style adaptation for person re-identification. In CVPR, pages 5157–5166, 2018.
  • [65] Z. Zhong, L. Zheng, Z. Zheng, S. Li, and Y. Yang. Camstyle: A novel data augmentation method for person re-identification. IEEE Transactions on Image Processing, 28(3):1176–1190, 2018.