This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Viewpoint-aware Progressive Clustering for Unsupervised Vehicle Re-identification

Aihua Zheng, Xia Sun,  Chenglong Li,  Jin Tang This research is supported in part by the Major Project for New Generation of AI under Grant (No. 2018AAA0100400), National Natural Science Foundation of China (61976002, 61976003 and 61860206004), the Natural Science Foundation of Anhui Higher Education Institutions of China (KJ2019A0033), and the Open Project Program of the National Laboratory of Pattern Recognition (NLPR) (201900046). (Corresponding author: Chenglong Li.) A. Zheng, X. Sun, C. Li, and J. Tang are with the Anhui Provincial Key Laboratory of Multi-modal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, 230601, China (e-mail: [email protected]; [email protected]; [email protected]; [email protected])
Abstract

Vehicle re-identification (Re-ID) is an active task due to its importance in large-scale intelligent monitoring in smart cities. Despite the rapid progress in recent years, most existing methods handle vehicle Re-ID task in a supervised manner, which is both time and labor-consuming and limits their application to real-life scenarios. Recently, unsupervised person Re-ID methods achieve impressive performance by exploring domain adaption or clustering-based techniques. However, one cannot directly generalize these methods to vehicle Re-ID since vehicle images present huge appearance variations in different viewpoints. To handle this problem, we propose a novel viewpoint-aware clustering algorithm for unsupervised vehicle Re-ID. In particular, we first divide the entire feature space into different subspaces according to the predicted viewpoints and then perform a progressive clustering to mine the accurate relationship among samples. Comprehensive experiments against the state-of-the-art methods on two multi-viewpoint benchmark datasets VeRi and VeRi-Wild validate the promising performance of the proposed method in both with and without domain adaption scenarios while handling unsupervised vehicle Re-ID.

Index Terms:
Viewpoint-aware, Progressive Clustering, Vehicle Re-ID, Unsupervised Learning.

I Introduction

Vehicle re-identification aims to identify a specific vehicle in non-overlapping camera networks. It is a crucial task in modern society with potential applications in artificial transportation, smart city and public security, to name a few. Similar to the person Re-ID task, vehicle Re-ID faces common challenges such as illumination and viewpoint changes across cameras, background clutters, and occlusions. Besides, vehicle Re-ID dramatically suffers from the challenges of large intra-class discrepancy and inter-class similarity. This is because different vehicles might present exactly similar appearance while the same vehicle might present totally different features, as shown in Fig. 1. Therefore, one cannot directly deploy person Re-ID models to achieve satisfactory performance in vehicle Re-ID.

With the blossom of deep learning techniques and its powerful learning ability on large labeled data, various supervised learning architectures [1, 2, 3, 4, 5, 6, 7, 8, 9] have been proposed and achieved remarkable performance for vehicle Re-ID. Despite great progress, supervised learning-based methods require numerous annotations to train the deep models, which are time and labor-consuming and significantly limit real-life applications of vehicle Re-ID.

Refer to caption
Figure 1: Illustration of major challenges in vehicle Re-ID. Different vehicles with the same viewpoint have higher visual similarity than those same vehicles with different viewpoints, and it is very common in real scenes that the same vehicle with different viewpoints sometimes has a similar appearance with those different vehicles with different viewpoints. These examples demonstrate that vehicle Re-ID greatly suffers from the challenges of large intra-class discrepancy and inter-class similarity.

Domain adaptation, which transfers the learned information from the source domain (labeled data) to the target domain (unlabeled data), has been widely explored in the past decade as one of unsupervised learning manners in both person Re-ID [10, 11, 12, 13] and vehicle Re-ID [14, 15]. However, they still require large annotations in the source domain. In addition, when the style gap between the two domains is too large, these transfer learning methods are also limited.

Different from domain adaptation-based methods, we study the problem of vehicle Re-ID in the target-only unsupervised learning framework, which does not rely on any labeled data in the source domain. As one of the target-only unsupervised methods, clustering-based methods have been widely explored in the related computer vision tasks [16, 17, 18, 19, 15, 20]. Recent efforts on clustering-based methods in person Re-ID are to assign pseudo labels for samples by clustering algorithms and then use these labeled samples to train Re-ID models [17, 18, 15, 21].

However, one can not directly apply these techniques to vehicle Re-ID. One of the key reasons is large viewpoint variations of vehicles, which bring big challenges to clustering algorithms. As shown in Fig. 1, by directly calculating the cosine similarity between vehicle images, we can see the similarity between the same vehicle images with different viewpoints is even lower than that between different vehicles in the same viewpoint with a similar appearance, which is referred as the similarity dilemma of vehicles in this paper. Due to the inter-instance similarity and intra-instance discrepancy caused by large viewpoint variations of vehicles, the accuracy of clustering algorithms is significantly affected, and the performance of vehicle Re-ID would thus be extremely degraded.

To handle this problem, we propose a novel viewpoint-aware progressive clustering framework (VAPC) for robust unsupervised vehicle Re-ID. We observe that vehicle images from different viewpoints of the same ID are more similar than vehicle images from different viewpoints of different IDs, e.g., image pairs {ID1(front), ID1(front_side)} are more similar than {ID1(front), ID2(front_side)}. Therefore we can divide the vehicles into different viewpoints. The vehicles in each viewpoint cluster exclude the effects of large viewpoint variations, and same ID from different viewpoints can be correctly classified according to the degree of similarity. In addition, when clustering is performed only between samples of the same viewpoint, the comparison of different viewpoints with the same ID is excluded, which further reduces the intra-class differences and simplifies the clustering task. Therefore, we propose a viewpoint-aware progressive clustering framework, which can be regarded as three parts. First, considering the extreme viewpoint changes of the vehicle, we design a viewpoint-aware network, which can be pre-trained using viewpoint annotations [22], to predict viewpoints of vehicle images as the prior information. Second, feature extraction is crucial to the performance of clustering. To extract the discriminative feature of each sample, it is necessary to train an initial model with strong feature extraction capabilities. In this paper, we use a self-supervised manner to learn the discriminative feature of each sample. Without the ground truth labels in the target-only unsupervised learning, we treat each sample as a category and force the network to learn the discriminative feature of each sample via the repelled loss [23, 17], which we call the recognition stage. Third, we design a viewpoint-aware progressive clustering algorithm to handle the problem of similarity dilemma discussed above. Specifically, we first perform clustering in each vehicle image set with the same viewpoint and then cluster them by comparing the similarity of clusters across different viewpoints. In this way, we can distinguish small gaps between different identities in the same viewpoint, and mine the same identity samples with large gaps between different viewpoints.

We use the clustered results to train the Re-ID network in a supervised way after progressive viewpoint-aware clustering. However, the clustering performance of different viewpoints significantly relies on the clustering results from the same viewpoint. Therefore, we introduce the kk-reciprocal encoding [24, 20, 15] as the distance metric to feature comparison of the same viewpoint due to its powerful ability in mining similar samples.

In addition, recent methods [19, 15, 20] achieve remarkable performance on target-only unsupervised person Re-ID. However, they directly employ the prevalent DBSCAN [25] to obtain pseudo labels while discarding all noisy samples (the hard positive and hard negative samples with pseudo labels assigned as -1) in the training stage. We argue that it is more important to learn the discriminative embeddings by mining hard positive samples than naively learning from simple samples, which has been proven in a large number of machine learning tasks [26, 27, 28, 29, 30, 31]. To this end, we propose a noise selection method to classify each noise sample into a suitable cluster by the similarity between the noise sample and other clusters.

Based on the above discussion, VAPC focuses on addressing unsupervised vehicle Re-ID through a viewpoint-aware progressive clustering framework. We alleviate the impact of vehicle similarity dilemmas on clustering by transforming global comparisons into progressive clustering based on viewpoint. To improve the clustering quality of the same viewpoint cluster, we introduced kk-reciprocal encoding [24, 20, 15] as a distance metric for DBSCAN [25] clustering. In order to deal with outlier noise samples, we propose a noise selection method to improve the generalization ability of the model further. The major contributions of this work are summarized as follows.

  • We propose a novel progressive clustering method to handle the similarity dilemma of vehicles in unsupervised vehicle Re-ID. To our best knowledge, this is the first time to employ the viewpoint-aware progressive clustering algorithm to achieve unsupervised vehicle Re-ID.

  • We designed a noise selection scheme to mine the hard positive samples with the same identity while considering their relationship to the hard negative samples, which significantly improves the discriminative ability of our network.

  • Comprehensive experimental results on two benchmark datasets, including VeRi-776 [1] and VeRi-Wild [32] demonstrate the promising performance of our method and yield to a new state-of-the-art for unsupervised vehicle Re-ID.

II Related Works

Since most vehicle Re-ID methods are in a supervised fashion, we briefly review the progress in supervised vehicle Re-ID and recent advances in unsupervised person/vehicle Re-ID.

II-A Vehicle Re-ID.

Most existing deep vehicle re-identification methods follow a supervised setting. Pioneer vehicle Re-ID methods [33, 34, 35] focus on the discriminative feature learning. Lou et al. [34] by mining similar negative samples, the features learned by the model are more robust. He et al. [35] proposed an efficient feature preserving method, which can enhance the perception ability of subtle differences. Some works introduce [36, 37, 38, 5, 6, 7] additional attribute information, such as color or type to improve the discrimination of the deep feature for vehicle Re-ID. Temporal path information is also auxiliary information and has been widely employed [39, 40], to improve the robustness of vehicle Re-ID, especially for the vehicles with a similar appearance from the same manufacture. To handle the viewpoint variation issue in vehicle Re-ID, Zhou et al. [5, 6] and Liu et al. [7] employ GAN to infer multi-view information from a single-view of the input image in either image or feature level, to boost the performance by integrating the input and generated images or features. Chu et al. [41] separate the Re-ID into similar and different viewpoint modes, and learn the respective deep metric for each case. In the case of a known 3D bounding box for the vehicle image, Sochor et al. [42] calculated orientation information through 3D coordinates and added to the feature map to improve performance. Despite the significant progress on vehicle Re-ID, these supervised deep learning-based methods require extensive training data, which is expensive in both time and labor-consuming.

Refer to caption
Figure 2: The overview of our method framework. We first predict each viewpoint, and then the viewpoint-aware unlabeled training set is input to the CNN model for feature extraction, which can be divided into different directional feature clusters. Then we will go through a recognition stage to make each sample feature extracted by the network more identifying. We design a clustering method that divides direction and period. In the first period, we cluster within the same viewpoint. For the noise samples found in the clustering process, we designed a noise selection method to select. In the second period, comparing the distances of all different viewpoints, clusters smaller than the distance threshold τ\tau will be merged. The network is iteratively trained based on the final clustering results.

II-B Unsupervised Person/Vehicle Re-ID.

Along with the great achievement on person Re-ID, unsupervised person Re-ID offers more challenges, which has attracted more and more attention recently. Recent advances of unsupervised person Re-ID methods generally fall into two categories. 1) The domain adaption based methods [10, 11, 12, 13, 43, 44], which aims to transfer the knowledge in the labeled source domain to the unlabeled target domain. Although the domain adaption based methods make impressive achievement in unsupervised Re-ID by exploring domain-invariant features, they still require a large amount of label annotation in the source domain. Furthermore, the huge diversity in different domains limits their transferring capabilities. 2) The target-only based methods [17, 18, 45], which fulfill the unsupervised task by dividing the unlabeled samples into different categories based on specific similarity. Lin et al. [17] treat each image as a single category and then gradually reduces the number of categories in subsequent clusters. Lin et al. [45] propose a framework that mines the similarity as a soft constraint and introduce camera information to encourage similar samples under different cameras to approach.

To the best of our knowledge, there are few works on unsupervised vehicle Re-ID. Peng et al. [14] propose to use a style GAN to generate vehicle pictures in the source domain more like the target domain. They assume that the source domain contains more viewpoints than the target domain for a better generation. Song et al. [15] introduce the theoretical guarantees of unsupervised domain adaptive Re-ID based on and use a self-training scheme to iteratively optimize the unsupervised domain adaptation model. However, it only focuses on unsupervised domain adaptation, not target-only unsupervised learning. Bashir et al. [46] employ clustering and reliable result selection with embedded color information to iteratively fine-tune the cascade network. However, despite the annotation on color information, this method requires a specific number of identities, which is hard to be known in real-life scenarios.

III Proposed Approach

The pipeline of the proposed framework is shown in Fig. 2, which includes three parts: 1) viewpoint prediction, that identifies the viewpoint information through a viewpoint prediction network on input data, 2) recognition stage, that learns the discriminative feature for each sample using the repelled loss, and 3) progressive clustering, that uses the two-period algorithm to handle the problem of the similarity dilemma in clustering. The detailed optimization process is shown in Algorithm 1.

III-A Viewpoint Prediction.

Due to the extreme viewpoint changes in vehicles, there are relatively small inter-class differences between different vehicles. We argue that global comparison in previous unsupervised clustering methods [17, 18, 15] tends to group the different vehicles with the same viewpoint into the same cluster. Therefore, this global comparison scheme cannot guarantee the promising performance for target-only unsupervised vehicle Re-ID without any label supervision in network training. To handle this problem, we propose to introduce a viewpoint prediction model to identify the vehicle’s viewpoint information during the forthcoming clustering.

In specific, we use a viewpoint prediction network to predict the viewpoint of each unlabeled vehicle image xix_{i} in training set {Xx1,x2,,xN}\left\{X\mid x_{1},x_{2},...,x_{N}\right\}. We train our viewpoint prediction model on VeRi-776 [1], which contains all the visible viewpoints of the vehicle. Following the viewpoints annotation in previous work [22], we divide vehicle images into five viewpoints, e.g., v={front,front_side,side,rear_side,rear}v=\{front,front\_side,side,rear\_side,rear\}. Furthermore, we have additionally labeled 3000 samples in VeRi-Wild [32] data to fine-tune the model to improve the robustness of the viewpoint prediction. We use the commonly used cross-entropy loss LηL_{\eta} to optimize the viewpoint classifier W(xiθ)W\left(x_{i}\mid\theta\right),

Lη=ΣiNyvlog(W(xiθ))L_{\eta}=-\Sigma_{i}^{N}y_{v}log\left(W\left(x_{i}\mid\theta\right)\right) (1)

where yvy_{v} is a one-hot vector of the ground truth of corresponding viewpoint labels.

III-B Recognition Stage.

After the viewpoint prediction, we can obtain viewpoint-aware unlabeled training set Xv={x1v,x2v,,xNv}X^{v}=\{x_{1}^{v},x_{2}^{v},...,x_{N}^{v}\}, and the current training set can be regarded as the clusters divided according to the viewpoint. For example, VeRi-776 [1] falls into five different viewpoint clusters. For each image in XvX^{v}, we assign a unique index-label yind={1,2,3,,N}y_{ind}=\{{1},{2},{3},...,{N}\} to indicate the category of each sample. In order to learn the discriminative feature, one can achieve this objective by directly using triplet loss [47, 26] or cross-entropy loss via classification. However, the learning driven by these losses, which mainly calculate the similarity among each batch, will become inefficient and difficult to converge with the dataset’s scale growth. Herein, we employ the more efficient repelled loss [23, 48, 17], which calculates the feature similarity between the current sample and all the training samples at once.

It is equipped with a key-value structure to store the features of all training samples, and the index-label yindy_{ind} is stored in the key memory. The yindy_{ind} will not change during the entire training process. We calculate the feature similarity between the ii-th image in the vv-th viewpoint fivf_{i}^{v} and all the samples,

p(yp|xiv)=exp((M[i]Tfiv/β)j=1Nexp((M[j]Tfiv/β)p(y_{p}|x_{i}^{v})=\frac{\exp(\left(M[i]^{T}f_{i}^{v}/\beta\right)}{\sum^{N}_{j=1}\exp(\left(M[j]^{T}f_{i}^{v}/\beta\right)} (2)

where M[i]M[i] denotes the ii-th slot of the value memory MM. β\beta is a hyper-parameter to control the softness of the probability distribution over classes, which is set to 0.1 followed by [17]. NN indicates the number of clusters. ypy_{p} is the pseudo label, and we initialize yp=yindy_{p}=y_{ind}. We maximize the distance between samples by assigning each sample to its own slot,

Lα=log(p(yp|xiv))L_{\alpha}=-\log(p(y_{p}|x_{i}^{v})) (3)

During the back propagation, the feature memory is updated by the formula M[yi]12(M[yi]+fiv)M[y_{i}]\leftarrow\tfrac{1}{2}(M[y_{i}]+f_{i}^{v}). At the recognition stage, M[yi]M[y_{i}] stores the features of each training sample. At the subsequent progressive clustering stage, the pseudo label ypy_{p} of each sample will be redistributed according to the clustering results, while each slot stores the features of each cluster.

III-C Progressive Clustering.

Without any identification information, we propose a progressive clustering algorithm for unsupervised vehicle Re-ID. It mainly contains three aspects, two-period algorithm to avoid the similarity dilemma caused by the extreme viewpoint changes of vehicles, the kk-reciprocal encoding to re-metric the distance for more robust clustering, and clustering with noisy sample selection to deal with outliers that are difficult to be clustered in real scenes.

The first Period

. Through the recognition stage, the model learned more recognizable identity features of each image. The features obtained from the training set F={F1,F2,Fv,,FV}F^{*}=\{F_{1},F_{2},F_{v},...,F_{V}\},

Fv={f1v,f2v,,fNvv}F_{v}=\{f_{1}^{v},f_{2}^{v},...,f_{N_{v}}^{v}\} (4)

where FvF_{v} and NvN_{v} represent the feature set and the number of samples in the vv-th viewpoint. We compare the similarity of all features FvF_{v} belonging to the same viewpoint cluster to obtain the distance matrix D(Fm,Fn),m=nD(F_{m},F_{n}),\ m=n. DD represents the scoring matrix of Euclidean distance dij=fifj2d_{ij}=\|f_{i}-f_{j}\|^{2}. There is no doubt that the same vehicle with the same viewpoint has the highest similarity and thus tends to be clustered together (assigned to the same pseudo label) with the highest priority. For each different distance matrix in the same viewpoint, we obtain pseudo labels by the prevalent cluster algorithm DBSCAN [25], which can effectively deal with noise points and achieve spatial clusters of arbitrary shapes without information of the number of clusters compared to the conventional k-means [49] clustering.

The Second Period

. In the second period, we compare the distance between different viewpoint clusters. We take the shortest distance between features in two clusters as a measure of the distance between clusters. Considering that we have no idea whether the current sample has positive samples (with the same identity) in other viewpoints, we comprehensively compare the distance between all different viewpoint clusters,

Dmn={D(F1,F2),,D(Fm,Fn)},mn.D^{*}_{mn}=\left\{D(F_{1},F_{2}),...,D(F_{m},F_{n})\right\},\ m\neq n. (5)

We argue that the higher similarity, the more likely the same identity. Thus adopt a progressive strategy to merge the clusters between different viewpoints gradually. Therefore, We first calculate a rank list RR,

R=argsort(Dmn),mnR=argsort(D_{mn}^{*}),m\neq n (6)

where RR finds the most similar clusters among all different viewpoints. We set a strict distance threshold τ\tau, and merge clusters from different viewpoints only when the distance of the candidates in RR^{*} is less than τ\tau, i.e,

R=R[1:C(d=τ)]R^{*}=R[1:C(d=\tau)] (7)

where C={ci,cj}C=\{c_{i},c_{j}\} is the last sample pair in different clusters with distance less than τ\tau. Intuitively, due to the style diversity of different datasets, we expect the setting of τ\tau to be irrelevant to datasets. In our method, after the recognition stage, we ascending sort the calculated DmnD^{*}_{mn}, and set the distance value between the titi-th lowest sample pair as the threshold τ\tau. The distance threshold is only calculated after the recognition stage and then fixed in the whole training process. We alternately execute the above two periods during each iteration. The model learns the features of vehicles from the same viewpoint while continuously mining the features of vehicles with the same identity from different viewpoints.

Algorithm 1 The viewpoint-aware progressive clustering method (VAPC)
0:  Unlabeled training set X={x1,x2,x3,xN}X=\{x_{1},x_{2},x_{3}...,x_{N}\}; Recognition stage epoch ErE_{r}; Set the distance of the most similar titi-th sample pair as the distance threshold; CNN model MM; index-label yind=1,2,3,,Ny_{ind}=1,2,3,...,N.
1:  Viewpoint prediction: XX \rightarrow XvX^{v}, V=5V=5;
2:  Recognition stage:
3:  for i<Eri<E_{r} do
4:     Train CNN model MM with XX and yindy_{ind} according to Eq. (3).
5:  end for
6:  Calculate threshold τ\tau;
6:  Best CNN model MM
7:  Progressive clustering stage:
8:  First period:
9:  for i<Vi<V do
10:     Calculate distance matrix: D(Fi,Fi)D(F_{i},F_{i}).
11:     Re-metric distance by Eq. (9) to obtain DJ(Fi,Fi)D_{J}(F_{i},F_{i})
12:     Use DBSCAN to obtain clustering results.
13:  end for
14:  Mine noise samples according to Eq. (10).
15:  Second period:
16:  Compare feature sets at the different viewpoint to obtain distance matrix D(Fm,Fn),mnD(F_{m},F_{n}),m\neq n.
17:  Select the clusters need merged from different viewpoints through Eq. (6) and Eq. (7).
18:  Retrain CNN model MM with XX and ypy_{p} according to Eq. (3).
19:  Evaluate on the test set \rightarrow performance PP
20:  if P>PP>P^{\ast} then
21:     P=PP^{\ast}=P
22:     Save the best model MM
23:  end if

Distance metric by kk-reciprocal encoding

. Clearly, more positive samples in the same-viewpoint cluster in the first period, higher clustering quality at different viewpoints in the second period, which in turn will benefit the performance in the next iteration. Note that the clustering method significantly relies on the distance metric, we propose introducing the widely used kk-reciprocal encoding [24, 20, 15] as the distance metric for feature comparison. For the sample xivx_{i}^{v} in XvX^{v}, we record its kk nearest neighbors with index-labels Kk(xiv)K_{k}(x_{i}^{v}), for all indexes indKk(xiv)ind\in K_{k}(x_{i}^{v}), if |Kk(iv)Kk2(xindv)|23|Kk2(xindv)|\left|K_{k}(_{i}^{v})\cap K_{\frac{k}{2}}(x_{ind}^{v})\right|\geqslant\frac{2}{3}\left|K_{\frac{k}{2}}(x_{ind}^{v})\right|, xivx_{i}^{v}’s mutual kk nearest neighbors set Si|Kk(xiv)Kk2(xindv)|S_{i}\leftarrow\left|K_{k}(x_{i}^{v})\cup K_{\frac{k}{2}}(x_{ind}^{v})\right|. In this case, all reliable samples similar to xivx_{i}^{v} are recorded in SiS_{i}. Then distance dijd_{ij} of the sample pair in the same viewpoint distance matrix, D(Fm,Fn),m=nD(F_{m},F_{n}),\ m=n reassigns weight by,

d~ij={edijifjSi,0else\tilde{d}_{ij}=\left\{\begin{aligned} &e^{-d_{ij}}\quad if\ j\in S_{i},\\ &0\quad\quad\ else\end{aligned}\right. (8)

For each image pairs (xiv,xjv)(x_{i}^{v},x_{j}^{v}) at the same viewpoint, we get a new distance matrix DJ(Fm,Fn),m=nD_{J}(F_{m},F_{n}),\ m=n for clustering, it can be calculated by,

dJ(xiv,xjv)=1l=1Nvmin(d~il,d~jl))l=1Nvmax(d~il,d~jl))d_{J}(x_{i}^{v},x_{j}^{v})=1-\frac{\sum_{l=1}^{N_{v}}min(\tilde{d}_{il},\tilde{d}_{jl}))}{\sum_{l=1}^{N_{v}}max(\tilde{d}_{il},\tilde{d}_{jl}))} (9)

where NvN_{v} is the total number of samples in viewpoint vv.

Refer to caption
Figure 3: Illustration of noise selection. Samples in the same color belong to the same cluster except the red color is for noisy samples (pseudo label is -1). P1P_{1}, P2P_{2} represent two different noise situations. After noise selection, we reconstruct the cluster for the noise samples by comparing each noise and other clusters. HpH_{p} and HnH_{n} represent hard positive samples and hard negative samples, respectively.

Clustering with noisy sample selection

. Our viewpoint-aware clustering strategy avoids comparing different viewpoints of vehicles during the first period of clustering, which alleviates the intra-class gap and reduces the difficulty of clustering to a great extent. However, due to the complexity of the real scene, some hard samples are still difficult to cluster and then regarded as noises. The reason is, although DBSCAN [25] can generate clusters for data of any spatial shape, it uses two parameters epseps and minPtsminPts to define the density conditions that need to be meet when forming clusters in the training set, which tends to cluster the samples with small intra-class gaps and treat the samples with larger intra-class gaps as noises, as shown in Fig. 3. We observe that these noises usually derive from two situations which are shown as P1P_{1} and P2P_{2} in Fig. 3. In P1P_{1}, due to occlusion, misalignment of the bounding box, or deviation of the viewpoint prediction, samples with same identity but far from the already formed clusters (the blue cluster as shown in Fig. 3), will be regarded as noise. In P2P_{2}, some samples deriving from the same identity fail to form into the same cluster since they can not meet the density condition due to large intra-class differences.

For P1P_{1}, we expect noises samples can be classified into clusters with the same identity. For P2P_{2}, we expect that noise samples with the same identity can be clustered together to form a new cluster. Specifically, we use set SnS_{n} to collect all the noise samples. For each member sis_{i} in SnS_{n}, we look for the most similar samples with the same viewpoint pip_{i}, and constitute a set of similar sample pairs, {{s1,p1},{s2,p2},,{sn,pn}}\{\{s_{1},p_{1}\},\{s_{2},p_{2}\},...,\{s_{n},p_{n}\}\}, which is descending sorted according to their pairwise similarities. Then we judge which situation the noise belongs to based on pip_{i}. If pip_{i} belongs to a cluster that has already been clustered, it corresponds to the first situation P1P_{1}. Otherwise, pip_{i} is a noise sample, it belongs to the second situation P2P_{2}. However, directly merging {si,pi}\{s_{i},p_{i}\} is not reliable caused by some hard negative samples, we take a more reliable approach as follows. Inspired by kk-reciprocal encoding, if {si,pi}\{s_{i},p_{i}\} belong to the same identity, their neighbor image sets should be similar, which also means that they should be located in each other’s k~\tilde{k}-nearest neighbors. Therefore, we calculate the k~\tilde{k}-nearest neighbor image sets of the same viewpoint as pip_{i}. If sis_{i} appears in top_k~[pi]top\_\tilde{k}[p_{i}], sis_{i} is regarded as a reliable hard positive sample and will be merged with pip_{i}. Otherwise, the noisy sample sis_{i} will be treated as a hard negative sample and divided into a new cluster to further learn its discriminative feature. Formally, we construct:

{Hp1={(si,pi)piSn,top_1[si]=pi,sitop_k~[pi]}Hp2={(si,pi)piSn,top_1[si]=pi,sitop_k~[pi]}Hn={(si)top_1[si]=pi,sitop_k~[pi]}\left\{\begin{array}[]{lr}H_{p_{1}}=\left\{\left(s_{i},p_{i}\right)\mid p_{i}\notin S_{n},top\_1[s_{i}]=p_{i},s_{i}\in top\_\tilde{k}[p_{i}]\right\}&\\ H_{p_{2}}=\left\{\left(s_{i},p_{i}\right)\mid p_{i}\in S_{n},top\_1[s_{i}]=p_{i},s_{i}\in top\_\tilde{k}[p_{i}]\right\}&\\ H_{n}=\left\{\left(s_{i}\right)\mid top\_1[s_{i}]=p_{i},s_{i}\notin top\_\tilde{k}[p_{i}]\right\}&\\ \end{array}\right. (10)

We merge sis_{i} in Hp1H_{p1} with the corresponding pip_{i}, in Hp2H_{p_{2}}, we form a new cluster CniC_{ni} for each pair (si,pi)(s_{i},p_{i}), and treat each hard negative sample in HnH_{n} as a single cluster. Note that when CniC_{ni} be created, for cic_{i} in CniC_{ni}, we will dynamically look for top_k~[ci]top\_\tilde{k}[c_{i}] as candidates and determine whether to merge the candidate with CniC_{ni} according to the condition of Hp2H_{p_{2}}, and not merge with other clusters after merging with CniC_{ni}.

IV Experiments

We evaluate our proposed method VAPC on two benchmark datasets VeRi-776 [1] and VeRi-Wild [32], which contain 55 and 44 view-points respectively. We compare our method with the prevalent domain adaption based unsupervised, and target-only methods without domain adaption for evaluation.

IV-A Datasets and Evaluation Protocol.

VeRi-776 [1] is a comprehensive vehicle re-identification dataset providing rich attributes information such as color, type and temporal path. It contains 776 different vehicles captured in 20 cameras, yielding more than 49,357 images and 9,000 tracks. The training and testing sets contain 37,728 images of 576 vehicles and 11579 images of 200 vehicles. Both training and testing sets contain 55 common visible viewpoint situations, including front,front_side,side,rear_side,rearfront,front\_side,side,rear\_side,rear. Following the protocol in [1], we only return the matchings from the different cameras for the query vehicles as the results. We use the mean average precision (mAP) and cumulative matching characteristic (CMC) at Rank-1, Rank-5 and Rank-20 as the measurement metrics.

VeRi-Wild [32] is a large-scale vehicle Re-ID dataset, containing more than 400 thousand images of 40 thousand vehicle IDs captured by 174 cameras in the surveillance system. It contains complex backgrounds, various viewpoints and illumination variations in real-world scenes. The training set contains 277,797 images of 30,671 vehicles. After the viewpoint prediction of the training set, VeRi-Wild contains 4 viewpoints, front,front_side,rear_side,rearfront,front\_side,rear\_side,rear, respectively containing 110204, 52716, 64968, 49909 images. Due to hardware limitations, we use all the training data in the recognition stage, and each viewpoint in the clustering stage takes 10,000 images, respectively. While the testing set consists of three subsets, test-3000, test-5000, and test-10000 with different testing sizes. Following the protocol in [32], the match rate protocol on VeRi-Wild is that all the references of the given query are in the gallery. We use mAP, Rank-1 and Rank-5 as the evaluation metrics.

TABLE I: Comparison with the state-of-the-art of target-only Re-ID and domain adaptive Re-ID methods on VeRi-776 and VeRi-Wild. “src” denotes the source domain/dataset, where “N/A” and ”VehicleID” indicate the target-only and domain adaptive methods on VehicleID dataset respectively. ”VAPC_TO”, ”VAPC_DT” and ”VAPC_DA” indicate our VAPC in target-only, direct transfer and domain adaption respectively.
method VeRi-776 VeRi-Wild
src R1 R5 R20 mAP test-3000 test-5000 test-10000
R1 R5 mAP R1 R5 mAP R1 R5 mAP
OIM [23] N/A 45.1 62.2 78.1 12.2 48.7 66.6 14.4 45.0 60.9 12.6 38.8 54.4 10.0
Bottom [17] N/A 63.7 73.4 83.4 23.5 70.5 86.0 30.7 64.2 82.2 27.1 55.2 75.1 21.6
AE [50] N/A 73.4 82.5 89.7 26.2 68.5 87.0 29.9 61.8 81.5 26.2 53.1 73.7 20.9
VAPC_TO (ours) N/A 76.2 81.2 85.3 30.4 72.1 87.7 33.0 64.3 83.0 28.1 55.9 75.9 22.6
SPGAN [43] VehicleID 57.4 70.0 - 16.4 59.1 76.2 24.1 55.0 74.5 21.6 47.4 66.1 17.5
ECN [48] VehicleID 60.8 70.9 85.4 27.7 73.4 88.8 34.7 68.6 84.6 30.6 61.0 78.2 24.7
UDAP [15] VehicleID 76.9 85.8 - 35.8 68.4 85.3 30.0 62.5 81.8 26.2 53.7 73.9 20.8
VAPC_DT (ours) VehicleID 69.1 79.0 88.2 35.5 74.0 88.6 37.7 68.1 84.8 33.1 60.2 78.7 26.3
VAPC_DA (ours) VehicleID 77.4 84.6 91.6 40.3 75.3 89.0 39.7 69.0 85.5 34.5 61.0 79.7 27.4
Refer to caption
(a) OIM
Refer to caption
(b) Bottom
Refer to caption
(c) AE
Refer to caption
(d) VAPC_TO (ours)
Figure 4: Visualization for features extracted by target-only method, OIM [23], Bottom [17], AE [50] and our method. 37 identities with 2000 images in the gallery of VeRi-776 are used. Each point represents an image, and each color represents a vehicle identity.

VehicleID [51] is another large-scale vehicle Re-ID dataset, it includes 110,178 real scene images of 13134 types of vehicles as a training set. 111,585 images of 13,113 vehicles were used as a test set. In this article, to compare the results of other existing unsupervised domain adaptation methods, we also use the VehicleID dataset as the source domain for supervised training.

IV-B Implementation Details.

We use ResNet50 [52] as the backbone by eliminating the last classification layer. All experiments are implemented on two NVIDIA TITAN Xp GPUs. We initialize our model with pre-trained weights on ImageNet [53]. For the viewpoint prediction network, we set the batch size as 32 and the learning rate as 0.001, with a maximum 20 epochs. If not specified, we use stochastic gradient descent with a momentum of 0.9 and the dropout rate as 0.5 to optimize the model. For the Re-ID feature extraction network, we resize the input images of VeRi-776 [1] and VeRi-Wild [32] as (384,384). The batch size is set to 16. The learning rates at the recognition stage is set to 0.1 and divided by 10 after every 15 epochs, and set to 0.001 in the clustering stage. We only use a random horizontal flip as a data augmentation strategy. Following the protocol in [24] we set kk to 20.

IV-C Comparison with State-of-the-art Methods.

We compare our method with the state-of-the-art unsupervised Re-ID methods on VeRi-776 [1] and VeRi-Wild [32] in both target-only and domain adaption scenarios, as shown in TABLE I.

Compared with the target-only method. We first compare our method with three state-of-the-art target-only unsupervised methods OIM [23], Bottom [17] and AE [50]. Generally speaking, our method (VAPC_TO) outperforms the three state-of-the-art target-only methods by a large margin by exploring the intra-class relationship. OIM [23] devotes to extracting discriminative features efficiently, which ignores the intra-class relationship, thus results in stumbling performance. Bottom [17] designs a bottom-up clustering strategy by merging the fixed clusters during each step. However, each clustering may produce the wrong classification, and more clustering steps, more clustering errors. Especially on the VeRi-776 [1], almost all visible viewpoints are included, which brings greater clustering challenges. Each clustering step only focuses on the same viewpoint and can not bring more samples of different viewpoints together. Our method effectively alleviates this problem and brings greater improvement. AE [50] clusters the samples via a similarity threshold and constrains the cluster size by embedded a balance term into the loss. However, due to the similarity dilemma of vehicles, where the same viewpoints of different identities may have higher similarities, it is difficult to set an optimal similarity threshold for clustering. In addition, more and more samples meeting the similarity threshold are treated as the same identity during the training, especially on larger scaled dataset VeRi-Wild [32], it will cause more severe data imbalance in each cluster and damage the feature representation. Therefore the performance of AE [50] on VeRi-Wild [32] declines comparing with Bottom [17].

We further use t-SNE [54] to visualize the feature space distribution of our method compared to the three state-of-the-art target-only methods, as shown in Fig. 4. Compared with ours, the distribution between the points is sparser in OIM [23] and Bottom [17], while more points of different colors gathering in AE [50]. our method presents a better feature distribution, which demonstrates that VAPC_TO can successfully cluster more images of vehicles with the same identity and effectively improve the feature representation for unsupervised vehicle Re-ID.

Compared with unsupervised domain adaptation. To evidence the effectiveness of our method on unsupervised vehicle Re-ID, we further evaluate our method in the domain adaption fashion. Following the protocol in [15], we use VehicleID [51] as the source domain and employ repelled loss [17] for supervised training, replacing the recognition stage in III-B. We compare our method in the domain adaption fashion (VAPC_DA) with three state-of-the-art unsupervised domain adaptation methods, including SPGAN [43], ECN [48] and UDAP [15], as shown in the lower half part in TABLE I.

SPGAN [43] considers the style change among different datasets and trains a style conversion model to bridge the style discrepancy between the source domain data and the target domain. However, due to the huge gap between the vehicle datasets in the real scene, e.g., the diverse viewpoints, resolution and illumination, it is challenging to obtain the desired translated image, which is crucial in SPGAN [43], and thus results in poor performance for vehicle Re-ID. ECN [48] joins the source domain for model constraints while using the kk-nearest neighbor algorithm to mine the same identity in the target domain. The setting of the kk value not only has a greater impact on the experimental results, but the most similar top kk samples are always at the same viewpoint. UDAP [15] uses source domain data to initialize the model and theoretically analyzes the rules that the model needs to follow when adapting to the target domain from the source domain. It achieves satisfactory results on vehicle Re-ID due to the strengthening of the constraints on the target domain training. The target domain feature extractor has stronger learnability while obtaining the source domain knowledge. However, it relies on global comparison, which may cause more clustering errors, especially on VeRi-Wild [32] dataset presents a much smaller inter-class differences than VeRi-776 [1].

TABLE II: Results evaluated on the VeRi-776 and test-3000 set of VeRi-Wild. kR means distance metric by kk-reciprocal encoding, NS means noise selection, FS represents our first period and second period clustering strategy.
method VeRi-776 VeRi-Wild
R1 R5 mAP R1 R5 mAP
(a) Ours 76.2 81.2 30.4 72.1 87.7 33.0
(b) w/o tP 68.7 73.2 25.0 68.5 85.0 30.3
(c) w/o kR 71.0 78.9 24.1 69.2 86.0 29.7
(d) w/o NS 71.3 78.8 27.8 70.1 87.0 32.5
(e) w/o tP + kR + NS 61.4 72.5 18.2 48.7 66.6 14.4
Refer to caption
Figure 5: Illustrations with and without two-period strategy on VeRi-776. The same color represents the same cluster, and different shapes represent different identities. The red circle marks the false clustered samples.

In addition, we evaluate our method in the ”Direct Transfer” fashion by training on the source domain and directly testing on the target domain indicated as (VAPC_DT) in TABLE I. First of all, by leveraging the information in the training data, VAPC_DT generally outperforms VAPC_TO, which verifies the knowledge of the source domain during the training improves the vehicle retrieval ability of the model. The only exception is the rank-1 score on VeRi-776 [1]. The main reason is the huge gap between VehicleID [51] and VeRi-776 [1] datasets, e.g., VeRi-776 has lower resolution and more viewpoints, which results in poor generalization performance. Even though VAPC_DT significantly boosts the mAP score on both VeRi-776 [1] comparing to the target-only fashion (VAPC_TO). Second, VAPC_DT is even significantly superior to the domain adaption methods SPGAN [43] and ECN [48], and comparable to UDAP [15] on mAP, which proves the robustness of our method for unsupervised vehicle Re-ID.

Note that our method in target-only fashion (VAPC_TO) even surpasses most unsupervised domain adaptation methods such as SPGAN [43] and ECN [48], and works comparably to UDAP [15]. This further verifies the promising performance of our method while handling unsupervised vehicle Re-ID especially without prior annotation information or source data.

IV-D Ablation Study.

In this section, we will thoroughly analyze the effectiveness of three critical components in the VAPC framework, including the two-period (tP) clustering strategy based on viewpoint prediction, kk-reciprocal encoding (kR) and noise selection (NS), as reported in TABLE II.

Refer to caption
Figure 6: Examples of ranking results with and without noise selection on VeRi-776 dataset. For each query, the top and the bottom rows show the ranking result without and with noise selection, respectively. The green and red boxes indicate the right and the wrong matchings, respectively.
Refer to caption
Figure 7: Clustering illustrations with and without distance metric by kk-reciprocal encoding on VeRi-776. The same shape represents the same identity, and the same color represents the same cluster.

Quantitative study. One of the key contributions of our progressive clustering is the two-period clustering on both the same and different viewpoints for vehicle Re-ID. As shown in TABLE II (b), without divide the viewpoints and removing the two-period (tP) strategy, we cluster all training samples directly after recognition stage, both mAP and rank scores significantly drop, -7.5% in Rank-1 and -5.4% in mAP on VeRi-776 [1], while -3.6% and -2.7% on VeRi-Wild [32] test-3000, which verifies the effectiveness of the progressive clustering for unsupervised vehicle Re-ID. Similar phenomenons happen to the kk-reciprocal encoding (kR) and the noise selection (NS), as shown in TABLE II (c) and TABLE II (d). By removing the corresponding components, both mAP and rank scores significantly decline, which evidences the role of each component. Without any of the three components, the baseline (as shown in TABLE II (e)) results in stumble performance on both datasets due to the inability to cope with the various challenges brought about by the extreme viewpoint changes of vehicles. By integrating all the three components, our method, as shown in TABLE II (a) achieves promising results for unsupervised Re-ID.

Qualitative study. To further understand the contribution of the three components, we visualize the results of different variants as discussed in Table II in terms of sample distribution or ranking list, as shown in Fig. 5 to Fig. 7. From Fig. 5 (a), we can see that more hard negative samples (different identities with highly similar appearance) with the same viewpoints tend to cluster without a two-period clustering strategy. Our method successfully gathers vehicle images with diverse viewpoints, even with large appearance differences due to the viewpoint and illumination changes. This further evidence the effectiveness of the proposed two-period clustering strategy, which can distinguish small gaps between different identities in the same viewpoint and mine the same identity samples with large gaps between different viewpoints. The role of kk-reciprocal encoding is to mine samples sharing the most similar features despite appearance differences. As shown in Fig 7 (a), the result without kk-reciprocal encoding tends to split the same identity with difference appearance caused by viewpoint and illumination changes into individual clusters, while it can merge them into one single cluster after introducing the kk-reciprocal encoding, as shown in Fig 7 (b). Fig. 6 demonstrates the qualitative comparison of ranking results of three queries with or without noise selection. Clearly, after introducing the noise selection scheme, our method can hit more correct matchings in the earlier rankings and can remove the false matchings with a similar appearance as the queries.

IV-E Analysis of Clustering Quality.

Clustering quality is a crucial factor in clustering-based methods for vehicle Re-ID. Therefore, we measure the clustering quality via Adjusted Mutual Information (AMI) [55] on our method compared to the state-of-the-art methods. AMI measures the distribution of ground truth and pseudo labels generated by clustering through mutual information. The larger AMI, the closer distribution of the ground truth and pseudo labels, which in turn means the better clustering quality. We compare our method with Bottom [17], k-means [49] and DBSCAN [25] which also allocate pseudo labels during clustering.

As illustrated in Fig. 8, the classic clustering algorithm k-means [49] and DBSCAN [25] work stumblingly in the global comparison fashion. Furthermore, k-means [49] specifies the number of clusters, which makes the change of samples in the cluster relatively stable. However, due to global comparison, a large number of samples with the same viewpoint and different identities appear in the same cluster, which makes model training continue to decline. DBSCAN [25] is sensitive to noise; therefore, a large number of noise samples under various challenges in real scenes deteriorates the clustering quality. Bottom [17] causes the final collapse due to the accumulation of the number of clustering errors each step. Since clustering based on viewpoint division greatly simplifies the clustering task, and the strategy of progressively merges different viewpoints and gradually gathers vehicles of the same identity from different viewpoints, our method continues to improve with training.

IV-F Investigation of Viewpoint Prediction.

Viewpoint prediction is a prerequisite component in our framework, as discussed in 1. To investigate the influence of viewpoint prediction in our method, we have trained a series of classifiers with different accuracy rates for viewpoint prediction. The experimental results are shown in Fig. 9. As expected, the Re-ID accuracy of VAPC_TO decreases as the accuracy of the viewpoint classification classifier decreases. When the accuracy of the viewpoint classifier drops to 0.5, the accuracy of Rank-1 drops from 76.2% to 70%, (-6.2%) on VeRi-776 [1]. Even though our method with only 0.5 viewpoint classifier accuracy still outperforms the most unsupervised algorithms, as shown in TABLE I. We can see that a robust viewpoint classifier can significantly improve the performance of our algorithm. And due to our more reasonable clustering strategy and effective noise processing, we can also perform well on a poor viewpoint classifier.

Refer to caption
Figure 8: The performance of clustering quality (AMI) on VeRi-776. Each step represents an iteration of progressive clustering and retraining the model.
Refer to caption
Figure 9: The performance along with different error rate viewpoint predictors on VeRi-776.
Refer to caption
(a) The parameter titi
Refer to caption
(b) The parameter k~\tilde{k}
Figure 10: Parameter and method analysis. (a) The impact of titi in progressive clustering. (b) The impact of the number of k~\tilde{k} in noise selection.

IV-G Parameter Analysis.

There are two essential parameters in our methods, titi denoting the distance of the titi-th sample pair as the threshold for combining different viewpoint clusters as explained in 3 Eq (7), and k~\tilde{k} in Eq (10), indicating the judgment condition when selecting noise as explained in 3. We shall evaluate the impact of these two parameters in this section.

The impact of the number titi. As shown in Fig. 10 (a), we vary titi from 0 to 4000 to calculate the distance threshold τ\tau and test the model performance. ti=0ti=0 means only the same viewpoint clustering. bigger titi, larger threshold τ\tau. A large titi will harm the model performance. For example, when ti>3500ti>3500 , a substantial performance drop can be observed. This is because the over large titi may cause too many clusters of different viewpoints to be merged at one time, which resulting in a large number of incorrect classifications. However, over small titi selects a few correct clusters, which also leads to poor performance. For the comprehensive performance of titi on VeRi-776 [1] and VeRi-Wild [32], we set titi to 1200.

The impact of the number k~\tilde{k}. Fig. 10 (b) reports the analysis on k~\tilde{k} during the noise selection. As discussed in 3, k~\tilde{k} plays the role of limiting noise combined with clusters or other noises. the larger k~\tilde{k}, the weaker limitation. The larger k~\tilde{k} declines the performance on VeRi-Wild [32], while remaining stable on VeRi-776 [1]. The reason is VeRi-Wild has a smaller inter-class difference compared to VeRi-776 [1]. When k~\tilde{k} increases, the constraint of judging whether the two clusters are merged is weakened, which increases the error rate. Based on the results on Fig. 10 (b), we set k~=2\tilde{k}=2 for the best balance.

V Conclusion

In this paper, we propose a viewpoint-aware progressive clustering method to solve the unsupervised Re-ID problem of vehicles. We analyzed the similarity dilemma of vehicle comparison, and for the first time explored the progressive clustering by dividing the training set into different subsets according to the viewpoint. In addition, we propose a noise selection strategy to solve the noise problem generated in the clustering process. Extensive experimental results demonstrate the effectiveness of the proposed methods in unsupervised Vehicle Re-ID.

Our method is based on the observation that images of vehicles from adjacent views normally share a large degree of common appearance, therefore they can be merged during clustering. However, it is still difficult to cluster the vehicles with only two viewpoints with large discrepancies, such as front and rear. In the future, we will further explore a more effective method to deal with these more challenging situations.

References

  • [1] X. Liu, W. Liu, T. Mei, and H. Ma, “A deep learning-based approach to progressive vehicle re-identification for urban surveillance,” in European Conference on Computer Vision, 2016, pp. 869–884.
  • [2] H. Guo, C. Zhao, Z. Liu, J. Wang, and H. Lu, “Learning coarse-to-fine structured feature embedding for vehicle re-identification,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 6853–6860.
  • [3] Z. Zheng, T. Ruan, Y. Wei, and Y. Yang, “Vehiclenet: Learning robust feature representation for vehicle re-identification.” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 1–4.
  • [4] B. He, J. Li, Y. Zhao, and Y. Tian, “Part-regularized near-duplicate vehicle re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3997–4005.
  • [5] Y. Zhou and L. Shao, “Cross-view gan based vehicle generation for re-identification.” in British Machine Vision Conference, 2017, pp. 1–12.
  • [6] ——, “Vehicle re-identification by adversarial bi-directional lstm network,” in 2018 IEEE Winter Conference on Applications of Computer Vision, 2018, pp. 653–662.
  • [7] X. Liu, S. Zhang, Q. Huang, and W. Gao, “Ram: a region-aware deep model for vehicle re-identification,” in 2018 IEEE International Conference on Multimedia and Expo, 2018, pp. 1–6.
  • [8] H. Wang, J. Peng, G. Jiang, F. Xu, and X. Fu, “Discriminative feature and dictionary learning with part-aware model for vehicle re-identification,” arXiv preprint arXiv:2003.07139, 2020.
  • [9] A. Suprem and C. Pu, “Looking glamorous: Vehicle re-id in heterogeneous cameras networks with global and local attention,” arXiv preprint arXiv:2002.02256, 2020.
  • [10] Z. Zhong, L. Zheng, S. Li, and Y. Yang, “Generalizing a person retrieval model hetero-and homogeneously,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 172–188.
  • [11] H. Fan, L. Zheng, C. Yan, and Y. Yang, “Unsupervised person re-identification: Clustering and fine-tuning,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 14, no. 4, pp. 1–18, 2018.
  • [12] P. Peng, T. Xiang, Y. Wang, M. Pontil, S. Gong, T. Huang, and Y. Tian, “Unsupervised cross-dataset transfer learning for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1306–1315.
  • [13] J. Wang, X. Zhu, S. Gong, and W. Li, “Transferable joint attribute-identity deep learning for unsupervised person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 2275–2284.
  • [14] J. Peng, H. Wang, T. Zhao, and X. Fu, “Cross domain knowledge transfer for unsupervised vehicle re-identification,” in IEEE International Conference on Multimedia and Expo Workshops, 2019, pp. 453–458.
  • [15] L. Song, C. Wang, L. Zhang, B. Du, Q. Zhang, C. Huang, and X. Wang, “Unsupervised domain adaptive re-identification: Theory and practice,” Pattern Recognition, vol. 102, p. 107173, 2020.
  • [16] M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 132–149.
  • [17] Y. Lin, X. Dong, L. Zheng, Y. Yan, and Y. Yang, “A bottom-up clustering approach to unsupervised person re-identification,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2019, pp. 8738–8745.
  • [18] G. Ding, S. H. Khan, and Z. Tang, “Dispersion based clustering for unsupervised person re-identification,” in The British Machine Vision Conference, 2019, p. 264.
  • [19] X. Zhang, J. Cao, C. Shen, and M. You, “Self-training with progressive augmentation for unsupervised cross-domain person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8222–8231.
  • [20] Y. Fu, Y. Wei, G. Wang, Y. Zhou, H. Shi, and T. S. Huang, “Self-similarity grouping: A simple unsupervised cross domain adaptation approach for person re-identification,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 6112–6121.
  • [21] F. Zhao, S. Liao, G.-S. Xie, J. Zhao, K. Zhang, and L. Shao, “Unsupervised domain adaptation with noise resistible mutual-training for person re-identification,” in European Conference on Computer Vision (ECCV), Glasgow, UK, 2020, pp. 1–18.
  • [22] A. Zheng, X. Lin, C. Li, R. He, and J. Tang, “Attributes guided feature learning for vehicle re-identification,” arXiv preprint arXiv:1905.08997, 2019.
  • [23] T. Xiao, S. Li, B. Wang, L. Lin, and X. Wang, “Joint detection and identification feature learning for person search,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3415–3424.
  • [24] Z. Zhong, L. Zheng, D. Cao, and S. Li, “Re-ranking person re-identification with k-reciprocal encoding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 1318–1327.
  • [25] M. Ester, H.-P. Kriegel, J. Sander, X. Xu et al., “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Knowledge Discovery and Data Mining, 1996, pp. 226–231.
  • [26] A. Hermans, L. Beyer, and B. Leibe, “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737, 2017.
  • [27] D. Wang and S. Zhang, “Unsupervised person re-identification via multi-label classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 10 981–10 990.
  • [28] H.-X. Yu, W.-S. Zheng, A. Wu, X. Guo, S. Gong, and J.-H. Lai, “Unsupervised person re-identification by soft multilabel learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 2148–2157.
  • [29] K. Sohn, “Improved deep metric learning with multi-class n-pair loss objective,” in Advances in Neural Information Processing Systems, 2016, pp. 1857–1865.
  • [30] H. Shi, Y. Yang, X. Zhu, S. Liao, Z. Lei, W. Zheng, and S. Z. Li, “Embedding deep metric for person re-identification: A study against large variations,” in European Conference on Computer Vision, 2016, pp. 732–748.
  • [31] H. Oh Song, Y. Xiang, S. Jegelka, and S. Savarese, “Deep metric learning via lifted structured feature embedding,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 4004–4012.
  • [32] Y. Lou, Y. Bai, J. Liu, S. Wang, and L. Duan, “Veri-wild: A large dataset and a new method for vehicle re-identification in the wild,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3235–3243.
  • [33] Z. Zheng, T. Ruan, Y. Wei, and Y. Yang, “Vehiclenet: Learning robust feature representation for vehicle re-identification,” in IEEE Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 1–4.
  • [34] Y. Lou, Y. Bai, J. Liu, S. Wang, and L. Duan, “Embedding adversarial learning for vehicle re-identification,” IEEE Transactions on Image Processing, vol. 28, no. 8, pp. 3794–3807, 2019.
  • [35] B. He, J. Li, Y. Zhao, and Y. Tian, “Part-regularized near-duplicate vehicle re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 3997–4005.
  • [36] M. Cormier, L. Sommer, and M. Teutsch, “Low resolution vehicle re-identification based on appearance features for wide area motion imagery,” 2016 IEEE Winter Applications of Computer Vision Workshops, pp. 1–7, 2016.
  • [37] X. Liu, W. Liu, H. Ma, and H. Fu, “Large-scale vehicle re-identification in urban surveillance videos,” in 2016 IEEE International Conference on Multimedia and Expo, 2016, pp. 1–6.
  • [38] X. Liu, W. Liu, T. Mei, and H. Ma, “Provid: Progressive and multimodal vehicle reidentification for large-scale urban surveillance,” IEEE Transactions on Multimedia, vol. 20, no. 3, pp. 645–658, 2017.
  • [39] Y. Shen, T. Xiao, H. Li, S. Yi, and X. Wang, “Learning deep neural networks for vehicle re-id with visual-spatio-temporal path proposals,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 1900–1909.
  • [40] Z. Wang, L. Tang, X. Liu, Z. Yao, S. Yi, J. Shao, J. Yan, S. Wang, H. Li, and X. Wang, “Orientation invariant feature embedding and spatial temporal regularization for vehicle re-identification,” in Proceedings of the IEEE International Conference on Computer Vision, 2017, pp. 379–387.
  • [41] R. Chu, Y. Sun, Y. Li, Z. Liu, C. Zhang, and Y. Wei, “Vehicle re-identification with viewpoint-aware metric learning,” in Proceedings of the IEEE International Conference on Computer Vision, 2019, pp. 8282–8291.
  • [42] J. Sochor, A. Herout, and J. Havel, “Boxcars: 3d boxes as cnn input for improved fine-grained vehicle recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 3006–3015.
  • [43] W. Deng, L. Zheng, Q. Ye, G. Kang, Y. Yang, and J. Jiao, “Image-image domain adaptation with preserved self-similarity and domain-dissimilarity for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 994–1003.
  • [44] L. Wei, S. Zhang, W. Gao, and Q. Tian, “Person transfer gan to bridge domain gap for person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 79–88.
  • [45] Y. Lin, L. Xie, Y. Wu, C. Yan, and Q. Tian, “Unsupervised person re-identification via softened similarity learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 3390–3399.
  • [46] R. M. S. Bashir, M. Shahzad, and M. Fraz, “Vr-proud: Vehicle re-identification using progressive unsupervised deep architecture,” Pattern Recognition, vol. 90, pp. 52–65, 2019.
  • [47] D. Cheng, Y. Gong, S. Zhou, J. Wang, and N. Zheng, “Person re-identification by multi-channel parts-based cnn with improved triplet loss function,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 1335–1344.
  • [48] Z. Zhong, L. Zheng, Z. Luo, S. Li, and Y. Yang, “Invariance matters: Exemplar memory for domain adaptive person re-identification,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 598–607.
  • [49] T. Kanungo, D. M. Mount, N. S. Netanyahu, C. D. Piatko, R. Silverman, and A. Y. Wu, “An efficient k-means clustering algorithm: Analysis and implementation,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 7, pp. 881–892, 2002.
  • [50] Y. Ding, H. Fan, M. Xu, and Y. Yang, “Adaptive exploration for unsupervised person re-identification,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 16, no. 1, pp. 3:1–3:19, 2020.
  • [51] H. Liu, Y. Tian, Y. Yang, L. Pang, and T. Huang, “Deep relative distance learning: Tell the difference between similar vehicles,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 2167–2175.
  • [52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 770–778.
  • [53] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255.
  • [54] L. V. Der Maaten and G. E. Hinton, “Visualizing data using t-sne,” Journal of Machine Learning Research, vol. 9, pp. 2579–2605, 2008.
  • [55] N. X. Vinh, J. Epps, and J. Bailey, “Information theoretic measures for clusterings comparison: Variants, properties, normalization and correction for chance,” The Journal of Machine Learning Research, vol. 11, pp. 2837–2854, 2010.